Folder(Inbox) Message Message. Body

Size: px

Start display at page:

Download "Folder(Inbox) Message Message. Body"

Sheena Palmer
6 years ago
Views:

1 Rening OEM to Improve Features of Query Languages for Semistructured Data Pavel Hlousek Charles University, Faculty of Mathematics and Physics, Prague, Czech Republic Abstract. Semistructured data can be modeled by the graph-oriented data model OEM. Due to its general properties, associated query languages become too complex because they work with cycles in data graphs. On the other hand, there are applications that require manipulating only tree-structured fragments of semistructured data (part-subpart relationships), and preservation of its other relationships. Therefore we rene the OEM model to distinguish between two types of edges. This renement enables to simplify semantics of query expressions and improve default structuring of the query result by keeping a minimal structural context, which the specied nodes had in the source data tree. We document the proposed notions on the language MailQL whose prototype is under development. Introduction Semistructured data can be modeled by the graph-oriented data model OEM (introduced in [Widom et al, 1995 ]), which allows cycles in the data graph. Because query languages over this data need to count with possibility of cycles, they become complex and complicated. Also the size of an answer to a query may contain data that are useless for the user, because with each node there is its whole subgraph returned. This is because data is held by nodes without outgoing edges in OEM model. On the other hand, data is often of tree structure in terms of part-subpart relationship, and cycles in the graph are then caused by references, which represent sort of added information. Therefore, in section, we rene the OEM model, not to lose notion of part-subpart relationships in the cyclic graph. This aects answers to queries, which return with each node its subgraph that represents only its subparts. Because part-subpart relationships form a tree, we do not have to bother with cycles any more, and can make our query language much simpler. This can be very suitable for query languages for XML data, where part-subpart may correspond to the element-subelement relationship and cycles are realized through IDREFS attribute. The fact, that we can work with data as if it were a tree, creates many possibilities for query languages over semistructured data. An outstanding one is presented in section. It illustrates default structuring of result nodes keeping a minimal structural context, these nodes had in the source data graph. Other similar languages, like Lorel [Abiteboul et al, 1996 ], XML-QL [Deutsch et al, 1998 ] or UnQL [Buneman et al, 1996 ], force the user to use some \construct" clause, where he/she explicitly species the result structure. And if they provide some default structuring, then they simply return set of nodes. In the following text, we use an message base as an example of semistructured data, and associated query language MailQL, which was developed as the author's master thesis [Hlousek, 2000 ]. In the message base, part-subpart relationships are e.g. folder-subfolder, folder-message, message- elds, and cycles are caused e.g. by message threads (described later). There are many other areas in semistructured data, where part-subpart relationships can be found: image and its subregions, XML data with elements containing elements, etc. Data Model In this section, we dene OEM model and its renement. OEM Model Object Exchange Model (OEM), rst appearing in TSIMMIS project [Widom et al, 1995 ], is the de facto standard in modeling semistructured data. The following denition of OEM is borrowed from [Abitebout et al, 2000 ].

2 Denition 1: An OEM object is a quadruple (label, oid, type, value), where label is a character string, oid is the object's unique identier, and type is either complex or some identier denoting an atomic type (like integer, string, etc.). When type is complex, then the object is called a complex object, and value is a set of oids. Otherwise, the object is an atomic object, and value is an atomic value of that type. Thus data represented by OEM model is kept by OEM objects of atomic type, which are referred to by OEM objects of complex type. OEM data is usually understood as an oriented graph with labeled nodes, where OEM objects correspond to nodes, and for each node n representing complex OEM object o = (label; oid; complex; value), there are edges leading from n to nodes that represent OEM objects in o's value eld. Figure 1 shows how this model can describe part of our tree-like message base. It is also apparent from the picture, that we use very common modication of the OEM model { labels are attached to edges, rather than nodes. Thus we take our message base as an oriented edge-labeled data graph. Folder(Inbox) Folder(Private) Message Message Field(From) Field(To) Body Figure 1. Part of message base modeled by OEM. The complex nodes are distinguished from atomic ones by empty circles. Denition 2: We say, that m is l-suboject of n, if there exists an edge labeled l leading from n to m. Henceforth, OEM objects are simply called objects. We should also note, that in the following text, we mix terms node and object, while according to OEM denition a node represents an object, and vice versa. Rening OEM Model Now let us turn our attention to our example of message base. Let us enrich our message base model that consists so far only of part-subpart relationships like folder-message, with edges providing us with some added information. Let us add to our model message threads. Message thread of a message involves all replies to message itself and all replies to its replies, etc., as well as all preceding messages, i.e. messages which have our message among its replies, or replies to its replies, etc. In message base model a message thread would be expressed by edges labeled thread, leading to all nodes that represent a message in the message's thread. We mentioned earlier, that with each node, there is its subgraph returned as an answer to a query in other OEM oriented languages. So with each message there must be always all messages from its thread returned. This OEM specic behaviour does not satisfy us, because we might be concerned in the messages only, not caring about their message threads, which might be in some cases very large, and so they could wastefully enlarge the size of the result. Therefore we rene the OEM model to distinguish between two types of edges. This renement is described by Def. 3. Denition 3: (i) Core edges are edges describing the tree structure of data, given by part-subpart relationships. (ii) Secondary edges are edges describing added information. Core (secondary) paths are oriented paths consisting only of core (secondary) edges. Def. 3 says that core edges are edges describing part-subpart relationships (e.g. edges from message to its elds), while secondary edges are edges describing other than part-subpart relationships (e.g. edges from message to its message thread). Denition 4: By core data tree of data graph G = (V; E) we denote its subgraph G 0 = (V; E 0 ), where E 0 E consists of all core edges.

3 We should note, that data graph and core data tree share the same nodes. The only dierence is that core data tree is composed exclusively from core edges, and therefore it is always a tree, because core edges take place where the part-subpart relationship holds, while the complete data graph can contain cycles, which are caused by presence of secondary edges. Here we made a simplication talking about data tree. Instead of tree, we might more generally talk about rooted acyclic graph, but there is no major consequence for this article in distinguishing these two, but number of roots. Henceforth, by root of a data tree we will understand any root in rooted acyclic graph. Now back to our example, by enriching the tree-like message base model with edges that provide some additional information, the data graph is no longer tree: it is a cyclic graph. But using Def. 3 and Def. 4, we do not lose notion of the previous tree in the data graph (due to core edges). Figure 2 illustrates this situation on our message base model enriched with thread-edges. Message Message Thread Thread From Body To Attachment Figure 2. Part of message base with core edges (drawn with thicker lines) and secondary edges (drawn with thinner lines). Message-nodes have their outgoing edges heavily reduced to simplify the gure, normally, there would be many more of them. The introduced renement was done to achieve this feature: let the query language work with the core data tree instead of working with the whole data graph. Specically, with each specied node there will be its subgraph from core data tree in the result. So asking for a message, there will not be all messages from its thread in the result, like in OEM. Furthermore, the result, just before shipped to the user, will be provided with those secondary edges, where both nodes of each secondary edge were preserved. Thus those messages from message's thread, that got into result, will remain connected with thread-edges. Path Expressions Now, when our semistructured data model is rened, we need a way to navigate through the data graph. Nowadays, the most suitable tool seem to be path expressions. Their presence in languages over semistructured data is almost a rule, because of their navigational syntax. If o is object and l is label, then by expression o:l we denote set of l-subobjects of o. We should notice, that o:l always denotes set of objects. This semantics of path expressions is typical for semistructured data. Simple Path Expressions. Simple path expression is expression r:l 1 : : : : l n, where r is root node and l 1 ; : : : ; l n are edge labels. Data path is a sequence of o 0 ; l 1 ; o 1 ; : : : ; l n ; o n, where o i are objects, and for each i there is an edge between o i?1 and o i labeled l i. According to these denitions we can see, that there can be more than one data path that satises some simple path expression. Semantics of simple path expressions is very intuitive. We will explain it on example of root.a.b. Expression root denotes the starting object. Expression root.a denotes set X of all objects, for which there exists an edge leading to them from root and labeled A. Expression root.a.b denotes then set Y of all objects, for which there exists an edge leading to them from any object in X and labeled B. Thus each simple path expression denotes a set of objects, even in case that there is no data path satisfying it. In such a case, path expression denotes an empty set. General Path Expressions. General path expressions enhance the power of simple path expressions by enabling use of wild cards and regular expressions in path expressions. With wild cards we can substitute either an edge label (using %) or a sequence of edge labels (using ). Thus expression root.%.b means root.any label.b, and expression root..z means root.any path.z.

4 Regular expressions enable use of path expressions such as root(.aj.c).b, which matches two simple path expressions root.a.b and root.c.b, and so results in union of two sets of objects. Usually there are many more constructs that are typical for regular expressions, and usage of wild cards could be widen much more as well. But to talk about these is not the goal of our paper. Common Prex. Using path expressions, we often refer to common prex of two or more path expressions. First we dene predicate IsPrex, denition of common prex follows. We should note, that by common prex we always mean the longest common prex. Denition 5: Let pe 1 and pe 2 be path expressions. Predicate IsP refix(pe 1 ; pe 2 ) is true if pe 1 = X:l 1 : : : : l n and pe 2 = X:l 1 : : : : l n+m, where n 0 and m 0. Denition 6: Let pe 1 and pe 2 be path expressions. By common prex we denote path expression pe = X:l 1 : : : :l k, where both IsP refix(pe; pe 1 ) and IsP refix(pe; pe 2 ) are true, and there is no path expression pe 0 = pe:l k+1 such that both IsP refix(pe 0 ; pe 1 ) and IsP refix(pe 0 ; pe 2 ) are true. Example Syntax As introduced earlier, MailQL is a query language over an message base, which we use here as an example of semistructured data modeled by rened OEM model. MailQL queries borrow their syntax from OQL [Cattel et al, 1997 ] and its semistructured data oriented successor Lorel [Abiteboul et al, 1996 ]. SELECT list of path expressions FROM list of aliases for path expressions WHERE boolean expression Example (1) shows a simple query in MailQL, which returns elds From and Date of all messages that contain string 'MailQL' in its Subject eld. SELECT m.from, m.date FROM Inbox.Message: m WHERE m.subject CONTAINS 'MailQL' (1) Note, that FROM clause in a MailQL query plays a dierent role than in other syntactically similar languages: here it only denes aliases (m) for path expressions (Inbox.Message). Before the query is executed, all occurances of aliases are substituted by appropriate path expressions in the SELECT and WHERE clause, therefore the FROM clause is no longer needed after that. Automatic Construction of the Result Structure In this section, we will introduce an interesting use of OEM renement. So far, we considered what should be returned with specied nodes. It means we were inspecting the part of the core data tree on the path from a node to leaves. Now we will switch our attention to the other part: the path from a node to the root of the core data tree. The question is: In what structural relationships should the nodes, specied in SELECT clause of a query, be? Current languages for semistructured data usually provide some \construct" clause which explicitly denes the structure of result. Some of them provide default structuring as well, which means returning a set. But having the core data tree, we can improve default structuring by keeping a minimal structural context, which the specied nodes had in the source data tree. Simply said, path expressions from SELECT clause specify nodes in data graph. All these nodes will be returned (with their subgraphs, as described in ). And we also want all these nodes to stay in the same structural relationships to each other as they did in the source data tree. This could be surely realized even so, that we would preserve whole paths leading to these nodes from the root of the core data tree. But our solution vertically reduces these paths and keeps from them only \interesting" nodes { nodes, in which the path forks to reach specied nodes. Compared with other languages for semistructured data, we might miss strong result restructuring features, but we think, that in many cases the user needs to see minimal structural context of data, which is just what our language provides. Furthermore, all introduced features of our language, could be incorporated into any query language with strong formatting options and thus provide default structuring of result.

5 Minimal Structural Context HLOUSEK: SEMISTRUCTURED DATA, OEM REFINEMENT Let us rst formalize vertical reduction. Denition 7: Let T = (V; E) be a tree, where V is a set of nodes, and E is a set of edges. We say that tree T 0 = (V 0 ; E 0 ) is vertically reduced tree of T, if both following conditions are true: i) V 0 V ii) 8e 0 = (v 0 1 ; v0 2) 2 E 0 it is true, that either (a) e 0 2 E (an edge preserved from T ); or (b) there exists sequence e 1 ; : : : ; e n of edges from E n E 0 and sequence v k2 ; : : : ; v kn of nodes from V n V 0, such that v 0 1 ; e 1; v k2 ; : : : ; v kn ; e n ; v 0 2 is path in T. Denition 7 says, that T 0 was obtained from T by leaving out some nodes in such a way, that if there was a path between two nodes in T, and if both nodes were preserved in T 0, than there must exist a path between them in T 0, as well. Now let us turn our attention to \interesting" nodes. As it was declared in the beginning of this section, the only nodes, that are preserved from paths leading to queried data, are those nodes, where these paths fork. Given two path expressions, we can determine the path expression of nodes, where the paths fork, by nding out common prex of these two path expressions. So by \interesting" nodes we understand those nodes, that are represented by common prexes of path expressions from SELECT clause. To describe the minimal structural context of result nodes, we use data structure called result structure tree (RST), which helps us specify the (vertically reduced) structure of result. Result structure tree is a tree RST = (V; E) with mapping PEI (path expression inx) dened for all its nodes. Each node is mapped by PEI to a part of path expression in such a way, that common prex is empty for each pair of siblings in RST. PEI of root node always maps to an empty path expression. 1 To get the structure of result for a certain query, we take all path expressions from SELECT clause of the query and construct appropriate RST from them. The construction is directed by common prexes of path expressions. We start with the root node, which is always present and is always assigned an empty path expression. First path expression will be represented by child node of root, where PEI of that node will be the path expression itself. When adding every other path expression to RST, we check for common prex with all path expressions, that are already represented by RST. If there is non empty common prex, then some splitting must take place in order not to violate denition of RST. Figure 3 in three steps illustrates, how RST is created for set of path expressions finbox.private.message.to, Inbox.Private.Message.From, Inbox.Private.Nameg. v root v root v root Root.Inbox.Private.Message.To Root.Inbox.Private.Message Root.Inbox.Private To From Message Name To From Figure 3. Example of RST construction. Dotted lines indicate node splitting, text attached to node is the value of the node's PEI. Now inner nodes of RST, which represent common prexes of a path expression in SELECT clause of a query, represent path expressions of \interesting" nodes we spoke above { these are the ones which express the minimal structural context of returned data { while the leaf nodes represent the data, which are to be returned. 1 If the data tree forms a rooted acyclic graph, then RST forms a rooted acyclic graph as well, but it has always just one root, which represents an empty path expression. So if there are more roots getting to result from the core data rooted acyclic graph, then they are represented as children of root in RST.

6 Henceforth, we will denote the structure of result using XML-like syntax (accordingly to [Bray et al, 1998 ]). So the nal structure from Fig. 3 will be written down like this: f<inbox.private> f<name>...</>g f<message> f<to>...</>g f<from>...</>g </Message>g </Inbox.Private>g RST to Result Once we have RST representing the result structure, we can easily generate result tree. Each node from RST may correspond to several nodes from the core data tree (e.g. Inbox.Message denotes a set of messages). Thus result for RST in Fig. 3 can look in XML syntax like this: <Inbox.Private> <Name>Private</> <Message> <To>julius.satinsky@theatre.sk</> <From>fan@home.cz</> </Message> <Message> <To>julius.satinsky@theatre.sk</> <To>milan.lasica@theatre.sk</> <From>fan@home.cz</> </Message> </Inbox.Private> Conclusions Leaning on the fact, that most semistructured data (e.g. XML data) is of tree structure in terms of part-subpart relationship, we rened OEM model not to lose notion of this tree behind a complex data graph with cycles. This renement allows us, and generally all languages over semistructured data, to work with data as if it were a tree, and so allows us not to bother with complexity caused by cycles in the data graph. As another use of our OEM renement, we introduced basics of a query language, that provides default structuring of a query result based on keeping minimal structural context, as a dierent approach to default result structuring of languages over semistructured data with some \construct" clause. The query language presented here (originally MailQL) was designed and implemented for an message base as a part of master thesis and is further under development during author's postgraduate study. References Abiteboul, Quass, McHugh, Widom, and Wiener, The lorel query language for semistructured data, International Journal on Digital Libraries, 1(1):68{88, 1996, Serge Abiteboul and Dan Suciu, Data on the Web: From Relations to Semistructured Data and XML, Data Management Systems, Morgan Kaufmann, rst edition, Peter Buneman, Susan Davidson, Gerd Hillebrand, and Dan Suciu, A query language and optimization techniques for unstructured data, In H. V. Jagadish and Inderpal Singh Mumick, SIGMOD, pages 505{516, ACM Press, Tim Bray, Jean Paoli, and C. M. Sperberg-McQueen, Extensible markup language (XML) 1.0, February 1998, R.G.G. Cattell et al, The Object Database Standard: ODMG 2.0, Morgan Kaufmann Publishers, Inc., Alin Deutsch, Mary Fernandez, Daniela Florescu, Alon Levy, and Dan Suciu, XML-QL: A query language for XML, August 1998, Pavel Hlousek, MailQL: A query language for an message base, Master's thesis, Charles University, Prague, 2000, In czech. J. Widom, H. Garcia-Molina, and Y. Papakonstantinou, Object exchange across heterogeneous information sources, In Proceedings of the Eleventh International Conference on Data Engineering, pages 251{260, Taipei, Taiwan, March 1995.

Semistructured Data Store Mapping with XML and Its Reconstruction

Semistructured Data Store Mapping with XML and Its Reconstruction Enhong CHEN 1 Gongqing WU 1 Gabriela Lindemann 2 Mirjam Minor 2 1 Department of Computer Science University of Science and Technology of