Optimizing Correlated Path Queries in XML Languages. Technical Report CS November 2002

Size: px

Start display at page:

Download "Optimizing Correlated Path Queries in XML Languages. Technical Report CS November 2002"

Sara Green
6 years ago
Views:

1 Optimizing Correlated Path Queries in XML Languages Ning Zhang and M. Tamer Özsu Tehnial Report CS November 2002 Shool Of Computer Siene, University of Waterloo, 1

2 Abstrat Path expressions are ubiquitous in XML proessing languages suh as XPath, XQuery, and XSLT. Expressions in these languages typially inlude multiple path expressions, some of them orrelated. Existing approahes evaluate these path expressions one-at-a-time and miss the optimization opportunities that may be gained by exploiting the orrelations among them. In this paper, we address the evaluation and optimization of orrelated path expressions. In partiular, we propose two types of optimization tehniques: integrating orrelated path expressions into a single pattern graph, and rewriting the pattern graph aording to a set of rewriting rules. The first optimization tehnique allows the query optimizer to hoose an exeution plan that is impossible by using the existing approahes. The seond optimization tehnique rewrites pattern graphs at a logial level and produe a set of equivalent pattern graphs from whih a physial optimizer an hoose given an appropriate ost funtion. Under ertain onditions that we identify, the graph pattern mathing-based exeution approah that we propose may be more effiient than the join-based approahes. 1 Introdution Path expressions are ubiquitous in XML proessing languages (XPath [1], XQuery [2], XSLT [3], XPointer [4], to name a few). With a syntax that involves slash separated, UNIX diretorylike notation, path expressions are used to loate elements in a olletion of XML douments. For example, path expression bibbookauthors loates in the bibliography all authors who wrote a book; bib*[authors= John Smith ] loates all of John Smith s publiations (inluding those o-authored with others) in the bibliography. Sine path expressions are usually used in onjuntion with other query languages suh as XPath and XQuery, there ould be multiple path expressions orrelated with eah other in a single XPath or XQuery expression. Consider the example query in Figure 1 taken from XQuery Use Cases [5]. There are four path expressions bound to four variables in the for lause, and ten more path expressions introdued by referening the four variables in the where lause. Note that these path expressions are not independent, but are orrelated with eah other by variable referening, simple equality test (=), substring ontainment test (ontains funtion), and equality test on aggregation (max) of subexpressions. Previous researh on the evaluation and optimization of path expressions has foused on a single path expression [6, 7, 8, 9, 10], and follow two approahes. A straightforward approah is to follow the formal semantis of XPath and XQuery languages [11], and treat a path expression as a sequene of steps, eah of whih takes input from the previous step and produes an output to the next step. Although this approah seems simple, experimental results show that implementations following this approah suffer from exponential runtime in the size the of path expressions in the worst ase [10]. Another approah is to redue the problem of evaluating path expressions against XML douments to the problem of mathing pattern trees against subjet trees, where the subjet tree is a simplified data model for an XML doument, and the pattern tree is an abstration for a fragment of path expression. This problem is alled the tree pattern mathing (TPM) problem. Intuitively, the TPM problem is to find all mappings δ i from the nodes of the pattern tree to the nodes of the subjet tree, suh that δ i is label-preserving and relationship-preserving (hild-parent, desendant-anestor, and sibling-ordering relationships). Using proper labeling tehniques [6, 7, 8, 12, 13], TPM an be evaluated reasonably effiiently. However, it has not been shown that a omplete path expression (inluding all thirteen axes) an be translated into a pattern tree. In fat, we shall show that a omplete path expression annot be translated into a pattern tree but an be translated into a direted graph, whih we all the pattern graph. More 2

3 1. <result> { 2. for $seller in doument("users.xml")user_tuple, 3. $buyer in doument("users.xml")user_tuple, 4. $item in doument("items.xml")item_tuple, 5. $highbid in doument("bids.xml")bid_tuple 6. where $sellername = "Tom Jones" 7. and $selleruserid = $itemoffered_by 8. and ontains($itemdesription,"biyle") 9. and $itemitemno = $highbiditemno 10. and $highbiduserid = $buyeruserid 11. and $highbidbid = 12. max( for $x in doument("bids.xml") 13. bid_tuple[itemno=$itemitemno]bid 14. return deimal($x) ) 15. return 16. <jones_bike> 17. { $itemitemno } 18. { $itemdesription } 19. <high_bid>{ $highbidbid }<high_bid> 20. <high_bidder>{ $buyername }<high_bidder> 21. <jones_bike> 22. sort by {itemno} 23. } <result> Figure 1: A Sample XQuery Expression in XQuery Use Cases [5] (Q5 in the use ase R ) generally, we shall show that multiple path expressions an be translated into a pattern graph, whih is the basis for our optimization of orrelated path expressions. Both of the previous approahes an be easily extended to handle orrelated path expressions: evaluate eah path expression one-by-one, and deide whether to materializeahe the results of path expressions to variable bindings. However, they omit the onnetions between path expressions, thus missing the optimization opportunities based on these onnetions. For example, in Figure 1, the path expression doument("users.xml")user tuple in line 2 is orrelated to the path expression $sellername in line 6 sine the latter referenes the variable $seller, whih is bound to the former path expression. Due to this orrelation, the two path expressions, together with the omparison expression between the seond path expression and Tom Jones, an be rewritten to a single path expression doument("users.xml")user tuple[name="tom Jones"] Instead of evaluating the two path expressions one-at-a-time and then evaluating the omparison expression, the query optimizer an hoose an aess plan that evaluates user tuple[name="tom Jones"] diretly from XML doument users.xml. In this paper, we propose a new way of evaluating multiple path expressions by exploiting the orrelations among them. Our optimization methodology is illustrated in Figure 2. We start by onverting multiple orrelated path expressions to a pattern graph. The identifiation of the orrelated path expressions is outside the sope of this paper, and will be the subjet of our future work. The pattern graph is then rewritten (using rules introdued in the paper) to generate a set of equivalent pattern graphs from whih the physial optimizer an hoose. In partiular, one rewriting rule partitions the pattern graph into quasi-ordered pattern trees, whih are passed to 3

4 a tree pattern mathing algorithm. The results of the tree pattern mathings are joined together to generate the final result for pattern graph mathing. In summary, our ontributions are as follows: We define a pattern graph and give a linear time algorithm to onvert multiple orrelated path expressions to a pattern graph. Therefore, evaluating multiple path expressions an be transformed to the problem of mathing a direted graph against an ordered tree. We all this the graph pattern mathing (GPM) problem. We present rewriting rules that an be applied to a pattern graph to produe a set of equivalent pattern graphs from whih the physial algebra optimizer an hoose based on an appropriate ost funtion. In one of the rewriting tehniques, a pattern graph is divided into onneted omponents eah of whih is evaluated by a tree pattern mathing (TPM) operator. We develop an Θ(nm) algorithm for the TPM operator, where n and m are the size of XML doument and pattern tree, respetively. Path Expression orrelated Path Expression Convert Pattern Graph Rewrite Axes Pattern Graph in a Normal Form Result for Pattern Graph Mathing Join Tree Pattern Mathing Algorithms Tree Pattern Mathing Algorithms Pattern Tree Pattern Tree Other Rewrite Rules Interonneted Quasi-ordered Pattern Trees Figure 2: Overview of Optimization Steps The rest of paper is organized as follows. In Setion 2, we introdue path expressions in XML query languages and path expressions in objet-oriented databases. In Setion 3 we give the definitions and introdue the mapping from orrelated path expressions to pattern graph. In Setion 4, we present the rewriting tehniques based on pattern graph. In Setion 5, we use the existing algorithm ombined with our own algorithm to solve the TPM problem. Finally, we introdue related works in Setion 6 and onlude the paper in Setion 7. 2 Bakground In this setion, we first introdue the syntax of path expression that is defined in the working draft of XPath 2.0 [1]. Then we disuss path expression evaluation within the ontext of objet databases, sine there are surfae similarities and important differenes. 4

5 2.1 Path Expressions in XPath The syntax of path expression an be simplified as the following EBNF grammar: Path ::= (Root )? Step ( Step) Root ::= $ QName doument ( StringLiteral )? ɛ Step ::= Axis :: NodeTest Qualifier NodeTest ::= NameTest KindTest Qualifier ::= [ Exp ] => NameTest NameTest ::= Name ( Name) Name ::= :? QName QName :? Axis ::= hild parent desendant anestor following-sibling preeding-sibling self desendant-or-self anestor-or-self following preeding attribute namespae KindTest ::= proessing-instrution ( StringLiteral? ) omment ( ) namespae ( ) node ( ) Eah Path expression onsists of a series of Steps separated by slashes ( ). Eah Step onsists of an Axis, a NodeTest, and zero or more Qualifiers. The Axis an be thought of as the diretion of this Step, representing the relationship between the previous Step and the urrent Step 1. A NodeTest in the path expression filters nodes by its name or kind (e.g. omment or namespae), whih are alled NameTest or KindTest, respetively. Qualifiers are filters testing more omplex onditions by evaluating the qualifier expressions enlosed in the brakets [ ]. Qualifier expressions an be of any type (arithmeti, logial, for, et), but here we restrited them to two types: i) path expressions, and ii) omparison expressions between path expression and literals (numerial and string). For example, bibbook[title= XML ][prie<100] is a path expression within our sope of disussion beause the two qualifier expressions are omparisons between path expressions and literals; while bibbookhapter[2]title is not, sine 2 is not a path expression nor a omparison expression between path expression and literals. Another feature of the path expression that is outside the sope of this paper is the dereferening of IDREF and IDREFS indiated by the => in the Qualifier. 2.2 Path Expressions in OQL Path expressions are not unique in XML query languages, but they also appear in objet-oriented database (OODB) query languages suh as OQL [14, 15]. Although they are based on different data models (ordered tree versus direted graph), path expressions in both kinds of query languages have similarities. In OQL, path expressions are defined as a hain of objets and methodsattributes in the so-alled objet omposition graph [15] (a.k.a. aggregation graph [16]). In the objet omposition graph, an objet o 1 has a direted edge to another objet o 2 if and only if o 1 has an attribute or method whose value or result is in the lass of objet o 2. For example, 1 Following the abbreviations defined in [1], we use labels.,, and to represent self, hild, and desendant-or-self axes, respetively. 5

6 o.m 1 (ϕ 1 ).m 2 (ϕ 2 )..m n (ϕ n ) is a path expression, where o is an objet instane, m i (1 i n) is an attribute or a method of either o (if i = 1) or o.m 1..m i 1 (if i > 1), ϕ i is the argument of m i if it is a method. This path expression an be thought of as a top-down path in the breadth-first spanning tree of the objet omposition graph rooted at o. This is analogous to the path expressions in XPath. However, the differenes are important: In XPath, the XML trees on whih the path expressions work (i.e., the tree representing the douments in the XML database) are known a priori, before the path expression is given. However, the OODB is based on a graph database, and the breadth-first spanning tree an be obtained only after the starting point of the path expression is fixed. The XML tree is always ordered, while the breadth-first spanning tree of objet omposition graph is always unordered. Ordering introdues differenes to the query evaluation and optimization proess. The path expressions in XPath provide a way to query XML trees in a multitude of ways inluding top-down, bottom-up, left-to-right, et. by using axes. While in OODB, the objet omposition graph an be traversed through the referene links. In addition to the existene of axes, path expressions in XPath differ from path expressions in OODB in the NodeTest and qualifier expressions. In XPath path expressions, NodeTest an selet nodes either by their kind or by their names (exat mathing or regular expression); while in OODB path expressions, methods m i are hosen only based on their names. In XPath path expression, qualifier brakets [ ] have simple and predefined semantis: the result of qualifier expressions in brakets are onverted to boolean values by the predefined onversion rule, and this boolean value determines whether or not the node generated by NodeTest is filtered out. In ontrast, in OODB path expressions, arguments ϕ i an be any expression and the semantis of m i varies aording to the definition of types. Path expressions in OODB are usually evaluated by joins on omponents along the path. The optimization tehniques are therefore onentrated on how to effiiently evaluate the joins [15]. Another optimization tehnique is using indexed san on the paths when an index is available. Optimization based on indexes are beyond the topi of this paper, and we refer the interested readers to [16]. Join-based evaluation of XPath path expressions is possible, and we disuss it along with an approah based on tree pattern mathing (TPM), and a hybrid approah ombining the two. 3 Correlated Path Expressions and Pattern Graphs In this setion, we shall first define the orrelated path expressions, the subjet tree and the pattern graph. Then we disuss the translation of orrelated path expressions into a pattern graph. In the rest of the paper, for simpliity, we denote subjet tree and pattern graph as T and P, respetively. 3.1 Definitions Many XML queries inlude multiple path expressions. In most ases, these path expressions are not independent, but are somehow orrelated. In this paper, we onsider two simple yet widely used types of orrelations for query optimization. The ontext in whih these orrelations an be used for optimization is disussed in Setion 3. 6

7 Definition 1 (Correlated Path Expressions) Path expression e is said to be orrelated to another path expression f (assuming they are bound to variables $e and $f respetively), if and only if 1. the first Step of e is the variable referene $f, i.e. the syntax of e is $f( Step)* (alled strong orrelation); or 2. they are onneted by a binary operator θ R (e θ f) that returns a boolean value. The semantis of operators in R are as defined in [17] (examples are the doument order operator << and >> or the equality operator =). We all this kind of orrelation weak orrelation. For example, path expression $selleruserid in Figure 1 in line 7 is orrelated with the path expression doument( users.xml )user tuple in line 1 sine the former referenes the variable binding of the latter. Also, the two path expressions in line 7 are orrelated to eah other by the equality relation. An XML doument is usually modeled as a labeled ordered tree [18], where eah node belongs to one of the seven kinds 2 : doument, element, attribute, namespae, proessing instrution, omment, and text. Eah kind is assoiated with a set of methods defined as its aess interfae. This is analogous to the assoiation of data strutures and methods found in objet-oriented programming languages. In our definition, subjet tree is an abstrat data struture for labeled ordered tree, where edges represent parent-hild relationship between elements in the XML doument. Definition 2 (Subjet Tree) A subjet tree T is a rooted, ordered, node-labeled tree, whih an be denoted by a 4-tuple T = Σ, N, E, t, where Σ is a finite alphabet, N and E are the sets of nodes and edges in the tree, respetively, and t N is the root of T. A node orresponds to an element in the XML doument. A node m is hild of node n iff the orresponding element of m is a subelement of the orresponding element of n. For eah node n N, the level of n, denoted by level(n), is defined by the number of edges in the path from t to n; n is labeled with a harater in Σ, where its label is denoted by label(n); n is assoiated with a value, whih is denoted by value(n); the hildren of n are ordered from left to right. We all this loal order (denoted by ) sine the ordering relationship only applies to siblings. There is also a total order among all nodes in the tree the order obtained by preorder traversal of the tree, whih we all global order (denoted by ). Sine a subjet tree is a data struture for modeling an XML doument, we need to define a data struture to model path expressions. It turns out that tree is not a suffiient data struture for this purpose. Consider the path expression abanestor-or-desendant::. This annot be translated into a pattern tree, sine the tree node b has two anestors (a and ), whih violates the definition of tree. Therefore, path expressions are onverted to pattern graphs. 2 This is the XPath term that roughly orresponds to a type. 7

8 Definition 3 (Pattern Graph) A pattern graph is a labeled, direted graph, whih is denoted by a 4-tuple P = Σ, V, A, R, where Σ is a finite alphabet, V and A are the sets of verties and ars (direted edges) in the graph, respetively, and R is the set of binary relations 3 between verties as defined in Definition 1. For eah vertex v V: v is labeled with or a set of haraters in Σ, where Σ. v is assoiated with a list (may be empty) of, l tuples, where is a omparison operator and l is a numerial or string literal. Eah ar (s, t) A is labeled with a relation r R, whih indiates that (s, t) is in relation r. In the pattern graph, we an think of the labels for verties and ars as query onditions or onstraints when mapping on the subjet tree. Their semantis are defined in the following. Definition 4 (GPM problem) Given a pattern graph P = Σ, V, A, R and a subjet tree T = Σ, N, E, t, the GPM problem is to find all funtions δ i : V N suh that: 1. For any v V: If label(v) Σ, then v an be mapped to any subjet tree node whose label is in label(v), i.e. label(v) Σ = label(δ i (v)) label(v). If label(v) =, then v an be mapped to any subjet tree node (we don t are about its label), i.e. label(v) = = x N. δ i (v) = x. If the assoiated, l tuple list is not empty, then value(δ i (v)) l holds for every tuple in the list. If all the above onditions hold, δ i (v) is alled a hit of v. 2. For any (s, t) A, if the ar is labeled with the binary relation r, then δ i (s), δ i (t) satisfies relation r in the subjet tree. If every vertex in V has a hit after the above mathing proess, the mapped trees in the subjet tree δ i (P) are alled witness trees and form the result of GPM 4. In this ase, we say P is satisfiable on T. For example, the following XPath expression an be onverted to the pattern graph shown in Figure 3(b). for $b in a*, $a in $b..a $d in $bd, $ in $b.., $e in $e where $b << $a return <result> { $d } { $e } <result> 3 In this paper, we don t study ternary or higher arity relations, in whih ase the graph is a hypergraph. 4 We don t onsider the order of witness trees in the output here, sine their order is important only in the onstrution proess, whih is outside of the sope of this paper. 8

9 b a a *($b) a a($a) ($) d b d($d) e($e) d e (a) a e (b) a b a b a d d e e () Figure 3: Example of subjet tree, pattern graph, mappings, and witness trees. (a) Subjet tree T, (b) Pattern graph P, () Two witness trees of P Assume an XML doument that has been translated into the subjet tree in Figure 3(a). The dotted lines from V in P to N in T represent a mapping δ 0 satisfying the definition of GPM. Note that δ 0 is not the only valid mapping. Another one differs from δ 0 only in the mappings of and e (represented by dashed lines) in V, where they are mapped to the and e, respetively, in the right subtree of the root in T. Therefore, the results of the GPM are the two witness trees orresponding to the two mappings in Figure 3(). 3.2 Converting Correlated Path Expressions to a Pattern Graph Based on the definition, onverting orrelated path expressions to a pattern graph an be done in two steps: 1) onvert every path expression to a pattern graph, 2) pairwise merge multiple pattern graphs if they are orrelated. Algorithm 1 onverts a path expression to a pattern graph. Based on the definition, a path expression an be divided into a sequene of Steps, eah of whih onsists of an axis, a NodeTest and a sequene of qualifiers. The algorithm onverts a Step of the path expression to a pattern graph using Algorithm 2, whih, in turn, uses Algorithm 3 to onvert the NodeTest that is part 9

10 Algorithm 1 ConvPath(pathExpr, r) Input: pathexpr: a path expression onsisting of a list of Steps. r is the vertex objet of previous step. Output: the vertex objet orresponding to the NodeTest of the last Step. 1: separate pathexpr into steps S; 2: temproot r; 3: for eah s S in left to right order do 4: v ConvStep(s, temproot); 5: temproot v; 6: end for 7: return temproot; Algorithm 2 ConvStep(step, pn ode) Input: step: a struture onsisting of axis, NodeTest and a list of qualifiers; pn ode: the vertex objet orresponding to the NodeTest of the previous step Output: the vertex objet orresponding to the NodeTest in the urrent step 1: v ConvN odet est(step.n odet est) 2: reate an ar α from pnode to v; 3: label α with step.axis; 4: for eah q step.qualifiers do 5: if q is a omparison expression between path expression p and a literal l then 6: s ConvP ath(p, v); 7: reate a tuple, l and assoiate it with s; { is the omparison operator} 8: else 9: s ConvP ath(q, v); 10: end if 11: end for 12: return v; of that Step. Two pattern graphs resulting from two Steps are then merged together using a direted ar labeled with the axis between these two Steps. Given two pattern graphs g and h translated from two orrelated path expressions, the merge operation is straightforward: if g and h are strongly orrelated and h referenes the variable binding of g, then we only need to reate an ar from the vertex orresponding to the NodeTest of g to the vertex orresponding to the NodeTest of h (note that we didn t onvert the variable referene of h in the algorithm), and label the ar with the axis onneting the variable referene and the rest of the path expression in h. If g and h are weakly orrelated, we also reate an ar from the vertex orresponding to the NodeTest of g to the vertex orresponding to the NodeTest of h, but label it with the binary relation between them. For example, in the pattern graph shown in Figure 4(b), the root labeled a and its leftmost hild, labeled *, and the ar onneting them is a pattern graph resulting from the path expression a*. The leftmost leaf, labeled with d, is a result of onverting the path expression $bd without the variable referene. The ar onneting the leaf and its parent is reated by merging the two strongly orrelated path expressions together. The ar between the vertex labeled with * and the vertex labeled a, whih is the hild of the root, is reated by the two weakly orrelated path expressions and their binary relation: $b << $a. 10

11 Algorithm 3 ConvNodeTest(nodeT est) Input: nodet est: The NodeTest of the urrent step Output: The vertex objet orresponding to the NodeTest. 1: reate a vertex v; 2: if nodet est is the wildard * then 3: labeled v with ; 4: else if nodet est is a NameTest then 5: S ; 6: for eah Name separated by do 7: if Name is the form QName then 8: map QName to a harater in Σ 9: S S {}; 10: else if nodet est is the form QName:* then 11: map QName to a harater in Σ; 12: S S { : }; 13: else if nodet est is the form QName1:QName2 then 14: map QName1 and QName2 to 1 and 2 in Σ, respetively; 15: S S { 1 : 2 }; 16: end if 17: labeled v with S; 18: end for 19: else if nodet est is a KindTest then 20: map the KindTest to its orresponding speial harater ; {For example, we an assign to attribute, to namespae, # to omment,! to PI, et.} 21: label v with ; 22: end if 23: return n; Theorem 1 Algorithm 1 translates a path expression to a pattern graph suh that: 1. eah NodeTest in the path expression is onverted to a vertex in the pattern graph. If the NodeTest is a NameTest, the list of Name s orresponds to the label (a set of harater in Σ) of the vertex. 2. if two verties u and v orrespond to two NodeTests in two onseutive Steps in the left-toright order, then there is a direted ar (u, v) in the pattern graph and the label of the ar is the Axis between these two Steps. 3. if vertex u orresponds to a NodeTest in a Step, and u orresponds to a NodeTest in that Step s Qualifier expression, then there is a direted ar (u, v) in the pattern graph and the label of the ar is the Axis of the NodeTest in the qualifier expression. If the qualifier expression is a omparison expression between a path expression and a literal, a tuple, l is onstruted to assoiated with v. Proof Sine a path expression is a list of Steps, whih onsists of an Axis, a NodeTest and a list of Qualifiers. Algorithm 1 translated a path expression to a pattern graph by translating eah Step into a pattern graph and then onnet two pattern graphs orresponding to two onseutive Steps by a diret ar that is labeled with the Axis of the seond Step. This guarantees the property 2 of the theorem. 11

12 For eah Step, Algorithm 1 alls Algorithm 2 to translate a Step into a pattern graph. Algorithm 2 first alls Algorithm 3 to onvert the NodeTest into a vertex in the pattern graph. If there are qualifiers, Algorithm 2 heks if the qualifier expression is a path expression or a omparison expression between a path expression and a literal. In either ase, Algorithm 1 to onstrut another pattern graph and onnet the two pattern graphs with a direted ar labeled with the Axis of the qualifier expression. If the qualifier expression is a omparison expression, Algorithm 2 also onstrut a tuple, l from the omparison expression and assoiates the tuple with the vertex orresponding to the qualifier expression. This guarantees the property 3 of the theorem. Algorithm 3 onverts a NodeTest to a vertex in the pattern graph, with label orresponds to the Names of the NodeTest or KindTest. This guarantees the property 1 of the theorem. Theorem 2 Algorithm 1 onverts a path expression to a pattern graph in Θ(n) time, where n is the number of NodeTests. Proof The algorithms onverts eah NodeTest to a vertex in the pattern graph, and algorithm 3 runs in Θ(1) time, so the omplexity of onverting all NodeTests is Θ(n). Sine there are exatly n axes for n NodeTests, the time for reating all edges is also Θ(n). By definition, there are at most n omparison expressions in the qualifiers, so there are O(n), l tuples assoiated with verties. Therefore, the total time for onverting a path expression is Θ(n) + Θ(n) + O(n) = Θ(n). Careful readers might have noted that not all orrelated path expressions in an XQuery expression are appliable to translating into pattern graph for evaluation without hanging the semantis. For example, assume that e 1 and e 2 are two orrelated path expressions where e 1 appears in a for lause and e 2 appears in the where lause. If there is a onstrution expression related to e 1 between e 1 and e 2, they annot be translated into a pattern graph. Consider the following XPath expression: 1. for $a in doument( bib.xml )bibbookauthors, 2. return 3. <author> 4. { $afirst_name } 5. for $b in doument( artiles.xml )artiles 6. where $bauthorlast_name = $alast_name 7. return 8. <book> $btitle <book> 9. <author> The path expression $alast name in line 6 is orrelated to the path expression doument( bib.xml )bibbookauthors in line 1. However, they annot be merged together sine doing so would result in a witness tree that ontains a node labeled last name. This is not equivalent to the semantis of the original XPath expression sine the original outputs all authors first name (line 4) even if the author does not have a last name. In this paper, we assume that the inputs of the onversion algorithm are orrelated path expressions that an be onverted to pattern graphs without hanging semantis. How to identify these orrelated path expressions is among our future work. Also note that the resulting pattern graph generated by the above algorithm may not be a minimum one in the sense that there may be redundant verties in the pattern graph. For example, the two path expressions bibbook[prie<100] (whih is bound to variable $b) and $bprie onvert to a pattern graph in whih vertex labeled with book has two hildren, both of whih are labeled prie. Interested readers are refereed to [19]. 12

13 root book title="xml" prie<100 Figure 4: Pattern Graph for book[title= XML"][prie<100] 4 Optimization Based on Pattern Graphs In this setion, we disuss a general evaluation strategy for path expressions based on joins and some optimization tehniques based on pattern graphs transformation. The first optimization tehnique is a set of rewriting rules that rewrite an ar labeled by any axis to a sub-pattern graph that only ontains a minimal subset of axes. The seond optimization tehnique divides the pattern graph into onneted pattern trees of -ars. Eah pattern tree an be passed to a TPM operator instead of a set of joins. 4.1 A General Evaluation Strategy Based on Joins A pattern graph P is a visual representation of onstraints of tree nodes speified in the path expression. For example, path expression book[title="xml"][prie<100] returns the books whose titles are XML and pries are less than 100. Its semantis an be expressed using the following onjuntive alulus expression 5 : { x y, z (hild(root, x) label(x) = book hild(x, y) label(y) = title value(y) = XML desendant-or-self(x, z) label(z) = prie value(z) < 100)} where root is the root of the urrent XML tree, hild (desendant-or-self) is a prediate that returns true if the seond argument is a hild (desendant-or-self) node of the first argument in the tree. In general, any axis an be thought of as a prediate on two tree nodes. As suggested by their names, label and value are two funtions that return the label and value of the tree node, respetively. Figure 4 is a pattern graph orresponding to the path expression and the onjuntive alulus expression. This onjuntive alulus an be evaluated by onverting it to an expression onsisting of seletion (σ) and join ( ) operators similar to the ones in the relational algebra: σ ϕ1 (T 1 ) hild(t1,t 2 ) σ ϕ2 (T 2 ) desendant-or-self(t1,t 3 ) σ ϕ3 (T 3 ) where all T i refer to the same subjet tree T, but appropriately renamed. The three seletion onditions are: 5 Note that not all path expressions an be expressed by onjuntive alulus, sine the latter does not allow and in the qualifier expressions. 13

14 ϕ 1 : label = book ϕ 2 : label = title value = XML ϕ 3 : label = prie value < 100 A possible aess plan is to first evaluate the label and value funtions and obtain three sequenes of tree nodes x, y, and z, respetively, either by index searhing or by sequential san depending on whether there are indexes on tree node labels. Then the three sequenes are joined together on the hild and desendant-or-self prediates. One way of atually implementing these joins is to assoiate eah tree node with a tuple that provides enough information for determining axis relationships. For example, the doid, preorder, postorder, level interval enoding [6] of subjet tree nodes provides a way for judging any thirteen axis relationships between two nodes in onstant time. However, when the subjet tree is updated, the ost of updating the enodings is Θ(n) in the worst ase, where n is the number of nodes in the subjet tree. A reent improvement of this enoding sheme an ahieve Θ(log n) amortized update time and enoding length [13]. The problem of this approah is that every NodeTest is onverted to a sequential san or index lookup, and every Axis (inluding those in the qualifier expressions) is onverted to a join on two sequenes, whih an seriously inrease the number of joins and make it hard for the query optimizer to hoose an effiient join order. For example, in the XQuery Use Cases [5], the total number of axes in the XQuery expressions 6 varies from 1 to 21, with an average of 5.68 if one ounts the axes in onstrution expressions or 3.94 otherwise. In these axes, approximately 65.63% are hild axes ( ), 32.98% are desendant-or-self axes ( ), and the rest are self axes (. ). Figure 5 shows the number of and axes in the queries in the XQuery Use Cases. Sine almost two-thirds of the axes in path expressions are hild, if nodes satisfying hild relationships are stored lose to eah other, we an use the loalization harateristis to aelerate the queries by using tree pattern mathing rather than joins. The other axes that don t have loalization harateristis an be evaluated by joins. 4.2 Rewriting Axes Relations The set of axes defined in the path expression is not minimal in the sense that we an identify a subset of the axes that an express any of the eleven axes. In fat, the minimum set is not unique. For example, {.,,, } is one possible set of labels for the pattern graph to represent all the axes, while {.,,, } is another. We hoose the seond set as the minimum set in this paper sine indiates the doument order relationship, whih is widely used in XPath expression evaluation. On the other hand, if a pattern graph is very large, sometimes we would like to redue the number of ars by ombining a partiular omponent of pattern graph into a single ar. This an be done by the reverse proess of the above onversion. These onversions are provided in the form of rewriting rules in Theorem 3. Theorem 3 (Rewriting Rules for Axes) In the pattern graph, any ar labeled with an axis other than attribute and namespae an be onverted to a subgraph whose ar labels are in the set {.,,, }. Assume that (p, ) is any ar and λ(p, ) denotes that the axis assoiated with the ar is λ, depending on its label the rewriting rules are as follows: (a) hild(p, ) label(p, ) = (b) parent(p, ) label(, p) = () desendant(p, ) x label(p, x) = label(x, ) = label(x) = (d) anestor(p, ) x label(x, p) = label(, x) = label(x) = 6 The number is alulated by summing up all axes in all path expressions in an XQuery expression. 14

15 XMP-Q XMP-Q3 XMP-Q5 XMP-Q7 XMP-Q9 XMP-Q11 Tree-Q1 Tree-Q3 Tree-Q5 Seq-Q1 Seq-Q3 Seq-Q5 R-Q2 R-Q4 R-Q6 R-Q8 R-Q10 R-Q12 R-Q14 R-Q16 R-Q18 SGML-Q2 SGML-Q4 SGML-Q6 SGML-Q8 SGML-Q10 Text-Q2 Text-Q5 NS-Q2 NS-Q4 NS-Q6 NS-Q8 Strong-Q1 Strong-Q3 Strong-Q5 Strong-Q7 Strong-Q9 Strong-Q11 Figure 5: The numbers of axes in the queries in XQuery Uses Cases (e) self(p, ) p (f) desendant-or-self(p, ) label(p, ) = (g) anestor-or-self(p, ) label(, p) = (h) following-sibling(p, ) x label(x, p) = label(x, ) = label(p, ) = (i) preeding-sibling(p, ) x label(x, p) = label(x, ) = label(, p) = (j) following(p, ) label(p, ) = (k) preeding(p, ) label(, p) = 15

16 p p * * p p p, (a) (b) () (d) (e) p (f) p (g) * * p p p p (h) (i) (j) (k) Figure 6: Converting from axes to pattern trees. (a) hild(p,) (b) parent(p,) () desendant(p,) (d) anestor(p,) (e) self(p,) (f) desendant-or-self(p,) (g) anestor-or-self(p,) (h) followingsibling(p,) (i) preeding-sibling(p,) (j) following(p,) (k) preeding(p,) Their graphial representations are in Figure 6(a) (k) in that order. Proof The hild, parent, desendant-or-self, anestor-or-self, self, following, and preeding axes are straightforward based on the semantis of edge labels,,. and. Sine anestor is the reverse of desendant, and preeding-sibling is the reverse of following-sibling, we prove desendant and following-sibling only. For rule (), by the semantis of desendant-or-self axis, desendant-or-self(x, ) desendant(x, ) x =. Therefore, the right-hand-side an be rewritten to x hild(p, x) desendant(x, ) hild(p, x) x = x desendant(p, ) hild(p, ) desendant(p, ) whih is the left hand side. Rule (h) is straightforward sine the right-hand-side is exatly the semantis of followingsibling axis. The purpose of rewriting axes is twofold: to onvert an ar labeled by any axis to a subgraph in Figure 6 (forward rewrite), or vie versa, and to onvert a subgraph in Figure 6 to a single ar (bakward rewrite). Whether the query optimizer should take one diretion or the other depends on the ost of the two plans. Usually the osts vary depending on the underlying physial storage struture. If the XML douments are shredded (e.g. using the interval enoding sheme [6]) and then the piees are stored in a relational database system, bakward rewrite may result in a more effiient physial plan sine eah ar in the pattern graph is translated into a join in the relational system [6, 7, 8], and the onversion redues the number of the joins. On the other hand, if the subjet tree nodes are not shredded but lustered in the physial storage aording to the parent-hild relationship, then the forward rewrite may lead to a more effiient physial plan, sine we an use the rewriting rules presented in the next subsetion to divide a pattern graph into onneted pattern trees. For eah pattern tree, a tree pattern mathing algorithm may need only one sequential san of the data to find all witness trees. The 16

17 total ost (IO ost plus the TPM algorithm in main memory) may be less than the ost of many sequential sans to find the nodes first and then many joins to get the final witness trees. 4.3 Partitioning Pattern Graph into Pattern Trees Another type of rewriting rule is to partition the pattern graph into onneted omponents suh that there are no -ars onneting verties in different omponents. Based on the properties of -ars, we an further transform eah omponent suh that the -ars form a tree struture, and all -ars are safely removed. Therefore, every resulting omponent is a tree (with optional -ars between tree nodes) if we only onsider -ars. We all this omponent pattern tree (CPT). By rewriting eah omponent as a CPT, we an math the CPT against the subjet tree first. Then all the resulting witness trees orresponding to the CPTs are joined together aording to ars onneting them. The purpose of removing -ars is to redue the size of the CPT and make the TPM algorithm more effiient. Algorithm 4 DeomposePG(P) Input: P = Σ, V, A, R : the pattern graph. Output: P is rewritten, or return UNSATISFIABLE if the pattern graph is unsatisfiable on any subjet tree. {Initialization} 1: S R ; 2: for eah vertex v V do 3: v.olor = W HIT E; 4: v.level = 0; 5: if v s indegree of -ars is 0 then 6: S S {v}; 7: end if 8: end for 9: if S = then 10: return UNSATISFIABLE; 11: end if {Partition the pattern graph into omponents and omb eah omponent into pattern tree.} 12: ncomp 0; 13: for eah r S do 14: R R Comb(P, r, ncomp); 15: ncomp ncomp + 1; 16: end for {Remove -ars in the pattern trees.} 17: for eah r R do 18: RmAD(P, r); 19: end for The deomposition proess is shown in Algorithm 4. The algorithm first olors eah vertex as WHITE and initializes its level (the number of -ars along the path from the root) to 0. The initialization also keeps reord of all verties that have no inoming -ars and the total number of verties whose indegree is non-zero. If no vertex has zero indegree of -ars, either 17

18 there is no -ar at all or there exists a yle of -ars. In the former ase, every vertex is a omponent and thus a trivial CPT. For the latter ase, sine any -ar is mapped to an edge in the subjet tree, whih has no yles, the existene of a yle in P implies that it is unsatisfiable on any subjet tree. The zero indegree verties are andidates to be the roots of the resulting pattern trees. Therefore, for eah of these verties, the deomposition algorithm alls funtion Comb (shown in Algorithm 5) to explore the pattern graph using depth-first searh (DFS) and manipulate the -ars to form a tree struture. The output of Comb is the real root of the pattern tree. After getting all of the roots of pattern trees, the deomposition algorithm alls funtion RmAD (shown in Algorithm 6) to remove -ars in the omponent. Both funtions hek the satisfiability of the pattern graph, and all funtion MergeP aths (shown in Algorithm 7) to merge two paths to eliminate the yle or redundant -ars in the tree. Algorithm 5 Comb(P, r, nc) Input: P = Σ, V, A, R : the pattern graph. r the starting vertex (root) of the pattern tree. nc the omponent #; Output: P is rewritten or UNSATISFIABLE is returned. 1: r.olor GREY ; 2: r.ncomp nc; 3: for eah -ar α(r, v) A do 4: if v.olor = W HIT E then 5: v.level r.level + 1; 6: v.predeessor r; 7: Comb(P, v); 8: else if v.olor = GREY then 9: return UNSATISFIABLE; {there is a direted yle} 10: else if v.level r.level + 1 then 11: return UNSATISFIABLE; {there is a undireted yle, and v s level based on two paths are different, so unable to merge paths.} 12: else 13: M ergep aths(p, v.predeessor, r); 14: end if 15: end for 16: r.olor BLACK; 17: return r; The MergeP aths funtion merges two paths in the pattern tree into one path. It takes two verties, whih have the same level, as input and test whether they are ompatible or not. Two verties are ompatible if they an be mapped to the same node in the subjet tree, i.e. the intersetion of the labels of the two verties is not empty, where * is treated as the set universe. If they are ompatible, these two verties are merged. The merging proess ontinues reursively to the parents of the two verties until it reahes the ommon anestor of the two original two verties. Figure 7 shows an example of dividing a pattern graph into pattern trees. Figure 7(a) is a pattern graph onsisting of -ars, -ar, and -ar. The deomposition algorithm first finds the two verties, a and g, and passes them to the funtion Comb individually. Comb removes all direted and undireted -yles in the left omponent and results in a pattern tree 18

19 Algorithm 6 RmAD(P, r) Input: P = Σ, V, A, R : the pattern graph. r the starting vertex (root) of the pattern tree. Output: -ars in the omponent are removed or UNSATISFIABLE is returned. 1: r.olor GREY ; 2: for eah -ar α(r, v) A r.ncomp = v.ncomp do 3: if v.olor = GREY r v v.level < r.level then 4: return UNSATISFIABLE; 5: end if 6: find v s anestor u s.t. u.level = r.level 7: MergeP aths(p, r, u); 8: remove α; 9: end for 10: if there is a -ars (r, w) s.t. w.olor = BLACK then 11: RmAD(P, w); 12: end if 13: r.olor W HIT E; Algorithm 7 MergePaths(P, u, v) Require: u.level = v.level Input: P = Σ, V, A, R : the pattern graph. u, v the two end verties of paths whose lengths are equal. Output: Two paths starting from the root to u and v, respetively, are merged or UNSATIS- FIABLE is returned. 1: i u; 2: j v; 3: while i j do 4: if label(i) label(j) then 5: label(i) label(i) label(j); 6: append the value onstraint tuple assoiated with j to i; 7: remove ar (j.predeessor, j); 8: else 9: return UNSATISFIABLE; 10: end if 11: i i.predeessor; 12: j j.predeessor; 13: end while in the left hand side of ar shown in Figure 7(b). Then the deomposition algorithm alls funtion RmAD to remove all -ars. In the example, the only -ar is removed sine it is redundant with the -ars (g, h), (h, j). Lemma 1 Algorithm 7 merges two paths into one path if they are ompatible, or return UN- SATISFIABLE otherwise. Proof Given two verties i and j, we want to merge them into one vertex k suh that a node in any subjet tree is mapped from both i and j iff it an be mapped from k. By definition of GPM, k is mapped to a node m in the subjet tree iff label(m) label(k) and m satisfies 19

20 a g {b,} {,d} h i * e j f (a) a g a g h i h i e j e j f f (b) () Figure 7: Deompose a Pattern Graph to Pattern Trees. (a) the original pattern graph (b) the two onneted CTPs () the two onneted quasi-ordered pattern tree the value onstraint assoiated with k. Similarly, m is mapped from i and j iff label(m) label(i) label(m) label(j) and m satisfies the value onstraints assoiated with both i and j. Therefore, we should label k and assoiate value onstraints suh that label(k) = label(i) label(j) and the value onstraint is the onjuntion of two onstraints assoiated with i and j, respetively. A speial ase arises if one of i or j s label is *. In this ase, the result of the intersetion is the label of the other vertex, i.e. * is treated as a set universe here. If the intersetion of label(i) and label(j) is empty, that means that there is no node in the subjet tree that an be mapped from both i and j; thus the path graph is unsatisfiable on any subjet tree. Lemma 2 Algorithm 7 merges two paths of pattern graph in O(l Σ ) time, where l is the length of the path from the ommon anestor of u and v to u or v. Proof The omplexity of Algorithm 7 is straightforward: the loop is exeuted at most l times. In eah loop, the most time-onsuming proess is to find the intersetion of label(i) and label(j). Sine the upper bound of label(i) and label(j) is Σ, finding the intersetion an be done in O( Σ ) if the labels are sorted. Thus the omplexity of the algorithm is O(l Σ ). Although the omplexity result of Algorithm 7 is disappointing espeially when the size of Σ is very large, the bound should be muh tighter in pratie sine most vertex labels are expeted to be a single harater in Σ or *, in whih ase, the omplexity is only Θ(l). 20

21 a a a b b b b b b (a) (b) () Figure 8: Breaking the yles (a) a direted yle (b) an undireted yle, has the same level from two paths () an undireted yle, has different levels from two paths Lemma 3 Algorithm 5 divides the pattern graph into onneted omponents, where the -ars form a tree struture in eah of the onneted omponents. Proof Algorithm 5 is based on depth-first searh (DFS). At first, all verties are olored WHITE. During the DFS exploration, if a vertex is first visited by a -ar, it is olored GREY. When all its -ars are visited, the vertex is olored BLACK. In eah loop of the algorithm, a vertex r is piked up and all its -ars (r, v) are examined. If the adjaent vertex v is WHITE, then a normal DFS all is applied to v reursively. If v s olor is GREY, it means that v is one of the anestors of r, whih forms a direted yle (illustrated in Figure 8(a)). Therefore, it is unsatisfiable on any subjet tree. If v s olor is BLACK, that means v has already been visited along another path. Therefore the indegree of -ars of v is greater than 1. Sine any indegree of a tree node in the subjet tree is 1 exept the root, the only possibility for the pattern graph to be satisfiable is that the two parents of v atually map to the same node in the subjet tree, thus the two parents are ompatible and should be merged into one vertex. If the two levels are equal (illustrated in Figure 8(b)), then the algorithm alls funtion M ergep ath to merge the two paths into one. The result of the merge is that the indegrees of -ars for v and all its anestors are 1. If the two paths assign different levels to v (illustrated in Figure 8()), it is impossible that the two paths lead to v an be merged together, and the algorithm returns UNSATISFIABLE. Sine all the diretedundireted yles in the pattern graph an be broken, after the DFS all verties have indegree of -ars equal to 1, whih follows that the -ars form a tree struture. Lemma 4 Algorithm 5 runs in O(m + n Σ ) time, where m and n are the number of -ars and verties in the onneted omponent, respetively. Proof The omplexity of Algorithm 5 depends on the omplexity of DFS and Algorithm 7. For eah -ar, the algorithm either alls Comb reursively or alls M ergep ath. Sine the omplexity of DFS is O(n + m) and there are at most n merges, the total omplexity is O(n + m) + O(n Σ ) = O(m + n Σ ). Also sine most of the labels of verties are single harater or *, the above omplexity result an be as tight as Θ(m + n). After Algorithm 5, all verties should be olored BLACK. To save time, we do not initialize the verties to WHITE. Rather, in Algorithm 6, BLACK verties indiate that they are not visited yet, while WHITE verties mean that they are finished visiting. 21

22 1 2 3 r 3 Figure 9: Remove all -ars (solid lines represent -ars and dotted lines represent -ars Lemma 5 Algorithm 6 removes all -ars from the a onneted omponent. Proof Algorithm 6 is also based on DFS. During the DFS when a vertex r is examined, eah -ar (r, v), where v is also in this onneted omponent, is examined to see whether it is redundant. There are three possibilities (illustrated by the four dotted lines in Figure 9) for the loation of v: 1. v is an anestor of r (indiated by v.olor = GREY and r v): this ontradits the onstraint of (r, v) ar, whih requires v is a desendant of r or r itself. So the algorithm returns UNSATISFIABLE; 2. v.level < r.level: this ontradits the onstraint of (r, v) ar, whih requires v s level is greater or equal to r s level. So the algorithm also returns UNSATISFIABLE; 3. v.level r.level: either v is a desendant or self of v, or r and v are in different paths from the root, the neessary ondition for the pattern graph to be satisfiable is that the two paths from the root to r and v an be merged together. Thus the algorithm alls the MergeP aths funtion to merge the two paths into one, and then remove the -ar. Sine all -ars are removed in all ases, the resulting omponent is a pattern tree onsisting of only -ars and possibly -ars. Lemma 6 Algorithm 6 runs in O(m+n Σ ) time, where m is the number of and -ars, and n is the number of verties in the pattern tree. Proof The omplexity analysis is similar to the analysis of Algorithm 5. The omplexity for DFS is O(m+n) and there are at most n verties merges, so the the total omplexity is O(m+n Σ ). Theorem 4 Algorithm 4 deomposes a pattern graph into onneted CPT s. Proof This theorem follows direted from the orretness analysis of Algorithm 5, Algorithm 6 and Algorithm 7. Theorem 5 Algorithm 4 deomposes a pattern graph into pattern trees in O(m + n Σ ) time, where n and m are the number of verties and ar, respetively, in the pattern graph. Proof The omplexity of this algorithm also follows direted from the omplexity analysis of Algorithm 5 and Algorithm 6: O(m + n Σ ). 22

Algorithms for External Memory Lecture 6 Graph Algorithms - Weighted List Ranking

Algorithms for External Memory Leture 6 Graph Algorithms - Weighted List Ranking Leturer: Nodari Sithinava Sribe: Andi Hellmund, Simon Ohsenreither 1 Introdution & Motivation After talking about I/O-effiient