Efficient Support for Ordered XPath Processing in Tree-Unaware Commercial Relational Databases

Size: px

Start display at page:

Download "Efficient Support for Ordered XPath Processing in Tree-Unaware Commercial Relational Databases"

Jocelyn Spencer
5 years ago
Views:

1 Efficient Support for Ordered XPath Processing in Tree-Unaware Commercial Relational Databases Boon-Siew Seah 1, Klarinda G. Widjanarko 1, Sourav S. Bhowmick 1, Byron Choi 1 Erwin Leonardi 1, 1 School of Computer Engineering, Nanyang Technological University, Singapore Singapore-MIT Alliance, Nanyang Technological University, Singapore { ,klarinda,assourav,kkchoi,lerwin}@ntu.edu.sg May 3, 007 Abstract In this paper, we present a novel ordered XPATH evaluation in tree-unaware RDBMS. The novelties of our approach lies in the followings. (a) We propose a novel XML storage scheme which comprises only leaf nodes, their corresponding data values, order encodings and their root-to-leaf paths. (b) We propose an algorithm for mapping ordered XPATH queries into SQL queries over the storage scheme. (c) We propose an optimization technique that enforces all mapped SQL queries to be evaluated in a left-to-right join order. By employing these techniques, we show, through a comprehensive experiment, that our approach not only scales well but also performs better than some representative tree-unaware approaches on more than 65% of our benchmark queries with the highest observed gain factor being In addition, our approach reduces significantly the performance gap between tree-aware and tree-unaware approaches and even outperforms a state-of-the-art tree-aware approach for certain benchmark queries. 1 Introduction With the rapid emergence of XML as the de facto standard for exchanging data on the Web, the interest in efficiently querying growing XML data sources has increased. One of the salient features of XML data is that it is order-sensitive. Supporting an ordered data model of XML as well as ordered XML queries, ordered XPATH axes and position predicates in particular, have been the key to successful XML applications, e.g., [1]. In this paper, we present a novel approach to efficiently evaluate ordered XPATH queries in a relational database. Current approaches for evaluating XPATH expressions in relational databases can be arguably categorized into two representative types. They either resort to encoding XML data as tables and translating XML queries into relational queries [3, 4, 5, 6, 10, 11, 15] or store XML data as a rich data type and process XML queries by enhancing the relational infrastructure [9]. The former approach can further be classified into two representative types. Firstly, a host of work on processing XPATH queries on tree-unaware relational databases has been reported [5, 10, 11] these approaches do not modify the database kernels. Secondly, there have been several efforts on enabling relational databases to be tree-aware by invading the database kernel to implement XML support [3, 4, 6, 15]. It has been shown that the latter approaches appear scalable and, in particular, perform orders of magnitude faster than some tree-unaware approaches [3, 6]. In this paper, we focus on supporting ordered XPATH evaluation in a tree-unaware relational environment. There is a considerable benefit in such an approach with respect to portability and ease of implementation on top of an off-the-shelf RDBMS. Although a diverse set of strategies for evaluating XML queries in tree-unaware relational environment have been recently proposed, few have undertaken a comprehensive study on evaluating ordered XPATH queries. Tatarinov et al. [1] is the first to show that it is indeed possible to support ordered

2 XPATH queries in relational databases. However, this approach does not scale well with large XML documents. In fact, as we shall show in Section 7, the GLOBAL-ORDER approach in [1] failed to return results for 0% of our benchmark queries on 1GB dataset in 60 minutes. Furthermore, this approach resorts to manual tuning of the relational optimizer when it failed to produce good query plans. Although such a manual tuning approach works, it is a cumbersome solution. In this paper, we address the above limitations by proposing a novel scheme for ordered XPATH query processing. Our storage strategy is built on top of SUCXENT++ [10], by extending it to support efficient processing of ordered axes and predicates. SUCXENT++ is designed primarily for query-mostly workloads. We exploit SUCXENT++ s strategy to store leaf nodes, their corresponding data values, auxiliary encodings and root-to-leaf paths. In contrast, some approaches, e.g., [6, 15], explicitly store information for all nodes of an XML document. Specifically, the followings remark the novelties of our storage scheme. (1) For each level of an XML document, we store an attribute called RValue which is an enhancement of the original RValue, proposed in [10], for processing recursive XPATH queries. () For each leaf node we store three additional attributes namely BranchOrder, DeweyOrderSum and SiblingSum. These attributes are the foundation for our ordered XPATH processing. The key features of these attributes are that they enable us (a) to compare the order between non-leaf nodes by comparing the order between their first descendant leaf nodes only; and (b) to determine the nearest common ancestor of two leaf nodes efficiently. As a result, it is not necessary to store the order information of non-leaf nodes. Furthermore, given any pair of nodes, these attributes enable us to evaluate position-based predicates efficiently. As highlighted in [1], relational optimizers may sometimes produce poor query plans for processing XPATH queries. In this paper, we undertake a novel strategy to address this issue. As opposed to manual tuning efforts, we propose an automatic approach to enforce the optimizer to replace previously generated poor plans with probably better query plans, as verified by our experiments. Unlike tree-aware schemes, our technique is non-invasive in nature. That is, it can easily be incorporated without modifying the internals of relational optimizers. Specifically, we enforce a relational optimizer to follow a left-to-right join order and enforce the relational engine to evaluate the mapped SQL queries according to the XPATH steps specified in the query. The good news is that this technique can select better plans for the majority of our benchmark queries across all benchmark datasets. As we shall see in Section 7, the performance of previously-inefficient queries in SUCXENT++ is significantly improved. The highest observed gain factor is 59. Furthermore, queries that failed to finish in 60 minutes were able to do so now, in the presence of such a join-order enforcement. This is indeed stimulating as it shows that some sophisticated internals of relational optimizers not only are irrelevant to XPATH processing but also often confuse XPATH query optimization in relational databases. Overall a joinorder-conscious SUCXENT++ significantly outperforms both GLOBAL-ORDER and SHARED-INLINING[11] in at least 65% of the benchmark queries with the highest observed gain factors being 1939 and 880, respectively. To the best of our knowledge, this is the first effort on exploiting a non-invasive automatic technique to improve query performance in the context of XPATH evaluation in relational environment. Recently, [3] showed that MONETDB is among the most efficient and scalable tree-aware relational-based XQuery processor and outperforms the current generation of XQuery systems significantly. Consequently, we investigated how our proposed technique compared to MONETDB. Our study revealed some interesting results. First, although MONETDB is and 3-74 times faster than GLOBAL-ORDER and SHARED-INLINING, respectively, for the majority of the benchmark queries, this performance gap is significantly reduced when MONETDB is compared to SUCXENT++. Our results show that not only MONETDB is now times faster than SUCXENT++ with join-order enforcement but surprisingly our approach is faster than MONETDB for 33% of benchmark queries! Additionally, MONETDB (Win3 builds) failed to shred 1GB dataset as it is vulnerable to the virtual memory fragmentation in Windows environment. Note that, this is in contrary to the results in [3] where MONETDB was built on top of Linux.6.11 operating system (8GB RAM), using a 64-bit address space, and was able to efficiently shred 11GB dataset. In summary, the main contributions of this paper are as follows. In Section 4, we describe our novel schemaoblivious relational storage scheme for XML. In Section 5, we present how ordered XPATH queries are supported in our proposed storage scheme. In Section 6, we proposed a novel left-to-right join order-based technique to

3 Document (DocId, Name) Path (PathId, PathExp) PathValue(DocId, PathId, LeafOrder, BranchOrder, BranchOrderSum, LeafValue) DocumentRValue (DocId, Level, RValue) (a) Original Schema of SUCXENT++ Document (DocId, Name) Path (PathId, PathExp) PathValue(DocId, DeweyOrderSum, PathId, BranchOrder, LeafOrder, SiblingSum, LeafValue) Attribute (DocId, LeafOrder, PathId, LeafValue) DocumentRValue (DocId, Level, RValue) (b) Modified Schema of SUCXENT++ Mod Level M RVal RVal Book Catalog 1 3 Title Chapter Chapter D1 = 0 * 1 Para Para D = 3 D3 = 4 preceding * - number representing local order of the node Di = DeweyOrderSum 1 D4 = 6 A The context node A The leaf node that represents the context node (c) XML Data Para 1 D5 = 19 Book descendant-or-self D6 = 3 Book D7 = 38 following Figure 1: SUCXENT++ schema and example of XML data improve query plan selection of relational query optimizers. To the best of our knowledge, this is the first effort on exploiting such non-invasive automatic technique to improve query performance in the context of XPATH processing in relational environment. Through an extensive experimental study in Section 7, we show that our approach significantly outperforms existing tree-unaware approaches for ordered XPATH queries. Additionally, our approach reduces significantly the performance gap between tree-aware and tree-unaware approaches and even outperform a state-of-the-art tree-aware approach for certain benchmark queries. Related Work Most of the previous tree-unaware approaches, except [1], focused on proposing efficient evaluation for children and descendant-or-self axes and positional predicates in XPATH queries. In this paper, the main focus is on the evaluation for following, preceding, following-sibling, and preceding-sibling axes as well as position-based and range predicates. All previous approaches, reported query performance on small/medium XML documents smaller than 500 MB. We investigate query performance on large synthetic and real datasets. This gives insights on the scalability of the state-of-the-art tree-unaware approaches for ordered XML processing. Compared to the tree-aware schemes [3, 4, 6, 15], our technique is tree-unaware in the sense that it can be built on top of any commercial RDBMS without modifying the database kernel. The approaches in [4, 15] do not provide a systematic and comprehensive effort for processing ordered XPATH queries. Although the scheme presented in [3, 4, 6] can support ordered axes, no comprehensive performance study has demonstrated with a variety of ordered XPATH queries. Furthermore, these approaches did not exploit the left-to-right join order technique to improve query plan selection. In [1], Tatarinov et al. proposed the first solution for supporting ordered XML query processing in a relational database. A modified EDGE table [5] was the underlying storage scheme. They described three order encoding methods: global, local, and dewey encodings. The best query performance was achieved with the global encoding for query-mostly workloads and with dewey encoding for a mix of queries and updates. Our focus differs from the above approach in the following ways. First, we focus on query-mostly workloads. Second, we consider a novel order-conscious storage scheme that is more space- and query-efficient and scalable when compared to the global encoding. 3 Background on SUCXENT++ Our approach for ordered XPATH processing relies on the SUCXENT++ approach [10]. We begin our discussion by briefly reviewing the storage scheme of SUCXENT++. Foremost, in the rest of the paper, we always assume 3

4 Path PathId PathExp 3.catalog#.book#.chapter#.para# 4.catalog#.book#.chapter# 5.catalog#.book# PathValue DocId Leaf Order Branch Order PathId Dewey OrderSum Sibling Sum Leaf Value D D D D D D D7 DocRValue DocId Level RValue Attribute DocId LeafOrder PathId LeafValue book book book 03 Figure : XML data in RDBMS document order in our discussions. The SUCXENT++ schema is shown in Figure 1(a). Document stores the document identifier DocId and the name Name of a given input XML document T. We associate each distinct (root-to-leaf) path appearing in T, namely PathExp, with an identifier PathId and store this information in Path table. For each leaf node n in T, we shall create a tuple in the PathValue table. We now elaborate the meaning of the attributes of this relation. Given two leaf nodes n 1 and n, n 1.LeafOrder < n.leaforder iff n 1 precedes n. LeafOrder of the first leaf node in T is 1 and n.leaforder = n 1.LeafOrder+1 iff n 1 is a leaf node immediately preceding n. Given two leaf nodes n 1 and n where n 1.LeafOrder+1 = n.leaforder, n.branchorder is the level of the nearest common ancestor of n 1 and n. That is, n 1 and n intersect at the BranchOrder level. The data value of n is stored in n.leafvalue. To discuss BranchOrderSum and RValue, we introduce some auxiliary definitions. Consider a sequence of leaf nodes C: n 1, n, n 3,..., n r in T. Then, C is a k-consecutive leaf nodes of T iff (a) n i.branchorder k for all i [1,r]; (b) If n 1.LeafOrder > 1, then n 0.BranchOrder < k where n 0.LeafOrder+1 = n 1.LeafOrder; and (c) If n r is not the last leaf node in T, then n r+1.branchorder < k where n r.leaforder+1 = n r+1.leaforder. A sequence C is called a maximal k-consecutive leaf nodes of T, denoted as M k, if there does not exist a k-consecutive leaf nodes C and C < C. Let L max be the largest level of T. Then, RValue of level l, denoted as R l, is 1 if l = L max. Otherwise, R l = R l+1 M l Now we are ready to define the BranchOrderSum attribute. Let N to be the set of leaf nodes preceding a leaf node n. n.branchordersum is 0 if n.leaforder = 1 and m N R m.branchorder otherwise. Based on the definitions above, Prakash et al. [10] defined Property 1 (below) which is essential to determine ancestor-descendant relationships efficiently. Property 1 Given two leaf nodes n 1 and n, n 1.BranchOrderSum - n.branchordersum < R l implies the nearest common ancestor of n 1 and n is at a level greater than l. 4 Extensions of SUCXENT++ To support ordered XML queries, the order information of nodes must be captured in the XML storage scheme. Unfortunately the LeafOrder and BranchOrderSum attributes only encode the global order of all leaf nodes. Since (order) information of non-leaf nodes is not explicitly stored, it must be derived from the attributes of leaf nodes. We now present how the original SUCXENT++ schema is extended to process ordered XPath queries efficiently. First, we move information related to attribute nodes from PathValue table to a new Attribute table. Second, we introduce new attributes to encode the relative order between (both non-leaf and leaf) nodes. Specifically, DeweyOrderSum and SiblingSum are introduced to replace BranchOrderSum. Second, the definition of RValue is modified such that RValue and DeweyOrderSum preserve the properties presented [10]. The modified schema is shown in Figure 1(b) and Figure shows the shredded version of the example XML document. 4

5 Algorithm to Compute DeweyOrderSum 01 for each non-attribute leaf node n i arranged in document order { 0 if (n i.branchorder > 0) { 03 n i.deweyordersum = dsum[ n i.branchorder + 1 ] + ModRValue( n i.branchorder ); 04 for ( level = n i.branchorder + 1; level <= maxdepthofxmldoc; level++ ) 05 dsum[level] = n i.deweyordersum; 06 } 07 else { //initialization for the first leaf node 08 n i.deweyordersum = 0; 09 for ( level = 1; level <= maxdepthofxmldoc; level++ ) 10 dsum[level] = 0; 11 } 1 } Figure 3: Algorithm to compute DeweyOrderSum 4.1 Attribute Table The PathValue table originally stored information related to both element and attribute nodes. However, to avoid mixing the order of element and attribute nodes, we separate the attribute nodes into Attribute table. The Attribute table consists of the following columns: DocId, LeafOrder, PathId, LeafValue. As we shall see later, a non-leaf node can be represented by the first descendant leaf nodes. Therefore, an attribute node is identified by DocId and LeafOrder of its parent node and its PathId. 4. Modified RValue Attribute Conceptually, RValue is used to encode the level of the nearest common ancestor of any pairs of leaf nodes. To ensure a property like Property 1 holds after modifications, intuitively, we magnify the gap between RValues, as shown in Definition 1. Relative order information is then captured in these gaps. Definition 1 [ModifiedRValue] Let L max be the largest level of an XML tree T. ModifiedRValue of level l, denoted as R l, is defined as follows: (i) If l = L max 1 then R l = 1; (ii) If 0 < l < L max 1 then R l = R l+1 M l For example, consider the XML tree shown in Figure 1(c). L max = 4. The values of M 1, M, and M 3 are 6, 3, and 1, respectively. Then, R 3 = 1, R = 1 M = 3, and R 1 = 3 M + 1 = 19. To ensure the evaluation of queries other than ordered XPATH queries is not affected by the above modifications, the RValue attribute in DocumentRValue stores R l instead of R l. 4.3 DeweyOrderSum and SiblingSum Attributes Next, we define the first attribute related to ordered XPATH processing. Consider the path query /catalog/book[1]/chapter[1] and Figure 1(c). Since only leaf nodes are stored in the PathValue table, the new attribute DeweyOrderSum of leaf nodes captures order information of the non-leaf nodes. At first glance, a simple representation of the order information could be a Dewey path. For instance, the Dewey path of the first chapter node of the first book node is However, using such Dewey paths has two major drawbacks. Firstly, string matching of Dewey paths can be computationally expensive. Secondly, simple lexicographical comparisons of two Dewey paths may not always be accurate [1]. Comparing 1. and 10. in lexicographical order will indicate that 10. appears before 1. [1]. Hence, we define DeweyOrderSum for this purpose: Definition [DeweyOrderSum] Consider an XML document T and a leaf node n at level l in T. Ord(n, k) = i iff a is either an ancestor of n or n itself; k is the level of a; and a is the i-th child of its parent. DeweyOrderSum of n, n.deweyordersum, is defined as l j= Φ(j) where Φ(j)=[Ord(n, j)-1] R j 1. 5

6 Algorithm to Compute SiblingSum 01 for each non-attribute leaf node n i arranged in document order { 0 if( n i.branchorder > 0 ) { 03 sord = siblingorder.order( n i.branchorder + 1, elementname[n i.branchorder + 1] ); //siblingorder.order(level, elementname) keep tracks of the same-sibling order //for a node, given the level and the element name for that level 04 n i.siblingsum = ssum[ n i.branchorder ] + (sord-1) * ModRValue( n i.branchorder ); 05 for ( level = n i.branchorder + 1; level <= maxdepthofxmldoc; level++) 06 ssum[level] = n i.siblingsum; 07 for ( level = n i.branchorder + ; level <= n i.depth; level++) 08 siblingorder.order(level, elementname[level]); 09 } 10 else { //initialization for the first leaf node 11 n i.siblingsum = 0; 1 for ( level = 1; level <= maxdepthofxmldoc; level++ ) 13 ssum[level] = 0; 14 for ( level = 1; level <= n i.depth; level++ ) 15 siblingorder.order(level, elementname[level]); 16 } 17 } Figure 4: Algorithm to compute SiblingSum For example, consider the rightmost chapter node in Figure 1(c) which has a Dewey path 1... Using the ModifiedRValue values derived previously, the DeweyOrderSum of this node can then be calculated as follows: n.deweyordersum = (Ord(n, ) 1) R 1 + (Ord(n, 3) 1) R = =. Figure 3 shows the algorithm to derive DeweyOrderSum during document shredding. dsum is an array to store the DeweyOrderSum for each level with respect to the current node n i. Since n i.branchorder is the level of the nearest common ancestor between the current node n i and the previous node, it implies that the local order of current node n i at level BranchOrder + 1 is increased by 1 and the local order of n i at level BranchOrder remains unchanged. Therefore, n i.deweyordersum equals to dsum[branchorder + 1] + ModifiedRValue(BranchOrder) (line 03). dsum is also updated accordingly (lines 04-05). Note that DeweyOrderSum is not sufficient to compute position-based predicates with QName name tests, e.g., chapter[]. Hence, the SiblingSum attribute is introduced to the PathValue table. Definition 3 [SiblingSum] Consider an XML document T and a leaf node n at level l in T. Sibling(n, k) = i iff a is either an ancestor of n or n itself; k is the level of a; and the i-th τ-child of its parent (τ is the tag name of a). SiblingSum of n, n.siblingsum, is l j= Ψ(j) where Ψ(j) = [Sibling(n, j)-1] R j 1. SiblingSum encodes the local order of nodes which are with the same tag name of n, namely same-tag-sibling order. For example, consider the children of the first book element in Figure 1(c). The local orders of title and the first and second chapter nodes are 1, and 3, respectively. On the other hand, the same-tag-sibling order of these nodes are 1, 1 and, respectively. The algorithm to compute SiblingSum is shown in Figure 3. SiblingOrder.Order(level, elementname) is used to calculate Sibling(n i, k) for the current node n i where k = level and τ = elementname. n i.siblingsum equals to ssum[branchorder] + [Sibling(n i, n i.branchorder+1) - 1] ModifiedRValue(BranchOrder) (line 04). 4.4 Preservation of SUCXENT++ s Features The above modifications do not adversely affect the document reconstruction process and efficient evaluation of non-ordered XPATH queries, as discussed in [10]. Recall that given a pair of leaf nodes, Property 1 was used in [10] to efficiently determine the nearest common ancestor of the nodes. Since we have modified the definition of RValue and replaced the BranchOrderSum attribute with the DeweyOrderSum attribute, this property is not applicable to the extended SUCXENT++ scheme. It is necessary to ensure that a corresponding property holds in the extended system. LEMMA 1 l j=k Φ(j) R k 1 where Φ(j) =[Ord(n, j)-1] R j 1, k (,l] and n is a leaf node in an XML document at level l. 6

7 Based on the above lemma, it is straightforward to show that l j=k Φ(j) < R k. Theorem 1 Let n 1 and n be two leaf nodes in an XML document. If R l n 1.DeweyOrderSum - n.deweyordersum < R l then the level of the nearest common ancestor of n 1 and n is l + 1. For example, consider the second leaf node in Figure 1(c). DeweyOrderSum of this node is 3. Let D 1 be the DeweyOrderSum of leaf nodes that have nearest common ancestor at level. Using the above theorem, D 1 falls within the following range: (R 1)/ + 1 D 1 3 < (R 1 1)/ + 1 D 1 3 < 10 which returns the first and fourth leaf nodes (DeweyOrderSum = 0 and 6, respectively). Let D be the DeweyOrderSum of leaf nodes that have nearest common ancestor at level 3. D falls within the following range: (R 3 1)/ + 1 D 3 < (R 1)/ D 3 < which returns the third leaf node (DeweyOrderSum = 4). Now let say we want to get the leaf nodes that have nearest common ancestor at level or deeper and let D 3 be the DeweyOrderSum of these nodes. D 3 falls within the following range: D 3 3 < (R 1 1)/ + 1 D 3 3 < 10 which returns the first four leaf nodes. We illustrate Theorem 1 further with XQUERY example. Consider the following XQUERY on the XML tree in Figure 1(c). FOR RETURN $b IN document( catalog )/catalog/book[1] $b/chapter/para Let D a be DeweyOrderSum of the first leaf node satisfying /catalog/book[1], which is the first title node. The RETURN clause implies the path /catalog/book/chapter/para. In this particular case, the nearest common ancestor between para nodes and book node is at level. This implies that the nearest common ancestor between para nodes and the first leaf node satisfying book node is at level or deeper. Let D b be DeweyOrderSum of the resulting nodes that satisfy both paths. Since the level of intersection is or deeper, D b D a < (R 1 1)/ + 1. From Figure 1(c), D a = 0 and R 1 = 19. Hence, nodes in the query result set must satisfy the inequality: D b 0 < (19 1)/ + 1. Note that DeweyOrderSum of the second and third leaf nodes (para nodes) are 3 and 4. Since 3 0 < 10 and 4 0 < 10, these nodes satisfies the above query. Whereas, the fifth leaf node whose DeweyOrderSum is 19 does not satisfy the above query. The proofs of the lemma, theorems and propositions are given in Appendix A. 5 Ordered XPath Processing This section describes how ordered XPATH queries are supported by the modified schema. First, we propose a method of node order comparison in the absence of non-leaf nodes. Next, we show how ordered XPATH queries are supported in detail. Finally, we present a translation algorithm of ordered XPATH queries and SQL. 5.1 Non-leaf Node Order Comparison Our strategy for comparing the order of non-leaf nodes is based on the following observation. If node n 0 precedes (resp. follows) another node n 1, then descendants of n 0 must also precede (resp. follow) the descendants of n 1. Therefore, instead of comparing the order between non-leaf nodes, we compare the order between their descendant leaf nodes. For this reason, we define a representative leaf node of a non-leaf node n to be its first descendant leaf node. Note that the BranchOrder attribute records the level of the nearest common ancestor of two consecutive leaf nodes. Let C be the sequence of descendant leaf nodes of n and n 1 be the first node in C. We know that the nearest common ancestor of any two consecutive nodes in C is also a descendant of node n. This implies (1) except n 1, BranchOrder of a node in C is at least the level of node n and () the nearest common ancestor of n 1 and its immediately preceding leaf node is not a descendant of node n. Therefore, BranchOrder of n 1 is always smaller than the level of n. We summarize this property in Property. 7

8 Property Let n be a non-leaf node at level l and C = n 1, n, n 3,..., n r be the sequence of descendant leaf nodes of n in document order. Then, n 1.BranchOrder < l and n i.branchorder l, where i (1,r]. Definition 4 [DeweyOrderSum of non-leaf nodes] Let S = i 1, i, i 3,..., i r1 be a sequence of non-leaf sibling nodes of a non-leaf node i 0 in document order. Let C = n 1, n,..., n r be the sequence of leaf nodes of S and n j is denoted as the first descendant leaf node of i j1. Then, i j1.deweyordersum = n j.deweyordersum. In the above definition, DeweyOrderSum of a leaf node is conceptually propagated to its ancestor nodes. Consequently, the following proposition holds. Proposition 1 Let C = n 1, n, n 3,..., n r be a sequence of sibling nodes. Consider n i where 1 < i r and the level of n i is l, where l > 1. Let m be n i or descendant of n i. Then, n 1.DeweyOrderSum+ [Ord(n i ) - Ord(n 1 )] R l 1 m.deweyordersum < n 1.DeweyOrderSum+ [(Ord(n i ) - Ord(n 1 ))+1] R l 1 where Ord(n i) and Ord(n 1 ) are the local order of n i and n 1, respectively. By using the above proposition, we can compare the order of two non-leaf nodes without evaluating every sibling nodes in the sequence. Also, since n 1 is the first sibling, Ord(n 1 ) = 1. Therefore, based on the above proposition, the following holds: n 1.DeweyOrderSum+ [Ord(n i )-1] R l 1 m.deweyordersum < n 1.DeweyOrderSum+ Ord(n i ) R l 1. Similar propositions for SiblingSum can be established in a straightforward manner. 5. Support for Ordered XPath Queries We now present how various types of ordered XPATH queries are supported by the modified SUCXENT++. Due to space constraints, we only focus on how DeweyOrderSum and ModifiedRValue are used for query processing. Similar technique can be applied to evaluations with SiblingSum. Position predicates. Position-based predicates, i.e., predicates of the form position()=i, select the node at the i-th position of the sequence of inner focus context nodes. We propose to compute the i-th node without evaluating every node in the sequence by applying Proposition 1. For example, suppose n 1 be the first book node of the sequence of book nodes (the context nodes) in Figure 1(c). Observe that n 1.DeweyOrderSum = 0 as its representative leaf node is the first leaf node of the XML tree. We now employ the inequality in Proposition 1 to select a sibling node, e.g., the second book node n. Here, Ord(n ) =, l =, R 1 = 19, and n 1.DeweyOrderSum = 0. Then, n.deweyordersum < n i.deweyordersum < 38. The nodes in this range are the descendant leaf nodes of n. Such simple arithmetic calculations can be efficiently implemented in a relational database. fn : last() can be computed by first determining all sibling nodes that satisfy the specific path and then finding the node with the largest DeweyOrderSum. Position predicate on child axes. This class of queries can be translated into a child axis followed by a position predicate, in which one must select the i-th child of the context node. Our strategy is to determine the first child of a context node and then the child s i-th sibling node as described above. First, by using Definition 4, we know that if n is the first child of n 1, then n 1.DeweyOrderSum = n.deweyordersum. Second, Proposition 1 provides us a method for selecting the i-th sibling node of a node. Reconsider Figure 1(c) and the XPATH query /catalog/*[]. The query result is the second child of catalog node. Suppose n 0 is the context node /catalog. Let n 1 be the first sibling node in the sequence returned by the expression /catalog/*. Then, n 0.DeweyOrderSum = n 1.DeweyOrderSum. Since the first sibling in that sequence is n 1 and all siblings of n 1 is in that sequence, we can now utilize Proposition 1 to select the leaf nodes of the second node in the context. The range operator, e.g., [position()= TO 10], can be easily handled in similar fashion. Following and preceding axes. following axis selects all nodes which follow the context node excluding the descendants of the context node. preceding axis, on the other hand, selects all nodes which precede 8

9 DeweyOrderSum D i - ((R L- - 1) / + 1) D i +R D L-1 i + ((R L- - 1) / + 1) D i preceding-sibling following-sibling preceding following D i : DeweyOrderSum of the context node L : level of the context node descendant-or-self Figure 5: Relationship between DeweyOrderSum and RValue. the context node excluding the ancestors of the context node. Similar to position predicates, we summarize a property of DeweyOrderSum to facilitate efficient processing of these axes. Proposition Let n a and n b be two nodes in the XML tree T and n b is a context node at level l b where l b > 1. Then, the following statements hold: 1. n a.deweyordersum n b.deweyordersum+r l b 1 if and only if n a follows n b and is not a descendant of n b ;. Similarly, n a.deweyordersum < n b.deweyordersum if and only if n a precedes n b and n a is neither a descendant nor an ancestor of n b. We illustrate this proposition in Figure 5. Now let us consider the following examples. Suppose that we evaluate the following axis on the first book node n b in Figure 1(c). Here, n b.deweyordersum = 0, l = and R 1 = 19. Let N be the nodes in the result of the evaluation of following axis. Then, by using Proposition, n N must satisfy this inequality: n.deweyordersum Similarly, suppose we evaluate the preceding axis on the last book node n b in Figure 1(c). n b.deweyordersum = 38. Denote the sequence of nodes satisfying the preceding axis to be N. Then n N must satisfy the following inequality: n.deweyordersum < 38. Another example can be found in Figure 1(c). Note that since SUCXENT++ only stores the leaf nodes, returning the internal nodes and the whole subtree as required by following::* and preceding::* axis require extra processing as the resulting XML document may have a very different structure or schema than the original XML document. This is achieved in SUCXENT++ during the result construction phase. Observe that, in our query, if we use QName as the name test (for example following::title), and the path from the root to QName element is unique then no extra processing is required. If the level of the QName is greater than the level of the context node (e.g. /catalog/book[]/preceding::title in Figure 1(c)), then we can use Proposition and PathId to return the resulting nodes. The same also applies if the level of the QName equals to the level of the context node (e.g. /catalog/*[1]/following::book in Figure 1(c)). However, if the level of the QName is less than the level of the context node (e.g. /catalog/*[1]/chapter/para/following::chapter in Figure 1(c)), we need to use Theorem 1 to exclude the nodes that have common ancestor at QName level or deeper. These three cases are illustrated in Figure 6. Following-sibling and preceding-sibling axes. following-sibling axis selects the children of the context node s parent that occur after the context node in document order whereas preceding-sibling axis selects the children of the context node s parent that occur before the context node in document order. Support for following-sibling (resp. preceding-sibling) axis can be achieved with an additional constraint on the following (resp. preceding) axis the selected nodes must be siblings of the context node. Proposition 3 Let n a and n b be two nodes in the XML tree T and n b is the context node at level l b where l b >. Then, the following statements hold: 9

10 L C QName QName L Q preceding following preceding::qname following::qname (a) Case 1: Qname level (L Q) > context node level (L C) QName QName L Q = L C preceding following preceding::qname following::qname (b) Case : Qname level (L Q) = context node level (L C) QName QName QName L Q L C preceding following preceding::qname following::qname (a) Case 3: Qname level (L Q) < context node level (L C) The context node The leaf node that represents the context node Figure 6: following::qname 1. n b.deweyordersum +R l b 1 n a.deweyordersum < n b.deweyordersum +(R l b 1)/ + 1 if and only if n a is a sibling of n b and n a follows n b.. n b.deweyordersum (R l b 1)/ 1 < n a.deweyordersum < n b.deweyordersum if and only if n a is a sibling of n b and n a precedes n b. The above proposition is illustrated in Figure 5. Now let us consider the following examples. Suppose we evaluate the following-sibling axis on the first title node n t in Figure 1(c). Here n t.deweyordersum = 0, l = 3, R 1 = 19, and R = 3. Denote N to be the nodes reachable via the following-sibling axis from n t. Using Proposition 3, n k.deweyordersum < 0 + (19 1)/ + 1 where n k N. That is, 3 n k.deweyordersum < 10. Hence, the second (DeweyOrderSum = 3) and the third (DeweyOrderSum = 6) chapter nodes are in this range. Now, suppose we evaluate the preceding-sibling axis at the last chapter node n c in Figure 1(c). Here n c.deweyordersum =. Let N be the nodes which satisfy the preceding-sibling axis. Therefore, (19 1)/ 1 < n r.deweyordersum < where n r N. That is, 1 < n r.deweyordersum <. Hence, the chapter node with DeweyOrderSum = 19 satisfies this bound. Another example can be found in Figure 1(c). 10

11 processpathexpr (XPath) 01 for every step in the XPath { 0 if (step.getaxis() == CHILD and step.haspredicate() == FALSE) 03 currentpath.add(nametest, step.getaxis()) 04 else { 05 from_sql.add("pathvalue as Vi") 06 if(currentpath.level() > 1) { where_sql.add("vi.pathid in currentpath.getpathid()") where_sql.add("vi.branchorder < currentpath.level()") 09 } 10 processaxis(step, currentpath) 11 processpredicate(step, currentpath) 1 } 13 if (step.islast() and currentpath.needupdate()) { from_sql.add("pathvalue as Vi") where_sql.add("vi.pathid in currentpath.getpathid()") 16 } 17 } 18 select_sql.add("vi.leafvalue, Vi.leafOrder,... ") 19 return select_sql + from_sql + where_sql + where_sql.unionwithattribute() (a)the processpathexpr Algorithm processaxis (step, currentpath) 01 switch (step.getaxis()){ 0 child: 03 where_sql.add("vi.deweyordersum BETWEEN Vi-1.DeweyOrderSum - RValue(currentPath.level() - 1) + 1 AND Vi-1.DeweyOrderSum + RValue(currentPath.level() - 1) - 1 ") 04 following: 05 where_sql.add("vi.deweyordersum >= Vi-1.DeweyOrderSum + * RValue(currentPath.level()) - 1 ") 06 if (currentpath.qnamelevel(nametest) < currentpath.level() 07 where_sql.add("+ RValue(currentPath.QNameLevel(nametest) - 1) - 1") 08 preceding: 09 where_sql.add("vi.deweyordersum < Vi-1.DeweyOrderSum ") 10 if (currentpath.qnamelevel(nametest) < currentpath.level() 11 where_sql.add("- RValue(currentPath.QNameLevel(nametest) - 1) + 1") 1 following-sibling: 13 where_sql.add("vi.deweyordersum BETWEEN Vi-1.DeweyOrderSum + * RValue(currentPath.level()) - 1 AND Vi-1.DeweyOrderSum + RValue(currentPath.level() - 1) - 1 ") 14 preceding-sibling: 15 where_sql.add("vi.deweyordersum BETWEEN Vi-1.DeweyOrderSum - RValue(currentPath.level() - 1) + 1 AND Vi-1.DeweyOrderSum - 1 ") 16 } 17 currentpath.add(nametest, step.getaxis()) (b)the processaxis Algorithm Figure 7: Procedure processpathexpr and Procedure processaxis. processpredicate (step, currentpath) 01 switch (step.getaxis()) { 0 CHILD: 03 n_from = step.getpredicatefrom() n_to = step.getpredicateto() 05 FOLLOWING-SIBLING: 06 n_from = step.getpredicatefrom() 07 n_to = step.getpredicateto() PRECEDING-SIBLING: 09 n_from = - step.getpredicatefrom() 10 n_to = - step.getpredicateto() } 1 switch (step.getpredicatetype()){ 13 position based predicate without name test: 14 where_sql.add("v i.deweyordersum BETWEEN V i-1.deweyordersum + n_from * ( * RValue(currentPath.level()) - 1) AND V i-1.deweyordersum + n_to * ( * RValue(currentPath.level()) - 1) - 1 ") 15 position based predicate with name test: 16 where_sql.add("v i.siblingsum BETWEEN V i-1.siblingsum + n_from * ( * RValue(currentPath.level()) - 1) AND V i-1.siblingsum + n_to * ( * RValue(currentPath.level()) - 1) - 1 ") 17 } Figure 8: Procedure processpredicate. 5.3 Ordered XPath Query Translation Algorithm Based on the properties defined in the previous subsection, we present an algorithm, shown in Figures 7 and 8, for generating SQL from ordered XPATH queries. Our algorithm assumes an XPATH expression is represented as a sequence of steps where a step may be associated with predicates. A SQL statement consists of three clauses: select sql, from sql and where sql. We assume that a clause has an add() method which encapsulates some simple string manipulations and simple SUCXENT++ joins for constructing valid SQL statements. In addition to preprocessing PathId as mentioned in [10], for a single XML document, we also preprocess RValue to reduce the number of joins. The translation consists of three main procedures. processpathexpr (Figure 7(a)): It analyzes the steps of an input XPATH expression (Line 01) and outputs a SQL statement. If the step consists of a child axis only (Lines 0-03), then we simply maintain a global variable currentp ath which records the simple downward path from the root to the context nodes. 1 Otherwise, when the step involves ordered predicates/other axes, we add predicates which select a superset of the next context nodes (Lines 05-09) and then call processaxis and processpredicate (Lines 10-11) with currentp ath to obtain the next context nodes. We add predicates in Lines 08 to determine the representative nodes of the context nodes. Finally, we collect the final results (Line 19). processaxis (Figure 7(b)): This procedure translates a step, together with currentp ath, based on the step type (Line 01). Lines 0-03, and 1-15 encode Theorem 1, Proposition and Proposition 3, respectively. 1 The details for maintaining currentp ath is simple but lengthy. For simplicity, we omitted such discussions. 11

12 01 WITH V (leafvalue, pathid, branchorder, DeweyOrderSum, DocId, LeafOrder ) AS ( 0 SELECT V.leafValue, V.pathID, V.branchOrder, V.DeweyOrderSum, V.DocId, V.LeafOrder 03 FROM PathValue V1, PathValue V 04 WHERE V1.docId = 1 05 AND V1.SiblingSum BETWEEN * ( * 10-1) AND * ( * 10-1) AND V1.pathid in (5,4,3,) 07 AND V1.branchOrder < 08 AND V.docId = V1.docId 09 AND V.DeweyOrderSum BETWEEN V1.DeweyOrderSum AND V1.DeweyOrderSum AND V.pathid in (4,3,) 11 AND V.DeweyOrderSum BETWEEN V1.DeweyOrderSum + 1 * ( * - 1) AND V1.DeweyOrderSum + 3 * ( * - 1) ) 13 SELECT V.*, 1 AS Attr 14 FROM V 15 UNION ALL 16 SELECT A.leafValue, A.pathID, V.branchOrder, V.DeweyOrderSum, A.DocId, A.LeafOrder, 0 AS Attr 17 FROM Attribute A, V 18 WHERE A.DocId = V.DocId AND A.LeafOrder = V.LeafOrder 19 AND A.PathId in (0) 0 ORDER BY DocId, DeweyOrderSum, Attr Figure 9: SQL example: /catalog/book[1]/*[position()= to 3] 01 WITH V (leafvalue, pathid, branchorder, DeweyOrderSum, DocId, LeafOrder ) AS ( 0 SELECT DISTINCT V.leafValue, V.pathID, V.branchOrder, V.DeweyOrderSum, V.DocId, V.LeafOrder 03 FROM PathValue V1, PathValue V 04 WHERE V1.docId = 1 05 AND V1.pathid in (4,3) 06 AND V1.branchOrder < 3 07 AND V.docId = V1.docId 08 AND V.DeweyOrderSum >= V1.DeweyOrderSum + * AND V.pathid in (5,4,3,) 11 ) 1 SELECT V.*, 1 AS Attr 13 FROM V 14 UNION ALL 15 SELECT A.leafValue, A.pathID, V.branchOrder, V.DeweyOrderSum, A.DocId, A.LeafOrder, 0 AS Attr 16 FROM Attribute A, V 17 WHERE A.DocId = V.DocId AND A.LeafOrder = V.LeafOrder 18 AND A.PathId in (1) 19 ORDER BY DocId, DeweyOrderSum, Attr Figure 10: SQL example: /catalog/book/chapter/following::book processpredicate (Figure 8(a)): This procedure mainly translates position predicates. Lines determine the range of position specified by the predicate. Given these, Lines 1-17 implement Proposition 1. We now illustrate the details of the translation algorithms with five examples related to translation of position-based predicates. Example 1 [Position-based predicates] Consider the path expression /catalog/book[1]/ *[position()= to 3]. The translated SQL is shown in Figure 9. /catalog/book[1] is translated to Lines Theorem 1 is used to get the children of /catalog/book[1] (lines 08-10), and /*[] is translated to Lines 11. Lines are used to union the resulting element nodes with the attribute nodes and the last line is to order the result by document order. Example [SQL for following axis] Consider the path expression /catalog/book/chapter/ following::book. The translated SQL is shown in Figure 10. /catalog/book/chapter is translated to Lines Proposition is used to get the following nodes (line 08) and since the level of book is higher then the level of /catalog/book/chapter, then Theorem 1 is used to exclude the nodes that have common ancestor at level or deeper (line 09). PathId is used to return only the book nodes (line 10). Example 3 [SQL for preceding axis] Consider the path expression /catalog/book/chapter/ preceding::book. The translated SQL is similar to Figure 10 except that the predicate in line 08 is 1

13 01 WITH V (leafvalue, pathid, branchorder, DeweyOrderSum, DocId, LeafOrder ) AS ( 0 SELECT DISTINCT V.leafValue, V.pathID, V.branchOrder, V.DeweyOrderSum, V.DocId, V.LeafOrder 03 FROM PathValue V1, PathValue V 04 WHERE V1.docId = 1 05 AND V1.SiblingSum BETWEEN * ( * 10-1) AND 0 + * ( * 10-1) AND V1.pathid in (5,4,3,) 07 AND V1.branchOrder < 08 AND V.docId = V1.docId 09 AND V.DeweyOrderSum >= V1.DeweyOrderSum + * AND V.pathid in (5,4,3,) 11 AND V.DeweyOrderSum BETWEEN V1.DeweyOrderSum + 1 * ( * 10-1) AND V1.DeweyOrderSum + * ( * 10-1) ) 13 SELECT V.*, 1 AS Attr 14 FROM V 15 UNION ALL 16 SELECT A.leafValue, A.pathID, V.branchOrder, V.DeweyOrderSum, A.DocId, A.LeafOrder, 0 AS Attr 17 FROM Attribute A, V 18 WHERE A.DocId = V.DocId AND A.LeafOrder = V.LeafOrder 19 AND A.PathId in (1) 0 ORDER BY DocId, DeweyOrderSum, Attr Figure 11: SQL example: /catalog/book[]/following-sibling::*[1] replaced with V.DeweyOrderSum < V1.DeweyOrderSum due to Proposition and line 09 is replaced with due to Theorem 1. Example 4 [SQL for following-sibling axis] Consider the path expression /catalog/book[]/ following-sibling::*[1]. The translated SQL is shown in Figure 11. /catalog/book[] is translated to Lines /following-sibling::* is translated to Lines Since the level of /catalog/book is, the translated SQL for following-sibling is similar to following axis (line 9). *[1] is translated to line 11. Example 5 [SQL for preceding-sibling axis] Consider the path expression /catalog/book[]/ preceding-sibling::*[1]. The translated SQL is similar to Figure 11 except that the predicate in line 09 is replaced with V.DeweyOrderSum < V1.DeweyOrderSum. 6 Join Order Enforcement Due to the tree-unaware nature of the underlying relational storage scheme as well as the lack of appropriate XML statistics, relational optimizers may generate inefficient query plans. In order to address this problem, some approaches have resorted to manual tuning of query plans [1] while others invade the database kernel to make it tree-aware [3, 4]. The former approach has not been scalable as it requires significant human intervention whereas the later approach may require non-trivial modifications of the internals of a RDBMS. In this section, we propose a simple yet effective technique to generate better query plans automatically without invading the database kernel. As discussed in Section 5.3, in order to evaluate an (ordered) XPATH query in SUCXENT++, each XPATH axis is translated into a join between the PathValue table and intermediate results (i.e., the context nodes). For example, in Figures 9-11, PathValue V1 returns the representative nodes of the context nodes to calculate PathValue V. Due to the lack of tree awareness, the relational optimizer is not capable of transforming the order of joins intelligently. Consequently, it may generate poor join order that typically requires caching large intermediate results in the database bufferpool. This is particularly important to NL joins, where large and deep loops are prohibitive. For example, the first few joins of a right-to-left join order may easily yield a large number of context nodes. To respond to this, we propose to enforce a left-to-right join order on the translated SQL query. Also, this evaluation order naturally corresponds to the order of XPATH steps specified in the XPATH expression. By employing this technique, the relational optimizer does not explore the large number of permutations of join order. We apply join order if the translated SQL query involves more than one PathValue 13

14 Total Number ID Node Attribute Total DC10 5,34 15,000 40,34 DC100,4,00 150,000,39,00 DC1000,44,61 1,500,000 3,94,61 DBLP 8,,945 1,665,930 9,888,875 (a) Features of Dataset Size (MB) Max Depth ID Query Res. Card. D1 /dblp/*[100000]/author D /dblp/article/author[] 190,838 D3 /dblp/*[600000]/pages/preceding-sibling::* 6 D4 /dblp/*[600000]/pages/following-sibling::* 5 (c) Benchmark queries for DBLP ID Query Res. Card. (10MB) Res. Card. (100MB) Res. Card. (1000MB) Q1 /catalog/item[1000] Q /catalog/*[1000] /catalog/item[position()=1000 to 10000]/ Q3 *[position()= to 7] 104,7 66,81 67,00 /catalog/item[position()=1000 to 10000]/authors/ Q4 author 65,161 39, ,350 (b) Benchmark queries for DC10, DC100, and DC1000 ID Query Res. Card. (10MB) Res. Card. (100MB) Res. Card. (1000MB) Q5 /catalog/*[1500]/publisher/following-sibling::* Q6 /catalog/*[1500]/publisher/following-sibling::*[5] Q7 /catalog/*[1500]/publisher/preceding-sibling::* Q8 /catalog/*[1500]/publisher/preceding-sibling::*[] Q9 /catalog/*[x]/following::title 50,500 5,000 Q10 /catalog/*[y]/preceding::title 49,499 4,499 X = 50, 500, 5000 for DC10, DC100, DC1000 respectively; Y = 50, 500, 5000 for DC10, DC100, DC1000 respectively Figure 1: Dataset and Benchmark Queries. relation. In addition, if the PathValue table appears in the SQL query only once, we let the relational optimizer to decide the plan for the join between the PathValue table and the Attribute table. The above enforcement can easily be implemented by query hints in commercial databases. Regarding our implementation, we use OPTION(FORCE ORDER) to implement the above technique in SUCXENT++. The strength of this approach lies in its simplicity in implementing on any commercial RDBMS that supports query hints. 7 Performance Study In this section, we present the results of our performance evaluation on our proposed approach, a tree-unaware schema-oblivious approach (GLOBAL-ORDER [1]), a tree-unaware schema-conscious approach (SHARED- INLINING [11]), and a tree-aware approach (MonetDB [3]). Prototypes for modified SUCXENT++ (denoted as SX), SUCXENT++ with join order enforcement (denoted as SX-JO), GLOBAL-ORDER (denoted as GO) and SHARED-INLINING (denoted as SI) were implemented with JDK 1.5. We used the Windows version of MON- ETDB/XQuery (denoted as MXQ) downloaded from The experiments were conducted on an Intel Xeon GHz machine running on Windows XP with 1GB of RAM. The RDBMS used was Microsoft SQL Server 005 Developer Edition. Note that we did not study the performance of XML support of SQL Server 005 as it can only evaluate the first two ordered queries in Figure 1(b). 7.1 Experimental Setup Data and query sets. In our experiments, XBENCH [13] dataset was used for synthetic data. Data-centric (DC) documents were considered with data sizes ranging from 10MB to 1GB. In addition, we used a real dataset, namely DBLP XML [16]. Figure 1 (a) shows the characteristics of the datasets used. Two sets of queries were designed to cover different types of ordered XPATH queries. In additional, the cardinality of the results was varied. Figures 1 (b) and 1 (c) show the benchmark queries on XBENCH and DBLP, respectively. XPATH queries with descendant axes were not included as they had been studied in [10]. Test methodology. The XPATH queries were executed in the reconstruct mode where not only the non-leaf nodes, but also all their descendants, were selected. Appropriate indexes were constructed for all approaches (except for MONETDB) through a careful analysis on the benchmark queries. Prior to our experiments, we ensured that statistics on relations were collected. The bufferpool of the RDBMS was cleared before each run. Each query was executed 6 times and the results from the first run were always discarded. 14

Efficient Support for Ordered XPath Processing in Tree-Unaware Commercial Relational Databases

Efficient Support for Ordered XPath Processing in Tree-Unaware Commercial Relational Databases Boon-Siew Seah 1,2, Klarinda G. Widjanarko 1,2, Sourav S. Bhowmick 1,2, Byron Choi 1, and Erwin Leonardi 1,2