A Structural Numbering Scheme for XML Data

Size: px

Start display at page:

Download "A Structural Numbering Scheme for XML Data"

Allan Jacobs
5 years ago
Views:

1 A Structural Numbering Scheme for XML Data Alfred M. Martin WS2002/2003 February/March 2003 Based on workout made during the EDBT 2002 Workshops Dao Dinh Khal, Masatoshi Yoshikawa, and Shunsuke Uemura Graduate School of Information Science Nara Institute of Science and Technology Takayama, Ikoma, Nara , Japan Information Technology Center, Nagoya University Funo-cho, Chikusa-ku, Nagoya , Japan Abstract. Identifier generation is a common but crucial task in many XML applications. In addition, the structural information of XML data is essential to evaluate the XML queries. In order to meet both these requirements, several numbering schemes, including the powerful UID technique, have been proposed. We introduce a new numbering scheme based on the UID techniques called multilevel recursive UID (ruid). The proposed ruid is robust, scalable and hierarchical. ruid features identifier generation by level and takes into account the XML tree topology. ruid not only enables the computation of the parent node's identifier from the child node's identifier, as in the original UID, but also deals effectively with XML structural update and can be applied to arbitrarily large XML documents. 1. Introduction The goal of a structural numbering scheme in XML Data is to generate stable identifiers, query evaluation and managing large XML trees. To achieve these goals the original used UID (Unified Identifiers) is extended by the new ruid, the recursive Unified Identifiers system. Among these schemes, the technique referred to as the Unique Identifier (UID), enumerates nodes using a k-ary tree where k is the maximal fanout of the nodes. Each of internal nodes supposedly has the same fan-out k by assigning a number of virtual children if needed. Consecutive integers starting from 1 are assigned to the nodes, including the virtual nodes, in order from top to bottom and from left to fight in each level.

2 Whereas other numbering schemes only can compare two identifiers, the identifiers must already be known, in order to determine the parent-child relationship, the UID technique has an interesting property whereby the parent node can be determined based on the identifier of the child node. Given a node having the identifier i we can compute the identifier of the parent of the node using the formula: The main features of the ruid technique: 1. Parent-child determination property: given the identifier of a node, the parent node's identifier can be efficiently computed. Using smallsize global information stored in main memory, the new technique allows the ancestor-descendant relationship to be determined without any I/0. 2. Robustness for structural change: the scope of data amendment when a structural update occurs is effectively reduced. 3. Scalability: the new presentation can overcome the identifier limitation of the original UID technique and can be applied to arbitrarily large XML documents. 4. Structural richness: ruid is effective in representing the structural components in XPath expressions 2. Multilevel Recursive UID 2.1 Definitions and pictures Definition 1. (A frame) Given an XML tree T rooted at r, a frame F is a tree: (1) rooted at r, (2) the node set which is a subset of the node set of T and (3) for any two nodes u and v in the frame, an edge exists connecting the nodes if and only if one of the nodes is

3 an ancestor of the other in T and there is no other node x that lies between u and v in T and x belongs to the frame. A tree and one of its frames are shown in Fig. 2(a).The dotted arrows connect the c. Fig.2 Frame and UID-local area Definition 2. (UID-local area) Given an XML tree T rooted at r, a frame F of T, and a node n of F a UID-local area of n is an induced subtree of T rooted at n such that each of the subtree s node paths is terminated either by a child node of n in F or a leaf node of T, if between the leaf node and n in T there exists no other node that belongs to F. Definition 3. (2-level ruid) The full 2-level ruid of a node n is a triple (g i, l i, r i ), where g i, l i, and r i are called the global index, local index, and root indicator, respectively. If n is a non-root node, then g i is the index of the UID-local area containing n, l i is the index of n inside the area, and r i is false. If n is the root node of an UID-local area, then g i is the index of the area, l i is the index of n as a leaf node in the upper UID-local area, and r i is true. The identifier of the root of the main XML tree is (1, 1, true). Input: An XML tree T Output: The 2-level ruid identifiers of nodes in T / / Global enumeration 1. Partition XML tree into UID-local areas and build the frame F upon their roots 2. Find the maximal fan-out k of F 3. Compute the global index g i using k-ary tree presentation of F / / Local enumerations 4. for each i th UID-local area 5. find the local maximal fan-out denoted by k i 6. compute the local indices l ij of nodes in the area via a k j -ary tree 7. if l ij = 1 then 8. recompute l ij in the upper UID-local area 9. r ij := true 10. update K using (g i, l ij, k i ) 11. else 12. r ij := false

4 13. end 14. Generate the identifiers of the nodes from (g i, l ij, r ij ) 15. end e. Save k and K Fig.3 Outline of the algorithm used to compute 2-level ruid Fig. 4 An original UID and its corresponding 2-level ruid counterpart Fig. 5 Global parameter table for the 2-level ruid shown in Fig. 4(b) 2.2 Parent-Child Relationship in 2-Level ruid Lemma 1. Given an XML tree T and a node n, based on the value k, and the table K, the identifier of the parent of n can be computed if the identifier of n is known. Input: An XML tree T, its k and K, and the 2-level ruid (g i, l i, r i ) of a node Output: The 2-1evel ruid (g, l, r) of the parent node 1. if (r i == true) then 2. g := L(g i - 2)/k + 1J 3. else 4. g := g i 5. end 6. get the fan-out of the row with the global index g in K 7. l := L(l i 2)/k i + 1J 8. if (l == 1) then

5 9. set l equal to the local index of the row with the global index g in K 10. r := true 11. else 12. r := false 13. end e. return (g, l, r) Fig. 6 rparent() the algorithm to compute the parent s 2-level ruid of a node Example: Suppose that k equals 4 and the table K is given in Fig. 5. Let c and p denote a node and its parent node, respectively. I illustrate how to determine the identifier of p from the identifier of c by considering several configurations of the child node: - c is the non-root node (2, 7, false): From the second line of K we know that the local fan-out of the UID-local area containing c is 2. Therefore, the local index of the identifier of p is l_(7-2)/2 + 1_l, which is equal to 3. Hence, p is the non area root node (2, 3, false). In Fig. 4, the node p is depicted by a fine-lined circle containing the numbers (2, 3). - c is the root node (10, 9, true): The upper UID-local area containing p must be determined. Because k equals 4, the upper UID-local area's index is l_(10-2)/4 + 1_l or 3. The local fan-out of the UID local area is shown in the third line of K and is equal to 3. The local index of p is l_(9-2)/3 + 1_l, which is equal to 3. The value is greater than 1, so p is the non area root node (3, 3, false). - c is the non-root node (3, 3, false): From the second line of K we know that the local fan-out of the UID-local area containing c is equal to 3 so the index of p in the UID-local area is l_(3-2) /3 +1_l, which is equal to 1. This means that p is the root of the considered UID-local area. Therefore, the local index of p must equal the index of the node in the upper local UID area. From K, the value is found to be 3, and p is the area root node (3, 3, true). Note that if the value k, together with the table K are known and are loaded into the main memory, then all of the steps in the algorithm rparent() can be formed completely inside the main memory without any disk I/O. 2.3 Adjustment of the Maximal Fan-out of Frame

6 2.4 Description of Multilevel ruid Definition 4. (Multilevel ruid) Given an XML tree T, the l-level ruid of a node n has the form: {θ, (α l-1, ß l-1 ),, (α 2,ß 2 ), (α 1, ß 1 )} where: - for j = 1... l-1: α j is the local index and ß j is the root indicator of n in its UID-local area identified by { θ, (α l-1, ß l- 1),, (α j+1, ß j+1 )} in the level j+1. - θ is the original UID in the level l. The symbols θ, α i and ß i have meanings similar to the first, second, and third components of 2-level ruid. Fig. 8 A multilevel ruid example Example: In Fig. 8, each polygon denotes an UID-local area. Suppose using 2-level ruid the node n has the identifier {8, (a, true)}, where the boolean value true indicates that n is the root of an UID-local area, 8 is the index of n in the second level's frame, and the integer number a is the index of n in the upper UID-local area that has the index 2. Using 3-level ruid, the index 8 is decomposed into (2, 4, false) and the full identifier of n becomes {2,(4, false),(a, true)}. The construction of multilevel ruid is consecutively building the UID levels, each created on the top of the previous level. First the 2-level ruid of the form {x l, (α l, ß l )} is constructed. If needed, the 3-level ruid of the

7 form {x l-1, (α l-1,ß l-1 ),(α l,ß l )} is constructed, and so on. The process stops when the top level becomes small enough to be stored. 3. Properties Robustness with Structural Update In the original UID, if a new node is inserted into an XML tree when space is available then the insertion causes the identifiers of the sibling nodes to the right of the inserted node as well as those of their descendant nodes, to be modified. In the worst case, when the insertion increases the tree's maximal fan-out, the entire enumeration has to be performed again. Identifiers of all of the nodes must be changed, which leads to an expensive reconstruction. The ruid copes better with structure update of XML data than does the original UID. The scope of identifier update due to a node insertion is reduced by a magnitude of two. If a node is inserted, at first only the nodes in the UID-local area where the update occurs need to be considered. If an appropriate space is available for the new node, then among the descendants of the sibling nodes to the right of the inserted node, only those which belong to the same UID-local area will have their identifiers modified. The nodes in the descendant areas are not affected because the frame F is unchanged. Otherwise, if such a space does not exist for the newly inserted node then the fan-out of the tree used in enumerating the UID-local area must be enlarged. Rather than modifying the identifiers of every XML component, the enlargement changes only the identifiers of the nodes in this area. In both cases, since the size of an UID-local area is much smaller than the size of the entire data set tree, the scope of the identifier update is greatly reduced. Similarly, the new ruid deals with another structural operation called node deletion. Note that any node deletion in an XML tree is cascading. That means all of the descendant nodes of the deleted node are deleted. The change of the identifiers of the sibling nodes to the right of the deleted node will affect the descendant nodes belonging to the UID-local area, where the deletion occurs. 3.2 XPath Axes Expressiveness In this section, we shall investigate the power of ruid to express XPath expressions. This property is important for the applicability of ruid in XML query processing. We consider XPath because XPath has become the standard on which many new proposed XML query languages are based. Furthermore, XPath expressions have additional concepts specific to XML data, such as axes that do not exist in regular path expressions. XPath is a language for addressing parts of a XML document, and was designed to be used by other languages such as XSLT and XPointer. In addition XPath provides basic facilities for the manipulation of strings, numbers and boolean operators in the logical structure of a XML document. One important kind of XPath expression is the location path. A location path selects a set of nodes relative to a context node. The result

8 of evaluating a location path is the node-set containing the nodes selected by the location path. I will focus only on the core rules of XPath, such as the following: [1] LocationPath ::= RelativeLocationPath AbsoluteLocationPath [2] AbsoluteLocationPath ::= '/' RelativeLocationPath? [3] RelativeLocationPath ::= Step RelativeLocationPath '/' Step Therefore, a location path can be written in the form: δ Step 1 τ 1 Step 2 τ 2 Step l where l 0, δ can be an empty symbol (indicating that nothing appears) or '/', τ i (i = 1...l - 1) is '/', and Step i (i = 1...l) is a location step. A location step has three parts: 1) an axis, which specifies the hierarchical relationship between the nodes considered in the location step and the context node, 2) a node test, which specifies the node type and expanded-name of the nodes selected by the location step, and 3) zero or more predicates to further refine the set of nodes. An initial node-set is generated from the axis and the node test and is then filtered by each of the predicates in turn. A predicate filters a node-set with respect to an axis to produce a new node-set. As described above, generating and filtering the axes is essential in evaluation of location steps in XPath expressions. The general task is as follows: "Given a context node n identified by (θ, α, ß), generate the node set belonging to a specific axis of n and satisfy a condition C". The condition C may be "to satisfy a logical expression related to data content", "to belong to a specific element type", etc. Depending on the particular C, the order to process may be: generating the set of nodes satisfying C and checking which nodes belongs to the specific axis, or generating the specified axis and then checking which nodes satisfy C. The first approach is good only for the cases in which C is specific, so the set of nodes satisfying C is small. The second approach is more generally applicable and thus we shall focus on discussing it. We demonstrate the XPath axes expressiveness of ruid by proposing several routines to generate the axes. We limit the scope of discussion to the axes that specify sets of nodes in term of the node position in XML documents. Due to triviality, we exclude the -or-self portion of axes from consideration. Specifically, the following axes will be considered: (1) parent and ancestor, (2) attribute, child, and descendant, (3) preceding-sibling and following-sibling, and (4) preceding and following. Parent and Ancestor axes: As shown in Section 2.2, after loading the value k and the table K, the parent's identifier for a given node can be computed using rparent() in main memory. The routine rancestor(n), used to generate the list of the ancestors of n, is a repetition of rparent(). Note that the numbering schemes based on the loose hierarchical order require additional parameters to express the hierarchical level, such as grandparent, or grand-grandparent. This task can be accomplished much more simply using ruid. For example, let us consider

9 an expression in abbreviated syntax such as "element 1 /*/element 2 ", in which the explicit requirement exists that between "element 1 " and "element 2 " there exists one and only one element. Naturally, we do not have to know the exact buffer element. Using ruid, we can avoid scanning the entire collection of available elements to find the parent of "element 2 ". We need only to list the grandparents, by applying rparent() twice, of the elements of the type "element 2 " and exclude those elements which are not of the type "element 1 ". Child and Descendant axes: In the 1-level UID, if p is the parent's UID, then the identifiers of its children belong to the range [(p-1)*k + 2, p*k + 1], where k is the fan-out of the enumerating tree. In the 2-level ruid, the routine rchildren(n) to create the list L of possible children of n is as follows. First, use k and θ to compute the sorted list L 1 of children of θ in the frame of T. Let k denote the local fan-out corresponding to θ and obtained from K. Let L 2 denote the list of integers in the interval [2, k + 1] if ß is true, or in the interval [(α -1)*k + 2, α *k + 1] if ß is false. For each i in L 2, if there exists no θ in L 1 such that (θ, i) is found in K as the global and local indices of a row, the add (θ, i, false) to L. Otherwise, add (θ, i, true) to L. In order to confirm the existent of such a θ', we first find in K the list of the local indices corresponding to the values in L 1 as the global indices. We then intersect the list with L 2. Note that both L 1 and K are sorted so this process is fast. The routine rdescendant(n) to generate the list of the descendants of n may be designed as a repetition of rchildren(). Another method is based on the following observation. Given two nodes n 1 and n 2, r 1 and r 2 are the roots of the UID-local areas containing n 1 and n 2, respectively. Then, if r 1 is a descendant of n 2, then n 1 is a descendant of n 2. Therefore, we first need to find the descendants of n inside of its UID-local area only, using rchildren(). Among these nodes, consider the UID-local area root nodes. In F find all the nodes which are descendant-or-self of the roots. All nodes in the areas rooted at the newly found nodes are descendants of n. Preceding-sibling and Following-sibling axes: We explain the routine denoted by rpsibling(n) to generate the list L of the preceding siblings of n. Using k and θ, we generate the sorted list L 1 of child nodes of θ in the frame F of T. In the context UID-local area, compute the sorted list L 2 of the preceding siblings of α. For each α i in L 2 if there exists no θ j such that (θ j, α i ) is found, in K as the global index and the local index of a row, then add (θ, α i, false) to L. Otherwise, add (θ j, α i, true) to L. This argument is similar to the routine for child and descendant axes. Similarly, we can design the routine rfsibling(n) to generate the list of the following siblings of n. In general, the multilevel ruid has the following property: For the axes 'preceding', 'following' the relative position of two nodes can be determined by the first different and preceding-following decidable components of their multilevel ruid. In the 2-level ruid, the orders among nodes are rejected in the frame F. We can use this property to accelerate the axis constructions.

10 4. Conclusion and Summary - Generating stable identifiers which robust against and at structural updates - Query evaluation - Managing large XML documents and their large trees 5. References 1. P.Buneman, S.Davidson, M.Fernandez, D.Suciu. Adding Structure to Unstructured Data. Proc. of the ICDT, Greece, , S.Chien, V.J.Tsotras, C.Zaniolo, D.Zhang. Storing and Querying Multiversion XML Documents using Durable Node Numbers. Proc. of the Inter. cour. on WISE:, Japan, , P.F.Dietz. Maintaining order in a link list. Proceeding of the Fourteenth ACM Symposium on Theory of Computing, California, , R.Goldman, J.Widom. DataGuides: enabling query formulation and optimization in semi structured databases. Proc. of the Inter. cour. on VLDB, , 1997, 5. H.Jang, Y.Kim, D.Shin. An Effective Mechanism for Index Update in Structured Documents. Proc. of CIKM, USA, , Q.Li, B.Moon. Indexing and Querying XML Data for Regular Path Expressions. Proc. of the Inter. Conf. on VLDB, Italy, Y.K.Lee, S-J.Yoo, K.Yoon, P.B.Berra. Index Structures for structured documents. ACM First Inter. conf. on Digital Libraries, Maryland, 91-99, A.Marian, S.Abiteboul, G.Cobena, L.Mignet. Change-Centric Management of Versions in an XML Warehouse, Proc. of the Inter. conf. on VLDB, Italy, T.Milo, D.Suciu. Index Structures for Path Expression. Proc. of the ICDT, , D.Shin. XML Indexing and Retrieval with a Hybrid Storage Model. J. Of Knowledge and Information Systems, 3: , C.Zhang, J.Naughton, D.DeWitt, Q.Luo, G.Lohman. On Supporting Containment: Queries in Relational Database Management Systems. Proc. of the ACM SIGMOD, USA, World Wide Web Consortium. Extensible Markup Language (XML) World Wide Web Consortium. XML Path Language (XPath) Version World Wide Web Consortium. Document Object Model (DOM) Level 2 Core Specification Version Core/, 2002.

A Structural Numbering Scheme for XML Data

A Structural Numbering Scheme for XML Data Dao Dinh Kha 1, Masatoshi Yoshikawa 1,2, and Shunsuke Uemura 1 1 Graduate School of Information Science Nara Institute of Science and Technology 8916-5 Takayama,