A Structural Numbering Scheme for XML Data

Size: px

Start display at page:

Download "A Structural Numbering Scheme for XML Data"

Rosanna Stephens
5 years ago
Views:

1 A Structural Numbering Scheme for XML Data Dao Dinh Kha 1, Masatoshi Yoshikawa 1,2, and Shunsuke Uemura 1 1 Graduate School of Information Science Nara Institute of Science and Technology Takayama, Ikoma, Nara , Japan 2 Information Technology Center, Nagoya University Furo-cho, Chikusa-ku, Nagoya , Japan Abstract. Identifier generation is a common but crucial task in many XML applications. In addition, the structural information of XML data is essential to evaluate the XML queries. In order to meet both these requirements, several numbering schemes, including the powerful UID technique, have been proposed. In this paper, we introduce a new numbering scheme based on the UID techniques called multilevel recursive UID (ruid). The proposed ruid is robust, scalable and hierarchical. ruid features identifier generation by level and takes into account the XML tree topology. ruid not only enables the computation of the parent node s identifier from the child node s identifier, as in the original UID, but also deals effectively with XML structural update and can be applied to arbitrarily large XML documents. In addition, we investigate the effectiveness of ruid in representing the XPath axes and query processing and briefly discuss other applications of ruid. 1 Introduction Extensive Markup Language (XML) [12] has been accepted as a standard for information exchange over the Internet and is supported by major software vendors. The main components of an XML document are elements of various length and positions within a hierarchical structure. In order to process XML data, XML elements must be assigned uniquely identifiers. Therefore, identifier generation is a common but crucial task in many XML applications. The method by which the task is accomplished significantly affects organization and storage of data, the construction of indices and the processing of queries. Unlike relational database is, in which data are projected into relations using fixed schemes, the structure of an XML document may change. To effectively process the queries on XML data, the structural information of XML documents is essential. Therefore, numerous studies have examined the presentation of the logical structure of XML data in a concise manner. Since XML data structure can be viewed as a tree, a numbering scheme can be used to represent the structure. Normally, numbering scheme is a method to generate the identifiers of the elements in such manner that the hierarchical A.B. Chaudhri et al. (Eds.): EDBT 2002 Workshops, LNCS 2490, pp , c Springer-Verlag Berlin Heidelberg 2002

2 92 D.D. Kha, M. Yoshikawa, and S. Uemura orders of the elements can be re-established based on their identifiers. For a numbering scheme, the ability to express orders, for example, parent-child, ancestordescendant, preceding-following, is essential. Hierarchical orders are used extensively in processing XML data, hence reduction of the computing workload of the hierarchy re-establishment to the greatest degree is desired. In [3,8,6,7], several numbering schemes for XML data have been proposed, in which an identifier is either an integer or a combination of integers. A hierarchical order between two elements exists if and only if their identifiers observe a predefined numerical order described by a formula. Among these schemes, the technique referred to as the Unique Identifier (UID) [7], enumerates nodes using a k-ary tree where k is the maximal fanout of the nodes. Each of internal nodes supposedly has the same fan-out k by assigning a number of virtual children if needed. Consecutive integers starting from 1are assigned to the nodes, including the virtual nodes, in order from top to bottom and from left to right in each level. Whereas other numbering schemes only can compare two identifiers, the identifiers must already be known, in order to determine the paren-child relationship, the UID technique has an interesting property whereby the parent node can be determined based on the identifier of the child node. Given a node having the identifier i we can compute the identifier of the parent of the node using the formula: parent(i) = (i 2)/k +1 (1) Using this property, given two nodes, the question of whether one node is an ancestor of another node can be easily answered based on their identifiers. This also allows the identifiers of the ancestors of a node to be generated quickly. This is a promising property for the evaluation of the structural part in XML queries. Moreover, ascertaining the identifiers of data items prior to loading data from the disk can help to reduce disk access. However, despite this useful property, the original UID technique has some drawbacks, especially regarding structural update. In practice, the content and structure of an XML document may be updated frequently to reflect the changes. When a new node is inserted, a new identifier must be assigned to the node. Because the order of enumeration by the UID technique is from left to right, in order to maintain the continuousness of the sibling s identifiers the new node s identifier must be the identifier of the node that is pushed to the right. The identifiers of all the sibling nodes to the right of the just-inserted node are increased by one. Because of the strong dependency of the identifier of a child node on the identifier of the parent node in the original UID technique, the identifiers of the descendant nodes of all the sibling nodes to the right of the inserted node will also be changed. The nearer to the root node the new node is inserted, the larger the scope of the identifier modification. Furthermore, the method by which the UID technique enumerates nodes using one tree of fixed fan-out k also contributes to the update problem. This problem becomes more serious when the number of children nodes of a node becomes larger than the pre-defined value k, because the initial value of k is overflowed

3 A Structural Numbering Scheme for XML Data 93 and there is no space for a new child node. This situation may occur frequently in practice. The modification of k results in an overhaul of the identifier system in which the identifiers of all nodes are to be recomputed. The reconstruction is costly and may severely degrade the system performance, especially for large XML documents. In general, since any change in the identifiers usually triggers a costly reconstruction, reducing the scope of the identifier update to the greatest degree possible is desired. Figure 1illustrates a node insertion and the consequent changes in identifiers. In Fig. 1(a) nodes of a tree are enumerated using the original UID method. Virtual nodes are denoted by dotted lines. Suppose that a node is inserted between nodes 2 and 3. The new enumeration is shown in Fig. 1(b). The previous nodes 3, 8, 9, 23, 26 and 27 are re-numerated as nodes 4, 11, 12, 32, 35, and 36, respectively. If another node is inserted behind the new node 4 in Fig. 1(b), the entire tree must be re-numerated. In addition, the original UID technique not only enumerates the real nodes virtually but also reserve identifiers for the virtual nodes. Because the fan-outs of nodes are various, the UID technique may enumerate a number of virtual nodes. The value of identifiers increase at the exponential rate equal to the maximal fan-out of the nodes and in the power of the length of the longest path in the tree. Therefore, in many cases, the value easily exceeds the maximal manageable integer value, even when the real nodes in the data source are few. Additional purpose-specific libraries are necessary to deal with the oversized values but they require extra computation cost (a) before insertion (b) after insertion Fig. 1. An original UID before and after a node insertion The above-mentioned drawbacks make the original UID technique unpractical in several cases. In this paper, we propose a new numbering scheme called the recursive UID that is an extended version of the UID technique. The newly proposed ruid technique has been designed so as to eliminate the above-mentioned drawbacks. Specifically, we grade and localize the value k, the fan-out of the enumerating tree. The node identifiers are created by levels. In each level the fan-out of the enumerating tree may vary based on the local topology of the XML

4 94 D.D. Kha, M. Yoshikawa, and S. Uemura tree. This approach preserves the advantages of the original UID technique while reducing its drawbacks. The main features of the ruid technique are as follows: 1. Parent-child determination property: given the identifier of a node, the parent node s identifier can be efficiently computed. Using small-size global information stored in main memory, the new technique allows the ancestordescendant relationship to be determined without any I/O. 2. Robustness for structural change: the scope of data amendment when a structural update occurs is effectively reduced. 3. Scalability: the new presentation can overcome the identifier limitation of the original UID technique and can be applied to arbitrarily large XML documents. 4. Structural richness: ruid is effective in representing of the structural components in XPath expressions. The remainder of the paper is organized as follows. Section 2 describes the 2-level recursive UID technique and its multilevel version. Section 3 discusses the main properties of the new numbering scheme. Section 4 outlines the possible practical applications of the technique in various aspects. Section 5 presents a number of observations made during our preliminary experiments. Section 6 briefly reviews related works. Section 7 concludes this paper and present suggestions for future research. 2 Multilevel Recursive UID In this section, we present the multilevel recursive UID. We first introduce the 2-level ruid numbering scheme. We shall describe ruid based on the notation for an XML tree. 2.1 Description of the 2-Level ruid Given an XML tree, the 2-level ruid numbering scheme manages the identifiers of nodes in global and local levels. The set of nodes is divided into subsets, the identifiers of which are created in the global level whereas the nodes of each subset are managed in the local level. We generate a number of parameters that are used in both the global and local levels. The additional data is small enough to be comfortably loaded into the main memory allowing fast access when navigating inside the XML tree. Concretely, the construction of the 2-level ruid numbering scheme for an XML tree consists of the following steps: (1) partition the XML tree into areas, each of which is an induced subtree of the XML tree; (2) enumeration the newly created areas according to the original UID scheme in order to generate the global indices; (3) for each area, enumeration of the nodes of that area in order to generate the local indices; and (4) compose of the synthetic identifiers of nodes from the global and local indices. The 2-level ruid is formalized by the following three definitions:

5 A Structural Numbering Scheme for XML Data 95 Definition 1. (A frame) Given an XML tree T rooted at r, aframef is a tree: (1) rooted at r, (2) the node set of which is a subset of the node set of T and (3) for any two nodes u and v in the frame, an edge exists connecting the nodes if and only if one of the nodes is an ancestor of the other in T and there is no other node x that lies between u and v in T and x belongs to the frame. A tree and one of its frames are shown in Fig. 2(a). The dotted arrows connect the corresponding nodes between these trees. source tree a frame UID local area (a) (b) Fig. 2. Frame and UID-local area Definition 2. (UID-local area) Given an XML tree T rooted at r, a frame F of T,andanoden of F, a UID-local area of n is an induced subtree of T rooted at n such that each of the subtree s node paths is terminated either by a child node of n in F oraleafnodeoft, if between the leaf node and n in T there exists no other node that belongs to F. An UID-local area is depicted in Fig. 2(b). We cover an XML tree using a set of UID-local areas such that the intersection of any two of these areas is either empty or consists of only one node from the frame. Hereafter, let us refer to the full identifier of a node as its identifier and the number assigned to a node locally inside an UID-local area as its index. Let κ denote the maximal fan-out of nodes in F. We use a κ-ary tree to enumerate the nodes of F and let the number assigned to each node in F be the index of the UID-local area rooted at the node. The 2-level UID numbering scheme based on F is defined as follows: Definition 3. (2-level ruid) The full 2-level ruid of a node n is a triple (g i, l i, r i ), where g i, l i, and r i are called the global index, local index, and root indicator, respectively. If n is a non-root node, then g i is the index of the UID-local area containing n, l i is the index of n inside the area, and r i is false.ifn is the root node of an UID-local area, then g i is the index of the area, l i is the index of n as a leaf node in the upper UID-local area, and r i is true. The identifier of the root of the main XML tree is (1, 1, true).

6 96 D.D. Kha, M. Yoshikawa, and S. Uemura From these definitions, the ruid of a node in an XML tree is determined uniquely. The values of the first and second components of the identifier of a node should be interpreted based on whether the node is an area-root node. This information is indicated by the third component of the identifier. For implementation, the global and local indices can be expressed by integers, whereas a boolean value is sufficient to express the root indicator. If XML data is stored in a RDBMS, one way to express ruid is that three separated fields, the types of which correspond to the component types, are used to store the components of the ruid. The data items are sorted first by the global index, and then by local index. Input: An XML tree T Output: The 2-level ruid identifiers of nodes in T //Global enumeration 1. Partition XML tree into UID-local areas and build the frame F upon their roots 2. Find the maximal fan-out κ of F 3. Compute the global index g i using κ-ary tree presentation of F //Local enumerations 4. foreach i th UID-local area 5. find the local maximal fan-out denoted by k i 6. compute the local indices l ij of nodes in the area via a k j-ary tree 7. if l ij = 1 then 8. recompute l ij in the upper UID-local area 9. r ij := true 10. update K using (g i, l ij, k i) 11. else 12. r ij := false 13. end 14. Generate the identifiers of the nodes from (g i, l ij, r ij) 15. end e. Save κ and K Fig. 3. Outline of the algorithm used to compute 2-level ruid We have established a one-to-one mapping between the nodes of F and UIDlocal areas. Therefore, for two UID-local areas, we refer to an area as preceding the other are if the root of the former precedes the root of the latter in F. The other orders among the UID-local areas, such as ancestor, descendant, and following, are determined similarly. We construct a table K having three columns: global index, local index, and fan-out. Each row of K corresponds to an UID-local area and contains the global index of the area, the index of the area s root in the upper area, and the maximal fan-out of nodes in the corresponding area, respectively. The table K is sorted according to the global index. The value κ and the table K are global parameters, which are loaded into the main memory during travelling T. The process used to compute 2-level ruid is briefly shown in Fig. 3. Example 1. Fig. 4 depicts examples of the original UID and the new 2-level ruid. In the tree shown on the left, the number inside each node is its original UID. In the tree shown on the right, integers shown inside each node are the global and local indices of the node s 2-level ruid. Rather than showing root indicators, the root nodes are encircled by bold circles and the non-root nodes

7 A Structural Numbering Scheme for XML Data 97 Fig. 4. An original UID and its corresponding 2-level ruid counterpart are encircled by fine-lined circles. Using the ruid, the global fan-out κ is 4 and six UID-local areas exists. The table K of the global parameters is shown in Fig. 5. Global index Local index Local fan-out Fig. 5. Global parameter table for the 2-level ruid shown in Fig. 4(b) 2.2 Parent-Child Relationship in 2-Level ruid Formula (1) can be used to check the parent-child relationship in the original UID. For the 2-level ruid, we need a more sophisticated function, denoted herein by rparent(), in the form of an algorithm. First, let us show that 2-level UID enables the parent-child relationship to be determined. Lemma 1. Given an XML tree T andanoden, based on the value κ and the table K, the identifier of the parent of n can be computed if the identifier of n is known. Proof. Let the parent node of n be denoted by p. Since the intersection of any two UID-local areas is either empty or consists of only one node that is the root of the lower area, it is sufficient to consider the following cases. First, if n and p belong to the same UID-local area and p is not the root of this area, then these nodes have the same global index. The local index of p can be computed using

8 98 D.D. Kha, M. Yoshikawa, and S. Uemura formula (1), where i is replaced by the local index of n, k is replaced by the maximal fan-out of the UID-local area that contains n, and k can be obtained from the table K. Second, if n and p belong to the same UID-local area, but p is the root of this area (this means that the value (l 2)/k +1 is equal to 1), then the global index of p can be computed using formula (1), where i is replaced by the global index of n, and k is replaced by κ. The local index of p can be obtained from the table K. Third, if p belongs to an upper UID-local area and n is the joint of the upper and lower UID-local areas corresponding to a pair of parent and child nodes in the frame of T, then the global index of p can be computed using the above-mentioned formula, where l is replaced by the global index of n, and k is replaced by κ. Because both p and n belong to the same UID-local area, the index of which is known, the local index of p is determined in a manner similar to that used in the first case. If the result is equal to 1, then the local index of p must be obtained from the table K. The algorithm by which to determine the parent s identifier from a node s identifier is shown in Fig. 6. We illustrate this algorithm through Example 2. Input: An XML tree T, its κ and K, and the 2-level ruid (g i, l i, r i) of a node Output: The 2-level ruid (g,l,r) of the parent node 1. if (r i == true) then 2. g := (g i 2)/κ else 4. g := g i 5. end 6. get the fan-out k j of the row with the global index g in K 7. l := (l i 2)/k j if (l == 1) then 9. set l equal to the local index of the row with the global index g in K 10. r := true 11. else 12. r := false 13. end e. return (g, l, r) Fig. 6. rparent() - the algorithm to compute the parent s 2-level ruid of a node Example 2. Suppose that κ equals 4 and the table K is given in Fig. 5. Let c and p denote a node and its parent node, respectively. We illustrate how to determine the identifier of p from the identifier of c by considering several configurations of the child node: c is the non-root node (2, 7, false): From the second line of K we know that the local fan-out of the UID-local area containing c is 2. Therefore, the local index of the identifier of p is (7 2)/2+1, which is equal to 3. Hence, p is the non area root node (2, 3, false). In Fig. 4, the node p is depicted by a fine-lined circle containing the numbers (2, 3). c is the root node (10, 9, true): the upper UID-local area containing p must be determined. Because κ equals 4, the upper UID-local area s index

9 A Structural Numbering Scheme for XML Data 99 is (10 2)/4+1 or 3. The local fan-out of the UID-local area is shown in the third line of K and is equal to 3. The local index of p is (9 2)/3+1, which is equal to 3. The value is greater than 1, so p is the non area root node (3, 3, false). c is the non-root node (3, 3, false): from the second line of K we know that the local fan-out of the UID-local area containing c is equal to 3 so the index of p in the UID-local area is (3 2)/3+1, which is equal to 1. This means that p is the root of the considered UID-local area. Therefore, the local index of p must equal the index of the node in the upper local UID area. From K, the value is found to be 3, and p is the area root node (3, 3, true). Note that if the value κ together with the table K are known and are loaded into the main memory, then all of the steps in the algorithm rparent() can be performed completely inside the main memory without any disk I/O. 2.3 Adjustment of the Maximal Fan-out of Frame The maximal fan-out κ of the frame F should not be greater than the maximal fan-out of the source data tree. However, in the native partitions of an XML tree into UID-local area, the maximal fan-out κ of the frame F may exceed the maximal fan-out of the XML tree. Such a partition is illustrated in Fig. 7(a). Suppose that the maximal fan-out of the subtrees rooted at u 1, u 2, and u 3 are less than or equal to 4. Although the node n 1 is not an area-root node, this node has three area-root descendants u 1, u 2, and u 3 in three separated paths. In the frame F, these three nodes are connected directly to n, and n has six children, as shown in Fig. 7(b). Therefore, the maximal fan-out of the frame is larger than that of the source data. Fig. 7. Adding a marked node in order to reduce the fan-out A simple solution in this case is to make the node n 1 as an area-root node, as shown in Fig. 7(c). Generally, if necessary, we can supplement additional arearoot nodes to reduce the value of κ. This trick guarantees that the fan-out of the frame is always less than or equal to the fan-out of the source XML tree. We omit the technical details of the solution here.

10 100 D.D. Kha, M. Yoshikawa, and S. Uemura 2.4 Description of Multilevel ruid In this section we generalize the concept of the 2-level ruid. The idea is that the frame in the 2-level ruid is to be considered as an original tree, and a new frame of this tree will be constructed in order to establish the 3-level ruid, and so on. The multilevel ruid may be used when the size of the frame is too large or when we need a more compact frame. Let us refer T and the frames recursively built one upon the other as the data levels. We enumerate the levels such that the original T is level one, its frame is level two, and so on. Definition 4. (Multilevel ruid) Given an XML tree T, the l-level ruid of a node n has the form: {θ, (α l 1,β l 1 ),, (α 2,β 2 ), (α 1,β 1 )} where: forj=1 l-1: α j is the local index and β j is the root indicator of n in its UID-local area identified by {θ, (α l 1,β l 1 ),, (α j+1,β j+1 )} in the level j+1. θ is the original UID in the level l. The symbols θ, α i, andβ i have meanings similar to the first, second, and third components of 2-level ruid. n {8,(a,true)} n {2,(4,false),(a,true)} Level 2 Level 3 Fig. 8. A multilevel ruid example Level 4 (top) Example 3. In Fig. 8, each polygon denotes an UID-local area. Suppose using 2-level ruid the node n has the identifier {8, (a, true)}, where the boolean value true indicates that n is the root of an UID-local area, 8 is the index of n in the second level s frame, and the integer number a is the index of n in the upper UID-local area that has the index 2. Using 3-level ruid, the index 8 is decomposed into (2, 4, false) and the full identifier of n becomes {2,(4, false),(a, true)}. Construction of multilevel ruid: For a large XML tree, we consecutively build the UID levels, each created on the top of the previous level. First, the 2-level ruid of the form {x l, (α l,β l )} is constructed. If needed, the 3-level ruid of the form {x l 1, (α l 1,β l 1 ), (α l,β l )} is constructed, and so on. The process stops when the top level becomes small enough to be stored. In practice, this requires only a few levels to encode a large XML tree.

11 3 Properties of Multilevel ruid A Structural Numbering Scheme for XML Data 101 The multilevel ruid has several properties, which are crucial for a numbering scheme to be applicable to the management of a large amount of XML data. 3.1 Scalability The newly proposed ruid can be used to present the identifers of nodes for arbitrarily large trees. If the number of nodes that can be enumerated by the original UID is denoted by e, then using m-level ruid, we can enumerate approximately e m nodes. Practically, 2-level ruid is capable of enumerating any XML data set currently in use. Furthermore, ruid reduces the number of virtual children to be added. Normally, the fan-outs of nodes in a tree are various. In many cases, the disparity in fan-outs is very significant. Since the set of nodes in any UID-local area is a subset of the nodes of the entire XML tree, the maximal fan-out of each UIDlocal area fits the nodes in the area closer than does the global maximal fan-out. By appropriately dividing an XML tree into UID-local areas, and using local enumerating trees for enumerating local nodes, we can avoid enumerating nodes having small fan-outs by a large-sized tree. 3.2 Robustness with Structural Update In the original UID, if a new node is inserted into an XML tree when space is available then the insertion causes the identifers of the sibling nodes to the right of the inserted node as well as those of their descendant nodes, to be modified. In the worst case, when the insertion increases the tree s maximal fan-out, the entire enumeration has to be performed again. Identifiers of all of the nodes must be changed, which leads to an expensive reconstruction. The ruid copes better with structure update of XML data than does the original UID. The scope of identifier update due to a node insertion is reduced by a magnitude of two. If a node is inserted, at first only the nodes in the UIDlocal area where the update occurs need to be considered. If an appropriate space is available for the new node, then among the descendants of the sibling nodes to the right of the inserted node, only those which belong to the same UID-local area will have their identifiers modified. The nodes in the descendant areas are not affected because the frame F is unchanged. Otherwise, if such a space does not exist for the newly inserted node then the fan-out of the tree used in enumerating the UID-local area must be enlarged. Rather than modifying the identifiers of every XML component, the enlargement changes only the identifiers of the nodes in this area. In both cases, since the size of an UID-local area is much smaller than the size of the entire data set tree, the scope of the identifier update is greatly reduced. Similarly, the new ruid deals with another structural operation called node deletion. Note that any node deletion in an XML tree is cascading. That means all of the descendant nodes of the deleted node are deleted. The change of the identifiers of the sibling nodes to the right of the deleted node will affect the descendant nodes belonging to the UID-local area, where the deletion occurs.

12 102 D.D. Kha, M. Yoshikawa, and S. Uemura 3.3 Parent-Child Relationship Determination The ruid preserves an important property of the original UID whereby given the identifier of a node the parent node s identifier can be computed entirely in the main memory without any I/O. The ancestor-descendant relationship can be examined based on parent-child determination. This property facilitates the evaluation of the structural part in XML queries, and is also important for the fast reconstruction of a portion of an XML document from a set of elements. The output is a portion of an XML document generated from these elements respecting the ancestor-descendant order existing in the source data. 3.4 Determination of Preceding and Following Orders Using Frame The organization by level of ruid provides an interesting feature in that the global index can be used to determine the relative position of two nodes located anywhere in the entire data tree. First, let us show the similarity between the preceding (the following, respectively) order of nodes, and of their projections to the set of children of the lowest common ancestor. Lemma 2. Let n 1 and n 2 be two distinct nodes of an XML tree such that n 1 is neither an ancestor nor a descendant of n 2.Letc be the lowest common ancestor of n 1 and n 2. Also, let c 1 (c 2, respectively) be a child of c located on the path between c and n 1 (n 2, respectively). n 1 precedes (or follows, respectively) n 2 if and only if c 1 is a preceding (or following, respectively) sibling of c 2. Proof. Because n 1 and n 2 are not in ancestor-descendant relationship, c 1 and c 2 are not the same node (see Fig. 9 (a)). A node precedes another node if the former is not an ancestor of the latter and remains before the latter in the preorder traversal. For a given node, the traversal passes all nodes of the induced subtree rooted at the node before leaving the subtree it for its parent node. This means that any node in the induced subtree rooted at c 1 precedes (follows, respectively) any node in the induced subtree rooted at c 2. Fig. 9. Projection to the set of children of their lowest common ancestor

13 A Structural Numbering Scheme for XML Data 103 The following Lemma states the relationship of the global index and the preceding or following order in the frame. Lemma 3. Given an XML tree T, a frame F of T, and two nodes n 1 and n 2 having the identifiers (θ 1,α 1,β 1 ) and (θ 2,α 2,β 2 ), the following claims hold: If θ 1 is a preceding node of θ 2 in F, then n 1 is a preceding node of n 2 in the entire T. If θ 1 is a following node of θ 2 in F, then n 1 is a following node of n 2 in the entire T. Proof. We shall discuss the case in which θ 1 precedes θ 2. Let c denote the lowest common ancestor of the nodes corresponding to θ 1 and θ 2 in F and let c 1 and c 2 denote the children of c in these node paths respectively, as shown in Fig. 9 (b). θ 1 precedes θ 2, and therefore from Lemma 2, c 1 precedes c 2. The node path to a node inside an induced subtree always includes the root of the subtree; therefore, the node paths of n 1 and n 2 also include the c 1 and c 2, respectively. Applying Lemma 2 again we find that n 1 precedes n XPath Axes Expressiveness In this section, we shall investigate the power of ruid to express XPath expressions. This property is important for the applicability of ruid in XML query processing. We consider XPath because XPath has become the standard on which many new proposed XML query languages are based. Furthermore, XPath expressions have additional concepts specific to XML data, such as axes that do not exist in regular path expressions. XPath [13] is a language for addressing parts of an XML document, and was designed to be used by other languages such as XSLT and XPointer. In addition XPath provides basic facilities for the manipulation of strings, numbers and boolean operators in the logical structure of an XML document. One important kind of XPath expression is the location path. A location path selects a set of nodes relative to a context node. The result of evaluating a location path is the node-set containing the nodes selected by the location path. We will focus only on the core rules of XPath, such as the following: [1] LocationPath ::= RelativeLocationPath AbsoluteLocationPath [2] AbsoluteLocationPath ::= / RelativeLocationPath? [3] RelativeLocationPath ::= Step RelativeLocationPath / Step Therefore, a location path can be written in the form: Step 1 τ 1 Step 2 τ 2 Step l (2) where l 0, can be an empty symbol (indicating that nothing appears) or /, τ i (i =1..l 1) is /, and Step i (i =1..l) isalocation step. A location step has three parts: 1) an axis, which specifies the hierarchical relationship between the nodes considered in the location step and the context node, 2) a node test, which

14 104 D.D. Kha, M. Yoshikawa, and S. Uemura specifies the node type and expanded-name of the nodes selected by the location step, and 3) zero or more predicates to further refine the set of nodes. An initial node-set is generated from the axis and the node test and is then filtered by each of the predicates in turn. A predicate filters a node-set with respect to an axis to produce a new node-set. As described above, generating and filtering the axes is essential in evaluation of location steps in XPath expressions. The general task is as follows: Given a context node n identified by (θ, α, β), generate the node set belonging to a specific axis of n and satisfy a condition C. The condition C may be to satisfy a logical expression related to data content, to belong to a specific element type, etc. Depending on the particular C, the order to process may be: generating the set of nodes satisfying C and checking which nodes belong to the specific axis, or generating the specified axis and then checking which nodes satisfy C. The first approach is good only for the cases in which C is specific, so the set of nodes satisfying C is small. The second approach is more generally applicable and thus we shall focus on discussing it. We demonstrate the XPath axes expressiveness of ruid by proposing several routines to generate the axes. We limit the scope of discussion to the axes that specify sets of nodes in term of the node position in XML documents. Due to triviality, we exclude the -or-self portion of axes from consideration. Specifically, the following axes will be considered: (1) parent and ancestor, (2) attribute, child, and descendant, (3) preceding-sibling and following-sibling, and (4) preceding and following. Parent and Ancestor axes. As shown in Section 2.2, after loading the value κ and the table K, the parent s identifier for a given a node can be computed using rparent() in main memory. The routine rancestor(n), used to generate the list of the ancestors of n, is a repetition of rparent(). Note that the numbering schemes based on the loose hierarchical order require additional parameters to express the hierarchical level, such as grandparent, or grand-grandparent. This task can be accomplished much more simply using ruid. For example, let us consider an expression in abbreviated syntax suchas element 1 /*/element 2, in which the explicit requirement exists that between element 1 and element 2 there exists one and only one element. Naturally, we do not have to know the exact buffer element. Using ruid, we can avoid scanning the entire collection of available elements to find the parent of element 2. We need only to list the grandparents, by applying rparent() twice, of the elements of the type element 2 and exclude those elements which are not of the type element 1. Child and Descendant axes. In the 1-level UID, if p is the parent s UID, then the identifiers of its children belong to the range [(p-1)*k +2,p*k +1], where k is the fan-out of the enumerating tree. In the 2-level ruid, the routine rchildren(n) to create the list L of possible children of n is as follows. First, use κ and θ to compute the sorted list L 1 of children of θ in the frame of T. Let k denote the local fan-out corresponding to θ and obtained from K. Let L 2 denote the list of integers in the interval [2, k +1]ifβ is true, or in the interval [(α-1)*k +2,α*k +1]ifβ is false. For each i in L 2, if there exists no θ in

15 A Structural Numbering Scheme for XML Data 105 L 1 such that (θ, i) is found in K as the global and local indices of a row, then add (θ, i, false) to L. Otherwise, add (θ, i, true) to L. In order to confirm the existence of such a θ, we first find in K the list of the local indices corresponding to the values in L 1 as the global indices. We then intersect the list with L 2. Note that both L 1 and K are sorted so this process is fast. The routine rdescendant(n) to generate the list of the descendants of n may be designed as a repetition of rchildren(). Another method is based on the following observation. Given two nodes n 1 and n 2, r 1 and r 2 are the roots of the UID-local areas containing n 1 and n 2, respectively. Then, if r 1 is a descendant of n 2, then n 1 is a descendant of n 2. Therefore, we first need to find the descendants of n inside of its UID-local area only, using rchildren(). Among these nodes, consider the UID-local area root nodes. In F find all the nodes which are descendant-or-self of the roots. All nodes in the areas rooted at the newly found nodes are descendants of n. Preceding-sibling and Following-sibling axes. We explain the routine denoted by rpsibling(n) to generate the list L of the preceding siblings of n. Using κ and θ, we generate the sorted list L 1 of child nodes of θ in the frame F of T. In the context UID-local area, compute the sorted list L 2 of the precedingsiblings of α. For each α i in L 2, if there exists no θ j such that (θ j, α i ) is found in K as the global index and the local index of a row, then add (θ, α i, false)to L. Otherwise, add (θ j, α i, true) to L. This argument is similar to the routine for child and descendant axes. Similarly, we can design the routine rfsibling(n) to generate the list of the following siblings of n. Preceding and Following axes. We will explain the routine for the preceding order. Based on Lemma 2, the routine in Fig.10 is designed to determine the preceding order between two nodes denoted by n 1 and n 2. This routine can be performed exclusively in the main memory. A routine to determine the preceding order using ruid can be designed similarly. We can apply Lemma 3 to design rpreceding(n). All nodes which belong to the UID-local areas preceding the area containing n, are preceding nodes of n. Hence, we need only check inside the UID-local areas, which are ancestors of the area containing n. We omit the details of the algorithm here. In general, the multilevel ruid has the following property: For the axes preceding, following the relative position of two nodes can be determined by the first different and preceding-following decidable components of their multilevel ruid. In the 2-level ruid, the orders among nodes are reflected in the frame F. We can use this property to accelerate the axis constructions. 4 Applications of ruid In this section, we briefly discuss possible applications of ruid in processing XML data. A detailed investigation of these applications is being conducted. Managing large XML trees. The ruid is a realistic method to process large XML documents. We believe that this property enables management of various data sources scattered over several sites on a network. With respect to application,

16 106 D.D. Kha, M. Yoshikawa, and S. Uemura Input: An XML tree T, nodes n 1 and n 2 Output: The preceding node p between n 1 and n 2 1. Compute the sorted set A 1 of ancestors of n 1 2. Compute the sorted set A 2 of ancestors of n 2 3. Compare A 1 and A 2 to determine the lowest common ancestor c of n 1 and n 2 4. if (c is n 1 or n 2) then 5. p := null 6. else 7. Determine the children c 1 and c 2 of c in A 1 and A 1 8. Compare the UIDs of c 1 and c 2 to get p 9. end e. return p Fig. 10. Routine to determine the preceding order in 1-level UID ruid allows full enumeration of all components of XML document trees generated by the parsers based on the Document Object Model [14] without need for additional software modules as required by the original UID. Generating stable identifiers. The ruid generates the identifiers that do not require much workload for recomputation when structural updates, such as node insertion or node deletion, occur. Therefore, ruid can be applied in applications for managing data that have frequent structural updates. Query evaluation. As shown in Section 3.5, ruid is an effective tool to express the structural part of XPath expressions. The axes of a node can be constructed if the nodes are identified by ruid. This property facilitates an efficient method by which to process queries on XML data. Database file/table selection. In some applications, the size of data files or tables may be very large and therefore the query evaluation becomes slow. Thus, decomposition of the data into smaller tables becomes necessary in order to speed up the queries. However, the decomposition raises the question of how to choose the correct data files or tables to select the candidates. One solution is to create the name of data files or tables using two parts: The first part is extracted from the text value such as the element or attribute names. The second part is the common global index of ruid of items. 5 Preliminary Experiment Evaluation In this Section, we report observations made during preliminary experiments conducted in the early stage of this study. We conducted a number of tests to generate the UID and ruid for several sample XML documents and to process simple queries. The application programs were written in Java and were connected with a RDBMS through JDBC-ODBC. All of these test components ran on the Windows XP Professional operating system. Preliminarily, we made following observations. First, ruid is more capable than the original UID with respect to enumeration of nodes of large trees, for example, the trees having a high degree of recursion. Second, even though the function to find the parent node s identifier from a child node s identifier in ruid is more complicated

17 A Structural Numbering Scheme for XML Data 107 than the one in the original UID, since the computation occurs mostly in main memory, the distinction is not significant. Third, querying speed using ruid in main memory is quite competitive, although the connection between the test programs and the RDBMS was slow in comparison to the computing speed due to the fact that at the time of the preliminary tests, an RDBMS was used to store and index the data in our experiments. 6 Related Works Several structural summaries for semistructured data, a general form of XML data, have been introduced, [1,4,9]. Structural information, such as node paths, is extracted from the data source, classified, and then represented in a structure graph. The graph can be used both as an indexing structure and a guide by which users can perform meaningful and valid queries. A method to determine the ancestor-descendant relationship using preorder and postorder traversal has been introduced in [3]. Extensions of the preorder and postorder traversal numbering scheme have been presented in [6,2]. Another approach uses the position and depth of a tree node to index XML elements in [11]. Management of identifiers by XID-map has been discussed in [8]. The XID-map provides identification for nodes in the change management of XML documents. A possible variant of the XID-map is based on node positions within a tree. For instance, for indexing purposes the triplet (prefix, postfix, level coding) is used. However, as mentioned in [8], the identifications are not robust. The UID technique was first introduced in [7]. Some applications of the original technique were proposed in [5,10], in which the numbering scheme was implemented to facilitate the indexing. In these studies the problems of structural update and overflow identifier were not discussed. 7 Discussion and Conclusion Application of numbering schemes in processing XML data is an effective technique. The technique allows to achieve two goals: generating identifiers for XML components and providing the structural information of XML documents. Among several proposed numbering schemes, the UID technique is promising because this technique allows the parent-child relationship to be determined effectively. However, the numbering scheme is ineffective when dealing with structural updates. Furthermore, in many cases the original UID consumes too much identifier value and requires extra tools for processing. In this study, we proposed a multilevel recursive numbering scheme called ruid. While preserving the efficient properties of the original UID, ruid is more robust in structural update and enables coding arbitrarily large XML documents. In addition, ruid can express the XPath axes of XPath expressions. Preliminary experiments have shown that ruid can be applied to process queries on XML

18 108 D.D. Kha, M. Yoshikawa, and S. Uemura data. Extensions of the present study are in progress including performance experiments using various configurations. The refinement of the experiment scheme and more detailed tests are currently in progress. References 1. P.Buneman, S.Davidson, M.Fernandez, D.Suciu. Adding Structure to Unstructured Data. Proc. of the ICDT, Greece, , S.Chien, V.J.Tsotras, C.Zaniolo, D.Zhang. Storing and Querying Multiversion XML Documents using Durable Node Numbers. Proc. of the Inter. Conf. on WISE, Japan, , P.F.Dietz. Maintaining order in a link list. Proceeding of the Fourteenth ACM Symposium on Theory of Computing, California, , R.Goldman, J.Widom. DataGuides: enabling query formulation and optimization in semistructured databases. Proc. of the Inter. Conf. on VLDB, , H.Jang, Y.Kim, D.Shin. An Effective Mechanism for Index Update in Structured Documents. Proc. of CIKM, USA, , Q.Li, B.Moon. Indexing and Querying XML Data for Regular Path Expressions. Proc. of the Inter. Conf. on VLDB, Italy, Y.K.Lee, S-J.Yoo, K.Yoon, P.B.Berra. Index Structures for structured documents. ACM First Inter. Conf. on Digital Libraries, Maryland, 91 99, A.Marian, S.Abiteboul, G.Cobena, L.Mignet. Change-Centric Management of Versions in an XML Warehouse. Proc. of the Inter. Conf. on VLDB, Italy, T.Milo, D.Suciu. Index Structures for Path Expression. Proc. of the ICDT, , D.Shin. XML Indexing and Retrieval with a Hybrid Storage Model. J. of Knowledge and Information Systems, 3: , C.Zhang, J.Naughton, D.DeWitt, Q.Luo, G.Lohman. On Supporting Containment Queries in Relational Database Management Systems. Proc. of the ACM SIGMOD, USA, World Wide Web Consortium. Extensible Markup Language (XML) World Wide Web Consortium. XML Path Language (XPath) Version World Wide Web Consortium. Document Object Model (DOM) Level 2 Core Specification Version

A Structural Numbering Scheme for XML Data

A Structural Numbering Scheme for XML Data Alfred M. Martin WS2002/2003 February/March 2003 Based on workout made during the EDBT 2002 Workshops Dao Dinh Khal, Masatoshi Yoshikawa, and Shunsuke Uemura