A Structural Numbering Scheme for XML Data

Size: px
Start display at page:

Download "A Structural Numbering Scheme for XML Data"

Transcription

1 A Structural Numbering Scheme for XML Data Dao Dinh Kha 1, Masatoshi Yoshikawa 1,2, and Shunsuke Uemura 1 1 Graduate School of Information Science Nara Institute of Science and Technology Takayama, Ikoma, Nara , Japan 2 Information Technology Center, Nagoya University Furo-cho, Chikusa-ku, Nagoya , Japan Abstract. Identifier generation is a common but crucial task in many XML applications. In addition, the structural information of XML data is essential to evaluate the XML queries. In order to meet both these requirements, several numbering schemes, including the powerful UID technique, have been proposed. In this paper, we introduce a new numbering scheme based on the UID techniques called multilevel recursive UID (ruid). The proposed ruid is robust, scalable and hierarchical. ruid features identifier generation by level and takes into account the XML tree topology. ruid not only enables the computation of the parent node s identifier from the child node s identifier, as in the original UID, but also deals effectively with XML structural update and can be applied to arbitrarily large XML documents. In addition, we investigate the effectiveness of ruid in representing the XPath axes and query processing and briefly discuss other applications of ruid. 1 Introduction Extensive Markup Language (XML) [12] has been accepted as a standard for information exchange over the Internet and is supported by major software vendors. The main components of an XML document are elements of various length and positions within a hierarchical structure. In order to process XML data, XML elements must be assigned uniquely identifiers. Therefore, identifier generation is a common but crucial task in many XML applications. The method by which the task is accomplished significantly affects organization and storage of data, the construction of indices and the processing of queries. Unlike relational database is, in which data are projected into relations using fixed schemes, the structure of an XML document may change. To effectively process the queries on XML data, the structural information of XML documents is essential. Therefore, numerous studies have examined the presentation of the logical structure of XML data in a concise manner. Since XML data structure can be viewed as a tree, a numbering scheme can be used to represent the structure. Normally, numbering scheme is a method to generate the identifiers of the elements in such manner that the hierarchical A.B. Chaudhri et al. (Eds.): EDBT 2002 Workshops, LNCS 2490, pp , c Springer-Verlag Berlin Heidelberg 2002

2 92 D.D. Kha, M. Yoshikawa, and S. Uemura orders of the elements can be re-established based on their identifiers. For a numbering scheme, the ability to express orders, for example, parent-child, ancestordescendant, preceding-following, is essential. Hierarchical orders are used extensively in processing XML data, hence reduction of the computing workload of the hierarchy re-establishment to the greatest degree is desired. In [3,8,6,7], several numbering schemes for XML data have been proposed, in which an identifier is either an integer or a combination of integers. A hierarchical order between two elements exists if and only if their identifiers observe a predefined numerical order described by a formula. Among these schemes, the technique referred to as the Unique Identifier (UID) [7], enumerates nodes using a k-ary tree where k is the maximal fanout of the nodes. Each of internal nodes supposedly has the same fan-out k by assigning a number of virtual children if needed. Consecutive integers starting from 1are assigned to the nodes, including the virtual nodes, in order from top to bottom and from left to right in each level. Whereas other numbering schemes only can compare two identifiers, the identifiers must already be known, in order to determine the paren-child relationship, the UID technique has an interesting property whereby the parent node can be determined based on the identifier of the child node. Given a node having the identifier i we can compute the identifier of the parent of the node using the formula: parent(i) = (i 2)/k +1 (1) Using this property, given two nodes, the question of whether one node is an ancestor of another node can be easily answered based on their identifiers. This also allows the identifiers of the ancestors of a node to be generated quickly. This is a promising property for the evaluation of the structural part in XML queries. Moreover, ascertaining the identifiers of data items prior to loading data from the disk can help to reduce disk access. However, despite this useful property, the original UID technique has some drawbacks, especially regarding structural update. In practice, the content and structure of an XML document may be updated frequently to reflect the changes. When a new node is inserted, a new identifier must be assigned to the node. Because the order of enumeration by the UID technique is from left to right, in order to maintain the continuousness of the sibling s identifiers the new node s identifier must be the identifier of the node that is pushed to the right. The identifiers of all the sibling nodes to the right of the just-inserted node are increased by one. Because of the strong dependency of the identifier of a child node on the identifier of the parent node in the original UID technique, the identifiers of the descendant nodes of all the sibling nodes to the right of the inserted node will also be changed. The nearer to the root node the new node is inserted, the larger the scope of the identifier modification. Furthermore, the method by which the UID technique enumerates nodes using one tree of fixed fan-out k also contributes to the update problem. This problem becomes more serious when the number of children nodes of a node becomes larger than the pre-defined value k, because the initial value of k is overflowed

3 A Structural Numbering Scheme for XML Data 93 and there is no space for a new child node. This situation may occur frequently in practice. The modification of k results in an overhaul of the identifier system in which the identifiers of all nodes are to be recomputed. The reconstruction is costly and may severely degrade the system performance, especially for large XML documents. In general, since any change in the identifiers usually triggers a costly reconstruction, reducing the scope of the identifier update to the greatest degree possible is desired. Figure 1illustrates a node insertion and the consequent changes in identifiers. In Fig. 1(a) nodes of a tree are enumerated using the original UID method. Virtual nodes are denoted by dotted lines. Suppose that a node is inserted between nodes 2 and 3. The new enumeration is shown in Fig. 1(b). The previous nodes 3, 8, 9, 23, 26 and 27 are re-numerated as nodes 4, 11, 12, 32, 35, and 36, respectively. If another node is inserted behind the new node 4 in Fig. 1(b), the entire tree must be re-numerated. In addition, the original UID technique not only enumerates the real nodes virtually but also reserve identifiers for the virtual nodes. Because the fan-outs of nodes are various, the UID technique may enumerate a number of virtual nodes. The value of identifiers increase at the exponential rate equal to the maximal fan-out of the nodes and in the power of the length of the longest path in the tree. Therefore, in many cases, the value easily exceeds the maximal manageable integer value, even when the real nodes in the data source are few. Additional purpose-specific libraries are necessary to deal with the oversized values but they require extra computation cost (a) before insertion (b) after insertion Fig. 1. An original UID before and after a node insertion The above-mentioned drawbacks make the original UID technique unpractical in several cases. In this paper, we propose a new numbering scheme called the recursive UID that is an extended version of the UID technique. The newly proposed ruid technique has been designed so as to eliminate the above-mentioned drawbacks. Specifically, we grade and localize the value k, the fan-out of the enumerating tree. The node identifiers are created by levels. In each level the fan-out of the enumerating tree may vary based on the local topology of the XML

4 94 D.D. Kha, M. Yoshikawa, and S. Uemura tree. This approach preserves the advantages of the original UID technique while reducing its drawbacks. The main features of the ruid technique are as follows: 1. Parent-child determination property: given the identifier of a node, the parent node s identifier can be efficiently computed. Using small-size global information stored in main memory, the new technique allows the ancestordescendant relationship to be determined without any I/O. 2. Robustness for structural change: the scope of data amendment when a structural update occurs is effectively reduced. 3. Scalability: the new presentation can overcome the identifier limitation of the original UID technique and can be applied to arbitrarily large XML documents. 4. Structural richness: ruid is effective in representing of the structural components in XPath expressions. The remainder of the paper is organized as follows. Section 2 describes the 2-level recursive UID technique and its multilevel version. Section 3 discusses the main properties of the new numbering scheme. Section 4 outlines the possible practical applications of the technique in various aspects. Section 5 presents a number of observations made during our preliminary experiments. Section 6 briefly reviews related works. Section 7 concludes this paper and present suggestions for future research. 2 Multilevel Recursive UID In this section, we present the multilevel recursive UID. We first introduce the 2-level ruid numbering scheme. We shall describe ruid based on the notation for an XML tree. 2.1 Description of the 2-Level ruid Given an XML tree, the 2-level ruid numbering scheme manages the identifiers of nodes in global and local levels. The set of nodes is divided into subsets, the identifiers of which are created in the global level whereas the nodes of each subset are managed in the local level. We generate a number of parameters that are used in both the global and local levels. The additional data is small enough to be comfortably loaded into the main memory allowing fast access when navigating inside the XML tree. Concretely, the construction of the 2-level ruid numbering scheme for an XML tree consists of the following steps: (1) partition the XML tree into areas, each of which is an induced subtree of the XML tree; (2) enumeration the newly created areas according to the original UID scheme in order to generate the global indices; (3) for each area, enumeration of the nodes of that area in order to generate the local indices; and (4) compose of the synthetic identifiers of nodes from the global and local indices. The 2-level ruid is formalized by the following three definitions:

5 A Structural Numbering Scheme for XML Data 95 Definition 1. (A frame) Given an XML tree T rooted at r, aframef is a tree: (1) rooted at r, (2) the node set of which is a subset of the node set of T and (3) for any two nodes u and v in the frame, an edge exists connecting the nodes if and only if one of the nodes is an ancestor of the other in T and there is no other node x that lies between u and v in T and x belongs to the frame. A tree and one of its frames are shown in Fig. 2(a). The dotted arrows connect the corresponding nodes between these trees. source tree a frame UID local area (a) (b) Fig. 2. Frame and UID-local area Definition 2. (UID-local area) Given an XML tree T rooted at r, a frame F of T,andanoden of F, a UID-local area of n is an induced subtree of T rooted at n such that each of the subtree s node paths is terminated either by a child node of n in F oraleafnodeoft, if between the leaf node and n in T there exists no other node that belongs to F. An UID-local area is depicted in Fig. 2(b). We cover an XML tree using a set of UID-local areas such that the intersection of any two of these areas is either empty or consists of only one node from the frame. Hereafter, let us refer to the full identifier of a node as its identifier and the number assigned to a node locally inside an UID-local area as its index. Let κ denote the maximal fan-out of nodes in F. We use a κ-ary tree to enumerate the nodes of F and let the number assigned to each node in F be the index of the UID-local area rooted at the node. The 2-level UID numbering scheme based on F is defined as follows: Definition 3. (2-level ruid) The full 2-level ruid of a node n is a triple (g i, l i, r i ), where g i, l i, and r i are called the global index, local index, and root indicator, respectively. If n is a non-root node, then g i is the index of the UID-local area containing n, l i is the index of n inside the area, and r i is false.ifn is the root node of an UID-local area, then g i is the index of the area, l i is the index of n as a leaf node in the upper UID-local area, and r i is true. The identifier of the root of the main XML tree is (1, 1, true).

6 96 D.D. Kha, M. Yoshikawa, and S. Uemura From these definitions, the ruid of a node in an XML tree is determined uniquely. The values of the first and second components of the identifier of a node should be interpreted based on whether the node is an area-root node. This information is indicated by the third component of the identifier. For implementation, the global and local indices can be expressed by integers, whereas a boolean value is sufficient to express the root indicator. If XML data is stored in a RDBMS, one way to express ruid is that three separated fields, the types of which correspond to the component types, are used to store the components of the ruid. The data items are sorted first by the global index, and then by local index. Input: An XML tree T Output: The 2-level ruid identifiers of nodes in T //Global enumeration 1. Partition XML tree into UID-local areas and build the frame F upon their roots 2. Find the maximal fan-out κ of F 3. Compute the global index g i using κ-ary tree presentation of F //Local enumerations 4. foreach i th UID-local area 5. find the local maximal fan-out denoted by k i 6. compute the local indices l ij of nodes in the area via a k j-ary tree 7. if l ij = 1 then 8. recompute l ij in the upper UID-local area 9. r ij := true 10. update K using (g i, l ij, k i) 11. else 12. r ij := false 13. end 14. Generate the identifiers of the nodes from (g i, l ij, r ij) 15. end e. Save κ and K Fig. 3. Outline of the algorithm used to compute 2-level ruid We have established a one-to-one mapping between the nodes of F and UIDlocal areas. Therefore, for two UID-local areas, we refer to an area as preceding the other are if the root of the former precedes the root of the latter in F. The other orders among the UID-local areas, such as ancestor, descendant, and following, are determined similarly. We construct a table K having three columns: global index, local index, and fan-out. Each row of K corresponds to an UID-local area and contains the global index of the area, the index of the area s root in the upper area, and the maximal fan-out of nodes in the corresponding area, respectively. The table K is sorted according to the global index. The value κ and the table K are global parameters, which are loaded into the main memory during travelling T. The process used to compute 2-level ruid is briefly shown in Fig. 3. Example 1. Fig. 4 depicts examples of the original UID and the new 2-level ruid. In the tree shown on the left, the number inside each node is its original UID. In the tree shown on the right, integers shown inside each node are the global and local indices of the node s 2-level ruid. Rather than showing root indicators, the root nodes are encircled by bold circles and the non-root nodes

7 A Structural Numbering Scheme for XML Data 97 Fig. 4. An original UID and its corresponding 2-level ruid counterpart are encircled by fine-lined circles. Using the ruid, the global fan-out κ is 4 and six UID-local areas exists. The table K of the global parameters is shown in Fig. 5. Global index Local index Local fan-out Fig. 5. Global parameter table for the 2-level ruid shown in Fig. 4(b) 2.2 Parent-Child Relationship in 2-Level ruid Formula (1) can be used to check the parent-child relationship in the original UID. For the 2-level ruid, we need a more sophisticated function, denoted herein by rparent(), in the form of an algorithm. First, let us show that 2-level UID enables the parent-child relationship to be determined. Lemma 1. Given an XML tree T andanoden, based on the value κ and the table K, the identifier of the parent of n can be computed if the identifier of n is known. Proof. Let the parent node of n be denoted by p. Since the intersection of any two UID-local areas is either empty or consists of only one node that is the root of the lower area, it is sufficient to consider the following cases. First, if n and p belong to the same UID-local area and p is not the root of this area, then these nodes have the same global index. The local index of p can be computed using

8 98 D.D. Kha, M. Yoshikawa, and S. Uemura formula (1), where i is replaced by the local index of n, k is replaced by the maximal fan-out of the UID-local area that contains n, and k can be obtained from the table K. Second, if n and p belong to the same UID-local area, but p is the root of this area (this means that the value (l 2)/k +1 is equal to 1), then the global index of p can be computed using formula (1), where i is replaced by the global index of n, and k is replaced by κ. The local index of p can be obtained from the table K. Third, if p belongs to an upper UID-local area and n is the joint of the upper and lower UID-local areas corresponding to a pair of parent and child nodes in the frame of T, then the global index of p can be computed using the above-mentioned formula, where l is replaced by the global index of n, and k is replaced by κ. Because both p and n belong to the same UID-local area, the index of which is known, the local index of p is determined in a manner similar to that used in the first case. If the result is equal to 1, then the local index of p must be obtained from the table K. The algorithm by which to determine the parent s identifier from a node s identifier is shown in Fig. 6. We illustrate this algorithm through Example 2. Input: An XML tree T, its κ and K, and the 2-level ruid (g i, l i, r i) of a node Output: The 2-level ruid (g,l,r) of the parent node 1. if (r i == true) then 2. g := (g i 2)/κ else 4. g := g i 5. end 6. get the fan-out k j of the row with the global index g in K 7. l := (l i 2)/k j if (l == 1) then 9. set l equal to the local index of the row with the global index g in K 10. r := true 11. else 12. r := false 13. end e. return (g, l, r) Fig. 6. rparent() - the algorithm to compute the parent s 2-level ruid of a node Example 2. Suppose that κ equals 4 and the table K is given in Fig. 5. Let c and p denote a node and its parent node, respectively. We illustrate how to determine the identifier of p from the identifier of c by considering several configurations of the child node: c is the non-root node (2, 7, false): From the second line of K we know that the local fan-out of the UID-local area containing c is 2. Therefore, the local index of the identifier of p is (7 2)/2+1, which is equal to 3. Hence, p is the non area root node (2, 3, false). In Fig. 4, the node p is depicted by a fine-lined circle containing the numbers (2, 3). c is the root node (10, 9, true): the upper UID-local area containing p must be determined. Because κ equals 4, the upper UID-local area s index

9 A Structural Numbering Scheme for XML Data 99 is (10 2)/4+1 or 3. The local fan-out of the UID-local area is shown in the third line of K and is equal to 3. The local index of p is (9 2)/3+1, which is equal to 3. The value is greater than 1, so p is the non area root node (3, 3, false). c is the non-root node (3, 3, false): from the second line of K we know that the local fan-out of the UID-local area containing c is equal to 3 so the index of p in the UID-local area is (3 2)/3+1, which is equal to 1. This means that p is the root of the considered UID-local area. Therefore, the local index of p must equal the index of the node in the upper local UID area. From K, the value is found to be 3, and p is the area root node (3, 3, true). Note that if the value κ together with the table K are known and are loaded into the main memory, then all of the steps in the algorithm rparent() can be performed completely inside the main memory without any disk I/O. 2.3 Adjustment of the Maximal Fan-out of Frame The maximal fan-out κ of the frame F should not be greater than the maximal fan-out of the source data tree. However, in the native partitions of an XML tree into UID-local area, the maximal fan-out κ of the frame F may exceed the maximal fan-out of the XML tree. Such a partition is illustrated in Fig. 7(a). Suppose that the maximal fan-out of the subtrees rooted at u 1, u 2, and u 3 are less than or equal to 4. Although the node n 1 is not an area-root node, this node has three area-root descendants u 1, u 2, and u 3 in three separated paths. In the frame F, these three nodes are connected directly to n, and n has six children, as shown in Fig. 7(b). Therefore, the maximal fan-out of the frame is larger than that of the source data. Fig. 7. Adding a marked node in order to reduce the fan-out A simple solution in this case is to make the node n 1 as an area-root node, as shown in Fig. 7(c). Generally, if necessary, we can supplement additional arearoot nodes to reduce the value of κ. This trick guarantees that the fan-out of the frame is always less than or equal to the fan-out of the source XML tree. We omit the technical details of the solution here.

10 100 D.D. Kha, M. Yoshikawa, and S. Uemura 2.4 Description of Multilevel ruid In this section we generalize the concept of the 2-level ruid. The idea is that the frame in the 2-level ruid is to be considered as an original tree, and a new frame of this tree will be constructed in order to establish the 3-level ruid, and so on. The multilevel ruid may be used when the size of the frame is too large or when we need a more compact frame. Let us refer T and the frames recursively built one upon the other as the data levels. We enumerate the levels such that the original T is level one, its frame is level two, and so on. Definition 4. (Multilevel ruid) Given an XML tree T, the l-level ruid of a node n has the form: {θ, (α l 1,β l 1 ),, (α 2,β 2 ), (α 1,β 1 )} where: forj=1 l-1: α j is the local index and β j is the root indicator of n in its UID-local area identified by {θ, (α l 1,β l 1 ),, (α j+1,β j+1 )} in the level j+1. θ is the original UID in the level l. The symbols θ, α i, andβ i have meanings similar to the first, second, and third components of 2-level ruid. n {8,(a,true)} n {2,(4,false),(a,true)} Level 2 Level 3 Fig. 8. A multilevel ruid example Level 4 (top) Example 3. In Fig. 8, each polygon denotes an UID-local area. Suppose using 2-level ruid the node n has the identifier {8, (a, true)}, where the boolean value true indicates that n is the root of an UID-local area, 8 is the index of n in the second level s frame, and the integer number a is the index of n in the upper UID-local area that has the index 2. Using 3-level ruid, the index 8 is decomposed into (2, 4, false) and the full identifier of n becomes {2,(4, false),(a, true)}. Construction of multilevel ruid: For a large XML tree, we consecutively build the UID levels, each created on the top of the previous level. First, the 2-level ruid of the form {x l, (α l,β l )} is constructed. If needed, the 3-level ruid of the form {x l 1, (α l 1,β l 1 ), (α l,β l )} is constructed, and so on. The process stops when the top level becomes small enough to be stored. In practice, this requires only a few levels to encode a large XML tree.

11 3 Properties of Multilevel ruid A Structural Numbering Scheme for XML Data 101 The multilevel ruid has several properties, which are crucial for a numbering scheme to be applicable to the management of a large amount of XML data. 3.1 Scalability The newly proposed ruid can be used to present the identifers of nodes for arbitrarily large trees. If the number of nodes that can be enumerated by the original UID is denoted by e, then using m-level ruid, we can enumerate approximately e m nodes. Practically, 2-level ruid is capable of enumerating any XML data set currently in use. Furthermore, ruid reduces the number of virtual children to be added. Normally, the fan-outs of nodes in a tree are various. In many cases, the disparity in fan-outs is very significant. Since the set of nodes in any UID-local area is a subset of the nodes of the entire XML tree, the maximal fan-out of each UIDlocal area fits the nodes in the area closer than does the global maximal fan-out. By appropriately dividing an XML tree into UID-local areas, and using local enumerating trees for enumerating local nodes, we can avoid enumerating nodes having small fan-outs by a large-sized tree. 3.2 Robustness with Structural Update In the original UID, if a new node is inserted into an XML tree when space is available then the insertion causes the identifers of the sibling nodes to the right of the inserted node as well as those of their descendant nodes, to be modified. In the worst case, when the insertion increases the tree s maximal fan-out, the entire enumeration has to be performed again. Identifiers of all of the nodes must be changed, which leads to an expensive reconstruction. The ruid copes better with structure update of XML data than does the original UID. The scope of identifier update due to a node insertion is reduced by a magnitude of two. If a node is inserted, at first only the nodes in the UIDlocal area where the update occurs need to be considered. If an appropriate space is available for the new node, then among the descendants of the sibling nodes to the right of the inserted node, only those which belong to the same UID-local area will have their identifiers modified. The nodes in the descendant areas are not affected because the frame F is unchanged. Otherwise, if such a space does not exist for the newly inserted node then the fan-out of the tree used in enumerating the UID-local area must be enlarged. Rather than modifying the identifiers of every XML component, the enlargement changes only the identifiers of the nodes in this area. In both cases, since the size of an UID-local area is much smaller than the size of the entire data set tree, the scope of the identifier update is greatly reduced. Similarly, the new ruid deals with another structural operation called node deletion. Note that any node deletion in an XML tree is cascading. That means all of the descendant nodes of the deleted node are deleted. The change of the identifiers of the sibling nodes to the right of the deleted node will affect the descendant nodes belonging to the UID-local area, where the deletion occurs.

12 102 D.D. Kha, M. Yoshikawa, and S. Uemura 3.3 Parent-Child Relationship Determination The ruid preserves an important property of the original UID whereby given the identifier of a node the parent node s identifier can be computed entirely in the main memory without any I/O. The ancestor-descendant relationship can be examined based on parent-child determination. This property facilitates the evaluation of the structural part in XML queries, and is also important for the fast reconstruction of a portion of an XML document from a set of elements. The output is a portion of an XML document generated from these elements respecting the ancestor-descendant order existing in the source data. 3.4 Determination of Preceding and Following Orders Using Frame The organization by level of ruid provides an interesting feature in that the global index can be used to determine the relative position of two nodes located anywhere in the entire data tree. First, let us show the similarity between the preceding (the following, respectively) order of nodes, and of their projections to the set of children of the lowest common ancestor. Lemma 2. Let n 1 and n 2 be two distinct nodes of an XML tree such that n 1 is neither an ancestor nor a descendant of n 2.Letc be the lowest common ancestor of n 1 and n 2. Also, let c 1 (c 2, respectively) be a child of c located on the path between c and n 1 (n 2, respectively). n 1 precedes (or follows, respectively) n 2 if and only if c 1 is a preceding (or following, respectively) sibling of c 2. Proof. Because n 1 and n 2 are not in ancestor-descendant relationship, c 1 and c 2 are not the same node (see Fig. 9 (a)). A node precedes another node if the former is not an ancestor of the latter and remains before the latter in the preorder traversal. For a given node, the traversal passes all nodes of the induced subtree rooted at the node before leaving the subtree it for its parent node. This means that any node in the induced subtree rooted at c 1 precedes (follows, respectively) any node in the induced subtree rooted at c 2. Fig. 9. Projection to the set of children of their lowest common ancestor

13 A Structural Numbering Scheme for XML Data 103 The following Lemma states the relationship of the global index and the preceding or following order in the frame. Lemma 3. Given an XML tree T, a frame F of T, and two nodes n 1 and n 2 having the identifiers (θ 1,α 1,β 1 ) and (θ 2,α 2,β 2 ), the following claims hold: If θ 1 is a preceding node of θ 2 in F, then n 1 is a preceding node of n 2 in the entire T. If θ 1 is a following node of θ 2 in F, then n 1 is a following node of n 2 in the entire T. Proof. We shall discuss the case in which θ 1 precedes θ 2. Let c denote the lowest common ancestor of the nodes corresponding to θ 1 and θ 2 in F and let c 1 and c 2 denote the children of c in these node paths respectively, as shown in Fig. 9 (b). θ 1 precedes θ 2, and therefore from Lemma 2, c 1 precedes c 2. The node path to a node inside an induced subtree always includes the root of the subtree; therefore, the node paths of n 1 and n 2 also include the c 1 and c 2, respectively. Applying Lemma 2 again we find that n 1 precedes n XPath Axes Expressiveness In this section, we shall investigate the power of ruid to express XPath expressions. This property is important for the applicability of ruid in XML query processing. We consider XPath because XPath has become the standard on which many new proposed XML query languages are based. Furthermore, XPath expressions have additional concepts specific to XML data, such as axes that do not exist in regular path expressions. XPath [13] is a language for addressing parts of an XML document, and was designed to be used by other languages such as XSLT and XPointer. In addition XPath provides basic facilities for the manipulation of strings, numbers and boolean operators in the logical structure of an XML document. One important kind of XPath expression is the location path. A location path selects a set of nodes relative to a context node. The result of evaluating a location path is the node-set containing the nodes selected by the location path. We will focus only on the core rules of XPath, such as the following: [1] LocationPath ::= RelativeLocationPath AbsoluteLocationPath [2] AbsoluteLocationPath ::= / RelativeLocationPath? [3] RelativeLocationPath ::= Step RelativeLocationPath / Step Therefore, a location path can be written in the form: Step 1 τ 1 Step 2 τ 2 Step l (2) where l 0, can be an empty symbol (indicating that nothing appears) or /, τ i (i =1..l 1) is /, and Step i (i =1..l) isalocation step. A location step has three parts: 1) an axis, which specifies the hierarchical relationship between the nodes considered in the location step and the context node, 2) a node test, which

14 104 D.D. Kha, M. Yoshikawa, and S. Uemura specifies the node type and expanded-name of the nodes selected by the location step, and 3) zero or more predicates to further refine the set of nodes. An initial node-set is generated from the axis and the node test and is then filtered by each of the predicates in turn. A predicate filters a node-set with respect to an axis to produce a new node-set. As described above, generating and filtering the axes is essential in evaluation of location steps in XPath expressions. The general task is as follows: Given a context node n identified by (θ, α, β), generate the node set belonging to a specific axis of n and satisfy a condition C. The condition C may be to satisfy a logical expression related to data content, to belong to a specific element type, etc. Depending on the particular C, the order to process may be: generating the set of nodes satisfying C and checking which nodes belong to the specific axis, or generating the specified axis and then checking which nodes satisfy C. The first approach is good only for the cases in which C is specific, so the set of nodes satisfying C is small. The second approach is more generally applicable and thus we shall focus on discussing it. We demonstrate the XPath axes expressiveness of ruid by proposing several routines to generate the axes. We limit the scope of discussion to the axes that specify sets of nodes in term of the node position in XML documents. Due to triviality, we exclude the -or-self portion of axes from consideration. Specifically, the following axes will be considered: (1) parent and ancestor, (2) attribute, child, and descendant, (3) preceding-sibling and following-sibling, and (4) preceding and following. Parent and Ancestor axes. As shown in Section 2.2, after loading the value κ and the table K, the parent s identifier for a given a node can be computed using rparent() in main memory. The routine rancestor(n), used to generate the list of the ancestors of n, is a repetition of rparent(). Note that the numbering schemes based on the loose hierarchical order require additional parameters to express the hierarchical level, such as grandparent, or grand-grandparent. This task can be accomplished much more simply using ruid. For example, let us consider an expression in abbreviated syntax suchas element 1 /*/element 2, in which the explicit requirement exists that between element 1 and element 2 there exists one and only one element. Naturally, we do not have to know the exact buffer element. Using ruid, we can avoid scanning the entire collection of available elements to find the parent of element 2. We need only to list the grandparents, by applying rparent() twice, of the elements of the type element 2 and exclude those elements which are not of the type element 1. Child and Descendant axes. In the 1-level UID, if p is the parent s UID, then the identifiers of its children belong to the range [(p-1)*k +2,p*k +1], where k is the fan-out of the enumerating tree. In the 2-level ruid, the routine rchildren(n) to create the list L of possible children of n is as follows. First, use κ and θ to compute the sorted list L 1 of children of θ in the frame of T. Let k denote the local fan-out corresponding to θ and obtained from K. Let L 2 denote the list of integers in the interval [2, k +1]ifβ is true, or in the interval [(α-1)*k +2,α*k +1]ifβ is false. For each i in L 2, if there exists no θ in

15 A Structural Numbering Scheme for XML Data 105 L 1 such that (θ, i) is found in K as the global and local indices of a row, then add (θ, i, false) to L. Otherwise, add (θ, i, true) to L. In order to confirm the existence of such a θ, we first find in K the list of the local indices corresponding to the values in L 1 as the global indices. We then intersect the list with L 2. Note that both L 1 and K are sorted so this process is fast. The routine rdescendant(n) to generate the list of the descendants of n may be designed as a repetition of rchildren(). Another method is based on the following observation. Given two nodes n 1 and n 2, r 1 and r 2 are the roots of the UID-local areas containing n 1 and n 2, respectively. Then, if r 1 is a descendant of n 2, then n 1 is a descendant of n 2. Therefore, we first need to find the descendants of n inside of its UID-local area only, using rchildren(). Among these nodes, consider the UID-local area root nodes. In F find all the nodes which are descendant-or-self of the roots. All nodes in the areas rooted at the newly found nodes are descendants of n. Preceding-sibling and Following-sibling axes. We explain the routine denoted by rpsibling(n) to generate the list L of the preceding siblings of n. Using κ and θ, we generate the sorted list L 1 of child nodes of θ in the frame F of T. In the context UID-local area, compute the sorted list L 2 of the precedingsiblings of α. For each α i in L 2, if there exists no θ j such that (θ j, α i ) is found in K as the global index and the local index of a row, then add (θ, α i, false)to L. Otherwise, add (θ j, α i, true) to L. This argument is similar to the routine for child and descendant axes. Similarly, we can design the routine rfsibling(n) to generate the list of the following siblings of n. Preceding and Following axes. We will explain the routine for the preceding order. Based on Lemma 2, the routine in Fig.10 is designed to determine the preceding order between two nodes denoted by n 1 and n 2. This routine can be performed exclusively in the main memory. A routine to determine the preceding order using ruid can be designed similarly. We can apply Lemma 3 to design rpreceding(n). All nodes which belong to the UID-local areas preceding the area containing n, are preceding nodes of n. Hence, we need only check inside the UID-local areas, which are ancestors of the area containing n. We omit the details of the algorithm here. In general, the multilevel ruid has the following property: For the axes preceding, following the relative position of two nodes can be determined by the first different and preceding-following decidable components of their multilevel ruid. In the 2-level ruid, the orders among nodes are reflected in the frame F. We can use this property to accelerate the axis constructions. 4 Applications of ruid In this section, we briefly discuss possible applications of ruid in processing XML data. A detailed investigation of these applications is being conducted. Managing large XML trees. The ruid is a realistic method to process large XML documents. We believe that this property enables management of various data sources scattered over several sites on a network. With respect to application,

16 106 D.D. Kha, M. Yoshikawa, and S. Uemura Input: An XML tree T, nodes n 1 and n 2 Output: The preceding node p between n 1 and n 2 1. Compute the sorted set A 1 of ancestors of n 1 2. Compute the sorted set A 2 of ancestors of n 2 3. Compare A 1 and A 2 to determine the lowest common ancestor c of n 1 and n 2 4. if (c is n 1 or n 2) then 5. p := null 6. else 7. Determine the children c 1 and c 2 of c in A 1 and A 1 8. Compare the UIDs of c 1 and c 2 to get p 9. end e. return p Fig. 10. Routine to determine the preceding order in 1-level UID ruid allows full enumeration of all components of XML document trees generated by the parsers based on the Document Object Model [14] without need for additional software modules as required by the original UID. Generating stable identifiers. The ruid generates the identifiers that do not require much workload for recomputation when structural updates, such as node insertion or node deletion, occur. Therefore, ruid can be applied in applications for managing data that have frequent structural updates. Query evaluation. As shown in Section 3.5, ruid is an effective tool to express the structural part of XPath expressions. The axes of a node can be constructed if the nodes are identified by ruid. This property facilitates an efficient method by which to process queries on XML data. Database file/table selection. In some applications, the size of data files or tables may be very large and therefore the query evaluation becomes slow. Thus, decomposition of the data into smaller tables becomes necessary in order to speed up the queries. However, the decomposition raises the question of how to choose the correct data files or tables to select the candidates. One solution is to create the name of data files or tables using two parts: The first part is extracted from the text value such as the element or attribute names. The second part is the common global index of ruid of items. 5 Preliminary Experiment Evaluation In this Section, we report observations made during preliminary experiments conducted in the early stage of this study. We conducted a number of tests to generate the UID and ruid for several sample XML documents and to process simple queries. The application programs were written in Java and were connected with a RDBMS through JDBC-ODBC. All of these test components ran on the Windows XP Professional operating system. Preliminarily, we made following observations. First, ruid is more capable than the original UID with respect to enumeration of nodes of large trees, for example, the trees having a high degree of recursion. Second, even though the function to find the parent node s identifier from a child node s identifier in ruid is more complicated

17 A Structural Numbering Scheme for XML Data 107 than the one in the original UID, since the computation occurs mostly in main memory, the distinction is not significant. Third, querying speed using ruid in main memory is quite competitive, although the connection between the test programs and the RDBMS was slow in comparison to the computing speed due to the fact that at the time of the preliminary tests, an RDBMS was used to store and index the data in our experiments. 6 Related Works Several structural summaries for semistructured data, a general form of XML data, have been introduced, [1,4,9]. Structural information, such as node paths, is extracted from the data source, classified, and then represented in a structure graph. The graph can be used both as an indexing structure and a guide by which users can perform meaningful and valid queries. A method to determine the ancestor-descendant relationship using preorder and postorder traversal has been introduced in [3]. Extensions of the preorder and postorder traversal numbering scheme have been presented in [6,2]. Another approach uses the position and depth of a tree node to index XML elements in [11]. Management of identifiers by XID-map has been discussed in [8]. The XID-map provides identification for nodes in the change management of XML documents. A possible variant of the XID-map is based on node positions within a tree. For instance, for indexing purposes the triplet (prefix, postfix, level coding) is used. However, as mentioned in [8], the identifications are not robust. The UID technique was first introduced in [7]. Some applications of the original technique were proposed in [5,10], in which the numbering scheme was implemented to facilitate the indexing. In these studies the problems of structural update and overflow identifier were not discussed. 7 Discussion and Conclusion Application of numbering schemes in processing XML data is an effective technique. The technique allows to achieve two goals: generating identifiers for XML components and providing the structural information of XML documents. Among several proposed numbering schemes, the UID technique is promising because this technique allows the parent-child relationship to be determined effectively. However, the numbering scheme is ineffective when dealing with structural updates. Furthermore, in many cases the original UID consumes too much identifier value and requires extra tools for processing. In this study, we proposed a multilevel recursive numbering scheme called ruid. While preserving the efficient properties of the original UID, ruid is more robust in structural update and enables coding arbitrarily large XML documents. In addition, ruid can express the XPath axes of XPath expressions. Preliminary experiments have shown that ruid can be applied to process queries on XML

18 108 D.D. Kha, M. Yoshikawa, and S. Uemura data. Extensions of the present study are in progress including performance experiments using various configurations. The refinement of the experiment scheme and more detailed tests are currently in progress. References 1. P.Buneman, S.Davidson, M.Fernandez, D.Suciu. Adding Structure to Unstructured Data. Proc. of the ICDT, Greece, , S.Chien, V.J.Tsotras, C.Zaniolo, D.Zhang. Storing and Querying Multiversion XML Documents using Durable Node Numbers. Proc. of the Inter. Conf. on WISE, Japan, , P.F.Dietz. Maintaining order in a link list. Proceeding of the Fourteenth ACM Symposium on Theory of Computing, California, , R.Goldman, J.Widom. DataGuides: enabling query formulation and optimization in semistructured databases. Proc. of the Inter. Conf. on VLDB, , H.Jang, Y.Kim, D.Shin. An Effective Mechanism for Index Update in Structured Documents. Proc. of CIKM, USA, , Q.Li, B.Moon. Indexing and Querying XML Data for Regular Path Expressions. Proc. of the Inter. Conf. on VLDB, Italy, Y.K.Lee, S-J.Yoo, K.Yoon, P.B.Berra. Index Structures for structured documents. ACM First Inter. Conf. on Digital Libraries, Maryland, 91 99, A.Marian, S.Abiteboul, G.Cobena, L.Mignet. Change-Centric Management of Versions in an XML Warehouse. Proc. of the Inter. Conf. on VLDB, Italy, T.Milo, D.Suciu. Index Structures for Path Expression. Proc. of the ICDT, , D.Shin. XML Indexing and Retrieval with a Hybrid Storage Model. J. of Knowledge and Information Systems, 3: , C.Zhang, J.Naughton, D.DeWitt, Q.Luo, G.Lohman. On Supporting Containment Queries in Relational Database Management Systems. Proc. of the ACM SIGMOD, USA, World Wide Web Consortium. Extensible Markup Language (XML) World Wide Web Consortium. XML Path Language (XPath) Version World Wide Web Consortium. Document Object Model (DOM) Level 2 Core Specification Version

A Structural Numbering Scheme for XML Data

A Structural Numbering Scheme for XML Data A Structural Numbering Scheme for XML Data Alfred M. Martin WS2002/2003 February/March 2003 Based on workout made during the EDBT 2002 Workshops Dao Dinh Khal, Masatoshi Yoshikawa, and Shunsuke Uemura

More information

A System for Storing, Retrieving, Organizing and Managing Web Services Metadata Using Relational Database *

A System for Storing, Retrieving, Organizing and Managing Web Services Metadata Using Relational Database * BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 6, No 1 Sofia 2006 A System for Storing, Retrieving, Organizing and Managing Web Services Metadata Using Relational Database

More information

Evaluating XPath Queries

Evaluating XPath Queries Chapter 8 Evaluating XPath Queries Peter Wood (BBK) XML Data Management 201 / 353 Introduction When XML documents are small and can fit in memory, evaluating XPath expressions can be done efficiently But

More information

CHAPTER 3 LITERATURE REVIEW

CHAPTER 3 LITERATURE REVIEW 20 CHAPTER 3 LITERATURE REVIEW This chapter presents query processing with XML documents, indexing techniques and current algorithms for generating labels. Here, each labeling algorithm and its limitations

More information

PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data

PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data Enhua Jiao, Tok Wang Ling, Chee-Yong Chan School of Computing, National University of Singapore {jiaoenhu,lingtw,chancy}@comp.nus.edu.sg

More information

Full-Text and Structural XML Indexing on B + -Tree

Full-Text and Structural XML Indexing on B + -Tree Full-Text and Structural XML Indexing on B + -Tree Toshiyuki Shimizu 1 and Masatoshi Yoshikawa 2 1 Graduate School of Information Science, Nagoya University shimizu@dl.itc.nagoya-u.ac.jp 2 Information

More information

Trees. Q: Why study trees? A: Many advance ADTs are implemented using tree-based data structures.

Trees. Q: Why study trees? A: Many advance ADTs are implemented using tree-based data structures. Trees Q: Why study trees? : Many advance DTs are implemented using tree-based data structures. Recursive Definition of (Rooted) Tree: Let T be a set with n 0 elements. (i) If n = 0, T is an empty tree,

More information

A Persistent Labelling Scheme for XML and tree Databases 1

A Persistent Labelling Scheme for XML and tree Databases 1 A Persistent Labelling Scheme for XML and tree Databases 1 Alban Gabillon Majirus Fansi 2 Université de Pau et des Pays de l'adour IUT des Pays de l'adour LIUPPA/CSYSEC 40000 Mont-de-Marsan, France alban.gabillon@univ-pau.fr

More information

TREES. Trees - Introduction

TREES. Trees - Introduction TREES Chapter 6 Trees - Introduction All previous data organizations we've studied are linear each element can have only one predecessor and successor Accessing all elements in a linear sequence is O(n)

More information

Index-Driven XQuery Processing in the exist XML Database

Index-Driven XQuery Processing in the exist XML Database Index-Driven XQuery Processing in the exist XML Database Wolfgang Meier wolfgang@exist-db.org The exist Project XML Prague, June 17, 2006 Outline 1 Introducing exist 2 Node Identification Schemes and Indexing

More information

Index-Trees for Descendant Tree Queries on XML documents

Index-Trees for Descendant Tree Queries on XML documents Index-Trees for Descendant Tree Queries on XML documents (long version) Jérémy arbay University of Waterloo, School of Computer Science, 200 University Ave West, Waterloo, Ontario, Canada, N2L 3G1 Phone

More information

Labeling Dynamic XML Documents: An Order-Centric Approach

Labeling Dynamic XML Documents: An Order-Centric Approach 1 Labeling Dynamic XML Documents: An Order-Centric Approach Liang Xu, Tok Wang Ling, and Huayu Wu School of Computing National University of Singapore Abstract Dynamic XML labeling schemes have important

More information

Binary Trees, Binary Search Trees

Binary Trees, Binary Search Trees Binary Trees, Binary Search Trees Trees Linear access time of linked lists is prohibitive Does there exist any simple data structure for which the running time of most operations (search, insert, delete)

More information

Ecient XPath Axis Evaluation for DOM Data Structures

Ecient XPath Axis Evaluation for DOM Data Structures Ecient XPath Axis Evaluation for DOM Data Structures Jan Hidders Philippe Michiels University of Antwerp Dept. of Math. and Comp. Science Middelheimlaan 1, BE-2020 Antwerp, Belgium, fjan.hidders,philippe.michielsg@ua.ac.be

More information

An Extended Byte Carry Labeling Scheme for Dynamic XML Data

An Extended Byte Carry Labeling Scheme for Dynamic XML Data Available online at www.sciencedirect.com Procedia Engineering 15 (2011) 5488 5492 An Extended Byte Carry Labeling Scheme for Dynamic XML Data YU Sheng a,b WU Minghui a,b, * LIU Lin a,b a School of Computer

More information

SDD Advanced-User Manual Version 1.1

SDD Advanced-User Manual Version 1.1 SDD Advanced-User Manual Version 1.1 Arthur Choi and Adnan Darwiche Automated Reasoning Group Computer Science Department University of California, Los Angeles Email: sdd@cs.ucla.edu Download: http://reasoning.cs.ucla.edu/sdd

More information

CMSC 754 Computational Geometry 1

CMSC 754 Computational Geometry 1 CMSC 754 Computational Geometry 1 David M. Mount Department of Computer Science University of Maryland Fall 2005 1 Copyright, David M. Mount, 2005, Dept. of Computer Science, University of Maryland, College

More information

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

Trees. 3. (Minimally Connected) G is connected and deleting any of its edges gives rise to a disconnected graph.

Trees. 3. (Minimally Connected) G is connected and deleting any of its edges gives rise to a disconnected graph. Trees 1 Introduction Trees are very special kind of (undirected) graphs. Formally speaking, a tree is a connected graph that is acyclic. 1 This definition has some drawbacks: given a graph it is not trivial

More information

Supporting Positional Predicates in Efficient XPath Axis Evaluation for DOM Data Structures

Supporting Positional Predicates in Efficient XPath Axis Evaluation for DOM Data Structures Supporting Positional Predicates in Efficient XPath Axis Evaluation for DOM Data Structures Torsten Grust Jan Hidders Philippe Michiels Roel Vercammen 1 July 7, 2004 Maurice Van Keulen 1 Philippe Michiels

More information

Graph Algorithms Using Depth First Search

Graph Algorithms Using Depth First Search Graph Algorithms Using Depth First Search Analysis of Algorithms Week 8, Lecture 1 Prepared by John Reif, Ph.D. Distinguished Professor of Computer Science Duke University Graph Algorithms Using Depth

More information

Integrating Path Index with Value Index for XML data

Integrating Path Index with Value Index for XML data Integrating Path Index with Value Index for XML data Jing Wang 1, Xiaofeng Meng 2, Shan Wang 2 1 Institute of Computing Technology, Chinese Academy of Sciences, 100080 Beijing, China cuckoowj@btamail.net.cn

More information

Accelerating XML Structural Matching Using Suffix Bitmaps

Accelerating XML Structural Matching Using Suffix Bitmaps Accelerating XML Structural Matching Using Suffix Bitmaps Feng Shao, Gang Chen, and Jinxiang Dong Dept. of Computer Science, Zhejiang University, Hangzhou, P.R. China microf_shao@msn.com, cg@zju.edu.cn,

More information

Section 5.5. Left subtree The left subtree of a vertex V on a binary tree is the graph formed by the left child L of V, the descendents

Section 5.5. Left subtree The left subtree of a vertex V on a binary tree is the graph formed by the left child L of V, the descendents Section 5.5 Binary Tree A binary tree is a rooted tree in which each vertex has at most two children and each child is designated as being a left child or a right child. Thus, in a binary tree, each vertex

More information

Chapter 12: Indexing and Hashing. Basic Concepts

Chapter 12: Indexing and Hashing. Basic Concepts Chapter 12: Indexing and Hashing! Basic Concepts! Ordered Indices! B+-Tree Index Files! B-Tree Index Files! Static Hashing! Dynamic Hashing! Comparison of Ordered Indexing and Hashing! Index Definition

More information

Labeling and Querying Dynamic XML Trees

Labeling and Querying Dynamic XML Trees Labeling and Querying Dynamic XML Trees Jiaheng Lu, Tok Wang Ling School of Computing, National University of Singapore 3 Science Drive 2, Singapore 117543 {lujiahen,lingtw}@comp.nus.edu.sg Abstract With

More information

Chapter 12: Indexing and Hashing

Chapter 12: Indexing and Hashing Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

Outline. Definition. 2 Height-Balance. 3 Searches. 4 Rotations. 5 Insertion. 6 Deletions. 7 Reference. 1 Every node is either red or black.

Outline. Definition. 2 Height-Balance. 3 Searches. 4 Rotations. 5 Insertion. 6 Deletions. 7 Reference. 1 Every node is either red or black. Outline 1 Definition Computer Science 331 Red-Black rees Mike Jacobson Department of Computer Science University of Calgary Lectures #20-22 2 Height-Balance 3 Searches 4 Rotations 5 s: Main Case 6 Partial

More information

Monotone Constraints in Frequent Tree Mining

Monotone Constraints in Frequent Tree Mining Monotone Constraints in Frequent Tree Mining Jeroen De Knijf Ad Feelders Abstract Recent studies show that using constraints that can be pushed into the mining process, substantially improves the performance

More information

Efficient pebbling for list traversal synopses

Efficient pebbling for list traversal synopses Efficient pebbling for list traversal synopses Yossi Matias Ely Porat Tel Aviv University Bar-Ilan University & Tel Aviv University Abstract 1 Introduction 1.1 Applications Consider a program P running

More information

Design of Index Schema based on Bit-Streams for XML Documents

Design of Index Schema based on Bit-Streams for XML Documents Design of Index Schema based on Bit-Streams for XML Documents Youngrok Song 1, Kyonam Choo 3 and Sangmin Lee 2 1 Institute for Information and Electronics Research, Inha University, Incheon, Korea 2 Department

More information

Trees : Part 1. Section 4.1. Theory and Terminology. A Tree? A Tree? Theory and Terminology. Theory and Terminology

Trees : Part 1. Section 4.1. Theory and Terminology. A Tree? A Tree? Theory and Terminology. Theory and Terminology Trees : Part Section. () (2) Preorder, Postorder and Levelorder Traversals Definition: A tree is a connected graph with no cycles Consequences: Between any two vertices, there is exactly one unique path

More information

XML databases. Jan Chomicki. University at Buffalo. Jan Chomicki (University at Buffalo) XML databases 1 / 9

XML databases. Jan Chomicki. University at Buffalo. Jan Chomicki (University at Buffalo) XML databases 1 / 9 XML databases Jan Chomicki University at Buffalo Jan Chomicki (University at Buffalo) XML databases 1 / 9 Outline 1 XML data model 2 XPath 3 XQuery Jan Chomicki (University at Buffalo) XML databases 2

More information

A Reduction of Conway s Thrackle Conjecture

A Reduction of Conway s Thrackle Conjecture A Reduction of Conway s Thrackle Conjecture Wei Li, Karen Daniels, and Konstantin Rybnikov Department of Computer Science and Department of Mathematical Sciences University of Massachusetts, Lowell 01854

More information

SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases

SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases Jinlong Wang, Congfu Xu, Hongwei Dan, and Yunhe Pan Institute of Artificial Intelligence, Zhejiang University Hangzhou, 310027,

More information

A Connection between Network Coding and. Convolutional Codes

A Connection between Network Coding and. Convolutional Codes A Connection between Network Coding and 1 Convolutional Codes Christina Fragouli, Emina Soljanin christina.fragouli@epfl.ch, emina@lucent.com Abstract The min-cut, max-flow theorem states that a source

More information

Outline. Depth-first Binary Tree Traversal. Gerênciade Dados daweb -DCC922 - XML Query Processing. Motivation 24/03/2014

Outline. Depth-first Binary Tree Traversal. Gerênciade Dados daweb -DCC922 - XML Query Processing. Motivation 24/03/2014 Outline Gerênciade Dados daweb -DCC922 - XML Query Processing ( Apresentação basedaem material do livro-texto [Abiteboul et al., 2012]) 2014 Motivation Deep-first Tree Traversal Naïve Page-based Storage

More information

CSI33 Data Structures

CSI33 Data Structures Outline Department of Mathematics and Computer Science Bronx Community College November 13, 2017 Outline Outline 1 C++ Supplement.1: Trees Outline C++ Supplement.1: Trees 1 C++ Supplement.1: Trees Uses

More information

PRINCIPLES OF COMPILER DESIGN UNIT I INTRODUCTION TO COMPILERS

PRINCIPLES OF COMPILER DESIGN UNIT I INTRODUCTION TO COMPILERS Objective PRINCIPLES OF COMPILER DESIGN UNIT I INTRODUCTION TO COMPILERS Explain what is meant by compiler. Explain how the compiler works. Describe various analysis of the source program. Describe the

More information

An Efficient XML Index Structure with Bottom-Up Query Processing

An Efficient XML Index Structure with Bottom-Up Query Processing An Efficient XML Index Structure with Bottom-Up Query Processing Dong Min Seo, Jae Soo Yoo, and Ki Hyung Cho Department of Computer and Communication Engineering, Chungbuk National University, 48 Gaesin-dong,

More information

CE 221 Data Structures and Algorithms

CE 221 Data Structures and Algorithms CE 221 Data Structures and Algorithms Chapter 4: Trees (Binary) Text: Read Weiss, 4.1 4.2 Izmir University of Economics 1 Preliminaries - I (Recursive) Definition: A tree is a collection of nodes. The

More information

Hierarchical Data in RDBMS

Hierarchical Data in RDBMS Hierarchical Data in RDBMS Introduction There are times when we need to store "tree" or "hierarchical" data for various modelling problems: Categories, sub-categories and sub-sub-categories in a manufacturing

More information

An approach to the model-based fragmentation and relational storage of XML-documents

An approach to the model-based fragmentation and relational storage of XML-documents An approach to the model-based fragmentation and relational storage of XML-documents Christian Süß Fakultät für Mathematik und Informatik, Universität Passau, D-94030 Passau, Germany Abstract A flexible

More information

Monotone Paths in Geometric Triangulations

Monotone Paths in Geometric Triangulations Monotone Paths in Geometric Triangulations Adrian Dumitrescu Ritankar Mandal Csaba D. Tóth November 19, 2017 Abstract (I) We prove that the (maximum) number of monotone paths in a geometric triangulation

More information

Semi-structured Data. 8 - XPath

Semi-structured Data. 8 - XPath Semi-structured Data 8 - XPath Andreas Pieris and Wolfgang Fischl, Summer Term 2016 Outline XPath Terminology XPath at First Glance Location Paths (Axis, Node Test, Predicate) Abbreviated Syntax What is

More information

Extending E-R for Modelling XML Keys

Extending E-R for Modelling XML Keys Extending E-R for Modelling XML Keys Martin Necasky Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic martin.necasky@mff.cuni.cz Jaroslav Pokorny Faculty of Mathematics and

More information

Tree. Virendra Singh Indian Institute of Science Bangalore Lecture 11. Courtesy: Prof. Sartaj Sahni. Sep 3,2010

Tree. Virendra Singh Indian Institute of Science Bangalore Lecture 11. Courtesy: Prof. Sartaj Sahni. Sep 3,2010 SE-286: Data Structures t and Programming Tree Virendra Singh Indian Institute of Science Bangalore Lecture 11 Courtesy: Prof. Sartaj Sahni 1 Trees Nature Lover sviewofatree leaves branches root 3 Computer

More information

3 No-Wait Job Shops with Variable Processing Times

3 No-Wait Job Shops with Variable Processing Times 3 No-Wait Job Shops with Variable Processing Times In this chapter we assume that, on top of the classical no-wait job shop setting, we are given a set of processing times for each operation. We may select

More information

Benchmarking the UB-tree

Benchmarking the UB-tree Benchmarking the UB-tree Michal Krátký, Tomáš Skopal Department of Computer Science, VŠB Technical University of Ostrava, tř. 17. listopadu 15, Ostrava, Czech Republic michal.kratky@vsb.cz, tomas.skopal@vsb.cz

More information

9/29/2016. Chapter 4 Trees. Introduction. Terminology. Terminology. Terminology. Terminology

9/29/2016. Chapter 4 Trees. Introduction. Terminology. Terminology. Terminology. Terminology Introduction Chapter 4 Trees for large input, even linear access time may be prohibitive we need data structures that exhibit average running times closer to O(log N) binary search tree 2 Terminology recursive

More information

Physical Level of Databases: B+-Trees

Physical Level of Databases: B+-Trees Physical Level of Databases: B+-Trees Adnan YAZICI Computer Engineering Department METU (Fall 2005) 1 B + -Tree Index Files l Disadvantage of indexed-sequential files: performance degrades as file grows,

More information

An Algorithm for Enumerating All Spanning Trees of a Directed Graph 1. S. Kapoor 2 and H. Ramesh 3

An Algorithm for Enumerating All Spanning Trees of a Directed Graph 1. S. Kapoor 2 and H. Ramesh 3 Algorithmica (2000) 27: 120 130 DOI: 10.1007/s004530010008 Algorithmica 2000 Springer-Verlag New York Inc. An Algorithm for Enumerating All Spanning Trees of a Directed Graph 1 S. Kapoor 2 and H. Ramesh

More information

A more efficient algorithm for perfect sorting by reversals

A more efficient algorithm for perfect sorting by reversals A more efficient algorithm for perfect sorting by reversals Sèverine Bérard 1,2, Cedric Chauve 3,4, and Christophe Paul 5 1 Département de Mathématiques et d Informatique Appliquée, INRA, Toulouse, France.

More information

Simpler, Linear-time Transitive Orientation via Lexicographic Breadth-First Search

Simpler, Linear-time Transitive Orientation via Lexicographic Breadth-First Search Simpler, Linear-time Transitive Orientation via Lexicographic Breadth-First Search Marc Tedder University of Toronto arxiv:1503.02773v1 [cs.ds] 10 Mar 2015 Abstract Comparability graphs are the undirected

More information

FM-WAP Mining: In Search of Frequent Mutating Web Access Patterns from Historical Web Usage Data

FM-WAP Mining: In Search of Frequent Mutating Web Access Patterns from Historical Web Usage Data FM-WAP Mining: In Search of Frequent Mutating Web Access Patterns from Historical Web Usage Data Qiankun Zhao Nanyang Technological University, Singapore and Sourav S. Bhowmick Nanyang Technological University,

More information

Binary Trees

Binary Trees Binary Trees 4-7-2005 Opening Discussion What did we talk about last class? Do you have any code to show? Do you have any questions about the assignment? What is a Tree? You are all familiar with what

More information

Informatics 1: Data & Analysis

Informatics 1: Data & Analysis T O Y H Informatics 1: Data & Analysis Lecture 11: Navigating XML using XPath Ian Stark School of Informatics The University of Edinburgh Tuesday 26 February 2013 Semester 2 Week 6 E H U N I V E R S I

More information

Problem Set 5 Solutions

Problem Set 5 Solutions Introduction to Algorithms November 4, 2005 Massachusetts Institute of Technology 6.046J/18.410J Professors Erik D. Demaine and Charles E. Leiserson Handout 21 Problem Set 5 Solutions Problem 5-1. Skip

More information

Data Structure. IBPS SO (IT- Officer) Exam 2017

Data Structure. IBPS SO (IT- Officer) Exam 2017 Data Structure IBPS SO (IT- Officer) Exam 2017 Data Structure: In computer science, a data structure is a way of storing and organizing data in a computer s memory so that it can be used efficiently. Data

More information

A Fast Algorithm for Optimal Alignment between Similar Ordered Trees

A Fast Algorithm for Optimal Alignment between Similar Ordered Trees Fundamenta Informaticae 56 (2003) 105 120 105 IOS Press A Fast Algorithm for Optimal Alignment between Similar Ordered Trees Jesper Jansson Department of Computer Science Lund University, Box 118 SE-221

More information

[ DATA STRUCTURES ] Fig. (1) : A Tree

[ DATA STRUCTURES ] Fig. (1) : A Tree [ DATA STRUCTURES ] Chapter - 07 : Trees A Tree is a non-linear data structure in which items are arranged in a sorted sequence. It is used to represent hierarchical relationship existing amongst several

More information

UPDATING MULTIDIMENSIONAL XML DOCUMENTS 1)

UPDATING MULTIDIMENSIONAL XML DOCUMENTS 1) UPDATING MULTIDIMENSIONAL XML DOCUMENTS ) Nikolaos Fousteris, Manolis Gergatsoulis, Yannis Stavrakas Department of Archive and Library Science, Ionian University, Ioannou Theotoki 72, 4900 Corfu, Greece.

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Notes on Binary Dumbbell Trees

Notes on Binary Dumbbell Trees Notes on Binary Dumbbell Trees Michiel Smid March 23, 2012 Abstract Dumbbell trees were introduced in [1]. A detailed description of non-binary dumbbell trees appears in Chapter 11 of [3]. These notes

More information

Interval Stabbing Problems in Small Integer Ranges

Interval Stabbing Problems in Small Integer Ranges Interval Stabbing Problems in Small Integer Ranges Jens M. Schmidt Freie Universität Berlin, Germany Enhanced version of August 2, 2010 Abstract Given a set I of n intervals, a stabbing query consists

More information

DDS Dynamic Search Trees

DDS Dynamic Search Trees DDS Dynamic Search Trees 1 Data structures l A data structure models some abstract object. It implements a number of operations on this object, which usually can be classified into l creation and deletion

More information

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey G. Shivaprasad, N. V. Subbareddy and U. Dinesh Acharya

More information

arxiv: v1 [cs.ds] 23 Jul 2014

arxiv: v1 [cs.ds] 23 Jul 2014 Efficient Enumeration of Induced Subtrees in a K-Degenerate Graph Kunihiro Wasa 1, Hiroki Arimura 1, and Takeaki Uno 2 arxiv:1407.6140v1 [cs.ds] 23 Jul 2014 1 Hokkaido University, Graduate School of Information

More information

Job Re-Packing for Enhancing the Performance of Gang Scheduling

Job Re-Packing for Enhancing the Performance of Gang Scheduling Job Re-Packing for Enhancing the Performance of Gang Scheduling B. B. Zhou 1, R. P. Brent 2, C. W. Johnson 3, and D. Walsh 3 1 Computer Sciences Laboratory, Australian National University, Canberra, ACT

More information

Advances in Data Management Principles of Database Systems - 2 A.Poulovassilis

Advances in Data Management Principles of Database Systems - 2 A.Poulovassilis 1 Advances in Data Management Principles of Database Systems - 2 A.Poulovassilis 1 Storing data on disk The traditional storage hierarchy for DBMSs is: 1. main memory (primary storage) for data currently

More information

Trees, Part 1: Unbalanced Trees

Trees, Part 1: Unbalanced Trees Trees, Part 1: Unbalanced Trees The first part of this chapter takes a look at trees in general and unbalanced binary trees. The second part looks at various schemes to balance trees and/or make them more

More information

Appropriate Item Partition for Improving the Mining Performance

Appropriate Item Partition for Improving the Mining Performance Appropriate Item Partition for Improving the Mining Performance Tzung-Pei Hong 1,2, Jheng-Nan Huang 1, Kawuu W. Lin 3 and Wen-Yang Lin 1 1 Department of Computer Science and Information Engineering National

More information

Mining XML Functional Dependencies through Formal Concept Analysis

Mining XML Functional Dependencies through Formal Concept Analysis Mining XML Functional Dependencies through Formal Concept Analysis Viorica Varga May 6, 2010 Outline Definitions for XML Functional Dependencies Introduction to FCA FCA tool to detect XML FDs Finding XML

More information

TwigList: Make Twig Pattern Matching Fast

TwigList: Make Twig Pattern Matching Fast TwigList: Make Twig Pattern Matching Fast Lu Qin, Jeffrey Xu Yu, and Bolin Ding The Chinese University of Hong Kong, China {lqin,yu,blding}@se.cuhk.edu.hk Abstract. Twig pattern matching problem has been

More information

Graph and Digraph Glossary

Graph and Digraph Glossary 1 of 15 31.1.2004 14:45 Graph and Digraph Glossary A B C D E F G H I-J K L M N O P-Q R S T U V W-Z Acyclic Graph A graph is acyclic if it contains no cycles. Adjacency Matrix A 0-1 square matrix whose

More information

Indexing XML Data with ToXin

Indexing XML Data with ToXin Indexing XML Data with ToXin Flavio Rizzolo, Alberto Mendelzon University of Toronto Department of Computer Science {flavio,mendel}@cs.toronto.edu Abstract Indexing schemes for semistructured data have

More information

12 Abstract Data Types

12 Abstract Data Types 12 Abstract Data Types 12.1 Foundations of Computer Science Cengage Learning Objectives After studying this chapter, the student should be able to: Define the concept of an abstract data type (ADT). Define

More information

Reducing the Size of Routing Tables for Large-scale Network Simulation

Reducing the Size of Routing Tables for Large-scale Network Simulation Reducing the Size of Routing Tables for Large-scale Network Simulation Akihito Hiromori, Hirozumi Yamaguchi, Keiichi Yasumoto, Teruo Higashino and Kenichi Taniguchi Graduate School of Engineering Science,

More information

1 The range query problem

1 The range query problem CS268: Geometric Algorithms Handout #12 Design and Analysis Original Handout #12 Stanford University Thursday, 19 May 1994 Original Lecture #12: Thursday, May 19, 1994 Topics: Range Searching with Partition

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

XML Storage and Indexing

XML Storage and Indexing XML Storage and Indexing Web Data Management and Distribution Serge Abiteboul Ioana Manolescu Philippe Rigaux Marie-Christine Rousset Pierre Senellart Web Data Management and Distribution http://webdam.inria.fr/textbook

More information

Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents

Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Twig Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li, Junichi Tatemura Wang-Pin Hsiung, Divyakant Agrawal, K. Selçuk Candan NEC Laboratories

More information

3 Competitive Dynamic BSTs (January 31 and February 2)

3 Competitive Dynamic BSTs (January 31 and February 2) 3 Competitive Dynamic BSTs (January 31 and February ) In their original paper on splay trees [3], Danny Sleator and Bob Tarjan conjectured that the cost of sequence of searches in a splay tree is within

More information

Mining Frequent Patterns without Candidate Generation

Mining Frequent Patterns without Candidate Generation Mining Frequent Patterns without Candidate Generation Outline of the Presentation Outline Frequent Pattern Mining: Problem statement and an example Review of Apriori like Approaches FP Growth: Overview

More information

B-Trees and External Memory

B-Trees and External Memory Presentation for use with the textbook, Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015 and External Memory 1 1 (2, 4) Trees: Generalization of BSTs Each internal node

More information

Introduction. for large input, even access time may be prohibitive we need data structures that exhibit times closer to O(log N) binary search tree

Introduction. for large input, even access time may be prohibitive we need data structures that exhibit times closer to O(log N) binary search tree Chapter 4 Trees 2 Introduction for large input, even access time may be prohibitive we need data structures that exhibit running times closer to O(log N) binary search tree 3 Terminology recursive definition

More information

Trees. (Trees) Data Structures and Programming Spring / 28

Trees. (Trees) Data Structures and Programming Spring / 28 Trees (Trees) Data Structures and Programming Spring 2018 1 / 28 Trees A tree is a collection of nodes, which can be empty (recursive definition) If not empty, a tree consists of a distinguished node r

More information

B-Trees and External Memory

B-Trees and External Memory Presentation for use with the textbook, Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015 B-Trees and External Memory 1 (2, 4) Trees: Generalization of BSTs Each internal

More information

Bottom-Up Evaluation of Twig Join Pattern Queries in XML Document Databases

Bottom-Up Evaluation of Twig Join Pattern Queries in XML Document Databases Bottom-Up Evaluation of Twig Join Pattern Queries in XML Document Databases Yangjun Chen Department of Applied Computer Science University of Winnipeg Winnipeg, Manitoba, Canada R3B 2E9 y.chen@uwinnipeg.ca

More information

Polygon Triangulation

Polygon Triangulation Polygon Triangulation Definition Simple Polygons 1. A polygon is the region of a plane bounded by a finite collection of line segments forming a simple closed curve. 2. Simple closed curve means a certain

More information

Treewidth and graph minors

Treewidth and graph minors Treewidth and graph minors Lectures 9 and 10, December 29, 2011, January 5, 2012 We shall touch upon the theory of Graph Minors by Robertson and Seymour. This theory gives a very general condition under

More information

Mining Frequently Changing Substructures from Historical Unordered XML Documents

Mining Frequently Changing Substructures from Historical Unordered XML Documents 1 Mining Frequently Changing Substructures from Historical Unordered XML Documents Q Zhao S S. Bhowmick M Mohania Y Kambayashi Abstract Recently, there is an increasing research efforts in XML data mining.

More information

A Dynamic Labeling Scheme using Vectors

A Dynamic Labeling Scheme using Vectors A Dynamic Labeling Scheme using Vectors Liang Xu, Zhifeng Bao, Tok Wang Ling School of Computing, National University of Singapore {xuliang, baozhife, lingtw}@comp.nus.edu.sg Abstract. The labeling problem

More information

V Advanced Data Structures

V Advanced Data Structures V Advanced Data Structures B-Trees Fibonacci Heaps 18 B-Trees B-trees are similar to RBTs, but they are better at minimizing disk I/O operations Many database systems use B-trees, or variants of them,

More information

Core Membership Computation for Succinct Representations of Coalitional Games

Core Membership Computation for Succinct Representations of Coalitional Games Core Membership Computation for Succinct Representations of Coalitional Games Xi Alice Gao May 11, 2009 Abstract In this paper, I compare and contrast two formal results on the computational complexity

More information

FINALTERM EXAMINATION Fall 2009 CS301- Data Structures Question No: 1 ( Marks: 1 ) - Please choose one The data of the problem is of 2GB and the hard

FINALTERM EXAMINATION Fall 2009 CS301- Data Structures Question No: 1 ( Marks: 1 ) - Please choose one The data of the problem is of 2GB and the hard FINALTERM EXAMINATION Fall 2009 CS301- Data Structures Question No: 1 The data of the problem is of 2GB and the hard disk is of 1GB capacity, to solve this problem we should Use better data structures

More information

A new generation of tools for SGML

A new generation of tools for SGML Article A new generation of tools for SGML R. W. Matzen Oklahoma State University Department of Computer Science EMAIL rmatzen@acm.org Exceptions are used in many standard DTDs, including HTML, because

More information

Algorithmic Aspects of Communication Networks

Algorithmic Aspects of Communication Networks Algorithmic Aspects of Communication Networks Chapter 5 Network Resilience Algorithmic Aspects of ComNets (WS 16/17): 05 Network Resilience 1 Introduction and Motivation Network resilience denotes the

More information

Multi-Cluster Interleaving on Paths and Cycles

Multi-Cluster Interleaving on Paths and Cycles Multi-Cluster Interleaving on Paths and Cycles Anxiao (Andrew) Jiang, Member, IEEE, Jehoshua Bruck, Fellow, IEEE Abstract Interleaving codewords is an important method not only for combatting burst-errors,

More information

Definition of Graphs and Trees. Representation of Trees.

Definition of Graphs and Trees. Representation of Trees. Definition of Graphs and Trees. Representation of Trees. Chapter 6 Definition of graphs (I) A directed graph or digraph is a pair G = (V,E) s.t.: V is a finite set called the set of vertices of G. E V

More information