Integrating Path Index with Value Index for XML data

Integrating Path Index with Value Index for XML data Jing Wang 1, Xiaofeng Meng 2, Shan Wang 2 1 Institute of Computing Technology, Chinese Academy of Sciences, 100080 Beijing, China cuckoowj@btamail.net.cn 2 Information School, Renmin Univerisity of China, 100872 Beijing, China {xfmeng, suang}@public.bta.net.cn Abstract. With the advent of XML, it is becoming the de facto standard required by the Web applications. To facilitate path expression processing, we propose an index structure adopted in our native XML database system Orient- X. Our index is constructed by utilizing DTD to get paths that will appear in the XML documents. It represents structural summary of XML data collection conforming to certain DTD, so we can process any label path query without accessing original data. In addition, it is integrated with value indexes. Preliminary experiments show quite promising results. 1 Introduction As more and more data sources on the Internet switch over and express their data content using XML [1] format, the volume of XML data is increasing rapidly. This trend calls for efficient XML data management solutions. In line with the tree centric nature of XML data, path expression plays an important role in XML query [2]. Without index, tree traversal is not an efficient solution to this problem. Recent proposals put their focus on efficient support for sequence of / steps from the document root, but is not efficient for the processing of partial path matching and // due to the exhaustive navigation of the indexes. Furthermore, the construction cost is expensive, and index size may be very large. In this paper, we propose SUPEX ( Schema guided Path index for XML data). 1 In contrast to traditional path index, SUPEX is constructed by utilizing DTD to get paths that will appear in XML documents. With SUPEX, we can process any label path query without accessing original data. Value based conditions are crucial in querying any kind of data. In SUPEX, path index and value indexes are integrated to facilitate query evaluation. The remainder of this paper is organized as follows. In Section2, we present some back knowledge and related works. An overview of SUPEX is given in Section 3. Section 4 describes the procedure of query processing. Section 5 contains the results of 1 our experiments. Finally, conclusion and future work are given in Section 6. 1 The work was partly supported by the grants from 863 High Technology Foundation of China (No. 2002AA116030) and the Natural Science Foundation of China (No. 60073014).

2 Background and Related Work Document Type Definition (DTD) is part of XML standard [1], and specifies the structure of an XML element by specifying the names of its sub-elements and attributes. We define an XML data set as a set of XML documents conforming to certain DTD. A key issue in XML query processing is how to efficiently determine the ancestor-descendant relationship between any two elements. We adopt the encoding scheme proposed by [5]. Every node in XML document tree is associated with an 3-tuple (DocId Order Size Level). This numbering scheme is applied to our document tree and index graph. Recent proposals on path index include DataGuides [3], 1-indexes[4], and so on. These indexes are not efficient for the processing of partial path matching due to the exhaustive navigation of the indexes. Furthermore, the construction cost is expensive. Cooper et al. [6] presented the Index Fabric which encodes each label path to each XML element with a value as a string and inserts them into an efficient index structure for string. This index loses relationships between elements, and only supports label path query from document root. XISS[5] proposed an index based path evaluating approach, and supported path query through structural join. 3 Overview of SUPEX 3.1 SUPEX: Its Structure SUPEX consists of a structural graph (SG) and an element map (EM). SG is constructed based on DTD, and represents the structure summary of XML data. So all possible path starting from the roots of XML documents conforming to special DTD will appear in SG. EM provides fast entries to nodes in SG, and is useful in finding all elements with the same tag. Structural Graph Structural Graph Element Map Index Records Element Map Value Indexes Fig. 1. SUPEX structure with value indexes

SG has one root node. Each node in SG except the root node has a label defined in DTD, called E-Label. All nodes with the same E-Label in SG are linked through a pointer named Next-Element. Each node in SG corresponds to a set of fixed-length index records which is called extent of the corresponding SG node. The extent of SG node includes index records of elements having an identical incoming label path, and these index records are sorted by DocId and Order values. Each index record includes an element descriptor and other related information. SG is tree-shaped when there is no cycle in DTD graph. When DTD graph is cyclic, SG is still tree-like except the reverse edges from descendant nodes to ancestor nodes. Element Map (EM) is implemented as a B+-tree using element name as key. Each entry in a leaf node points to the first node of a list with an identical E-Label in SG. EM allows us to quickly find all SG nodes with the same E-Label. In traditional database systems such as relational database systems, value indexes are usually created on columns of a relation. But due to tree-shaped nature of XML data, it is difficult to define the granularity of values indexes. In SUPEX, value indexes are constructed with respect to the context of data elements. Each SG node may have one or several pointers to value indexes that are constructed on the attributes or text values of elements in its extent. These value indexes are implemented as B+-tree, and their construction and destruction are user s decisions. Fig. 1 gives the structure of SUPEX. 3.2 Construction of SUPEX DTDs have proved important in a variety of areas: transformation between XML and databases, XML data storage, and so on. SUPEX is generated from DTD, and the main issues that must be addressed include: 1. Simplifying DTD. Practical DTD can be very complex, and most of the complicity of DTD comes from the complex specification of elements. We choose a set of transformations to eliminate constraints on occurrence time of elements, transform to,, and group sub-elements having the same name. Such simplification loses information such as relative order of the elements, but retains information about all possible sub-elements, which is enough to generate our structure graph. 2. Constructing structural graph. The simplified DTD can be represented as DTD graph. Through depth first traversal of DTD graph starting at the element nodes without incoming edge, we expand DTD graph into the structural graph. The SG nodes with an identical E-Label are linked to form a list. Element Map can be constructed on all element tags. 3. Data Loading. SUPEX can be constructed before data loading. During XML documents loading procedure, each element is encoded, and its corresponding index record is inserted into the extent of corresponding node in SG. As for value indexes, users can choose to create appropriate value indexes on attributes or text values of elements conforming to certain context.

4 Query Processing with SUPEX SUPEX contains information of path index and value indexes. With SUPEX, we can efficiently process path expression with value based condition predicate. SUPEX supports two basic queries: (1) given a tag, all elements with this tag can be obtained by the lookup of EM. (2) Simple label paths from the root of document can be matched by traversal of SG starting from the root node. Except these two, SUPEX can be used to evaluate query in the following ways. 4.1 Path Expression A complex path expression can be decomposed into a set of basic structural relationships between nodes. These basic structural relationships include ancestordescendant and parent-child relationship. Path queries like E1/E2 and E1/*/E2 can be supported by Parent-Child (E1, E2) and Ancestor-Descendant(E1, E2) algorithms, respectively. The procedure of algorithm Ancestor-Descendant(E1,E2) is as follows. By the lookup of EM, we can get two nodes in SG that are the head nodes of lists with E-Label E1 and E2 respectively. Following the two lists, we determine the ancestor-descendant relationship between the current nodes in the two lists according to their numbers. If they are ancestor and descendant, the element records in their extent are sort-merged and appended into result. Otherwise, one pointer is moved to the next node accordingly. Algorithm Ancestor-Descendant (E1,E2) Input: Ancestor element E1,descendant element E2 Output: Pairs of matching nodes 1: Get the head node of List E1 in SG through EM; 2: Get the head node of List E2 in SG through EM; 3: For each node in List E1 do 4: Skip over unmatchable nodes in List E2; 5: For each matching node in List E2 do 6: Sort-merge the extents of current nodes of List E1 and E2; 7: Append the result to output; 8: End for 9: End for In addition to these basic structural relationships, our index can support partial label path matching. For label paths like //E1/E2//En, we needn t traverse the whole index graph to get result. By the lookup of EM, we can obtain the head node of the list with E-Label E1. For each node in this list, the sub-tree rooted at it will be traversed to find nodes matching E1/E2//En. So only a part of SG will be traversed to get the result. This will greatly reduce the cost of partial label path matching. The detailed procedure is omitted due to space limit.

4.2 Query Evaluation with Value Indexes Value based condition predicates are important in query evaluation. In XML query, condition predicates are often on elements matching certain label path expression. In SUPEX, value indexes are created according to the requirements of users, and can be used in the evaluation of condition predicates. Through the traversal of SG, we can get SG nodes matching certain label path expressions. If there are value based conditions on these nodes and appropriate value indexes, query can be evaluated through existing indexes. When there are a large number of data nodes matching path expressions, value indexes will be a good choice with lower cost compared with executing predicates on all candidate nodes. 5 Preliminary Experiment Results We empirically evaluated the performance of SUPEX on a variety of XML documents. We report results here for a representative dataset: the XMark benchmark [7]. The experiments were performed on Pentium IV-1.4GHz platform with MS- Windows 2000 and 256 Mbytes of main memory. The XERCES-C++ parser was used to parse and generate XML data. We implemented our index in the C++ programming language. The data sets were stored on a local disk. To get controllable document size, we used the XML generator XMLgen developed by the XMark benchmark project. For a fixed DTD modeling an Internet auction site, XMLgen produces document instances of controllable size. Table 1 lists the characteristics of the data sets used in our experiments. The numbers in columns of the table represent the parameter of XMLgen, the size of generated document, and the number of elements in the generated documents, respectively. Table 1. XML document size and number of elements in documents Scaling Factor Document Size(MB) Element number 0.01 1.12 17132 0.05 5.6 83533 0.1 11.3 167865 0.5 56.2 832911 1.0 113 1666315 We implemented the element index and element-element join algorithm in XISS [5], and compare its performance with SUPEX. Fig. 2 and 3 report the query response time for //open_auction//description and //description/text against XMark documents of increasing size respectively. As shown in these figures, SUPEX is faster than XISS, and attains more cost reduction compared with XISS with the increasing of document size. We have found preliminary experiment results quite motivating. Further performance evaluation will be made in the future.

time(millisecs) 1400 1200 1000 800 600 400 200 0 1 5 10 50 100 data size(mbyte) SUPEX XISS time(millisecs) 5000 4000 3000 2000 1000 0 1 5 10 50 100 data size(mbyte) SUPEX XISS Fig. 2. Open_auction//description Fig. 3. Description/text 6 Conclusion and Future Work Our research group is working on a native XML data management system. We are implementing SUPEX as the index module of our system. In the future, we will test our method with large volume of data, and compare it with existing index schemes. Furthermore, values indexes will be added into SUPEX structure to accelerate the evaluation of predicate conditions. References 1. T. Bray, J.Paoli, C. M. Sperberg-McQueen, and E. Maler(Eds). Extensible Markup Language (XML) 1.0 (Second Edition). W3C Recommendation 6 October 2000, http://www.w3.org/tr/2000/rec-xml-20001006 2. D. chamberlin, D. Florescu, J. Robie, J. Simeon, and M. Stefanescu(Eds). Xquery: A Query Language for XML. W3C Working Draft, 15 February 2001, http://www.w3.org/tr/2001/wd-xquery-2001215 3. R. Goldman, J. Widom. DataGuide: Enabling Query Formulation and Optimization in Semistructured Databases. In Proceedings of the 23 th International Conference on Very Large Data Bases, Athens, Greece,1997 4. T. Milo and D. Suciu. Index structures for path expression. In Proceedings of the 7th International Conference on Database Theory, pages 277 295, January 1999 5. Quanzhong Li, Bongki Moon. Indexing and Querying XML Data for Regular Path Expressions. In Proceedings of the 27 th International Conference on Very Large Data Bases, Roma, Italy, 2001 6. Brian F. Cooper, Neal Sample, Michael J. Franklin, Gisli R. Hjaltason, Moshe Shadmon. A Fast Index for Semistructured Data. In Proceedings of the 27 th International Conference on Very Large Data Bases, Roma, Italy, 2001 7. Albrecht R. Schmidt, Florian Waas, Martin L. Kersten, Daniela Florescu, Ioana Manolescu, Michael J. Carey, and Ralph Busse. The XML Benchmark Project. Technical Report INS- R0103, CWI, Amsterdam, the Netherlands, April 2001