Integrating Path Index with Value Index for XML data

Similar documents
Full-Text and Structural XML Indexing on B + -Tree

Estimating the Selectivity of XML Path Expression with predicates by Histograms

ADT 2009 Other Approaches to XQuery Processing

Semistructured Data Store Mapping with XML and Its Reconstruction

PAPER Full-Text and Structural Indexing of XML Documents on B + -Tree

Effective Schema-Based XML Query Optimization Techniques

An Efficient XML Index Structure with Bottom-Up Query Processing

Open Access The Three-dimensional Coding Based on the Cone for XML Under Weaving Multi-documents

Design of Index Schema based on Bit-Streams for XML Documents

ADT 2010 ADT XQuery Updates in MonetDB/XQuery & Other Approaches to XQuery Processing

Indexing XML Data with ToXin

A New Way of Generating Reusable Index Labels for Dynamic XML

XML Systems & Benchmarks

A FRAMEWORK FOR EFFICIENT DATA SEARCH THROUGH XML TREE PATTERNS

A FRACTIONAL NUMBER BASED LABELING SCHEME FOR DYNAMIC XML UPDATING

Chapter 13 XML: Extensible Markup Language

Accelerating XML Structural Matching Using Suffix Bitmaps

TwigStack + : Holistic Twig Join Pruning Using Extended Solution Extension

Outline. Approximation: Theory and Algorithms. Ordered Labeled Trees in a Relational Database (II/II) Nikolaus Augsten. Unit 5 March 30, 2009

Labeling Scheme and Structural Joins for Graph-Structured XML Data

An Improved Prefix Labeling Scheme: A Binary String Approach for Dynamic Ordered XML

An approach to the model-based fragmentation and relational storage of XML-documents

OrientX : A Schema-based Native XML Database System

A Two-Step Approach for Tree-structured XPath Query Reduction

An Efficient XML Node Identification and Indexing Scheme

Indexing and Querying XML Data for Regular Path Expressions Λ

A Dynamic Labeling Scheme using Vectors

Storing and Querying XML Documents Without Using Schema Information

Schema-Based XML-to-SQL Query Translation Using Interval Encoding

An Extended Preorder Index for Optimising XPath Expressions

Designing a High Performance Database Engine for the Db4XML Native XML Database System

Multi-User Evaluation of XML Data Management Systems with XMach-1

Informatics 1: Data & Analysis

Relational Index Support for XPath Axes

SphinX: Schema-conscious XML Indexing

The Research on Coding Scheme of Binary-Tree for XML

Using an Oracle Repository to Accelerate XPath Queries

Data Centric Integrated Framework on Hotel Industry. Bridging XML to Relational Database

TwigINLAB: A Decomposition-Matching-Merging Approach To Improving XML Query Processing

STRUCTURE-BASED QUERY EXPANSION FOR XML SEARCH ENGINE

Efficient Integration of Structure Indexes of XML

Element Algebra. 1 Introduction. M. G. Manukyan

Security-Conscious XML Indexing

Compression of the Stream Array Data Structure

A New Method of Generating Index Label for Dynamic XML Data

Symmetrically Exploiting XML

Querying Tree-Structured Data Using Dimension Graphs

Security Based Heuristic SAX for XML Parsing

QuickXDB: A Prototype of a Native XML QuickXDB: Prototype of Native XML DBMS DBMS

Efficient Processing of Complex Twig Pattern Matching

The XOO7 XML Management System Benchmark

Answering XML Twig Queries with Automata

Pathfinder/MonetDB: A High-Performance Relational Runtime for XQuery

Tree-Pattern Queries on a Lightweight XML Processor

Shifting Predicates to Inner Sub-Expressions for XQuery Optimization

A Modular modular XQuery implementation

An Algorithm for Streaming XPath Processing with Forward and Backward Axes

Index Structures for Matching XML Twigs Using Relational Query Processors

Accelerating XPath Location Steps

PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data

TwigList: Make Twig Pattern Matching Fast

XML Data Stream Processing: Extensions to YFilter

A Novel Replication Strategy for Efficient XML Data Broadcast in Wireless Mobile Networks

Aggregate Query Processing of Streaming XML Data

Structural Joins, Twig Joins and Path Stack

Part XII. Mapping XML to Databases. Torsten Grust (WSI) Database-Supported XML Processors Winter 2008/09 321

An Extended Byte Carry Labeling Scheme for Dynamic XML Data

XML: Extensible Markup Language

XML in Databases. Albrecht Schmidt. al. Albrecht Schmidt, Aalborg University 1

TwigX-Guide: An Efficient Twig Pattern Matching System Extending DataGuide Indexing and Region Encoding Labeling

Querying and Updating XML with XML Schema constraints in an RDBMS

A Distributed Query Engine for XML-QL

Fast Matching of Twig Patterns

XML databases. Jan Chomicki. University at Buffalo. Jan Chomicki (University at Buffalo) XML databases 1 / 9

Fast Structural Query with Application to Chinese Treebank Sentence Retrieval

Approaches. XML Storage. Storing arbitrary XML. Mapping XML to relational. Mapping the link structure. Mapping leaf values

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

XPathMark: an XPath Benchmark for the XMark Generated Data

Performance Evaluation of XHTML encoding and compression

SM3+: An XML Database Solution for the Management of MPEG-7 Descriptions

EE 368. Weeks 5 (Notes)

XGA XML Grammar for JAVA

XML Storage and Indexing

A Clustering-based Scheme for Labeling XML Trees

Optimize Twig Query Pattern Based on XML Schema

Classifying Elements for XML Query Transformation

Streaming XPath Processing with Forward and Backward Axes

Uses for Trees About Trees Binary Trees. Trees. Seth Long. January 31, 2010

An Efficient Index Lattice for XML Query Evaluation

Efficient Processing of XML Path Queries Using the Disk-based F&B Index

A Schema Extraction Algorithm for External Memory Graphs Based on Novel Utility Function

Adding Valid Time to XPath

Provenance Management in Databases under Schema Evolution

Two-Tier Air Indexing for On-Demand XML Data Broadcast

Navigation- vs. Index-Based XML Multi-Query Processing

ON VIEW PROCESSING FOR A NATIVE XML DBMS

METAXPath. Utah State University. From the SelectedWorks of Curtis Dyreson. Curtis Dyreson, Utah State University Michael H. Böhen Christian S.

Selectively Storing XML Data in Relations

A New Encoding Scheme of Supporting Data Update Efficiently

Proposed Specification of a Distributed XML-Query Network

Transcription:

Integrating Path Index with Value Index for XML data Jing Wang 1, Xiaofeng Meng 2, Shan Wang 2 1 Institute of Computing Technology, Chinese Academy of Sciences, 100080 Beijing, China cuckoowj@btamail.net.cn 2 Information School, Renmin Univerisity of China, 100872 Beijing, China {xfmeng, suang}@public.bta.net.cn Abstract. With the advent of XML, it is becoming the de facto standard required by the Web applications. To facilitate path expression processing, we propose an index structure adopted in our native XML database system Orient- X. Our index is constructed by utilizing DTD to get paths that will appear in the XML documents. It represents structural summary of XML data collection conforming to certain DTD, so we can process any label path query without accessing original data. In addition, it is integrated with value indexes. Preliminary experiments show quite promising results. 1 Introduction As more and more data sources on the Internet switch over and express their data content using XML [1] format, the volume of XML data is increasing rapidly. This trend calls for efficient XML data management solutions. In line with the tree centric nature of XML data, path expression plays an important role in XML query [2]. Without index, tree traversal is not an efficient solution to this problem. Recent proposals put their focus on efficient support for sequence of / steps from the document root, but is not efficient for the processing of partial path matching and // due to the exhaustive navigation of the indexes. Furthermore, the construction cost is expensive, and index size may be very large. In this paper, we propose SUPEX ( Schema guided Path index for XML data). 1 In contrast to traditional path index, SUPEX is constructed by utilizing DTD to get paths that will appear in XML documents. With SUPEX, we can process any label path query without accessing original data. Value based conditions are crucial in querying any kind of data. In SUPEX, path index and value indexes are integrated to facilitate query evaluation. The remainder of this paper is organized as follows. In Section2, we present some back knowledge and related works. An overview of SUPEX is given in Section 3. Section 4 describes the procedure of query processing. Section 5 contains the results of 1 our experiments. Finally, conclusion and future work are given in Section 6. 1 The work was partly supported by the grants from 863 High Technology Foundation of China (No. 2002AA116030) and the Natural Science Foundation of China (No. 60073014).

2 Background and Related Work Document Type Definition (DTD) is part of XML standard [1], and specifies the structure of an XML element by specifying the names of its sub-elements and attributes. We define an XML data set as a set of XML documents conforming to certain DTD. A key issue in XML query processing is how to efficiently determine the ancestor-descendant relationship between any two elements. We adopt the encoding scheme proposed by [5]. Every node in XML document tree is associated with an 3-tuple (DocId Order Size Level). This numbering scheme is applied to our document tree and index graph. Recent proposals on path index include DataGuides [3], 1-indexes[4], and so on. These indexes are not efficient for the processing of partial path matching due to the exhaustive navigation of the indexes. Furthermore, the construction cost is expensive. Cooper et al. [6] presented the Index Fabric which encodes each label path to each XML element with a value as a string and inserts them into an efficient index structure for string. This index loses relationships between elements, and only supports label path query from document root. XISS[5] proposed an index based path evaluating approach, and supported path query through structural join. 3 Overview of SUPEX 3.1 SUPEX: Its Structure SUPEX consists of a structural graph (SG) and an element map (EM). SG is constructed based on DTD, and represents the structure summary of XML data. So all possible path starting from the roots of XML documents conforming to special DTD will appear in SG. EM provides fast entries to nodes in SG, and is useful in finding all elements with the same tag. Structural Graph Structural Graph Element Map Index Records Element Map Value Indexes Fig. 1. SUPEX structure with value indexes

SG has one root node. Each node in SG except the root node has a label defined in DTD, called E-Label. All nodes with the same E-Label in SG are linked through a pointer named Next-Element. Each node in SG corresponds to a set of fixed-length index records which is called extent of the corresponding SG node. The extent of SG node includes index records of elements having an identical incoming label path, and these index records are sorted by DocId and Order values. Each index record includes an element descriptor and other related information. SG is tree-shaped when there is no cycle in DTD graph. When DTD graph is cyclic, SG is still tree-like except the reverse edges from descendant nodes to ancestor nodes. Element Map (EM) is implemented as a B+-tree using element name as key. Each entry in a leaf node points to the first node of a list with an identical E-Label in SG. EM allows us to quickly find all SG nodes with the same E-Label. In traditional database systems such as relational database systems, value indexes are usually created on columns of a relation. But due to tree-shaped nature of XML data, it is difficult to define the granularity of values indexes. In SUPEX, value indexes are constructed with respect to the context of data elements. Each SG node may have one or several pointers to value indexes that are constructed on the attributes or text values of elements in its extent. These value indexes are implemented as B+-tree, and their construction and destruction are user s decisions. Fig. 1 gives the structure of SUPEX. 3.2 Construction of SUPEX DTDs have proved important in a variety of areas: transformation between XML and databases, XML data storage, and so on. SUPEX is generated from DTD, and the main issues that must be addressed include: 1. Simplifying DTD. Practical DTD can be very complex, and most of the complicity of DTD comes from the complex specification of elements. We choose a set of transformations to eliminate constraints on occurrence time of elements, transform to,, and group sub-elements having the same name. Such simplification loses information such as relative order of the elements, but retains information about all possible sub-elements, which is enough to generate our structure graph. 2. Constructing structural graph. The simplified DTD can be represented as DTD graph. Through depth first traversal of DTD graph starting at the element nodes without incoming edge, we expand DTD graph into the structural graph. The SG nodes with an identical E-Label are linked to form a list. Element Map can be constructed on all element tags. 3. Data Loading. SUPEX can be constructed before data loading. During XML documents loading procedure, each element is encoded, and its corresponding index record is inserted into the extent of corresponding node in SG. As for value indexes, users can choose to create appropriate value indexes on attributes or text values of elements conforming to certain context.

4 Query Processing with SUPEX SUPEX contains information of path index and value indexes. With SUPEX, we can efficiently process path expression with value based condition predicate. SUPEX supports two basic queries: (1) given a tag, all elements with this tag can be obtained by the lookup of EM. (2) Simple label paths from the root of document can be matched by traversal of SG starting from the root node. Except these two, SUPEX can be used to evaluate query in the following ways. 4.1 Path Expression A complex path expression can be decomposed into a set of basic structural relationships between nodes. These basic structural relationships include ancestordescendant and parent-child relationship. Path queries like E1/E2 and E1/*/E2 can be supported by Parent-Child (E1, E2) and Ancestor-Descendant(E1, E2) algorithms, respectively. The procedure of algorithm Ancestor-Descendant(E1,E2) is as follows. By the lookup of EM, we can get two nodes in SG that are the head nodes of lists with E-Label E1 and E2 respectively. Following the two lists, we determine the ancestor-descendant relationship between the current nodes in the two lists according to their numbers. If they are ancestor and descendant, the element records in their extent are sort-merged and appended into result. Otherwise, one pointer is moved to the next node accordingly. Algorithm Ancestor-Descendant (E1,E2) Input: Ancestor element E1,descendant element E2 Output: Pairs of matching nodes 1: Get the head node of List E1 in SG through EM; 2: Get the head node of List E2 in SG through EM; 3: For each node in List E1 do 4: Skip over unmatchable nodes in List E2; 5: For each matching node in List E2 do 6: Sort-merge the extents of current nodes of List E1 and E2; 7: Append the result to output; 8: End for 9: End for In addition to these basic structural relationships, our index can support partial label path matching. For label paths like //E1/E2//En, we needn t traverse the whole index graph to get result. By the lookup of EM, we can obtain the head node of the list with E-Label E1. For each node in this list, the sub-tree rooted at it will be traversed to find nodes matching E1/E2//En. So only a part of SG will be traversed to get the result. This will greatly reduce the cost of partial label path matching. The detailed procedure is omitted due to space limit.

4.2 Query Evaluation with Value Indexes Value based condition predicates are important in query evaluation. In XML query, condition predicates are often on elements matching certain label path expression. In SUPEX, value indexes are created according to the requirements of users, and can be used in the evaluation of condition predicates. Through the traversal of SG, we can get SG nodes matching certain label path expressions. If there are value based conditions on these nodes and appropriate value indexes, query can be evaluated through existing indexes. When there are a large number of data nodes matching path expressions, value indexes will be a good choice with lower cost compared with executing predicates on all candidate nodes. 5 Preliminary Experiment Results We empirically evaluated the performance of SUPEX on a variety of XML documents. We report results here for a representative dataset: the XMark benchmark [7]. The experiments were performed on Pentium IV-1.4GHz platform with MS- Windows 2000 and 256 Mbytes of main memory. The XERCES-C++ parser was used to parse and generate XML data. We implemented our index in the C++ programming language. The data sets were stored on a local disk. To get controllable document size, we used the XML generator XMLgen developed by the XMark benchmark project. For a fixed DTD modeling an Internet auction site, XMLgen produces document instances of controllable size. Table 1 lists the characteristics of the data sets used in our experiments. The numbers in columns of the table represent the parameter of XMLgen, the size of generated document, and the number of elements in the generated documents, respectively. Table 1. XML document size and number of elements in documents Scaling Factor Document Size(MB) Element number 0.01 1.12 17132 0.05 5.6 83533 0.1 11.3 167865 0.5 56.2 832911 1.0 113 1666315 We implemented the element index and element-element join algorithm in XISS [5], and compare its performance with SUPEX. Fig. 2 and 3 report the query response time for //open_auction//description and //description/text against XMark documents of increasing size respectively. As shown in these figures, SUPEX is faster than XISS, and attains more cost reduction compared with XISS with the increasing of document size. We have found preliminary experiment results quite motivating. Further performance evaluation will be made in the future.

time(millisecs) 1400 1200 1000 800 600 400 200 0 1 5 10 50 100 data size(mbyte) SUPEX XISS time(millisecs) 5000 4000 3000 2000 1000 0 1 5 10 50 100 data size(mbyte) SUPEX XISS Fig. 2. Open_auction//description Fig. 3. Description/text 6 Conclusion and Future Work Our research group is working on a native XML data management system. We are implementing SUPEX as the index module of our system. In the future, we will test our method with large volume of data, and compare it with existing index schemes. Furthermore, values indexes will be added into SUPEX structure to accelerate the evaluation of predicate conditions. References 1. T. Bray, J.Paoli, C. M. Sperberg-McQueen, and E. Maler(Eds). Extensible Markup Language (XML) 1.0 (Second Edition). W3C Recommendation 6 October 2000, http://www.w3.org/tr/2000/rec-xml-20001006 2. D. chamberlin, D. Florescu, J. Robie, J. Simeon, and M. Stefanescu(Eds). Xquery: A Query Language for XML. W3C Working Draft, 15 February 2001, http://www.w3.org/tr/2001/wd-xquery-2001215 3. R. Goldman, J. Widom. DataGuide: Enabling Query Formulation and Optimization in Semistructured Databases. In Proceedings of the 23 th International Conference on Very Large Data Bases, Athens, Greece,1997 4. T. Milo and D. Suciu. Index structures for path expression. In Proceedings of the 7th International Conference on Database Theory, pages 277 295, January 1999 5. Quanzhong Li, Bongki Moon. Indexing and Querying XML Data for Regular Path Expressions. In Proceedings of the 27 th International Conference on Very Large Data Bases, Roma, Italy, 2001 6. Brian F. Cooper, Neal Sample, Michael J. Franklin, Gisli R. Hjaltason, Moshe Shadmon. A Fast Index for Semistructured Data. In Proceedings of the 27 th International Conference on Very Large Data Bases, Roma, Italy, 2001 7. Albrecht R. Schmidt, Florian Waas, Martin L. Kersten, Daniela Florescu, Ioana Manolescu, Michael J. Carey, and Ralph Busse. The XML Benchmark Project. Technical Report INS- R0103, CWI, Amsterdam, the Netherlands, April 2001