Using an Oracle Repository to Accelerate XPath Queries

Size: px

Start display at page:

Download "Using an Oracle Repository to Accelerate XPath Queries"

Jordan Thornton
5 years ago
Views:

1 Using an Oracle Repository to Accelerate XPath Queries Colm Noonan, Cian Durrigan, and Mark Roantree Interoperable Systems Group, Dublin City University, Dublin 9, Ireland {cnoonan, cdurrigan, Abstract. One of the problems associated with XML databases is the poor performance of XPath queries. Although this has attracted much attention by the research community, solutions are either partial (not providing full XPath functionality) or unable to manage database updates. In this work, we exploit features of Oracle 10g in order to rapidly build indexes that improve the processing times of XML databases. Significantly, we can also support XML database updates as the rebuild time for entire indexes is reasonably fast and thus, provides for flexible update strategies. This paper discusses the process for building the index repository and describes a series of experiments that demonstrate our improved query response times. 1 Introduction Native XML databases perform badly for many complex queries where the database size is large or the structure complex. Efforts to use an index are hampered by the fact that the reconstruction of these indexes (after update operations) is time-consuming. In previous research [9], we devised an indexing method to improve the performance of XPath queries. In this work, we focused on the theoretical aspects of indexing and devised a method of providing fast access to XML nodes based on the principal axes used by XPath to generate query results. In our previous work we introduced the PreLevel indexing method and provided theoretical proofs of its ability to cover the full functionality of the XPath Query Language and presented optimised algorithms for XML tree traversals [9]. For each of the primary XPath axes, conjunctive range predicates were derived from the intrinsic properties of the preorder traversal ranks and level ranks. By recording both preorder and level rankings (together with appropriate element values) in the PreLevel index, we provided algorithms to facilitate optimised query response times. The work presented in our current paper was carried out as part of the FAST project (Flexible indexing Algorithm using Semantic Tags). This research is funded by a Proof of Concept grant where theoretical ideas are deployed to provide state-of-the art solutions to current problems. The contribution of this Funded By Enterprise Ireland Grant PC/2005/0049. S. Bressan, J. Küng, and R. Wagner (Eds.): DEXA 2006, LNCS 4080, pp , c Springer-Verlag Berlin Heidelberg 2006

2 74 C. Noonan, C. Durrigan, and M. Roantree work is in the provision of a new XPath Query Interface that performs well against the current efforts in the area. Furthermore, we have extended the work in [9] by delivering a fast method of creating the Semantic Repository and also by introducing a multi-index system to fine-tune query performance. The paper is structured as follows: in 2, we discuss similar research in XML querying; in 3, an outline of the FAST Repository is provided, together with the process used to construct it; in 4, we describe our process for semantic query routing; in 5, we provide experimental data in 6, we offer some conclusions. 2 Related Research Our approach is to employ a native XML database and support the query processing effort with an indexing method deployed using traditional database techniques. In this section, we examine similar efforts in this area. XML Enabled Databases. Many researchers including [5,13] have chosen to enable relational databases systems rather than employ a native XML database. In [13], they explore how XML enabled databases can support XML queries by: 1. Utilising two separate indexes on element and text values. 2. Incorporating the MPMGJN algorithm, that is different from the standard merge join algorithms found on commercial databases. 3. Converting XQuery expressions into SQL statements. The results [13] suggest that the MPMGJN algorithm is more efficient at supporting XML queries than any of the join algorithms found in commercial XML enabled databases. By incorporating the MPMGJN algorithm into the query execution machinery of the RDBMS, they can become efficient in XML storage and processing. However, this approach requires that all XML queries be converted to SQL and in [7], they show that not all XQuery expressions can be translated into SQL, and in some cases, translate into inefficient SQL statements. A further disadvantage of enabled XML databases is their inefficiency at retrieving entire or subsets of an XML document, as it may require several costly joins to construct the required result [3]. Native XML Databases. Our PreLevel index structure is an extension of the XPath Accelerator [4], which uses an index structure designed to support the evaluation of XPath queries. This index is capable of supporting the evaluation of all XPath axes (including ancestor, following, child, etc)[2].itemploysa SAX processor to create pre and post order encoding of nodes that capture the structural and value-based semantics of an XML document. Furthermore, the ability to start traversals from arbitrary context nodes in an XML document allows the index to support XPath expressions that are embedded in XQuery expressions. In [11], the experience of building Jungle, a secondary storage manager for Galax (an open source native XML database) is detailed. In order to optimise

3 Using an Oracle Repository to Accelerate XPath Queries 75 query processing, they used the XPath Accelerator and indexing structure. However, one major limitation they encountered was the evaluation of the child axis, which they found to be as expensive as evaluating the descendant axis. They deemed this limitation to be unacceptable and designed their own indexes to support the child axis. Although the XPath Accelerator s pre/post encoding scheme has since been updated in [5] to use pre/level/size, our PreLevel Structure as demonstrated in [9] supports highly efficient evaluations of not just children but also of descendants of any arbitrary node. The Jungle implementation experience also highlighted the significant overhead imposed at document loading time by a postorder traversal, which is not required by the PreLevel index structure. 3 The FAST Repository The processing architecture illustrated in Fig. 1 has three levels: document level, metadata level and storage level. In this section, we begin by describing the role of the processors connecting the levels and at the end of the section, we present the comparative times for a range of XML databases. For reasons of clarity, we now present some of the terminology we use. The metadata extraction file (see 1 in Fig. 1) is a text document; the Oracle table identified as 2 inthesamefigure is called the Base Index Table (BIT); and the set of tables identified as 3 are called the Primary Index Tables (PIT). 3.1 Metadata Extraction The Metadata Extractor processes the document set at level 1 to generate the metadata document set at level 2. A basic SAX parser has been enhanced with semantic rules that trigger events to extract the data required for the PreLevel indexing method. These events deliver the attributes stored in the metadata document and are described below: As each node is visited the PreOrder and Level events obtain the pre-order value of the node and the level at which it occurs in the XML document tree. For performance reasons, the Position event determines the position of each node as it occurs at each level in the hierarchy (left to right). This is used to optimise algorithms that operate across a single level. As each node is read, the Parent event returns the preorder value of the node s parent. The Type event is used to distinguish between elements and attributes. The Name and Value events record the name and value of the node. The FullPath event is used to record the entire XML path (from node to root). The DocID event is fired by the exist database to provide its internal document identifier for the XML document.

4 76 C. Noonan, C. Durrigan, and M. Roantree Level 1: Document XML Document set Metadata Extractor Level 2: Metadata 1 metadata exist Storage Bulk Storage Semantic Indexing Level 3: Storage exist 2 3 Fig. 1. Repository Processing Architecture 3.2 exist Storage Processor The exist database [6] provides a schema-less storage of XML documents in hierarchical collections and can store a large amount of XML data. It also uses built-in methods to store XML documents in its document store with indexed paged files. The exist Storage Processor stores an XML document in the form of collections (of sub-documents), in a hierarchical structure. There is a single parent collection that acts as the root and using this organisational structure, exist ensures the speed of querying and information retrieval is significantly faster. 3.3 Bulk Storage Processor One of the problems with constructing indexes for XML documents is that they tend to be very large with one or more tuples for each node. Thus, building the index can be time-consuming and makes the prospect of document updates difficult. While some form of incremental updating process can be used to address this problem, we sought to find a means of building entire indexes quickly. Using Oracle, it is possible to significantly reduce the time required for large amounts of information (in the case of DBLP, millions of rows) to be inserted, through a bulk loading process. Oracle s SQL*Loader provides a means of bypassing time consuming SQL INSERT commands by presenting the loader with a control file containing a metadata description of a large text file (generated by our Metadata Extraction processor). The Oracle Storage Processor loads the text file generated by the Metadata Extractor and creates the control file required by the SQL*Loader. The output from this processor is the creation of the Base Index Table (BIT) in Oracle.

5 Using an Oracle Repository to Accelerate XPath Queries Semantic Indexing Processor This processor is used to create a set of Primary Index Tables that can be used to improve times for different types of queries. At present, the semantic rules are quite simple. Initially, a Metadata Table is created containing useful statistics of the XML data and one set of statistics contains the total number of data elements in the document set (EntireTotal), each element name (ElemName), the number of element types (ElemTypeCount) and the number of occurrences of this element (ElemTotal). For each element that exceeds a threshold Tix, the Semantic Index Processor generates a Primary Index Table. Tix is calculated by multiplying the average number of elements (ElemAvg = EntireTotal/ElemTypeCount) by an IndexFactor that is currently set at 2, based on our empirical study of XML document content. Thus, an element whose value for ElemTotal Tix will have a Primary Index Table created. We have found that this has performance advantages over the creation of multiple indexes on the Base Index Table. 3.5 Repository Build Times In our experiments, we built the semantic repository for five XML databases on a Dell Optiplex GX620 (3.20GHz) workstation with 1GB RAM on a Windows platform. In Fig. 2 we provide the build times for five standard XML databases [12]. The role of the Bulk Storage Processor played a significant role as the time required to generate the Semantic Repository for DBLP using SQL INSERT commands was 8.13 hours and is now reduced to seconds using the same workstation. Name Size Rows Elems. Atts. Levels BIT PIT BIT+PIT DBLP 127MB s 299.0s 557.9s Line 30MB s 3.7s 67.1s Item UWM 2MB s 0.3s 4.1s AT_meta 28MB s 75.7s 135.1s Mondial 1MB s 1.6s 5.0s Fig. 2. Repository Construction Times 4 Query Processing The first role of the Query Router (QR) is to classify the query into one of the following three categories: Index Query. These queries are resolved at the index level and are regarded as text node queries (see 5). Partial XPath Query. These queries are processed and routed to more precise locations in the database. They then use the XQuery processor of the native XML database.

6 78 C. Noonan, C. Durrigan, and M. Roantree FullXPathQuery.These queries are resolved at the index level to generate a set of unique exist identifiers. These identifiers allow direct access to the result documents in the database and do not require exist s XQuery Interface. For reasons of space, we concentrate on the Partial XPath Query category as this employs all features of the Semantic Repository and works with the native XML database to generate the query result set. The QR currently accepts only XPath queries as input but will convert these to XQuery FLWOR expressions on output. 4.1 Query Router The Query Router breaks the location path of an XPath expression into its location steps. Each location step comprises an axis, a node test (specifies the node type and name) and zero or more predicates. The role of this processor is to modify the XPath axis to provide a more precise location using the algorithms described in [9] and the data stored in the Semantic Repository. In example 1, the XPath query retrieves the book titles for the named author. Example 1. //book[author = Bertrand Meyer ]/title In Fig. 3(a), the location steps generated by the XPath parser are shown. The XPath axes, the appropriate nodes and the predicate for the book node are displayed. The main role of the Query Router is to provide a more precise location path and thus improve query performance. Using the descendant-or-self axis from the root, query processing involves the root node and all of its descendants until it finds the appropriate book node: our optimiser (using the PreLevel index) can quickly identify a precise path from the root to this book node. In this example, it requires a modified location step as displayed in Fig. 3(b). The Base Index Table is the default index used to improve performance but before this takes place, a check is made to determine if one of the Primary Index Tables can be used. In the node test part of the XPath expression, the Query Router checks to see if one of the PIT set is sorted on that particular element and if so, can use that index. Example 2. for $title in doc( /db/dblp/dblp.xml )/dblp/book[author = Bertrand Meyer ]/title return $title Step Axis Node Predicates Test 1 descendant-or-self node() 2 child book [author = Bertrand Meyer ] 3 child title (a) Step Axis Node Predicates Test 1 child dblp 2 child book [author = Bertrand Meyer ] 3 child title (b) Fig. 3. Location Steps

7 Using an Oracle Repository to Accelerate XPath Queries 79 The final part of this process is the construction of one or more XQuery FLWOR expressions by inserting the modified axis expressions into the for clause. This expression in example 2 is passed to the native XQuery processor to complete the result set. 5 Details of Query Performance All experiments were run using a 3GHz Pentium IV machine with 1GB memory on a Windows XP platform. The Query Router runs using Eclipse 3.1 with Java virtual machine (JVM) version 1.5. The Repository was deployed using Oracle 10g (running a LINUX operating system, with a 2.8 GHz Pentium IV processor and 1GB of memory) and exist (Windows platform with a 1.8 GHz Pentium IV processor and 512MB of memory) database servers. The default JVM machine settings of exist were increased from -Xmx128000k to -Xmx256000k to maximise efficiency. The DBLP XML database was chosen (see Fig. 2 for details) for its size. For the purpose of this paper, we extracted a subset of the original query set [8], and extended a categorisation originally used in [1]. Empty queries (Q5) are those that return zero matches. Text node queries (Q6) are queries that return text nodes i.e. XPath queries thatendwiththetext() function. Wildcard queries (Q2) are queries that contain a wildcard character. Punctual queries (Q1, Q3, Q4) query only a small portion of the database and have a high selectivity, thus they return a small number of matches. Low selectivity queries (Q6, Q7) are queries that may return a large number of matches. Table 1. DBLP Queries used in our experiments Query XPath Expression Matches Q1 //inproceedings[./title/text() = Semantic Analysis Patterns. ]/author 2 Q2 //inproceedings[./*/text() = Semantic Analysis Patterns. ]/author 2 Q3 //book[author = Bertrand Meyer ]/title 13 Q4 //inproceedings[./author = Jim Gray ][./year = 1990 ]/@key 6 Q5 //site/people/person[@id = person ] 0 Q6 //title/text() 328,859 Q7 /dblp/book/series Queries and Performance Measures In order to obtain contrasting results, we ran all queries under varying support modes: 1. exist. The XPath query is executed using the exist query processor only. 2. exist + BIT. The XPath query pre-processed using the Base Index Table before being passed to the exist query processor.

8 80 C. Noonan, C. Durrigan, and M. Roantree 3. exist + PIT. The XPath query pre-processed using a Primary Index Table before being passed to the exist query processor. 4. PreLevel. The XPath query is processed at the PreLevel index without using the exist query processor. For each query, all the execution times are recorded in milliseconds (ms), together with the number of matches. The times were averaged (with the first run elimination) to ensure that all results are warm cache numbers. Table 2 displays the execution times for each of the seven queries in each support mode. Some of the queries (Q2, Q6) in the exist mode failed to return any results (R2, R6), as exist continually returned an out of memory error. Although the exist user guide suggests the alteration of JVM settings in order to address the problem, even with optimum JVM settings, these queries fail to generate a result. Table 2. Query Execution Times Result exist exist+bit exist+pit PreLevel R1 20, , ,408.2 N/A R2 19, ,412.9 N/A R As exist + BIT N/A R4 3,209 2, ,721.3 N/A R As exist + BIT N/A R6 102, , ,648.2 R ,480 As exist +BIT N/A 5.2 Performance Analysis Results R1, R3 and R4 show that the Query Router (QR) can make a significant difference to punctual queries. Furthermore, the results indicate that the PIT (where available) is more efficient than the BIT at routing XPath queries. Result R5 suggests that the QR can efficiently handle queries that return empty result sets. This is because the QR will always consult its Semantic Repository (SR) to ensure that a query has a positive number of matches before passing the query to exist. The multi-index feature of the SR ensures that this type of query is identified quickly. As text node queries (Q6) require only the PreLevel index for processing, they run far more efficiently in the PreLevel mode. Result R7 indicates that the QR will not improve the performance of child queries. This is not unexpected as exist handles this form of query well. The QR performs less favourably because it must perform an index look-up in order to determine the respective document URI(s) which are required in each XQuery expression. However we believe that with the incorporation of a meta-metatable containing summarised data for each fullpath expression and their respective document URIs, the QR will be able to outperform exist for even this form of query. This research forms part of ongoing research and initial results are positive [10]. The exist processor cannot handle the upper scale of low selectivity queries such as Q6 that return a very large number of matches. However the QR can process the upper scale of low selectivity queries, by:

9 Using an Oracle Repository to Accelerate XPath Queries 81 Utilising our Semantic Repository to calculate the number of matches for a low selectivity query. If the number of matches is greater than a set threshold (50,000 in our current experiment setup), the QR will break the low selectivity query into a number (equal to (number of matches/threshold) + 1) of child queries. The resulting child queries are equivalent to the low selectivity query. The exist processor also fails to handle wildcards where the search range is high or the database is very large (Q2). If there is a wildcard character in the node test or predicate clauses, the QR removes the wildcard by processing the wildcard option at the PreLevel index. This aspect of the QR is not yet fully functional: it can only remove wildcards in certain queries. 6 Conclusions In this paper, we presented our approach to improving the performance of XPath queries. In this context, we discussed the construction of the FAST Semantic Repository, which includes an indexing structure based upon our prior work on level based indexing for XPath performance. In our current work, we provide an extended indexing structure deployed using Oracle 10g and exploit some of Oracle s features to ensure a fast rebuilding of the index. Using our Query Router we can then exploit our indexing structures in one of two broad modes: using the indexing method to fully resolve the query; use either the Base or Primary Index Table together with the exist XQuery processor to generate the result set. The Query Router accepts XPath expressions as input and creates XQuery expressions (where necessary) for the exist database. We also describe a series of experiments to support our claims that XPath expressions can be optimised using our indexing structures. Together with the fast rebuilding of the index, this method supports not only fast XPath queries, but also a strong basis for the provision of updates. The construction time for a large index is between 60 and 260 seconds, and this allows for rebuilding the index multiple times during the course of the day. This provides the basis for appending to updateable indexes with full rebuilds at set intervals. Thus, our current research focus is on managing update queries. We are also examining the cost of PIT builds against their increase in query performance as a fine-tuning measure for the index. Finally, our next version of the FAST prototype should include an interface for both XPath and XQuery expressions. References 1. Barta A., Consens M. and Mendelzon A. Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods. In Proceedings of the 31st VLDB Conference, pp , Morgan Kaufmann, Berglund A. et al. XML Path Language (XPath 2.0), Technical Report W3C Working Draft, WWW Consortium (

10 82 C. Noonan, C. Durrigan, and M. Roantree 3. Beyer K. et al. System RX: One Part Relational, One Part XML. In Proceedings of ACM SIGMOD Conference on Management of data, pp , ACM Press, Grust T. Accelerating XPath Location Steps. In Proceedings of the 2002 ACM SIGMOND International Conference on the Management of Data, volume 31, SIG- MOND Record, pp , ACM Press, Grust T., Sakr S. and Teuber J. XQuery on SQL Hosts. In Proceedings of the 30th International Conference on Very Large Databases (VLDB), pp , Morgan Kaufmann, Meier W. exist: An Open Source Native XML Database. In Web, Web-Services, and Database Systems, LNCS 2593, pp , Springer, Manolescu I., Florescu D., Kossmann D. Answering XML Queries on Heterogeneous Data Sources, In Proceedings of the 27th International Conference on Very Large Databases (VLDB), pp , Morgan Kaufmann, Noonan C. XPath Query Routing in the FAST Project. Technical Report ISG-06-01, isg, O Connor M., Bellahsene Z. and Roantree M. An Extended Preorder Index for Optimising XPath Expressions. In Proceedings of 3rd XML Database Symposium (XSym), LNCS Vol. 3671, pp , Springer, Roantree M. The FAST Prototype: a Flexible indexing Algorithm using Semantic Tags. Technical Report ISG-06-02, isg, Vyas A., Fernández M. and Simèon J. The Simplest XML Storage Manager Ever., In Proceedings of the First International Workshop on XQuery Implementation, Experience and Perspectives <XIME-P/>, in cooperation with ACM SIGMOD, pp 27-42, The XML Data Repository. xmldatasets/, Zhang C. et al. On Supporting Containment Queries in Relational Database Management Systems, In Proceedings of the 2001 ACM SIGMOD International Conference on the Management of Data, pp , ACM Press, 2001.

Optimising XML-Based Web Information Systems

Optimising XML-Based Web Information Systems Colm Noonan and Mark Roantree Interoperable Systems Group, Dublin City University, Ireland - {mark,cnoonan}@computing.dcu.ie Abstract. Many Web Information