XMLDBMS. Computer Science 764. December 22, Kevin Beach, Vuk Ercegovac, Michael Henderson, Amy Rea, Suan Yong

Size: px

Start display at page:

Download "XMLDBMS. Computer Science 764. December 22, Kevin Beach, Vuk Ercegovac, Michael Henderson, Amy Rea, Suan Yong"

Claribel Andrews
5 years ago
Views:

1 XMLDBMS Computer Science 764 December 22, 1998 Kevin Beach, Vuk Ercegovac, Michael Henderson, Amy Rea, Suan Yong

2 Introduction: XML-QL is a query language for obtaining data from XML documents on the World Wide Web. From a database viewpoint, an XML document serves as a database from which a query will extract results. While the semi-structured nature of XML lends itself to an object data model, the relational data model has been shown to perform well with queries posed over large data sets. Thus, we have designed an implemented a simple database system that executes relational-like queries over XML data sets that have been transformed into the relational model. Specifically, we execute XML-QL queries in a system, which dynamically loads and transforms XML data sets into relations. The queries are transformed into intermediate execution plans from which an optimizer will produce a less costly plan to access the relations with RDBMS-like operators. Since we are primarily interested in issues concerning the use of relations to store and query XML data sets, we do not handle issues relating to recovery, concurrency, or the use of secondary and non-volatile storage. This decision is also supported by the expected normal usage of such a system: the intended user is an XML surfer who, given a set of XML documents, poses queries in XML-QL via a applet in a browser that can display the results of the query. In essence, the system serves as an XML document filter that transforms XML data sets into relations to facilitate more efficient processing. We have initially developed our system to support only a subset of the features provided by XML-QL. Supporting the complete XML-QL specifications is not necessary to achieve our goals. With respect to the query language, we have implemented the features that demonstrate most completely the querying aspect of the language and not the data manipulation aspect. As such, the optimizer will only be able to take advantage of operators for which language support

3 has been added. Similarly, the GUI attempts to provide a clean interface for constructing queries and displaying results in a straightforward way. We do not deal with the problem of displaying XML graphically. Our goal is to build a system with which we can attain some insight into the design considerations that arise when using relations to store and query XML data sets. Architecture Overview: Figure 1 is a schematic of the XMLDBMS system, showing the steps involved in processing a query. Initially, the client applet submits to the server an XML-QL query (or a document with an embedded query). The server strips out the query and forwards it to the XML- QL to SQL translator. The translator identifies the URLs of the XML documents that the query needs, and tells the storage manager. The storage manager will load the DTD document associated with the URL and convert it into an internal schema data structure. The storage manager will also get the catalog associated with the data in the XML document (at present, we load the document and build the catalog from scratch; in the future we envision having precomputed catalog information stored in a separate file. See Future Work). The schema and catalog is returned to the translator, which uses the schema to verify the validity of the XML-QL query. The translator then produces an SQL query, and combines the catalogs it has collected into a single catalog. The SQL query and catalog is passed on to the query optimizer, which generates the execution plan. The plan execution component obtains the tables from the storage manager (which fetches the XML document and translates it into an internal table data structure) and produces a resultant table that is returned to the translator. The results are then converted into the desired XML formatting and returned to the server, which passes it along (or embeds it into the document containing the embedded XML-QL query) to the client applet.

4 Client (applet) (fetch DTD document) (fetch XML document) XML-QL query Server XML results XML-QL formatted results URL DTD schema (build catalog) schema, catalog SQL, catalog Storage Manager plan Translator Optimizer Plan Execution result table XML tables table name table Figure 1 - flowchart of the XMLDBMS system The Storage Manager The XMLDBMS storage manager plays the role of a buffer manager for data that could potentially be scattered throughout the web. Specifically, it is responsible for acquiring, for a given XML document, a schema, a catalog, and a table containing the data in that document. It is also in charge of assigning to each XML document a unique page ID that is prepended to the name of each attribute in that document s table. This is to ensure, for example, that two XML documents contain tables that happen to have the same name will have different internal names. When the schema for a given document is needed, the storage manager will fetch the DTD for that document, and the DTD parser translates it into an internal schema data structure (which is actually just a table). At present we assume the DTD for a given document is in a separate file in the same directory, and has the filename of the document plus a.dtd suffix (e.g., the DTD for the file is in ). When the

5 table for a given XML document is needed, the storage manager fetches the document and gives it to the XML parser, which builds the tables associated with that document. When the catalog for a given document is needed, the storage manager will get the tables associated with the document and build a catalog from scratch. We treat the fetching of the catalog as a separate functionality of the storage manager because a possible extension of this project is to have the query execution distributed among multiple servers. In this case, it would be desirable to be able to obtain the catalog for a given XML document without having to fetch the document itself (the catalog information would, for example, be stored in a separate file, like the DTD). We describe this extension further in Future Work. Our current implementation of the storage manager caches the schema, tables, and catalogs it has built. This is desirable if we assume that when a client query over a given XML document is likely to make more queries over the same document. This also assumes tables fit in memory. In the current implementation the cache is never flushed. Possible future work could be to incorporate a more sophisticated buffer management system that could delete stale tables from the cache, or potentially to support tables that do not fit in memory. The XML-QL to SQL Translator The translator component of XMLDBMS uses an XML-QL parser that was constructed using the ANTLR parser-generating tool [Ant98]. The grammar for XML-QL as presented by Deutsch et al. in the W3C proposal [DFF+98] is incomplete, buggy, and at times confusing. As such our parser supports a modified subset of XML-QL, the grammar for which is given in

6 Table 1. In particular, we have excluded support for i) functions; ii) nested queries and query blocks; iii) Skolem functions; iv) Regular path expressions. Additionally, we do not at present support the use of tag variables in queries. queryblock ::= where ( orderby )? construct where ::= "WHERE" condition ("," condition )* condition ::= element "IN" datasource predicate element ::= starttag ( STRING LITERAL VAR ( element )+ ) endtag (( "ELEMENT_AS" VAR ) ( "CONTENT_AS" VAR ))* starttag ::= "<" ( VAR ID ) ( attribute )* ">" endtag ::= "</" ( VAR ID )? ">" attribute ::= ID "=" ( STRING VAR ) datasource ::= VAR STRING predicate ::= expression oprel expression expression ::= VAR STRING LITERAL oprel ::= "<" "<=" ">" ">=" "=" "!=" orderby ::= "ORDER-BY" ( VAR ) ("," VAR )* construct ::= "CONSTRUCT" ( result VAR ) result ::= starttag ( STRING LITERAL ( VAR result )+ ) endtag Table 1 - subset of XML-QL grammar supported by XMLDBMS. The parser builds an abstract syntax tree (AST) representing the XML-QL query, which at the root level consists of a "WHERE" clause and a "CONSTRUCT" clause. The translator

7 walks through the "WHERE" clause to first identify the URLs of the datasources over which the query is searching. It then requests from the storage manager the schemata (from the DTDs) and catalog information for the datasources. Note that the storage manager will first check its cache to see if the information has been previously loaded. The storage manager will also assign to each datasource a unique internal identifier (we use strings of the form pagen ) which is prepended to the name of each table in that datasource. This is to ensure that each table in the storage manager can be uniquely identified (specifically, we will not be confused if two different XML pages contain tables with the same name). The schemata are then used to verify the validity of the query, i.e. the translator checks to see if the elements described in the query does indeed exist in the schema of its datasource. After this, the translator can translate the "WHERE" clause of the query into a SQL query. This SQL query, along with the catalogs, is fed into the plan generation and execution components of XMLDBMS. The plan-generation component (the query optimizer) uses the catalogs to generate a plan tree, which is used by the plan-execution component to fetch the appropriate tables (through the storage manager) and perform the required operations to produce a result table. This result table is returned unprojected to the translator. The final task of the translator is to walk through the "CONSTRUCT" clause of the AST of the query, which describes the desired (XML) output format of the results. The translator converts the result table into a string containing the formatted (and projected) results and returns it to the server front end.

8 Parsing the DTD The DTD was parsed using a third party open source parser, which can be found at the following URL: ( Reading in the DTD and creating the schema of the database simply involves traversing the parse tree that is created by the DTD parser. The first level of children in the tree represents each relation in the database. For each node at this level, a new relation is created and is placed at the end of a vector stored by the DBMS. It is possible that a node at this level does not need to generate a new table, however we leave this to future generations of the software to make that decision. The second level of children in the tree represents the elements and attributes for each relation (the field names). Two vectors are maintained in each relation: one that stores the names of the elements/attributes and the other to store the corresponding type of the element/attribute. At this point the tables are unaware of any links between each other. <!ELEMENT book (author+, title, publisher)> <!ATTLIST book year CDATA #REQUIRED> <!ELEMENT publisher (name, address)> <!ELEMENT author (firstname?, lastname)> DTD Node BOOK AUTHOR PUBLISHER title year AUTHOR PUBLISHER firstname lastname name address Figure 2. Example DTD Parse Tree

9 Translating XML to a Relational Database Reading in the actual data from the XML page follows a similar process. An instance of the XML parser is needed to fetch the data into the tables, and a new parse tree is therefore created. The actual nodes created by the data parser contain a lot of information, but we only needed to use a small amount of the features. For this project the key features for each node in this tree are Node Type, Node Name, and a possible set of children. A depth-first approach was used to traverse the tree. The parser does have methods to examine sibling nodes very easily, so a breadth-first traversal would also have worked; however the depth first was more intuitive to code. As the parser traverses the tree, if the Node Name matches one of the table names in the schema of the database, a new record is created. All of the fields in a newly created record are initialized to null. (The way that the XML DTD is set up, there is no possibility of duplicate table naming, nor is there any crossover in field/relation names.) At this point, the children of this node are checked to see if their Node Names match any of the column names in this relation. If a node is found that does not match it means that this document is not consistent with the DTD that it specified. If the Node Name does match one of the column headings and the node only has one child, that child is in fact the text/data associated with this node. All of the text that is stored in the database is found in such leaf nodes. In this case, the text is just added to the current record. When the tree traversal pointer goes back up to the main parent (the table Name), the record is then appended to the table. Using the example in figure 3, if one of the nodes in level 2 has more than one child, this means that the node is the parent node of a new record. In this case, a new record is created to obtain the data in the children. The name of the Relation and the schema of the relation that the child record belongs to is stored in the parent node. In order to

10 link the parent field to the nested record, the text for this field is an id value that represents the record that is now stored in the child relation. The type of this field is also changed to lookup. An integer value of type lookup is actually an index to the child records. The way that the parse tree is set up naturally lends itself to having set valued attributes. Since our project is designed as relational model, a second pass through the data to break down the records was needed. Also, during the second pass is the best time to test for integrity constraints on the data since the data parser interpret everything as text. Book Level 1 title year AUTHOR PUBLISHER Level 2 data data firstname lastname name address Level 3 data data data data Figure 3. Example XML Parse Tree Catalogs Catalogs are stored as a set of three relations in the DBMS. Every XML page that is fetched results in the generation of a set for the XML data that is transformed into relational tables using the schema in Figure 4.

11 Relations Schema Relation Name Number of Tuples Number of Attributes Indices Schema Index Name Relation Name Key Num. Of Entries Num. Unique Max Value Min Value Index Type Attribute Schema Attribute Name Relation Name Attribute Type Figure 4. Schemas for the catalogs used by XMLDBMS As the set of relations is generated it is appended to a master catalog set in the storage manager. Because our array-based approach to relations automatically keeps track of the number of tuples in a relation, generating the catalogs is trivial given the relation and its schema. This approach would not be satisfactory however if we were to allow updates to made on the relations. Plan Generation and Execution For plan generation, we chose to use an existing query optimizer framework and support code, Opt++ [KD95]. Our decision was driven by two factors: development time and usefulness to a prototype system. With the sample optimizer provided with Opt++, we were able to immediately output a plan representation, thus enabling the development of the plan execution infrastructure in parallel with the modifications to the optimizer. Since Opt++ is designed to be extensible, we were able to customize it to handle our operators as we developed them and to modify the cost calculations to more closely reflect our system. More importantly, since Opt++ is designed for flexibility and since many factors contributing to the design of XMLDBMS, such as workloads, data sets, specifications, etc are either non-existent or in flux, the integration of such a system is of significantly greater importance when prototyping solutions and running

12 experiments. For example, by specifying the catalog of a hypothetical data set, we can get preliminary numbers when trying different search strategies or indices. Interface to Opt++ In line with the prototyping argument from above, we chose to use Java to implement XMLDBMS. However, Opt++ was written in C++ so we needed an interface between the optimizer and XMLDBMS, which we call the OptClient. The responsibilities of the OptClient are to manage the Opt++ process, feed it queries and catalog data, and translate the optimized plan from Opt++ into a form that can be executed using the services provided by XMLDBMS. Since one may want to use more than one optimizer and more importantly since Opt++ is meant to be extended (i.e. its output or inputs may change), we designed OptClient to be a generic interface that a developer could use if they wanted to hook up such an optimizer to a database written in Java such as XMLDBMS. The methods that must be supported are startoptimizer and optimize. The first simply exec s and sets up communication streams with an optimizer and the second returns the root of an optimized plan given a query and a catalog. It is assumed that the query is valid for the instance of the catalog passed in with the query. If no catalog is passed in then the query is assumed to be valid over the previous catalog. We have implemented a class that supports this interface to manage and communicate with the current version of Opt++. When a result comes back from Opt++, it must be parsed and translated into the operators supported by XMLDBMS. The output from Opt++ is composed of operator names in the first line and the operator s arguments in the second line. The set of such pairs is output as a tree traversal of the plan found in Opt++. Hence, the process used for the plan translation is: 1) bind

13 the name of the operator or access method to its corresponding operator in XMLDBMS, 2) let the operator parse its own argument, and 3) set the children of the node if its not a leaf node. Plan Execution Previously we described the process of converting the output from Opt++, a description of an optimized plan, to generate an execution tree that is set to process the given query over the data set. Now we will describe in more detail the operators that make up the execution tree and what is required during execution. All elements of the tree are referred to as operators even though logically, they can be either implementations of relational operators or access methods. In either case, they conform to an Operator interface that enforces the implementation of the methods open, next, close, and getoutput. Note that next will return the next satisfying tuple and null if the there are none left and getoutput provides a way for the parent to get the schema of a child. For the case of an access method such as filescan, the only criterion for passing a tuple to the caller is if all tuples have been seen in the relation. However, for an index such as a B-tree or an internal operator such as a Select or Join, a predicate is required to evaluate whether or not the tuple is passed on to the caller. Such a predicate is implemented as a generic set of OR predicates in an AND predicate, i.e. the predicate is in conjunctive normal form. Each OR expression is composed of standard predicates that take a value or an attribute reference as arguments. The predicates handled are >, >=, <, <=, = and are implemented in such a way to make extending these predicates to handle new types relatively painless. Thus the top level AND predicate is composed of a bunch of values and attribute references. To this top-level predicate, one or two tuples, depending on whether or not the

14 operator is unary or binary, will have to be evaluated over the predicate. To do this, the predicate has a left and right input where tuples from the right or left child of this operator will come from. This is done so that the values used in the outer tuple of a join are not re-referenced as the other child s tuples stream by. Thus, when parsing a predicate, the operator must determine which side of its predicate an attribute reference belongs to. This is done by checking the output schemas of the children to see which side of the tree an attribute originates from. Once the origin is known, the position of an attribute reference in a given operator is found before execution from the schema found in the outputs of the operator s children. If the node is a leaf, the schema is obtained from its base relation and the attributes are rewritten to provide a unique name, composed of table or variable name and attribute name. Since the attribute names are unique at the leaves and given that on any transition from child to parent will be some composition of fields, every attribute referenced in an operator will correspond to a unique name. The preceding discussion provides a guideline for how the methods in the interface should be written. The open method will be responsible for setting up its output schema by either opening its children if its has any and using their output schemas or if the operator is a leaf, using the base relation. In addition, if there exists a predicate, the mapping of attribute to position is now done. The next method will just return the next satisfying tuple, optionally applied to a predicate if the node requires one, and null if no more tuples satisfy. Though only nested loops is currently implemented, this framework would also support the implementation of an algorithm such as hash-join or sort-merge where there might be a materialization stage. Furthermore, there is nothing that precludes an implementation that sets up its source to be at a remote node, as long as the local operator adheres to the above interface.

15 Given this interface, once a root of such a plan is obtained, the execution follows by getting the output schema of the root, making a relation from such a schema, and filling it with tuples from next calls to the root until there are no more tuples remaining. Modifications to Opt++ The status of Opt++ as it came out of the box was that it parsed SQL, could take a catalog of a fixed format, had a number of algorithms to implement operators, and had preset cost parameters. However, these were all for an ORDBMS and had different system assumptions than XMLDBMS. Thus a number of modifications had to be made in sample code provided by Opt++ such that it would make sense for usage with XMLDBMS. However, since XMLDBMS manages relational data, there were a number of similarities that could be preserved and tweaked. The following will detail what could be salvaged, what had to be completely rewritten, and what major parts had to be modified to work with a system such as XMLDBMS. The primary component of Opt++ that remained was the SQL parser. The motivation for this decision was based on the fact that translating to SQL or relational algebra are equivalent in difficulty for our subset of XML-QL, but SQL is easier to understand when translating. Furthermore, the translation to relational algebra already existed in Opt++ so we traded off writing new code in Java rather than new code in C++. More importantly, we found XML-QL to be a weakly specified and clunky language so it is foreseeable that XML-QL will not be used in the future. Therefore, for such a prototype, we felt it was more useful to translate to an accepted and implemented language for Opt++, SQL, than to hardcode the translation to relational algebra from XML-QL.

16 While using SQL remains, the catalogs over which the optimizer tries to estimate the best plan was rewritten since the statistics and terminology were often irrelevant in a relational system and difficult to map from one to another. The goal for the catalog schemas was to start simple with the possibility of adding statistics when interesting trends in the data become apparent. For example, some of our data sets produced many null values when converted from XML to relations. If this statistic was recorded with a relation and there was no index on the often-null attribute, the selectivity factor would be significantly reduced. Also, we should mark attributes as being foreign keys of another table as this seems to be a common feature in our modeling of complex object and sets. This would save the optimizer the trouble of inferring the same from the indices where we would expect to see an index on the primary key of the relation pointed to. Furthermore, it is currently assumed that all query processing occurs locally, but if this is not the case, information regarding the remote properties of relations may be useful, such as round-trip time. This might be stored in another relation containing server data and maintained in a catalog proxy. Another issue that fits well in the ORDBMS version that does not fit well in the relational system was that of types. In the original version of Opt++, the type system was driven from the catalog information as expected since each relation is a type. In addition, even the primitive types such as integer, float, etc are not distinguished from relation types, thus when type checking a query, Opt++ uses the information for all types as originating from the catalog information. Since, this dependence is everywhere, it sufficed to maintain this catalog driven type system where the primitive types are placed in globally known locations and make relatively minor changes throughout the code.

17 Yet another issue that had to be dealt with was the terms used to calculate the estimated costs of a plan. In the original version of Opt++, the costs assumed I/O was necessary when processing a query. In the case of XMLDBMS, the database is assumed to be in main memory and for simplicity, we assume that the OS will not page the process out. In addition, we do not include the overhead incurred by using Java to store the data as this can be reduced to current DBMS standards in a more realistic system. A more realistic approach might be to replace the original disk I/O terms in place of network I/O, however we assume the data is in memory at time of optimization. Thus the terms modified in estimating the costs remove the costs associated with I/O and leave only the costs due to memory, such as expected number of tuples per operator output and expected number of operations per operator given such sizes. These changes were made where cost per implementation of operator was considered in the search space. It should be noted that these implementations existed in Opt++ and remain so that they can be either used as developed in XMLDBMS or made use of in hypothetical situations that can be mimicked by supplying Opt++ a hypothetical catalog. Because XMLDBMS supports a limited set of operator implementations, the impact of using such an optimizer is questionable in terms of performance where the only join is a nested loops join. However, our translation to the relational model provides much room for gains in join reordering and selection pushes. In addition, the choice of using Opt++ was driven by the benefit such a flexible system provides to a prototype such as XMLDBMS. Experiments The XMLDBMS system was tested on a 200 MHz Pentium Pro running Solaris 2.6. When querying over small test datafiles (less than 10 kilobytes) we were able to obtain results

18 almost in real time. When we ran queries over larger files (about 250 kilobytes), the response time varied greatly depending on the nature of the datafile. When querying over an XML document containing data that translates into a dense table, the response time was relatively quick. However, when querying over an XML document which contained data that was very sparse (the document contains the play The Tragedy of King Richard the Third [Sha98]), the time it took to translate the document into a table was overwhelmingly long. Among the notable functionality we were able to achieve with XMLDBMS are: i) Querying over multiple data sources. For example, the following query performs a join on tables from two XML documents, and constructs a resulting document containing attributes from both sources. WHERE <book> <title> $t </> <author><lastname> $l </></> </book> IN " <Item> <Title> $t </> <UnitPrice> $p </> </Item> IN "file:///u/k/b/kbeach/764/src/data/book.xml" CONSTRUCT <Book> <Title> $t </> <Author> $l </> <Price> $p </> </Book> ii) DTD Translation. The following example translates XML data that conforms to one DTD into XML that conforms to another. WHERE <book year=$y> <title> $t </> <author> <lastname> $l </> <firstname> $f </> </> <publisher> <name> $p </> <address> $a </> </> </book> IN "file:///u/k/b/kbeach/764/src/data/bib.xml" CONSTRUCT <thebook> <theyear> $y </> <thetitle> $t </> <theauthor firstname=$f> $l </> <thepublisher address=$a> $p </> </thebook>

19 Conclusions The XML-QL query language as described by the W3C proposal [DFF+98] tries to do too much, and so becomes very difficult to understand. The stripped-down subset of the language we have chosen to support, however, appears to lend itself very well to querying XML data, since they both use the <tag> syntax of SGML. While this subset of XML-QL is not very powerful, it would be interesting to see if a more powerful, and at the same time more intuitive, query language for XML can be developed. One significant conflict that arose was the management of set valued attributes. XML naturally lends itself to easily specifying set valued attributes. This results in a significantly time-consuming process of flattening out the tuples that contain sets and the reduction of the ability to efficiently perform more complex queries over those tuples. Furthermore, since we duplicate tuples containing a set of values N times, where N is the cardinality of the set, it drastically increases the memory and disk consumption of the database. While the relational model is possible, we believe that XML is more naturally adaptable to an object relational model, or a pure object oriented database. A copmarison study between the different approaches has not been done at this point in time. Future Work The XMLDBMS system appears to be easily extensible to make a distributed database system. At present, the execution plan (which is in the form of a tree) generated by the query optimizer is executed entirely locally, with the only remote action being the fetching of the tables (at the leaf nodes). For example, if it can be detected that a subtree of the execution plan uses only tables from a remote site, and the remainder of the tree does not depend on tables from

20 that site, it should be possible to migrate the entire subtree to the remote site and initiate execution there. If the resulting table that is sent back is significantly smaller than the original tables on which the subtree depended, there will be a significant gain in performance. In order to fully implement this, however, the process of deciding whether to migrate a subtree must be more involved, and would potentially require extending the query optimizer and having available more catalog information. Our current implementation assumes that an XML datasource is never altered. Thus, once the document is cached in the storage manager it is never refreshed. Because XML data is available in the same manner as a web page, we encounter the similar problem that the web caching community faces, which is maintaining consistency of its data. This problem is probably even more significant when these URL s are treated as entries or tables within a database since it is the database data that could be stale and not just a news article or personal web page. Currently, we cache all data and assume it will never become stale, however, future improvements should maintain strong consistency between the XML document and what is stored in the XMLDBMS. Bibliography [Ant98] ANTLR version 2.4.0, Magelang Institute. [DFF+98] A. Deutsch, M. Fernandez, D. Florescu, A. Levy, D. Suciu. XML-QL: A Query Language for XML, submission to the World Wide Web Consortium, 19 August [KD95] N. Kabra, D. DeWitt. OPT++: An Object-Oriented Implementation for Extensible Database Query Optimization.

Query Containment for XML-QL

Query Containment for XML-QL Deepak Jindal, Sambavi Muthukrishnan, Omer Zaki {jindal, sambavi, ozaki}@cs.wisc.edu University of Wisconsin-Madison April 28, 2000 Abstract The ability to answer an incoming