Efficient schema-based XML-to-Relational data mapping

Size: px

Start display at page:

Download "Efficient schema-based XML-to-Relational data mapping"

Joleen Hodge
5 years ago
Views:

1 Information Systems ] (]]]]) ]]] ]]] Efficient schema-based XML-to-Relational data mapping Mustafa Atay, Artem Chebotko, Dapeng Liu, Shiyong Lu, Farshad Fotouhi Department of Computer Science, Wayne State University, Detroit, MI 48202, USA Received 2 March 2005; received in revised form 4 December 2005; accepted 15 December 2005 Recommended by: Prof. J. Van den Bussche Abstract Storing and querying XML documents using a RDBMS is a challenging problem since one needs to resolve the conflict between the hierarchical, ordered nature of the XML data model and the flat, unordered nature of the relational data model. This conflict can be resolved by the following XML-to-Relational mappings: schema mapping, data mapping and query mapping. In this paper, we propose: (i) a lossless schema mapping algorithm to generate a database schema from a DTD, which makes several improvements over existing algorithms, (ii) two linear data mapping algorithms based on DOM and SAX, respectively, to map ordered XML data to relational data. To our best knowledge, there is no published linear schema-based data mapping algorithm for mapping ordered XML data to relational data. Experimental results are presented to show that our algorithms are efficient and scalable. r 2006 Elsevier B.V. All rights reserved. Keywords: XML; Relational; Schema-based; Ordered; Mapping; Shredding 1. Introduction XML has emerged as a standard for representing and exchanging data over the World Wide Web. The increasing amount of XML documents requires the need to store and query XML documents efficiently. Numerous researchers have proposed using relational databases to store and query XML documents [1 9]. The main challenge of this relational approach is that one needs to resolve the conflict between the hierarchical, ordered nature of the XML data model and the flat, unordered Corresponding author. Tel.: ; fax: addresses: matay@wayne.edu (M. Atay), artem@wayne.edu (A. Chebotko), dliu@wayne.edu (D. Liu), shiyong@wayne.edu (S. Lu), fotouhi@wayne.edu (F. Fotouhi). nature of the relational data model. This conflict can be resolved by the following XML-to-Relational mappings: Schema mapping: Either a fixed generic database schema (schema-oblivious XML storage) is used, or a database schema is generated from an XML schema or DTD (schema-based XML storage) for the storage of XML documents. To support the ordered nature of the XML data model, an order encoding scheme such as those proposed in [8] can be used and additional columns are introduced to store the ordinals of XML elements. Data mapping, which shreds an input XML document into relational tuples and inserts them into the relational database whose schema is generated in the schema mapping phase /$ - see front matter r 2006 Elsevier B.V. All rights reserved. doi: /j.is

2 2 M. Atay et al. / Information Systems ] (]]]]) ]]] ]]] Query mapping, which translates an XML query into its relational equivalent (i.e. SQL statements or relational algebra expressions), executes them against the database and returns the query result to the user. If the query result is to be returned as XML documents, then a reconstruction algorithm [10] is needed to reconstruct the XML subtrees rooted at the matching nodes. While existing work has focused on the problems of schema mapping [1 7,9] and query mapping [8,11 19], there is no published linear schema-based data mapping algorithm for mapping ordered XML documents to relational data. Firstly, the schemaoblivious storage schemes [1 3,16] use a simple, fixed database schema for XML storage, and the data mapping problem in this context has been addressed by Grust et al. in [20]. Secondly, while the schema-based storage schemes [4 7,9] have presented different strategies to generate a good database schema from an XML schema, there has been no published work presenting algorithms for mapping XML documents to relational data that will fit into the generated database schema and preserve the XML document order. Tatarinov et al. [8] focus on the investigation of three order encoding schemes for storing and querying XML documents. Although it presents a brief discussion of schema-based order-preserving schema mapping, no algorithmic details are given for the schemabased data mapping. Thirdly, existing works on query mapping [8,11 15,17 19] assume that the database has already been populated with XML documents, and no algorithms have been published for shredding XML documents into relational data in the context where the database schema is generated from an XML schema. The data translation algorithm presented in [21] does not support recursive XML schemas and does not consider the ordered nature of XML documents. The data loading algorithms defined in [16,20] support the schema-oblivious storage scheme and use a SAX-based approach. Finally, our previous data mapping algorithm presented in [22] is not order-preserving and uses only a DOM-based approach. Since the target database schema might be complex and its corresponding XML-to-Relational schema mapping is non-trivial, it is challenging to design an efficient schema-based data mapping algorithm. This is one major motivation of our research. The main contributions of this paper are: 1. We propose a schema mapping algorithm, ODTDMap, which generates a database schema from an XML DTD for storing and querying ordered XML documents. Although the main idea of ODTDMap is similar to the shared inlining algorithm [4,8] and its variant [9], ODTDMap makes several improvements over them as discussed at the end of Section We propose an efficient DOM-based linear data mapping algorithm, OXInsert, which shreds and composes input XML documents into relational tuples and inserts them into the relational database according to the schema generated by ODTDMap. OXInsert is based on our previous data mapping algorithm XInsert [22], but it takes into account the ordered nature of the input XML documents and set-valued attributes that were not considered by XInsert. 3. We propose an efficient and linear SAX-based data mapping algorithm, SDM, which shreds and composes ordered XML documents into relational tuples and inserts them into the relational database according to the schema generated by ODTDMap. Our experimental study shows that the proposed algorithms ODTDMap, OXInsert, and SDM are efficient and scalable. We show that our data mapping algorithms OXInsert and SDM are efficient under different schema mapping algorithms other than ODTDMap in the experimental study. Although query mapping is an essential part of a complete mapping scheme, mapping XML queries into their SQL counterparts is not the focus of this paper. We refer the interested readers to recently proposed query mapping algorithms [8,11,12,14,15,17 19]. We assume the reader is familiar with XML [23] and its related technologies, such as DTD [23], DOM [24] and SAX [25]. Organization: The rest of the paper is organized as follows. Section 2 presents an overview of related work. The formalization of a schema-based relational XML storage system is given in Section 3. Section 4 gives a brief description of our schema mapping algorithm ODTDMap. Section 5 identifies the main issues for data mapping and describes our proposed data mapping algorithms OXInsert and SDM. Section 6 presents an experimental study of the time performance of ODTDMap, OXInsert and SDM algorithms. Finally, Section 7 concludes the paper and points out some potential future work.

3 M. Atay et al. / Information Systems ] (]]]]) ]]] ]]] 3 2. Related work Three major approaches have been proposed for storing and querying XML data. The first approach is to develop native XML databases that support the XML data model and XML query languages directly. This includes Software AG s Tamino XML Server [26], IXIA s TEXTML Server [27], Sonic Software s extensible Information Server [28] (formerly excelon s XIS) and MODIS s Sedna Native XML DBMS [29]. The advantage of this native approach is that XML data can be stored and retrieved in their original formats and no additional mappings or translations are needed. Furthermore, most native XML databases have the ability to perform sophisticated full-text searches including full thesaurus support, word stubbing (to match all forms of a word: run, ran, running) and proximity searches. The disadvantage is that due to the document-centric nature of these databases, complex searches or aggregations might be cumbersome. The second approach is to use existing mature technologies, such as relational DBMSs or objectoriented DBMSs, to store and query XML data [1 9]. The main challenge of this approach is that one needs to resolve the conflict between the hierarchical, ordered nature of the XML data model and the target data model. This usually requires various mappings such as schema mapping, data mapping and query mapping to be performed between the two data models. Therefore, the main issue is to develop efficient algorithms to perform these mappings. This approach includes two categories of methods: schema-oblivious XML storage [1 3,16], which uses a fixed generic database schema for XML storage, and schema-based XML storage [4 7,9], which uses a database schema generated from an XML schema for XML storage. The third approach is to use the XML support enabled by commercial database systems. Currently, most major databases, such as SQL Server [30], Oracle [31] and DB2 [32], provide mechanisms to store and query XML data by extending the existing data model with an additional XML data type (e.g., XMLType in Oracle 10g) so that a column of this data type can be defined and used to store XML data. In addition, a set of methods is associated with this new XML data type to process, manipulate and query stored XML data. As discussed above, these approaches have their pros and cons, and the choice has to be made based on the requirement of the application at hand and the advancement of these approaches at the time that the choice has to be made. Readers are referred to an evaluation study of alternative XML storage strategies [33] for more details. 3. Schema-based relational XML storage system Our schema-based relational XML storage system contains two major components: 1. Schema mapping, which takes an XML DTD as input, and outputs a database schema and a s- mapping, which assigns each element/attribute in the DTD to the relation in which the element/ attribute is going to be stored. 2. Data mapping, which takes a valid XML document and the output of a schema mapping as input, shreds the XML document into relational tuples, and inserts them into the relational database. In the following, we formalize the notions of s- mapping, schema mapping and data mapping, respectively: Definition 3.1 (s-mapping). Given a DTD D with element-type set E and attribute-type set A, and a database schema R, a s-mapping is a function s : ðe [ AÞ! R, such that given an attribute/elementtype e 2ðA [ EÞ, sðeþ is the relation in which the instances of e will be stored. Definition 3.2 (Schema mapping). A schema mapping is a function SM that assigns to each DTD D a pair ðr; sþ to store the XML documents conforming to D, where R is a database schema and s is a s- mapping over R. Definition 3.3 (Data mapping). A data mapping DM is a function that assigns to each triple ðx; R; sþ a set of relational tuples T, where X is a valid XML document, R is a database schema, s is a s-mapping over R, and T is the result of shredding X into relational tuples according to the layout described by R and s. 4. Schema mapping algorithm ODTDMap In this section, we propose our schema mapping algorithm, ODTDMap, which generates a database schema from an XML DTD for storing and querying ordered XML documents. Several

4 4 M. Atay et al. / Information Systems ] (]]]]) ]]] ]]] approaches exist in the literature. One approach is to map each DTD element to a separate table [4]. The drawback of this approach is that it might result in too many tables and thus expensive join of multiple tables for a query. Another approach is to map all DTD elements into a single fixed table [2]. This approach might result in a large table and expensive self-join of the table for a query. A better approach, which our ODTDMap algorithm takes, is to map a child and its parent to the same table when the child appears at most once under its parent. This operation is called inlining and was first introduced in [4]. The inlining approach reduces the number of tables in the generated database schema and thus the number of joins for a query. Our ODTDMap algorithm is shown in Fig. 1.Itis inspired by the shared inlining algorithm introduced in [4]. However, we made several improvements over it which are described in Section 4.4. The ODTDMap algorithm consists of the following three main steps: 1. Simplifying DTD: Since a DTD expression might be very complex due to its hierarchical nesting capability, this step greatly simplifies the mapping procedure. 2. Creating and inlining DTD graph: We create the corresponding DTD graph based on the simplified DTD, and then inline as many descendant nodes as possible to a parent node in the DTD graph. Thus, all descendants of an XML element e which occur at most once under e will be mapped to the same relation with e. 3. Generating database schema and s-mapping: After a DTD graph is inlined, we generate a database schema and s-mapping based on the inlined DTD graph. The section ends with a discussion on the improvements we made over existing schema mapping algorithms. 00 Algorithm ODTDMap 01 Input: DTD D 02 Output: Database Schema R, σ-mapping σ 03 Begin 04 Simplify the DTD D 05 Create the DTD graph G 06 IG = Inline(G) //create the inlined DTD graph 07 GenerateRelSigma(IG) //generate the relations and σ-mapping 08 End Fig. 1. Schema mapping algorithm ODTDMap Simplifying DTDs DTDs, in general, can be complex and generating database schemas for these DTDs can be an awkward task. The first step in our schema mapping algorithm is to simplify a DTD into a canonical form such that it can easily be translated into a database schema which will be able to store the XML documents conforming to the original, unsimplified DTD. The occurrence operators in a DTD can be classified into two groups based on the underlying relationship between parent and child elements: (i) operators that lead to a one-to-one relationship: {?,, }, (ii) operators that lead to a one-to-many relationship: { þ, }. It is sufficient to generate a complete relational schema for the given DTD if we can distinguish between those two relationship groups. Thus, we can replace the first operator in each group with the second one which results in reducing the types of occurrence operators from four to two. Although the processing of the choice operator j seems to be a problematic issue in the schema mapping process, we can deal with it easily. Let us consider the following DTD expression: h!element a ðb j cþi. The element a can contain elements b or c but not both at the same time. However, we can introduce columns b and c together in the table corresponding to element a. During the data mapping phase, if a contains child b, then we assign null to c column and vice versa. Thus, there is not much difference between the given DTD expression and h!element a ðb; cþi regarding the target database schema. We define a set of transformation rules in Fig. 2 to transform a DTD into a canonical form. Example 4.1. Using the simplification rules shown in Fig. 2, one can transform h!element a ððb þ ; c ; d?þ?; ðe?; f; ðg ; h?þ þ Þ?Þi to a simplified version h!element a ðb ; c ; d; e; f; g ; h Þi. The following DTD expressions are the ones which are changed as a result of applying the simplification rules given in Fig. 2 to the DTD shown in Fig. 3: h!element book ðtitle, author ; chapter ; citationþi h!element section ðparagraph ; section Þi

5 M. Atay et al. / Information Systems ] (]]]]) ]]] ]]] 5 Fig. 2. DTD simplification rules. While the above simplification procedure maintains the parent child relationships, it does not maintain the document order. However, we introduce additional ordinal attributes to record the order of the document. Thus, any XML query, including the ones which require the document order information, can be evaluated over the generated database schema. Our set of rules is essentially an improvement of the transformation rules defined in the shared inlining algorithm [4]. Our set of rules is complete since we consider all possible combinations of operators and XML elements, whereas the shared inlining algorithm only lists some important combinations. For example, there is no rule that corresponds to ðe 1 jje n Þ? in the shared inlining algorithm Creating and inlining DTD graphs In this step, we create the corresponding DTD graph based on the simplified DTD and do the inlining operation on the DTD graph. The notion of the DTD graph is defined as follows: Fig. 3. Sample XML DTD xbib.dtd. Our simplification rules will transform complex DTD expressions into a flat canonical form as it loosens some DTD constraints. However, the DTD simplification procedure will preserve sufficient information to generate a database schema with necessary tables and columns to store XML data. The actual constraint information can be derived from the original DTD and introduced to the database schema by revisiting the original DTD later. Interested readers are referred to [5,34] where capturing semantic knowledge from a DTD and introducing it to a database schema through semantic constraints are discussed in detail. Two pieces of information are essential for the reconstruction of an XML document from its relational representation and for answering XML queries against the relational storage of an XML document: (1) the parent child relationships between XML elements and (2) the document order. Definition 4.2 (DTD graph). The structure of a DTD D can be represented by a directed graph G ¼ðV; EÞ, where V is the set of vertices and E is the set of edges. The vertices represent elements and attribute types in D, and the edges represent their parent child relationships. Each vertex is labeled with the name of the corresponding element or attribute type. An edge is labeled by if it is incident to a vertex which can appear more than once under its parent in the corresponding XML document, otherwise no label is used. For example, the DTD graph shown in Fig. 4 corresponds to the simplified form of DTD given in Fig. 3. While each element appears only once in the DTD graph, attributes appear as many times as they appear in the DTD. Node identifiers for the attributes in the DTD graph are preceded by 1 For a set-valued attribute such as IDREFS or NMTOKENS, the edge between the set-valued attribute and its parent (the owner element) is labeled by in the DTD graph. Thus, we can 1 In implementation, to ensure the uniqueness of attribute names, we can use the concatenation of an attribute name and its owner element name as the attribute identifier. For attribute of element book can have a label book.id.

6 6 M. Atay et al. / Information Systems ] (]]]]) ]]] ]]] xbib title, citation} author {fname, lname } paragraph section manage the set-valued attributes easily in further steps. In Fig. 4, cites is a set-valued attribute. Therefore, its incoming edge is labeled with. After we create the DTD graph for the simplified DTD, we inline as many descendant elements to an element as possible. The rationale is that these inlined elements will eventually produce a single relation. Therefore, we only inline a child c to a parent p when p can contain at most one occurrence of c in order to avoid introducing redundancy into the generated relation. After the simplification procedure, any input DTD is now in a canonical form, i.e., each DTD expression is a tuple of distinct element names or their stars (). As a result, in the corresponding DTD graph, an edge is labeled by a star () if the edge is leading to an element with a and no label is put otherwise. Thus, if an edge has a as its label, we call it a star edge, otherwise, we call it a normal edge. We define the notion of an inlinable node, an inlinable subtree and a shared node in a DTD graph as follows: @publisher author title chapter fname lname section paragraph Fig. 4. DTD graph of xbib.dtd. Fig. 5. Inlined DTD graph of xbib.dtd. Definition 4.3 (Inlinable node). Given a DTD graph, a node is inlinable if and only if it has exactly one incoming edge and that edge is a normal edge. Definition 4.4 (Inlinable subtree). Given a DTD graph and a node e in the graph, e and all other inlinable nodes that are reachable from e by normal edges constitute a subtree rooted at e. This subtree is called the inlinable subtree for the node e. Definition 4.5 (Shared node). Given a DTD graph, a node is called a shared node if it has more than one incoming edge. Using our inlining procedure, the DTD graph shown in Fig. 4, will be transformed into the inlined DTD graph shown in Fig. 5. Our inlining procedure considers the following three cases which are illustrated in Fig Case 1: Node a is connected to node b by a normal edge and b has no other incoming edges. In this case, a can contain at most one occurrence of b, and we combine node b into a while maintaining the parent child relationships between b and its children. 2. Case 2: Node a is connected to node b by a normal edge and b has other incoming edges. In this case, we do not combine b into a since b has multiple parents. 3. Case 3: Node a is connected to node b by a star edge. In this case, each a can contain multiple occurrences of b, and we do not combine b into a. Only Case 1 allows us to inline an element to its parent. While Case 2 does not allow inlining due to a shared node, Case 3 does not allow inlining to avoid redundancy due to the multiple occurrences of a child element in its parent caused by the operator. Example 4.6. In Fig. 7A, nodes b and d are inlinable but nodes a and c are not inlinable. The inlinable subtree for a contains nodes a and b, whereas the inlinable subtree for c contains nodes c and d. In m a b n m a,b n m a n Case 1 Case 2 Case 3 Fig. 6. Three cases for inlining. b a b

7 M. Atay et al. / Information Systems ] (]]]]) ]]] ]]] 7 a d a, b c, d b c (A) (B) a b c g a, b, c, d g d e (C) f (D) e, f Fig. 7. Inlining DTD graphs. Fig. 7C, nodes b d and f are inlinable, but nodes a, e and g are not inlinable. The inlinable subtree for a contains nodes a d, and the inlinable subtree for node e contains nodes e and f. While there is no shared node in Fig. 7A, the only shared node in Fig. 7C is node e. The DTD graph shown in Fig. 7A will be inlined into one shown in Fig. 7B, and the DTD graph shown in Fig. 7C will be inlined into one shown in Fig. 7D. The notion of the inlinable subtree formalizes the intuition of inlining as many descendant elements as possible to an element. We illustrate our inlining algorithm in pseudocode in Fig. 8. Essentially, it uses a depth-first search strategy to identify the inlinable subtree for each node and then inline that subtree to its root. A field inlinedset of set type is introduced for each node e to represent the set of nodes that has been inlined to this node e (initially e:inlinedset ¼fg). For example, in Fig. 7C, after the inlining procedure, a:inlinedset ¼fb; c; dg. The algorithm is efficient as indicated by the following theorem. Theorem 4.7 (Time complexity). The time complexity of our inlining algorithm is OðnÞ, where n is the number of nodes in the input DTD graph. Proof. This is obvious since each node of the DTD graph is visited at most once. & Fig. 8. The inlining procedure Generating database schema and s-mapping After a simplified DTD graph is inlined, the last step is to generate a database schema based on this inlined DTD graph and generate the schema mapping information which will be used in the data mapping process later. The procedure to generate the database schema and s-mapping is given in Fig. 9. For each node e in the inlined DTD graph, a relation e is generated. Basically, in the generated database schema, we associate each element e with a unique ID. We also introduce a unique f :ID for each element type f in the inlined set of e. The rationale behind introducing an ID or f.id for each element is to be able to store the order of XML elements in the relational tables. It is mentioned in [8] that no ordinal ID will be required for inlined elements. However, as we will show in Section 4.4, such a mapping scheme is lossy. Our mapping scheme is lossless and stores sufficient information in the relational database to reconstruct the original XML document. A complete proof, which shows that our mapping scheme is lossless, is in [10]. Attribute parentid is introduced for each noninlinable element to preserve the parent child relationship and, thus, the tree structure of an XML document. We do not need to introduce an

8 8 M. Atay et al. / Information Systems ] (]]]]) ]]] ]]] Fig. 9. The database schema and s-mapping generation procedure. Fig. 10. Database schema for xbib.dtd. attribute parentid for inlinable elements since they are stored in the same tuple with their parent. To facilitate the processing of recursive XML queries (queries with == axis), each element e is associated with an attribute endid, which stores the maximum ID of the descendants of e. 2 We introduce f.endid for each element type f in the inlined set of e for the same purpose. We introduce attribute parenttype if the node in the inlined DTD graph has more than one parent (shared node). Thus, the attribute parenttype facilitates efficient selection of descendants of a particular parent. A column e is introduced in the database schema for each non-inlinable leaf element type to store its textual content. Similarly, column f is introduced 2 Leaf elements have the same ID and endid values. As such, we can omit the endid to save space. for each leaf element or attribute type f in the inlined set of e. Obviously, if the element type is EMPTY, we do not introduce such a column. The database schema shown in Fig. 10 is generated for the inlined DTD graph given in Fig. 5 by the schema generation procedure explained above. After generating the database schema, both the database schema and the s-mapping that maps element and attribute types to the relational schemas in which they should be stored are output. This output is used by our data mapping algorithms in Section 5 to actually shred XML documents into relational tuples Discussion Although the main idea of ODTDMap is similar to existing algorithms [4,8,9], ODTDMap made

9 several improvements over them: M. Atay et al. / Information Systems ] (]]]]) ]]] ]]] 9 Recursion: The standard shared-inlining algorithm [4] defines a rule to deal with two mutually recursive elements. It is not clear how to handle a DTD with a cycle consisting of more than two elements. ODTDMap can handle arbitrary cycles in DTDs by just checking the incoming edges to the current node irrespective of other nodes in the DTD graph. Therefore, ODTDMap deals with cycles naturally without requiring an explicit check of the existence of cycles which is required by the standard shared-inlining algorithm. Losslessness: ODTDMap is lossless in the sense that the generated database schema can store enough structural information to reconstruct the original XML documents and support the storage and query of ordered XML documents. We are able to reconstruct the original XML document in the given document order. In contrast, the shared-inlining algorithm [4] and its variant [9] do not support the ordered nature of XML documents. Although [8] proposes the Global, Local and Dewey Order schemes and discusses their applications to the schema-less case, no details are presented for the schemabased case. The authors suggest that there is no need to have a separate column for storing the order information of inlined elements, since the position of such elements can be determined from the position of their parent element and the document schema. This is not true. For example, consider the DTD and the sample XML document shown in Fig. 11A and B, respectively. The ordered shared-inlining will create the database shown in Fig. 11C, in which the order information of the inlined element C is lost, and there is no way to determine whether the element B comes before or after element C; therefore, the original XML document cannot be reconstructed. On the other hand, our ODTDMap will create a database shown in Fig. 11D, where we associate an ID with the inlined element C as well. Thus, it will support the reconstruction of the original XML document. Efficient support for XML queries: To facilitate the processing of XML queries, each non-leaf element e is associated with an endid which stores the maximum ID of the descendants of e. In this way, one can efficiently identify all the following and preceding elements of a given element as well as its descendants. (A) (C) Set-valued attributes: Existing schema mapping algorithms [4,8,9] have not considered set-valued attributes such as IDREFS and NMTOKENS. In ODTDMap, we connect a set-valued attribute to its owner element with a star edge in the DTD graph and map it to a separate relation (see how the cites attribute in Fig. 3 is mapped). 5. Data mapping As the target database schema might be complex and its corresponding XML-to-Relational schema mapping is non-trivial, it is challenging to design an efficient schema-based data mapping algorithm. The main challenging issues include the following: Varying document structure: XML documents have varying structures due to the optional occurrence operators?,, and choice operator j used in the underlying DTD, unlike relational tables which always have a fixed structure. For example, in the XML document tree given in Fig. 13, which corresponds to the sample XML document shown in Fig. 12, the nodes with ordinal numbers 10 and 16 are of the same element type. However, their subtrees are quite different. While there is no paragraph node among the child nodes of node 10, there is no section node among the child nodes of node 16. A data mapping algorithm should keep track of the missing child nodes and handle structural differences between the same type of element nodes due to the optional operators using efficient data structures. Scalability: In an online environment, where new XML documents might be inserted into the database on-the-fly, a data mapping algorithm (B) (D) Fig. 11. A lossy versus a lossless mapping.

10 10 M. Atay et al. / Information Systems ] (]]]]) ]]] ]]] Fig. 12. A sample XML document xbib.xml. will be used frequently. Thus, it is critical that a data mapping algorithm is efficient and scales well with the size of XML documents. It is obvious that a linear data mapping algorithm will fulfill this requirement the best. In the following sections, we present two data mapping algorithms, DOM-based OXInsert and SAX-based SDM to address these issues. An appropriate ordering technique is needed to keep the ordered XML documents in the unordered structure of relational tables. Several order encoding methods are proposed in [8]. Their experimental results show that the global order encoding performs the best on query intensive workloads. However, our data mapping algorithms can be easily adapted to other order encoding schemes proposed in [8] DOM-based approach We use a tree data model to represent the XML documents since each valid XML document is rooted at a unique element which is specified by DOCTYPE declaration in the DTD. We first introduce our XML Tree data model, which is based on W3C s Document Object Model (DOM) [24]. The details of our XML document model are given in Definition 5.1. Definition 5.1 (XML Tree). We model an XML document D as an XML element tree (XML Tree) T, in which nodes represent XML elements and edges represent parent child relationships between XML elements. The XML Tree T is an ordered tree and its nodes can have attributes and values associated with them. The root of XML Tree T is denoted by T:root. For each element node e in T,we use the following notations: e:name, the name of XML element e. e:eid, the global ID of XML element e which is given based on the pre-order tree traversal. e:endid, the largest descendant ID of node e and e:id ¼ e:endid if e is a leaf node in T. e:attributes, the set of XML attributes of e. We also denote the attributes of e by e:a 1 ;...; e:a n and the names and values of these attributes by e:a i :name and e:a i :value, respectively (i ¼ 1;...; n). e:value, the value of e, where e:value ¼ NULL if e is a non-leaf node. e:parent, the parent node of e, where e:parent ¼ NULL if e is the root node of T. e:children, the ordered sequence of child nodes of e, and e:children ¼ NULL if e is a leaf n ode of T. We also denote the children of e by e:c 1 ;...; e:c m.

11 M. Atay et al. / Information Systems ] (]]]]) ]]] ]]] 11 xbib(1) book (2) title (3) author (4) author (7) chapter (10) chapter (16) fname (5) lname (6) fname (8) lname (9) section (11) paragraph (17) paragraph (18) section (12) section (14) paragraph (13) paragraph (15) Fig. 13. XML Tree of xbib.xml. The XML Tree data model has some distinctions from W3C s DOM specification. In contrast to traditional XML DOM tree, the XML Tree does not consider XML PCDATA values as nodes but consider them as data fields of XML element nodes. It has an ID field for each node which is assigned based on the pre-order tree traversal as an XML document is being parsed. Besides an ID field, each node is assigned an endid field which denotes the largest descendant ID of that node. This distinction is only for the convenience of presentation; thus, the algorithm proposed in this paper can be implemented directly on the standard DOM model. The XML Tree for the XML document shown in Fig. 12 is illustrated in Fig. 13. In an XML Tree, each node e is labeled by e.name(e.eid, e.endid, e.value, e:a 1 :name ¼ e:a 1 :value;...; e:a n :name ¼ e:a n :value) and e:value is omitted when e is a non-leaf node where e:value ¼ NULL. However, in Fig. 13, we just include e.name and e.eid for simplicity. We differentiate an element node e in an XML Tree from its corresponding type in the DTD which is denoted by typeðeþ. For example, we use the expression sðtypeðeþþ to find the corresponding table for e. Our DOM-based data mapping algorithm OXInsert is shown in Fig. 14. We design OXInsert as an iterative algorithm. The documents conforming to a DTD might be nested with arbitrary depth if the input DTD is recursive (cyclic XML schema). One concern of a recursive data mapping algorithm might be memory space requirement as a result of numerous recursive calls. Therefore, we avoid using a recursive design for OXInsert algorithm. The main idea of OXInsert is that it uses queue q to process all non-inlinable XML elements, and for each such element e, it uses queue r to process all XML elements that are inlinable to e. Lines process each non-inlinable XML element e dequeued from q. In particular, a tuple tp is created in the table corresponding to typeðeþ denoted by sðtypeðeþþ. The data values of node e are retrieved and loaded to the corresponding fields of tuple tp in procedure loadtupledataðþ (line 09). Set-valued attributes are dealt with processsetattrðþ procedure where the values of a set-valued attribute are stored in a separate table. Note that we deal with the issue of varying document structure elegantly: on one hand, all missing nodes will have NULL values in their corresponding columns as they are all initialized to NULL. The corresponding column of a node is filled with a value only when the node is present. On the other hand, for two elements of the same type, even though the structures of their subtrees might vary, we process each of their descendants using the s-mapping in a consistent and correct manner. Since the information of inlinable elements are stored in the same tuple as their parents, for each non-inlinable element e, we need to retrieve the data values of the elements that are inlinable to e. This is achieved by using another queue r to process the descendants of e, which are inlinable to e in lines During this process, if we encounter any non-inlinable element, it will be enqueued into q for further processing (line 17). For each element f that is inlinable to e, we fill appropriate fields of the tuple tp corresponding to e with the data values retrieved from node f in procedure loadtupledataðþ. The set-

12 12 M. Atay et al. / Information Systems ] (]]]]) ]]] ]]] Fig. 14. DOM-based data mapping algorithm OXInsert. valued attributes of f are dealt with processsetattrðþ procedure where the values of a set-valued attribute are stored in a separate table. The descendants of f are enqueued into r for further processing (lines 20 22). Procedure loadtupledataðþ retrieves the data of any node n and loads it to the tuple tp. Parameter prefix helps to overcome the difference in relational attribute names of non-inlinable and inlinable nodes in XML Tree T. The shredding of set-valued attributes of node n is processed by procedure processsetattrðþ. Procedure processsetattrðþ processes the setvalued attribute e:a of a particular element e. Each such attribute is mapped to a separate table, which is denoted by sðtypeðe:aþþ, unlike a single-valued attribute which is mapped to the same table with its owner element. A tuple with a sequential index ID, which is disjoint from the IDs in the XML tree, a parent ID and a value is inserted for each value of the set-valued attribute e:a to the table sðtypeðe:aþþ (lines 3 7). To analyze the time complexity of algorithm OXInsert, we first present some properties of the algorithm in the following lemmas. Lemma 5.2. Each non-inlinable element e in XML Tree T is enqueued into queue q exactly once, and q only contains non-inlinable elements. Proof. The operation of enqueue into q is performed only at line 5 and at line 17. Line 5 enqueues the root element which is non-inlinable. Line 17 is in

13 M. Atay et al. / Information Systems ] (]]]]) ]]] ]]] 13 the body of the If statement whose condition indicates that element f to be enqueued into queue q is non-inlinable. Therefore, q only contains noninlinable elements. The acyclicity of T implies that each non-inlinable element of T can be enqueued into q at most once. In addition, except the root element, the While statement (lines 15 24) will ensure that each non-inlinable element will be enqueued into q at least once in line 17. Finally, the root element is enqueued into q exactly once. Therefore, each non-inlinable element e is enqueued into q exactly once. & Lemma 5.3. Each XML element e, except the root element in XML Tree T is enqueued into queue r exactly once. Proof. Lemma 5.2 implies that each non-inlinable element e is dequeued from q exactly once (line 7), and for each such e, the While statement (lines 15 24) will enqueue each of e s descendant element f exactly once into queue r, where f satisfies the following: (1) f is e s child (line 14) or (2) f is a descendant of e, where f s parent is inlinable to e (line 21). Therefore, each element of T, except the root element, will satisfy one of these two cases for some e and, thus, will be enqueued into r at least once. The acyclicity of T implies that each element of T can be enqueued into r at most once. Therefore, each XML element in T is enqueued into r exactly once. & The following theorem demonstrates that OXInsert is an efficient linear algorithm. Theorem 5.4 (Time complexity). The time complexity of algorithm OXInsert is OðnÞ, where database schema R and s-mapping s are fixed and n is the total number of XML elements and attribute values in XML Tree T. Proof (Sketch). From Lemma 5.2, each non-inlinable element e in XML Tree T is enqueued into queue q exactly once, and q only contains noninlinable elements. Therefore, lines 7 11 will be executed exactly once for each non-inlinable element. In addition, the execution of lines 7 11 is constant when we ignore lines 05 and 06 of loadtupledataðþ procedure, whose execution time is attributed to XML attributes. From Lemma 5.3, each XML element is enqueued into queue r exactly once, thus, lines will be executed exactly once for each XML element. In addition, the execution time of line is constant when we ignore lines 05 and 06 of loadtupledataðþprocedure, whose execution time is attributed to XML attributes. In conclusion, the time complexity of OXInsert is OðnÞ. & Table 1 shows how the XML Tree given in Fig. 13 is mapped to the relational database using our data mapping algorithm OXInsert SAX-based approach DOM-based algorithms are popular because W3C adopts DOM as its standard for XML description. For a big XML file, or multiple XML files processed in multi-tasking environment, creating DOM trees is expensive. A DOM-based data mapping algorithm processes a document in two runs: in the first run, the parser browses the Table 1 The state of the database after xbib.xml is stored

14 14 M. Atay et al. / Information Systems ] (]]]]) ]]] ]]] document and creates an XML tree in the main memory. In the second run, the data mapping algorithm accesses to this DOM tree and processes it. On the other hand, SAX-based [25] data mapping approach only needs to process the document in one run. Our SAX-based data mapping algorithm, called SDM hereafter, is given in Fig. 15. The data mapping algorithm SDM takes an XML document X, a database schema R and a s-mapping s as input as described in Definition 3.3. Event-driven SDM algorithm makes a sequential scan of the whole document from top to bottom. It triggers procedures startelementðþ, charactersðþ, and endelementðþ for start tags, character data and end tags, respectively. When a start tag for an element e is encountered, SDM triggers the procedure startelementðþ. startelementðþ generates a sequential global ID (GID) for the element e. This global ID helps to maintain XML document order in the relational database. If e is a non-inlinable element, then it creates a new tuple t of table sðtypeðeþþ and starts to fill out the fields of tuple t with the information obtained from e. While it pushes element type e and its GID onto stack GST, the tuple t is pushed onto the stack ST sðtypeðeþþ to be completely filled out when all the descendants of e are processed. If e is an inlinable element, then no new tuple is created. However, the tuple on top of the stack ST sðtypeðeþþ,is updated with GID and the attribute values of e. Then, the element type of e and its GID is pushed onto the stack GST. Set-valued attributes of e are dealt with processsetattrðþ procedure as in the DOM-based algorithm OXInsert, since values of a set-valued attribute are stored into a separate table. When any character data between the start and the end tags are encountered, SDM triggers the procedure charactersðþ. Since element e on top of GST is the owner of scanned character data, these data are mapped to the tuple on top of the stack ST sðtypeðeþþ. When the end tag for element e is encountered, SDM triggers the procedure endelementðþ.if e is non-inlinable, then endelementðþpops up the tuple t from the stack ST sðtypeðeþþ and assigns GID as endid of tuple t, and inserts t into the table sðtypeðeþþ. Otherwise, it updates the tuple on top of the stack ST sðtypeðeþþ assigning the current GID as endid of e. SDM maintains a global stack, GST, and a separate stack, ST sðtypeðeþþ, for each table sðtypeðeþþ, where sðtypeðeþþ is the table corresponding to the type of e in the underlying DTD. Global stack GST keeps the parent child relationships. The stacks for tables are used to fill the required context information for a particular tuple t of table sðtypeðeþþ. SDM pushes an item to a table stack ST sðtypeðeþþ when a start tag for a non-inlinable element e is encountered. It pops up the stack ST sðtypeðeþþ when it reads the end tag of e. Hence, a table stack never grows over one item, unless there exists a descendant element which is of the same type as its ancestor (recursive XML schema). Table stacks in SDM allow processing such elements easily without interfering with the context of a pending ancestor element, which has the same type as its descendant and for which a tuple has been already created. Theorem 5.5 (Time complexity). The time complexity of algorithm SDM is OðnÞ, where n is the number of elements and attribute values in the input XML document. We skip the proof since it is trivial. 6. Experimental study We implemented ODTDMap, OXInsert and SDM algorithms in Java. We used a Pentium IV computer with 2.4 GHz processor and 1 GB main memory for the experiments. The experiments were run using Java software development kit. We minimized the usage of system resources during the experiments to get more realistic results. We ran the programs 6 times and got the average value, excluding the first run, to have more accurate results The experiment of schema mapping ODTDMap We applied ODTDMap to a set of DTDs to conduct a performance evaluation of our proposed schema mapping algorithm ODTDMap. We used 6 test DTDs from the XBench XML Benchmark [35] for our experiments. First, we identified the properties of each DTD such as the number of elements and attributes, the number of and þ operators and, etc. Then, we ran ODTDMap and measured its time for mapping the input DTD to the output database schema. The time spent is measured by running the schema mapping procedure for 1000 times to get significant results. The number of tables generated for each DTD was recorded. The experimental results are shown in Table 2.

15 M. Atay et al. / Information Systems ] (]]]]) ]]] ]]] 15 Fig. 15. SAX-based data mapping algorithm SDM. While the total number of elements in 6 DTDs is 125 and, the total number of attributes is 14, the total number of tables generated for those DTDs is 23. The total number of tables is around one-sixth of the total number of elements and attributes. We observed that ODTDMap algorithm reduced the number of tables considerably in contrast to the number of elements. We observed that the running time of the ODTDMap algorithm is proportional to the number of elements in the input DTD. This is not surprising since ODTDMap algorithm visits each element only once and spends constant time on each element DOM-based versus SAX-based data mapping We chose auction:xml of the XMark benchmark [36] as our data set to compare the performance of the DOM-based algorithm OXInsert with the SAXbased algorithm SDM. We generated the test documents in six different sizes ranging from 25 to 125 MB. We constructed the XML Tree for each document using W3C s DOM specification. Our performance metric is the time to map the input XML document to the target relational data. While loading data to the database are not included in this time, the time for parsing the input XML documents is included in the measurement. The chart given in Fig. 16 shows the average time spent for each document using the two data mapping approaches. As shown in Fig. 16, SDM shows linear performance and scales very well with the size of the input XML documents while OXInsert shows linear performance up to the 75 MB document. DOM-based data mapping algorithm OXInsert has much better performance than the SAX-based

16 16 M. Atay et al. / Information Systems ] (]]]]) ]]] ]]] Table 2 Experimental results of schema mapping DTD file File size # of # of # of # of þ # of Running (bytes) elements attributes operators operators tables time ðmsþ country.dtd address.dtd customer.dtd item.dtd order.dtd catalog.dtd Time (sec) OXInsert SDM Size (MB) Fig. 16. The performance of OXInsert versus SDM. algorithm SDM up to the 75 MB XML document. However, after 75 MB, the SAX-based algorithm starts to outperform as XML documents beyond 75 MB could no longer be represented as a DOM tree in the main memory in our experiments. For a large XML document whose XML tree does not fit in the main memory, part of the tree will be swapped between the disk and the main memory, causing a considerable time on I/O operations and degrading the performance of the DOM-based approach. In this case, the event-driven SAX-based approach does not suffer. We observed from our experiments that, as long as the document tree can fit in the main memory, the DOM-based approach for data mapping should be chosen. Otherwise, the SAX-based approach should be the choice for data mapping Data mapping across different schema mappings In order to study the performance of both the DOM-based data mapping algorithm OXInsert and the SAX-based algorithm SDM across various schema mapping schemes, we conducted experiments on the following three classic schema mappings [4]: Basic, which inlines a child element to its parent if the parent can contain at most one occurrence of the child. Basic creates a separate relation for each element type. Therefore, an element type might be represented in multiple relations. One disadvantage of Basic is that it might generate a large number of relations, causing low performance for some queries. Shared, which inlines a child element type to its parent if the parent can contain at most one occurrence of the child. However, to avoid the problem of Basic, each element type is represented in exactly one relation. A shared element type is always mapped to a separate table in Shared. Hybrid, which inlines the shared element types that are not reached through a -edge in addition to the inlining performed by shared inlining. This approach combines the features of both Basic and Shared. We added the support for set-valued attributes to these three schema mapping algorithms. To see the impact of inlining on data mapping performance, we did not implement the inlining feature of Basic since we already implemented the same notion of inlining in Shared. The database schema generated by Basic, Shared and Hybrid for the DTD given in Fig. 3 is shown in Fig. 17. The database schemas generated by Hybrid and Shared are the same. We used auction:xml as our data set and generated test documents of sizes from 25 to 125 MB for OXInsert and from 100 to 1 GB for SDM. OXInsert does not terminate normally for test documents beyond 125 MB due to its memory space limitation.

Chapter 13 XML: Extensible Markup Language

Chapter 13 XML: Extensible Markup Language - Internet applications provide Web interfaces to databases (data sources) - Three-tier architecture Client V Application Programs Webserver V Database Server