Schemaless Approach of Mapping XML Document into Relational Database

Schemaless Approach of Mapping XML Document into Relational Database Ibrahim Dweib 1, Ayman Awadi 2, Seif Elduola Fath Elrhman 1, Joan Lu 1 University of Huddersfield 1 Alkhoja Group 2 ibrahim_thweib@yahoo.c om, aymawadi@yahoo.com, seifelduolaf@yahoo.com, J.Lu@hud.ac.uk Abstract The extensible Markup Language (XML) is used for representing and exchanging data through the Internet, but this technology needs a suitable medium for storing these data. At present, three common technologies can be used to store and retrieve XML documents, i.e., native XML database, Object Oriented Database (OODB) and Relational Database (RDB). This paper describes a general method for mapping XML documents to RDB. The method does not need a DTD or XML schema. And it can be applied as a general solution for any tree data structure and not just for XML data. Also, it can be used for data-centric and document-centric documents. Experiments on this method shows it's ability to maintain document structure at a low cost price and easily, building of the original document is straight forward, performing first level semantic search is achievable either on a single document or on all documents. 1. Introduction The World Wide Web (WWW) nowadays is the most important media used by most of the human beings in their daily life activities (i.e.; e-business, e-mail, etc). Most enterprises collaborate with other enterprises in long-running read-write workflows through XML-based data exchange technologies. A large amount of data is needed to be exchanged through the web (i.e., XML format) and stored somewhere as a digital copy. Storing the huge amount of web services data is an attractive area of research for the researchers and database vendors. But the important issue is how to retrieve and query these data in an efficient manner. The use of XML; for data exchanging and representation; and Relational Database Management System (RDBMS); for storing and querying; together represents a sophisticated hybrid approach to solving most of the data problems. Following this track, the key challenges in previous studies with fixed shredding is that there is loss of information from the original XML documents, the reconstruction of the original XML documents is very difficult and the size of generated RDB is huge due to inlining of XML elements on the relational tables. Existing Mapping techniques from XML-torelational database can generally be classified into two tracks. The first one is the structured-centric technique, which depends on the XML document structured to guide the mapping process [1, 2]. The second track is the schema centric, which makes use of schema information as DTD or XML schema to derive an efficient relational storage for XML documents [3, 4]. In this research we will focus on a method for mapping XML documents to relational database. The method does not need a DTD or XML schema to simplify the mapping process. And it can be applied as a general solution for any tree data structure and not just for XML data. In this method, the description of each XML document structure is kept in a big text field doc_structure containing a coded string, any changes on the document structure should be reflected in this field, such as adding a new tag or property, deleting an existing tag or property, or relocating a given tag or property to a different location in the same document. The method aims to overcome the challenges faced due to fixed shredding, i.e.; 1) No loss of information while shredding. 2) Reconstruction of original XML documents is easier and much faster. 3) Maintaining XML document structure. 4) Preserve the ordering nature of XML data. 5) Capable to perform semantic search on the stored data. The rest of the paper is organized as follows: section 2 discusses related works, section 3 discusses the theory method and guidance, section 4 shows experimental results and section 5 draws the conclusions and future works. 978-1-4244-2358-3/08/$20.00 2008 IEEE 167 CIT 2008

2. Related works Different approaches were proposed for labelling XML document tree since it plays a significant role in querying XML process. Global, Local and Dewey labelling were proposed in [1]. In Global label each node is assigned a number that represents the node's absolute position in the document. In this label, dynamic update is very difficult since all the nodes after the inserted node's need to be relabelled and extracting the parent-child and ancestor-descendant relationship are also impossible. In Local label each node is assigned a number that represents its relative position among its siblings. In this label, a combination of a node's position with that of its ancestors as a path vector uniquely identifies the absolute position of the node within the document. Dynamic update in Local label has less overhead than Global label because only the following siblings of the new node need to be renumbered. But extracting the parent-child and ancestor-descendant relationship is still very difficult. While in Dewey label, each node is given a label as a combination of its parent label and a private integer number. It gives an easy way to extract node labels from its ancestors. For example, if an element label is 1.2.5.3, then its parent is 1.2.5, and its ancestor label is 1.2.. But this method generates a large size RDB from the mapping process, since it gives private label for each node in the tree, and it needs updating the labels of the following nodes in case of inserting new node. ORDPATH, a hierarchical labelling schema implemented in Microsoft SQL Server 2005, was introduced in [5]. It is used to label nodes of an XML tree without requiring a schema. It can support insertion of new nodes at arbitrary positions in the XML tree without updating the labels of old nodes since it only used positive odd integers to be assigned to nodes during initial loading and reserved even-numbered and negative integer values for later insertions into the existing tree. The advantages of ORDPATH label are no overhead incurred for updates and it reserves the structure of XML document. But, it fails to perform semantic search or path search. A clustering-based scheme for labelling XML trees was proposed in [2]. In this scheme, a group of elements is labelled instead of a single element. Elements are separated into various groups, putting all sibling elements in one group, and assigning a one label to this group instead to one label to each element and stored them in one relational record. A clustering-based scheme will reduce the size of the database needed to store the XML tree by reducing the number of record generated from the mapping process, since it uses one label for a group of elements (a clustered) which is stored in one relational records, in contrast of other labelling methods that need a label for each node. But this method suffers from the problem of dynamic updating after the insertion of new node, i.e. many nodes should be relabelled. And also, it fails to perform semantic search or path search. Oracle XML DB [6] and IBM DB2 XML Extender [7] provided a schemaless way of storing XML data. The entire XML document is stored in a column using CLOB data type. There is no need for XML-to-SQL query translation, since XML queries are similar to XML query processing in a native XML database. 3. Theory method and guidance The main goal of mapping XML documents to RDB is to utilize the main advantages of the two technologies by finding an efficient storage, retrieval and query method to the huge amount for web data exchange through the Internet. In this section we will focus on a method of mapping XML documents to RDB. The method does not need a DTD or XML schema to simplify the mapping process. And it can be applied as a general solution for any tree and not just for XML data. 3.1. Theory guidance The main mathematical concepts that are used in this research are presented in this section. Definition 1: XML tree is composed of many sub-trees of different levels; it can be define as the following: T = n i= 1 ( Ei, Ai, X i, r ; i=1, 2 n, represent the levels of i 1) XML tree, 0 represents the root Where, E i is a finite set of elements in the level i. A i is a finite set of attributes in the level i. X i is a finite set of texts in the level i. r i-1 is the root of the sub-tree of level i. Definition 2: A dynamic fragment (shred) df(i) is defined to be the attributes and texts (leaf children) of the subtree i of the XML tree plus its root r i-1, as follows: df(i) = (A i, X i, r i-1 ), Where A i is a finite set of attributes in the level i X i is a finite set of texts in the level i. r i-1 is the root of the sub-tree of level i. 3.2. Method employed The method is very simple, a global label approach is applied to give a label to the XML elements and attributes. The label is unique for each token, i.e.; document element, tag, or property. But it is not required to be in sequence as in [1]. Any initial traversing for the XML document, i.e., in-order, pre-order, or post-order, can be applicable. No re-labelling for XML document 168

items is needed if new item or sub-tree is added. The relational schema consists of two tables. The "documents" table keeps the required information of the XML documents structure, and the "tokens" table which keeps the contents of the XML documents. The following sub-sections give more details about the approach. 3.3. Design framework The solution is built on a simple idea based on definitions 1 and 2: 1. A master table for documents is needed. It is called "documents", this table will keep information about documents themselves, at minimum it will have the structure of documents(doc_id, doc_structure), additional fields may be added to keep all information about the document itself such as dates, statistics, types etc. a. The doc_id is a unique id generated per document to identify documents. b. The doc_structure is a big text field containing a coded string describing each document structure, any changes on the document structure should be reflected in this field, such as adding a new tag or property, deleting an existing tag or property, or relocating a given tag or property to a different location in the same document (details below). 2. A second table to store the actual contents for all documents is also needed. Documents will be shredded into pieces of data that will be called tokens, each document element, tag, or property will be considered a token, the tokens table will have at the minimum this structure, tokens(doc_id, token_id, token_name, token_value). a. The token_id is the primary generated id for each token. b. The doc_id is the foreign key linking the tokens table to the documents table. c. token_name is the tag name or the property name as found in the original XML document. d. token_value is the text value of the XML tag property. The rules for constructing doc_structure field are as follows: 1. The doc_structure field is where the document structure is maintained. 2. It consists of long series of related keys. 3. Each key should start with a given alphabet character, say the letter 'T' for element (child) and the letter 'A' for attribute, this is necessary to delimit keys in the sequence. Then the letter is followed with a numeric number representing the token_id that this key is referring to, e.g. T120 is a key referring to a token in the tokens table whose token_id = 120. 4. If the token we are referring to has some properties defined in the original XML document then the key representing this token in the doc_structure will be followed with a set of keys defining these properties. As an example, T120A12A17A2 is a valid key string which can be read as token number 120 has three properties defined by tokens number 12, 17, and 2, and these properties appear in the original document in this order. 5. If the token we are referring to has some children tags (sub-tree) in the original XML document, then these children will be represented as a key-string surrounded by angle brackets. As an example, T120<T12T7<T2T1>T77> is a valid string that can be read, token 120 has three sub tags in this order token 12, followed by token 7, then token 77, and token 7 itself has also two sub tags numbered 2, and number 1 in the given order. So, the relational schema for this method has two tables as shown in Figure 1. Documents(doc_id, doc_structure) Tokens(doc_id, token_id, token_name, token_value) Figure 1: Relational schema 3.3.1. Mapping XML to RDB algorithm. The data model used for the mapping algorithm uses the W3C's Document Object Model (DOM) to represent XML documents in memory before mapping them, it also uses a stack to traverse the xml document by pushing the children of each node onto stack in reverse order in order to preserve there order in the doc_structure field. Figure 2 shows MapXMLtoRDB algorithm with DOM Document containing the XML document to be mapped and DocID as input, and RDB tables as output. Line 5 pushes the root element of the document to the stack. The do loop is used to construct the doc_structure field and to insert the XML tokens (elements and attributes) into token's table (lines 6-28). In line 7, the top of stack is popped, if the popped element is ">", that means all the children of the parent element were added to the database, and the ">" symbol is appended to the "struc" string (lines 8-10). If not (i.e. the popped element is a node), the element's name and value are inserted into the database, and its id is appended to "struc" string. If this element has an attributes, all its attributes are inserted to the database and there ids are appended to the "struc" string. 169

1 MapXMLtoRDB Algorithm 2 Input: DOM Document containing the XML document to be mapped, DocID. 3 Output: XML tokens inserted in Relational Database tables. 4 Begin 5 Initialize stack with document Element 6 Do loop 7 Pop top of stack Element 8 If Element = ">" 9 Append to struc string 10 Else 11 Write token to database, element name, element value 12 Get token id for the added token 13 Append Id to struc string 14 If element has attributes 15 For each attribute in attributes collection do 16 Add to database as token, att. name & att. value 17 Get token id 18 Append token id to struc string 19 End for 20 End if 21 If element has child nodes 22 append "<" to struc string 23 Push ">" to stack 24 Push all childs to stack in reverse order 25 End if 26 If stack is empty exist loop 27 End if 28 End loop 29 Write struc string to database 30 End algorithm Figure 2: Mapping XML to RDB algorithm Lines (21-25) check if the element has children. If so, an "<" is appended to "struc" string, and ">" is pushed to the stack, and all its children are pushed to the stack but in reverse order. Line 26 checks the status of the stack, if it is empty, the do loop is terminated. After that, the "struc" string is inserted to the database (documents table). All element's children are enclosed by angle brackets. The nested brackets differentiate between document's levels, while using the letter 'T' and 'A' to differentiates between element's children and attribute. The reconstruction algorithm for building XML document from relational database is omitted due to space issue. 3.4. Theory implementation on simple case study In this subsection, we give an example to illustrate the application of the mapping method described in Subsection 3.3. Consider the XML document in Figure 3 as an example. Any XML document can be represented as a rooted, labelled Tree. Figure 4 presents an XML tree for the XML document in Figure 3. In our method, each node in the tree is given a generated label in pre-order traversal. This label is unique since it identifies each token in the document. <books> <book id="11210" category="fiction"> <author id="a1" sex="m">m. John</author> <name>computer Science 101</name> </book> <book id="11211"> <author>a. Mark</author> <name>applied Math 101</name> <subject>math</subject > </book> </books> Figure 3: XML document 99 Books 100 Book Book 107 101 102 Id "11210" Category "fiction" 103 106 author name 108 Id "11211" 109 author 110 name subject 111 104 Id "a1" 105 Sex "m" M. John CS 101 A. Mark Math Applied Math 101 Figure 4: A tree representation for XML document in figure 3 170

After transformation, this document will be represented by a single record in the documents table with doc_id for example = 10, as in Figure 5. And the tokens table will be containing the records for the document contents as shown in Figure 6. The doc_structure field for this document will be, T99<T100A101A102<T103A104A105T106>T107A1 08<T109T110T111>> Doc_id Doc_strcuture 10 T99<T100A101A102<T103A104A105T106 >T107A108<T109T110T111>> Figure 5: Documents table doc_id token_id token_name token_value 10 99 books Null 10 100 book Null 10 101 id 11210 10 102 category fiction 10 103 author M. John 10 104 id a1 10 105 sex m 10 106 name Computer Science 101 10 107 book Null 10 108 id 11211 10 109 author A. Mark 10 110 name Applied Math 101 10 111 subject Math Figure 6: Tokens table Notice that we can easily maintain the document structure in this way, for example if we desire to delete the "sex" property of the first author, and we know that this property is A105, then all what we need is to do a simple string operation to exclude the substring A105 from the doc_structure field (in boldface). And if we need to add a new book tag between the existing ones its nothing more than an insertion of the proper code inside the above string at the right place, so for example if the newly added book has the structure in Figure 7, and then, it has been shredded to those records in the tokens table as in Figure 8. <book id="106"> <author>abc</author> <name>applied Geo 106</name> </book> Figure 7: Abortion of XML document doc_id token_id token_name token_value 10 200 book Null 10 201 id 106 10 202 author abc 10 203 name Applied Geo 106 Figure 8: Equivalent Tokens table Then its equivalent key-string will be T200A201<T202T203> This new substring will be inserted in the doc_structure at the right place reflecting its order in the original document; therefore the doc_structure field will now look like this: T99<T100A101A102)<T103A104A105T106>T200A2 01<T202T203>T107A108<T109T110T111>> 4. Experimental results An Intel Core 2 Duo computer with 2 GHz CPU, 1 GB RAM, 256 MB shared Cache and running Windows Vista is used for the experimental test. Visual Basic 6 is used as software development kit with Microsoft Access 2003 as relational database target. Five XML documents with different sizes are used in the experiment. The performance metric is the time spent for mapping XML documents to relational database and the time spent for reconstructing these documents from relational database. The experiment is repeated five times and the mean value of those times is reported to obtain a realistic and accurate results. Table 1 shows both times, i.e., the time spent for mapping XML to RDB and the time spent for reconstructing those documents from relational database; for different documents sizes. The data is taken from the XML data repository that is available at the web site of the School of Computer Science and Engineering, University of Washington [8]. The results in table 1 shows that the time for mapping XML document to RDB and reconstructing it from RDB is acceptable and the relation is linear between the document size and the mapping and reconstructing time. 5. Conclusion and future works By using this method, we are able to maintain document structure at a low cost price and easily, building the original document is straight forward, performing first level semantic search is also achievable either on a single document or on all documents. 171

Table 1: The time spent for mapping XML documents and the time for reconstructing them. Document size 4 KB 28 KB 64 KB 602KB 1MB Mapping time (secs) 0.01988238 0.14977736.3551445 3.574335 5.85278136 Reconstructing time (secs) 0.018990234 0.44980958 1.926836 18.305544 32.06255104 Complex semantic search is not achievable easily in this structure, for example we can not make a select statement to retrieve all records where id of author equals something. The next step of this research is to improve this method to achieve complex semantic search, differentiate between XML data type (i.e., strings, dates, integers), in order to apply less than or greater than queries. And then, we will make an intensive testing and compare our method with other methods in the literature to see its performance. References [1] I. Tatarinov, S. Viglas, K. Beyer, J. Shanmugasundaram, E. Shekita, and C. Zhang, (2002), Storing and Querying Ordered XML using a Relational Database System, in Proc. of SIGMOD, pp 204-215. [2] S. Soltan, and M. Rahgozar, (2006), A Clustering-based Scheme for Labeling XML Trees, IJCSNS International Journal of Computer Science and Network Security, VOL.6 No.9A. [3] K. Fujimoto, M. Yoshikawa, D. Kha, and T. Amagasa, (2005), "A Mapping Scheme of XML Documents into Relational Databases Using Schema-based Path Identifiers", Proceedings of the 2005 International Workshop on Challenges in Web Information and Integration (WIRI'05), 2005 IEEE. [4] G. Xing, X. Zhonghang, and A. Douglas, (2007), "X2R: A System for Managing XML Documents and Key Constraints Using RDBMS", in Proc. of ACMSE 2007, March 23-24, 2007, Winston-Salem, North Carolina, USA. [5] P. O Neil, E. O Neil, S. Pal, I. Cseri, G. Schaller, and N. Westbury, (2004), "ORDPATHs: Insert-Friendly XML Node Labels", SIGMOD 2004, June 13 18, Paris, France. [6] Oracle, (n. a.), Oracle XML DB Developer's Guide 10g. Retrieved 1 st Nov 2006, from http://www.databasebooks.us/oracle_0016.php [7] IBM, (n. a.), DB2 XML Extender. Retrieved Oct 10, 2006, from http://www- 306.ibm.com/software/data/db2/extenders/xmlext/index.html [8] U. Washington, Computer Science & Engineering Research, (2002), XMLData Repository. Retrieved Jun 15, 2007 from http://www.cs.washington.edu/research/xmldatasets/ 172