An approach to the model-based fragmentation and relational storage of XML-documents

Size: px

Start display at page:

Download "An approach to the model-based fragmentation and relational storage of XML-documents"

Cornelius Stone
5 years ago
Views:

1 An approach to the model-based fragmentation and relational storage of XML-documents Christian Süß Fakultät für Mathematik und Informatik, Universität Passau, D Passau, Germany Abstract A flexible method to store XML documents in relational or object-relational databases is presented that is based on an adaptable fragmentation. Whereas most known approaches decompose XML documents into minimal units we propose to store fragments of variable granularity ranging from single elements to whole documents. Different fragmentation strategies depending on the specific access and query requirements can be applied to the same XML documents. Experiments have shown that the response times are much better than those for the complete decomposition. Furthermore, our storage model which is based on directed acyclic graphs facilitates the reuse of XML subdocuments and supports different views on XML documents. 1 Introduction Today, there exist numerous different approaches to store and query XML data. Besides storing XML data in the file system, which is straightforward but does not support querying XML data, object-oriented database systems as well as systems based on semi-structured data models or native XML systems volunteer. However, comparative studies [4, 6, 7, 8] indicate, that relational or object-relational database systems are still very competitive. In general, there are many ways to store XML data in relational or object-relational database systems. For example, in [1, 2] the user or database administrator can decide how to store XML elements in relational tables. Appropriate relational schemata can also be derived automatically from a given XML schema, e.g. a DTD [3, 8]. There are also generic approaches that store documents without any user interaction and do not require any kind of schema and which provide for the storage and retrieval of different types of XML documents, e.g. XSL documents etc., using the same relational schema for storing. For example, [4] presents different strategies to completely decompose arbitrary XML documents into relational tables. They show a good overall performance, but reconstructing complete documents is rather expensive. In relational databases, clustering and indexing is applied to compensate for the performance loss caused by the (more or less) rigorous decomposition of the relational schemata which is required for normalization purposes. However, up to now it is not clear what clustering and indexing of XML documents exactly means. In this paper we propose a flexible fragmentation of XML documents that avoids at least unnecessary joins when reconstructing frequently accessed parts of the document. Fragmentation has also been suggested in [9]. However, [9] relies on an fragmentation specification supplied by the user and stored in the source document or the DTD, whereas our generic approach is completely independent of such hard-wired directives. Furthermore, in our approach multiple fragmentation strategies can be defined and applied to the document whenever appropriate. The fragmentation strategies presented in this paper are guided by an underlying domain model which is used to specify the fragments at a high level of abstraction. As compared to [4] we do not lose the difference between subelements and attributes as well as between subelements and references. Furthermore, applying our approach to object-relational databases we can benefit from specific user defined datatypes [8] and indexing techniques [1, 2]. Finally, our storage model is based on directed acylic graphs. In contrast to tree-based storage structures, it facilitates the reuse of XML subdocuments and supports different views on XML documents. The rest of this paper is organized as follows: In 2 we formally define the fragmentation of XML documents. In 3 relational database schemata as well as algorithms to store and retrieve XML fragments are introduced. Section 4 focuses on model-based fragmentation strategies and presents experimental results. The paper concludes with a short summary in 5.

2 2 Fragmentation of XML documents Definition 1 (Notation for XML documents) Let doc be an XML document. Then elements(doc) denotes the set of elements in doc where elements having the same tag name are identified by unique subindices. tree(doc) is the tree structure of the elements defined by doc. root(doc) is the root element of doc and therefore of tree(doc), too. Let e elements(doc) be an element of doc. Then doc(e) denotes the subdocument of doc having e as its root element. tag(e) is the tag name of e. value(e, attr) is the value of the attribute attr of e. xml(e) denotes the XML representation or serialization of e in doc including the opening and closing tags of e and all of its contents. In particular, xml(root(doc)) is the XML representation of the entire document doc. children(e) elements(doc) denotes the set of elements directly contained in e excluding e itself. Conversely parent(e) elements(doc) denotes the element which contains e. Obviously, parent is not defined for the root element. <course title="xml Tutorial"> < title="introduction"> <motivation> <>XML is needed for many reasons...</> <image src="xml.gif"/> </motivation> < title="basics"> <definition> <>An XML document is...</> </definition> <example> <>Consider the following XML document...</> </example> </course> Figure 1: Sample XML document Definition 2 (Fragment) Let doc be an XML document. Then f elements(doc) is a fragment of doc iff the subgraph of tree(doc) which is induced by f is connected. This subgraph forms a tree denoted by tree(f) and having the root element root(f). Definition 3 (Fragmentation) Let doc be an XML document. A fragmentation F = {f 1,..., f n } of doc is a partitioning of elements(doc) into fragments f 1 to f n, i.e., the f i are pairwise disjoint and their union equals elements(doc). roots(f ) denotes the set of root elements of the fragments in F. The elements of roots(f ) uniquely determine F. We assume that each fragment f i in F, 1 i n, has an unique identifier id(f i ). The XML representation xml(f) of a fragment f F is the result of replacing in xml(root(f)) the XML representation xml(e) of the root element e of any other fragment g F occurring as a subtree in f by the element < tag(e) fragment-id = id(g) / > where tag(e) is the tag name of the root element of g and id(g) is the unique identifier of g Thus, essentially every fragment subtree is replaced by a reference to the fragment. Let doc be an XML document. Let F be a fragmentation of doc. Then the tree structure tree(doc) induces a graph graph(f ) on the fragments of F which is a tree, too, because in XML each element can only be contained in exactly one other element. However, we allow directed acyclic fragmentation graphs, in which each fragment can have more than one parent fragment. Definition 4 (Graph of a fragmentation) Let F be a fragmentation and f, g F. Then f is called a parent fragment of g and g a child fragment of f in F iff in the tree of the original document doc it holds that parent(root(g)) f. children(f) denotes the child fragments of f in F and parents(f) denotes the parent fragments of f in F. Example 5 (Fragments) Figure 2 shows the tree structure of the XML document of figure 1 using four fragments with root elements course, motivation, definition and example identified by the unique identifiers 1 through 4. The XML representation of fragment 1 containing three references to the child fragments can be found in figure 3.

3 2 1 motivation course 4 definition 3 4 example <course title="xml Tutorial"> < title="introduction"> <motivation fragment-id="2" /> < title="basics"> <definition fragment-id="3" /> <example fragment-id="4" /> </course> image Figure 2: Tree structure with four fragments Figure 3: XML representation of a fragment 3 Relational storage of XML fragments Definition 6 (Relational schema) To store the graph of a fragmentation in a relational database we use a relational schema consisting of the three tables fragment(id, tag, xml), attribute(id, name, value) and child(parid, childid, pos) where underlined attributes denote primary keys. Attribute id of table attribute and attribute parid as well as attribute childid of table child are foreign keys of the table fragment. Attribute xml is of a type appropriate to store large character sequences (e.g. CLOB). Note that we show only the essential attributes. Algorithm 7 (Storage of an XML document) Let doc be an XML document and let F be a fragmentation of doc. Then doc is stored according to F in a relational database using the schema of definition 6 by the following algorithm: 1. For each fragment f F, insert into table fragment the tuple (id(f), tag(root(f)), xml(f)). 2. For each attribute-value pair name=value of a root element root(f) of a fragment f, insert into table attribute the tuple (id(f), name, value). 3. For each pair of fragments f and g, where g is the i-th child fragment of f according to the element ordering in the original document, insert into table child the tuple (id(f), id(g), i). Example 8 (Storage of an XML document) Figure 4 shows the extension of the tables after storing the XML document of figure 1 according to the fragmentation shown in figure 2. For the complete contents of column xml in the first row of table fragment see figure 3. Algorithm 9 (Retrieval of a fragment) Let the XML document doc be stored as described in algorithm 7 according to a fragmentation F. Let e = root(f) be the root element of a fragmentf F. Then we obtain the subdocument xml(e) which contains e and all its XML contents using the following algorithm: 1. Execute the SQL query SELECT xml FROM fragment WHERE id= id(e). The result of this query is the XML representation xml(f) of fragment f. 2. Replace each element < tag(e 2 ) fragment-id = id / > by the XML representation xml(e 2 ) of the root element e 2 of the fragment with identifier id obtained by a recursive application of this algorithm. According to algorithm 9 tables attribute and child are not necessary for retrieving a document, because their information is also contained in the XML representation of the fragments. Nevertheless they are important for the efficient retrieval of fragments and navigation in the fragmentation graph. fragment id tag xml 1 course <course... 2 motivation <motivation>... 3 definition <definition>... 4 example <example>... attribute id name value 1 title XML Tutorial child parid childid pos Figure 4: Extension of tables for fragmented sample XML document

4 1,0 1,0 0,8 0,8 Relative Retrieval Time 0,6 0,4 Relative Retrieval Time 0,6 0,4 Single Elements ContentModules Cou/Sec/Ex Cou/Sec Course 1 large Fragment 0,2 0,2 0, Levels per Fragment Figure 5: (a) Uninformed Fragmentation 0, Chapters per Document (b) Model-Based Fragmentation 4 Model-based fragmentation strategies Definition 10 (Fragmentation strategy) Let doc be an XML document. A fragmentation strategy S elements(doc) is a subset of the set of elements of doc specifying the root elements of the fragments. Strategies specify those elements which are stored in separate fragments. They use predicates which every element has to satisfy to qualify as a root element of a fragment, e.g. match patterns for tag names and attribute values of elements (see example 12) or more sophisticated structural conditions (see example 11). Strategies can also be based on the nesting depth, i.e., how many levels of nested elements are stored in a fragment. Figure 5 (a) shows the experimental results of retrieving an entire XML document having 14 levels of element nesting. As expected, the response time of the rightmost strategy where all 14 levels, i.e. the whole document, are stored in one fragment is about 15 times as fast as the response time of the leftmost strategy where each fragment contains only one element, i.e., the document is completely decomposed. These strategies are uninformed and therefore produce unpredictable fragmentations resulting in the different response times shown in figure 5 (a). To meet the specific access and query requirements of a given application we define fragmentation strategies which are guided by the domain model of the application. For example, figure 6 shows a simplified part of the teachware model presented in [10] which describes learning material using specialized DTDs [5] (see the sample document in figure 1) Course Module StructureModule ContentModule motivation definition paragraph illustration example exercise remark Figure 6: Domain model for teachware Example 11 (Sequential Leaf Access) From the teachware domain model we know that a learner mostly accesses the leaf s of a course document doc in a sequential way. Those s directly contain at least one ContentModule while their enclosing s do not directly contain any ContentModule. To meet this specific access requirement, we now define a fragmentation strategy S 1 to store such leaf s in separate fragments: S 1 = root(doc) {e elements(doc) (tag(e) = ( c children(e) : tag(c) CM)) p ancestors(e) : ( c children(e) : tag(c) CM))} ancestors(e) is the set of all ancestors of e in tree(doc) and CM = {motivation, definition,...} is the set of all tag names of ContentModule elements according to the given model.

5 Example 12 (Supporting Reuse of Modules) We know that authors usually reuse ContentModules in more than one course document. Thus, we define a corresponding strategy S 2 = {e elements(doc) (tag(e) CM)} which stores each ContentModule in a separate fragment. Note, that our storage model which is based on directed acyclic graphs (see definition 4) directly supports the reuse of XML subdocuments. Figure 5 (b) shows the experimental results of retrieving complete course documents containing one to five chapters, i.e., top-level s. The response time for S 2 depicted in the second line from the top is significantly less than the response time of the complete decomposition depicted in the line on top. From examples 11 and 12 we can see, that there can be more than one fragmentation strategy for a single XML document. Both strategies S 1 and S 2 can be applied whenever appropriate. Moreover, we can define a combined strategy S 1 S 2 which produces a finer-grained fragmentation and which facilitates the sequential leaf access of example 11 as well as the reuse of ContentModules of example 12. Our approach allows to adjust the granularity of fragments when appropriate. For example, statistic information like the most often reused subdocuments can be used to dynamically determine the appropriate fragmentation strategy. So, documents can be re-fragmented and re-organized in storage when needed without having to change the document source itself. This supports the cooperative authoring of an XML document base. Example 13 (Dynamically Changing Strategies) To support reuse at the storage layer and to improve S 2 of example 12 accordingly we define the strategy S 3 = {e ReusedElements} where ReusedElements is the dynamically changing set of reused elements. 5 Conclusion In this paper we have presented a generic, model-based approach to the relational storage of XML documents. Arbitrary XML documents are automatically stored in database where they are clustered in fragments of different sizes tailorable to specific access and query requirements. We have specified different strategies which make use of information provided by an underlying model. Experimental results show that corresponding queries can be answered more efficiently than when using a complete decomposition. The storage model is based on directed acyclic graphs. In contrast to a tree model it directly supports multiple hierarchies which facilitate the reuse of XML subdocuments and allow the definition of different views on the same XML document. Future work will focus on the concept of views which could only be touched in this paper. Furthermore, we will study the application of our approach to the modularization of XML documents. References [1] S. Banerjee et al. Oracle8i - The XML Enabled Data Management System. In Proc. ICDE 2000: San Diego, USA, [2] J. M. Cheng and J. Xu. XML and DB2. In Proc. ICDE 2000: San Diego, USA, [3] A. Deutsch et al. Storing Semistructured Data with STORED. In Proc. ACM SIGMOD Philadelphia, PN, 1999, [4] D. Florescu and D. Kossmann. A Performance Evaluation of Alternative Mapping Schemes for Storing XML Data in a Relational Database. techreport 3680, INRIA, France, [5] C. Süß. Learning Material Markup Language [6] A. Schmidt et al. Efficient Relational Storage and Retrieval of XML Documents. In Proc. WebDB 2000, Dallas, USA, [7] J. Shanmugasundaram et al. Relational Databases for Querying XML Documents: Limitations and Opportunities. In Proc. 25th VLDB Conference, Edinburgh, Scotland, [8] T. Shimura et al. Storage and Retrieval of XML Documents Using Object-Relational Databases. In Proc. DEXA 99, Florence, Italy, [9] B. Surjanto et al. XML Content Management based on Object-Relational Database Technology. In Proc. WISE 2000, Hongkong, [10] C. Süß et al. Metamodeling for Web-Based Teachware Managment. In Advances in Conceptual Modeling. ER 99 Workshop on the World-Wide Web and Conceptual Modeling, Paris, France, 1999.

Data Modeling and Relational Storage of XML-based Teachware

Data Modeling and Relational Storage of XML-based Teachware Christian Süß, Ulrich Zukowski and Burkhard Freitag Fakultät für Mathematik und Informatik, Universität Passau D-94030 Passau, Germany {suess,zukowski,freitag}@fmi.uni-passau.de