Data Structures for Maintaining Path Statistics in Distributed XML Stores

Size: px

Start display at page:

Download "Data Structures for Maintaining Path Statistics in Distributed XML Stores"

Juliet Robertson
6 years ago
Views:

1 Data Structures for Maintaining Path Statistics in Distributed XML Stores c Yury Soldak Department of Computer Science, Saint-Petersburg State University University Prospekt 28 Saint-Petersburg Russian Federation ysoldak@acm.org Abstract The paper contains description of distributed XML store model based on notion of distributed XML document. Classification of XPath expressions is defined and the notion of distributed XML document is introduced. Definition of DataGuide-based statistical structure for XML stores is proposed and two possible approaches to maintain its actuality are discussed. Stability of feedback-based approach is shown. Generalization of the structure on distributed case is described. 1 Introduction & Related Work Developed for data exchange on the Web, XML becomes widely accepted. It is very likely that most of data on the Web can be reached in a form of XML documents in the nearest future. Furthermore, data is stored in XML on many sites already. As for the Web, it can be characterized as fairly unpredictable network of heterogeneous data sources [9]. The development of different aspects related to XML-query evaluation on the Web is the topical problem. Particularly this is true for set of remote servers which form a distributed XML store. Many open issues exist in the area of effective distributed XML query evaluation (sec. 2.2). Two papers focused on related problems are the background of the current work. In the first paper [1] two techniques were proposed for estimating the selectivity of simple path expressions over large-scale XML data: path trees and Markov tables. Both techniques summarize complex and large-scale data in a small amount of memory then use this summary for selectivity estimation. An idea of exploiting path tree to store statistical information is obtained from the paper and heavily used in the current work. The second paper [7] introduces XPathLearner, a technique for estimating selectivity of simple path expressions based on a feedback analysis. XPathLearner stores statistics in a Markov table. As considered further, this is not the best solution in the case of distributed XML Proceedings of the Spring Young Researcher s Colloquium on Database and Information Systems, Moscow, Russia, 2006 store. Primary goal of the current paper is to define structure which will be (a) suited for the distributed case and (b) a convenient basis for developing XPathLearner-like solution. Both papers mentioned above study problems of harvesting, updating and storing selectivity statistics in a global scope. In other words, there is no way to estimate the path expression selectivity for the particular site of the store. Therefore, the techniques lack for one of the most needful features for effective distributed queries evaluation (sec. 2.2). The rest of the paper is organized as follows. In the section 2 we describe a distributed data model used in the paper and discuss the problems related to the distributed query evaluation. Then the query optimizer structure, the place and importance of statistics module are discussed (section 3). After that, in the section 4, one can find XPath expressions classification we use. XML Tree Sibling Summarization structure and related issues are considered in the section 5. And, finally, the section 6 contains conclusions. 2 Distributed XML Store 2.1 Model Definition Distributed XML document (DXML document) is a document which contains at least one XInclude[10] or XLink[11] element inside it s body. Definition We name an DXML document locally distributed in the case of all included (or linked) XML fragments and the including document itself reside at the same server. DXML documents as defined above are the building blocks of our distributed XML store. Example of an employee list for a multi-office company is shown in Figure 1. Every office has its own employee list which is managed independently on other offices and located at separate site. Every such list changes constantly and unpredictably depending on hires, dismissals and small changes in personal data of any employee. Any HR manager can add (or remove) some elements into own part of the employee list even in the case of common person description structure is developed. As a result we have true distributed semistructured XML

2 <company xmlns:xi=" xmlns:xl=" > <name>the very big company</name> <staff> <office id="main"> <person position="ceo" office="main"> <name>john Smith</name> </person> <xi:include xi:href="/db/rnd.xml" xi:xpointer="element(/persons/person)" /> <xi:include xi:href="/db/qa.xml" xi:xpointer="element(/persons/person)" /> <xl:link xl:type="simple" xl:href="/db/managers.xml#xpointer(//person)" /> </office> <office id="o1"> <xi:include xi:href=" xi:xpointer="element(/staff/person)" /> </office> <office id="o2"> <xi:include xi:href=" xi:xpointer="element(/staff/persons/person)" /> </office> </staff> </company> Figure 1: Distributed XML document store. This store is the simplest example based on a single distributed document. Of course distributed store can contain any number of documents (distributed or local). Furthermore, it is absolutely not necessary that roots of distributed documents belong to a single server. Having DXML document we can define several separate parts one part for each site. We assume that these sites are independently maintained. So they may perfectly belong to different companies. Sites are some kind of black boxes to each other. The only requirement is the interface to query xml data on each site. There are no restrictions on the type of the interface. Sites can understand queries on any known xml query language. We use XQuery-over-HTTP approach for prototyping. during evaluation of query listed in Figure 2. These sequences might be obtained in several different ways. For example, query evaluator naively obtains all the person elements for each office (sends simple queries to the corresponding servers), then locally joins two (possibly) big sequences. Obviously, described approach is not optimal. Approaches similar to semi-joins for distributed RDBMSs are more attractive. We have to know selectivity of the path expressions (for person and familyname elements in our case) in order to use them. Moreover, number of distinct values for resulting node sequences (so-called distinct selectivity) is of interest too. And finally, it is important to know selectivity with regard to a server, not just abstract selectivity in the global scope. This example shows the crucial role of XPath selectivity estimation for evaluation of queries on DXML documents. XML query optimizer structure and place of an XPath selectivity estimator in it are discussed in the next section. 3 XML Queries Optimization 3.1 Optimizer structure & general issues 2.2 Query Evaluation It would be really useful to query DXML with the conventional XQuery language. The result we need to obtain is equal to the result when all parts of DXML are downloaded from remote servers and merged into temporary local XML document on which existing XQuery evaluator runs our query. Described is the naive implementation and expected to be very slow. It is required to evaluate queries on DXML more effectively. In other words, the query optimizer for local documents should be extended to generate optimal query plans for DXML and the query evaluator should be able to evaluate these new query plans. Figure 2 presents a simple example of the query on distributed XML store. The query obtains information about persons from office 1 which possibly have relatives (i.e. persons with the same family name) working at office 2. Here company.xml is the distributed XML document and information about two offices is included into it with the help of two XInclude elements (see Figure 1). for $p1 in doc( company.xml )//office[@id= o1 ]//person, $p2 in distinct-values(doc( company.xml )//office[@id= o2 ]//person/familyname) where $p1/familyname = $p2 return $p1 Figure 2: Query on distributed XML store It is necessary to join two sequences $p1 and $p2 Figure 3: Query optimizer Classic query optimizer structure is shown in Figure 3. It consists of two main blocks: logical and physical optimization modules. The logical module rewrites a query using chosen XML algebra rules, the physical module generates various physical execution plans and selects the best of them exploiting execution cost estimator. The cost estimator in turn requires various statistics to estimate a cost. Both relational and XML query optimizers expected to implement this simple architecture. Differences between the optimizers are in implementation. XML optimizers is harder to implement. First of all semistructured optimizers work in terms of more complex data structures (tree or graph structures) than their relational counterparts. Furthermore, XML databases area is not so well developed as the relational one. As a result XML

3 optimizer developers forced to make a lot of (ad-hock) decisions which are not grounded theoretically and are not proven to be best as it is the case for RDMSs. For example, XML database developers have no even single widely-accepted XML algebra to use. Implementation of a physical optimizer for XML databases is really the challenge today. We try to make one step forward in that direction developing statistics module which can be exploited by cost estimator. 3.2 Cost Estimator Another challenge is related to the distributed nature of source data. Cost estimator should be aware on this store specifics. Different cost models exist in distributed environment [6]. The very simple model would be to estimate the cost of evaluation of a path expression as k n where n is the estimated selectivity of the expression and k is the host-specific parameter. The parameter would be small for fast database components and large for slow ones or components which are only reachable by slow communication link. The k parameter can also depend on size of elements reachable by the path expression due to the fact that both serialization and transmission of large elements are very costly operations. Structures developed with the purpose of support distributed cost models with necessary statistical information are described in the section 5. 4 XPath Expressions Classification The XQuery language uses XPath expressions to define sequences of XML elements on which operations are performed. XPath expressions are used in XQuery queries in many different ways, so notation of these expressions vary significantly. Several expression types are defined here. These definitions will be used in the following sections. The list below contains 4 characteristics which are necessary to check in order to classify XPath expression: First sequence construction method (the very first step) Presence of predicates (which are not identically true) in step definitions Direction of step axes Presence of branchpoints An element sequence is the input and output of any XPath step. The very first sequence in XPath expression may be defined either by function call (document(), collection()) or by a variable reference. In the first case we name expression functional, in the second case variational. Every XPath step contains predicate expression (omitted in step notion when identically true). An XPath expression is predicative or simple depending on presence of at least one predicate in the expression notion. There are 13 axes defined in XPath specification. In [4] four major directions (ancestor, descendant, following and preceding) are defined. As a result some of XPath axes are co-directed in terms of major directions. For example axes parent and ancestor are co-directed, but parent and following-sibling are not. XPath expression is directed if and only if all its steps are co-directed and multidirected otherwise. In special cases number of major directions can be explicitly defined for multidirected expressions. And finally XPath expression is branched if and only if at least one of steps has branchpoint (name test of kind (a b... c)). Examples: doc( foo.xml )/a/b/c - functional simple directed nonbranched doc( foo.xml )/a//b/c[@e = 1 ] - functional predicative directed nonbranched doc( foo.xml )//a[@b = 3 ]/following::c - functional predicative multidirected nonbranched $v/a/(b[@c = $w] d) - variational predicative directed branched 5 XML Tree Sibling Summarization 5.1 Definition Conception of DataGuides is widely known to semistructured data researchers. It was originally introduced in [3]. From that times till present DataGuides are used as a base for indexes (for example [2]) and structures for statistical information representation [1]. All statistical structures defined in the paper are based on the DataGuide notion. This gives us several benefits. Small amount of memory required to store statistics is one of major benefits and not the single. DataGuide-like structures can be easily extended to support distributed case as shown in the section 5.6. Original DataGuide was developed for the area where only parent-child relations are used. As a result it is impossible to extend the structure to support all kinds of XPath expressions. Only simple directed XPath expressions where major direction is descendant (child, descendant and descendant-self axes) are supported in all the structures considered below. Furthermore, only functional XPath expressions are studied for now and not variational ones. Each branched expression is splitted to several nonbranched which are studied separately. XML Tree Sibling Summarization (XTSS) structure is developed for maintaining XPath selectivity statistics for any ordinary XML document. This is the DataGuide tree where every node keeps number of sibling XML elements with the same name joined to construct the node as well as name of these elements. Every arc defines parent-child relationship for source elements. In Figure 4 an example of XTSS is shown at right and its respective source XML document is shown at left. 5.2 Construction in Offline XQuery query in Figure 5 recursively constructs XTSS for XML document (In our particular case the XML document contains text of Shakespeare s Macbeth) This query was evaluated on Ipedo[5] and exist[8] Native XML DBMS in order to obtain approximations

4 constructor fetches XPath expressions from the user query and their respective selectivities from the query result. Then builds XTSS branch for each XPath and adds it to the (partial) XTSS. In the case of branch already exists in the XTSS, the selectivity value of respecive node is updated. Theoretically whole XTSS can be built this way. Practically, however, we will always have only part of whole XTSS depending on queries evaluated. Moreover, selectivity values expected to be only in the leaves and rarely in the intermediate nodes. Figure 4: XML Tree Sibling Summarization define function xtss( $seq as node, $deep as integer ) as node { let $newdeep := $deep + 1 let $names := for $s in $seq return name($s) let $dnames := distinct-values($names) for $name in $dnames let $nodes := $seq[name() = $name] let $nextnodes := $nodes/* return element {$name} { attribute { c } {count($nodes)}, attribute { d } {$deep}, xtss($nextnodes,$newdeep) } } xtss(document( db/plays/macbeth.xml )/*, 0) Figure 5: XTSS generation XQuery for time required to construct whole XTSS for middlesized XML document. Of course this routine should work much faster when coded as part of query engine. Here we try to define higher bound for XTSS construction time. Node constructors were removed from query to minimize query execution time. The results of this quick experiment are shown in Table 1. XML DBMS Time (secs) Ipedo exist Table 1: XTSS offline construction time Obviously construction of whole XTSS is a costly operation. The process of constructing an XTSS for each document in DB with thousands of documents will run too long. Moreover, XTSS should be maintained and will force us to start described process from time to time. This approach can be very resource consuming and so is not good for statistical structure. The solution is to construct partial XTSS following the online (or feedback) approach. 5.3 Feedback Approach Feedback approach let us construct XTSS branch by branch exploiting results of the user queries. An XTSS (a) /a/b/c & /a/d (b) /a//e added Figure 6: Partial XTSS The partial XTSS obtained after evaluating /a/b/c and /a/d path expressions is shown in Figure 6(a). Any information about processed XPath expressions is valuable when feedback approach is used. Unfortunately complete information is not always available. The most frequent case of that incompleteness is evaluation of steps with descendant axis. In order to fill the gap the partial XTSS notion was enriched with ancestordescendant (also named generalized) arcs marked by * at figures. Figure 6(b) presents an example of partial XTSS with one generalized arc added. Using generalized arcs we can obtain data duplication in our structure. This is not good for structure size and statistics accuracy. Assume two path expressions were evaluated: /a//e at first and then /a/b/e. Depending on structure of the source data, the first expression may (and may not) define sequence of more XML elements. Leaving both branches in the XTSS we obtain the data duplication problem. On the other hand it is possible to leave only one of these branches and possibly hit accuracy. It was decided to leave most concrete branch (/a/b/e in our case) each time we have situation like described. Following that rule we ll avoid data duplication and can hit accuracy in case of generalized expression defines larger sequence than concrete one. This is the price we pay for graceful and predictable statistical structure. The decision is based on our experience in real-world applications development. The problem mentioned above is rarely appear there. In many cases the generalized expressions are used in place of more effective concrete ones in order to reduce the size of a query textual representation. In the case of selectivity of evaluated expression equals 0, the branch is not added to the XTSS or is removed in case the branch exists in the XTSS already. In some cases we don t remove whole branch, but cut it at first branching node reached from the branch leaf. 5.4 Ambiguity During Updates Handling of generalized path expressions faces the maintenance problem: ambiguity during distribution of new

5 selectivity of generalized path expression among all satisfying XTSS nodes. Having generalized path expression let us assume that all the satisfying paths in the source data have corresponding nodes in the XTSS. Otherwise correct selectivity distribution is impossible. Following the feedback way we always should assume something. The new selectivity is achieved after expression evaluation. The question is how to distribute this selectivity among all satisfying nodes. It is clear that common selectivity may decrease or increase. Let us assume the later is true and difference is d n. Let S n and S n 1 are current and previous common selectivities respectively where S n S n 1 = d n. S n = m where m is the number of nodes and s i n is the selectivity of i-th node. At least three distribution approaches exist: equal, proportional and history-based. The equal method distributes difference by the simplest formula: s i n s i n = s i n 1 + d n /m Using proportional method difference is distributed in following way: s i n = s i n 1 S n /S n 1 The additional information is necessary to be stored in each XTSS node in order to use the third approach. This information is the value of increment ˆd i made during recent node update after evaluation of a concrete path expression. Selectivity distribution formila for this case is following: s i n = s i n 1 + d n ˆd i / ˆd, ˆd = m Clearly, described approaches are just simplest and not all the possible ones. More approaches can be developed. For example, it is possible to maintain selectivity alteration frequency for each node and then use this information to distribute selectivity as separate (the fourth) method or to improve the third approach. The store structure and behavior of a stored data define the distribution approach. Unfortunately any approach can t guarantee accurate difference distribution. However we state that regardless of method of use XTSS will contain accurate (or very close to accurate) selectivity values if queries which affect nodes of interest are evaluated several times. In other words XTSS is the stable structure. The next section proves the statement. 5.5 XTSS Stability Let we have the XTSS part of m nodes where each node defines concrete path expression corresponding to the branch ended in that node. These m nodes and only them satisfy generalized path expression q. Selectivity values stored in that nodes are accurate: d i = ṡ i s i = 0, i = 1 : m ˆd i where ṡ i is actual selectivity and s i is stored selectivity. Let us assume source data has been changed so that selectivity values of k XTSS nodes with indexes i I, I = k are no more accurate and should be updated: m D = d i where and i / I di = d i = 0 i I di 0 After evaluation of the q expression we know new common selectivity of m nodes and having previous common selectivity we know D. We should distribute the difference D among m nodes. We don t know which of XTSS nodes actually should be updated, and even don t know the number k of that nodes. So we distribute D among all m candidates following one of the approaches described in the previous section. After that the common selectivity of selected XTSS nodes equals actual selectivity of the q expression and at the same time selectivity values of particular nodes can be wrong (so not accurate). Stability means that the structure tends to contain accurate values. Definition XTSS is accurate if and only if each its node has accurate selectivity value. Since XTSS can be devided into several parts, XTSS is also accurate if and only if each XTSS part is accurate. We use following formula to measure accuracy of XTSS part: m A = d i where m is the number of nodes in measured part and d i is the difference between accurate and stored (in the XTSS node) selectivity value. The part is accurate if and only if A = 0. Proposition 5.1 Let q be a generalized path expression what defines XTSS part of m nodes as described above. Suppose that all elements reachable by q have corresponding nodes in XTSS, select queries are more frequent than update ones, user queries contain both generalized and concrete path expressions Then A 0. Proof Indeed A decreases each time concrete path expression is evaluated because value of the particular node becomes accurate (therefore corresponding d = 0). A remains the same in case of generalized expression evaluation because the distribution approaches do not change A.

6 In some special cases A can stay unchanged for a long time even if concrete and generalized expressions are evaluated constantly. It depends on the set of concrete path expressions which are evaluated. These expressions are frequently used and selectivity values for them stored in XTSS are accurate. So XTSS is accurate for these expressions. If an XTSS node is never accessed to obtain its selectivity (and so refined after expression is evaluated) it can never have accurate selectivity. As a result XTSS is accurate but only for the frequent expressions. This is natural for feedback approach and is enough to say XTSS is stable. The speed of A decreasing depends on a store properties and the only advice to be given is to experiment with the distribution approaches. It is a good idea to turn off the feedback evaluation of generalized path expressions in the case of source data changes so frequently that XTSS has no time for stabilization. 5.6 Generalization of XTSS for Distributed XML Statistical information is required in order to have a possibility to evaluate queries on distributed XML documents efficiently as this is the case for local documents. Having XTSS defined for local documents we ll extend the definition for distributed case introducing Distributed XTSS (DXTSS) notion. DXTSS plays the same role for distributed documents as XTSS for local ones. Both parent-child and ancestor-descendant arc types share a property they define relations between nodes of the same local document. Arcs of the cross-document references appear in the distributed case. They can be not only cross-document, but cross-server too in the case of a distributed document fragments reside at different servers. We name such arcs associative and mark with symbol. In such a way DXTSS is a set of XTSSs connected to each other by associative arcs. One of XTSSs is considered to be main and contains the structure s root. See Figure 7 for example of DXTSS. in our statistical structure in order to use it for cost estimation. The large values not necessarily mean that chain exists, this can be the result of an outdated hardware or overload of a remote server. The real situation is not so important for successful cost estimation. The only crucial information is how long the remote operation lasts. Associative arcs are suited to store that statistics. 6 Conclusions The paper contains description of distributed XML store model based on notion of distributed XML document. It is shown how conventional XQuery language can be used to query stores of that type. Query evaluation issues are discussed and value of path selectivity statistics is shown. XPath expressions classification based on four characteristics is introduced. It can be used by researchers and developers to easily refer to XPath expression classes as we do in the paper. DataGuide-like XML Tree Sibling Summarization structures are defined in the paper. They suited to contain statistical information about XPath expression selectivities and are used by cost estimator module of our XML query optimizer. Generalization of that structure on distributed case is described utilizing associative arcs to put local XTSSs together. Feedback approach to maintain the partial XTSS structure is described and its stability is shown. 7 Acknowledgements I would like to thank my scientific adviser Boris Novikov for his support and valuable comments. Many issues and application patterns of XTSS were discussed with Anton Gubanov and Maxim Lukichev, thank you colleagues for that. References [1] Ashraf Aboulnaga, Alaa R. Alameldeen, and Jeffrey F. Naughton. Estimating the selectivity of XML path expressions for internet scale applications. In The VLDB Journal, pages , [2] A. Fomichev. XML Storing and Processing Techniques. In SYRCoDIS, pages NIIMM, Figure 7: Distributed XTSS It is worth to mention that only one associative arc is allowed in DXTSS branch. This restriction is explained by the fact we consider remote servers to be independent and atomic. So we can t demand any private information (for example, store scheme) from them. It is possible that distributed documents form a chain (or even cycle) including parts of each other. But we ll never know exactly about that evaluating a query on a distributed document. It is acceptably to measure and store connection and/or transmission speed (the k parameter of simple cost function considered in the section 3.2) for each associative arc [3] Roy Goldman and Jennifer Widom. Dataguides: Enabling query formulation and optimization in semistructured databases. In VLDB 97, Proceedings of 23rd International Conference on Very Large Data Bases, pages Morgan Kaufmann, [4] T. Grust. Accelerating XPath location steps. In Proceedings of ACM Conference on Management of Data (SIGMOD), [5] Ipedo XML database website. Website. [6] Donald Krossmann. The State of the Art in Distributed Query Processing. In ACM Computing Surveys, volume 32, pages , 2000.

7 [7] L. Lim, M. Wang, S. Padmanabhan, J. S. Vitter, and R. Parr. XPathLearner: An On-line Self-Tuning Markov Histogram for XML Path Selectivity Estimation. In VLDB, pages , [8] Wolfgang Meier. exist: An Open Source Native XML Database. In Web, Web-Services, and Database Systems, pages , [9] Marko Smiljanic, Henk M. Blanken, Maurice van Keulen, and Willem Jonker. Distributed XML Database Systems. Technical Report TR-CTIT-02-46, CTIT, University of Twente, The Netherlands, October [10] XML Inclusions (XInclude) Version 1.0, 20 December W3C Recommendation. [11] XML Linking Language (XLink) Version 1.0, 27 June W3C Recommendation.

Full-Text and Structural XML Indexing on B + -Tree

Full-Text and Structural XML Indexing on B + -Tree Toshiyuki Shimizu 1 and Masatoshi Yoshikawa 2 1 Graduate School of Information Science, Nagoya University shimizu@dl.itc.nagoya-u.ac.jp 2 Information