Structural XML Querying

Size: px

Start display at page:

Download "Structural XML Querying"

Dominic Reynolds
5 years ago
Views:

1 VŠB Technical University of Ostrava Faculty of Electrical Engineering and Computer Science Department of Computer Science Structural XML Querying 2018 Radim Bača

3 Abstract A well-formed XML document or a set of documents can be viewed as an XML database and the associated DTD, or XML Schema, is its database schema. XQuery and XPath are usually the query languages of an XML database. If we compare them to relational databases, the main differences are the hierarchical data model and the implicit order of the XML data. Therefore, the major novel issues related to query processing in XML databases are (1) handling of a query logic related to the XML document hierarchical structure, and (2) dealing with the implicit order. Let us call such problems a structural XML querying. In this work, we provide a comprehensive survey of the state-of-the-art of approaches and related aspects for an efficient structural XML querying. In particular, we start with a description of labeling schemes to capture the structure of the data and the respective storage strategies. Then we deal with the key part of every XML query processing algorithm a twig query join as well as optimizations of XML query processing. Moreover, we describe two twig query joins that can be used in a structural XML querying. Key Words: Structural XML querying, XQuery, Cost-based optimizations

5 Contents List of Figures 7 List of Tables 9 1 Introduction Motivation XML Model and XML Query Languages Labeling Schemes XML Storage Techniques Partitioning of XML Document Schema Tree Node Indices Balancing XML Storage Twig Query Join Algorithms GTPStack GTPStack Experimental Results Top-down Filtering Optimality XML Query Processing Optimizations Cost-based Optimizations Selectivity Estimation Techniques CostTwigJoin Algorithm CostTwigJoin Experimental Results Conclusions 45 References 49 5

6 6

7 List of Figures 1 An XML document and its XML tree model XPath queries and their corresponding TPQs XQuery queries and their corresponding GTP representations (a) Containment labeling scheme (b) Dewey order labeling scheme Properties of various XML document partitionings (a) Document index (b) Partition index (c) Partition index with indexed lists Categories of algorithms utilized for twig query joins Performance of the output enumeration Number of queries violating the properties for different k values (a) for all ZIPF queries (b) for queries corresponding to the first and to the fourth ZIPF template Variants of the TB12 query with different number of output nodes (a) Processing time of GTPStack+M and (b) number of nodes stored on stacks by GTPStack for each variant of TB Example of three TPQs and their corresponding checking query nodes (underlined) (a) Sample XML tree (b) QC An empirical evaluation of the selection of α and β Results of our cost-based optimization framework Overhead of the cost-based optimization Average speed-up of greedy forward compared to Top-k

8 8

9 List of Tables 1 Major features of twig query join categories Numbers of nodes inserted into an intermediate storage and numbers of nodes relevant to the GTP result Number of the filtering function calls corresponding to the inner query nodes ZIPF query templates for XPath queries GTPStack+M compared to all tested approaches for all queries Ratio of the number of nodes stored in various intermediate storages and the number of relevant nodes for each collection Median values of nodes stored in an intermediate storage for each collection Categories of twig query joins A comparison of cost-based approaches Characteristics of data collections

10 10

11 1 Introduction A well-formed XML document or a set of documents can be viewed as an XML database and the associated DTD, or XML Schema, is its database schema. XQuery and XPath are usually the query languages of an XML database. If we compare them to relational databases, the main differences are the hierarchical data model and the implicit order of the XML data. Therefore, the major novel issues related to query processing in XML databases are (1) handling of a query logic related to the XML document hierarchical structure, and (2) dealing with the implicit order. Let us call such problems a structural XML querying. In this work, we provide a comprehensive survey of the state-of-the-art of approaches and related aspects for an efficient structural XML querying. In particular, we start with a description of labeling schemes to capture the structure of the data and the respective storage strategies. Then we deal with the key part of every XML query processing algorithm a twig query join as well as optimizations of XML query processing. Moreover, we describe two twig query joins that can be used in a structural XML querying. In particular, we summarize the main contributions of this work as follows: 1. We present a detailed description of up-to-date storages and indices for XML data, as well as a classification of the physical access methods with regard to node labeling, document partitioning, and twig query joins used. 2. We provide a thorough description of the state-of-the-art twig query joins and their comparison in terms of their compatibility, features, and supported query models. 3. We describe XML query algebras and outline their compatibility with twig query joins. 4. We discuss main aspects of cost-based optimization techniques and selectivity estimation approaches for XML queries. We compare these techniques in terms of supported query models, twig query join algorithms, and several other practical features as well. 5. We depict two of our novel twig query join algorithms called GTPStack and CostTwigJoin in detail. We compare these algorithms with other state-of-the-art algorithms and describe thoroughly our contributions. The content of this work is based mainly on the following publications: R. Bača, M. Krátký, T. W. Ling, and J. Lu. Optimal and Efficient Generalized Twig Pattern Processing: a Combination of Preorder and Postorder Filterings. The VLDB Journal, 22:1-25, Springer. [8] R. Bača, P. Lukáš, and M. Krátký. Cost-based holistic twig joins. Information Systems, 52:21-33, Elsevier. [9] 11

12 R. Bača, M. Krátký, I. Holubová, M. Nečaský, T. Skopal, M. Svoboda, and S. Sakr. Structural XML Query Processing. Accepted in ACM Computing Surveys, [12] 1.1 Motivation The adoption of the extensible Markup Language (XML) [19] proposed by the W3C 1 as a standard for information exchange has gained so much momentum and currently it is undoubtedly a main standard for the representation and exchange of data. Before this happened, we witnessed a massive boom of techniques that enable efficient storing and querying of XML data. Now we can observe that the boom in the proposals of new techniques for efficient structural XML querying is over and the research world has shifted its attention towards other kinds of data models and data formats (e.g., JSON [18], NoSQL [88], RDF [14], or linked data [16]). However, according to Gartner [36], there is an increasing trend of a new generation of multi-model database systems (e.g., OrientDB 2, MarkLogic 3, or HPE Vertica 4 ) which is designed to support storing data in a combination of related models and query across them. Therefore, we believe that the work done in XML database management system (XDBMS) is still relevant nowadays and it can be used in a new generation of database systems. 1.2 XML Model and XML Query Languages This section contains a brief introduction of the structural XML querying problem XML and XML Model For the purpose of machine processing of XML data, we do not view an XML document as a textual document, but we instead use it as a model. Every XML document has to be wellformed, which basically means that tags form a hierarchical structure, and, hence, the model is a tree with several types of data nodes corresponding to elements, attributes, or textual data. An example of an XML document and its XML tree model is depicted in Figure 1. In the following text, we use a term labeled path for a sequence of node names of a path in an XML tree XML Querying Issues and difficulties of structural XML querying are mostly observed with respect to the XPath [31] and XQuery [17] languages, where the latter one is actually an extension to the former one. Both these languages are the key standards among XML query languages, and so the corresponding XML query processing algorithms are essential for any XDBMS

13 XML document <notes> <note status= important > <to>roope</to> <from>jani</from> <body>call me!</body> </note> <note status= new > <to>radim</to> <due> </due> <body>finish article</body> </note> </notes> XML tree model notes note note status to from body status to due body important Roope Jani Call me! new Radim Finish article Figure 1: An XML document and its XML tree model XPath is a query language for selecting nodes from an XML document. The XPath language provides the ability to navigate within an XML document and select its particular nodes by a variety of criteria. We can find two major types of constructs in any XPath query: (1) structural constructs including a navigation in an XML tree, an element or an attribute name selection, a wildcard, Boolean expressions, and quantifiers and (2) content constructs including predicates on the node content (i.e. the element content or the attribute value) and a comparison of the node content. The content constructs can be often handled by methods designed for a query processing in RDBMS [13, 80, 24], moreover, there is a number of works dealing with the content constructs in XML [63, 55, 10, 4]. Let us remember that we are more focused on the structural constructs in this work. The major structural query model used by most approaches is called a twig pattern query (TPQ). A basic TPQ Q = (V, E) is a tree with a set V of query nodes and a set E of edges. Query nodes represent nodes of an XML document to be retrieved and edges represent structural relationships between the nodes. A query node q V is labeled by a node name. An edge e E can be of the parent-child (PC) or the ancestor-descendant (AD) type. PC and AD edges are visualized by simple and double lines, respectively. Q1: //a//b[.//d]/e Q2: //b/d/following-sibling::e Q3: //a//b[./ancestor::b][./e]/d Q4: //a//*[./b and not(./e)]//d e a b d d b a > b 1 e e b 2 d b a * e d Q1: TPQ Q2: TPQ > Q3: X-TPQ Q4: TPQ * Figure 2: XPath queries and their corresponding TPQs TPQs can represent XPath queries which contain child and descendant axes in their steps and predicates. A sample XPath expression of this kind is depicted in Figure 2 (Q1). The XPath language also contains other constructs. Thus, there exist different extensions of TPQ handling various structural aspects of XPath queries [73, 68]. An example of the major TPQ extensions is 13

14 provided in Figure 2 (Q2 Q4). We use a notation for TPQ extension names, where a structural construct not belonging to a basic TPQ definition is usually written in a superscript of a TPQ name (i.e. Q2 and Q4). Algorithms evaluating a TPQ on an XML document are called twig query joins or structural joins. Any twig query join finds all the occurrences of a TPQ in an XML tree (also called a TPQ matching). We discuss them in Section 3. Let us note that the TPQ matching is often considered as a core problem of XML querying. optional edge output node for $a in //a[//c//f], for $a in //a, for $b in //b[not(.//d/c) or /c] $aa in $a//b//a $d in $a//b//d let $cc := $b//c/f return return return <o> {$a,$aa} </o> <o> {$d,$a/b} </o> <o> {$b,$cc} </o> optional output node a a b c b b b' d c' c'' f a' d c b: (not(d) or c') and c'' f Q5 Q6 Q7 Figure 3: XQuery queries and their corresponding GTP representations The XQuery language [17] extends XPath in many ways. Two important extensions are the ability to specify output query nodes and an introduction of sequences in values. If we use only the TPQ model, we would have to perform a TPQ output postprocessing in order to get an expected XQuery output. To overcome this problem, a generalized query pattern (GTP) [30] is introduced making such a postprocessing unnecessary. Figure 3 shows an example of three XQuery queries with their corresponding GTP models. We use a simplified notation of the GTP model introduced in [28]. The optional edge and the optional output node enable a simple representation of the let clause or an optional XPath construct in the return clause (see Q6 and Q7 for an example). 1.3 Labeling Schemes Most of the twig query joins use a labeling scheme that assigns a unique node label to each node. Node labels allow us to resolve the following basic operations between two nodes u and v during query processing. The most significant operations are (1) lowest common ancestor of u and v in an XML tree, (2) resolving AD or PC relationship, and (3) decision whether u has a lower document order than v or not. 14

15 There are two major types of labeling schemes: (1) fixed-length labeling schemes, where the label has a fixed length, e.g., the Containment labeling scheme [106] (see an example in Figure 4(a)), and (2) prefix-based labeling schemes, where the length of the label is equal to the node depth, e.g., the Dewey order [94] (see an example in Figure 4(b)). The node label serves as an unique identifier (node id) of a node in an XML document in many labeling schemes. a (1,16) a 1 b (2,13) b (14,15) b 1.1 b 1.2 a (3,4) (5,8) c c (9,12) a c c b (6,7) (a) b (10,11) b (b) b Figure 4: (a) Containment labeling scheme (b) Dewey order labeling scheme The labeling schemes mentioned in the previous paragraph [106, 34, 94] are early works with certain issues such as the poor update performance or a lack of the LCA operation support. These issues were the main reason for the introduction of several other labeling schemes such as ORDPath [74], DDE [103], or Branch code [102]. A specific way to improve the update capability of the fixed-length labeling schemes is the introduction of bulk operations [51]. CT-label [59] is an approach that aims at reducing the label size while an XML document becomes nearly static. There are also labeling schemes ignored here such as compressed branch code [102], DFPD [60] or DPLS [62] as they can introduce false hits under certain circumstances. The prefix-based labeling schemes have fast insertion but it is balanced out by the cost of basic operations. Therefore, there is no holy grail of labeling schemes and a trade-off choice needs to be made among the performance of basic operations or the performance of updates. When we select an appropriate labeling scheme we have to consider also the LCA operation support as it is required by various twig query joins. 15

16 16

17 2 XML Storage Techniques Native XDBMSs build structural indices allowing them to avoid the necessity of accessing an XML document when resolving structural constructs of XML queries. Section 2.1 describes a general storage concept called the partitioning of an XML document. Sections 2.2 and 2.3 describe two data structures called a schema tree and a node index (both structures can be built during a preprocessing of an XML document), while Section 2.4 summarizes advantages of various storage settings. 2.1 Partitioning of XML Document Nodes of an XML document can be easily divided into disjoint sets (partitions), where each set is identified by its partition label. Partitioning approaches can be divided as follows: (1) Those based on a document structure [106, 42, 26, 49], and (2) Those considering a typical XML query workload [50, 27]. A tag partitioning [106, 5] is the most common partitioning based on a document structure where nodes are divided according to their tags and the number of partitions is equal to the number of unique tags in an XML document. There are also other partitioning based on tag+level [29], labeled paths [105, 29, 52] or forward & backward paths [49]. Each of these partitionings is actually a decomposition (refinement) of another one, as it is illustrated in Figure 5. On one hand, the tag partitioning produces a low number of partitions with a high number of nodes in each partition. On the other hand, F&B provides the opposite property: a possibly high number of partitions with a low number of nodes per each partition. Another type of partitioning is a semantic partitioning [6] that partitions the XML document according to the structure specified in an XML schema. 2.2 Schema Tree A schema tree for an XML document is a labeled tree, where each labeled path occurs only once. An example of a schema tree is depicted in Figure??(b). In literature, there are many names for this tree: DataGuide [42], path tree [3], summary tree, summary index, path index, and so on. A schema tree is useful for the following purposes: (1) to determine partition labels corresponding to a query [11] when a more refined partitioning is used (e.g., tag+level, or labeled paths), (2) to support a simple selectivity estimation [3], and (3) to get a general knowledge about an XML document structure (e.g., to support an auto-completion feature in an XML editor). 2.3 Node Indices Node labels are not stored in a schema tree structure; instead they are stored in a node index as values. There are two basic types of node indices: (1) those having a node id as a key (document 17

18 low number of partitions high number of nodes in a partition high number of partitions low number of nodes in a partition coarsing tag tag+level labeled path F&B partitioning refinement Figure 5: Properties of various XML document partitionings indices), and (2) those having a partition label as a key (partition indices). Nodes corresponding to one partition label are sorted according to the node label in the partition index; therefore, a sequential scan of the key s list is usually necessary during the query processing. This can be improved by an index (XB-tree [20] or XR-tree [48]) built over each list. All types of node indices are schematically depicted in Figure 6. Value: node label + some other information Key: partition label Value: list of node labels B-tree Key: node id B-tree Key: partition label B-tree (b) Value: XB-tree XB-tree XB-tree (a) (c) Figure 6: (a) Document index (b) Partition index (c) Partition index with indexed lists From the query processing perspective, a document index is very useful when we have a small set of context nodes and we want to use it to resolve the remaining relationships of a query. This type of the query processing can be considered as navigational. On the other hand, many twig query joins are based on the partition index (see Section 3). This type of join is mainly focused on the merge during one sequential scan of lists which removes irrelevant nodes [20, 5]. List in the partition index is called stream in the join and we use the term sequence in the following text. 2.4 Balancing XML Storage From the query processing point of view, the selection of a partitioning influences the size of three problems: (1) the overall amount of nodes that have to be read from a node index, (2) 18

19 the number of random accesses into a node index, and (3) the necessity to find all the partition labels in a potentially large schema tree. If we minimize the first problem by using a more refined partitioning, then the latter two problems of the query processing increase and vice versa. This behavior also depends on the query workload. As mentioned in Section 2.1, there are also partitioning techniques which are based on a typical XML query workload [50, 27]. These approaches take into account typical queries and the partitioning is created with respect to them. These approaches minimize all three query processing problems of partitioning mentioned above; however, they are only effective for a specific workload. 19

20 20

21 3 Twig Query Join Algorithms As discussed in Section 1.2.2, the basic problem of the XML querying is finding all the occurrences (query matches) of a TPQ in an XML tree [33]. Algorithms addressing this problem are usually called twig query joins and they represent a basic operator in an XML query algebra. Unlike in the relational domain where the term join stands for an algorithm-independent operation, in this work we use the term twig query join for algorithms that solve the TPQ matching problem as it is common in the referenced papers. In the full version of this work we define a data structure API that abstracts the data structures of the storage system. Let us only mention that there are two basic access patterns: (1) navigation API using a node for navigation in the XML document and (2) stream API using just sequential scan of a node sequence corresponding to one partition label. Operations in the APIs have significantly different performance characteristics depending on available node indices (see Section 2.3). In general, the document index supports both the navigational and stream APIs, whereas the partition index supports only the stream API. However, the document index has to access nodes using a sequential scan of a B-tree containing many irrelevant nodes when processing the stream API which is inefficient. A query processed in the partition index accesses only nodes relevant to a partition label, and therefore, it is crucial for the efficient processing of all merge-like algorithms. The selection of an appropriate index and algorithm is a task for a cost-based optimizer (see Section 4.1) and we propose an algorithm based on such selection in Section 4.3. Before we describe particular algorithms, we generally specify steps that can be identified as parts of any twig query join as follows: Filtering: An algorithm scans sequences and filters out nodes not corresponding to any query match. In this stage, algorithms use main memory filtering data structures to get rid of the maximum number of nodes that are irrelevant to a query. The most common filtering data structure is a stack or a set of stacks. The main feature of the filtering data structure is that it can help to decide whether the node is useless or not in constant time. Of course, every filtering can have false hits. Intermediate storage: Every join uses some type of an intermediate storage, where nodes are stored before a query output is enumerated. In some approaches, a filtering data structure is used as the intermediate storage as well. Output enumeration: In this step, algorithms read the intermediate storage and generate an output which is usually in a form of ordered tuples. The major task of this step is often the tuple ordering according to an XML query model. One of the most significant differences among joins is the method that is used during the filtering step. The first type of filtering methods focuses only on a pair of query nodes and the 21

22 twig query join is processed as a set of binary structural joins. On the other hand, holistic joins filter input nodes based on information from all query nodes. Figure 7 summarizes the major categories of twig query joins and Table 1 depicts their main features. Twig query joins Binary structural joins Holistic joins Navigational Merge-like Top-down Bottom-up Figure 7: Categories of algorithms utilized for twig query joins struc- Binary tural joins Top-down joins holistic Bottom-up holistic joins Pros They can be easily integrated into any XML query algebra and support all XPath axes. Linear I/O complexity of the query processing with respect to the sum of output and input sizes for some query types and unnecessary query plan optimizations. Linear CPU and I/O complexities of the output enumeration with respect to the output size. Cons Their efficiency is significantly dependent on the selection of a good query plan. They can produce a large intermediate result when compared to the query output. A sequential scan of an intermediate result is required even if it contains many useless nodes (see Section??). They can produce a large intermediate result compared to the query output (see Section??). Table 1: Major features of twig query join categories 3.1 GTPStack In this section, we outline basic ideas of our holistic join algorithm GTPStack introduced in [8]. Node filtering As described in the introduction of this section, every holistic algorithm has a filtering mechanism skipping irrelevant input data nodes which are not a part of any query match before these nodes are stored in an intermediate storage. Holistic algorithms use stacks during the filtering. In the following text #q denotes a query node q. If the filtering skips irrelevant nodes so that they are not stored on stacks at all, we speak about the top-down filtering. The bottom-up filtering skips irrelevant nodes (i.e., they are not stored in the intermediate storage) when they are popped out from their stacks. The simplest top-down filtering is represented by 22

23 PathStack [20] which skips an irrelevant node n corresponding to #q when there is no occurrence of a path from #root to #q containing n. Another type of top-down algorithms use a recursive filtering function such as getnext [20] or getpart [43] and they skip irrelevant nodes which are not a part of a whole TPQ occurrence. On the other hand, a bottom-up filtering algorithm (e.g., Twig 2 Stack [28] or TwigList [81]) skips an irrelevant node n corresponding to #q if there is no occurrence of a subtree rooted at #q containing n. We say that a filtering is optimal for a query Q if it skips all irrelevant nodes during the sequential scan of the input, which means that an algorithm with such filtering has a linear worst-case time and I/O complexity for Q with respect to the sum of the input and TPQ result size. For example, the TwigStack [20] and TJStrictPre [43] algorithms are optimal for TPQs having only ancestor-descendant relationships. Table 2 shows the number of nodes stored in an intermediate storage by various filtering algorithms. In this table, we use three TreeBank queries. The last column shows the number of nodes which are a part of a GTP result tuple. Evidently, the combination of PathStack and the bottom-up filtering can store an enormous number of irrelevant nodes; however, it stores less nodes than the bottom-up filtering itself. An top-down filtering algorithm such as TwigStack stores significantly less nodes, but it still typically stores large number of irrelevant nodes due to the fact that it filters only according to the TPQ model and no bottom-up filtering is included. Query PathStack Top-down Bottom-up Nodes +Bottom-up in GTP Twig 2 Twig Stack Stack TwigStack result +PathStack TB1 172,851 92,972 32, TB2 170,874 49,765 24, TB3 404, ,961 12, Table 2: Numbers of nodes inserted into an intermediate storage and numbers of nodes relevant to the GTP result In this section, we briefly outline features of the GTPStack algorithm combining the topdown filtering function and a bottom-up filtering (let us call it a combined filtering). To our best knowledge it is the first such correct algorithm that it is able to do it before a node is stored in the intermediate storage. Our combined filtering enables optimal filtering according to GTP; therefore, only the nodes relevant to the GTP result are stored in an intermediate storage if the algorithm is optimal. In other words, if GTPStack is optimal, then it has a linear worst-case I/O complexity with respect to the GTP result size. GTPStack s combined filtering significantly reduces number of nodes in an intermediate storage even if GTPStack is not optimal for a query. Moreover, in order to speed up the query processing time we use the following two improvements in the filtering mechanism: (1) we introduce a novel top-down filtering function called getmatch which always outperforms the getpart function [43], and (2) we avoid storing predicate nodes on stacks. 23

24 GTPStack processing time improvements Let us briefly describe our ideas behind the above improvements on examples. A filtering function such as getnext or getpart is typically called many times as is shown in Table 3. This table gives numbers of the function calls corresponding to the inner query nodes and numbers of unnecessary calls for three TreeBank queries. An unnecessary call of the filtering function works with exactly the same data nodes as the last function call; therefore, it returns the same query node. As observed on the query TB3, there can be almost half of function calls unnecessary. Our novel getmatch function avoids all these unnecessary calls. Additionally, as is shown in Section 3.2.3, the efficiency of the filtering function is significantly dependent on the ability to skip irrelevant nodes. The getpart and getnext functions sometimes return irrelevant nodes which are not subsequently stored on stacks. Another advantage of getmatch is that it skips all these irrelevant nodes. getnext getpart Query Calls Unnecessary Calls Unnecessary [10 3 ] calls [10 3 ] [10 3 ] calls [10 3 ] TB TB TB Table 3: Number of the filtering function calls corresponding to the inner query nodes We use the term main branch query node to name the query nodes which are on a query path from the root to an output node. The rest of the query nodes are called predicate query nodes. GTPStack separates the node filtering and the output enumeration which yields the following optimization. It allows us to avoid storing the nodes corresponding to the predicate query nodes on stacks. Optimality An important feature of holistic algorithms using the top-down filtering is that they have a linear worst-case I/O complexity with respect to the TPQ result size (i.e., they are optimal) for some query classes. Different holistic approaches define their optimality conditions in a different way; however, all of them specify only the query requirements. To our best knowledge, GTPStack is the first algorithm that is optimal for some query classes with respect to the GTP result size. GTPStack optimality is defined only by XPath axes and XML document characteristics. In other words, semantics related to the output nodes, boolean expressions, and quantifiers do not influence its optimality getmatch Filtering Function The existing getnext and getpart top-down filtering functions have two shortcomings: (1) they often perform unnecessary recursive calls, and (2) they return a query node #q even if there is no ancestor of H(#q) on S parent(#q). As a result, they both cause many unnecessary computations which are completely avoided by the getmatch function introduced in this section. 24

25 The getmatch function introduces three improvements of the existing top-down filtering functions: (1) dynamic programming that avoids unnecessary recursive calls, (2) a filtering procedure which advance the sequence according to the bottom node of the parent s stack, and (3) a cycle for the inner nodes which does not terminate the getmatch call until all sequences are ended or a promising node is found. The usage of the parent s stack and the above cycle cause a more progressive advancing of sequences which is a major parameter influencing the processing time of an filtering function as is shown in Section A top-down filtering function has the following property for a specific (see below) class of queries: it returns only a query node #q such that there is a query match of #q containing H(#q). Therefore, the top-down filtering removes all nodes irrelevant to the TPQ. The getpart and getnext procedures guarantee this property for queries having only AD relationships [20, 43]. Since getmatch only avoids unnecessary function calls and skips irrelevant nodes having no ancestor on parent s stack, its optimality properties are the same as for getpart and getnext. This means that getmatch is optimal for queries having only AD relationships. Let us note that in Section 3.3 we prove that a holistic algorithm with an top-down filtering function (e.g., getmatch) can be optimal even for a query containing any combination of PC and AD relationships depending on XML document characteristics Summary of GTPStack GTPStack is the first algorithm with a linear worst-case I/O complexity with respect to the sum of the input and GTP result sizes (in this case, GTPStack is optimal for the GTP). This is mainly achieved by the combination of the top-down and bottom-up filterings and, to our best knowledge, GTPStack is the first correct holistic algorithm using a combined filtering before storing a node in an intermediate storage. The combined approach used in GTPStack has the following advantages: (1) it allows us to avoid storing nodes corresponding to predicate query nodes on stacks which speeds up the query processing, and (2) it significantly decreases the number of nodes in the intermediate storage even when GTPStack is not optimal for a query. GTPStack uses our novel filtering function called getmatch, which avoids unnecessary function calls and improves sequence advancing which furthermore speeds up the query processing. All these features make GTPStack superior to the state-of-the-art holistic approaches as is shown experimentally in Section GTPStack Experimental Results We implemented five state-of-the-art holistic algorithms in C++: TwigStack [20], TwigList [81], TJStrictPre [43], TJStrictPost [43], and Twig 2 Stack+PathStack [28] (abbreviated to T2PS). We do not include experimental results of the TwigList algorithm since both TJStrictPost and TJStrictPre use an improved version of TwigList. In our experiments, we use more than one version of GTPStack. We combine GTPStack with existing top-down filtering functions; therefore, 25

26 we use the following simple notation, where GTPStack+N, GTPStack+P, and GTPStack+M stand for GTPStack combined with getnext, getpart, and getmatch, respectively. By writing GTPStack we mean any of the above versions of GTPStack. Since GTPStack+N always outperforms TwigStack, we include the results for the TwigStack query processing only in the intermediate storage test (see page 27). The main shortcoming of TwigStack is represented by its redundant intermediate storage and inefficient output enumeration. We use one own synthetic XML document called ZIPF and three real-world XML collections. The ZIPF document contains seven different elements named from a to g spread randomly using the Zipfian distribution, where a has the highest occurrence ( 50%) and g has the lowest occurrence ( 1%). Every element of ZIPF has exactly two children and the depth of the collection is 24 which means that all paths in ZIPF have the same length. collections are XMark [91] with factor 10, INEX 1.9 [40], and TreeBank [97]. The real-world Queries for the XMark and TreeBank collections are selected from several existing articles on TPQ processing [64, 28, 56]. Queries for the INEX collection were selected in order to show differences between the algorithms. A list of the selected real-world collections queries can be found in full version of this work. The largest number of queries were generated for the ZIPF collection. The ZIPF queries are generated according to five query templates shown in Table 4. A template only specifies relationships between query nodes, output query node, and predicate query nodes. Query template Number of generated queries 1. //τ[/υ and /ω] //α/β[//χ and //δ] //α/α[//β]/χ//χ[//δ and //ϵ] //α[/τ and //υ and //ω] //α/β[//χ/δ] 81 Table 4: ZIPF query templates for XPath queries We run our experiments on a PC with Intel Xeon 2.93GHz CPU, and Windows Server 2008 operating system. When measuring the processing time, each query is processed fifteen times in the main memory and then we compute the average result omitting the two worst and the two best results. If we want to compare processing times T x and T y of two approaches x and y for a set of queries, we first compute a geometric mean of ratios T y /T x of each query. Subsequently, since we want to have the value in percents, we simply subtract one from the calculated geometric mean and multiply it by 100. We call it a relative processing time improvement (RPTI) of approach x compared to approach y for a set of queries. For example, if we have the result of the geometric mean 1.68, we write that RPTI of x has a 68% improvement compared to y. Since GTPStack s improvements of processing time relate only to the filtering part of holistic algorithms, we also measured the filtering time of the algorithms (i.e., the processing time 26

27 without the time spent on reading the input data), and therefore, we also compute the relative filtering time improvement (RFTI) of approach x compared to approach y for a set of queries. In order to minimize the processing time measurement error, we say that a method is faster than the other one for a query Q if its RPTI for Q is at least 2% and their processing time difference for Q is at least 10 miliseconds Processing Time Table 5 gives RPTI and RFTI of GTPStack+M compared to all tested approaches for all queries. Table 5 also contains the number of queries for which GTPStack+M is faster and slower. Number of Filtering RPTI RFTI queries approach Faster Slower T2PS 77% 172% TJStrictPost 88% 222% TJStrictPre 12% 43% GTPStack+N 68% 150% GTPStack+P 13% 46% Table 5: GTPStack+M compared to all tested approaches for all queries We can observe that GTPStack+M outperforms both approaches TJStrictPre and GTP- Stack+P using the getpart function for all queries since the getmatch function is always faster or equally fast compared to getpart. This comes from the fact that getmatch improves getpart without any additional overhead. The remaining approaches (T2PS, TJStrictPost, and GTP- Stack+N) perform significantly worse in average than GTPStack+M (RFTI of GTPStack+M ranges from 150% to 222% when compared to these approaches). To better understand the differences among the algorithms and the advantages of GTP- Stack+M, we need to compare their corresponding parts separately. We first compare only the result enumeration time; then we compare the top-down filterings; then, we show how GTPStack optimizes its processing time for queries with many predicate nodes; and, in last experiment, we compare the bottom-up filterings Test of Intermediate Storages for TPQs Let us start with a test showing the properties of various intermediate storages and mainly the performance of the output enumeration. Note that the LIS intermediate storage is used by the TJStrictPost, TJStrictPre, and GTPStack algorithms. Results in this section serves as a hint for a selection of the most appropriate intermediate storage for our approach. In this test, we present only the XMark queries since the results for other collections are similar. However, we ignore the GTP semantics (i.e., we consider all query nodes as output ones) 27

28 Time[s] since TwigStack cannot enumerate GTPs. As a result, queries 8 and 13 did not finish since their TPQ result sizes were over one billion and the available main memory was not sufficient TwigStack LIS T2PS DNF DNF Query Figure 8: Performance of the output enumeration Figure 8 shows the results of this experiment. As expected, TwigStack performs very poorly which corresponds to the results published in [28]. Inefficiency of the TwigStack intermediate storage comes from the duplicate work with nodes and the sequential scan of the whole intermediate storage during the output enumeration. If we compare the output enumeration times of T2PS and LIS, the difference is not significant. This result comes from the fact that they use the bottom-up filtering; therefore, their output enumeration time is linear with respect to the result size. Finally, we decided to use the LIS storage since the Twig 2 Stack intermediate storage requires global pop order for its correct functionality. LIS can work with our bottom-up filtering and its enumeration time performance is comparable to the Twig 2 Stack intermediate storage Analysis of Top-down Filtering We can find three major types of top-down filtering approaches: (1) the PathStack algorithm which filters only according to the node query path from the root query node and does not perform any sequence advance, (2) the getnext function which performs a sequence advance according to the query node descendants, and (3) getpart and getmatch which perform a sequence advance according to the query node descendants and ancestor. We ignore getpart in the following text since getmatch always outperforms this function for our queries as is shown in Section The PathStack algorithm is very simple and fast and its processing time is linear with respect to the input size (the correlation coefficient between the PathStack processing time 28

29 Number of queries [%] Number of queries [%] and the input size for the ZIPF queries is 0.98). Since PathStack does not use any advanced sequence forwarding, its processing time is not influenced by the query result. On the other hand, an filtering function (i.e., getnext or getmatch) can skip irrelevant nodes more quickly using the sequence advancing. Therefore, these functions outperform PathStack if they advance the sequence sufficiently often; this is discussed further in this section. The main attribute indicating the efficiency of a top-down filtering is the average sequence advance (denoted as AvgFwd) during one FwdToAncOf or FwdToDescOf function call. Since the getmatch function uses both forwarding functions while getnext uses only the FwdToDescOf function, we also define: (1) an average sequence advance during one FwdToAncOf function call, and (2) an average sequence advance during one FwdToDescOf function call. Let us call them an average ancestor forward movement and an average descendant forward movement and denote them AvgAncFwd and AvgDescFwd, respectively. We first compare the getnext and getmatch functions (i.e., we compare GTPStack+N and GTPStack+M). The getnext function uses only FwdToAncOf, and therefore, getmatch performs better if its AvgDescFwd is sufficiently large. Our goal is to find a threshold value k with the following two properties: if AvgDescFwd > k, then T getnext > T getmatch, if AvgDescFwd < k, then T getnext < T getmatch First property violation Second property violation Sum of the violations Treshold value k (a) Sum of the violations Treshold value k (b) Figure 9: Number of queries violating the properties for different k values (a) for all ZIPF queries (b) for queries corresponding to the first and to the fourth ZIPF template Figure 9(a) shows how many queries violate the above properties for all queries in our ZIPF query set for a different k value. As we can see, there is no optimal k value for which all queries satisfy both properties since the sum of violations never reaches zero. In other words, we cannot find any exact threshold value of AvgDescFwd which would say whether to use getnext or getmatch. The reason for this is that various query nodes in a query can have significantly different AvgDescFwd values; therefore, selection of the same filtering function for all query nodes is not always the best solution. Figure 9(b) shows the number of violations for the ZIPF queries 29

30 generated by templates 1 and 4. These queries are very simple and the descendant forwarding is always performed in the relation to the same parent query node. We can observe that the threshold value k = 0.03 is a good selection for many of them since the sum of violations is less than 5%. Based on the above heuristics, we evaluated a simple combination of the getmatch and the getnext functions which works as follows: We first process a query using getmatch and collect the statistics about AvgDescFwd in each query node. Secondly, we use the getmatch function for a query node with AvgDescFwd larger than 0.03 and use the getnext function in the other cases. This combination of getmatch and getnext always performs better than or equally to any other filtering function. Similarly, we looked for another two threshold values which would indicate that an algorithm using PathStack performs better than algorithms using getnext or getmatch. Both algorithms have the threshold value of AvgFwd approximately equal to 0.3, where only 5% of the ZIPF queries corresponding to the first and fourth template violate the corresponding processing time properties. In other words, if AvgFwd is lower than 0.3, then PathStack performs better in many cases Optimization of Predicate Query Nodes None of the existing holistic algorithms can optimize the query processing time with respect to the number of nodes corresponding to predicate query nodes; therefore, their processing time is the same regardless of the GTP semantics. Their performance is mainly dependent on their top-down filtering as is shown in the above section. If GTPStack is optimal, then it stores only the nodes corresponding to main branch query nodes on stacks and thus it saves some time during the bottom-up filtering. #S #NN #NP #DT #S #NN #NP #DT #S #NN #NP #DT #S #NN #NP #DT #PP #PP #PP #PP #IN #NN #IN #NN #IN #NN #IN #NN #S #NN #NP #DT #PP #S #NN #NP #DT #PP #S #NN #NP #DT #PP #IN #NN #IN #NN #IN #NN Figure 10: Variants of the TB12 query with different number of output nodes 30

31 Processing time [s] Number of nodes stored on stacks [10 5 ] Query Query (a) (b) Figure 11: (a) Processing time of GTPStack+M and (b) number of nodes stored on stacks by GTPStack for each variant of TB12 In Figure 10, we can observe seven variants of the TB12 query with a different number of output nodes. We selected the TB12 query because GTPStack+M is optimal for it. Figure 11a shows how the processing time of GTPStack+M decreases with the decreasing number of nodes stored on stacks. The AvgDescFwd and AvgAncFwd values are equal to 0.09 and 0.14, respectively, for the TB12 query which indicates (according to the results of the above section) that getmatch should be slower than T2PS and TJStrictPost having the processing times and 0.202, respectively. However, for the last query variant, where the number of nodes corresponding to predicate query nodes is six times larger than the number of the other nodes, GTPStack+M performs equally to both algorithms (its processing time is 0.216) Intermediate Storage Size Method ZIPF TB XM INEX T2PS, TJStrictPost TJStrictPre GTPStack Table 6: Ratio of the number of nodes stored in various intermediate storages and the number of relevant nodes for each collection ZIPF TB XM INEX [10 3 ] [10 3 ] [10 3 ] [10 3 ] T2PS, TJStrictPost TJStrictPre GTPStack Relevant nodes Table 7: Median values of nodes stored in an intermediate storage for each collection Another important property of every algorithm related to the I/O complexity is the intermediate result size. We evaluate the intermediate result size in terms of the number of nodes stored there. For each method we compute a ratio of the number of nodes stored in the intermediate 31

32 storage and the number of relevant nodes for each collection and filtering method. Table 6 shows this ratio for all queries in each collection computed using the geometric mean. Table 7 shows us the median value of nodes stored in an intermediate storage for each collection. Generally, GTPStack stores one order of magnitude less nodes than the rest of the tested approaches due to the fact it uses the combined approach. The intermediate result size does not have a significant relationship to the processing time during the main memory run since every algorithm has to perform some extra operations if it wants to avoid storing useless nodes. However, the difference will be huge if an intermediate storage is larger than the main memory, and, in this case, I/O operations have to be included. 3.3 Top-down Filtering Optimality Optimality of a top-down filtering for a query or at least subquery has several important impacts: (1) we can guarantee that we skip all nodes irrelevant to a TPQ during the top-down filtering, (2) the algorithm optimality is necessary for a more efficient top-down node filtering of a query with the NOT operator, (3) we store only nodes corresponding to the output query nodes in the intermediate storage, and (4) we avoid storing all nodes corresponding to the predicate query nodes on stacks. Every top-down filtering algorithm has its specific query classes, for which the algorithm optimality is proved. The most common top-down algorithm optimality is the AD query (a query having only the AD relationships) optimality which is also GTPStack s optimality if the tag partitioning is used. However, we show that the top-down algorithm optimality can be significantly extended using the tag+level or labeled path partitioning. For a more thorough comparison of algorithms optimality query classes see [12]. When we speak about a holistic algorithm in this Section, we mean a holistic algorithm using an top-down filtering such as TwigStack, TJStrictPre, or GTPStack. Let us define several terms related to the query nodes of a TPQ: Query node with PC in its subtree is a query node #q having the PC relationship between some two nodes from the subtree(#q) set. A checking query node #q is a query node with PC in its subtree and having the AD relationship with its parent. It is an important query node type from the optimality point of view. Checkingnodes(Q) is a set of checking query nodes in the query Q. A tag is called single level if all nodes in the XML tree with this tag are on the same level. PRU #q is a set of partition labels that corresponds to a query node #q. Note that under tag partitioning we can have just one partition label in every PRU #q. If use some other partitioning then we need to do the DataGuide search first. By a term DataGuide search me mean determining partition labels corresponding to a query (see Section 2.2). 32

QuickXDB: A Prototype of a Native XML QuickXDB: Prototype of Native XML DBMS DBMS

QuickXDB: A Prototype of a Native XML QuickXDB: Prototype of Native XML DBMS DBMS Petr Lukáš, Radim Bača, and Michal Krátký Petr Lukáš, Radim Bača, and Michal Krátký Department of Computer Science, VŠB