Bottom Up and Top Down Twig Pattern Matching on Indexed Trees

Size: px

Start display at page:

Download "Bottom Up and Top Down Twig Pattern Matching on Indexed Trees"

Georgiana Collins
5 years ago
Views:

1 Nils Grimsmo Bottom Up and Top Down Twig Pattern Matching on Indexed Trees Thesis for the degree of philosophiae doctor Trondheim, Norwegian University of Science and Technology. Faculty of Information Technology, Mathematics and Electrical Engineering. Department of Computer and Information Science.

2 NTNU Norwegian University of Science and Technology Thesis for the degree of philosophiae doctor Faculty of Information Technology, Mathematics and Electrical Engineering Department of Computer and Information Science c Nils Grimsmo ISBN (printed version) ISBN (electronic version) ISSN Doctoral theses at NTNU, 2011:96 Printed by NTNU-trykk

3 Preface This thesis is submitted to the Norwegian University of Science and Technology (NTNU) for partial fulfillment of the requirements for the degree of philosophiae doctor. The doctoral work has been performed at the Department of Computer and Information Science, NTNU, Trondheim, with Bjørn Olstad as main supervisor, and Øystein Torbjørnsen and Magnus Lie Hetland as co-supervisors. The candidate was supported by the Research Council of Norway under the grant NFR , and by the iad project, also funded by the Research Council of Norway. 5

5 Summary This PhD thesis is a collection of papers presented with a general introduction to the topic, which is twig pattern matching (TPM) on indexed tree data. TPM is a pattern matching problem where occurrences of a query tree are found in a usually much larger data tree. This has applications in XML search, where the data is tree shaped and the queries specify tree patterns. The papers included present contributions on how to construct and use structure indexes, which can speed up pattern matching, and on how to efficiently join together results for the different parts of the query with so-called twig joins. Paper 1 [18] shows how to perform more efficient matching of root-to-leaf query paths in so-called path indexes, by using new opportunistic algorithms on existing data structures. Paper 2 [19] proves a tight bound on the worst-case space usage for a data structure used to implement path indexes. Paper 3 [24] presents an XML indexing system which combines existing techniques in a novel way, and has orders of magnitude improved performance over existing commercial and open-source systems. Paper 4 [20] reviews and creates a taxonomy for the many advances in the field of TPM on indexed data, and proposes opportunities for further research. Paper 5 [21] bridges the gap between worst-case optimality and practical performance in current twig join algorithms. Paper 6 [22] improves the construction cost of so-called forward and backward path indexes for tree data from loglinear to linear. 7

7 Acknowledgments The day-to-day supervision of the PhD work during the first years was mostly done by the external supervisor Dr. Øystein Torbjørnsen from Fast Search and Transfer, who has been a good source of ideas and clever technical solutions. Dr. Magnus Lie Hetland from my department has been supervising the last year, and has given substantial help both scientifically and in the writing process of some papers. The visits of my formal supervisor Dr. Bjørn Olstad have been inspirational. The discussions with Dr. Felix Weigel during his internship at FAST resulted in many ideas. I would like to thank fellow PhD student Truls Amundsen Bjørklund for good times, fruitful discussions and honest feedback during our work together. Thank you Nina, for your patience, beauty and delicious cooking. 9

9 Contents Preface 5 Summary 7 Acknowledgments 9 Contents 12 1 Introduction Indexing/search in semi-structured data Use-case: XML XPath and XQuery Abstract problem: Twig Pattern Matching Research scope: TPM on indexed data Research questions Background Twig joins Twig join work-flow Result enumeration Single output query node Simple intermediate result architecture Tree position encoding Partial match filtering Intermediate result construction Merging input streams Data locality and updatability Twig join conclusion Partitioning data Motivation for fragmentation Path partitioning Backward and forward path partitioning Balancing fragmentation Reading data

10 2.3.1 Skipping Skipping child matches Skipping parent matches Holistic skipping Virtual streams Virtual matches for non-branching internal query nodes Tree position encoding allowing ancestor reconstruction Virtual matches for branching query nodes Related problems and solutions Research Summary Formalities Publications and research process Paper Paper Paper Paper Paper Paper Research methodology Evaluation of contributions Research questions revisited Opportunities revisited Future work Strong structure summaries for independent documents A simpler fast optimal twig join Simpler and faster evaluation with non-output nodes Ultimate data access shoot-out Conclusions Bibliography 61 4 Included papers 63 Paper 1: Faster Path Indexes for Search in XML Data Paper 2: On the Size of Generalised Suffix Trees Extended with String ID Lists 87 Paper 3: XLeaf: Twig Evaluation with Skipping Loop Joins and Virtual Nodes 93 Paper 4: Towards Unifying Advances in Twig Join Algorithms Paper 5: Fast Optimal Twig Joins Paper 6: Linear Computation of the Maximum Simultaneous Forward and Backward Bisimulation for Node-Labeled Trees A Other Papers 193 Paper 7: On performance and cache effects in substring indexes Paper 8: Inverted Indexes vs. Bitmap Indexes in Decision Support Systems Paper 9: Search Your Friends And Not Your Enemies

11 Chapter 1 Introduction Research is formalized curiosity. It is poking and prying with a purpose. Zora Neale Hurston The thesis is submitted as a paper collection bound together by a general introduction. This chapter presents the context of the research, which is indexing and querying semi-structured data, and the abstract problem investigated, which is twig pattern matching (TPM). Chapter 2 gives a high-level introduction to techniques used in state of the art TPM on indexed data. Chapter 3 lists the included published papers with short qualitative assessments, evaluates the total contribution of this thesis, and proposes future work. 1.1 Indexing/search in semi-structured data So-called semi-structured data gives both flexibility and expressional power, and is commonly used for storing and exchanging data in heterogeneous information systems. In the semi-structured data model, documents have a structure that specifies how the different parts of the content relate to each other. This means information is contained both in the structure and the content. Documents are usually structurally self-contained, meaning that the structure can be understood from the document alone, without additional meta-data. The focus of this thesis is algorithms and data structures for indexing and querying semi-structured data, where queries specify both structure and content. The use of semistructured data can functionally cover both traditional structure-oriented and contentoriented data management, and the thesis therefore touches the fields of both databases and information retrieval. 13

12 CHAPTER 1. INTRODUCTION 1.2 Use-case: XML XML is a simple yet flexible markup language [46], and has become the de facto standard for storing semi-structured data. An example XML document is shown in Figure 1.1. Standard XML has a tree model, where there are mainly three types of nodes in a document tree: element, attribute and text. All internal nodes in the document tree are of type element, and are given by start and end tags, such as the node with name book in the example. Text and attribute nodes are always leaf nodes. Text nodes have simple string values, while attributes have both a name and a value, such as the ISBN node in the example. <library> <book ISBN="13"> <title>kritik der Unvollständigkeit</title> <author>kant</author> <author>gödel</author> </book>... </library> Figure 1.1: Example XML document XPath and XQuery XPath [45] and XQuery [47] have become standard languages for querying XML. Comparing the two, XPath is a simpler declarative language, while XQuery is a more complex language that uses XPath expressions as building blocks. The XPath expression in Figure 1.2a asks for the title of all books coauthored by Kant and Gödel. In XPath single and double forward slashes specify child and descendant relationships between nodes, respectively. Square brackets contain predicates, and the rightmost node not part of a predicate is the output node, also called the return node. XPath queries are trees, and the tree representation of the example is shown in Figure 1.2b. In XPath there are 11 so-called axes in addition to descendant and child: parent, ancestor, followingsibling, preceding-sibling, following, preceding, attribute, namespace, self, descendant-orself and ancestor-or-self [45]. There can also be more complex value predicates than simple tests on string equality, using function such as count(), contains(), sum(), etc.. XQuery is a powerful language where small programs are built with path expressions as building blocks, in so-called FLWOR expressions (for, let, where, order, return). Figure 1.3 shows an XQuery program similar to the XPath expression in Figure 1.2a, which in addition orders books by title and retrieves both title and ISBN. 14

13 1.3. ABSTRACT PROBLEM: TWIG PATTERN MATCHING //book[author/text()="kant"][author/text()="gödel"]/title (a) Expression. <book> <author> <author> <title> "Kant" "Gödel" (b) Tree representation. Figure 1.2: XPath example finding books coauthored by Kant and Gödel. for $b in doc("lib.xml")/library/book let $t := $b/title where $b/author = "Kant" and $b/author = "Gödel" order by $t return ($t, $b/@isbn) Figure 1.3: Example XQuery. 1.3 Abstract problem: Twig Pattern Matching In XPath a large number of functions can be used in value predicates, and thirteen different axes dictate the relationships between nodes. The many details in the language makes it hard to reason about the complexity of evaluation algorithms and hard to implement prototypes. TPM is a more abstract tree matching problem that covers a subset of XPath. It is of academic interest because a TPM solution covers the majority of the workload in most XML search systems [15]. In TPM both query and data are node-labeled trees, as shown in the example in Figure 1.4. Node predicates are on label equality, and all nodes have the same type. There are two types of query edges that dictate the relationship between data nodes in a match, ancestor descendant (A D) and parent child (P C), denoted in figures by double and single edges, respectively. The result of a TPM query is the set of mappings of query nodes to data nodes that both respect the node labels and satisfy the A D and P C relationships specified by the query edges. In settings with XML document collections, the data is a forest of trees, but this can easily be transformed into a single tree by adding a virtual super-root node. 15

14 CHAPTER 1. INTRODUCTION c 1 a 1 b 1 c 2 a 1 d 2 b 6 b 1 c 1 a 2 e 1 a 4 d 1 c 5 b 2 a 3 c 4 b 3 b 5 c 6 c 3 f 1 b 4 Figure 1.4: TPM example with a query tree on the left and a data tree on the right. One of the matches for the query in the data is shown with arrows from query nodes to data nodes. In the following, query nodes are drawn with circles and data nodes with rounded rectangles. Node labels are written with typewriter font, and the superscripts in query nodes and subscripts in data nodes are used to identify the nodes (together with the labels) Research scope: TPM on indexed data The scope of this thesis is twig pattern matching on indexed data, and we assume that the processes of preparing the index and evaluating queries are separate. For this strategy to be viable, the cost of index construction must be justified by the performance gain for query evaluation compared to evaluation without an index, seen in light of the index construction cost. We use the following abstract view of an index: It is a mechanism which provides a function from some feature of a node, to nodes in the data tree that have this feature. The simplest non-trivial such feature is node label, as used in the index shown in Figure 1.5a. In a typical implementation, entries in a so-called dictionary on label point to so-called occurrence lists containing nodes with matching label. When indexing on label, a query can be evaluated by reading the label-matching data nodes for each query node, and joining these into full query matches. The number of full query matches may be small compared to the total number of query node matches read, but if the labels on the query nodes are selective, much fewer data nodes will be processed than when evaluating the query on the data tree without an index. Indexing on node label can be extended to indexing on path labels, the string of labels from the root to a node, as illustrated in Figure 1.5b. This can again be extended to classify nodes not only on labels of the ancestor nodes on the path above, but also on the labels of the children in the subtree below. These indexing strategies, called structure indexing, will be discussed in the next chapter, together with so-called twig join algorithms. 16

15 1.4. RESEARCH QUESTIONS c c 1 a a 1 a 2 a 3 a 4 c a a 1 b b 1 b 2 b 3 b 4 b 5 b 6 c a a a 2 a 4 c c 1 c 2 c 3 c 4 c 5 c 6 c a a a a 3 d d 1 d 2 c a a b b 2 b 3 b 5 e e 1 f f 1 c d d 2 (a) Indexing on label. (b) Indexing on path. Figure 1.5: Indexing the data tree from Figure Research questions The following are the main research questions I have investigated during the work with this thesis: RQ1: How can matches for tree queries be joined more efficiently? RQ2: How can pattern matching in the dictionary be done more efficiently? RQ3: How can structure indexes be constructed faster and using less space? These questions will be revisited in Section 3.4.1, where I will evaluate to what extent they have been answered by my research. Note that more efficient query evaluation can mean either that all or most queries are evaluated using less time, or that queries from some important group are evaluated using less time. Preferably, faster evaluation for one group of queries should not cause slower evaluation for other groups. 17

17 Chapter 2 Background Research is what I m doing when I don t know what I m doing. Werner von Braun This chapter presents some underlying concepts for state-of-the-art approaches for TPM on indexed data, which will hopefully ease the understanding of the contributions in the research papers included in this thesis. A high-level conceptual overview is given instead of an in-depth description of details in state-of-the-art solutions, because this is better covered by the included papers where the specific techniques are discussed. The following discussion divides the problem of TPM on indexed data into three somewhat orthogonal issues: How to construct full query matches from individual query node matches in so-called twig joins, how to partition the underlying data nodes such that as few as possible are read to evaluate a query, and how to efficiently read streams of data nodes during a join. Notation. The following notation is used in the discussion: A graph G has node set V G and edge set E G V G V G. All graphs are directed. A graph is a tree if all nodes have one incoming edge except the root, which has zero incoming edges. Nodes with zero outgoing edges are called leaves. A graph is called a forest if it consists of many unconnected trees, i.e., if all nodes have zero to one incoming edges. If a relation R relates x to y, this may be denoted both xry and x, y R and x y R. We primarily use angle brackets for graph edges, as in u, v E G, and the maps to arrow for mappings of query nodes to data nodes, as in q d M. The transitive closure of a relation R is denoted by R. In the problems discussed there will mostly be a query tree Q and a data tree D, where each node v V Q V D has a Label(v) A. Assume A O( D ) for simplicity. Each query edge u, v E Q has an EdgeType(u, v) { A D, P C }, specifying an ancestor descendant or a parent child relationship. Remember from Section 1.3 that in TPM we have a single node type, and only differentiate nodes by label, while in XML there are different node types. We can generalize 19

18 CHAPTER 2. BACKGROUND TPM to cover this by using different label codings for different node type, such as for example starting element node labels with <, attribute node labels and text node labels with ". Definition 1 (Query match). Given a query tree Q and a data tree D, a match for Q in D is a total 1 function M : V Q V D such that Label(v) = Label(M(v)) for all v V Q, and whenever there is an edge u, v E Q, if EdgeType(u, v) = P C, then there is an edge M(u), M(v) E D, or if EdgeType(u, v) = A D, then there is an edge M(u), M(v) E D. 2 Revisit the example in Figure 1.4 for an illustration of a match. Definition 2 (Twig pattern matching problem). Given a query tree Q and a data tree D, the twig pattern matching problem is to find the set of functions that are matches for Q in D. Denote this set of matches by M Q,D, or just M when there is no ambiguity. 2.1 Twig joins In a twig join, a query is evaluated by considering a set of candidate data nodes I v V D for each query node v V Q, which are joined into full query matches, where the query node v is mapped to a data node in I v. With label-indexing, I v = {v V D Label(v ) = Label(v)}, the set of all data nodes with label matching that of v. For the example in Figure 1.4, the candidate set for query node a 1 would be I a 1 = {a 1, a 2, a 3, a 4 }. Denote the total input by I = {v v v V Q, v I v }, and the set of query matches that can be constructed from this input by O = {M M I, M M}. The following discussion assumes label indexing, but the techniques for constructing twig matches presented here are also applicable when changing the assumptions on how to index the underlying data, as discussed in Section 2.2. In practice, a twig join accesses each set I v through an enumeration S v, which typically follows a given ordering, such as tree preorder. As a base case, S v is implemented as a simple stream, where you can read out the current element, the so-called head, or forward to the next element. Later in this chapter we also consider cases where you can fastforward to search for a given element in S v. In some settings S v could also have random access to elements at given positions Twig join work-flow Early approaches used multiple binary joins to construct full query matches [56, 1], but this can give intermediate results of exponential size when the query contains A D edges [5]. This deficiency led to the introduction of multi-way joins [5, 7, 27, 39, 33]. Current multi-way twig join algorithms generally use the following strategy, illustrated in Figure 2.1: There are two phases, temporally separate, where the first phase constructs an intermediate result data structure, and the second phase traverses this data structure 1 A function f : X Y is called total iff f(x) is defined for all x X. 2 There in an edge u, v E iff there is a simple path from u to v using edges from E. 20

19 2.1. TWIG JOINS to enumerate and output the set of full query matches O. The first phase has two components, where the first merges the streams S v, materializing I v for each v V Q, into a single stream S, materializing the total input set I. Phase 1, Component 1: Input stream merger Phase 1, Component 2: Intermediate result constr. Phase 2: Result enumeration a 1 a 1 a 2... b 1 b 1 b 2... c 1 c 1 c 2... c 1 c 1 b 1 b 1... Intermed. results a 1 b 2 c 5... Figure 2.1: Work-flow of twig join algorithms. Figure 2.2 illustrates why the two phases are temporally separate, as in the worst case, all the data must be read before it is known whether or not the nodes in the input are useful. On the other hand, use of the two components in Phase 1 can be temporally overlapping, because Component 2 reads data and query node pairs from Component 1 in some order that can be implemented without lookahead in the individual streams. Note that for some combinations of query and data, the construction of intermediate results is not necessary for linear evaluation (as we exploit in Paper 3 included in Chapter 4). a 1 a 1 c 1 b 1 b 1 b n a 2 c n+1 b n+1 c 1 c n Figure 2.2: Example showing why Phase 1 and Phase 2 are temporally separate. When the input streams are sorted in tree preorder, it cannot be known whether b 1,..., b n are part of a query match before c n+1 is seen, or whether c 1,..., c n are part of a query match before b n+1 is seen. Note that there is no stream ordering such that all twig queries can be evaluated without storing intermediate results [10]. To understand the design choices in the approach depicted in Figure 2.1, it is easiest to start with the last step, result enumeration, and work backwards. Section sketches a generic algorithm for enumerating results, and Section sketches the layout of a generic data structure that enables evaluating that algorithm in linear time. With this as a starting point, I go through various techniques and strategies for implementing the generic approach. Section briefly presents a common tree position encoding that makes it possible to decide A D and P C relationships between data nodes in the 21

20 CHAPTER 2. BACKGROUND various streams in constant time. Section describes two common data node filtering strategies, and Section shows how one of these can be used to realize the conceptual data structure from Section in linear time. Section describes the input stream merge component, where filtering strategies can be used for practical speedups Result enumeration Algorithm 1 gives a high-level description of how to output all unique query matches that can be constructed from the input. The approach is a generalization of what is used in state of the art twig joins [7, 27, 39, 33]. The algorithm recursively constructs full query matches from partial matches that are known to be part of full query matches, denoted here as partial full matches. Formally, a partial full match is an M such that M M for some full query match M M. The set of all partial full matches is M = {M M M : M M}. Algorithm 1 Result enumeration Denote the set of partial full matches by M. Start with M = {}, an empty partial full twig match. Assume any fixed ordering of the nodes in Q, and let v Q be the first node in this ordering. For all v such that {v v } M : Call Recurse(v v ). The function Recurse(u u ): Insert u u into M. If M = Q : Output M. Otherwise: Let v be the node following u in Q. For all v v such that M {v v } M : Recurse(v v ) Remove u u from M. Example 1. We evaluate the query and data in Figure 1.4 using Algorithm 1, and order query nodes in tree preorder. A candidate match for query node a 1 that is part of a full match is data node a 1, and hence one of the top-level calls to Recurse will be with the parameter u u set to a 1 a 1. After this pair has been inserted into M, we consider the query node b 1, which follows a 1 in tree preorder. Since M = {a 1 a 1 }, and M {b 1 b 1 } is a partial full match, b 1 b 1 is one of the pairs we recurse with. In that recursive call we have M = {a 1 a 1, b 1 b 1 }, and consider matches for the final query node c 1. As {a 1 a 1, b 1 b 1 } {c 1 c 1 } is a partial full match, we again recurse with c 1 c 1, and output the new M, since it is a complete full match. Assume that the set of partial full matches M does not have to be materialized, and that given a partial full match M M, where all nodes u preceding v have a mapping 22

21 2.1. TWIG JOINS in u u M, all v v such that M {v v } M can be traversed in time linear in their number. Under these assumptions the algorithm can be evaluated in O( O Q ) time, linear in the total number of data nodes in the output. The intuition is that each recursive call constructs in constant time a partial full match not seen before, and that each unique partial full match yields at least one unique full query match Single output query node In TPM the answers in the result set are all legal ways of matching the query nodes to the data nodes, but in many information retrieval settings other semantics may be more useful. In the XPath language [45] queries have a single output node, and the result set contains all matches for this query node that are part of some full query match. In the XQuery language [47], which is used for more complex information retrieval and processing, there can be any number of output and non-output nodes in the query. Only minor changes are needed in Algorithm 1 for this generalized case with both output and non-output query nodes. A simple solution is to put the output query nodes first in the fixed ordering, and stop the recursion before non-output nodes are considered. Note that practical data structures that enable linear enumeration for any combination of output and non-output nodes [7] are not as simple as the data structures described in the following sections Simple intermediate result architecture Figure 2.2 illustrated why it is not possible to output query matches directly by just inspecting the heads of the streams for each query node. In the example all the nodes labeled c must be read before it can be known whether or not any of the nodes labeled b are useful, and vice versa. The purpose of storing intermediate results is to organize the data nodes in such a way that an implementation of the approach in Algorithm 1 can be evaluated efficiently. If the query nodes are ordered in tree preorder, it is natural to maintain for each u u that is part of a full query match, for each child v of u, the list of pairs v v used together with u u in some full query match. Figure 2.3 illustrates this strategy. In addition to the lists of pointers to useful child query node matches for each pair, there must be a list of pointers to the data nodes that match the query root in full query matches. a 1 b 1 c 1 b 1 b 2 b 3 b 4 b 5 b 6 a 1 a 2 a 3 a 4 c 1 c 2 c 3 c 4 c 5 c 6 Full match roots Figure 2.3: Generic intermediate results for the data tree in Figure

22 CHAPTER 2. BACKGROUND This data structure takes O( I + O Q ) space, linear in the size of the input and output, because the lists of data nodes take O( I ) space, and each root pointer or child match pointer is used at least once in Algorithm 1, which has time complexity O( O Q ). The following intuition shows how this data structure can be used to efficiently implement Algorithm 1 when query nodes are ordered in tree preorder: (i) The pairs v v in the initial calls in the outer for-loop are trivially found by traversing the list of pointers to full match roots. (ii) In a recursive call, after u u has been added to M, the current M is a partial full match by assumption. Let v be the node following u in preorder, and let p be the parent of v (possibly p = u). All query nodes preceding v have a mapping in M, and assume M (p) = p. Let Q p and Q v be the subgraphs resulting from removing the edge p, v from Q. These subqueries can be matched independently when the mapping of both p and v is fixed in a way such that EdgeType(p, v) is satisfied. If v v is used in some full query match together with p p, we know that p, v satisfies EdgeType(p, v). Then, if M is a partial full match, M {v v } must also be a partial full match. Example 2. This example illustrates how to implement the data access for Example 1 using the data structure in Figure 2.3. The first match for the query root a 1 that is part of a full match is the data node a 1, and hence the first non-empty partial full match in Algorithm 1 is M = {a 1 a 1 }. When considering the next query node in preorder, b 1, we see from the pointers in the data structure that b 2 is the first data node usable together with a 1. Hence the next partial full match is M = {a 1 a 1, b 1 b 2 }. Then, when considering the next query node c 1, we see that the data node c 5 is the only data node usable with a 1, the current match for a 1, the parent of query node c 1. We insert c 1 c 5 to get the full match M = {a 1 a 1, b 1 b 2, c 1 c 5 } Tree position encoding To construct the intermediate results efficiently it must be decidable from position information following the data nodes whether or not they satisfy A D and P C relationships. A common solution is the interval-based BEL encoding [56], where each node is given integer numbers begin, end and level, as shown in Figure ,10,1 2,5,2 3,3,3 4,4,3 6,9,2 7,7,3 8,8,3 Figure 2.4: The BEL encoding for a tree, with begin, end and level numbers. This encoding is similar to preorder and postorder traversal numbers, and can be computed in a depth-first traversal of the tree. The reason the encoding is often preferred is probably that the begin and end numbers correspond to the document position of opening and closing tags in XML. 24

23 2.1. TWIG JOINS With the BEL encoding, a node a is an ancestor of a node b iff a.begin < b.begin and b.begin < a.end, and it is a parent if also a.level + 1 = b.level. Sorting on begin or end numbers respectively gives the same sorting orders as preorder and postorder traversal numbers. There exists a large number of tree position encodings with different properties [50]. Some allow decision of more types of node relationships, and some allow reconstruction of related nodes. They differ in the computational cost of evaluating relationships, space usage, and how well they handle updates in the data tree Partial match filtering When constructing intermediate results it is often possible to filter out some query and data node pairs that will never be part of a full query match. In current twig join algorithms filtering is used both for practical speedup [5, 27, 33], and/or as a necessity for worst-case efficient result enumeration [7]. A filtering strategy does not have to be perfect, but it must certainly not remove pairs that are part of full query matches. In other words, it can have false positives, but not false negatives. Most filtering strategies are based on the observation that if there is some subquery (a subgraph of the query), such that the pair v v is not part of any match for the subquery, then v v is not part of any match for the entire query, and can safely be thrown away [21]. The two most common filtering strategies are illustrated in Figure 2.5. The first is based on checking if query prefix paths are matched [5, 27, 33], and the second on checking if query subtrees are matched [7, 39, 33]. The prefix path of a query node is the subquery containing the nodes on the path from the root down to the node. c 1 c 1 b 1 c 2 a 1 d 2 b 6 b 1 c 2 a 1 d 2 b 6 a 1 a 2 e 1 a 4 d 1 c 5 a 2 e 1 a 4 d 1 c 5 b 1 c 1 b 2 a 3 c 4 b 3 b 5 c 6 b 2 a 3 c 4 b 3 b 5 c 6 f 1 b 2 c 3 f 1 b 4 c 3 f 1 b 4 (a) Query. (b) Matching prefix paths. (c) Matching subtrees. Figure 2.5: Matching query parts. We call a pair v v that is part of a prefix path match for v a prefix path matcher. Filtering query and data node pairs on whether or not they are prefix path matchers is easy to implement with an inductive strategy: Assuming that v Q has parent u, the 25

24 CHAPTER 2. BACKGROUND pair v v is a prefix path matcher for v if and only if there exists a pair u u that is a prefix path matcher for u such that u, v satisfies the A D or P C relationship specified by EdgeType(u, v) [5]. Prefix path filtering is easiest to implement when data nodes are seen in tree preorder, where ancestors are seen before descendants. Example 3. Figure 2.5b illustrates prefix path match checking. The pair a 1 a 1 is trivially a prefix path matcher, and b 1 b 3 must then be a prefix path matcher because EdgeType(a 1, b 1 ) = A D and a 1, b 3 E D. This again implies that f 1 f 1 must be a prefix path matcher because EdgeType(b 1, f 1 ) = P C and b 3, f 1 E D. Filtering pairs on whether or not they are subtree matchers can be implemented with a similar strategy: The pair v v is a subtree matcher if and only if for each child w of v, there exists a subtree matcher w w such that v, w satisfies the A D or P C relationship specified by EdgeType(v, w) [7]. Subtree match filtering is easiest to implement when data nodes are seen in tree postorder. Example 4. Figure 2.5c illustrates subtree match checking. The pairs f 1 f 1, b 2 b 4 and c 1 c 5 are trivially subtree matchers because the query nodes are leaves. The pair b 1 b 3 is a subtree matcher because f 1 f 1 is a subtree matcher and b 3, f 1 E D satisfies EdgeType(b 1, f 1 ) = P C, and because b 2 b 4 is a subtree matcher and b 3, b 4 E D satisfies EdgeType(b 1, b 2 ) = A D. The pair a 1 a 1 is a subtree matcher, because b 1 b 3 is a subtree matcher and a 1, b 3 E D satisfies EdgeType(a 1, b 1 ) = A D, and because c 1 c 5 is a subtree matcher and a 1, c 5 E D satisfies EdgeType(a 1, b 1 ) = P C Intermediate result construction The filtering on matched subtrees described in the previous section is strongly related to a strategy that can be used to efficiently build a data structure that realizes the conceptual structure depicted in Figure 2.3. What is described in the following is a slight simplification of what is used in the Twig 2 Stack [7] algorithm, which was the first twig join algorithm with cost linear in the size of the input data and the output result set. The reason preorder processing of data nodes and filtering on matched prefix paths is not a suitable starting point for a worst-case efficient algorithm, is that even though paths in the data do match paths in the query, it is hard to figure out on the fly during preorder processing whether or not other paths in the query can use the same branching nodes. On the other hand, with postorder processing matches for the query can be constructed bottom up by combining subtree matches into bigger subtree matches. The storage order of data nodes in the index does not have to be changed for postorder processing, as a preorder stream of match pairs can be translated to a postorder stream with a stack: When a pair v v is read in preorder, all pairs u u on the stack such that u is not an ancestor of v are popped off and processed one by one, before v v is pushed on stack. When following the strategy from Sections and 2.1.3, the key to efficient enumeration of results is the ability to efficiently find usable subtree matches. Given a candidate v v, we need to find for all children w of v, the list of matchers w w such that v, w satisfies EdgeType(v, w). Subtree matches for the query root are trivially full query matches. 26

25 2.1. TWIG JOINS The overall strategy for the proposed data structure is to maintain for each query node v a list of disjoint trees T v consisting of node matches from the stream S v, as shown in Figure 2.6. Some additional dummy nodes are used to bind the trees together. For each data node in the trees for a query node, there is a list of pointers to usable child query node matches. P C matches are pointed to directly, while A D matches are found in the entire subtrees pointed to. a 1 a 4 c 1 b 1 c 2 b 2 c 3 c 4 c 5 b 3 b 4 b 5 c 6 Figure 2.6: Figure 1.4. Postorder construction of intermediate results for the data and query in Algorithm 2 shows how this data structure can be constructed, specifying the processing of a single pair v v in postorder. For each query node v, there is a list T v of disjoint trees consisting of subtree matchers v v where v S v. When processing a pair v v, the trees where the root data nodes are descendants of v are joined into single trees, both in the lists T w for the children w of v, and in the list T v for v itself. For P C edges, pointers from v v to w w denote single direct child matches, while for A D edges, pointers denote that entire subtrees contain matches. A pair v v is only added if there is at least one pointer for each child w of v, and this effectively implements subtree match filtering as described in Section Example 5. Figure 2.7 shows the step processing a 1 a 1 when constructing intermediate results for the data and query from Figure 1.4 with Algorithm 2. The trees at the end of T b 1, where the roots are b 2 and a dummy node, are joined into a single tree. So are the trees at the end of T c 1, where the roots are c 3, c 4 and c 5. Pointers are added from a 1 a 1 to the tree of descendants in T b 1, and to the child match c 1 c 5 in T c 1. Since a 1 a 1 has pointers both to matches for b 1 and c 1, it is a subtree match, and is added to T a 1. When evaluating the input I with Algorithm 2, the total number of calls to the procedure Process() would be v V Q S v = I, and the total number of rounds in the for-loop would be v V Q I v b v O( I b Q ), where b v is number of children of v and b Q is the maximal number of children for any node in Q. Apart from constant time 27

26 CHAPTER 2. BACKGROUND Algorithm 2 Postorder intermediate result construction Function Process(v v ): For each child w of v: Let T w be the trees at the end of T w where root nodes are descendants of v. If EdgeType(v, v) = P C : Add pointers from v v to all w w in T w where depth(w ) = depth(v )+1. If T w > 1 Replace T w by a dummy node with the trees from T w as children. If EdgeType(v, v) = A D and T w > 0: Add a descendant pointer from v v to the single node in T w. If v v does not have at least one pointer per child w of v: Discard v v and return failure. Remove from the end of T v all roots where data nodes are descendants of v, add them as children of v v, and append v v to T v. a 1 a 1 a 4 c 1 a 4 c 1 b 1 c 2 b 1 c 2 b 2 c 3 c 4 c 5 b 2 c 3 c 4 c 5 b 3 b 5 c 6 b 3 b 5 c 6 b 4 b 4 (a) Before adding a 1 a 1. (b) After adding a 1 a 1. Figure 2.7: A step in postorder construction of intermediate results for the data and query in Figure 1.4. Dotted boxes give the current list of trees T v for each v V Q. 28

27 2.1. TWIG JOINS operations for each input v v and each child w of v, there is some non-trivial cost associated with merging trees and adding pointers to P C and A D child matches. A merge attempt either inspects only one tree root and does not change T v, or inspects k > 1 roots, removes k 1 roots from T v and adds a new one. This means that the cost of merge operations is bounded by the number of attempts and the sizes of the trees, i.e., v V Q O( I v + I v b v ). Now consider the cost of adding pointers from matches for a query node u to matches for a child query node w. If EdgeType(v, w) = A D, then only a single edge is added from each v v. If EdgeType(v, w) = P C, then only a single edge is added to each w w, as a node can have only one parent. In conclusion, the total cost of using Algorithm 2 is v V Q O( I v + I v b v ) O( I + I b Q ). What is presented here is a slight simplification of the Twig 2 Stack algorithm [7]. The main difference between the above depiction and Twig 2 Stack is that in the latter, the data structure for each query node is a list of trees of stacks of nodes, instead of simply lists of trees of nodes. Many alternative twig join algorithms have been presented [27, 39, 33] in the years following the publication of the Twig 2 Stack algorithm. What is common to these algorithms is that they have improved practical performance, but higher worst-case complexity in the result enumeration phase. An example is the TwigList algorithm, which stores intermediate nodes in simple vectors instead of trees, and implements a weaker form of subtree filtering, where all query edges are considered to have type A D Merging input streams The final component missing to implement the strategy in Figure 2.1 is the input stream merger. The input to the merge is one preorder sorted stream representing I v for each v Q, and the desired output is a sorted stream representing I. The sort order required for using the approach from Section is that the pairs v v I are sorted primarily on the preorder of the data nodes, and secondarily on the postorder of the query nodes. This means that after translating the stream into data node postorder with a stack, the new stream is sorted secondarily on query node preorder. This is required by Algorithm 2 for cases where a single data node matches multiple query nodes, as a data node could hide useful children of itself if the sorting was not secondarily on query node preorder. The simplest merge approach is to traverse the query in postorder, and find some minimum v v by taking a preorder minimum v that is head of a stream I v for a postorder minimal v. This takes Θ( Q ) time per extraction, and gives a total cost of Θ( I Q ) for the merge. An asymptotically better approach is to organize the individual streams in a priority queue implemented with a binary heap, sorted primarily on the heads of the streams and secondarily on the query nodes. Extractions then take O(log Q ) time, and the total cost is O( I log Q ) [11]. Since the preorder and postorder tree traversal numbers we are sorting on are bounded by the size of the input, the sorting complexity is not loglinear, but linear under the unit cost assumption. The entire set I can be put in a single array, and sorted using radix sort in Θ( I ) time [11]. As the intermediate result construction is already O( I b Q ), the radix sort approach gives no advantage over the heap based approach when log Q b Q. Since the latter uses much less memory in practice, Θ( Q ) instead of Θ( I ), it is preferable in most real-world scenarios. 29

28 CHAPTER 2. BACKGROUND Some of the newer twig join algorithms storing intermediate results in preorder [27, 33] use a O( I Q ) input stream merge component that implements a weak form of subtree match filtering, where all query edges are considered to have type A D [5]. The merger uses only O( Q ) memory and is very fast in practice because queries are typically small. It returns data nodes in a relaxed preorder, where the ordering is only guaranteed between matches for query nodes related by ancestry. This stream is not easily translated into postorder, and hence the merger is not used for postorder processing algorithms [21] Data locality and updatability This chapter does in general not make a distinction between data stored in main memory and on disk, but in practical implementations it is important to consider the costs of different access patterns in different media. While main memory on modern computers does not really have a uniform memory access cost, due to the use of caches, we can design usable systems that use random memory reads and writes. On the other hand, if the data is so large it must reside on disk, a system that uses a lot of random access will not be efficient in practice. Consider now the different phases and components in our twig join strategy. The input stream merger is assumed to only inspect stream heads and store a minimal amount of state. Hence it should work well on an architecture where the candidate matches for each query node are streamed from disk. The intermediate result construction, as shown in Algorithm 2, inspects in each call a number of tree roots stored contiguously at the end of the current list of trees for each query node. This in itself is simple to implement with good spatial locality, but it should also be considered how the layout of data affects the result enumeration phase. Luckily, if intermediate nodes are stream onto disk and inserted into blocks in postorder, most nodes that are close in the data tree will be stored closely on disk. This strategy will give fairly good spatial locality during result enumeration [7]. The problem of intermediate results exceeding the size of main memory can be avoided in many practical cases, by observing that when the uppermost candidate match for the root query node is closed, none of the data nodes seen so far in the tree preorder will be used in any match involving data nodes later in the tree preorder [7]. This means that when the uppermost query root match candidate is closed, the current intermediate data can be used to enumerate the current set of query matches, before this data is discarded. Example 6. Consider the data in Figure 1.4, and an algorithm that pushes nodes onto a stack in preorder and pops them off in postorder. When the data node b 6 is processed, it causes the popping of a 1, and there are no more a-nodes on the stack. As a match for the query node a 1 must be above the match for any other query node in a full query match, no nodes preceding b 6 in the data will be involved in a match together with nodes following and including b 6. Hence we can enumerate results, and delete the current intermediate data structures. In many practical cases with large amounts of data, the underlying information is stored in a large number of independent documents of moderate size, and in these cases the above trick is always applicable. Data updates are also easy to handle in such a setting. A way of encoding global data node positions is by combining document identifiers and local 30

29 2.2. PARTITIONING DATA node position encodings, such as BEL, and this simplifies updates: Updating a document can be viewed as deleting it and then re-adding it with a new document identifier, as is common in search systems for unstructured data [51]. Note that when the data is a single large tree that cannot easily be partitioned into independent documents, we need a node position encoding that has affordable cost for tree updates. There exist a number of such encodings with different properties [50] Twig join conclusion We have now discussed all the components in a state-of-the-art twig join algorithm, and the costs of the different components are: input stream merge: O( I log Q ) for the heap-based approach, intermediate results construction: O( I b Q ), and result enumeration: O( O Q ). This gives a total combined data, query and result complexity of O( I log Q + I b Q + O Q ). Commonly the size of the query is viewed as a constant, and twig join algorithms are called linear and optimal if the combined data and result complexity is O( I + O ). 2.2 Partitioning data In the previous discussion it was assumed that the data nodes where partitioned on label in the index. This section considers the advantages and challenges that arise from more advanced indexing strategies Motivation for fragmentation Let us first recap the introduction to the general strategy for TPM on indexed data from Section 1.3.1: The index is a mechanism which provides a function from some feature of a node to the set of nodes in the data that have this feature. The main motivation for using an index is of course reading and processing less data during query processing. If node labels are selective then simple label partitioning is an efficient approach, but this is not always the case. Figure 2.8 shows a case with many label-matches for the individual query nodes in the data, but only a few full matches for the query. The above example may be unrealistic, but reconsider the data in Figure 1.1 and the query in Figure 1.2 on page 14. If the given library has billions of books, then the cost of reading the data nodes labeled book will be huge compared to the size of the output result set. This motivates the use of a more fragmented partitioning of the data to improve the selectivity of query nodes. Note that another way of improving performance in these cases is to use skipping, discussed later in Section

30 CHAPTER 2. BACKGROUND a 1 b 1 a 2 b 3 a 2 a 9 a 15 a 4 b 3 a 7 a 8 b 10 b 13 b 16 b 18 b 20 a 4 b 5 a 6 a 11 a 12 b 14 b 17 b 19 a 21 (a) Example query and data, showing first of four matches. a 1 a 2 b 3 a 4 a 2 a 4 a 6 a 7 a 8 a 9 a 11 a 12 a 15 a 21 a 2 a 4 a 6 a 7 a 8 a 9 a 11 a 12 a 15 a 21 b 1 b 3 b 5 b 10 b 13 b 14 b 16 b 17 b 18 b 19 b 20 a 2 a 4 a 6 a 7 a 8 a 9 a 11 a 12 a 15 a 21 (b) Example query and streams read. Marked stream nodes are useful. Figure 2.8: Partitioning on label Path partitioning A natural extension of label partitioning is to partition data nodes on the paths by which they are reachable [37, 13, 36, 8]. Section described how useless data nodes could be filtered out during intermediate result construction if they did not match prefix paths in the query. When indexing data nodes on prefix path, the same filtering is performed in advance, and we only process data nodes from classes where the prefix paths match the prefix paths in the query. To identify useful partitions when evaluating a query, we need some form of dictionary. In Figure 1.5b on page 17 a simple dictionary of path strings was used in the index, but this approach does not have attractive worst-case properties. There may be many unique paths in the data, and the size of this naive dictionary can be O( D 2 ) if the tree is deep. A more robust approach is to use a dictionary tree called a path summary, where shared prefixes of paths are only encoded once. Figure 2.9a shows the path partitioning for the data tree in Figure 2.8a. A path summary can be constructed from this partitioning by creating one node for each block in the partition, and creating edges between summary nodes whenever there are edges between data nodes in the related blocks, as shown on the left in Figure 2.9b. Prefix path matches for each query node can be found individually by using a matching algorithm on the summary tree, but this may give many individual matches that never take part in full query matches. A robust and efficient way to find useful prefix path matches is to index the summary itself on label, and use a twig join algorithms to evaluate queries directly on the summary to find relevant nodes [2]. 32

CS301 - Data Structures Glossary By

CS301 - Data Structures Glossary By Abstract Data Type : A set of data values and associated operations that are precisely specified independent of any particular implementation. Also known as ADT Algorithm