Structural XML Querying

Size: px
Start display at page:

Download "Structural XML Querying"

Transcription

1 VŠB Technical University of Ostrava Faculty of Electrical Engineering and Computer Science Department of Computer Science Structural XML Querying 2018 Radim Bača

2

3 Abstract A well-formed XML document or a set of documents can be viewed as an XML database and the associated DTD, or XML Schema, is its database schema. XQuery and XPath are usually the query languages of an XML database. If we compare them to relational databases, the main differences are the hierarchical data model and the implicit order of the XML data. Therefore, the major novel issues related to query processing in XML databases are (1) handling of a query logic related to the XML document hierarchical structure, and (2) dealing with the implicit order. Let us call such problems a structural XML querying. In this work, we provide a comprehensive survey of the state-of-the-art of approaches and related aspects for an efficient structural XML querying. In particular, we start with a description of labeling schemes to capture the structure of the data and the respective storage strategies. Then we deal with the key part of every XML query processing algorithm a twig query join as well as optimizations of XML query processing. Moreover, we describe two twig query joins that can be used in a structural XML querying. Key Words: Structural XML querying, XQuery, Cost-based optimizations

4

5 Contents List of Figures 7 List of Tables 9 1 Introduction Motivation XML Model and XML Query Languages Labeling Schemes XML Storage Techniques Partitioning of XML Document Schema Tree Node Indices Balancing XML Storage Twig Query Join Algorithms GTPStack GTPStack Experimental Results Top-down Filtering Optimality XML Query Processing Optimizations Cost-based Optimizations Selectivity Estimation Techniques CostTwigJoin Algorithm CostTwigJoin Experimental Results Conclusions 45 References 49 5

6 6

7 List of Figures 1 An XML document and its XML tree model XPath queries and their corresponding TPQs XQuery queries and their corresponding GTP representations (a) Containment labeling scheme (b) Dewey order labeling scheme Properties of various XML document partitionings (a) Document index (b) Partition index (c) Partition index with indexed lists Categories of algorithms utilized for twig query joins Performance of the output enumeration Number of queries violating the properties for different k values (a) for all ZIPF queries (b) for queries corresponding to the first and to the fourth ZIPF template Variants of the TB12 query with different number of output nodes (a) Processing time of GTPStack+M and (b) number of nodes stored on stacks by GTPStack for each variant of TB Example of three TPQs and their corresponding checking query nodes (underlined) (a) Sample XML tree (b) QC An empirical evaluation of the selection of α and β Results of our cost-based optimization framework Overhead of the cost-based optimization Average speed-up of greedy forward compared to Top-k

8 8

9 List of Tables 1 Major features of twig query join categories Numbers of nodes inserted into an intermediate storage and numbers of nodes relevant to the GTP result Number of the filtering function calls corresponding to the inner query nodes ZIPF query templates for XPath queries GTPStack+M compared to all tested approaches for all queries Ratio of the number of nodes stored in various intermediate storages and the number of relevant nodes for each collection Median values of nodes stored in an intermediate storage for each collection Categories of twig query joins A comparison of cost-based approaches Characteristics of data collections

10 10

11 1 Introduction A well-formed XML document or a set of documents can be viewed as an XML database and the associated DTD, or XML Schema, is its database schema. XQuery and XPath are usually the query languages of an XML database. If we compare them to relational databases, the main differences are the hierarchical data model and the implicit order of the XML data. Therefore, the major novel issues related to query processing in XML databases are (1) handling of a query logic related to the XML document hierarchical structure, and (2) dealing with the implicit order. Let us call such problems a structural XML querying. In this work, we provide a comprehensive survey of the state-of-the-art of approaches and related aspects for an efficient structural XML querying. In particular, we start with a description of labeling schemes to capture the structure of the data and the respective storage strategies. Then we deal with the key part of every XML query processing algorithm a twig query join as well as optimizations of XML query processing. Moreover, we describe two twig query joins that can be used in a structural XML querying. In particular, we summarize the main contributions of this work as follows: 1. We present a detailed description of up-to-date storages and indices for XML data, as well as a classification of the physical access methods with regard to node labeling, document partitioning, and twig query joins used. 2. We provide a thorough description of the state-of-the-art twig query joins and their comparison in terms of their compatibility, features, and supported query models. 3. We describe XML query algebras and outline their compatibility with twig query joins. 4. We discuss main aspects of cost-based optimization techniques and selectivity estimation approaches for XML queries. We compare these techniques in terms of supported query models, twig query join algorithms, and several other practical features as well. 5. We depict two of our novel twig query join algorithms called GTPStack and CostTwigJoin in detail. We compare these algorithms with other state-of-the-art algorithms and describe thoroughly our contributions. The content of this work is based mainly on the following publications: R. Bača, M. Krátký, T. W. Ling, and J. Lu. Optimal and Efficient Generalized Twig Pattern Processing: a Combination of Preorder and Postorder Filterings. The VLDB Journal, 22:1-25, Springer. [8] R. Bača, P. Lukáš, and M. Krátký. Cost-based holistic twig joins. Information Systems, 52:21-33, Elsevier. [9] 11

12 R. Bača, M. Krátký, I. Holubová, M. Nečaský, T. Skopal, M. Svoboda, and S. Sakr. Structural XML Query Processing. Accepted in ACM Computing Surveys, [12] 1.1 Motivation The adoption of the extensible Markup Language (XML) [19] proposed by the W3C 1 as a standard for information exchange has gained so much momentum and currently it is undoubtedly a main standard for the representation and exchange of data. Before this happened, we witnessed a massive boom of techniques that enable efficient storing and querying of XML data. Now we can observe that the boom in the proposals of new techniques for efficient structural XML querying is over and the research world has shifted its attention towards other kinds of data models and data formats (e.g., JSON [18], NoSQL [88], RDF [14], or linked data [16]). However, according to Gartner [36], there is an increasing trend of a new generation of multi-model database systems (e.g., OrientDB 2, MarkLogic 3, or HPE Vertica 4 ) which is designed to support storing data in a combination of related models and query across them. Therefore, we believe that the work done in XML database management system (XDBMS) is still relevant nowadays and it can be used in a new generation of database systems. 1.2 XML Model and XML Query Languages This section contains a brief introduction of the structural XML querying problem XML and XML Model For the purpose of machine processing of XML data, we do not view an XML document as a textual document, but we instead use it as a model. Every XML document has to be wellformed, which basically means that tags form a hierarchical structure, and, hence, the model is a tree with several types of data nodes corresponding to elements, attributes, or textual data. An example of an XML document and its XML tree model is depicted in Figure 1. In the following text, we use a term labeled path for a sequence of node names of a path in an XML tree XML Querying Issues and difficulties of structural XML querying are mostly observed with respect to the XPath [31] and XQuery [17] languages, where the latter one is actually an extension to the former one. Both these languages are the key standards among XML query languages, and so the corresponding XML query processing algorithms are essential for any XDBMS

13 XML document <notes> <note status= important > <to>roope</to> <from>jani</from> <body>call me!</body> </note> <note status= new > <to>radim</to> <due> </due> <body>finish article</body> </note> </notes> XML tree model notes note note status to from body status to due body important Roope Jani Call me! new Radim Finish article Figure 1: An XML document and its XML tree model XPath is a query language for selecting nodes from an XML document. The XPath language provides the ability to navigate within an XML document and select its particular nodes by a variety of criteria. We can find two major types of constructs in any XPath query: (1) structural constructs including a navigation in an XML tree, an element or an attribute name selection, a wildcard, Boolean expressions, and quantifiers and (2) content constructs including predicates on the node content (i.e. the element content or the attribute value) and a comparison of the node content. The content constructs can be often handled by methods designed for a query processing in RDBMS [13, 80, 24], moreover, there is a number of works dealing with the content constructs in XML [63, 55, 10, 4]. Let us remember that we are more focused on the structural constructs in this work. The major structural query model used by most approaches is called a twig pattern query (TPQ). A basic TPQ Q = (V, E) is a tree with a set V of query nodes and a set E of edges. Query nodes represent nodes of an XML document to be retrieved and edges represent structural relationships between the nodes. A query node q V is labeled by a node name. An edge e E can be of the parent-child (PC) or the ancestor-descendant (AD) type. PC and AD edges are visualized by simple and double lines, respectively. Q1: //a//b[.//d]/e Q2: //b/d/following-sibling::e Q3: //a//b[./ancestor::b][./e]/d Q4: //a//*[./b and not(./e)]//d e a b d d b a > b 1 e e b 2 d b a * e d Q1: TPQ Q2: TPQ > Q3: X-TPQ Q4: TPQ * Figure 2: XPath queries and their corresponding TPQs TPQs can represent XPath queries which contain child and descendant axes in their steps and predicates. A sample XPath expression of this kind is depicted in Figure 2 (Q1). The XPath language also contains other constructs. Thus, there exist different extensions of TPQ handling various structural aspects of XPath queries [73, 68]. An example of the major TPQ extensions is 13

14 provided in Figure 2 (Q2 Q4). We use a notation for TPQ extension names, where a structural construct not belonging to a basic TPQ definition is usually written in a superscript of a TPQ name (i.e. Q2 and Q4). Algorithms evaluating a TPQ on an XML document are called twig query joins or structural joins. Any twig query join finds all the occurrences of a TPQ in an XML tree (also called a TPQ matching). We discuss them in Section 3. Let us note that the TPQ matching is often considered as a core problem of XML querying. optional edge output node for $a in //a[//c//f], for $a in //a, for $b in //b[not(.//d/c) or /c] $aa in $a//b//a $d in $a//b//d let $cc := $b//c/f return return return <o> {$a,$aa} </o> <o> {$d,$a/b} </o> <o> {$b,$cc} </o> optional output node a a b c b b b' d c' c'' f a' d c b: (not(d) or c') and c'' f Q5 Q6 Q7 Figure 3: XQuery queries and their corresponding GTP representations The XQuery language [17] extends XPath in many ways. Two important extensions are the ability to specify output query nodes and an introduction of sequences in values. If we use only the TPQ model, we would have to perform a TPQ output postprocessing in order to get an expected XQuery output. To overcome this problem, a generalized query pattern (GTP) [30] is introduced making such a postprocessing unnecessary. Figure 3 shows an example of three XQuery queries with their corresponding GTP models. We use a simplified notation of the GTP model introduced in [28]. The optional edge and the optional output node enable a simple representation of the let clause or an optional XPath construct in the return clause (see Q6 and Q7 for an example). 1.3 Labeling Schemes Most of the twig query joins use a labeling scheme that assigns a unique node label to each node. Node labels allow us to resolve the following basic operations between two nodes u and v during query processing. The most significant operations are (1) lowest common ancestor of u and v in an XML tree, (2) resolving AD or PC relationship, and (3) decision whether u has a lower document order than v or not. 14

15 There are two major types of labeling schemes: (1) fixed-length labeling schemes, where the label has a fixed length, e.g., the Containment labeling scheme [106] (see an example in Figure 4(a)), and (2) prefix-based labeling schemes, where the length of the label is equal to the node depth, e.g., the Dewey order [94] (see an example in Figure 4(b)). The node label serves as an unique identifier (node id) of a node in an XML document in many labeling schemes. a (1,16) a 1 b (2,13) b (14,15) b 1.1 b 1.2 a (3,4) (5,8) c c (9,12) a c c b (6,7) (a) b (10,11) b (b) b Figure 4: (a) Containment labeling scheme (b) Dewey order labeling scheme The labeling schemes mentioned in the previous paragraph [106, 34, 94] are early works with certain issues such as the poor update performance or a lack of the LCA operation support. These issues were the main reason for the introduction of several other labeling schemes such as ORDPath [74], DDE [103], or Branch code [102]. A specific way to improve the update capability of the fixed-length labeling schemes is the introduction of bulk operations [51]. CT-label [59] is an approach that aims at reducing the label size while an XML document becomes nearly static. There are also labeling schemes ignored here such as compressed branch code [102], DFPD [60] or DPLS [62] as they can introduce false hits under certain circumstances. The prefix-based labeling schemes have fast insertion but it is balanced out by the cost of basic operations. Therefore, there is no holy grail of labeling schemes and a trade-off choice needs to be made among the performance of basic operations or the performance of updates. When we select an appropriate labeling scheme we have to consider also the LCA operation support as it is required by various twig query joins. 15

16 16

17 2 XML Storage Techniques Native XDBMSs build structural indices allowing them to avoid the necessity of accessing an XML document when resolving structural constructs of XML queries. Section 2.1 describes a general storage concept called the partitioning of an XML document. Sections 2.2 and 2.3 describe two data structures called a schema tree and a node index (both structures can be built during a preprocessing of an XML document), while Section 2.4 summarizes advantages of various storage settings. 2.1 Partitioning of XML Document Nodes of an XML document can be easily divided into disjoint sets (partitions), where each set is identified by its partition label. Partitioning approaches can be divided as follows: (1) Those based on a document structure [106, 42, 26, 49], and (2) Those considering a typical XML query workload [50, 27]. A tag partitioning [106, 5] is the most common partitioning based on a document structure where nodes are divided according to their tags and the number of partitions is equal to the number of unique tags in an XML document. There are also other partitioning based on tag+level [29], labeled paths [105, 29, 52] or forward & backward paths [49]. Each of these partitionings is actually a decomposition (refinement) of another one, as it is illustrated in Figure 5. On one hand, the tag partitioning produces a low number of partitions with a high number of nodes in each partition. On the other hand, F&B provides the opposite property: a possibly high number of partitions with a low number of nodes per each partition. Another type of partitioning is a semantic partitioning [6] that partitions the XML document according to the structure specified in an XML schema. 2.2 Schema Tree A schema tree for an XML document is a labeled tree, where each labeled path occurs only once. An example of a schema tree is depicted in Figure??(b). In literature, there are many names for this tree: DataGuide [42], path tree [3], summary tree, summary index, path index, and so on. A schema tree is useful for the following purposes: (1) to determine partition labels corresponding to a query [11] when a more refined partitioning is used (e.g., tag+level, or labeled paths), (2) to support a simple selectivity estimation [3], and (3) to get a general knowledge about an XML document structure (e.g., to support an auto-completion feature in an XML editor). 2.3 Node Indices Node labels are not stored in a schema tree structure; instead they are stored in a node index as values. There are two basic types of node indices: (1) those having a node id as a key (document 17

18 low number of partitions high number of nodes in a partition high number of partitions low number of nodes in a partition coarsing tag tag+level labeled path F&B partitioning refinement Figure 5: Properties of various XML document partitionings indices), and (2) those having a partition label as a key (partition indices). Nodes corresponding to one partition label are sorted according to the node label in the partition index; therefore, a sequential scan of the key s list is usually necessary during the query processing. This can be improved by an index (XB-tree [20] or XR-tree [48]) built over each list. All types of node indices are schematically depicted in Figure 6. Value: node label + some other information Key: partition label Value: list of node labels B-tree Key: node id B-tree Key: partition label B-tree (b) Value: XB-tree XB-tree XB-tree (a) (c) Figure 6: (a) Document index (b) Partition index (c) Partition index with indexed lists From the query processing perspective, a document index is very useful when we have a small set of context nodes and we want to use it to resolve the remaining relationships of a query. This type of the query processing can be considered as navigational. On the other hand, many twig query joins are based on the partition index (see Section 3). This type of join is mainly focused on the merge during one sequential scan of lists which removes irrelevant nodes [20, 5]. List in the partition index is called stream in the join and we use the term sequence in the following text. 2.4 Balancing XML Storage From the query processing point of view, the selection of a partitioning influences the size of three problems: (1) the overall amount of nodes that have to be read from a node index, (2) 18

19 the number of random accesses into a node index, and (3) the necessity to find all the partition labels in a potentially large schema tree. If we minimize the first problem by using a more refined partitioning, then the latter two problems of the query processing increase and vice versa. This behavior also depends on the query workload. As mentioned in Section 2.1, there are also partitioning techniques which are based on a typical XML query workload [50, 27]. These approaches take into account typical queries and the partitioning is created with respect to them. These approaches minimize all three query processing problems of partitioning mentioned above; however, they are only effective for a specific workload. 19

20 20

21 3 Twig Query Join Algorithms As discussed in Section 1.2.2, the basic problem of the XML querying is finding all the occurrences (query matches) of a TPQ in an XML tree [33]. Algorithms addressing this problem are usually called twig query joins and they represent a basic operator in an XML query algebra. Unlike in the relational domain where the term join stands for an algorithm-independent operation, in this work we use the term twig query join for algorithms that solve the TPQ matching problem as it is common in the referenced papers. In the full version of this work we define a data structure API that abstracts the data structures of the storage system. Let us only mention that there are two basic access patterns: (1) navigation API using a node for navigation in the XML document and (2) stream API using just sequential scan of a node sequence corresponding to one partition label. Operations in the APIs have significantly different performance characteristics depending on available node indices (see Section 2.3). In general, the document index supports both the navigational and stream APIs, whereas the partition index supports only the stream API. However, the document index has to access nodes using a sequential scan of a B-tree containing many irrelevant nodes when processing the stream API which is inefficient. A query processed in the partition index accesses only nodes relevant to a partition label, and therefore, it is crucial for the efficient processing of all merge-like algorithms. The selection of an appropriate index and algorithm is a task for a cost-based optimizer (see Section 4.1) and we propose an algorithm based on such selection in Section 4.3. Before we describe particular algorithms, we generally specify steps that can be identified as parts of any twig query join as follows: Filtering: An algorithm scans sequences and filters out nodes not corresponding to any query match. In this stage, algorithms use main memory filtering data structures to get rid of the maximum number of nodes that are irrelevant to a query. The most common filtering data structure is a stack or a set of stacks. The main feature of the filtering data structure is that it can help to decide whether the node is useless or not in constant time. Of course, every filtering can have false hits. Intermediate storage: Every join uses some type of an intermediate storage, where nodes are stored before a query output is enumerated. In some approaches, a filtering data structure is used as the intermediate storage as well. Output enumeration: In this step, algorithms read the intermediate storage and generate an output which is usually in a form of ordered tuples. The major task of this step is often the tuple ordering according to an XML query model. One of the most significant differences among joins is the method that is used during the filtering step. The first type of filtering methods focuses only on a pair of query nodes and the 21

22 twig query join is processed as a set of binary structural joins. On the other hand, holistic joins filter input nodes based on information from all query nodes. Figure 7 summarizes the major categories of twig query joins and Table 1 depicts their main features. Twig query joins Binary structural joins Holistic joins Navigational Merge-like Top-down Bottom-up Figure 7: Categories of algorithms utilized for twig query joins struc- Binary tural joins Top-down joins holistic Bottom-up holistic joins Pros They can be easily integrated into any XML query algebra and support all XPath axes. Linear I/O complexity of the query processing with respect to the sum of output and input sizes for some query types and unnecessary query plan optimizations. Linear CPU and I/O complexities of the output enumeration with respect to the output size. Cons Their efficiency is significantly dependent on the selection of a good query plan. They can produce a large intermediate result when compared to the query output. A sequential scan of an intermediate result is required even if it contains many useless nodes (see Section??). They can produce a large intermediate result compared to the query output (see Section??). Table 1: Major features of twig query join categories 3.1 GTPStack In this section, we outline basic ideas of our holistic join algorithm GTPStack introduced in [8]. Node filtering As described in the introduction of this section, every holistic algorithm has a filtering mechanism skipping irrelevant input data nodes which are not a part of any query match before these nodes are stored in an intermediate storage. Holistic algorithms use stacks during the filtering. In the following text #q denotes a query node q. If the filtering skips irrelevant nodes so that they are not stored on stacks at all, we speak about the top-down filtering. The bottom-up filtering skips irrelevant nodes (i.e., they are not stored in the intermediate storage) when they are popped out from their stacks. The simplest top-down filtering is represented by 22

23 PathStack [20] which skips an irrelevant node n corresponding to #q when there is no occurrence of a path from #root to #q containing n. Another type of top-down algorithms use a recursive filtering function such as getnext [20] or getpart [43] and they skip irrelevant nodes which are not a part of a whole TPQ occurrence. On the other hand, a bottom-up filtering algorithm (e.g., Twig 2 Stack [28] or TwigList [81]) skips an irrelevant node n corresponding to #q if there is no occurrence of a subtree rooted at #q containing n. We say that a filtering is optimal for a query Q if it skips all irrelevant nodes during the sequential scan of the input, which means that an algorithm with such filtering has a linear worst-case time and I/O complexity for Q with respect to the sum of the input and TPQ result size. For example, the TwigStack [20] and TJStrictPre [43] algorithms are optimal for TPQs having only ancestor-descendant relationships. Table 2 shows the number of nodes stored in an intermediate storage by various filtering algorithms. In this table, we use three TreeBank queries. The last column shows the number of nodes which are a part of a GTP result tuple. Evidently, the combination of PathStack and the bottom-up filtering can store an enormous number of irrelevant nodes; however, it stores less nodes than the bottom-up filtering itself. An top-down filtering algorithm such as TwigStack stores significantly less nodes, but it still typically stores large number of irrelevant nodes due to the fact that it filters only according to the TPQ model and no bottom-up filtering is included. Query PathStack Top-down Bottom-up Nodes +Bottom-up in GTP Twig 2 Twig Stack Stack TwigStack result +PathStack TB1 172,851 92,972 32, TB2 170,874 49,765 24, TB3 404, ,961 12, Table 2: Numbers of nodes inserted into an intermediate storage and numbers of nodes relevant to the GTP result In this section, we briefly outline features of the GTPStack algorithm combining the topdown filtering function and a bottom-up filtering (let us call it a combined filtering). To our best knowledge it is the first such correct algorithm that it is able to do it before a node is stored in the intermediate storage. Our combined filtering enables optimal filtering according to GTP; therefore, only the nodes relevant to the GTP result are stored in an intermediate storage if the algorithm is optimal. In other words, if GTPStack is optimal, then it has a linear worst-case I/O complexity with respect to the GTP result size. GTPStack s combined filtering significantly reduces number of nodes in an intermediate storage even if GTPStack is not optimal for a query. Moreover, in order to speed up the query processing time we use the following two improvements in the filtering mechanism: (1) we introduce a novel top-down filtering function called getmatch which always outperforms the getpart function [43], and (2) we avoid storing predicate nodes on stacks. 23

24 GTPStack processing time improvements Let us briefly describe our ideas behind the above improvements on examples. A filtering function such as getnext or getpart is typically called many times as is shown in Table 3. This table gives numbers of the function calls corresponding to the inner query nodes and numbers of unnecessary calls for three TreeBank queries. An unnecessary call of the filtering function works with exactly the same data nodes as the last function call; therefore, it returns the same query node. As observed on the query TB3, there can be almost half of function calls unnecessary. Our novel getmatch function avoids all these unnecessary calls. Additionally, as is shown in Section 3.2.3, the efficiency of the filtering function is significantly dependent on the ability to skip irrelevant nodes. The getpart and getnext functions sometimes return irrelevant nodes which are not subsequently stored on stacks. Another advantage of getmatch is that it skips all these irrelevant nodes. getnext getpart Query Calls Unnecessary Calls Unnecessary [10 3 ] calls [10 3 ] [10 3 ] calls [10 3 ] TB TB TB Table 3: Number of the filtering function calls corresponding to the inner query nodes We use the term main branch query node to name the query nodes which are on a query path from the root to an output node. The rest of the query nodes are called predicate query nodes. GTPStack separates the node filtering and the output enumeration which yields the following optimization. It allows us to avoid storing the nodes corresponding to the predicate query nodes on stacks. Optimality An important feature of holistic algorithms using the top-down filtering is that they have a linear worst-case I/O complexity with respect to the TPQ result size (i.e., they are optimal) for some query classes. Different holistic approaches define their optimality conditions in a different way; however, all of them specify only the query requirements. To our best knowledge, GTPStack is the first algorithm that is optimal for some query classes with respect to the GTP result size. GTPStack optimality is defined only by XPath axes and XML document characteristics. In other words, semantics related to the output nodes, boolean expressions, and quantifiers do not influence its optimality getmatch Filtering Function The existing getnext and getpart top-down filtering functions have two shortcomings: (1) they often perform unnecessary recursive calls, and (2) they return a query node #q even if there is no ancestor of H(#q) on S parent(#q). As a result, they both cause many unnecessary computations which are completely avoided by the getmatch function introduced in this section. 24

25 The getmatch function introduces three improvements of the existing top-down filtering functions: (1) dynamic programming that avoids unnecessary recursive calls, (2) a filtering procedure which advance the sequence according to the bottom node of the parent s stack, and (3) a cycle for the inner nodes which does not terminate the getmatch call until all sequences are ended or a promising node is found. The usage of the parent s stack and the above cycle cause a more progressive advancing of sequences which is a major parameter influencing the processing time of an filtering function as is shown in Section A top-down filtering function has the following property for a specific (see below) class of queries: it returns only a query node #q such that there is a query match of #q containing H(#q). Therefore, the top-down filtering removes all nodes irrelevant to the TPQ. The getpart and getnext procedures guarantee this property for queries having only AD relationships [20, 43]. Since getmatch only avoids unnecessary function calls and skips irrelevant nodes having no ancestor on parent s stack, its optimality properties are the same as for getpart and getnext. This means that getmatch is optimal for queries having only AD relationships. Let us note that in Section 3.3 we prove that a holistic algorithm with an top-down filtering function (e.g., getmatch) can be optimal even for a query containing any combination of PC and AD relationships depending on XML document characteristics Summary of GTPStack GTPStack is the first algorithm with a linear worst-case I/O complexity with respect to the sum of the input and GTP result sizes (in this case, GTPStack is optimal for the GTP). This is mainly achieved by the combination of the top-down and bottom-up filterings and, to our best knowledge, GTPStack is the first correct holistic algorithm using a combined filtering before storing a node in an intermediate storage. The combined approach used in GTPStack has the following advantages: (1) it allows us to avoid storing nodes corresponding to predicate query nodes on stacks which speeds up the query processing, and (2) it significantly decreases the number of nodes in the intermediate storage even when GTPStack is not optimal for a query. GTPStack uses our novel filtering function called getmatch, which avoids unnecessary function calls and improves sequence advancing which furthermore speeds up the query processing. All these features make GTPStack superior to the state-of-the-art holistic approaches as is shown experimentally in Section GTPStack Experimental Results We implemented five state-of-the-art holistic algorithms in C++: TwigStack [20], TwigList [81], TJStrictPre [43], TJStrictPost [43], and Twig 2 Stack+PathStack [28] (abbreviated to T2PS). We do not include experimental results of the TwigList algorithm since both TJStrictPost and TJStrictPre use an improved version of TwigList. In our experiments, we use more than one version of GTPStack. We combine GTPStack with existing top-down filtering functions; therefore, 25

26 we use the following simple notation, where GTPStack+N, GTPStack+P, and GTPStack+M stand for GTPStack combined with getnext, getpart, and getmatch, respectively. By writing GTPStack we mean any of the above versions of GTPStack. Since GTPStack+N always outperforms TwigStack, we include the results for the TwigStack query processing only in the intermediate storage test (see page 27). The main shortcoming of TwigStack is represented by its redundant intermediate storage and inefficient output enumeration. We use one own synthetic XML document called ZIPF and three real-world XML collections. The ZIPF document contains seven different elements named from a to g spread randomly using the Zipfian distribution, where a has the highest occurrence ( 50%) and g has the lowest occurrence ( 1%). Every element of ZIPF has exactly two children and the depth of the collection is 24 which means that all paths in ZIPF have the same length. collections are XMark [91] with factor 10, INEX 1.9 [40], and TreeBank [97]. The real-world Queries for the XMark and TreeBank collections are selected from several existing articles on TPQ processing [64, 28, 56]. Queries for the INEX collection were selected in order to show differences between the algorithms. A list of the selected real-world collections queries can be found in full version of this work. The largest number of queries were generated for the ZIPF collection. The ZIPF queries are generated according to five query templates shown in Table 4. A template only specifies relationships between query nodes, output query node, and predicate query nodes. Query template Number of generated queries 1. //τ[/υ and /ω] //α/β[//χ and //δ] //α/α[//β]/χ//χ[//δ and //ϵ] //α[/τ and //υ and //ω] //α/β[//χ/δ] 81 Table 4: ZIPF query templates for XPath queries We run our experiments on a PC with Intel Xeon 2.93GHz CPU, and Windows Server 2008 operating system. When measuring the processing time, each query is processed fifteen times in the main memory and then we compute the average result omitting the two worst and the two best results. If we want to compare processing times T x and T y of two approaches x and y for a set of queries, we first compute a geometric mean of ratios T y /T x of each query. Subsequently, since we want to have the value in percents, we simply subtract one from the calculated geometric mean and multiply it by 100. We call it a relative processing time improvement (RPTI) of approach x compared to approach y for a set of queries. For example, if we have the result of the geometric mean 1.68, we write that RPTI of x has a 68% improvement compared to y. Since GTPStack s improvements of processing time relate only to the filtering part of holistic algorithms, we also measured the filtering time of the algorithms (i.e., the processing time 26

27 without the time spent on reading the input data), and therefore, we also compute the relative filtering time improvement (RFTI) of approach x compared to approach y for a set of queries. In order to minimize the processing time measurement error, we say that a method is faster than the other one for a query Q if its RPTI for Q is at least 2% and their processing time difference for Q is at least 10 miliseconds Processing Time Table 5 gives RPTI and RFTI of GTPStack+M compared to all tested approaches for all queries. Table 5 also contains the number of queries for which GTPStack+M is faster and slower. Number of Filtering RPTI RFTI queries approach Faster Slower T2PS 77% 172% TJStrictPost 88% 222% TJStrictPre 12% 43% GTPStack+N 68% 150% GTPStack+P 13% 46% Table 5: GTPStack+M compared to all tested approaches for all queries We can observe that GTPStack+M outperforms both approaches TJStrictPre and GTP- Stack+P using the getpart function for all queries since the getmatch function is always faster or equally fast compared to getpart. This comes from the fact that getmatch improves getpart without any additional overhead. The remaining approaches (T2PS, TJStrictPost, and GTP- Stack+N) perform significantly worse in average than GTPStack+M (RFTI of GTPStack+M ranges from 150% to 222% when compared to these approaches). To better understand the differences among the algorithms and the advantages of GTP- Stack+M, we need to compare their corresponding parts separately. We first compare only the result enumeration time; then we compare the top-down filterings; then, we show how GTPStack optimizes its processing time for queries with many predicate nodes; and, in last experiment, we compare the bottom-up filterings Test of Intermediate Storages for TPQs Let us start with a test showing the properties of various intermediate storages and mainly the performance of the output enumeration. Note that the LIS intermediate storage is used by the TJStrictPost, TJStrictPre, and GTPStack algorithms. Results in this section serves as a hint for a selection of the most appropriate intermediate storage for our approach. In this test, we present only the XMark queries since the results for other collections are similar. However, we ignore the GTP semantics (i.e., we consider all query nodes as output ones) 27

28 Time[s] since TwigStack cannot enumerate GTPs. As a result, queries 8 and 13 did not finish since their TPQ result sizes were over one billion and the available main memory was not sufficient TwigStack LIS T2PS DNF DNF Query Figure 8: Performance of the output enumeration Figure 8 shows the results of this experiment. As expected, TwigStack performs very poorly which corresponds to the results published in [28]. Inefficiency of the TwigStack intermediate storage comes from the duplicate work with nodes and the sequential scan of the whole intermediate storage during the output enumeration. If we compare the output enumeration times of T2PS and LIS, the difference is not significant. This result comes from the fact that they use the bottom-up filtering; therefore, their output enumeration time is linear with respect to the result size. Finally, we decided to use the LIS storage since the Twig 2 Stack intermediate storage requires global pop order for its correct functionality. LIS can work with our bottom-up filtering and its enumeration time performance is comparable to the Twig 2 Stack intermediate storage Analysis of Top-down Filtering We can find three major types of top-down filtering approaches: (1) the PathStack algorithm which filters only according to the node query path from the root query node and does not perform any sequence advance, (2) the getnext function which performs a sequence advance according to the query node descendants, and (3) getpart and getmatch which perform a sequence advance according to the query node descendants and ancestor. We ignore getpart in the following text since getmatch always outperforms this function for our queries as is shown in Section The PathStack algorithm is very simple and fast and its processing time is linear with respect to the input size (the correlation coefficient between the PathStack processing time 28

29 Number of queries [%] Number of queries [%] and the input size for the ZIPF queries is 0.98). Since PathStack does not use any advanced sequence forwarding, its processing time is not influenced by the query result. On the other hand, an filtering function (i.e., getnext or getmatch) can skip irrelevant nodes more quickly using the sequence advancing. Therefore, these functions outperform PathStack if they advance the sequence sufficiently often; this is discussed further in this section. The main attribute indicating the efficiency of a top-down filtering is the average sequence advance (denoted as AvgFwd) during one FwdToAncOf or FwdToDescOf function call. Since the getmatch function uses both forwarding functions while getnext uses only the FwdToDescOf function, we also define: (1) an average sequence advance during one FwdToAncOf function call, and (2) an average sequence advance during one FwdToDescOf function call. Let us call them an average ancestor forward movement and an average descendant forward movement and denote them AvgAncFwd and AvgDescFwd, respectively. We first compare the getnext and getmatch functions (i.e., we compare GTPStack+N and GTPStack+M). The getnext function uses only FwdToAncOf, and therefore, getmatch performs better if its AvgDescFwd is sufficiently large. Our goal is to find a threshold value k with the following two properties: if AvgDescFwd > k, then T getnext > T getmatch, if AvgDescFwd < k, then T getnext < T getmatch First property violation Second property violation Sum of the violations Treshold value k (a) Sum of the violations Treshold value k (b) Figure 9: Number of queries violating the properties for different k values (a) for all ZIPF queries (b) for queries corresponding to the first and to the fourth ZIPF template Figure 9(a) shows how many queries violate the above properties for all queries in our ZIPF query set for a different k value. As we can see, there is no optimal k value for which all queries satisfy both properties since the sum of violations never reaches zero. In other words, we cannot find any exact threshold value of AvgDescFwd which would say whether to use getnext or getmatch. The reason for this is that various query nodes in a query can have significantly different AvgDescFwd values; therefore, selection of the same filtering function for all query nodes is not always the best solution. Figure 9(b) shows the number of violations for the ZIPF queries 29

30 generated by templates 1 and 4. These queries are very simple and the descendant forwarding is always performed in the relation to the same parent query node. We can observe that the threshold value k = 0.03 is a good selection for many of them since the sum of violations is less than 5%. Based on the above heuristics, we evaluated a simple combination of the getmatch and the getnext functions which works as follows: We first process a query using getmatch and collect the statistics about AvgDescFwd in each query node. Secondly, we use the getmatch function for a query node with AvgDescFwd larger than 0.03 and use the getnext function in the other cases. This combination of getmatch and getnext always performs better than or equally to any other filtering function. Similarly, we looked for another two threshold values which would indicate that an algorithm using PathStack performs better than algorithms using getnext or getmatch. Both algorithms have the threshold value of AvgFwd approximately equal to 0.3, where only 5% of the ZIPF queries corresponding to the first and fourth template violate the corresponding processing time properties. In other words, if AvgFwd is lower than 0.3, then PathStack performs better in many cases Optimization of Predicate Query Nodes None of the existing holistic algorithms can optimize the query processing time with respect to the number of nodes corresponding to predicate query nodes; therefore, their processing time is the same regardless of the GTP semantics. Their performance is mainly dependent on their top-down filtering as is shown in the above section. If GTPStack is optimal, then it stores only the nodes corresponding to main branch query nodes on stacks and thus it saves some time during the bottom-up filtering. #S #NN #NP #DT #S #NN #NP #DT #S #NN #NP #DT #S #NN #NP #DT #PP #PP #PP #PP #IN #NN #IN #NN #IN #NN #IN #NN #S #NN #NP #DT #PP #S #NN #NP #DT #PP #S #NN #NP #DT #PP #IN #NN #IN #NN #IN #NN Figure 10: Variants of the TB12 query with different number of output nodes 30

31 Processing time [s] Number of nodes stored on stacks [10 5 ] Query Query (a) (b) Figure 11: (a) Processing time of GTPStack+M and (b) number of nodes stored on stacks by GTPStack for each variant of TB12 In Figure 10, we can observe seven variants of the TB12 query with a different number of output nodes. We selected the TB12 query because GTPStack+M is optimal for it. Figure 11a shows how the processing time of GTPStack+M decreases with the decreasing number of nodes stored on stacks. The AvgDescFwd and AvgAncFwd values are equal to 0.09 and 0.14, respectively, for the TB12 query which indicates (according to the results of the above section) that getmatch should be slower than T2PS and TJStrictPost having the processing times and 0.202, respectively. However, for the last query variant, where the number of nodes corresponding to predicate query nodes is six times larger than the number of the other nodes, GTPStack+M performs equally to both algorithms (its processing time is 0.216) Intermediate Storage Size Method ZIPF TB XM INEX T2PS, TJStrictPost TJStrictPre GTPStack Table 6: Ratio of the number of nodes stored in various intermediate storages and the number of relevant nodes for each collection ZIPF TB XM INEX [10 3 ] [10 3 ] [10 3 ] [10 3 ] T2PS, TJStrictPost TJStrictPre GTPStack Relevant nodes Table 7: Median values of nodes stored in an intermediate storage for each collection Another important property of every algorithm related to the I/O complexity is the intermediate result size. We evaluate the intermediate result size in terms of the number of nodes stored there. For each method we compute a ratio of the number of nodes stored in the intermediate 31

32 storage and the number of relevant nodes for each collection and filtering method. Table 6 shows this ratio for all queries in each collection computed using the geometric mean. Table 7 shows us the median value of nodes stored in an intermediate storage for each collection. Generally, GTPStack stores one order of magnitude less nodes than the rest of the tested approaches due to the fact it uses the combined approach. The intermediate result size does not have a significant relationship to the processing time during the main memory run since every algorithm has to perform some extra operations if it wants to avoid storing useless nodes. However, the difference will be huge if an intermediate storage is larger than the main memory, and, in this case, I/O operations have to be included. 3.3 Top-down Filtering Optimality Optimality of a top-down filtering for a query or at least subquery has several important impacts: (1) we can guarantee that we skip all nodes irrelevant to a TPQ during the top-down filtering, (2) the algorithm optimality is necessary for a more efficient top-down node filtering of a query with the NOT operator, (3) we store only nodes corresponding to the output query nodes in the intermediate storage, and (4) we avoid storing all nodes corresponding to the predicate query nodes on stacks. Every top-down filtering algorithm has its specific query classes, for which the algorithm optimality is proved. The most common top-down algorithm optimality is the AD query (a query having only the AD relationships) optimality which is also GTPStack s optimality if the tag partitioning is used. However, we show that the top-down algorithm optimality can be significantly extended using the tag+level or labeled path partitioning. For a more thorough comparison of algorithms optimality query classes see [12]. When we speak about a holistic algorithm in this Section, we mean a holistic algorithm using an top-down filtering such as TwigStack, TJStrictPre, or GTPStack. Let us define several terms related to the query nodes of a TPQ: Query node with PC in its subtree is a query node #q having the PC relationship between some two nodes from the subtree(#q) set. A checking query node #q is a query node with PC in its subtree and having the AD relationship with its parent. It is an important query node type from the optimality point of view. Checkingnodes(Q) is a set of checking query nodes in the query Q. A tag is called single level if all nodes in the XML tree with this tag are on the same level. PRU #q is a set of partition labels that corresponds to a query node #q. Note that under tag partitioning we can have just one partition label in every PRU #q. If use some other partitioning then we need to do the DataGuide search first. By a term DataGuide search me mean determining partition labels corresponding to a query (see Section 2.2). 32

QuickXDB: A Prototype of a Native XML QuickXDB: Prototype of Native XML DBMS DBMS

QuickXDB: A Prototype of a Native XML QuickXDB: Prototype of Native XML DBMS DBMS QuickXDB: A Prototype of a Native XML QuickXDB: Prototype of Native XML DBMS DBMS Petr Lukáš, Radim Bača, and Michal Krátký Petr Lukáš, Radim Bača, and Michal Krátký Department of Computer Science, VŠB

More information

Compression of the Stream Array Data Structure

Compression of the Stream Array Data Structure Compression of the Stream Array Data Structure Radim Bača and Martin Pawlas Department of Computer Science, Technical University of Ostrava Czech Republic {radim.baca,martin.pawlas}@vsb.cz Abstract. In

More information

Child Prime Label Approaches to Evaluate XML Structured Queries

Child Prime Label Approaches to Evaluate XML Structured Queries Child Prime Label Approaches to Evaluate XML Structured Queries Shtwai Abdullah Alsubai Department of Computer Science the University of Sheffield This thesis is submitted for the degree of Doctor of Philosophy

More information

An Implementation of Tree Pattern Matching Algorithms for Enhancement of Query Processing Operations in Large XML Trees

An Implementation of Tree Pattern Matching Algorithms for Enhancement of Query Processing Operations in Large XML Trees An Implementation of Tree Pattern Matching Algorithms for Enhancement of Query Processing Operations in Large XML Trees N. Murugesan 1 and R.Santhosh 2 1 PG Scholar, 2 Assistant Professor, Department of

More information

Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents

Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Twig Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li, Junichi Tatemura Wang-Pin Hsiung, Divyakant Agrawal, K. Selçuk Candan NEC Laboratories

More information

PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data

PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data Enhua Jiao, Tok Wang Ling, Chee-Yong Chan School of Computing, National University of Singapore {jiaoenhu,lingtw,chancy}@comp.nus.edu.sg

More information

TwigList: Make Twig Pattern Matching Fast

TwigList: Make Twig Pattern Matching Fast TwigList: Make Twig Pattern Matching Fast Lu Qin, Jeffrey Xu Yu, and Bolin Ding The Chinese University of Hong Kong, China {lqin,yu,blding}@se.cuhk.edu.hk Abstract. Twig pattern matching problem has been

More information

Accelerating XML Structural Matching Using Suffix Bitmaps

Accelerating XML Structural Matching Using Suffix Bitmaps Accelerating XML Structural Matching Using Suffix Bitmaps Feng Shao, Gang Chen, and Jinxiang Dong Dept. of Computer Science, Zhejiang University, Hangzhou, P.R. China microf_shao@msn.com, cg@zju.edu.cn,

More information

On Label Stream Partition for Efficient Holistic Twig Join

On Label Stream Partition for Efficient Holistic Twig Join On Label Stream Partition for Efficient Holistic Twig Join Bo Chen 1, Tok Wang Ling 1,M.TamerÖzsu2, and Zhenzhou Zhu 1 1 School of Computing, National University of Singapore {chenbo, lingtw, zhuzhenz}@comp.nus.edu.sg

More information

CHAPTER 3 LITERATURE REVIEW

CHAPTER 3 LITERATURE REVIEW 20 CHAPTER 3 LITERATURE REVIEW This chapter presents query processing with XML documents, indexing techniques and current algorithms for generating labels. Here, each labeling algorithm and its limitations

More information

Structural Joins, Twig Joins and Path Stack

Structural Joins, Twig Joins and Path Stack Structural Joins, Twig Joins and Path Stack Seminar: XML & Datenbanken Student: Irina ANDREI Konstanz, 11.07.2006 Outline 1. Structural Joins Tree-Merge Stack-Tree 2. Path-Join Algorithms PathStack PathMPMJ

More information

Benchmarking a B-tree compression method

Benchmarking a B-tree compression method Benchmarking a B-tree compression method Filip Křižka, Michal Krátký, and Radim Bača Department of Computer Science, Technical University of Ostrava, Czech Republic {filip.krizka,michal.kratky,radim.baca}@vsb.cz

More information

A FRAMEWORK FOR EFFICIENT DATA SEARCH THROUGH XML TREE PATTERNS

A FRAMEWORK FOR EFFICIENT DATA SEARCH THROUGH XML TREE PATTERNS A FRAMEWORK FOR EFFICIENT DATA SEARCH THROUGH XML TREE PATTERNS SRIVANI SARIKONDA 1 PG Scholar Department of CSE P.SANDEEP REDDY 2 Associate professor Department of CSE DR.M.V.SIVA PRASAD 3 Principal Abstract:

More information

TwigStack + : Holistic Twig Join Pruning Using Extended Solution Extension

TwigStack + : Holistic Twig Join Pruning Using Extended Solution Extension Vol. 8 No.2B 2007 603-609 Article ID: + : Holistic Twig Join Pruning Using Extended Solution Extension ZHOU Junfeng 1,2, XIE Min 1, MENG Xiaofeng 1 1 School of Information, Renmin University of China,

More information

A Prime Number Approach to Matching an XML Twig Pattern including Parent-Child Edges

A Prime Number Approach to Matching an XML Twig Pattern including Parent-Child Edges A Prime Number Approach to Matching an XML Twig Pattern including Parent-Child Edges Shtwai Alsubai and Siobhán North Department of Computer Science, The University of Sheffield, Sheffield, U.K. Keywords:

More information

Tree-Pattern Queries on a Lightweight XML Processor

Tree-Pattern Queries on a Lightweight XML Processor Tree-Pattern Queries on a Lightweight XML Processor MIRELLA M. MORO Zografoula Vagena Vassilis J. Tsotras Research partially supported by CAPES, NSF grant IIS 0339032, UC Micro, and Lotus Interworks Outline

More information

Outline. Depth-first Binary Tree Traversal. Gerênciade Dados daweb -DCC922 - XML Query Processing. Motivation 24/03/2014

Outline. Depth-first Binary Tree Traversal. Gerênciade Dados daweb -DCC922 - XML Query Processing. Motivation 24/03/2014 Outline Gerênciade Dados daweb -DCC922 - XML Query Processing ( Apresentação basedaem material do livro-texto [Abiteboul et al., 2012]) 2014 Motivation Deep-first Tree Traversal Naïve Page-based Storage

More information

A New Encoding Scheme of Supporting Data Update Efficiently

A New Encoding Scheme of Supporting Data Update Efficiently Send Orders for Reprints to reprints@benthamscience.ae 1472 The Open Cybernetics & Systemics Journal, 2015, 9, 1472-1477 Open Access A New Encoding Scheme of Supporting Data Update Efficiently Houliang

More information

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS TRIE BASED METHODS FOR STRING SIMILARTIY JOINS Venkat Charan Varma Buddharaju #10498995 Department of Computer and Information Science University of MIssissippi ENGR-654 INFORMATION SYSTEM PRINCIPLES RESEARCH

More information

Data Structure. IBPS SO (IT- Officer) Exam 2017

Data Structure. IBPS SO (IT- Officer) Exam 2017 Data Structure IBPS SO (IT- Officer) Exam 2017 Data Structure: In computer science, a data structure is a way of storing and organizing data in a computer s memory so that it can be used efficiently. Data

More information

This is a repository copy of A Prime Number Approach to Matching an XML Twig Pattern including Parent-Child Edges.

This is a repository copy of A Prime Number Approach to Matching an XML Twig Pattern including Parent-Child Edges. This is a repository copy of A Prime Number Approach to Matching an XML Twig Pattern including Parent-Child Edges. White Rose Research Online URL for this paper: http://eprints.whiterose.ac.uk/117467/

More information

Query Processing & Optimization

Query Processing & Optimization Query Processing & Optimization 1 Roadmap of This Lecture Overview of query processing Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Introduction

More information

Evaluating XPath Queries

Evaluating XPath Queries Chapter 8 Evaluating XPath Queries Peter Wood (BBK) XML Data Management 201 / 353 Introduction When XML documents are small and can fit in memory, evaluating XPath expressions can be done efficiently But

More information

XML Query Processing. Announcements (March 31) Overview. CPS 216 Advanced Database Systems. Course project milestone 2 due today

XML Query Processing. Announcements (March 31) Overview. CPS 216 Advanced Database Systems. Course project milestone 2 due today XML Query Processing CPS 216 Advanced Database Systems Announcements (March 31) 2 Course project milestone 2 due today Hardcopy in class or otherwise email please I will be out of town next week No class

More information

An Extended Byte Carry Labeling Scheme for Dynamic XML Data

An Extended Byte Carry Labeling Scheme for Dynamic XML Data Available online at www.sciencedirect.com Procedia Engineering 15 (2011) 5488 5492 An Extended Byte Carry Labeling Scheme for Dynamic XML Data YU Sheng a,b WU Minghui a,b, * LIU Lin a,b a School of Computer

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Querying Spatiotemporal Data Based on XML Twig Pattern

Querying Spatiotemporal Data Based on XML Twig Pattern Querying Spatiotemporal Data Based on XML Twig Pattern Luyi Bai Yin Li Jiemin Liu* College of Information Science and Engineering Northeastern University Shenyang 110819 China * Corresponding author Tel:

More information

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey G. Shivaprasad, N. V. Subbareddy and U. Dinesh Acharya

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

Full-Text and Structural XML Indexing on B + -Tree

Full-Text and Structural XML Indexing on B + -Tree Full-Text and Structural XML Indexing on B + -Tree Toshiyuki Shimizu 1 and Masatoshi Yoshikawa 2 1 Graduate School of Information Science, Nagoya University shimizu@dl.itc.nagoya-u.ac.jp 2 Information

More information

DATA MODELS FOR SEMISTRUCTURED DATA

DATA MODELS FOR SEMISTRUCTURED DATA Chapter 2 DATA MODELS FOR SEMISTRUCTURED DATA Traditionally, real world semantics are captured in a data model, and mapped to the database schema. The real world semantics are modeled as constraints and

More information

Contents Contents Introduction Basic Steps in Query Processing Introduction Transformation of Relational Expressions...

Contents Contents Introduction Basic Steps in Query Processing Introduction Transformation of Relational Expressions... Contents Contents...283 Introduction...283 Basic Steps in Query Processing...284 Introduction...285 Transformation of Relational Expressions...287 Equivalence Rules...289 Transformation Example: Pushing

More information

Benchmarking the UB-tree

Benchmarking the UB-tree Benchmarking the UB-tree Michal Krátký, Tomáš Skopal Department of Computer Science, VŠB Technical University of Ostrava, tř. 17. listopadu 15, Ostrava, Czech Republic michal.kratky@vsb.cz, tomas.skopal@vsb.cz

More information

ADT 2009 Other Approaches to XQuery Processing

ADT 2009 Other Approaches to XQuery Processing Other Approaches to XQuery Processing Stefan Manegold Stefan.Manegold@cwi.nl http://www.cwi.nl/~manegold/ 12.11.2009: Schedule 2 RDBMS back-end support for XML/XQuery (1/2): Document Representation (XPath

More information

Index-Trees for Descendant Tree Queries on XML documents

Index-Trees for Descendant Tree Queries on XML documents Index-Trees for Descendant Tree Queries on XML documents (long version) Jérémy arbay University of Waterloo, School of Computer Science, 200 University Ave West, Waterloo, Ontario, Canada, N2L 3G1 Phone

More information

CSE 530A. B+ Trees. Washington University Fall 2013

CSE 530A. B+ Trees. Washington University Fall 2013 CSE 530A B+ Trees Washington University Fall 2013 B Trees A B tree is an ordered (non-binary) tree where the internal nodes can have a varying number of child nodes (within some range) B Trees When a key

More information

An Efficient XML Index Structure with Bottom-Up Query Processing

An Efficient XML Index Structure with Bottom-Up Query Processing An Efficient XML Index Structure with Bottom-Up Query Processing Dong Min Seo, Jae Soo Yoo, and Ki Hyung Cho Department of Computer and Communication Engineering, Chungbuk National University, 48 Gaesin-dong,

More information

Navigation- vs. Index-Based XML Multi-Query Processing

Navigation- vs. Index-Based XML Multi-Query Processing Navigation- vs. Index-Based XML Multi-Query Processing Nicolas Bruno, Luis Gravano Columbia University {nicolas,gravano}@cs.columbia.edu Nick Koudas, Divesh Srivastava AT&T Labs Research {koudas,divesh}@research.att.com

More information

Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching

Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University of Singapore Lower Kent Ridge Road, Singapore

More information

CS301 - Data Structures Glossary By

CS301 - Data Structures Glossary By CS301 - Data Structures Glossary By Abstract Data Type : A set of data values and associated operations that are precisely specified independent of any particular implementation. Also known as ADT Algorithm

More information

An Effective and Efficient Approach for Keyword-Based XML Retrieval. Xiaoguang Li, Jian Gong, Daling Wang, and Ge Yu retold by Daryna Bronnykova

An Effective and Efficient Approach for Keyword-Based XML Retrieval. Xiaoguang Li, Jian Gong, Daling Wang, and Ge Yu retold by Daryna Bronnykova An Effective and Efficient Approach for Keyword-Based XML Retrieval Xiaoguang Li, Jian Gong, Daling Wang, and Ge Yu retold by Daryna Bronnykova Search on XML documents 2 Why not use google? Why are traditional

More information

XML databases. Jan Chomicki. University at Buffalo. Jan Chomicki (University at Buffalo) XML databases 1 / 9

XML databases. Jan Chomicki. University at Buffalo. Jan Chomicki (University at Buffalo) XML databases 1 / 9 XML databases Jan Chomicki University at Buffalo Jan Chomicki (University at Buffalo) XML databases 1 / 9 Outline 1 XML data model 2 XPath 3 XQuery Jan Chomicki (University at Buffalo) XML databases 2

More information

SFilter: A Simple and Scalable Filter for XML Streams

SFilter: A Simple and Scalable Filter for XML Streams SFilter: A Simple and Scalable Filter for XML Streams Abdul Nizar M., G. Suresh Babu, P. Sreenivasa Kumar Indian Institute of Technology Madras Chennai - 600 036 INDIA nizar@cse.iitm.ac.in, sureshbabuau@gmail.com,

More information

XML has become the de facto standard for data exchange.

XML has become the de facto standard for data exchange. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 12, DECEMBER 2008 1627 Scalable Filtering of Multiple Generalized-Tree-Pattern Queries over XML Streams Songting Chen, Hua-Gang Li, Jun

More information

Algorithms in Systems Engineering ISE 172. Lecture 16. Dr. Ted Ralphs

Algorithms in Systems Engineering ISE 172. Lecture 16. Dr. Ted Ralphs Algorithms in Systems Engineering ISE 172 Lecture 16 Dr. Ted Ralphs ISE 172 Lecture 16 1 References for Today s Lecture Required reading Sections 6.5-6.7 References CLRS Chapter 22 R. Sedgewick, Algorithms

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 10. Graph databases Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Graph Databases Basic

More information

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING VOL:25 NO:1 YEAR A Survey of XML Tree Patterns

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING VOL:25 NO:1 YEAR A Survey of XML Tree Patterns IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING VOL:25 NO:1 YEAR 2013 A Survey of XML Tree Patterns Marouane Hachicha and Jérôme Darmont, Member, IEEE Computer Society Abstract With XML becoming a

More information

Web Data Management. XML query evaluation. Philippe Rigaux CNAM Paris & INRIA Saclay

Web Data Management. XML query evaluation. Philippe Rigaux CNAM Paris & INRIA Saclay http://webdam.inria.fr/ Web Data Management XML query evaluation Serge Abiteboul INRIA Saclay & ENS Cachan Ioana Manolescu INRIA Saclay & Paris-Sud University Philippe Rigaux CNAM Paris & INRIA Saclay

More information

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Huayu Wu Institute for Infocomm Research, A*STAR, Singapore huwu@i2r.a-star.edu.sg Abstract. Processing XML queries over

More information

On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques

On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore 3 Science Drive, Singapore

More information

XML Systems & Benchmarks

XML Systems & Benchmarks XML Systems & Benchmarks Christoph Staudt Peter Chiv Saarland University, Germany July 1st, 2003 Main Goals of our talk Part I Show up how databases and XML come together Make clear the problems that arise

More information

Multi-Way Number Partitioning

Multi-Way Number Partitioning Proceedings of the Twenty-First International Joint Conference on Artificial Intelligence (IJCAI-09) Multi-Way Number Partitioning Richard E. Korf Computer Science Department University of California,

More information

Binary Decision Diagrams

Binary Decision Diagrams Logic and roof Hilary 2016 James Worrell Binary Decision Diagrams A propositional formula is determined up to logical equivalence by its truth table. If the formula has n variables then its truth table

More information

A Real Time GIS Approximation Approach for Multiphase Spatial Query Processing Using Hierarchical-Partitioned-Indexing Technique

A Real Time GIS Approximation Approach for Multiphase Spatial Query Processing Using Hierarchical-Partitioned-Indexing Technique International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 6 ISSN : 2456-3307 A Real Time GIS Approximation Approach for Multiphase

More information

Computational Optimization ISE 407. Lecture 16. Dr. Ted Ralphs

Computational Optimization ISE 407. Lecture 16. Dr. Ted Ralphs Computational Optimization ISE 407 Lecture 16 Dr. Ted Ralphs ISE 407 Lecture 16 1 References for Today s Lecture Required reading Sections 6.5-6.7 References CLRS Chapter 22 R. Sedgewick, Algorithms in

More information

Announcements (March 31) XML Query Processing. Overview. Navigational processing in Lore. Navigational plans in Lore

Announcements (March 31) XML Query Processing. Overview. Navigational processing in Lore. Navigational plans in Lore Announcements (March 31) 2 XML Query Processing PS 216 Advanced Database Systems ourse project milestone 2 due today Hardcopy in class or otherwise email please I will be out of town next week No class

More information

XML Index Recommendation with Tight Optimizer Coupling

XML Index Recommendation with Tight Optimizer Coupling XML Index Recommendation with Tight Optimizer Coupling Technical Report CS-2007-22 July 11, 2007 Iman Elghandour University of Waterloo Andrey Balmin IBM Almaden Research Center Ashraf Aboulnaga University

More information

Packet Classification Using Dynamically Generated Decision Trees

Packet Classification Using Dynamically Generated Decision Trees 1 Packet Classification Using Dynamically Generated Decision Trees Yu-Chieh Cheng, Pi-Chung Wang Abstract Binary Search on Levels (BSOL) is a decision-tree algorithm for packet classification with superior

More information

Trees. Q: Why study trees? A: Many advance ADTs are implemented using tree-based data structures.

Trees. Q: Why study trees? A: Many advance ADTs are implemented using tree-based data structures. Trees Q: Why study trees? : Many advance DTs are implemented using tree-based data structures. Recursive Definition of (Rooted) Tree: Let T be a set with n 0 elements. (i) If n = 0, T is an empty tree,

More information

CBSL A Compressed Binary String Labeling Scheme for Dynamic Update of XML Documents

CBSL A Compressed Binary String Labeling Scheme for Dynamic Update of XML Documents CIT. Journal of Computing and Information Technology, Vol. 26, No. 2, June 2018, 99 114 doi: 10.20532/cit.2018.1003955 99 CBSL A Compressed Binary String Labeling Scheme for Dynamic Update of XML Documents

More information

Cardinality estimation of navigational XPath expressions

Cardinality estimation of navigational XPath expressions University of Twente Department of Electrical Engineering, Mathematics and Computer Science Database group Cardinality estimation of navigational XPath expressions Gerben Broenink M.Sc. Thesis 16 June

More information

Proceedings of the 2012 International Conference on Industrial Engineering and Operations Management Istanbul, Turkey, July 3 6, 2012

Proceedings of the 2012 International Conference on Industrial Engineering and Operations Management Istanbul, Turkey, July 3 6, 2012 Proceedings of the 2012 International Conference on Industrial Engineering and Operations Management Istanbul, Turkey, July 3 6, 2012 Solving Assembly Line Balancing Problem in the State of Multiple- Alternative

More information

Bottom-Up Evaluation of Twig Join Pattern Queries in XML Document Databases

Bottom-Up Evaluation of Twig Join Pattern Queries in XML Document Databases Bottom-Up Evaluation of Twig Join Pattern Queries in XML Document Databases Yangjun Chen Department of Applied Computer Science University of Winnipeg Winnipeg, Manitoba, Canada R3B 2E9 y.chen@uwinnipeg.ca

More information

Database System Concepts

Database System Concepts Chapter 14: Optimization Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2007/2008 Slides (fortemente) baseados nos slides oficiais do livro c Silberschatz, Korth and Sudarshan.

More information

Integrating Path Index with Value Index for XML data

Integrating Path Index with Value Index for XML data Integrating Path Index with Value Index for XML data Jing Wang 1, Xiaofeng Meng 2, Shan Wang 2 1 Institute of Computing Technology, Chinese Academy of Sciences, 100080 Beijing, China cuckoowj@btamail.net.cn

More information

From Passages into Elements in XML Retrieval

From Passages into Elements in XML Retrieval From Passages into Elements in XML Retrieval Kelly Y. Itakura David R. Cheriton School of Computer Science, University of Waterloo 200 Univ. Ave. W. Waterloo, ON, Canada yitakura@cs.uwaterloo.ca Charles

More information

Bioinformatics Programming. EE, NCKU Tien-Hao Chang (Darby Chang)

Bioinformatics Programming. EE, NCKU Tien-Hao Chang (Darby Chang) Bioinformatics Programming EE, NCKU Tien-Hao Chang (Darby Chang) 1 Tree 2 A Tree Structure A tree structure means that the data are organized so that items of information are related by branches 3 Definition

More information

CMSC424: Database Design. Instructor: Amol Deshpande

CMSC424: Database Design. Instructor: Amol Deshpande CMSC424: Database Design Instructor: Amol Deshpande amol@cs.umd.edu Databases Data Models Conceptual representa1on of the data Data Retrieval How to ask ques1ons of the database How to answer those ques1ons

More information

Part XII. Mapping XML to Databases. Torsten Grust (WSI) Database-Supported XML Processors Winter 2008/09 321

Part XII. Mapping XML to Databases. Torsten Grust (WSI) Database-Supported XML Processors Winter 2008/09 321 Part XII Mapping XML to Databases Torsten Grust (WSI) Database-Supported XML Processors Winter 2008/09 321 Outline of this part 1 Mapping XML to Databases Introduction 2 Relational Tree Encoding Dead Ends

More information

Bottom Up and Top Down Twig Pattern Matching on Indexed Trees

Bottom Up and Top Down Twig Pattern Matching on Indexed Trees Nils Grimsmo Bottom Up and Top Down Twig Pattern Matching on Indexed Trees Thesis for the degree of philosophiae doctor Trondheim, 2010-09-02 Norwegian University of Science and Technology. Faculty of

More information

Point Cloud Filtering using Ray Casting by Eric Jensen 2012 The Basic Methodology

Point Cloud Filtering using Ray Casting by Eric Jensen 2012 The Basic Methodology Point Cloud Filtering using Ray Casting by Eric Jensen 01 The Basic Methodology Ray tracing in standard graphics study is a method of following the path of a photon from the light source to the camera,

More information

DATA STRUCTURE AND ALGORITHM USING PYTHON

DATA STRUCTURE AND ALGORITHM USING PYTHON DATA STRUCTURE AND ALGORITHM USING PYTHON Advanced Data Structure and File Manipulation Peter Lo Linear Structure Queue, Stack, Linked List and Tree 2 Queue A queue is a line of people or things waiting

More information

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe Introduction to Query Processing and Query Optimization Techniques Outline Translating SQL Queries into Relational Algebra Algorithms for External Sorting Algorithms for SELECT and JOIN Operations Algorithms

More information

Index-Driven XQuery Processing in the exist XML Database

Index-Driven XQuery Processing in the exist XML Database Index-Driven XQuery Processing in the exist XML Database Wolfgang Meier wolfgang@exist-db.org The exist Project XML Prague, June 17, 2006 Outline 1 Introducing exist 2 Node Identification Schemes and Indexing

More information

Chapter 13 XML: Extensible Markup Language

Chapter 13 XML: Extensible Markup Language Chapter 13 XML: Extensible Markup Language - Internet applications provide Web interfaces to databases (data sources) - Three-tier architecture Client V Application Programs Webserver V Database Server

More information

<=chapter>... XML. book. allauthors (1,5:60,2) title (1,2:4,2) XML author author author. <=author> jane. Origins (1,1:150,1) (1,61:63,2) (1,64:93,2)

<=chapter>... XML. book. allauthors (1,5:60,2) title (1,2:4,2) XML author author author. <=author> jane. Origins (1,1:150,1) (1,61:63,2) (1,64:93,2) Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno Columbia University nicolas@cscolumbiaedu Nick Koudas AT&T Labs Research koudas@researchattcom Divesh Srivastava AT&T Labs Research divesh@researchattcom

More information

Labeling Dynamic XML Documents: An Order-Centric Approach

Labeling Dynamic XML Documents: An Order-Centric Approach 1 Labeling Dynamic XML Documents: An Order-Centric Approach Liang Xu, Tok Wang Ling, and Huayu Wu School of Computing National University of Singapore Abstract Dynamic XML labeling schemes have important

More information

Department of Computer Science and Technology

Department of Computer Science and Technology UNIT : Stack & Queue Short Questions 1 1 1 1 1 1 1 1 20) 2 What is the difference between Data and Information? Define Data, Information, and Data Structure. List the primitive data structure. List the

More information

Performance Improvement of Hardware-Based Packet Classification Algorithm

Performance Improvement of Hardware-Based Packet Classification Algorithm Performance Improvement of Hardware-Based Packet Classification Algorithm Yaw-Chung Chen 1, Pi-Chung Wang 2, Chun-Liang Lee 2, and Chia-Tai Chan 2 1 Department of Computer Science and Information Engineering,

More information

9/29/2016. Chapter 4 Trees. Introduction. Terminology. Terminology. Terminology. Terminology

9/29/2016. Chapter 4 Trees. Introduction. Terminology. Terminology. Terminology. Terminology Introduction Chapter 4 Trees for large input, even linear access time may be prohibitive we need data structures that exhibit average running times closer to O(log N) binary search tree 2 Terminology recursive

More information

Algorithms Exam TIN093/DIT600

Algorithms Exam TIN093/DIT600 Algorithms Exam TIN093/DIT600 Course: Algorithms Course code: TIN 093 (CTH), DIT 600 (GU) Date, time: 22nd October 2016, 14:00 18:00 Building: M Responsible teacher: Peter Damaschke, Tel. 5405 Examiner:

More information

Security-Conscious XML Indexing

Security-Conscious XML Indexing Security-Conscious XML Indexing Yan Xiao, Bo Luo, and Dongwon Lee The Pennsylvania State University, University Park, USA xiaoyan515@gmail.com, {bluo,dongwon}@psu.edu Abstract. To support secure exchanging

More information

Chapter 14: Query Optimization

Chapter 14: Query Optimization Chapter 14: Query Optimization Database System Concepts 5 th Ed. See www.db-book.com for conditions on re-use Chapter 14: Query Optimization Introduction Transformation of Relational Expressions Catalog

More information

XML Filtering Technologies

XML Filtering Technologies XML Filtering Technologies Introduction Data exchange between applications: use XML Messages processed by an XML Message Broker Examples Publish/subscribe systems [Altinel 00] XML message routing [Snoeren

More information

XML: Extensible Markup Language

XML: Extensible Markup Language XML: Extensible Markup Language CSC 375, Fall 2015 XML is a classic political compromise: it balances the needs of man and machine by being equally unreadable to both. Matthew Might Slides slightly modified

More information

6. Relational Algebra (Part II)

6. Relational Algebra (Part II) 6. Relational Algebra (Part II) 6.1. Introduction In the previous chapter, we introduced relational algebra as a fundamental model of relational database manipulation. In particular, we defined and discussed

More information

Pathfinder/MonetDB: A High-Performance Relational Runtime for XQuery

Pathfinder/MonetDB: A High-Performance Relational Runtime for XQuery Introduction Problems & Solutions Join Recognition Experimental Results Introduction GK Spring Workshop Waldau: Pathfinder/MonetDB: A High-Performance Relational Runtime for XQuery Database & Information

More information

12 Abstract Data Types

12 Abstract Data Types 12 Abstract Data Types 12.1 Foundations of Computer Science Cengage Learning Objectives After studying this chapter, the student should be able to: Define the concept of an abstract data type (ADT). Define

More information

Introduction. for large input, even access time may be prohibitive we need data structures that exhibit times closer to O(log N) binary search tree

Introduction. for large input, even access time may be prohibitive we need data structures that exhibit times closer to O(log N) binary search tree Chapter 4 Trees 2 Introduction for large input, even access time may be prohibitive we need data structures that exhibit running times closer to O(log N) binary search tree 3 Terminology recursive definition

More information

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 27-1

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 27-1 Slide 27-1 Chapter 27 XML: Extensible Markup Language Chapter Outline Introduction Structured, Semi structured, and Unstructured Data. XML Hierarchical (Tree) Data Model. XML Documents, DTD, and XML Schema.

More information

DATA STRUCTURE : A MCQ QUESTION SET Code : RBMCQ0305

DATA STRUCTURE : A MCQ QUESTION SET Code : RBMCQ0305 Q.1 If h is any hashing function and is used to hash n keys in to a table of size m, where n

More information

Twig Pattern Search in XML Database

Twig Pattern Search in XML Database Twig Pattern Search in XML Database By LEPING ZOU A thesis submitted to the Department of Applied Computer Science in conformity with the requirements for the degree of Master of Science University of

More information

Section 5.5. Left subtree The left subtree of a vertex V on a binary tree is the graph formed by the left child L of V, the descendents

Section 5.5. Left subtree The left subtree of a vertex V on a binary tree is the graph formed by the left child L of V, the descendents Section 5.5 Binary Tree A binary tree is a rooted tree in which each vertex has at most two children and each child is designated as being a left child or a right child. Thus, in a binary tree, each vertex

More information

Problem Set 5 Solutions

Problem Set 5 Solutions Introduction to Algorithms November 4, 2005 Massachusetts Institute of Technology 6.046J/18.410J Professors Erik D. Demaine and Charles E. Leiserson Handout 21 Problem Set 5 Solutions Problem 5-1. Skip

More information

How to Store XML Data

How to Store XML Data How to Store XML Data Technical Report No.: 2010/2 Dept. of Software Engineering Faculty of Mathematics and Physics Charles University in Prague November 2010 Pavel Loupal 1, Irena Mlýnková 2, Martin Nečaský

More information

Performing Grouping and Aggregate Functions in XML Queries

Performing Grouping and Aggregate Functions in XML Queries Performing Grouping and Aggregate Functions in XML Huayu Wu, Tok Wang Ling, Liang Xu, and Zhifeng Bao School of Computing National University of Singapore wuhuayu@comp.nus.edu.sg, lingtw@comp.nus.edu.sg,

More information

BlossomTree: Evaluating XPaths in FLWOR Expressions

BlossomTree: Evaluating XPaths in FLWOR Expressions BlossomTree: Evaluating XPaths in FLWOR Expressions Ning Zhang University of Waterloo School of Computer Science nzhang@uwaterloo.ca Shishir K. Agrawal Indian Institute of Technology, Bombay Department

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

CSE 190D Spring 2017 Final Exam Answers

CSE 190D Spring 2017 Final Exam Answers CSE 190D Spring 2017 Final Exam Answers Q 1. [20pts] For the following questions, clearly circle True or False. 1. The hash join algorithm always has fewer page I/Os compared to the block nested loop join

More information

Ch 5 : Query Processing & Optimization

Ch 5 : Query Processing & Optimization Ch 5 : Query Processing & Optimization Basic Steps in Query Processing 1. Parsing and translation 2. Optimization 3. Evaluation Basic Steps in Query Processing (Cont.) Parsing and translation translate

More information