Structural XML Querying
|
|
- Dominic Reynolds
- 5 years ago
- Views:
Transcription
1 VŠB Technical University of Ostrava Faculty of Electrical Engineering and Computer Science Department of Computer Science Structural XML Querying 2018 Radim Bača
2
3 Abstract A well-formed XML document or a set of documents can be viewed as an XML database and the associated DTD, or XML Schema, is its database schema. XQuery and XPath are usually the query languages of an XML database. If we compare them to relational databases, the main differences are the hierarchical data model and the implicit order of the XML data. Therefore, the major novel issues related to query processing in XML databases are (1) handling of a query logic related to the XML document hierarchical structure, and (2) dealing with the implicit order. Let us call such problems a structural XML querying. In this work, we provide a comprehensive survey of the state-of-the-art of approaches and related aspects for an efficient structural XML querying. In particular, we start with a description of labeling schemes to capture the structure of the data and the respective storage strategies. Then we deal with the key part of every XML query processing algorithm a twig query join as well as optimizations of XML query processing. Moreover, we describe two twig query joins that can be used in a structural XML querying. Key Words: Structural XML querying, XQuery, Cost-based optimizations
4
5 Contents List of Figures 7 List of Tables 9 1 Introduction Motivation XML Model and XML Query Languages Labeling Schemes XML Storage Techniques Partitioning of XML Document Schema Tree Node Indices Balancing XML Storage Twig Query Join Algorithms GTPStack GTPStack Experimental Results Top-down Filtering Optimality XML Query Processing Optimizations Cost-based Optimizations Selectivity Estimation Techniques CostTwigJoin Algorithm CostTwigJoin Experimental Results Conclusions 45 References 49 5
6 6
7 List of Figures 1 An XML document and its XML tree model XPath queries and their corresponding TPQs XQuery queries and their corresponding GTP representations (a) Containment labeling scheme (b) Dewey order labeling scheme Properties of various XML document partitionings (a) Document index (b) Partition index (c) Partition index with indexed lists Categories of algorithms utilized for twig query joins Performance of the output enumeration Number of queries violating the properties for different k values (a) for all ZIPF queries (b) for queries corresponding to the first and to the fourth ZIPF template Variants of the TB12 query with different number of output nodes (a) Processing time of GTPStack+M and (b) number of nodes stored on stacks by GTPStack for each variant of TB Example of three TPQs and their corresponding checking query nodes (underlined) (a) Sample XML tree (b) QC An empirical evaluation of the selection of α and β Results of our cost-based optimization framework Overhead of the cost-based optimization Average speed-up of greedy forward compared to Top-k
8 8
9 List of Tables 1 Major features of twig query join categories Numbers of nodes inserted into an intermediate storage and numbers of nodes relevant to the GTP result Number of the filtering function calls corresponding to the inner query nodes ZIPF query templates for XPath queries GTPStack+M compared to all tested approaches for all queries Ratio of the number of nodes stored in various intermediate storages and the number of relevant nodes for each collection Median values of nodes stored in an intermediate storage for each collection Categories of twig query joins A comparison of cost-based approaches Characteristics of data collections
10 10
11 1 Introduction A well-formed XML document or a set of documents can be viewed as an XML database and the associated DTD, or XML Schema, is its database schema. XQuery and XPath are usually the query languages of an XML database. If we compare them to relational databases, the main differences are the hierarchical data model and the implicit order of the XML data. Therefore, the major novel issues related to query processing in XML databases are (1) handling of a query logic related to the XML document hierarchical structure, and (2) dealing with the implicit order. Let us call such problems a structural XML querying. In this work, we provide a comprehensive survey of the state-of-the-art of approaches and related aspects for an efficient structural XML querying. In particular, we start with a description of labeling schemes to capture the structure of the data and the respective storage strategies. Then we deal with the key part of every XML query processing algorithm a twig query join as well as optimizations of XML query processing. Moreover, we describe two twig query joins that can be used in a structural XML querying. In particular, we summarize the main contributions of this work as follows: 1. We present a detailed description of up-to-date storages and indices for XML data, as well as a classification of the physical access methods with regard to node labeling, document partitioning, and twig query joins used. 2. We provide a thorough description of the state-of-the-art twig query joins and their comparison in terms of their compatibility, features, and supported query models. 3. We describe XML query algebras and outline their compatibility with twig query joins. 4. We discuss main aspects of cost-based optimization techniques and selectivity estimation approaches for XML queries. We compare these techniques in terms of supported query models, twig query join algorithms, and several other practical features as well. 5. We depict two of our novel twig query join algorithms called GTPStack and CostTwigJoin in detail. We compare these algorithms with other state-of-the-art algorithms and describe thoroughly our contributions. The content of this work is based mainly on the following publications: R. Bača, M. Krátký, T. W. Ling, and J. Lu. Optimal and Efficient Generalized Twig Pattern Processing: a Combination of Preorder and Postorder Filterings. The VLDB Journal, 22:1-25, Springer. [8] R. Bača, P. Lukáš, and M. Krátký. Cost-based holistic twig joins. Information Systems, 52:21-33, Elsevier. [9] 11
12 R. Bača, M. Krátký, I. Holubová, M. Nečaský, T. Skopal, M. Svoboda, and S. Sakr. Structural XML Query Processing. Accepted in ACM Computing Surveys, [12] 1.1 Motivation The adoption of the extensible Markup Language (XML) [19] proposed by the W3C 1 as a standard for information exchange has gained so much momentum and currently it is undoubtedly a main standard for the representation and exchange of data. Before this happened, we witnessed a massive boom of techniques that enable efficient storing and querying of XML data. Now we can observe that the boom in the proposals of new techniques for efficient structural XML querying is over and the research world has shifted its attention towards other kinds of data models and data formats (e.g., JSON [18], NoSQL [88], RDF [14], or linked data [16]). However, according to Gartner [36], there is an increasing trend of a new generation of multi-model database systems (e.g., OrientDB 2, MarkLogic 3, or HPE Vertica 4 ) which is designed to support storing data in a combination of related models and query across them. Therefore, we believe that the work done in XML database management system (XDBMS) is still relevant nowadays and it can be used in a new generation of database systems. 1.2 XML Model and XML Query Languages This section contains a brief introduction of the structural XML querying problem XML and XML Model For the purpose of machine processing of XML data, we do not view an XML document as a textual document, but we instead use it as a model. Every XML document has to be wellformed, which basically means that tags form a hierarchical structure, and, hence, the model is a tree with several types of data nodes corresponding to elements, attributes, or textual data. An example of an XML document and its XML tree model is depicted in Figure 1. In the following text, we use a term labeled path for a sequence of node names of a path in an XML tree XML Querying Issues and difficulties of structural XML querying are mostly observed with respect to the XPath [31] and XQuery [17] languages, where the latter one is actually an extension to the former one. Both these languages are the key standards among XML query languages, and so the corresponding XML query processing algorithms are essential for any XDBMS
13 XML document <notes> <note status= important > <to>roope</to> <from>jani</from> <body>call me!</body> </note> <note status= new > <to>radim</to> <due> </due> <body>finish article</body> </note> </notes> XML tree model notes note note status to from body status to due body important Roope Jani Call me! new Radim Finish article Figure 1: An XML document and its XML tree model XPath is a query language for selecting nodes from an XML document. The XPath language provides the ability to navigate within an XML document and select its particular nodes by a variety of criteria. We can find two major types of constructs in any XPath query: (1) structural constructs including a navigation in an XML tree, an element or an attribute name selection, a wildcard, Boolean expressions, and quantifiers and (2) content constructs including predicates on the node content (i.e. the element content or the attribute value) and a comparison of the node content. The content constructs can be often handled by methods designed for a query processing in RDBMS [13, 80, 24], moreover, there is a number of works dealing with the content constructs in XML [63, 55, 10, 4]. Let us remember that we are more focused on the structural constructs in this work. The major structural query model used by most approaches is called a twig pattern query (TPQ). A basic TPQ Q = (V, E) is a tree with a set V of query nodes and a set E of edges. Query nodes represent nodes of an XML document to be retrieved and edges represent structural relationships between the nodes. A query node q V is labeled by a node name. An edge e E can be of the parent-child (PC) or the ancestor-descendant (AD) type. PC and AD edges are visualized by simple and double lines, respectively. Q1: //a//b[.//d]/e Q2: //b/d/following-sibling::e Q3: //a//b[./ancestor::b][./e]/d Q4: //a//*[./b and not(./e)]//d e a b d d b a > b 1 e e b 2 d b a * e d Q1: TPQ Q2: TPQ > Q3: X-TPQ Q4: TPQ * Figure 2: XPath queries and their corresponding TPQs TPQs can represent XPath queries which contain child and descendant axes in their steps and predicates. A sample XPath expression of this kind is depicted in Figure 2 (Q1). The XPath language also contains other constructs. Thus, there exist different extensions of TPQ handling various structural aspects of XPath queries [73, 68]. An example of the major TPQ extensions is 13
14 provided in Figure 2 (Q2 Q4). We use a notation for TPQ extension names, where a structural construct not belonging to a basic TPQ definition is usually written in a superscript of a TPQ name (i.e. Q2 and Q4). Algorithms evaluating a TPQ on an XML document are called twig query joins or structural joins. Any twig query join finds all the occurrences of a TPQ in an XML tree (also called a TPQ matching). We discuss them in Section 3. Let us note that the TPQ matching is often considered as a core problem of XML querying. optional edge output node for $a in //a[//c//f], for $a in //a, for $b in //b[not(.//d/c) or /c] $aa in $a//b//a $d in $a//b//d let $cc := $b//c/f return return return <o> {$a,$aa} </o> <o> {$d,$a/b} </o> <o> {$b,$cc} </o> optional output node a a b c b b b' d c' c'' f a' d c b: (not(d) or c') and c'' f Q5 Q6 Q7 Figure 3: XQuery queries and their corresponding GTP representations The XQuery language [17] extends XPath in many ways. Two important extensions are the ability to specify output query nodes and an introduction of sequences in values. If we use only the TPQ model, we would have to perform a TPQ output postprocessing in order to get an expected XQuery output. To overcome this problem, a generalized query pattern (GTP) [30] is introduced making such a postprocessing unnecessary. Figure 3 shows an example of three XQuery queries with their corresponding GTP models. We use a simplified notation of the GTP model introduced in [28]. The optional edge and the optional output node enable a simple representation of the let clause or an optional XPath construct in the return clause (see Q6 and Q7 for an example). 1.3 Labeling Schemes Most of the twig query joins use a labeling scheme that assigns a unique node label to each node. Node labels allow us to resolve the following basic operations between two nodes u and v during query processing. The most significant operations are (1) lowest common ancestor of u and v in an XML tree, (2) resolving AD or PC relationship, and (3) decision whether u has a lower document order than v or not. 14
15 There are two major types of labeling schemes: (1) fixed-length labeling schemes, where the label has a fixed length, e.g., the Containment labeling scheme [106] (see an example in Figure 4(a)), and (2) prefix-based labeling schemes, where the length of the label is equal to the node depth, e.g., the Dewey order [94] (see an example in Figure 4(b)). The node label serves as an unique identifier (node id) of a node in an XML document in many labeling schemes. a (1,16) a 1 b (2,13) b (14,15) b 1.1 b 1.2 a (3,4) (5,8) c c (9,12) a c c b (6,7) (a) b (10,11) b (b) b Figure 4: (a) Containment labeling scheme (b) Dewey order labeling scheme The labeling schemes mentioned in the previous paragraph [106, 34, 94] are early works with certain issues such as the poor update performance or a lack of the LCA operation support. These issues were the main reason for the introduction of several other labeling schemes such as ORDPath [74], DDE [103], or Branch code [102]. A specific way to improve the update capability of the fixed-length labeling schemes is the introduction of bulk operations [51]. CT-label [59] is an approach that aims at reducing the label size while an XML document becomes nearly static. There are also labeling schemes ignored here such as compressed branch code [102], DFPD [60] or DPLS [62] as they can introduce false hits under certain circumstances. The prefix-based labeling schemes have fast insertion but it is balanced out by the cost of basic operations. Therefore, there is no holy grail of labeling schemes and a trade-off choice needs to be made among the performance of basic operations or the performance of updates. When we select an appropriate labeling scheme we have to consider also the LCA operation support as it is required by various twig query joins. 15
16 16
17 2 XML Storage Techniques Native XDBMSs build structural indices allowing them to avoid the necessity of accessing an XML document when resolving structural constructs of XML queries. Section 2.1 describes a general storage concept called the partitioning of an XML document. Sections 2.2 and 2.3 describe two data structures called a schema tree and a node index (both structures can be built during a preprocessing of an XML document), while Section 2.4 summarizes advantages of various storage settings. 2.1 Partitioning of XML Document Nodes of an XML document can be easily divided into disjoint sets (partitions), where each set is identified by its partition label. Partitioning approaches can be divided as follows: (1) Those based on a document structure [106, 42, 26, 49], and (2) Those considering a typical XML query workload [50, 27]. A tag partitioning [106, 5] is the most common partitioning based on a document structure where nodes are divided according to their tags and the number of partitions is equal to the number of unique tags in an XML document. There are also other partitioning based on tag+level [29], labeled paths [105, 29, 52] or forward & backward paths [49]. Each of these partitionings is actually a decomposition (refinement) of another one, as it is illustrated in Figure 5. On one hand, the tag partitioning produces a low number of partitions with a high number of nodes in each partition. On the other hand, F&B provides the opposite property: a possibly high number of partitions with a low number of nodes per each partition. Another type of partitioning is a semantic partitioning [6] that partitions the XML document according to the structure specified in an XML schema. 2.2 Schema Tree A schema tree for an XML document is a labeled tree, where each labeled path occurs only once. An example of a schema tree is depicted in Figure??(b). In literature, there are many names for this tree: DataGuide [42], path tree [3], summary tree, summary index, path index, and so on. A schema tree is useful for the following purposes: (1) to determine partition labels corresponding to a query [11] when a more refined partitioning is used (e.g., tag+level, or labeled paths), (2) to support a simple selectivity estimation [3], and (3) to get a general knowledge about an XML document structure (e.g., to support an auto-completion feature in an XML editor). 2.3 Node Indices Node labels are not stored in a schema tree structure; instead they are stored in a node index as values. There are two basic types of node indices: (1) those having a node id as a key (document 17
18 low number of partitions high number of nodes in a partition high number of partitions low number of nodes in a partition coarsing tag tag+level labeled path F&B partitioning refinement Figure 5: Properties of various XML document partitionings indices), and (2) those having a partition label as a key (partition indices). Nodes corresponding to one partition label are sorted according to the node label in the partition index; therefore, a sequential scan of the key s list is usually necessary during the query processing. This can be improved by an index (XB-tree [20] or XR-tree [48]) built over each list. All types of node indices are schematically depicted in Figure 6. Value: node label + some other information Key: partition label Value: list of node labels B-tree Key: node id B-tree Key: partition label B-tree (b) Value: XB-tree XB-tree XB-tree (a) (c) Figure 6: (a) Document index (b) Partition index (c) Partition index with indexed lists From the query processing perspective, a document index is very useful when we have a small set of context nodes and we want to use it to resolve the remaining relationships of a query. This type of the query processing can be considered as navigational. On the other hand, many twig query joins are based on the partition index (see Section 3). This type of join is mainly focused on the merge during one sequential scan of lists which removes irrelevant nodes [20, 5]. List in the partition index is called stream in the join and we use the term sequence in the following text. 2.4 Balancing XML Storage From the query processing point of view, the selection of a partitioning influences the size of three problems: (1) the overall amount of nodes that have to be read from a node index, (2) 18
19 the number of random accesses into a node index, and (3) the necessity to find all the partition labels in a potentially large schema tree. If we minimize the first problem by using a more refined partitioning, then the latter two problems of the query processing increase and vice versa. This behavior also depends on the query workload. As mentioned in Section 2.1, there are also partitioning techniques which are based on a typical XML query workload [50, 27]. These approaches take into account typical queries and the partitioning is created with respect to them. These approaches minimize all three query processing problems of partitioning mentioned above; however, they are only effective for a specific workload. 19
20 20
21 3 Twig Query Join Algorithms As discussed in Section 1.2.2, the basic problem of the XML querying is finding all the occurrences (query matches) of a TPQ in an XML tree [33]. Algorithms addressing this problem are usually called twig query joins and they represent a basic operator in an XML query algebra. Unlike in the relational domain where the term join stands for an algorithm-independent operation, in this work we use the term twig query join for algorithms that solve the TPQ matching problem as it is common in the referenced papers. In the full version of this work we define a data structure API that abstracts the data structures of the storage system. Let us only mention that there are two basic access patterns: (1) navigation API using a node for navigation in the XML document and (2) stream API using just sequential scan of a node sequence corresponding to one partition label. Operations in the APIs have significantly different performance characteristics depending on available node indices (see Section 2.3). In general, the document index supports both the navigational and stream APIs, whereas the partition index supports only the stream API. However, the document index has to access nodes using a sequential scan of a B-tree containing many irrelevant nodes when processing the stream API which is inefficient. A query processed in the partition index accesses only nodes relevant to a partition label, and therefore, it is crucial for the efficient processing of all merge-like algorithms. The selection of an appropriate index and algorithm is a task for a cost-based optimizer (see Section 4.1) and we propose an algorithm based on such selection in Section 4.3. Before we describe particular algorithms, we generally specify steps that can be identified as parts of any twig query join as follows: Filtering: An algorithm scans sequences and filters out nodes not corresponding to any query match. In this stage, algorithms use main memory filtering data structures to get rid of the maximum number of nodes that are irrelevant to a query. The most common filtering data structure is a stack or a set of stacks. The main feature of the filtering data structure is that it can help to decide whether the node is useless or not in constant time. Of course, every filtering can have false hits. Intermediate storage: Every join uses some type of an intermediate storage, where nodes are stored before a query output is enumerated. In some approaches, a filtering data structure is used as the intermediate storage as well. Output enumeration: In this step, algorithms read the intermediate storage and generate an output which is usually in a form of ordered tuples. The major task of this step is often the tuple ordering according to an XML query model. One of the most significant differences among joins is the method that is used during the filtering step. The first type of filtering methods focuses only on a pair of query nodes and the 21
22 twig query join is processed as a set of binary structural joins. On the other hand, holistic joins filter input nodes based on information from all query nodes. Figure 7 summarizes the major categories of twig query joins and Table 1 depicts their main features. Twig query joins Binary structural joins Holistic joins Navigational Merge-like Top-down Bottom-up Figure 7: Categories of algorithms utilized for twig query joins struc- Binary tural joins Top-down joins holistic Bottom-up holistic joins Pros They can be easily integrated into any XML query algebra and support all XPath axes. Linear I/O complexity of the query processing with respect to the sum of output and input sizes for some query types and unnecessary query plan optimizations. Linear CPU and I/O complexities of the output enumeration with respect to the output size. Cons Their efficiency is significantly dependent on the selection of a good query plan. They can produce a large intermediate result when compared to the query output. A sequential scan of an intermediate result is required even if it contains many useless nodes (see Section??). They can produce a large intermediate result compared to the query output (see Section??). Table 1: Major features of twig query join categories 3.1 GTPStack In this section, we outline basic ideas of our holistic join algorithm GTPStack introduced in [8]. Node filtering As described in the introduction of this section, every holistic algorithm has a filtering mechanism skipping irrelevant input data nodes which are not a part of any query match before these nodes are stored in an intermediate storage. Holistic algorithms use stacks during the filtering. In the following text #q denotes a query node q. If the filtering skips irrelevant nodes so that they are not stored on stacks at all, we speak about the top-down filtering. The bottom-up filtering skips irrelevant nodes (i.e., they are not stored in the intermediate storage) when they are popped out from their stacks. The simplest top-down filtering is represented by 22
23 PathStack [20] which skips an irrelevant node n corresponding to #q when there is no occurrence of a path from #root to #q containing n. Another type of top-down algorithms use a recursive filtering function such as getnext [20] or getpart [43] and they skip irrelevant nodes which are not a part of a whole TPQ occurrence. On the other hand, a bottom-up filtering algorithm (e.g., Twig 2 Stack [28] or TwigList [81]) skips an irrelevant node n corresponding to #q if there is no occurrence of a subtree rooted at #q containing n. We say that a filtering is optimal for a query Q if it skips all irrelevant nodes during the sequential scan of the input, which means that an algorithm with such filtering has a linear worst-case time and I/O complexity for Q with respect to the sum of the input and TPQ result size. For example, the TwigStack [20] and TJStrictPre [43] algorithms are optimal for TPQs having only ancestor-descendant relationships. Table 2 shows the number of nodes stored in an intermediate storage by various filtering algorithms. In this table, we use three TreeBank queries. The last column shows the number of nodes which are a part of a GTP result tuple. Evidently, the combination of PathStack and the bottom-up filtering can store an enormous number of irrelevant nodes; however, it stores less nodes than the bottom-up filtering itself. An top-down filtering algorithm such as TwigStack stores significantly less nodes, but it still typically stores large number of irrelevant nodes due to the fact that it filters only according to the TPQ model and no bottom-up filtering is included. Query PathStack Top-down Bottom-up Nodes +Bottom-up in GTP Twig 2 Twig Stack Stack TwigStack result +PathStack TB1 172,851 92,972 32, TB2 170,874 49,765 24, TB3 404, ,961 12, Table 2: Numbers of nodes inserted into an intermediate storage and numbers of nodes relevant to the GTP result In this section, we briefly outline features of the GTPStack algorithm combining the topdown filtering function and a bottom-up filtering (let us call it a combined filtering). To our best knowledge it is the first such correct algorithm that it is able to do it before a node is stored in the intermediate storage. Our combined filtering enables optimal filtering according to GTP; therefore, only the nodes relevant to the GTP result are stored in an intermediate storage if the algorithm is optimal. In other words, if GTPStack is optimal, then it has a linear worst-case I/O complexity with respect to the GTP result size. GTPStack s combined filtering significantly reduces number of nodes in an intermediate storage even if GTPStack is not optimal for a query. Moreover, in order to speed up the query processing time we use the following two improvements in the filtering mechanism: (1) we introduce a novel top-down filtering function called getmatch which always outperforms the getpart function [43], and (2) we avoid storing predicate nodes on stacks. 23
24 GTPStack processing time improvements Let us briefly describe our ideas behind the above improvements on examples. A filtering function such as getnext or getpart is typically called many times as is shown in Table 3. This table gives numbers of the function calls corresponding to the inner query nodes and numbers of unnecessary calls for three TreeBank queries. An unnecessary call of the filtering function works with exactly the same data nodes as the last function call; therefore, it returns the same query node. As observed on the query TB3, there can be almost half of function calls unnecessary. Our novel getmatch function avoids all these unnecessary calls. Additionally, as is shown in Section 3.2.3, the efficiency of the filtering function is significantly dependent on the ability to skip irrelevant nodes. The getpart and getnext functions sometimes return irrelevant nodes which are not subsequently stored on stacks. Another advantage of getmatch is that it skips all these irrelevant nodes. getnext getpart Query Calls Unnecessary Calls Unnecessary [10 3 ] calls [10 3 ] [10 3 ] calls [10 3 ] TB TB TB Table 3: Number of the filtering function calls corresponding to the inner query nodes We use the term main branch query node to name the query nodes which are on a query path from the root to an output node. The rest of the query nodes are called predicate query nodes. GTPStack separates the node filtering and the output enumeration which yields the following optimization. It allows us to avoid storing the nodes corresponding to the predicate query nodes on stacks. Optimality An important feature of holistic algorithms using the top-down filtering is that they have a linear worst-case I/O complexity with respect to the TPQ result size (i.e., they are optimal) for some query classes. Different holistic approaches define their optimality conditions in a different way; however, all of them specify only the query requirements. To our best knowledge, GTPStack is the first algorithm that is optimal for some query classes with respect to the GTP result size. GTPStack optimality is defined only by XPath axes and XML document characteristics. In other words, semantics related to the output nodes, boolean expressions, and quantifiers do not influence its optimality getmatch Filtering Function The existing getnext and getpart top-down filtering functions have two shortcomings: (1) they often perform unnecessary recursive calls, and (2) they return a query node #q even if there is no ancestor of H(#q) on S parent(#q). As a result, they both cause many unnecessary computations which are completely avoided by the getmatch function introduced in this section. 24
25 The getmatch function introduces three improvements of the existing top-down filtering functions: (1) dynamic programming that avoids unnecessary recursive calls, (2) a filtering procedure which advance the sequence according to the bottom node of the parent s stack, and (3) a cycle for the inner nodes which does not terminate the getmatch call until all sequences are ended or a promising node is found. The usage of the parent s stack and the above cycle cause a more progressive advancing of sequences which is a major parameter influencing the processing time of an filtering function as is shown in Section A top-down filtering function has the following property for a specific (see below) class of queries: it returns only a query node #q such that there is a query match of #q containing H(#q). Therefore, the top-down filtering removes all nodes irrelevant to the TPQ. The getpart and getnext procedures guarantee this property for queries having only AD relationships [20, 43]. Since getmatch only avoids unnecessary function calls and skips irrelevant nodes having no ancestor on parent s stack, its optimality properties are the same as for getpart and getnext. This means that getmatch is optimal for queries having only AD relationships. Let us note that in Section 3.3 we prove that a holistic algorithm with an top-down filtering function (e.g., getmatch) can be optimal even for a query containing any combination of PC and AD relationships depending on XML document characteristics Summary of GTPStack GTPStack is the first algorithm with a linear worst-case I/O complexity with respect to the sum of the input and GTP result sizes (in this case, GTPStack is optimal for the GTP). This is mainly achieved by the combination of the top-down and bottom-up filterings and, to our best knowledge, GTPStack is the first correct holistic algorithm using a combined filtering before storing a node in an intermediate storage. The combined approach used in GTPStack has the following advantages: (1) it allows us to avoid storing nodes corresponding to predicate query nodes on stacks which speeds up the query processing, and (2) it significantly decreases the number of nodes in the intermediate storage even when GTPStack is not optimal for a query. GTPStack uses our novel filtering function called getmatch, which avoids unnecessary function calls and improves sequence advancing which furthermore speeds up the query processing. All these features make GTPStack superior to the state-of-the-art holistic approaches as is shown experimentally in Section GTPStack Experimental Results We implemented five state-of-the-art holistic algorithms in C++: TwigStack [20], TwigList [81], TJStrictPre [43], TJStrictPost [43], and Twig 2 Stack+PathStack [28] (abbreviated to T2PS). We do not include experimental results of the TwigList algorithm since both TJStrictPost and TJStrictPre use an improved version of TwigList. In our experiments, we use more than one version of GTPStack. We combine GTPStack with existing top-down filtering functions; therefore, 25
26 we use the following simple notation, where GTPStack+N, GTPStack+P, and GTPStack+M stand for GTPStack combined with getnext, getpart, and getmatch, respectively. By writing GTPStack we mean any of the above versions of GTPStack. Since GTPStack+N always outperforms TwigStack, we include the results for the TwigStack query processing only in the intermediate storage test (see page 27). The main shortcoming of TwigStack is represented by its redundant intermediate storage and inefficient output enumeration. We use one own synthetic XML document called ZIPF and three real-world XML collections. The ZIPF document contains seven different elements named from a to g spread randomly using the Zipfian distribution, where a has the highest occurrence ( 50%) and g has the lowest occurrence ( 1%). Every element of ZIPF has exactly two children and the depth of the collection is 24 which means that all paths in ZIPF have the same length. collections are XMark [91] with factor 10, INEX 1.9 [40], and TreeBank [97]. The real-world Queries for the XMark and TreeBank collections are selected from several existing articles on TPQ processing [64, 28, 56]. Queries for the INEX collection were selected in order to show differences between the algorithms. A list of the selected real-world collections queries can be found in full version of this work. The largest number of queries were generated for the ZIPF collection. The ZIPF queries are generated according to five query templates shown in Table 4. A template only specifies relationships between query nodes, output query node, and predicate query nodes. Query template Number of generated queries 1. //τ[/υ and /ω] //α/β[//χ and //δ] //α/α[//β]/χ//χ[//δ and //ϵ] //α[/τ and //υ and //ω] //α/β[//χ/δ] 81 Table 4: ZIPF query templates for XPath queries We run our experiments on a PC with Intel Xeon 2.93GHz CPU, and Windows Server 2008 operating system. When measuring the processing time, each query is processed fifteen times in the main memory and then we compute the average result omitting the two worst and the two best results. If we want to compare processing times T x and T y of two approaches x and y for a set of queries, we first compute a geometric mean of ratios T y /T x of each query. Subsequently, since we want to have the value in percents, we simply subtract one from the calculated geometric mean and multiply it by 100. We call it a relative processing time improvement (RPTI) of approach x compared to approach y for a set of queries. For example, if we have the result of the geometric mean 1.68, we write that RPTI of x has a 68% improvement compared to y. Since GTPStack s improvements of processing time relate only to the filtering part of holistic algorithms, we also measured the filtering time of the algorithms (i.e., the processing time 26
27 without the time spent on reading the input data), and therefore, we also compute the relative filtering time improvement (RFTI) of approach x compared to approach y for a set of queries. In order to minimize the processing time measurement error, we say that a method is faster than the other one for a query Q if its RPTI for Q is at least 2% and their processing time difference for Q is at least 10 miliseconds Processing Time Table 5 gives RPTI and RFTI of GTPStack+M compared to all tested approaches for all queries. Table 5 also contains the number of queries for which GTPStack+M is faster and slower. Number of Filtering RPTI RFTI queries approach Faster Slower T2PS 77% 172% TJStrictPost 88% 222% TJStrictPre 12% 43% GTPStack+N 68% 150% GTPStack+P 13% 46% Table 5: GTPStack+M compared to all tested approaches for all queries We can observe that GTPStack+M outperforms both approaches TJStrictPre and GTP- Stack+P using the getpart function for all queries since the getmatch function is always faster or equally fast compared to getpart. This comes from the fact that getmatch improves getpart without any additional overhead. The remaining approaches (T2PS, TJStrictPost, and GTP- Stack+N) perform significantly worse in average than GTPStack+M (RFTI of GTPStack+M ranges from 150% to 222% when compared to these approaches). To better understand the differences among the algorithms and the advantages of GTP- Stack+M, we need to compare their corresponding parts separately. We first compare only the result enumeration time; then we compare the top-down filterings; then, we show how GTPStack optimizes its processing time for queries with many predicate nodes; and, in last experiment, we compare the bottom-up filterings Test of Intermediate Storages for TPQs Let us start with a test showing the properties of various intermediate storages and mainly the performance of the output enumeration. Note that the LIS intermediate storage is used by the TJStrictPost, TJStrictPre, and GTPStack algorithms. Results in this section serves as a hint for a selection of the most appropriate intermediate storage for our approach. In this test, we present only the XMark queries since the results for other collections are similar. However, we ignore the GTP semantics (i.e., we consider all query nodes as output ones) 27
28 Time[s] since TwigStack cannot enumerate GTPs. As a result, queries 8 and 13 did not finish since their TPQ result sizes were over one billion and the available main memory was not sufficient TwigStack LIS T2PS DNF DNF Query Figure 8: Performance of the output enumeration Figure 8 shows the results of this experiment. As expected, TwigStack performs very poorly which corresponds to the results published in [28]. Inefficiency of the TwigStack intermediate storage comes from the duplicate work with nodes and the sequential scan of the whole intermediate storage during the output enumeration. If we compare the output enumeration times of T2PS and LIS, the difference is not significant. This result comes from the fact that they use the bottom-up filtering; therefore, their output enumeration time is linear with respect to the result size. Finally, we decided to use the LIS storage since the Twig 2 Stack intermediate storage requires global pop order for its correct functionality. LIS can work with our bottom-up filtering and its enumeration time performance is comparable to the Twig 2 Stack intermediate storage Analysis of Top-down Filtering We can find three major types of top-down filtering approaches: (1) the PathStack algorithm which filters only according to the node query path from the root query node and does not perform any sequence advance, (2) the getnext function which performs a sequence advance according to the query node descendants, and (3) getpart and getmatch which perform a sequence advance according to the query node descendants and ancestor. We ignore getpart in the following text since getmatch always outperforms this function for our queries as is shown in Section The PathStack algorithm is very simple and fast and its processing time is linear with respect to the input size (the correlation coefficient between the PathStack processing time 28
29 Number of queries [%] Number of queries [%] and the input size for the ZIPF queries is 0.98). Since PathStack does not use any advanced sequence forwarding, its processing time is not influenced by the query result. On the other hand, an filtering function (i.e., getnext or getmatch) can skip irrelevant nodes more quickly using the sequence advancing. Therefore, these functions outperform PathStack if they advance the sequence sufficiently often; this is discussed further in this section. The main attribute indicating the efficiency of a top-down filtering is the average sequence advance (denoted as AvgFwd) during one FwdToAncOf or FwdToDescOf function call. Since the getmatch function uses both forwarding functions while getnext uses only the FwdToDescOf function, we also define: (1) an average sequence advance during one FwdToAncOf function call, and (2) an average sequence advance during one FwdToDescOf function call. Let us call them an average ancestor forward movement and an average descendant forward movement and denote them AvgAncFwd and AvgDescFwd, respectively. We first compare the getnext and getmatch functions (i.e., we compare GTPStack+N and GTPStack+M). The getnext function uses only FwdToAncOf, and therefore, getmatch performs better if its AvgDescFwd is sufficiently large. Our goal is to find a threshold value k with the following two properties: if AvgDescFwd > k, then T getnext > T getmatch, if AvgDescFwd < k, then T getnext < T getmatch First property violation Second property violation Sum of the violations Treshold value k (a) Sum of the violations Treshold value k (b) Figure 9: Number of queries violating the properties for different k values (a) for all ZIPF queries (b) for queries corresponding to the first and to the fourth ZIPF template Figure 9(a) shows how many queries violate the above properties for all queries in our ZIPF query set for a different k value. As we can see, there is no optimal k value for which all queries satisfy both properties since the sum of violations never reaches zero. In other words, we cannot find any exact threshold value of AvgDescFwd which would say whether to use getnext or getmatch. The reason for this is that various query nodes in a query can have significantly different AvgDescFwd values; therefore, selection of the same filtering function for all query nodes is not always the best solution. Figure 9(b) shows the number of violations for the ZIPF queries 29
30 generated by templates 1 and 4. These queries are very simple and the descendant forwarding is always performed in the relation to the same parent query node. We can observe that the threshold value k = 0.03 is a good selection for many of them since the sum of violations is less than 5%. Based on the above heuristics, we evaluated a simple combination of the getmatch and the getnext functions which works as follows: We first process a query using getmatch and collect the statistics about AvgDescFwd in each query node. Secondly, we use the getmatch function for a query node with AvgDescFwd larger than 0.03 and use the getnext function in the other cases. This combination of getmatch and getnext always performs better than or equally to any other filtering function. Similarly, we looked for another two threshold values which would indicate that an algorithm using PathStack performs better than algorithms using getnext or getmatch. Both algorithms have the threshold value of AvgFwd approximately equal to 0.3, where only 5% of the ZIPF queries corresponding to the first and fourth template violate the corresponding processing time properties. In other words, if AvgFwd is lower than 0.3, then PathStack performs better in many cases Optimization of Predicate Query Nodes None of the existing holistic algorithms can optimize the query processing time with respect to the number of nodes corresponding to predicate query nodes; therefore, their processing time is the same regardless of the GTP semantics. Their performance is mainly dependent on their top-down filtering as is shown in the above section. If GTPStack is optimal, then it stores only the nodes corresponding to main branch query nodes on stacks and thus it saves some time during the bottom-up filtering. #S #NN #NP #DT #S #NN #NP #DT #S #NN #NP #DT #S #NN #NP #DT #PP #PP #PP #PP #IN #NN #IN #NN #IN #NN #IN #NN #S #NN #NP #DT #PP #S #NN #NP #DT #PP #S #NN #NP #DT #PP #IN #NN #IN #NN #IN #NN Figure 10: Variants of the TB12 query with different number of output nodes 30
31 Processing time [s] Number of nodes stored on stacks [10 5 ] Query Query (a) (b) Figure 11: (a) Processing time of GTPStack+M and (b) number of nodes stored on stacks by GTPStack for each variant of TB12 In Figure 10, we can observe seven variants of the TB12 query with a different number of output nodes. We selected the TB12 query because GTPStack+M is optimal for it. Figure 11a shows how the processing time of GTPStack+M decreases with the decreasing number of nodes stored on stacks. The AvgDescFwd and AvgAncFwd values are equal to 0.09 and 0.14, respectively, for the TB12 query which indicates (according to the results of the above section) that getmatch should be slower than T2PS and TJStrictPost having the processing times and 0.202, respectively. However, for the last query variant, where the number of nodes corresponding to predicate query nodes is six times larger than the number of the other nodes, GTPStack+M performs equally to both algorithms (its processing time is 0.216) Intermediate Storage Size Method ZIPF TB XM INEX T2PS, TJStrictPost TJStrictPre GTPStack Table 6: Ratio of the number of nodes stored in various intermediate storages and the number of relevant nodes for each collection ZIPF TB XM INEX [10 3 ] [10 3 ] [10 3 ] [10 3 ] T2PS, TJStrictPost TJStrictPre GTPStack Relevant nodes Table 7: Median values of nodes stored in an intermediate storage for each collection Another important property of every algorithm related to the I/O complexity is the intermediate result size. We evaluate the intermediate result size in terms of the number of nodes stored there. For each method we compute a ratio of the number of nodes stored in the intermediate 31
32 storage and the number of relevant nodes for each collection and filtering method. Table 6 shows this ratio for all queries in each collection computed using the geometric mean. Table 7 shows us the median value of nodes stored in an intermediate storage for each collection. Generally, GTPStack stores one order of magnitude less nodes than the rest of the tested approaches due to the fact it uses the combined approach. The intermediate result size does not have a significant relationship to the processing time during the main memory run since every algorithm has to perform some extra operations if it wants to avoid storing useless nodes. However, the difference will be huge if an intermediate storage is larger than the main memory, and, in this case, I/O operations have to be included. 3.3 Top-down Filtering Optimality Optimality of a top-down filtering for a query or at least subquery has several important impacts: (1) we can guarantee that we skip all nodes irrelevant to a TPQ during the top-down filtering, (2) the algorithm optimality is necessary for a more efficient top-down node filtering of a query with the NOT operator, (3) we store only nodes corresponding to the output query nodes in the intermediate storage, and (4) we avoid storing all nodes corresponding to the predicate query nodes on stacks. Every top-down filtering algorithm has its specific query classes, for which the algorithm optimality is proved. The most common top-down algorithm optimality is the AD query (a query having only the AD relationships) optimality which is also GTPStack s optimality if the tag partitioning is used. However, we show that the top-down algorithm optimality can be significantly extended using the tag+level or labeled path partitioning. For a more thorough comparison of algorithms optimality query classes see [12]. When we speak about a holistic algorithm in this Section, we mean a holistic algorithm using an top-down filtering such as TwigStack, TJStrictPre, or GTPStack. Let us define several terms related to the query nodes of a TPQ: Query node with PC in its subtree is a query node #q having the PC relationship between some two nodes from the subtree(#q) set. A checking query node #q is a query node with PC in its subtree and having the AD relationship with its parent. It is an important query node type from the optimality point of view. Checkingnodes(Q) is a set of checking query nodes in the query Q. A tag is called single level if all nodes in the XML tree with this tag are on the same level. PRU #q is a set of partition labels that corresponds to a query node #q. Note that under tag partitioning we can have just one partition label in every PRU #q. If use some other partitioning then we need to do the DataGuide search first. By a term DataGuide search me mean determining partition labels corresponding to a query (see Section 2.2). 32
QuickXDB: A Prototype of a Native XML QuickXDB: Prototype of Native XML DBMS DBMS
QuickXDB: A Prototype of a Native XML QuickXDB: Prototype of Native XML DBMS DBMS Petr Lukáš, Radim Bača, and Michal Krátký Petr Lukáš, Radim Bača, and Michal Krátký Department of Computer Science, VŠB
More informationCompression of the Stream Array Data Structure
Compression of the Stream Array Data Structure Radim Bača and Martin Pawlas Department of Computer Science, Technical University of Ostrava Czech Republic {radim.baca,martin.pawlas}@vsb.cz Abstract. In
More informationChild Prime Label Approaches to Evaluate XML Structured Queries
Child Prime Label Approaches to Evaluate XML Structured Queries Shtwai Abdullah Alsubai Department of Computer Science the University of Sheffield This thesis is submitted for the degree of Doctor of Philosophy
More informationAn Implementation of Tree Pattern Matching Algorithms for Enhancement of Query Processing Operations in Large XML Trees
An Implementation of Tree Pattern Matching Algorithms for Enhancement of Query Processing Operations in Large XML Trees N. Murugesan 1 and R.Santhosh 2 1 PG Scholar, 2 Assistant Professor, Department of
More informationTwig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents
Twig Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li, Junichi Tatemura Wang-Pin Hsiung, Divyakant Agrawal, K. Selçuk Candan NEC Laboratories
More informationPathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data
PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data Enhua Jiao, Tok Wang Ling, Chee-Yong Chan School of Computing, National University of Singapore {jiaoenhu,lingtw,chancy}@comp.nus.edu.sg
More informationTwigList: Make Twig Pattern Matching Fast
TwigList: Make Twig Pattern Matching Fast Lu Qin, Jeffrey Xu Yu, and Bolin Ding The Chinese University of Hong Kong, China {lqin,yu,blding}@se.cuhk.edu.hk Abstract. Twig pattern matching problem has been
More informationAccelerating XML Structural Matching Using Suffix Bitmaps
Accelerating XML Structural Matching Using Suffix Bitmaps Feng Shao, Gang Chen, and Jinxiang Dong Dept. of Computer Science, Zhejiang University, Hangzhou, P.R. China microf_shao@msn.com, cg@zju.edu.cn,
More informationOn Label Stream Partition for Efficient Holistic Twig Join
On Label Stream Partition for Efficient Holistic Twig Join Bo Chen 1, Tok Wang Ling 1,M.TamerÖzsu2, and Zhenzhou Zhu 1 1 School of Computing, National University of Singapore {chenbo, lingtw, zhuzhenz}@comp.nus.edu.sg
More informationCHAPTER 3 LITERATURE REVIEW
20 CHAPTER 3 LITERATURE REVIEW This chapter presents query processing with XML documents, indexing techniques and current algorithms for generating labels. Here, each labeling algorithm and its limitations
More informationStructural Joins, Twig Joins and Path Stack
Structural Joins, Twig Joins and Path Stack Seminar: XML & Datenbanken Student: Irina ANDREI Konstanz, 11.07.2006 Outline 1. Structural Joins Tree-Merge Stack-Tree 2. Path-Join Algorithms PathStack PathMPMJ
More informationBenchmarking a B-tree compression method
Benchmarking a B-tree compression method Filip Křižka, Michal Krátký, and Radim Bača Department of Computer Science, Technical University of Ostrava, Czech Republic {filip.krizka,michal.kratky,radim.baca}@vsb.cz
More informationA FRAMEWORK FOR EFFICIENT DATA SEARCH THROUGH XML TREE PATTERNS
A FRAMEWORK FOR EFFICIENT DATA SEARCH THROUGH XML TREE PATTERNS SRIVANI SARIKONDA 1 PG Scholar Department of CSE P.SANDEEP REDDY 2 Associate professor Department of CSE DR.M.V.SIVA PRASAD 3 Principal Abstract:
More informationTwigStack + : Holistic Twig Join Pruning Using Extended Solution Extension
Vol. 8 No.2B 2007 603-609 Article ID: + : Holistic Twig Join Pruning Using Extended Solution Extension ZHOU Junfeng 1,2, XIE Min 1, MENG Xiaofeng 1 1 School of Information, Renmin University of China,
More informationA Prime Number Approach to Matching an XML Twig Pattern including Parent-Child Edges
A Prime Number Approach to Matching an XML Twig Pattern including Parent-Child Edges Shtwai Alsubai and Siobhán North Department of Computer Science, The University of Sheffield, Sheffield, U.K. Keywords:
More informationTree-Pattern Queries on a Lightweight XML Processor
Tree-Pattern Queries on a Lightweight XML Processor MIRELLA M. MORO Zografoula Vagena Vassilis J. Tsotras Research partially supported by CAPES, NSF grant IIS 0339032, UC Micro, and Lotus Interworks Outline
More informationOutline. Depth-first Binary Tree Traversal. Gerênciade Dados daweb -DCC922 - XML Query Processing. Motivation 24/03/2014
Outline Gerênciade Dados daweb -DCC922 - XML Query Processing ( Apresentação basedaem material do livro-texto [Abiteboul et al., 2012]) 2014 Motivation Deep-first Tree Traversal Naïve Page-based Storage
More informationA New Encoding Scheme of Supporting Data Update Efficiently
Send Orders for Reprints to reprints@benthamscience.ae 1472 The Open Cybernetics & Systemics Journal, 2015, 9, 1472-1477 Open Access A New Encoding Scheme of Supporting Data Update Efficiently Houliang
More informationTRIE BASED METHODS FOR STRING SIMILARTIY JOINS
TRIE BASED METHODS FOR STRING SIMILARTIY JOINS Venkat Charan Varma Buddharaju #10498995 Department of Computer and Information Science University of MIssissippi ENGR-654 INFORMATION SYSTEM PRINCIPLES RESEARCH
More informationData Structure. IBPS SO (IT- Officer) Exam 2017
Data Structure IBPS SO (IT- Officer) Exam 2017 Data Structure: In computer science, a data structure is a way of storing and organizing data in a computer s memory so that it can be used efficiently. Data
More informationThis is a repository copy of A Prime Number Approach to Matching an XML Twig Pattern including Parent-Child Edges.
This is a repository copy of A Prime Number Approach to Matching an XML Twig Pattern including Parent-Child Edges. White Rose Research Online URL for this paper: http://eprints.whiterose.ac.uk/117467/
More informationQuery Processing & Optimization
Query Processing & Optimization 1 Roadmap of This Lecture Overview of query processing Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Introduction
More informationEvaluating XPath Queries
Chapter 8 Evaluating XPath Queries Peter Wood (BBK) XML Data Management 201 / 353 Introduction When XML documents are small and can fit in memory, evaluating XPath expressions can be done efficiently But
More informationXML Query Processing. Announcements (March 31) Overview. CPS 216 Advanced Database Systems. Course project milestone 2 due today
XML Query Processing CPS 216 Advanced Database Systems Announcements (March 31) 2 Course project milestone 2 due today Hardcopy in class or otherwise email please I will be out of town next week No class
More informationAn Extended Byte Carry Labeling Scheme for Dynamic XML Data
Available online at www.sciencedirect.com Procedia Engineering 15 (2011) 5488 5492 An Extended Byte Carry Labeling Scheme for Dynamic XML Data YU Sheng a,b WU Minghui a,b, * LIU Lin a,b a School of Computer
More informationJoint Entity Resolution
Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute
More informationQuerying Spatiotemporal Data Based on XML Twig Pattern
Querying Spatiotemporal Data Based on XML Twig Pattern Luyi Bai Yin Li Jiemin Liu* College of Information Science and Engineering Northeastern University Shenyang 110819 China * Corresponding author Tel:
More informationKnowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey
Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey G. Shivaprasad, N. V. Subbareddy and U. Dinesh Acharya
More informationAdvanced Database Systems
Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed
More informationFull-Text and Structural XML Indexing on B + -Tree
Full-Text and Structural XML Indexing on B + -Tree Toshiyuki Shimizu 1 and Masatoshi Yoshikawa 2 1 Graduate School of Information Science, Nagoya University shimizu@dl.itc.nagoya-u.ac.jp 2 Information
More informationDATA MODELS FOR SEMISTRUCTURED DATA
Chapter 2 DATA MODELS FOR SEMISTRUCTURED DATA Traditionally, real world semantics are captured in a data model, and mapped to the database schema. The real world semantics are modeled as constraints and
More informationContents Contents Introduction Basic Steps in Query Processing Introduction Transformation of Relational Expressions...
Contents Contents...283 Introduction...283 Basic Steps in Query Processing...284 Introduction...285 Transformation of Relational Expressions...287 Equivalence Rules...289 Transformation Example: Pushing
More informationBenchmarking the UB-tree
Benchmarking the UB-tree Michal Krátký, Tomáš Skopal Department of Computer Science, VŠB Technical University of Ostrava, tř. 17. listopadu 15, Ostrava, Czech Republic michal.kratky@vsb.cz, tomas.skopal@vsb.cz
More informationADT 2009 Other Approaches to XQuery Processing
Other Approaches to XQuery Processing Stefan Manegold Stefan.Manegold@cwi.nl http://www.cwi.nl/~manegold/ 12.11.2009: Schedule 2 RDBMS back-end support for XML/XQuery (1/2): Document Representation (XPath
More informationIndex-Trees for Descendant Tree Queries on XML documents
Index-Trees for Descendant Tree Queries on XML documents (long version) Jérémy arbay University of Waterloo, School of Computer Science, 200 University Ave West, Waterloo, Ontario, Canada, N2L 3G1 Phone
More informationCSE 530A. B+ Trees. Washington University Fall 2013
CSE 530A B+ Trees Washington University Fall 2013 B Trees A B tree is an ordered (non-binary) tree where the internal nodes can have a varying number of child nodes (within some range) B Trees When a key
More informationAn Efficient XML Index Structure with Bottom-Up Query Processing
An Efficient XML Index Structure with Bottom-Up Query Processing Dong Min Seo, Jae Soo Yoo, and Ki Hyung Cho Department of Computer and Communication Engineering, Chungbuk National University, 48 Gaesin-dong,
More informationNavigation- vs. Index-Based XML Multi-Query Processing
Navigation- vs. Index-Based XML Multi-Query Processing Nicolas Bruno, Luis Gravano Columbia University {nicolas,gravano}@cs.columbia.edu Nick Koudas, Divesh Srivastava AT&T Labs Research {koudas,divesh}@research.att.com
More informationPrefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching
Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University of Singapore Lower Kent Ridge Road, Singapore
More informationCS301 - Data Structures Glossary By
CS301 - Data Structures Glossary By Abstract Data Type : A set of data values and associated operations that are precisely specified independent of any particular implementation. Also known as ADT Algorithm
More informationAn Effective and Efficient Approach for Keyword-Based XML Retrieval. Xiaoguang Li, Jian Gong, Daling Wang, and Ge Yu retold by Daryna Bronnykova
An Effective and Efficient Approach for Keyword-Based XML Retrieval Xiaoguang Li, Jian Gong, Daling Wang, and Ge Yu retold by Daryna Bronnykova Search on XML documents 2 Why not use google? Why are traditional
More informationXML databases. Jan Chomicki. University at Buffalo. Jan Chomicki (University at Buffalo) XML databases 1 / 9
XML databases Jan Chomicki University at Buffalo Jan Chomicki (University at Buffalo) XML databases 1 / 9 Outline 1 XML data model 2 XPath 3 XQuery Jan Chomicki (University at Buffalo) XML databases 2
More informationSFilter: A Simple and Scalable Filter for XML Streams
SFilter: A Simple and Scalable Filter for XML Streams Abdul Nizar M., G. Suresh Babu, P. Sreenivasa Kumar Indian Institute of Technology Madras Chennai - 600 036 INDIA nizar@cse.iitm.ac.in, sureshbabuau@gmail.com,
More informationXML has become the de facto standard for data exchange.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 12, DECEMBER 2008 1627 Scalable Filtering of Multiple Generalized-Tree-Pattern Queries over XML Streams Songting Chen, Hua-Gang Li, Jun
More informationAlgorithms in Systems Engineering ISE 172. Lecture 16. Dr. Ted Ralphs
Algorithms in Systems Engineering ISE 172 Lecture 16 Dr. Ted Ralphs ISE 172 Lecture 16 1 References for Today s Lecture Required reading Sections 6.5-6.7 References CLRS Chapter 22 R. Sedgewick, Algorithms
More informationBig Data Management and NoSQL Databases
NDBI040 Big Data Management and NoSQL Databases Lecture 10. Graph databases Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Graph Databases Basic
More informationIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING VOL:25 NO:1 YEAR A Survey of XML Tree Patterns
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING VOL:25 NO:1 YEAR 2013 A Survey of XML Tree Patterns Marouane Hachicha and Jérôme Darmont, Member, IEEE Computer Society Abstract With XML becoming a
More informationWeb Data Management. XML query evaluation. Philippe Rigaux CNAM Paris & INRIA Saclay
http://webdam.inria.fr/ Web Data Management XML query evaluation Serge Abiteboul INRIA Saclay & ENS Cachan Ioana Manolescu INRIA Saclay & Paris-Sud University Philippe Rigaux CNAM Paris & INRIA Saclay
More informationParallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce
Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Huayu Wu Institute for Infocomm Research, A*STAR, Singapore huwu@i2r.a-star.edu.sg Abstract. Processing XML queries over
More informationOn Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques
On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore 3 Science Drive, Singapore
More informationXML Systems & Benchmarks
XML Systems & Benchmarks Christoph Staudt Peter Chiv Saarland University, Germany July 1st, 2003 Main Goals of our talk Part I Show up how databases and XML come together Make clear the problems that arise
More informationMulti-Way Number Partitioning
Proceedings of the Twenty-First International Joint Conference on Artificial Intelligence (IJCAI-09) Multi-Way Number Partitioning Richard E. Korf Computer Science Department University of California,
More informationBinary Decision Diagrams
Logic and roof Hilary 2016 James Worrell Binary Decision Diagrams A propositional formula is determined up to logical equivalence by its truth table. If the formula has n variables then its truth table
More informationA Real Time GIS Approximation Approach for Multiphase Spatial Query Processing Using Hierarchical-Partitioned-Indexing Technique
International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 6 ISSN : 2456-3307 A Real Time GIS Approximation Approach for Multiphase
More informationComputational Optimization ISE 407. Lecture 16. Dr. Ted Ralphs
Computational Optimization ISE 407 Lecture 16 Dr. Ted Ralphs ISE 407 Lecture 16 1 References for Today s Lecture Required reading Sections 6.5-6.7 References CLRS Chapter 22 R. Sedgewick, Algorithms in
More informationAnnouncements (March 31) XML Query Processing. Overview. Navigational processing in Lore. Navigational plans in Lore
Announcements (March 31) 2 XML Query Processing PS 216 Advanced Database Systems ourse project milestone 2 due today Hardcopy in class or otherwise email please I will be out of town next week No class
More informationXML Index Recommendation with Tight Optimizer Coupling
XML Index Recommendation with Tight Optimizer Coupling Technical Report CS-2007-22 July 11, 2007 Iman Elghandour University of Waterloo Andrey Balmin IBM Almaden Research Center Ashraf Aboulnaga University
More informationPacket Classification Using Dynamically Generated Decision Trees
1 Packet Classification Using Dynamically Generated Decision Trees Yu-Chieh Cheng, Pi-Chung Wang Abstract Binary Search on Levels (BSOL) is a decision-tree algorithm for packet classification with superior
More informationTrees. Q: Why study trees? A: Many advance ADTs are implemented using tree-based data structures.
Trees Q: Why study trees? : Many advance DTs are implemented using tree-based data structures. Recursive Definition of (Rooted) Tree: Let T be a set with n 0 elements. (i) If n = 0, T is an empty tree,
More informationCBSL A Compressed Binary String Labeling Scheme for Dynamic Update of XML Documents
CIT. Journal of Computing and Information Technology, Vol. 26, No. 2, June 2018, 99 114 doi: 10.20532/cit.2018.1003955 99 CBSL A Compressed Binary String Labeling Scheme for Dynamic Update of XML Documents
More informationCardinality estimation of navigational XPath expressions
University of Twente Department of Electrical Engineering, Mathematics and Computer Science Database group Cardinality estimation of navigational XPath expressions Gerben Broenink M.Sc. Thesis 16 June
More informationProceedings of the 2012 International Conference on Industrial Engineering and Operations Management Istanbul, Turkey, July 3 6, 2012
Proceedings of the 2012 International Conference on Industrial Engineering and Operations Management Istanbul, Turkey, July 3 6, 2012 Solving Assembly Line Balancing Problem in the State of Multiple- Alternative
More informationBottom-Up Evaluation of Twig Join Pattern Queries in XML Document Databases
Bottom-Up Evaluation of Twig Join Pattern Queries in XML Document Databases Yangjun Chen Department of Applied Computer Science University of Winnipeg Winnipeg, Manitoba, Canada R3B 2E9 y.chen@uwinnipeg.ca
More informationDatabase System Concepts
Chapter 14: Optimization Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2007/2008 Slides (fortemente) baseados nos slides oficiais do livro c Silberschatz, Korth and Sudarshan.
More informationIntegrating Path Index with Value Index for XML data
Integrating Path Index with Value Index for XML data Jing Wang 1, Xiaofeng Meng 2, Shan Wang 2 1 Institute of Computing Technology, Chinese Academy of Sciences, 100080 Beijing, China cuckoowj@btamail.net.cn
More informationFrom Passages into Elements in XML Retrieval
From Passages into Elements in XML Retrieval Kelly Y. Itakura David R. Cheriton School of Computer Science, University of Waterloo 200 Univ. Ave. W. Waterloo, ON, Canada yitakura@cs.uwaterloo.ca Charles
More informationBioinformatics Programming. EE, NCKU Tien-Hao Chang (Darby Chang)
Bioinformatics Programming EE, NCKU Tien-Hao Chang (Darby Chang) 1 Tree 2 A Tree Structure A tree structure means that the data are organized so that items of information are related by branches 3 Definition
More informationCMSC424: Database Design. Instructor: Amol Deshpande
CMSC424: Database Design Instructor: Amol Deshpande amol@cs.umd.edu Databases Data Models Conceptual representa1on of the data Data Retrieval How to ask ques1ons of the database How to answer those ques1ons
More informationPart XII. Mapping XML to Databases. Torsten Grust (WSI) Database-Supported XML Processors Winter 2008/09 321
Part XII Mapping XML to Databases Torsten Grust (WSI) Database-Supported XML Processors Winter 2008/09 321 Outline of this part 1 Mapping XML to Databases Introduction 2 Relational Tree Encoding Dead Ends
More informationBottom Up and Top Down Twig Pattern Matching on Indexed Trees
Nils Grimsmo Bottom Up and Top Down Twig Pattern Matching on Indexed Trees Thesis for the degree of philosophiae doctor Trondheim, 2010-09-02 Norwegian University of Science and Technology. Faculty of
More informationPoint Cloud Filtering using Ray Casting by Eric Jensen 2012 The Basic Methodology
Point Cloud Filtering using Ray Casting by Eric Jensen 01 The Basic Methodology Ray tracing in standard graphics study is a method of following the path of a photon from the light source to the camera,
More informationDATA STRUCTURE AND ALGORITHM USING PYTHON
DATA STRUCTURE AND ALGORITHM USING PYTHON Advanced Data Structure and File Manipulation Peter Lo Linear Structure Queue, Stack, Linked List and Tree 2 Queue A queue is a line of people or things waiting
More informationIntroduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe
Introduction to Query Processing and Query Optimization Techniques Outline Translating SQL Queries into Relational Algebra Algorithms for External Sorting Algorithms for SELECT and JOIN Operations Algorithms
More informationIndex-Driven XQuery Processing in the exist XML Database
Index-Driven XQuery Processing in the exist XML Database Wolfgang Meier wolfgang@exist-db.org The exist Project XML Prague, June 17, 2006 Outline 1 Introducing exist 2 Node Identification Schemes and Indexing
More informationChapter 13 XML: Extensible Markup Language
Chapter 13 XML: Extensible Markup Language - Internet applications provide Web interfaces to databases (data sources) - Three-tier architecture Client V Application Programs Webserver V Database Server
More information<=chapter>... XML. book. allauthors (1,5:60,2) title (1,2:4,2) XML author author author. <=author> jane. Origins (1,1:150,1) (1,61:63,2) (1,64:93,2)
Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno Columbia University nicolas@cscolumbiaedu Nick Koudas AT&T Labs Research koudas@researchattcom Divesh Srivastava AT&T Labs Research divesh@researchattcom
More informationLabeling Dynamic XML Documents: An Order-Centric Approach
1 Labeling Dynamic XML Documents: An Order-Centric Approach Liang Xu, Tok Wang Ling, and Huayu Wu School of Computing National University of Singapore Abstract Dynamic XML labeling schemes have important
More informationDepartment of Computer Science and Technology
UNIT : Stack & Queue Short Questions 1 1 1 1 1 1 1 1 20) 2 What is the difference between Data and Information? Define Data, Information, and Data Structure. List the primitive data structure. List the
More informationPerformance Improvement of Hardware-Based Packet Classification Algorithm
Performance Improvement of Hardware-Based Packet Classification Algorithm Yaw-Chung Chen 1, Pi-Chung Wang 2, Chun-Liang Lee 2, and Chia-Tai Chan 2 1 Department of Computer Science and Information Engineering,
More information9/29/2016. Chapter 4 Trees. Introduction. Terminology. Terminology. Terminology. Terminology
Introduction Chapter 4 Trees for large input, even linear access time may be prohibitive we need data structures that exhibit average running times closer to O(log N) binary search tree 2 Terminology recursive
More informationAlgorithms Exam TIN093/DIT600
Algorithms Exam TIN093/DIT600 Course: Algorithms Course code: TIN 093 (CTH), DIT 600 (GU) Date, time: 22nd October 2016, 14:00 18:00 Building: M Responsible teacher: Peter Damaschke, Tel. 5405 Examiner:
More informationSecurity-Conscious XML Indexing
Security-Conscious XML Indexing Yan Xiao, Bo Luo, and Dongwon Lee The Pennsylvania State University, University Park, USA xiaoyan515@gmail.com, {bluo,dongwon}@psu.edu Abstract. To support secure exchanging
More informationChapter 14: Query Optimization
Chapter 14: Query Optimization Database System Concepts 5 th Ed. See www.db-book.com for conditions on re-use Chapter 14: Query Optimization Introduction Transformation of Relational Expressions Catalog
More informationXML Filtering Technologies
XML Filtering Technologies Introduction Data exchange between applications: use XML Messages processed by an XML Message Broker Examples Publish/subscribe systems [Altinel 00] XML message routing [Snoeren
More informationXML: Extensible Markup Language
XML: Extensible Markup Language CSC 375, Fall 2015 XML is a classic political compromise: it balances the needs of man and machine by being equally unreadable to both. Matthew Might Slides slightly modified
More information6. Relational Algebra (Part II)
6. Relational Algebra (Part II) 6.1. Introduction In the previous chapter, we introduced relational algebra as a fundamental model of relational database manipulation. In particular, we defined and discussed
More informationPathfinder/MonetDB: A High-Performance Relational Runtime for XQuery
Introduction Problems & Solutions Join Recognition Experimental Results Introduction GK Spring Workshop Waldau: Pathfinder/MonetDB: A High-Performance Relational Runtime for XQuery Database & Information
More information12 Abstract Data Types
12 Abstract Data Types 12.1 Foundations of Computer Science Cengage Learning Objectives After studying this chapter, the student should be able to: Define the concept of an abstract data type (ADT). Define
More informationIntroduction. for large input, even access time may be prohibitive we need data structures that exhibit times closer to O(log N) binary search tree
Chapter 4 Trees 2 Introduction for large input, even access time may be prohibitive we need data structures that exhibit running times closer to O(log N) binary search tree 3 Terminology recursive definition
More informationCopyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 27-1
Slide 27-1 Chapter 27 XML: Extensible Markup Language Chapter Outline Introduction Structured, Semi structured, and Unstructured Data. XML Hierarchical (Tree) Data Model. XML Documents, DTD, and XML Schema.
More informationDATA STRUCTURE : A MCQ QUESTION SET Code : RBMCQ0305
Q.1 If h is any hashing function and is used to hash n keys in to a table of size m, where n
More informationTwig Pattern Search in XML Database
Twig Pattern Search in XML Database By LEPING ZOU A thesis submitted to the Department of Applied Computer Science in conformity with the requirements for the degree of Master of Science University of
More informationSection 5.5. Left subtree The left subtree of a vertex V on a binary tree is the graph formed by the left child L of V, the descendents
Section 5.5 Binary Tree A binary tree is a rooted tree in which each vertex has at most two children and each child is designated as being a left child or a right child. Thus, in a binary tree, each vertex
More informationProblem Set 5 Solutions
Introduction to Algorithms November 4, 2005 Massachusetts Institute of Technology 6.046J/18.410J Professors Erik D. Demaine and Charles E. Leiserson Handout 21 Problem Set 5 Solutions Problem 5-1. Skip
More informationHow to Store XML Data
How to Store XML Data Technical Report No.: 2010/2 Dept. of Software Engineering Faculty of Mathematics and Physics Charles University in Prague November 2010 Pavel Loupal 1, Irena Mlýnková 2, Martin Nečaský
More informationPerforming Grouping and Aggregate Functions in XML Queries
Performing Grouping and Aggregate Functions in XML Huayu Wu, Tok Wang Ling, Liang Xu, and Zhifeng Bao School of Computing National University of Singapore wuhuayu@comp.nus.edu.sg, lingtw@comp.nus.edu.sg,
More informationBlossomTree: Evaluating XPaths in FLWOR Expressions
BlossomTree: Evaluating XPaths in FLWOR Expressions Ning Zhang University of Waterloo School of Computer Science nzhang@uwaterloo.ca Shishir K. Agrawal Indian Institute of Technology, Bombay Department
More informationLeveraging Set Relations in Exact Set Similarity Join
Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,
More informationCSE 190D Spring 2017 Final Exam Answers
CSE 190D Spring 2017 Final Exam Answers Q 1. [20pts] For the following questions, clearly circle True or False. 1. The hash join algorithm always has fewer page I/Os compared to the block nested loop join
More informationCh 5 : Query Processing & Optimization
Ch 5 : Query Processing & Optimization Basic Steps in Query Processing 1. Parsing and translation 2. Optimization 3. Evaluation Basic Steps in Query Processing (Cont.) Parsing and translation translate
More information