Evaluating XPath Queries

Similar documents
XML Storage and Indexing

Outline. Depth-first Binary Tree Traversal. GerĂȘnciade Dados daweb -DCC922 - XML Query Processing. Motivation 24/03/2014

Web Data Management. XML query evaluation. Philippe Rigaux CNAM Paris & INRIA Saclay

PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data

Accelerating XML Structural Matching Using Suffix Bitmaps

An Implementation of Tree Pattern Matching Algorithms for Enhancement of Query Processing Operations in Large XML Trees

XML Query Processing. Announcements (March 31) Overview. CPS 216 Advanced Database Systems. Course project milestone 2 due today

CHAPTER 3 LITERATURE REVIEW

A FRAMEWORK FOR EFFICIENT DATA SEARCH THROUGH XML TREE PATTERNS

XML Query Processing and Optimization

On Label Stream Partition for Efficient Holistic Twig Join

TwigINLAB: A Decomposition-Matching-Merging Approach To Improving XML Query Processing

TwigList: Make Twig Pattern Matching Fast

Compression of the Stream Array Data Structure

Chapter 12: Query Processing. Chapter 12: Query Processing

Aggregate Query Processing of Streaming XML Data

An Efficient XML Index Structure with Bottom-Up Query Processing

Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching

Chapter 12: Query Processing

Structural Joins, Twig Joins and Path Stack

Navigation- vs. Index-Based XML Multi-Query Processing

BLAS : An Efficient XPath Processing System

Announcements (March 31) XML Query Processing. Overview. Navigational processing in Lore. Navigational plans in Lore

Chapter 13: Query Processing

XML databases. Jan Chomicki. University at Buffalo. Jan Chomicki (University at Buffalo) XML databases 1 / 9

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( ) 1

Database System Concepts

Full-Text and Structural XML Indexing on B + -Tree

CSE 530A. B+ Trees. Washington University Fall 2013

Advanced Database Systems

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

Trees : Part 1. Section 4.1. Theory and Terminology. A Tree? A Tree? Theory and Terminology. Theory and Terminology

Binary Trees

Chapter 12: Indexing and Hashing. Basic Concepts

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Example using multiple predicates

Chapter 12: Indexing and Hashing

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

Chapter 12: Query Processing

Compact Access Control Labeling for Efficient Secure XML Query Evaluation

Data Structures and Algorithms

Computer Science 210 Data Structures Siena College Fall Topic Notes: Trees

Answering XML Twig Queries with Automata

EE 368. Weeks 5 (Notes)

Labeling and Querying Dynamic XML Trees

Binary Trees, Binary Search Trees

ISSN: [Lakshmikandan* et al., 6(3): March, 2017] Impact Factor: 4.116

Chapter 12: Indexing and Hashing

Index-Driven XQuery Processing in the exist XML Database

Schema-Based XML-to-SQL Query Translation Using Interval Encoding

MQEB: Metadata-based Query Evaluation of Bi-labeled XML data

Chapter 11: Indexing and Hashing

M-ary Search Tree. B-Trees. Solution: B-Trees. B-Tree: Example. B-Tree Properties. B-Trees (4.7 in Weiss)

Two hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE. Date: Thursday 16th January 2014 Time: 09:45-11:45. Please answer BOTH Questions

Part XII. Mapping XML to Databases. Torsten Grust (WSI) Database-Supported XML Processors Winter 2008/09 321

Trees. Eric McCreath

Advances in Data Management Principles of Database Systems - 2 A.Poulovassilis

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Chapter 11: Indexing and Hashing

Efficient Processing of Complex Twig Pattern Matching

Efficient Query Optimization Of XML Tree Pattern Matching By Using Holistic Approach

Chapter 11: Indexing and Hashing

[ DATA STRUCTURES ] Fig. (1) : A Tree

Section 5.5. Left subtree The left subtree of a vertex V on a binary tree is the graph formed by the left child L of V, the descendents

Query Processing & Optimization

M-ary Search Tree. B-Trees. B-Trees. Solution: B-Trees. B-Tree: Example. B-Tree Properties. Maximum branching factor of M Complete tree has height =

The Research on Coding Scheme of Binary-Tree for XML

3. Priority Queues. ADT Stack : LIFO. ADT Queue : FIFO. ADT Priority Queue : pick the element with the lowest (or highest) priority.

ADT 2009 Other Approaches to XQuery Processing

B-Trees and External Memory

Announcements. Reading Material. Recap. Today 9/17/17. Storage (contd. from Lecture 6)

Trees. Q: Why study trees? A: Many advance ADTs are implemented using tree-based data structures.

B-Trees and External Memory

Algorithms and Data Structures

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

Lecture 9 March 4, 2010

CSE 214 Computer Science II Introduction to Tree

Tree. Virendra Singh Indian Institute of Science Bangalore Lecture 11. Courtesy: Prof. Sartaj Sahni. Sep 3,2010

An Efficient XML Node Identification and Indexing Scheme

Algorithms and Data Structures

Trees. (Trees) Data Structures and Programming Spring / 28

Trees, Part 1: Unbalanced Trees

<=chapter>... XML. book. allauthors (1,5:60,2) title (1,2:4,2) XML author author author. <=author> jane. Origins (1,1:150,1) (1,61:63,2) (1,64:93,2)

Indexing Keys in Hierarchical Data

UNIT III BALANCED SEARCH TREES AND INDEXING

An Effective and Efficient Approach for Keyword-Based XML Retrieval. Xiaoguang Li, Jian Gong, Daling Wang, and Ge Yu retold by Daryna Bronnykova

Data Structure. IBPS SO (IT- Officer) Exam 2017

INF2220: algorithms and data structures Series 1

Tree Structures. A hierarchical data structure whose point of entry is the root node

Trees. CSE 373 Data Structures

New Path Based Index Structure for Processing CAS Queries over XML Database

XML and Databases. Lecture 10 XPath Evaluation using RDBMS. Sebastian Maneth NICTA and UNSW

SDD Advanced-User Manual Version 1.1

Trees. Carlos Moreno uwaterloo.ca EIT

Lec 17 April 8. Topics: binary Trees expression trees. (Chapter 5 of text)

Updating XML documents

Compact Access Control Labeling for Efficient Secure XML Query Evaluation

Introduction. for large input, even access time may be prohibitive we need data structures that exhibit times closer to O(log N) binary search tree

Tree. A path is a connected sequence of edges. A tree topology is acyclic there is no loop.

Transcription:

Chapter 8 Evaluating XPath Queries Peter Wood (BBK) XML Data Management 201 / 353

Introduction When XML documents are small and can fit in memory, evaluating XPath expressions can be done efficiently But what if we have very large documents stored on disk? How should they be stored (fragmented)? How can we query them efficiently (by reducing the number of disk accesses needed)? Peter Wood (BBK) XML Data Management 202 / 353

Fragmentation A large document will not fit on a single disk page (block) It will need to be fragmented over possibly a large number of pages Updates to the document may result in further fragmentation Peter Wood (BBK) XML Data Management 203 / 353

Pre-order traversal Recall pre-order traversal of a tree: To traverse a non-empty tree in pre-order, perform the following operations recursively at each node, starting with the root node: 1 Visit the node 2 Traverse the root nodes of subtrees of the node from left to right Peter Wood (BBK) XML Data Management 204 / 353

Fragmentation based on pre-order traversal A very simple method to store the document nodes on disk is as follows: A pre-order traversal of the document, starting from the root, groups as many nodes as possible within the current page When the page is full, a new page is used to store the nodes that are encountered next and so on, until the entire tree has been traversed Peter Wood (BBK) XML Data Management 205 / 353

CD library example first two CDs L C C p c s p p c m s d m o t d m d Peter Wood (BBK) XML Data Management 206 / 353

CD library example first two CDs Stored as 3 fragments L C C p c s p p c m s d m o t d m d Peter Wood (BBK) XML Data Management 206 / 353

CD library example first two CDs Stored as 3 fragments L C C p c s p p c m s d m o t d m d Peter Wood (BBK) XML Data Management 206 / 353

CD library example first two CDs Stored as 3 fragments L C C p c s p p c m s d m o t d m d Peter Wood (BBK) XML Data Management 206 / 353

CD library example first two CDs Stored as 3 fragments L C C p c s p p c m s d m o t d m d Peter Wood (BBK) XML Data Management 206 / 353

Simple XPath queries Selecting both CDs nodes requires accessing 2 fragments Evaluating /CD-library/CD/performance requires accessing all 3 fragments This is very small example, but one can see that such fragmentation could lead to very bad performance Peter Wood (BBK) XML Data Management 207 / 353

Simple XPath queries Selecting both CDs nodes requires accessing 2 fragments Evaluating /CD-library/CD/performance requires accessing all 3 fragments This is very small example, but one can see that such fragmentation could lead to very bad performance Two improvements: Smart fragmentation: Group those nodes that are often accessed simultaneously together Rich node identifiers: Sophisticated node identifiers reducing the cost of join operations needed to stitch back fragments Peter Wood (BBK) XML Data Management 207 / 353

Representation on disk One of the simplest ways to represent an XML document on disk is to Assign an identifier to each node Store the XML document as one relation (which may be fragmented) representing a set of edges Peter Wood (BBK) XML Data Management 208 / 353

Simple node identfiers Here node identifiers are simply integers, assigned in some order 1 L 2 C C 8 3 p 9 c s 11 p 13 p 18 c m s d m o t d m d 4 5 6 7 10 12 14 15 16 17 19 20 Peter Wood (BBK) XML Data Management 209 / 353

The Edge relation pid cid clabel - 1 CD-library 1 2 CD 2 3 performance 3 4 composer 3 5 composition 3 6 soloist 3 7 date 1 8 CD......... pid is the id of the parent node cid is the id of the child node clabel is the element name of the child node (attributes and text nodes can be handled similarly) Peter Wood (BBK) XML Data Management 210 / 353

Processing XPath queries //composer: can be evaluated by a simple lookup π cid (σ clabel= composer (Edge)) Peter Wood (BBK) XML Data Management 211 / 353

Processing XPath queries //composer: can be evaluated by a simple lookup π cid (σ clabel= composer (Edge)) /CD-library/CD: requires one join π cid ((σ clabel= CD library (Edge)) cid=pid (σ clabel= CD (Edge))) Peter Wood (BBK) XML Data Management 211 / 353

Processing XPath queries (2) /CD-library//composer: many joins potentially needed Let A := (σ clabel= CD library (Edge)) Let B := (σ clabel= composer (Edge)) /CD-library/composer π cid (A cid=pid B) /CD-library/*/composer π cid (A cid=pid Edge cid=pid B) /CD-library/*/*/composer......... This assumes the query processor does not have any schema information available which might constrain where composer elements are located Peter Wood (BBK) XML Data Management 212 / 353

Element-partitioned Edge relations A simple improvement is to use element-partitioned Edge relations Here, the Edge relation is partitioned into many relations, one for each element name CD-library pid cid - 1 CD pid cid 1 2 1 8 performance pid cid 2 3 8 13 8 18 composer pid cid 3 4 8 9 Peter Wood (BBK) XML Data Management 213 / 353

Element-partitioned Edge relations (2) This saves some space (element names are not repeated) It also reduces the disk I/O needed to retrieve the identifiers of elements having a given name However, it does not solve the problem of evaluating queries with // steps in non-leading positions Peter Wood (BBK) XML Data Management 214 / 353

Path-partitioned approach to fragmentation Path-partitioning tries to solve the problem of // steps at arbitrary positions in a query This approach uses one relation for each distinct path in the document, e.g., /CD-library/CD/performance There is also another relation, called Paths, which contains all the unique paths Peter Wood (BBK) XML Data Management 215 / 353

Path-partitioned storage pid cid /CD-library: - 1 /CD-library/CD/composer: /CD-library/CD: pid cid 8 9 pid cid 1 2 1 8 /CD-library/CD/performance/composer: pid cid 3 4 Paths: path /CD-library /CD-library/CD /CD-library/CD/performance /CD-library/CD/performance/composer... Peter Wood (BBK) XML Data Management 216 / 353

Path-partitioned storage (2) Based on a path-partitioned store, a query such as //CD//composer can be evaluated in two steps: Scan the Paths relation to identify all the paths matching the given XPath query For each such path, scan the corresponding path-partitioned relation So for //CD//composer, the paths would be /CD-library/CD/composer /CD-library/CD/performance/composer So only these two relations need to be scanned Peter Wood (BBK) XML Data Management 217 / 353

Path-partitioned storage (3) The evaluation of XPath queries with many branches will still require joins across the relations However, the evaluation of // steps is simplified, thanks to the first processing step, performed on the path relation For very structured data, the path relation is typically small Thus, the cost of the first processing step is likely negligible, while the performance benefits of avoiding numerous joins are quite important However, for some data, the path relation can be large, and in some cases, even larger than the data itself Peter Wood (BBK) XML Data Management 218 / 353

Node identifiers Node identifiers are needed to indicate how nodes are related to one another in an XML tree This is particularly important when the data is fragmented and we need to reconnect children with their parents However, it is often even more useful to be able to identify other kinds of relationships between nodes, just by looking at their identifiers This means we need to use identifiers that are richer than simple consecutive integers We will see later how this information can be used in query processing Peter Wood (BBK) XML Data Management 219 / 353

Region-based identifers The region-based identifier scheme assigns to each XML node n a pair of integers The pair consists of the offset of the node s start tag, and the offset of its end tag We denote this pair by (n.start, n.end) Consider the following offsets of tags: <a>... <b>... </b>... </a> 0 30 50 90 the region-based identifier of the <a> element is the pair (0, 90) the region-based identifier of the <b> element is the pair (30, 50) Peter Wood (BBK) XML Data Management 220 / 353

Using region-based identifiers Comparing the region-based identifiers of two nodes n 1 and n 2 allows for deciding whether n 1 is an ancestor of n 2 Observe that this is the case if and only if: n1.start < n 2.start, and n1.end > n 2.end There is no need to use byte offsets: (Start tag, end tag). Count only opening and closing tags (as one unit each) and assign the resulting counter values to each element (Pre, post). Pre-order and post-order index (see next slides) Region-based identifiers are quite compact, as their size only grows logarithmically with the number of nodes in a document Peter Wood (BBK) XML Data Management 221 / 353

Post-order traversal Recall post-order traversal of a tree: To traverse a non-empty tree in post-order, perform the following operations recursively at each node, starting with the root node: 1 Traverse the root nodes of subtrees of the node from left to right 2 Visit the node Peter Wood (BBK) XML Data Management 222 / 353

Example of (pre, post) node identifiers (1,20) L (2,6) C C (8,19) p c s p p (3,5) (9,8) (11,10) (13,15) (18,18) c m s d m o t d m d (4,1) (5,2) (6,3) (7,4) (10,7) (12,9) (14,11) (15,12) (16,13) (17,14) (19,16) (20,17) Peter Wood (BBK) XML Data Management 223 / 353

Using (pre, post) identifiers to find ancestors The same method as for other region-based identifiers allows us to decide, for two nodes n 1 and n 2, whether n 1 is an ancestor of n 2 As before, this is the case if and only if: n 1.pre < n 2.pre, and n 1.post > n 2.post where n i.pre and n i.post are the pre-order and post-order numbers assigned to node n i, respectively Peter Wood (BBK) XML Data Management 224 / 353

Using (pre, post) identifiers to find parents One can add another number to a node identifier which indicates the depth of the node in the tree The root is assigned a depth of 1; the depth increases as we go down the tree Using (pre, post, depth), we can decide whether node n 1 is a parent of node n 2 Node n 1 is a parent of node n 2 if and only if n 1 is an ancestor of n 2 and n 1.depth = n 2.depth 1 Peter Wood (BBK) XML Data Management 225 / 353

Dewey-based identifiers These identifiers use the principal of the Dewey classification system used in libraries for decades To get the identifier of a child node, one adds a suffix to the identifier of its parent (including a separator) e.g., if the parent s identifier is 1.2.3 and the child is the second child of this parent, then its identifier is 1.2.3.2 Peter Wood (BBK) XML Data Management 226 / 353

Example of Dewey-based identifiers 1 L 1.1 C C 1.2 p c s p p 1.1.1 1.2.1 1.2.2 1.2.3 1.2.4 c m s d m o t d m d 1.1.1.1 1.1.1.2 1.1.1.3 1.1.1.4 1.2.1.1 1.2.2.1 1.2.3.1 1.2.3.2 1.2.3.3 1.2.3.4 1.2.4.1 1.2.4.2 Peter Wood (BBK) XML Data Management 227 / 353

Using Dewey-based identifiers Let n 1 and n 2 be two identifiers, of the form: n 1 = x 1.x 2.....x m and n 2 = y 1.y 2.....y n The node identified by n 1 is an ancestor of the node identified by n 2 if and only if n 1 is a prefix of n 2 When this is the case, the node identified by n 1 is the parent of the node identified by n 2 if and only if n = m + 1 Dewey IDs allow finding other relationships such as preceding-sibling and preceding (respectively, following-sibling, and following) The node identified by n 1 is a preceding sibling of the node identified by n 2 if and only if 1 x 1.x 2.....x m 1 = y 1.y 2.....y n 1 and 2 x m < y n The main drawback of Dewey identifiers is their length: the length is variable and can get large Peter Wood (BBK) XML Data Management 228 / 353

Structural identifiers and updates Consider a node with Dewey ID 1.2.2.3 Suppose we insert a new first child for node 1.2 Then the ID of node 1.2.2.3 becomes 1.2.3.3 In general: Offset-based identifiers need to be updated as soon as a character is inserted or removed in the document (start, end), (pre, post), and Dewey IDs need to be updated when the elements of the documents change It is possible to avoid re-labelling on deletions, but gaps will appear in the labelling scheme Re-labelling operations are quite expensive Peter Wood (BBK) XML Data Management 229 / 353

Tree pattern query evaluation Assume we have element-partitioned relations using (pre, post) identifiers Assume we want to evaluate a tree pattern query One way is to decompose the query into its basic patterns: Each basic pattern is just a pair of nodes connected by a child edge or a descendant edge We particularly want an efficient way of evaluating basic patterns that use the descendant operator Peter Wood (BBK) XML Data Management 230 / 353

Tree Pattern Example bookstore magazine date date title day month Peter Wood (BBK) XML Data Management 231 / 353

Decomposed Tree Pattern Example bookstore magazine magazine date date magazine date title day month Peter Wood (BBK) XML Data Management 232 / 353

Example tree with (pre, post) identifiers (Taken from the book Web Data Management ) (1,16) a (2,5) b b (7,14) f (16,15) b d c b d (3,3) (6,4) (8,8) (11,12) (15,13) e g e g e g g (4,1) (5,2) (9,6) (10,7) (12,9) (13,10) (14,11) Peter Wood (BBK) XML Data Management 233 / 353

Element-partitioned relations for example a pre post 1 16 b pre post 2 5 3 3 7 14 11 12 c pre post 8 8 d pre post 6 4 15 13 e pre post 4 1 9 6 12 9 f pre post 16 15 g pre post 5 2 10 7 13 10 14 11 Peter Wood (BBK) XML Data Management 234 / 353

Evaluation of descendant patterns Assume we want to evaluate the basic pattern corresponding to b//g This pattern may need to be joined to the results calculated for other basic patterns So, in general, we need to find all pairs (x, y) of nodes where x is an element with name b y is an element with name g y is a descendant of x Peter Wood (BBK) XML Data Management 235 / 353

Evaluation of descendant patterns (2) We could take every node ID from the b relation and compare it to every node ID from the g relation Each time we can test whether the g-node is a descendant of the b-node using the (pre, post) identifiers But this method will take time proportional to n m, if there are n b-nodes and m g-nodes In particular, one of the relations is scanned many times This is similar to a nested-loops implementation of a relational join, which is known to be inefficient Can we do better? Peter Wood (BBK) XML Data Management 236 / 353

Stack-based join algorithm We will look at an elegant method for evaluation of descendant patterns that uses an auxiliary stack This is called the stack-based join (SBJ) algorithm SBJ reads each ID from each relation only once SBJ assumes that the IDs in each relation are sorted, essentially by their pre-order values (as in the earlier slide) We will illustrate the method by example Peter Wood (BBK) XML Data Management 237 / 353

Stack-based join algorithm example (2,5) (5,2) (3,3) (10,7) (7,14) (13,10) (11,12) (14,11) b IDs g IDs Stack Peter Wood (BBK) XML Data Management 238 / 353

Stack-based join algorithm example (5,2) (3,3) (10,7) (7,14) (13,10) (11,12) (14,11) b IDs g IDs (2,5) Stack SBJ starts by pushing the first ancestor (that is, b node) ID, namely (2,5), on the stack Peter Wood (BBK) XML Data Management 238 / 353

Stack-based join algorithm example (5,2) (3,3) (10,7) (7,14) (13,10) (11,12) (14,11) b IDs g IDs (2,5) Stack SBJ starts by pushing the first ancestor (that is, b node) ID, namely (2,5), on the stack Then, STD continues to examine the IDs in the b ancestor input While the current ancestor ID is a descendant of the top of the stack, the current ancestor ID is pushed on the stack Peter Wood (BBK) XML Data Management 238 / 353

Stack-based join algorithm example (5,2) (10,7) (7,14) (13,10) (11,12) (14,11) b IDs g IDs (3,3) (2,5) Stack SBJ starts by pushing the first ancestor (that is, b node) ID, namely (2,5), on the stack Then, STD continues to examine the IDs in the b ancestor input While the current ancestor ID is a descendant of the top of the stack, the current ancestor ID is pushed on the stack So the second b ID, (3,3), is pushed on the stack, since it is a descendant of (2,5) Peter Wood (BBK) XML Data Management 238 / 353

Stack-based join algorithm example (2) (5,2) (10,7) (7,14) (13,10) (11,12) (14,11) b IDs g IDs (3,3) (2,5) Stack Output The third ID in the b input, (7,14), is not a descendant of current stack top, namely (3,3) Therefore, SBJ stops pushing b IDs on the stack and considers the first descendant ID, to see if it has matches on the stack Peter Wood (BBK) XML Data Management 239 / 353

Stack-based join algorithm example (2) (10,7) (7,14) (13,10) (11,12) (14,11) b IDs g IDs (3,3) (2,5) Stack Output (3,3), (5,2) (2,5), (5,2) The third ID in the b input, (7,14), is not a descendant of current stack top, namely (3,3) Therefore, SBJ stops pushing b IDs on the stack and considers the first descendant ID, to see if it has matches on the stack The first g node, namely (5,2), is a descendant of both b nodes on the stack, leading to the first two output tuples Peter Wood (BBK) XML Data Management 239 / 353

Stack-based join algorithm example (2) (10,7) (7,14) (13,10) (11,12) (14,11) b IDs g IDs (3,3) (2,5) Stack Output (3,3), (5,2) (2,5), (5,2) The third ID in the b input, (7,14), is not a descendant of current stack top, namely (3,3) Therefore, SBJ stops pushing b IDs on the stack and considers the first descendant ID, to see if it has matches on the stack The first g node, namely (5,2), is a descendant of both b nodes on the stack, leading to the first two output tuples Note that the stack does not change when output is produced This is because there may be further descendant IDs to match the ancestor IDs on the stack Peter Wood (BBK) XML Data Management 239 / 353

Stack-based join algorithm example (3) (10,7) (7,14) (13,10) (11,12) (14,11) b IDs g IDs (3,3) (2,5) Stack Output (3,3), (5,2) (2,5), (5,2) A descendant ID which has been compared with ancestor IDs on the stack and has produced output tuples, can be discarded Now the g ID (10,7) encounters no matches on the stack Moreover, (10,7) occurs in the document after the nodes on the stack Therefore, no descendant node ID yet to be examined can have ancestors on this stack This is because the input g IDs are sorted Peter Wood (BBK) XML Data Management 240 / 353

Stack-based join algorithm example (3) (10,7) (7,14) (13,10) (11,12) (14,11) b IDs g IDs Stack Output (3,3), (5,2) (2,5), (5,2) A descendant ID which has been compared with ancestor IDs on the stack and has produced output tuples, can be discarded Now the g ID (10,7) encounters no matches on the stack Moreover, (10,7) occurs in the document after the nodes on the stack Therefore, no descendant node ID yet to be examined can have ancestors on this stack This is because the input g IDs are sorted Therefore, at this point, the stack is emptied Peter Wood (BBK) XML Data Management 240 / 353

Stack-based join algorithm example (4) (10,7) (7,14) (13,10) (11,12) (14,11) b IDs g IDs Stack Output (3,3), (5,2) (2,5), (5,2) Peter Wood (BBK) XML Data Management 241 / 353

Stack-based join algorithm example (4) (10,7) (13,10) (11,12) (14,11) b IDs g IDs (7,14) Stack Output (3,3), (5,2) (2,5), (5,2) Next the ancestor ID (7,14) is pushed on the stack Peter Wood (BBK) XML Data Management 241 / 353

Stack-based join algorithm example (4) b IDs (10,7) (13,10) (14,11) g IDs (11,12) (7,14) Stack Output (3,3), (5,2) (2,5), (5,2) Next the ancestor ID (7,14) is pushed on the stack followed by its descendant, in the ancestor input, (11,12) Peter Wood (BBK) XML Data Management 241 / 353

Stack-based join algorithm example (4) b IDs (10,7) (13,10) (14,11) g IDs (11,12) (7,14) Stack Output (3,3), (5,2) (2,5), (5,2) Next the ancestor ID (7,14) is pushed on the stack followed by its descendant, in the ancestor input, (11,12) The next descendant ID is (10,7) Peter Wood (BBK) XML Data Management 241 / 353

Stack-based join algorithm example (4) b IDs (13,10) (14,11) g IDs (11,12) (7,14) Stack Output (3,3), (5,2) (2,5), (5,2) (7,14), (10,7) Next the ancestor ID (7,14) is pushed on the stack followed by its descendant, in the ancestor input, (11,12) The next descendant ID is (10,7) This which produces a result with (7,14) and is then discarded Peter Wood (BBK) XML Data Management 241 / 353

Stack-based join algorithm example (5) Output (3,3), (5,2) (2,5), (5,2) (7,14), (10,7) b IDs (13,10) (14,11) g IDs (11,12) (7,14) Stack Peter Wood (BBK) XML Data Management 242 / 353

Stack-based join algorithm example (5) Output (3,3), (5,2) (2,5), (5,2) (7,14), (10,7) b IDs (13,10) (14,11) g IDs (11,12) (7,14) Stack The next descendant ID is (13,10) Peter Wood (BBK) XML Data Management 242 / 353

Stack-based join algorithm example (5) b IDs (14,11) g IDs (11,12) (7,14) Stack Output (3,3), (5,2) (2,5), (5,2) (7,14), (10,7) (11,12), (13,10) (7,14), (13,10) The next descendant ID is (13,10) This leads to two new tuples added to the output Peter Wood (BBK) XML Data Management 242 / 353

Stack-based join algorithm example (5) b IDs (14,11) g IDs (11,12) (7,14) Stack Output (3,3), (5,2) (2,5), (5,2) (7,14), (10,7) (11,12), (13,10) (7,14), (13,10) The next descendant ID is (13,10) This leads to two new tuples added to the output The next descendant ID is (14,11) Peter Wood (BBK) XML Data Management 242 / 353

Stack-based join algorithm example (5) b IDs g IDs (11,12) (7,14) Stack Output (3,3), (5,2) (2,5), (5,2) (7,14), (10,7) (11,12), (13,10) (7,14), (13,10) (11,12), (14,11) (7,14), (14,11) The next descendant ID is (13,10) This leads to two new tuples added to the output The next descendant ID is (14,11) This also leads to two more output tuples Peter Wood (BBK) XML Data Management 242 / 353

Other approaches The stack-based join algorithm is as efficient as possible for single descendant basic patterns But an overall algorithm for tree pattern evaluation still has to join the answers from basic patterns back together The size of intermediate results can be unnecessarily large Another approach is to evaluate the entire pattern in one operation One algorithm for this is called holistic twig join Peter Wood (BBK) XML Data Management 243 / 353

Summary We considered some issues for dealing with querying large XML documents These included methods for fragmenting documents and efficient evaluation methods, particularly for ancestor-descendant basic patterns For more information, see Chapter 4 on XML Query Evaluation in the book Web Data Management The original stack-based join algorithm is from S. Al-Khalifa, H.V. Jagadish, J.M. Patel, Y. Wu, N. Koudas, and D. Srivastava. Structural joins: A primitive for efficient XML query pattern matching. In Proc. Int. Conf. on Data Engineering (ICDE), 2002. Holistic twig join is described in N. Bruno, N. Koudas, and D. Srivastava. Holistic twig joins: optimal XML pattern matching. In Proc. ACM Int. Conf. on the Management of Data (SIGMOD), 2002. Peter Wood (BBK) XML Data Management 244 / 353