XML Storage and Indexing Web Data Management and Distribution Serge Abiteboul Ioana Manolescu Philippe Rigaux Marie-Christine Rousset Pierre Senellart Web Data Management and Distribution http://webdam.inria.fr/textbook September 29, 2011 WebDam (INRIA) XML Storage and Indexing September 29, 2011 1 / 26
Introduction Outline 1 Introduction 2 XML fragmentation 3 XML identifiers WebDam (INRIA) XML Storage and Indexing September 29, 2011 2 / 26
Introduction Motivation PTIME algorithms for evaluating XPath queries: Simple tree navigation Translation into logic Tree-automata techniques These techniques assume XML data in memory Very large XML documents: what is critical is the number of disk accesses WebDam (INRIA) XML Storage and Indexing September 29, 2011 3 / 26
Introduction Naïve page-based storage auctions item open_auctions item auction auction comment description name description name initial bids initial bids WebDam (INRIA) XML Storage and Indexing September 29, 2011 4 / 26
Introduction Processing Simple Paths /auctions/item requires the traversal of two disk pages; /auctions/item/description and /auctions/open_auctions/auction/initial both require traversing three disk pages; //initial requires traversing all the pages of the document. WebDam (INRIA) XML Storage and Indexing September 29, 2011 5 / 26
Introduction Remedies Smart fragmentation. Group nodes that are often accessed simultaneously, so that the number of pages that need to be accessed is globally reduced. Rich node identifiers. Sophisticated node identifiers reducing the cost of join operations. WebDam (INRIA) XML Storage and Indexing September 29, 2011 6 / 26
XML fragmentation Outline 1 Introduction 2 XML fragmentation Tag-partitioning Path-partitioning 3 XML identifiers WebDam (INRIA) XML Storage and Indexing September 29, 2011 7 / 26
XML fragmentation Simple node identifiers 1 auctions 2 3 4 item open_auctions item 8 9 auction auction 5 comment 6 name 7 description 10 11 12 13 initial bids initial bids 14 15 name description WebDam (INRIA) XML Storage and Indexing September 29, 2011 8 / 26
XML fragmentation Partial instance of the Edge relation pid cid clabel - 1 auctions 1 2 item 2 5 comment 2 6 name 2 7 description 1 3 open_auctions 3 8 auction WebDam (INRIA) XML Storage and Indexing September 29, 2011 9 / 26
XML fragmentation Processing XPath queries with simple IDs //initial Direct index lookup. π cid (σ clabel=initial (Edge)) /auctions/item A join is needed. π cid ((σ clabel=auctions (Edge)) cid=pid (σ clabel=item (Edge))) //auction//bid Unbounded number of joins! //auction/bid π cid (A cid=pid B) //auction/*/bid π cid (A cid=pid Edge cid=pid B) //auction/*/*/bid WebDam (INRIA) XML Storage and Indexing September 29, 2011 10 / 26
XML fragmentation Tag-partitioning Outline 1 Introduction 2 XML fragmentation Tag-partitioning Path-partitioning 3 XML identifiers WebDam (INRIA) XML Storage and Indexing September 29, 2011 11 / 26
XML fragmentation Tag-partitioning Tag-partitioned Edge relations auctionsedge pid cid - 1 itemedge pid cid 1 2 1 4 open_auctionsedge pid cid 1 3 auctionedge pid cid 3 8 3 9 Reduces the disk I/O needed to retrieve the identifiers of elements having a given tag. Partitioning of queries with // steps in non-leading position remains as difficult as before. WebDam (INRIA) XML Storage and Indexing September 29, 2011 12 / 26
XML fragmentation Path-partitioning Outline 1 Introduction 2 XML fragmentation Tag-partitioning Path-partitioning 3 XML identifiers WebDam (INRIA) XML Storage and Indexing September 29, 2011 13 / 26
XML fragmentation Path-partitioning Path-partitioned storage /auctions pid cid - 1 /auctions/item/name pid cid 2 6 4 14 /auctions/item path pid cid 1 2 1 4 Paths /auctions /auctions/item /auctions/item/comment /auctions/item/name WebDam (INRIA) XML Storage and Indexing September 29, 2011 14 / 26
XML fragmentation Path-partitioning Query evaluation using path partitioning //item//bid can be evaluated in two steps: Scan the path relation and identify all the parent-child paths matching the given linear XPath query; For each of the paths thus obtained, scan the corresponding path-partitioned table. Many branches still require joins across the relations. For very structured data, the path relation is typically much smaller than the data set itself. Thus, the cost of the first processing step is likely negligible, while the performance benefits of avoiding numerous joins are quite important. WebDam (INRIA) XML Storage and Indexing September 29, 2011 15 / 26
XML identifiers Outline 1 Introduction 2 XML fragmentation 3 XML identifiers Region-based identifiers Dewey-based identifiers Structural identifiers and updates WebDam (INRIA) XML Storage and Indexing September 29, 2011 16 / 26
XML identifiers Identifiers Within a persistent XML store, each node must be assigned a unique identifier (or ID, in short). We want to encapsulate structure information into these identifiers to facilitate query evaluation. WebDam (INRIA) XML Storage and Indexing September 29, 2011 17 / 26
XML identifiers Region-based identifiers Outline 1 Introduction 2 XML fragmentation 3 XML identifiers Region-based identifiers Dewey-based identifiers Structural identifiers and updates WebDam (INRIA) XML Storage and Indexing September 29, 2011 18 / 26
XML identifiers Region-based identifiers Region-based identifiers <a>... <b>... </b>... </a> 0 30 50 90 Offsets in the XML file The so-called region-based identifier scheme simply assigns to each XML node n, the pair composed of the offset of its begin tag, and the offset of its end tag. We denote this pair by (n.begin,n.end). the region-based identifier of the <a> element is the pair (0, 90); the region-based identifier of the <b> element is pair (30, 50). WebDam (INRIA) XML Storage and Indexing September 29, 2011 19 / 26
XML identifiers Region-based identifiers Using region-based identifiers Comparing the region-based identifiers of two nodes n 1 and n 2 allows deciding whether n 1 is an ancestor of n 2. Observe that this is the case if and only if: n 1.start < n 2.start, and n 2.end < n 1.end. No need to use byte offset: (Begin tag, end tag). Count only opening and closing tags (as one unit each) and assign the resulting counter values to each element. (Pre, post). Preorder and postorder index (see next slide). (Pre, post, depth). The depth can be used to decide whether a node is a parent of another one. Region-based identifiers are quite compact, as their size only grows logarithmically with the number of nodes in a document. WebDam (INRIA) XML Storage and Indexing September 29, 2011 20 / 26
XML identifiers Region-based identifiers (pre, post, depth) node identifiers (1,15,1) auctions (2,4,2) (3,11,2) (4,14,2) item open_auctions item (8,7,3) (9,10,3) auction auction (5,1,3) comment (6,2,3) name (7,3,3) description (10,5,4) (11,6,4) (12,8,4) (13,9,4) initial bids initial bids (14,12,3) (15,13,3) name description WebDam (INRIA) XML Storage and Indexing September 29, 2011 21 / 26
XML identifiers Dewey-based identifiers Outline 1 Introduction 2 XML fragmentation 3 XML identifiers Region-based identifiers Dewey-based identifiers Structural identifiers and updates WebDam (INRIA) XML Storage and Indexing September 29, 2011 22 / 26
XML identifiers Dewey-based identifiers Dewey node identifiers 1 auctions 1.1 1.2 1.3 item open_auctions item 1.2.1 1.2.2 auction auction 1.1.1 comment 1.1.2 name 1.1.3 description 1.2.1.1 1.2.1.2 1.2.2.1 1.2.2.2 initial bids initial bids 1.3.1 1.3.2 name description WebDam (INRIA) XML Storage and Indexing September 29, 2011 23 / 26
XML identifiers Dewey-based identifiers Using Dewey identifiers Ancestor decided by looking if an ID is a prefix of another one (largest non-identical prefix for parent). Possible to decide preceding-sibling, following-sibling, preceding, following. Given two Dewey IDs n 1 and n 2, one can find the ID of the lowest common ancestor (LCA) of the corresponding nodes. The ID of the LCA is the longest common prefix of n 1 and n 2. Useful for keyword search. Main drawback: length (large, variable) of the identifiers. WebDam (INRIA) XML Storage and Indexing September 29, 2011 24 / 26
XML identifiers Structural identifiers and updates Outline 1 Introduction 2 XML fragmentation 3 XML identifiers Region-based identifiers Dewey-based identifiers Structural identifiers and updates WebDam (INRIA) XML Storage and Indexing September 29, 2011 25 / 26
XML identifiers Structural identifiers and updates Re-labeling Consider a node with Dewey ID 1.2.2.3. Suppose we insert a new first child to node 1.2. Then the ID of node 1.2.2.3 becomes 1.2.3.3. In general: Offset-based identifiers need to be updated as soon as a character is inserted or removed in the document. (start, end), (pre, post), Dewey IDs need to be updated when the elements of the documents change. Possible to avoid relabeling on deletions, but gaps will appear in the labeling scheme. Relabeling operation quite costly. WebDam (INRIA) XML Storage and Indexing September 29, 2011 26 / 26