XML Storage and Indexing

Similar documents
Outline. Depth-first Binary Tree Traversal. Gerênciade Dados daweb -DCC922 - XML Query Processing. Motivation 24/03/2014

Web Data Management. XML query evaluation. Philippe Rigaux CNAM Paris & INRIA Saclay

Evaluating XPath Queries

Indexing Large-Scale Data

Web Data Management. Tree Pattern Evaluation. Philippe Rigaux CNAM Paris & INRIA Saclay

Introduction to Distributed Data Systems

CHAPTER 3 LITERATURE REVIEW

Updating XML with XQuery

Trees : Part 1. Section 4.1. Theory and Terminology. A Tree? A Tree? Theory and Terminology. Theory and Terminology

Section 5.5. Left subtree The left subtree of a vertex V on a binary tree is the graph formed by the left child L of V, the descendents

An Effective and Efficient Approach for Keyword-Based XML Retrieval. Xiaoguang Li, Jian Gong, Daling Wang, and Ge Yu retold by Daryna Bronnykova

Full-Text and Structural XML Indexing on B + -Tree

Chapter 4 Trees. Theorem A graph G has a spanning tree if and only if G is connected.

XML databases. Jan Chomicki. University at Buffalo. Jan Chomicki (University at Buffalo) XML databases 1 / 9

B-Trees and External Memory

Trees. Truong Tuan Anh CSE-HCMUT

Trees. Q: Why study trees? A: Many advance ADTs are implemented using tree-based data structures.

B-Trees and External Memory

TREES. Trees - Introduction

Web Data Management. Introduction. Philippe Rigaux CNAM Paris & INRIA Saclay

Lowest Common Ancestor (LCA) Queries

Algorithms and Data Structures

Binary Trees, Binary Search Trees

Labeling Dynamic XML Documents: An Order-Centric Approach

An Extended Byte Carry Labeling Scheme for Dynamic XML Data

Advances in Data Management Principles of Database Systems - 2 A.Poulovassilis

Retrieving Meaningful Relaxed Tightest Fragments for XML Keyword Search

The B-Tree. Yufei Tao. ITEE University of Queensland. INFS4205/7205, Uni of Queensland

Physical Level of Databases: B+-Trees

Efficient XML Storage based on DTM for Read-oriented Workloads

XML Systems & Benchmarks

Advanced XSLT. Web Data Management and Distribution. Serge Abiteboul Ioana Manolescu Philippe Rigaux Marie-Christine Rousset Pierre Senellart

XPath evaluation in linear time with polynomial combined complexity

Part XII. Mapping XML to Databases. Torsten Grust (WSI) Database-Supported XML Processors Winter 2008/09 321

7.1 Introduction. A (free) tree T is A simple graph such that for every pair of vertices v and w there is a unique path from v to w

Binary Trees

XPath. Lecture 36. Robb T. Koether. Wed, Apr 16, Hampden-Sydney College. Robb T. Koether (Hampden-Sydney College) XPath Wed, Apr 16, / 28

XPath Lecture 34. Robb T. Koether. Hampden-Sydney College. Wed, Apr 11, 2012

Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree.

Ioana Manolescu is a researcher at INRIA Saclay, and the scientific leader of the LEO team, joint between INRIA and University Paris XI.

XML & Databases. Tutorial. 9. Staircase Join, Symmetries. Universität Konstanz. Database & Information Systems Group Prof. Marc H.

One of the main selling points of a database engine is the ability to make declarative queries---like SQL---that specify what should be done while

Trees. CSE 373 Data Structures

Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart

L02 : 08/21/2015 L03 : 08/24/2015.

CSE 530A. B+ Trees. Washington University Fall 2013

DDE: From Dewey to a Fully Dynamic XML Labeling Scheme

An AVL tree with N nodes is an excellent data. The Big-Oh analysis shows that most operations finish within O(log N) time

MASS: A Multi-Axis Storage Structure for Large XML Documents

Module 4: Index Structures Lecture 13: Index structure. The Lecture Contains: Index structure. Binary search tree (BST) B-tree. B+-tree.

Database Applications (15-415)

Integrating Path Index with Value Index for XML data

XML APIs. Web Data Management and Distribution. Serge Abiteboul Philippe Rigaux Marie-Christine Rousset Pierre Senellart

Database System Concepts

Lec 17 April 8. Topics: binary Trees expression trees. (Chapter 5 of text)

Trees. Tree Structure Binary Tree Tree Traversals

Todays Lecture. Assignment 2 deadline: You have 5 Calendar days to complete.

Towards microbenchmarking. June 30, 2006

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Chapter 11: Indexing and Hashing

Material You Need to Know

Trees. Trees. CSE 2011 Winter 2007

Design and Analysis of Algorithms Lecture- 9: B- Trees

BLAS : An Efficient XPath Processing System

Databases and Information Systems 1. Prof. Dr. Stefan Böttcher

) $ f ( n) " %( g( n)

Databases and Information Systems 1 - XPath queries on XML data streams -

CS 245 Midterm Exam Winter 2014

M-ary Search Tree. B-Trees. Solution: B-Trees. B-Tree: Example. B-Tree Properties. B-Trees (4.7 in Weiss)

Trees. A tree is a directed graph with the property

EE 368. Weeks 5 (Notes)

4. Suffix Trees and Arrays

Bioinformatics Programming. EE, NCKU Tien-Hao Chang (Darby Chang)

Index-Driven XQuery Processing in the exist XML Database

CE 221 Data Structures and Algorithms

Course Review for. Cpt S 223 Fall Cpt S 223. School of EECS, WSU

An Efficient XML Node Identification and Indexing Scheme

4. Suffix Trees and Arrays

Semi-structured Data. 8 - XPath

Binary search trees. Binary search trees are data structures based on binary trees that support operations on dynamic sets.

Part V. SAX Simple API for XML. Torsten Grust (WSI) Database-Supported XML Processors Winter 2008/09 84

Data Structure. IBPS SO (IT- Officer) Exam 2017

March 20/2003 Jayakanth Srinivasan,

Course: The XPath Language

Distributed Relationship Schemes for Trees

Binary search trees 3. Binary search trees. Binary search trees 2. Reading: Cormen et al, Sections 12.1 to 12.3

Trees. Introduction & Terminology. February 05, 2018 Cinda Heeren / Geoffrey Tien 1

Part V. SAX Simple API for XML

Programming II (CS300)

Amol Deshpande, UC Berkeley. Suman Nath, CMU. Srinivasan Seshan, CMU

Tree Structures. A hierarchical data structure whose point of entry is the root node

Operating Systems Comprehensive Exam. Spring Student ID # 3/16/2006

Introduction. for large input, even access time may be prohibitive we need data structures that exhibit times closer to O(log N) binary search tree

CSCI2100B Data Structures Trees

CSI33 Data Structures

9/29/2016. Chapter 4 Trees. Introduction. Terminology. Terminology. Terminology. Terminology

Chapter 12: Indexing and Hashing. Basic Concepts

Design of Index Schema based on Bit-Streams for XML Documents

Efficient LCA based Keyword Search in XML Data

COMP9319 Web Data Compression & Search. Cloud and data optimization XPath containment Distributed path expression processing

Transcription:

XML Storage and Indexing Web Data Management and Distribution Serge Abiteboul Ioana Manolescu Philippe Rigaux Marie-Christine Rousset Pierre Senellart Web Data Management and Distribution http://webdam.inria.fr/textbook September 29, 2011 WebDam (INRIA) XML Storage and Indexing September 29, 2011 1 / 26

Introduction Outline 1 Introduction 2 XML fragmentation 3 XML identifiers WebDam (INRIA) XML Storage and Indexing September 29, 2011 2 / 26

Introduction Motivation PTIME algorithms for evaluating XPath queries: Simple tree navigation Translation into logic Tree-automata techniques These techniques assume XML data in memory Very large XML documents: what is critical is the number of disk accesses WebDam (INRIA) XML Storage and Indexing September 29, 2011 3 / 26

Introduction Naïve page-based storage auctions item open_auctions item auction auction comment description name description name initial bids initial bids WebDam (INRIA) XML Storage and Indexing September 29, 2011 4 / 26

Introduction Processing Simple Paths /auctions/item requires the traversal of two disk pages; /auctions/item/description and /auctions/open_auctions/auction/initial both require traversing three disk pages; //initial requires traversing all the pages of the document. WebDam (INRIA) XML Storage and Indexing September 29, 2011 5 / 26

Introduction Remedies Smart fragmentation. Group nodes that are often accessed simultaneously, so that the number of pages that need to be accessed is globally reduced. Rich node identifiers. Sophisticated node identifiers reducing the cost of join operations. WebDam (INRIA) XML Storage and Indexing September 29, 2011 6 / 26

XML fragmentation Outline 1 Introduction 2 XML fragmentation Tag-partitioning Path-partitioning 3 XML identifiers WebDam (INRIA) XML Storage and Indexing September 29, 2011 7 / 26

XML fragmentation Simple node identifiers 1 auctions 2 3 4 item open_auctions item 8 9 auction auction 5 comment 6 name 7 description 10 11 12 13 initial bids initial bids 14 15 name description WebDam (INRIA) XML Storage and Indexing September 29, 2011 8 / 26

XML fragmentation Partial instance of the Edge relation pid cid clabel - 1 auctions 1 2 item 2 5 comment 2 6 name 2 7 description 1 3 open_auctions 3 8 auction WebDam (INRIA) XML Storage and Indexing September 29, 2011 9 / 26

XML fragmentation Processing XPath queries with simple IDs //initial Direct index lookup. π cid (σ clabel=initial (Edge)) /auctions/item A join is needed. π cid ((σ clabel=auctions (Edge)) cid=pid (σ clabel=item (Edge))) //auction//bid Unbounded number of joins! //auction/bid π cid (A cid=pid B) //auction/*/bid π cid (A cid=pid Edge cid=pid B) //auction/*/*/bid WebDam (INRIA) XML Storage and Indexing September 29, 2011 10 / 26

XML fragmentation Tag-partitioning Outline 1 Introduction 2 XML fragmentation Tag-partitioning Path-partitioning 3 XML identifiers WebDam (INRIA) XML Storage and Indexing September 29, 2011 11 / 26

XML fragmentation Tag-partitioning Tag-partitioned Edge relations auctionsedge pid cid - 1 itemedge pid cid 1 2 1 4 open_auctionsedge pid cid 1 3 auctionedge pid cid 3 8 3 9 Reduces the disk I/O needed to retrieve the identifiers of elements having a given tag. Partitioning of queries with // steps in non-leading position remains as difficult as before. WebDam (INRIA) XML Storage and Indexing September 29, 2011 12 / 26

XML fragmentation Path-partitioning Outline 1 Introduction 2 XML fragmentation Tag-partitioning Path-partitioning 3 XML identifiers WebDam (INRIA) XML Storage and Indexing September 29, 2011 13 / 26

XML fragmentation Path-partitioning Path-partitioned storage /auctions pid cid - 1 /auctions/item/name pid cid 2 6 4 14 /auctions/item path pid cid 1 2 1 4 Paths /auctions /auctions/item /auctions/item/comment /auctions/item/name WebDam (INRIA) XML Storage and Indexing September 29, 2011 14 / 26

XML fragmentation Path-partitioning Query evaluation using path partitioning //item//bid can be evaluated in two steps: Scan the path relation and identify all the parent-child paths matching the given linear XPath query; For each of the paths thus obtained, scan the corresponding path-partitioned table. Many branches still require joins across the relations. For very structured data, the path relation is typically much smaller than the data set itself. Thus, the cost of the first processing step is likely negligible, while the performance benefits of avoiding numerous joins are quite important. WebDam (INRIA) XML Storage and Indexing September 29, 2011 15 / 26

XML identifiers Outline 1 Introduction 2 XML fragmentation 3 XML identifiers Region-based identifiers Dewey-based identifiers Structural identifiers and updates WebDam (INRIA) XML Storage and Indexing September 29, 2011 16 / 26

XML identifiers Identifiers Within a persistent XML store, each node must be assigned a unique identifier (or ID, in short). We want to encapsulate structure information into these identifiers to facilitate query evaluation. WebDam (INRIA) XML Storage and Indexing September 29, 2011 17 / 26

XML identifiers Region-based identifiers Outline 1 Introduction 2 XML fragmentation 3 XML identifiers Region-based identifiers Dewey-based identifiers Structural identifiers and updates WebDam (INRIA) XML Storage and Indexing September 29, 2011 18 / 26

XML identifiers Region-based identifiers Region-based identifiers <a>... <b>... </b>... </a> 0 30 50 90 Offsets in the XML file The so-called region-based identifier scheme simply assigns to each XML node n, the pair composed of the offset of its begin tag, and the offset of its end tag. We denote this pair by (n.begin,n.end). the region-based identifier of the <a> element is the pair (0, 90); the region-based identifier of the <b> element is pair (30, 50). WebDam (INRIA) XML Storage and Indexing September 29, 2011 19 / 26

XML identifiers Region-based identifiers Using region-based identifiers Comparing the region-based identifiers of two nodes n 1 and n 2 allows deciding whether n 1 is an ancestor of n 2. Observe that this is the case if and only if: n 1.start < n 2.start, and n 2.end < n 1.end. No need to use byte offset: (Begin tag, end tag). Count only opening and closing tags (as one unit each) and assign the resulting counter values to each element. (Pre, post). Preorder and postorder index (see next slide). (Pre, post, depth). The depth can be used to decide whether a node is a parent of another one. Region-based identifiers are quite compact, as their size only grows logarithmically with the number of nodes in a document. WebDam (INRIA) XML Storage and Indexing September 29, 2011 20 / 26

XML identifiers Region-based identifiers (pre, post, depth) node identifiers (1,15,1) auctions (2,4,2) (3,11,2) (4,14,2) item open_auctions item (8,7,3) (9,10,3) auction auction (5,1,3) comment (6,2,3) name (7,3,3) description (10,5,4) (11,6,4) (12,8,4) (13,9,4) initial bids initial bids (14,12,3) (15,13,3) name description WebDam (INRIA) XML Storage and Indexing September 29, 2011 21 / 26

XML identifiers Dewey-based identifiers Outline 1 Introduction 2 XML fragmentation 3 XML identifiers Region-based identifiers Dewey-based identifiers Structural identifiers and updates WebDam (INRIA) XML Storage and Indexing September 29, 2011 22 / 26

XML identifiers Dewey-based identifiers Dewey node identifiers 1 auctions 1.1 1.2 1.3 item open_auctions item 1.2.1 1.2.2 auction auction 1.1.1 comment 1.1.2 name 1.1.3 description 1.2.1.1 1.2.1.2 1.2.2.1 1.2.2.2 initial bids initial bids 1.3.1 1.3.2 name description WebDam (INRIA) XML Storage and Indexing September 29, 2011 23 / 26

XML identifiers Dewey-based identifiers Using Dewey identifiers Ancestor decided by looking if an ID is a prefix of another one (largest non-identical prefix for parent). Possible to decide preceding-sibling, following-sibling, preceding, following. Given two Dewey IDs n 1 and n 2, one can find the ID of the lowest common ancestor (LCA) of the corresponding nodes. The ID of the LCA is the longest common prefix of n 1 and n 2. Useful for keyword search. Main drawback: length (large, variable) of the identifiers. WebDam (INRIA) XML Storage and Indexing September 29, 2011 24 / 26

XML identifiers Structural identifiers and updates Outline 1 Introduction 2 XML fragmentation 3 XML identifiers Region-based identifiers Dewey-based identifiers Structural identifiers and updates WebDam (INRIA) XML Storage and Indexing September 29, 2011 25 / 26

XML identifiers Structural identifiers and updates Re-labeling Consider a node with Dewey ID 1.2.2.3. Suppose we insert a new first child to node 1.2. Then the ID of node 1.2.2.3 becomes 1.2.3.3. In general: Offset-based identifiers need to be updated as soon as a character is inserted or removed in the document. (start, end), (pre, post), Dewey IDs need to be updated when the elements of the documents change. Possible to avoid relabeling on deletions, but gaps will appear in the labeling scheme. Relabeling operation quite costly. WebDam (INRIA) XML Storage and Indexing September 29, 2011 26 / 26