An Extended Byte Carry Labeling Scheme for Dynamic XML Data

Similar documents
CBSL A Compressed Binary String Labeling Scheme for Dynamic Update of XML Documents

CHAPTER 3 LITERATURE REVIEW

A FRACTIONAL NUMBER BASED LABELING SCHEME FOR DYNAMIC XML UPDATING

A New Way of Generating Reusable Index Labels for Dynamic XML

Open Access The Three-dimensional Coding Based on the Cone for XML Under Weaving Multi-documents

The Research on Coding Scheme of Binary-Tree for XML

DDE: From Dewey to a Fully Dynamic XML Labeling Scheme

A Dynamic Labeling Scheme using Vectors

Accelerating XML Structural Matching Using Suffix Bitmaps

A Persistent Labelling Scheme for XML and tree Databases 1

Updating XML documents

Implementation and Optimization of LZW Compression Algorithm Based on Bridge Vibration Data

Labeling Dynamic XML Documents: An Order-Centric Approach

PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data

An Improved Prefix Labeling Scheme: A Binary String Approach for Dynamic Ordered XML

Outline. Approximation: Theory and Algorithms. Ordered Labeled Trees in a Relational Database (II/II) Nikolaus Augsten. Unit 5 March 30, 2009

A New Method of Generating Index Label for Dynamic XML Data

An Efficient XML Index Structure with Bottom-Up Query Processing

Full-Text and Structural XML Indexing on B + -Tree

Path-based XML Relational Storage Approach

Efficient Query Optimization Of XML Tree Pattern Matching By Using Holistic Approach

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey

An Improved Frequent Pattern-growth Algorithm Based on Decomposition of the Transaction Database

Schemaless Approach of Mapping XML Document into Relational Database

This is a repository copy of XML Labels Compression using Prefix-Encodings.

ADT 2009 Other Approaches to XQuery Processing

Design of Index Schema based on Bit-Streams for XML Documents

Indexing XML Data Stored in a Relational Database

A System for Storing, Retrieving, Organizing and Managing Web Services Metadata Using Relational Database *

XML databases. Jan Chomicki. University at Buffalo. Jan Chomicki (University at Buffalo) XML databases 1 / 9

Advances in Data Management Principles of Database Systems - 2 A.Poulovassilis

Evaluating XPath Queries

(2,4) Trees. 2/22/2006 (2,4) Trees 1

Compression of the Stream Array Data Structure

Schema-Based XML-to-SQL Query Translation Using Interval Encoding

TwigList: Make Twig Pattern Matching Fast

Outline. Depth-first Binary Tree Traversal. GerĂȘnciade Dados daweb -DCC922 - XML Query Processing. Motivation 24/03/2014

CSE 530A. B+ Trees. Washington University Fall 2013

On Label Stream Partition for Efficient Holistic Twig Join

Chapter 12 Advanced Data Structures

A New Indexing Strategy for XML Keyword Search

Data Processing System to Network Supported Collaborative Design

Efficient Processing of Complex Twig Pattern Matching

(2,4) Trees Goodrich, Tamassia (2,4) Trees 1

Compacting XML Structures Using a Dynamic Labeling Scheme

QuickXDB: A Prototype of a Native XML QuickXDB: Prototype of Native XML DBMS DBMS

Uses for Trees About Trees Binary Trees. Trees. Seth Long. January 31, 2010

TwigINLAB: A Decomposition-Matching-Merging Approach To Improving XML Query Processing

INVESTIGATING BINARY STRING ENCODING FOR COMPACT REPRESENTATION OF XML DOCUMENTS

A Clustering-based Scheme for Labeling XML Trees

Labeling and Querying Dynamic XML Trees

An undirected graph is a tree if and only of there is a unique simple path between any 2 of its vertices.

Section 5.5. Left subtree The left subtree of a vertex V on a binary tree is the graph formed by the left child L of V, the descendents

Symmetrically Exploiting XML

Design on a method for interactive editing fault polygon

Main Memory and the CPU Cache

EE 368. Weeks 5 (Notes)

ADT 2010 ADT XQuery Updates in MonetDB/XQuery & Other Approaches to XQuery Processing

A FRAMEWORK FOR EFFICIENT DATA SEARCH THROUGH XML TREE PATTERNS

A New Encoding Scheme of Supporting Data Update Efficiently

Course: The XPath Language

Data Structure. IBPS SO (IT- Officer) Exam 2017

Course: The XPath Language

An Effective and Efficient Approach for Keyword-Based XML Retrieval. Xiaoguang Li, Jian Gong, Daling Wang, and Ge Yu retold by Daryna Bronnykova

Text Compression through Huffman Coding. Terminology

LABELING DYNAMIC XML DOCUMENTS: AN ORDER-CENTRIC APPROACH XU LIANG

Module 4: Index Structures Lecture 13: Index structure. The Lecture Contains: Index structure. Binary search tree (BST) B-tree. B+-tree.

A Bottom-up Strategy for Query Decomposition

Semi-structured Data. 8 - XPath

Top-k Keyword Search Over Graphs Based On Backward Search

DeweyIDs The Key to Fine-Grained Management of XML Documents

CS301 - Data Structures Glossary By

Index-Driven XQuery Processing in the exist XML Database

Development of a Rapid Design System for Aerial Work Truck Subframe with UG Secondary Development Framework

XML Storage and Indexing

Ecient XPath Axis Evaluation for DOM Data Structures

MID TERM MEGA FILE SOLVED BY VU HELPER Which one of the following statement is NOT correct.

Chapter 4 Trees. Theorem A graph G has a spanning tree if and only if G is connected.

Trees. 3. (Minimally Connected) G is connected and deleting any of its edges gives rise to a disconnected graph.

Procedia - Social and Behavioral Sciences 195 ( 2015 ) World Conference on Technology, Innovation and Entrepreneurship

A Survey Of Algorithms Related To Xml Based Pattern Matching

Experimental Evaluation of Query Processing Techniques over Multiversion XML Documents

Chapter 17 Indexing Structures for Files and Physical Database Design

TwigStack + : Holistic Twig Join Pruning Using Extended Solution Extension

An Extended Preorder Index for Optimising XPath Expressions

Citation for the original published paper (version of record):

B-Trees and External Memory

Trees. Q: Why study trees? A: Many advance ADTs are implemented using tree-based data structures.

Trees : Part 1. Section 4.1. Theory and Terminology. A Tree? A Tree? Theory and Terminology. Theory and Terminology

Trees. Reading: Weiss, Chapter 4. Cpt S 223, Fall 2007 Copyright: Washington State University

Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching

A Structural Numbering Scheme for XML Data

A Novel Replication Strategy for Efficient XML Data Broadcast in Wireless Mobile Networks

Research on software development platform based on SSH framework structure

B-Trees and External Memory

An Implementation of Tree Pattern Matching Algorithms for Enhancement of Query Processing Operations in Large XML Trees

Data Centric Integrated Framework on Hotel Industry. Bridging XML to Relational Database

7.1 Introduction. A (free) tree T is A simple graph such that for every pair of vertices v and w there is a unique path from v to w

Seleniet XPATH Locator QuickRef

Node Indexing in XML Query Optimization: A Review

Transcription:

Available online at www.sciencedirect.com Procedia Engineering 15 (2011) 5488 5492 An Extended Byte Carry Labeling Scheme for Dynamic XML Data YU Sheng a,b WU Minghui a,b, * LIU Lin a,b a School of Computer Science and Technology, Zhejiang University, Hangzhou 310027,China b Department of Computer Science and Engineering, Zhejiang University City College, Hangzhou 310015,China Abstract It is a crucial problem for XML data query system to determine the structural relationship between any two nodes efficiently. By assigning a unique label which implies the position relationship to each XML tree node, the structural relationship between any two nodes can be judged directly with the label of each node. We propose a new XML tree labeling scheme named EBCL which depends on prefix-based labeling scheme. The EBCL can not only support dynamic update to XML document fully and efficiently, but also reduce the space cost extremely. 2011 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of [Advanced in Control Engineeringand Information Science] Open access under CC BY-NC-ND license. Keywords: XML Tree, XML Labeling, Labeling Scheme, Dynamic Update 1. Introduction Owing to the semi-structure, open, heterogeneity, flexibility and many other characteristics, XML has become the real standard for data representation and exchange on the internet. With the rapid development of network technology, more and more data is being stored in XML format, so it becomes a hot research topic how to store XML data and provide an accurate and efficient query to the XML data. W3C working group has developed XPath and XQuery query languages to query XML data, of which common characteristic is querying XML document by path expression. The fundamental technology of path-based query is structural connection which is to determine the structural relationship between any two nodes depending on efficient XML tree node encoding. XML encoding is to assign to a node with a unique code which implies the position relationship with other nodes, so any two nodes can be compared directly to determine whether there is parent-child, ancestor-descendant, sibling, precursor, successor or other structural relationship between them without traversing the whole XML document tree. * Corresponding author. Tel.: +86-571-88284308; fax: +86-571-88015567. E-mail address: mhwu@zucc.edu.cn. 1877-7058 2011 Published by Elsevier Ltd. doi:10.1016/j.proeng.2011.08.1018 Open access under CC BY-NC-ND license.

YU Sheng et al. / Procedia Engineering 15 (2011) 5488 5492 5489 2. Related works Generally speaking, the existing XML encoding methods are divided into two categories: region based encoding and path based encoding. The Zhang [1] encoding scheme is a very typical region based encoding scheme, it exploit the characteristic that the XML document is ordered and determine various structural relationships between any given pair of nodes by tree traversal. In the path based encoding scheme,the label of a given node u encodes the nodes on the path from the document root down to u, as sequence to uniquely denote an ancestor of u on the path. Thus, given a node v and its ancestor u, their relationship could be determined precisely, u is an ancestor of v iff label (u) is a prefix of label (v). DeweyID [2] is an integer based prefix encoding scheme. It labels the n th child of a node with an integer n, and this n should be concatenated to the prefix which is its parent s label to form the complete label of this child node. The dynamic update of XML means supporting insert, delete and update operation without changing the existing label of XML document tree nodes. As the delete and update operation do not change the label of other nodes, the dynamic update means supporting insert operation dynamically. In fact, dynamic update can be converted to study how to insertion a new code into a sequence of existing codes without disturbing the order and relabeling them. ORDPATH [3] is an extension of DeweyID, and the main difference between them is that even numbers are reserved for further node insertions in ORDPATH. LSDX [4] uses a combination of letters and numbers to make up labels. The number is used to represent the level of the node and the letter to mark the position related to other siblings. QED [5] uses quaternary string in conjunction with a recursive algorithm to assign unique and persistent labels to each node. An important feature of these approaches is that they compare codes based on the lexicographic order rather than the numerical order which allows a code insertion between two existing codes by increasing the size of the inserted code. It use 0, 1, 2 and 3 to encode the node, but in the actual stores, each number is represented by two binary bits and 0 is used as the separator, in which realize the variable length and avoid overflow. The most fatal drawback of QED encoding scheme is that it needs to get the range which the codes needed to be represented before encoding and generate the code at the 1/3 and 2/3 point recursively. 3. EBCL encoding scheme 3.1. Basic definition Definition 1 (Basic symbol) EBCL codes are made up of 256 basic symbols 0, 1,..., 254 and 255 which is called one-bit code. In the logical representation it is one of the EBCL codes, but in the physical representation each of them consists of one byte(8 bits) from 00000000 to 11111111. In order to avoid confusion, in this paper we put a one-bit code in brackets if its decimal number expression is multibit number. i.e., [10][25]96 means a code makes up of four basic symbols 10, 25, 9, and 6. Definition 2 (Extended byte code) Every bit of extended byte code is a basic symbol except 0 and [255], and the left-most position is the highest bit while the right-most position is the lowest bit. Add 1(+1) operation: Add 1 to the least significant bit. A carry is passed to the higher bit if the current sum in the given bit is [255] and the [255] is changed to 1. Subtract 1(-1) operation: The reverse operation of add 1. Greater than (>): Given an extended byte code a, a+1 > a, and it is transitive. Less than (<): Given an extended byte code a, a-1 < a, and it is transitive. Definition 3 (Self-coding and Length) Self-coding is made up of many sections separated by the

5490 YU Sheng et al. / Procedia Engineering 15 (2011) 5488 5492 basic symbol 0 ( 00000000 in binary representation). Every section is an extended byte code and the number of sections is called self-coding s length. Definition 4 (EBCL Encoding) Assign each node with a global unique EBCL code which is its parent s label connect with self_label by segment separator [255] ( 11111111 in binary presentation). The order between segments is to compare their corresponding section from left to right, and it can be determined directly according to the lexicographic order since the section separator is 0. The order between EBCL labels is to compare their corresponding segment from left to right. 3.2. Static EBCL encoding Static EBCL encoding scheme is assigning a unique code to every node in the loading period of the XML document, which can be mapped as: {XML node XML document tree} -> {EBCL encoding}. The rules of static EBCL encoding scheme are as follows: The root node is labeled with 1; The EBCL code for every node is its parent s label connects with self_label by segment separator [255]. The self-label of the left-most node is 1 and others self-label is its left sibling s self-label add 1. Algorithm 1: Static_Label Input: Node E which has been encoded Output: Label of each child of Node E Static_Label(E){ selfcode=1; /* the self-code of the first node is 1 */ for(int i = 0; i < E.ChildCount; i++){ /* encode all the E s children from left to right */ ichild = E.getChild[i]; /* ichild is the i th child of E */ ichild.label=e.label connect with self_label; /* connect the parent s code and self-code */ selflabel=addone(selflabel); /* add 1 to self-code to generate the next self-code */ EBCL_Tree(iChild); /* call EBCL_Tree recursively to encode ichild s child node */ } } Algorithm 1 shows the Static EBCL encoding function. As shown in Fig. 1, the root node has 354 child nodes, the left-most child node's self-code is 1 and other node's self-code is 2, 3,..., [253], 11,..., 1[100]. Fig. 1 Static EBCL encoding scheme Fig. 2 Dynamic EBCL encoding scheme 3.3. Dynamic EBCL encoding Dynamic EBCL encoding means that insert a new node into the XML document tree which has been encoded by the static EBCL encoding scheme without changing the existing labels and generate a new EBCL label to the inserted node. We divide the insertion into three cases based on the target location, namely left-most insertion, middle insertion and right-most insertion. Since the target location is known, we just need to compute the self-

YU Sheng et al. / Procedia Engineering 15 (2011) 5488 5492 5491 code and connect it with its parent s code to form the EBCL code. In EBCL encoding, sub-tree insertion can be converted to insert the root node and then generate its child nodes by the static EBCL algorithm. The left-most insertion means to insert a new node before all the siblings. Suppose the current leftmost node s self-code is a, scanning a from left to right by section as follows: If a has multiple sections, then preserve its first not null section to minimize its length; If a only has one single section and a>1, then return a-1; If a equals 1, return 0[254]. Here use [254] to make it convenient to insert node before this new node. In addition, the situation of no sibling is incorporated in this insertion type. While inserting a new node c into two adjacent node a and b, compare a and b from left to right by each section. The core concept of EBCL is that if the difference is 1 at the last section of a and b, c can be formed with a appends a new section. It is worth noting that a new section is started with 2 instead of 1, which is to the future again inserting a new node before it. Table 1 shows a few different situations of the middle insertion. Right-most insertion means insert a new node at the right-most location into the tree. Let the current right-most node s self-code is a, in order to minimize the length of the new node s self-code, we add 1 to the first not null section of a to form the new s self-code. As shown in Fig. 2, the node with dotted line means the newly inserted node. Table 1 Example of middle insertion left sibling 2070[23] 2070[23] 207 2 2 right sibling 20908 208 208 207 202 new self-label 208 2070[24] 20702 202 20102 4. Performance experiment We conduct experiments to evaluate the performance and compare the code size of ORDPATH, LSDX, QED and EBCL encoding scheme which is proposed by this paper on static encoding and dynamic update. All the four schemes are implemented in Java and all the experiments are carried out on Intel Core2 Duo E8500 @3.16G CPU with 2 GB RAM running Windows XP Professional. The Experiment use five data sets and the first three data sets is generated by xmark [6], which is the tool to generate document automatically, with the arguments of 0.1(D1), 0.2(D2) and 0.5(D3). The latter two data sets [7] NASA (D4) and TREEBANK_E (D5) are provided by the Department of Computer Science, University of Washington. Code Size(MB) 80 60 40 20 0 ORDPATH LSDX QED EBCL D1 D2 D3 D4 D5 Static labelling time(s) 20 15 10 5 0 ORDPATH LSDX QED EBCL D1 D2 D3 D4 D5 Fig. 3. Code size of the encoding schemes Fig. 4. Time cost on static encoding Fig. 3 shows the space-cost of the encoding schemes. In ORDPATH encoding scheme, every self-code is represented by integer and in the 32 bits system it needs 4 bytes to store, while the minimal EBCL label

5492 YU Sheng et al. / Procedia Engineering 15 (2011) 5488 5492 only take 1 byte. In EBCL every byte use 0 and 255 as section separator and segment separator, so every byte can represent 2 8-2 effectual codes. If the fan-out is no more than 254, it only need 1/4 storage space of ORDPATH. QED and EBCL are variable length encoding scheme as well, but it cost much more space than EBCL. LSDX is a string-based encoding scheme and every char cost one byte to store, which means every byte in LSDX is only able to represent 26 effectual codes; the space cost will grow rapidly if the fan-out is very large. Fig. 4 shows the time-cost of the encoding schemes on the static encoding. Since ORDPATH encoding scheme only involves integer addition operation, its performance is the best. QED encoding scheme needs to pre-generate self-code for every child and then distribute to each code, so its performance is the worst. The EBCL encoding scheme proposed by this paper needs add 1 operation and this need some more operation compared with ORDPATH, but its performance is only a little worse than ORDPATH encoding scheme. Table 2 Time cost of inserting 100 nodes (ms) Location to be inserted ORDPATH LSDX QED EBCL Left-most 1109 1563 1376 1252 Middle 3168 2164 2353 1645 Right-most 1103 1923 2071 961 Table 2 shows the average time cost of ORDPATH, LSDX, QED and EBCL encoding scheme when insert 100 new nodes. The data in the table shows that EBCL encoding scheme has a considerable performance when insert a new node dynamically. Conclusion We propose a new efficient dynamic XML tree encoding scheme based on our previous work. It can not only do well in static encoding with high space efficiency but also well support dynamic update that can void relabelling the existing code completely. At last, we evaluate the performance of EBCL encoding scheme from every perspective by experiment. References [1] Zhang C, Naughton J, DeWitt D. On supporting containment queries in relational database management systems. In: Proceedings of the ACM SIGMOD Conference 2001; 426-437. [2] Tatarinov I, Viglas S, Beyer K S. Storing and querying ordered XML using a relational database system. In: Proc. of SIG2MOD 2002; 1:204-215. [3] P. E. O'Neil, E. J. O'Neil, S. Pal. ORDPATHs: Insert-Friendly XML Node Labels.SIGMOD Conference 2004, 903-908. [4] Maggie Duong, Yanchun Zhang. LSDX: A New Labeling Scheme for Dynamically Updating XML Data. 16th Australasian Database Conference 2005. [5] C. Li, T. W. Ling and M.Hu. Efficient Updates in Dynamic XML Data: from Binary String to Quaternary String. VLDB Journal 2008, 17(3):573-601. [6] Xmark: The xml benchmark project. http://monetdb.ewi.nl/xml/downloads.html [7] University of Washington XML Repository. http://www.cs.washington.edu/research/ xmldatasets.