Path Query Reduction and Diffusion for Distributed Semi-structured Data Retrieval+

Similar documents
Semistructured Data Store Mapping with XML and Its Reconstruction

Some aspects of references behaviour when querying XML with XQuery

Efficient Query Evaluation on Distributed Graph with Hadoop Environment

A Web-Based OO Platform for the Development of Didactic Multimedia Collaborative Applications

Folder(Inbox) Message Message. Body

An index replication scheme for wireless data broadcasting

Schemas for Integration and Translation of. Structured and Semi-Structured Data?

Interactive Query and Search in Semistructured Databases æ

Indexing XML Data with ToXin

Storing and Maintaining Semistructured Data Efficiently in an Object-Relational Database

Fixpoint Path Queries

Aspects of an XML-Based Phraseology Database Application

Ambiguous Grammars and Compactification

Hybrid XML Data Model Architecture for Efficient Document Management

Introduction to Semistructured Data and XML. Overview. How the Web is Today. Based on slides by Dan Suciu University of Washington

Introduction to Semistructured Data and XML

Introduction to XML. Yanlei Diao UMass Amherst April 17, Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau.

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

SSDDM: Distance Metric for Graph-based Semistructured

fied by a regular expression [4,7,9,11,23,16]. However, this kind of navigational queries is not completely satisfactory since in many cases we would

Efficient Processing Regular Queries In Shared-Nothing Parallel Database Systems Using Tree- And Structural Indexes

Non-context-Free Languages. CS215, Lecture 5 c

LOGIC AND DISCRETE MATHEMATICS

METAXPath. Utah State University. From the SelectedWorks of Curtis Dyreson. Curtis Dyreson, Utah State University Michael H. Böhen Christian S.

Distributed Query Evaluation on Semistructured Data

Design of Index Schema based on Bit-Streams for XML Documents

Semantic Web and Databases: Relationships and some Open Problems

Element Algebra. 1 Introduction. M. G. Manukyan

A Commit Scheduler for XML Databases

XML-QE: A Query Engine for XML Data Soures

Edinburgh Research Explorer

Processing Regular Path Queries Using Views or What Do We Need for Integrating Semistructured Data?

A Dynamic Labeling Scheme using Vectors

Context-Free Languages and Parse Trees

The Relational Model

Lab Assignment 3 on XML

DATA MODELS FOR SEMISTRUCTURED DATA

An Improved Prefix Labeling Scheme: A Binary String Approach for Dynamic Ordered XML

Context-Free Languages & Grammars (CFLs & CFGs) Reading: Chapter 5

1.0 Languages, Expressions, Automata

Querying Spatiotemporal XML Using DataFoX

Recursion and Structural Induction

Quiz 1: Solutions J/18.400J: Automata, Computability and Complexity. Nati Srebro, Susan Hohenberger

Introduction to Automata Theory. BİL405 - Automata Theory and Formal Languages 1

Introduction to Data Management CSE 344

CS152: Programming Languages. Lecture 2 Syntax. Dan Grossman Spring 2011

Nested XPath Query Optimization for XML Structured Document Database

Labeling Recursive Workflow Executions On-the- Fly

Using Webspaces to Model Document Collections on the Web

Jennifer Widom. Stanford University

Extending E-R for Modelling XML Keys

10/24/12. What We Have Learned So Far. XML Outline. Where We are Going Next. XML vs Relational. What is XML? Introduction to Data Management CSE 344

Context-Free Grammars and Languages (2015/11)

Chapter 13 XML: Extensible Markup Language

Fragmentation of XML Documents

Overview of the Integration Wizard Project for Querying and Managing Semistructured Data in Heterogeneous Sources

modern database systems lecture 4 : semi-structured data

Theoretical Part. Chapter one:- - What are the Phases of compiler? Answer:

Context-Free Grammars

Ontology Structure of Elements for Web-based Natural Disaster Preparedness Systems

Antisymmetric Relations. Definition A relation R on A is said to be antisymmetric

Querying XML Data. Mary Fernandez. AT&T Labs Research David Maier. Oregon Graduate Institute

Type Checking. Outline. General properties of type systems. Types in programming languages. Notation for type rules.

Outline. General properties of type systems. Types in programming languages. Notation for type rules. Common type rules. Logical rules of inference

XSLT and Structural Recursion. Gestão e Tratamento de Informação DEI IST 2011/2012

Outline. Parser overview Context-free grammars (CFG s) Derivations Syntax-Directed Translation

Graph Semantic Based Conceptual Model of Semistructured Data: An Object Oriented Approach

Introduction to Database Systems CSE 414

Message-Optimal and Latency-Optimal Termination Detection Algorithms for Arbitrary Topologies

Encyclopedia of Database Systems, Editors-in-chief: Özsu, M. Tamer; Liu, Ling, Springer, MAINTENANCE OF RECURSIVE VIEWS. Suzanne W.

([1-9] 1[0-2]):[0-5][0-9](AM PM)? What does the above match? Matches clock time, may or may not be told if it is AM or PM.

Foundations of Computer Science Spring Mathematical Preliminaries

Slides for Faculty Oxford University Press All rights reserved.

Inferring Structure in Semistructured Data

in [8] was soon recognized by the authors themselves [7]: After a few experiments they realized that RMM was not adequate for many typical application

Homework 6: FDs, NFs and XML (due April 13 th, 2016, 4:00pm, hard-copy in-class please)

Inferring Structure in Semistructured Data

New Rewritings and Optimizations for Regular Path Queries

o12 references o24 references o29 first last

An Efficient Algorithm for Computing Non-overlapping Inversion and Transposition Distance

X-tree Diff+: Efficient Change Detection Algorithm in XML Documents *

Programming Languages Third Edition

CS521 \ Notes for the Final Exam

DSD: A Schema Language for XML

An Extended Byte Carry Labeling Scheme for Dynamic XML Data

Monitoring Stable Properties in Dynamic Peer-to-Peer Distributed Systems

CS525 Winter 2012 \ Class Assignment #2 Preparation

Introduction to Database Systems CSE 414

A Note on Scheduling Parallel Unit Jobs on Hypercubes

XML Query Processing and Optimization

Composability Test of BOM based models using Petri Nets

Choosing a Data Model and Query Language for Provenance

Computer Science 236 Fall Nov. 11, 2010

CONVENTIONAL EXECUTABLE SEMANTICS. Grigore Rosu CS522 Programming Language Semantics

Functional Dependency: Design and Implementation of a Minimal Cover Algorithm

Generalized Document Data Model for Integrating Autonomous Applications

Storing and Querying XML Documents Without Using Schema Information

Querying Tree-Structured Data Using Dimension Graphs

MA513: Formal Languages and Automata Theory Topic: Context-free Grammars (CFG) Lecture Number 18 Date: September 12, 2011

The PCAT Programming Language Reference Manual

Transcription:

Path Query Reduction and Diffusion for Distributed Semi-structured Data Retrieval+ Jaehyung Lee, Yon Dohn Chung, Myoung Ho Kim Division of Computer Science, Department of EECS Korea Advanced Institute of Science and Technology (KAIST) 373-1, Kusong-dong, Yusong-gu, Taejon, 305-701, Korea {jlee, ydchung, mhkim}@dbserver.kaist.ac.kr Abstract In this paper, we address the problem of query processing on distributed semi-structured data. The distributed semistructured data can be modeled as a rooted and edge-labeled graph, where nodes are located in a single or a number of sites. For eficient retrieval of distributed semi-structured data, we propose a query processing model that is based on the query reduction and diffusion method. In the method, a user query is reduced in a site and distributed to other sites for data retrieval. We also propose a set of algorithms for the proposed model. 1. Introduction The semi-structured data is generally described as the data whose structure or format is not separated from its contents. Examples of semi-structured data are HTML documents, BIB^ documents, genome data, etc. These semi-structured data have irregular and changeable structure. Most previous studies on semi-structured data try to extend existing database technologies. That is, they complement relational and object-oriented database technologies for semi-structured data retrieval. Detailed research areas include data integration, web site management, general purpose semi-structured data management [6], data model [5], query language design 11, 21, query processing [8], and indexing techniques [7]. Various data models proposed for semi-structured data share a common property: they model the semi-structured data as a rooted and edge-labeled graph. In the graph, nodes are objects and labels are strings, integers, images, sounds, etc. Several query languages have been proposed for semistructured data, which vary in style and expressive power. ~~ ~ +This work was supported by grant No. 1999-1-303-007-3 from the interdisciplinary research program of the KOSEF. They are generally based on regular path expressions and describe nodes reachable from the given path in a declarative way. In the below we show an example of regular path query which is from UnQL [2]. Q = select t where -*+ CS-Dept *-*+ in DB Paper * t The query retrieves all papers accessible from a CS-Dept link in the DB. Here, -** CS- Dept +-*+ Paper is a regular path expression. This expression denotes a path which has an edge labeled CS-Dept and an edge labeled Paper in this order. In this paper, for distributed semi-structured data retrieval, we propose a query processing model that can be applied to the current web environment consisting of HTML and XML documents. The model uses path query reduction and diffusion for local and distributed query processing. In the local query processing, regular path queries are processed with the query reduction, where regular path queries are reduced as they move towards children nodes. In the distributed query processing, queries are forwarded to other sites when there are edges that connect two sites. We call it the query diffusion. In the distributed query processing, the query originating site must know the time when the query processing process is terminated. We propose a set of algorithms with which the query originating site detects the termination of distributed query processing both in the normal case (i.e., all query processing is ended normally) and the user abort case (i.e., the user wants to stop the query processing). 2. Background A general model for semi-structured data is a rooted and edge-labeled graph. Figure 1 illustrates an example of graph representation which is a fragment of a university web site. In the graph nodes are web pages and edges are hyperlinks between the nodes. Numbers in the nodes are the identifiers for each node. 0-7695-0789-1/00 $10.00 0 2000 IEEE 393

..... Figure 1. A graph representation of semistructured data Semi-structured query languages have a common feature that they use path expressions to traverse graphs. A path expression on semi-structured data denotes a sequence of edge labels. The query results are the nodes that satisfy the given path expression from the root node of the data. The use of regular expressions as the query language for semi-structured data retrieval is effective, since it need not describe all sequences of labels. The regular expression has the following grammar: R ::= P I a I - I RIR I R+R I R' Here, P is a user defined condition statement or boolean combination of such condition statements. a is a label constant, - is a label, RllRz is an alternation, R1 * Rz is the concatenation of R1 and Rl, and R' is the closure of R. The following regular expression Queryl finds all the papers of the computer science department in Figure 1. Queryl : -* =+ CS-Dept => -* + Paper The result is the set of nodes (57, 69, 70, 86). In the rest of the paper, we omit '3' in regular expressions if there is no ambiguity. In the paper, we consider a regular expression R of the following query. Q(DB) = select t where R + t in DB We call this query a regular path query or regular query, and the result of this query is a set of nodes in the graph reachable from the root via the given regular expression R. In these days, there are many web sites spread out in several locations and they are connected to each other by hyperlinks. If we consider those web sites as semi-structured data, we can apply the semi-structured data model to them. An example of university web sites distributed on three different sites is depicted in Figure 2. If Queryl is applied to the node 1 in Figure 2, the result is the nodes with slant lines. Suczu [8] proposed a query decomposition method for query processing on distributed semi-structured data. In the method, queries are transferred to all other sites, computed in each site, and then the result of each site is returned to the query originating site. This method assumes Figure 2. An example of distributed semistructured data R : a*bc*d I a*e b b'sa Reduce(R,a) = a'bc'd Reduce(R,b) = c'd Reduce(R,c) = q5 Reduce(R,d) = 4 Reduce(R,e) = E I a'e -@ Figure 3. An example of path query reduction that semi-structured data is distributed on fixed and known sites, and every site knows its input and output nodes'. However, if we consider the current web environment, this assumption is not realistic. In HTML and XML documents, identifying output nodes is very easy, but identifying input nodes is almost impossible. 3. Proposed Query Processing Model In this section, we propose a query processing model for distributed semi-structured data retrieval. - Local Query Processing The query in a local site is processed through the query reduction, which is done by the following 'Reduce' function. The Reduce function takes two inputs: (i) a path query given as a regular expression and (ii) a label used for state transition in an automata constructed from the regular expression. If a transition from the start state of the automata using the label is possible, then the Reduce function returns a regular path query with a new start state which is obtained from the transition. Otherwise, the Reduce function returns q5. Figure 3 is an example of an automata and 'For every cross link U + v from site a to site p, we call U an output node in o and v an input node in D. 394

Reduce function results for each label of a regular expression R : a'bc'd 1 a'e. We assume that each local site uses Algorithm 1 for path query processing. The LQP (Local Query Processing) function in Algorithm 1 takes two inputs: (i) a path query given as a regular expression and (ii) a node identifier. It applies the Reduce function to each label on the edges that are adjacent to the given node. If the result of Reduce function is not 4, the LQP function recursively calls itself on children nodes using a reduced query. Figure 4 shows an example, where a path query a'bc'd is given to node 1 and the query result is (3, 4). Algorithm 1 Local query processing Visited t 4 {nodes already visited} Result c 4 {query result} LQP(R, Root(DB)) function LQP(R, U) begin if R's start state E R's final states then Result t Result U {U} if < R,u >E Visited then return Visited t Visited U { < R, U >} for all U 4 U do R2 = Reduce(R,a) if R2 # 4 then LQP(R2,u) end for end query processing model for distributed semi-structured data is based on Algorithm 2. Algorithm 2 takes two inputs: (i) distributed semi-structured data and (ii) a regular path query. Algorithm 2 Distributed query processing s : query originating site 0, : identifier for a node in site p R : regular expression {query message (s, 0,, R) arrives at site p} receive (s, 0,, R) evaluate (s, O,, R) {query message (s, 0,, R') is computed} send (s, 0,, R') to site r {query processing result result, is computed} send (result,) to site s..,i. Figure 5. An example of distributed query processing Figure 4. An example of local query processing - Distributed Query Processing For distributed query processing, we assume that each site knows its output nodes only. Unlike local query processing, in distributed semi-structured data retrieval, queries have to be transferred to several sites using the query diffusion. The query diffusion means that, in each local site, if an edge to the other site is reachable and the result of the Reduce function on that edge is not 4, then the reduced query is transferred along that edge. The proposed If a site receives a query, the query is processed through the LQP algorithm in the site. When an edge to the other site which satisfies the Reduce function is found, a reduced query message is sent to that site. After the local query processing, the query result is sent to the query originating site. As shown here, the distributed query processing model is based on both the query reduction and the query diffusion. An example of query processing using this model is illustrated in Figure 5. 4. Termination Detection The query processing model we have proposed is applicable to the current web environment that consists of HTML and XML documents. However, since the query originating site gets query results incrementally from several sites, it can not detect when the query processing is terminated. In this section, we propose an algorithm with which the query originating site detects when the query processing is finished in our distributed query processing model. In addition, we propose an extended algorithm that the query 395

Name I Descriotion state, parent, n (result,) Table 1. Notations 1,... N I site (s is query originating site) each site's state (active, passive) the site which sent query to site p (parent, = s, for p # s) the log for queries given to site p the result for a given query in site p the number of (query-ack) messages to be received in site p the number of (result-ack) messages to be received in site p the number of (query-ack) messages to be received in site p for query originating site detects the termination of the query processing when the user wants to abort the query processing. Table 1 describes some notations for the termination detection of query processing in the normal and user abort cases. - Normal Case For the query originating site to detect the termination of distributed query processing, each participating site p acts as follows. when query message (s, y, 0,, R) arrives at site p receive (s, q, 0,, R) if (O,, R) E log, then send (query-ack) to q return 109, + log, U { (OP, R) I if then parent, t q send (query-ack) to q state, t active evaluate (s, 0,, R) 0 when (query-ack) arrives at site p n(query,)-- if n(query,) = 0 and state, =passive then 2(s, q, 0,, R) : s - query originating site, q - parent site which sent the query, 0, - identifier of the node in site p, R - path query 0 when query processing result result, is computed send (result,) to site s n( result,) ++ 0 when (result-ack) arrives at site p receive (result-ack) n(res.uk,)-- if n(result,) = 0 then state, t passive if n(yuery,) = 0 and state,=passive then. 0 when query message (s,p, O,, R') is computed send (s,p, O,, R') to site r n(query,)++ Figure 6 is an example of a distributed query processing and the termination detection. Queries are transferred from Site 1 to Site 2, Site 2 to Site 3 and Site 4, and Site 3 to Site 4. Each transferred query is processed and the query result is transferred to Site 1. Site 3 and Site 4 notify the end of the query processing to Site 2, and Site 2 to Site 1. Finally, Site 1 detects the termination of the distributed query processing. Figure 6. An example of normal termination For the query that Site 3 sent to Site 4, Site 4 immediately sends an acknowledgement to Site 3 because its parent variable was already set to Site 2. This immediate acknowledgement maintains the path on which query acknowledgement messages are delivered as a tree. - Cycle Prevention 396

Let query R be delivered to the node a of site A. When processing R, the same query R can be delivered to the node b of site B, and then it can be delivered to the node a of site A again. For preventing this kind of query cycling, each site must not process the query message that is already processed or currently being processed. For this purpose, each site uses a log variable for storing received query messages. The management of log variables is as follows. After detecting the termination of query processing, the query originating site notifies the termination to all the sites which sent the query result. Then each.site clears its log variable. - User Abort Case While the given query is in process, user may want to abort it. For supporting the termination detection in the case of user requested abort, we use a variable n(queryi) (in Table 1) and add the followings to the previous algorithm: Increasing n(queryi) when a query message to site T is computed in site p and decreasing it when the (query-ack) is delivered from site r. In addition, we add the following abort message handling algorithm. 0 when (abort) arrives at site p if then return stop its evaluation state, t passive if n(query,) = 0 then if n(result,) = 0 then send (abort) to all sites T where n(queryi) # 0 n(queryi) = 0 for all T - Correctness Now, we prove the correctness of the termination detection algorithms for the normal case and for the user abort case. In each case, we prove the followings: (i) after the termination of distributed query processing, the query originating site returns to a passive state and (ii) after the query originating site returns to a passive state, all distributed query processing is actually terminated. First, we define the termination of distributed query processing. Definition 1 The distributed query processing is terminated when all the relevant sites are in passive states and there is no undelivered (query), (query-ack), (result) and (result-ack) message. There may be (abort) messages in the network in case of the user requested abort. However, this message does not change the state of a site to be active. Theorem 1 The distributed query processing for a regular path query is terminated if and only if the query originating site returns to a passive state. [Proof] We can prove the theorem by transforming Dijkstra s diffusing computation method [3]. For details, see the reference [4]. 5. Conclusion In this paper, we have proposed a query processing model for distributed semi-structured data retrieval. The proposed model is based on the query reduction and the query diffusion The query reduction is used for the local query processing and the query diffusion is used for the distributed query processing. In the model we assume that Each site knows its output nodes only, which is practically applicable to the current web environment. In addition, we have proposed termination detection algorithms for the proposed model. References (11 S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. Wiener. The Lore1 query language for semistructured data. International Journal on Digital Libraries, 1(1), April 1997. [2] Peter Buneman, Susan Davidson, Gerd Hillebrand, and Dan Suciu. A query language and optimization techniques for unstructured data. In Proceedings of ACM SIGMOD Conference, 1996. [3] Edsger W. Dijkstra and C.S. Scholten. Termination Detection For Diffusing Computations. Information Processing Letters, 11, 1980. [4] J. Lee, Y. D. Chung, and M. H. Kim. A Path Query Processing Scheme for Distributed Semi-structured Data Retrieval. Journal of KISS - Databases, (revision). [5] R. Goldman, S. Chawathe, A. Crespo, and J. Mchugh. A Standard Textual Interchange Format for the Object Exchnage Model (OEM). Technical report, Stanford University, 1996. [6] J. McHugh, S. Abiteboul, R. Goldman, D. Quass, and J. Widom. Lore: A database management system for semistructured data. SIGMOD Record, 26(3), September 1997. [7] Tova Milo and Dan Suciu. Index structures for path expressions. In International Conference on Database Theory, 1999. [SI Dan Suciu. Query Decomposition and View Maintenance for Query Languages for Unstructured Data. In Proceedings of Very Large Databases Conference, 1996. 397