Path Query Reduction and Diffusion for Distributed Semi-structured Data Retrieval+

Path Query Reduction and Diffusion for Distributed Semi-structured Data Retrieval+ Jaehyung Lee, Yon Dohn Chung, Myoung Ho Kim Division of Computer Science, Department of EECS Korea Advanced Institute of Science and Technology (KAIST) 373-1, Kusong-dong, Yusong-gu, Taejon, 305-701, Korea {jlee, ydchung, mhkim}@dbserver.kaist.ac.kr Abstract In this paper, we address the problem of query processing on distributed semi-structured data. The distributed semistructured data can be modeled as a rooted and edge-labeled graph, where nodes are located in a single or a number of sites. For eficient retrieval of distributed semi-structured data, we propose a query processing model that is based on the query reduction and diffusion method. In the method, a user query is reduced in a site and distributed to other sites for data retrieval. We also propose a set of algorithms for the proposed model. 1. Introduction The semi-structured data is generally described as the data whose structure or format is not separated from its contents. Examples of semi-structured data are HTML documents, BIB^ documents, genome data, etc. These semi-structured data have irregular and changeable structure. Most previous studies on semi-structured data try to extend existing database technologies. That is, they complement relational and object-oriented database technologies for semi-structured data retrieval. Detailed research areas include data integration, web site management, general purpose semi-structured data management [6], data model [5], query language design 11, 21, query processing [8], and indexing techniques [7]. Various data models proposed for semi-structured data share a common property: they model the semi-structured data as a rooted and edge-labeled graph. In the graph, nodes are objects and labels are strings, integers, images, sounds, etc. Several query languages have been proposed for semistructured data, which vary in style and expressive power. ~~ ~ +This work was supported by grant No. 1999-1-303-007-3 from the interdisciplinary research program of the KOSEF. They are generally based on regular path expressions and describe nodes reachable from the given path in a declarative way. In the below we show an example of regular path query which is from UnQL [2]. Q = select t where -*+ CS-Dept *-*+ in DB Paper * t The query retrieves all papers accessible from a CS-Dept link in the DB. Here, -** CS- Dept +-*+ Paper is a regular path expression. This expression denotes a path which has an edge labeled CS-Dept and an edge labeled Paper in this order. In this paper, for distributed semi-structured data retrieval, we propose a query processing model that can be applied to the current web environment consisting of HTML and XML documents. The model uses path query reduction and diffusion for local and distributed query processing. In the local query processing, regular path queries are processed with the query reduction, where regular path queries are reduced as they move towards children nodes. In the distributed query processing, queries are forwarded to other sites when there are edges that connect two sites. We call it the query diffusion. In the distributed query processing, the query originating site must know the time when the query processing process is terminated. We propose a set of algorithms with which the query originating site detects the termination of distributed query processing both in the normal case (i.e., all query processing is ended normally) and the user abort case (i.e., the user wants to stop the query processing). 2. Background A general model for semi-structured data is a rooted and edge-labeled graph. Figure 1 illustrates an example of graph representation which is a fragment of a university web site. In the graph nodes are web pages and edges are hyperlinks between the nodes. Numbers in the nodes are the identifiers for each node. 0-7695-0789-1/00 $10.00 0 2000 IEEE 393

..... Figure 1. A graph representation of semistructured data Semi-structured query languages have a common feature that they use path expressions to traverse graphs. A path expression on semi-structured data denotes a sequence of edge labels. The query results are the nodes that satisfy the given path expression from the root node of the data. The use of regular expressions as the query language for semi-structured data retrieval is effective, since it need not describe all sequences of labels. The regular expression has the following grammar: R ::= P I a I - I RIR I R+R I R' Here, P is a user defined condition statement or boolean combination of such condition statements. a is a label constant, - is a label, RllRz is an alternation, R1 * Rz is the concatenation of R1 and Rl, and R' is the closure of R. The following regular expression Queryl finds all the papers of the computer science department in Figure 1. Queryl : -* =+ CS-Dept => -* + Paper The result is the set of nodes (57, 69, 70, 86). In the rest of the paper, we omit '3' in regular expressions if there is no ambiguity. In the paper, we consider a regular expression R of the following query. Q(DB) = select t where R + t in DB We call this query a regular path query or regular query, and the result of this query is a set of nodes in the graph reachable from the root via the given regular expression R. In these days, there are many web sites spread out in several locations and they are connected to each other by hyperlinks. If we consider those web sites as semi-structured data, we can apply the semi-structured data model to them. An example of university web sites distributed on three different sites is depicted in Figure 2. If Queryl is applied to the node 1 in Figure 2, the result is the nodes with slant lines. Suczu [8] proposed a query decomposition method for query processing on distributed semi-structured data. In the method, queries are transferred to all other sites, computed in each site, and then the result of each site is returned to the query originating site. This method assumes Figure 2. An example of distributed semistructured data R : a*bc*d I a*e b b'sa Reduce(R,a) = a'bc'd Reduce(R,b) = c'd Reduce(R,c) = q5 Reduce(R,d) = 4 Reduce(R,e) = E I a'e -@ Figure 3. An example of path query reduction that semi-structured data is distributed on fixed and known sites, and every site knows its input and output nodes'. However, if we consider the current web environment, this assumption is not realistic. In HTML and XML documents, identifying output nodes is very easy, but identifying input nodes is almost impossible. 3. Proposed Query Processing Model In this section, we propose a query processing model for distributed semi-structured data retrieval. - Local Query Processing The query in a local site is processed through the query reduction, which is done by the following 'Reduce' function. The Reduce function takes two inputs: (i) a path query given as a regular expression and (ii) a label used for state transition in an automata constructed from the regular expression. If a transition from the start state of the automata using the label is possible, then the Reduce function returns a regular path query with a new start state which is obtained from the transition. Otherwise, the Reduce function returns q5. Figure 3 is an example of an automata and 'For every cross link U + v from site a to site p, we call U an output node in o and v an input node in D. 394

Reduce function results for each label of a regular expression R : a'bc'd 1 a'e. We assume that each local site uses Algorithm 1 for path query processing. The LQP (Local Query Processing) function in Algorithm 1 takes two inputs: (i) a path query given as a regular expression and (ii) a node identifier. It applies the Reduce function to each label on the edges that are adjacent to the given node. If the result of Reduce function is not 4, the LQP function recursively calls itself on children nodes using a reduced query. Figure 4 shows an example, where a path query a'bc'd is given to node 1 and the query result is (3, 4). Algorithm 1 Local query processing Visited t 4 {nodes already visited} Result c 4 {query result} LQP(R, Root(DB)) function LQP(R, U) begin if R's start state E R's final states then Result t Result U {U} if < R,u >E Visited then return Visited t Visited U { < R, U >} for all U 4 U do R2 = Reduce(R,a) if R2 # 4 then LQP(R2,u) end for end query processing model for distributed semi-structured data is based on Algorithm 2. Algorithm 2 takes two inputs: (i) distributed semi-structured data and (ii) a regular path query. Algorithm 2 Distributed query processing s : query originating site 0, : identifier for a node in site p R : regular expression {query message (s, 0,, R) arrives at site p} receive (s, 0,, R) evaluate (s, O,, R) {query message (s, 0,, R') is computed} send (s, 0,, R') to site r {query processing result result, is computed} send (result,) to site s..,i. Figure 5. An example of distributed query processing Figure 4. An example of local query processing - Distributed Query Processing For distributed query processing, we assume that each site knows its output nodes only. Unlike local query processing, in distributed semi-structured data retrieval, queries have to be transferred to several sites using the query diffusion. The query diffusion means that, in each local site, if an edge to the other site is reachable and the result of the Reduce function on that edge is not 4, then the reduced query is transferred along that edge. The proposed If a site receives a query, the query is processed through the LQP algorithm in the site. When an edge to the other site which satisfies the Reduce function is found, a reduced query message is sent to that site. After the local query processing, the query result is sent to the query originating site. As shown here, the distributed query processing model is based on both the query reduction and the query diffusion. An example of query processing using this model is illustrated in Figure 5. 4. Termination Detection The query processing model we have proposed is applicable to the current web environment that consists of HTML and XML documents. However, since the query originating site gets query results incrementally from several sites, it can not detect when the query processing is terminated. In this section, we propose an algorithm with which the query originating site detects when the query processing is finished in our distributed query processing model. In addition, we propose an extended algorithm that the query 395

Name I Descriotion state, parent, n (result,) Table 1. Notations 1,... N I site (s is query originating site) each site's state (active, passive) the site which sent query to site p (parent, = s, for p # s) the log for queries given to site p the result for a given query in site p the number of (query-ack) messages to be received in site p the number of (result-ack) messages to be received in site p the number of (query-ack) messages to be received in site p for query originating site detects the termination of the query processing when the user wants to abort the query processing. Table 1 describes some notations for the termination detection of query processing in the normal and user abort cases. - Normal Case For the query originating site to detect the termination of distributed query processing, each participating site p acts as follows. when query message (s, y, 0,, R) arrives at site p receive (s, q, 0,, R) if (O,, R) E log, then send (query-ack) to q return 109, + log, U { (OP, R) I if then parent, t q send (query-ack) to q state, t active evaluate (s, 0,, R) 0 when (query-ack) arrives at site p n(query,)-- if n(query,) = 0 and state, =passive then 2(s, q, 0,, R) : s - query originating site, q - parent site which sent the query, 0, - identifier of the node in site p, R - path query 0 when query processing result result, is computed send (result,) to site s n( result,) ++ 0 when (result-ack) arrives at site p receive (result-ack) n(res.uk,)-- if n(result,) = 0 then state, t passive if n(yuery,) = 0 and state,=passive then. 0 when query message (s,p, O,, R') is computed send (s,p, O,, R') to site r n(query,)++ Figure 6 is an example of a distributed query processing and the termination detection. Queries are transferred from Site 1 to Site 2, Site 2 to Site 3 and Site 4, and Site 3 to Site 4. Each transferred query is processed and the query result is transferred to Site 1. Site 3 and Site 4 notify the end of the query processing to Site 2, and Site 2 to Site 1. Finally, Site 1 detects the termination of the distributed query processing. Figure 6. An example of normal termination For the query that Site 3 sent to Site 4, Site 4 immediately sends an acknowledgement to Site 3 because its parent variable was already set to Site 2. This immediate acknowledgement maintains the path on which query acknowledgement messages are delivered as a tree. - Cycle Prevention 396

Let query R be delivered to the node a of site A. When processing R, the same query R can be delivered to the node b of site B, and then it can be delivered to the node a of site A again. For preventing this kind of query cycling, each site must not process the query message that is already processed or currently being processed. For this purpose, each site uses a log variable for storing received query messages. The management of log variables is as follows. After detecting the termination of query processing, the query originating site notifies the termination to all the sites which sent the query result. Then each.site clears its log variable. - User Abort Case While the given query is in process, user may want to abort it. For supporting the termination detection in the case of user requested abort, we use a variable n(queryi) (in Table 1) and add the followings to the previous algorithm: Increasing n(queryi) when a query message to site T is computed in site p and decreasing it when the (query-ack) is delivered from site r. In addition, we add the following abort message handling algorithm. 0 when (abort) arrives at site p if then return stop its evaluation state, t passive if n(query,) = 0 then if n(result,) = 0 then send (abort) to all sites T where n(queryi) # 0 n(queryi) = 0 for all T - Correctness Now, we prove the correctness of the termination detection algorithms for the normal case and for the user abort case. In each case, we prove the followings: (i) after the termination of distributed query processing, the query originating site returns to a passive state and (ii) after the query originating site returns to a passive state, all distributed query processing is actually terminated. First, we define the termination of distributed query processing. Definition 1 The distributed query processing is terminated when all the relevant sites are in passive states and there is no undelivered (query), (query-ack), (result) and (result-ack) message. There may be (abort) messages in the network in case of the user requested abort. However, this message does not change the state of a site to be active. Theorem 1 The distributed query processing for a regular path query is terminated if and only if the query originating site returns to a passive state. [Proof] We can prove the theorem by transforming Dijkstra s diffusing computation method [3]. For details, see the reference [4]. 5. Conclusion In this paper, we have proposed a query processing model for distributed semi-structured data retrieval. The proposed model is based on the query reduction and the query diffusion The query reduction is used for the local query processing and the query diffusion is used for the distributed query processing. In the model we assume that Each site knows its output nodes only, which is practically applicable to the current web environment. In addition, we have proposed termination detection algorithms for the proposed model. References (11 S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. Wiener. The Lore1 query language for semistructured data. International Journal on Digital Libraries, 1(1), April 1997. [2] Peter Buneman, Susan Davidson, Gerd Hillebrand, and Dan Suciu. A query language and optimization techniques for unstructured data. In Proceedings of ACM SIGMOD Conference, 1996. [3] Edsger W. Dijkstra and C.S. Scholten. Termination Detection For Diffusing Computations. Information Processing Letters, 11, 1980. [4] J. Lee, Y. D. Chung, and M. H. Kim. A Path Query Processing Scheme for Distributed Semi-structured Data Retrieval. Journal of KISS - Databases, (revision). [5] R. Goldman, S. Chawathe, A. Crespo, and J. Mchugh. A Standard Textual Interchange Format for the Object Exchnage Model (OEM). Technical report, Stanford University, 1996. [6] J. McHugh, S. Abiteboul, R. Goldman, D. Quass, and J. Widom. Lore: A database management system for semistructured data. SIGMOD Record, 26(3), September 1997. [7] Tova Milo and Dan Suciu. Index structures for path expressions. In International Conference on Database Theory, 1999. [SI Dan Suciu. Query Decomposition and View Maintenance for Query Languages for Unstructured Data. In Proceedings of Very Large Databases Conference, 1996. 397