Query Containment in the Presence of Limited Access Patterns. Abstract

Size: px

Start display at page:

Download "Query Containment in the Presence of Limited Access Patterns. Abstract"

Joanna Ford
5 years ago
Views:

1 Query Containment in the Presence of Limited Access Patterns Chen Li Computer Science Department Stanford University Edward Chang Electrical & Computer Engineering University of California, Santa Barbara Abstract In information-integration systems, sources may have access pattern limitations, i.e., they require values for certain attributes to return tuples. In this paper we study the following problem: given views with access pattern limitations, how to test whether the maximal answer to a conjunctive query (CQ) is contained in that to another CQ? Since a datalog program is necessary to compute the maximal answer to a CQ, as shown in [9, 21], the containment appears undecidable. However, because these programs have a special form, their containment can be reduced to containment of monadic programs, which is known to be decidable [7]. We prove the decidability for both the source-centric approach and the query-centric approach to information integration [8]. The results can be extended to the case where the contained CQ and the containing CQ have dierent initial bindings. Our work complements the recent paper by Millstein, Levy, and Friedman [25]. In addition, the decidability of monadic programs involves a complex algorithm. We develop a polynomial-time algorithm for testing boundedness of these programs [27], and show that when a program in the test is bounded, we can perform the containment test eciently. Keywords: information-integration systems, access pattern limitations, query containment, query equivalence, datalog programs. 1 Introduction The goal of information-integration systems (e.g., [3, 12, 14, 15, 16, 17, 23, 24, 26]) is to support seamless access to heterogeneous data sources. In these systems, sources may have access pattern limitations (also called binding restrictions). For instance, many Web sources such as The IMDB [13] and Cinemachine [6] return movie information only if some values are specied for certain attributes, such as movie title, star name, and etc. These sources do not accept queries such as \return all the movie information you know about." Given sources with access pattern limitations, Duschka and Levy [9] showed how to compute the maximal answer to a query by translating the query and the source limitations into a datalog program [31]. In a recent paper [21], we developed an algorithm for nding the relevant sources that need to be accessed to answer a query. In this paper we address the problem of whether the maximal answer to one query is contained in that to another query. The following is a motivating example. EXAMPLE 1.1 Assume we have four source views as shown in Figure 1(a). Their schemas can be represented as a hypergraph [31] as shown in Figure 1(b), in which each node is an attribute and each hyperedge is a view schema. For example, source view v 1 has the information about studios and their movies, and source view v 2 has the information about movie awards and stars. The access limitation of each view is described as a binding pattern [31], in which each attribute is adorned as b (a value 1

2 View schema v 1 (Studio; Movie) v 2 (Movie; Star; Award) v 3 (Movie; Star) v 4 (Star; Addr) Binding pattern bf bbf bf bf b f v3(movie; Star) b b f v2(movie; Star; Award) Studio Movie Star Award b f b f v1(studio; Movie) Addr v4(star; Addr) (a) View schemas with binding patterns (b) The hypergraph representation Figure 1: Four movie sources must be specied for this attribute) or f (it can be free). For instance, the binding pattern bf of view v 1 says that every query using view v 1 must specify a studio name. Consider the following two conjunctive queries (CQs): Q 1 : ans(a) :- v 1 (disney; M);v 2 (M; S; W );v 4 (S; A) Q 2 : ans(a) :- v 1 (disney; M);v 3 (M; S);v 4 (S; A) Both queries ask for the addresses of stars in Disney movies by taking the joins of dierent views. To answer Q 2,we can rst send a query v 1 (disney; M) tov 1 to retrieve its Disney movies. For each of these movies m, send query v 3 (m; S) to retrieve its stars. Then for each star s, send query v 4 (s; A) to retrieve his/her addresses. The binding restrictions of these three views make the above plan executable. While answering query Q 1,we cannot get any bindings for the Star attribute by using only the views v 1, v 2, and v 3 in Q 1 due to their binding restrictions. However, we can use view v 3 (Movie; Star) with the binding pattern bf to obtain some Star bindings, although v 3 is not in Q 1. In addition, in order to query view v 3,we can access view v 1 (Studio; Movie) to retrieve the necessary Movie bindings using disney as a studio name. Although the two queries use dierent views, surprisingly, if all the available information about studios, movies, and stars is from the two queries and the four source views, all the answer to query Q 1 that we can compute is contained in that to query Q 2.(We give the formal proof in Section 2.3.) Therefore, if a user submits a query Q 1 [ Q 2,we can answer this query by just answering query Q 2, and thus save the queries to view v 2. 2 In this paper we study the following problem: Given conjunctive queries on views with bindingpattern restrictions, is the maximal answer to one query contained in that to another query? The solution to this problem can help us avoid unnecessary source accesses, as shown in the example above. However, [9, 21] show that we may need a potentially recursive datalog program to compute the maximal answer to a CQ. Thus this containment problem seems undecidable, since containment of datalog programs is undecidable [30]. In Section 2 we prove this containment problem is decidable by showing the datalog program to compute the maximal answer to a CQ is monadic. That is, all its recursive predicates [31] are monadic (i.e., with arity one); its nonrecursive IDB predicates can have arbitrary arity. Therefore, our containment problem can be reduced to containment of monadic programs, which is known to be decidable [7]. [7] involves a complex algorithm (using tree-automata theory) to test containment of monadic programs. Therefore, we are interested in the case where a program in the test is bounded, and 2

3 the containment can be tested eciently using the algorithms in [4, 5, 29]. A datalog program is bounded if it is equivalent to a nite union of CQs [27]. [7] shows that boundedness is decidable for monadic datalog programs, although it is not decidable in general [11]. However, testing boundedness of monadic program also involves a complex algorithm. In Section 3, we study a class of CQs, called connection queries [21]. We develop a polynomial-time algorithm for testing boundedness of datalog programs for connection queries. There are two approaches to information integration [8]: the query-centric approach (in which user queries are in terms of views synthesized on source views), and the source-centric approach (in which user queries and source views are in terms of global views). In Sections 2 and 3 we take the query-centric approach. In Section 4 we extend our decidability result to the source-centric approach. 1.1 Related work Recently Millstein, Levy, and Friedman [25] study relative containment in the context of answering queries using views [18, 19]. The problem is to test query containment relative to source views with binding restrictions that are available to an information-integration system. Suppose P 1 (resp. P 2 )is the program to compute the maximal answer to a (potentially recursive) query Q 1 (resp. Q 2 ). The authors show that, surprisingly, the program P 1 is contained in P 2 if and only if P 1 is contained in Q 2. Then the authors prove that when Q 2 is a CQ, then the relative containment is decidable using the results of [5]. Our paper complements the paper [25] as follows: 1. [25] uses the source-centric approach to information integration. In this paper we give the decidability result for both the query-centric approach and the source-centric approach. 2. The decidability proof in [25] is based on the assumption that the set of bindings for the contained query is a subset of the bindings for the containing query. We loosen this assumption by showing that containment is decidable even if the two CQs have dierent initial bindings. However, we assume that the contained query is a CQ, while in [25] the contained query can be a recursive datalog program. Some other studies on answering queries in the presence of binding restrictions include: how to derive equivalent rewritings of CQs [28], how to optimize CQs [10, 22, 35], how to test whether the complete answer to a query can be computed [20]. 2 Relative containment in the query-centric approach In this section we take the query-centric approach to information integration. We give the formal denition of query containment in the presence of binding restrictions, and prove this containment is decidable. 2.1 Source views and queries Let v 1 ;:::;v n be n source views. Each view has a binding pattern representing the possible queries that the view can accept. In each binding pattern, an attribute is adorned as b (a value must be specied 3

4 for this attribute) or f (the attribute can be free). Let V denote the views with their adornments (\source descriptions" for short). We consider conjunctive queries (CQs) in the form ans( X):-g 1 ( X 1 );:::;g n ( X n ): In each subgoal g i ( X i ), predicate g i is a view in V, and every argument in the subgoal is either a variable or a constant. We consider safe CQs, i.e., every variable in the head appears in the body. 2.2 The maximal answer to a query Let Q be a query on source descriptions V. For a database of V, the maximal answer to Q is the answer we can compute if we retrieve asmany tuples as possible from the views, using only the initial bindings in Q and the bindings retrievable from V. It is known that a recursive datalog program may be necessary to compute the maximal answer to a query [9, 21], since we could access sources repeatedly to retrieve more bindings, and use these bindings to retrieve more tuples, and then more bindings, and so on. We construct a datalog program, denoted (Q; V ), which can be evaluated on V to compute the maximal answer to Q. In [21] we discuss how the program (Q; V ) is constructed for connection queries (see Section 3 for the denition of connection queries). Here we generalize the way of constructing (Q; V )toanycqq. r 1 : ans(a) :- bv 1 (disney; M); bv 2 (M; S; W); bv 4 (S; A) r 6 : bv 3 (M; S) :- movie(m); v 3 (M; S) r 2 : bv 1 (T; M) :- studio(t); v 1 (T; M) r 7 : star(s) :- movie(m); v 3 (M; S) r 3 : movie(m) :-studio(t); v 1 (T; M) r 8 : bv 4 (S; A) :- star(s); v 4 (S; A) r 4 : bv 2 (M; S; W):- movie(m); star(s); v 2 (M; S; W) r 9 : addr(a) :- star(s); v 4 (S; A) r 5 : award(w) :-movie(m); star(s); v 2 (M; S; W) r 10 : studio(disney) :- Figure 2: The program (Q 1 ;V) for the query Q 1 in Example 1.1 Figure 2 shows the program (Q 1 ;V) for the four views and the query Q 1 in Example 1.1. We use this program to show in general how to construct the program (Q; V ) for a CQ Q: 1. For each view v i 2 V,introduce an IDB predicate bv i, called the -predicate of v i, to store the tuples that can be retrieved from the view. For each domain A of the attributes in V,introduce a domain predicate doma to store the bindings for this domain that are retrievable from Q and the source views. For instance, IDB predicates cv 1 ;:::;cv 4 in Figure 2 are the corresponding -predicates of the four source views; IDB predicates studio, movie, star, award, and addr are the domain predicates for the domains of the ve corresponding attributes. 2. Replace each subgoal in the query Q with the -predicate of the corresponding view. The new rule is called the connection rule of Q. For instance, query Q 1 is rewritten as rule r For each view v i, write the following -rule and domain rules based on its binding pattern. Suppose v i has m attributes, say A 1 ;:::;A m, and the binding pattern of v i says that the arguments in positions 1;:::;p need to be bound, and the arguments in positions p +1;:::;m can be free. The following rules are the -rule and domain rules of v i : -rule: ^v i (A 1 ;:::;A m ) :- doma 1 (A 1 );:::;doma p (A p );v i (A 1 ;:::;A m ) domain rules: doma k (A k ) :- doma 1 (A 1 );:::;doma p (A p );v i (A 1 ;:::;A m ) (k = p +1;:::;m) 4

5 in which each doma j (j =1;:::;p) is the domain predicate for attribute A j.for instance, rule r 2 in Figure 2 is the -rule of v 1, and r 3 is the domain rule. 4. For each binding a for a domain A that can be derived from Q, write a fact rule doma(a) :-.For instance, r 10 in Figure 2 is a fact rule representing the fact that from query Q 1 we know disney is a studio name. During the construction of the program (Q; V ), we make the following assumptions: (1) Each binding for an attribute must be from the domain of the attribute. (2) If a source view requires a value, say, a string, as a particular argument, we will not allow the \strategy" of trying all the possible strings for this argument to test the source, since this strategy will not terminate. (3) Each binding we use is either obtained from the user query, or from a tuple returned by another source query. Ifwehave more bindings, we can incorporate them into the program (Q; V ) by adding the corresponding fact rules. For instance, consider the program (Q 1 ;V) in Figure 2. If we know that there is a movie titled \King Kong," we can add the following fact rule to the program (Q 1 ;V): movie( 0 King Kong 0 ):-. For any database of V,byevaluating the program (Q; V ) on the views V,we can obtain all the retrievable tuples from the source views, since every possible source query is captured by an evaluation of a rule in (Q; V ). Therefore, the program can compute the maximal answer to the query. 2.3 Problem denition Denition 2.1 (relative containment) Given two CQs Q 1 and Q 2 on source descriptions V,we say Q 1 is contained inq 2 relative to V, denoted Q 1 V Q 2, if for any database of V, the maximal answer to Q 1 is contained in the maximal answer to Q 2 ; that is, (Q 1 ;V) (Q 2 ;V). 2 EXAMPLE 2.1 In Example 1.1, the programs (Q 1 ;V) and (Q 2 ;V) are equivalent to the following two programs P 1 and P 2, respectively. 1 P 1 : ans(a) :- v 1 (disney; M);v 3 (M; S);v 2 (M; S; W );v 4 (S; A) P 2 : ans(a) :- v 1 (disney; M);v 3 (M; S);v 4 (S; A) For instance, we can simplify the program (Q 1 ;V) as follows. Consider the subgoals cv 1 (disney; M), cv 2 (M; S; W ), and cv 4 (S; A) in rule r 1. We substitute each of them by the body of the -rule (r 2, r 4, and r 8 ) of the corresponding view, with the necessary variable unication. After the substitutions, we can remove all the -rules. The new program is shown in Figure 3. We then substitute the domain predicates studio, movie, and star in rule r 0 1 with the body of the corresponding domain rule. Then the rule r 0 1 becomes query P 1. Similarly, program (Q 2 ;V) can be simplied to P 2. r1 0 : ans(a) :- studio(disney); v 1 (disney; M); movie(m); star(s); v 2 (M; S; W); star(s); v 4 (S; A) r 3 : movie(m) :- studio(t); v 1 (T; M) r 5 : award(w) :- movie(m); star(s); v 2 (M; S; W) r 7 : star(s) :- movie(m); v 3 (M; S) r 9 : addr(a) :- star(s); v 4 (S; A) 10 : studio(disney) :- Figure 3: An equivalent program for the program in Figure 2 1 Two datalog programs are equivalent if they produce the same ans facts for any database. 5

6 Since the identity mapping on subgoals of P 1 and P 2 gives us a containment mapping [4] from P 2 to P 1,wehave P 1 P 2. Therefore, (Q 1 ;V) (Q 2 ;V), and Q 1 V Q 2. 2 In general, our problem is: given two queries Q 1 and Q 2 on source views V with binding restrictions, how to test whether Q 1 V Q 2? 2.4 Relative containment is decidable Since the programs (Q 1 ;V) and (Q 2 ;V) can be recursive [9, 21], the relative containment seems undecidable [30]. In this section we prove that relative containment is decidable using the results of monadic programs. A datalog program is monadic if its recursive IDB predicates are monadic (the nonrecursive predicates can have arbitrary arity). An IDB predicate is nonrecursive if the predicate is not on any cycle in the dependency graph of the program [31]. Cosmadakis et al. [7] showed that containment of monadic programs is decidable. 2 Lemma 2.1 The program (Q; V ) of a query Q on source descriptions V can be translated into an equivalent monadic datalog program. 2 Proof: Consider the rule in (Q; V ) with the ans predicate as the head. For each -predicate bv i in its body, it can be substituted by the body of the corresponding -rule of v i, with the necessary variable unication. After the substitutions, remove the -rules from (Q; V ), and the new program can compute the same ans facts as (Q; V ) for any database. (For instance, the program (Q 1 ;V) in Figure 2 can be rewritten to the equivalent program in Figure 3.) The IDB predicates of the new program include the ans predicate and the domain predicates. ans is not recursive, since it only appears in the head of the connection rule. All other IDB predicates, i.e., the domain predicates, are monadic. Therefore, the new program is monadic. By Lemma 2.1 and the results in [7], we have: Theorem 2.1 Relative containment with binding restrictions is decidable. 2 Notice that Theorem 2.1 is correct even if the two queries have dierent initial bindings. The reason is that we can incorporate their dierent ininial bindings to their datalog programs by adding the corresponding fact rules, which are also monadic. 3 Testing program boundedness [7] uses a complex algorithm involving automata theory to test containment of monadic programs. If one of the two programs in the test is bounded, the containment can be tested more eciently using 2 The notion of nonrecursive predicates in [7] is slightly dierent from our denition above. In that paper nonrecursive predicates cannot depend on recursive predicates. That is, a predicate is nonrecursive if it either does not depend on another predicate, or it depends only on nonrecursive predicates. Thus, the program can be unfolded so that nonrecursive predicates do not depend on any other IDB predicate. However, its decidability result can be generalized to our stronger denition of nonrecursive predicates [34]. 6

7 the algorithms in [4, 5, 29]. A datalog program is bounded if it is equivalent to a nite union of CQs. For instance, the programs for the two queries in Example 1.1 are both bounded, because each of them can be rewritten to an equivalent CQ. In this section, we study the following problem: given a query Q on source views V with binding restrictions, how to test the boundedness of (Q; V )? [7] also gives an algorithm for testing boundedness of monadic programs using automata theory, although boundedness of datalog programs is undecidable in general [11]. We develop a polynomial-time algorithm for testing boundedness of programs of a class of CQs, called connection queries. 3.1 Connection queries Assume we have a set of global attributes, and dierent attributes are from dierent domains. Let V be a set of views with binding restrictions. Each view schema is a subset of the global attributes. A connection query Q is a natural join of a set of views (denoted T (Q)) with selections on some attributes (called the input attributes of Q, denoted I(Q)) and projections on some other attributes (called the output attributes of Q, denoted O(Q)). The set of views T (Q) is also called the connection of Q. The user species values for the input attributes I(Q), and is interested in the values of the output attributes O(Q). (See [21] for details about connection queries.) EXAMPLE 3.1 In Example 1.1, studio, movie, star, award, and addr are global attributes from dierent domains. Each of the four view schemas is a subset of these attributes. Queries Q 1 and Q 2 are two connection queries. For query Q 1, it has a connection T (Q 1 )=fv 1 ;v 2 ;v 4 g, a set of input attributes I(Q 1 )=fstudiog, and a set of output attributes O(Q 1 )=faddrg. Similarly, T (Q 2 )=fv 1 ;v 3 ;v 4 g, I(Q 2 )=fstudiog, and O(Q 2 )=faddrg. Suppose we have two views R(A; B; C) and S(B; C; D). The following query ans(c) :-R(A; A; C);S(X;C;C) is not a connection query, since it is not a natural join of the two views Boundedness of connection queries Given a connection query Q on views V with binding restrictions, we say Q is bounded if the datalog program (Q; V ) is bounded. For instance, the two queries in Example 1.1 are both bounded. The following connection query is unbounded. f b v 1 (A; B) A B f b v 3 (B; D) D E f f v 5 (D; E) bf v 2 (B; C) C bf v 4 (B; D) Figure 4: The source descriptions in Example 3.2 7

8 EXAMPLE 3.2 Consider the ve source views in Figure 4, whose schemas are represented as a hypergraph. The ve attributes have ve dierent domains. Assume a user knows the value of A is a, and wants to get the C values by joining the views v 1 and v 2. The following is the corresponding query Q: ans(c) :-v 1 (a; B);v 2 (B; C) That is, T (Q)=fv 1 ;v 2 g, I(Q) =fag, and O(Q) =fcg. Figure 5 shows the program (Q; V ), which is unbounded. Intuitively, since the binding pattern of v 3 (B; D) isfb, and the binding pattern of v 4 (B; D) isbf, we can visit these two source views repeatedly to retrieve more B bindings. For each new B binding, it may participate in v 1 1v 2, and generate more answers to Q. (We will give a formal proof of the unboundedness in Section 3.5.) 2 r 1 : ans(c) :- bv 1 (a; B); bv 2 (B; C) r 8 : bv 4 (B; D) :- domb(b); v 4 (B; D) r 2 : bv 1 (A; B) :- domb(b); v 1 (A; B) r 9 : domd(d) :- domb(b); v 4 (B; D) r 3 : doma(a) :- domb(b); v 1 (A; B) r 10 : bv 5 (D; E) :- v 5 (D; E) r 4 : bv 2 (B; C) :- domb(b); v 2 (B; C) r 11 : domd(d) :- v 5 (D; E) r 5 : domc(c) :- domb(b); v 2 (B; C) r 12 : dome(e) :- v 5 (D; E) r 6 : bv 3 (B; D) :- domd(d); v 3 (B; D) r 13 : doma(a) :- r 7 : domb(b) :- domd(d); v 3 (B; D) Figure 5: The program (Q; V ) in Example 3.2. We want to solve the following problem: Given a connection query Q on views V with binding restrictions, how to test the boundedness of Q? 3.3 Forward-closure and independent connections We rst review some denitions in [21]. Given a source view v i, let B(v i ) and F(v i ) respectively denote the bound attributes and free attributes in the binding pattern of v i. Let A(v i )=B(v i ) [F(v i ) be all the attributes in v i. Suppose W is a set of source views in V, let A(W) denote the attributes in W. For instance, in Example 3.2, B(v 1 )=fag, F(v 1 )=fbg, A(v 1 )=fa; Bg, and A(fv 1 ;v 2 g)=fa; B; Cg. Given a set of source views W V and a set of attributes X A(V ), the forward-closure of X given W, denoted f-closure(x; W), is the set of source views in W such that, starting from the attributes in X as the initial bindings, the binding requirements of these source views are satised by using only the source views in W. For instance, in Example 3.2, f-closure(fag; fv 1 ;v 2 g)=, and f-closure(fbg; fv 1 ;v 2 g)=fv 1 ;v 2 g. Let V q = f-closure(i(q);v) be all the queryable source views, i.e., the source views that we may eventually query, starting with the initial bindings in I(Q), and perhaps using several preliminary queries to other sources in order to obtain the necessary bindings for these source views. The nonqueryable source views in V, V q can be ignored without changing the maximal answer to Q, since we cannot retrieve any tuples from them. If there is a nonqueryable view in T (Q), then the maximal answer to Q is empty. A query Q is independent if its connection T (Q) satises f-closure(i(q);t(q)) = T (Q). For instance, the query Q 2 in Example 1.1 is independent, since f-closure(i(q);t(q 2 )) = f-closure(fstudiog; fv 1 ;v 3 ;v 4 g)=t (Q 2 ). Query Q 1 is not independent since f-closure(i(q);t(q 1 )) = f-closure(fstudiog; fv 1 ;v 2 ;v 4 g)=fv 1 g6= T (Q 1 ). Similarly, the connection query in Example 3.2 is not independent. 8

9 Theorem 3.1 If a connection query Q on source views V with binding restrictions is independent, then the program (Q; V ) is bounded. 2 Proof: See the appendix. If a connection query Q is not independent, then the program (Q; V )may not be bounded, as shown in Example BF-chain, BF-loop, Backward-closure, and kernel A sequence of views w 1 ;:::;w k forms a BF-chain (bound-free chain) if for i =1;:::;k, 1, F(w i ) \ B(w i+1 ) 6=. That is, for two adjacent views w i and w i+1 in the BF-chain, w i can contribute some bindings to w i+1. The source views w 1 and w k are the head and the tail of the BF-chain, respectively. A sequence of views forms a BF-loop if it forms a BF-chain, and the bound attributes of the head overlap with the free attributes of the tail (as shown in Figure 6). In Figure 4, (v 3 ;v 4 ) forms a BF-loop, because F(v 3 ) \B(v 4 )=fbg and F(v 4 ) \B(v 3 )=fdg. free bound free bound... w 5 w 4 free wn w 1 w 2 w 3 bound bound free free bound free bound Figure 6: A BF-loop Suppose A is an attribute in the queryable views V q, i.e., A 2A(V q ). The backward-closure of A, denoted b-closure(a), is the set of queryable source views that can be backtracked from A by following some BF-chain in a reverse order, in which A is a free attribute of the tail in the BF-chain. The backward-closure of a set of attributes X A(V q ), denoted b-closure(x), is the union of all the backward-closures of the attributes in X, i.e., b-closure(x) = S A2X b-closure(a). For instance, in Example 3.2, b-closure(b) =fv 1 ;v 3 ;v 4 ;v 5 g, and b-closure(fb; Cg)=fv 1 ;v 2 ;v 3 ;v 4 ;v 5 g. Denition 3.1 (BF-graph) The BF-graph of a set of source views W is a directed graph in which each vertex corresponds to a view in W, and there is an edge from vertex v i to vertex v j if and only if F(v i ) \B(v j ) 6=. 2 Intuitively, there is an edge from vertex v i to vertex v j if view v i can provide some bindings for view v j. Figures 7 and 8 show the BF-graphs of the source views in Example 1.1 and Example 3.2, respectively. For instance, in Figure 7, there is an edge from vertex v 1 to vertex v 2 because F(v 1 ) \ B(v 2 )=fmovieg6=. Clearly there is a BF-loop among a set of source views if and only if the BF-graph of these views is cyclic. 9

10 v 1 v 3 v 3 v 5 v 2 v 4 v 1 v 2 v 4 Figure 7: BF-graph for Example 1.1 Figure 8: BF-graph for Example 3.2. Denition 3.2 (kernel) Assume Q is a connection query on source descriptions V. A set of attributes K is a kernel of Q if f-closure(k[i(q);t(q)) = T (Q); and by removing any attribute A from K, f-closure((k,fag) [ I(Q);T(Q)) 6= T (Q): Intuitively, akernel K of a connection query Q is a minimal set of attributes in A(Q) such that, if the attributes in K have been bound, together with the initial bindings in I(Q), we can bind all the attributes A(Q) by using only the source views in T (Q). For instance, in Example 3.2, fbg is the only kernel of the connection fv 1 ;v 2 g.itisshown in [21] that a connection query may have multiple kernels, and all its kernels have the same backward-closure. In addition, we can compute the maximal answer to Q using only the views in b-closure(k) [ T (Q), in which K is a kernel of Q Testing boundedness of connection queries Theorem 3.2 If Q is a connection query on source descriptions V, and all the source views in T (Q) are queryable, then (Q; V ) is bounded if and only if there is no BF-loop among the views in b-closure(k), in which K is a kernel of Q. 2 Proof: See the appendix. Intuitively, for the \only if" part, if there is a BF-loop among the views in b-closure(k), we can populate the views in the loop following the loop as many times as possible, such that only after a certain number k of source accesses we can retrieve a tuple in the answer to the query, and k can be arbitrarily large. Thus (Q; V )isunbounded. EXAMPLE 3.3 Consider the two connection queries in Example 1.1. Query Q 1 has one kernel fstarg, whose backward-closure is fv 1 ;v 3 g. Clearly there is no BF-loop in fv 1 ;v 3 g,thus (Q 1 ;V)is bounded. Similarly, query Q 2 has one kernel, and there is no BF-loop in its backward-closure, thus (Q 2 ;V) is also bounded. In Example 3.2, fbg is the only kernel of the connection fv 1 ;v 2 g, and the backward-closure of fbg is fv 1 ;v 3 ;v 4 ;v 5 g. Since there is a BF-loop, (v 3 ;v 4 ), among these four views, by Theorem 3.2, this connection query is unbounded. 2 If a query Q is independent, it has only one kernel, the empty set. Thus the backward-closure of this kernel is empty, and there is no BF-loop in the backward-closure. By Theorem 3.2, (Q; V ) 10

11 is bounded, which is consistent with Theorem 3.1. By Theorem 3.2, we give an algorithm called TestBoundedness, for testing boundedness of connection queries, as shown in Figure 9. Algorithm TestBoundedness: Test boundedness of connection queries Input: V : Source views with binding restrictions. Q: A connection query on V. Output: Decision about the boundedness of (Q; V ). Method: (1) Compute the queryable views V q = f-closure(i(q);v); (2) If there is one view v 2 T (Q) that is not in V q,(q; V ) is bounded and return; (3) Compute a kernel K of Q; (4) Compute b-closure(k); (5) Build the BF-graph G of b-closure(k); (6) Test the acyclicity of G. If G is acyclic, then (Q; V ) is bounded; otherwise, (Q; V ) is unbounded. Figure 9: Testing the boundedness of a connection Let us analyze the complexity of the algorithm TestBoundedness. Assume V has n views, T (Q) has m views and k attributes, and b-closure(k) has p views. [21] gives the details how steps 1 to 4 are executed in O(kn 2 ) time. We can test the cyclicity of the BF-graph G using a depth-rst search algorithm in directed graphs, as described in [1]. The complexity of deciding the cyclicity of a directed graph G(V;E)isO(jEj), where E is the set of edges in graph G(V;E). There could be at most, p 2 edges in the BF-graph, so step 5 can be done in O(p 2 ) time, including the time of building the structure of the adjacent vertices for each vertex, as described in [1]. Step 6 can be done in O(p 2 ) time using a depth-rst search algorithm. Therefore, the complexity of the algorithm TestBoundedness is: O(kn 2 )+O(p 2 )+O(p 2 )=O(kn 2 ) 4 Extend the decidability result to the source-centric approach In this section we extend the decidability result of query-centric approach in Section 2 to the sourcecentric approach to information integration [8]. ([32] is a good survey on the dierences between these two approaches.) 4.1 Notation in the source-centric approach Let Q be a CQ, and V be a set of conjunctive source views with binding restrictions. Both Q and V are dened on some global predicates. A rewriting of a query of Q relative tov is a datalog program P, such that the EDB predicates in P are the views in V, and the expansion of P is contained in the query Q. The expansion of P, denoted P exp, is obtained from P by replacing all source-view literals by their denitions. Existentially quantied variables in a source view are replaced by fresh variables in the expansion. The following example is borrowed from [9]. EXAMPLE 4.1 Assume parent, male, and f emale are three global predicates. The following two views v 1 and v 2 store the father and mother relation, respectively. v 1 (X; Y ) :- parent(x; Y ); male(x) v 2 (X; Y ) :- parent(x; Y ); female(x) 11

12 The following query asks for the grandparents of smith: ans(x) :-parent(x; Z); parent(z; smith) The following is a rewriting of the query: ans(x) ans(x) ans(x) ans(x) :- v 1 (X; Z);v 1 (Z; smith) :- v 1 (X; Z);v 2 (Z; smith) :- v 2 (X; Z);v 1 (Z; smith) :- v 2 (X; Z);v 2 (Z; smith) 2 A rewriting of a query Q relative to a set of views V is the maximally-contained rewriting of Q relative tov if it is not contained in any other rewriting of Q relative tov. Let P 1 and P 2 be the maximally-contained rewriting of Q 1 and Q 2 relative tov, respectively. Query Q 1 is contained inq 2 relative to V, denoted Q 1 V Q 2, if for any database of the source views V, the answer computed by Q 1 is a subset of that computed by P 2, i.e., P 1 P 2 [25]. 4.2 Decidability result Given a set V of views with binding restrictions and two CQs Q 1 and Q 2, our goal is to test whether Q 1 V Q 2.We prove this containment is decidable by showing that the maximally-contained rewriting of a query is also a monadic program. We give the proof in two steps: 1. If we do not consider the binding restrictions, then the maximally-contained rewriting is inherently nonrecursive, i.e., it is equivalent to a nite union of CQs. 2. Then we consider the binding restrictions by adding monadic rules, thus the nal maximallycontained rewriting is monadic. In the rst step, we want to know whether we should consider recursive datalog programs to nd the maximally-contained rewriting of a CQ. [19] shows how to obtain an equivalent rewriting of a CQ in the space of unions of CQs. [8] shows how to get the maximally-contained rewriting of a query in the space of datalog programs, and does not show whether the rewriting is bounded or not. The following lemma shows that we do not need to consider recursive datalog programs to nd the maximally-contained rewriting of a CQ, since the maximally-contained rewriting is equivalent toa nite union of CQs. Lemma 4.1 In the source-centric approach, if the source views are conjunctive without binding restrictions, then the maximally-contained rewriting of a CQ is a bounded datalog program. 2 Proof: Suppose Q is a CQ, and V is a set of conjunctive views without binding restrictions. Using the inverse-rule algorithm in [8], we can obtain a maximally-contained rewriting P DL, which isa datalog program. Now we prove this program P DL is inherently bounded. P DL can be expanded into a (possibly innite) union of CQs. For each C i of these, C i uses only views in V, and the expansion C exp i is contained in Q. It is shown in [33] that there must be a conjunctive rewriting C 0 i of Q, such that C 0 i has no more subgoals than Q, while C i C 0 i. Since there are nite number of conjunctive rewritings of Q with no more subgoals than Q, we can nd a nite union P UCQ of CQs as a rewriting 12

13 of Q, and P DL P UCQ. Since P DL is the maximally-contained rewriting, i.e., P DL P UCQ. We have P DL = P UCQ ; that is, the maximally-contained rewriting of Q in the space of datalog programs is a nite union of CQs. Lemma 4.2 In the source-centric approach, if the source views are conjunctive with binding restrictions, then the maximally-contained rewriting of a CQ relative to the views is a monadic datalog program. 2 Proof: Let P be the maximally-contained rewriting of a CQ Q relative tov without binding restrictions. By Lemma 4.1, P is equivalent to a nite union of CQs, in which each EDB predicate is a view in V.[9] shows how to construct the maximally-contained rewriting of Q relative tov with binding restrictions in two steps: (1) Add a set of domain rules domain(v;q). 3 (2) For each rule r in P, insert a subgoal dom(x) before subgoals g in r that have avariable X in an argument position that is required to be bound, and X does not appear in the subgoals to the left of g in the body. In the new program, the only recursive predicates are those domain predicates, which are monadic. Therefore, the program is monadic. By Lemma 4.2 and the results in [7], we have: Theorem 4.1 In the source-centric approach, query containment relative to views with binding restrictions is decidable. 2 Notice that the decidability result holds even if the two queries have dierent initial bindings, since we can incorporate their initial bindings to their corresponding maximally-contained rewritings by adding the necessary monadic rules. 5 Conclusion In this paper we solved the following problem: given views with access pattern limitations, how to test whether the maximal answer to a conjunctive query (CQ) is contained in that to another CQ? We proved that the problem is decidable using the results of monadic programs. We gave the decidability results for both the source-centric approach and the query-centric approach to information integration. We also developed a polynomial-time algorithm for testing boundedness of these programs, and show that when a program in the containment test is bounded, we can perform the test eciently. Acknowledgments: We thank Je Ullman for his valuable comments and many discussions on this material. We thank Anand Rajaraman for helpful discussions on Lemma 4.1. We also thank Rada Chirkova for her helpful comments on this material. References [1] A. V. Aho, J. E. Hopcroft, and J. D. Ullman. Data Structures and Algorithms. Addison-Wesley Publishing Company, The domain rules in [8] are slightly dierent from the domain rules in Section 2. 13

14 [2] C. Beeri and R. Ramakrishnan. On the power of magic. In Proc. of ACM Symposium on Principles of Database Systems (PODS), pages 269{283, [3] T. Catarci. Web-based information access. IFCIS International Conference on Cooperative Information Systems (CoopIS), pages 10{19, [4] A. K. Chandra and P. M. Merlin. Optimal implementation of conjunctive queries in relational data bases. STOC, pages 77{90, [5] S. Chaudhuri and M. Y. Vardi. On the equivalence of recursive and nonrecursive datalog programs. In Proc. of ACM Symposium on Principles of Database Systems (PODS), pages 55{66, [6] Cinemachine. [7] S. S. Cosmadakis, H. Gaifman, P. C. Kanellakis, and M. Y. Vardi. Decidable optimization problems for database logic programs. ACM Symposium on Theory of Computing (STOC), pages 477{490, [8] O. M. Duschka. Query planning and optimization in information integration. Ph.D. Thesis, Computer Science Dept., Stanford Univ., [9] O. M. Duschka and A. Y. Levy. Recursive plans for information gathering. Proceedings of the Fifteenth International Joint Conference on Articial Intelligence, IJCAI-97, [10] D. Florescu, A. Levy, I. Manolescu, and D. Suciu. Query optimization in the presence of limited access patterns. In Proc. of ACM SIGMOD, pages 311{322, [11] H. Gaifman, H. G. Mairson, Y. Sagiv, and M. Y. Vardi. Undecidable optimization problems for database logic programs. Journal of the ACM, pages 683{713, [12] L. M. Haas, D. Kossmann, E. L. Wimmers, and J. Yang. Optimizing queries across diverse data sources. In Proc. of VLDB, pages 276{285, [13] IMDB. The Internet Movie Database Ltd. Search Engine, [14] Z. Ives, D. Florescu, M. Friedman, A. Levy, and D. Weld. An adaptive query execution engine for data integration. In Proc. of ACM SIGMOD, pages 299{310, [15] V. Josifovski and T. Risch. Integrating heterogenous overlapping databases through object-oriented transformations. In Proc. of VLDB, pages 435{446, [16] Z. Kedad and M. Bouzeghoub. Discovering view expressions from a multi-source information system. IFCIS International Conference on Cooperative Information Systems (CoopIS), pages 57{68, [17] S. Kerr, A. Gal, and J. Mylopoulos. Information services for the web: Building and maintaining domain models. IFCIS International Conference on Cooperative Information Systems (CoopIS), pages 4{13, [18] A. Y. Levy. Answering queries using views: A survey. In [19] A. Y. Levy, A. O. Mendelzon, Y. Sagiv, and D. Srivastava. Answering queries using views. In Proc. of ACM Symposium on Principles of Database Systems (PODS), pages 95{104, [20] C. Li. Computing complete answers to queries in the presence of limited access patterns (extended version). Technical report, Computer Science Dept., Stanford Univ., [21] C. Li and E. Chang. Query planning with limited source capabilities. International Conference on Data Engineering (ICDE), pages 401{412, [22] C. Li, R. Yerneni, V. Vassalos, H. Garcia-Molina, Y. Papakonstantinou, J. D. Ullman, and M. Valiveti. Capability based mediation in TSIMMIS. In Proc. of ACM SIGMOD, pages 564{566, [23] L. Liu, C. Pu, and W. Han. XWRAP: An XML-enabled wrapper construction system for web information sources. International Conference on Data Engineering (ICDE), pages 611{621, [24] R. J. Miller. Using schematically heterogeneous structures. In Proc. of ACM SIGMOD, pages 189{200, [25] T. Millstein, A. Levy, and M. Friedman. Query containment for data integration systems. In Proc. of ACM Symposium on Principles of Database Systems (PODS), [26] T. Milo and S. Zohar. Using schema matching to simplify heterogeneous data translation. In Proc. of VLDB, pages 122{133, [27] J. F. Naughton and Y. Sagiv. A decidable class of bounded recursions. In Proc. of ACM Symposium on Principles of Database Systems (PODS), pages 227{236. ACM, [28] A. Rajaraman, Y. Sagiv, and J. D. Ullman. Answering queries using templates with binding patterns. In Proc. of ACM Symposium on Principles of Database Systems (PODS), pages 105{112, [29] Y. Sagiv and M. Yannakakis. Equivalences among relational expressions with the union and dierence operators. Journal of the ACM, 27(4):633{655, [30] O. Shmueli. Equivalence of datalog queries is undecidable. Journal of Logic Programming, 15(3):231{241,

15 [31] J. D. Ullman. Principles of Database and Knowledge-base Systems, Volumes II: The New Technologies. Computer Science Press, New York, [32] J. D. Ullman. Information integration using logical views. International Conference on Database Theory (ICDT), pages 19{40, [33] J. D. Ullman. Lecture notes on principles of database systems [34] M. Vardi. Personal communication [35] R. Yerneni, C. Li, J. D. Ullman, and H. Garcia-Molina. Optimizing large join queries in mediation systems. International Conference on Database Theory (ICDT), pages 348{364, A Appendix A.1 Proof of Theorem 3.1 Proof: Assume T (Q) = fw 1 ;:::;w k g is the connection in the query Q on source descriptions V. Since f-closure(i(q);t(q)) = T (Q), there exists a sequence of the views in connection T (Q), say, w i1 ; ;w i k, that satises: (i) B(w i 1 ) I(Q); (ii) for j =2;:::;k, B(w i j ) I(Q) [A(w i 1 ) [ [ A(w i j,1 ). For any database of V,we can compute the maximal answer to Q as follows. Compute the corresponding sequence of n supplementary relations [2, 31] I 1 ;:::;I n, where I i is the supplementary relation after the rst i subgoals have been processed. The supplementary relation I n is the answer to query Q. Therefore, we can compute the answer to the query using n + 1 applications of the rules in (Q; V ) (the last application is to evaluate the connection rule). A.2 Proof of Theorem 3.2 Proof: If: Assume T (Q) =fw 1 ;:::;w n g, and b-closure(k) =fv 1 ;:::;v k g. Since there is no BFloop among the views in b-closure(k), there exists a BF-chain in b-closure(k) with distinct views v i1 ;:::;v i k, such that the free attributes of each view v ij do not overlap with the bound attributes of any previous source view. Starting with the initial bindings in Q and following the sequence, we use the views in this sequence to send source queries and retrieve all the possible bindings for the attributes in K. With these bindings and the initial bindings in I(Q), there exists a sequence of the views in T (Q), say w l1 ;:::;w ln, such that the binding requirements of each view in the sequence can be satised by previous subgoals. We follow this sequence to send source queries, collect tuples from the sources in the connection, and evaluate the connection rule in (Q; V ) to compute the maximal answer to Q. Therefore, we can evaluate the rules in nite number of steps to compute the maximal answer to Q, and the number is independent of the source relations. Thus (Q; V ) is bounded. Only If: If there is a BF-loop among the views b-closure(k), we prove (Q; V )isunbounded by showing that for any integer k > 0, there exists some database, such that only after k applications of the rules in (Q; V ) can we compute a tuple in the maximal answer to Q. Since there is a BF-loop among b-closure(k), there exists an attribute A in K, such that there is a BF-loop among b-closure(a). For any integer k>0, there is a BF-chain v 1 ;:::;v k with length k, such that A 2F(v k ). We can add tuples to the relations on the BF-chain, such that only following the BF-chain can we retrieve a tuple in the answer to Q. In other words, we populate the relations in a BF-loop of the views in b-closure(k) along the loop as many times as we want. By the way the database is constructed, we can only compute a tuple in the answer to Q after k applications of the rules in (Q; V ). Thus (Q; V ) is unbounded. 15

Answering Queries with Useful Bindings

Answering Queries with Useful Bindings CHEN LI University of California at Irvine and EDWARD CHANG University of California, Santa Barbara In information-integration systems, sources may have diverse and