Efficiently Enumerating Results of Keyword Search

Size: px

Start display at page:

Download "Efficiently Enumerating Results of Keyword Search"

Robyn Alexandrina Lawson
5 years ago
Views:

1 Efficiently Enumerating Results of Keyword Search Benny Kimelfeld and Yehoshua Sagiv The Selim and Rachel Benin School of Engineering and Computer Science The Hebrew University of Jerusalem Edmond J. Safra Campus Jerusalem 91904, Israel Abstract. Various approaches for keyword search have been explored in different settings, including databases, XML and the Web. It is shown that in many cases, systems that incorporate keyword search actually solve similar problems. This paper describes, for this type of problems, the first algorithms that are provably efficient, that is, run with polynomial delay. Specifically, algorithms for enumerating K-fragments are given, where a K-fragment is a subtree T of the given data graph, such that T contains all the keywords of K and no proper subtree of T has this property. Three types of K-fragments are considered: rooted, undirected and strong. For all three types, there are algorithms that enumerate all K-fragments with polynomial delay. For rooted K-fragments and acyclic data graphs, there is an algorithm that enumerates with polynomial delay in the order of increasing weight, assuming that K is of a fixed size. 1 Introduction The advent of the World-Wide Web and the proliferation of search engines has transformed keyword search from a niche role to a major player in the information-technology field. Modern database languages should have both querying and searching capabilities. In recent years, different approaches for developing such capabilities have been investigated. An early example is keyword search in databases [8]. More recently, several papers [1, 3, 9, 10] proposed systems that support keyword search in relational databases. Naturally, keyword search is highly relevant to XML. There are, however, two facets of XML. Data-centric XML is essentially semistructured data and usually query languages (e.g., XQuery) are used for retrieving information. Documentcentric XML consists of large chunks of text, with XML tags that are used mostly for indicating the structure of documents rather than relationships among data items. Consequently, there are different approaches for handling keyword search in XML some are aimed at data-centric XML while others are tailored for document-centric XML. INEX [6] is an initiative that focuses on document-centric XML and evaluates retrieving techniques for different types of queries. One type is just a list of keywords, while another type consists of both keywords and structural conditions

2 that are written in a style similar to XPath. For both types of queries, a result is an element of a document (with all its descendant elements) rather than a whole document. XKeyword [11] is a tool that generates, from a given XML document, descriptive portions containing all the specified keywords. Query evaluation in XKeyword is based on the method that was developed in DISCOVER [10] for keyword search in relational databases. The approach of [4] is aimed at data-centric XML and its goal is to find semantic relationships among nodes of XML documents. Efficient solutions for tree documents are given in [4]. XSEarch [5] combines the approach of [4] with information-retrieval techniques. The notion of semantic relationships is generalized in [14] to graph documents (i.e., XML documents that may have ID references). The above approaches consider different settings and use a variety of techniques. At the core, however, many of these approaches deal with similar graph problems. The goal of this paper is to clearly identify these graph problems and provide provably efficient algorithms (rather than heuristics) for solving them. Essentially, in all of the above approaches, data are (or can be) represented as a graph that has two types of nodes: structural nodes and keyword nodes. For example, in the systems that implement keyword search in relational databases [1, 3, 9, 10], structural nodes represent tuples. Two tuples are connected by an edge if they can be joined on a foreign key. A tuple t and a keyword k are connected if t contains k. A formal framework for keyword search in data graphs is presented in [15]. A key concept in this framework is reduced subtrees. Given a data graph G and a set of keywords K, a subtree T of G is reduced with respect to (abbr. w.r.t.) K if T contains the keywords of K, but no proper subtree of T contains all of these keywords. A K-fragment is a subtree of G that is reduced w.r.t. K. The results of a keyword search are K-fragments. Actually, there are three types of K- fragments: rooted (i.e., directed), undirected and strong. A strong K-fragment is an undirected K-fragment, such that all its keyword nodes are leaves (and since it is reduced w.r.t. K, all its leaves are keywords of K). Note that in a directed data graph, keyword nodes do not have outgoing edges. Hence, in a rooted K- fragment, all keyword nodes must be leaves. In an undirected K-fragment, on the other hand, keyword nodes are not necessarily leaves. In many of the approaches mentioned earlier, processing a keyword search is simply an enumeration of all K-fragments. This is true also for the information unit approach [16] to searching the Web. Typically, results of a keyword search are either strong K-fragments [1, 9 11, 14, 16] or rooted K-fragments [3, 14]; however, they could also be undirected K-fragments [14]. Thus far, heuristics have been employed to solve the different variants of this enumeration problem. These heuristics may perform well in practice, but they either lack a clear upper bound or have an exponential upper bound, even if the number of results is small. 2

3 In this paper, we give efficient algorithms for enumerating K-fragments. Since the output of an enumeration algorithm can be exponential in the size of the input, we use the yardstick of enumeration with polynomial delay as an indication of efficiency. We show that all rooted, undirected or strong K-fragments can be enumerated with polynomial delay. We also consider the problem of enumerating by increasing weight. Specifically, we show that if the size of K is fixed, then all rooted K-fragments of an acyclic data graph can be enumerated by increasing weight with polynomial delay. Note that a known NP-complete problem [7] implies that this result can hold only if the size of K is assumed to be fixed. Making this assumption is realistic and in line with the notion of data complexity [17], which is commonly used for measuring the complexity of query evaluation. In summary, the main contribution of this paper is in giving, for the first time, provably efficient algorithms for enumeration problems that need to be solved in many different settings of keyword search. These settings include relational databases, data-centric as well as document-centric XML, and the Web. This paper is organized as follows. Section 2 defines basic concepts and notations. The notion of enumeration algorithms, their complexity measures, and threaded enumerators are discussed in Section 3. The algorithms are described in Sections 4, 5 and 6. We present a heuristics for sorted enumerations in Section 7. We conclude and discuss future work in Section 8. In Appendix A, we describe two algorithms that cannot be given in the paper itself due to a lack of space. In Appendices B and C, we give proofs of correctness for two of our algorithms. In Appendix D, we give a detailed complexity analysis for the first algorithm. 2 Preliminaries 2.1 Data Graphs A data graph G consists of a set V(G) of nodes and a set E(G) of edges. There are two types of nodes: structural nodes and keyword nodes (or keywords for short). S(G) denotes the set of structural nodes and K(G) denotes the set of keyword nodes. Unless explicitly stated otherwise, edges are directed, i.e., an edge is a pair (n 1, n 2 ) of nodes. Keywords have only incoming edges, while structural nodes may have both incoming and outgoing edges. Hence, no edge can connect two keywords. These restrictions mean that E(G) S(G) V(G). The edges of a data graph G may have weights. The weight function w G assigns a positive weight w G (e) to every edge e E(G). The weight of the data graph G, denoted w(g), is the sum of the weights of all the edges of G, i.e., w(g) = e E(G) w G(e). A data graph is rooted if it contains some node r, such that every node of G is reachable from r through a directed path. The node r is called a root of G. (Note that a rooted data graph may have several roots.) A data graph is connected if its underlying undirected graph is connected. As an example, consider the data graph G 1 depicted in Figure 1. (This data graph is a subgraph of the Mondial 1 XML database.) In this graph, filled circles 1 3

4 G 1 continent gov country organization country gov name Monarchy Belgium hq name city Netherlands Monarchy name Brussels Fig. 1. A data graph G 1 represent structural nodes and keywords are written in italic font. Note that the keyword Monarchy appears twice in this figure; however, in the actual data graph, the keyword Monarchy is represented by a single node that has two incoming edges. Also note that the structural nodes of G 1 have labels, but these are ignored in this paper. The data graph G 1 is rooted and the node labeled with continent is the only root. We use two types of data trees. A rooted tree is a rooted data graph, such that there is only one root and for every node u, there is a unique path from the root to u. An undirected tree is a connected data graph that contains no cycles, even when ignoring the directions of the edges. We say that G is a subgraph of the data graph G, denoted G G, if V(G ) V(G) and E(G ) E(G). The weights of edges in G are the same as those in G. Rooted and undirected subtrees are special cases of subgraphs. For a data graph G and a subset U V(G), we denote by G U the induced subgraph of G that consists of the nodes of V(G) \ U and all the edges of G between these nodes. If u V(G), then we may write G u instead of G { u }. If G 1 and G 2 are subgraphs of G, we use G 1 G 2 to denote the subgraph that consists of all the nodes and edges of both G 1 and G 2 ; that is, the graph G that satisfies V(G ) = V(G 1 ) V(G 2 ) and E(G ) = E(G 1 ) E(G 2 ). Given a data graph G, a subset U V(G) and an edge e = (v, u) E(G), we use U ±e to denote the set (U \ { u }) { v }. Given two nodes u and v in a data graph G, we use u G v to denote the fact that v is reachable from u through a directed path in G. A rooted (respectively, undirected) subtree T of a data graph G is reduced w.r.t. a subset U of the nodes of G if T contains U, but no proper rooted (respectively, undirected) subtree of T contains U. 4

5 name continent name country country Belgium F 1 Netherlands name country city organization name hq country Belgium name gov country Belgium F 2 Monarchy F 3 Netherlands gov name country Netherlands Fig. 2. Fragments of G Keyword Search A query is simply a finite set K of keywords. Given a data graph G, a rooted K-fragment (abbr. RKF) is a rooted subtree of G that is reduced w.r.t. K. Similarly, an undirected K-fragment (abbr. UKF) is an undirected subtree of G that is reduced w.r.t. K. A strong K-fragment (abbr. SKF) is a UKF, such that all the keywords are leaves. Note that an RKF is also an SKF and an SKF is also a UKF. Figure 2 shows three K-fragments of G 1, where K is the query {Belgium,Netherlands}. F 3 is a UKF, F 2 is an SKF and F 1 is an RKF. In some approaches to keyword search (e.g., [1, 9 11, 16]), the goal is to solve the SKF problem, that is, to enumerate all SKFs for a given K. In other approaches (e.g., [3]), the goal is to solve the RKF problem. The work of [14] considers also the UKF problem. 3 Enumeration Algorithms 3.1 Threaded Enumerators In order to construct efficient enumeration algorithms, we employ threaded enumerators that enable one algorithm to use the elements enumerated by another algorithm (or even by itself, recursively) as soon as these elements are generated, rather than waiting for termination. Formally, an enumeration algorithm E generates, for a given input x, a sequence E 1 (x),..., E n(x) (x). Each element E i (x) is enumerated by the operation print( ). We say that E(x) enumerates a set S if { E 1 (x),..., E n(x) (x) } = S and E i (x) E j (x) for all 1 i < j n(x). Sometimes one enumeration algorithm E uses another enumeration algorithm E, or may even use itself recursively. An important property of an enumeration algorithm is the ability to start generating elements as soon as possible. This property is realized by enabling E to use each element generated by E when that 5

6 element is created, rather than having to wait until E finishes its enumeration. In Java [2], for example, each enumeration algorithm can be implemented as a distinct thread. By using the wait and notify mechanisms, E can stop E after every output and later resume E in order to generate the next output. Java threads are rather complex, since they can be executed concurrently. We need a simpler notion of threads, since concurrency is not essential threads are needed for writing enumeration algorithms that realize the desired time complexity, even if the execution is serial. Next, we describe our notion of threads. We write algorithms in pseudo code using threaded enumerators. A specific threaded enumerator TE is constructed by the command TE := new [E ](x), where E is some enumeration algorithm and x is an input for E. The elements E 1 (x),..., E n(x) (x) are enumerated by repeatedly executing the command next[te ]. The ith execution of next[te ] generates the element E i (x) if 1 i n(x); otherwise, if i > n(x), the null element, denoted, is generated. We assume that is not an element in the output of E(x). An enumeration algorithm E may use a threaded enumerator recursively, i.e, a threaded enumerator for E(x ), where x is usually different from x. As an example, consider the pseudo code of the algorithm ReducedSubtrees, presented in Figure 4. In Line 21, a threaded enumerator is constructed for the algorithm RSExtensions (shown in Figure 5(a)). Line 18 is an example of a recursive construction of a threaded enumerator. 3.2 Measuring the Complexity of Enumeration Algorithms Polynomial time complexity is not a suitable yardstick of efficiency when analyzing an enumeration algorithm, since the output size could be exponential in the input size. In [13], several definitions of efficiency for enumeration algorithms are discussed. The weakest definition is polynomial total time, that is, the running time is polynomial in the combined size of the input and the output. Two stronger definitions consider the time that is needed for generating the ith element, after the first i 1 elements have already been created. Incremental polynomial time means that the ith element is generated in time that is polynomial in the combined size of the input and the first i 1 elements. The strongest definition is polynomial delay, that is, the ith element is generated in time that is polynomial only in the input size. For characterizing space efficiency, we use two definitions. Note that the amount of space needed for writing the output is ignored only the space used for storing intermediate results is measured. The usual definition is polynomial space, that is, the amount of space used by the algorithm is polynomial in the input size. Linearly incremental polynomial space means that the space needed for generating the first i elements is bounded by i times a polynomial in the input size. Note that an enumeration algorithm that runs with polynomial delay uses (at most) linearly incremental polynomial space. All the algorithms in this paper, except for one version of the heuristics of Section 7, run with polynomial delay. The algorithms of the next two sections use polynomial space. 6

7 (1) v (2) G a 2... b r v c (a) (b) Fig. 3. (a) A data graph G 2. (b) Extensions: (1) by a directed path, and (2) by a reduced subtree 4 Enumerating Rooted K-Fragments 4.1 The Algorithm In this section, we describe an algorithm for enumerating RKFs. Our algorithm solves the more general problem of enumerating reduced subtrees. That is, given a data graph G and a subset U V(G), the algorithm enumerates, with polynomial delay, the set RS(G, U) of all rooted subtrees of G that are reduced w.r.t. U. Hence, to solve the RKF problem, we execute the algorithm with U = K, where K is the given set of keywords. If U has only two nodes, the enumeration is done by a rather straightforward algorithm, PairRS(G, u, v), that is described in Appendix A. The problem is more difficult for larger sets of nodes, because for some subsets U U, the set RS(G, U ) might be much larger than the set RS(G, U). For example, for the graph G 2 of Figure 3(a), RS(G 2, { a, b, c }) has only one subtree, whereas the size of RS(G 2, { a, b }) is exponential in the size of G 2. In the algorithm ReducedSubtrees(G, U) of Figure 4, every intermediate result, obtained from the recursive calls in Lines 11 and 18, can be extended into at least one distinct element of RS(G, U). Thus, the complexity is not worse than polynomial total time. Next, we describe this algorithm in detail. In Lines 1 3, the algorithm ReducedSubtrees(G, U) terminates after printing a single tree that has one node and no edges, if U = 1. In Lines 4 5, the algorithm terminates if RS(G, U) is empty. Note that RS(G, U) = if and only if there is no node w of G, such that all the nodes of U are reachable from w. An arbitrary node u U is chosen in Line 6 and if the test of Line 7 is true, then u is a leaf in 7

8 ReducedSubtrees(G, U) 1: if U = 1 then 2: print((u, )) 3: exit 4: if RS(G, U) = then 5: exit 6: choose an arbitrary node u U 7: if v U \ { u }, u G v then 8: W := {w (w, u) is an edge of G and RS(G u, U ±(w,u) ) } 9: for all w W do 10: U w := U ±(w,u) 11: TE := new [ReducedSubtrees](G u, U w) 12: T := next[te ] 13: while T do 14: print(t (w, u)) 15: T := next[te ] 16: else 17: let v U be a node s.t. u v and u G v 18: TE 1 := new [ReducedSubtrees](G, U \ { v }) 19: T := next[te 1 ] 20: while T do 21: TE 2 := new [RSExtensions](G, T, v) 22: T := next[te 2 ] 23: while T do 24: print(t ) 25: T := next[te 2 ] 26: T := next[te 1 ] Fig. 4. Enumerating RS(G, U) every tree of RS(G, U). If so, Line 9 iterates over all nodes w, such that (w, u) is an edge of G and RS(G u, U ±(w,u) ). All the trees of RS(G u, U ±(w,u) ) are enumerated in Lines The edge (w, u) is added to each of these trees and the result is printed in Line 14. If the test of Line 7 is false, then Line 17 arbitrarily chooses a node v U (v u) that is reachable from u. All the trees of RS(G, U \ { v }) are enumerated starting at Line 18. Each of these trees can be extended to a tree of RS(G, U) in two different ways, as illustrated in Figure 3(b). For each T RS(G, U \ { v }), all extensions T of T are enumerated starting at Line 21 by calling RSExtensions(G, T, v). These extensions are printed in Line 24. Next, we explain how RSExtensions(G, T, v) works. Given a node v U and a subtree T RS(G, U \ { v }) having a root r, the algorithm RSExtensions(G, T, v) of Figure 5(a) enumerates all subtrees T, such that T contains T and T RS(G, U). In Lines 5 12, T is extended by directed simple paths. Each path P is from a node u (u r) of T to v, and u is the only node in both P and T. These paths are enumerated by the algorithm 8

9 Paths(G, u, v) of Figure 5(b). The extensions of T by these paths are printed in Line 11. In Lines 13 19, T is extended by reduced subtrees T of G. Each T is reduced w.r.t. { r, v } and r is the only node in both T and T. Note that the root of the new tree is the root of T. The trees T are enumerated by PairRS, which is described in Appendix A. The extensions of T by these trees are printed in Line 18. The following theorem shows the correctness of ReducedSubtrees, and its proof is given in Appendix B. Theorem 1. Let G be a data graph and U be a subset of the nodes of G. The algorithm ReducedSubtrees(G, U) enumerates RS(G, U). Interestingly, the algorithm remains correct even if Line 6 and the test of Line 7 are ignored, and only the else part (i.e., Lines 17 26) is executed in all cases, where in Line 17 v can be any node in U. However, the complexity is no longer polynomial total time, since the enumerator TE 1 may generate trees T that cannot be extended by RSExtensions(G, T, v). For example, consider the graph G 2 of Figure 3(a) and let U = { a, b, c }. If we choose v = c, then all directed paths from a to b will be generated by TE 1. However, none of those paths can be extended to a subtree of RS(G 2, U). If, on the other hand, only the then part (i.e., Lines 8 15) is executed, then the algorithm will not be correct. 4.2 Complexity Analysis In this section, we show that the algorithm ReducedSubtrees enumerate with polynomial delay. We first discuss complexity of enumeration algorithms in general. To prove that an enumeration algorithm enumerates with polynomial delay, we have to calculate the computation cost between successive print commands. Formally, let E(x) enumerate the sequence E 1 (x),..., E n (x). For 1 < i n, the ith interval starts immediately after the printing of E i 1 (x) and ends with the printing of E i (x). The first interval starts at the beginning of the execution of E(x) and ends with the printing of E 1 (x). The (n + 1)st interval starts immediately after the printing of E n (x) and ends when the execution of E(x) terminates. The ith delay of E(x) is the execution cost of the ith interval. The cost of each command, other than a next command, is defined as usual. For a threaded enumerator TE of E (x ), the cost of the ith execution of next[te ] is 1 + C, where C is the ith delay of E (x ). (Note that this a recursive defintion.) The ith space usage of E(x) is the amount of space used for printing the first i elements E 1 (x),..., E i (x). Note that the (n + 1)st space usage is equal to the total space used by E(x) from start to finish. It is not always easy to evaluate the ith delay directly, since an enumeration algorithm may use recursively threads of other enumeration algorithms, leading to complex recursive equations. So, we take a different approach. First, we evaluate the basic ith delay that is defined as the cost of the ith interval, assuming that each next command has a unit cost. Second, for each interval, we count the total number of next commands that are executed during that interval, including next commands of threaded enumerators that are created recursively. The ith 9

10 RSExtensions(G, T, v) 1: let r be the root of T 2: if v V(T ) then 3: print(t ) 4: exit 5: for all u V(T ) \ { r } do 6: Ḡ := G (V(T ) \ { u }) 7: if u Ḡ v then 8: TE := new [Paths](Ḡ, u, v) 9: P := next[te ] 10: while P do 11: print(t P ) 12: P := next[te ] 13: G r := G (V(T ) \ { r }) 14: if RS(G r, { r, v }) then 15: TE := new [PairRS](G r, r, v) 16: T := next[te ] 17: while T do 18: print(t T ) 19: T := next[te ] Paths(G, u, v) 1: if u = v then 2: let P the path containing u only 3: print(p ) 4: exit 5: W := {w (u, w) is an edge of G and w G u v} 6: for all w W do 7: TE := new [Paths](G u, w, v) 8: P := next[te ] 9: while P do 10: print(p (u, w)) 11: P := next[te ] (b) (a) Fig. 5. (a) Enumerating subtree extensions. (b) Enumerating simple directed paths delay is the product of (upper bounds on) the basic ith delay and the number of next commands in the ith interval. In the algorithm ReducedSubtrees, for example, it is rather easy to see that for all threaded enumerators used during that algorithm, the basic ith delay is polynomial in the input size. It is more difficult to show that only a polynomial number of next commands are executed during each interval. Note that this would not be true if the algorithm created threaded enumerators that return empty results (e.g., by ignoring the test of either Line 4 of Figure 4 or Line 14 of Figure 5(a)). The complexity of ReducedSubtrees is summarized in the following theorem and the detailed analysis is given in Appendix D. Theorem 2. Let K be a query of size k and G be a data graph with n nodes and m edges. Consider the execution of ReducedSubtrees(G, K). Let F i denote the ith rooted K-fragment printed and F i denote its number of nodes. Then, The first delay is O (mk F 1 ); For i > 1, the ith delay is O (mk( F i + F i 1 )); and The ith space usage is O (mn). 10

11 Corollary 1. The RKF problem can be solved with polynomial delay and polynomial space. A simple optimization that can be applied to the algorithm is to first remove irrelevant nodes. A node v is considered irrelevant if either no keyword of K can be reached from v or v cannot be reached from any node u, such that all the keywords of K are reachable from u. We implemented the algorithm Reduced- Subtrees and tested it on the data graph of the Mondial XML document (ID references were replaced with edges). We found that usually the running time was improved by an order of magnitude due to this optimization. Also note that the space usage can be reduced to O(m) by implementing the algorithm so that different threaded enumerators share data structures. 5 Enumerating Strong and Undirected K-Fragments Enumerating SKFs is simpler than enumerating RKFs. It suffices to choose an arbitrary keyword k K and recursively enumerate all the strong (K \ { k })- fragments. Each strong (K\{ k })-fragment T is extended to a strong K-fragment by adding all simple undirected paths P, such that P starts at some structural node u of T, ends at k and passes only through structural nodes that are not in T. These paths are enumerated by U-Paths(G, u, k), which is similar to Paths(G, u, v) and its description is omitted. The complete algorithm StrongFragments for enumerating SKFs is described in Appendix A. In order to enumerate UKFs, the algorithm StrongFragments should be modified so that the generated paths may include, between u and k, both structural and keyword nodes that are not in T (note that u itself may also be a keyword). Theorem 3. The SKF and UKF problems can be solved with polynomial delay and polynomial space. The algorithms of this and the previous sections can be easily parallelized by assigning a processor to each threaded enumerator that executes a recursive call for a smaller set of nodes (in Line 18 of ReducedSubtrees and in Line 11 of StrongFragments). The processor that does the recursive call sends the results to the processor that generated that call. The latter extends those results to fragments with one more keyword of K. Note that there is no point in assigning a processor to each threaded enumerator that is created in Line 11 of ReducedSubtrees, since the extension process in this case is very simple (i.e., adding just one edge). 6 Enumerating Rooted K-Fragments in Sorted Order In this section, we present an efficient algorithm for enumerating RKFs by increasing weight, assuming that the query is of a fixed size and the data graph is acyclic. As in the unordered case, we solve this problem by solving the more 11

12 SortedRS(G, U) 1: Initialize( U ) 2: i := 1 3: while T [U, i] do 4: print(t [U, i]) 5: i := i + 1 6: Generate(U, i) Initialize(K) 1: for all subsets W V, such that 1 W K, in the s order do 2: I[W ] := 0 3: u := max W 4: for all edges e = (v, u) in G do 5: N [W ±e, e] := 1 6: Generate(W, 1) NextSubtree(W, e) 1: l := N [W ±e, e] 2: if T [W ±e, l] then 3: return T [W ±e, l] e 4: else 5: return Generate(W, i) 1: if I[W ] i then 2: return 3: if W = 1 then 4: T [W, 1] := (W, ), T [W, 2] :=, I[W ] = 2 5: return 6: u := max W 7: if u has no incoming edges in G then 8: T [W, 1] :=, I[W ] := 1, return 9: let e be an incoming edge of u, such that w(nextsubtree(w, e)) is minimal 10: T [W, i] :=NextSubtree(W, e) 11: if NextSubtree(W, e) then 12: Generate(W ±e, N [W ±e, e] + 1) 13: N [W ±e, e] := N [W ±e, e] : I[W ] := i Fig. 6. Enumerating RS(G, U) by increasing weight general problem of enumerating reduced subtrees by increasing weight. Thus, the input is an acyclic data graph and a subset of nodes. Note that a related, but simpler problem is that of enumerating the k shortest paths (e.g., [12]). We use to denote a topological order on the nodes of G. The maximal element of a nonempty set W is denoted as max W. Given the input G and U, the algorithm generates the reduced subtrees w.r.t. every set of nodes W, such that W U, and stores them in the array T [W, i], where T [W, 1] is the smallest, etc. Values are assigned to T [W, i] in sorted order, and the array I[W ] stores the largest i, such that the subtree T [W, i] has already been created. If T [W, i] = (i 1), it means that the graph G has i 1 subtrees that are reduced w.r.t. W. Consider an edge e entering max W. A sorted sequence of reduced subtrees w.r.t. W can be obtained by adding e to each subtree T [W ±e, i]. Let { T [W ±e, i] e } denote this sequence. The complete sequence { T [W, i] } is generated by merging all the sequences { T [W ±e, i] e } of edges e that enter max W. We use N [W ±e, e] to denote the smallest j, such that the subtree T [W ±e, j] e has not yet been merged into the sequence T [W, i]. 12

13 The algorithm is shown in Figure 6. Subtrees are assigned to T [W, i] in Line 10 of Generate. It can be shown that i = I[W ] + 1 whenever this line is reached. Let e 1,..., e m be all the edges entering max W. The reduced subtree w.r.t. W that is assigned to T [W, i] is chosen in Line 9 and is a minimal subtree among T [W ±e1, N [W ±e1, e 1 ]] e 1,..., T [W ±em, N [W ±em, e m ]] e m, which are obtained by calling NextSubtree. Clearly, all the subtrees T [W ±ej, N [W ±ej, e j ]] (1 j m) should have been generated before T [W, i]. For that reason, if T [W ±e k, N [W ±e k, e k ]] e k is the subtree that has just been assigned to T [W, i], then in Line 12 the subtree T [W ±e k, N [W ±e k, e k ] + 1] is generated. Note that T [W ±e k, N [W ±e k, e k ] + 1] = may hold after executing Line 12; it happens if RS(G, W ±e k ) < N [W ±e k, e k ]+1. (Note that w( ) =.) It is also possible that T [W ±e k, N [W ±e k, e k ]+1] may have already been created before executing Line 12; hence, the test in Line 1 of Generate. The enumeration algorithm SortedRS(G, U) starts by calling the algorithm Initialize( U ) in order to compute T [W, 1] for every nonempty subset W, such that W U. The loop in Line 1 of Initialize traverses the sets W in the s order, where W 1 s W 2 if max W 1 max W 2. After initialization, the subtrees T [U, i] are generated in sorted order. The algorithm terminates when T [U, i] =. The following theorem states the correctness of SortedRS. The crux of the proof (given in Appendix C) is in showing that each of the arrays T, I, and N holds the correct information described above. Theorem 4. Let G be an acyclic data graph and U be a subset of the nodes of G. SortedRS(G, U) enumerates RS(G, U) by increasing weight. Theorem 5. Let K be a query of size k and G be an acyclic data graph with n nodes and m edges. In the execution of SortedRS(G, K), The first delay is O ( mn k) ; For i > 1, the ith delay is O(m); The ith space usage is O ( n k+2 + in 2). Corollary 2. If queries are of fixed size and data graphs are acyclic, then the sorted RKF problem can be solved with polynomial delay. Note that in practice, for each set W, the array T [W, i] should be implemented as a linked list and the array N [W, e] should store pointers to that list. This does not change the running time and it limits the amount of space just to the size of the subtrees that are actually explored for W. 7 A Heuristics for Sorted Enumerations Usually, the goal is enumeration by increasing weight. There are two approaches for achieving this goal. In [1, 10], the enumeration is by increasing weight, but the worst-case upper bound on the running time is (at best) exponential. In [3, 16], a heuristic approach is used to enumerate in an order that is likely to be close to the sorted order. Note that there is no guarantee by how much the actual 13

14 order may deviate from the sorted order. The upper bound on the running time is exponential [16] or not stated [3]. In comparison, the algorithms of Sections 4 and 5 imply that enumeration by increasing weight can be done in polynomial total time (even if the size of the query is unbounded) simply by first generating all the fragments and then sorting them. None of the current systems achieves this worst-case upper bound. Generating and then sorting would work well when there are not too many results. Next, we outline a heuristics that runs with polynomial delay (even if the query is of unbounded size) and enumerates in an order that is likely to be close to the sorted order. The general idea is to apply the algorithms of Sections 4 and 5 in a neighborhood of the data graph around the keywords of K, starting with the neighborhood comprising just the keywords of K and then enlarging this neighborhood in stages. The heuristics for building the successive neighborhoods is based on assigning a cost C(n) to each node n and then adding the nodes, one at a time, in the order of increasing cost. C(n) could be, for example, the sum of (or maximal value among) the distances between n and the keywords of K. Alternatively, C(n) could be a number that is at most twice the weight of a minimal undirected subtree that contains all the keywords of K and n. Note that in either case, C(n) can be computed efficiently. For a given neighborhood, we should generate all K-fragments that contain v, where v is the most-recently added node. One way of doing that is by applying directly the algorithms of Sections 4 and 5, and printing only those K-fragments that contain the node v. This would result in an enumeration that runs in incremental polynomial time. To realize enumeration with polynomial delay, we should have algorithms that can enumerate, with polynomial delay, K-fragments that contain a given node v / K (note that v must be an interior node). We can show that such algorithms exist for enumerating SKFs and UKFs. For RKFs, we can show existence of such an algorithm if the data graph is acyclic, and that for cyclic data graphs, no such algorithm exists, unless P=NP. The proof of these results is beyond the scope of this paper. 8 Conclusion and Future Work We have given provably efficient algorithms for problems that occur in different settings of keyword search, including databases, data-centric XML as well as document-centric XML, and the Web. Ours are the first algorithms, for these type of problems, that run with polynomial delay (or even polynomial total time). We have also shown how our algorithms can lead to heuristics and we believe that this heuristics will outperform existing ones [1, 3, 10, 11, 16]. Experimentation with this heuristics, however, is beyond the scope of this paper. The results of these paper can be extended in two ways. First, for queries of fixed size, all K-fragments can be enumerated with polynomial delay and in the order of increasing weight. This result holds for all three types of K-fragments (i.e., RKFs, SKFs and UKFs), but for RKFs the polynomial delay is not as good 14

15 as the polynomial delay of the algorithm SortedRS of Section 6, which works under the additional assumption that the data graph is acyclic. The second extension is a formal definition of enumeration in an approximate order, as well as algorithms for enumerating all three types of K-fragments in an approximate order and with polynomial delay, even if queries are of unbounded size. Note that the heuristics of Section 7 does not satisfy the notion of an approximate order but it has a better polynomial delay. These extensions are summarized in [15] and will be described in detail in a future paper. Additional future work includes the development of indices and other optimizations that would enhance the efficiency of our algorithms. References 1. S. Agrawal, S. Chaudhuri, and G. Das. DBXplorer: enabling keyword search over relational databases. In SIGMOD Conference, page 627, Ken Arnold, James Gosling, and David Holmes. The Java Programming Language. Addison-Wesley Longman Publishing Co., Inc., G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan. Keyword searching and browsing in databases using BANKS. In ICDE, pages , S. Cohen, Y. Kanza, and Y. Sagiv. Generating relations from XML documents. In ICDT, pages , S. Cohen, J. Mamou, Y. Kanza, and Y. Sagiv. XSEarch: A semantic search engine for XML. In VLDB, pages 45 56, N. Fuhr, M. Lalmas, and S. Malik, editors. INitiative for the Evaluation of XML Retrieval (INEX). Proceedings of the Second INEX Workshop, M. R. Garey, R. L. Graham, and D. S. Johnson. The complexity of computing Steiner minimal trees. SIAM Journal on Applied Mathematics, 32: , R. Goldman, N. Shivakumar, S. Venkatasubramanian, and H. Garcia-Molina. Proximity search in databases. In VLDB, pages 26 37, V. Hristidis, L. Gravano, and Y. Papakonstantinou. Efficient IR-style keyword search over relational databases. In HDMS, V. Hristidis and Y. Papakonstantinou. DISCOVER: Keyword search in relational databases. In VLDB, pages , V. Hristidis, Y. Papakonstantinou, and A. Balmin. Keyword proximity search on XML graphs. In ICDE, pages , V. M. Jiménez and A. Marzal. Computing the K shortest paths: A new algorithm and an experimental comparison. In Algorithm Engineering, pages 15 29, D.S. Johnson, M. Yannakakis, and C.H. Papadimitriou. On generating all maximal independent sets. Information Processing Letters, 27: , March B. Kimelfeld. Interconnection semantics for XML. Master s thesis, The Hebrew University of Jerusalem, B. Kimelfeld and Y. Sagiv. Efficient engines for keyword proximity search. In WebDB, Wen-Syan Li, K. Selçuk Candan, Quoc Vu, and Divyakant Agrawal. Retrieving and organizing web pages by information unit. In WWW, pages , M. Y. Vardi. The complexity of relational query languages (extended abstract). In STOC, pages ,

16 A Additional Algorithms In this appendix, we describe two algorithms. The algorithm PairRS(G, u, v) of Figure 7 enumerates all rooted subtrees of a data graph G that are reduced w.r.t. { u, v }, where u and v are two given nodes of G. The algorithm StrongFragments(G, K) of Figure 8 enumerates all SKFs for a data graph G and a set of keywords K. PairRS(G, u, v) 1: if u = v then 2: print(({ u }, )) 3: exit 4: if u G v then 5: TE := new [Paths](G, u, v) 6: T := next[te ] 7: while T do 8: print(t ) 9: T := next[te ] 10: let W be the set of nodes w s.t. (w, u) is an edge of G and w and v are reachable from a common node in G u 11: for all w W do 12: TE := new [PairRS](G u, w, v) 13: T := next[te ] 14: while T do 15: print(t (w, u)) 16: T := next[te ] Fig. 7. An algorithm for enumerating RS(G, { u, v }). B Proof of Theorem 1 (Correctness of ) Lemma 1 (correctness of RSExtensions). Suppose that T is a subtree of G that is reduced w.r.t. W and let v be a node of G. RSExtensions(G, T, v) enumerates all subtrees ˆT, such that T ˆT G and ˆT is reduced w.r.t. U = W { v }. Proof. Let RS T (G, U) be the set of trees in RS(G, U) that have T as a subtree. Let T 1,..., T k be the subtrees printed by RSExtensions(G, T, v). The lemma follows from the following claims. Claim 1. T i T j for all 1 i < j k. Claim 2. { T 1,..., T k } RS T (G, U). 16

17 StrongFragments(G, K) 1: if G does not contain a strong K-fragment then 2: exit 3: if K = 1 then 4: print((k, )) 5: exit 6: if K = 2 then 7: let G K be obtained from G by removing all keywords not in K 8: U-Paths(G K, K) 9: exit 10: choose an arbitrary keyword k K 11: TE t := new [StrongFragments](G, K \ { k }) 12: T := next[te t ] 13: while T do 14: for all u S(T ) do 15: Ḡ := (G (S(T ) \ { u })) (K(G) \ { k }) 16: TE p := new [U-Paths](Ḡ, u, k) 17: P := next[te ] p 18: while P do 19: print(t P ) 20: P := next[te p ] 21: T := next[te t ] Fig. 8. Enumerating strong K-fragments. Claim 3. RS T (G, U) { T 1,..., T k }. Claims 1 and 2 are rather straightforward. We will only prove Claim 3. Suppose that ˆT RS T (G, U). We will show that ˆT is generated by the algorithm. Let T be the undirected tree obtained by ignoring the directions of the edges of ˆT. Let r be the root of T and u be a node of T that is closest to v in T. We consider three cases. Case 1: u = v, i.e., v is a node of T and so T is reduced w.r.t. U. In this case, T itself is the only tree in RS T (G, U). This tree is printed in Line 3 of RSExtensions(G, T, v). Case 2: r u v. By the choice of u, there is an undirected path P between u and v in T, and u is the only node of T on P. Thus, v must be a leaf of ˆT that is reachable from u by the path P, which is obtained from P by directing the edges as in ˆT ; otherwise, some node of P would have an in-degree two in ˆT. Therefore, ˆT can be obtained by concatenating T and P. Node u is chosen in some iteration of the loop of Line 5. By the correctness of Paths, P is returned by the threaded enumerator TE that is created in Line 8. Hence, ˆT = T P is eventually printed in Line 11. Case 3: r = u v. Let T be the subtree of ˆT that is reduced w.r.t. { r, v }. The concatenation T T is a subtree of ˆT that includes all the nodes of U, and hence, must be ˆT itself. Note that r is the only node that T and T have 17

18 in common. By the correctness of PairRS, it follows that T is returned by the threaded enumerator TE that is created in Line 15. Hence, ˆT is eventually printed in Line 18. We conclude that ˆT is generated by the algorithm, as claimed. Proof of Theorem 1. The proof follows from the next three claims that are proved by induction on V(G) + U. Claim 1. R(G, U) RS(G, U), where R(G, U) is the set of subtrees generated by ReducedSubtrees(G, U). Claim 2. RS(G, U) R(G, U), where R(G, U) is the same as above. Claim 3. No tree is generated twice by the algorithm. For the basis of the induction, suppose that U = 1; this includes the case where V(G) = 1. Thus, RS(G, U) has a single tree, consisting of the only node in U. ReducedSubtrees(G, U) prints this tree in Line 2 and terminates. Hence, all three claims hold when U = 1. For the inductive step of Claim 1, suppose that U 2. Note that all the trees generated by ReducedSubtrees(G, U) are printed either in Line 14 or in Line 24, depending on the test of Line 7. If this test is true, then the threaded enumerator TE is initialized in Line 11 and, by the inductive hypothesis, it generates reduced subtrees T of G u w.r.t. U ±(w,u). Thus, each T is a reduced subtree of G w.r.t. U ±(w,u), such that T contains w and does not contain u. Therefore, adding the edge (w, u) to T results in a reduced subtree of G w.r.t. U. If the test of Line 7 is false, then by the inductive hypothesis, the threaded enumerator TE 1 (initialized in Line 18) generates reduced subtrees T of G w.r.t. U \{ v }. By Lemma 1, the threaded enumerator TE 2 (initialized in Line 21) extends T to reduced subtrees of G w.r.t. U. For the inductive step of Claim 2, suppose that U 2 and let T RS(G, U). Thus, neither the test of Line 1 nor the test in Line 4 is true, and the algorithm proceeds by choosing a node u in Line 6. To show that T is printed by the algorithm, we consider two cases depending on the test in Line 7. If this test is true, then u must be a leaf of T. Let w be the parent of u in T and let T u be obtained by removing the edge (w, u) from T. The tree T u is a reduced subtree of G u w.r.t. U ±(w,u). By the inductive hypothesis, T u is generated in either Line 12 or 15 and hence, the subtree T is printed in Line 14. If the test in Line 7 is false, then let v be the node chosen in Line 17. Let T be the subtree of T that is reduced w.r.t. U \ { v } (note that T may contain v). By the inductive hypothesis, the threaded enumerator TE 1 (which is initialized in Line 18) generates T. By Lemma 1, the subtree T is generated by the threaded enumerator TE 2 and printed in Line 24. To prove the inductive step of Claim 3, recall that the algorithm prints all subtrees in either Line 14 or Line 24. Consider two subtrees T 1 and T 2 that are printed by the algorithm. If they are printed in Line 14, then either each one has a different node as the parent of u or they are different by the inductive 18

19 hypothesis. If they are printed in Line 24, let v be the node chosen in Line 17. Suppose that T 1 and T 2 are generated as extensions of the trees ˆT 1 and ˆT 2, respectively, where ˆT 1 and ˆT 2 are created by the threaded enumerator TE 1. By Lemma 1, each ˆT i is the subtree of T i that is reduced w.r.t. U \ { v }. Hence, if ˆT 1 and ˆT 2 are not identical, then neither are T 1 and T 2. So suppose that ˆT 1 and ˆT 2 are identical. By the inductive hypothesis, the subtree ˆT 1 is generated only once by the enumerator TE 1. Thus, T 1 and T 2 are printed in the same iteration of the loop of Line 20. Hence, the claim follows from Lemma 1. C Proof of Theorem 4 In this section, we prove Theorem 4 correctness of the algorithm SortedRS. We assume that G is an acyclic data graph and that U is a set of K > 1 nodes in G. We first prove the following lemma. Lemma 2. Let W be a nonempty set of K or fewer nodes in G. During the execution of SortedRS(G, U), whenever Generate(W, i) is being called, i I[W ] + 1 holds. Proof. For i = 1, the lemma is trivial, since I[W ] is a nonnegative integer. So suppose that i > 1. We will first prove the following claim. Claim 1. If Generate(W, i) is called during the execution of SortedRS, then Generate(W, i 1) is called in some prior step. To prove this claim, we consider several cases. Case 1: Generate(W, i) is called in Line 6 of SortedRS(G, U). In this case, Generate(W, i 1) was called in the previous iteration of Line 3 of SortedRS. Case 2: i > 2 and Generate(W, i) is called in Line 12 of Generate(Ŵ, î), for some Ŵ and î. In that call, W = Ŵ ±e for some edge e, and i is the value N [W, e] + 1. Now, consider the step in which N [W, e] was set to its current value i 1. Since i 1 > 1, this step is an execution of Line 13 of Generate, in some previous execution of Generate with Ŵ as an argument. In that execution, Generate(W, i 1) was called in Line 12. Case 3: i = 2 and Generate(W, i) is called in Line 12 of Generate(Ŵ, î), for some Ŵ and î. Since W precedes Ŵ in the s order, Generate(W, 1) is called in Line 6 of Initialize before the set Ŵ is even considered. This completes the proof of Claim 1. The following observation follows from the tests of Line 3 of SortedRS and Line 11 of Generate. Observation 1. If, during the execution of SortedRS(G, U), T [W, i] is set to, then T [W, j] is never called for j > i. Now, consider a specific call to Generate(W, i). Let k be the value of I[W ] at that call. We will show that k i 1. From Claim 1 it follows that Generate(W, i 1) was previously called. Let j be the value of I[W ] at that 19

20 call. Obviously, j k. From Observation 1 it follows that after the call to Generate(W, i 1), the value of T [W, i 1] is not. We now consider that execution of Generate(W, i 1). If the test of Line 1 is true, then j i 1 holds, and hence k i 1. Otherwise, if the test of Line 3 is true, then both i and k must be 2. From Observation 1, the test of Line 7 cannot hold. Finally, if none of these tests is true, then Line 14 is reached, and hence j = i 1. It follows that k i 1, as claimed. For simplification, we assume that for every nonempty set W of K or fewer nodes in G, no two subtrees in RS(G, W ) have the same weight. Note that this assumption is not required for the correctness of the algorithm SortedRS. The notion of safety relates to the values that are stored about a subset of the nodes of G, during the execution of SortedRS(G, U). Let W be a nonempty set of K or fewer nodes in G. We say that W is safe if the following conditions hold: 1. For every i, such that 1 i I[W ], the value T [W, i] is defined, and it forms the ith lightest tree in RS(G, W ), or if i > RS(G, U) ; and 2. If W > 1, then for every incoming edge e of max W, T [W ±e, N [W ±e, e]] is defined, and it forms the lightest tree T, such that T e RS(G, W ) \ { T [W, i] 1 i I[W ] }. If no such T exists, then T [W ±e, N [W ±e, e]] =. Lemma 3. Let W be a nonempty set of K or fewer nodes in G. During the execution of SortedRS(G, U), whenever Generate(W, i) is being called, the set W is safe. Proof. To prove this lemma, we need the following observation. Observation 1. The first time Generate is called with the argument W is in Line 6 of Initialize. This observation follows from the order in which the sets are traversed in Initialize. Another observation we make use of is the following. Observation 2. If W is safe at the first call to Generate(W, 1), then safety of W can only be impaired during some execution of Generate with W as an argument. We prove this lemma by induction on the position of W in the s order. For the base case, we assume that W consists of only one node. Observation 2 implies that it is satisfactory to show that W is safe at the first call to Generate(W, 1), and that safety is not impaired during any execution of Generate with W as an argument. From Observation 1 it follows that, when Generate is first called with W as an argument, the value of I[W ] is 0. Hence, W is trivially safe. Furthermore, Lines 3 5 of Generate imply that W remains safe in the end of each execution of Generate(W, i). We conclude that the lemma holds for W, as required. For the inductive step, assume that W > 1. Let u = max W. We first show that W is safe on the first call to Generate(W, 1) in Initialize. Since 20

21 I[W ] = 0 at that time, Condition 1 of safety is satisfied in an empty manner. For Condition 2, consider an incoming edge e of u. Then, N [W ±e, e] = 1. Since W ±e precedes W in the s order, Generate(W ±e, 1) was called in a previous iteration of Line 1 of Initialize. Hence, I[W ±e ] 1. From the induction hypothesis it follows that W ±e is safe. In particular, T [W ±e, N [W ±e, e]] is defined and it forms the Steiner tree of W ±e, or if RS(G, W ±e ) =. It follows that T [W ±e, 1] is a smallest tree T, such that T e RS(G, W ), if RS(G, W ±e ) ; or, otherwise. We conclude W is safe at the first call to Generate with W as an argument, as claimed. By Observation 2, it is now satisfactory to prove that if W is safe at the beginning some execution of Generate with W as an argument, then W is also safe at the end of that execution. Consider an execution of Generate(W, i), for some i. If the test of Line 1 of Generate is true, then the values of the arrays that relate to the set W remain unchanged, and hence, W remains safe. So assume that this test is false (i.e., i > I[W ]). From Lemma 2 it follows that i = I[W ]+1. Let u = max W. If the test of Line 7 is true, then RS(G, W ) =. In that case, Line 8 implies that W remains safe. Otherwise, let e be the edge that is chosen at Line 9. Since W is safe at the beginning of the algorithm execution, the tree T [W, i] that is defined in Line 10 is the smallest tree in RS(G, W ) \ { T [W, j] 1 j I[W ] }, or, if no such tree exists. Since, in Line 14, I[W ] is set to i (that is, I[W ] is increased by 1), Condition 1 of safety is satisfied at the end of the algorithm execution. To show that Condition 2 of safety is also satisfied, it is enough to show that it is satisfied w.r.t. the edge e. If the test of Line 11 is false, then the value N [W ±e, e] does not change, as required. Otherwise, Generate(W ±e, N [W ±e, e] + 1) is executed in Line 12. Hence, in Line 13, I[W ±e ] N [W ±e, e] + 1, and by the induction hypothesis, the set W ±e is safe. Thus, T [W ±e, N [W ±e, e] + 1] is the smallest tree T, such that T e RS(G, W ), and the weight of T is greater than the weight of T [W ±e, N [W ±e, e]]; or if no such tree exists. Since the value of N [W ±e, e] is incremented in Line 13, Condition 2 is satisfied. We conclude that W remains safe at the end of that execution of Generate, as claimed. Theorem 4 follows directly from Lemma 3, when taking W to be the set U. D Complexity of In this section, we analyze the time complexity of the algorithm ReducedSubtrees. Let E be an enumeration algorithm and x be an input for E. We use N i [E(x)] to denote the number of next commands that are executed during the ith interval of E(x). The following two lemmas give upper bounds on N i [E(x)] for ReducedSubtrees and the other three enumeration algorithms used by ReducedSubtrees. Note that these upper bounds hold under the assumption that each algorithm generates a nonempty result. During the execution of ReducedSubtrees, tests are always made to guarantee that when a threaded enumerator is initialized, it will return a nonempty result. 21

Efficient Engines for Keyword Proximity Search

Efficient Engines for Keyword Proximity Search Benny Kimelfeld The Selim and Rachel Benin School of Engineering and Computer Science The Hebrew University of Jerusalem Edmond J. Safra Campus Jerusalem