Selecting and Using Views to Compute Aggregate Queries

Size: px

Start display at page:

Download "Selecting and Using Views to Compute Aggregate Queries"

Lewis O’Neal’
6 years ago
Views:

1 Selecting and Using Views to Compute Aggregate Queries Foto Afrati National Technical University of Athens Athens, Greece Rada Chirkova Computer Science North Carolina State University Abstract We consider the problem of obtaining equivalent rewritings of aggregate queries using views. We assume conjunctive views and rewritings, with or without aggregation; in each rewriting, only one view contributes to computing the aggregated query output. Our focus is on minimizing the cost of computing a query workload; we look at query rewriting using existing views and at view selection. In the queryrewriting problem, we give sufficient and necessary conditions for a rewriting to exist. For view selection, we prove complexity results. We also give algorithms for obtaining rewritings and selecting views. 1 Introduction The problem of answering and rewriting queries using views for conjunctive queries and views has received considerable attention (see, e.g., [ALU01, CDGLV03, CHS02, Hal01] and references therein). However, a small amount of work addresses the case where the language is extended with aggregation [ACN00, CNS99, GHQ95, GHRU97, NSS98, SDJL96]. Few complete algorithms are known for finding rewritings; moreover, the existing results address special cases. At the same time, using materialized views to compute aggregate queries results in potentially greater benefits than for purely conjunctive queries, as a view with aggregation Due to space limitations, we do not provide proofs in the text. Selected proofs are in the appendix. Contact author: NCSU, 900 Main Campus Dr, Venture III Ste 165-C Rm 196, Raleigh, NC 27695, USA; tel ; fax ; chirkova@csc.ncsu.edu precomputes some of the grouping/aggregation on some of the query s subgoals. Also, because aggregate queries are often computed on large amounts of data, in many applications it is beneficial to use previously cached results as views to answer a new query [HRU96, GHRU97, ACN00]. We consider aggregate queries and views and address the problems of (1) how to answer the queries using the views and of (2) how to optimally select views to materialize. EXAMPLE 1.1 This is a simple motivating example. On a database with schema {P (A, B), S(B, C, D), T (C, G), U(A, H)} we consider three queries, Q 1, Q 2, and Q 3 : q 1 (A, B, max(c)) : p(a, B), s(b, C, D), t(c, G), u(a, H). q 2 (B, C, sum(h)) : p(a, B), s(b, C, D), u(a, H). q 3 (B, count) : s(b, C, D), t(c, G). We consider the following views: v 1 (B, max(c)) v 2 (A, B, sum(h)) v 3 (B, C) v 4 (C, count) :- s(b, C, D), t(c, G). :- p(a, B), u(a, H). :- s(b, C, D). :- t(c, G). We can rewrite the three queries as Q 1, Q 2, Q 3 using the four views: q 1 q 2 q 3 (A, B, W ) :- (B, C, sum(w )) :- (B, sum(w )) :- v 1(B, W ), v 2 (A, B, X). v 2(A, B, W ), v 3 (B, C). v 4(C, W ), v 3 (B, C). Each rewriting uses more than one view, and all views in a rewriting are not necessarily of the same type, i.e., some are without aggregation (view V 3 ), and some use aggregation different than the aggregation of the query they rewrite (view V 4 in Q 3 ). However, in each rewriting only 1

2 one view (the first) in the body contributes to the value of the aggregated attribute in the head; we call it the central view. We call these rewritings central rewritings. Also rewritings Q 2, Q 3 are themselves aggregate queries, whereas rewriting Q 1 is not. Finally, the grouping attributes in the head of the rewriting are, in general, different than the ones in the views used in the body. It is not straightforward how to argue that these rewritings are indeed equivalent to the queries. To see that, take Q 2 slightly modified as q 2 (B, C, W ) : v 2(A, B, W ), v 3 (B, C). Interestingly, rewriting Q 2 is not equivalent to query Q 2, although its body is the same as Q 2 and the head contains the same attributes. Also Q 2 (Q 3 ) is equivalent to Q 2 (Q 3 ) only if the view V 3 is computed under bag semantics [CV93]. One contribution of this paper is a complete algorithm which constructs central rewritings given a query and a set of views. The aggregate operators we consider are the common operators max, min, count, sum, count( ). As aggregation is not a relational operator, proving equivalence of queries to rewritings is more complicated than when queries have no aggregation. Thus, we investigate this problem first and use the results we obtain to develop our algorithm. When addressing the view-selection problem, we consider also multiaggregate views and queries with the HAVING clause, as in the following example. EXAMPLE 1.2 Consider a database with three relations, one relation that stores transactions, and two that store information about store branches: P(storeId, product, saleprice, profit, dayofsale, monthofsale, yearofsale); T(storeId, storechain); W(storeId, storecity). We consider three queries. Query Q 1 gives maximal profit per store chain per product for year Query Q 2 gives total sales per product per year per city, for all stores. Query Q 3 uses a HAVING clause in its definition and returns all product names, together with total sales, for each year after 1997 and only for the city of Seattle. Here is one possible SQL expression for Q 3 : SELECT product,yearofsale,sum(saleprice) FROM P,W WHERE P.storeId = W.storeId GROUP BY product, yearofsale, storecity HAVING yearofsale > 1997 AND storecity = Seattle ; These three queries can be rewritten using a single multiaggregate view. In our datalog rule notation the queries, the view and the rewritings can be written as: q 1 (S, Y, max(t )):- p(x, Y, Z, T, N, L, 02), t(x, S). q 2 (Y, M, U, sum(z)):- p(x, Y, Z, T, N, L, M), w(x, U). q 3 (Y, M, F ):- q 2 (Y, M, U, F ), M > 97, U = Seattle. v 1 (X, Y, M, sum(z), max(t )):- p(x, Y, Z, T, N, L, M). q 1(S, Y, max(k)):- v 1 (X, Y, 02, F, K), t(x, S). q 2(Y, M, U, sum(j)):- v 1 (X, Y, M, J, K), w(x, U). q 3(Y, M, F ):- q 2(Y, M, U, F ), M > 97, U = Seattle. View V 1 can be used as a central view to rewrite all three queries. Our second main result is an algorithm that selects multiaggregate central views optimally given a query workload. We also prove complexity results for the view-selection problem. The structure of this paper is as follows. Section 2 defines aggregate queries and equivalence among aggregate queries. Section 3 presents our framework, in particular the types of rewritings we consider, the cost model for view selection, and a more technical presentation of our results. In section 4, we prove necessary and sufficient conditions for a type of rewriting to exist and provide also negative results. In section 5, we prove that the view-selection problem in NP-complete for sum, count, and provide an exponential-time lower bound on the complexity of view selection for max, min. In Section 6, we give algorithms for obtaining rewritings given a query and views and for selecting views given a query workload. Related Work and Comparison to Ours The problems of rewriting queries using views and of view selection for aggregate queries have 2

3 been considered in papers related with data warehouses and datacubes [GCB + 97, Wid95]; in general, the problem considered in this context was to answer each query (or part of a query) using a single view [ACN00, GHQ95, GHRU97, SDJL96]. Recent work [CNS99] has considered the problem of rewriting a query with aggregation using multiple views with aggregation; to determine whether a rewriting that uses views is equivalent to a query with aggregation, the method is to determine whether the rewriting s unfolding (defined similarly to expansion [Ull97]), which uses base relations only, is equivalent to the query [NSS98]. Thus complete algorithms are obtained that construct rewritings that use multiplication as an aggregate operator and use only aggregate views in the body of the rewritings. In the present paper, we use unfoldings to determine equivalence of a central rewriting to a query and obtain complete algorithms. Our central rewritings use only standard aggregation operators and use any views in the body, including multiaggregate views. On view selection, considerable work has been done on efficiently selecting views such as in the datacube context (e.g., [GHRU97]), where the focus was on getting efficient algorithms for interesting special cases of the problem. Here we focus on obtaining results on the complexity of the view-selection problem for central rewritings in a framework similar to [CHS02]. Other related work on aggregate query rewriting includes [GT03], which considers rewriting aggregate queries using multiple aggregate views over a single relation, and [AAD + 96], which presents fast algorithms for computing the cube operator. [YW01] considers the problem of using views with aggregation to compute queries in temporal databases. Work related to query languages with aggregate capabilities can be found in [BL02], [RSSS98], [ÖÖM87], [LSV02]. [PDST00] proposes a new method for generating alternative query plans, using an interaction of indexes, materialized views, semantic optimization, and query minimization. Finally, results on equivalence of aggregate queries are presented in [CNS99], which establishes that checking the equivalence of unions of sum or count-queries is GI-hard and in PSPACE. (GI is the class of problems that are many-one reducible to the graph isomorphism problem.) It is also shown in [CNS99] that checking equivalence of unions of max-queries is Π p 2-complete, whereas checking equivalence of unions of conjunctive queries without aggregation is NP-complete. 2 Preliminaries A database is a collection of relations. A query is a mapping from databases to databases, where usually the output database (the answer) is a database with a single relation. A relation is viewed as either a set or a bag (a.k.a. multiset) of tuples. A bag can be thought of as a set of elements (we call it the core-set of the bag) with multiplicities attached to each element. A conjunctive query is of the following form: h( s) : g 1 ( s 1 ),..., g k ( s k ). In each subgoal g i ( s i ), predicate g i is a base relation, and every argument in the subgoal is either a variable or a constant. We shall denote the part on the right-hand side of the : (called the body) by A. The part in the left-hand side is called the head. An attribute or variable which is not in the head is called a nondistinguished attribute or variable. An assignment γ for A is a mapping of the variables appearing in A to constants, and of the constants appearing in A to themselves. Assignments are naturally extended to tuples and atoms. For a tuple of variables s = (s 1,..., s k ) we let γ s denote the tuple (γ(s 1 ),..., γ(s k )). Satisfaction of atoms (and of conjunctions of atoms) by an assignment w.r.t a database is defined as follows: g(γ s) is satisfied if the tuple γ s is in the relation that corresponds to the predicate of subgoal g. Under set semantics, a conjunctive query q( s) A defines a new relation q D, for a given set database D, as follows: q D := {γ s γ satisfies A w.r.t. D}. Under bag-set semantics [CV93], a 3

4 conjunctive query q( s) A defines a new multiset relation {{q}} D, for a given set database D, as follows: {{q}} D := {{γ s γ satisfies A w.r.t. D}}. We say that the query is computed under bag semantics [CV93] if both the input database and the answer are bags. In this case, the collection of satisfying assignments is viewed as a multiset. We define equivalence under each of the three types of semantics. Two queries are setequivalent (bag-set-equivalent, bag-equivalent, respectively) if they produce the same set (multiset, respectively) of answers on every database (every set database for the first two cases, every bag database for the third case). When we compute a query, we will say whether we compute it as a bag or as a set, unless obvious from the context. We assume in this paper that the data we want to aggregate are real numbers, R. If S is a set, then M(S) denotes the set of finite multisets over S. A K-ary aggregate function is a function α : M(R k ) R that maps multisets of k-tuples of real numbers to real numbers. An aggregate term is an expression built up using variables and aggregate functions. Every aggregate term gives rise to an aggregate function in a natural way. We use α(y) as an abstract notation for an aggregate term, where y is the variable in the term. The aggregate queries that we consider here have the aggregate functions count, count( ), sum, max, and min. Note that count is over an argument whereas count( ) is the only function that we consider here that takes no argument. In the rest of the paper, we will not refer again to this distinction as our resutls carry over. An aggregate query is a conjunctive query augmented by an aggregate term in its head. Thus it has the syntax: q( s, α(y)) A, (1) where A is a conjunction of predicate atoms that represent relations; α(y) is an aggregate term; s are the grouping attributes of the query; y does not appear among s; all the variables in the head occur in the body. With each aggregate query q as in Equation 1, we associate its core q, which is a conjunctive query: q( s, y) A. (2) For the semantics of an aggregate query we think as follows: Let D be a database and q an aggregate query as in Equation 1. When q is applied on D it yields a new relation q D that is defined by the following three steps: First, we compute the core q on D as a bag B. In the second step, we form equivalence classes in B. Two tuples belong to the same equivalence class if they agree on the values of the grouping attributes. This is the grouping step. The third step is aggregation; it associates with each equivalence class a value that is the aggregate function computed on a bag which contains all values of the input argument of the aggregated attribute in this class. For each class, it returns one tuple which contains the values of the grouping attributes and the computed aggregated value. We say that an aggregate function α is duplicate-insensitive if the result of α computed over a bag of values is the same as the result of α computed over the core set of this bag. Otherwise α is duplicate-sensitive [GHQ95]. We say that an aggregate function α is distributive [GCB + 97] if there is a function γ such that α(a) = γ(α(a)), where A is a multiset. All the four functions we consider are distributive. In fact, for all α, γ = α, except that for count, γ = sum. The following are useful observations. Proposition 2.1 Let Q be an aggregate query with X the grouping tuple and Y the aggregated attribute. Then the following hold: (1) There is a functional dependency X Y ; (2) the answer to Q is set-valued; (3) the projection of the answer to Q on X is set-valued. 4

5 Now we define equivalence between aggregate queries. As two aggregate queries with different aggregate functions may be equivalent but we don t want to treat such cases here, we define equivalence only among compatible queries. Definition 2.1 (Compatible queries) [NSS98] Two queries are compatible if they have identical heads, up to variable renaming. Definition 2.2 (Equivalence of compatible aggregate queries) [NSS98] For two compatible aggregate queries Q( x, α(y)) B( s) and Q ( x, α(y)) B ( s ), Q Q if Q(D) = Q (D) for every database D. Equivalence among aggregate queries is investigated in [CNS99, NSS98] where it is shown that: (1) Two conjunctive queries are bag-set equivalent if and only if they are isomorphic; (2) equivalence of sum-queries and count-queries can be reduced to bag-set equivalence among their cores; (3) equivalence of max-queries can be reduced to set-equivalence between their cores. 3 Our Framework and Contributions 3.1 Rewritings for Aggregate Queries Suppose V is a set of views defined on a database schema S, and suppose D is a database instance with schema S. Then by D V we denote the database obtained by computing all the view relations in V on the database D: D V = V (D). V ɛv Definition 3.1 (Equivalent Rewriting) Let Q be a query defined on database schema S, and let V be a set of views defined on S; let R be a query defined in terms of the views in V. Then Q and R are equivalent, denoted Q R, if and only if for any database D, Q(D) = R(D V ). We say that a view V is set-valued if V is computed and stored to be accessed as a set, and we say that V is bag-valued if V is computed and stored to be accessed as a bag. Whenever in a rewriting, a bag-valued view V will be denoted by an adornement as V b. The following example shows that equivalence of a rewriting to a query is affected depending on whether conjunctive views are set- or bag-valued. EXAMPLE 3.1 We have the following query and one view which is the core of the query. Q(X, count) V (X) Q (X, count) : p(x, Y, Z). : p(x, Y, Z). : V b (X). The rewriting is equivalent to the query as it is, i.e., when the view is bag-valued. However, if the view is set-valued, then there is no equivalence. (Consider the following database: P = {(1, 3, 4), (1, 5, 6)}. On P, the answer to Q has one tuple (1, 2), the answer to the view computed as a set has one tuple (1), and hence the answer to Q has one tuple (1, 1).) 3.2 Central Rewritings Finding rewritings for aggregate queries introduces additional complications when compared to finding rewritings for conjunctive queries without aggregation: Now a decision has to be made as to the following parameters: (1) What kinds of queries are the views. (2) What kind of query is the rewriting. (3) Whether the views are computed under set or bag-set semantics. (4) Moreover, as a consequence of the choice we make, the aggregate function may or may not depend on some aggregated attributes of the views. Our choice is to depend only on the aggregated attribute of a single view, which we call central view. The rest of the views in the rewriting are called noncentral views. Aggregate queries (and views that are defined by aggregate queries) are not symmetrical w.r.t. all their attributes. We call the aggregated attribute the output argument of the query. We do not allow joins on output arguments. Thus in the setting of our paper, we make the following assumptions on the rewritings we consider: 5

6 1. The argument of aggregation in the head of the rewriting comes from exactly one (central) view in the body of the rewriting. We call central aggregate operator the aggregate operator of the central view that contributes to the aggregation in the head (there might be several in the case of multiaggregate central view) and (in the case the central view is purely conjunctive) the aggregate operator in the head of the rewriting. 2. Aggregated outputs of noncentral views are not used in the head of the rewriting. 3. There is no join on output arguments of views. We call such types of rewritings central rewritings. In all our results, we will assume that we consider only central rewritings. We may view our problem now as belonging to one of the following three classes: CQ/CQA when the central view is purely conjunctive and the rewriting has aggregation, CQA/CQ when the central view has aggregation and the rewriting is purely conjunctive, and CQA/CQA when both the central view and the rewriting have aggregation. It is easier to state our results for each class separately. Our rewriting template R for all three rewritings is r( x, α(y)) v 0 ( x 0, y), v b 1( x 1, y 1 ),..., v b k( x k, y k ). (3) where α is a nontrivial aggregate operator in cases CQ/CQA and CQA/CQA, and is an identity in case CQA/CQ (i.e., the head is r( x, y)). Also in the case CQ/CQA, we assume a central view too which covers all subgoals that contain the variable y. Our contribution presented in Section 4 is: For each central rewriting, we obtain sufficient and necessary conditions for a rewriting to exist. This is achieved by using unfoldings of rewritings as explained in the following section. 3.3 Unfoldings of Rewritings Unlike the case of conjunctive queries without aggregation, where it is straightforward how to define and use expansions [Ull97] (unfoldings reduce to expansions in this case), in presence of aggregation there are more complications. Sometimes, unfoldings are not equivalent to the rewritings as we will prove in the section that follows. Here we define unfoldings. We are given a set of views defined as conjunctive aggregate queries over the base predicates, and are given a conjunctive query R over the views. We use to refer to R as a rewriting even in the case when we have not associated it with any particular query (whose rewriting is to be obtained). The unfolding R u of R is a join of all the subgoals of the views in R, followed by some grouping/aggregation. If we denote by B vi the body of a view V i, then an unfolding R u of R is defined as follows: r u ( x, β(y)) B v0 & B v1 &... & B vk. (4) where (1) β is the aggregate operator of the central view of R, if the central view is aggregated, or else is the aggregate operator in the head of R; (2) the variables in the B vi s that are also contained in the x i are retained the same as in the rewriting, whereas the other (non-distinguished variables of the view definition) are replaced by fresh variables that are not used in any other B vj with j i. Moreover, y is the attribute which is aggregated in the definition of the central view V 0 of R (in case V 0 has aggregation). In the purely conjunctive case, the unfolding is equivalent to the expansion [Ull97] of the rewriting. In our framework, we also consider multiaggregate queries and views. In this case, we assume again that only one aggregated attribute from one (central) view is used to compute the aggregated value in the head of the rewriting. Our central rewritings are extended naturally. 3.4 View Selection and Cost Model We want to design minimal-cost views, i.e., those views whose use in the rewriting of a query results in the cheapest computation of the query. We take the assumption that the view relations have been precomputed and stored in the 6

7 database. Thus, we don t assume any cost on computing the views. We assume that the size of a database relation is the number of tuples in it, and that the cost of computing a join is the sum of the sizes of the input relations and of the output relation (this faithfully models the cost of, e.g., hash joins). For conjunctive queries, we measure the cost of query evaluation as the sum of the costs of all the joins during the computation of the query. (We assume that all selections are pushed down as far as they go, and consider only left-linear query trees for joins.) For queries with aggregation, our sum-cost model measures the cost of evaluating a query as the sum of the costs of the three steps in the computation of the query: computation of the conjunctive core, grouping, aggregation. (Let N be the size of the input relation to a unary operator. Then the cost of the grouping operator, which is the same as sorting, is proportional to N log N; the cost of the aggregate operator, which can be computed in a single scan, is N.) Now we present our formulation of the viewselection problem. We assume that we must satisfy a bound (storage limit) on the sum of the sizes of the relations for the views that will be selected to be materialized. Definition 3.2 (view-selection problem) Given a query workload, an oracle that gives view sizes 1, and a storage limit (a positive integer), return a set of view definitions, such that: the views in the set give an equivalent rewriting (of one of our three central rewriting types) of each query in the workload, the view relations satisfy the storage limit, and the total cost of computing the queries using the rewritings is minimum among the view sets that satisfy the previous two conditions. 1 alternatively, given a specific database For the view-selection problem, we prove the following (in section 5): (1) Decidability. (2) NP-hardness, even in the case of queries and views without aggregation. (3) Membership in NP for sum and count aggregate queries. (4) Exponential-time lower bound on the complexity for min and max aggregate queries. 4 Results on Equivalence of Unfoldings and Rewriting We present results that prove that the unfoldings defined in Section 3 are equivalent to the rewritings. We also present negative results that show that our constraints that need to be satisfied for this to hold are tight. As a consequence of the results in this section, equivalence of a rewriting to a query is reduced to equivalence between two aggregate queries (which is known how to check [NSS98]). In brief, for the cases where we prove that the rewriting is equivalent to the query, it suffices to check whether the unfolding is equivalent to the query. 4.1 Case CQ/CQA: central view CQ and rewriting CQA Theorem 4.1 Let R be a CQ/CQA rewriting. Suppose that all noncentral views are without aggregation and are bag-valued. Then R R u. Proof: Here all views are without aggregation. Given a database D on the base relations, the result of computing the bag-join of all views in the body of R is equivalent to computing each view relation separately as a bag and then computing the bag-join of all the views in the body of the rewriting. After that, the same grouping and aggregation is applied in both R and R u. The following result relaxes the requirement for noncentral views in the case of duplicateinsensitive functions. Theorem 4.2 For a CQ/CQA rewriting R with central aggregation max (min), R R u. 7

8 Negative Results Proposition 4.1 Let R be a CQ/CQA rewriting with central aggregation sum or count. Suppose that either there is a noncentral view with aggregation, or there is a set-valued noncentral view. Then the unfolding is not set-equivalent to the rewriting. 4.2 Case CQA/CQ: central view CQA and rewriting CQ Lemma 4.1 For every CQA/CQ rewriting R, if R is equivalent to its unfolding R u, then all grouping attributes of the central view of R appear in the head of R. Query Q 1 in example 1.1 is rewritten using a view whose grouping atributes are a proper subset of the arguments in the head of the rewriting. The following theorem proves equivalence of a rewriting to its unfolding for all aggregate functions that we consider, under some restrictions on the view definitions and on the form of the rewriting. Theorem 4.3 Consider a CQA/CQ rewriting R. Suppose that (i) all noncentral views of R have no aggregation, (ii) R does not have nondistinguished attributes in its body (except possibly noncentral aggregated arguments in R s central view in case of multiaggregate views), (iii) noncentral views do not have nondistinguished attributes in their definition, and (iv) all grouping attributes of the central view appear in the head of R. Then the following hold: R is equivalent to its unfolding R u, and the answer to R on any set-valued database is a set. Although, as we prove in the negative-results section, none of the conditions in Theorem 4.3 can be relaxed for sum or count queries, they can be relaxed for max and min queries: Theorem 4.4 Let R be a CQA/CQ rewriting with central aggregation max (min). Suppose that all the grouping arguments of the central view of R appear in the head of R. Then R is set-valued and is equivalent to its unfolding R u. 8 Negative Results The question arises whether we can extend Theorem 4.3 by relaxing one of the restrictions. Here we prove that it is not possible for aggregate functions sum and count. A counterexample is rewriting Q 2 in Example 1.1. However, it might seem that there could be cases where the unfolding we defined in the previous section does work. In the following proposition, we prove that, for aggregate operators sum or count the following holds: For any rewriting and its unfolding (the way we define unfoldings), such that some of the restrictions in Theorem 4.3 are relaxed, the unfolding is not equivalent to the rewriting. Proposition 4.2 Consider a CQA/CQ query R with central aggregation sum or count. Suppose that noncentral aggregated arguments of R cannot be used in the head of R or in joins in the body of R. Moreover, suppose that at least one of the following holds: 1. There is a noncentral view in R defined by an aggregate query. 2. There is a noncentral view in R defined by a query with nondistinguished variables. 3. There are nondistinguished variables in R (other than noncentral aggregation in the central view of R). Then R is not set-equivalent to its unfolding R u (the way we define R u ). We prove this proposition in three propositions, each relaxing one of the restrictions. The proof techniques are similar in all three cases. 4.3 Case CQA/CQA: central view CQA and rewriting CQA Here, to prove R R u, we choose to prove that the standard query plans for R and R u can be transformed to the same plan R int. We give an example to show our technique. EXAMPLE 4.1 Consider the following rewriting and its unfolding: r(x, T, sum(w )) : v 4 (X, Z, W ), v 5 (Z, T ). v 4 (X, Z, count( )) : p(x, Y, Z).

9 v 5 (Z, T ) r u (X, T, count( )) Let R int be defined as follows: : u(z, T, L). : p(x, Y, Z), u(z, T, L). r int (X, T, sum(w )) : R int (X, T, Z, W ). r int (X, T, Z, count( )) : p(x, Y, Z), u(z, T, L). We show that R R u by showing that R R int and R int R u. Theorem 4.5 Let R be a CQA/CQA rewriting. Suppose that noncentral views are without aggregation and are bag-valued. Then R R u. Negative Results Proposition 4.3 Let R be a CQA/CQA rewriting with central aggregation sum or count. Suppose that either there is a noncentral view with aggregation, or there is a set-valued noncentral view. Then the unfolding is not set-equivalent to the rewriting. 5 View Selection 5.1 Decidability Theorem 5.1 The view-selection problem under the storage limit is decidable for finite workloads of conjunctive queries with aggregation and for conjunctive views and rewritings, with or without aggregation, for the three central rewritings we consider. The query workloads we consider may contain queries both with and without aggregation. 5.2 NP-completeness for sum or count In this section we present an NP-completeness result for the view-selection problem for workloads of sum or count queries. As the proof also works for purely conjunctive queries, views, and rewritings under bag semantics, the viewselection problem for that case is also NPcomplete. (Interestingly, under bag-set semantics, the view-selection problem for conjunctive queries, views, and rewritings has an exponential-time lower bound; cf. [CHS02].) Theorem 5.2 The view-selection problem under the storage limit is NP-complete for finite 9 workloads of conjunctive queries with sum- or count- aggregation and for conjunctive views and rewritings, with or without aggregation, for the three central rewritings we consider. 5.3 Lower Bound for Workloads of max or min We prove an exponential-time lower bound for view selection under a storage limit for max- and min-queries. Theorem 5.3 The view-selection problem under the storage limit has an exponential-time lower bound for finite workloads of conjunctive queries with max- or min- aggregation and for conjunctive views and rewritings, with or without aggregation, for the three central rewritings we consider. 6 Algorithms As a consequence of the results in Section 4, we obtain algorithms which are based on the following observations. Proposition 6.1 In a CQA/CQ rewriting, the set of all grouping attributes of the central view is a subset of the set of all grouping attributes of the rewriting. We call this central view groupingcomplete. In a CQA/CQA rewriting, the set of the grouping attributes of the rewriting is a union of subsets of the grouping attributes in the central view and the non-aggregated attributes in noncentral views. We call this central view groupingincomplete. We consider a rewriting R and define its reduced-core rewriting R r to be a conjunctive rewriting whose head attributes are R s grouping attributes only, and whose body uses reducedcore views. Given an aggregate view V, we define its reduced-core view V r to be a view whose body is the body of V and whose head is a new predicate name V r ; the arguments in the head of V r are all the grouping attributes of V. The reduced-core rewriting is a conjunctive query, and the following holds:

10 Proposition 6.2 Let R r be a reduced-core rewriting of a CQA/CQA or CQA/CQ rewriting R. Then R r is an equivalent rewriting of the reduced-core query using the reduced-core views. 6.1 Constructing Rewritings In this section, given a query and a set of views, we construct all equivalent rewritings of the query using the views. The problem is actually reduced to the problem of obtaining rewritings for purely conjunctive queries. For lack of space, we describe only the case for max queries and CQA/CQA or CQA/CQ rewritings. The other cases are similar with the additional observation that, in the duplicate-sensitive cases, we find rewritings for the purely conjunctive queries whose unfolding is isomorphically mapped on the query. In the following algorithm, Q r and V r are the reduced-core queries of a query Q and of views, respectively. We use an algorithm in the literature [ALU01] to find all rewritings Q r using V r. Procedure Find-R. Input: query Q, set of views V Consider Q r,v r. Find all rewritings of Q r using V r. For each rewriting R r do: Consider the expansion R r exp For each cont. mapping from Q r to R r exp do: If there is a view in the rewriting such that its aggregated attribute is the image of the aggregated attribute of the query, do: Call this the central view. If the central view is grouping-incomplete then construct CQA/CQA rewriting If the central view is grouping-complete then construct CQA/CQ rewriting end end end Theorem 6.1 If there is a central rewriting of a query Q using views V, then the algorithm will find it. 6.2 Selecting Views We present an algorithm that selects multiaggregate views to be used as central views, given a query workload. It is particularly efficient in the case of queries with the HAVING clause, where a single multiaggregate central view saves on using joins on several aggregate views. The algorithm selects all maximal such views. For a query workload, a view is maximal if there does not exist another multiaggregate view with more aggregated arguments which can replace it in all the rewritings in the workload. The algorithm considers each query Q in the workload and constructs a pair of views (Vc Q, Vn Q ) which essentially represent a central minimal view and a collective noncentral view. We may think of the pair (Vc Q, Vn Q ) as providing a rewriting for Q with the minimum number of subgoals in the central view Vc Q. We call them characteristic views of the query Q. In the next step, the algorithm considers all combinations of those pairs and finds compatible pairs of characteristic views. Two pairs are compatible if (1) the two central views can be combined in a single multiaggregate view V m, and (2) V m can be used to rewrite both queries. Proposition Each query has a bounded number of characteristic views. 2. In any central rewriting of a query Q, the views used in the rewriting can also be used to produce central rewritings of characteristic views. 3. It is decidable to tell whether two pairs of characteristic views are compatible. Theorem 6.2 The algorithm finds all maximal multiaggregate views for a query workload. References [AAD + 96] S. Agarwal, R. Agrawal, P. Deshpande, A. Gupta, J.F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the computation of multidimensional aggregates. In Proceedings of VLDB, pages ,

11 [ACN00] S. Agrawal, S. Chaudhuri, and V.R. Narasayya. Automated selection of materialized views and indexes in SQL databases. In Proceedings of VLDB, pages , [ALU01] F. Afrati, C. Li, and J.D. Ullman. Generating efficient plans for queries using views. In Proceedings of ACM SIGMOD, [BL02] M. Benedikt and L. Libkin. Aggregate operators in constraint query languages. JCSS, 64: , [CDGLV03] D. Calvanese, G. De Giacomo, M. Lenzerini, and M.Y. Vardi. View-based query containment. In Proc. PODS, pages 56 67, [CHS02] R. Chirkova, A.Y. Halevy, and D. Suciu. A formal perspective on the view selection problem. VLDB Journal, 11(3): , [CNS99] [CV93] S. Cohen, W. Nutt, and A. Serebrenik. Rewriting aggregate queries using views. In Proceedings of PODS, pages , S. Chaudhuri and M. Vardi. Optimization of real conjunctive queries. In Proc. PODS, pages 59 70, [GCB + 97] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, and M. Venkatrao. Data cube: A relational aggregation operator generalizing Group-by, Cross-Tab, and sub totals. Data Mining and Knowledge Discovery, 1(1):29 53, [GHQ95] A. Gupta, V. Harinarayan, and D. Quass. Aggregate-query processing in data warehousing environments. In Proceedings of VLDB, pages , [GHRU97] [GT03] [Hal01] H. Gupta, V. Harinarayan, A. Rajaraman, and J.D. Ullman. Index selection for OLAP. In Proceedings of ICDE, pages , S. Grumbach and L. Tininini. On the content of materialized aggregate views. JCSS, 66: , Alon Y. Halevy. Answering queries using views: A survey. VLDB Journal, 10(4): , [HRU96] V. Harinarayan, A. Rajaraman, and J. Ullman. Implementing data cubes efficiently. In Proceedings of SIGMOD, pages , [LSV02] J. Lechtenbörger, H. Shu, and G. Vossen. Aggregate queries over conditional tables. Journal of Intelligent Information Systems, 19(3): , [NSS98] [ÖÖM87] [PDST00] W. Nutt, Y. Sagiv, and S. Shurin. Deciding equivalences among aggregate queries. In Proceedings of PODS, pages , G. Özsoyoglu, Z.M. Özsoyoglu, and V. Matos. Extending relational algebra and relational calculus with set-valued attributes and aggregate functions. TODS, 12: , L. Popa, A. Deutsch, A. Sahuguet, and V. Tannen. A chase too far? SIGMOD Record, 29(2), [RSSS98] K.A. Ross, D. Srivastava, P.J. Stuckey, and S. Sudarshan. Foundations of aggregation constraints. Theoretical Computer Science, 193(1-2): , [SDJL96] D. Srivastava, S. Dar, H.V. Jagadish, and A.Y. Levy. Answering queries with aggregation using 11

12 [Ull97] views. In Proceedings of VLDB, pages , Jeffrey D. Ullman. Information integration using logical views. In Proceedings of ICDT, [Wid95] Jennifer Widom. Research problems in data warehousing. In Proceedings of CIKM, [YW01] J. Yang and J. Widom. Incremental computation and maintenance of temporal aggregates. In Proceedings of ICDE, pages 51 62, A From Section 4 A.1 Proof of Theorem 4.3 Theorem 4.3 Consider a CQA/CQ rewriting R. Suppose that (i) all noncentral views of R have no aggregation, (ii) R does not have nondistinguished attributes in its body (except possibly noncentral aggregated arguments in R s central view in case of multiaggregate views), (iii) noncentral views do not have nondistinguished attributes in their definition, and (iv) all grouping attributes of the central view appear in the head of R. Then the following hold: R is equivalent to its unfolding R u, and the answer to R on any set-valued database is a set. Proof (sketch): The proof has two parts: Part 1: Suppose the central view V of the rewriting has just one aggregated argument (i.e., we do not consider multiaggregate views). We first show that the answer to R is a set on any setvalued database; thus, it is enough to show set equivalence of R and R u on set-valued databases. We then transform the standard query plan for R into a set-equivalent query plan that is the standard query plan for R u, as follows. We fix a set-valued database D. We observe that V can be computed on D by taking a bag projection of the body of V on the head attributes of V, and by then doing V s grouping and aggregation on the result. We use this observation to argue that we can compute R on D as follows: (1) take a join of the bodies of all the views in R; (2) project the resulting relation on the head attributes of R under bag semantics; (3) group the resulting tuples into equivalence classes, based on the union of the grouping arguments of V and of the head arguments of R, and then aggregate using V s aggregation function; as a result, we obtain the value of V s aggregation for each equivalence class w.r.t. the grouping attributes of the view V. Because the grouping attributes of V are a subset of the head arguments of R, the result of this computation is the relation for R on D. We then observe that it is trivial to transform this plan into standard computation for R u. Part 2 (multiaggregate central view V ): We reduce this case to the previous case by projecting out extra aggregate arguments of the central view V and thus obtaining a new rewriting R. We then argue that R and R have the same unfolding, and use transitivity of set equivalence to show R R u. A.2 Proof of Proposition 4.2 Proposition 4.2 is proven in three parts with similar proofs, each for one of the clauses in the statement. We give here one of the three proofs. Proposition A.1 Consider a CQA/CQ query R with central aggregation sum or count. If at least one noncentral view in R has aggregation (with any aggregation function(s)), and if noncentral aggregated arguments of R cannot be used in the head of R or in joins in the body of R, then R is not set-equivalent to its unfolding R u (the way we define R u ). Proof (sketch): Consider an arbitrary CQA/CQ query R with central aggregation sum or count, such that R has a noncentral view with aggregation; let R u be the unfolding of R. We prove the Proposition by assuming R R u and 12

13 by then constructing a database on which the answers to R and R u are different as sets; we thus arrive at a contradiction. Recall that, by definition of R u, the head variables of R and R u are the same. Here s the idea of what we show on a counterexample database D, for the case where R has a noncentral aggregate view. For a fixed assignment x of the grouping attributes in the head of R u, we ascertain that the answer to R u on D has a tuple, with some value z of the aggregated argument Z of R u. We argue that, for the same assignment x, none of the tuples in the answer to R on D has a value of Z that is equal to z. Thus, the answers to R and R u on D are different as sets. To produce this counterexample, we build a database D in such a way that the body of the aggregate noncentral view V 1 in R has exactly two tuples that correspond to a fixed assignment x of the grouping arguments X of R u ; we build the rest of the database D to ensure that the answer to each of R and R u on D has at least one tuple whose values of X are x. (We build the database D as a union, on each base relation separately, of two canonical databases for R, which result from assigning two different variable names to the argument to be aggregated in V 1.) Now, because V 1 has aggregation, the answer to V 1 on D has exactly one tuple that corresponds to this assignment x; recall that the body of V 1 has two tuples for x. For this reason, when we compute R and R u on the database D, the result of joining all the subgoals of R u has at least two copies of each tuple in the body of the central view of R. Recall that the aggregated argument Z in the head of R u is also the aggregated argument in the head of the central view V of R. We argue that, for this reason, for the assignment x of the grouping arguments of R u, the (only) tuple in the answer to R u on D has the value of Z that is at least twice the value of Z in any tuple for x in the answer to R on D. Indeed, let there be j tuples in the body, on the database D, of the central view V of R. We construct D in such a way that each tuple in the body of V has the value 1 of the argument Y that is aggregated in the head of V. Therefore, the value of Z = α(y ) in the head of V is exactly j (recall that α is either sum or count). By definition, the answer to R on D is obtained by taking a projection, on the head arguments of R, of the result of joining all the subgoals of R. As the subgoals of R include the view V, any tuple for x in the answer to R on D has Z = j. On the other hand, the answer to R u on D is the result of performing R u s aggregation which is the central aggregation (sum or count) of the central view of R on the body of R u. (The body of R u is the result of joining all its subgoals.) The body of R u has at least 2j tuples; the value, in each tuple, of the argument to be aggregated is 1. Thus, the tuple for x in the answer to R u on D has the value of Z that is at least 2j. A.3 Proof of Theorem 4.5 Theorem 4.5 Let R be a CQA/CQA rewriting. Suppose that noncentral views are without aggregation and are bag-valued. Then R R u. Proof (sketch): We show that each of R and R u is equivalent to a query R int (see Example 4.1) whose definition is based on the definitions of R and R u ; then R R u follows from transitivity of equivalence. For a rewriting R defined as r( x, α(y)) v 0 ( x 0, y), v b 1( x 1, y 1 ),..., v b k( x k, y k ). and for its unfolding R u, r u ( x, β(y)) B v0 & B v1 &... & B vk. R int is defined as r int ( x, α(z)) r int ( x x 0, z). (5) r int ( x x 0, β(y)) B v0 & B v1 &... & B vk. Here, α is the aggregate function of R, and β is the aggregate function of R s central view V. We give an intuition for the proof for the case where the aggregation function of the central view V in R is count( ); the proof carries over in a straightforward way to any distributive ag- 13

14 gregation function [GCB + 97]. In the computation of R on an arbitrary database, consider any group G(t) that results, after grouping and aggregation, in a tuple t in the answer to R. Any tuple p in G(t) is the result of joining tuples in views in R, one tuple from each view. Consider a tuple s in V (central view of R) that contributes to the tuple p, and let k be the aggregated value in s. As V s aggregation is count( ), s corresponds to k tuples in the body of V. Thus, each tuple p (with some value k) in each group in R corresponds to k tuples in the body of the central view V of R. We use this observation to see that we can use a query plan for R int to compute R. For each tuple p (with some value k) in the body of R, we have k tuples in the body of Rint. After doing R int s aggregation (the same as V s aggregation) on the union of the grouping attributes of R and V, we obtain, from these k tuples, exactly the tuple p in the body of R int. As the grouping and aggregation are the same in the heads of R and R int, we conclude that R and R int have the same answer on any database. To show that R u and R int have the same answer on any database, we first observe that they are computed on the same relation B = B v0 &... & B vk. We then use the fact that R s aggregate function is distributive, to argue that the two grouping/aggregation steps in computing R int result in the same answer, on the relation B, as the single grouping/aggregation step in computing R u. B From Section 5 B.1 Proof of Theorem 5.1 Theorem 5.1 The view-selection problem under the storage limit is decidable for finite workloads of conjunctive queries with aggregation and for conjunctive views and rewritings, with or without aggregation, for the three central rewritings we consider. Proof (sketch): The proof is a consequence of the fact that views in equivalent rewritings have definitions whose length is bounded by the size of the query. This is true for conjunctive queries and carries over to aggregate queries and central rewritings because of our results on equivalence of unfoldings and rewritings (proved in Section 4) an the results on equivalence of aggregate queries. The combination of these results obtain that the core of the rewriting should be equivalent to the core of the query. Then we argue as in the purely conjunctive case under either semantics. B.2 NP-hardness proof (Theorem 5.2) Proposition B.1 The view-selection problem under the storage limit is NP-hard for finite workloads of conjunctive queries with sum- or count- aggregation and for conjunctive views and rewritings, with or without aggregation, for the three central rewritings we consider. Proof (sketch): We prove the Proposition by reducing an NP-complete problem Partition to the problem of view selection for a single query with sum- or count- aggregation, for each of our three central rewritings. Consider an instance I of Partition, which has n elements a 1,..., a n. We construct an instance J of view selection, in time at most polynomial in the size of I. The instance J has: 1. A sum or count query Q, with n subgoals p i that correspond to the elements in I, and with an extra subgoal p 0 that provides an aggregation argument of Q. 2. An oracle, which gives the size of the relation for each subgoal of the query Q, as 1 for p 0 and as 2 s(ai) for each p i, i > 0; here the integer s(a i ) is the size of the element a i in the instance I of Partition. For any view defined on a subset of the subgoals of the query Q, the oracle gives the size of the relation for the view as a product of the sizes of the relations for the relevant subgoals; the size of Q is 2 S(A), where S(A) is the sum of sizes of all the elements in I. In the full 14

15 proof we argue that the oracle gives view sizes consistently on some database. 3. A central rewriting type (one of CQ/CQA, CQA/CQ, and CQA/CQA). The problem for J is: For the query Q and the set of databases for which the oracle gives the sizes of views as described above, does there exist a rewriting R of the specified type, such that the sum cost (see section 3) of answering the query Q on these databases using the rewriting R does not exceed a numeric value M, which depends on the type of the rewriting R. We then show that an instance I of Partition has a solution if and only if the corresponding instance J of view selection has a solution. Consider the value of M in J : The component of M that represents the cost of computing the body of the rewriting (i.e., all except the final grouping/aggregation) is M = 2 S(A)/2 + 2 S(A)/2 + 2 S(A). The remainder of the proof is an argument that on the databases described by the oracle, the cost of computing the body of a rewriting does not exceed M only if there are exactly two views that have the same size M 0, as given by the oracle, and such that the join of the two views gives the body of the query Q. Now the size M 0 of any such view can only be 2 S(A)/2, otherwise the (sum) cost of joining the views cannot be M. But by construction of the query and of the oracle, the size of a view can be 2 S(A)/2 only if in the instance I of Partition there is a subset A of the set A, such that the total size of the elements of A is S(A)/2. B.3 Proof of Theorem 5.3 Theorem 5.3 The view-selection problem under the storage limit has an exponential-time lower bound for finite workloads of conjunctive queries with max- or min- aggregation and for conjunctive views and rewritings, with or without aggregation, for the three central rewritings we consider. Proof (sketch): We use the construction given in the proof of Theorem 6 in [CHS02]; we take the conjunctive query in the construction and modify it to obtain definitions of queries with aggregation. We then consider rewritings of these queries one for each of the three rewriting types we consider and prove that in each case, an exponential number of fixed views (some of them with aggregation) are the only possible viewset that satisfies a chosen storage limit and gives a minimal-cost rewriting of the query. Here are some details. We take the database schema (relations S 1 through S n ) from the construction in the proof in [CHS02], and change the schemas of two of the relations, to accommodate attributes that would justify the aggregation in each type of central rewritings that we consider; we then construct a database D on which to compute our queries and rewritings. After defining the queries and rewritings on the new schema, we use our results on equivalence of queries with aggregation to their central rewritings to argue that each of the rewritings we produce is equivalent to the corresponding query. Each rewriting has an exponential number of filtering views [ALU01] that, when applied (i.e., joined) together to one of the nonfiltering views in the plan for computing the rewriting on the database, reduce the relation for the view in a way that minimizes the cost of the plan. Finally, for each rewriting we set a storage limit as the amount of space that is just enough to store the relations, on the database D, for an exponential number of views that we have fixed in each rewriting. Similarly to the proof in [CHS02], we show that (1) the cost of computing the queries using the chosen views and rewritings is lower than the cost of computing the queries without views, and (2) for any other viewset that could produce lower-cost plans to compute the queries on the database D, the relations for the viewset do not satisfy the storage limit. In particular, by construction of the database D, our fixed views with aggregation are more beneficial than views (without aggregation) that are their cores. 15

Designing Views to Answer Queries under Set, Bag,and BagSet Semantics

Designing Views to Answer Queries under Set, Bag,and BagSet Semantics Rada Chirkova Department of Computer Science, North Carolina State University Raleigh, NC 27695-7535 chirkova@csc.ncsu.edu Foto Afrati