A Nested Relational Approach to Processing SQL Subqueries

Size: px

Start display at page:

Download "A Nested Relational Approach to Processing SQL Subqueries"

Neal Wilkerson
5 years ago
Views:

1 A Nested Relational Approach to Processing SQL Subqueries Bin Cao Antonio Badia Computer Engineering and Computer Science Department University of Louisville Louisville, KY 4292 ABSTRACT One of the most powerful features of SQL is the use of nested queries. Most research work on the optimization of nested queries focuses on aggregate subqueries. However, the solutions proposed for non-aggregate subqueries are still limited, especially for queries having multiple subqueries and null values. In this paper, we show that existing approaches to queries containing non-aggregate subqueries proposed in the literature (including rewrites) are not adequate. We then propose a new efficient approach, the nested relational approach, based on the nested relational algebra. Our approach directly unnests non-aggregate subqueries using hash joins, and treats all subqueries in a uniform manner, being able to deal with nested queries of any type and any level. We report on experimental work that confirms that existing approaches have difficulties dealing with non-aggregate subqueries, and that our approach offers better performance. We also discuss some possibilities for algebraic optimization and the issue of integrating our approach in a relational database system. 1. INTRODUCTION SQL is the standard language for data retrieval and manipulation in relational database systems. One of the most powerful features of SQL is nested queries (queries having subqueries). Theoretically, a query can have an arbitrary number of subqueries nested within it. A subquery can be either aggregate or non-aggregate. An aggregate subquery has an aggregate function in its SELECT clause; it always returns a single value as the result. A non-aggregate subquery is linked to the outer query by one of the following operators: EXISTS, NOT EXISTS, IN, NOT IN, θ SOME/ANY, and θ ALL, where θ {<,, >,, =, }; the result is either a set of values or empty. Since it is usually This research was sponsored by NSF under grant IIS Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro t or commercial advantage, and that copies bear this notice and the full citation on the rst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speci c permission and/or a fee. SIGMOD 25 June 14-16, 25, Baltimore, Maryland, USA Copyright 25 ACM /5/6 $5.. inefficient to directly execute nested queries in their original form [1], query unnesting, i.e. rewriting nested queries into flat forms, has been proposed as a better solution [1, 8, 5, 13, 14, 18]. Unfortunately, most proposed approaches concentrate on aggregate subqueries; optimization of nonaggregate subqueries has some limitations. Some proposed approaches are derived from those for aggregate subqueries [8, 6, 1]. The only solutions proposed for non-aggregate subqueries are limited [5, 3, 2], especially for queries with multiple subqueries and null values. The common problems of these proposed approaches are two fold: first, queries can not be unnested directly and transformations are required; second, each operator is evaluated in a different manner. In this paper, we focus on non-aggregate subqueries. We propose a new, efficient approach, the nested relational approach, for evaluating nested queries containing non-aggregate subqueries in a uniform manner. To directly unnest nonaggregate subqueries, we use the nested relational algebra instead of the standard relational algebra. The motivation of using the nested relational algebra is based on the observation that the subquery result is either a set of values or empty, which can be considered as a set-valued attribute in the nested relational model. Conceptually, our nested relational approach unnests a nested query from top-down, and then uses our extended nested relational algebra to compute the predicates associated with the subquery from bottomup. We will show that our approach not only allows unnesting non-aggregate subqueries directly without transformation, but also allows each operator to be evaluated in a uniform manner. Furthermore, our approach does not require indexes; only hash joins are necessary. Finally, being algebraic, our approach has clear semantics and can be further optimized. The rest of this paper is organized as follows: Section 2 summarizes related work and gives the motivation. Section 3 defines the nested relational model and our extended nested relational algebra. Section 4 describes the original algorithm for evaluating queries having non-aggregate subqueries and some special cases for optimization. Section 5 shows our experiments. Section 6 concludes the paper. 2. RELATED WORK AND MOTIVATION Significant research efforts have been devoted to the optimization of nested queries since the 198 s. Kim [1] was motivated by the observation that executing correlated nested queries using the traditional nested iteration method can be 191

2 very inefficient. As a solution, Kim developed query transformation algorithms to rewrite nested queries into equivalent, flat queries which can be processed more efficiently. Several problems of these algorithms were later pointed out and solved in [8]. Dayal [5] refined and extended all of the previous optimization work to a unified approach for processing queries that contain nested subqueries, aggregates and quantifiers, which enables unnesting queries with more than one nesting level. Muralikrishna [13] extended Dayal s approach to enable processing the queries that have an arbitrary number of blocks nested within any given block. Finally, work on magic decorrelation optimization within the logic programming community was brought to SQL optimization [18, 17]. Unfortunately, most proposed strategies focus on aggregate subqueries. Unnesting non-aggregate subqueries, especially those with certain operators, still pose problems. Before presenting the problems, we first describe the terminology used in this paper. We introduce the term linking predicate to refer to the predicate that connects a subquery and an outer query. In a linking predicate, the attribute of the outer query is called the linking attribute, the attribute of the subquery is called the linked attribute, and the operator is called the linking operator. We call EXISTS, SOME/ANY and IN positive linking operators, and NOT EXISTS, ALL and NOT IN negative linking operators. If a query has both positive and negative linking operators, we say it has mixed linking operators. If a subquery contains a predicate which references the relation in the outer query, we say the subquery is correlated to the outer query, and the predicate is called the correlated predicate. The attribute of the outer query in a correlated predicate is called the correlating attribute, and the attribute of the subquery is called the correlated attribute. If a query has no nested subqueries, we call it a flat query. If a query has nested subqueries, but they are all flat, we call it a one-level nested query; if a query has nested subqueries, but they are all onelevel, we call it a two-level nested query, and so on. Since SQL is a block-structured language, the terms inner query block and outer query block are used interchangeably with subquery and outer query respectively in this paper. As an example, assume relations R(A, B, C, D), S(E, F, G, H, I), T (J, K, L), and consider the following query: Query Q: select R.B, R.C, R.D from R where R.A > 1 and R.B not in (select S.E from S where S.F = 5 and R.D = S.G and S.H > all (select T.J from T where T.K = R.C and T.L <> S.I)) Query Q is a two-level nested query. From top-down, the second query block is correlated to the first query block by the predicate R.D = S.G, and the third query block is correlated to the other two query blocks by the predicates T.K = R.C and T.L <> S.I. It has two negative linking operators, NOT IN and ALL. Unnesting Query Q using existing techniques presents several problems. First, it can not be unnested directly; instead, rewriting predicates NOT IN and ALL is required. However, rewriting such predicates may not preserve semantics when null values are present. Because of null values, R.A >ALL (select S.B...) is not equal to an antijoin of R and S on the condition R.A <= S.B. Furthermore, R.A >ALL (select S.B...) is not equal to R.A > (select max(s.b)...) or to = (select count(s.b)...) with the condition R.A <= S.B added into the subquery. Readers can convince themselves by assuming that R.A is 5 and S.B is {2, 3, 4, null}. Second, even when rewriting is possible, the resulting query tree may have several outer joins and antijoins that cannot be moved (except under certain circumstances; see [7, 16]), as well as extra operations. Even though Muralikrishna [14] proposed to extract (left) antijoins from (left) outer joins, we note that in general such reuse may not be possible: here, the outer join is introduced to deal with the correlation, and the antijoin with the linking; therefore, they have distinct, independent conditions attached to them (and such approaches transform the query tree in a query graph, making it harder for the optimizer to consider alternatives). Also, magic decorrelation [18, 17] would be able to improve the above plan by pushing selections down to the relations; however, this approach does not improve the overall situation, with outer joins and antijoins still present. Third, (outer) joins are introduced to deal with correlations, which means that all correlated subqueries become one query block. However, when dealing with negative linking predicates, this creates a problem. To see why, note that in Query Q we have to outer join R with S and T to determine which tuples of T must be tested for the ALL linking predicate. However, if the set of tuples of T related to a tuple in R and S fail the test, we can not throw the whole set away. The reason is that some tuples in S fail to qualify for an answer, making true the NOT IN linking predicate, and hence qualifying the R tuple. Thus, tuples in S and T should be antijoined separately to determine which tuples in S pass or fail the ALL test. Then the result should be separately antijoined with R to determine which tuples in R pass or fail the NOT IN test. Different approaches are developed in [1, 6, 2, 3]. They all involve extending the standard relational algebra. In [1], a special operator called the multidimensional join (MDjoin) is used to join two relations, group the result by one of them and compute aggregates on different partitions of the resulting join. Queries having non-aggregate subqueries are rewritten as counts. However, this approach also suffers from the problem above, and it requires a double join of two relations (although if implemented with care, the approach is efficient). Finally, the MD-join only commutes with other joins and selections in a selective manner. Similar to [1], non-aggregate subqueries are rewritten as aggregate subqueries with counts in [6], then transformed queries are evaluated by a second-order APPLY operator. While such operator is very powerful, it may not yield the best possible plan for each case. In [2], queries having non-aggregate subqueries are computed by using a Boolean aggregate, which applies a condition to a set of tuples by applying the condition to each tuple and computing the conjunction or disjunction of the resulting truth values. In [2], tuples that fail the test are not discarded but marked and kept for further processing. In [3], queries having non-aggregate sub- 192

3 queries are evaluated by transforming nested queries into flat queries first, and then flat queries are incrementally computed. Their transformation leads to a Cartesian product followed by difference operations, which is likely not to be an efficient approach. In conclusion, existing approaches either have difficulties in dealing with mixed and negative linking operators, or call for special operations. What is needed is an approach which uniformly deals with all types of linking predicates without introducing undue complexity. We propose to use the nested relational algebra because it explicitly represents the intuition that for a given tuple, a non-aggregate subquery provides a set of values (perhaps empty). As a consequence, linking predicates become set predicates which can be represented in a straightforward manner. 3. DEFINITION OF EXTENDED NESTED RELATIONAL ALGEBRA Several well-known, basically equivalent definitions of the nested relational algebra have been introduced [19, 15, 11]. For the purpose of the nested relational approach, the definitions need to be extended and slightly modified. Definition 1. Let U = {A 1,..., A n} be a finite set of attributes. A schema over U and the depth of the schema is defined recursively as follows: 1. If A 1,..., A n are atomic attributes from U, then R = (A 1,..., A n) is a (flat) schema over U with the name R. The depth of the schema R is, denoted by depth(r) =. 2. If A 1,..., A n are atomic attributes from U, R 1,..., R m are distinct names of schema with a set of attributes (denoted by attr(r 1),..., attr(r m)) such that {A 1,..., A n} and {attr(r 1),..., attr(r m)} are pairwise disjoint, then R = (A 1,..., A n, R 1,..., R m) is a (nested) schema with the name R. R 1,...R m are called subschemas. The depth of the schema R is defined as: depth(r) = 1 + max m i=1depth(r i). Definition 2. Let R denote a schema over a finite set U of attributes. The domain of R, denoted by DOM(R), is defined recursively as follows: 1. If R = (A 1,..., A n), where A i (1 i n) are atomic attributes, then DOM(R)=DOM(A 1)... DOM(A n), where denotes Cartesian product. 2. If R = (A 1,...A n, R 1,..., R m), where A i (1 i n) are atomic attributes and R j (1 j m) are subschemas nested within R, then DOM(R)=DOM(A 1)... DOM(A n) 2 DOM(R 1)... 2 DOM(Rm), where denotes Cartesian product and 2 DOM(R j ) denotes the power set of the set DOM(R j)(1 j m). A nested tuple over R is an element of DOM(R). A nested relation r over R is a finite set of nested tuples over R, which is denoted by: sch(r) = R. The nested relational algebra has the standard operations of the relational algebra: selection(σ), projection(π), Cartesian product( ), join( ), union( ), intersection( ), difference( ), plus the nest and unnest operators. Here we modify this algebra slightly to suit our purpose, redefining nest and modifying selection. Definition 3. Let R = (A 1,..., A n) be a flat relational schema, where A i (1 i n) are atomic attributes. Let attr(r) denote the names of all attributes, that is, attr(r) = {A 1,..., A n}. Let r be a flat relation over R, that is, sch(r) = R. Let N 1 and N 2 be two disjoint subsets of attr(r). Then the nest of r by N 1 keeping N 2, υ N1,N 2 (r), is defined as: υ N1,N 2 (r) := {t t r t [N 1] = t[n 1] t [N 2] = {t [N 2] t r t [N 1] = t[n 1]}} N 1 is called the set of nesting attributes, N 2 the set of nested attributes. Note that in the traditional definition, only N 2 is specified, and N 1 is understood as attr(r) attr(n 2). The definition presented here has an implicit projection of N 1 N 2 and will be more convenient for our approach; it also highlights the connection between nesting and grouping. Note also that, for simplicity (and since this will be our most frequent use) we have defined nesting over flat (depth(r) = ) relations only; however, the definition can be extended to the general case without problems. The unnest operator can be defined as usual to be the inverse of nest. Definition 4. Let R(A 1,..., A n, R 1,..., R m) be a nested relational schema, where A i (1 i n) are atomic attributes, R j (1 j m) are subschemas. Let r be a nested relation over R, that is, sch(r) = R. Let attr(r j) denote the names of attributes in R j (1 j m). Then a linking predicate over r is defined as one of: AθL{B}, where A A i (1 i n), B attr(r j) (1 j m), θ {<,, >,, =, }, L {SOME/ANY, ALL}. {B} θ, where θ {=, } and B as above. The semantics of each predicate are obvious. Note that (again for simplicity) we only define the linking predicate over one-level (depth(r) = 1) nested relations. For a multi-level (depth(r) 2) nested relation, A and B might belong to the subschemas with depth d and d + 1 respectively. Thus, the above definition can still be used. Definition 5. Let r be a relation over schema R, that is, sch(r) = R. The selection of r with respect to C, σ C(r), where C is a usual predicate or a linking predicate, is defined as usual: σ C(r) := {t t r C(t) is true} Let attr(r) denote the names of all attributes in R, A a subset of attr(r), and C a usual predicate or a linking predicate. The pseudo-selection of r with respect to C keeping A, σ C,A(r), is defined as: σ C,A(r) := {t t r((c(t) is true t = t) (C(t) is false t [A] = {null} t [attr(r) A] = t[attr(r) A]))} Thus, a pseudo-selection keeps all tuples that pass the condition (as the usual selection); for the tuples that fail, it keeps the tuple, but it pads the attributes in A with null values. In this paper, either σ or σ is called a linking selection if C is a linking predicate. The linking selection with σ follows the usual definition; the linking selection with σ applies the pseudo-selection definition. As usual, the definitions of join, semijoin and outer join can carry out to nested algebra from regular (flat) algebra. To help understand the above definitions, we give an example. 193

4 Example 1. Assume R(A, B, C, D), S(E, F, G, H, I), T (J, K, L) are relations shown in figure 1(a), 1(b), 1(c), where R.D, S.I and T.L are primary keys for each relation. The relation T emp1 shown in figure 1(d) is obtained by the projection of R.B, R.C, R.D, S.E, S.H, S.I, T.J and T.L on the result of a left outer join of R and S on the predicate R.D = S.G, followed by a left outer join with T on the predicates T.K = R.C and T.L <> S.I. A B C D(#) null null 5 4 (a) Relation R J K L(#) null 4 2 (c) Relation T E F G H I(#) null 4 (b) Relation S B C D(#) E H I(#) J L(#) null 2 3 null null null null null null null 4 null null B C D(#) E H I(#) J L(#) null 2 3 null null null null null null null 4 null null (a) T emp2 = υ {R.B,R.C,R.D,S.E,S.H,S.I},{T.J,T.L} (T emp1) B C D(#) E H I(#) null null null null 2 3 null null null null null 4 (b) T emp3 = σ S.H>ALL{T.J} T.L is null,{s.e,s.h,s.i}(t emp2) B C D(#) E H I(#) null 2 3 null null null null null 4 (c) T emp4 = σ S.H>ALL{T.J} T.L is null (T emp2) Figure 2: Example of nest and linking selection (d) T emp1 = π R.B,R.C,R.D,S.E,S.H,S.I,T.J,T.L((R R.D=S.G S) T.K=R.C T.L<>S.I T ) Figure 1: Base Relations The relation T emp2 shown in figure 2(a) is a one-level nested relation resulting from nesting by {R.B, R.C, R.D, S.E, S.H, S.I}, keeping all of {T.J, T.L}. The reason why we keep the primary keys of R, S and T is that they will be used to identify if the corresponding tuple is empty. We assume that each relation has a unique non-null attribute served as a primary key. In our case, a primary key with the null value must be padded by a left outer join operation. If a tuple does not match the join condition, the left outer join operation will pad null values on its attributes including the primary key. Thus, the tuple with the primary key being null can be considered empty. Another reason we keep the primary keys of R, S and T is that we have to distinguish between an empty tuple with all attributes being null and a tuple with a certain attribute originally null. As a result, our extended relational algebra can be used on relations containing null values without any problem. The relation T emp3 shown in figure 2(b) is the projection of R.B, R.C, R.D, S.E, S.H and S.I on the result of the linking selection σ S.H>ALL{T.J} T.L is null,{s.e,s.h,s.i}(t emp2). Note that it is a pseudo-selection. A negative linking predicate returns true if the subquery result is empty, which is identified by the primary key being null. Thus, we have additional condition T.L is null doing linking selection. Under our definition, even though the linking selection over the second tuple returns false, we can not discard this tuple. We have to keep this tuple by padding null values on S.E, S.H and S.I. The linking selection over all other tuples returns true, thus we keep these tuples in their original forms. One notable point is that for the fourth and the fifth tuples, although the linking selection compares S.H(null) to {T.J}({null}), the linking selection returns true because the result of the condition T.L is null is true. From this example, we can see that linking selection only compares the linking attribute to the linked attribute whose corresponding primary key is not null. The result of comparison is based on the standard definition. The relation T emp4 shown in figure 2(c) is obtained by the projection of R.B, R.C, R.D, S.E, S.H and S.I on the result of the linking selection σ S.H>ALL{T.J} T.L is null (T emp2). The linking selection over the second tuple returns false, thus we discard this tuple. All other tuples pass the linking selection and become the result. Note that the projection operation in each subfigure is omitted. 194

5 4. THE NESTED RELATIONAL APPROACH TO PROCESSING SUBQUERIES The motivation of the nested relational approach is based on the observation that the linking predicate is actually a set computation. The basic idea of the nested relational approach is straightforward: a nested query is unnested from top-down first, and then the linking predicates are computed from bottom-up, which requires: (1) the subquery result to be a set (perhaps empty) and (2) a comparison between a single-valued attribute and a set-valued attribute. Such operations can be achieved by the nest operator and the linking selection operator defined in the previous section. In our approach, non-correlated subqueries are executed once, and the result is used by every tuple (virtual Cartesian product). Correlated subqueries can be executed and then connected to outer queries by join or outer join operations. We first present an original approach and then introduce some optimizations. 4.1 Original approach For a nested query with n query blocks, in each query block, from top-down, let R i (1 i n) denote the relations in the FROM clause; L i (1 i n 1) denote the linking predicate between blocks i and i+1; C ij (2 i n and 1 j n) represent the correlated predicate(s) between block i and j (i > j), and i (1 i n) represent the predicates in the WHERE clause except L i and C i. Our algorithm proceeds in three steps. First, we reduce each query block to one relation by doing all operations in the WHERE clause except linking predicate and correlated predicate(s), i.e., at each block i, produce T i = σ i (R i) 1. Note that this is equivalent to producing the complementary set in the magic decorrelation technique [18, 17]; however, we do not produce a magic set. Second, we create a tree expression for the query as follows: walk through the query in Depth-First, Left-to-Right order; create one node for each query block. We label each node with the corresponding T i. Between any two adjacent nodes T i and T i+1, we add an edge directed from T i to T i+1 labeled with the linking predicate L i. If T i+1 is correlated to T i, we add the correlated predicate C (i+1)i to the edge. If T i is correlated to a non-adjacent node T j (i > j), we add the correlated predicate C ij to the edge between T i and T i 1 if all edges between T j and T i have been labeled with correlated predicates; otherwise, we add an edge directed from T j to T i labeled with the correlated predicate C ij. The root is labeled by the name of the outermost query block, leaves are labeled by the name of innermost query blocks, other nodes are labeled by the name of the middle query blocks. A node is called a subroot if it has more than one children. All nodes under a subroot are called a subtree of the subroot. For a given node n, let name(n) be the T i that serves as name of the node; link C(n, m) be the C ij (if one exists) and link L(n, m) be the L i, which label the link between n and one of his children m. Third, we compute(root, T 1). The algorithm, shown as algorithm 1, recursively goes down the tree in depth-first manner, creating a single relation through the use of join or outer join. Note that the structure created in the previous step may be a graph. In this step, we restrict our attention to edges labeled with correlated predicates, in which case 1 We assume all relations are connected, i.e. no Cartesian product present. we get a maximal spanning query tree for the graph (when all query blocks are correlated). When a leaf is reached, the algorithm goes bottom-up nesting the relation obtained and applying a corresponding linking selection to reduce the relation. When a subroot is found on the way down, the algorithm chooses a child to continue towards the leaves; on the way up, however, the algorithm will go down again until all paths in the subtree of the subroot have been covered before proceeding up past the subroot. We do not provide a formal proof for the correctness of algorithm 1 due to lack of space. Basically, we unnest a query in a traditional way, and then nest by each tuple of the outer query, which preserves tuple iteration semantics. Then, the linking selection operator computes linking predicates in a straightforward manner. Algorithm 1 Compute(node,relational-expression) Require: : a nested query with non-aggregate subqueries Ensure: : the result of a query 1: PROCEDURE compute(node, rel) { 2: if (node is a leaf) then 3: return; 4: else 5: for each n children(node) do 6: T i = name(n); 7: C ij = link C(node, n); 8: L i = link L(node, n); 9: if (C ij ) then 1: rel = rel Cij T i or rel = rel Cij T i; 11: else 12: rel = rel T i; 13: end if 14: compute(n, rel); 15: rel = υ {T1.,...},{T i. }(rel); 16: rel = σ Li (rel) or σ L i (rel); 17: end for 18: end if 19: } The algorithm works equally for nested linear queries and nested tree queries 2. In the first case, there is only one child for each node; the net effect is that of going down the tree joining or outer joining, or using the Cartesian product when there is no correlation (this Cartesian product is really virtual), and then up nesting and evaluating the predicates. In the second case, each subroot makes us go down all paths before continuing on the way up. To show how the original nested relational approach processes a nested query, we give an example. Example 2. Consider Query Q in section 2. The tree expression for this query is shown in figure 3(a). To process this query, we would start from root node T 1: R, performing a left outer join of R and S on the correlated predicate R.D = S.G. Since T 2: S is not a leaf, we keep performing a left outer join with T on the correlated predicates T.K = R.C and T.L <> S.I. Node T 3: T is a leaf node, thus we compute the linking predicate L 2: S.H >ALL {T.J}, which 2 A nested linear query is a query in which at most one query block is nested within any query block. A nested tree query is a query in which there is at least one query block which has two or more query blocks nested within it at the same level. 195

6 is achieved by nesting {R.B, R.C, R.D, S.E, S.H, S.I}, keeping all of {T.J, T.L}, followed by the projection of R.B, R.C, R.D, S.E, S.H, S.I and the linking selection S.H >ALL {T.J}. Then, it goes back to node T 2: S. Since there is no other children under node T 2: S, we compute the linking predicate R.B ALL {S.E} (the NOT IN linking operator is equal to ALL ) by nesting {R.B, R.C, R.D}, keeping all {S.E, S.I}, followed by the projection of R.B, R.C, R.D and the linking selection R.B ALL {S.E}, which goes back to root T 1: R. The final result is obtained by the projection of the desired attributes. Note that we use both σ and σ linking selection in this example. Generally, σ is used for computing negative or mixed linking predicates; σ is used for computing the last unfinished linking predicate, or for all unfinished linking predicates being positive. We use a query tree to represent the process of a query evaluation, in which π denotes projection; left outer join; σ or σ (linking) selection; υ nest. The query tree for processing Query Q is shown in figure 3(b) (intermediate projections are omitted). T1: R L1: R.B ALL {S.E} C21: R.D = S.G T2: S L2: S.H > ALL {T.J} C32: T.L <> S.I C31: T.K = R.C T3: T (a) Tree Expression R.D=S.G σ R.A>1 σ S.F =5 R S π R.B,R.C,R.D (b) Query Tree σ R.B ALL{S.E} S.I is null υ {R.B,R.C,R.D},{S.E.S.I} σ S.H>ALL{T.J} T.L is null,{s.e,s.h,s.i} υ {R.B,R.C,R.D,S.E,S.H,S.I},{T.J,T.L} T.K=R.C T.L<>S.I Figure 3: The Nested Relational Approach Applied to Query Q 4.2 Optimizations Algorithm 1 can evaluate nested queries containing nonaggregate subqueries with any type of linking predicates and any level of nesting in a uniform manner. However, there are several alternatives and optimizations possible. We briefly discuss some of the more interesting ones. T Reduce nesting operations In the original approach, we compute each linking predicate by using one nesting operation followed by one linking selection. However, examining the parameters of the nest operator, it is clear that higher levels nest by a prefix of the nesting attributes used by lower levels, and use part of the postfix of those nesting attributes as the nested attributes. For instance, see figure 3(b). To compute the linking predicate S.H >ALL {T.J}, we nest by the nesting attributes {R.B, R.C, R.D, S.E, S.H, S.I}; next to compute R.B ALL {S.E}, we nest by the prefix of the previous nesting attributes {R.B, R.C, R.D}, and choose part of the postfix of the previous nesting attributes {S.E, S.I} as the nested attributes. This advantageous feature gives rise to an optimization of the original approach: doing first all nesting operations in a single step, followed by executing the linking selections one by one, instead of intertwining nesting and linking selection. This gives a feasible and efficient implementation due to the fact that only the deepest or first nesting involves true (physical) reordering of the tuples in the relation, all others are conceptual. For example, the nest and the linking selection operations in figure 3(b) can be rewritten as two consecutive nests followed by two linking selections. Even there still exist two nest operators, the operations can be done in one step. Note that the result of two consecutive nesting is a two-level nested relation. As pointed out in section 3, computing the linking predicate S.H >ALL {T.J} only involves S and T, which still can be considered as a linking selection over a one-level nested relation resulted from the projection of S and T on the two-level nested relation. Similarly, computing the linking predicate R.B ALL {S.E} can be regarded as the projection of R and S on the two-level nested relation Pipelining Pipelining is possible in the context of our algorithm. In particular, it seems clear that it should be possible to pipeline the linking selection with the nesting that is immediately adjacent to it; in some cases, the condition may be evaluated at the same time that the nesting is taking place. Thus, the cost of such plans can be further reduced even if no modification to the plan takes place Linear correlation Algorithm 1 could be further optimized for some special queries to gain better performance. One such case is linear correlation. A nested query is linear correlated if each inner query block is only correlated to its adjacent outer query block. Since the evaluation of the outer query block only depends on its adjacent inner query block, the linear correlated queries can be processed from bottom-up instead of top-down. For instance, Query Q becomes a linear correlated query by getting rid of one of the correlated predicates T.K = R.C in the innermost query block and changing T.L <> S.I to T.L = S.I. Instead of from top-down, this query can be efficiently processed from bottom-up by performing nesting on the result of a left outer join of S and T with corresponding selections pushed down, followed by computing the linking predicate S.H >ALL {T.J}; then nesting again on the result of a left outer join of R and the previous resulting tuples, followed by computing the linking predicate R.B ALL {S.E}. Note that pipelining can be applied for computing the linking predicate and nest- 196

7 ing. Clearly, this strategy benefits from small intermediate results, since only qualified tuples participate in further (outer) join operations Push down nesting Another idea is to push nesting operations down past (outer) join. The original nested relational approach uses the standard approach to unnest the subquery, which may produce a very large intermediate relation for later processing. To avoid this problem, we can push the nesting operation down before the (outer) join. This is not always possible; the conditions under which it can be done are similar to the conditions to push down a group-by operator past a join [9]. In particular, one situation in which the push down is possible is when the nesting attribute is also the attribute in the condition of the join, and this condition is an equality. In symbols, if R and S are flat relations and B, C sch(s), A sch(r), υ {B},{C} (R A=C S) = R (υ {B},{C} S). This is a common pattern in our approach. For example, consider Query Q with the third query block removed. It can be processed as follows: first, nest the relation S using υ {S.G},{S.E,S.I} with the selection of S.F = 5 pushed down. Note that these two steps can be pipelined. Then R left outer joins the resulting one-level nested relation on the predicate R.D = S.G, followed by computing the linking predicate R.B ALL {S.E}. The final result is obtained by the projection of the desired attributes Positive linking operators Although the nested relational approach is focused on dealing efficiently with mixed and negative linking operators, we would like the approach to be also efficient for positive linking operators. However, existing approaches have a very efficient way for evaluating positive linking operators. In the case of IN, for instance, the linking predicate is transformed into a semijoin. However, our approach would create an outer join, a nest and a selection, that is, an expression of σ A= SOME{B} (υ {A},{B} (R C S)), where A is the nesting attribute and A sch(r), B is the nested attribute and B sch(s), and C is the correlated predicate, would be generated for A IN (SELECT B FROM S...). The trick in these cases is to realize that the expression above can be simplified to R C A=B S. In a more general setting, the expression σ AθSOME{B} (υ {A},{B} (R C S)) can be shown to be equivalent to R C AθB S. If, furthermore, projection push down shows that only attributes from R are needed, the join can be transformed into a semijoin. Thus, through algebraic rewriting our approach can be shown to be equivalent to the standard one for positive cases. More discussion about positive linking operators will be shown in section EXPERIMENTS AND PERFORMANCE ANALYSIS In this section, we compare the performance of the nested relational approach with the performance of a popular commercial database management system (DBMS), which we call System A, evaluating nested queries in its latest version using its native approach. Our experiments focus on nested queries containing negative and mixed linking operators, which are not efficiently evaluated by direct unnesting using existing techniques. 5.1 Implementation As described in the previous sections, our nested relational algebra is an extension of the standard relational algebra, thus only the nest operator and the linking selection operator are not supported by current DBMS. To implement the nested relational approach, we wrote stored procedures in procedural SQL, an extension of SQL that adds programming language-like capabilities to SQL (variable declaration, loop and conditional statements). Our approach was to design the program in two stages: first, an SQL query is used to unnest the query by executing (left outer) joins of the base relations in each query block with corresponding selections pushed down. Second, code in the procedure implements the nest operator and the linking selection operator by processing the tuples fetched from the first stage, which we call intermediate result. In order to simulate nesting in an effective manner, we make the database sort the intermediate result. This is equivalent to implementing nest by sorting, which we believe is a realistic possibility (like a group-by, the two obvious options to implement nest are sorting and hashing). We implemented two variants of the nested relational approach: the original nested relational approach implements the nest operator and the linking selection operator separately (which requires two passes over the intermediate result), and the optimized nested relational approach pipelines the nest operator and the linking selection operator (which requires only one pass over the intermediate result). The reasons we use stored procedures to implement the nested relational approach are: (1) they run inside the database so that the communication overhead can be reduced significantly compared to external processing; (2) they can be called by other applications, which makes the nested relational approach more suited for practical use. However, there still exists communication overhead when the stored procedure fetches data from the SQL engine (as observed in [1], this is one considerable disadvantage that all experimental settings similar to ours must bear). For that reason, in reporting our results one of the main parameters we use is the size of the intermediate result. In our experiments, we created a TPC-H database [4] at scale factor 1 (total data size 1GB) in System A, hosted on a server with an Intel Pentium 4 2.8GHz processor, two 36GB SCSI disks, and 1GB memory, running Red Hat Enterprise Linux WS release 3. We configured a buffer cache of size 32MB, and installed all data and indexes in a single disk. B+ tree indexes on the primary key of each base table were automatically built by System A. Additional indexes on the selected foreign keys were created manually when needed (more on this below). 5.2 Performance analysis To verify the efficiency of the nested relational approach, three queries and their variations with four different sizes derived from the TPC-H benchmark were tested in our experiments. For each query, we measured the average execution time of multiple runs of the query as the primary performance metric. Before each running, the buffer cache of System A was flushed. The graphs of the results plot the elapsed time on the Y-axis and the size of each query block (outer/inner) on the X-axis. The size of each query block denotes the size of the base table (or a join of base tables) in a query block with corresponding selections pushed down, but without the linking predicates executed yet. We 197

8 chose this size as a parameter due to the fact that it directly relates to the intermediate result, which in turn, relates to the overhead corresponding to fetching tuples from the SQL engine. This size is controlled by changing constants on the selections and thus varying their selectivity factor. Note that the size of the final result is proportional to the size of the intermediate result. Our first experiment was done on Query 1, which is a one-level nested query with an ALL linking operator. Query 1: select o_orderkey, o_orderpriority from orders where o_orderdate>=x1 and o_orderdate<x2 and o_totalprice > all (select l_extendedprice from lineitem where l_orderkey=o_orderkey and l_commitdate<l_receiptdate and l_shipdate<l_commitdate) proach. The experimental results are shown in figure 4. Both the original and the optimized nested relational approaches outperform the native approach, although the native approach benefits from indexes. One notable point about Query 1 is that, with a NOT NULL constraint on the attribute l_extendedprice, System A directly performs an antijoin of orders and lineitem, and the performance is about the same as ours. However, if the NOT NULL constraint is dropped, even though there are no null values in l_extendedprice, antijoin is not used. In general, the ALL or NOT IN linking predicate can not be evaluated using antijoin when null values exist. The second experiment we did was on two variations of Query 2, a two-level nested query. The term [any all] refers to choosing either one. Query 2: select p_partkey, p_name from part where p_size>=x1 and p_size<=x2 and p_retailprice < [any all] (select ps_supplycost from partsupp where ps_partkey=p_partkey and ps_availqty<y and not exists (select * from lineitem where ps_partkey=l_partkey and ps_suppkey=l_suppkey and l_quantity=z)) 4K/7K 8K/7K 12K/7K 16K/7K Size of Query Block (1/2) 8 6 Figure 4: Query 1 The conditions o_orderdate>=x1 and o_orderdate<x2 and l_commitdate<l_receiptdate and l_shipdate<l_commitdate are used to regulate the size of each query block. The size of the outer query block ranges from 4K to 16K tuples, and the inner query block has 7K tuples. The attributes l_orderkey and o_orderkey are automatically indexed. The native approach evaluates Query 1 in the nested iteration manner, that is, for each tuple of orders that qualifies the conditions o_orderdate>=x1 and o_orderdate<x2, the inner query block is computed once, and then the ALL linking predicate is evaluated. Note that every time when the inner query block is to be computed, lineitem is accessed by index rowid, which is more efficient than fully accessed. For the nested relational approach, the stored procedure fetches the tuples from System A which performs an outer hash join of orders and lineitem on the correlated predicate l_orderkey=o_orderkey, which requires full accesses of orders and lineitem, and then processes the intermediate result by the nested relational way. The size of the intermediate result is about 4K, 81K, 123K and 165K for four tests. Accordingly, the processing time of nest and linking selection for the original nested relational approach is.24,.47,.71,.98 seconds, and.3,.6,.1,.13 seconds for the optimized nested relational ap K/16K/12K 24K/16K/12K 36K/16K/12K 48K/16K/12K Figure 5: Query 2a(mixed: ANY/NOT EXISTS) The conditions p_size>=x1 and p_size<=x2, ps_availqty<y and l_quantity=z are used again to give different size of each query block. The size of the first query block ranges from 12K to 48K tuples, the second and the third query blocks have 16K and 12K tuples respectively. The attributes p_partkey and (ps_partkey, ps_suppkey) are automatically indexed. Additional indexes on the foreign keys of lineitem, l_partkey and l_suppkey, are created manually for fast accessing data within lineitem. To test the effect of indexes on the native approach, we created a combined index on (l_partkey,l_suppkey) and two single indexes on l_partkey and l_suppkey respectively. 198

9 K/16K/12K 24K/16K/12K 36K/16K/12K 48K/16K/12K Figure 6: Query 2b(negative: ALL/NOT EXISTS) Note that Query 2 is linear correlated, this implies that there are two possible approaches to the query: one from top-down, the other from bottom-up. Our first variation of Query 2 is Query 2a with the mixed ANY and NOT EXISTS operators. The native approach evaluates Query 2a from the bottom up, that is, first performs an antijoin of partsupp and lineitem to form a view for the NOT EXISTS linking predicate, and then performs a semijoin of part and the previous resulting view for the ANY linking predicate. Each table is fully accessed once. The execution result is shown in figure 5. If the linking operators are any combination of ANY/SOME, IN, EXISTS and NOT EXISTS, the native approach will be the same, that is, the combination of semijoin and/or antijoin. However, if one of the linking operators is ALL or NOT IN, for general cases, the native approach has to introduce nested iteration, which gives rise to our second variation of Query 2, Query 2b with the negative ALL and NOT EXISTS operators. If there is a NOT NULL constraint on the attribute ps_supplycost, the native approach evaluates it in the similar manner as processing Query 2a with two antijoins instead of one antijoin and one semijoin. However, for general cases or if the NOT NULL constraint is dropped, the native approach can only unnest the NOT EXISTS linking predicate by antijoin, and perform nested iteration for the ALL linking predicate. Thus, for each tuple of part that qualifies the conditions p_size>=x1 and p_size<=x2, the native approach performs a nested loop antijoin of partsupp and lineitem using the combined index on (l_partkey,l_suppkey). Figure 6 shows the execution result. One important point we must make is the additional indexes on the foreign keys of lineitem play an important role in doing nested loop antijoin. The native approach performs much worse if these indexes are not available. No matter what the linking operator is, for both queries, the nested relational approach processes the intermediate result obtained from System A that executes outer hash joins of part, partsupp and lineitem, in a nested relational way. The size of the intermediate result is equivalent for both queries, that is, 14K, 29K, 44K and 58K, and thus the processing time of nest and linking selection is almost same,.18,.36,.54 and.72 seconds for the original nested relational approach, and.2,.4,.6 and.8 seconds for the optimized nested relational approach. For the convenience of comparison, we use the same scale on Y-axes in figure 5 and figure 6. Comparing figure 5 with figure 6, we can obtain the following points for nested linear queries: the nested relational approach has similar performance on nested linear queries regardless of the linking operators; the performance of the native approach depends on the existence of the ALL or NOT IN linking operator: the native approach (semijoin and/or antijoin) performs significantly worse than the nested relational approach if the ALL or NOT IN linking operator are used (see figure 6), but slightly better than the nested relational approach when the ALL or NOT IN linking operator are not used (see figure 5), which is partly because of the processing, but mostly because of the communication overhead required by the nested relational approach. In the literature, antijoin and semijoin have been considered to be the most efficient way for processing the NOT EXISTS predicate and the EXISTS or IN predicates respectively. However, as pointed before, antijoin can not be directly used anywhere without a delicate transformation or a constraint when nulls exist. Furthermore, antijoin and semijoin can not always be extended to evaluate multi-level queries, because either antijoin or semijoin keeps only one table information that participate in the operation, thus the other table information required by the further processing might be lost. As a result, we came up with our third experiment on three variations of Query 3, which is derived from Query 2 by slightly modifying the predicate ps_partkey=l_partkey in the third query block to p_partkey=l_partkey. This modification made Query 3 a more general two-level nested query: the third query block is correlated to both the other two query blocks. This modification also significantly affects the native approach. Similar to Query 2, the terms [all any], [exists not exists] and [= <>] denote choosing either one. Query 3: select p_partkey, p_name from part where p_size>=x1 and p_size<=x2 and p_retailprice < [all any] (select ps_supplycost from partsupp where ps_partkey=p_partkey and ps_availqty<y and [exists not exists] (select * from lineitem where p_partkey[= <>]l_partkey and ps_suppkey[= <>]l_suppkey and l_quantity=z)) The size of each query block and the available indexes are same as Query 2. The variations of Query 3 are used to test mixed linking operators, positive linking operators, and negative linking operators, with equal and non-equal correlated predicates. The first variation is Query 3a with the mixed linking operators ALL and EXISTS, the second variation is Query 3b with two negative linking operators ALL and NOT EXISTS, and the third variation is Query 3c with two positive linking operators ANY and EXISTS. Generally, optimizer generates query plan depending on not only the linking operators but also the correlated predicates. Thus, each variation again 199

10 K/16K/12K 24K/16K/12K 36K/16K/12K 48K/16K/12K 12K/16K/12K 24K/16K/12K 36K/16K/12K 48K/16K/12K 12K/16K/12K 24K/16K/12K 36K/16K/12K 48K/16K/12K (a) Query 3a(a): p partkey=l partkey and ps suppkey=l suppkey (b) Query 3a(b): p partkey<>l partkey and ps suppkey=l suppkey (c) Query 3a(c): p partkey=l partkey and ps suppkey<>l suppkey Figure 7: Query 3a(mixed: ALL/EXISTS) K/16K/12K 24K/16K/12K 36K/16K/12K 48K/16K/12K 12K/16K/12K 24K/16K/12K 36K/16K/12K 48K/16K/12K 12K/16K/12K 24K/16K/12K 36K/16K/12K 48K/16K/12K (a) Query 3b(a): p partkey=l partkey and ps suppkey=l suppkey (b) Query 3b(b): p partkey<>l partkey and ps suppkey=l suppkey (c) Query 3b(c): p partkey=l partkey and ps suppkey<>l suppkey Figure 8: Query 3b(negative: ALL/NOT EXISTS) has three cases based on the correlated predicates of the third query block: (a) p_partkey=l_partkey and ps_suppkey=l_suppkey, (b) p_partkey<>l_partkey and ps_suppkey=l_suppkey, (c) p_partkey=l_partkey and ps_suppkey<>l_suppkey. For this discussion, we use the same scale on Y-axes in figures 7, 8 and 9 for easy comparison. It is important to point out that System A is unable to use antijoin in these queries, even though the NOT NULL constraint is present; this is due to the problems mentioned in section 2. System A has different plans for all three queries. For Query 3a and Query 3c, System A always tries to unnest the third query blocks for the EXISTS linking predicate. While for Query 3b, System A has to perform nested iteration over three query blocks. Also, System A treats different correlated predicates in a different manner. More importantly, System A is greatly affected by indexes. We explain System A s behavior in detail. For Query 3a with mixed linking operators (see figure 7), the ALL linking predicate has to be evaluated using nested iteration, but the EXISTS linking predicate can be unnested by nested loop join. For each tuple of part that qualifies p_size>=x1 and p_size<=x2, lineitem is accessed by index rowid; for each index, the nested loop join is performed on partsupp using the index on (ps_partkey,ps_suppkey). For Query 3a(a) (see figure 7(a)) and Query 3a(c) (see figure 7(c)), the combined index on (l_partkey,l_suppkey) is used to access lineitem, while for Query 3a(b) (see figure 7(b)), the single index on l_suppkey is used. Comparing figures in figure 7, we can see that Query 3a(b) performs much better than Query 3a(a) and Query 3a(c), both of which have similar performance. The reason is that the single index structure of l_suppkey is much smaller than the combined index structure of (l_partkey,l_suppkey). For Query 3b with negative linking operators (see figure 8), the ALL and NOT EXISTS linking predicates have to be performed by nested iteration. For each tuple of part that qualifies p_size>=x1 and p_size<=x2, partsupp is accessed using the index on (ps_partkey, ps_suppkey), and in turn, lineitem is accessed using the appropriate indexes: the combined index on (l_partkey,l_suppkey) for Query 3b(a) (see figure 8(a)) and Query 3b(c) (see figure 8(c)); the single index on l_suppkey for Query 3b(b) (see figure 8(b)). Figure 8 shows that Query 3b(a) and 2

Computing SQL Queries with Boolean Aggregates

Computing SQL Queries with Boolean Aggregates Antonio Badia Computer Engineering and Computer Science department University of Louisville Abstract. We introduce a new method for optimization of SQL queries