Optimizing relational queries in connection hypergraphs: nested queries, views, and binding propagations

Size: px

Start display at page:

Download "Optimizing relational queries in connection hypergraphs: nested queries, views, and binding propagations"

Ralf Morris
5 years ago
Views:

1 The VLDB Journal (1998) 7: 1 11 The VLDB Journal c Springer-Verlag 1998 Optimizing relational queries in connection hypergraphs: nested queries, views, and binding propagations Jia Liang Han Bell Labs, Lucent Technologies, 6200 East Broad Street, Rm 0B115, Columbus, OH USA; jlhan@acm.org Edited by Y. Vassiliou. Received September 1, 1993 / Accepted January 8, 1996 Abstract. We optimize relational queries using connection hypergraphs (CHGs). All operations including value-passing between SQL blocks can be set-oriented. By introducing partial evaluations, reordering operations can be achieved for nested queries. For a query using views, we merge CHGs for the views and the query into one CHG and then apply query optimization. Furthermore, we may simulate magic sets methods elegantly in a CHG. Sideways informationpassing strategies (SIPS) in a CHG amount to partial evaluations of SIPS paths. We introduce the maximum SIPS strategy, which performs SIPS for all bindings and all SIPS paths for a query. The new method has several advantages. First, the maximum SIPS strategy can be more efficient than the previous SIPS based on simple heuristics. Second, it is conceptually simple and easy to implement. Third, the processing strategies may be incorporated with the search space for query execution plans, which is a proven optimization strategy introduced by System R. Fourth, it provides a general framework of query optimization and may potentially be used to optimize next-generation database systems. Key words: Relational query optimization Connection hypergraphs Partial evaluations SIPS Search space 1 Introduction One of the main advantages of the relational model is declarativeness of its query languages. A declarative language frees application programmers from procedural details such as data structures, access paths to relations, join methods, thus greatly simplifying the task of querying and programming. However, such power does not come for free. Great effort is required to design and optimize relational database systems to achieve good performance. For a query in a highlevel declarative query language, there are often a number of potential evaluation procedures. Which procedure is the most efficient depends on many factors, including the query type, the ranges of attributes (image sizes), the size of relations, and so on. Query processing may be viewed as a sequence of basic operations. A basic operation (or an operation in short) refers to either a selection, a projection, a join, or a cartesian product. Query evaluation is usually divided into two phases. Phase one does not materialize relations but considers various query execution plans (QEPs). A QEP specifies procedural details, such as the order of operations, access paths to relations, buffer management, and management of temporary relations. For a not too simple relational query, the number of QEPs is large. One possible approach is to map all QEPs into a discrete search space, carry out an exhaustive search while using statistics and other information to estimate computation cost of each QEP, then choose the most efficient QEP. Such a strategy was proposed in system R (Selinger et al. 1979) and has proven to be effective in practice. In phase two, relations are materialized and operations are carried out to obtain the answer. The relational language SQL allows nested queries and views. Nested queries and views are important, since they allow programmers to adopt a modular approach to application problems. They may also play an important role in some new-generation database systems, for example, objectoriented database systems. Query optimizers in system R and many current systems were designed for simple SQL queries, i.e., one query having only one SQL block (one SELECT-FROM-WHERE structure). They are often very inefficient when used to evaluate nested queries and views. The inefficiency may be attributed largely to two causes: (1) the value-passing between SQL blocks is tuple-oriented not set-oriented; (2) reordering operations is restricted to one SQL block. To optimize nested queries, Kim (1982) proposed query rewriting methods which transform nested SQL queries into unnested ones. Mumick et al. (1990b) noted that Kim s transformations have similarities with semijoins, Bernstein and Chiu (1981), and magic sets methods (Bancilhon et al. 1986; Beeri and Ramakrishnan 1991; Ullman 1989). Magic sets methods were proposed to optimize recursive datalog queries. Magic sets methods rewrite programs and queries so that irrelevant tuples are not generated during query evaluation. The key to magic set methods is sideways information-passing strategies (SIPS) (Bancilhon et al. 1986; Beeri and Ramakrishnan 1991; Ullman 1989), which pass binding information in a relation side-

2 2 ways (without evaluating the relation or query fully) to other relations. Mumick et al. (1990a,b,c) applied and extended magic set methods to optimize relational queries, as well as datalog queries. Magic set methods are more powerful than Kim s transformations and are applicable to general relational queries. Pirahesh et al. (1992) considered program transformations in the presence of duplicates and aggregates. Analytical and experimental results in these papers and more recently in Mumick and Pirahesh (1994) showed that these program transformations may improve performance of some queries by several orders of magnitude. In this paper, we use the connection hypergraph (CHG) to optimize relational queries, including nested queries and views. The CHG was used in Ullman (1982, 1989) to give a graph version of the Wong-Youssefi algorithm (Wong and Youssefi 1976). Further study using the CHG to optimize unnested SQL queries was given in Han (1994a, 1995). The CHG was used to construct a search space of QEPs, which has several largely orthogonal dimensions (Han 1995): (1) evaluation orders of operations, (2) evaluation methods for each operation, (3) access paths to relations and operations restricted to one base relation, (4) various groupings of join clusters for pipelining. In this paper, we first generalize the graph method to nested queries and views. Each SQL block is represented by one CHG. For a nested query, CHGs for SQL blocks are connected by association hyperedges into one CHG. Each view definition is represented by one CHG, and a query using views may be represented by a CHG merged from CHGs (association hyperedges might be required). We introduce partial evaluations of association hyperedges for the purpose of reordering operations and value-passing across SQL blocks. All operations in a CHG, including partial evaluations, can be set-oriented. Program transformations are unnecessary to evaluate nested queries and views efficiently. Furthermore, partial evaluations may be applied to relation hyperedges as well. Partial evaluations may propagate bindings in a CHG and can be used to simulate magic sets methods elegantly. A sideways information-passing strategy (SIPS) path is a path that starts from a relation with some bindings and ends at another relation (a more precise definition is given in Sect. 5). There may be a few bindings in a query and many different SIPS paths available for binding propagations. To remove as many irrelevant tuples as possible during query evaluation, we propose the maximum SIPS strategy, which performs SIPS for all bindings and all SIPS paths. The search space for QEPs is revised to incorporate SIPS. An exhaustive search then can find the most efficient QEP. This addresses the interplay problem of cost estimate and program transformations mentioned in Mumick et al. (1990b). In addition, it is straightforward in a CHG to generalize SIPS to bindings of type A > a and the like. The arrangement of the rest of this paper is as follows. Sect. 2 introduces the CHG for SQL queries and shows query evaluation in a CHG. In Sect. 3, we examine some important query optimization heuristics: push-selections, project-outirrelevant-attributes-early, and pipelining. The search space for QEPs is constructed. Section 4 addresses query optimization in a CHG for nested queries and queries using views. In Sect. 5, we explain binding propagations in a CHG and simulate magic sets methods by partial evaluations. We consider various SIPS paths and propose the maximum SIPS strategy. We modify the search space so that an exhaustive search may take SIPS into consideration. Summary and further research topics are given in Sect Connection hypergraph and query evaluation We first represent SQL queries by CHGs. The SQL language has many features and it is impossible to consider all of them here. We are limited to a subset of SQL in this paper. An SQL query may be represented by a CHG (Ullman 1989; Han 1994a). An SQL query may have several SQL blocks, each of which has a SELECT-FROM-WHERE structure. A WHERE clause consists of a set of conditions joined by AND. Conditions may be classified into four categories: (1) A = a, (2) A = B, (3) AθB, where θ is one of <,, /=, >,, (4) AθB, where θ is one of,, /=,,. A simple SQL query is an SQL query of only one SQL block. A nested SQL query has more than one SQL block. Each SQL block may be represented by a CHG as follows. 1. An attribute A of relation R i is represented by a node, R i.a. If both R i and R j (they can be different occurrences of the same relation) have attribute A, we create different nodes R i.a and R j.a. We use the term node and attribute interchangeably. 2. A relation hyperedge is drawn for each relation in the FROM clause, which is a solid circle enclosing all its attributes. 3. To represent a condition of type (1) in a CHG, we label the node by A = a. 4. For a condition of type (3) or (4) in the WHERE clause, we draw a condition hyperedge, which is a dotted circle enclosing all nodes in the condition. 5. For a condition of type (2), we merge the corresponding nodes. 6. Attributes in the SELECT clause are known as distinguished attributes. 7. We draw a hyperedge, also a solid circle, which encloses the distinguished nodes. The hyperedge corresponds to the output relation. A relation hyperedge or hyperedge corresponds to a relation; when there is no confusion hyperedge and relation are used interchangeably. On notation, we use different fonts for hyperedge and for relation, e.g., emp as the hyperedge for the emp relation. To represent a nested SQL query, we may draw a CHG for each SQL block and connect CHGs by association hyperedges. An association hyperedge is drawn as a dashed circle enclosing related attributes. Association hyperedges are labeled by IS IN, UNION, EXISTS, etc. Example 2.1. The following example is taken from Mumick et al. (1990b). It has the following relations: emp(eno, Ename, Sal, Bonus, Job, Dno, EKidsN), dept(dno,

3 3 >50000 Job=Sr Prog Eno Ename Sal Bonus Job Dno EkidsN Fig. 1. The CHG for Example 2.1 IS IN SNo PNo JNo Qty PNo PName Dim Price >25 Color Mgr Loc Loc=San Jose shipment part Qty=20 Fig. 2. The CHG for Example 2.2 emp dept Mgr, Loc), where Eno, Dno are employee number and department number, respectively, Mgr stands for manager, and Loc location. The query below finds every senior programmer whose salary plus bonus is greater than 50,000 and whose department is located in San Jose. SELECT Ename, Mgr FROM emp, dept WHERE Job = Sr Programmer AND Sal + Bonus > AND emp.dno = dept.dno AND Loc = San Jose The above query is slightly different from that in Mumick et al. (1990b) (the original example had a subquery P(emp, dept)). The CHG for this query is shown in Fig. 1. Example 2.2. The following example is a variation (for the purpose of binding propagations later) of an example in Kim (1982). Consider relations shipment(sno, PNo, JNo, Qty), part(pno, PName, Dim, Price, Color). The nested SQL query below finds the supplier numbers of suppliers that supply parts of quantity = 20 whose unit price is greater than 25: SELECT SNo FROM shipment WHERE Qty = 20 AND PNo IS IN (SELECT PNo FROM part WHERE Price > 25) The CHG for this SQL query is shown in Fig. 2. Definition 2.1. A base relation is a relation known originally in the database (usually is stored on disk). A transient relation is a temporary relation generated during query evaluation and is used in later evaluation steps. We traverse the CHG for a relational query and mark hyperedges and conditions as we proceed. The traversal determines a sequence of events, S = (E 1,..., E n ), where an event E i can be either a relation hyperedge, a condition hyperedge, a condition, or an association hyperedge. Definition 2.2. Any relation hyperedge is evaluable. A condition hyperedge is evaluable if the relation hyperedges it intersects precede it, i.e., they have been marked. A condition (of type A = a) is evaluable if the relation hyperedge has been marked. (Conditions and condition hyperedges may be combined with retrieval of a relation in Sect. 3.) Until Sect. 4, we consider only sequences in each of which all events are evaluable and without association hyperedges (except for the semantics on nested queries later in this section). We map each event in S into a basic relational operation. The transient relation obtained after an event E j is denoted as TRAN(E 1,..., E j ). The first event E 1 must be a relation hyperedge. We now define a procedure EV AL(TRAN(E 1,..., E j 1 ), E j ), which takes TRAN(E 1,..., E j 1 ), E j and results in TRAN(E 1,......, E j ). 1. Initially, EV AL(E 1 ) gives TRAN(E 1 ) = R 1, where R 1 corresponds to the relation hyperedge E 1. The corresponding relational operation is just retrieval of R If E j is a condition or a condition hyperedge, then EV AL(TRAN(E 1,..., E j 1 ), E j ) gives TRAN(E 1,..., E j ) = σ F (TRAN(E 1,..., E j 1 )), where F is the condition. 3. If E j is a relation hyperedge for relation R j which intersects the marked part of the CHG, then EV AL(TRAN (E 1,..., E j 1 ), E j ) gives TRAN(E 1,..., E j ) = TRAN(E 1,..., E j 1 ) R j. 4. If E j is a relation hyperedge not intersecting the marked part of the CHG, then EV AL(TRAN(E 1,..., E j 1 ), E j ) gives TRAN(E 1,..., E j ) = TRAN(E 1,..., E j 1 ) R j. For example, consider S = {emp, dept,...} in Example 2.1. T RAN(emp, dept) = T RAN(emp) dept. In Example 2.2, let S = {shipment, part,...}. TRAN(shipment, part) = TRAN(shipment) part. After all hyperedges and conditions in a CHG have been marked and corresponding relational operations have been carried out, we project the transient relation onto the distinguished attributes. This results in a relation for the hyperedge, which is the answer to the query. After an intermediate step, some attributes in the transient relation may be not in the query and no longer required in future processing. They may be projected out. This issue will be addressed in detail in the next section.

4 4 Algorithm 2.1. (1) Traverse the CHG which results in a sequence S = (E 1, E 2,..., E n ); (2) for j = 1 to n do (3) EV AL(TRAN(E 1,..., E j 1 ), E j ); (4) project TRAN(E 1,..., E n ) onto the distinguished attributes, which results in the answer. Theorem 2.1. Algorithm 2.1 terminates and evaluates the answer correctly. Example 2.3. For a nontrivial CHG, there can be a large number of sequences due to different evaluation orders. Consider Example 2.1 and the sequence S = (emp, Job = Sr Programmer, Sal + Bonus > 50000, dept, Loc = San Jose ). The evaluation proceeds as follows. (1) Retrieve the emp relation. (2) Apply the condition Job = Sr Programmer, which amounts to a selection on emp. (3) Condition hyperedge Sal + Bonus > is now evaluable and is evaluated next. This corresponds to a selection on the previous transient relation and results in a new transient relation, which is a subset of the emp relation. (4) Evaluate the hyperedge dept. This corresponds to a natural join with the transient relation. (5) Evaluate the condition Loc = San Jose, which corresponds to a selection. Finally, we project the transient relation onto (Ename, Mgr) to obtain the answer to the query. There are many other possible evaluation orders. Consider another sequence S = (dept, Loc = San Jose, emp, Sal + Bonus > 50000, Job = Sr Programmer ). The operations are as follows. (1) Retrieve the dept relation. (2) Apply the selection Loc = San Jose. (3) Evaluate hyperedge emp, which corresponds to a natural join. (4) Apply the condition hyperedge Sal + Bonus > (5) Apply the selection Job = Sr Programmer, followed by the final projection. This gives another evaluation order. Semantics of a nested query are often stated as evaluate inner blocks first. To give a more precise definition, we introduce the dependency graph (DG) (Ullman 1988). Every node in a DG represents a relation, either a base relation or a derived relation (for a hyperedge). In each SQL block, we draw an arc from each node for a relation in the FROM clause to the node for the relation. We also draw an arc from the node for the relation of an inner block to the node for the relation of the outer block. For example, the DG for Example 2.2 is given in Fig. 3. If recursion is not allowed, which is assumed so in this paper, then the DG is a DAG (directed acyclic graph). A topologic order, strata, may be assigned to nodes in a DAG. Derived relations are evaluated bottom-up in a topologic order (Ullman 1988, Han 1994b), and all relations may be determined in this way. The stratum of an association hyperedge is equal to the highest stratum of the relation hyperedges it intersects. Whether an association hyperedge is evaluable is determined by the following. An association hyperedge is evaluable if all the SQL blocks below its stratum have been marked. Operation for an evaluable association hyperedge is determined by the meaning of its label. shipment (SNo) (PNo) part Fig. 3. The DG for Example 2.2 There is one optimization technique on memory management. If the marked part of the CHG consists of several disjoint components, then the transient relation may be regarded as a cartesian product of subrelations. It is more efficient in space to store such a transient relation as subrelations rather than to construct and store it explicitly. Such a decomposition technique was used in Wong and Youssefi (1976). The CHG was used to represent QUEL queries (Ullman 1989). QUEL is a language based on relational calculus; thus, it is more declarative than SQL and may be represented by a CHG. Since SQL is based on both relational algebra and relational calculus, it has some procedural aspects. However, an SQL query may still be represented by a CHG because of commutative and associative properties of relational algebra reviewed in Ullman (1989). This should come as no surprise, since relational algebra and relational calculus have the same expressive power. Declarativeness of the CHG representation implies many evaluation orders. The CHG approach to query evaluation has several advantages: 1. The CHG gives a natural abstract representation of queries. A logical unit of data in a relational database is a relation, which is represented by a hyperedge in the CHG. 2. All operations in the CHG representation can be setoriented. 3. Using a CHG, it is possible to separate semantics from query evaluation. For example, adding duplicate adornments to a CHG is simple and the semantics are not connected to an evaluation. 4. The CHG is convenient to order basic operations and to construct a search space for QEPs. By enlarging the search space, we may incorporate many optimization strategies (more in later sections). 5. A CHG contains information on relation schemes and attributes sufficient to determine the relevance of attributes to the query (see the next section). 6. It is easy to generalize query optimization in a CHG to nested queries and views. 7. Graph algorithms have been well studied and efficient algorithms are well known. 3 The search space In this section, we review results presented in Han (1995). We first describe in a CHG two well-known optimization heuristics, push-selections and project-out-(irrelevant)- attributes. These heuristics simplify the search space, while

5 5 leading to the most efficient or close to the most efficient QEP. The push-selections heuristic performs selections before joins or cartesian products. Selections remove tuples irrelevant to a query. This has cascading effects if performed early. The push-selections heuristic often reduces the total cost by several orders of magnitude. There can be extreme cases when this heuristic does not lead to the most efficient QEP. In order to find the most efficient procedure for sure, we may consider all evaluation orders, including those performing selections after joins or products. However, this significantly increases the size of the search space. For example, consider a query of three selections and two joins. If we use the push-selections heuristic, then only two operations need to be ordered. On the other hand, there may be as many as 5! evaluation orders (some may be invalid) without the heuristics. Since extreme cases are rare and it still performs well even if it does not find the best one, this heuristic is widely used. The following two rules realize the push-selections heuristic in the CHG representation. (R1) If possible, use a constraint (a condition of type A = a, A > a,... or a restricted set of values for an attribute) to retrieve a relation. (R2) Evaluate a condition or a condition hyperedge as soon as it is evaluable. Note that (R1) combines two operations into one and, strictly speaking, might not observe the evaluable condition in the previous section. During query evaluation, some attributes of a transient relation may be neither queried nor useful in future processing. These attributes may be projected out to reduce the size of a transient relation. Definition 3.1. An attribute in a relation is irrelevant to a query if projecting it out from this relation will not change the answer to the query. Otherwise, the attribute is relevant. A relation is irrelevant to a query if its deletion will not change the answer to the query. Otherwise, the relation is relevant. Note that the relevance concept is used for three different objects in this paper: relations, attributes, and tuples in a relation, respectively. After irrelevant attributes are projected out, we obtain a subrelation of the original relation. After irrelevant tuples are removed or filtered out, such as push-selections or sideways information passing, we obtain a subset of the original relation. Relevance of relations and attributes may be determined by the following theorems (Han 1995). Theorem 3.1. A relation hyperedge is irrelevant to a query iff it is not connected to any distinguished node. All nodes in an irrelevant hyperedge are irrelevant. Theorem 3.2. Assume that initially all hyperedges are relevant. A node in the marked part of a CHG (an attribute in the transient relation) is relevant to the query iff either it is a distinguished node or it is in a hyperedge or condition yet to be marked. The following rule corresponds to the project-out-(irrelevant)-attributes early heuristic: (R3) After a hyperedge has been marked, project out irrelevant attributes from the transient relation. As an example, in Fig. 1, after the emp hyperedge and condition Job = Sr Prog have been evaluated, attributes Eno, Job, EKidsN become irrelevant and may be projected out. If the condition hyperedge Sal +Bonus > has also been evaluated, then attributes Sal and Bonus are irrelevant as well. Although in principle the projection heuristic can offer great savings, in practice, however, the savings may be not as significant as they look. Even though transient relations are usually large, if pipelining is used, e.g., in nested-loop methods, then they need not be materialized. Project-outirrelevant-attributes in main memory is useful but its impact on the total cost is not great. This is probably why this heuristic is not used as widely as possible. If disk is used for temporary relations, then this heuristic should be applied. Pipelining is important in reducing costs related to transient relations. Consider two adjacent operations, E i and E i+1. The cost of E i consists of Ci in, Ci aux, C cpu i, Ci out, for the input, auxiliary relations, CPU cost, and output, respectively. An auxiliary relation is a temporary relation created to facilitate an operation. For example, sorted relations used for a sort-join are auxiliary relations. The cost of E i+1 consists of similar terms. If we do not write TRAN(E 1,..., E i ) to disk but use it directly to evaluate E i+1, known as pipelining E i to E i+1, we save Ci out, Ci+1 in. Since transient relations are usually large, pipelining often gives great savings. However, pipelining is not always possible or efficient, mainly because pipelined operations compete for resources, especially memory space. One type of pipelining, pipelining a join/cartesian product to a sequence of selections and/or projections, is always feasible and does not cause adverse effects in efficiency (Han 1995). This forms operation clusters, each cluster having one join or cartesian product followed by zero or more selections/projections. Thus, the problem of evaluation orders is reduced to ordering all relation hyperedges in a CHG. Additional pipelining is also possible. In particular, more than one join may be pipelined into a join cluster. However, there are restrictions for such pipelining. If TRAN(E 1,..., E i ) is large with respect to the buffer size, then E i+1 cannot use join methods that require auxiliary relations. Some join methods, mainly sort-join, require auxiliary relations. To take this into consideration, we group joins into various join clusters. We may now construct the search space. The search space consists of four largely orthogonal dimensions: (1) evaluation orders of operations, (2) evaluation methods for each operation, (3) access paths and operations on one base relation, and (4) various groupings of join clusters. As an example for (2), there are many potentially efficient methods to evaluate a join. For (3), operations on one base relation and access paths are considered together because both evaluate one relation hyperedge. Orthogonality of these dimensions means that various choices in a dimension are available after other dimensions are fixed. Let d i be the number of possible choices for dimension i. d 1 = e!, where e is the number of relation hyperedges. d 2 is determined by the system designer. d 3 depends on the query. d 4 = 2 e 1. The total number of choices, D, is

6 6 D = d 1,d 2,d 3 2 e 1. (3.1) If all the dimensions are orthogonal to each other, then D = e!d 2 d 3 2 e 1. An exhaustive search may be used to find the best QEP among D states. If the number of states is so large that an exhaustive search is impractical, then some optimization strategies proposed in Swami and Gupta (1988) and Swami (1989) may be considered. The experiments in Swami and Gupta (1988) and Swami (1989) used a cost model for main memory databases. More studies may be required for other cost models and on pipelining. 4 Nested queries and queries using views In many current database systems, query optimizers were designed for unnested SQL queries. When such a system processes a nested query, it first evaluates partially the outer block, resulting in some value (either a constant or a tuple). This is passed to the inner blocks. The system then evaluates the inner blocks using this value. Results from the inner blocks are passed back to the outer block and are used to finish processing the outer block. This simple method has two performance problems. (1) The value-passing between blocks is tuple-oriented not set-oriented. The inner blocks are evaluated once for each value passed from the outer block. (2) The search for an efficient evaluation order is restricted to one SQL block. Similar problems also exist for queries using views. To solve the above problems, Kim (1982) proposed methods to transform nested queries into unnested ones. Magic sets methods also carry out query transformations (Mumick et al. 1990a,b,c). In this section, we show that the above inefficiency problem may be easily solved by a processing strategy in the CHG representation. In addition, we incorporate the processing strategy into the search space. 4.1 Nested queries First, in a CHG, all operations may be implemented as setoriented not tuple-oriented. For example, in Example 2.2 when evaluation proceeds from the outer block to the inner block, a set of values of PNo instead of a single value may be passed to the inner block. Therefore, the first performance problem of nested queries does not exist. Reordering operations for a nested query is also possible in the CHG representation. For operations within one SQL block, reordering can be carried out as before. Let us consider the meaning of reordering operations of different SQL blocks. First, according to semantics, an association hyperedge is not evaluable until all SQL blocks below its stratum have been evaluated. This condition will be relaxed and an association hyperedge may be evaluated partially. Second, usually it is inefficient to evaluate hyperedges in other SQL blocks, since the corresponding operation is a cartesian product. However, partial evaluations of association hyperedges change this. An example is given below. Example 4.1. Consider Example 2.2. The condition Qty = 20 is a selection on the relation shipment and may be evaluated early. This results in a transient relation (SNo, PNo) after projecting out irrelevant attributes. At this point, the association hyperedge is not evaluable, since the semantics require that the inner block be evaluated before the association hyperedge. However, we may project TRAN(SNo, PNo) onto PNo, which results in a set {PNo}. According to the meaning of IS IN, tuples in the relation part whose PNo value is not in the set {PNo} are irrelevant to the query. Thus, this set may be used to restrict the relation part when the inner block is evaluated. Evaluation of the inner block results in a set {PNo}. This set is then passed back to the outer block and joined with the previous transient relation (SNo, PNo). Finally, we project the result onto attribute SN o to obtain the answer. Definition 4.1. A subtransient relation is a subrelation of a transient relation, i.e., a projection of a transient relation. Definition 4.2. A partial evaluation of an association hyperedge is as follows. (1) Project the transient relation onto those attributes in the association hyperedge, resulting a subtransient relation. (2) Find the set of attributes intersecting other parts of the CHG. (3) Use the subtransient relation and the meaning of the association hyperedge (IN, UNION, EXISTS, etc.) to obtain possible values for these attributes, i.e., new relations known as association relations, one association relation for each SQL block. A partial evaluation results in an association relation(s) which is then used to evaluate the inner block(s). (Query optimization using sideways information passing is another type of partial evaluation, which is addressed in Sect. 5.) Partial evaluations allow us to arrange association hyperedges or relation hyperedges in any order. If an association hyperedge is not evaluable, we may evaluate it partially. A partially evaluated association hyperedge needs to be evaluated fully again when all SQL blocks below its stratum have been evaluated. This may be represented in the event sequence S by adding one additional association hyperedge. Normally, the place of such an addition should be immediately after the association hyperedge becomes evaluable, since such an operation usually reduces the size of the transient relation. More formally, let the event sequence be S = (E 1,..., A i, E j,..., E k,...), where A i is an association hyperedge not evaluable, E k the last hyperedge whose evaluation enables A i evaluable. The sequence will be rewritten as S = (E 1,..., A i, E j,..., E k, A i,...), where A i denotes a partial evaluation. The last A i is evaluated fully. Attributes in an association hyperedge are relevant to the query until it is fully evaluated. A transient relation after a partial evaluation usually may be decomposed into the original transient relation and the relevant part of the association relations. As before, it is unnecessary to construct a decomposable relation explicitly. We may store the original transient relation on disk and use the relevant part of the association relations to evaluate the inner blocks. For example, in Example 4.1, we may keep the transient relation (SNo, PNo) on disk and use only the set {PNo} after the partial evaluation to process the inner block. The transient relation for the outer block is used only

7 7 Job = Sr Programmer emp1 EkidsN Eno Ename Sal Bonus Job Dno EkidsN Fig. 4. The CHG for Example 4.2 > Job Bonus Avg(Sal) Ename Eno emp2 after the inner block has been evaluated and the results are passed back to the outer block. Example 4.2. Consider query (C) in Mumick et al. (1990b), which was used as an example for magic sets methods. SELECT Ename FROM emp e1 WHERE Job = Sr Programmer AND Sal > (SELECT AVG(e2.Sal) FROM emp e2 WHERE e2.dno = e1.dno) The CHG for this query is shown in Fig. 4. Note that the two SQL blocks overlap. The main causes of inefficiency given in Mumick et al. (1990b) are: (1) tuple-oriented (repeated computation of the same department); (2) a fixed evaluation order. The first problem does not exist for the CHG approach. The second problem has been addressed above. This query has several interesting properties. First, here the inner block and the outer block are not disjoint. The condition e2.dno = e1.dno in the inner WHERE clause merges the two blocks. Second, this query has an aggregate operator AVG. A partial evaluation is often not possible when aggregate operators are involved. Aggregate operators and grouping operations (GROUP BY) should be carried out with care, because a set can be empty (Han 1994b). It is possible that more than one partial evaluation may be applied to one association hyperedge (although probably of little practical significance). Suppose a transient relation intersects an association hyperedge and a partial evaluation passes the relevant values to the inner block. Later, more hyperedges or conditions in the outer block might be evaluated, which results in a new transient relation. A projection of the new transient relation onto the intersecting attribute gives a set of values that usually is a subset of the values passed earlier. The new values may be passed through the association hyperedge again to replace the old ones. In addition, a partial evaluation may be made either from an inner block to the outer block or from an outer block to the inner blocks. 4.2 A method to enumerate QEPs Figure 5 is an algorithm which enumerates all QEPs according to our analysis so far. It has four loops, each for one search dimension. Lines (4) and (5) determine whether or Algorithm 4.1. (search space for queries with or without nesting) (1) order relation and association hyperedges into lists {L 1 }; (2) for each list L 1 do (3) for each hyperedge do (4) if it is an unevaluable association hyperedge A i (5) then add A i to L 1 after the last hyperedge below its stratum; (6) pipeline conditions and condition hyperedges; (7) the result is lists {L 1 }; (8) for each list L 1 do (9) for each relation hyperedge do (10) enumerate evaluation methods for the corresponding operation; (11) the result is lists {L 2 }; (12) for each list L 2 do (13) for each hyperedge do (14) enumerate access paths; (15) the result is lists {L 3 }; (16) for each list L 3 do (17) for each hyperedge do (18) enumerate join clusters; (19) the result is lists {L 4 }; /* {L 4 } is the search space for QEPs */ Fig. 5. An algorithm to enumerate QEPs not an association hyperedge A i is evaluable for the given evaluation order in L 1. If not, we insert A i in L 1 after the last hyperedge below its stratum, as discussed earlier. The newly added hyperedge can be fully evaluated, while the preceding hyperedge has to be evaluated partially. This algorithm may be combined with an appropriate cost model to find the most efficient QEP. Its complexity is bound by the complexity of the search space. In a practical implementation, the designer might use heuristics to reduce the search space by pruning unlikely choices. 4.3 Queries using views For an SQL query using views, we first draw one CHG for each view and for the query. If a relation is given by one and only one view, then we merge the view CHG with the query CHG, i.e., merge the hyperedge of the view with the corresponding hyperedge in the query CHG. If a relation is defined by several views, then we need to specify the semantics. We adopt the semantics used in Starburst (Mumick et al. 1990b), i.e., a derived relation is the union of all the relation definitions. This coincides with Prolog. In a CHG, we use an association hyperedge labeled UNION to connect these CHGs. After a CHG has been constructed, we may apply the usual query optimization strategy (Fig. 5). Example 4.3. The following example is from Pirahesh et al. (1992). The view keeps the item number and vendor names for an item that vendors have supplied since the year 85. CREATE VIEW itpv AS (SELECT DISTINCT itp.itemn, pur.vendn FROM itp, pur WHERE itp.ponum = pur.ponum AND pur.odate > 85 ) SELECT itm.itemn, itpv.vendn FROM itm, itpv

8 8 For the view itpv pur For the query itpv itm pur ponum itemn... vendn odate >85... itp itpv a itemn vendn =<itemn<20... itm itpv ponum itemn... vendn odate >85... b itp 1=<itemn<20 Fig. 6. a The CHG for the original query in Example 4.3. b The CHG merged from a WHERE itm.itemn = itpv.itemn AND itm.itemn >= 01 AND itm.itemn < 20 The merged hypergraph is shown in Fig. 6. From experimental results shown in Pirahesh et al. (1992), current database systems do not evaluate such queries efficiently. The main causes of inefficiencies are the same as nested queries: (1) tuple-oriented value-passing between the query and views; (2) restricted operations reordering among the query and views. Magic sets methods (Mumick et al. 1990b; Pirahesh et al. 1992) have been proposed to rewrite queries using views. In our approach, we first obtain a CHG merging from the query and views, then use the same strategy as nested queries. The CHG approach removes the causes of inefficiency. Similar to nested queries, this method integrates cost estimates and evaluation methods and can find the most efficient QEP after an exhaustive search. 5 Binding propagations and magic sets methods In Sect. 5.1, we simulate magic set methods in a CHG. The maximum SIPS strategy is proposed in Sect In Sect. 5.3, we enumerate QEPs incorporation with the maximum SIPS strategy. 5.1 Simulate magic sets methods Magic set methods were proposed to optimize recursive datalog programs (Bancilhon et al. 1986; Beeri and Ramakrishnan 1991; Ullman 1989). Magic sets methods use binding information in a query to rewrite the program and query so that irrelevant tuples are not generated during query evaluation. Note that the relevance here, which refers to a subset of a relation, is different from the relevance for projection in Sect. 3, which refers to a subset of attributes in a relation. One key idea of magic sets methods is SIPS, which mean passing binding information in a query or relation to other relations without computing the relation fully. Mumick et al. (1990a,b,c) observed that Kim s transformations, semijoin methods, and magic sets methods share a common heuristic: filtering out irrelevant tuples in query evaluation. They applied magic sets methods to relational queries and showed that the former are often more powerful and more efficient than other program transformations. SIPS and magic sets methods may be simulated elegantly in the CHG representation. Consider Example 2.1 again. Suppose we first evaluate emp. As discussed in Sect. 3, we may use the binding Job = Sr Programmer to restrict tuples in emp so that only tuples of senior programmers are constructed for the transient relation. However, even among these tuples, many perhaps work at a department not located in San Jose. These tuples are also irrelevant to the final answer and it is more efficient if they are removed. To do so, we may first find those departments that are located in San Jose. This can be achieved by a look-up of the dept relation and find the corresponding values of Dno as a set {Dno}. Now there are two conditions on the emp relation, Job = Sr Programmer and {Dno}. One condition may be used to retrieve relevant tuples in emp and the other to restrict the tuples. Which choice is more efficient depends on the indices and on the values. This may be incorporated into the search dimension on access paths to emp. Definition 5.1. Suppose R 1 has some bindings and intersects with other hyperedges. R 1 may be evaluated partially as follows. (1) Apply the bindings to relation R 1 ; this results in a temporary relation T. (2) Perform semijoins of T onto the intersecting hyperedges, which results in subtransient relations, known as magic relations or magic sets for unity arities. (We borrow the term magic from magic sets methods.) In the above example the first step evaluates the dept relation partially, which results in a set (the magic set) {Dno}. If the first step evaluates dept fully, then both Dno and Mgr are relevant to the query and should be kept in the transient relation (Dno, Mgr), which is a procedure already considered in the search space in Sect. 3. The magic set {Dno} has the minimum size to pass sideways to the emp relation. Hyperedge dept is evaluated twice, and only partially the first time. Partial evaluation separates magic sets methods from other heuristics in Sect. 3. Partial evaluation of a relation hyperedge is similar to partial evaluation for an association hyperedge before; both evaluate a hyperedge partially and project the result onto a relation scheme for further evaluation. The difference is that, for an association hyperedge, partial evaluation may have to be used, because the hyperedge may be unevaluable, while here partial evaluation is used for the purpose of optimization. The above query processing strategy is quite complex. Let us understand why it may improve efficiency. If all bindings in a query are in the same base relation, then the push-

9 9 selections heuristic together with a search on the access paths (discussed in Sect. 3) may be sufficient. If bindings appear in different relations, then it might be beneficial to perform a semijoin to pass the binding information in a relation, say, R 1, to another relation, R 2. Result of the semijoin may then be used to restrict R 2, which reduces the size of R 2. (Note that this in fact introduces an additional join. However, cost of such a join is in the order of a selection, since one relation scheme encloses the other scheme completely.) The reduced R 2 is then joined with R 1. Cost of the final join is usually less than that of the original join. We may also view the above strategy in a different way. Computation cost in query evaluation may be classified into three levels. Level 0 is at the scheme level, which costs almost nothing when compared with other operations. Level 1 includes selections and projections, which involves one relation and is not expensive. Level 2 includes joins and products and is the most expensive. In general, it pays to perform operations at a lower level as much as possible to reduce costs at a higher level. Magic sets methods require more low-level operations, but save the costs of some joins. 5.2 Maximum SIPS For a not too simple query, there may be a few bindings and many ways for SIPS. We introduce a new graph named cograph to illustrate SIPS paths. Definition 5.2. A cograph is an ordinary undirected graph for a CHG. Each relation or association hyperedge corresponds to one node in the cograph. An edge is drawn in the cograph between two nodes if their corresponding hyperedges intersect. We indicate a binding or bindings for a relation by adding a prime to the relation. Definition 5.3. By a path in a CHG we refer to a sequence of relation hyperedges, association hyperedges, and sets of nodes (intersections of hyperedges) which maps one-to-one to a path in its cograph. Definition 5.4. A SIPS path is a path in a cograph that starts from a relation (node) with some bindings, known as the initial relation, and ends at a relation (node), known as the destination relation. For example, in Example 2.1, there are two SIPS paths: one from emp to dept and another from dept to emp. Definition 5.5. A SIPS operation is a sequence of semijoins along a SIPS path (partial evaluations) that results in a subrelation (known as the magic relation, a magic set if its arity is one) of the destination relation. Definition 5.6. The maximum SIPS strategy maximizes the effects of SIPS on a relation by performing SIPS operations for all bindings in the query and over all SIPS paths for this relation. Let us focus on one destination relation R. Bindings of other relations may be passed sideways to R. There can be many SIPS paths. SIPS from different bindings and paths usually have different effects on R. If a cograph contains cycles, there can be an infinite number of SIPS paths. To solve this problem, we note that a cyclic SIPS path is not useful as far as SIPS are concerned. A magic relation generated by an acyclic SIPS path contains the magic relations generated by those cyclic SIPS paths that correspond to the acyclic one. To preserve the completeness of the answer, only acyclic SIPS paths need to be used for SIPS operations. Example 5.1. To illustrate various SIPS paths, consider a query on abstract relations: R 1 (A, B, C, D), R 2 (E, F, G, H), R 3 (J, K, L). SELECT C FROM R 1, R 2, R 3 WHERE R 1.A = a AND R 1.C = R 2.E AND R 1.D = R 2.F AND R 1.A = R 3.J AND R 2.H = R 3.L AND R 3.K = k Its CHG is drawn in Fig. 7 and its cograph in Fig. 8. There are two bindings in the query: R 1.A = a in R 1 and R 3, and R 3.K = k in R 3. The potential SIPS paths for R 2 are R 1 R 2, R 3 R 2, R 3 R 1 R 2 (R 1 R 3 R 2 is ignored, because the binding on R 1 is contained by the bindings on R 3 ). We need not consider cyclic SIPS paths, e.g., R 1 R 2 R 3 R 1 R 2. Consider the SIPS path R 1 R 2. Since R 1 and R 2 intersect on two attributes, C, D, the SIPS operation gives a magic relation (C, D) for R 2. The SIPS path R 3 R 2 determines a magic set (H) for R 2. Note that both (H) and (C, D) restrict R 2. In fact, we may take the benefit of all bindings if the SIPS operations keep the magic relation (C, D, H) for R 2. The above example has an interesting feature. Different SIPS paths for the same destination relation may give magic relations of different relational schemes, all being subrelations of the destination relation. A natural join of these magic relations results in a large magic relation. For the above example, this is in fact a cartesian product. However, we cannot claim that this simple approach finds the most restrictive magic relation, since information could be lost when semijoins are performed on SIPS paths. Further research is required on finding the most restrictive magic relation efficiently. The term maximum SIPS strategy does not mean that the obtained magic relation is the most restrictive. It simply means that SIPS are carried out for all bindings and all SIPS paths. Example 5.2. Let us see how the maximum SIPS strategy will process the query in Example 5.1. Consider the event sequence S = {R 1, R 2, R 3 }. For R 1, there is one SIPS path R 3 R 2 R1 (another SIPS path R 3 R 1 is ignored since the constraint is contained by the binding on R 1 ). The magic relation is (C, D). For R 2, the SIPS paths have been given in Example 5.1. For R 3, there is one SIPS path R 1 R 2 R 3. Other possible event sequences are {R 1, R 3, R 2 }, {R 2, R 1, R 3 }, {R 2, R 3, R 1 }, {R 3, R 1, R 2 }, {R 3, R 2, R 1 }. It is not difficult to implement the maximum SIPS strategy. Let the event sequence be S. Every relation R in S can be a destination relation. For a node corresponding to

10 10 A=a K=k R3 R1 A B K H G R1 C D R2 R2 R3 Fig. 7. The CHG for Example 5.1 Fig. 8. The cograph for Fig. 7 a destination relation, we first find all acyclic SIPS paths in the cograph. Various graph algorithms, including depth-first search, may be used for this purpose. SIPS paths in the CHG are easy to obtain, since there is a one-to-one mapping between paths in the CHG and paths in the cograph. Clearly, given a CHG with bindings, we may carry out the maximum SIPS strategy. Example 5.3. As a practical example on SIPS, consider the query in Example 4.2. Let us first pass the binding on emp1 to emp2. A partial evaluation of emp1 with binding Job = Sr Programmer gives the magic set {Dno}. {Dno} is used to evaluate emp2, which results in a transient relation (Avg(Sal), Dno) after projecting out irrelevant attributes. The transient relation is then joined with emp1, followed by evaluation of the association hyperedge. Finally, a projection is used to obtain the answer. This gives the same procedure as program (M) in Mumick et al. (1990b). This example has only one binding and one SIPS path. Algorithm 5.1. (search space with maximum SIPS) (1) order relation and association hyperedges into lists {L 1 }; (2) for each list L 1 do (3) for each hyperedge do (4) if it is an unevaluable association hyperedge A i (5) then add A i to L 1 after the last hyperedge below its stratum; (6) the result is lists {L 0 1 }; (7) for each list L 0 1 do (8) for each relation hyperedge H in L 0 1 do begin (9) find all acyclic SIPS paths; /* for maximum SIPS */ (10) perform semijoins for these SIPS paths; /* SIPS */ (11) end (12) pipeline conditions and condition hyperedges; (13) the result is lists {L 1 }; /* go to line 8 of Fig. 5 */ Fig. 9. An algorithm to perform maximum SIPS 5.3 A method to enumerate QEPs Not only SIPS are simple in the CHG representation, they can also be integrated easily with other query optimization strategies. In particular, it may be combined with the processing strategy for nested queries and for queries using views, and with the search space. To enumerate QEPs, only minor changes are needed for Fig. 5. After the order of relation and association hyperedges has been decided, we may perform the maximum SIPS strategy for each relation. Such an algorithm is given in Fig. 9. It should be used in conjunction with Fig. 5. In Fig. 9, lines 1 5 are the same as in Fig. 5. Line 9 finds all acyclic SIPS paths (as discussed before, depth-first searches may be used for this purpose). Line 10 performs SIPS for all SIPS paths. The output of Fig. 9 should be connected to line 8 of Fig. 5. We might consider SIPS of various degrees, ranging from none to the maximum SIPS strategy, for all relation hyperedges and add another search dimension to the search space. However, this is unlikely to be useful in practice. Our view is that an implementation will likely use either the maximum SIPS strategy or none at all for simplicity. A word of caution on SIPS. SIPS might not offer great savings if pipelining is used for magic relations. The reason is similar to that for projections (Sect. 3 and Han 1995). SIPS are based on two ideas: push-selections and semijoins. The benefit of semijoins is greatly reduced if pipelining is used. Without semijoins, the other idea of SIPS is fully accounted for by methods discussed in Sect. 3. A further study on cost estimate with memory management is required to determine when semijoins and SIPS are beneficial. In the above discussion, bindings of type A = a are implied. It is also possible to apply SIPS using a restriction of type A > a(a < a, A a,...) similar to Mumick et al. (1990a). Such SIPS often gain less in efficiency. This is because a condition like B > 500 often restricts a relation by a small factor, while a selection A = a may restrict a relation by a larger factor. For example, if only 5% of all employees are senior programmers, then the selection Job = Sr Programmer restricts the relation emp by a factor of 20. In summary, the method proposed here is a simple, elegant alternative to magic sets methods for relational queries. This new method has the following advantages over magic sets methods proposed earlier. (1) The new method is relatively intuitive. It is easier to implement. (2) Unlike magic sets methods which may introduce many rules, the new method is simpler and more efficient implementations are possible. (3) It uses the maximum SIPS strategy as discussed above. The earlier magic sets methods, in fact, choose one SIPS path if there are more than one. The new method gives smaller magic sets, and thus, is more efficient. (4) It incorporates SIPS with the search space, thus solving the interplay problem of cost estimates and program transformations mentioned in Mumick et al. (1990b). (5) The undesirable effect of generating recursions from nonrecursive programs (Mumick et al. 1990b) do not occur here.

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Introduction to Query Processing and Query Optimization Techniques Outline Translating SQL Queries into Relational Algebra Algorithms for External Sorting Algorithms for SELECT and JOIN Operations Algorithms