The Enhancement of Semijoin Strategies in Distributed Query Optimization

The Enhancement of Semijoin Strategies in Distributed Query Optimization F. Najjar and Y. Slimani Dept. Informatique - Facult6 des Sciences de Tunis Campus Universitaire - 1060 Tunis, Tunisie yahya, slimani@f st. rnu. tn Abstract. We investigate the problem of optimizing distributed queries by using semijoins in order to minimize the amount of data communication between sites. The problem is reduced to that of finding an optimal semijoin sequence that locally fully reduces the relations referenced in a general query graph before processing the join operations. 1 Introduction The optimization of general queries, in a distributed database system, is an important and challenging research issue. The problem is to determine a sequence of database operations which process the query while minimizing some predetermined cost function. Join is a frequently used database operation. It is also the most expensive, specifically in a distributed database system; it may involve large communication costs when the relations are located at different sites. Hence, instead of performing joins in one step, semijoins [1], are performed first to reduce the size of the relations so as to minimize the data transmission cost for processing queries [2]. In the next step, joins are performed on the reduced relations. The join of two relations R and S on an attribute A is denoted by (R ~A S), while the semijoin from R to S on an attribute A is denoted by S (XA R. Thus, S (:X:A R is defined as follows: (i) project R on the join attribute A (i.e. R(A)); (ii) Ship R(A) to the site containing S; (iii) Join S with R(A). The transmission cost of sending S to the site containing R for the join R ~n S can thus be reduced. There are two main methods to process a join operation between two relations. One is called the nondistributed join, where a join is performed between two unfragmented relations. The other is called the distributed join, where the join operation is performed between the fragments of relations. As pointed out in [5], the problem of query processing has been proved to be NP-hard. This fact justifies the necessity of resorting to heuristics. The remaining of this paper is organized as follows: preliminaries are given in Section 2. Section 3 defines the main characteristics of two semijoin-based query optimization heuristics; then, we present and discuss the join query optimization in a fragmented database. Finally, Section 5 concludes the paper.

529 2 Preliminaries A join query graph can be denoted by a graph G = (V, E), where V is the set of relations and E is the set of edges. An edge (Ri, Rj) E E, if there exists a join predicate on some attribute of Ri and Rj. Without loss of generality, only cyclic query graphs are considered. In addition, all attributes are renamed in such a way that two join attributes have the same name if and only if they have a join predicate between them. The relations referenced in the query are assumed to be located at different sites. The query problem is simplified to be the estimation of the data statistics and the optimization of the transmission order, so that the total data transmission is minimized. We denote by IS I the cardinality of a relation S. Let WA be the width of an attribute A and wr~ be the width of a tuple in Ri. The size of the total amount of data in Ri can then be denoted by IIRill = wr, IRil. For notational simplicity, we use IAI to denote the extant domain of the attribute A. Ri(A) denotes the set of distinct values for the attribute A appearing in Ri. For each semijoin Rj o( A i, a selectivity factor, ilia =- ]R~(A)] IAI is used to predict the reduction effect. After the execution of Rj OCA Ri, the size of Rj becomes PiAIIRill. Morever, it is important to verify that a semijoin Rj (XA Ri is profitable, i.e. if the cost incurred by this semijoin, wa]ri(a)], is less than the cost of the reduction (called the benefit), which is computed in terms of avoided future transmission cost, wr, ]Ri]--piA]Ri]. The profit is set to be (benefit - cost). 3 Nondistributed Join Method In this section, we propose two heuristics. The first, namely one-phase Parallel Semi Joins, 1-PSJ, determines a set of parallel semijoins. The second, namely Hybrid A* heuristic, HA*, finds a general sequence of semijoins, which is a combination of parallel and sequential semijoins. 3.1 1-PSJ We say that Ri is fully locally reduced if {j / Ri (XA Rj is feasible}. We denote by RDi= {j/ri c< Rj is profitable} the set of index reducers of the relation Ri. Our objective is to find the set of the most locally profitable semijoins (called applicable semijoins), APi C_ RDi, such that the overall profit is maximized, and subsequently the total transmission cost (TCi) of Ri is minimized. Furthermore, removing a profitable semijoin may increase the total profit and minimize the extra costs incurred by semijoins. Since all applicable semijoins are executed simultaneously, local optimality (with respect to Ri ) can be attained. Finally, in order to reduce each relation in the query, we apply a divide-and-conquer algorithm. The total cost (TC) is minimized if all tranmission cost (TCi) are minimized simultaneously. The details of this algorithm are given in [4].

530 3.2 Hybrid A* The well known A* can be used to determine a sequence of semijoin reducers [6] for distributed query processing. The key issue of A* algorithm is to derive a heuristic function which can intelligently guide the search of a sequence of semijoins. In the A* algorithm, the search is controlled by a heuristic function f, with two arguments: the cost of reaching p from the initial node (original query graph with its corresponding profile), and the cost of reaching the goal node from p. Accordingly, f(p) = g(p) + hip), where g(p) estimates the minimum cost of trajectory from the initial state to p, and hip ) estimates the minimum cost from p to the goal state. The node p chosen for expansion (i.e., whose immediate successors will be generated) is the one which has the smallest value f(p) among all generated nodes that have not been expanded so far. In order to derive a general sequence of semijoins, for a node p, gip) = 9(q) + ~ cos tiri ~ Rj) + IIR~II, where p is an immediate successor of q, and jgapi R~ denotes the resulting relation after performing applicables semijoins to the original relation Ri. The function h is defined as the sum of the sizes of remaining relations such that the effect of the total reduction (with respect to neighboring relations) gives the best estimation, h(p) = )--~(~ cos t(rk oc Rj) +IS R k ~ I S), where k j Rk is not yet reduced. Example 1: Consider the following join query: Select A, D from R1, R2, R3, R4 where (R1.A = R3.A) and (R1.B = R2.B) and (R2.C = R3.C) and (R3.D = R4.1)). We suppose that R1, R2, R3 and R4 are located in different sites. The corresponding query graph and profile are given, respectively in Figure 1 and Table 1. R1 R4 R2 C R3 Fig. 1. Join query graph for example 1. 1- PSJ finds the set {R1 c( R2, R3 ~ R1, R3 c( R4}, with the total transmission cost to the final site R2, 18,370. Whereas, HA., finds the general sequence of semijoins, R1 c< R2, {R3 c< R~, R3 ~ R4}, with the cost of 16,681. To show more insights into the performance of 1 - PSJ and HA* heuristics, simulations were carried on different queries for n (n is being 5-12) relations

531 Table 1. Profile Table for example 1. R~ I [R,I X Wx IR~(A)I R1 1190 IA 2 830 B 1 850 R2 34401B i 850 C 3 900 R3 2152 A 2 800 3 9OO 1 720 R4 3100 D 1 700 involved in each query. For a comparison purpose, in addition to 1 - PSJ and HA*, we also apply the original method, OM, which consists of sending all the relations directly to the final site. In Figure 2, it is apparent that as the number of relations increases (n > 8), HA* heuristic becomes better than 1 - PSJ. When n >_ 9, HA* outperfoms the other heuristics significantly (the reduction cost is about 45%). o ]1 ~.~ 75~1~ I E zo.t "] I - - ' ',-,,,.,. Number of referenced relations Fig. 2. Impact of the number of relations on transmission cost. 4 Distributed Join Method A relation can be horizontally partitioned (declustered) into disjoint subsets, called fragments, distributed across several sites. We associate for each fragment its qualification, which is expressed by a predicate describing the common properties of the tuples in that fragment. One major issue in developping a horizontal fragmentation technique is determining the criteria to be used in guiding the fragmentation. A major difficulty is that there are no known significant criteria that can be used to partition relations horizontally. In the context of our study, we suggest a bipartition of each relation Ri, such that, a relation is divided into mutually exclusive fragments. To represent the

532 fragments more specifically, we propose the following formula: IRi[ = a[ril + (1 - a)lri I = IRill+ IRi21, where a is a rational number ranging from 0 to 1 and Ril, Ri2 are the fragments of Ri. The above fragmentation satisfies the three conditions [2], completeness, reconstruction, and disjointness, which ensure a correct horizontal fragmentation. Note that bipartitioning can be applied to a relation repeatedly. To estimate the cost of an execution plan, the database profile may have the following statistics: IRijl denotes the cardinmity of the fragment number j of relation Ri and IRij (A)I represents the number of distinct values of attribute A in its fragment. When semijoins are used in a fragmented database system, they have to be performed in a relation-to-fragment manner, so that they do not cause the elimination of contributive tuples. At each site containing a fragment Rj~ to be reduced, we proceed as follows: (i) every fragment of Ri 1 must participate in reducing Rjk; so, find the optimal set of applicable semijoins and send values of the semijoins attributes from each fragment of Ri to Rjk; (ii) Merge the fragments of R~ before eliminating any tuple of Rjk. Example 2: We illustrate the distributed join method (HA* on fragmented relations) with the same previous example discussed for the nondistributed join. After partitioning, the corresponding profile is given in Table 2. Table 2. Profile Table for Example 1. Rij IRijl X Wx IRij(A)I Rll/R121119/1071 A 2 90/809 B 1 91/818 R21/R22 344/3096 B 1 91/818 C 3 97/872 R31/R32 215/1936 A 2 87/782 C 3 97/872 D 1 79/800 R41/R4~ 310/2790 D 1 76/683 The optimal general sequence is: {R12 c( R22, R12 0( (R31 t2 R32)}, {R21 c< Rll, R21 (3( (R31 U R32)}, R42 (3( R31, {R32 (2( (Rll U R12), {R32 c< (R21 U R22),R32 0( -R42}; it incurs 13,959, which is less than in nondistributed join method. A general conclusion is that the communication cost is substantially reduced if we use a "good fragmentation". In the absence of a formal definition of a good fragmentation, we can approximate it by the a-factor. In effect, we have noted that a good choice of this criteria leads to a good fragmentation. The Fig 2 shows the effect of the a-factor on the communication cost for a given query in which the number of relations is constant and a is varied. 1 Ri such that Rj ~ Ri is applied in the query.

533 5 Conclusion In this paper, we proposed two distributed query processing strategies for join queries using semijoin as a query processing tactic. For these two strategies, we present new heuristics that "intelligently" guiding the search and returning a general reducer sequence of semijoins. For the case of the distributed join strategy, we proposed a technique to bipartition each relation assuming a fixed a-factor. References 1. P.A. Bernstein, N. Goodman, E. Wong, C. Reeve, azld J.B. Rothnie. Query Processing in a System for Distributed Databases (SDD-1). ACM TDS, vol. 6(4), Dec. 1981, pp. 602-625. 2. S. Ceri and G. Pelagatti. Distributed Databases: Principles and Systems. McGraw- Hill, 1985. 3. M-S Chen and P.S Yu. Combining 3oin and Semijoin Operations for Distributed Query Processing. [EEE TKDEvol. 5(3), Jun. 1993, pp. 534-. 542. 4. F. Najjar, Y. Slimani, S. Tlili, and J. Boughizane. Heuristics to determine a general sequence of semijoins in distributed query processing. Proc. of the 9thIASTED Int. Conf., PDCS, Washington D. C. (USA), Oct. 1997, pp. 354-359. 5. C. Wang and M-S. Chen. On the Complexity of Distributed Query Optimization. IEEE TKDE, vol. 8(4), Aug. 1996, pp. 650-662. 6. H. Yoo and S. Lafortune. An Intelligent Search Method for Query Optimization by Semijoins. IEEE TKDE, vol. 1(2), Jun. 1989, pp. 226-237.