The Enhancement of Semijoin Strategies in Distributed Query Optimization

Similar documents
SDD-1 Algorithm Implementation

Optimization of Queries in Distributed Database Management System

Rule Enforcement with Third Parties in Secure Cooperative Data Access

Query Optimization in Distributed Databases. Dilşat ABDULLAH

Tri-variate Optimization Strategies of Semi-Join Technique on Distributed Databases

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

MC 302 GRAPH THEORY 10/1/13 Solutions to HW #2 50 points + 6 XC points

Outline. Distributed DBMS Page 5. 1

A Heuristic Approach to Distributed Query Processing

3 No-Wait Job Shops with Variable Processing Times

and therefore the system throughput in a distributed database system [, 1]. Vertical fragmentation further enhances the performance of database transa

International Journal of Modern Trends in Engineering and Research e-issn: p-issn:

Mobile and Heterogeneous databases Distributed Database System Query Processing. A.R. Hurson Computer Science Missouri Science & Technology

9.5 Equivalence Relations

A Genetic Programming Approach for Distributed Queries

The Encoding Complexity of Network Coding

Module 9: Selectivity Estimation

Parallel Databases C H A P T E R18. Practice Exercises

SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases

Joint Entity Resolution

A New Optimal State Assignment Technique for Partial Scan Designs

Query Acceleration in Distributed Database Systems

Group Secret Key Generation Algorithms

IMPROVED A* ALGORITHM FOR QUERY OPTIMIZATION

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 1

2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006

Distributed DBMS. Concepts. Concepts. Distributed DBMS. Concepts. Concepts 9/8/2014

Leveraging Set Relations in Exact Set Similarity Join

Nodes Energy Conserving Algorithms to prevent Partitioning in Wireless Sensor Networks

DISTRIBUTED QUERY OPTIMIZATION USING HILL CLIMBING ALGORITHM FOR COMPLEX CHURCH DATABASES

Lecture 22 Acyclic Joins and Worst Case Join Results Instructor: Sudeepa Roy

Distributed Query Processing

CS54200: Distributed Database Systems

Multiple Vertex Coverings by Cliques

Network Topology Control and Routing under Interface Constraints by Link Evaluation

Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret

Comparative Analysis of Range Aggregate Queries In Big Data Environment

Lecture 6: Graph Properties

Efficient Prefix Computation on Faulty Hypercubes

Graph theory - solutions to problem set 1

Approximation Algorithms for Geometric Intersection Graphs

Interlaced Column-Row Message-Passing Schedule for Decoding LDPC Codes

Hash-Based Indexing 165

Efficient FM Algorithm for VLSI Circuit Partitioning

ptimimation of ulti-join

[Ch 6] Set Theory. 1. Basic Concepts and Definitions. 400 lecture note #4. 1) Basics

5 Graphs

Best Keyword Cover Search

PACKING DIGRAPHS WITH DIRECTED CLOSED TRAILS

Keywords APSE: Advanced Preferred Search Engine, Google Android Platform, Search Engine, Click-through data, Location and Content Concepts.

Optimization of Distributed Queries

ARELAY network consists of a pair of source and destination

Data Communication and Parallel Computing on Twisted Hypercubes

Integration of Transactional Systems

CMPSCI 311: Introduction to Algorithms Practice Final Exam

CSC A Hash-Based Approach for Computing the Transitive Closure of Database Relations. Farshad Fotouhi, Andrew Johnson, S.P.

Query Processing. high level user query. low level data manipulation. query processor. commands

THE EFFECT OF JOIN SELECTIVITIES ON OPTIMAL NESTING ORDER

A Reduction of Conway s Thrackle Conjecture

Semi-Independent Partitioning: A Method for Bounding the Solution to COP s

FUTURE communication networks are expected to support

Results on the min-sum vertex cover problem

Digital Filter Synthesis Considering Multiple Adder Graphs for a Coefficient

Database Architectures

Contents Contents Introduction Basic Steps in Query Processing Introduction Transformation of Relational Expressions...

Partitioning. Course contents: Readings. Kernighang-Lin partitioning heuristic Fiduccia-Mattheyses heuristic. Chapter 7.5.

Bipartite graphs unique perfect matching.

Matching Algorithms. Proof. If a bipartite graph has a perfect matching, then it is easy to see that the right hand side is a necessary condition.

CHAPTER 8. Copyright Cengage Learning. All rights reserved.

MOST attention in the literature of network codes has

Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection

Interleaving Schemes on Circulant Graphs with Two Offsets

Title: Rate-Based Query Optimization for Streaming Information Sources Authors: Efstratios Viglas, Jeffrey F. Naughton Paper Number: 233 Area: Core

Theorem 2.9: nearest addition algorithm

The NP-Completeness of Some Edge-Partition Problems

Introduction to Graph Theory

Pebble Sets in Convex Polygons

Star coloring bipartite planar graphs

Heuristic Algorithms for Multiconstrained Quality-of-Service Routing

Packet Classification Using Dynamically Generated Decision Trees

PCP and Hardness of Approximation

Approximation Algorithms

Superconcentrators of depth 2 and 3; odd levels help (rarely)

CMSC424: Database Design. Instructor: Amol Deshpande

Efficient Computation of Canonical Form for Boolean Matching in Large Libraries

Greedy Algorithms 1 {K(S) K(S) C} For large values of d, brute force search is not feasible because there are 2 d {1,..., d}.

Chapter 3. Set Theory. 3.1 What is a Set?

An Iterative Greedy Approach Using Geographical Destination Routing In WSN

On Covering a Graph Optimally with Induced Subgraphs

Textbook: Chapter 6! CS425 Fall 2013 Boris Glavic! Chapter 3: Formal Relational Query. Relational Algebra! Select Operation Example! Select Operation!

Unit 5A: Circuit Partitioning

Probe Distance-Hereditary Graphs

Chapter 4 Distributed Query Processing

A Connection between Network Coding and. Convolutional Codes

Low-level optimization

Rectangular Partitioning

Maximizing edge-ratio is NP-complete

Advanced Databases. Lecture 15- Parallel Databases (continued) Masood Niazi Torshiz Islamic Azad University- Mashhad Branch

Adjacent: Two distinct vertices u, v are adjacent if there is an edge with ends u, v. In this case we let uv denote such an edge.

An Adaptive Query Processing Method according to System Environments in Database Broadcasting Systems

Transcription:

The Enhancement of Semijoin Strategies in Distributed Query Optimization F. Najjar and Y. Slimani Dept. Informatique - Facult6 des Sciences de Tunis Campus Universitaire - 1060 Tunis, Tunisie yahya, slimani@f st. rnu. tn Abstract. We investigate the problem of optimizing distributed queries by using semijoins in order to minimize the amount of data communication between sites. The problem is reduced to that of finding an optimal semijoin sequence that locally fully reduces the relations referenced in a general query graph before processing the join operations. 1 Introduction The optimization of general queries, in a distributed database system, is an important and challenging research issue. The problem is to determine a sequence of database operations which process the query while minimizing some predetermined cost function. Join is a frequently used database operation. It is also the most expensive, specifically in a distributed database system; it may involve large communication costs when the relations are located at different sites. Hence, instead of performing joins in one step, semijoins [1], are performed first to reduce the size of the relations so as to minimize the data transmission cost for processing queries [2]. In the next step, joins are performed on the reduced relations. The join of two relations R and S on an attribute A is denoted by (R ~A S), while the semijoin from R to S on an attribute A is denoted by S (XA R. Thus, S (:X:A R is defined as follows: (i) project R on the join attribute A (i.e. R(A)); (ii) Ship R(A) to the site containing S; (iii) Join S with R(A). The transmission cost of sending S to the site containing R for the join R ~n S can thus be reduced. There are two main methods to process a join operation between two relations. One is called the nondistributed join, where a join is performed between two unfragmented relations. The other is called the distributed join, where the join operation is performed between the fragments of relations. As pointed out in [5], the problem of query processing has been proved to be NP-hard. This fact justifies the necessity of resorting to heuristics. The remaining of this paper is organized as follows: preliminaries are given in Section 2. Section 3 defines the main characteristics of two semijoin-based query optimization heuristics; then, we present and discuss the join query optimization in a fragmented database. Finally, Section 5 concludes the paper.

529 2 Preliminaries A join query graph can be denoted by a graph G = (V, E), where V is the set of relations and E is the set of edges. An edge (Ri, Rj) E E, if there exists a join predicate on some attribute of Ri and Rj. Without loss of generality, only cyclic query graphs are considered. In addition, all attributes are renamed in such a way that two join attributes have the same name if and only if they have a join predicate between them. The relations referenced in the query are assumed to be located at different sites. The query problem is simplified to be the estimation of the data statistics and the optimization of the transmission order, so that the total data transmission is minimized. We denote by IS I the cardinality of a relation S. Let WA be the width of an attribute A and wr~ be the width of a tuple in Ri. The size of the total amount of data in Ri can then be denoted by IIRill = wr, IRil. For notational simplicity, we use IAI to denote the extant domain of the attribute A. Ri(A) denotes the set of distinct values for the attribute A appearing in Ri. For each semijoin Rj o( A i, a selectivity factor, ilia =- ]R~(A)] IAI is used to predict the reduction effect. After the execution of Rj OCA Ri, the size of Rj becomes PiAIIRill. Morever, it is important to verify that a semijoin Rj (XA Ri is profitable, i.e. if the cost incurred by this semijoin, wa]ri(a)], is less than the cost of the reduction (called the benefit), which is computed in terms of avoided future transmission cost, wr, ]Ri]--piA]Ri]. The profit is set to be (benefit - cost). 3 Nondistributed Join Method In this section, we propose two heuristics. The first, namely one-phase Parallel Semi Joins, 1-PSJ, determines a set of parallel semijoins. The second, namely Hybrid A* heuristic, HA*, finds a general sequence of semijoins, which is a combination of parallel and sequential semijoins. 3.1 1-PSJ We say that Ri is fully locally reduced if {j / Ri (XA Rj is feasible}. We denote by RDi= {j/ri c< Rj is profitable} the set of index reducers of the relation Ri. Our objective is to find the set of the most locally profitable semijoins (called applicable semijoins), APi C_ RDi, such that the overall profit is maximized, and subsequently the total transmission cost (TCi) of Ri is minimized. Furthermore, removing a profitable semijoin may increase the total profit and minimize the extra costs incurred by semijoins. Since all applicable semijoins are executed simultaneously, local optimality (with respect to Ri ) can be attained. Finally, in order to reduce each relation in the query, we apply a divide-and-conquer algorithm. The total cost (TC) is minimized if all tranmission cost (TCi) are minimized simultaneously. The details of this algorithm are given in [4].

530 3.2 Hybrid A* The well known A* can be used to determine a sequence of semijoin reducers [6] for distributed query processing. The key issue of A* algorithm is to derive a heuristic function which can intelligently guide the search of a sequence of semijoins. In the A* algorithm, the search is controlled by a heuristic function f, with two arguments: the cost of reaching p from the initial node (original query graph with its corresponding profile), and the cost of reaching the goal node from p. Accordingly, f(p) = g(p) + hip), where g(p) estimates the minimum cost of trajectory from the initial state to p, and hip ) estimates the minimum cost from p to the goal state. The node p chosen for expansion (i.e., whose immediate successors will be generated) is the one which has the smallest value f(p) among all generated nodes that have not been expanded so far. In order to derive a general sequence of semijoins, for a node p, gip) = 9(q) + ~ cos tiri ~ Rj) + IIR~II, where p is an immediate successor of q, and jgapi R~ denotes the resulting relation after performing applicables semijoins to the original relation Ri. The function h is defined as the sum of the sizes of remaining relations such that the effect of the total reduction (with respect to neighboring relations) gives the best estimation, h(p) = )--~(~ cos t(rk oc Rj) +IS R k ~ I S), where k j Rk is not yet reduced. Example 1: Consider the following join query: Select A, D from R1, R2, R3, R4 where (R1.A = R3.A) and (R1.B = R2.B) and (R2.C = R3.C) and (R3.D = R4.1)). We suppose that R1, R2, R3 and R4 are located in different sites. The corresponding query graph and profile are given, respectively in Figure 1 and Table 1. R1 R4 R2 C R3 Fig. 1. Join query graph for example 1. 1- PSJ finds the set {R1 c( R2, R3 ~ R1, R3 c( R4}, with the total transmission cost to the final site R2, 18,370. Whereas, HA., finds the general sequence of semijoins, R1 c< R2, {R3 c< R~, R3 ~ R4}, with the cost of 16,681. To show more insights into the performance of 1 - PSJ and HA* heuristics, simulations were carried on different queries for n (n is being 5-12) relations

531 Table 1. Profile Table for example 1. R~ I [R,I X Wx IR~(A)I R1 1190 IA 2 830 B 1 850 R2 34401B i 850 C 3 900 R3 2152 A 2 800 3 9OO 1 720 R4 3100 D 1 700 involved in each query. For a comparison purpose, in addition to 1 - PSJ and HA*, we also apply the original method, OM, which consists of sending all the relations directly to the final site. In Figure 2, it is apparent that as the number of relations increases (n > 8), HA* heuristic becomes better than 1 - PSJ. When n >_ 9, HA* outperfoms the other heuristics significantly (the reduction cost is about 45%). o ]1 ~.~ 75~1~ I E zo.t "] I - - ' ',-,,,.,. Number of referenced relations Fig. 2. Impact of the number of relations on transmission cost. 4 Distributed Join Method A relation can be horizontally partitioned (declustered) into disjoint subsets, called fragments, distributed across several sites. We associate for each fragment its qualification, which is expressed by a predicate describing the common properties of the tuples in that fragment. One major issue in developping a horizontal fragmentation technique is determining the criteria to be used in guiding the fragmentation. A major difficulty is that there are no known significant criteria that can be used to partition relations horizontally. In the context of our study, we suggest a bipartition of each relation Ri, such that, a relation is divided into mutually exclusive fragments. To represent the

532 fragments more specifically, we propose the following formula: IRi[ = a[ril + (1 - a)lri I = IRill+ IRi21, where a is a rational number ranging from 0 to 1 and Ril, Ri2 are the fragments of Ri. The above fragmentation satisfies the three conditions [2], completeness, reconstruction, and disjointness, which ensure a correct horizontal fragmentation. Note that bipartitioning can be applied to a relation repeatedly. To estimate the cost of an execution plan, the database profile may have the following statistics: IRijl denotes the cardinmity of the fragment number j of relation Ri and IRij (A)I represents the number of distinct values of attribute A in its fragment. When semijoins are used in a fragmented database system, they have to be performed in a relation-to-fragment manner, so that they do not cause the elimination of contributive tuples. At each site containing a fragment Rj~ to be reduced, we proceed as follows: (i) every fragment of Ri 1 must participate in reducing Rjk; so, find the optimal set of applicable semijoins and send values of the semijoins attributes from each fragment of Ri to Rjk; (ii) Merge the fragments of R~ before eliminating any tuple of Rjk. Example 2: We illustrate the distributed join method (HA* on fragmented relations) with the same previous example discussed for the nondistributed join. After partitioning, the corresponding profile is given in Table 2. Table 2. Profile Table for Example 1. Rij IRijl X Wx IRij(A)I Rll/R121119/1071 A 2 90/809 B 1 91/818 R21/R22 344/3096 B 1 91/818 C 3 97/872 R31/R32 215/1936 A 2 87/782 C 3 97/872 D 1 79/800 R41/R4~ 310/2790 D 1 76/683 The optimal general sequence is: {R12 c( R22, R12 0( (R31 t2 R32)}, {R21 c< Rll, R21 (3( (R31 U R32)}, R42 (3( R31, {R32 (2( (Rll U R12), {R32 c< (R21 U R22),R32 0( -R42}; it incurs 13,959, which is less than in nondistributed join method. A general conclusion is that the communication cost is substantially reduced if we use a "good fragmentation". In the absence of a formal definition of a good fragmentation, we can approximate it by the a-factor. In effect, we have noted that a good choice of this criteria leads to a good fragmentation. The Fig 2 shows the effect of the a-factor on the communication cost for a given query in which the number of relations is constant and a is varied. 1 Ri such that Rj ~ Ri is applied in the query.

533 5 Conclusion In this paper, we proposed two distributed query processing strategies for join queries using semijoin as a query processing tactic. For these two strategies, we present new heuristics that "intelligently" guiding the search and returning a general reducer sequence of semijoins. For the case of the distributed join strategy, we proposed a technique to bipartition each relation assuming a fixed a-factor. References 1. P.A. Bernstein, N. Goodman, E. Wong, C. Reeve, azld J.B. Rothnie. Query Processing in a System for Distributed Databases (SDD-1). ACM TDS, vol. 6(4), Dec. 1981, pp. 602-625. 2. S. Ceri and G. Pelagatti. Distributed Databases: Principles and Systems. McGraw- Hill, 1985. 3. M-S Chen and P.S Yu. Combining 3oin and Semijoin Operations for Distributed Query Processing. [EEE TKDEvol. 5(3), Jun. 1993, pp. 534-. 542. 4. F. Najjar, Y. Slimani, S. Tlili, and J. Boughizane. Heuristics to determine a general sequence of semijoins in distributed query processing. Proc. of the 9thIASTED Int. Conf., PDCS, Washington D. C. (USA), Oct. 1997, pp. 354-359. 5. C. Wang and M-S. Chen. On the Complexity of Distributed Query Optimization. IEEE TKDE, vol. 8(4), Aug. 1996, pp. 650-662. 6. H. Yoo and S. Lafortune. An Intelligent Search Method for Query Optimization by Semijoins. IEEE TKDE, vol. 1(2), Jun. 1989, pp. 226-237.