Mobile and Heterogeneous databases Distributed Database System Query Processing. A.R. Hurson Computer Science Missouri Science & Technology

Similar documents
Database Systems External Sorting and Query Optimization. A.R. Hurson 323 CS Building

Mobile and Heterogeneous databases

CS5300 Database Systems

COSC344 Database Theory and Applications. σ a= c (P) S. Lecture 4 Relational algebra. π A, P X Q. COSC344 Lecture 4 1

Algorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1)

Chapter 3. Algorithms for Query Processing and Optimization

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 25-1

Distributed Databases Systems

CSC 742 Database Management Systems

Mobile and Heterogeneous databases Distributed Database System Transaction Management. A.R. Hurson Computer Science Missouri Science & Technology

Optimization of Distributed Queries

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Chapter 4 Distributed Query Processing

Query Processing SL03

ECE 650 Systems Programming & Engineering. Spring 2018

Query Decomposition and Data Localization

Introduction Background Distributed DBMS Architecture Distributed Database Design Semantic Data Control Distributed Query Processing

QUERY PROCESSING & OPTIMIZATION CHAPTER 19 (6/E) CHAPTER 15 (5/E)

Relational Algebra 1

Mobile and Heterogeneous databases

Introduction Background Distributed DBMS Architecture Distributed Database Design Semantic Data Control Distributed Query Processing

Chapter 19 Query Optimization

CS 348 Introduction to Database Management Assignment 2

Query Processing. high level user query. low level data manipulation. query processor. commands

CS54200: Distributed. Introduction

Outline. Query Processing Overview Algorithms for basic operations. Query optimization. Sorting Selection Join Projection

Query Processing & Optimization. CS 377: Database Systems

Chapter 8: Relational Algebra

Some different database system architectures. (a) Shared nothing architecture.

Database design process

Chapter 18 Strategies for Query Processing. We focus this discussion w.r.t RDBMS, however, they are applicable to OODBS.

Chapter 6 Part I The Relational Algebra and Calculus

RELATIONAL DATA MODEL: Relational Algebra

QUERY PROCESSING & OPTIMIZATION CHAPTER 19 (6/E) CHAPTER 15 (5/E)

CS 5803 Introduction to High Performance Computer Architecture: Arithmetic Logic Unit. A.R. Hurson 323 CS Building, Missouri S&T

Chapter 8 SQL-99: Schema Definition, Basic Constraints, and Queries

Outline. Distributed DBMS Page 5. 1

Introduction to SQL. ECE 650 Systems Programming & Engineering Duke University, Spring 2018

SQL STRUCTURED QUERY LANGUAGE

DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

UNIVERSITY OF TORONTO MIDTERM TEST SUMMER 2017 CSC343H Introduction to Databases Instructor Tamanna Chhabra Duration 120 min No aids allowed

Chapter 6 The Relational Algebra and Calculus

SQL-99: Schema Definition, Basic Constraints, and Queries. Create, drop, alter Features Added in SQL2 and SQL-99

Query Processing and Optimization *

CS 377 Database Systems

Advanced Databases (SE487) Prince Sultan University College of Computer and Information Sciences

CSE 444: Database Internals. Lecture 22 Distributed Query Processing and Optimization

COSC344 Database Theory and Applications. Lecture 5 SQL - Data Definition Language. COSC344 Lecture 5 1

Relational Algebra. Relational Algebra Overview. Relational Algebra Overview. Unary Relational Operations 8/19/2014. Relational Algebra Overview

What happens. 376a. Database Design. Execution strategy. Query conversion. Next. Two types of techniques

Chapter 19: Distributed Databases

Chapter 12: Query Processing

Relational Algebra. Ron McFadyen ACS

SQL. Copyright 2013 Ramez Elmasri and Shamkant B. Navathe

Chapter 6: RELATIONAL DATA MODEL AND RELATIONAL ALGEBRA

Distributed KIDS Labs 1

QUERY OPTIMIZATION [CH 15]

CS 5803 Introduction to High Performance Computer Architecture: RISC vs. CISC. A.R. Hurson 323 Computer Science Building, Missouri S&T

QUERY OPTIMIZATION. CS 564- Spring ACKs: Jeff Naughton, Jignesh Patel, AnHai Doan

We shall represent a relation as a table with columns and rows. Each column of the table has a name, or attribute. Each row is called a tuple.

2.2.2.Relational Database concept

Part 1 on Table Function

ECE 650 Systems Programming & Engineering. Spring 2018

Distributed Query Optimization: Use of mobile Agents Kodanda Kumar Melpadi

Relational Model. CS 377: Database Systems

Database Management Systems

Relational Database design. Slides By: Shree Jaswal

Mobile and Heterogeneous databases Security. A.R. Hurson Computer Science Missouri Science & Technology

SQL: A COMMERCIAL DATABASE LANGUAGE. Data Change Statements,

Database Design Theory and Normalization. CS 377: Database Systems

Ian Kenny. November 28, 2017

Relational Model History. COSC 304 Introduction to Database Systems. Relational Model and Algebra. Relational Model Definitions.

DATABASE CONCEPTS. Dr. Awad Khalil Computer Science & Engineering Department AUC

Chapter 6 5/2/2008. Chapter Outline. Database State for COMPANY. The Relational Algebra and Calculus

RELATIONAL DATA MODEL

Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley. Chapter 6 Outline. Unary Relational Operations: SELECT and

Database Architectures

management systems Elena Baralis, Silvia Chiusano Politecnico di Torino Pag. 1 Distributed architectures Distributed Database Management Systems

SQL Introduction. CS 377: Database Systems

Overview Relational data model

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23

Database Architectures

NOTE: DO NOT REMOVE THIS EXAM PAPER FROM THE EXAM VENUE

Relational Model History. COSC 416 NoSQL Databases. Relational Model (Review) Relation Example. Relational Model Definitions. Relational Integrity

Ch 5 : Query Processing & Optimization

Query Processing & Optimization

CS54200: Distributed Database Systems

Relational Query Optimization. Highlights of System R Optimizer

Relational Algebra Part I. CS 377: Database Systems

CS 338 Functional Dependencies

Information Systems Development 37C Lecture: Final notes. 30 th March 2017 Dr. Riitta Hekkala

Integration of Transactional Systems

CS403- Database Management Systems Solved Objective Midterm Papers For Preparation of Midterm Exam

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 9 - Query optimization

UNIT 3 DATABASE DESIGN

Contents Contents Introduction Basic Steps in Query Processing Introduction Transformation of Relational Expressions...

Slides by: Ms. Shree Jaswal

Chapter 10. Normalization. Chapter Outline. Chapter Outline(contd.)

Examination examples

Query Processing and Query Optimization. Prof Monika Shah

Transcription:

Mobile and Heterogeneous databases Distributed Database System Query Processing A.R. Hurson Computer Science Missouri Science & Technology 1

Note, this unit will be covered in four lectures. In case you finish it earlier, then you have the following options: 1) Take the early test and start CS6302.module3 2) Study the supplement module (supplement CS6302.module2) 3) Act as a helper to help other students in studying CS6302.module2 Note, options 2 and 3 have extra credits as noted in course outline. 2

Enforcement of background Current Module { Glossary of prerequisite topics No Familiar with the topics? Yes Take Test No Pass? Remedial action Yes { Glossary of topics No Familiar with the topics? Take the Module Yes Take Test No Pass? Yes Options Study next module? Review CS6302 module2 background Lead a group of students in this module (extra credits)? Study more advanced related topics (extra credits)? At the end: take exam, record the score, impose remedial action if not successful Extra Curricular activities 3

You are expected to be familiar with: Centralized database configuration and query processing and query optimization in a centralized database environment. If not, you need to study CS6302.module2.background 4

In this module, we define: Distributed Databases Issues in distributed databases, Query processing in general Query optimization Optimization process Query transformation (equivalence rules) Distributed query processing 5

From our discussion in the last module, one could conclude that distributed databases is one form of database organization to represent and process the data. It is a natural extension of client-server and peer-to-peer paradigms. We will look at the major issues in a distributed database organization and then concentrate on query processing. 6

Distributed Database Systems A distributed database in an environment in which related data sources: Are residing on different geographically distributed sites data is closed to the application domain (s) that uses it. Might be replicated to improve performance. Are slit (Fragmented), horizontally and/or vertically, and distributed among the sites to balance the load and improve performance. 7

Data Distribution Data might be distributed to minimize communication cost and/or response time. Data might be kept at the site where it was created to allow higher control and security. 8

Data Replication Data replication allows more availability (in the event of failure if one site is down, data can be accessed from another site), Data replication increases the degree of parallelism (reduce response time), Data replication increases overhead on update. 9

Data Replication In general, data replication increases the performance of read operations and increases the availability of the data to read-only transactions. However, data replication incurs overhead for update transactions, since all copies of the data must be synchronized. In addition, controlling concurrent updates by several transactions on replicated data is becoming very complex. 10

Data Fragmentation Horizontal fragmentation In horizontal fragmentation, data set is partitioned (broken) into subsets with the same structure original data set can be constructed by taking the union of the subsets. Horizontal fragmentation keeps the subsets at the sites where they are used the most. Horizontal fragmentation is lossless. 11

Data Fragmentation Horizontal fragmentation T = T i 1 i n T 1 T 2 T 3 T 4 T 5 12

Data Fragmentation Vertical fragmentation In vertical fragmentation, data set is decomposed (broken) into subsets with different attributes Original data set is constructed by performing join on the fragments. Vertical fragmentation is also lossless. 13

Data Fragmentation Vertical fragmentation T i =Π Li (T) 1 i n T = T 1 T 2 T n 14

Two types of applications access distributed databases: Application that accesses data at the level of SQL statements. Application that accesses data using only stored procedures provided by the database management system. Only applications of the first type can access data directly and hence subject to employ query optimization, concurrency control, strategies. 15

Issues in Distributed Database Systems Data Distribution How should a distributed database be designed? How should the data be distributed? 16

Issues in Distributed Database Systems Transparency Distributed data should be managed transparently details regarding the location of the data must be hidden. Transparency comes in different types: Network transparency Replication transparency Fragmentation transparency 17

Issues in Distributed Database Systems Transparency Location transparency (Network transparency) Freedom from operational details of the network. Replication transparency Unaware from the existence of duplicated copies. Fragmentation transparency Unaware from the existence of fragments. 18

Issues in Distributed Database Systems Transparency The degree of transparency is a compromise between: Ease of use and the difficulty, and The overhead cost of providing high levels of performance. 19

Issues in Distributed Database Systems Keeping track of data Ability to keep track of the data distribution, fragmentation, and replication. Application of catalog, meta-data, 20

Issues in Distributed Database Systems Distributed Query processing The ability to access remote sites and transmitting queries (sub-queries) and data among different sites via the network: How should queries that access multiple databases be processed? How do we improve query processing? 21

Issues in Distributed Database Systems Distributed Transaction management The ability to devise execution strategies for queries and transactions that access data from more than one site and to synchronize the access to distributed data (efficiently) and maintain integrity of the whole database. 22

Issues in Distributed Database Systems Replicated data management The ability to decide which copy of a replicated data set to access, to maintain the consistency of the replicated data, and to control number of replica. 23

Issues in Distributed Database Systems Distributed database recovery The ability to recover from individual site crashes and system failures. 24

Issues in Distributed Database Systems Security The ability to authenticate users, and enforce authorization and access control. 25

To appreciate the issue of query processing in a distributed environment, one has to have a good appreciation and understanding of query processing in a centralized database environment. This also brings out the issue of query optimization, equivalency between queries, and equivalence rules. 26

Query processing A query processing involves three steps: Parsing and Translation Optimization Evaluation 27

Query processing Query Parser & Translator Internal Representation Optimizer Query Output Execution Engine Execution Plan DATA BASE Statistics about data 28

Query Optimization A Simple Example Get names of suppliers who supply part P 2 : SELECT DISTINCT Sname FROM S, SP WHERE S.S# = SP.S# AND SP.P# = P 2 ; Suppose that the cardinality of S and SP are 100 and 10,000, respectively. Furthermore, assume 50 tuples in SP are for part P 2. 29

Query Optimization A Simple Example Without an optimizer, the system will: Generates Cartesian product of S and SP. This will generate a relation of size 1,000,000 tuples Too large to be kept in the main memory. Restricts results of previous step as specified by WHERE clause. This means reading 1,000,000 tuples of which 50 will be selected. Projects the result of previous step over Sname to produce the final result. 30

Query Optimization A Simple Example An Optimizer on the other hand: Restricts SP to just the tuples for part P 2. This will involve reading 10,000 tuples, but produces a relation with 50 tuples. Joins the result of the previous step with S relation over S#. This involves the retrieval of only 100 tuples and the generation of a relation with at most 50 tuples. Projects the result of the last operation over Sname. 31

Query Optimization A Simple Example If the number of tuples I/O s is used as the performance measure, then it is clear that the second approach is far faster that the first approach. In the first case we read about 3,000,000 tuples and in the second case we read about 10,000 tuples. So a simple policy doing restriction and then join instead of doing product and then a restriction sounds a good heuristic. 32

Optimization Process Cast the query into some internal representation Convert the query to some internal representation that is more suitable for machine manipulation, relational algebra. Now we can build a query tree very easily. Π (Sname) (σ P# = P2 (S S.S# =SP.S# SP )) 33

Optimization Process Result Project (Sname) Restrict (Sp.P# = P 2 Join (S.S# = SP.S#) S SP 34

Optimization Process Choose candidate low-level procedure After transferring the query into more desirable form, the optimizer must then decide how to evaluate the transformed query. At this stage issues such as: existence of indexes or other access paths To reduce I/O cost, physical clustering of records To reduce I/O cost, size of base relations, Selectivity factors, comes into play. 35

Optimization Process Choosing the cheapest plan, naturally, requires a method for assigning a cost to any given plan This cost formula should estimate the number of disk accesses, CPU utilization and execution time, space utilization,. 36

Query processing An Example Factors such as number of accesses to the disks, CPU time, Communication cost must be taken into consideration to estimate cost of a plan. Techniques such as pipelining and parallelism can be applied to execute basic operations. Different algorithms can be developed to execute basic operations. 37

Optimization Process There are two optimization: Heuristic rules Systematically estimate main techniques for query In this course, we will talk about the heuristic rules. 38

Optimization Process heuristic rules Reduce the search space, Reduce the size of intermediate results, Perform selection operations as early as possible, Perform projections early, It is usually better to perform selections earlier than projections, Convert Cartesian product to join. 39

Optimization Process heuristic rules Based on heuristic rules the optimizer uses equivalence relationships to reorder operations in a query for execution. 40

Cost of Plan The cost associated with each plan needs to be estimated. This will be accomplished by estimating the cost of each operation. Factors such as: size of relation (s), underlying architecture, buffer size, size of the memory, reduction factor for each operation, need to be taken into consideration.

Assume the following relations: Department (Dname, Dnumber, Mgr-ssn, ) Project (Pname, Pnumber, Plocation, Dnum) Employee (Fname, Lname, Ssn, Bdate, address, Dno, )

Query Optimization An Example SELECT FROM WHERE AND AND Pnumber, Dnum, Lname, Bdate, Address Project, Department, Employee Dnum = Dnumber MGRSSN = SSN Plocation = California ; 43

Query Optimization An Example The above query based on the schemas of the relations can be translated into: Π Pnumber,Dnum,Lname,Address,Bdate (((σ Plocation= california (Project)) Dnum=Dnumber (Department ) ) MNGSSN=SSN (Employee)) 44

Query Optimization An Example Π Pnumber,Dnum,Lname,Address,Bdate MNGSSN=SSN Dnum=Dnumber Employee σ Plocation= california Department Project 45

Distributed Query processing In a distributed database system, relative to the centralized database, several additional factors complicate query processing. These includes: Data Transfer cost size of data, communication cost, communication reliability, The potential performance gain data replication and fragmentation, Potential parallelism data replication and fragmentation, Security, and Processing capability and speed. 46

Distributed Query processing Query translation must be done correctly (same semantic as the original query) and efficiently. In a centralized context, query execution strategies can be well expressed in an extension of relational algebra. 47

Distributed Query processing In a distributed system, relational algebra is not enough to express execution strategy. It must be supplemented with operations for exchanging data between sites. In addition, the distributed query processor must also select the best sites to process data and possibly the way data should be transferred. 48

Example Assume the following relations: EMP(ENO, ENAME, TITLE) and ASG(ENO, PNO, RESP, DUR) Find the names of employees who are managing a project Π (ENAME) (EMP SELECT FROM WHERE ENAME EMP, ASG EMP.ENO = ASG.ENO RESP = manager ENO (σ RESP = Manager (ASG))) 49

Example Assume that EMP and ASG are horizontally fragmented as follows: EMP 1 = σ ENO E3 (EMP) EMP 2 = σ ENO < E3 (EMP) ASG 1 = σ ENO < E3 (ASG) ASG 2 = σ ENO > E3 (ASG) Fragments of ASG 1, ASG 2, EMP 1, and EMP 2 are stored at sites 1, 2, 3, 4, respectively and the result is expected at site 5. 50

Example Result = Π (ENAME) (EMP 1 EMP 2 ) Site 5 EMP 1 EMP 2 Site 3 Site 4 EMP 1 = EMP 1 ENO ASG 1 EMP 2 = EMP 2 ENO ASG 2 ASG 1 ASG 2 Site 1 Site 2 ASG 1 = σ RESP < Manager (ASG 1 ) ASG 2 = σ RESP < Manager (ASG 2 ) Strategy1 51

Example Result = Π (ENAME) (EMP 1 EMP 2 ) ENO (σ RESP = Manager (ASG 1 ASG 2 ))) Site 5 ASG 1 ASG 2 EMP 1 EMP 2 Site 1 Site 2 Site 3 Site 4 Strategy2 52

Example Assume tuple access (tupacc) is 1 unit and tuple transfer (tuptrans) is 10 units, relations EMP and ASG have 400 and 1000 tuples, respectively, and there are 20 managers. Further assume data is distributed uniformly among sites and ASG and EMP are locally clustered on attributes RESP and ENO. 53

Example Total cost of strategy1 1) Produce ASG (10 + 10)*tupacc = 20 2) Transfer ASG (10 + 10)*tuptrans = 200 3) Produce EMP (10 + 10)*tupacc*2 = 40 4) Transfer EMP (10 + 10)*tuptrans = 200 Total cost of strategy1 1) Transfer EMP 400*tuptrans = 4000 2) Transfer ASG 1000*tuptrans = 10000 3) Produce ASG 1000*tupacc = 1000 4) Join EMP and ASG 400*20*tupacc = 8000 Note: in strategy2, the access methods to EMP and ASG are lost because of the data transfer. 54

Distributed Query processing Types of optimizations: Exhaustive approach Heuristic approach Reduce search space at each site Reduce size of intermediate results Reduce communication cost Optimization Overhead Static optimization Dynamic optimization Hybrid 55

Distributed Query processing Statistics: Effectiveness of optimization (specially static optimization) relies on statistics on the databases. Decision sites: When static optimization is employed, either a single site or multiple sites may participate in selection of the strategy for query processing. Most system use the centralized decision approach. Naturally, the centralized approach is easier to implement, however, it requires knowledge of the entire distributed databases. The distributed approach requires only local information. One can employ a hybrid approach where one site makes the major decisions and other sites can make local decisions. 56

Distributed Query processing Exploitation of the Network Topology Local area network Wide area network Exploitation of Replicated Fragments: Global query expressed on global relations must be mapped into queries on physical fragments of relations (localization) Use of Semijoins: A semijoin is of particular importance since it improves the processing of distributed join operations by reducing the size of data exchanged between sites. However, semijoins may result in an increase in the number of messages and local processing time. 57

Global query Query Decomposition Global schema Algebraic query on distributed data Data Localization Fragment schema Fragment query Global Optimization Statistics on Fragment Optimized fragment query with communication operations Local Optimization Local schema Optimized local queries 58

Distributed Query processing Cost of an execution strategy can be expressed in terms of either total execution time or the response time. Total-exec.time = T CPU * #insts + T I/O * #I/Os + T MGS * #MSGs + T TR * #bytes T CPU * #insts + T I/O * #I/Os measures the local processing time and T MGS * #MSGs + T TR * #bytes represents the cost of communications. 59

Distributed Query processing Size of intermediate result(s) is one of the major factors affecting the cost: Length(A i ) denotes length of attribute A i in bytes Card(Π Ai (R)) denotes the number of distinct values of A i in R. Card(dom [A i ]) denotes the cardinality of the domain A i. Card(R) denotes number the entities (tuples) in R. Size(R) = Card(R) * Length(R) where Length(R) is the length of each entity (tuple) in R. 60

Distributed Query processing Card(σ F (R)) = SF S (F) * card(r) denotes the number of data entities (tuples) responding to a selection condition. SF S (F) is the selection factor which is dependent on selection formula and can be determined as follows: SF S (A = value) = 1/(Card(Π Ai (R))) SF S (A > value) = (max(a) value)/(max(a) min(a)) SF S (A < value) = (value min(a))/(max(a) min(a)) SF S (p(a i ) p(a j )) = SF S (p(a i )) * SF S (p(a j )) SF S (p(a i ) p(a j )) = SF S (p(a i )) + SF S (p(a j )) SF S (p(a i )) * SF S (p(a j )) SF S (A ε {value}) = SF S (A = value) * Card({value}) Where p(a i ) and p(a j ) are the predicates over respective attributes. 61

Distributed Query processing Join selectivity factor is defined as: SF J (R, S)= Card(R S) /(Card(R) * Card(S)) And hence Card(R S) = SF J (R, S) * (Card(R) * Card(S)) For Cartesian product Card(R S) = Card(R) * Card(S) For projection Card(Π A (R)) = card(r) if A contains a key 62

Distributed Query processing Query transformation Based on the query, location of data sets, size of the data sets, communication cost, processing capability, a dynamic strategy should be laid out. According to the strategy, then the query is decomposed into sub-queries. Sub-queries are sent to the designated sites for execution. 63

Distributed Query Processing In distributed systems, several additional factors further complicate query processing: Cost of data transmission this includes intermediate data and the final result. Hence, the query optimization algorithm must attempt to reduce the amount of data transfer. 64

Distributed Query Processing Use of semijoin the idea is to reduce the number of tuples in a relation before transferring it to another site. If A and B are domain compatible attributes of R and S then we have R A=B S Note semijoin is not commutative: R S S R 65

Questions Calculate selectivity factor of semijoin operation and cardinality of semijoin operation? Calculate cardinality of Union and Difference operations? 66

Distributed Query Processing A query is decomposed into a set of sub-queries that can be executed at the individual sites. A strategy for combining the results of the sub-queries to form the query result must be generated: An estimate on the size of data transmission must be made to minimize communication cost: In case of data replication and fragmentation attempt must be made to choose the closest replica and/or fragments. If possible, semijoin operation should be performed to reduce data transfer size. Where ever possible attempt must be made to enforce heuristics and equivalence rules, at each site, to reduce the execution cost. 67

Distributed Query Processing An example Assume the following queries on two relations: Π FNAME,LNAME,DNAME ((EMPLOYEE) DNO=DNUMBER (DEPARTMENT) ) Site1: EMPLOYEE FNAME MINIT LNAME SSN BDATE ADDRESS SEX DNO 10,000 records Each record is 100 bytes SSN field is 9 bytesfname field is 15 bytes DNO field is 4 bytes LNAME field is 15 bytes 68

Distributed Query Processing An example Site2: DEPARTMENT DNAME DNUMBER MGRSSN MGRSTARTDATE 100 records Each record is 35 bytes DNUMBER field is 4 bytes MGRSSN field is 9bytes DNAME field is 10 bytes 69

Distributed Query Processing An example The result of this query include 10,000 records, each 40 bytes long. The query is generated at site3. 70

Distributed Query Processing An example Transfer both relations to site3 and process the request there. Here we have to transfer 1,000,000+3500 bytes. Transfer employee relation to site2, process the request at site2, and transfer the result to site3. Here we have to transfer 400,000+1,000,000 bytes. Transfer department relation to site1, process the request at site1, and transfer the result to site3. Here we have to transfer 400,000+3500 bytes. 71

Distributed Query Processing An example Suppose we have the following query initiated at site3: Π FNAME,LNAME,DNAME ((DEPARTMENT) MGRSSN=SSN (EMPLOYEE ) ) Note that this query generate 100 records. 72

Distributed Query Processing An example Transfer both relations to site3 and process the request there. Here we have to transfer 1,000,000+3500 bytes. Transfer employee relation to site2, process the request at site2, and transfer the result to site3. Here we have to transfer 4000+1,000,000 bytes. Transfer department relation to site1, process the request at site1, and transfer the result to site3. Here we have to transfer 4000+3500 bytes. 73

Question What is a semijoin Strategy? Apply equivalence relations to the aforementioned examples, enumerate different strategies, and calculate the data transfer accordingly. Apply semijoin operation to carry out the aforementioned queries and calculate the data transfer size. How can we perform database operations on a parallel system (assume relational model)? 74