Outline. q Database integration & querying. q Peer-to-Peer data management q Stream data management q MapReduce-based distributed data management

Similar documents
Mobile and Heterogeneous databases

Distributed Query Optimization: Use of mobile Agents Kodanda Kumar Melpadi

Chapter 18: Parallel Databases Chapter 19: Distributed Databases ETC.

Query Processing SL03

Distributed DBMS. Concepts. Concepts. Distributed DBMS. Concepts. Concepts 9/8/2014

Conjunctive queries. Many computational problems are much easier for conjunctive queries than for general first-order queries.

Chapter 18: Parallel Databases

QUERY PROCESSING & OPTIMIZATION CHAPTER 19 (6/E) CHAPTER 15 (5/E)

QUERY PROCESSING & OPTIMIZATION CHAPTER 19 (6/E) CHAPTER 15 (5/E)

Query Processing. high level user query. low level data manipulation. query processor. commands

Query Decomposition and Data Localization

Outline. Introduction. Introduction Definition -- Selection and Join Semantics The Cost Model Load Shedding Experiments Conclusion Discussion Points

Mobile and Heterogeneous databases Distributed Database System Query Processing. A.R. Hurson Computer Science Missouri Science & Technology

Distributed KIDS Labs 1

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Data Integration Systems

Distributed Databases Systems

Distributed Data Management

Parallel Query Optimisation

Distributed Data Management

Query Processing and Query Optimization. Prof Monika Shah

Algorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1)

Outline. Distributed DBMS Page 5. 1

Distributed Databases Systems

Lecture 8. Database Management and Queries

Review for Exam 1 CS474 (Norton)

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 25-1

Introduction Background Distributed DBMS Architecture Distributed Database Design Semantic Data Control Distributed Query Processing

Contents Contents Introduction Basic Steps in Query Processing Introduction Transformation of Relational Expressions...

Relational Model History. COSC 416 NoSQL Databases. Relational Model (Review) Relation Example. Relational Model Definitions. Relational Integrity

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Outline. File Systems. Page 1

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2004 Pearson Education, Inc.

Query Processing & Optimization

Relational Databases

Query Rewriting Using Views in the Presence of Inclusion Dependencies

Data Integration Services

CSE 544, Winter 2009, Final Examination 11 March 2009

CSE 344 MAY 7 TH EXAM REVIEW

Query Processing and Optimization on the Web

Query Processing and Optimization on the Web

On Using an Automatic, Autonomous and Non-Intrusive Approach for Rewriting SQL Queries

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad

Chapter 12: Query Processing

DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

Introduction Background Distributed DBMS Architecture Distributed Database Design Semantic Data Control Distributed Query Processing

Chapter 19: Distributed Databases

Chapter 19 Query Optimization

DISTRIBUTED DATABASES

Relational Model History. COSC 304 Introduction to Database Systems. Relational Model and Algebra. Relational Model Definitions.

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Chapter 11: Query Optimization

Query optimization. Elena Baralis, Silvia Chiusano Politecnico di Torino. DBMS Architecture D B M G. Database Management Systems. Pag.

Module 9: Selectivity Estimation

Database System Concepts and Architecture

DISTRIBUTED DATABASES

Optimization of Distributed Queries

Dr. Awad Khalil. Computer Science & Engineering department

Chapter 13: Query Optimization. Chapter 13: Query Optimization

CS54200: Distributed. Introduction

Advanced Databases: Parallel Databases A.Poulovassilis

Streaming Data Integration: Challenges and Opportunities. Nesime Tatbul

Overview of Query Evaluation

GUJARAT TECHNOLOGICAL UNIVERSITY

Rajiv GandhiCollegeof Engineering& Technology, Kirumampakkam.Page 1 of 10

Distributed Query Processing. Banking Example

COMP9311 Week 10 Lecture. DBMS Architecture. DBMS Architecture and Implementation. Database Application Performance

Chapter 14: Query Optimization

Information Management (IM)

Distributed Databases

CS54200: Distributed Database Systems

Advanced Databases. Lecture 4 - Query Optimization. Masood Niazi Torshiz Islamic Azad university- Mashhad Branch

Relational Model: History

Integration of Transactional Systems

Distributed Data Management

Principles of Data Management. Lecture #9 (Query Processing Overview)

management systems Elena Baralis, Silvia Chiusano Politecnico di Torino Pag. 1 Distributed architectures Distributed Database Management Systems

Parser: SQL parse tree

Database Heterogeneity

Overview of Query Evaluation. Chapter 12

Module 9: Query Optimization

An Adaptive Query Execution Engine for Data Integration

Basant Group of Institution

2.2.2.Relational Database concept

CSE 444: Database Internals. Lecture 22 Distributed Query Processing and Optimization

Lecture 1: Conjunctive Queries

Mahathma Gandhi University

Data Integration and Data Warehousing Database Integration Overview

Outline. Query Processing Overview Algorithms for basic operations. Query optimization. Sorting Selection Join Projection

Introduction Alternative ways of evaluating a given query using

Module 4. Implementation of XQuery. Part 0: Background on relational query processing

Chapter 3. Algorithms for Query Processing and Optimization

CSE 544: Principles of Database Systems

Scaling Access to Heterogeneous Data Sources with DISCO

Parallel Relational Database Systems

CS 377 Database Systems

Fundamentals of. Database Systems. Shamkant B. Navathe. College of Computing Georgia Institute of Technology PEARSON.

Distributed Database Management Systems. Data and computation are distributed over different machines Different levels of complexity

Mastro Studio: a system for Ontology-Based Data Management

Transcription:

Outline n Introduction & architectural issues n Data distribution n Distributed query processing n Distributed query optimization n Distributed transactions & concurrency control n Distributed reliability n Data replication n Parallel database systems q Database integration & querying q Query rewriting q Optimization issues q Peer-to-Peer data management q Stream data management q MapReduce-based distributed data management Page 9.32

Multidatabase Query Processing n Mediator/wrapper architecture n MDB query processing architecture n Query rewriting using views n Query optimization and execution n Query translation and execution Page 9.33

Mediator/Wrapper Architecture Query Processing Mediator Local Schema Same Interface Wrapper1 Different Interfaces DBMS1 Query Results Global view Result Integration local schema local schema Wrapper2 Wrapper3 DBMS2 DBMS3 DBMS4 Page 9.34

Advantages of M/W Architecture n Wrappers encapsulate the details of component DBMS l Export schema and cost information l Manage communication with Mediator n Mediator provides a global view to applications and users l Single point of access u May be itself distributed l Can specialize in some application domain l Perform query optimization using global knowledge l Perform result integration in a single format Page 9.35

Issues in MDB Query Processing n Component DBMSs are autonomous and may range from full-fledge relational DBMS to flat file systems l Different computing capabilities u Prevents uniform treatment of queries across DBMSs l Different processing cost and optimization capabilities u Makes cost modeling difficult l Different data models and query languages u Makes query translation and result integration difficult l Different runtime performance and unpredictable behavior u Makes query execution difficult Page 9.36

Mediator Data Model n Relational model l Simple and regular data structures l Mandatory schema n Object model l Complex (graphs) and regular data structures l Mandatory schema n Semi-structured (XML) model l Complex (trees) and irregular data structures l Optional schema (DTD or XSchema) In this chapter, we use the relational model which is sufficient to explain MDB query processing Page 9.37

MDB Query Processing Architecture Global/local correspondences Allocation and capabilities Local/DBMS mappings Page 9.38

Query Rewriting Using Views n Views used to describe the correspondences between global and local relations l Global As View: the global schema is integrated from the local databases and each global relation is a view over the local relations l Local As View: the global schema is defined independently of the local databases and each local relation is a view over the global relations n Query rewriting best done with Datalog, a logic-based language l More expressive power than relational calculus l Inline version of relational domain calculus Page 9.39

Datalog Terminology n Conjunctive (SPJ) query: a rule of the form l Q(T) :- R 1 (T 1 ), R n (T n ) l Q(T) : head of the query denoting the result relation l R 1 (T 1 ), R n (T n ): subgoals in the body of the query l R 1, R n : predicate names corresponding to relation names l T 1, T n : refer to tuples with variables and constants l Variables correspond to attributes (as in domain calculus) l - means unnamed variable n Disjunctive query = n conjunctive queries with same head predicate Page 9.40

Datalog Example With EMP(ENAME,TITLE,CITY) and ASG(ENAME,PNAME,DUR) SELECT!ENAME,TITLE, PNAME! FROM!EMP, ASG! WHERE!EMP.ENAME = ASG.ENAME! AND!TITLE = "Programmer" OR DUR=24! Q(ename,title,pname) :- Q(ename,title,pname) :- Emp(ename,title,-) Asg(ename,pname,-), title = Programmer. Emp(ename,title,-) Asg(ename,pname,24). Page 9.41

Rewriting in GAV n Global schema similar to that of homogeneous DDBMS l Local relations can be fragments l But no completeness: a tuple in the global relation may not exist in local relations u Yields incomplete answers l And no disjointness: the same tuple may exist in different local databases u Yields duplicate answers n Rewriting (unfolding) l Similar to query modification u Apply view definition rules to the query and produce a union of conjunctive queries, one per rule application u Eliminate redundant queries Page 9.42

GAV Example Schema Global relations EMP(ENAME,CITY) ASG(ENAME,PNAME,TITLE, DUR) Local relations EMP1(ENAME,TITLE,CITY) EMP2(ENAME,TITLE,CITY) ASG1(ENAME,PNAME,DUR) Emp(ename,city) :- Emp1(ename,title,city). (r 1 ) Emp(ename,city) :- Emp2(ename,title,city). (r 2 ) Asg(ename,pname,title,dur) :- Emp1(ename,title,city), (r 3 ) Asg1(ename,pname,dur). Asg(ename,pname,title,dur) :- Emp2(ename,title,city), (r 4 ) Asg1(ename,pname,dur). Page 9.43

GAV Example Query Let Q: name and project for employees in Paris Q(e,p) :- Emp(e, Paris ), Asg(e,p,-,-). Unfolding produces Q Q (e,p) :- Emp1(e,-, Paris ), Asg1(e,p,-,). (q 1 ) Q (e,p) :- Emp2(e,-, Paris ), Asg1(e,p,-,). (q 2 ) where q 1 is obtained by applying r 3 only or both r 1 and r 3 In the latter case, there are redundant queries same for q 2 with r 2 only or both r 2 and r 4 Page 9.44

Rewriting in LAV n More difficult than in GAV l No direct correspondence between the terms in GS (emp, ename) and those in the views (emp1, emp2, ename) l There may be many more views than global relations l Views may contain complex predicates to reflect the content of the local relations u e.g. a view Emp3 for only programmers n Often not possible to find an equivalent rewriting l Best is to find a maximally-contained query which produces a maximum subset of the answer u e.g. Emp3 can only return a subset of the employees Page 9.45

Rewriting Algorithms n The problem to find an equivalent query is NPcomplete in the number of views and number of subgoals of the query n Thus, algorithms try to reduce the numbers of rewritings to be considered n Three main algorithms l Bucket l Inverse rule l MiniCon Page 9.46

LAV Example Schema Local relations EMP1(ENAME,TITLE,CITY) EMP2(ENAME,TITLE,CITY) ASG1(ENAME,PNAME,DUR) Global relations EMP(ENAME,CITY) ASG(ENAME,PNAME,TITLE, DUR) Emp1(ename,title,city) :- Emp(ename,city), Asg(ename,-,title,-). (r 1 ) Emp2(ename,title,city) :- Emp(ename,city), Asg(ename,-,title,-). (r 2 ) Asg1(ename,pname,dur) :- Asg(ename,pname,-,dur) (r 3 ) Page 9.47

Bucket Algorithm n Considers each predicate of the query Q independently to select only the relevant views Step 1 l Build a bucket b for each subgoal q of Q that is not a comparison predicate l Insert in b the heads of the views which are relevant to answer q Step 2 l For each view V of the Cartesian product of the buckets, produce a conjunctive query u If it is contained in Q, keep it n The rewritten query is a union of conjunctive queries Page 9.48

LAV Example Query Let Q be Q(e,p) :- Emp(e, Paris ), Asg(e,p,-,-). Step1: we obtain 2 buckets (one for each subgoal of Q) b 1 = Emp1(ename,title,city), Emp2(ename,title,city) b 2 = Asg1(ename,pname,dur ) (the prime variables (title and dur ) are not useful) Step2: produces Q (e,p) :- Emp1(e,-, Paris ), Asg1(e,p,-,). (q 1 ) Q (e,p) :- Emp2(e,-, Paris ), Asg1(e,p,-,). (q 2 ) Page 9.49

Query Optimization and Execution n Takes a query expressed on local relations and produces a distributed QEP to be executed by the wrappers and mediator n Three main problems l Heterogeneous cost modeling u To produce a global cost model from component DBMS l Heterogeneous query optimization u To deal with different query computing capabilities l Adaptive query processing u To deal with strong variations in the execution environment Page 9.50

Heterogeneous Cost Modeling n Goal: determine the cost of executing the subqueries at component DBMS n Three approaches l Black-box: treats each component DBMS as a black-box and determines costs by running test queries l Customized: customizes an initial cost model l Dynamic: monitors the run-time behavior of the component DBMS and dynamically collect cost information Page 9.51

Black-box Approach n Define a logical cost expression l Cost = init cost + cost to find qualifying tuples + cost to process selected tuples u The terms will differ much with different DBMS n Run probing queries on component DBMS to compute cost coefficients l Count the numbers of tuples, measure cost, etc. l Special case: sample queries for each class of important queries u Use of classification to identify the classes n Problems l The instantiated cost model (by probing or sampling) may change over time l The logical cost function may not capture important details of component DBMS Page 9.52

Customized Approach n Relies on the wrapper (i.e. developer) to provide cost information to the mediator n Two solutions l Wrapper provides the logic to compute cost estimates u Access_cost = reset + (card-1)*advance s s s reset = time to initiate the query and receive a first tuple advance = time to get the next tuple (advance) card = result cardinality l Hierarchical cost model u Each node associates a query pattern with a cost function u The wrapper developer can give cost information at various levels of details, depending on knowledge of the component DBMS Page 9.53

Hierarchical Cost Model Page 9.54

Dynamic Approach n Deals with execution environment factors which may change l Frequently: load, throughput, network contention, etc. l Slowly: physical data organization, DB schemas, etc. n Two main solutions l Extend the sampling method to consider some new queries as samples and correct the cost model on a regular basis l Use adaptive query processing which computes cost during query execution to make optimization decisions Page 9.55

Heterogeneous Query Optimization n Deals with heterogeneous capabilities of component DBMS l One DBMS may support complex SQL queries while another only simple select on one fixed attribute n Two approaches, depending on the M/W interface level l Query-based u All wrappers support the same query-based interface (e.g. ODBC or SQL/MED) so they appear homogeneous to the mediator u Capabilities not provided by the DBMS must be supported by the wrappers l Operator-based u Wrappers export capabilities as compositions of operators u Specific capabilities are available to mediator More flexibility in defining the level of M/W interface Page 9.56

Query-based Approach n We can use 2-step query optimization with a heterogeneous cost model l But centralized query optimizers produce left-linear join trees whereas in MDB, we want to push as much processing in the wrappers, i.e. exploit bushy trees n Solution: convert a left-linear join tree into a bushy tree such that l The initial total cost of the QEP is maintained l The response time is improved n Algorithm l Iterative improvement of the initial left-linear tree by moving down subtrees while response time is improved Page 9.57

Left Linear vs Bushy Join Tree Page 9.58

Operator-based Approach n M/W communication in terms of subplans n Use of planning functions (Garlic) l Extension of cost-based centralized optimizer with new operators u Create temporary relations u Retrieve locally stored data u Push down operators in wrappers u accessplan and joinplan rules l Operator nodes annotated with u Location of operands, materialization, etc. Page 9.59

Planning Functions Example n Consider 3 component databases with 2 wrappers: l w 1.db 1 : EMP(ENO,ENAME,CITY) l w 1.db 2 : ASG(ENO,PNAME,DUR) l w 2. db 3 : EMPASG(ENAME,CITY,PNAME,DUR) n Planning functions of w 1 l AccessPlan (R: rel, A: attlist, P: pred) = scan(r, A, P, db(r)) l JoinPlan (R 1, R 2 : rel, A: attlist, P: joinpred) = join(r 1, R 2, A, P) u condition: db(r 1 ) db(r 2 ) u implemented by w 1 n Planning functions of w 2 l AccessPlan (R: rel, A: attlist, P: pred) = fetch(city=c) u condition: (city=c) included in P l AccessPlan (R: rel, A: attlist, P: pred) = scan(r, A, P, db(r)) u implemented by w 2 Page 9.60

Heterogenous QEP SELECT FROM WHERE!ENAME,PNAME,DUR!!EMPASG!!CITY = "Paris" AND DUR>24! Page 9.61

Adaptive Query Processing - Motivations n Assumptions underlying heterogeneous query optimization l The optimizer has sufficient knowledge about runtime u Cost information l Runtime conditions remain stable during query execution n Appropriate for MDB systems with few data sources in a controlled environment n Inappropriate for changing environments with large numbers of data sources and unpredictable runtime conditions Page 9.62

Example: QEP with Blocked Operator n Assume ASG, EMP, PROJ and PAY each at a different site n If ASG site is down, the entire pipeline is blocked n However, with some reorganization, the join of EMP and PAY could be done while waiting for ASG Page 9.63

Adaptive Query Processing Definition n A query processing is adaptive if it receives information from the execution environment and determines its behavior accordingly l Feed-back loop between optimizer and runtime environment l Communication of runtime information between mediator, wrappers and component DBMS u Hard to obtain with legacy databases n Additional components l Monitoring, assessment, reaction l Embedded in control operators of QEP n Tradeoff between reactiveness and overhead of adaptation Page 9.64

Adaptive Components n Monitoring parameters (collected by sensors in QEP) l Memory size l Data arrival rates l Actual statistics l Operator execution cost l Network throughput n Adaptive reactions l Change schedule l Replace an operator by an equivalent one l Modify the behavior of an operator l Data repartitioning Page 9.65

Eddy Approach n Query compilation: produces a tuple D, P, C, Eddy l D: set of data sources (e.g. relations) l P: set of predicates l C: ordering constraints to be followed at runtime l Eddy: n-ary operator between D and P n Query execution: operator ordering on a tuple basis using Eddy l On-the-fly tuple routing to operators based on cost and selectivity l Change of join ordering during execution u Requires symmetric join algorithms such Ripple joins Page 9.66

QEP with Eddy n D= {R, S, T} n P = {σ P (R), R JN 1 S, S JN 2 T) n C = {S < T} where < imposes S tuples to probe T tuples using an index on join attribute l Access to T is wrapped by JN Result tuples Page 9.67

Query Translation and Execution n Performed by wrappers using the component DBMS l Conversion between common interface of mediator and DBMS-dependent interface u Query translation from wrapper to DBMS u Result format translation from DBMS to wrapper l Wrapper has the local schema exported to the mediator (in common interface) and the mapping to the DBMS schema l Common interface can be query-based (e.g. ODBC or SQL/ MED) or operator-based n In addition, wrappers can implement operators not supported by the component DBMS, e.g. join Page 9.68

Wrapper Placement n Depends on the level of autonomy of component DB n Cooperative DB l May place wrapper at component DBMS site l Efficient wrapper-dbms com. n Uncooperative DB l May place wrapper at mediator l Efficient mediator-wrapper com. n Impact on cost functions Page 9.69