DBMS Query evaluation

Similar documents
R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

Database System Concepts

Chapter 12: Query Processing. Chapter 12: Query Processing

Query Optimization. Query Optimization. Optimization considerations. Example. Interaction of algorithm choice and tree arrangement.

Chapter 12: Query Processing

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

Chapter 13: Query Processing

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Advanced Database Systems

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

Advances in Data Management Query Processing and Query Optimisation A.Poulovassilis

What happens. 376a. Database Design. Execution strategy. Query conversion. Next. Two types of techniques

Chapter 12: Query Processing

Outline. Query Processing Overview Algorithms for basic operations. Query optimization. Sorting Selection Join Projection

QUERY OPTIMIZATION E Jayant Haritsa Computer Science and Automation Indian Institute of Science. JAN 2014 Slide 1 QUERY OPTIMIZATION

Query Processing & Optimization

Query Processing: an Overview. Query Processing in a Nutshell. .. CSC 468 DBMS Implementation Alexander Dekhtyar.. QUERY. Parser.

SQL QUERY EVALUATION. CS121: Relational Databases Fall 2017 Lecture 12

Hash-Based Indexing 165

Algorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1)

Operator Implementation Wrap-Up Query Optimization

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

Administrivia. CS 133: Databases. Cost-based Query Sub-System. Goals for Today. Midterm on Thursday 10/18. Assignments

Relational Query Optimization. Highlights of System R Optimizer

Overview of Query Processing and Optimization

Chapter 3. Algorithms for Query Processing and Optimization

Database Applications (15-415)

CS122 Lecture 4 Winter Term,

Relational Query Optimization

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

Evaluation of Relational Operations

CSIT5300: Advanced Database Systems

Relational Query Optimization. Overview of Query Evaluation. SQL Refresher. Yanlei Diao UMass Amherst October 23 & 25, 2007

Principles of Data Management. Lecture #12 (Query Optimization I)

Overview of Implementing Relational Operators and Query Evaluation

Review. Support for data retrieval at the physical level:

DBMS Y3/S5. 1. OVERVIEW The steps involved in processing a query are: 1. Parsing and translation. 2. Optimization. 3. Evaluation.

Overview of Query Processing. Evaluation of Relational Operations. Why Sort? Outline. Two-Way External Merge Sort. 2-Way Sort: Requires 3 Buffer Pages

CSIT5300: Advanced Database Systems

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Database Applications (15-415)

CSE 344 APRIL 20 TH RDBMS INTERNALS

Database Systems External Sorting and Query Optimization. A.R. Hurson 323 CS Building

Module 9: Query Optimization

Principles of Data Management. Lecture #9 (Query Processing Overview)

Implementing Relational Operators: Selection, Projection, Join. Database Management Systems, R. Ramakrishnan and J. Gehrke 1

CompSci 516 Data Intensive Computing Systems

Query Processing and Query Optimization. Prof Monika Shah

Chapter 11: Query Optimization

15-415/615 Faloutsos 1

Ch 5 : Query Processing & Optimization

Fundamentals of Database Systems

Review. Relational Query Optimization. Query Optimization Overview (cont) Query Optimization Overview. Cost-based Query Sub-System

Evaluation of Relational Operations: Other Techniques

Database Tuning and Physical Design: Basics of Query Execution

Evaluation of Relational Operations: Other Techniques. Chapter 14 Sayyed Nezhadi

Query processing and optimization

CS330. Query Processing

CMPUT 391 Database Management Systems. An Overview of Query Processing. Textbook: Chapter 11 (first edition: Chapter 14)

Implementation of Relational Operations: Other Operations

Overview of Query Evaluation. Overview of Query Evaluation

ATYPICAL RELATIONAL QUERY OPTIMIZER

Evaluation of Relational Operations

Implementation of Relational Operations. Introduction. CS 186, Fall 2002, Lecture 19 R&G - Chapter 12

CMPUT 391 Database Management Systems. Query Processing: The Basics. Textbook: Chapter 10. (first edition: Chapter 13) University of Alberta 1

Evaluation of Relational Operations: Other Techniques

Evaluation of Relational Operations

Database Applications (15-415)

CSC 261/461 Database Systems Lecture 19

Parser: SQL parse tree

Announcements. Two typical kinds of queries. Choosing Index is Not Enough. Cost Parameters. Cost of Reading Data From Disk

Query Optimization. Schema for Examples. Motivating Example. Similar to old schema; rname added for variations. Reserves: Sailors:

Introduction to Database Systems CSE 444

Query Optimization. Schema for Examples. Motivating Example. Similar to old schema; rname added for variations. Reserves: Sailors:

Evaluation of relational operations

Introduction Alternative ways of evaluating a given query using

An SQL query is parsed into a collection of query blocks optimize one block at a time. Nested blocks are usually treated as calls to a subroutine

Query Evaluation! References:! q [RG-3ed] Chapter 12, 13, 14, 15! q [SKS-6ed] Chapter 12, 13!

ECS 165B: Database System Implementa6on Lecture 7

Database Applications (15-415)

Implementation of Relational Operations

Chapter 18 Strategies for Query Processing. We focus this discussion w.r.t RDBMS, however, they are applicable to OODBS.

Overview of Query Evaluation. Chapter 12

Evaluation of Relational Operations. SS Chung

Query Processing & Optimization. CS 377: Database Systems

Chapter 13: Query Optimization. Chapter 13: Query Optimization

Schema for Examples. Query Optimization. Alternative Plans 1 (No Indexes) Motivating Example. Alternative Plans 2 With Indexes

Evaluation of Relational Operations. Relational Operations

Query Processing Strategies and Optimization

Database System Concepts

CMP-3440 Database Systems

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases. Introduction

CSE 544 Principles of Database Management Systems

Architecture and Implementation of Database Systems Query Processing

CAS CS 460/660 Introduction to Database Systems. Query Evaluation II 1.1

Transcription:

Data Management for Data Science DBMS Maurizio Lenzerini, Riccardo Rosati Corso di laurea magistrale in Data Science Sapienza Università di Roma Academic Year 2016/2017 http://www.dis.uniroma1.it/~rosati/dmds/

Evaluation of relational algebra operators Our purpose is to analyze how the SQL engine computes the result of a query. We will assume that the SQL engine translates the SQL expression into relational algebra. So, we will analyze the methods to evaluate single operators of relational algebra (and then will present query plans combining the evaluation of multiple operators). We will focus on Selection, Projection and Join. The evaluation of the selection operation has already been described in the slides on file organizations. 2

Projection The main problem with projection is the (possibly requested) elimination of duplicates. Two algorithms can be used to this purpose: 1. Sort-based projection 2. Hash-based projection We will analyze the first one, which is as follows: 1. Scan the relation R (the relation being projected), and project it writing the result in a temporary file (let T be its number of pages) 2. Order the temporary file, using as ordering key all the attributes (after projection) 3. Scan the ordered temporary file, deleting the contiguous multiple occurrences of the same record, and writing the result in a new file Cost of (1): O(M) + O(T) dove T = O(M) Cost of (2): O(T log H T) (H= number of free buffer slots for the mergesort) Cost of 3): O(T) Summing up: the total cost is O(M log H M) 3

Projection with index-only scan If we have to compute the projection on attributes A1,,An of relation R, and we have a B+-tree index whose search key includes all the above attributes, then we can execute the so-called index-only scan It consists of scanning the leaves of the index (which is much faster than scanning the data file). Furthermore, if A1,,An are a prefix of the search key of the index, then during the scan we can efficiently: - eliminate the attributes that are projected out - eliminate the duplicate tuples 4

Join We refer to the equi-join on the relations R and S. As a running example, we will assume that: - R has M pages and p R tuples per page - S has N pages and p S tuples per page and we will consider the following values: M=1000, N=500, p R =100, p S =10, and 10 ms to acess a block in mass memory. We will analyze the following algorithms: Nested loop Block nested loop Index nested loop Sort merge join 5

Nested loop One of the two relations is the outer relation (say R), the other one is the inner relation (S). scan the outer relation R, and: for each page P of R: scan S, looking for the record that join with the records in P The cost is M + (M N). With the previous values, the cost of computing the join of R and S through the nested loop is (1000 + (1000 500)) 10ms = 5.010.000ms, i.e., about 1,5 hours. In general, the smallest relation between R and S should be chosen as the outer relation. If we do that in our example, the cost is 5.005.000ms. 6

Block nested loop If the outer relation can fit into the buffer, and there are at least 2 extra free pages, we can load the whole outer relation in the buffer, and use one of the extra buffer pages to read the inner relation one page at the time, and use the remaining free page to store the result (one page at the time). The cost is M + N (optimal). In our example, the cost of the block nested loop is 1500 10ms, i.e., 15 seconds (1500*10ms=15 seconds). 7

Block nested loop In many cases, the outer relation cannot fit into the buffer: we only have G free pages, and so can only devote G-2 buffer pages to load the outer relation, where G-2 is less than the number of pages of the outer relation. The block nested loop algorithm changes as follows: load the first two G-2 pages of the outer relation in the buffer, scan the inner relation loading one page at the time in the buffere, and for each of such pages compute the join between the tuples in this page with the tuples of the outer relation that are currently in the buffer, writing the result in the remaining extra page. If the outer relation is R, the cost è M + (N M/(G-2)). Of course, to make the algorithm effective, G must be sufficiently large. In our example, assuming G=102, and choosing S as the outer relation, the cost is (500 + (1000 500/100)) 10ms = 55000ms (i.e., about a minute). 8

Index nested loop The index nested loop algorithm can be executed only if there is an index on the inner relation whose search key is equal to (or contains) the attributes involved in the equi-join. If the inner relation is S, the algorithm is: scan R, and for each page P of R: for each tuple t in P: use the index on S to find the tuples of S that are in join with t The cost depends on the type of index and on how many tuples of the inner relation are in join with a tuple of the outer relation. If, in particular, the equi-join attributes are a key for the inner relation S, then at most one tuple of the inner relation is in join with a tuple of the outer relation. 9

Index nested loop Assuming that there is an hash index on S (inner relation) and that the equi-joins attributes are a key for S, and assuming 1,2 as the cost for the single access through hash index: since we have 100.000 tuples for R, the cost of the index nested loop algorithm is 1000 + 100.000 1,2 = 121.000 (about 20 minutes). 10

Sort merge join The sort merge join algorithm is as follows: 1 (if necessary) order (through the merge sort algorithm) R on the equijoin attribute(s) 2 - (if necessary) order (through the merge sort algorithm) S on the equijoin attribute(s) 3 - scan the pages of R (outer relation), and for each tuple of each page of R, look for the tuples that have a join with a tuple in the current page of S (inner relation), exploiting the ordering property. Cost of steps 1 and 2: The cost of the merge-sort algorithm in mass memory is O(P log H P), where P is the number of pages in the file, and H is the number of free buffer slots that the algorithm can use. Therefore: The cost of step 1 is M log H M The cost of step 2 is N log H N 11

Sort merge join Cost of step 3: In the worst case (all tuples of R are in join with all tuples of S) the cost of the algorithm is M N. In the favourable cases, the cost is much smaller. For instance, when the equi-join attributes are a key for the inner relation, then every page of S is analyzed only once, therefore the cost is M + N (optimal). In the average case, even if the same page of S is requested multiple times for the join operation, it is very likely that is still in the buffer, so it is still realistic to assume a cost of M + N. In our case, assuming that the buffer has 10 pages available for the merge sort: the cost of ordering R (step 1) is 1000 log 10 1000 = 3000 the cost of ordering S (step 2) is 500 log 10 500 = 1500 the costo of step 3 is 1000+500= 1500 Therefore, the total cost is 6000 10ms (i.e., one minute). 12

in a DBMS Structure of the SQL engine: query Query parser parsed query Query optimizer Query plan generator Query plan cost estimator Catalog manager query plan Query plan evaluator 13

Query optimizer The query optimizer is the module that defines the query evaluation plan of the SQL query A query evaluation plan (or simply query plan) is a tree whose nodes are relational algebra operators (select, project, join, etc.) annotated by a description of the access methods and the algorithms for executing such an operation The query optimizer is constituted by: Query plan generator: is the module that generate a set of possible query plans, each of which corresponds to a possible execution plan of the SQL query Query plan cost estimator: for every query plan generated by the query plan generator, it computes an estimate of the execution cost of such a plan The query optimizer chooses, among all the generated plans, a plan with the minimum cost 14

Query plan Example of a query plan: π {ID} (on-the-fly) taxcode=taxcode (sort merge join) (scan and materialization of the result) σ year>1990 Student Citizen 15

Query plan Example: the following query Q: SELECT S.ID FROM Student S, Citizen R WHERE S.taxCode = R.taxCode AND R.year>1990 can be formulated in relational algebra through the expression π {ID} (σ year>1990 (Student taxcode=taxcode Citizen)) 16

Query plan Every query plan for Q without a specification of the execution of the single operators corresponds to a relational algebra expression equivalent to Q For instance, from π {ID} (σ year>1990 (Student taxcode=taxcode Citizen)), the following trees: π {ID} π {ID} π {ID} σ year>1990 taxcode=taxcode taxcode=taxcode taxcode=taxcode Student σ year>1990 σ year>1990 Student Student Citizen Citizen Citizen correspond to expressions that are equivalent to the initial one 17

Partitioning the SQL query The query parser partitions the query SQL into blocks every block is a query SELECT FROM WHERE (possibly with clauses HAVING and GROUP BY) with no nested queries A block essentially corresponds to a relational algebra expression that only uses the selection (σ), projection (π), join ( ) and cartesian product ( ) operators Every block is independently processed by the subsequent query optimizer 18

Query plan generator The query plan generator takes as input a block, that is, a selectproject-join relational algebra expression, and generates a set of query plans for such an expression To reach this goal: It computes equivalent transformations of the expression It uses the information about the existing access methods of the relations (namely, about the physical organization of the files storing the relations) The query plan generator does not generate all the possible plans, because the number of such plans is usually really huge (even for relatively simple query) The generation of the query plans is restricted by rules that allow for generating only predefined kinds of query plans Example of such a rule: every join and cartesian product must have a primitive relation as its inner relation (right child) 19

Query plan cost estimator Is the module that assign a cost to every query plan generated by the query plan generator The cost is assigned according to a cost model that considers a set of relevant parameters for the real execution cost of the query To compute the cost, the query plan cost estimator makes use of the meta-information given by the catalog manager, in particular: Size of every relation (number of records, number of pages) Type and size (number of bytes) of attributes Size of indices (number of pages) For every index, min and max value of the search key that is currently stored etc. 20

Query plan evaluator The query optimizer selects, among all query plans generated by the query plan generator, one having the minimum cost according to the evaluation of the query plan cost estimator The query plan evaluator executes such a chosen plan, that is, it executes the tree corresponding to the query plan The plan is executed bottom-up and left-to-right (in the order corresponding to a depth-first post-order traversal of the tree): first, the innermost operations are executed (the ones closer to the leaves of the tree) from left to right, slowly going up until the root (which is the last operation executed) During the execution, the operations are executed according to the access methods and the algorithms defined in the annotations of the query plan 21

Relational algebra equivalences The query plan generator uses equivalences between relational algebra expressions to generate new trees starting from the one corresponding to the initial expression Such equivalences are very important in order to produce a query plan with minimum cost Such a step of the query plan generator is independent from the access methods of the relations and the available algorithms for executing the operations 22

Relational algebra equivalences Examples: R C S S C R (R C1 S) C2 T R C1 (S C2 T) R C S σ C (R S) σ C1 (σ C2 (R)) σ C2 (σ C1 (R)) π A (σ C (R)) σ C (π A (R)) if condition C only involves attributes of A π A ( σ C (R S)) σ C (π A (R) π A (S)) if condition C involves only attributes of A σ C1 C2 C3 (R S)) σ C3 (σ C1 (R) σ C2 (S) ) where C1 are the conditions involving only attributes of R C2 are the conditions involving only attributes of S C3 are the conditions involving both attributes of R and attributes of S 23

Equivalence-based optimizations Main heuristics: Push-down selections: in the query plan, execute selections as soon as possible (that is, push selections as down as possible in the plan) Push-down projections: in the query plan, execute projections as soon as possible (that is, push projections as down as possible in the plan) Both such heuristics are based on the fact that usually the cost of selection and projection operations is smaller than the cost of join operations For instance, if a selection and a join must be executed, it is very likely that, by executing the selection operation first, we significantly reduce the size of (at least) one of the two relations on which the join will be executed, which usually implies that the cost of the join operation will be significantly reduced 24

Push-down selection Example: σ year>1990 taxcode=taxcode taxcode=taxcode σ year>1990 Student Citizen Student Citizen Intuitively, the second query plan should be more efficient than the first one, because the join operation is performed after selecting, in the Citizen relation, the records whose year is greater than 1990. Hence, the intermediate relation on which the join is executed could have a size much smaller than the whole Citizen relation. 25

Limits of heuristics Such heuristics do not always produce optimal solutions Indeed, we should evaluate the consequences of the priorities among operators with respect to the existing access methods of the relations and the available (in the DBMS) algorithms for executing the single operators For instance, if in the previous example there exists an index on the Citizen relation with search key taxcode, we could use such an index if we execute the first tree, while we could not use it if we execute the second tree: in many situations, this might imply that the execution of the first tree is more efficient than the execution of the second tree 26

Generation of intermediate results The results of intermediate operations (corresponding to the intermediate nodes in the query plan tree) can be handled in two ways: materialization of the result: a temporary DB table is generated, which stores (in mass memory) the intermediate result no materialization of the result: the result is not stored in mass memory: as long as it is produced, it is passed to the next operation one chunk at a time (like in an assembly line). In this way, a pipeline of operations is created Materialization implies a slower execution of the operation (because of the need of accessing the mass memory) On the other hand, not all the operation can be executed without materializing the result For instance, if the result must be ordered, then it is necessary to materialize it 27

Generation of intermediate results Example: T R S If sort merge join is the only available algorithm for executing the join, then the intermediate result of the R S operation must be materialized, because the sort merge join algorithm has to order the relations on which the join is executed (and so the execution of the outermost join requires that the result of R S is materialized) 28