CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 9 - Query optimization

Similar documents
CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 7 - Query optimization

Review: Query Evaluation Steps. Example Query: Logical Plan 1. What We Already Know. Example Query: Logical Plan 2.

Lecture 20: Query Optimization (2)

Introduction to Data Management CSE 344. Lecture 12: Cost Estimation Relational Calculus

Database Systems CSE 414

CSE 344 APRIL 27 TH COST ESTIMATION

CSE 344 FEBRUARY 21 ST COST ESTIMATION

Announcements. Two typical kinds of queries. Choosing Index is Not Enough. Cost Parameters. Cost of Reading Data From Disk

Lecture 19: Query Optimization (1)

Reminders. Query Optimizer Overview. The Three Parts of an Optimizer. Dynamic Programming. Search Algorithm. CSE 444: Database Internals

CSE 544 Principles of Database Management Systems

CSE 444: Database Internals. Lecture 11 Query Optimization (part 2)

Review. Administrivia (Preview for Friday) Lecture 21: Query Optimization (1) Where We Are. Relational Algebra. Relational Algebra.

Lecture 21: Query Optimization (1)

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 7 - Query execution

CompSci 516 Data Intensive Computing Systems. Lecture 11. Query Optimization. Instructor: Sudeepa Roy

Principles of Data Management. Lecture #12 (Query Optimization I)

Data Storage. Query Performance. Index. Data File Types. Introduction to Data Management CSE 414. Introduction to Database Systems CSE 414

Database Management Systems CSEP 544. Lecture 6: Query Execution and Optimization Parallel Data processing

Relational Query Optimization. Highlights of System R Optimizer

CSE 344 APRIL 20 TH RDBMS INTERNALS

Introduction to Database Systems CSE 414. Lecture 26: More Indexes and Operator Costs

Query Optimization. Schema for Examples. Motivating Example. Similar to old schema; rname added for variations. Reserves: Sailors:

Query Optimization. Schema for Examples. Motivating Example. Similar to old schema; rname added for variations. Reserves: Sailors:

Relational Query Optimization

Hash table example. B+ Tree Index by Example Recall binary trees from CSE 143! Clustered vs Unclustered. Example

Magda Balazinska - CSE 444, Spring

ATYPICAL RELATIONAL QUERY OPTIMIZER

Administrivia. CS 133: Databases. Cost-based Query Sub-System. Goals for Today. Midterm on Thursday 10/18. Assignments

Schema for Examples. Query Optimization. Alternative Plans 1 (No Indexes) Motivating Example. Alternative Plans 2 With Indexes

Introduction to Database Systems CSE 444

Principles of Data Management. Lecture #9 (Query Processing Overview)

Relational Query Optimization. Overview of Query Evaluation. SQL Refresher. Yanlei Diao UMass Amherst October 23 & 25, 2007

QUERY OPTIMIZATION E Jayant Haritsa Computer Science and Automation Indian Institute of Science. JAN 2014 Slide 1 QUERY OPTIMIZATION

Overview of Implementing Relational Operators and Query Evaluation

CSE 444: Database Internals. Lectures 5-6 Indexing

Query Processing & Optimization

An SQL query is parsed into a collection of query blocks optimize one block at a time. Nested blocks are usually treated as calls to a subroutine

Introduction to Database Systems CSE 344

Introduction to Database Systems CSE 444, Winter 2011

Relational Query Optimization

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

Announcements. From SQL to RA. Query Evaluation Steps. An Equivalent Expression

CSE 344 FEBRUARY 14 TH INDEXING

Overview of Query Evaluation. Overview of Query Evaluation

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Database Applications (15-415)

CompSci 516 Data Intensive Computing Systems

Overview of Query Processing

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

Introduction to Data Management CSE 344. Lectures 9: Relational Algebra (part 2) and Query Evaluation

Database Systems. Announcement. December 13/14, 2006 Lecture #10. Assignment #4 is due next week.

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

Overview of Query Evaluation

QUERY OPTIMIZATION [CH 15]

Evaluation of Relational Operations. Relational Operations

Chapter 14 Query Optimization

Chapter 14 Query Optimization

Chapter 14 Query Optimization

Review. Relational Query Optimization. Query Optimization Overview (cont) Query Optimization Overview. Cost-based Query Sub-System

Relational Query Optimization. Overview of Query Evaluation. SQL Refresher. Yanlei Diao UMass Amherst March 8 and 13, 2007

Administrivia. Relational Query Optimization (this time we really mean it) Review: Query Optimization. Overview: Query Optimization

CSE 444: Database Internals. Lecture 22 Distributed Query Processing and Optimization

Advances in Data Management Query Processing and Query Optimisation A.Poulovassilis

15-415/615 Faloutsos 1

Database Applications (15-415)

Lecture 17: Query execution. Wednesday, May 12, 2010

Introduction Alternative ways of evaluating a given query using

Evaluation of Relational Operations: Other Techniques. Chapter 14 Sayyed Nezhadi

Evaluation of Relational Operations

Outline. Query Processing Overview Algorithms for basic operations. Query optimization. Sorting Selection Join Projection

DBMS Y3/S5. 1. OVERVIEW The steps involved in processing a query are: 1. Parsing and translation. 2. Optimization. 3. Evaluation.

Query Evaluation Overview, cont.

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Implementation of Relational Operations

CS330. Query Processing

Query Evaluation Overview, cont.

Hash-Based Indexing 165

Chapter 14: Query Optimization

Advanced Databases. Lecture 4 - Query Optimization. Masood Niazi Torshiz Islamic Azad university- Mashhad Branch

Lassonde School of Engineering Winter 2016 Term Course No: 4411 Database Management Systems

Principles of Database Management Systems

Evaluation of Relational Operations: Other Techniques

Evaluation of relational operations

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

Chapter 11: Query Optimization

Overview of Query Evaluation

Chapter 3. Algorithms for Query Processing and Optimization

Evaluation of Relational Operations

Evaluation of Relational Operations: Other Techniques

Chapter 13: Query Optimization. Chapter 13: Query Optimization

Contents Contents Introduction Basic Steps in Query Processing Introduction Transformation of Relational Expressions...

Implementing Relational Operators: Selection, Projection, Join. Database Management Systems, R. Ramakrishnan and J. Gehrke 1

Implementation of Relational Operations. Introduction. CS 186, Fall 2002, Lecture 19 R&G - Chapter 12

Query Optimization. Chapter 13

RELATIONAL OPERATORS #1

Implementation of Relational Operations: Other Operations

Query Evaluation (i)

Principles of Data Management. Lecture #13 (Query Optimization II)

Transcription:

CSE 544 Principles of Database Management Systems Magdalena Balazinska Fall 2007 Lecture 9 - Query optimization

References Access path selection in a relational database management system. Selinger. et. al. SIMOD 1979 Database management systems. Ramakrishnan and Gehrke. Third Ed. Chapter 15. CSE 544 - Fall 2007 2

Outline Basic query optimization algorithm Typical query optimizer (based on System R) Estimating the cost of a query plan Search space Algorithm for enumerating query plans Other types of optimizers CSE 544 - Fall 2007 3

Query Optimization Algorithm For a query There exists many physical query plans Query optimizer needs to pick a good one Basic query optimization algorithm Enumerate alternative plans Compute estimated cost of each plan Compute number of I/Os Optionally take into account other resources Choose plan with lowest cost This is called cost-based optimization CSE 544 - Fall 2007 4

Outline Basic query optimization algorithm Typical query optimizer (based on System R) Estimating the cost of a query plan Search space Algorithm for enumerating query plans Other types of optimizers CSE 544 - Fall 2007 5

Estimating Cost of a Query Plan We already how to Compute the cost of different operations We still need to Compute cost of retrieving tuples from disk with different access paths (for more sophisticated predicates than equality) Compute cost of a complete plan CSE 544 - Fall 2007 6

Access Path Access path: a way to retrieve tuples from a table A file scan An index plus a matching selection condition Index matches selection condition if it can be used to retrieve just tuples that satisfy the condition Example: Supplier(sid,sname,scity,sstate) B+-tree index on (scity,sstate) matches scity= Seattle does not match sid=3, does not match sstate= WA CSE 544 - Fall 2007 7

Access Path Selection Supplier(sid,sname,scity,sstate) Selection condition: sid > 300 scity= Seattle Indexes: B+-tree on sid and B+-tree on scity Which access path should we use? We should pick the most selective access path CSE 544 - Fall 2007 8

Access Path Selectivity Access path selectivity is the number of pages retrieved if we use this access path Most selective retrieves fewest pages As we saw earlier, for equality predicates Selection on equality: σ a=v (R) V(R, a) = # of distinct values of attribute a 1/V(R,a) is thus the reduction factor Clustered index on a: cost B(R)/V(R,a) Unclustered index on a: cost T(R)/V(R,a) (we are ignoring I/O cost of index pages for simplicity) CSE 544 - Fall 2007 9

Selectivity for Range Predicates Selection on range: σ a>v (R) How to compute the selectivity? Assume values are uniformly distributed Reduction factor X X = (Max(R,a) - v) / (Max(R,a) - Min(R,a)) Clustered index on a: cost B(R)*X Unclustered index on a: cost T(R)*X CSE 544 - Fall 2007 10

Back to Our Example Selection condition: sid > 300 scity= Seattle Index I1: B+-tree on sid clustered Index I2: B+-tree on scity unclustered Let s assume V(Supplier,scity) = 20 Max(Supplier, sid) = 1000, Min(Supplier,sid)=1 B(Supplier) = 100, T(Supplier) = 1000 Cost I1: B(R) * (Max-v)/(Max-Min) = 100*700/999 70 Cost I2: T(R) * 1/V(Supplier,scity) = 1000/20 = 50 CSE 544 - Fall 2007 11

Selectivity with Multiple Conditions What if we have an index on multiple attributes? Example selection σ a=v1 b= v2 (R) and index on <a,b> How to compute the selectivity? Assume attributes are independent X = 1 / (V(R,a) * V(R,b)) Clustered index on <a,b>: cost B(R)*X Unclustered index on <a,b>: cost T(R)*X CSE 544 - Fall 2007 12

We already how to Back to Estimating Cost of a Query Plan Compute the cost of different operations Compute cost of retrieving tuples from disk with different access paths (for more sophisticated predicates than equality) We still need to Compute cost of a complete plan CSE 544 - Fall 2007 13

Computing the Cost of a Plan Collect statistical summaries of stored data Compute cost in a bottom-up fashion For each operator compute Estimate cost of executing the operation Estimate statistical summary of the output data CSE 544 - Fall 2007 14

Statistics on Base Data Collected information for each relation Number of tuples (cardinality) Indexes, number of keys in the index Number of physical pages, clustering info Statistical information on attributes Min value, max value, number distinct values Histograms Correlations between columns (hard) Collection approach: periodic, using sampling CSE 544 - Fall 2007 15

Computing Cost of an Operator The cost of executing an operator depends On the operator implementation On the input data We learned how to compute this cost last two lectures CSE 544 - Fall 2007 16

Statistics on the Output Data Most important piece of information Size of operator result I.e., the number of output tuples Projection: output size same as input size Selection: multiply input size by reduction factor Similar to what we did for estimating access path selectivity Assume independence between conditions in the predicate (use product of the reduction factors for the terms) CSE 544 - Fall 2007 17

Estimating Result Sizes For joins R S Take product of cardinalities of relations R and S Apply reduction factors for each term in join condition Terms are of the form: column1 = column2 Reduction: 1/ ( MAX( V(R,column1), V(S,column2)) Assumes each value in smaller set has a matching value in the larger set CSE 544 - Fall 2007 18

Our Example Suppliers(sid,sname,scity,sstate) Supplies(pno,sid,quantity) Some statistics T(Supplier) = 1000 records B(Supplier) = 100 pages T(Supplies) = 10,000 records B(Supplies) = 100 pages V(Supplier,scity) = 20, V(Supplier,state) = 10 V(Supplies,pno) = 3,000 Both relations are clustered CSE 544 - Fall 2007 19

Physical Query Plan 1 (On the fly) (On the fly) π sname σ scity= Seattle sstate= WA pno=2 Selection and project on-the-fly -> No additional cost. (Nested loop) sno = sno Total cost of plan is thus cost of join: = B(Supplier)+B(Supplier)*B(Supplies) = 100 + 100 * 100 = 10,100 I/Os Suppliers (File scan) Supplies (File scan) CSE 544 - Fall 2007 20

Physical Query Plan 2 (On the fly) (Sort-merge join) (Scan write to T1) π sname sno = sno (1) σ scity= Seattle sstate= WA (4) (3) (2) σ pno=2 Total cost = 100 + 100 * 1/20 * 1/10 (1) + 100 + 100 * 1/3000 (2) + 2 (3) + 0 (4) Total cost 204 I/Os (Scan write to T2) Suppliers (File scan) Supplies (File scan) CSE 544 - Fall 2007 21

Plan 2 with Different Numbers What if we had: π sname 10K pages of Suppliers 10K pages of Supplies (Sort-merge join) (Scan write to T1) sno = sno (1) σ scity= Seattle sstate= WA (4) (3) (Scan write to T2) (2) σ pno=2 Total cost = 10000 + 50 (1) + 10000 + 4 (2) + 4*50 + 2*4 + 4 + 50 (3) + 0 (4) Total cost 20,316 I/Os Suppliers (File scan) Supplies (File scan) Assuming naive two-pass sort algorithm CSE 544 - Fall 2007 22

Physical Query Plan 3 (On the fly) (On the fly) (3) (4) π sname σ scity= Seattle sstate= WA (2) sno = sno (Index nested loop) Total cost = 1 (1) + 4 (2) + 0 (3) + 0 (3) Total cost 5 I/Os (Use hash index) 4 tuples (1) σ pno=2 Supplies (Hash index on pno ) Assume: clustered Suppliers (Hash index on sno) Clustering does not matter CSE 544 - Fall 2007 23

Simplifications In the previous examples, we assumed that all index pages were in memory When this is not the case, we need to add the cost of fetching index pages from disk (see lecture 6) CSE 544 - Fall 2007 24

Summary What we know Different types of physical query plans How to compute the cost of a query plan Although it is hard to compute the cost accurately We can now compare query plans Let s now consider how the query optimizer searches through the space of possible plans CSE 544 - Fall 2007 25

Outline Basic query optimization algorithm Typical query optimizer (based on System R) Estimating the cost of a query plan Search space Algorithm for enumerating query plans Other types of optimizers CSE 544 - Fall 2007 26

Relational Algebra Equivalences Selections Commutative: σ c1 (σ c2 (R)) same as σ c2 (σ c1 (R)) Cascading: σ c1 c2 (R) same as σ c2 (σ c1 (R)) Projections Cascading Joins Commutative : R S same as S R Associative: R (S T) same as (R S) T CSE 544 - Fall 2007 27

Left-Deep Plans and Bushy Plans R4 R2 R3 R1 Left-deep plan R3 R1 R2 R4 Bushy plan CSE 544 - Fall 2007 28

Relational Algebra Equivalences Selects, projects, and joins We can commute and combine all three types of operators We just have to be careful that the fields we need are available when we apply the operator Relatively straightforward. See book 15.3. If you like this topic, more info in optional paper (by Chaudhuri), Section 4. CSE 544 - Fall 2007 29

Search Space Challenges Search space is huge! Many possible equivalent trees Many implementations for each operator Many access paths for each relation Cannot consider ALL plans Want a search space that includes low-cost plans CSE 544 - Fall 2007 30

System R Search Space Only left-deep plans Enable dynamic programming for enumeration Facilitate tuple pipelining from outer relation Consider plans with all interesting orders Perform cross-products after all other joins (heuristic) Only consider nested loop & sort-merge joins Consider both file scan and indexes Try to evaluate predicates early CSE 544 - Fall 2007 31

Plan Enumeration Algorithm Idea: use dynamic programming For each subset of {R1,, Rn}, compute the best plan for that subset In increasing order of set cardinality: Step 1: for {R1}, {R2},, {Rn} Step 2: for {R1,R2}, {R1,R3},, {Rn-1, Rn} Step n: for {R1,, Rn} It is a bottom-up strategy A subset of {R1,, Rn} is also called a subquery CSE 544 - Fall 2007 32

Dynamic Programming Algo. For each subquery Q {R1,, Rn} compute the following: Size(Q) A best plan for Q: Plan(Q) The cost of that plan: Cost(Q) CSE 544 - Fall 2007 33

Dynamic Programming Algo. Step 1: Enumerate all single-relation plans Consider selections on attributes of relation Consider all possible access paths Consider attributes that are not needed Compute cost for each plan Keep cheapest plan per interesting output order CSE 544 - Fall 2007 34

Dynamic Programming Algo. Step 2: Generate all two-relation plans For each each single-relation plan from step 1 Consider that plan as outer relation Consider every other relation as inner relation Compute cost for each plan Keep cheapest plan per interesting output order CSE 544 - Fall 2007 35

Dynamic Programming Algo. Step 3: Generate all three-relation plans For each each two-relation plan from step 2 Consider that plan as outer relation Consider every other relation as inner relation Compute cost for each plan Keep cheapest plan per interesting output order Steps 4 through n: repeat until plan contains all the relations in the query CSE 544 - Fall 2007 36

Commercial Query Optimizers DB2, Informix, Microsoft SQL Server, Oracle 8 Inspired by System R Left-deep plans and dynamic programming Cost-based optimization (CPU and IO) Go beyond System R style of optimization Also consider right-deep and bushy plans (e.g., Oracle and DB2) Variety of additional strategies for generating plans (e.g., DB2 and SQL Server) CSE 544 - Fall 2007 37

Other Query Optimizers Randomized plan generation Genetic algorithm PostgreSQL uses it for queries with many joins Rule-based Extensible collection of rules Rule = Algebraic law with a direction Algorithm for firing these rules Generate many alternative plans, in some order Prune by cost Startburst (later DB2) and Volcano (later SQL Server) CSE 544 - Fall 2007 38