QUERY OPTIMIZATION E Jayant Haritsa Computer Science and Automation Indian Institute of Science. JAN 2014 Slide 1 QUERY OPTIMIZATION

Similar documents
Relational Query Optimization

Query Optimization. Schema for Examples. Motivating Example. Similar to old schema; rname added for variations. Reserves: Sailors:

Schema for Examples. Query Optimization. Alternative Plans 1 (No Indexes) Motivating Example. Alternative Plans 2 With Indexes

Query Optimization. Schema for Examples. Motivating Example. Similar to old schema; rname added for variations. Reserves: Sailors:

Relational Query Optimization. Highlights of System R Optimizer

Relational Query Optimization. Overview of Query Evaluation. SQL Refresher. Yanlei Diao UMass Amherst October 23 & 25, 2007

Review. Relational Query Optimization. Query Optimization Overview (cont) Query Optimization Overview. Cost-based Query Sub-System

CS330. Query Processing

Overview of Query Evaluation. Overview of Query Evaluation

Overview of Implementing Relational Operators and Query Evaluation

Relational Query Optimization

Relational Query Optimization. Overview of Query Evaluation. SQL Refresher. Yanlei Diao UMass Amherst March 8 and 13, 2007

Overview of Query Evaluation

Principles of Data Management. Lecture #12 (Query Optimization I)

Administrivia. CS 133: Databases. Cost-based Query Sub-System. Goals for Today. Midterm on Thursday 10/18. Assignments

Evaluation of Relational Operations: Other Techniques. Chapter 14 Sayyed Nezhadi

An SQL query is parsed into a collection of query blocks optimize one block at a time. Nested blocks are usually treated as calls to a subroutine

Overview of Query Evaluation

Database Applications (15-415)

Administrivia. Relational Query Optimization (this time we really mean it) Review: Query Optimization. Overview: Query Optimization

QUERY OPTIMIZATION [CH 15]

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

Principles of Data Management. Lecture #9 (Query Processing Overview)

Database Systems. Announcement. December 13/14, 2006 Lecture #10. Assignment #4 is due next week.

System R Optimization (contd.)

Query Optimization. Kyuseok Shim. KDD Laboratory Seoul National University. Query Optimization

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

Query Evaluation Overview, cont.

Query Evaluation Overview, cont.

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Overview of Query Processing. Evaluation of Relational Operations. Why Sort? Outline. Two-Way External Merge Sort. 2-Way Sort: Requires 3 Buffer Pages

ATYPICAL RELATIONAL QUERY OPTIMIZER

Evaluation of relational operations

CompSci 516 Data Intensive Computing Systems

ECS 165B: Database System Implementa6on Lecture 7

Query Processing and Query Optimization. Prof Monika Shah

Overview of Query Processing

CompSci 516 Data Intensive Computing Systems. Lecture 11. Query Optimization. Instructor: Sudeepa Roy

Συστήματα Διαχείρισης Βάσεων Δεδομένων

Query Evaluation! References:! q [RG-3ed] Chapter 12, 13, 14, 15! q [SKS-6ed] Chapter 12, 13!

CS330. Some Logistics. Three Topics. Indexing, Query Processing, and Transactions. Next two homework assignments out today Extra lab session:

Database Applications (15-415)

Overview of Query Evaluation. Chapter 12

Query Evaluation Overview. Query Optimization: Chap. 15. Evaluation Example. Cost Estimation. Query Blocks. Query Blocks

Evaluation of Relational Operations

Implementing Relational Operators: Selection, Projection, Join. Database Management Systems, R. Ramakrishnan and J. Gehrke 1

Database Applications (15-415)

Midterm Review CS634. Slides based on Database Management Systems 3 rd ed, Ramakrishnan and Gehrke

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 7 - Query optimization

Evaluation of Relational Operations

Principles of Data Management. Lecture #13 (Query Optimization II)

Overview of DB & IR. ICS 624 Spring Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa

Module 9: Query Optimization

CSIT5300: Advanced Database Systems

Hash-Based Indexing 165

External Sorting Implementing Relational Operators

DBMS Query evaluation

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 9 - Query optimization

Query Optimization in Relational Database Systems

Operator Implementation Wrap-Up Query Optimization

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23

Query Processing & Optimization

15-415/615 Faloutsos 1

Implementation of Relational Operations

Evaluation of Relational Operations. Relational Operations

Chapter 12: Query Processing

CSE 444: Database Internals. Section 4: Query Optimizer

Implementation of Relational Operations. Introduction. CS 186, Fall 2002, Lecture 19 R&G - Chapter 12

Evaluation of Relational Operations

Database Applications (15-415)

Administriva. CS 133: Databases. General Themes. Goals for Today. Fall 2018 Lec 11 10/11 Query Evaluation Prof. Beth Trushkowsky

Query Evaluation (i)

EECS 647: Introduction to Database Systems

CAS CS 460/660 Introduction to Database Systems. Query Evaluation II 1.1

Evaluation of Relational Operations. SS Chung

Introduction Alternative ways of evaluating a given query using

Query Optimization. Query Optimization. Optimization considerations. Example. Interaction of algorithm choice and tree arrangement.

Parser. Select R.text from Report R, Weather W where W.image.rain() and W.city = R.city and W.date = R.date and R.text.

Advances in Data Management Query Processing and Query Optimisation A.Poulovassilis

CSE 444: Database Internals. Sec2on 4: Query Op2mizer

Final Review. CS634 May 11, Slides based on Database Management Systems 3 rd ed, Ramakrishnan and Gehrke

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Query Processing and Optimization

ECS 165B: Database System Implementa6on Lecture 14

QUERY OPTIMIZATION FOR DATABASE MANAGEMENT SYSTEM BY APPLYING DYNAMIC PROGRAMMING ALGORITHM

Database Systems External Sorting and Query Optimization. A.R. Hurson 323 CS Building

Database System Concepts

Lassonde School of Engineering Winter 2016 Term Course No: 4411 Database Management Systems

Query and Join Op/miza/on 11/5

Evaluation of Relational Operations: Other Techniques

Implementation of Relational Operations: Other Operations

Chapter 12: Query Processing

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Chapter 12: Query Processing. Chapter 12: Query Processing

CMPUT 391 Database Management Systems. An Overview of Query Processing. Textbook: Chapter 11 (first edition: Chapter 14)

CSIT5300: Advanced Database Systems

Database Applications (15-415)

Transcription:

E0 261 Jayant Haritsa Computer Science and Automation Indian Institute of Science JAN 2014 Slide 1

Database Engines Main Components Query Processing Transaction Processing Access Methods JAN 2014 Slide 2

Declarative Queries SQL, the standard database query interface, is a declarative language Specifies only what is wanted, but not how the query should be evaluated (i.e. ends, not means) Example: List the names of students with their registered courses select StudentName, CourseName from STUDENT, COURSE, REGISTER where STUDENT.RollNo = REGISTER.RollNo and REGISTER.CourseNo = COURSE.CourseNo Unspecified: join order [ ((S R) C) or ((R C) S)? ] join techniques [ Nested-Loops or Sort-Merge or Hash? ] DBMS query optimizer identifies efficient execution strategy: query execution plan JAN 2014 Slide 3

Query Processing Architecture User Queries (SQL) PARSER (SYNTAX) VALIDATOR (SEMANTICS) OPTIMIZER (QUERY PLAN) CODE GENERATOR EXECUTION JAN 2014 Slide 4

Overview of Query Optimization Algebra Tree: Tree of relational algebra operators (σ, π,, ) that implement the user s query Query Plan: Tree of relational algebra operators, with choice of algorithm for each operator Design Goal: Find best plan JAN 2014 Slide 5

Motivating Example Student (sid: integer, sname: string, age: integer, gpa: real) Register (sid: integer, cid: integer, section: integer, desc: string) SELECT S.sname FROM Student S, Register R WHERE S.sid = R.sid AND R.cid = 261 AND S.age > 35 (Equivalent English Query: Senior registrants for DBMS course) JAN 2014 Slide 6

Example (contd) Student: Each tuple is 50 bytes long, 80 tuples per page, 500 pages. Register: Each tuple is 40 bytes long, 100 tuples per page, 1000 pages. Cost (5 buffers): 500+167*1000 = 167500 Misses several opportunities: selections could have been pushed earlier, no use is made of any available indexes, etc. Goal of optimization: To find more efficient plans that compute the same answer. RA Tree: Plan: sname cid=261 age > 35 sid=sid Register Student sname cid=261 age > 35 sid=sid (On-the-fly) (On-the-fly) (Nested Loops) Register Student JAN 2014 Slide 7

Alternative Plan sname (On-the-fly) Main difference: push selects. With 5 buffers, cost of plan: (Scan; write to temp T1) cid=261 Register sid=sid age > 35 Student Scan Register (1000) + write temp T1 (10 pages, if we have 100 different courses, uniform distribution). Scan Student (500) + write temp T2 (250 pages, if we have 10 different ages in range 30-40, uniform distribution). Sort T1 (2*2*10), sort T2 (2*4*250), merge-join (10+250) Total: 4060 page I/Os. If we used BNL join of T1 and T2, join cost = 10 + 4*250, total cost = 2770. If we push projections, T1 has only sid, T2 only sid and sname: T1 fits in 3 pages, cost of BNL drops to under 250 pages, total < 2000. (Sort-Merge Join) (Scan; write to temp T2) JAN 2014 Slide 8

Design Framework Issues For a given query, what plans are considered (i.e. plan space)? How is the cost of a plan estimated (i.e. cost model, statistics)? How to efficiently search through plan space for cheapest plan? JAN 2014 Slide 9

Design Framework (contd) Ideally: Want to find best plan Practically: Avoid worst plans! System R approach 1979 ACM Sigmod conference paper considered Bible of query optimization Most widely used currently; works well for < 10 joins. P. Selinger, M. Astrahan, D. Chamberlin, R. Lorie and T. Price, ``Access Path Selection in a Relational Database System'', Proc. of ACM SIGMOD Intl. Conf. on Management of Data, June 1979. JAN 2014 Slide 10

Overview of System R Optimizer JAN 2014 Slide 11

Issue 1: Plan Space Operator Algorithms: Access: Sequential, Index (clustered / unclustered) Join: NestedLoop, Sort-merge Plan Structure: Only left-deep plans are considered Left-deep plans allow output of each operator to be pipelined into the next operator without storing it in a temporary relation. Cartesian products are avoided JAN 2014 Slide 12

Join Orders bushy A B C D left-deep D D right-deep C C A B B A JAN 2014 Slide 13

Issue 2: Cost Model Weighted combination of CPU and I/O costs Simple statistics of relations and indexes maintained in system catalogs Assume independence across predicates and uniform data distribution across domain Wet finger technique when stats not available! Statistics are only periodically updated ensures that metadata locking does not become a bottleneck! JAN 2014 Slide 14

Issue 3: Search Algorithm Given an n-relation join, there are n! plans possible (not factoring in choice of join method) Plan prefixes are memoryless about order, therefore use Dynamic Programming (DP) DP works for n 10 JAN 2014 Slide 15

Details of System R Optimizer JAN 2014 Slide 16

Query Blocks: Units of Optimization Outer block Nested block SELECT S.sname FROM Student S WHERE (S.age, S.gpa) IN (SELECT S2.age, MAX (S2.gpa) FROM Student S2 GROUP BY S2.age) An SQL query is parsed into a collection of query blocks, and these are optimized one block at a time. Nested blocks are usually treated as calls to a subroutine, made once per outer tuple. JAN 2014 Slide 17

Estimations For each plan considered: Must estimate cost of each operation in plan tree. Must estimate size of result (cardinality) for each operation in tree. JAN 2014 Slide 18

Size Estimation Consider a query block: SELECT attribute list FROM relation list WHERE term 1 AND... AND term k Maximum # tuples in result is the product of the cardinalities of relations in the FROM clause. Reduction factor (RF) associated with each term reflects the impact of the term in reducing result size. Result cardinality = Max # tuples * product of all RF s. Implicit assumption that terms are independent! Term col=value has RF 1/NKeys(I), given index I on col, o.w. 1/10 Term col1=col2 has RF 1/MAX(NKeys(I1), NKeys(I2)), o.w. 1/10 Term col>value has RF (High(I)-value)/(High(I)-Low(I)) o.w. 1/3 Term min < col < max has RF (max - min)/(high(i)-low(i)) o.w. 1/4 JAN 2014 Slide 19

Query Blocks Various cases: Single-relation query block Multiple-relation query block Nested-query blocks JAN 2014 Slide 20

Single-relation Block For queries over a single relation, queries consist of a combination of selects, projects, and aggregate ops: Each available access path (file scan / index) is considered, and the one with the least estimated cost is chosen. The different operations are essentially carried out together (e.g., if an index is used for a selection, projection is done for each retrieved tuple, and the resulting tuples are pipelined into the aggregate computation). JAN 2014 Slide 21

Cost Estimates for Single-Relation Plans Index I on primary key matches selection: Cost is Height(I)+1 for a B + tree Clustered index I matching one or more selects: (NPages(I) + NPages(R)) * product of RF s of matching selects. Non-clustered index I matching one or more selects: (NPages(I) + NTuples(R)) * product of RF s of matching selects. Sequential scan of file: NPages(R). Note: Typically, no duplicate elimination on projections! (Exception: Done on answers if user says DISTINCT.) JAN 2014 Slide 22

Multi-relation Query Block Join of Relations R 1, R 2, R 3,..., R n. Equivalently, r 1 r 2 Decide mapping of r i to R j Decide j i (join method) r 3... r n Decide a(r i ) (access method for R j ) JAN 2014 Slide 23

Cost Model and Statistics Cost = I/O + w * CPU = Page Fetches + w * # of tuples returned from RSS Catalogs contain: # tuples (NTuples) and # pages (NPages) for each relation. # distinct key values (NKeys) and # pages (NPages) for each index. Index height (H(I)), low/high key values (Low/High) for each tree index. JAN 2014 Slide 24

Queries Over Multiple Relations As the number of joins increases, the number of alternative plans grows rapidly; we need to restrict the search space and process it efficiently. In System R: only left-deep join trees are considered. Left-deep trees allow us to generate fully pipelined plans. Intermediate results not written to temporary files. Not all left-deep trees are fully pipelined (e.g., Sort- Merge join). JAN 2014 Slide 26

Plan Space Search Once first k relations are joined, method to join the composite to (k+1) st relation is independent of the order of joining first k, that is, applicable predicates, join orderings, etc. are all the same Markovian (future depends only on present, not past, i.e. memoryless ) Dynamic Programming (global optimal requires local optimal) JAN 2014 Slide 27

Comment on Search Complexity Consider finding the best join-tree for r 1 r 2... r n. There are (2(n 1))! / (n 1)! different join-trees: n = 5, number is 1680; n = 7, number is 665280; n = 10, the number is greater than 176 billion! No need to generate all join-trees. Using DP, the least-cost join order for any subset of {r 1, r 2,... r n } is computed only once and stored for future use. This reduces time complexity to around O(3 n ) n = 10, number is 59000. With restriction to left-deep plans, the time complexity reduces to O(n2 n ). JAN 2014 Slide 28

Example Search Space Reduction Consider r 1 r 2 r 3. There are 12 different join orders: r 1 (r 2 r 3 ) r 1 (r 3 r 2 ) (r 2 r 3 ) r 1 (r 3 r 2 ) r 1 r 2 (r 1 r 3 ) r 2 (r 3 r 1 ) (r 1 r 3 ) r 2 (r 3 r 1 ) r 2 r 3 (r 1 r 2 ) r 3 (r 2 r 1 ) (r 1 r 2 ) r 3 (r 2 r 1 ) r 3 To find best join order for (r 1 r 2 r 3 ) r 4 r 5, there are 12 different join orders for computing r 1 r 2 r 3, and 12 orders for computing the join of this result with r 4 and r 5. Thus, there appear to be 144 join orders to examine. However, once we have found the best join order for the subset of relations {r 1, r 2, r 3 }, we can use only that order for further joins with r 4 and r 5. Thus, instead of 144 choices to JAN examine, 2014 we need to examine only 12 + 12 = 24 choices. Slide 29

Principle of Optimality Query: R1 R2 R3 R4 R5 Optimal Plan: R4 R1 R5 R3 R2 Optimal plan for joining R3, R2, R4, R1 Optimality slides from Shivnath Babu

Principle of Optimality Query: R1 R2 R3 R4 R5 Optimal Plan: R4 R1 R5 R3 R2 Optimal plan for joining R3, R2, R4

Selinger Algorithm: Query: R1 R2 R3 R4 { R1, R2, R3, R4 } Progress of algorithm { R1, R2, R3 } { R1, R2, R4 } { R1, R3, R4 } { R2, R3, R4 } { R1, R2 } { R1, R3 } { R1, R4 } { R2, R3 } { R2, R4 } { R3, R4 } { R1 } { R2 } { R3 } { R4 }

Selinger Algorithm: Query: R1 R2 R3 R4 Optimal plan: R2 R4 R3 R1

Enumeration of Left-Deep Plans Left-deep plans differ only in the order of relations, the access method for each relation, and the join method for each join. Enumerated using N passes (if N relations joined): Pass 1: Find best 1-relation plan for each relation. Pass 2: Find best way to join result of each 1-relation plan (as outer) to another relation. (All 2-relation plans.) Pass N: Find best way to join result of a (N-1)-relation plan (as outer) to the N th relation. (All N-relation plans.) For each subset of relations, retain only: Cheapest plan overall, plus explicit Cheapest plan for each interesting order (GROUP BY, ORDER BY, or join column) of the tuples. implicit JAN 2014 Slide 34

Enumeration of Plans (Contd.) ORDER BY, GROUP BY, aggregates etc. handled as a final step, using either an interestingly ordered plan or an additional sorting operator. An N 1 way plan is not combined with an additional relation unless there is a join condition between them or all predicates in WHERE clause have been used up. i.e., avoid Cartesian products if possible. In spite of pruning plan space, this approach is still exponential in the # of tables (hence the limitation to 10 joins) JAN 2014 Slide 35

Example Pass1: Student: B+ tree on age B+ tree on sid Register: B+ tree on cid Student: B+ tree matches age>35, and is probably cheapest. However, if this selection is expected to retrieve a lot of tuples, and index is unclustered, file scan may be cheaper. Still, B+ tree plan kept (because tuples are in age order). Register: B+ tree on cid matches cid=261; cheapest. Register sname sid=sid cid=261 age > 35 Student Pass 2: We consider each plan retained from Pass 1 as the outer, and consider how to join it with the (only) other relation. e.g., Register as outer: B + tree index can be used to get Student tuples that satisfy sid = outer tuple s sid value. JAN 2014 Slide 36

Detailed Example Work through Figures 3, 4, 5, 6 in the paper JAN 2014 Slide 37

Nested Query Blocks Uncorrelated nested queries are basically constants to be computed once during execution Correlated nested queries are like function calls. JAN 2014 Slide 38

Example SELECT S.sname FROM Student S WHERE S.salary = (SELECT avg(salary) FROM Student) Independent SELECT S.sname FROM Student S WHERE S.salary > (SELECT A.salary FROM Advisor A WHERE A.id=S.advisor) Correlated JAN 2014 Slide 39

Handling Nested block is optimized independently, with the outer tuple considered as providing a selection condition. Outer block is optimized with the cost of calling nested block computation taken into account. Implicit ordering of these blocks means that some good strategies are not considered. The non-nested version of the query is typically optimized better. SELECT S.sname FROM Student S WHERE EXISTS (SELECT * FROM Register R WHERE R.cid=261 AND R.sid=S.sid) Nested block to optimize: SELECT * FROM Register R WHERE R.cid=261 AND R.sid=outer value Equivalent non-nested query: SELECT S.sname FROM Student S, Register R WHERE S.sid=R.sid AND R.cid=261 JAN 2014 Slide 40

Limitations Only considers left-deep plans Statistics make major assumptions of uniformity and independence (will address these issues in following papers on histograms and sampling) Nested queries are handled in a straightforward manner without un-nesting Instead of dynamic programming, could also consider alternatives such as randomized algorithms based on simulated annealing JAN 2014 Slide 41

END QUERY PROCESSING E0 261 JAN 2014 Slide 42