CMPT 354: Database System I. Lecture 7. Basics of Query Optimization

Similar documents
Optimization Overview

CSC 261/461 Database Systems Lecture 19

CMPT 354: Database System I. Lecture 3. SQL Basics

CMPT 354: Database System I. Lecture 1. Course Introduction

CMPT 354: Database System I. Lecture 4. SQL Advanced

CMPT 354: Database System I. Lecture 2. Relational Model

Lecture 14. Lecture 14: Joins!

CMPT 354: Database System I. Lecture 11. Transaction Management

Lecture 12. Lecture 12: Access Methods

L20: Joins. CS3200 Database design (sp18 s2) 3/29/2018

Query and Join Op/miza/on 11/5

Lecture Query evaluation. Cost-based transformations. By Marina Barsky Winter 2016, University of Toronto

L22: The Relational Model (continued) CS3200 Database design (sp18 s2) 4/5/2018

CMPT 354: Database System I. Lecture 5. Relational Algebra

Announcements. Two typical kinds of queries. Choosing Index is Not Enough. Cost Parameters. Cost of Reading Data From Disk

Final Review. CS 145 Final Review. The Best Of Collection (Master Tracks), Vol. 2

Lecture 15: The Details of Joins

Fundamentals of Database Systems

Database Systems CSE 414

Lecture Query evaluation. Combining operators. Logical query optimization. By Marina Barsky Winter 2016, University of Toronto

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

QUERY OPTIMIZATION. CS 564- Spring ACKs: Jeff Naughton, Jignesh Patel, AnHai Doan

Database Applications (15-415)

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Principles of Data Management. Lecture #9 (Query Processing Overview)

Review. Support for data retrieval at the physical level:

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 7 - Query optimization

Chapter 2: The Relational Algebra

Database Applications (15-415)

Introduction to Data Management CSE 344. Lecture 12: Cost Estimation Relational Calculus

Overview of Implementing Relational Operators and Query Evaluation

Administrivia. CS 133: Databases. Cost-based Query Sub-System. Goals for Today. Midterm on Thursday 10/18. Assignments

Query Processing and Advanced Queries. Query Optimization (4)

Principles of Data Management. Lecture #13 (Query Optimization II)

Outline. Database Management and Tuning. Outline. Join Strategies Running Example. Index Tuning. Johann Gamper. Unit 6 April 12, 2012

Query Optimization. Schema for Examples. Motivating Example. Similar to old schema; rname added for variations. Reserves: Sailors:

EECS 647: Introduction to Database Systems

Query Optimization. Vishy Poosala Bell Labs

Hash table example. B+ Tree Index by Example Recall binary trees from CSE 143! Clustered vs Unclustered. Example

Data Storage. Query Performance. Index. Data File Types. Introduction to Data Management CSE 414. Introduction to Database Systems CSE 414

Lecture 16. The Relational Model

CSE 444: Database Internals. Lecture 11 Query Optimization (part 2)

Database Systems. Announcement. December 13/14, 2006 Lecture #10. Assignment #4 is due next week.

Introduction to Database Systems CSE 414. Lecture 26: More Indexes and Operator Costs

Introduction to Database Systems CSE 414. Lecture 16: Query Evaluation

Schema for Examples. Query Optimization. Alternative Plans 1 (No Indexes) Motivating Example. Alternative Plans 2 With Indexes

Parser: SQL parse tree

Query Optimization. Schema for Examples. Motivating Example. Similar to old schema; rname added for variations. Reserves: Sailors:

CompSci 516 Data Intensive Computing Systems. Lecture 11. Query Optimization. Instructor: Sudeepa Roy

Advanced Database Systems

Relational Query Optimization

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 9 - Query optimization

Chapter 2: The Relational Algebra

Principles of Database Management Systems

CSE 344 FEBRUARY 21 ST COST ESTIMATION

Relational Query Optimization. Overview of Query Evaluation. SQL Refresher. Yanlei Diao UMass Amherst October 23 & 25, 2007

Review: Query Evaluation Steps. Example Query: Logical Plan 1. What We Already Know. Example Query: Logical Plan 2.

Query Processing & Optimization. CS 377: Database Systems

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

An SQL query is parsed into a collection of query blocks optimize one block at a time. Nested blocks are usually treated as calls to a subroutine

Lassonde School of Engineering Winter 2016 Term Course No: 4411 Database Management Systems

QUERY OPTIMIZATION E Jayant Haritsa Computer Science and Automation Indian Institute of Science. JAN 2014 Slide 1 QUERY OPTIMIZATION

CompSci 516 Data Intensive Computing Systems

Relational Query Optimization

Database Applications (15-415)

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

4/10/2018. Relational Algebra (RA) 1. Selection (σ) 2. Projection (Π) Note that RA Operators are Compositional! 3.

CSE 544: Principles of Database Systems

Query Processing & Optimization

DBMS Query evaluation

Module 9: Selectivity Estimation

Database Systems CSE 414

CSE 344 APRIL 20 TH RDBMS INTERNALS

Relational Query Optimization. Overview of Query Evaluation. SQL Refresher. Yanlei Diao UMass Amherst March 8 and 13, 2007

Implementing Joins 1

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

CS330. Query Processing

Query Processing. Introduction to Databases CompSci 316 Fall 2017

ATYPICAL RELATIONAL QUERY OPTIMIZER

ECS 165B: Database System Implementa6on Lecture 7

Review. Relational Query Optimization. Query Optimization Overview (cont) Query Optimization Overview. Cost-based Query Sub-System

Database System Concepts

CSC 261/461 Database Systems Lecture 13. Fall 2017

Lecture 20: Query Optimization (2)

CSE 544 Principles of Database Management Systems. Fall 2016 Lecture 14 - Data Warehousing and Column Stores

Principles of Data Management. Lecture #12 (Query Optimization I)

Chapter 9. Cardinality Estimation. How Many Rows Does a Query Yield? Architecture and Implementation of Database Systems Winter 2010/11

R has a ordered clustering index file on its tuples: Read index file to get the location of the tuple with the next smallest value

CSE 344 APRIL 27 TH COST ESTIMATION

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Evaluation of relational operations

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #3: SQL and Rela2onal Algebra- - - Part 1

Outline. Database Tuning. Join Strategies Running Example. Outline. Index Tuning. Nikolaus Augsten. Unit 6 WS 2014/2015

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Notes. Some of these slides are based on a slide set provided by Ulf Leser. CS 640 Query Processing Winter / 30. Notes

Reminders. Query Optimizer Overview. The Three Parts of an Optimizer. Dynamic Programming. Search Algorithm. CSE 444: Database Internals

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

External Sorting Implementing Relational Operators

Query Evaluation Overview, cont.

Overview of Query Processing. Evaluation of Relational Operations. Why Sort? Outline. Two-Way External Merge Sort. 2-Way Sort: Requires 3 Buffer Pages

Transcription:

CMPT 354: Database System I Lecture 7. Basics of Query Optimization 1

Why should you care? https://databricks.com/glossary/catalyst-optimizer https://sigmod.org/sigmod-awards/people/goetz-graefe-2017-sigmod-edgar-f-codd-innovations-award/ 2

Query Processing Steps SQL query SQL Parser Query optimization Logical Optimization Physical Optimization Query Execution Disk 3

IBM System R Optimizer First implementation of a queryoptimizer Make people believe that the DBMS can beat a human developer A lot of the concepts are still used today 4

How to build a query optimization? 1. Plan Space Figure out all possible query plans Too large, must be pruned 2. Cost Estimation Estimate the cost of each plan 3. Search Algorithm Find the best plan CPU + I/O don t go for best plan, go for least worst plan 5

Outline Recap of Logical Optimization Selection Pushdown Projection Pushdown Physical Optimization Join Algorithms Selectivity Estimation 6

Translating to RA R(A,B) S(B,C) T(C,D) SELECT R.A,S.D FROM R,S,T WHERE R.B = S.B AND S.C = T.C AND R.A < 10; Π ",$ s A<10 T(C,D) Π ",$ (σ "'() T R S ) R(A,B) S(B,C) 7

Optimizing RA Plan R(A,B) S(B,C) T(C,D) SELECT R.A,S.D FROM R,S,T WHERE R.B = S.B AND S.C = T.C AND R.A < 10; Push down selection on A so it occurs earlier Π ",$ s A<10 T(C,D) Π ",$ (σ "'() T R S ) R(A,B) S(B,C) 8

Optimizing RA Plan R(A,B) S(B,C) T(C,D) SELECT R.A,S.D FROM R,S,T WHERE R.B = S.B AND S.C = T.C AND R.A < 10; Push down selection on A so it occurs earlier Π ",$ T(C,D) Π ",$ T σ "'() (R) S s A<10 R(A,B) S(B,C) 9

Optimizing RA Plan R(A,B) S(B,C) T(C,D) SELECT R.A,S.D FROM R,S,T WHERE R.B = S.B AND S.C = T.C AND R.A < 10; Push down projection so it occurs earlier Π ",$ T(C,D) Π ",$ T σ "'() (R) S s A<10 R(A,B) S(B,C) 10

Optimizing RA Plan R(A,B) S(B,C) T(C,D) We eliminate B earlier! SELECT R.A,S.D FROM R,S,T WHERE R.B = S.B AND S.C = T.C AND R.A < 10; Π ",0 Π ",$ T(C,D) Π ",$ T Π ",/ σ "'() (R) S s A<10 R(A,B) S(B,C) 11

Outline Recap of Logical Optimization Selection Pushdown Projection Pushdown Physical Optimization Join Algorithms Histogram 12

Join Algorithms Nested loop Join Hash Join Sort-merge join Student sname uid Mike 0 Joe 1 Alice 0 Marry 1 Bob 0 Tim 2 University uid uname 0 SFU 1 UBC 2 UT Student University 13

Dive into Nested Loop Joins 14

Notes Cost = I/O + CPU + Network IO aware algorithms We will focus on I/O Given a relation R, let: T(R) = # of tuples in R P(R) = # of pages in R 10 CMPT 345 SP 2018 Jiannan 20 CMPT 454 FA 2018 Martin 30 40 50 60 70 80 Page 1 Page 2 Page 3 Page 4 15

Nested Loop Join (NLJ) Compute R S on A: for r in R: for s in S: if r[a] == s[a]: yield (r,s) 16

Nested Loop Join (NLJ) Compute R S on A: for r in R: for s in S: if r[a] == s[a]: yield (r,s) Cost: P(R) 1. Loop over the tuples in R Note that our IO cost is based on the number of pages loaded, not the number of tuples! 17

Nested Loop Join (NLJ) Compute R S on A: for r in R: for s in S: if r[a] == s[a]: yield (r,s) Cost: P(R) + T(R)*P(S) 1. Loop over the tuples in R 2. For every tuple in R, loop over all the tuples in S Have to read all of S from disk for every tuple in R! 18

Nested Loop Join (NLJ) Compute R S on A: for r in R: for s in S: if r[a] == s[a]: yield (r,s) Cost: P(R) + T(R)*P(S) 1. Loop over the tuples in R 2. For every tuple in R, loop over all the tuples in S 3. Check against join conditions Note that NLJ can handle things other than equality constraints just check in the if statement! 19

Nested Loop Join (NLJ) Compute R S on A: for r in R: for s in S: if r[a] == s[a]: yield (r,s) Cost: P(R) + T(R)*P(S) + OUT 1. Loop over the tuples in R 2. For every tuple in R, loop over all the tuples in S 3. Check against join conditions 4. Write out (to page, then when page full, to disk) Is this the same as a cross product? 20

Nested Loop Join (NLJ) Compute R S on A: for r in R: for s in S: if r[a] == s[a]: yield (r,s) Cost: P(R) + T(R)*P(S) + OUT What if R ( outer ) and S ( inner ) switched? P(S) + T(S)*P(R) + OUT Outer vs. inner selection could make a huge difference- DBMS needs to know which relation is smaller! 21

IO-Aware Approach 22

Block Nested Loop Join (BNLJ) Compute R S on A: for each B-1 pages pr of R: for page ps of S: for each tuple r in pr: for each tuple s in ps: if r[a] == s[a]: yield (r,s) Cost: P(R) Given B+1 pages of memory 1. Load in B-1 pages of R at a time (leaving 1 page each free for S & output) Note: There could be some speedup here due to the fact that we re reading in multiple pages sequentially however we ll ignore this here! 23

Block Nested Loop Join (BNLJ) Compute R S on A: for each B-1 pages pr of R: for page ps of S: for each tuple r in pr: for each tuple s in ps: if r[a] == s[a]: yield (r,s) Cost: Given B+1 pages of memory P R + P R B 1 P(S) 1. Load in B-1 pages of R at a time (leaving 1 page each free for S & output) 2. For each (B-1)-page segment of R, load each page of S Note: Faster to iterate over the smaller relation first! 24

Block Nested Loop Join (BNLJ) Compute R S on A: for each B-1 pages pr of R: for page ps of S: for each tuple r in pr: for each tuple s in ps: if r[a] == s[a]: yield (r,s) Cost: Given B+1 pages of memory P R + P R B 1 P(S) 1. Load in B-1 pages of R at a time (leaving 1 page each free for S & output) 2. For each (B-1)-page segment of R, load each page of S 3. Check against the join conditions BNLJ can also handle non-equality constraints 25

Block Nested Loop Join (BNLJ) Compute R S on A: for each B-1 pages pr of R: for page ps of S: for each tuple r in pr: for each tuple s in ps: if r[a] == s[a]: yield (r,s) Cost: Given B+1 pages of memory P R + < = >?( P(S) + OUT 1. Load in B-1 pages of R at a time (leaving 1 page each free for S & output) 2. For each (B-1)-page segment of R, load each page of S 3. Check against the join conditions 4. Write out 26

BNLJ vs. NLJ: Benefits of IO Aware NLJ Read all of S from disk for every page of R BNLJ Read all of S from disk for every (B-1)-page segment of R NLJ P(R) + T(R)*P(S) + OUT BNLJ P R + < = >?( P(S) + OUT BNLJ is faster by roughly (>?()@(=) <(=)! 27

BNLJ vs. NLJ: Benefits of IO Aware Example: R: 500 pages S: 1000 pages 100 tuples / page We have 12 pages of memory (B = 11) Ignoring OUT here NLJ: Cost = 500 + 50,000*1000 = 50 Million IOs BNLJ: Cost = 500 + A)) ())) () = 50 Thousand IOs A very real difference from a small change in the algorithm! 28

Outline Recap of Logical Optimization Selection Pushdown Projection Pushdown Physical Optimization Join Algorithms Histogram 29

Motivation Imagine you build an index on name, and then run the following queries SELECT sid FROM Student WHERE name =? Willqueryoptimizer use the index? Yes! Imagine you build an index on gender, and then run the following queries SELECT sid FROM Student WHERE gender =? Willqueryoptimizer use the index? No! How does query optimizer figure these out? 30

Histograms A histogram is a set of value ranges ( buckets ) and the frequencies of values in those buckets occurring How to choose the buckets? Equiwidth & Equidepth Turns out high-frequency values are very important 31

Example Frequency 10 5 0 1 2 3 4 5 6 7 8 9 101112131415 Values How do we compute how many values between 8 and 10? (Yes, it s obvious) Problem: counts take up too much space! 32

Full vs. Uniform Counts 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 How much space do the full counts (bucket_size=1) take? How much space do the uniform counts (bucket_size=all) take? 33

Fundamental Tradeoffs Want high resolution (like the full counts) Want low space (like uniform) Histograms are a compromise! So how do we compute the bucket sizes? 34

Equi-width 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 All buckets roughly the same width 35

Equidepth 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 All buckets contain roughly the same number of items (total frequency) 36

Histograms Simple, intuitive and popular Parameters: # of buckets and type Can extend to many attributes (multidimensional) 37

Maintaining Histograms Histograms require that we update them! Typically, you must run/schedule a command to update statistics on the database Out of date histograms can be terrible! There is research work on self-tuning histograms and the use of query feedback Oracle 11g 38

Nasty example 10 5 0 1 2 3 4 5 6 7 8 9 101112131415 1. we insert many tuples with value > 16 2. we do not update the histogram 3. we ask for values > 20? 39

Compressed Histograms One popular approach: 1. Store the most frequent values and their counts explicitly 2. Keep an equiwidth or equidepth one for the rest of the values People continue to try all manner of fanciness here wavelets, graphical models, entropy models, 40

Summary Logical Optimization SQL -- > RA à RA Tree Selection Pushdown Projection Pushdown Physical Optimization Nested Loop Join / Hash Join / Sort-Merge Join I/O Aware Algorithm Histogram 41

Acknowledge Some lecture slides were copied from or inspired by the following course materials W4111: Introduction to databases by Eugene Wu at Columbia University CSE344: Introduction to Data Management by Dan Suciu at University of Washington CMPT354: Database System I by John Edgar at Simon Fraser University CS186: Introduction to Database Systems by Joe Hellerstein at UC Berkeley CS145: Introduction to Databases by Peter Bailis at Stanford CS 348: Introduction to Database Management by Grant Weddell at University of Waterloo 42