Optimization Overview

Similar documents
L22: The Relational Model (continued) CS3200 Database design (sp18 s2) 4/5/2018

CMPT 354: Database System I. Lecture 7. Basics of Query Optimization

Chapter 2: The Relational Algebra

Chapter 2: The Relational Algebra

Lecture Query evaluation. Cost-based transformations. By Marina Barsky Winter 2016, University of Toronto

Final Review. CS 145 Final Review. The Best Of Collection (Master Tracks), Vol. 2

Lecture Query evaluation. Combining operators. Logical query optimization. By Marina Barsky Winter 2016, University of Toronto

Lecture 16. The Relational Model

CSC 261/461 Database Systems Lecture 13. Fall 2017

4/10/2018. Relational Algebra (RA) 1. Selection (σ) 2. Projection (Π) Note that RA Operators are Compositional! 3.

Query and Join Op/miza/on 11/5

Relational Model and Relational Algebra

CompSci 516 Data Intensive Computing Systems

QUERY OPTIMIZATION. CS 564- Spring ACKs: Jeff Naughton, Jignesh Patel, AnHai Doan

CMSC424: Database Design. Instructor: Amol Deshpande

EECS 647: Introduction to Database Systems

Module 9: Selectivity Estimation

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

Query Processing Strategies and Optimization

Chapter 12: Query Processing

Relational Query Optimization

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

ATYPICAL RELATIONAL QUERY OPTIMIZER

Relational Query Optimization

Chapter 13: Query Optimization

Informationslogistik Unit 4: The Relational Algebra

Cost Models. the query database statistics description of computational resources, e.g.

Relational Model, Relational Algebra, and SQL

Chapter 13: Query Optimization. Chapter 13: Query Optimization

CompSci 516 Data Intensive Computing Systems. Lecture 11. Query Optimization. Instructor: Sudeepa Roy

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 7 - Query optimization

Review. Support for data retrieval at the physical level:

Transactions and Concurrency Control

Query Processing & Optimization

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Chapter 11: Query Optimization

Database Systems. Announcement. December 13/14, 2006 Lecture #10. Assignment #4 is due next week.

Lecture 20: Query Optimization (2)

Announcements. Two typical kinds of queries. Choosing Index is Not Enough. Cost Parameters. Cost of Reading Data From Disk

Query Optimization. Schema for Examples. Motivating Example. Similar to old schema; rname added for variations. Reserves: Sailors:

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

An SQL query is parsed into a collection of query blocks optimize one block at a time. Nested blocks are usually treated as calls to a subroutine

Evaluation of relational operations

Parser: SQL parse tree

Lassonde School of Engineering Winter 2016 Term Course No: 4411 Database Management Systems

Advanced Database Systems

CSE 344 FEBRUARY 14 TH INDEXING

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

Algorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1)

Chapter 2: Intro to Relational Model

Relational Algebra and SQL

Chapter 6 The Relational Algebra and Relational Calculus

15-415/615 Faloutsos 1

CMSC 424 Database design Lecture 18 Query optimization. Mihai Pop

CSC 261/461 Database Systems Lecture 19

DBMS Query evaluation

Query Optimization. Schema for Examples. Motivating Example. Similar to old schema; rname added for variations. Reserves: Sailors:

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Query Processing & Optimization. CS 377: Database Systems

Contents Contents Introduction Basic Steps in Query Processing Introduction Transformation of Relational Expressions...

Module 4. Implementation of XQuery. Part 0: Background on relational query processing

Introduction to Database Systems CSE 414. Lecture 16: Query Evaluation

Chapter 12: Query Processing

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #3: SQL and Rela2onal Algebra- - - Part 1

Schema for Examples. Query Optimization. Alternative Plans 1 (No Indexes) Motivating Example. Alternative Plans 2 With Indexes

CMP-3440 Database Systems

Relational Databases. Relational Databases. Extended Functional view of Information Manager. e.g. Schema & Example instance of student Relation

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

Outline. Query Processing Overview Algorithms for basic operations. Query optimization. Sorting Selection Join Projection

Evaluation of Relational Operations

Review. Relational Query Optimization. Query Optimization Overview (cont) Query Optimization Overview. Cost-based Query Sub-System

Data on External Storage

Principles of Data Management. Lecture #13 (Query Optimization II)

Relational Query Optimization. Highlights of System R Optimizer

Query optimization. Elena Baralis, Silvia Chiusano Politecnico di Torino. DBMS Architecture D B M G. Database Management Systems. Pag.

SQL QUERY EVALUATION. CS121: Relational Databases Fall 2017 Lecture 12

Chapter 12: Query Processing. Chapter 12: Query Processing

Topics to Learn. Important concepts. Tree-based index. Hash-based index

Relational Algebra. [R&G] Chapter 4, Part A CS4320 1

Overview of Implementing Relational Operators and Query Evaluation

Relational Query Languages: Relational Algebra. Juliana Freire

Relational Algebra for sets Introduction to relational algebra for bags

What s a database system? Review of Basic Database Concepts. Entity-relationship (E/R) diagram. Two important questions. Physical data independence

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Relational Algebra. Note: Slides are posted on the class website, protected by a password written on the board

Overview of Query Processing and Optimization

Introduction to Data Management. Lecture #11 (Relational Algebra)

CSE 444: Database Internals. Lecture 11 Query Optimization (part 2)

Implementation of Relational Operations

QUERY PROCESSING & OPTIMIZATION CHAPTER 19 (6/E) CHAPTER 15 (5/E)

Chapter 3. Algorithms for Query Processing and Optimization

Selection and Projection

Faloutsos - Pavlo CMU SCS /615

Overview. Carnegie Mellon Univ. School of Computer Science /615 - DB Applications. Concepts - reminder. History

Chapter 13: Query Processing

CS 377 Database Systems

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

2.2.2.Relational Database concept

Lecture 12. Lecture 12: The IO Model & External Sorting

Chapter 9. Cardinality Estimation. How Many Rows Does a Query Yield? Architecture and Implementation of Database Systems Winter 2010/11

Transcription:

Lecture 17 Optimization Overview Lecture 17

Lecture 17 Today s Lecture 1. Logical Optimization 2. Physical Optimization 3. Course Summary 2

Lecture 17 Logical vs. Physical Optimization Logical optimization: Find equivalent plans that are more efficient Intuition: Minimize # of tuples at each step by changing the order of RA operators Physical optimization: Find algorithm with lowest IO cost to execute our plan Intuition: Calculate based on physical parameters (buffer size, etc.) and estimates of data size (histograms) SQL Query Relational Algebra (RA) Plan Optimized RA Plan Execution

Lecture 17 > Section 1 1. Logical Optimization 4

Lecture 17 > Section 1 What you will learn about in this section 1. Optimization of RA Plans 2. ACTIVITY: RA Plan Optimization 5

Lecture 17 > Section 1 > Plan Optimization RDBMS Architecture How does a SQL engine work? SQL Query Relational Algebra (RA) Plan Optimized RA Plan Execution Declarative query (from user) Translate to relational algebra expresson Find logically equivalent- but more efficient- RA expression Execute each operator of the optimized plan!

Lecture 17 > Section 1 > Plan Optimization RDBMS Architecture How does a SQL engine work? SQL Query Relational Algebra (RA) Plan Optimized RA Plan Execution Relational Algebra allows us to translate declarative (SQL) queries into precise and optimizable expressions!

Lecture 17 > Section 1 > Plan Optimization Recall: Relational Algebra (RA) Five basic operators: 1. Selection: s 2. Projection: P 3. Cartesian Product: 4. Union: È 5. Difference: - We ll look at these first! Derived or auxiliary operators: Intersection, complement Joins (natural,equi-join, theta join, semi-join) Renaming: r Division And also at one example of a derived operator (natural join) and a special operator (renaming)

Lecture 17 > Section 1 > Plan Optimization Recall: Converting SFW Query -> RA Students(sid,sname,gpa) People(ssn,sname,address) SELECT DISTINCT gpa, address FROM Students S, People P WHERE gpa > 3.5 AND sname = pname; Π "#$,$&&'()) (σ "#$,-./ (S P)) How do we represent this query in RA?

Lecture 17 > Section 1 > Plan Optimization Recall: Logical Equivalece of RA Plans Given relations R(A,B) and S(B,C): Here, projection & selection commute: σ 45/ (Π 4 (R)) = Π 4 (σ 45/ (R)) What about here? σ 45/ (Π 8 (R))? = Π 8 (σ 45/ (R)) We ll look at this in more depth later in the lecture

Lecture 17 > Section 1 > Plan Optimization RDBMS Architecture How does a SQL engine work? SQL Query Relational Algebra (RA) Plan Optimized RA Plan Execution We ll look at how to then optimize these plans now

Lecture 17 > Section 1 > Plan Optimization Note: We can visualize the plan as a tree Π 8 Π 8 (R A, B S B, C ) R(A,B) S(B,C) Bottom-up tree traversal = order of operation execution!

Lecture 17 > Section 1 > Plan Optimization A simple plan Π 8 What SQL query does this correspond to? Are there any logically equivalent RA expressions? R(A,B) S(B,C)

Lecture 17 > Section 1 > Plan Optimization Pushing down projection Π 8 Π 8 Π 8 R(A,B) S(B,C) R(A,B) S(B,C) Why might we prefer this plan?

Lecture 17 > Section 1 > Plan Optimization Takeaways This process is called logical optimization Many equivalent plans used to search for good plans Relational algebra is an important abstraction.

Lecture 17 > Section 1 > Plan Optimization RA commutators The basic commutators: Push projection through (1) selection, (2) join Push selection through (3) selection, (4) projection, (5) join Also: Joins can be re-ordered! Note that this is not an exhaustive set of operations This covers local re-writes; global re-writes possible but much harder This simple set of tools allows us to greatly improve the execution time of queries by optimizing RA plans!

Lecture 17 > Section 1 > Plan Optimization Optimizing the SFW RA Plan

Lecture 17 > Section 1 > Plan Optimization Translating to RA R(A,B) S(B,C) T(C,D) SELECT R.A,S.D FROM R,S,T WHERE R.B = S.B AND S.C = T.C AND R.A < 10; Π 4,> s A<10 T(C,D) Π 4,> (σ 4?@A T R S ) R(A,B) S(B,C)

Lecture 17 > Section 1 > Plan Optimization Logical Optimization Heuristically, we want selections and projections to occur as early as possible in the plan Terminology: push down selections and pushing down projections. Intuition: We will have fewer tuples in a plan. Could fail if the selection condition is very expensive (say runs some image processing algorithm). Projection could be a waste of effort, but more rarely.

Lecture 17 > Section 1 > Plan Optimization Optimizing RA Plan R(A,B) S(B,C) T(C,D) SELECT R.A,S.D FROM R,S,T WHERE R.B = S.B AND S.C = T.C AND R.A < 10; Push down selection on A so it occurs earlier Π 4,> s A<10 T(C,D) Π 4,> (σ 4?@A T R S ) R(A,B) S(B,C)

Lecture 17 > Section 1 > Plan Optimization Optimizing RA Plan R(A,B) S(B,C) T(C,D) SELECT R.A,S.D FROM R,S,T WHERE R.B = S.B AND S.C = T.C AND R.A < 10; Push down selection on A so it occurs earlier Π 4,> T(C,D) Π 4,> T σ 4?@A (R) S s A<10 R(A,B) S(B,C)

Lecture 17 > Section 1 > Plan Optimization Optimizing RA Plan R(A,B) S(B,C) T(C,D) SELECT R.A,S.D FROM R,S,T WHERE R.B = S.B AND S.C = T.C AND R.A < 10; Push down projection so it occurs earlier Π 4,> T(C,D) Π 4,> T σ 4?@A (R) S s A<10 R(A,B) S(B,C)

Lecture 17 > Section 1 > Plan Optimization Optimizing RA Plan R(A,B) S(B,C) T(C,D) SELECT R.A,S.D FROM R,S,T WHERE R.B = S.B AND S.C = T.C AND R.A < 10; We eliminate B earlier! In general, when is an attribute not needed? Π 4,D Π 4,> T(C,D) Π 4,> T Π 4,C σ 4?@A (R) S s A<10 R(A,B) S(B,C)

Lecture 17 > Section 1 > ACTIVITY Activity-17-1.ipynb 24

Lecture 17 > Section 2 2. Physical Optimization 25

Lecture 17 > Section 2 What you will learn about in this section 1. Index Selection 2. Histograms 3. ACTIVITY 26

Lecture 17 > Section 2 > Index Selection Index Selection Input: Schema of the database Workload description: set of (query template, frequency) pairs Goal: Select a set of indexes that minimize execution time of the workload. Cost / benefit balance: Each additional index may help with some queries, but requires updating This is an optimization problem!

Lecture 17 > Section 2 > Index Selection Example Workload description: SELECT pname FROM Product WHERE year =? AND category =? Frequency 10,000,000 SELECT pname, FROM Product WHERE year =? AND Category =? AND manufacturer =? Frequency 10,000,000 Which indexes might we choose?

Lecture 17 > Section 2 > Index Selection Example Workload description: SELECT pname FROM Product WHERE year =? AND category =? Frequency 10,000,000 SELECT pname FROM Product WHERE year =? AND Category =? AND manufacturer =? Frequency 100 Now which indexes might we choose? Worth keeping an index with manufacturer in its search key around?

Lecture 17 > Section 2 > Index Selection Simple Heuristic Can be framed as standard optimization problem: Estimate how cost changes when we add index. We can ask the optimizer! Search over all possible space is too expensive, optimization surface is really nasty. Real DBs may have 1000s of tables! Techniques to exploit structure of the space. In SQLServer Autoadmin. NP-hard problem, but can be solved!

Lecture 17 > Section 2 > Index Selection Estimating index cost? Note that to frame as optimization problem, we first need an estimate of the cost of an index lookup Need to be able to estimate the costs of different indexes / index types We will see this mainly depends on getting estimates of result set size!

Lecture 17 > Section 2 > Index Selection Ex: Clustered vs. Unclustered Cost to do a range query for M entries over N-page file (P per page): Clustered: To traverse: Log f (1.5N) To scan: 1 random IO + EF@ G Unclustered: To traverse: Log f (1.5N) To scan: ~ M random IO sequential IO Suppose we are using a B+ Tree index with: Fanout f Fill factor 2/3

Lecture 17 > Section 2 > Index Selection Plugging in some numbers Clustered: To traverse: Log F (1.5N) To scan: 1 random IO + EF@ G Unclustered: To traverse: Log F (1.5N) To scan: ~ M random IO sequential IO To simplify: Random IO = ~10ms Sequential IO = free ~ 1 random IO = 10ms ~ M random IO = M*10ms If M = 1, then there is no difference! If M = 100,000 records, then difference is ~10min. Vs. 10ms! If only we had good estimates of M

Lecture 17 > Section 2 > Histograms Histograms & IO Cost Estimation 34

Lecture 17 > Section 2 > Histograms IO Cost Estimation via Histograms For index selection: What is the cost of an index lookup? Also for deciding which algorithm to use: Ex: To execute R S, which join algorithm should DBMS use? What if we want to compute σ A,10 (R) σ B51 (S)? In general, we will need some way to estimate intermediate result set sizes Histograms provide a way to efficiently store estimates of these quantities

Lecture 17 > Section 2 > Histograms Histograms A histogram is a set of value ranges ( buckets ) and the frequencies of values in those buckets occurring How to choose the buckets? Equiwidth & Equidepth Turns out high-frequency values are very important

Lecture 17 > Section 2 > Histograms Example Frequency 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Values How do we compute how many values between 8 and 10? (Yes, it s obvious) Problem: counts take up too much space!

Lecture 17 > Section 2 > Histograms Full vs. Uniform Counts 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 How much space do the full counts (bucket_size=1) take? How much space do the uniform counts (bucket_size=all) take?

Lecture 17 > Section 2 > Histograms Fundamental Tradeoffs Want high resolution (like the full counts) Want low space (like uniform) Histograms are a compromise! So how do we compute the bucket sizes?

Lecture 17 > Section 2 > Histograms Equi-width 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 All buckets roughly the same width

Lecture 17 > Section 2 > Histograms Equidepth 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 All buckets contain roughly the same number of items (total frequency)

Lecture 17 > Section 2 > Histograms Histograms Simple, intuitive and popular Parameters: # of buckets and type Can extend to many attributes (multidimensional)

Lecture 17 > Section 2 > Histograms Maintaining Histograms Histograms require that we update them! Typically, you must run/schedule a command to update statistics on the database Out of date histograms can be terrible! There is research work on self-tuning histograms and the use of query feedback Oracle 11g

Lecture 17 > Section 2 > Histograms Nasty example 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1. we insert many tuples with value > 16 2. we do not update the histogram 3. we ask for values > 20?

Lecture 17 > Section 2 > Histograms Compressed Histograms One popular approach: 1. Store the most frequent values and their counts explicitly 2. Keep an equiwidth or equidepth one for the rest of the values People continue to try all manner of fanciness here wavelets, graphical models, entropy models,

Lecture 17 > Section 2 > ACTIVITY Activity-17-2.ipynb 46

Lecture 17 > Section 3 3. Course Summary 47

Lecture 17 > Section 3 Course Summary We learned 1. How to design a database 2. How to query a database, even with concurrent users and crashes / aborts 3. How to optimize the performance of a database We got a sense (as the old joke goes) of the three most important topics in DB research: Performance, performance, and performance 1. Intro 2-3. SQL 4. ER Diagrams 5-6. DB Design 7-8. TXNs 11-12. IO Cost 14-15. Joins 16. Rel. Algebra

Lecture 17 > Section 3 Course Summary We learned 1. How to design a database 2. How to query a database, even with concurrent users and crashes / aborts 3. How to optimize the performance of a database We got a sense (as the old joke goes) of the three most important topics in DB research: Performance, performance, and performance 1. Intro 2-3. SQL 4. ER Diagrams 5-6. DB Design 7-8. TXNs 11-12. IO Cost 14-15. Joins 16. Rel. Algebra

Lecture 17 > Section 3 Course Summary We learned 1. How to design a database 2. How to query a database, even with concurrent users and crashes / aborts 3. How to optimize the performance of a database We got a sense (as the old joke goes) of the three most important topics in DB research: Performance, performance, and performance 1. Intro 2-3. SQL 4. ER Diagrams 5-6. DB Design 7-8. TXNs 11-12. IO Cost 14-15. Joins 16. Rel. Algebra

Lecture 17 > Section 3 Course Summary We learned 1. How to design a database 2. How to query a database, even with concurrent users and crashes / aborts 3. How to optimize the performance of a database We got a sense (as the old joke goes) of the three most important topics in DB research: Performance, performance, and performance 1. Intro 2-3. SQL 4. ER Diagrams 5-6. DB Design 7-8. TXNs 11-12. IO Cost 14-15. Joins 16. Rel. Algebra