Operator Implementation Wrap-Up Query Optimization

Similar documents
Evaluation of Relational Operations: Other Techniques. Chapter 14 Sayyed Nezhadi

Evaluation of Relational Operations

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Implementation of Relational Operations

Evaluation of Relational Operations. Relational Operations

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

Database Applications (15-415)

Evaluation of relational operations

CompSci 516 Data Intensive Computing Systems

Evaluation of Relational Operations

CAS CS 460/660 Introduction to Database Systems. Query Evaluation II 1.1

Overview of Implementing Relational Operators and Query Evaluation

Relational Query Optimization. Highlights of System R Optimizer

CS330. Query Processing

Principles of Data Management. Lecture #9 (Query Processing Overview)

Database Systems. Announcement. December 13/14, 2006 Lecture #10. Assignment #4 is due next week.

Overview of Query Evaluation. Overview of Query Evaluation

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

Relational Query Optimization

Implementation of Relational Operations. Introduction. CS 186, Fall 2002, Lecture 19 R&G - Chapter 12

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

External Sorting Implementing Relational Operators

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

Overview of Query Evaluation

Principles of Data Management. Lecture #12 (Query Optimization I)

Evaluation of Relational Operations. SS Chung

Overview of Query Processing

Database Applications (15-415)

Overview of Query Processing. Evaluation of Relational Operations. Why Sort? Outline. Two-Way External Merge Sort. 2-Way Sort: Requires 3 Buffer Pages

15-415/615 Faloutsos 1

CompSci 516 Data Intensive Computing Systems. Lecture 11. Query Optimization. Instructor: Sudeepa Roy

Implementing Relational Operators: Selection, Projection, Join. Database Management Systems, R. Ramakrishnan and J. Gehrke 1

Administriva. CS 133: Databases. General Themes. Goals for Today. Fall 2018 Lec 11 10/11 Query Evaluation Prof. Beth Trushkowsky

Overview of Query Evaluation. Chapter 12

Query Evaluation (i)

Query Processing and Query Optimization. Prof Monika Shah

Evaluation of Relational Operations

Administrivia. CS 133: Databases. Cost-based Query Sub-System. Goals for Today. Midterm on Thursday 10/18. Assignments

CSIT5300: Advanced Database Systems

Database Applications (15-415)

Overview of DB & IR. ICS 624 Spring Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa

Query Optimization. Schema for Examples. Motivating Example. Similar to old schema; rname added for variations. Reserves: Sailors:

Implementing Joins 1

Query Processing and Optimization

Query Optimization. Schema for Examples. Motivating Example. Similar to old schema; rname added for variations. Reserves: Sailors:

Relational Query Optimization. Overview of Query Evaluation. SQL Refresher. Yanlei Diao UMass Amherst March 8 and 13, 2007

Query Evaluation Overview, cont.

Relational Query Optimization. Overview of Query Evaluation. SQL Refresher. Yanlei Diao UMass Amherst October 23 & 25, 2007

Review. Relational Query Optimization. Query Optimization Overview (cont) Query Optimization Overview. Cost-based Query Sub-System

Spring 2017 QUERY PROCESSING [JOINS, SET OPERATIONS, AND AGGREGATES] 2/19/17 CS 564: Database Management Systems; (c) Jignesh M.

Query Evaluation Overview, cont.

Basic form of SQL Queries

Database Applications (15-415)

Query Evaluation! References:! q [RG-3ed] Chapter 12, 13, 14, 15! q [SKS-6ed] Chapter 12, 13!

Database Applications (15-415)

Introduction to Data Management. Lecture 14 (SQL: the Saga Continues...)

Implementation of Relational Operations: Other Operations

Administrivia. Relational Query Optimization (this time we really mean it) Review: Query Optimization. Overview: Query Optimization

ATYPICAL RELATIONAL QUERY OPTIMIZER

Enterprise Database Systems

Database Management Systems. Chapter 5

SQL: Queries, Programming, Triggers

An SQL query is parsed into a collection of query blocks optimize one block at a time. Nested blocks are usually treated as calls to a subroutine

SQL: Queries, Constraints, Triggers

QUERY EXECUTION: How to Implement Relational Operations?

CSE 444: Database Internals. Section 4: Query Optimizer

Evaluation of Relational Operations: Other Techniques

QUERY OPTIMIZATION E Jayant Haritsa Computer Science and Automation Indian Institute of Science. JAN 2014 Slide 1 QUERY OPTIMIZATION

SQL. Chapter 5 FROM WHERE

Evaluation of Relational Operations: Other Techniques

Cost-based Query Sub-System. Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class.

Evaluation of Relational Operations: Other Techniques

Chapter 12: Query Processing. Chapter 12: Query Processing

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23

Schema for Examples. Query Optimization. Alternative Plans 1 (No Indexes) Motivating Example. Alternative Plans 2 With Indexes

SQL: Queries, Programming, Triggers. Basic SQL Query. Conceptual Evaluation Strategy. Example of Conceptual Evaluation. A Note on Range Variables

DBMS Query evaluation

Database Management System

Chapter 12: Query Processing

Advances in Data Management Query Processing and Query Optimisation A.Poulovassilis

Chapter 13: Query Processing

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Database Applications (15-415)

QUERY OPTIMIZATION [CH 15]

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

Lecture 3 More SQL. Instructor: Sudeepa Roy. CompSci 516: Database Systems

Query Optimization in Relational Database Systems

CSE 444: Database Internals. Sec2on 4: Query Op2mizer

Relational Query Optimization

Chapter 12: Query Processing

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Midterm Review CS634. Slides based on Database Management Systems 3 rd ed, Ramakrishnan and Gehrke

ECS 165B: Database System Implementa6on Lecture 7

CIS 330: Applied Database Systems

Database System Concepts

CSC 261/461 Database Systems Lecture 19

Relational Algebra 1

CSE 544 Principles of Database Management Systems

Database Systems. October 12, 2011 Lecture #5. Possessed Hands (U. of Tokyo)

Transcription:

Operator Implementation Wrap-Up Query Optimization 1

Last time: Nested loop join algorithms: TNLJ PNLJ BNLJ INLJ Sort Merge Join Hash Join 2

General Join Conditions Equalities over several attributes (e.g., R.sid=S.sid AND R.rname=S.sname): Index NL works if we have an index on both sid and sname (together or separately) For Sort-Merge and Hash Join, sort/partition on combination of the two join columns. 3

General Join Conditions Inequality conditions (e.g., R.rname < S.sname): For Index NL, need (clustered!) B+ tree index. Range probes on inner; # matches likely to be much higher than for equality joins. Hash Join, Sort Merge Join not applicable. Block NLJ quite likely to be the best join method here. 4

Blocking vs non blocking algorithms Suppose your join is not evaluated in isolation, but sits within a bigger RA tree Interesting to think about overall sid=sid bid=100 sname evaluation Reserves Sailors 5

Blocking vs non blocking algorithms Selection produces R tuples one at a time Some join algorithms can get started right away, others can't sid=sid bid=100 sname Reserves Sailors 6

Blocking vs non blocking algorithms Hash join has to wait for all of the tuples - blocking Nested loops joins can get started right away non blocking Sort-merge join? Note: not the same usage of "blocking" as in BNLJ!! Here: blocking = needs to read at least one input completely before producing output sname sid=sid bid=100 Reserves Sailors 7

Other implementations The implementations you have seen are geared to specific requirements Compute entire result as fast as possible Sometimes you may have a setting with different desiderata Which may call for totally different implementations! 8

Case Study: Online Aggregation Setting: interactive exploration of large data sets to discover general trends Example: SELECT P.zipcode, AVG(E.salary) FROM EmploymentData E, PersonalData P WHERE E.ssn = P.ssn GROUP BY P.zipcode; 9

Case Study: Online Aggregation This query may take a lot of time to compute But we often don't need precise results Some estimate (with a confidence interval) is enough So, goal is to compute partial results fast And give a running confidence interval so user can stop evaluation once satisfied Means we want to sample from DB (definitely don't want sorted order!) 10

Joins for online aggregation Definitely do not want to have to read both relations before starting to output result So the blocking algorithms (hash join and sort merge) are out 11

Joins for online aggregation Nested loops might work: Pick random tuple from R (not trivial!) Join it with all of S, update confidence interval Get another random tuple and repeat until enough Note would want smaller relation as inner because can update confidence intervals more often 12

Ripple joins Even better Avoid scanning all of the inner relation each time Basic idea: Pick random tuple from R and random tuple from S Join them Pick two more random tuples, one from each Join all 4 tuples together Continue until confidence interval acceptable 13

Ripple Join: Square Two-Table Join R S X 14

Ripple Join: Square Two-Table Join R S X X X X 15

Ripple Join: Square Two-Table Join R S X X X X X X X X X 16

Ripple Join: Square Two-Table Join R S X X X X X X X X X X X X X X X X 17

Pseudocode for (max = 1 to infty) for (i = 1 to max 1) if (match (R[i], S[max]) output tuple for (i=1 to max) if (match (R[max], S[i])) output tuple 18

Ripple Joins Can have non-square aspect ratios if desired due to statistical properties of data i.e. if want to sample one relation more often than the other This method of join evaluation allows confidence intervals to shrink quite fast Theory and experimental results in research paper if you're interested (link in CMS) 19

Ripple Joins Extremely inefficient if we wanted to compute the whole join Even worse than tuple nested loops join Because would need to keep track of the alreadyseen tuples from both relations (eventually won't fit in memory) Can use indexes, blocking or hashing to help But the main reason this works is that we almost always stop computation very early 20

Set Operations Intersection and cross-product special cases of join. Intersection: equality on all fields is join condition Cross product: no equality condition 21

Set Operations - Union Main challenge: eliminating duplicates Sorting based approach Sort both relations (on combination of all attributes). Scan sorted relations and merge them. 22

Set Operations - Union Hash based approach to union: Partition R and S using hash function h. For each S partition, build in-memory hash table (using h2), scan corresponding R partition and add tuples to hash table while discarding duplicates When done with partition, write out hashtable as (part of) result 23

Set Operations Set difference (R S) similar to union Sorting-based approach During merge pass, write out tuples in R after checking that do not appear in S Hashing-based approach For each tuple of R, probe hashtable for partition of S and only write tuple (to output) if not found in hashtable. 24

Aggregate Operations (AVG, MIN, etc.) Without grouping: In general, requires scanning the relation. Keep track of some "running information" SUM? MAX/MIN? AVG? 25

Aggregate Operations (AVG, MIN, etc.) With grouping: Sort on group-by attributes, then scan relation and compute aggregate for each group. Can improve upon this by combining sorting and aggregate computation (total cost = cost of sort in this case) Hashing approach: build a hash table with entries <grouping-value, running-info> and update it as you scan the entire relation 26

Aggregate Operations (AVG, MIN, etc.) May be able to use indexes Index-only scan Retrieve records in sorted order instead of having to sort 27

Summary so far Understand how to implement basic Relational Algebra operators Select Project Join Set operators Aggregation/GROUP BY Understand that other implementations may be appropriate if setting/requirements are different 28

Putting it all together How to use these implementation algorithms to process your SQL queries efficiently? Query evaluation and optimization 29

Putting it all together Query Query Parser Logical Query Plan Query Optimizer Plan generator Plan cost estimator Catalog Manager Physical Query Plan Query Plan Evaluator 30

Parsing and decomposition SELECT S.sname, S.age FROM Sailors S WHERE S.age = (SELECT MAX(S2.age) FROM Sailors S2); This query will generate two blocks Blocks correspond to a single SELECT-FROM- WHERE clause Blocks optimized one at a time 31

Optimizing a block A block is basically a Relational Algebra selectproject-join ( ) expression With additional "operators"/annotations to handle features like Aggregation GROUP BY ORDER BY Etc Core of the optimizer's work: find the best physical query plan for the SPJ part 32

Query & logical and physical plans SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 AND S.rating>5 sname (On-the-fly) rating > 5 (On-the-fly) Logical query plan: sname bid=100 rating > 5 (Use hash index; do not write result to temp) sid=sid bid=100 Reserves (Index Nested Loops, with pipelining ) Sailors Reserves sid=sid Sailors Physical query plan = RA tree annotated with info on access methods and operator implementation 33

The work of an optimizer Generate some different physical plans Reorder operators (using our handy RA equivalences) Experiment with different implementations for each operator For each plan, estimate the cost Choose the lowest-cost plan 34

The work of an optimizer Ideally: Want to find best plan. Practically: Avoid worst plans! Optimization can't take too long, otherwise might defeat the purpose (if takes longer to optimize than to run a "dumb" plan) 35

Two main questions How to generate (a good subset of) the possible plans? How to compute the cost of each plan? Let's start with the second question. Basically it's about putting together the peroperator calculations we have been doing already 36

Let's look at some examples Sailors (sid: integer, sname: string, rating: integer, age: real) Reserves (sid: integer, bid: integer, day: date, rname: string) Reserves: Each tuple is 40 bytes long, 100 tuples per page, 1000 pages. Sailors: Each tuple is 50 bytes long, 80 tuples per page, 500 pages. 37

A first physical query plan SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 AND S.rating>5 sname bid=100 rating > 5 (On-the-fly) (On-the-fly) Logical plan: sname sid=sid (Page Nested Loops) bid=100 rating > 5 Reserves Sailors Reserves sid=sid Sailors Cost: 1000+500*1000 = 501,000 page I/Os Convention: left child of join = outer relation 38

FAQ: what does "on the fly" mean? Just means that we can apply the operation while the tuple is already in memory So no extra I/O cost 39