CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 7 - Query execution

Similar documents
CSE 544 Principles of Database Management Systems

Introduction to Database Systems CSE 444

Lecture 17: Query execution. Wednesday, May 12, 2010

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 6 - Storage and Indexing

CSE 444: Database Internals. Lecture 3 DBMS Architecture

CSE 444: Database Internals. Lecture 3 DBMS Architecture

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 7 - Query optimization

Introduction to Data Management CSE 344. Lectures 9: Relational Algebra (part 2) and Query Evaluation

Review: Query Evaluation Steps. Example Query: Logical Plan 1. What We Already Know. Example Query: Logical Plan 2.

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 9 - Query optimization

Announcements. From SQL to RA. Query Evaluation Steps. An Equivalent Expression

What we already know. Late Days. CSE 444: Database Internals. Lecture 3 DBMS Architecture

Announcements. Two typical kinds of queries. Choosing Index is Not Enough. Cost Parameters. Cost of Reading Data From Disk

CSE 444: Database Internals. Lectures 5-6 Indexing

CSE 344 APRIL 20 TH RDBMS INTERNALS

Database Systems CSE 414

Introduction to Data Management CSE 344. Lecture 12: Cost Estimation Relational Calculus

Introduction to Database Systems CSE 444, Winter 2011

CSE 344 APRIL 27 TH COST ESTIMATION

CSE 344 FEBRUARY 14 TH INDEXING

CSE 344 JANUARY 26 TH DATALOG

Data Storage. Query Performance. Index. Data File Types. Introduction to Data Management CSE 414. Introduction to Database Systems CSE 414

CSE 344 FEBRUARY 21 ST COST ESTIMATION

Hash table example. B+ Tree Index by Example Recall binary trees from CSE 143! Clustered vs Unclustered. Example

CSEP 544: Lecture 04. Query Execution. CSEP544 - Fall

Introduction to Database Systems CSE 414. Lecture 26: More Indexes and Operator Costs

EECS 647: Introduction to Database Systems

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Lassonde School of Engineering Winter 2016 Term Course No: 4411 Database Management Systems

Agenda. Discussion. Database/Relation/Tuple. Schema. Instance. CSE 444: Database Internals. Review Relational Model

Evaluation of Relational Operations. Relational Operations

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

Database Systems. Announcement. December 13/14, 2006 Lecture #10. Assignment #4 is due next week.

Principles of Database Management Systems

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

Chapter 13: Query Processing

Evaluation of Relational Operations

Implementation of Relational Operations

CompSci 516 Data Intensive Computing Systems. Lecture 11. Query Optimization. Instructor: Sudeepa Roy

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Principles of Data Management. Lecture #9 (Query Processing Overview)

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2006 Lecture 3 - Relational Model

Database System Concepts

Chapter 12: Query Processing

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

Query Execution [15]

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Principles of Data Management. Lecture #12 (Query Optimization I)

Query Optimization. Query Optimization. Optimization considerations. Example. Interaction of algorithm choice and tree arrangement.

Introduction to Database Systems CSE 344

15-415/615 Faloutsos 1

Implementing Relational Operators: Selection, Projection, Join. Database Management Systems, R. Ramakrishnan and J. Gehrke 1

Lecture 20: Query Optimization (2)

Query Processing & Optimization. CS 377: Database Systems

Chapter 12: Query Processing. Chapter 12: Query Processing

University of Waterloo Midterm Examination Sample Solution

Announcements. Agenda. Database/Relation/Tuple. Discussion. Schema. CSE 444: Database Internals. Room change: Lab 1 part 1 is due on Monday

Query Processing. Introduction to Databases CompSci 316 Fall 2017

Query Optimization. Schema for Examples. Motivating Example. Similar to old schema; rname added for variations. Reserves: Sailors:

Database Applications (15-415)

Overview of Implementing Relational Operators and Query Evaluation

CSE 544 Principles of Database Management Systems

ATYPICAL RELATIONAL QUERY OPTIMIZER

Relational Query Optimization. Highlights of System R Optimizer

CSIT5300: Advanced Database Systems

Query Optimization. Schema for Examples. Motivating Example. Similar to old schema; rname added for variations. Reserves: Sailors:

Cost-based Query Sub-System. Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class.

Database Applications (15-415)

Evaluation of relational operations

Announcements. Agenda. Database/Relation/Tuple. Schema. Discussion. CSE 444: Database Internals

CSE Midterm - Spring 2017 Solutions

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

The query processor turns user queries and data modification commands into a query plan - a sequence of operations (or algorithm) on the database

RELATIONAL OPERATORS #1

Relational Query Optimization

Chapter 12: Query Processing

CMSC424: Database Design. Instructor: Amol Deshpande

Administrivia. CS 133: Databases. Cost-based Query Sub-System. Goals for Today. Midterm on Thursday 10/18. Assignments

Evaluation of Relational Operations

Outline. Query Processing Overview Algorithms for basic operations. Query optimization. Sorting Selection Join Projection

Relational Query Optimization. Overview of Query Evaluation. SQL Refresher. Yanlei Diao UMass Amherst October 23 & 25, 2007

Implementation of Relational Operations. Introduction. CS 186, Fall 2002, Lecture 19 R&G - Chapter 12

Database Management Systems (COP 5725) Homework 3

Database Management Systems. Chapter 4. Relational Algebra. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Plan for today. Query Processing/Optimization. Parsing. A query s trip through the DBMS. Validation. Logical plan

Relational Query Optimization. Overview of Query Evaluation. SQL Refresher. Yanlei Diao UMass Amherst March 8 and 13, 2007

Operator Implementation Wrap-Up Query Optimization

Algorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1)

Query Processing & Optimization

Overview of Query Evaluation. Overview of Query Evaluation

Advanced Database Systems

Midterm Review. March 27, 2017

Query Processing. Solutions to Practice Exercises Query:

CS 564 Final Exam Fall 2015 Answers

CSE 444: Database Internals. Lecture 22 Distributed Query Processing and Optimization

What happens. 376a. Database Design. Execution strategy. Query conversion. Next. Two types of techniques

Query Processing and Advanced Queries. Query Optimization (4)

Overview of Query Evaluation

Evaluation of Relational Operations

Transcription:

CSE 544 Principles of Database Management Systems Magdalena Balazinska Fall 2007 Lecture 7 - Query execution

References Generalized Search Trees for Database Systems. J. M. Hellerstein, J. F. Naughton and A. Pfeffer. VLDB 1995. [To finish talking about GiST] Query evaluation techniques for large databases. G. Graefe. ACM Computing Survey 25(2). 1993. Sec 1. Database management systems. Ramakrishnan and Gehrke. Third Ed. Chapter 12.

Outline Finish talking about GiST Steps involved in processing a query Logical query plan Physical query plan Query execution overview Operator implementations (part 1)

Generalized Search Tree (GiST) Goal: facilitate database extensibility When adding a new data type Want to add indexes for the data type Overview GiST is an index structure Basically, this is a template for indexes Supports extensible set of queries and data types

B+ Tree Example d = 2 Find the key 40 40 80 80 20 60 100 120 140 20 < 40 60 10 15 18 20 30 40 50 60 65 80 85 90 30 < 40 40 10 15 18 20 30 40 50 60 65 80 85 90

R-Tree Example Designed for spatial data Search key values are bounding boxes R1 Q R2 Q R1 R3 R3 R3 R4 R5 R6 R7 Q R4 R5 R6 R7 R4 R5 Q R6 R7 R2 For insertion: at each level, choose child whose bounding box needs least enlargement (in terms of area)

GiST Key Insights Canonical database search tree Balanced tree with high fanout Leaf nodes contain pointers to actual data Leaf nodes stored as a linked list Internal nodes used as a directory Contain <key,pointers> pairs If key consistent with query, data may be found if we follow pointer Generalized search key: predicate that holds for each entry below key B+-tree key is pair of integers <a,b> and predicate is Contains([a,b),v) R-tree key is bounding box and predicate is also containment test Generalized search tree: hierarchy of partitions

GiST Key Methods: Consistent Consistent(E,q) Entry E = (p, ptr) and query predicate q Returns false if p q can be guaranteed unsatisfiable Returns true otherwise See Algo Search(R,q) [also FindMin(R,q) and Next(R,q,E)]

GiST Key Methods: Consistent In a B+-tree, query predicates q can be either Contains( [x,y), v ) returns true if x <= v < y and false otherwise Equal(x,v) returns true if x = v and false otherwise In a B+-tree, Consistent(E,q) p = Contains( [x p,y p ), v ) (1) q = Contains( [x q,y q ), v ) or (2) q = Equal(x q,v) For (1), return true if ( x p < y q ) (y p > x q ) For (2), return true if x p <= x q < y p In R-tree, Consistent returns true if bounding boxes overlap

GiST Key Methods Penalty Used during insert operations to pick subtree where to insert See algorithms Insert(R,E,I) and ChooseSubtree(R,E,l) B+-tree: returns zero when value to insert falls within subtree range R-tree: returns change in area PickSplit Used to split nodes during insert operations See algorithm Split(R,N,E) B+-tree: half the entries go into left group and half into right group R-tree: e.g., minimize total area of bounding boxes after split

GiST Key Methods Union Once a key is inserted, need to adjust predicates at parent nodes See algorithm AdjustKeys(R,N) B+-tree: computes interval that covers all given intervals R-tree: computes bigger bounding box Compress/Decompress: for storage performance

Outline Finish talking about GiST Steps involved in processing a query Logical query plan Physical query plan Query execution overview Operator implementations (part 1)

Query Evaluation Steps SQL query Parse & Rewrite Query Query optimization Select Logical Plan Select Physical Plan Query Execution Logical plan Physical plan Disk

Example Database Schema Supplier(sno,sname,scity,sstate) Part(pno,pname,psize,pcolor) Supply(sno,pno,price) View: Suppliers in Seattle CREATE VIEW NearbySupp AS SELECT sno, sname FROM Supplier WHERE scity='seattle' AND sstate='wa'

Example Query Find the names of all suppliers in Seattle who supply part number 2 SELECT sname FROM NearbySupp WHERE sno IN ( SELECT sno FROM Supplies WHERE pno = 2 )

Steps in Query Evaluation Step 0: admission control User connects to the db with username, password User sends query in text format Step 1: Query parsing Parses query into an internal format Performs various checks using catalog Step 2: Query rewrite View rewriting, flattening, etc.

Rewritten Version of Our Query Original query: SELECT sname FROM NearbySupp WHERE sno IN ( SELECT sno FROM Supplies WHERE pno = 2 ) Rewritten query: SELECT S.sname FROM Supplier S, Supplies U WHERE S.scity='Seattle' AND S.sstate='WA AND S.sno = U.sno AND U.pno = 2;

Continue with Query Evaluation Step 3: Query optimization Find an efficient query plan for executing the query We will spend a whole lecture on this topic A query plan is Logical query plan: an extended relational algebra tree Physical query plan: with additional annotations at each node Access method to use for each relation Implementation to use for each relational operator

Extended Algebra Operators Union, intersection, difference - Selection σ Projection π Join Duplicate elimination δ Grouping and aggregation γ Sorting τ Rename ρ

Logical Query Plan π sname σ sscity= Seattle sstate= WA pno=2 sno = sno Suppliers Supplies

Query Block Most optimizers operate on individual query blocks A query block is an SQL query with no nesting Exactly one SELECT clause FROM clause At most one WHERE clause GROUP BY clause HAVING clause

Typical Plan for Block (1/2)... π fields σ selection condition join condition SELECT-PROJECT-JOIN Query join condition R S

Typical Plan For Block (2/2) having condition γ fields, sum/count/min/max(fields) π fields σ selection condition join condition

How about Subqueries? SELECT Q.name FROM Person Q WHERE Q.age > 25 and not exists SELECT * FROM Purchase P WHERE P.buyer = Q.name and P.price > 100

How about Subqueries? SELECT Q.name FROM Person Q WHERE Q.age > 25 and not exists SELECT * FROM Purchase P WHERE P.buyer = Q.name and P.price > 100 σ name age>25 - name σ Price > 100 buyer=name Person Purchase Person

Physical Query Plan Logical query plan with extra annotations Access path selection for each relation Use a file scan or use an index Implementation choice for each operator Scheduling decisions for operators

Physical Query Plan (On the fly) π sname (On the fly) σ sscity= Seattle sstate= WA pno=2 (Nested loop) sno = sno Suppliers (File scan) Supplies (File scan)

Final Step in Query Processing Step 4: Query execution How to synchronize operators? How to pass data between operators? What techniques are possible (paper Sec. 1)? One thread per process Iterator interface Pipelined execution Intermediate result materialization

Iterator Interface Each operator implements this interface Interface has only three methods open() Initializes operator state Sets parameters such as selection condition get_next() Operator invokes get_next recursively on its inputs Performs processing and produces an output tuple close(): clean-up state Examples: Table 1 in the paper

Pipelined Execution Applies parent operator to tuples directly as they are produced by child operators Benefits No operator synchronization issues Saves cost of writing intermediate data to disk Saves cost of reading intermediate data from disk Good resource utilizations on single processor This approach is used whenever possible

Pipelined Execution (On the fly) π sname (On the fly) σ sscity= Seattle sstate= WA pno=2 (Nested loop) sno = sno Suppliers (File scan) Supplies (File scan)

Intermediate Tuple Materialization Writes the results of an operator to an intermediate table on disk No direct benefit but Necessary for some operator implementations When operator needs to examine the same tuples multiple times

Intermediate Tuple Materialization (On the fly) π sname (Sort-merge join) sno = sno (Scan: write to T1) σ sscity= Seattle sstate= WA σ pno=2 (Scan: write to T2) Suppliers (File scan) Supplies (File scan)

Outline Finish talking about GiST Steps involved in processing a query Logical query plan Physical query plan Query execution overview Operator implementations (part 1)

Cost Parameters In database systems the data is on disk Cost = total number of I/Os Parameters: B(R) = # of blocks (i.e., pages) for relation R T(R) = # of tuples in relation R V(R, a) = # of distinct values of attribute a

Cost Cost of an operation = number of disk I/Os to read the operands compute the result Cost of writing the result to disk is not included Need to count it separately when applicable

Notions of Clustering Clustered-file organization (aka co-clustering) Tuples of one relation R are placed with a tuple of another relation S with a common value Clustered relation Tuples of relation are stored on blocks predominantly devoted to storing that relation Sometimes also called clustered file organization Clustered index (aka clustering index) When ordering of data records is close to the ordering of data entries in the index

Cost Parameters Clustered relation R: Blocks consists mostly of records from this table B(R) T(R) / blocksize Unclustered relation R: Its records are placed on blocks with other tables When R is unclustered: B(R) T(R) When a is a key, V(R,a) = T(R) When a is not a key, V(R,a)

Cost of Scanning a Table Clustered relation: Result may be unsorted: B(R) Result needs to be sorted: 3B(R) Unclustered relation Unsorted: T(R) Sorted: T(R) + 2B(R)

One-pass Algorithms Selection σ(r), projection Π(R) Both are tuple-at-a-time algorithms Cost: B(R), the cost of scanning the relation Input buffer Unary operator Output buffer

Join Algorithms Logical operator: Product(pname, cname) Company(cname, city) Propose three physical operators for the join, assuming the tables are in main memory: Hash join Nested loop join Sort-merge join

Hash Join Hash join: R S Scan R, build buckets in main memory Then scan S and join Cost: B(R) + B(S) One pass algorithm when B(R) <= M

Nested Loop Joins Tuple-based nested loop R S R is the outer relation, S is the inner relation for each tuple r in R do for each tuple s in S do if r and s join then output (r,s) Cost: B(R) + T(R) B(S) when S is clustered Cost: B(R) + T(R) T(S) when S is unclustered

Page-at-a-time Refinement for each page of tuples r in R do for each page of tuples s in S do for all pairs of tuples if r and s join then output (r,s) Cost: B(R) + B(R)B(S) if S is clustered Cost: B(R) + B(R)T(S) if S is unclustered

Nested Loop Joins We can be much more clever How would you compute the join in the following cases? What is the cost? B(R) = 1000, B(S) = 2, M = 4 B(R) = 1000, B(S) = 3, M = 4 B(R) = 1000, B(S) = 6, M = 4

Nested Loop Joins Block Nested Loop Join Group of (M-2) pages of S is called a block for each (M-2) pages ps of S do for each page pr of R do for each tuple s in ps for each tuple r in pr do if r and s join then output(r,s)

Nested Loop Joins R & S... Hash table for block of S (M-2 pages)... Join Result... Input buffer for R Output buffer

Nested Loop Joins Cost of block-based nested loop join Read S once: cost B(S) Outer loop runs B(S)/(M-2) times, and each time need to read R: costs B(S)B(R)/(M-2) Total cost: B(S) + B(S)B(R)/(M-2) Notice: it is better to iterate over the smaller relation first

Sort-Merge Join Sort-merge join: R S Scan R and sort in main memory Scan S and sort in main memory Merge R and S Cost: B(R) + B(S) One pass algorithm when B(S) + B(R) <= M Typically, this is NOT a one pass algorithm

One-pass Algorithms Duplicate elimination δ(r) Need to keep tuples in memory When new tuple arrives, need to compare it with previously seen tuples Balanced search tree or hash table Cost: B(R) Assumption: B(δ(R)) <= M

One-pass Algorithms Grouping: Product(name, department, quantity) γ department, sum(quantity) (Product) Answer(department, sum) How can we compute this in main memory?

One-pass Algorithms Grouping: γ department, sum(quantity) (R) Need to store all departments in memory Also store the sum(quantity) for each department Balanced search tree or hash table Cost: B(R) Assumption: number of depts fits in memory