Principles of Data Management. Lecture #9 (Query Processing Overview)

Similar documents
Overview of Query Evaluation. Overview of Query Evaluation

CS330. Query Processing

Overview of Query Evaluation

Overview of Implementing Relational Operators and Query Evaluation

Relational Query Optimization

Evaluation of Relational Operations: Other Techniques. Chapter 14 Sayyed Nezhadi

Query Optimization. Schema for Examples. Motivating Example. Similar to old schema; rname added for variations. Reserves: Sailors:

Principles of Data Management. Lecture #12 (Query Optimization I)

Schema for Examples. Query Optimization. Alternative Plans 1 (No Indexes) Motivating Example. Alternative Plans 2 With Indexes

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

Database Systems. Announcement. December 13/14, 2006 Lecture #10. Assignment #4 is due next week.

Query Optimization. Schema for Examples. Motivating Example. Similar to old schema; rname added for variations. Reserves: Sailors:

Relational Query Optimization. Highlights of System R Optimizer

Overview of Query Evaluation

Review. Relational Query Optimization. Query Optimization Overview (cont) Query Optimization Overview. Cost-based Query Sub-System

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

ECS 165B: Database System Implementa6on Lecture 7

Overview of Query Evaluation. Chapter 12

Evaluation of relational operations

Query Evaluation Overview, cont.

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

Query Evaluation Overview, cont.

Relational Query Optimization. Overview of Query Evaluation. SQL Refresher. Yanlei Diao UMass Amherst October 23 & 25, 2007

Administrivia. CS 133: Databases. Cost-based Query Sub-System. Goals for Today. Midterm on Thursday 10/18. Assignments

Relational Query Optimization. Overview of Query Evaluation. SQL Refresher. Yanlei Diao UMass Amherst March 8 and 13, 2007

An SQL query is parsed into a collection of query blocks optimize one block at a time. Nested blocks are usually treated as calls to a subroutine

Implementation of Relational Operations: Other Operations

Query Evaluation! References:! q [RG-3ed] Chapter 12, 13, 14, 15! q [SKS-6ed] Chapter 12, 13!

Database Applications (15-415)

Overview of Query Processing

Relational Query Optimization

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

Implementing Relational Operators: Selection, Projection, Join. Database Management Systems, R. Ramakrishnan and J. Gehrke 1

Administrivia. Relational Query Optimization (this time we really mean it) Review: Query Optimization. Overview: Query Optimization

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

CompSci 516 Data Intensive Computing Systems

QUERY OPTIMIZATION E Jayant Haritsa Computer Science and Automation Indian Institute of Science. JAN 2014 Slide 1 QUERY OPTIMIZATION

Evaluation of Relational Operations

Implementation of Relational Operations. Introduction. CS 186, Fall 2002, Lecture 19 R&G - Chapter 12

Administriva. CS 133: Databases. General Themes. Goals for Today. Fall 2018 Lec 11 10/11 Query Evaluation Prof. Beth Trushkowsky

Introduction to Data Management. Lecture 14 (SQL: the Saga Continues...)

CAS CS 460/660 Introduction to Database Systems. Query Evaluation II 1.1

15-415/615 Faloutsos 1

Evaluation of Relational Operations: Other Techniques

Database Applications (15-415)

Query Evaluation (i)

Query Processing and Query Optimization. Prof Monika Shah

Evaluation of Relational Operations: Other Techniques

Overview of Query Processing. Evaluation of Relational Operations. Why Sort? Outline. Two-Way External Merge Sort. 2-Way Sort: Requires 3 Buffer Pages

Overview of DB & IR. ICS 624 Spring Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa

External Sorting Implementing Relational Operators

Evaluation of Relational Operations

Database Applications (15-415)

CS330. Some Logistics. Three Topics. Indexing, Query Processing, and Transactions. Next two homework assignments out today Extra lab session:

Principles of Data Management. Lecture #13 (Query Optimization II)

Database Applications (15-415)

CompSci 516 Data Intensive Computing Systems. Lecture 11. Query Optimization. Instructor: Sudeepa Roy

Evaluation of Relational Operations. Relational Operations

CSE 444: Database Internals. Section 4: Query Optimizer

Evaluation of Relational Operations: Other Techniques

Operator Implementation Wrap-Up Query Optimization

Implementation of Relational Operations

Cost-based Query Sub-System. Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class.

Midterm Review CS634. Slides based on Database Management Systems 3 rd ed, Ramakrishnan and Gehrke

ATYPICAL RELATIONAL QUERY OPTIMIZER

CSE 444: Database Internals. Sec2on 4: Query Op2mizer

CSIT5300: Advanced Database Systems

QUERY EXECUTION: How to Implement Relational Operations?

Introduction to Data Management. Lecture #11 (Relational Algebra)

QUERY OPTIMIZATION [CH 15]

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23

Database Applications (15-415)

Evaluation of Relational Operations

Introduction to Data Management. Lecture #13 (Relational Calculus, Continued) It s time for another installment of...

Introduction to Data Management. Lecture 16 (SQL: There s STILL More...) It s time for another installment of...

Query Optimization in Relational Database Systems

Introduction to Data Management. Lecture #14 (Relational Languages IV)

System R Optimization (contd.)

Query Processing and Optimization

Evaluation of Relational Operations. SS Chung

Lecture 3 More SQL. Instructor: Sudeepa Roy. CompSci 516: Database Systems

Database Management System

Advances in Data Management Query Processing and Query Optimisation A.Poulovassilis

Lecture 2 SQL. Instructor: Sudeepa Roy. CompSci 516: Data Intensive Computing Systems

Introduction to Data Management. Lecture #10 (Relational Calculus, Continued)

Lecture 3 SQL - 2. Today s topic. Recap: Lecture 2. Basic SQL Query. Conceptual Evaluation Strategy 9/3/17. Instructor: Sudeepa Roy

Introduction to Data Management. Lecture #17 (SQL, the Never Ending Story )

Query Processing & Optimization

IMPORTANT: Circle the last two letters of your class account:

Review: Where have we been?

Module 9: Query Optimization

Spring 2017 QUERY PROCESSING [JOINS, SET OPERATIONS, AND AGGREGATES] 2/19/17 CS 564: Database Management Systems; (c) Jignesh M.

Συστήματα Διαχείρισης Βάσεων Δεδομένων

SQL: Queries, Programming, Triggers. Basic SQL Query. Conceptual Evaluation Strategy. Example of Conceptual Evaluation. A Note on Range Variables

SQL: Queries, Constraints, Triggers

Midterm Exam #2 (Version A) CS 122A Winter 2017

CSIT5300: Advanced Database Systems

Final Review. CS634 May 11, Slides based on Database Management Systems 3 rd ed, Ramakrishnan and Gehrke

Lassonde School of Engineering Winter 2016 Term Course No: 4411 Database Management Systems

SQL: Queries, Programming, Triggers

Transcription:

Principles of Data Management Lecture #9 (Query Processing Overview) Instructor: Mike Carey mjcarey@ics.uci.edu Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Today s Notable News v Midterm time is upon us! In-class exam on Thursday (or HW for CS122c) Closed book but open cheat sheet (8.5 x11 x2) Sample midterm exam + answers on wiki Grading will take awhile (but I will do it ) v Today sets the stage for the next major phase of the course (which starts after the midterm) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 2

DBMS Structure (from Lecture 1) Query Parser Query Optimizer Plan Executor Relational Operators (+ Utilities) SQL You are now here Files of Records Access Methods (Indices) Transaction Manager Buffer Manager Disk Space and I/O Manager Lock Manager Log Manager Data Files Index Files Catalog Files WAL Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 3 Component Roles (from Lecture 1) v Query Parser Parse and analyze SQL query Produce data structure capturing SQL statement and the objects that it refers to in the system catalogs v Query optimizer (often w/2 steps) Rewrite query logically (e.g., views/unnesting) Perform cost-based optimization (a la System R) Goal is to pick a good query plan considering Physical table structures Available access paths (indexes) Data statistics (if known) Cost model (for relational operations) (Cost differences can be orders of magnitude!) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 4

Component Roles (cont.) v Plan Executor + Relational Operators Runtime side of query processing Usually based on a pull-based tree of iterators model, e.g.: (1) getnexttuple( ) (2) (3) (4) Nodes are relational operators (actually physical implementations of relational operators/combos) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 5 Some Common Techniques v Algorithms for evaluating relational operators all use some simple ideas extensively*: Indexing: Can use WHERE conditions to retrieve small set of tuples (selections, joins) Iteration: Sometimes faster to scan all tuples even if there s an index. (And sometimes, we can scan the data entries in an index instead of the table itself.) Partitioning: By using sorting or hashing, we can partition the input tuples and replace an expensive operation by similar operations on smaller inputs. * Watch for these techniques as we discuss query evaluation! Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 6

(Schema and Data for Examples) Sailors (sid: integer, sname: string, rating: integer, age: real) Reserves (sid: integer, bid: integer, day: date, rname: string) v Sailors: Members of the UW-Madison Hoofers sailing club. 40,000 tuples, each one 50 bytes long, with 80 tuples/page (4000 byte pages), so 500 pages total. v Reserves: Members reservations for the club s sailboats. 100,000 tuples, each one 40 bytes long, so 100 tuples/page (4000 byte pages), yielding 1000 pages total. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 7 Access Paths v An access path is a method of retrieving tuples: File scan, or index that matches a selection (in the query) v A tree index matches (a conjunction of) terms that involve only attributes in a prefix of the search key. E.g., Tree index on <a, b, c> matches the selection a=5 AND b=3, and a=5 AND b>6, but not b=3. v A hash index matches (a conjunction of) terms that has a term attribute = value for every attribute in the search key of the index. E.g., Hash index on <a, b, c> matches a=5 AND b=3 AND c=5; but not b=3, or a=5 AND b=3, or a>5 AND b=3 AND c=5. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 8

A Note on Complex Selections (day<8/9/94 AND rname= Paul ) OR bid=5 OR sid=3 v Selection conditions are first converted to be in conjunctive normal form (CNF): (day<8/9/94 OR bid=5 OR sid=3 ) AND (rname= Paul OR bid=5 OR sid=3) v We ll mostly discuss cases with no ORs; see the text if you re curious about the general case. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 9 One Approach to Selections v Find the most selective access path, retrieve tuples using it, then apply any remaining terms that don t match the index: Most selective access path: An index or file scan that we estimate will require the fewest page I/Os. Terms that match this index reduce the number of tuples retrieved; other terms used to filter the retrieved tuples on the fly, but don t prevent retrieval of the tuples/pages. Consider day<8/9/94 AND bid=5 AND sid=3. A B+ tree index on day can be used; bid=5 and sid=3 must be checked for each retrieved tuple. Similarly, a hash index on <bid, sid> could be used; day<8/9/94 must then be checked. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 10

Using an Index for Selections v Cost depends on #qualifying tuples and clustering. Cost of finding qualifying data entries (typically small) plus cost of retrieving actual records themselves (can be large w/o clustering). E.g., assuming uniform distribution of names, about 10% of tuples qualify (100 pages, 10,000 tuples). With a clustered index, cost will be just over 100 I/Os; if index is unclustered, can cost up to 10,000 I/Os! SELECT * FROM Reserves R WHERE R.rname < C% Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 11 Projection SELECT DISTINCT R.sid, R.bid FROM Reserves R v Expensive part is possibly removing duplicates. SQL systems don t remove duplicates unless the keyword DISTINCT is specified in a query. (Bags vs. sets.) v Sorting Approach: Sort on <sid, bid> and remove duplicates. (Can optimize by dropping unwanted information while sorting.) v Hashing Approach: Hash on <sid, bid> to create partitions. Load partitions into memory one at a time, build an in-memory hash structure, and eliminate duplicates within it. (Q: See why this works?) v If there is an index with both R.sid and R.bid in the search key, may be cheaper to sort data entries! Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 12

Highlights of System R Optimizer v Impact: Most widely used approach; works well for < 10 joins. v Cost estimation: Approximate cost at best. Statistics, maintained in system catalogs, used to estimate cost of operations and result sizes. Considers combination of CPU and I/O costs. v Plan Space: Too large, so must be pruned. Only the space of left-deep plans is considered. Left-deep plans allow the output of each operator to be pipelined into the next operator without storing it in a temporary relation. Cartesian products always avoided. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 13 Statistics and Catalogs v Need information about the relations and indexes involved. Catalogs typically contain at least: # tuples and # pages for each relation. # distinct key values and # pages for each index. Index height, low/high key values for each tree index. v Catalogs updated periodically. Updating each time data changes too expensive; lots of approximation anyway, so slight inconsistencies okay. v More detailed information (e.g., histograms of the values in some field) are sometimes stored. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 14

Cost Estimation v For each plan considered, must estimate cost: Estimate cost of each operation in plan tree. Depends on input cardinalities. We ve already discussed how to estimate the cost of operations (sequential scan, index scan, joins, etc.) Estimate size of result for each operation in tree! Use information about the input relations. For selections and joins, assume independence of predicates. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 15 Size Estimation and Reduction Factors v Consider a query block: v Maximum # tuples in result is the product of the cardinalities of the relations in the FROM clause. v Reduction factor (RF) associated with each term reflects the impact of the term in reducing result size. Result cardinality = Max # tuples * product of all RF s. Implicit assumption that terms are independent! More on this later in the quarter. SELECT attribute list FROM relation list WHERE term 1 AND... AND term k Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 16

A Motivating Example SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 AND S.rating>5 v Cost: 500+500*1000 = 500,500 I/Os v And not the worst possible plan! v Misses several opportunities: Plan: selections could have been `pushed earlier, no use is made of any available indexes, etc. v Goal of optimization: To consider more efficient plans that compute the same logical answer. Algebra Tree: sname sid=sid bid=100 rating > 5 Reserves bid=100 rating > 5 sname sid=sid Sailors (On-the-fly) (On-the-fly) (Block Nested Loops) Sailors Reserves Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 17 Alternative Plan 1 (No Indexes) sname (On-the-fly) (Sort-Merge Join) sid=sid v Main difference: push selects. v With 5 buffers, cost of plan: Scan Reserves (1000) + write temp T1 (10 pages, if we have 100 boats, uniform distribution), with cost = 1010. Scan Sailors (500) + write temp T2 (250 pages, if 10 ratings), cost = 750. Join by sort T1 (2*2*10 = 40), sort T2 (2*3*250 = 1500), merge (10+250). Total: 3560 page I/Os. (Scan; write to temp T1) bid=100 Reserves (Scan; rating > 5 write to temp T2) Sailors v If we use block NL join, join cost = 10+10*250, total cost = 4270. v If we push projections, T1 has only sid, T2 only sid and sname: T1 now fits in 3 pages, so block NL cost drops lower, (etc.) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 18

Alternative Plan 2 (Using Indexes) v With clustered index on bid of Reserves, we get 100,000/100 = 1000 tuples on 1000/100 = 10 pages. (Use hash index; do not write result to temp) sid=sid bid=100 v INL join with pipelining (outer not materialized). Reserves Projecting out unnecessary fields from outer doesn t help. sname (On-the-fly) rating > 5 (On-the-fly) (Index Nested Loops, with pipelining ) Sailors Join column sid is a key for Sailors. At most one matching Sailors tuple, so unclustered index on sid is OK. Decision not to push rating>5 before the join based on the availability of the sid index on Sailors. Cost: Selection of Reserves tuples (10 I/Os); then, for each one, must get matching Sailors tuple (1000*1.2); total = 1210 I/Os. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 19 QP Summary v There are several alternative evaluation algorithms for each relational operator. v A query is evaluated by converting it to a tree of operators and then evaluating the operators in the tree. v DBA must understand query optimization to fully understand the performance impact of a given database design (relations, indexes) on a workload (set of queries). v Two parts to optimizing a query (coming later): Enumerate a set of alternative plans. Must prune search space; typically, left-deep plans only. Estimate the cost of each plan that is considered. Must estimate size of result and cost for each plan node. Key issues: Statistics, indexes, operator implementations. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 20