ECS 165B: Database System Implementa6on Lecture 7

Similar documents
Overview of Query Evaluation. Overview of Query Evaluation

CS330. Query Processing

Overview of Query Evaluation

Evaluation of relational operations

Overview of Implementing Relational Operators and Query Evaluation

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Principles of Data Management. Lecture #9 (Query Processing Overview)

Overview of Query Evaluation

Database Systems. Announcement. December 13/14, 2006 Lecture #10. Assignment #4 is due next week.

Overview of Query Evaluation. Chapter 12

Query Optimization. Schema for Examples. Motivating Example. Similar to old schema; rname added for variations. Reserves: Sailors:

Evaluation of Relational Operations. Relational Operations

Schema for Examples. Query Optimization. Alternative Plans 1 (No Indexes) Motivating Example. Alternative Plans 2 With Indexes

Evaluation of Relational Operations

Evaluation of Relational Operations

Relational Query Optimization

Implementation of Relational Operations

Query Optimization. Schema for Examples. Motivating Example. Similar to old schema; rname added for variations. Reserves: Sailors:

Implementation of Relational Operations: Other Operations

Relational Query Optimization. Overview of Query Evaluation. SQL Refresher. Yanlei Diao UMass Amherst October 23 & 25, 2007

Query Evaluation! References:! q [RG-3ed] Chapter 12, 13, 14, 15! q [SKS-6ed] Chapter 12, 13!

Evaluation of Relational Operations: Other Techniques

Evaluation of Relational Operations

CAS CS 460/660 Introduction to Database Systems. Query Evaluation II 1.1

Implementation of Relational Operations. Introduction. CS 186, Fall 2002, Lecture 19 R&G - Chapter 12

Implementing Joins 1

Evaluation of Relational Operations: Other Techniques

CompSci 516 Data Intensive Computing Systems

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

Evaluation of Relational Operations: Other Techniques. Chapter 14 Sayyed Nezhadi

Overview of Query Processing. Evaluation of Relational Operations. Why Sort? Outline. Two-Way External Merge Sort. 2-Way Sort: Requires 3 Buffer Pages

Implementing Relational Operators: Selection, Projection, Join. Database Management Systems, R. Ramakrishnan and J. Gehrke 1

Relational Query Optimization. Highlights of System R Optimizer

Administriva. CS 133: Databases. General Themes. Goals for Today. Fall 2018 Lec 11 10/11 Query Evaluation Prof. Beth Trushkowsky

CS330. Some Logistics. Three Topics. Indexing, Query Processing, and Transactions. Next two homework assignments out today Extra lab session:

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

Database Applications (15-415)

Query Evaluation Overview, cont.

15-415/615 Faloutsos 1

Principles of Data Management. Lecture #12 (Query Optimization I)

Review. Relational Query Optimization. Query Optimization Overview (cont) Query Optimization Overview. Cost-based Query Sub-System

Evaluation of Relational Operations: Other Techniques

Database Management System

Query Evaluation Overview, cont.

Relational Query Optimization. Overview of Query Evaluation. SQL Refresher. Yanlei Diao UMass Amherst March 8 and 13, 2007

Administrivia. CS 133: Databases. Cost-based Query Sub-System. Goals for Today. Midterm on Thursday 10/18. Assignments

QUERY EXECUTION: How to Implement Relational Operations?

Database Applications (15-415)

Midterm Review CS634. Slides based on Database Management Systems 3 rd ed, Ramakrishnan and Gehrke

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

Cost-based Query Sub-System. Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class.

Database Applications (15-415)

Database Applications (15-415)

Overview of Query Processing

Overview of DB & IR. ICS 624 Spring Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa

Dtb Database Systems. Announcement

Relational Query Optimization

Tree-Structured Indexes

QUERY OPTIMIZATION E Jayant Haritsa Computer Science and Automation Indian Institute of Science. JAN 2014 Slide 1 QUERY OPTIMIZATION

An SQL query is parsed into a collection of query blocks optimize one block at a time. Nested blocks are usually treated as calls to a subroutine

Tree-Structured Indexes ISAM. Range Searches. Comments on ISAM. Example ISAM Tree. Introduction. As for any index, 3 alternatives for data entries k*:

Indexing. Chapter 8, 10, 11. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Presenter: Eunice Tan Lam Ziyuan Yan Ming fei. Join Algorithm

External Sorting Implementing Relational Operators

Principles of Data Management. Lecture #5 (Tree-Based Index Structures)

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

Tree-Structured Indexes

ECS 165B: Database System Implementa6on Lecture 14

Evaluation of Relational Operations. SS Chung

CSE 444: Database Internals. Sec2on 4: Query Op2mizer

Tree-Structured Indexes

Announcements. Reading Material. Recap. Today 9/17/17. Storage (contd. from Lecture 6)

Relational Algebra. Note: Slides are posted on the class website, protected by a password written on the board

Introduction to Data Management. Lecture 15 (More About Indexing)

Sorting a file in RAM. External Sorting. Why Sort? 2-Way Sort of N pages Requires Minimum of 3 Buffers Pass 0: Read a page, sort it, write it.

Tree-Structured Indexes

Tree-Structured Indexes

Introduction to Data Management. Lecture 21 (Indexing, cont.)

Tree-Structured Indexes. Chapter 10

Chapter 12: Query Processing

Query and Join Op/miza/on 11/5

Tree-Structured Indexes

Administrivia. Tree-Structured Indexes. Review. Today: B-Tree Indexes. A Note of Caution. Introduction

Tree-Structured Indexes. A Note of Caution. Range Searches ISAM. Example ISAM Tree. Introduction

Final Exam Review. Kathleen Durant CS 3200 Northeastern University Lecture 22

Final Exam Review. Kathleen Durant PhD CS 3200 Northeastern University

Database Management Systems. Chapter 4. Relational Algebra. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Administrivia. Relational Query Optimization (this time we really mean it) Review: Query Optimization. Overview: Query Optimization

Introduction to Data Management. Lecture #11 (Relational Algebra)

CSIT5300: Advanced Database Systems

CSE 444: Database Internals. Section 4: Query Optimizer

Introduction. Choice orthogonal to indexing technique used to locate entries K.

Tree-Structured Indexes

Query Evaluation (i)

QUERY OPTIMIZATION [CH 15]

Relational algebra. Iztok Savnik, FAMNIT. IDB, Algebra

Relational Algebra. Chapter 4, Part A. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Advances in Data Management Query Processing and Query Optimisation A.Poulovassilis

Introduction to Data Management. Lecture #14 (Relational Languages IV)

Transcription:

ECS 165B: Database System Implementa6on Lecture 7 UC Davis April 12, 2010 Acknowledgements: por6ons based on slides by Raghu Ramakrishnan and Johannes Gehrke.

Class Agenda Last 6me: Dynamic aspects of B+ Trees Today: Summary: tree- structured indices Overview of query evalua6on Reading Chapter 12 in Ramakrishan and Gehrke (or Chapter 13 in Silberschatz, Korth, and Sudarshan)

Announcements Expanded set of tests posted: /home/cs165b/davisdb/testrm.cpp Page file manager bugfixes: /home/cs165b/davisdb/pagefilemanager.cpp, h /home/cs165b/davisdb/filehandle.cpp Have you done an svn commit lately?

Summary: Tree- Structured Indices

Summary Tree-structured indexes are ideal for rangesearches, also good for equality searches. ISAM is a static structure. Only leaf pages modified; overflow pages needed. Overflow chains can degrade performance unless size of data set and data distribution stay constant. B+ tree is a dynamic structure. Inserts/deletes leave tree height-balanced; log F N cost. High fanout (F) means depth rarely more than 3 or 4. Almost always better than maintaining a sorted file. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 25

Summary (Contd.) Typically, 67% occupancy on average. Usually preferable to ISAM, modulo locking considerations; adjusts to growth gracefully. If data entries are data records, splits can change rids! Key compression increases fanout, reduces height. Bulk loading can be much faster than repeated inserts for creating a B+ tree on a large data set. Most widely used index in database management systems because of its versatility. One of the most optimized components of a DBMS. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 26

Overview of Query Evalua6on Reading: Chapter 12 of Ramakrishnan and Gehrke (Chapter 13 of Silberschatz et al)

Overview of Query Evaluation Plan: Tree of R.A. ops, with choice of alg for each op. Each operator typically implemented using a `pull interface: when an operator is `pulled for the next output tuples, it `pulls on its inputs and computes them. Two main issues in query optimization: For a given query, what plans are considered? Algorithm to search plan space for cheapest (estimated) plan. How is the cost of a plan estimated? Ideally: Want to find best plan. Practically: Avoid worst plans! We will study the System R approach. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 2

Recall: "Evalua6on Plan" (ECS 165A) π CName (σ Price>5000 ((CUSTOMERS ORDERS) OFFERS)) π CName ((CUSTOMERS ORDERS) (σ Price>5000 (OFFERS))) Representation as evaluation plan (query tree): CName o Price > 5000 CName CUSTOMERS ORDERS OFFERS CUSTOMERS ORDERS OFFERS o Price > 5000

Recall: "Annotated Evalua6on Plan" (ECS 165A) Query: List the name of all customers who have ordered a product that costs more than $5,000. Assume that for both CUSTOMERS and ORDERS an index on CName exists: I 1 (CName, CUSTOMERS), I 2 (CName, ORDERS). CName (sort to remove duplicates) block nested!loop join get tuples for tids of I 2 pipeline index!nested loop join pipeline OFFERS Price > 5000 full table scan I (CName, CUSTOMERS) 1 2 I (CName, ORDERS) ORDERS

Some Common Techniques Algorithms for evaluating relational operators use some simple ideas extensively: Indexing: Can use WHERE conditions to retrieve small set of tuples (selections, joins) Iteration: Sometimes, faster to scan all tuples even if there is an index. (And sometimes, we can scan the data entries in an index instead of the table itself.) Partitioning: By using sorting or hashing, we can partition the input tuples and replace an expensive operation by similar operations on smaller inputs. * Watch for these techniques as we discuss query evaluation! Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 3

Statistics and Catalogs Need information about the relations and indexes involved. Catalogs typically contain at least: # tuples (NTuples) and # pages (NPages) for each relation. # distinct key values (NKeys) and NPages for each index. Index height, low/high key values (Low/High) for each tree index. Catalogs updated periodically. Updating whenever data changes is too expensive; lots of approximation anyway, so slight inconsistency ok. More detailed information (e.g., histograms of the values in some field) are sometimes stored. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 4

Catalogs in DavisDB Two catalog tables: relations and attributes!!relations : rela6on name, tuple length, number of abributes, cardinality,!attributes: rela6on name, abribute name, offset in tuple, abribute type, abribute length, index name, More details when Part 3 is assigned

Access Paths An access path is a method of retrieving tuples: File scan, or index that matches a selection (in the query) A tree index matches (a conjunction of) terms that involve only attributes in a prefix of the search key. E.g., Tree index on <a, b, c> matches the selection a=5 AND b=3, and a=5 AND b>6, but not b=3. A hash index matches (a conjunction of) terms that has a term attribute = value for every attribute in the search key of the index. E.g., Hash index on <a, b, c> matches a=5 AND b=3 AND c=5; but it does not match b=3, or a=5 AND b=3, or a>5 AND b=3 AND c=5. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 5

A Note on Complex Selections (day<8/9/94 AND rname= Paul ) OR bid=5 OR sid=3 Selection conditions are first converted to conjunctive normal form (CNF): (day<8/9/94 OR bid=5 OR sid=3 ) AND (rname= Paul OR bid=5 OR sid=3) We only discuss case with no ORs; see text if you are curious about the general case. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 6

One Approach to Selections Find the most selective access path, retrieve tuples using it, and apply any remaining terms that don t match the index: Most selective access path: An index or file scan that we estimate will require the fewest page I/Os. Terms that match this index reduce the number of tuples retrieved; other terms are used to discard some retrieved tuples, but do not affect number of tuples/pages fetched. Consider day<8/9/94 AND bid=5 AND sid=3. A B+ tree index on day can be used; then, bid=5 and sid=3 must be checked for each retrieved tuple. Similarly, a hash index on <bid, sid> could be used; day<8/9/94 must then be checked. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 7

Using an Index for Selections Cost depends on # qualifying tuples, and clustering. Cost of finding qualifying data entries (typically small) plus cost of retrieving records (could be large w/o clustering). In example, assuming uniform distribution of names, about 10% of tuples qualify (100 pages, 10000 tuples). With a clustered index, cost is little more than 100 I/Os; if unclustered, upto 10000 I/Os! SELECT * FROM WHERE Reserves R R.rname < C% Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 8

Projection SELECT FROM DISTINCT R.sid, R.bid Reserves R The expensive part is removing duplicates. SQL systems don t remove duplicates unless the keyword DISTINCT is specified in a query. Sorting Approach: Sort on <sid, bid> and remove duplicates. (Can optimize this by dropping unwanted information while sorting.) Hashing Approach: Hash on <sid, bid> to create partitions. Load partitions into memory one at a time, build in-memory hash structure, and eliminate duplicates. If there is an index with both R.sid and R.bid in the search key, may be cheaper to sort data entries! Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 9

Join: Index Nested Loops foreach tuple r in R do foreach tuple s in S where ri == sj do add <r, s> to result If there is an index on the join column of one relation (say S), can make it the inner and exploit the index. Cost: M + ( (M*p R ) * cost of finding matching S tuples) For each R tuple, cost of probing S index is about 1.2 for hash index, 2-4 for B+ tree. Cost of then finding S tuples (assuming Alt. (2) or (3) for data entries) depends on clustering. Clustered index: 1 I/O (typical), unclustered: upto 1 I/O per matching S tuple. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 10

Examples of Index Nested Loops Hash-index (Alt. 2) on sid of Sailors (as inner): Scan Reserves: 1000 page I/Os, 100*1000 tuples. For each Reserves tuple: 1.2 I/Os to get data entry in index, plus 1 I/O to get (the exactly one) matching Sailors tuple. Total: 220,000 I/Os. Hash-index (Alt. 2) on sid of Reserves (as inner): Scan Sailors: 500 page I/Os, 80*500 tuples. For each Sailors tuple: 1.2 I/Os to find index page with data entries, plus cost of retrieving matching Reserves tuples. Assuming uniform distribution, 2.5 reservations per sailor (100,000 / 40,000). Cost of retrieving them is 1 or 2.5 I/Os depending on whether the index is clustered. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 11

Join: Sort-Merge (R S) i=j Sort R and S on the join column, then scan them to do a ``merge (on join col.), and output result tuples. Advance scan of R until current R-tuple >= current S tuple, then advance scan of S until current S-tuple >= current R tuple; do this until current R tuple = current S tuple. At this point, all R tuples with same value in Ri (current R group) and all S tuples with same value in Sj (current S group) match; output <r, s> for all pairs of such tuples. Then resume scanning R and S. R is scanned once; each S group is scanned once per matching R tuple. (Multiple scans of an S group are likely to find needed pages in buffer.) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 12

Example of Sort-Merge Join sid sname rating age 22 dustin 7 45.0 28 yuppy 9 35.0 31 lubber 8 55.5 44 guppy 5 35.0 58 rusty 10 35.0 sid bid day rname 28 103 12/4/96 guppy 28 103 11/3/96 yuppy 31 101 10/10/96 dustin 31 102 10/12/96 lubber 31 101 10/11/96 lubber 58 103 11/12/96 dustin Cost: M log M + N log N + (M+N) The cost of scanning, M+N, could be M*N (very unlikely!) With 35, 100 or 300 buffer pages, both Reserves and Sailors can be sorted in 2 passes; total join cost: 7500. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 13

Highlights of System R Optimizer Impact: Most widely used currently; works well for < 10 joins. Cost estimation: Approximate art at best. Statistics, maintained in system catalogs, used to estimate cost of operations and result sizes. Considers combination of CPU and I/O costs. Plan Space: Too large, must be pruned. Only the space of left-deep plans is considered. Left-deep plans allow output of each operator to be pipelined into the next operator without storing it in a temporary relation. Cartesian products avoided. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 14