Implementing Relational Operators: Selection, Projection, Join. Database Management Systems, R. Ramakrishnan and J. Gehrke 1

Similar documents
External Sorting Implementing Relational Operators

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Implementation of Relational Operations: Other Operations

Database Applications (15-415)

Database Applications (15-415)

Evaluation of Relational Operations

Overview of Query Evaluation. Chapter 12

Evaluation of relational operations

Evaluation of Relational Operations: Other Techniques

Database Management System

Implementation of Relational Operations. Introduction. CS 186, Fall 2002, Lecture 19 R&G - Chapter 12

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

Overview of Query Evaluation. Overview of Query Evaluation

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

Evaluation of Relational Operations: Other Techniques. Chapter 14 Sayyed Nezhadi

Evaluation of Relational Operations: Other Techniques

Evaluation of Relational Operations. Relational Operations

Evaluation of Relational Operations

Cost-based Query Sub-System. Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class.

CAS CS 460/660 Introduction to Database Systems. Query Evaluation II 1.1

Implementing Joins 1

Overview of Implementing Relational Operators and Query Evaluation

Evaluation of Relational Operations: Other Techniques

Principles of Data Management. Lecture #9 (Query Processing Overview)

Overview of Query Evaluation

CS330. Query Processing

Implementation of Relational Operations

Administriva. CS 133: Databases. General Themes. Goals for Today. Fall 2018 Lec 11 10/11 Query Evaluation Prof. Beth Trushkowsky

15-415/615 Faloutsos 1

Evaluation of Relational Operations

CompSci 516 Data Intensive Computing Systems

CSE 444: Database Internals. Section 4: Query Optimizer

Overview of Query Processing. Evaluation of Relational Operations. Why Sort? Outline. Two-Way External Merge Sort. 2-Way Sort: Requires 3 Buffer Pages

ECS 165B: Database System Implementa6on Lecture 7

Database Systems. Announcement. December 13/14, 2006 Lecture #10. Assignment #4 is due next week.

Query Evaluation Overview, cont.

Query Evaluation Overview, cont.

Database Applications (15-415)

CSE 444: Database Internals. Sec2on 4: Query Op2mizer

CompSci 516 Data Intensive Computing Systems. Lecture 11. Query Optimization. Instructor: Sudeepa Roy

Midterm Review CS634. Slides based on Database Management Systems 3 rd ed, Ramakrishnan and Gehrke

RELATIONAL OPERATORS #1

QUERY EXECUTION: How to Implement Relational Operations?

Query Optimization. Schema for Examples. Motivating Example. Similar to old schema; rname added for variations. Reserves: Sailors:

Relational Query Optimization. Highlights of System R Optimizer

Relational Query Optimization. Overview of Query Evaluation. SQL Refresher. Yanlei Diao UMass Amherst October 23 & 25, 2007

Schema for Examples. Query Optimization. Alternative Plans 1 (No Indexes) Motivating Example. Alternative Plans 2 With Indexes

Dtb Database Systems. Announcement

Query Optimization. Schema for Examples. Motivating Example. Similar to old schema; rname added for variations. Reserves: Sailors:

Relational Query Optimization

Overview of Query Processing

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

Query Evaluation! References:! q [RG-3ed] Chapter 12, 13, 14, 15! q [SKS-6ed] Chapter 12, 13!

Operator Implementation Wrap-Up Query Optimization

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

Overview of Query Evaluation

Evaluation of Relational Operations. SS Chung

CSIT5300: Advanced Database Systems

Principles of Data Management. Lecture #12 (Query Optimization I)

Database Applications (15-415)

Relational Query Optimization. Overview of Query Evaluation. SQL Refresher. Yanlei Diao UMass Amherst March 8 and 13, 2007

TotalCost = 3 (1, , 000) = 6, 000

ATYPICAL RELATIONAL QUERY OPTIMIZER

Chapter 12: Query Processing. Chapter 12: Query Processing

Relational Query Optimization

Review. Relational Query Optimization. Query Optimization Overview (cont) Query Optimization Overview. Cost-based Query Sub-System

192 Chapter 14. TotalCost=3 (1, , 000) = 6, 000

Query Processing: The Basics. External Sorting

Course No: 4411 Database Management Systems Fall 2008 Midterm exam

Administrivia. Relational Query Optimization (this time we really mean it) Review: Query Optimization. Overview: Query Optimization

Overview of DB & IR. ICS 624 Spring Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa

Sorting a file in RAM. External Sorting. Why Sort? 2-Way Sort of N pages Requires Minimum of 3 Buffers Pass 0: Read a page, sort it, write it.

Final Exam Review. Kathleen Durant CS 3200 Northeastern University Lecture 22

Query Processing. Introduction to Databases CompSci 316 Fall 2017

CSIT5300: Advanced Database Systems

Chapter 12: Query Processing

Presenter: Eunice Tan Lam Ziyuan Yan Ming fei. Join Algorithm

QUERY OPTIMIZATION E Jayant Haritsa Computer Science and Automation Indian Institute of Science. JAN 2014 Slide 1 QUERY OPTIMIZATION

Hash-Based Indexing 165

Modern Database Systems Lecture 1

Advances in Data Management Query Processing and Query Optimisation A.Poulovassilis

Tree-Structured Indexes

Query Processing and Query Optimization. Prof Monika Shah

EECS 647: Introduction to Database Systems

RAID in Practice, Overview of Indexing

CompSci 516: Database Systems

CS330. Some Logistics. Three Topics. Indexing, Query Processing, and Transactions. Next two homework assignments out today Extra lab session:

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Advanced Database Systems

Chapter 13: Query Processing

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

An SQL query is parsed into a collection of query blocks optimize one block at a time. Nested blocks are usually treated as calls to a subroutine

Administrivia. CS 133: Databases. Cost-based Query Sub-System. Goals for Today. Midterm on Thursday 10/18. Assignments

Query Evaluation (i)

Announcements. Reading Material. Today. Different File Organizations. Selection of Indexes 9/24/17. CompSci 516: Database Systems

CMPUT 391 Database Management Systems. Query Processing: The Basics. Textbook: Chapter 10. (first edition: Chapter 13) University of Alberta 1

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

CMPS 181, Database Systems II, Final Exam, Spring 2016 Instructor: Shel Finkelstein. Student ID: UCSC

Query Optimization. Query Optimization. Optimization considerations. Example. Interaction of algorithm choice and tree arrangement.

Transcription:

Implementing Relational Operators: Selection, Projection, Join Database Management Systems, R. Ramakrishnan and J. Gehrke 1

Readings [RG] Sec. 14.1-14.4 Database Management Systems, R. Ramakrishnan and J. Gehrke 2

Last time Started discussion on how to implement RA operators (selection) Metric: #I/Os required don't include the cost of writing out results same for all implementations Database Management Systems, R. Ramakrishnan and J. Gehrke 3

No index Unsorted file (or sorted on "wrong" attribute) Only option is to scan whole thing linearly Cost: # pages in the file File is sorted on selection attribute Binary search, then retrieve all qualifying tuples Cost: log (#pages in file) + #pages with qualifying tuples Database Management Systems, R. Ramakrishnan and J. Gehrke 4

Using indexes for selection Tree index cost: Traverse tree from root Done only once 2 to 3 I/Os (trees are short) Scan leaf level pages to find all data entries Retrieve actual data records Database Management Systems, R. Ramakrishnan and J. Gehrke 5

Cost Components Tree Index Index Component 1: Traversing index File Database Management Systems, R. Ramakrishnan and J. Gehrke 6

Cost Components Tree Index Index Component 2: Traversing subset of data entries in index (may be ok to skip if clustered) File Database Management Systems, R. Ramakrishnan and J. Gehrke 7

Cost Components Tree Index Index Component 3: Fetching actual data records (alternative 2 or 3) File Database Management Systems, R. Ramakrishnan and J. Gehrke 8

What about hash indexes? Depends on implementation (linear, extendible etc) But typically small Reasonable assumption: 1 or 2 I/Os to find right bucket Plus cost of retrieving actual data records (depends on number of matching records) Database Management Systems, R. Ramakrishnan and J. Gehrke 9

Example Selection on Reserves, condition is "R.rname < 'C%'" Assume uniform distribution of names, so about 10% of relation should be retrieved 10K tuples, 100 pages Database Management Systems, R. Ramakrishnan and J. Gehrke 10

Example Have clustered B+ tree index on rname Traverse index to find first leaf page - 1 or 2 I/Os Start scan and retrieve tuples - 100 I/Os Have unclustered B+ tree index on rname Worst case, retrieving each tuple requires a separate I/O so 10K I/Os Much cheaper to just scan all 1000 original pages! Heuristic: scan cheaper than unclustered index if expect to retrieve more than 5% of tuples Database Management Systems, R. Ramakrishnan and J. Gehrke 11

Optimizations are possible Alternative 2 or 3, unclustered index Find qualifying data entries from index Sort the rids of the data entries to be retrieved Remember rid = (page ID, slot #) Fetch rids in order Ensures each data page is read from disk just once! Although number of data pages retrieved still likely to be more than with clustering Database Management Systems, R. Ramakrishnan and J. Gehrke 12

Example SELECT * FROM Sailor S WHERE S.Age = 25 AND S.Salary > 100K Have Hash index on Age Database Management Systems, R. Ramakrishnan and J. Gehrke 13

Evaluation Options Option 1 Use available index (on Age) to get superset of relevant data entries Retrieve the tuples corresponding to the set of data entries Apply remaining predicates on retrieved tuples Return those tuples that satisfy all predicates Option 2 Sequential scan! (always available) May be better depending on selectivity Database Management Systems, R. Ramakrishnan and J. Gehrke 14

And another example... SELECT * FROM Sailor S WHERE S.Age = 25 AND S.Salary > 100K Have Hash index on Age Have B+ tree index on Salary Database Management Systems, R. Ramakrishnan and J. Gehrke 15

Evaluation Options Option 1 Choose most selective access path (index) Could be index on Age or Salary, depending on selectivity of the corresponding predicates Use this index to get superset of relevant data entries Retrieve the tuples corresponding to the set of data entries Apply remaining predicates on retrieved tuples Return those tuples that satisfy all predicates Database Management Systems, R. Ramakrishnan and J. Gehrke 16

Evaluation Alternatives Option 2 Get rids of data records using each index Use index on Age and index on Salary Intersect the rids We ll discuss intersection soon Retrieve the tuples corresponding to the rids Option 3 Sequential scan! Database Management Systems, R. Ramakrishnan and J. Gehrke 17

More complex selection conditions When can we use an index? When it "matches" at least some of our selection condition. E.g. suppose we have a tree index for R on <sid, bid, day> Will help for "sid > 10" Will help for "sid > 10 AND bid = 100" But not helpful for "bid = 100" Database Management Systems, R. Ramakrishnan and J. Gehrke 18

Using indexes for selection A hash index matches (a conjunction of) terms that has a term attribute = value for every attribute in the search key of the index. E.g., Hash index on <a, b, c> matches a=5 AND b=3 AND c=5; but it does not match b=3, or a=5 AND b=3, or a>5 AND b=3 AND c=5. Database Management Systems, R. Ramakrishnan and J. Gehrke 19

Using indexes for selection A tree index matches (a conjunction of) terms that involve only attributes in a prefix of the search key. E.g., Tree index on <a, b, c> matches the selection a=5 AND b=3, and a=5 AND b>6, but not b=3. Database Management Systems, R. Ramakrishnan and J. Gehrke 20

Complex Selections (day<8/9/94 AND rname= Paul ) OR bid=5 OR sid=3 Selection conditions are first converted to conjunctive normal form (CNF): (day<8/9/94 OR bid=5 OR sid=3 ) AND (rname= Paul OR bid=5 OR sid=3) Database Management Systems, R. Ramakrishnan and J. Gehrke 21

Complex Selections Combination of techniques seen so far If have index for one (or more) of the conjuncts, can use it to filter tuples and apply remaining condition to only those Can use union of retrieved tuples (or RIDs) for "OR" But really, at some point sequential scan becomes best bet Database Management Systems, R. Ramakrishnan and J. Gehrke 22

Selection summary Depends on what is available File sorted on selection attribute(s) Indexes (clustered or not) Options: Sequential scan (not always totally stupid) Binary search Use index or combination of indexes Database Management Systems, R. Ramakrishnan and J. Gehrke 23

Projection SELECT S.Name, S. Age FROM Sailors S SELECT DISTINCT S.Name, S. Age FROM Sailors S Database Management Systems, R. Ramakrishnan and J. Gehrke 24

Implementing Projection In the general case, will need to scan the whole file Because we want something from each tuple in the result The expensive part is eliminating duplicates if desired Database Management Systems, R. Ramakrishnan and J. Gehrke 25

Sorting-based projection Basic idea: make a pass through tuples and retain only desired attributes sort result of the above go through resulting file and output only nonduplicates Fortunately, we know how to do external sorting for good performance! Database Management Systems, R. Ramakrishnan and J. Gehrke 26

General External Merge Sort Make multiple passes to merge runs Pass 1: Produce runs of length B(B-1) pages Pass 2: Produce runs of length B(B-1) 2 pages Pass P: Produce runs of length B(B-1) P pages INPUT 1... INPUT 2...... OUTPUT Disk INPUT B-1 B Main memory buffers Disk Database Management Systems, R. Ramakrishnan and J. Gehrke 27

Modifications to External Sorting Pass 0 Project out unwanted columns here (so don't need a separate pass before) Still produce runs of length B pages Tuples in runs are smaller than input tuples Merge passes Eliminate duplicates during merge Database Management Systems, R. Ramakrishnan and J. Gehrke 28

Hashing-based projection Idea: hash all tuples into buckets based on attributes we want Bucket = partition (subset) of the input All duplicates will go into the same bucket Problem: hash table may not fit in memory! Database Management Systems, R. Ramakrishnan and J. Gehrke 29

Hashing-based projection If whole hashtable won't fit into memory, maybe a single bucket will All (pairs of) duplicates will be in the same bucket So we can eliminate duplicates bucket by bucket Database Management Systems, R. Ramakrishnan and J. Gehrke 30

Projection Based on Hashing Assume relation does not fit in memory First pass Divide relation into partitions Eliminate unwanted fields as you go No duplicate elimination yet! Original Relation OUTPUT 1 Partitions... INPUT 2 hash function h B-1 1 2 B-1 Disk B main memory buffers Disk Database Management Systems, R. Ramakrishnan and J. Gehrke 31

Now process buckets one at a time Use a different hash function h2 Buckets of R Hash table for partition Ri Duplicate Free Bucket h2 Input buffer for Ri Disk B main memory buffers Database Management Systems, R. Ramakrishnan and J. Gehrke 32 Disk If hash table for bucket does not fit in memory, can recursively apply hashing algorithm from previous phase.

Comments on Projection Advantages of sort-based projection Better handling of skew Results in sorted order Hash-based projection lends itself better to parallelizing Database Management Systems, R. Ramakrishnan and J. Gehrke 33

Projection SELECT DISTINCT S.Name, S. Age FROM Sailors S Have B+ tree index on (Name, Age) Database Management Systems, R. Ramakrishnan and J. Gehrke 34

Evaluation Using Covering Index Simply scan leaf levels of index structure No need to retrieve actual data records Index-only scan Works so long as the index search key includes all the projection attributes Extra attributes in search key are okay Best if projection attributes are prefix of search key Can eliminate duplicates in single pass of index-only scan Database Management Systems, R. Ramakrishnan and J. Gehrke 35

Example SELECT DISTINCT S.Name, S. Age FROM Sailors S Have Hash index on Name Have B+ tree index on Age Sailors relation has 100 other attributes! Database Management Systems, R. Ramakrishnan and J. Gehrke 36

Evaluation Using RID Joins Retrieve (SearchKey1, RID) pairs from first index Retrieve (SearchKey2, RID) pairs from second index Join these based on RID to get (SearchKey1, SearchKey2, RID) triples We will discuss joins soon! Project out the third column to get the desired result Database Management Systems, R. Ramakrishnan and J. Gehrke 37

Projection Summary General case: scan and eliminate duplicates Sorting and hashing approaches Indexes: can use those if available Index-only scans RID joins Database Management Systems, R. Ramakrishnan and J. Gehrke 38

Next up: Joins! SELECT * FROM Reserves R, Sailors S, WHERE R.sid = S.sid No indices on Sailors or Reserves Database Management Systems, R. Ramakrishnan and J. Gehrke 39

Tuple Nested Loop Join foreach tuple r in R do foreach tuple s in S do if r.sid == s.sid then add <r, s> to result R is outer relation S is inner relation Database Management Systems, R. Ramakrishnan and J. Gehrke 40

Example relations Sailors (sid, sname, rating, age) Reserves(sid, bid, day, rname) Each tuple of Sailors 50 bytes long (80 tuples per page) and have 500 pages Each tuple of Reserves 40 bytes long (100 tuples per page) and have 1000 pages Database Management Systems, R. Ramakrishnan and J. Gehrke 41

Analysis Assume M pages in R, p R tuples per page M = 1000, p R = 100 N pages in S, p S tuples per page N = 500, p S = 80 Total cost = M + p R * M * N Ignore cost of writing out result Same for all join methods This is 50,001,000 I/O's in our example! Database Management Systems, R. Ramakrishnan and J. Gehrke 42

Page Nested Loop Join foreach page p1 in R do foreach page p2 in S do foreach r in p1 do foreach s in p2 do if r.sid == s.sid then add <r, s> to result R is outer relation S is inner relation Database Management Systems, R. Ramakrishnan and J. Gehrke 43

Analysis Assume M pages in R, p R tuples per page M = 1000, p R = 100 N pages in S, p S tuples per page Select N = 500, p S = 80 Total cost = M + M * N Which is 501,000 I/O s Note: Smaller relation should be outer Better for S to be outer in this case! Database Management Systems, R. Ramakrishnan and J. Gehrke 44