Implementing Joins 1

Similar documents
Implementation of Relational Operations

Evaluation of Relational Operations. Relational Operations

Evaluation of Relational Operations

Evaluation of Relational Operations

Dtb Database Systems. Announcement

Evaluation of relational operations

CAS CS 460/660 Introduction to Database Systems. Query Evaluation II 1.1

CompSci 516 Data Intensive Computing Systems

Evaluation of Relational Operations

Implementation of Relational Operations. Introduction. CS 186, Fall 2002, Lecture 19 R&G - Chapter 12

Evaluation of Relational Operations: Other Techniques. Chapter 14 Sayyed Nezhadi

Evaluation of Relational Operations. SS Chung

Presenter: Eunice Tan Lam Ziyuan Yan Ming fei. Join Algorithm

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Overview of Query Evaluation. Overview of Query Evaluation

CS330. Query Processing

Database Applications (15-415)

Administriva. CS 133: Databases. General Themes. Goals for Today. Fall 2018 Lec 11 10/11 Query Evaluation Prof. Beth Trushkowsky

Overview of Query Processing. Evaluation of Relational Operations. Why Sort? Outline. Two-Way External Merge Sort. 2-Way Sort: Requires 3 Buffer Pages

Overview of Query Evaluation

Implementing Relational Operators: Selection, Projection, Join. Database Management Systems, R. Ramakrishnan and J. Gehrke 1

Database Management System

Database Applications (15-415)

ECS 165B: Database System Implementa6on Lecture 7

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

15-415/615 Faloutsos 1

Query and Join Op/miza/on 11/5

Sorting a file in RAM. External Sorting. Why Sort? 2-Way Sort of N pages Requires Minimum of 3 Buffers Pass 0: Read a page, sort it, write it.

Database Systems. Announcement. December 13/14, 2006 Lecture #10. Assignment #4 is due next week.

Overview of Implementing Relational Operators and Query Evaluation

Cost-based Query Sub-System. Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class.

Midterm Review CS634. Slides based on Database Management Systems 3 rd ed, Ramakrishnan and Gehrke

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

QUERY EXECUTION: How to Implement Relational Operations?

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

Evaluation of Relational Operations: Other Techniques

Database Applications (15-415)

CompSci 516 Data Intensive Computing Systems. Lecture 11. Query Optimization. Instructor: Sudeepa Roy

Implementation of Relational Operations: Other Operations

External Sorting Implementing Relational Operators

Evaluation of Relational Operations: Other Techniques

Database Management System. Relational Algebra and operations

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

Overview of Query Processing

Evaluation of Relational Operations: Other Techniques

Spring 2017 QUERY PROCESSING [JOINS, SET OPERATIONS, AND AGGREGATES] 2/19/17 CS 564: Database Management Systems; (c) Jignesh M.

Overview of DB & IR. ICS 624 Spring Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa

CSE 444: Database Internals. Section 4: Query Optimizer

Query Evaluation Overview, cont.

Database Applications (15-415)

Query Evaluation Overview, cont.

CS330. Some Logistics. Three Topics. Indexing, Query Processing, and Transactions. Next two homework assignments out today Extra lab session:

Operator Implementation Wrap-Up Query Optimization

Relational Algebra. Chapter 4, Part A. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Relational Algebra. Note: Slides are posted on the class website, protected by a password written on the board

Database Management Systems. Chapter 4. Relational Algebra. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Relational Algebra. [R&G] Chapter 4, Part A CS4320 1

Overview of Query Evaluation. Chapter 12

Relational Algebra 1

Relational Query Optimization

Final Exam Review. Kathleen Durant CS 3200 Northeastern University Lecture 22

CSE 444: Database Internals. Sec2on 4: Query Op2mizer

Lecture 15: The Details of Joins

Final Exam Review. Kathleen Durant PhD CS 3200 Northeastern University

Lecture 14. Lecture 14: Joins!

Locality. Christoph Koch. School of Computer & Communication Sciences, EPFL

Principles of Data Management. Lecture #9 (Query Processing Overview)

Relational Algebra 1

Query Optimization. Schema for Examples. Motivating Example. Similar to old schema; rname added for variations. Reserves: Sailors:

Review. Relational Query Optimization. Query Optimization Overview (cont) Query Optimization Overview. Cost-based Query Sub-System

Question 1. Part (a) [2 marks] what are different types of data independence? explain each in one line. Part (b) CSC 443 Term Test Solutions Fall 2017

Query Optimization. Schema for Examples. Motivating Example. Similar to old schema; rname added for variations. Reserves: Sailors:

SQL Part 2. Kathleen Durant PhD Northeastern University CS3200 Lesson 6

Triggers. PL/SQL reminder. Example from last week. PL/SQL reminder- cont. Triggers- Trigger introduction cont. introduction

Course No: 4411 Database Management Systems Fall 2008 Midterm exam

Schema for Examples. Query Optimization. Alternative Plans 1 (No Indexes) Motivating Example. Alternative Plans 2 With Indexes

Administrivia. Relational Query Optimization (this time we really mean it) Review: Query Optimization. Overview: Query Optimization

University of Waterloo Midterm Examination Sample Solution

CSIT5300: Advanced Database Systems

Introduction to Data Management. Lecture #11 (Relational Algebra)

This lecture. Step 1

Relational Query Optimization. Highlights of System R Optimizer

Relational algebra. Iztok Savnik, FAMNIT. IDB, Algebra

Join Algorithms. Lecture #12. Andy Pavlo Computer Science Carnegie Mellon Univ. Database Systems / Fall 2018

Administrivia. CS 133: Databases. Cost-based Query Sub-System. Goals for Today. Midterm on Thursday 10/18. Assignments

University of Waterloo Midterm Examination Solution

Relational Query Optimization. Overview of Query Evaluation. SQL Refresher. Yanlei Diao UMass Amherst October 23 & 25, 2007

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #11: Query Processing and Midterm Review

Database Applications (15-415)

TotalCost = 3 (1, , 000) = 6, 000

Database Systems. Course Administration. 10/13/2010 Lecture #4

Relational Query Languages

CSIT5300: Advanced Database Systems

Relational Algebra and Calculus. Relational Query Languages. Formal Relational Query Languages. Yanlei Diao UMass Amherst Feb 1, 2007

Query Processing. Introduction to Databases CompSci 316 Fall 2017

CAS CS 460/660 Introduction to Database Systems. Relational Algebra 1.1

Assignment 6 Solutions

Query Processing and Query Optimization. Prof Monika Shah

RELATIONAL OPERATORS #1

Administriva# CS#133:#Databases# Logical#Plan#to#Physical#Plan# Goals#for#Today# Spring#2017# Lec#11# #2/21# Query#Evalua?on# Prof.

Transcription:

Implementing Joins 1

Last Time Selection Scan, binary search, indexes Projection Duplicate elimination: sorting, hashing Index-only scans Joins 2

Tuple Nested Loop Join foreach tuple r in R do foreach tuple s in S do if r.sid == s.sid then add <r, s> to result R is outer relation S is inner relation 3

Page Nested Loop Join foreach page p1 in R do foreach page p2 in S do foreach r in p1 do foreach s in p2 do if r.sid == s.sid then add <r, s> to result R is outer relation S is inner relation 4

Analysis Assume M pages in R, p R tuples per page M = 1000, p R = 100 N pages in S, p S tuples per page Select N = 500, p S = 80 Total cost = M + M * N Which is 501,000 I/O s Note: Smaller relation should be outer Better for S to be outer in this case! 5

Block Nested Loop Join Notice: previous algorithms do not use all the available buffer pages We could keep all (or at least some) of the outer relation in memory while we scan the inner one 6

Block Nested Loops Join Use one page as input buffer for scanning S, one page as output buffer, and all remaining pages to hold ``block of R. For each matching tuple r in R-block, s in S-page, add <r, s> to result. Then read next R-block, scan S, etc. R & S... Block of R... Join Result... Input buffer for S Output buffer 7

Examples of Block Nested Loops Cost: Scan of outer + #outer blocks * scan of inner # of pages of outer / blocksize #outer blocks = With Reserves (R) as outer, and 100 page blocks: Cost of scanning R is 1000 I/Os; a total of 10 blocks. Per block of R, we scan Sailors (S); 10*500 I/Os. 8

Examples of Block Nested Loops With 100-page block of Sailors as outer: Cost of scanning S is 500 I/Os; a total of 5 blocks. Per block of S, we scan Reserves; 5*1000 I/Os. With sequential reads considered, analysis changes: may be best to divide buffers evenly between R and S. 9

Index Nested Loops Join foreach tuple r in R do foreach tuple s in S where r i == s j do add <r, s> to result Suppose we have an index on S, on the join attribute No need to scan all of S just use index to retrieve tuples that match this r This will probably be faster, especially if there are few matching tuples and the index is clustered 10

Index Nested Loops Join foreach tuple r in R do foreach tuple s in S where r i == s j do add <r, s> to result Cost: M + ( (M*p R ) * cost of finding matching S tuples) For each R tuple, cost of probing S index is about 1.2 for hash index, 2-4 for B+ tree. Cost of then finding S tuples (assuming Alt. (2) or (3) for data entries) depends on clustering. Clustered index: 1 I/O (typical), unclustered: up to 1 I/O per matching S tuple. 11

Examples of Index Nested Loops Join sailors and reserves on sid Hash index (Alt. 2) on sid of Sailors (as inner): Scan Reserves: 1000 page I/Os, 100*1000 tuples. For each Reserves tuple: 1.2 I/Os to get data entry in index, plus 1 I/O to get (the exactly one) matching Sailors tuple. Total: 221,000 I/Os. 12

Examples of Index Nested Loops Hash index (Alt. 2) on sid of Reserves (as inner): Scan Sailors: 500 page I/Os, 80*500 tuples. For each Sailors tuple: 1.2 I/Os to find index page with data entries, plus cost of retrieving matching Reserves tuples. Assuming uniform distribution, 2.5 reservations per sailor (100,000 / 40,000). Cost of retrieving them is 1 or 2.5 I/Os depending on whether the index is clustered. Final numbers: 88,500 (clustered), 148,500 (unclustered) Still not as good as the numbers we saw for BNLJ, but much better than TNLJ. 13

Sort-merge join motivation sid sname rating age 22 dustin 7 45.0 28 yuppy 9 35.0 31 lubber 8 55.5 44 guppy 5 35.0 58 rusty 10 35.0 sid bid day rname 28 103 12/4/96 guppy 28 103 11/3/96 yuppy 31 101 10/10/96 dustin 31 102 10/12/96 lubber 31 101 10/11/96 lubber 58 103 11/12/96 dustin 14

Sort-Merge Join Sort R and S on the join column, then scan them to do a ``merge (on join col.), and output result tuples. 15

Sort-Merge Join on Ri = Sj Scan R and S until current R tuple = current S tuple. At this point, all R tuples with same value in Ri (current R group) and all S tuples with same value in Sj (current S group) match; output <r, s> for all pairs of such tuples. Then resume scanning R and S. R is scanned once; each S group is scanned once per matching R tuple. If the groups are huge the whole S-group may not fit in memory Although this is rare in practice 16

Cost of Sort-Merge Join sid sname rating age 22 dustin 7 45.0 28 yuppy 9 35.0 31 lubber 8 55.5 44 guppy 5 35.0 58 rusty 10 35.0 sid bid day rname 28 103 12/4/96 guppy 28 103 11/3/96 yuppy 31 101 10/10/96 dustin 31 102 10/12/96 lubber 31 101 10/11/96 lubber 58 103 11/12/96 dustin Cost: (cost to sort R) + (cost to sort S) + (scan cost) M < scan cost <= M+ M*N (typically close to M+N) With 35, 100 or 300 buffer pages, both Reserves and Sailors can be sorted in 2 passes; total join cost: 7500. (BNL cost: 2500 to 15000 I/Os) 17

Refinement of Sort-Merge Join We can combine the merging phases in the sorting of R and S with the merging required for the join. Let L be # pages of larger relation, S of smaller relation After pass 0, have L/B runs of larger, S/B runs of smaller Can merge them in one pass if total # runs <= B-1 18

Refinement of sort-merge join Want I.e. As a rule of thumb, need 19

SMJ cost with refinement 3 (L + S) Pass 0 for each relation contributes 2L and 2S Merge pass is L + S (typical case) In our example, 4500 I/Os In practice SMJ has linear cost (in I/Os) 20

Hash Join Similar idea to hashing for projection Hash both relations on join attribute Matching tuples fall into same partition Then can compute join on partitions one at a time 21

Hash Join Partition both relations using hash fn h: R tuples in partition i will only match S tuples in Original Relation... INPUT OUTPUT partition i. Disk B main memory buffers Disk 1 2 hash function h B-1 Partitions 1 2 B-1 22

Hash Join Read in a partition of R, hash it using h2 (<> h!). Scan matching partition of S, search for matches. Partitions of R & S Disk hash fn h2 Hash table for partition Ri h2 Input buffer for Si Output buffer B main memory buffers Join Result Disk 23

Observations on Hash Join One or more R partitions may not fit in memory. If so, can apply hash join technique recursively: Take partition of R and corresponding partition of S Use different hash function to split each of them up into subpartitions Join the subpartitions one at a time 24

Observations on Hash Join How much memory do we need so we don't have to recurse? #partitions k <= B-1, and B-2 >= size of largest partition to be held in memory. Assuming uniformly sized partitions, letting S = size of smaller relation, and maximizing k, we get: k= B-1, and S/(B-1) <= B-2, i.e., B must be (approx) > S Contrast with sort-merge join, where "enough buffer pages" was based on size of the larger relation 25

Cost of Hash Join In partitioning phase, read+write both relns; 2(M+N). In matching phase, read both relns; M+N I/Os. In our running example, this is a total of 4500 I/Os. 26

Sort-Merge Join vs Hash Join Given a minimum amount of memory both have a cost of 3(M+N) I/Os. Hash Join superior on this count if relation sizes differ greatly (depends on size of smaller relation vs larger relation for sort-merge) Also, Hash Join is highly parallelizable. Sort-Merge Join less sensitive to data skew; also, result is sorted. 27