Lecture 15: The Details of Joins

Similar documents
Lecture 14. Lecture 14: Joins!

CSC 261/461 Database Systems Lecture 19

Lecture 12. Lecture 12: The IO Model & External Sorting

Lecture 12. Lecture 12: Access Methods

Lecture 12. Lecture 12: The IO Model & External Sorting

Implementing Joins 1

L20: Joins. CS3200 Database design (sp18 s2) 3/29/2018

Implementation of Relational Operations: Other Operations

Query and Join Op/miza/on 11/5

Final Review. CS 145 Final Review. The Best Of Collection (Master Tracks), Vol. 2

Evaluation of Relational Operations: Other Techniques

Database Applications (15-415)

CMPT 354: Database System I. Lecture 7. Basics of Query Optimization

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Evaluation of Relational Operations: Other Techniques

6.830 Lecture 8 10/2/2017. Lab 2 -- Due Weds. Project meeting sign ups. Recap column stores & paper. Join Processing:

Database Applications (15-415)

Query Processing. Introduction to Databases CompSci 316 Fall 2017

Review. Support for data retrieval at the physical level:

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

Evaluation of Relational Operations

University of Waterloo Midterm Examination Sample Solution

Advances in Data Management Query Processing and Query Optimisation A.Poulovassilis

RELATIONAL OPERATORS #1

Advanced Database Systems

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Fundamentals of Database Systems

15-415/615 Faloutsos 1

University of Waterloo Midterm Examination Solution

Cost-based Query Sub-System. Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class.

Evaluation of Relational Operations: Other Techniques. Chapter 14 Sayyed Nezhadi

Evaluation of relational operations

Datenbanksysteme II: Implementing Joins. Ulf Leser

CAS CS 460/660 Introduction to Database Systems. Query Evaluation II 1.1

Roadmap DB Sys. Design & Impl. Overview. Reference. What is Join Processing? Example #1. Join Processing

Database Applications (15-415)

Evaluation of Relational Operations: Other Techniques

Administriva. CS 133: Databases. General Themes. Goals for Today. Fall 2018 Lec 11 10/11 Query Evaluation Prof. Beth Trushkowsky

Spring 2017 QUERY PROCESSING [JOINS, SET OPERATIONS, AND AGGREGATES] 2/19/17 CS 564: Database Management Systems; (c) Jignesh M.

CompSci 516 Data Intensive Computing Systems

Database System Concepts

Chapter 13: Query Processing

Midterm Review CS634. Slides based on Database Management Systems 3 rd ed, Ramakrishnan and Gehrke

Chapter 12: Query Processing. Chapter 12: Query Processing

TotalCost = 3 (1, , 000) = 6, 000

R has a ordered clustering index file on its tuples: Read index file to get the location of the tuple with the next smallest value

Implementation of Relational Operations

Chapter 12: Query Processing

Evaluation of Relational Operations

Database Applications (15-415)

Lecture Query evaluation. Cost-based transformations. By Marina Barsky Winter 2016, University of Toronto

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

Implementing Relational Operators: Selection, Projection, Join. Database Management Systems, R. Ramakrishnan and J. Gehrke 1

Database Applications (15-415)

Query Optimization. Query Optimization. Optimization considerations. Example. Interaction of algorithm choice and tree arrangement.

Query Processing: A Systems View. Announcements (March 1) Physical (execution) plan. CPS 216 Advanced Database Systems

Announcements (March 1) Query Processing: A Systems View. Physical (execution) plan. Announcements (March 3) Physical plan execution

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Notes. Some of these slides are based on a slide set provided by Ulf Leser. CS 640 Query Processing Winter / 30. Notes

CSE 544 Principles of Database Management Systems

Chapter 12: Query Processing

CMSC424: Database Design. Instructor: Amol Deshpande

Implementation of Relational Operations. Introduction. CS 186, Fall 2002, Lecture 19 R&G - Chapter 12

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

Introduction to Database Systems CSE 444, Winter 2011

Datenbanksysteme II: Caching and File Structures. Ulf Leser

Evaluation of Relational Operations. Relational Operations

Sorting & Aggregations

Dtb Database Systems. Announcement

192 Chapter 14. TotalCost=3 (1, , 000) = 6, 000

Advanced Databases: Parallel Databases A.Poulovassilis

CSIT5300: Advanced Database Systems

Database Management System

Parser: SQL parse tree

EECS 647: Introduction to Database Systems

CompSci 516: Database Systems. Lecture 20. Parallel DBMS. Instructor: Sudeepa Roy

Project. Building a Simple Query Optimizer with Performance Evaluation Experiment on Query Rewrite Optimization

CSE 444: Database Internals. Lectures 5-6 Indexing

External Sorting Implementing Relational Operators

CMSC424: Database Design. Instructor: Amol Deshpande

CSE 544, Winter 2009, Final Examination 11 March 2009

Database Systems. Announcement. December 13/14, 2006 Lecture #10. Assignment #4 is due next week.

PARALLEL & DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

Today's Class. Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Example Database. Query Plan Example

CSE 344 APRIL 20 TH RDBMS INTERNALS

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

Database Applications (15-415)

Advanced Database Systems

Hash-Based Indexing 165

Principles of Data Management. Lecture #9 (Query Processing Overview)

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Administrivia. Administrivia. Faloutsos/Pavlo CMU /615

Data Storage. Query Performance. Index. Data File Types. Introduction to Data Management CSE 414. Introduction to Database Systems CSE 414

Principles of Database Management Systems

Architecture and Implementation of Database Systems Query Processing

Query Optimization. Vishy Poosala Bell Labs

CMPUT 391 Database Management Systems. Query Processing: The Basics. Textbook: Chapter 10. (first edition: Chapter 13) University of Alberta 1

Advanced Databases. Lecture 15- Parallel Databases (continued) Masood Niazi Torshiz Islamic Azad University- Mashhad Branch

INSTITUTO SUPERIOR TÉCNICO Administração e optimização de Bases de Dados

Transcription:

Lecture 15 Lecture 15: The Details of Joins (and bonus!)

Lecture 15 > Section 1 What you will learn about in this section 1. How to choose between BNLJ, SMJ 2. HJ versus SMJ 3. Buffer Manager Detail (PS#3!) 4. ACTIVITY: Buffer Manager DETAIL! 2

Lecture 15 > Section 2 > NLJ Nested Loop Joins 3

Lecture 15 > Section 2 > NLJ Notes We are again considering IO aware algorithms: care about disk IO Given a relation R, let: T(R) = # of tuples in R P(R) = # of pages in R Note also that we omit ceilings in calculations good exercise to put back in!

Lecture 15 > Section 2 > BNLJ Block Nested Loop Join (BNLJ) Compute R S on A: for each B-1 pages pr of R: for page ps of S: for each tuple r in pr: for each tuple s in ps: if r[a] == s[a]: yield (r,s) Cost: P(R) Given B+1 pages of memory 1. Load in B-1 pages of R at a time (leaving 1 page each free for S & output)

Lecture 15 > Section 2 > BNLJ Block Nested Loop Join (BNLJ) Compute R S on A: for each B-1 pages pr of R: for page ps of S: for each tuple r in pr: for each tuple s in ps: if r[a] == s[a]: yield (r,s) Cost: Given B+1 pages of memory P R + P R B 1 P(S) 1. Load in B-1 pages of R at a time (leaving 1 page each free for S & output) 2. For each (B-1)-page segment of R, load each page of S This line is called 1 2 345 times; the loop iterates over the entire relation S 1 2 345 times (ceiling!)

Lecture 15 > Section 2 > BNLJ Block Nested Loop Join (BNLJ) Compute R S on A: for each B-1 pages pr of R: for page ps of S: for each tuple r in pr: for each tuple s in ps: if r[a] == s[a]: yield (r,s) Cost: Given B+1 pages of memory P R + P R B 1 P(S) 1. Load in B-1 pages of R at a time (leaving 1 page each free for S & output) 2. For each (B-1)-page segment of R, load each page of S 3. Check against the join conditions BNLJ can also handle non-equality constraints

Lecture 15 > Section 2 > BNLJ Block Nested Loop Join (BNLJ) Compute R S on A: for each B-1 pages pr of R: for page ps of S: for each tuple r in pr: for each tuple s in ps: if r[a] == s[a]: yield (r,s) Cost: Given B+1 pages of memory P R + 1 2 345 P(S) + OUT 1. Load in B-1 pages of R at a time (leaving 1 page each free for S & output) 2. For each (B-1)-page segment of R, load each page of S 3. Check against the join conditions 4. Write out

Lecture 15 > Section 2 > BNLJ BNLJ: Some quick facts. We use B+1 buffer pages as: 1 page for S 1 page for output B-1 Pages for R P R + 1 2 345 P(S) + OUT If P(R) <= B-1 then we do one pass over S, and we run in time P(R) + P(S) + OUT. Note: This is optimal for our cost model! Thus, if min {P(R), P(S)} <= B-1 we should always use BNLJ We use this at the end of hash join. We define end condition, one of the buckets is smaller than B-1!

Lecture 15 > Section 1 1. Sort-Merge Join (SMJ) 10

Lecture 15 > Section 1 > SMJ Sort Merge Join (SMJ): Basic Procedure To compute R S on A: 1. Sort R, S on A using external merge sort Note that we are only considering equality join conditions here 2. Scan sorted files and merge 3. [May need to backup - if there are many duplicate join keys] Note that if R, S are already sorted on A, SMJ will be awesome!

Lecture 15 > Section 1 > Backup Simple SMJ Optimization Unsorted input relations Given B+1 buffer pages Sort Phase (Ext. Merge Sort) Split & sort R S Split & sort <= B total runs Merge Merge Merge / Join Phase B-Way Merge / Join Joined output file created!

Lecture 15 > Section 1 > Backup Simple SMJ Optimization On this last pass, we only do P(R) + P(S) + OUT IOs to complete the join! If we can initially split R and S into B total runs each of length approx then we only need 3(P(R) + P(S)) + OUT for SMJ! 2 R/W per page to sort runs in memory, 1 R per page to B-way merge / join! How much memory for this to happen? 1 2 61(7) 3 2 B + 1 ~ P R + P S 2B = Thus, max{p R, P S } B 2 is an approximate sufficient condition Given B+1 buffer pages If the larger of R,S has <= B 2 pages, then SMJ costs 3(P(R)+P(S)) + OUT!

Bonus questions. Q1: Fast dog. If max {P(R), P(S)} < B2 then SMJ takes 3(P(R) + P(S)) + OUT What is the similar condition to obtain 5(P(R) + P(S)) + OUT? What is the condition for (2k+1)(P(R) + P(S)) + OUT Q2: BNLJ V. SMJ Under what conditions will BNLJ outperform SMJ? Size of R, S and # of buffer pages Discuss! And We ll put up a google form.

Lecture 15 > Section 2 2. Hash Join (HJ) 15

Lecture 15 > Section 2 > HJ Hash Join: High-level procedure To compute R S on A: 1. Partition Phase: Using one (shared) hash function h B per pass partition R and S into B buckets. Each phase creates B more buckets that are a factor of B smaller. Repeatedly partition with a new hash function Stop when all buckets for one relation are smaller than B-1 (Why?) Each pass takes 2(P(R) + P(S)) 2. Matching Phase: Take pairs of buckets whose tuples have the same values for h, and join these Use BNLJ here for each matching pair. P(R) + P(S) + OUT We decompose the problem using h B, then complete the join

Lecture 15 > Section 2 > HJ Hash Join: High-level procedure R S 1. Partition Phase: Using one (shared) hash function h B, partition R and S into B buckets (0,j) Disk (3,j) (0,j) (5,b) (3,b) h B R 1 R 2 S 1 (3,j) (3,b) (0,j) Disk (0,j) Note our new convention: pages each have two tuples (one per row) S 2 (5,b)

Lecture 15 > Section 2 > HJ Hash Join: High-level procedure 2. Matching Phase: Take pairs of buckets whose tuples have the same values for h B, and join these Disk Disk R S (0,j) (3,j) (0,j) (5,b) (3,b) h B R 1 R 2 S 1 (3,j) (3,b) (0,j) (0,j) Join matching buckets S 2 (5,b)

Lecture 15 > Section 2 > HJ Hash Join: High-level procedure 2. Matching Phase: Take pairs of buckets whose tuples have the same values for h B, and join these Disk Disk R S (0,j) (3,j) (0,j) (5,b) (3,b) h B R 1 R 2 S 1 S 2 (3,j) (3,b) (0,j) (5,b) (0,j) Don t have to join the others! E.g. these!

Bonus questions #2 Q1: Fast little dog. If min {P(R), P(S)} < B2 then HJ takes 3(P(R) + P(S)) + OUT What is the similar condition to obtain 5(P(R) + P(S)) + OUT? What is the condition for (2k+1)(P(R) + P(S)) + OUT Q2: SMJ V. HJ Under what conditions will HJ outperform SMJ? Under what conditions will SMJ outperform SMJ? Size of R, S and # of buffer pages Discuss! And We ll put up a google form.

Lecture 15 > Section 3 > The Cage Match Sort-Merge v. Hash Join Given enough memory, both SMJ and HJ have performance: Enough memory = ~3(P(R)+P(S)) + OUT SMJ: B 2 > max{p(r), P(S)} HJ: B 2 > min{p(r), P(S)} Hash Join superior if relation sizes differ greatly. Why?

Lecture 15 > Section 3 > The Cage Match Further Comparisons of Hash and Sort Joins Hash Joins are highly parallelizable. Sort-Merge less sensitive to data skew and result is sorted

Lecture 15 > Section 3 > The Cage Match Summary I will ask you to compute costs on the final and PS Walk through the algorithms, you ll be able to compute the costs! Memory sizes key in hash versus sort join Hash Join = Little dog (depends on smaller relation) Skew is a major factor (more on PS) Message: The database can compute IO costs, and these are different than a traditional system.

Lecture 15 > Section 3 3. Buffer Manager 24

Lecture 11 > Section 2 > The Buffer The Buffer A buffer is a region of physical memory used to store temporary data In this lecture: a region in main memory used to store intermediate data between disk and processes Main Memory Buffer Key idea: Reading / writing to disk is slowneed to cache data! Disk

Lecture 11 > Section 2 > The Buffer The (Simplified) Buffer In this class: We ll consider a buffer located in main memory that operates over pages and files: Read(page): Read page from disk -> buffer if not already in buffer Main Memory Buffer 1,0,3 Disk

Lecture 11 > Section 2 > The Buffer The (Simplified) Buffer In this class: We ll consider a buffer located in main memory that operates over pages and files: Read(page): Read page from disk -> buffer if not already in buffer Main Memory 2 1,0,30 Buffer Processes can then read from / write to the page in the buffer 1,0,3 Disk

Lecture 11 > Section 2 > The Buffer The (Simplified) Buffer In this class: We ll consider a buffer located in main memory that operates over pages and files: Read(page): Read page from disk -> buffer if not already in buffer Flush(page): Evict page from buffer & write to disk Main Memory 1,2,3 Buffer 1,0,3 Disk

Lecture 11 > Section 2 > The Buffer The (Simplified) Buffer In this class: We ll consider a buffer located in main memory that operates over pages and files: Read(page): Read page from disk -> buffer if not already in buffer Flush(page): Evict page from buffer & write to disk Release(page): Evict page from buffer without writing to disk Main Memory 1,2,3 Buffer 1,0,3 Disk

Lecture 11 > Section 2 > The Buffer Managing Disk: The DBMS Buffer Main Memory Database maintains its own buffer Buffer Why? The OS already does this DB knows more about access patterns. Watch for how this shows up! (cf. Sequential Flooding) Recovery and logging require ability to flush to disk. Disk

Lecture 11 > Section 2 > The Buffer The Buffer Manager A buffer manager handles supporting operations for the buffer: Primarily, handles & executes the replacement policy i.e. finds a page in buffer to flush/release if buffer is full and a new page needs to be read in DBMSs typically implement their own buffer management routines

Activity 15!