Aufgabe 2. Join-Methoden Differential Snapshots. Ulf Leser Wissensmanagement in der Bioinformatik

Similar documents
Datenbanksysteme II: Implementing Joins. Ulf Leser

Query Processing. Introduction to Databases CompSci 316 Fall 2017

Query Processing and Advanced Queries. Query Optimization (4)

Datenbanksysteme II: Finish. Ulf Leser

Datenbanksysteme II: Caching and File Structures. Ulf Leser

CS-245 Database System Principles

Principles of Database Management Systems

Outline. Database Management and Tuning. Outline. Join Strategies Running Example. Index Tuning. Johann Gamper. Unit 6 April 12, 2012

Advanced Database Systems

CSEP 544: Lecture 04. Query Execution. CSEP544 - Fall

University of Waterloo Midterm Examination Solution

R has a ordered clustering index file on its tuples: Read index file to get the location of the tuple with the next smallest value

CS54100: Database Systems

Lecture 14. Lecture 14: Joins!

Chapters 15 and 16b: Query Optimization

Introduction to Database Systems CSE 444, Winter 2011

CSE 544 Principles of Database Management Systems

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Statistics. Duplicate Elimination

Database Systems CSE 414

EECS 647: Introduction to Database Systems

Hash table example. B+ Tree Index by Example Recall binary trees from CSE 143! Clustered vs Unclustered. Example

Selection Queries. to answer a selection query (ssn=10) needs to traverse a full path.

Query Execution [15]

From SQL-query to result Have a look under the hood

Introduction to Database Systems CSE 414. Lecture 26: More Indexes and Operator Costs

Evaluation of relational operations

Evaluation of Relational Operations

CAS CS 460/660 Introduction to Database Systems. Query Evaluation II 1.1

Implementing Joins 1

Overview of Implementing Relational Operators and Query Evaluation

Notes. Some of these slides are based on a slide set provided by Ulf Leser. CS 640 Query Processing Winter / 30. Notes

Outline. Query Processing Overview Algorithms for basic operations. Query optimization. Sorting Selection Join Projection

Question 1. Part (a) [2 marks] what are different types of data independence? explain each in one line. Part (b) CSC 443 Term Test Solutions Fall 2017

CSE 344 APRIL 27 TH COST ESTIMATION

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

Data Storage. Query Performance. Index. Data File Types. Introduction to Data Management CSE 414. Introduction to Database Systems CSE 414

Query Processing. Solutions to Practice Exercises Query:

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

COSC 404 Database System Implementation. Query Processing. Dr. Ramon Lawrence University of British Columbia Okanagan

Query Processing with Indexes. Announcements (February 24) Review. CPS 216 Advanced Database Systems

Chapter 13: Query Processing

Implementing Relational Operators: Selection, Projection, Join. Database Management Systems, R. Ramakrishnan and J. Gehrke 1

Query Optimization. Query Optimization. Optimization considerations. Example. Interaction of algorithm choice and tree arrangement.

Chapter 12: Query Processing

Ulrich Stärk

Query Processing: A Systems View. Announcements (March 1) Physical (execution) plan. CPS 216 Advanced Database Systems

Implementation of Relational Operations

Implementation of Relational Operations. Introduction. CS 186, Fall 2002, Lecture 19 R&G - Chapter 12

TotalCost = 3 (1, , 000) = 6, 000

Principles of Data Management. Lecture #9 (Query Processing Overview)

Database Systems. Project 2

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

PARALLEL & DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

Database System Concepts

Evaluation of Relational Operations. Relational Operations

15-415/615 Faloutsos 1

L20: Joins. CS3200 Database design (sp18 s2) 3/29/2018

Announcements. Two typical kinds of queries. Choosing Index is Not Enough. Cost Parameters. Cost of Reading Data From Disk

Assignment 6 Solutions

CMSC424: Database Design. Instructor: Amol Deshpande

Algorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1)

Evaluation of Relational Operations: Other Techniques

1. SQL definition SQL is a declarative query language 2. Components DRL: Data Retrieval Language DML: Data Manipulation Language DDL: Data Definition

Schema for Examples. Query Optimization. Alternative Plans 1 (No Indexes) Motivating Example. Alternative Plans 2 With Indexes

CSE 344 FEBRUARY 21 ST COST ESTIMATION

Chapter 12: Query Processing. Chapter 12: Query Processing

Data Warehousing (Special Indexing Techniques)

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

CS330. Query Processing

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Overview of Query Evaluation. Overview of Query Evaluation

Module 10: Parallel Query Processing

Fundamentals of Database Systems

Evaluation of Relational Operations: Other Techniques. Chapter 14 Sayyed Nezhadi

Query Optimization. Schema for Examples. Motivating Example. Similar to old schema; rname added for variations. Reserves: Sailors:

Announcements (March 1) Query Processing: A Systems View. Physical (execution) plan. Announcements (March 3) Physical plan execution

RELATIONAL OPERATORS #1

Architecture and Implementation of Database Systems Query Processing

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

CS122 Lecture 8 Winter Term,

Course No: 4411 Database Management Systems Fall 2008 Midterm exam

Evaluation of Relational Operations: Other Techniques

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 7 - Query execution

Advanced Databases. Lecture 15- Parallel Databases (continued) Masood Niazi Torshiz Islamic Azad University- Mashhad Branch

Search Engines Chapter 2 Architecture Felix Naumann

Query Evaluation! References:! q [RG-3ed] Chapter 12, 13, 14, 15! q [SKS-6ed] Chapter 12, 13!

Chapter 3. Algorithms for Query Processing and Optimization

Parser: SQL parse tree

Name: Problem 1 Consider the following two transactions: T0: read(a); read(b); if (A = 0) then B = B + 1; write(b);

6.830 Lecture 8 10/2/2017. Lab 2 -- Due Weds. Project meeting sign ups. Recap column stores & paper. Join Processing:

Query Evaluation and Optimization

CMSC424: Database Design. Instructor: Amol Deshpande

CMPT 354: Database System I. Lecture 7. Basics of Query Optimization

192 Chapter 14. TotalCost=3 (1, , 000) = 6, 000

Chapter 17: Parallel Databases

Evaluation of Relational Operations

Transcription:

Aufgabe 2 Join-Methoden Differential Snapshots Ulf Leser Wissensmanagement in der Bioinformatik

Join Operator JOIN: Most important relational operator Potentially very expensive Required in all practical queries and applications Often appears in groups of joins Many variations with different characteristics, suited for different situations Example: Relations R (A, B) and S (B, C) SELECT * FROM R, S WHERE R.B = S.B A B C R A B A1 0 A2 1 A3 2 A4 1 S B C 1 C1 2 C2 1 C3 3 C4 1 C5 R S A2 1 C1 A2 1 C3 A2 1 C5 A3 2 C2 A4 1 C1 A4 1 C3 A4 1 C5 Ulf Leser: DWH und DM, Übung, Sommersemester 2007 2

Nested-loop Join Super-naïve FOR EACH r IN R DO FOR EACH s IN S DO IF ( r.b=s.b) THEN OUTPUT (r s) Slight improvement FOR EACH block x IN R DO FOR EACH block y IN S DO FOR EACH r in x DO FOR EACH s in y DO IF ( r.b=s.b) THEN OUTPUT (r s) Cost estimations b(r), b(s) number of blocks in R and in S Each block of outer relation is read once Inner relation is read once for each block of outer relation Inner two loops are free (only main memory operations) Altogether: b(r)+b(r)*b(s) IO S R Ulf Leser: DWH und DM, Übung, Sommersemester 2007 3

Example Assume b(r)=10.000, b(s)=2.000 R as outer relation IO = 10.000 + 10.000*2.000 = 20.010.000 S as outer relation IO = 2.000 + 2.000*10.000 = 20.002.000 Use smaller relation as outer relation For large relation, choice doesn t really matter Can t we do better?? Ulf Leser: DWH und DM, Übung, Sommersemester 2007 4

... There is no m in the formula m: Size of main memory in blocks This should make you suspicious We are not using our available main memory Ulf Leser: DWH und DM, Übung, Sommersemester 2007 5

Blocked nested-loop join Rule of thumb: Use all memory you can get Use all memory the buffer manager allocates to your process Blocked-nested-loop FOR i=1 TO b(r)/(m-1) DO READ NEXT m-1 blocks of R into M FOR EACH block y IN S DO FOR EACH r in R-chunk DO FOR EACH s in y do IF ( r.b=s.b) THEN OUTPUT (r s) Cost estimation Outer relation is read once Inner relation is read once for every chunk of R There are ~b(r)/m chunks IO = b(r) + b(r)*b(s)/m Further advantage: Outer relation is read in chunks sequential IO Ulf Leser: DWH und DM, Übung, Sommersemester 2007 6

Example Example Assume b(r)=10.000, b(s)=2.000, m=500 R as outer relation IO = 10.000 + 10.000*2.000/500 = 50.000 S as outer relation IO = 2.000 + 2.000*10.000/500 = 42.000 Compare to one-block NL: 20.002.000 IO Use smaller relation as outer relation Again, difference irrelevant as tables get larger But sizes of relations do matter If one relation fits into memory (b<m) Total cost: b(r) + b(s) One pass blocked-nested-loop We can do a little better with blocked-nested loop?? Ulf Leser: DWH und DM, Übung, Sommersemester 2007 7

Zig-Zag Join When finishing a chunk of outer relation, hold last block of inner relation in memory Load next chunk of outer relation and compare with the still available last block of inner relation For each chunk, we need to read one block less Thus: Saves b(r)/m IO If R is outer relation Ulf Leser: DWH und DM, Übung, Sommersemester 2007 8

Sort-Merge Join How does it work? Sort both relations on join attribute(s) Merge both sorted relations Caution if duplicates exist The result size still is R * S in worst case If there are r/s tuples with value x in the join attribute in R / S, we need to output r*s tuples for x So what is the worst case?? Example R A B A1 0 A2 1 A4 1 A3 2 S B C 1 C1 1 C3 1 C5 2 C2 3 C4 Ulf Leser: DWH und DM, Übung, Sommersemester 2007 9

Example Continued R A B S B C A1 0 A2 1 A4 1 A3 2 1 C1 1 C3 1 C5 2 C2 3 C4 Partial join {<A1,0>,<A2,1>} S A B C A2 1 C1 A2 1 C3 A2 1 C5 R A B S B C A B C A1 0 A2 1 1 C1 1 C3 Partial join A2 1 C1 A2 1 C3 A4 1 A3 2 1 C5 2 C2 3 C4 {<A1,0>,<A2,1>, <A4,1>} S A2 1 C5 A4 1 C1 A4 1 C3 new A4 1 C5 Ulf Leser: DWH und DM, Übung, Sommersemester 2007 10

Merge Phase r := first (R); s := first (S); WHILE NOT EOR (R) and NOT EOR (S) DO IF r[b] < s[b] THEN r := next (R) ELSEIF r[b] > s[b] THEN s := next (S) ELSE /* r[b] = s[b]*/ b := r[b]; B := ; WHILE NOT EOR(S) and s[b] = b DO B := B {s}; s = next (S); END DO; WHILE NOT EOR(R) and r[b] = b DO FOR EACH e in B DO OUTPUT (r,e); r := next (R); END DO; END DO; Ulf Leser: DWH und DM, Übung, Sommersemester 2007 11

Cost estimation Sorting R costs 2*b(R) * ceil(log m (b(r))) Sorting S costs 2*b(S) * ceil(log m (b(s))) Merge phase reads each relation once Total IO b(r) + b(s) + 2*b(R)*ceil(log m (b(r))) + 2*b(S)*ceil(log m (b(s))) Improvement While sorting, do not perform last read/write phase Open all sorted runs in parallel for merging Saves 2*b(R) + 2*b(S) IO Sometimes, we are able to save the sorting phase If sort is performed somewhere down in the tree Needs to be a sort on the same attribute(s) Merge-sort join: No inner/outer relation Ulf Leser: DWH und DM, Übung, Sommersemester 2007 12

Merge-Join and Main Memory We have no m in the formula of the merge phase Implicitly, it is in the number of runs required More memory doesn t decrease number of blocks to read, but can be used for sequential reads Always fill memory with m/2 blocks from R and m/2 blocks from S Use asynchronous IO 1. Schedule request for m/4 blocks from R and m/4 blocks from S 2. Wait until loaded 3. Schedule request for next m/4 blocks from R and next m/4 blocks from S 4. Do not wait perform merge on first 2 chunks of m/4 blocks 5. Wait until previous request finished 1. We used this waiting time very well 6. Jump to 3, using m/4 chunks of M in turn Ulf Leser: DWH und DM, Übung, Sommersemester 2007 13

Hash Join As always, we may save sorting if good hash function available Assume a very good hash function Distributes hash values almost uniformly over hash table If we have good histograms (later), a simple interval-based hash function will usually work How can we apply hashing to joins?? Ulf Leser: DWH und DM, Übung, Sommersemester 2007 14

Hash Join Idea Use join attributes as hash keys in both R and S Choose hash function for hash table of size m Each bucket has size b(r)/m, b(s)/m Hash phase Scan R, compute hash table, writing full blocks to disk immediately Scan S, compute hash table, writing full blocks to disk immediately Notice: Probably better to use some n<b(r)/m to allow for sequential writes Merge phase Iteratively, load same bucket of R and of S in memory Compute join Ulf Leser: DWH und DM, Übung, Sommersemester 2007 15

Hash Join Cost Hash phase costs 2*b(R)+2*b(S) Merge phase costs b(r) + b(s) Total: 3*(b(R)+b(S)) Under what assumption?? Ulf Leser: DWH und DM, Übung, Sommersemester 2007 16

Hybrid Hash Join Assume that min(b(r),b(s)) < m 2 /2 Notice: During merge phase, we used only (b(r)+b(s))/m memory blocks (size of two buckets), although there might be much more Improvement Chose smaller relation (assume S) Chose a number k of buckets (with k<m) Again, assuming perfect hash functions, each bucket has size b(s)/k When hashing S, keep first x buckets completely in memory, but only one block for each of the (k-x) other buckets These x buckets are never written to disk When hashing R If hash value maps into buckets 1..x, perform join immediately Otherwise, map to the k-x other buckets and write to disk After first round, we have performed the join on x buckets and have k- x buckets of both relations on disk Perform normal merge phase on k-x buckets Ulf Leser: DWH und DM, Übung, Sommersemester 2007 17

Hybrid Hash Join - Cost Total saving (compared to normal hash join) We save 2 IO for every block in either relation that is never written to disk We keep x buckets in memory, having ~ b(s)/k and ~b(r)/k blocks Together, we save 2*x*(b(S)+b(R))/k IO operations How should we choose k and x? Optimal solution x=1 and k as small as possible Build buckets as large as possible, such that still one entire bucket and one block for all other buckets fits into memory Thus, we use as much memory as possible for savings Optimum reached at ~k=b(s)/m Actually, k must be smaller, so that M can accommodate 1 block for each other bucket Together, we save 2*(b(S)+b(R))*m/b(S) Total cost: (3-2m/b(S))*(b(S)+b(R)) Compared to 3*(b(R)+b(S)) for normal hash join Ulf Leser: DWH und DM, Übung, Sommersemester 2007 18

Comparing Hash Join and Sort-Merge Join If enough memory provided, both require approximately the same number of IO 3*(b(R)+b(S)) Hybrid-hash join improves slightly SM generates sorted results sort phase of other joins in query plan can be dropped Advantage propagates up the tree HJ does not need to perform O(n*log(n)) sorting in main memory HJ requires that only one relation is small enough, SM needs two small relations Because both are sorted independently HJ depends on roughly equally sized buckets Otherwise, performance might degrade due to unexpected paging To prevent, estimate k more conservative and do not fill m completely Some memory remains unused Both can be tuned to generate mostly sequential IO Ulf Leser: DWH und DM, Übung, Sommersemester 2007 19

Comparing Join Methods Ulf Leser: DWH und DM, Übung, Sommersemester 2007 20

HashJoin: Schritt 1 Value1 Value2 Value3 1 2 1 2 Value1 Value4 Value8 Value4 3 3 Value5 Value5 Value6 Value7 Value8 4 5 6 4 5 6 Value6 Value9 Value3 Value7 Ulf Leser: DWH und DM, Übung, Sommersemester 2007 21

HashJoin: Schritt 2 1 1 2 2 3 3 4 4 5 5 6 6 Ulf Leser: DWH und DM, Übung, Sommersemester 2007 22

Zu beachten Die erste Hashfunktion sollte die Tupel gleichmäßig auf die s verteilen Es sollten beim ersten Schritt nicht zu viele s entstehen Benötigen jeweils ein Filehandle Datei öffnen & schließen etc ist langsam Bei zu wenigen s enthält aber jedes zu viele Daten für den späteren Vergleich im Hauptspeicher Also: Maximale Größe der s irgendwie abschätzen Bezüglich der benötigten Zeit für die Berechnung lohnt es sich, manches selber zu programmieren Java bietet die Klasse util.hashtable, die aber langsamer ist als eine eigene Implementierung Vor allem auf die IO Routinen muss man achten, da hier die Hauptlast entsteht Ulf Leser: DWH und DM, Übung, Sommersemester 2007 23