External Sorting Implementing Relational Operators

Similar documents
Implementing Relational Operators: Selection, Projection, Join. Database Management Systems, R. Ramakrishnan and J. Gehrke 1

Database Applications (15-415)

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Database Applications (15-415)

Evaluation of Relational Operations

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

Overview of Implementing Relational Operators and Query Evaluation

Overview of Query Evaluation. Chapter 12

Evaluation of Relational Operations

Evaluation of Relational Operations: Other Techniques. Chapter 14 Sayyed Nezhadi

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

Administriva. CS 133: Databases. General Themes. Goals for Today. Fall 2018 Lec 11 10/11 Query Evaluation Prof. Beth Trushkowsky

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Overview of Query Evaluation. Overview of Query Evaluation

CS330. Query Processing

CS542. Algorithms on Secondary Storage Sorting Chapter 13. Professor E. Rundensteiner. Worcester Polytechnic Institute

Principles of Data Management. Lecture #9 (Query Processing Overview)

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

Cost-based Query Sub-System. Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class.

15-415/615 Faloutsos 1

Overview of Query Evaluation

Database Systems. Announcement. December 13/14, 2006 Lecture #10. Assignment #4 is due next week.

Spring 2017 EXTERNAL SORTING (CH. 13 IN THE COW BOOK) 2/7/17 CS 564: Database Management Systems; (c) Jignesh M. Patel,

Implementation of Relational Operations

Database Management System

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Dtb Database Systems. Announcement

Evaluation of Relational Operations. Relational Operations

Query Evaluation Overview, cont.

Implementation of Relational Operations. Introduction. CS 186, Fall 2002, Lecture 19 R&G - Chapter 12

Implementation of Relational Operations: Other Operations

CAS CS 460/660 Introduction to Database Systems. Query Evaluation II 1.1

Relational Query Optimization

Evaluation of relational operations

Query Evaluation Overview, cont.

Evaluation of Relational Operations

CompSci 516 Data Intensive Computing Systems

Relational Query Optimization. Highlights of System R Optimizer

CSE 444: Database Internals. Section 4: Query Optimizer

EXTERNAL SORTING. Sorting

Overview of Query Processing

Database Applications (15-415)

Operator Implementation Wrap-Up Query Optimization

Query Processing and Query Optimization. Prof Monika Shah

Announcements. Reading Material. Today. Different File Organizations. Selection of Indexes 9/24/17. CompSci 516: Database Systems

Sorting a file in RAM. External Sorting. Why Sort? 2-Way Sort of N pages Requires Minimum of 3 Buffers Pass 0: Read a page, sort it, write it.

External Sorting. Chapter 13. Comp 521 Files and Databases Fall

Implementing Joins 1

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

CompSci 516: Database Systems

Database Applications (15-415)

Database Applications (15-415)

Overview of Query Processing. Evaluation of Relational Operations. Why Sort? Outline. Two-Way External Merge Sort. 2-Way Sort: Requires 3 Buffer Pages

NOTE: sorting using B-trees to be assigned for reading after we cover B-trees.

Why Is This Important? Overview of Storage and Indexing. Components of a Disk. Data on External Storage. Accessing a Disk Page. Records on a Disk Page

Evaluation of Relational Operations: Other Techniques

Query Optimization. Schema for Examples. Motivating Example. Similar to old schema; rname added for variations. Reserves: Sailors:

Query Evaluation (i)

Evaluation of Relational Operations: Other Techniques

ECS 165B: Database System Implementa6on Lecture 7

Principles of Data Management. Lecture #12 (Query Optimization I)

Query Optimization. Schema for Examples. Motivating Example. Similar to old schema; rname added for variations. Reserves: Sailors:

Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Why Sort? Why Sort? select... order by

Evaluation of Relational Operations: Other Techniques

CompSci 516 Data Intensive Computing Systems. Lecture 11. Query Optimization. Instructor: Sudeepa Roy

CSE 444: Database Internals. Sec2on 4: Query Op2mizer

Overview of DB & IR. ICS 624 Spring Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa

Midterm Review CS634. Slides based on Database Management Systems 3 rd ed, Ramakrishnan and Gehrke

Relational Query Optimization. Overview of Query Evaluation. SQL Refresher. Yanlei Diao UMass Amherst March 8 and 13, 2007

Schema for Examples. Query Optimization. Alternative Plans 1 (No Indexes) Motivating Example. Alternative Plans 2 With Indexes

Hash-Based Indexing 165

Data on External Storage

CSIT5300: Advanced Database Systems

Relational Query Optimization. Overview of Query Evaluation. SQL Refresher. Yanlei Diao UMass Amherst October 23 & 25, 2007

Evaluation of Relational Operations. SS Chung

Review. Relational Query Optimization. Query Optimization Overview (cont) Query Optimization Overview. Cost-based Query Sub-System

Spring 2017 QUERY PROCESSING [JOINS, SET OPERATIONS, AND AGGREGATES] 2/19/17 CS 564: Database Management Systems; (c) Jignesh M.

QUERY EXECUTION: How to Implement Relational Operations?

RAID in Practice, Overview of Indexing

Modern Database Systems Lecture 1

CS330. Some Logistics. Three Topics. Indexing, Query Processing, and Transactions. Next two homework assignments out today Extra lab session:

Administrivia. Relational Query Optimization (this time we really mean it) Review: Query Optimization. Overview: Query Optimization

Overview of Storage and Indexing

Overview of Storage and Indexing

Query Evaluation! References:! q [RG-3ed] Chapter 12, 13, 14, 15! q [SKS-6ed] Chapter 12, 13!

Query Processing: The Basics. External Sorting

Last Class. Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Why do we need to sort? Today's Class

Database Applications (15-415)

CSE 444: Database Internals. Lectures 5-6 Indexing

Database Applications (15-415)

Final Exam Review. Kathleen Durant CS 3200 Northeastern University Lecture 22

RELATIONAL OPERATORS #1

Indexing. Chapter 8, 10, 11. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

QUERY OPTIMIZATION E Jayant Haritsa Computer Science and Automation Indian Institute of Science. JAN 2014 Slide 1 QUERY OPTIMIZATION

TotalCost = 3 (1, , 000) = 6, 000

Overview of Storage and Indexing

IMPORTANT: Circle the last two letters of your class account:

Why Sort? Data requested in sorted order. Sor,ng is first step in bulk loading B+ tree index. e.g., find students in increasing GPA order

Administrivia. CS 133: Databases. Cost-based Query Sub-System. Goals for Today. Midterm on Thursday 10/18. Assignments

Transcription:

External Sorting Implementing Relational Operators 1

Readings [RG] Ch. 13 (sorting) 2

Where we are Working our way up from hardware Disks File abstraction that supports insert/delete/scan Indexing for efficient access 3

External sorting Sorting a large amount of data that does not fit in RAM E.g., sort 1TB of data with 16G of RAM Why sorting? Useful building block in a variety of DB queries Eliminate duplicates in a relation Some join implementations Or may just want data in sorted order Benchmarks and competitions See http://sortbenchmark.org/ 4

Remember merge sort? 5

2-Way External Merge Sort Pass 0: Read a page at a time, sort it using your favorite algorithm, write it Only one buffer page used (could use more if we have more) 1 Page Disk Main memory buffers Disk 6

Two-Way External Merge Sort: Pass 0 3,4 6,2 9,4 8,7 5,6 3,1 2 3,4 2,6 4,9 7,8 5,6 1,3 2 Input file 1-page runs Assume input file with N data pages What is the cost of Pass 0 (in terms of # I/Os)? 7

Definition: A run is a sorted portion of the file Pass 0 generates some number of runs (how many??) 8

2-Way External Merge Sort Now: make more passes to merge runs Pass 1: Merge two runs of length 1 (page) Pass 2: Merge two runs of length 2 (pages) until 1 run of length N Three buffer pages used INPUT 1 INPUT 2 OUTPUT Disk Main memory buffers Disk 9

2-Way External Merge Sort 3,4 6,2 9,4 8,7 5,6 3,1 2 Input file 3,4 2,6 4,9 7,8 5,6 1,3 2 4,6 4,4 6,7 8,9 4,7 8,9 1,3 5,6 2 1,2 3,5 6 1-page runs PASS 1 2-page runs PASS 2 4-page runs PASS 3 1,2 3,4 4,5 6,6 7,8 9 8-page runs 10

2-Way External Merge Sort: Analysis Total I/O cost for sorting file with N pages Cost of Pass 0 = 2N Number of merge passes = Cost of each merge pass = log 2 N 2N 2 log Cost of all merge passes = N 2 N Total cost = ( log N + ) 2 N 1 2 11

General External Merge Sort General case where B buffer pages are available Let's see how the calculations change 12

2-Way External Merge Sort Pass 0: read a page at a time, sort it using your favorite algorithm, write it Only one buffer page used How can this be modified if B buffer pages are available? 1 Page Disk Main memory buffers Disk 13

General External Merge Sort Read B pages at a time, sort B pages in main memory, and write out B pages Length of each run = B pages Assuming N input pages, number of runs = N/B Cost = 2N B Pages Disk Main memory buffers Disk 14

General External Merge Sort: Pass 0 # buffer pages B = 4 3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3 Input file 8,9 6,7 6,9 5,6 6,8 5,5 4-page runs 4,4 1,1 3,4 15

2-Way External Merge Sort Merge passes: Make multiple passes to merge runs Pass 1: Merge two runs of length 1 (page) Pass 2: Merge two runs of length 2 (pages) until 1 run of length N Three buffer pages used How can this be modified if B buffer pages available? INPUT 1 INPUT 2 OUTPUT Disk Main memory buffers Disk 16

General External Merge Sort Make multiple passes to merge runs Pass 1: Produce runs of length B(B-1) pages Pass 2: Produce runs of length B(B-1) 2 pages Pass P: Produce runs of length B(B-1) P pages INPUT 1... INPUT 2...... OUTPUT Disk INPUT B-1 B Main memory buffers Disk 17

Definition Merge fan-in number of runs being merged in each merge pass What is the value of this fan-in? 18

General External Merge Sort: Merge # buffer pages B = 4 3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3 Input file 8,9 6,7 6,9 5,6 6,8 5,5 4-page runs Pass 0 4,4 1,1 3,4 Pass 1 Main Memory 19

General External Merge Sort: Phase 2 # buffer pages B = 4 3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3 Input file 8,9 6,7 6,9 5,6 6,8 5,5 4-page runs Pass 0 4,4 1,1 3,4 1,1 Pass 1 Main Memory 20

General External Merge Sort: Phase 2 # buffer pages B = 4 3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3 Input file 8,9 6,7 6,9 5,6 6,8 5,5 4-page runs Pass 0 4,4 1,1 3,4 1,1 Pass 1 Main Memory 21

General External Merge Sort: Phase 2 # buffer pages B = 4 3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3 Input file 8,9 6,7 6,9 5,6 6,8 5,5 4-page runs Pass 0 4,4 1,1 3,4 1,1 Pass 1 Main Memory 22

General External Merge Sort: Phase 2 # buffer pages B = 4 3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3 Input file 8,9 6,7 6,9 5,6 6,8 5,5 4-page runs Pass 0 4,4 1,1 3,4 1 Pass 1 1 Main Memory 23

General External Merge Sort: Phase 2 # buffer pages B = 4 3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3 Input file 8,9 6,7 6,9 5,6 6,8 5,5 4-page runs Pass 0 4,4 1,1 3,4 Pass 1 1,1 Main Memory 24

General External Merge Sort: Phase 2 # buffer pages B = 4 3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3 Input file 8,9 6,7 6,9 5,6 6,8 5,5 4-page runs Pass 0 4,4 1,1 3,4 Pass 1 Main Memory 1,1 25

General External Merge Sort: Phase 2 # buffer pages B = 4 3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3 Input file 8,9 6,7 6,9 5,6 6,8 5,5 4-page runs Pass 0 4,4 1,1 3,4 Pass 1 Main Memory 1,1 26

General External Merge Sort: Phase 2 # buffer pages B = 4 3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3 Input file 8,9 6,7 6,9 5,6 6,8 5,5 4-page runs Pass 0 4,4 1,1 3,4 Pass 1 Main Memory 1,1 27

General External Merge Sort: Phase 2 # buffer pages B = 4 3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3 Input file 8,9 6,7 6,9 5,6 6,8 5,5 4-page runs Pass 0 4,4 1,1 3,4 3 Pass 1 2 Main Memory 1,1 28

General External Merge Sort: Phase 2 # buffer pages B = 4 3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3 Input file 8,9 6,7 6,9 5,6 6,8 5,5 4-page runs Pass 0 4,4 1,1 3,4 3 3 Pass 1 2,2 Main Memory 1,1 29

General External Merge Sort: Phase 2 # buffer pages B = 4 3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3 Input file 8,9 6,7 6,9 5,6 6,8 5,5 4-page runs Pass 0 4,4 1,1 3,4 3 3 Pass 1 2,2 Main Memory 1,1 30

General External Merge Sort: Phase 2 # buffer pages B = 4 3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3 Input file 8,9 6,7 6,9 5,6 6,8 5,5 4-page runs Pass 0 4,4 1,1 3,4 3 3 3 Pass 1 2 Main Memory 2,2 1,1 31

General External Merge Sort: Phase 2 # buffer pages B = 4 3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3 Input file 8,9 6,7 6,9 5,6 6,8 5,5 4-page runs Pass 0 4,4 1,1 3,4 3 3 Pass 1 Main Memory 2,2 1,1 32

General External Merge Sort: Phase 2 # buffer pages B = 4 3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3 Input file 8,9 6,7 6,9 5,6 6,8 5,5 4-page runs Pass 0 4,4 1,1 3,4 3 3 Pass 1 2,2 Main Memory 1,1 33

General External Merge Sort: Phase 2 # buffer pages B = 4 3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3 Input file 8,9 6,7 6,9 5,6 6,8 5,5 4-page runs Pass 0 4,4 1,1 3,4 4,4 3 3 Pass 1 2,2 Main Memory 1,1 34

General External Merge Sort: Phase 2 # buffer pages B = 4 3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3 Input file 8,9 6,7 6,9 5,6 6,8 5,5 4-page runs Pass 0 4,4 1,1 3,4 4,4 3 3 Pass 1 2,2 Main Memory 1,1 35

General External Merge Sort: Analysis Total I/O cost for sorting file with N pages Cost of Pass 0 = 2N If # merge passes is P then: B(B-1) P = N Therefore P = logb 1 N / B Cost of each merge pass = Cost of all merge passes = 2N 2 1 N logb N / B Total cost = ( log N / B 1) 2N 1 + B 36

Number of Passes of External Sort N B=3 B=5 B=9 B=17 B=129 B=257 100 7 4 3 2 1 1 1,000 10 5 4 3 2 2 10,000 13 7 5 4 2 2 100,000 17 9 6 5 3 3 1,000,000 20 10 7 5 3 3 10,000,000 23 12 8 6 4 3 100,000,000 26 14 9 7 4 4 1,000,000,000 30 15 10 8 5 4 37

External Merge Sort: Optimizations Currently, do one page I/O at a time But can read/write a block of pages sequentially! Make each buffer input/output a block of pages 38

General External Merge Sort: Phase 2 # buffer pages B = 8 3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3 Input file 8,9 6,7 6,9 5,6 6,8 5,5 4-page runs Pass 0 4,4 1,1 3,4 4,4 Pass 1 Main Memory 39

General External Merge Sort: Phase 2 # buffer pages B = 8 3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3 Input file 8,9 6,7 6,9 5,6 6,8 5,5 4-page runs Pass 0 4,4 1,1 3,4 4,4 1,1 Pass 1 Main Memory 40

General External Merge Sort: Phase 2 # buffer pages B = 8 3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3 Input file 8,9 6,7 6,9 5,6 6,8 5,5 4-page runs Pass 0 4,4 1,1 3,4 4,4 1,1 3,4 Pass 1 Main Memory 41

General External Merge Sort: Phase 2 # buffer pages B = 8 3,4 6,2 9,4 8,7 5,6 3,1 9,2 6,1 8,2 3,4 5,5 6,3 Input file 8,9 6,7 6,9 5,6 6,8 5,5 4-page runs Pass 0 4,4 1,1 3,4 4,4 1,1 3,4 Pass 1 Main Memory 42

External Merge Sort: Optimizations Tradeoff: we reduce the merge fan-in So more passes will be needed 43

General Merge Sort: Optimizations Double buffering: to reduce I/O wait time, prefetch into `shadow block. Again, potentially more passes; in practice, most files still sorted in 2-3 passes. INPUT 1 INPUT 1' INPUT 2 INPUT 2' OUTPUT OUTPUT' Disk INPUT k INPUT k' b block size Disk B main memory buffers, k-way merge 44

Using B+ Trees for Sorting Scenario: Table to be sorted has B+ tree index on sorting column(s). Idea: Can retrieve records in order by traversing leaf pages. Is this a good idea? Cases to consider: B+ tree is clustered Good idea! B+ tree is not clustered Could be a very bad idea! 45

Clustered B+ Tree Used for Sorting Cost: root to the leftmost leaf, then retrieve all leaf pages (Alternative 1) If Alternative 2 is used? Additional cost of retrieving data records: each page fetched just once. Index (Directs search) Data Entries Data Records Always better than external sorting! 46

Unclustered B+ Tree Used for Sorting Alternative (2) for data entries; each data entry contains rid of a data record. In general, one I/O per data record! Index (Directs search) Data Entries Data Records 47

Summary External sorting is important External merge sort minimizes disk I/O cost: Pass 0: Produces sorted runs of size B (# buffer pages). Later passes: merge runs. # of runs merged at a time depends on B In practice, # of passes rarely more than 2 or 3. 48

So far Index entries Data entries 49

So far SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 AND S.rating>5 RA Tree: sname bid=100 rating > 5 sid=sid Reserves Sailors 50

Now for the "middle piece" How is a query actually evaluated in your DBMS? Two aspects: How to implement each relational operator Several options depending if indexes available How to evaluate the whole plan Maybe optimize plan first using RA equivalences Choose implementation for each operator based on what's best for the overall plan (not just locally) 51

Implementing RA Operators Selection Projection Join Set operations Extended RA: GROUP BY, Aggregates 52

Readings [RG] Sec 14.1-14.3 (implementing select) 53

Example relations Sailors (sid, sname, rating, age) Reserves(sid, bid, day, rname) Each tuple of Sailors 50 bytes long (80 tuples per page) and have 500 pages Each tuple of Reserves 40 bytes long (100 tuples per page) and have 1000 pages 54

Cost model Number of page I/Os incurred by algorithm Ignore things like benefits of sequential access, CPU computation cost, etc. Crude but disk I/O is likely to dominate cost so a good way to compare implementation alternatives 55

Select Operator SELECT * FROM Reserves R WHERE R.rname = 'Alice'; 56

Various cases Case 1: No index on any selection attribute File could be sorted on selection attr. or not Case 2: Have matching index on all selection attributes Case 3: Have matching index on some (but not all) selection attributes e.g. want to select "name = 'Alice' AND rating > 5" but only have index on name 57

No index on any selection attribute If file unsorted, need to scan whole thing 1000 I/Os Plus cost to write out results, but we ignore that aspect Want to compare evaluation costs for different implementations But all of them need to write out result If sorted, can do binary search Logarithmic cost for 1000 pages about 10 I/Os Plus cost to retrieve qualifying tuples (0 to 1000 extra I/Os) 58

Using indexes for selection Start with simple case where only one selection attribute (e.g. "name = 'Alice'") Suppose we have a matching index Should we use it? Probably, but not always! Depends on cost of accessing data through index 59

Using indexes for selection Tree index cost: Traverse tree from root Done only once 2 to 3 I/Os (trees are short) Scan leaf level pages to find all data entries Retrieve actual data records 60

Cost Components Tree Index Index Component 1: Traversing index File 61

Cost Components Tree Index Index Component 2: Traversing subset of data entries in index File 62

Cost Components Tree Index Index Component 3: Fetching actual data records (alternative 2 or 3) File 63

Clustered indexes If index is truly 100% clustered, don't need second component (scan from first leaf) But "clustered" may mean "data is stored close to search key order", not "exactly in search key order" In this case do need to access leaf index pages In your calculations, both are ok but make it very clear which you're choosing and why State your assumptions 64

What about hash indexes? Depends on implementation (linear, extendible etc) But typically small Reasonable assumption: 1 or 2 I/Os to find right bucket Plus cost of retrieving actual data records (depends on number of matching records) 65

Example Selection on Reserves, condition is "R.rname < 'C%'" Assume uniform distribution of names, so about 10% of relation should be retrieved 10K tuples, 100 pages 66

Example Have clustered B+ tree index on rname Traverse index to find first leaf page - 1 or 2 I/Os Start scan and retrieve tuples - 100 I/Os Have unclustered B+ tree index on rname Worst case, retrieving each tuple requires a separate I/O so 10K I/Os Much cheaper to just scan all 1000 original pages! Heuristic: scan cheaper than unclustered index if expect to retrieve more than 5% of tuples 67

Optimizations are possible Alternative 2 or 3, unclustered index Find qualifying data entries from index Sort the rids of the data entries to be retrieved Remember rid = (page ID, slot #) Fetch rids in order Ensures each data page is read from disk just once! Although number of data pages retrieved still likely to be more than with clustering 68