Sorting & Aggregations

Similar documents
Today's Class. Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Example Database. Query Plan Example

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Last Class. Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Why do we need to sort? Today's Class

Join Algorithms. Lecture #12. Andy Pavlo Computer Science Carnegie Mellon Univ. Database Systems / Fall 2018

EECS 647: Introduction to Database Systems

CSIT5300: Advanced Database Systems

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Administrivia. Administrivia. Faloutsos/Pavlo CMU /615

RELATIONAL OPERATORS #1

Database Applications (15-415)

Database Applications (15-415)

Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Why Sort? Why Sort? select... order by

Cost-based Query Sub-System. Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class.

Database Management System

Chapter 12: Query Processing. Chapter 12: Query Processing

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

CS542. Algorithms on Secondary Storage Sorting Chapter 13. Professor E. Rundensteiner. Worcester Polytechnic Institute

Advanced Database Systems

Spring 2017 QUERY PROCESSING [JOINS, SET OPERATIONS, AND AGGREGATES] 2/19/17 CS 564: Database Management Systems; (c) Jignesh M.

Implementing Relational Operators: Selection, Projection, Join. Database Management Systems, R. Ramakrishnan and J. Gehrke 1

Evaluation of Relational Operations: Other Techniques

Evaluation of Relational Operations

Evaluation of Relational Operations: Other Techniques

External Sorting Implementing Relational Operators

Database Applications (15-415)

Query Processing. Introduction to Databases CompSci 316 Fall 2017

Goals for Today. CS 133: Databases. Relational Model. Multi-Relation Queries. Reason about the conceptual evaluation of an SQL query

Chapter 12: Query Processing

Dtb Database Systems. Announcement

Evaluation of Relational Operations: Other Techniques

Implementation of Relational Operations: Other Operations

University of Waterloo Midterm Examination Sample Solution

Implementing Joins 1

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Database Applications (15-415)

15-415/615 Faloutsos 1

Chapter 13: Query Processing

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

Overview of Query Processing. Evaluation of Relational Operations. Why Sort? Outline. Two-Way External Merge Sort. 2-Way Sort: Requires 3 Buffer Pages

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

CMPUT 391 Database Management Systems. Query Processing: The Basics. Textbook: Chapter 10. (first edition: Chapter 13) University of Alberta 1

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Database System Concepts

CAS CS 460/660 Introduction to Database Systems. Query Evaluation II 1.1

NOTE: sorting using B-trees to be assigned for reading after we cover B-trees.

CSE 444: Database Internals. Sec2on 4: Query Op2mizer

University of Waterloo Midterm Examination Solution

CSC 261/461 Database Systems Lecture 19

CSE 444: Database Internals. Section 4: Query Optimizer

Tree-Structured Indexes

Review. Support for data retrieval at the physical level:

Implementation of Relational Operations

Modern Database Systems Lecture 1

CompSci 516: Database Systems

EXTERNAL SORTING. Sorting

Course No: 4411 Database Management Systems Fall 2008 Midterm exam

Query Processing & Optimization

Lecture 14. Lecture 14: Joins!

Data Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Evaluation of Relational Operations: Other Techniques. Chapter 14 Sayyed Nezhadi

Lecture 8 Index (B+-Tree and Hash)

Lecture 12. Lecture 12: Access Methods

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

External Sorting. Chapter 13. Comp 521 Files and Databases Fall

Database Applications (15-415)

CS 222/122C Fall 2017, Final Exam. Sample solutions

QUERY EXECUTION: How to Implement Relational Operations?

Evaluation of Relational Operations

Question 1. Part (a) [2 marks] what are different types of data independence? explain each in one line. Part (b) CSC 443 Term Test Solutions Fall 2017

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

CS222P Fall 2017, Final Exam

Hash table example. B+ Tree Index by Example Recall binary trees from CSE 143! Clustered vs Unclustered. Example

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

CSE 444: Database Internals. Lectures 5-6 Indexing

Overview of Query Evaluation. Chapter 12

Introduction to Database Systems CSE 414. Lecture 26: More Indexes and Operator Costs

Evaluation of Relational Operations. Relational Operations

Query Processing: The Basics. External Sorting

Lecture 11. Lecture 11: External Sorting

CS-245 Database System Principles

Introduction to Data Management CSE 344. Lecture 12: Cost Estimation Relational Calculus

Evaluation of relational operations

Lecture 15: The Details of Joins

Announcements (March 1) Query Processing: A Systems View. Physical (execution) plan. Announcements (March 3) Physical plan execution

ECS 165B: Database System Implementa6on Lecture 7

Database Applications (15-415)

SQL - Data Query language

Final Exam Review. Kathleen Durant CS 3200 Northeastern University Lecture 22

1.1 - Basics of Query Processing in SQL Server

Sorting a file in RAM. External Sorting. Why Sort? 2-Way Sort of N pages Requires Minimum of 3 Buffers Pass 0: Read a page, sort it, write it.

CS 186/286 Spring 2018 Midterm 1

Principles of Data Management. Lecture #9 (Query Processing Overview)

Announcements. Two typical kinds of queries. Choosing Index is Not Enough. Cost Parameters. Cost of Reading Data From Disk

Query Processing. Lecture #10. Andy Pavlo Computer Science Carnegie Mellon Univ. Database Systems / Fall 2018

CMSC424: Database Design. Instructor: Amol Deshpande

What happens. 376a. Database Design. Execution strategy. Query conversion. Next. Two types of techniques

Data Storage. Query Performance. Index. Data File Types. Introduction to Data Management CSE 414. Introduction to Database Systems CSE 414

Transcription:

Sorting & Aggregations Lecture #11 Database Systems /15-645 Fall 2018 AP Andy Pavlo Computer Science Carnegie Mellon Univ.

2 Sorting Algorithms Aggregations TODAY'S AGENDA

3 WHY DO WE NEED TO SORT? Tuples in a table have no specific order But users often want to retrieve tuples in a specific order. Trivial to support duplicate elimination (DISTINCT) Bulk loading sorted tuples into a B+ tree index is faster Aggregations (GROUP BY)

4 SORTING ALGORITHMS If data fits in memory, then we can use a standard sorting algorithm like quick-sort. If data does not fit in memory, then we need to use a technique that is aware of the cost of writing data out to disk.

5 EXTERNAL MERGE SORT Sorting Phase Sort small chunks of data that fit in main-memory, and then write back the sorted data to a file on disk. Merge Phase Combine sorted sub-files into a single larger file.

6 OVERVIEW We will start with a simple example of a 2-way external merge sort. Files are broken up into N pages. The DBMS has a finite number of B fixed-size buffers.

7 2-WAY EXTERNAL MERGE SORT Pass #0 Reads every B pages of the table into memory Sorts them, and writes them back to disk. Each sorted set of pages is called a run. Pass #1,2,3, Recursively merges pairs of runs into runs twice as long Uses three buffer pages (2 for input pages, 1 for output) Memory Disk

7 2-WAY EXTERNAL MERGE SORT Pass #0 Reads every B pages of the table into memory Sorts them, and writes them back to disk. Each sorted set of pages is called a run. Pass #1,2,3, Recursively merges pairs of runs into runs twice as long Uses three buffer pages (2 for input pages, 1 for output) Memory Memory Disk

7 2-WAY EXTERNAL MERGE SORT Pass #0 Reads every B pages of the table into memory Sorts them, and writes them back to disk. Each sorted set of pages is called a run. Pass #1,2,3, Recursively merges pairs of runs into runs twice as long Uses three buffer pages (2 for input pages, 1 for output) Memory Memory Disk

7 2-WAY EXTERNAL MERGE SORT Pass #0 Reads every B pages of the table into memory Sorts them, and writes them back to disk. Each sorted set of pages is called a run. Pass #1,2,3, Recursively merges pairs of runs into runs twice as long Uses three buffer pages (2 for input pages, 1 for output) Memory Memory Memory Disk

7 2-WAY EXTERNAL MERGE SORT Pass #0 Reads every B pages of the table into memory Sorts them, and writes them back to disk. Each sorted set of pages is called a run. Pass #1,2,3, Recursively merges pairs of runs into runs twice as long Uses three buffer pages (2 for input pages, 1 for output) Memory Memory Memory Disk

8 2-WAY EXTERNAL MERGE SORT EOF In each pass, we read and write each page in file. Number of passes = 1 + log 2 N Total I/O cost = 2N (# of passes) 3,4 6,2 9,4 8,7 5,6 3,1 2

8 2-WAY EXTERNAL MERGE SORT EOF In each pass, we read and write each page in file. Number of passes = 1 + log 2 N Total I/O cost = 2N (# of passes) 3,4 6,2 9,4 8,7 5,6 3,1 2 PASS #0 3,4 2,6 4,9 7,8 5,6 1,3 2 1-PAGE RUNS

8 2-WAY EXTERNAL MERGE SORT EOF In each pass, we read and write each page in file. Number of passes = 1 + log 2 N Total I/O cost = 2N (# of passes) PASS #0 PASS #1 3,4 6,2 9,4 8,7 5,6 3,1 2 3,4 2,6 4,9 7,8 5,6 1,3 2 2,3 4,6 4,7 8,9 1,3 5,6 2 1-PAGE RUNS 2-PAGE RUNS

8 2-WAY EXTERNAL MERGE SORT EOF In each pass, we read and write each page in file. Number of passes = 1 + log 2 N Total I/O cost = 2N (# of passes) PASS #0 PASS #1 PASS #2 3,4 6,2 9,4 8,7 5,6 3,1 2 3,4 2,6 4,9 7,8 5,6 1,3 2 2,3 4,6 4,7 8,9 1,3 5,6 2 2,3 4,4 6,7 8,9 1,2 3,5 6 1-PAGE RUNS 2-PAGE RUNS 4-PAGE RUNS

8 2-WAY EXTERNAL MERGE SORT EOF In each pass, we read and write each page in file. Number of passes = 1 + log 2 N Total I/O cost = 2N (# of passes) PASS #0 PASS #1 PASS #2 PASS #3 3,4 3,4 2,6 4,9 7,8 5,6 1,3 2 2,3 4,6 6,2 9,4 8,7 5,6 3,1 2 4,7 8,9 1,3 5,6 2,3 4,4 6,7 8,9 1,2 2,3 3,4 4,5 6,6 7,8 9 1,2 3,5 6 2 1-PAGE RUNS 2-PAGE RUNS 4-PAGE RUNS 8-PAGE RUNS

9 2-WAY EXTERNAL MERGE SORT This algorithm only requires three buffer pages (B=3). Even if we have more buffer space available (B>3), it does not effectively utilize them. Let s next generalize the algorithm to make use of extra buffer space.

10 GENERAL EXTERNAL MERGE SORT Pass #0 Use B buffer pages. Produce N / B sorted runs of size B Pass #1,2,3, Merge B-1 runs (i.e., K-way merge). Number of passes = 1 + log B-1 N / B Total I/O Cost = 2N (# of passes)

10 GENERAL EXTERNAL MERGE SORT Pass #0 Use B buffer pages. Produce N / B sorted runs of size B Pass #1,2,3, Merge B-1 runs (i.e., K-way merge). Number of passes = 1 + log B-1 N / B Total I/O Cost = 2N (# of passes)

10 GENERAL EXTERNAL MERGE SORT Pass #0 Use B buffer pages. Produce N / B sorted runs of size B Pass #1,2,3, Merge B-1 runs (i.e., K-way merge). Number of passes = 1 + log B-1 N / B Total I/O Cost = 2N (# of passes)

12 EXAMPLE Sort 108 page file with 5 buffer pages: N=108, B=5 Pass #0: N / B = 108 / 5 = 22 sorted runs of 5 pages each (last run is only 3 pages) Pass #1: N / B-1 = 22 / 4 = 6 sorted runs of 20 pages each (last run is only 8 pages) Pass #2: N / B-1 = 6 / 4 = 2 sorted runs, 80 pages and 28 pages Pass #3: Sorted file of 108 pages 1+ log B-1 N / B = 1+ log 4 22 = 1+ 2.229... = 4 passes

13 USING B+TREES If the table that must be sorted already has a B+ tree index on the sort attribute(s), then we can use that to accelerate sorting. Retrieve tuples in desired sort order by simply traversing the leaf pages of the tree. Cases to consider: Clustered B+ tree Unclustered B+ tree

14 CASE 1: CLUSTERED B+TREE Traverse to the left-most leaf page, and then retrieve tuples from all leaf pages. Index (Directs search) Data Entries ("Sequence set") This will always better than external sorting. 101 102 103 104 Data Records

15 CASE 2: UNCLUSTERED B+TREE Chase each pointer to the page that contains the data. Index (Directs search) Data Entries ("Sequence set") This is almost always a bad idea. In general, one I/O per data record. 101 102 103 104 Data Records

17 AGGREGATIONS Collapse multiple tuples into a single scalar value. Two implementation choices: Sorting Hashing

18 SORTING AGGREGATION SELECT DISTINCT cid FROM enrolled WHERE grade IN ('B','C') enrolled(sid,cid,grade) sid cid grade 53666 C 53688 15-721 A 53688 15-826 B 53666 15-721 C 53655 C Filter sid cid grade 53666 C 53688 15-826 B 53666 15-721 C 53655 C

18 SORTING AGGREGATION SELECT DISTINCT cid FROM enrolled WHERE grade IN ('B','C') enrolled(sid,cid,grade) sid cid grade 53666 C 53688 15-721 A 53688 15-826 B 53666 15-721 C 53655 C Filter sid cid grade 53666 C 53688 15-826 B 53666 15-721 C 53655 C Remove Columns cid 15-826 15-721

18 SORTING AGGREGATION SELECT DISTINCT cid FROM enrolled WHERE grade IN ('B','C') enrolled(sid,cid,grade) sid cid grade 53666 C 53688 15-721 A 53688 15-826 B 53666 15-721 C 53655 C Filter sid cid grade 53666 C 53688 15-826 B 53666 15-721 C 53655 C Remove Columns cid 15-826 15-721 Sort cid 15-721 15-826

18 SELECT DISTINCT cid FROM enrolled WHERE grade IN ('B','C') SORTING AGGREGATION enrolled(sid,cid,grade) sid cid grade 53666 C 53688 15-721 A 53688 15-826 B 53666 15-721 C 53655 C Filter sid cid grade 53666 C 53688 15-826 B 53666 15-721 C 53655 C Remove Columns cid 15-826 15-721 Sort cid X 15-721 15-826 Eliminate Dupes

20 ALTERNATIVES TO SORTING What if we don t need the data to be ordered? Forming groups in GROUP BY (no ordering) Removing duplicates in DISTINCT (no ordering)

20 ALTERNATIVES TO SORTING What if we don t need the data to be ordered? Forming groups in GROUP BY (no ordering) Removing duplicates in DISTINCT (no ordering) Hashing is a better alternative in this scenario. Only need to remove duplicates, no need for ordering. Can be computationally cheaper than sorting.

21 HASHING AGGREGATE Populate an ephemeral hash table as the DBMS scans the table. For each record, check whether there is already an entry in the hash table: DISTINCT: Discard duplicate. GROUP BY: Perform aggregate computation. If everything fits in memory, then it's easy. If we have to spill to disk, then we need to be smarter

22 HASHING AGGREGATE Partition Phase Divide tuples into buckets based on hash key. ReHash Phase Build in-memory hash table for each partition and compute the aggregation.

23 HASHING AGGREGATE PHASE #1: PARTITION Use a hash function h 1 to split tuples into partitions on disk. We know that all matches live in the same partition. Partitions are "spilled" to disk via output buffers. Assume that we have B buffers.

24 HASHING AGGREGATE PHASE #1: PARTITION SELECT DISTINCT cid FROM enrolled WHERE grade IN ('B','C') enrolled(sid,cid,grade) sid cid grade 53666 C 53688 15-721 A 53688 15-826 B 53666 15-721 C 53655 C Filter sid cid grade 53666 C 53688 15-826 B 53666 15-721 C 53655 C

24 HASHING AGGREGATE PHASE #1: PARTITION SELECT DISTINCT cid FROM enrolled WHERE grade IN ('B','C') enrolled(sid,cid,grade) sid cid grade 53666 C 53688 15-721 A 53688 15-826 B 53666 15-721 C 53655 C Filter sid cid grade 53666 C 53688 15-826 B 53666 15-721 C 53655 C Remove Columns cid 15-826 15-721

24 HASHING AGGREGATE PHASE #1: PARTITION SELECT DISTINCT cid FROM enrolled WHERE grade IN ('B','C') Filter sid cid grade 53666 C 53688 15-826 B 53666 15-721 C 53655 C Remove Columns cid 15-826 15-721 enrolled(sid,cid,grade) sid cid grade 53666 C 53688 15-721 A 53688 15-826 B 53666 15-721 C 53655 C h 1 B-1 partitions 15-826 15-721

25 HASHING AGGREGATE PHASE #2: REHASH For each partition on disk: Read it into memory and build an in-memory hash table based on a second hash function h 2. Then go through each bucket of this hash table to bring together matching tuples. This assumes that each partition fits in memory.

26 HASHING AGGREGATE PHASE #2: REHASH SELECT DISTINCT cid FROM enrolled WHERE grade IN ('B','C') enrolled(sid,cid,grade) sid cid grade 53666 C 53688 15-721 A 53688 15-826 B 53666 15-721 C 53655 C Phase #1 Buckets 15-826 15-721

26 HASHING AGGREGATE PHASE #2: REHASH SELECT DISTINCT cid FROM enrolled WHERE grade IN ('B','C') Phase #1 Buckets 15-826 15-721 h 2 h 2 h 2 Hash Table Key Value XXX YYY 15-826 ZZZ 15-721 enrolled(sid,cid,grade) sid cid grade 53666 C 53688 15-721 A 53688 15-826 B 53666 15-721 C 53655 C

26 HASHING AGGREGATE PHASE #2: REHASH SELECT DISTINCT cid FROM enrolled WHERE grade IN ('B','C') Phase #1 Buckets 15-826 15-721 h 2 h 2 h 2 Hash Table Key Value XXX YYY 15-826 ZZZ 15-721 enrolled(sid,cid,grade) sid cid grade 53666 C 53688 15-721 A 53688 15-826 B 53666 15-721 C 53655 C cid 15-721 15-826

27 HASHING SUMMARIZATION During the ReHash phase, store pairs of the form (GroupKey RunningVal) When we want to insert a new tuple into the hash table: If we find a matching GroupKey, just update the RunningVal appropriately Else insert a new GroupKey RunningVal

28 HASHING SUMMARIZATION SELECT cid, AVG(s.gpa) FROM student AS s, enrolled AS e WHERE s.sid = e.sid GROUP BY cid Phase #1 Buckets 15-826 15-721

28 HASHING SUMMARIZATION SELECT cid, AVG(s.gpa) FROM student AS s, enrolled AS e WHERE s.sid = e.sid GROUP BY cid Phase #1 Buckets 15-826 15-721 h 2 h 2 h 2 Hash Table key value XXX (2,7.32) YYY 15-826 (1,3.33) ZZZ 15-721 (1,2.89)

28 HASHING SUMMARIZATION SELECT cid, AVG(s.gpa) FROM student AS s, enrolled AS e WHERE s.sid = e.sid GROUP BY cid Phase #1 Buckets 15-826 15-721 h 2 h 2 h 2 key XXX YYY ZZZ Hash Table value (2,7.32) 15-826 (1,3.33) 15-721 (1,2.89) Running Totals AVG(col) (COUNT,SUM) MIN(col) (MIN) MAX(col) (MAX) SUM(col) (SUM) COUNT(col) (COUNT)

28 HASHING SUMMARIZATION SELECT cid, AVG(s.gpa) FROM student AS s, enrolled AS e WHERE s.sid = e.sid GROUP BY cid Phase #1 Buckets 15-826 15-721 h 2 h 2 h 2 key XXX YYY ZZZ Hash Table value (2,7.32) 15-826 (1,3.33) 15-721 (1,2.89) Running Totals AVG(col) (COUNT,SUM) MIN(col) (MIN) MAX(col) (MAX) SUM(col) (SUM) COUNT(col) (COUNT) Final Result cid AVG(gpa) 3.66 15-826 3.33 15-721 2.89

29 COST ANALYSIS How big of a table can we hash using this approach? B-1 "spill partitions" in Phase #1 Each should be no more than B blocks big Answer: B (B-1) A table of N pages needs about sqrt(n) buffers Assumes hash distributes records evenly. Use a "fudge factor" f>1 for that: we need B sqrt(f N)

30 CONCLUSION Choice of sorting vs. hashing is subtle and depends on optimizations done in each case. We already discussed the optimizations for sorting: Chunk I/O into large blocks to amortize seek+rd costs. Double-buffering to overlap CPU and I/O.

31 Nested Loop Join Sort-Merge Join Hash Join "Exotic" Joins NEXT CL ASS