External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Similar documents
External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

External Sorting. Chapter 13. Comp 521 Files and Databases Fall

Why Sort? Data requested in sorted order. Sor,ng is first step in bulk loading B+ tree index. e.g., find students in increasing GPA order

CS542. Algorithms on Secondary Storage Sorting Chapter 13. Professor E. Rundensteiner. Worcester Polytechnic Institute

EXTERNAL SORTING. Sorting

Sorting a file in RAM. External Sorting. Why Sort? 2-Way Sort of N pages Requires Minimum of 3 Buffers Pass 0: Read a page, sort it, write it.

Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Why Sort? Why Sort? select... order by

Spring 2017 EXTERNAL SORTING (CH. 13 IN THE COW BOOK) 2/7/17 CS 564: Database Management Systems; (c) Jignesh M. Patel,

Evaluation of Relational Operations

NOTE: sorting using B-trees to be assigned for reading after we cover B-trees.

Last Class. Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Why do we need to sort? Today's Class

Database Applications (15-415)

CompSci 516: Database Systems

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

External Sorting Implementing Relational Operators

Announcements. Reading Material. Today. Different File Organizations. Selection of Indexes 9/24/17. CompSci 516: Database Systems

Dtb Database Systems. Announcement

Database Applications (15-415)

Motivation for Sorting. External Sorting: Overview. Outline. CSE 190D Database System Implementation. Topic 3: Sorting. Chapter 13 of Cow Book

Lecture 11. Lecture 11: External Sorting

Overview of Query Processing. Evaluation of Relational Operations. Why Sort? Outline. Two-Way External Merge Sort. 2-Way Sort: Requires 3 Buffer Pages

Evaluation of Relational Operations: Other Techniques

External Sorting Sorting Tables Larger Than Main Memory

EECS 647: Introduction to Database Systems

Tree-Structured Indexes

Evaluation of relational operations

Principles of Data Management. Lecture #5 (Tree-Based Index Structures)

Tree-Structured Indexes

Evaluation of Relational Operations: Other Techniques

Tree-Structured Indexes

Tree-Structured Indexes ISAM. Range Searches. Comments on ISAM. Example ISAM Tree. Introduction. As for any index, 3 alternatives for data entries k*:

Tree-Structured Indexes

Overview of Storage and Indexing

Database Applications (15-415)

Implementing Relational Operators: Selection, Projection, Join. Database Management Systems, R. Ramakrishnan and J. Gehrke 1

Overview of Storage and Indexing

Hash-Based Indexing 165

Overview of Storage and Indexing

Indexing. Chapter 8, 10, 11. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Tree-Structured Indexes

Database Management System

Unit 3 Disk Scheduling, Records, Files, Metadata

Overview of Storage and Indexing

Announcements. Reading Material. Recap. Today 9/17/17. Storage (contd. from Lecture 6)

Lecture 8 Index (B+-Tree and Hash)

Chapter 12: Query Processing. Chapter 12: Query Processing

THE B+ TREE INDEX. CS 564- Spring ACKs: Jignesh Patel, AnHai Doan

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 6 - Storage and Indexing

Introduction to Data Management. Lecture 14 (Storage and Indexing)

Introduction. Choice orthogonal to indexing technique used to locate entries K.

Lecture 12. Lecture 12: The IO Model & External Sorting

Introduction to Data Management. Lecture #13 (Indexing)

Principles of Data Management. Lecture #3 (Managing Files of Records)

Data Management for Data Science

Tree-Structured Indexes (Brass Tacks)

Indexing. Jan Chomicki University at Buffalo. Jan Chomicki () Indexing 1 / 25

Midterm Review CS634. Slides based on Database Management Systems 3 rd ed, Ramakrishnan and Gehrke

Implementation of Relational Operations: Other Operations

Storage hierarchy. Textbook: chapters 11, 12, and 13

Storage and Indexing

RAID in Practice, Overview of Indexing

Chapter 12: Query Processing

Kathleen Durant PhD Northeastern University CS Indexes

CSE 530A. B+ Trees. Washington University Fall 2013

CSE 444: Database Internals. Lectures 5-6 Indexing

Modern Database Systems Lecture 1

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

CS 186/286 Spring 2018 Midterm 1

Lecture 13. Lecture 13: B+ Tree

Midterm 1: CS186, Spring I. Storage: Disk, Files, Buffers [11 points] SOLUTION. cs186-

Introduction to Data Management. Lecture 15 (More About Indexing)

C has been and will always remain on top for performancecritical

Evaluation of Relational Operations

Introduction to Indexing 2. Acknowledgements: Eamonn Keogh and Chotirat Ann Ratanamahatana

Evaluation of Relational Operations: Other Techniques

Main Memory and the CPU Cache

Readings. Important Decisions on DB Tuning. Index File. ICOM 5016 Introduction to Database Systems

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #5: Hashing and Sor5ng

Chapter 12: Indexing and Hashing (Cnt(

Data on External Storage

Sorting & Aggregations

Administriva. CS 133: Databases. General Themes. Goals for Today. Fall 2018 Lec 11 10/11 Query Evaluation Prof. Beth Trushkowsky

EXTERNAL SORTING. CS 564- Spring ACKs: Dan Suciu, Jignesh Patel, AnHai Doan

Module 4: Tree-Structured Indexing

Introduction to Data Management. Lecture 21 (Indexing, cont.)

Overview of Storage and Indexing

Why Is This Important? Overview of Storage and Indexing. Components of a Disk. Data on External Storage. Accessing a Disk Page. Records on a Disk Page

CSE 544 Principles of Database Management Systems

Tree-Structured Indexes

Sorting a File of Records

Unit 2 Buffer Pool Management

The use of indexes. Iztok Savnik, FAMNIT. IDB, Indexes

Database Applications (15-415)

Tree-Structured Indexes

Evaluation of Relational Operations. Relational Operations

Background: disk access vs. main memory access (1/2)

Review of Storage and Indexing

CS 186/286 Spring 2018 Midterm 1

Overview of Storage and Indexing

Tree-Structured Indexes. Chapter 10

Transcription:

External Sorting Chapter 13 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Why Sort? v A classic problem in computer science! v Data requested in sorted order e.g., find students in increasing gpa order v Sorting is first step in bulk loading B+ tree index. v Sorting useful for eliminating duplicate copies in a collection of records (Why?) v Sort-merge join algorithm involves sorting. v Problem: sort 1Gb of data with 1Mb of RAM. why not virtual memory? Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 2

2-Way Sort: Requires 3 Buffers v Pass 1: Read a page, sort it, write it. only one buffer page is used v Pass 2, 3,, etc.: three buffer pages used. INPUT 1 OUTPUT Disk INPUT 2 Main memory buffers Disk Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 3

Two-Way External Merge Sort v Each pass we read + write each page in file. v N pages in the file => the number of passes = log + 2 N 1 v So total cost is: v Idea: Divide and conquer: sort subfiles and merge ( log N + ) 2 N 1 2 3,4 6,2 9,4 8,7 5,6 3,1 2 3,4 2,6 4,9 7,8 5,6 1,3 2 2,3 4,6 Input file PASS 0 1-page runs PASS 1 2-page runs PASS 2 4-page runs PASS 3 8-page runs 9 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 4 2,3 4,4 6,7 8,9 4,7 8,9 1,2 2,3 3,4 4,5 6,6 7,8 1,3 5,6 2 1,2 3,5 6

General External Merge Sort More than 3 buffer pages. How can we utilize them? v To sort a file with N pages using B buffer pages: Pass 0: use B buffer pages. Produce N / B sorted runs of B pages each. Pass 2,, etc.: merge B-1 runs. INPUT 1... INPUT 2... OUTPUT... Disk INPUT B-1 B Main memory buffers Disk Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 5

Cost of External Merge Sort v Number of passes: v Cost = 2N * (# of passes) 1 + 1 v E.g., with 5 buffer pages, to sort 108 page file: Pass 0: 108 / 5 = 22 sorted runs of 5 pages each (last run is only 3 pages) Pass 1: 22 / 4 = 6 sorted runs of 20 pages each (last run is only 8 pages) Pass 2: 2 sorted runs, 80 pages and 28 pages Pass 3: Sorted file of 108 pages log B N / B Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 6

Number of Passes of External Sort N B=3 B=5 B=9 B=17 B=129 B=257 100 7 4 3 2 1 1 1,000 10 5 4 3 2 2 10,000 13 7 5 4 2 2 100,000 17 9 6 5 3 3 1,000,000 20 10 7 5 3 3 10,000,000 23 12 8 6 4 3 100,000,000 26 14 9 7 4 4 1,000,000,000 30 15 10 8 5 4 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 7

Internal Sort Algorithm v Quicksort is a fast way to sort in memory. v An alternative is tournament sort (a.k.a. heapsort ) Top: Read in B blocks Output: move smallest record to output buffer Read in a new record r insert r into heap if r not smallest, then GOTO Output else remove r from heap output heap in order; GOTO Top Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 8

More on Heapsort v Fact: average length of a run in heapsort is 2B The snowplow analogy v Worst-Case: What is min length of a run? How does this arise? v Best-Case: What is max length of a run? How does this arise? B v Quicksort is faster, but... Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 9

I/O for External Merge Sort v longer runs often means fewer passes! v Actually, do I/O a page at a time v In fact, read a block of pages sequentially! v Suggests we should make each buffer (input/ output) be a block of pages. But this will reduce fan-out during merge passes! In practice, most files still sorted in 2-3 passes. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 10

Number of Passes of Optimized Sort N B=1,000 B=5,000 B=10,000 100 1 1 1 1,000 1 1 1 10,000 2 2 1 100,000 3 2 2 1,000,000 3 2 2 10,000,000 4 3 3 100,000,000 5 3 3 1,000,000,000 5 4 3 Block size = 32, initial pass produces runs of size 2B. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 11

Double Buffering v To reduce wait time for I/O request to complete, can prefetch into `shadow block. Potentially, more passes; in practice, most files still sorted in 2-3 passes. INPUT 1 INPUT 1' INPUT 2 OUTPUT INPUT 2' OUTPUT' Disk INPUT k INPUT k' b block size Disk B main memory buffers, k-way merge Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 12

Sorting Records! v Sorting has become a blood sport! Parallel sorting is the name of the game... v Datamation: Sort 1M records of size 100 bytes Typical DBMS: 15 minutes World record: 3.5 seconds 12-CPU SGI machine, 96 disks, 2GB of RAM v New benchmarks proposed: Minute Sort: How many can you sort in 1 minute? Dollar Sort: How many can you sort for $1.00? Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 13

Using B+ Trees for Sorting v Scenario: Table to be sorted has B+ tree index on sorting column(s). v Idea: Can retrieve records in order by traversing leaf pages. v Is this a good idea? v Cases to consider: B+ tree is clustered Good idea! B+ tree is not clustered Could be a very bad idea! Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 14

Clustered B+ Tree Used for Sorting v Cost: root to the leftmost leaf, then retrieve all leaf pages (Alternative 1) v If Alternative 2 is used? Additional cost of retrieving data records: each page fetched just once. Data Records Index (Directs search) Data Entries ("Sequence set") Always better than external sorting! Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 15

Unclustered B+ Tree Used for Sorting v Alternative (2) for data entries; each data entry contains rid of a data record. In general, one I/O per data record! Index (Directs search) Data Entries ("Sequence set") Data Records Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 16

External Sorting vs. Unclustered Index N Sorting p=1 p=10 p=100 100 200 100 1,000 10,000 1,000 2,000 1,000 10,000 100,000 10,000 40,000 10,000 100,000 1,000,000 100,000 600,000 100,000 1,000,000 10,000,000 1,000,000 8,000,000 1,000,000 10,000,000 100,000,000 10,000,000 80,000,000 10,000,000 100,000,000 1,000,000,000 p: # of records per page B=1,000 and block size=32 for sorting p=100 is the more realistic value. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 17

Summary v External sorting is important; DBMS may dedicate part of buffer pool for sorting! v External merge sort minimizes disk I/O cost: Pass 0: Produces sorted runs of size B (# buffer pages). Later passes: merge runs. # of runs merged at a time depends on B, and block size. Larger block size means less I/O cost per page. Larger block size means smaller # runs merged. In practice, # of runs rarely more than 2 or 3. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 18

Summary, cont. v Choice of internal sort algorithm may matter: Quicksort: Quick! Heap/tournament sort: slower (2x), longer runs v The best sorts are wildly fast: Despite 40+ years of research, we re still improving! v Clustered B+ tree is good for sorting; unclustered tree is usually very bad. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 19