External Sorting. Chapter 13. Comp 521 Files and Databases Fall

Similar documents
External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Why Sort? Data requested in sorted order. Sor,ng is first step in bulk loading B+ tree index. e.g., find students in increasing GPA order

CS542. Algorithms on Secondary Storage Sorting Chapter 13. Professor E. Rundensteiner. Worcester Polytechnic Institute

EXTERNAL SORTING. Sorting

Sorting a file in RAM. External Sorting. Why Sort? 2-Way Sort of N pages Requires Minimum of 3 Buffers Pass 0: Read a page, sort it, write it.

Spring 2017 EXTERNAL SORTING (CH. 13 IN THE COW BOOK) 2/7/17 CS 564: Database Management Systems; (c) Jignesh M. Patel,

Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Why Sort? Why Sort? select... order by

Database Applications (15-415)

Evaluation of Relational Operations

External Sorting Implementing Relational Operators

NOTE: sorting using B-trees to be assigned for reading after we cover B-trees.

Last Class. Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Why do we need to sort? Today's Class

CompSci 516: Database Systems

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Database Applications (15-415)

External Sorting Sorting Tables Larger Than Main Memory

Announcements. Reading Material. Today. Different File Organizations. Selection of Indexes 9/24/17. CompSci 516: Database Systems

Lecture 11. Lecture 11: External Sorting

Motivation for Sorting. External Sorting: Overview. Outline. CSE 190D Database System Implementation. Topic 3: Sorting. Chapter 13 of Cow Book

Dtb Database Systems. Announcement

Extra: B+ Trees. Motivations. Differences between BST and B+ 10/27/2017. CS1: Java Programming Colorado State University

Lecture 12. Lecture 12: The IO Model & External Sorting

Database Applications (15-415)

Tree-Structured Indexes

Lecture 13. Lecture 13: B+ Tree

Tree-Structured Indexes

Tree-Structured Indexes

Tree-Structured Indexes

Hash-Based Indexing 165

Introduction to Indexing 2. Acknowledgements: Eamonn Keogh and Chotirat Ann Ratanamahatana

THE B+ TREE INDEX. CS 564- Spring ACKs: Jignesh Patel, AnHai Doan

EECS 647: Introduction to Database Systems

Indexing. Jan Chomicki University at Buffalo. Jan Chomicki () Indexing 1 / 25

Implementing Relational Operators: Selection, Projection, Join. Database Management Systems, R. Ramakrishnan and J. Gehrke 1

Announcements. Reading Material. Recap. Today 9/17/17. Storage (contd. from Lecture 6)

Principles of Data Management. Lecture #5 (Tree-Based Index Structures)

Midterm 1: CS186, Spring I. Storage: Disk, Files, Buffers [11 points] cs186-

Chapter 12: Query Processing. Chapter 12: Query Processing

Midterm 1: CS186, Spring I. Storage: Disk, Files, Buffers [11 points] SOLUTION. cs186-

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 6 - Storage and Indexing

Overview of Storage and Indexing

Advanced Database Systems

Evaluation of Relational Operations

CSE 444: Database Internals. Lectures 5-6 Indexing

Physical Level of Databases: B+-Trees

CSE 544 Principles of Database Management Systems

Storage hierarchy. Textbook: chapters 11, 12, and 13

Evaluation of relational operations

Introduction. Choice orthogonal to indexing technique used to locate entries K.

EXTERNAL SORTING. CS 564- Spring ACKs: Dan Suciu, Jignesh Patel, AnHai Doan

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Tree-Structured Indexes

Tree-Structured Indexes

C has been and will always remain on top for performancecritical

Chapter 12: Query Processing

Overview of Storage and Indexing

Kathleen Durant PhD Northeastern University CS Indexes

Problem. Indexing with B-trees. Indexing. Primary Key Indexing. B-trees: Example. B-trees. primary key indexing

Overview of Query Processing. Evaluation of Relational Operations. Why Sort? Outline. Two-Way External Merge Sort. 2-Way Sort: Requires 3 Buffer Pages

CS 186/286 Spring 2018 Midterm 1

Evaluation of Relational Operations: Other Techniques

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Lecture 8 Index (B+-Tree and Hash)

COMP2012H Spring 2014 Dekai Wu. Sorting. (more on sorting algorithms: mergesort, quicksort, heapsort)

Tree-Structured Indexes. Chapter 10

Data Management for Data Science

Why Is This Important? Overview of Storage and Indexing. Components of a Disk. Data on External Storage. Accessing a Disk Page. Records on a Disk Page

Main Memory and the CPU Cache

CSE 530A. B+ Trees. Washington University Fall 2013

CS 186/286 Spring 2018 Midterm 1

Chapter 13: Query Processing

CMSC 424 Database design Lecture 13 Storage: Files. Mihai Pop

Sorting & Aggregations

Database Management System

Evaluation of Relational Operations: Other Techniques

Sorting Algorithms. CptS 223 Advanced Data Structures. Larry Holder School of Electrical Engineering and Computer Science Washington State University

Indexing. Chapter 8, 10, 11. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Query Processing. Introduction to Databases CompSci 316 Fall 2017

CS127: B-Trees. B-Trees

Tree-Structured Indexes ISAM. Range Searches. Comments on ISAM. Example ISAM Tree. Introduction. As for any index, 3 alternatives for data entries k*:

Tree-Structured Indexes (Brass Tacks)

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

CS350: Data Structures B-Trees

Query Processing: A Systems View. Announcements (March 1) Physical (execution) plan. CPS 216 Advanced Database Systems

Review. Support for data retrieval at the physical level:

Intro to DB CHAPTER 12 INDEXING & HASHING

Background: disk access vs. main memory access (1/2)

I think that I shall never see A billboard lovely as a tree. Perhaps unless the billboards fall I ll never see a tree at all.

CSIT5300: Advanced Database Systems

Chapter 11: Indexing and Hashing

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

Lecture 12. Lecture 12: Access Methods

Overview of Storage and Indexing

Database System Concepts

B-Trees. Disk Storage. What is a multiway tree? What is a B-tree? Why B-trees? Insertion in a B-tree. Deletion in a B-tree

Roadmap DB Sys. Design & Impl. Reference. Detailed Roadmap. Motivation. Outline of LRU-K. Buffering - LRU-K

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

Sorting a File of Records

Transcription:

External Sorting Chapter 13 Comp 521 Files and Databases Fall 2012 1

Why Sort? A classic problem in computer science! Advantages of requesting data in sorted order gathers duplicates allows for efficient searches Sorting is first step in bulk loading B+ tree index. Sort-merge join algorithm involves sorting. Problem: sort 20Gb of data with 1Gb of RAM. why not let the OS handle it with virtual memory? Comp 521 Files and Databases Fall 2012 2

2-Way Sort: Requires 3 Buffers Pass 1: Read a page, sort it, write it. only one buffer page is used Pass 2, 3,, N etc.: Read two pages, merge them, and write merged page Requires three buffer pages. INPUT 1 INPUT 2 OUTPUT Disk Main memory buffers Disk Comp 521 Files and Databases Fall 2012 3

Two-Way External Merge Sort Each pass we read + write each page in file. N pages in the file => the number of passes So toal cost is: 3,4 6,2 9,4 8,7 5,6 3,1 2 3,4 2,6 4,9 7,8 5,6 1,3 2 2,3 4,6 2,3 4,4 6,7 8,9 4,7 8,9 1,3 5,6 2 1,2 3,5 6 Input file PASS 0 1-page runs PASS 1 2-page runs PASS 2 4-page runs PASS 3 Idea: Divide and conquer: sort pages and merge 8-page runs Comp 521 Files and Databases Fall 2012 9 4 1,2 2,3 3,4 4,5 6,6 7,8

General External Merge Sort More than 3 buffer pages. How can we utilize them? To sort a file with N pages using B buffer pages: Pass 0: use B buffer pages. Produce sorted runs of B pages each. Pass 2,, etc.: merge B-1 runs. INPUT 1... INPUT 2... OUTPUT... Disk INPUT B-1 B Main memory buffers Disk Comp 521 Files and Databases Fall 2012 5

General External Merge Sort More than 3 buffer pages. How can we utilize them? Key Insight #1: We can merge more than 2 input buffers at a time affects fanout base of log! Key Insight #2: The output buffer is generated incrementally, so only one buffer page is needed for any size of run! To sort a file with N pages using B buffer pages: Pass 0: use B buffer pages. Produce pages each. sorted runs of B Pass 2,, etc.: merge B-1 runs, leaving one page for output. Comp 521 Files and Databases Fall 2012 6

Cost of External Merge Sort Number of passes: Cost = 2N * (# of passes) E.g., with 5 buffer pages, to sort 108 page file: Pass 0: = 22 sorted runs of 5 pages each (last run is only 3 pages) Pass 1: = 6 sorted runs of 20 pages each (last run is only 8 pages) Pass 2: 6 / 4 = 2 sorted runs, 80 pages and 28 pages Pass 3: Sorted file of 108 pages Comp 521 Files and Databases Fall 2012 7

Number of Passes of External Sort Comp 521 Files and Databases Fall 2012 8

Internal Sort Algorithm Quicksort is a fast way to sort in memory. Very fast on average Alternative is replacement sort Top: Read in B blocks Fill: Find the smallest record greater than the largest value to output buffer add it to the end of the output buffer fill moved record s slot with next value from the input buffer, if empty refill input buffer else end run goto Fill Comp 521 Files and Databases Fall 2012 9

More on Replacement Sort Fact: average length of a run is 2B The snowplow analogy Imagine a snowplow moving around a circular track on which snow falls at a steady rate. At any instant, there is a certain amount of snow S on the track. Some falling snow comes in front of the plow, some behind. During the next revolution of the plow, all of this is removed, plus 1/2 of what falls during that revolution. Thus, the plow removes 2S amount of snow. B Comp 521 Files and Databases Fall 2012 10

More on Replacement Sort Fact: average length of a run in heapsort is 2B The snowplow analogy Worst-Case: What is min length of a run? How does this arise? Best-Case: What is max length of a run? How does this arise? Quicksort is faster, but... B Comp 521 Files and Databases Fall 2012 11

I/O for External Merge Sort longer runs imply fewer passes! Actually, do I/O a page at a time In fact, read a block of pages sequentially! Suggests we should make each buffer (input/ output) be a block of pages. But this will reduce fan-out during merge passes! In practice, most files still sorted in 2-3 passes. Comp 521 Files and Databases Fall 2012 12

Number of Passes of Optimized Sort Block size = 32, initial pass produces runs of size 2B. Comp 521 Files and Databases Fall 2012 13

Double Buffering To reduce wait time for I/O request to complete, can prefetch into a shadow block. Potentially, more passes; in practice, most files still sorted in 2-3 passes. INPUT 1 INPUT 1' INPUT 2 OUTPUT INPUT 2' OUTPUT' Disk INPUT k INPUT k' b block size Disk B main memory buffers, k-way merge Comp 521 Files and Databases Fall 2012 14

Sorting Records! Sorting has become a blood sport! Parallel external sorting is the name of the game... 2005 IBM Almaden Sort 1Tbyte of 100 byte records Typical DBMS: > 5 days World record: 17 min, 37 seconds RS/6000 SP with 488 nodes Each node: 4, 332MHz 604 processors, 1.5GB of RAM, and a 9GB SCSI disk New benchmarks proposed: Minute Sort: How many can you sort in 1 minute? Dollar Sort: How many can you sort for $1.00? Comp 521 Files and Databases Fall 2012 15

Using B+ Trees for Sorting Scenario: Table to be sorted has B+ tree index on sorting column(s). Idea: Can retrieve records in order by traversing leaf pages. Is this a good idea? Cases to consider: B+ tree is clustered Good idea! B+ tree is not clustered Could be a very bad idea! Comp 521 Files and Databases Fall 2012 16

Clustered B+ Tree Used for Sorting Cost: root to the left-most leaf, then retrieve all leaf pages (Alternative 1) If Alternative 2 is used? Additional cost of retrieving data records: each page fetched just once. Fill factor of < 100% Data Records introduces a small overhead extra pages fetched Index (Directs search) Data Entries ("Sequence set") Always better than external sorting! Comp 521 Files and Databases Fall 2012 17

Unclustered B+ Tree Used for Sorting Alternative (2) for data entries; each data entry contains rid of a data record. In general, one I/O per data record! Index (Directs search) Data Entries ("Sequence set") Data Records Comp 521 Files and Databases Fall 2012 18

External Sorting vs. Unclustered Index p: # of records per page B=1,000 and block size=32 for sorting p=100 is the more realistic value. Comp 521 Files and Databases Fall 2012 19

Summary External sorting is important; DBMS may dedicate part of buffer pool just for sorting! External merge sort minimizes disk I/O cost: Pass 0: Produces sorted runs of size B (# buffer pages). Later passes: merge runs. # of runs merged at a time depends on B, and block size. Larger block size means less I/O cost per page. Larger block size means smaller # runs merged. In practice, # of runs rarely more than 2 or 3. Comp 521 Files and Databases Fall 2012 20

Summary, cont. Choice of internal sort algorithm may matter: Quicksort: Quick! Replacement sort: slower (2x), but with longer runs The best sorts are wildly fast: Despite 40+ years of research, we re still improving! Clustered B+ tree is good for sorting; unclustered tree is usually very bad. Comp 521 Files and Databases Fall 2012 21