CS542 Algorithms on Secondary Storage Sorting Chapter 13. Professor E. Rundensteiner
Lesson: Using secondary storage effectively Data too large to live in memory Regular algorithms on small scale only Must design algorithms for large data I/O costs dominate must reduce I/O
Why Sort? A classic problem in computer science! Data requested in sorted order via SQL: e.g., find students in increasing gpa order Sorting is first step in bulk loading B+ tree index. Sorting useful for eliminating duplicate copies in a collection of records Sort-merge join algorithm involves sorting. Problem: sort 1Gb of data with 1Mb of RAM. why not virtual memory?
2-Way Sort: Requires 3 Buffers PREPARE: 1 pass: Read a page, sort it, write it. INPUT / OUTPUT Disk Main memory buffer Disk MERGE: Many passes: Merge smaller runs into INPUT 1 INPUT 2 OUTPUT larger runs. Disk Main memory buffer Disk
Two-Way External Merge Sort
Idea: Divide and conquer: sort subfiles and merge into larger sorts 3,4 6,2 9,4 8,7 5,6 3,1 2 3,4 2,6 4,9 7,8 5,6 1,3 2 2,3 4,6 2,3 4,4 6,7 8,9 4,7 8,9 1,3 5,6 2 1,2 3,5 6 Input file PASS 0 1-page runs PASS 1 2-page runs PASS 2 4-page runs PASS 3 1,2 2,3 3,4 4,5 6,6 7,8 9 8-page runs
Cost? 3,4 6,2 9,4 8,7 5,6 3,1 2 3,4 2,6 4,9 7,8 5,6 1,3 2 2,3 4,6 2,3 4,7 8,9 1,3 5,6 2 Input file PASS 0 1-page runs PASS 1 2-page runs PASS 2 4,4 6,7 8,9 1,2 3,5 6 4-page runs PASS 3 1,2 2,3 3,4 4,5 6,6 7,8 9 8-page runs
Costs for pass : all pages # of passes : height of tree Total cost : product of above 3,4 6,2 9,4 8,7 5,6 3,1 2 3,4 2,6 4,9 7,8 5,6 1,3 2 2,3 4,6 2,3 4,4 6,7 8,9 4,7 8,9 1,2 2,3 3,4 4,5 6,6 7,8 9 1,3 5,6 2 1,2 3,5 6 Input file PASS 0 1-page runs PASS 1 2-page runs PASS 2 4-page runs PASS 3 8-page runs
Each pass reads + writes each page in file. N pages in file => 2N Number of passes log N 1 2 So total cost is: 2 N log N 1 2 3,4 6,2 9,4 8,7 5,6 3,1 2 3,4 2,6 4,9 7,8 5,6 1,3 2 2,3 4,6 2,3 4,4 6,7 8,9 4,7 8,9 1,2 2,3 3,4 4,5 6,6 7,8 9 1,3 5,6 2 1,2 3,5 6 Input file PASS 0 1-page runs PASS 1 2-page runs PASS 2 4-page runs PASS 3 8-page runs
External Merge Sort What if we had more buffer pages? How do we utilize them wisely? - Two main ideas!
Phase 1 : Prepare INPUT 1... Disk INPUT 2... INPUT B B Main memory buffers Disk Construct as large as possible starter lists.
Phase 2 : Merge INPUT 1... INPUT 2... OUTPUT Disk INPUT B-1 B Main memory buffers Disk Compose as many sorted sublists into one long sorted list.
Cost of External Merge Sort Cost = 2N * (# of passes) # of passes: 1 log N / B B 1
Example Buffer : with 5 buffer pages File to sort : 108 pages Pass 0: Size of each run? Number of runs? Pass 1: Size of each run? Number of runs? etc???
Example Buffer : with 5 buffer pages File to sort : 108 pages Pass 0: 108 / 5 = 22 sorted runs of 5 pages each (last run is only 3 pages) Pass 1: 22 / 4 = 6 sorted runs of 20 pages each (last run is only 8 pages) Pass 2: 2 sorted runs, 80 pages and 28 pages Pass 3: Sorted file of 108 pages Total I/O costs:? 2*N * (4 passes)
Example Buffer : with 5 buffer pages ; File to sort : 108 pages Total IO costs = 2 * N ( # of passes ) = 2N * ( 1 log N / B B 1 log_ 4 = 2 * 108 * (1 + ) 108/ 5 = 2 * 108 * (1 + log_4 ( 22 )) = 2 * 108 * 4
Number of Passes of External Sort - gain of utilizing all available buffers - importance of a high fan-in during merging N B=3 B=5 B=9 B=17 B=129 B=257 100 7 4 3 2 1 1 1,000 10 5 4 3 2 2 10,000 13 7 5 4 2 2 100,000 17 9 6 5 3 3 1,000,000 20 10 7 5 3 3 10,000,000 23 12 8 6 4 3 100,000,000 26 14 9 7 4 4 1,000,000,000 30 15 10 8 5 4
Optimizing External Sorting Cost metric? I/O only (until now) CPU is nontrivial, worth reducing
Optimizing Internal Sorting: Phase 1 12 4... 2 8 10 3 5 INPUT CURRENT SET OUTPUT 1 input, 1 output, B-2 current set buffer Main idea: repeatedly pick tuple in current set with smallest k value that is still greater than largest k value in output buffer and append it to output buffer
Internal Sort Algorithm 12 4... 2 9 10 3 5 8 INPUT CURRENT SET OUTPUT
Internal Sort Algorithm 2 12 4... 10 4 3 5 8 9 INPUT CURRENT SET OUTPUT
Internal Sort Algorithm 12 4... 2 8 10 3 5 INPUT CURRENT SET OUTPUT Input & Output? new input page is read in, if it is fully consumed. Output is written out, when it is full When terminate current run? When all tuples in current set are smaller than largest tuple in output buffer.
Analysis of Heapsort Fact: Average length of a run in heapsort is 2B The snowplow analogy Worst-Case: What is min length of a run? How does this arise? Best-Case: What is max length of a run? How does this arise? Quicksort is faster, but... B
Optimizing External Sorting: Phase 2 Further optimization for external sorting: Blocked I/O Double buffering
Blocking of Buffer Pages: Ext. Sort Thus far : Do 1 I/O a page at a time But : Reading a block of pages sequentially is cheaper Make input/output a block of b pages block size b INPUT 1 INPUT 1 INPUT 2 INPUT 2 OUTPUT OUTPUT' block size b Disk INPUT k INPUT k Disk B main memory buffers
Example: Blocking for Merge Sort Idea : buffer blocks = b pages; 1 buffer block of b pages for input, 1 buffer block of b pages for output then we can merge B-b/b runs in that first pass Example : 10 buffer pages if one-page input and output buffer blocks, then we produce 9 runs at a time if two-page input and output buffer blocks, then we produce 4 runs at a time
Blocking of Buffer Pages block size b INPUT 1 INPUT 1 INPUT 2 INPUT 2 OUTPUT OUTPUT' block size b Disk INPUT k INPUT k Disk B main memory buffers, k-way merge Pro: Save total costs of IOs Cons: Reduce fan-out during merge passes In practice, most files still sorted in 2-3 passes.
Double Buffering Overlap CPU and I/O To reduce wait time for I/O request to complete, prefetch into `shadow block. INPUT 1 INPUT 1' INPUT 2 INPUT 2' OUTPUT OUTPUT' Disk INPUT k INPUT k' b block size Disk B main memory buffers, k-way merge Potentially, more passes In practice, most files still sorted in 2-3 passes.
Using B+ Trees for Sorting Scenario: Table to be sorted has B+ tree index on sorting column(s). Idea: Retrieve records in order by traversing leaf pages. Is this a good idea? Cases to consider: B+ tree is clustered B+ tree is not clustered Good idea! Bad idea!
Clustered B+ Tree Used for Sorting Traversal of tree from root to left-most leaf, then if Alt. 1, retrieve all leaf pages in order If Alt. 2, additional cost of retrieving actual data records Index (Directs search) Data Entries ("Sequence set") Data Records Always better than external sorting!
Unclustered B+ Tree Used for Sorting Alternative (2) for data; each data entry contains rid of a data record. In general, one I/O per data record! Index (Directs search) Data Entries ("Sequence set") Data Records
External Sorting vs. Unclustered Index N Sorting p=1 p=10 p=100 100 200 100 1,000 10,000 1,000 2,000 1,000 10,000 100,000 10,000 40,000 10,000 100,000 1,000,000 100,000 600,000 100,000 1,000,000 10,000,000 1,000,000 8,000,000 1,000,000 10,000,000 100,000,000 10,000,000 80,000,000 10,000,000 100,000,000 1,000,000,000 p: # of records per page (p=100 realistic) B=1,000 and block size=32 for sorting
Summary External sorting is important; DBMS may dedicate part of buffer pool for sorting! External merge sort minimizes disk I/O costs: In practice, # of phases rarely more than 3. Choice of internal sort algorithm may matter. Best sorts are wildly fast: Despite 40+ years of research, still improving! Clustered B+ tree is good for sorting; unclustered tree is usually very bad.