External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Similar documents
External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

CS542. Algorithms on Secondary Storage Sorting Chapter 13. Professor E. Rundensteiner. Worcester Polytechnic Institute

External Sorting. Chapter 13. Comp 521 Files and Databases Fall

Why Sort? Data requested in sorted order. Sor,ng is first step in bulk loading B+ tree index. e.g., find students in increasing GPA order

EXTERNAL SORTING. Sorting

Sorting a file in RAM. External Sorting. Why Sort? 2-Way Sort of N pages Requires Minimum of 3 Buffers Pass 0: Read a page, sort it, write it.

Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Why Sort? Why Sort? select... order by

CompSci 516: Database Systems

Evaluation of Relational Operations

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Last Class. Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Why do we need to sort? Today's Class

Database Applications (15-415)

External Sorting Implementing Relational Operators

Announcements. Reading Material. Today. Different File Organizations. Selection of Indexes 9/24/17. CompSci 516: Database Systems

Spring 2017 EXTERNAL SORTING (CH. 13 IN THE COW BOOK) 2/7/17 CS 564: Database Management Systems; (c) Jignesh M. Patel,

NOTE: sorting using B-trees to be assigned for reading after we cover B-trees.

Dtb Database Systems. Announcement

Evaluation of relational operations

EECS 647: Introduction to Database Systems

Evaluation of Relational Operations: Other Techniques

Evaluation of Relational Operations: Other Techniques

Principles of Data Management. Lecture #5 (Tree-Based Index Structures)

Overview of Query Processing. Evaluation of Relational Operations. Why Sort? Outline. Two-Way External Merge Sort. 2-Way Sort: Requires 3 Buffer Pages

Tree-Structured Indexes

Tree-Structured Indexes

Tree-Structured Indexes ISAM. Range Searches. Comments on ISAM. Example ISAM Tree. Introduction. As for any index, 3 alternatives for data entries k*:

Database Applications (15-415)

Tree-Structured Indexes

Tree-Structured Indexes

Motivation for Sorting. External Sorting: Overview. Outline. CSE 190D Database System Implementation. Topic 3: Sorting. Chapter 13 of Cow Book

Overview of Storage and Indexing

Indexing. Chapter 8, 10, 11. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Overview of Storage and Indexing

Overview of Storage and Indexing

Tree-Structured Indexes

Database Management System

External Sorting Sorting Tables Larger Than Main Memory

Implementing Relational Operators: Selection, Projection, Join. Database Management Systems, R. Ramakrishnan and J. Gehrke 1

Chapter 12: Query Processing. Chapter 12: Query Processing

Hash-Based Indexing 165

Introduction to Data Management. Lecture 14 (Storage and Indexing)

Overview of Storage and Indexing

Introduction to Data Management. Lecture #13 (Indexing)

Database Applications (15-415)

Lecture 8 Index (B+-Tree and Hash)

Chapter 12: Query Processing

Kathleen Durant PhD Northeastern University CS Indexes

Announcements. Reading Material. Recap. Today 9/17/17. Storage (contd. from Lecture 6)

THE B+ TREE INDEX. CS 564- Spring ACKs: Jignesh Patel, AnHai Doan

Storage and Indexing

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Lecture 11. Lecture 11: External Sorting

Unit 3 Disk Scheduling, Records, Files, Metadata

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #5: Hashing and Sor5ng

Implementation of Relational Operations: Other Operations

Lecture 12. Lecture 12: The IO Model & External Sorting

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 6 - Storage and Indexing

Introduction to Data Management. Lecture 21 (Indexing, cont.)

Module 4: Tree-Structured Indexing

Midterm Review CS634. Slides based on Database Management Systems 3 rd ed, Ramakrishnan and Gehrke

Evaluation of Relational Operations. Relational Operations

Introduction. Choice orthogonal to indexing technique used to locate entries K.

CSE 444: Database Internals. Lectures 5-6 Indexing

Introduction to Data Management. Lecture 15 (More About Indexing)

Chapter 13: Query Processing

Tree-Structured Indexes. Chapter 10

Evaluation of Relational Operations: Other Techniques

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

CSE 530A. B+ Trees. Washington University Fall 2013

Review of Storage and Indexing

Lassonde School of Engineering Winter 2016 Term Course No: 4411 Database Management Systems

Principles of Data Management. Lecture #3 (Managing Files of Records)

Readings. Important Decisions on DB Tuning. Index File. ICOM 5016 Introduction to Database Systems

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

RAID in Practice, Overview of Indexing

Data on External Storage

Lecture 13. Lecture 13: B+ Tree

Advanced Database Systems

CS 186/286 Spring 2018 Midterm 1

Tree-Structured Indexes

RELATIONAL OPERATORS #1

Overview of Storage and Indexing

Tree-Structured Indexes

Tree-Structured Indexes (Brass Tacks)

The use of indexes. Iztok Savnik, FAMNIT. IDB, Indexes

Administriva. CS 133: Databases. General Themes. Goals for Today. Fall 2018 Lec 11 10/11 Query Evaluation Prof. Beth Trushkowsky

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Step 4: Choose file organizations and indexes

Principles of Data Management. Lecture #9 (Query Processing Overview)

Introduction to Indexing 2. Acknowledgements: Eamonn Keogh and Chotirat Ann Ratanamahatana

Overview of Storage and Indexing

Overview of Storage and Indexing. Data on External Storage

Storage hierarchy. Textbook: chapters 11, 12, and 13

ICOM 6005 Database Management Systems Design. Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department

CMSC424: Database Design. Instructor: Amol Deshpande

Extra: B+ Trees. Motivations. Differences between BST and B+ 10/27/2017. CS1: Java Programming Colorado State University

Module 9: Selectivity Estimation

Hash table example. B+ Tree Index by Example Recall binary trees from CSE 143! Clustered vs Unclustered. Example

CS 186/286 Spring 2018 Midterm 1

Transcription:

External Sorting Chapter 13 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Why Sort? A classic problem in computer science! Data requested in sorted order e.g., find students in increasing gpa order Sorting is first step in bulk loading B+ tree index. Sorting useful for eliminating duplicate copies in a collection of records (Why?) Sort-merge join algorithm involves sorting. Problem: sort 1Gb of data with 1Mb of RAM. why not virtual memory? Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 2

2-Way Sort: Requires 3 Buffers Pass 1: Read a page, sort it, write it. only one buffer page is used Pass 2, 3,, etc.: three buffer pages used. INPUT 1 INPUT 2 OUTPUT Disk Main memory buffers Disk Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 3

Two-Way External Merge Sort Each pass we read + write each page in file. N pages in the file => the number of passes = log 2 N + 1! " So toal cost is: ( log N + ) 2 N 1! " 2 Idea: Divide and conquer: sort subfiles and merge 3,4 6,2 9,4 8,7 5,6 3,1 2 3,4 2,6 4,9 7,8 5,6 1,3 2 2,3 4,6 Input file PASS 0 1-page runs PASS 1 2-page runs PASS 2 4-page runs PASS 3 8-page runs 9 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 4 2,3 4,4 6,7 8,9 4,7 8,9 1,2 2,3 3,4 4,5 6,6 7,8 1,3 5,6 2 1,2 3,5 6

General External Merge Sort More than 3 buffer pages. How can we utilize them? To sort a file with N pages using B buffer pages: Pass 0: use B buffer pages. Produce! N / B" sorted runs of B pages each. Pass 2,, etc.: merge B-1 runs. INPUT 1... INPUT 2...... OUTPUT Disk INPUT B-1 B Main memory buffers Disk Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 5

Cost of External Merge Sort Number of passes: 1 +! log B# 1! N / B" " Cost = 2N * (# of passes) E.g., with 5 buffer pages, to sort 108 page file: Pass 0:! 108 / 5" = 22 sorted runs of 5 pages each (last run is only 3 pages) Pass 1:! 22 / 4" = 6 sorted runs of 20 pages each (last run is only 8 pages) Pass 2: 2 sorted runs, 80 pages and 28 pages Pass 3: Sorted file of 108 pages Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 6

Number of Passes of External Sort N B=3 B=5 B=9 B=17 B=129 B=257 100 7 4 3 2 1 1 1,000 10 5 4 3 2 2 10,000 13 7 5 4 2 2 100,000 17 9 6 5 3 3 1,000,000 20 10 7 5 3 3 10,000,000 23 12 8 6 4 3 100,000,000 26 14 9 7 4 4 1,000,000,000 30 15 10 8 5 4 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 7

I/O for External Merge Sort longer runs often means fewer passes! Actually, do I/O a page at a time In fact, read a block of pages sequentially! Suggests we should make each buffer (input/output) be a block of pages. But this will reduce fan-out during merge passes! In practice, most files still sorted in 2-3 passes. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 8

Number of Passes of Optimized Sort N B=1,000 B=5,000 B=10,000 100 1 1 1 1,000 1 1 1 10,000 2 2 1 100,000 3 2 2 1,000,000 3 2 2 10,000,000 4 3 3 100,000,000 5 3 3 1,000,000,000 5 4 3 Block size = 32, initial pass produces runs of size 2B. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 9

Double Buffering To reduce wait time for I/O request to complete, can prefetch into `shadow block. Potentially, more passes; in practice, most files still sorted in 2-3 passes. INPUT 1 INPUT 1' INPUT 2 OUTPUT INPUT 2' OUTPUT' Disk INPUT k INPUT k' b block size Disk B main memory buffers, k-way merge Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 10

Using B+ Trees for Sorting Scenario: Table to be sorted has B+ tree index on sorting column(s). Idea: Can retrieve records in order by traversing leaf pages. Is this a good idea? Cases to consider: B+ tree is clustered Good idea! B+ tree is not clustered Could be a very bad idea! Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 11

Clustered B+ Tree Used for Sorting Cost: root to the leftmost leaf, then retrieve all leaf pages (Alternative 1) If Alternative 2 is used? Additional cost of retrieving data records: each page fetched just once. Data Records Index (Directs search) Data Entries ("Sequence set") Always better than external sorting! Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 12

Unclustered B+ Tree Used for Sorting Alternative (2) for data entries; each data entry contains rid of a data record. In general, one I/O per data record! Index (Directs search) Data Entries ("Sequence set") Data Records Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 13

External Sorting vs. Unclustered Index N Sorting p=1 p=10 p=100 100 200 100 1,000 10,000 1,000 2,000 1,000 10,000 100,000 10,000 40,000 10,000 100,000 1,000,000 100,000 600,000 100,000 1,000,000 10,000,000 1,000,000 8,000,000 1,000,000 10,000,000 100,000,000 10,000,000 80,000,000 10,000,000 100,000,000 1,000,000,000 p: # of records per page B=1,000 and block size=32 for sorting p=100 is the more realistic value. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 14

Effects on query processing We have mentioned two factors that lead to large number of query execution plans Equivalent algebraic expressions Choice of query processing algorithms Now we need to add to the list: Do we sort intermediate results? Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 15

Summary External sorting is important; DBMS may dedicate part of buffer pool for sorting! External merge sort minimizes disk I/O cost: Pass 0: Produces sorted runs of size B (# buffer pages). Later passes: merge runs. # of runs merged at a time depends on B, and block size. Larger block size means less I/O cost per page. Larger block size means smaller # runs merged. In practice, # of runs rarely more than 2 or 3. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 16

Summary, cont. Choice of internal sort algorithm may matter: Quicksort: Quick! Heap/tournament sort: slower (2x), longer runs The best sorts are wildly fast: Despite 40+ years of research, we re still improving! Clustered B+ tree is good for sorting; unclustered tree is usually very bad. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 17