Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Similar documents
Last Class. Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Why do we need to sort? Today's Class

Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Why Sort? Why Sort? select... order by

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Sorting & Aggregations

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Database Applications (15-415)

CS542. Algorithms on Secondary Storage Sorting Chapter 13. Professor E. Rundensteiner. Worcester Polytechnic Institute

External Sorting Implementing Relational Operators

External Sorting. Chapter 13. Comp 521 Files and Databases Fall

Database Applications (15-415)

Dtb Database Systems. Announcement

Cost-based Query Sub-System. Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class.

EXTERNAL SORTING. Sorting

CompSci 516: Database Systems

Why Sort? Data requested in sorted order. Sor,ng is first step in bulk loading B+ tree index. e.g., find students in increasing GPA order

Spring 2017 EXTERNAL SORTING (CH. 13 IN THE COW BOOK) 2/7/17 CS 564: Database Management Systems; (c) Jignesh M. Patel,

EECS 647: Introduction to Database Systems

Today's Class. Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Example Database. Query Plan Example

Evaluation of Relational Operations

Announcements. Reading Material. Today. Different File Organizations. Selection of Indexes 9/24/17. CompSci 516: Database Systems

NOTE: sorting using B-trees to be assigned for reading after we cover B-trees.

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class. Today s Class. Faloutsos/Pavlo CMU /615

Sorting a file in RAM. External Sorting. Why Sort? 2-Way Sort of N pages Requires Minimum of 3 Buffers Pass 0: Read a page, sort it, write it.

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Administrivia. Administrivia. Faloutsos/Pavlo CMU /615

Overview of Query Processing. Evaluation of Relational Operations. Why Sort? Outline. Two-Way External Merge Sort. 2-Way Sort: Requires 3 Buffer Pages

15-415/615 Faloutsos 1

THE B+ TREE INDEX. CS 564- Spring ACKs: Jignesh Patel, AnHai Doan

University of Waterloo Midterm Examination Solution

Database Management System

Database Applications (15-415)

External Sorting Sorting Tables Larger Than Main Memory

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #5: Hashing and Sor5ng

Query Processing. Lecture #10. Andy Pavlo Computer Science Carnegie Mellon Univ. Database Systems / Fall 2018

RELATIONAL OPERATORS #1

CSE 544 Principles of Database Management Systems

Hash-Based Indexing 165

Motivation for Sorting. External Sorting: Overview. Outline. CSE 190D Database System Implementation. Topic 3: Sorting. Chapter 13 of Cow Book

CSE 444: Database Internals. Lectures 5-6 Indexing

Evaluation of Relational Operations: Other Techniques

Implementing Relational Operators: Selection, Projection, Join. Database Management Systems, R. Ramakrishnan and J. Gehrke 1

Data on External Storage

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

Tree-Structured Indexes

Overview of Storage and Indexing

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Administrivia. Last Class. Faloutsos/Pavlo CMU /615

Announcements (March 1) Query Processing: A Systems View. Physical (execution) plan. Announcements (March 3) Physical plan execution

Datenbanksysteme II: Caching and File Structures. Ulf Leser

Extra: B+ Trees. Motivations. Differences between BST and B+ 10/27/2017. CS1: Java Programming Colorado State University

Announcements. Reading Material. Recap. Today 9/17/17. Storage (contd. from Lecture 6)

Introduction to Database Systems CSE 344

Spring 2017 QUERY PROCESSING [JOINS, SET OPERATIONS, AND AGGREGATES] 2/19/17 CS 564: Database Management Systems; (c) Jignesh M.

University of Waterloo Midterm Examination Sample Solution

Carnegie Mellon Univ. Dept. of Computer Science /615 DB Applications. Outline. (Static) Hashing. Faloutsos - Pavlo CMU SCS /615

COSC-4411(M) Midterm #1

Introduction to Indexing 2. Acknowledgements: Eamonn Keogh and Chotirat Ann Ratanamahatana

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 6 - Storage and Indexing

CSE544: Principles of Database Systems. Lectures 5-6 Database Architecture Storage and Indexes

Database Applications (15-415)

Evaluation of Relational Operations: Other Techniques

CS 186/286 Spring 2018 Midterm 1

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Hash table example. B+ Tree Index by Example Recall binary trees from CSE 143! Clustered vs Unclustered. Example

Join Algorithms. Lecture #12. Andy Pavlo Computer Science Carnegie Mellon Univ. Database Systems / Fall 2018

7. Query Processing and Optimization

CSE 530A. B+ Trees. Washington University Fall 2013

CS 222/122C Fall 2017, Final Exam. Sample solutions

What Developers must know about DB2 for z/os indexes

Introduction to Database Systems CSE 414. Lecture 26: More Indexes and Operator Costs

Lecture 12. Lecture 12: Access Methods

Implementation of Relational Operations: Other Operations

Database Systems CSE 414

Chapter 13: Indexing. Chapter 13. ? value. Topics. Indexing & Hashing. value. Conventional indexes B-trees Hashing schemes (self-study) record

What happens. 376a. Database Design. Execution strategy. Query conversion. Next. Two types of techniques

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

CS143: Index. Book Chapters: (4 th ) , (5 th ) , , 12.10

Data Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Readings. Important Decisions on DB Tuning. Index File. ICOM 5016 Introduction to Database Systems

Evaluation of Relational Operations: Other Techniques

Topics to Learn. Important concepts. Tree-based index. Hash-based index

Database Applications (15-415)

CARNEGIE MELLON UNIVERSITY DEPT. OF COMPUTER SCIENCE DATABASE APPLICATIONS

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Midterm 1: CS186, Spring I. Storage: Disk, Files, Buffers [11 points] SOLUTION. cs186-

Lecture 8 Index (B+-Tree and Hash)

Query Processing: A Systems View. Announcements (March 1) Physical (execution) plan. CPS 216 Advanced Database Systems

Administração e Optimização de Bases de Dados 2012/2013 Index Tuning

Lecture 12. Lecture 12: The IO Model & External Sorting

Guest Lecture. Daniel Dao & Nick Buroojy

RAID in Practice, Overview of Indexing

PS2 out today. Lab 2 out today. Lab 1 due today - how was it?

Storage and Indexing

CS 186/286 Spring 2018 Midterm 1

Query Processing. Introduction to Databases CompSci 316 Fall 2017

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Module 4: Tree-Structured Indexing

Lecture 13. Lecture 13: B+ Tree

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Review: Memory, Disks, & Files. File Organizations and Indexing. Today: File Storage. Alternative File Organizations. Cost Model for Analysis

Transcription:

Last Class Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB Applications C. Faloutsos A. Pavlo Lecture#12: External Sorting (R&G, Ch13) Static Hashing Extendible Hashing Linear Hashing Hashing vs. B-trees 15-415/615 2 Today's Class Why Do We Need to Sort? Sorting Overview Two-way Merge Sort External Merge Sort Optimizations B+trees for Sorting SELECT...ORDER BY Bulk loading B+ tree index. Duplicate elimination (DISTINCT) SELECT...GROUP BY Sort-merge join algorithm. 15-415/615 3 15-415/615 4

Why Do We Need to Sort? Overview What do we do if the data that we want to sort is larger than the amount of memory that is available to the DBMS? What if multiple queries are running at the same time and they all want to sort data? Files are broken up into N pages. The DBMS has a finite number of B fixedsize buffers. Let s start with a simple example Why not just use virtual memory? 15-415/615 5 15-415/615 6 Two-way External Merge Sort Pass 0: Read a page, sort it, write it. Only one buffer page is used Pass 1,2,3, : requires 3 buffer pages Merge pairs of runs into runs twice as long Three buffer pages used. Memory Memory Memory 7 Two-way External Merge Sort Each pass we read + write each page in file. # of passes = log 2 N + 1 So total I/O cost is: = 2N ( log 2 N + 1) Divide and conquer: Sort subfiles and merge #0 #1 3,4 #2 #3 Input file 6,2 9,4 8,7 5,6 3,1 2 3,4 2,6 4,9 7,8 5,6 1,3 2 2,3 4,6 2,3 4,4 6,7 8,9 4,7 8,9 1,2 2,3 3,4 4,5 6,6 7,8 9 1,3 5,6 2 1,2 3,5 6 NULL 8 1-Page 2-Page 4-Page 8-Page

Two-way External Merge Sort General External Merge Sort This only requires three buffer pages. Even if we have more buffer space available, this algorithm does not utilize it effectively. B>3 buffer pages. How to sort a file with N pages? Let s look at the general algorithm 15-415/615 9 15-415/615 10 General External Merge Sort Example Pass 0: Use B buffer pages. Produce N / B sorted runs of size B. Pass 1,2,3, : Merge B-1 runs. Number of passes: 1+ log B-1 N / B Cost = 2N (# of passes) Sort 108 page file with 5 buffer pages: Pass 0: 108 / 5 = 22 sorted runs of 5 pages each (last run is only 3 pages) Pass 1: 22 / 4 = 6 sorted runs of 20 pages each (last run is only 8 pages) Pass 2: 2 sorted runs, 80 pages and 28 pages Pass 3: Sorted file of 108 pages 1+ log B-1 N / B = 1+ log 4 22 = 1+ 2.229... 4 passes 15-415/615 11 15-415/615 12

Today's Class Blocked I/O & Double-buffering Sorting Overview Two-way Merge Sort External Merge Sort Optimizations B+trees for Sorting So far, we assumed random disk access. The cost changes if we consider that runs are written (and read) sequentially. What could we do to exploit it? Blocked I/O: Exchange a few random reads for several sequential ones using bigger pages. Double-buffering: Mask I/O delays with prefetching. 15-415/615 13 15-415/615 14 Blocked I/O Blocked I/O Normally, B buffers of size (say) 4K INSTEAD: B/b buffers, of size b kilobytes Normally, B buffers of size (say) 4K INSTEAD: B/b buffers, of size b kilobytes Advantages? Fewer random disk accesses because some of them are sequential. Disadvantages? Less parallelization. 15 15-415/615 16

Blocked I/O & Double-buffering Double-buffering So far, we assumed random disk access Cost changes, if we consider that runs are written (and read) sequentially What could we do to exploit it? Blocked I/O: Exchange a few random reads for several sequential ones using bigger pages. Double-buffering: Mask I/O delays with prefetching. 15-415/615 17 Normally, when, say INPUT1 is exhausted We issue a read request and then we wait 6 Main memory buffers 18 Double-buffering Double-buffering We can prefetch the next block into a spare buffer using a asynchronous thread. This potentially requires more passes. But in practice, most files still sorted in 2-3 passes. 6 Main memory buffers 19 6 Main memory buffers 20

Today's Class Using B+ Trees for Sorting Sorting Overview Two-way Merge Sort External Merge Sort Optimizations B+trees for Sorting Scenario: Table to be sorted has B+ tree index on sorting column(s). Idea: Can retrieve records in order by traversing leaf pages. Is this a good idea? Cases to consider: B+ tree is clustered Good Idea! B+ tree is not clustered Could be Bad! 15-415/615 21 15-415/615 22 Clustered B+Tree for Sorting Unclustered B+Tree for Sorting Traverse to the leftmost leaf, then retrieve all leaf pages. Index (Directs search) Chase each pointer to the page that contains the data. Index (Directs search) Data Entries ("Sequence set") Data Entries ("Sequence set") Data Records Always better than external sorting! Data Records In general, one I/O per data record! 15-415/615 23 15-415/615 24

Summary External sorting is important. External merge sort minimizes disk I/O: Pass 0: Produces sorted runs of size B (# buffer pages). Later Passes: Merge runs. Clustered B+ tree is good for sorting; unclustered tree is usually bad. 15-415/615 25