Announcements. Reading Material. Today. Different File Organizations. Selection of Indexes 9/24/17. CompSci 516: Database Systems

Similar documents
CompSci 516: Database Systems

Storage and Indexing

Review of Storage and Indexing

Overview of Storage and Indexing

Overview of Storage and Indexing

The use of indexes. Iztok Savnik, FAMNIT. IDB, Indexes

Overview of Indexing. Chapter 8 Part II. A glimpse at indices and workloads

Overview of Storage and Indexing

Single Record and Range Search

Overview of Storage and Indexing. Data on External Storage

Why Is This Important? Overview of Storage and Indexing. Components of a Disk. Data on External Storage. Accessing a Disk Page. Records on a Disk Page

Overview of Storage and Indexing

Indexing. Chapter 8, 10, 11. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

INDEXES MICHAEL LIUT DEPARTMENT OF COMPUTING AND SOFTWARE MCMASTER UNIVERSITY

Lecture 34 11/30/15. CMPSC431W: Database Management Systems. Instructor: Yu- San Lin

Lecture 8 Index (B+-Tree and Hash)

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Lecture #16 (Physical DB Design)

CS 443 Database Management Systems. Professor: Sina Meraji

Introduction to Data Management. Lecture #17 (Physical DB Design!)

CSIT5300: Advanced Database Systems

Announcements. Reading Material. Recap. Today 9/17/17. Storage (contd. from Lecture 6)

Modern Database Systems Lecture 1

Friday Nights with Databases!

Data on External Storage

Physical Database Design and Tuning. Review - Normal Forms. Review: Normal Forms. Introduction. Understanding the Workload. Creating an ISUD Chart

External Sorting Implementing Relational Operators

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Review: Memory, Disks, & Files. File Organizations and Indexing. Today: File Storage. Alternative File Organizations. Cost Model for Analysis

Physical Database Design and Tuning. Chapter 20

Physical Database Design and Tuning

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Step 4: Choose file organizations and indexes

CS542. Algorithms on Secondary Storage Sorting Chapter 13. Professor E. Rundensteiner. Worcester Polytechnic Institute

Lecture 36 12/4/15. CMPSC431W: Database Management Systems. Instructor: Yu- San Lin

Spring 2017 EXTERNAL SORTING (CH. 13 IN THE COW BOOK) 2/7/17 CS 564: Database Management Systems; (c) Jignesh M. Patel,

Context. File Organizations and Indexing. Cost Model for Analysis. Alternative File Organizations. Some Assumptions in the Analysis.

Hash-Based Indexing 165

CS330. Some Logistics. Three Topics. Indexing, Query Processing, and Transactions. Next two homework assignments out today Extra lab session:

Overview. Understanding the Workload. Physical Database Design And Database Tuning. Chapter 20

Overview of Storage and Indexing

EXTERNAL SORTING. Sorting

CAS CS 460/660 Introduction to Database Systems. File Organization and Indexing

Database Applications (15-415)

Chapter 1: overview of Storage & Indexing, Disks & Files:

CompSci 516 Data Intensive Computing Systems. Lecture 11. Query Optimization. Instructor: Sudeepa Roy

External Sorting. Chapter 13. Comp 521 Files and Databases Fall

Implementing Relational Operators: Selection, Projection, Join. Database Management Systems, R. Ramakrishnan and J. Gehrke 1

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Database Applications (15-415)

CompSci 516 Data Intensive Computing Systems

Evaluation of Relational Operations

Readings. Important Decisions on DB Tuning. Index File. ICOM 5016 Introduction to Database Systems

Discuss physical db design and workload What choises we have for tuning a database How to tune queries and views

CSE 444: Database Internals. Lectures 5-6 Indexing

Indexes. File Organizations and Indexing. First Question to Ask About Indexes. Index Breakdown. Alternatives for Data Entries (Contd.

Administrivia. Physical Database Design. Review: Optimization Strategies. Review: Query Optimization. Review: Database Design

Records in a file are grouped into buckets. Search key values are organized in a tree. The highest level is the root

CSE 544 Principles of Database Management Systems

Lecture 12. Lecture 12: Access Methods

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 6 - Storage and Indexing

CompSci 516: Database Systems. Lecture 20. Parallel DBMS. Instructor: Sudeepa Roy

Introduction to Database Systems CSE 344

CPSC 421 Database Management Systems. Lecture 19: Physical Database Design Concurrency Control and Recovery

Evaluation of Relational Operations: Other Techniques

Lecture 13. Lecture 13: B+ Tree

Overview of Storage and Indexing

CSC 261/461 Database Systems Lecture 17. Fall 2017

Administriva. CS 133: Databases. General Themes. Goals for Today. Fall 2018 Lec 11 10/11 Query Evaluation Prof. Beth Trushkowsky

Database Systems CSE 414

Data Storage. Query Performance. Index. Data File Types. Introduction to Data Management CSE 414. Introduction to Database Systems CSE 414

Database Management System

Database Management Systems (COP 5725) Homework 3

Database Systems CSE 414

Principles of Data Management. Lecture #9 (Query Processing Overview)

Sorting a file in RAM. External Sorting. Why Sort? 2-Way Sort of N pages Requires Minimum of 3 Buffers Pass 0: Read a page, sort it, write it.

Database Systems CSE 414

Database Applications (15-415)

Overview of Storage & Indexing (i)

Lecture 5 Design Theory and Normalization

192 Chapter 14. TotalCost=3 (1, , 000) = 6, 000

Dtb Database Systems. Announcement

Introduction to Data Management. Lecture #13 (Indexing)

Introduction to Database Systems CSE 344

CompSci 516: Database Systems

Lecture 3 SQL - 2. Today s topic. Recap: Lecture 2. Basic SQL Query. Conceptual Evaluation Strategy 9/3/17. Instructor: Sudeepa Roy

Lecture 2 SQL. Instructor: Sudeepa Roy. CompSci 516: Data Intensive Computing Systems

Why Sort? Data requested in sorted order. Sor,ng is first step in bulk loading B+ tree index. e.g., find students in increasing GPA order

Indexing: B + -Tree. CS 377: Database Systems

RAID in Practice, Overview of Indexing

Introduction to Data Management. Lecture 14 (Storage and Indexing)

Evaluation of Relational Operations: Other Techniques

Query Processing. Lecture #10. Andy Pavlo Computer Science Carnegie Mellon Univ. Database Systems / Fall 2018

Lecture 3 More SQL. Instructor: Sudeepa Roy. CompSci 516: Database Systems

RELATIONAL OPERATORS #1

Evaluation of Relational Operations: Other Techniques

Database Design and Tuning

Tree-Structured Indexes

Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Why Sort? Why Sort? select... order by

Introduction to Data Management. Lecture 21 (Indexing, cont.)

Transcription:

CompSci 516 Database Systems Lecture 9 Index Selection and External Sorting Announcements Private project threads created on piazza Please use these threads (and not emails) for all communications on your project Project proposal deadline extended until 9/28 Thursday 5 pm Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Duke CS, Fall 2017 CompSci 516: Database Systems 2 Today Reading Material Index selection External sort Index: as in Lecture 7/8 External sorting: [RG] External sorting: Chapter 13 [GUW] Chapter 14 Acknowledgement: The following slides have been created adapting the instructor material of the [RG] book provided by the authors Dr. Ramakrishnan and Dr. Gehrke. Duke CS, Fall 2017 CompSci 516: Database Systems 3 Duke CS, Fall 2016 CompSci 516: Data Intensive Computing Systems 4 Different File Organizations Search key = <age, sal> We need to understand the importance of appropriate file organization and index Consider following options: Selection of Indexes Heap files random order; insert at end-of-file Sorted files sorted on <age, sal> Clustered B+ tree file search key <age, sal> Heap file with unclustered B + -tree index on search key <age, sal> Heap file with unclustered hash index on search key <age, sal> Duke CS, Fall 2017 CompSci 516: Database Systems 5 Duke CS, Fall 2017 CompSci 516: Database Systems 6 1

Possible Operations Scan Fetch all records from disk to buffer pool Equality search Find all employees with age = 23 and sal = 50 Fetch page from disk, then locate qualifying record in page Range selection Find all employees with age > 35 Insert a record identify the page, fetch that page from disk, inset record, write back to disk (possibly other pages as well) Delete a record similar to insert Duke CS, Fall 2017 CompSci 516: Database Systems 7 Understanding the Workload A workload is a mix of queries and updates For each query in the workload: Which relations does it access? Which attributes are retrieved? Which attributes are involved in selection/join conditions? How selective are these conditions likely to be? For each update in the workload: Which attributes are involved in selection/join conditions? How selective are these conditions likely to be? The type of update (INSERT/DELETE/UPDATE), and the attributes that are affected Duke CS, Fall 2017 CompSci 516: Database Systems 8 Choice of Indexes What indexes should we create? Which relations should have indexes? What field(s) should be the search key? Should we build several indexes? For each index, what kind of an index should it be? Clustered? Hash/tree? More on Choice of Indexes One approach: Consider the most important queries Consider the best plan using the current indexes See if a better plan is possible with an additional index. If so, create it. Obviously, this implies that we must understand how a DBMS evaluates queries and creates query evaluation plans We will learn query execution and optimization later - For now, we discuss simple 1-table queries. Before creating an index, must also consider the impact on updates in the workload Duke CS, Fall 2017 CompSci 516: Database Systems 9 Duke CS, Fall 2017 CompSci 516: Database Systems 10 Trade-offs for Indexes Index Selection Guidelines Attributes in WHERE clause are candidates for index keys Exact match condition suggests hash index Range query suggests tree index Clustering is especially useful for range queries can also help on equality queries if there are many duplicates Try to choose indexes that benefit as many queries as possible Since only one index can be clustered per relation, choose it based on important queries that would benefit the most from clustering Multi-attribute search keys should be considered when a WHERE clause contains several conditions Order of attributes is important for range queries Duke CS, Fall 2017 CompSci 516: Database Systems 11 Note: clustered index should be used judiciously expensive updates, although cheaper than sorted files Duke CS, Fall 2017 CompSci 516: Database Systems 12 2

Examples of Clustered Indexes Examples of Clustered Indexes What is a good indexing strategy? Group-By query What is a good indexing strategy? SELECT E.dno WHERE E.age>40 SELECT E.dno, COUNT (*) WHERE E.age>10 GROUP BY E.dno Which attribute(s)? Clustered/Unclustered? B+ tree/hash? Which attribute(s)? Clustered/Unclustered? B+ tree/hash? Duke CS, Fall 2017 CompSci 516: Database Systems 13 Duke CS, Fall 2017 CompSci 516: Database Systems 14 Examples of Clustered Indexes Equality queries and duplicates What is a good indexing strategy? SELECT E.dno WHERE E.hobby= Stamps Which attribute(s)? Clustered/Unclustered? B+ tree/hash? SELECT E.dno WHERE E.eid=50 Indexes with Composite Search Keys Composite Search Keys: Search on a combination of fields Equality query: Every field value is equal to a constant value. E.g. wrt <sal,age> index: age=20 and sal =75 Range query: Some field value is not a constant. E.g.: sal > 10 which combination(s) would help? Examples of composite key indexes using lexicographic order. 11,80 12,10 12,20 13,75 <age, sal> 10,12 20,12 75,13 80,11 name age sal bob 12 10 cal 11 80 joe 12 <sal, age> Data entries in index sorted by <sal,age> 20 sue 13 75 Data records sorted by name 11 12 12 13 <age> 10 20 75 80 <sal> Data entries sorted by <sal> Duke CS, Fall 2017 CompSci 516: Database Systems 15 Composite Search Keys To retrieve Emp records with age=30 AND sal=4000, an index on <age,sal> would be better than an index on age or an index on sal first find age = 30, among them search sal = 4000 If condition is: 20<age<30 AND 3000<sal<5000: Clustered tree index on <age,sal> or <sal,age> is best. Index-Only Plans A number of queries can be answered without retrieving any tuples from one or more of the relations involved if a suitable index is available SELECT E.dno, COUNT(*) GROUP BY E.dno SELECT E.dno, MIN(E.sal) GROUP BY E.dno If condition is: age=30 AND 3000<sal<5000: Clustered <age,sal> index much better than <sal,age> index more index entries are retrieved for the latter Composite indexes are larger, updated more often For index-only strategies, clustering is not important SELECT AVG(E.sal) WHERE E.age=25 AND E.sal BETWEEN 3000 AND 5000 Duke CS, Fall 2017 CompSci 516: Database Systems 17 Duke CS, Fall 2017 CompSci 516: Database Systems 18 3

Why Sort? External Sorting quick review of mergesort on whiteboard Duke CS, Fall 2016 CompSci 516: Data Intensive Computing Systems 19 Duke CS, Fall 2016 CompSci 516: Data Intensive Computing Systems 20 2-Way Sort: Requires 3 Buffers Suppose N = 2 k pages in the file Pass 0: Read a page, sort it, write it. repeat for all 2 k pages only one buffer page is used Pass 1: Read two pages, sort (merge) them using one output page, write them to disk repeat 2 k-1 times three buffer pages used Pass 2, 3, 4,.. continue INPUT 1 INPUT 2 OUTPUT Main memory buffers Duke CS, Fall 2016 CompSci 516: Data Intensive Computing Systems 21 Two-Way External Merge Sort Each sorted sub-file is called a run each run can contain multiple pages Each pass we read + write each page in file. N pages in the file, => the number of passes = élog ù + 2 N 1 So toal cost is: 2 N ( élog 2 N ù + 1) Not too practical, but useful to learn basic concepts for external sorting 3,4 6,2 9,4 8,7 5,6 3,1 2 3,4 2,6 4,9 7,8 5,6 1,3 2 2,3 4,6 2,3 4,4 6,7 8,9 4,7 8,9 1,2 2,3 3,4 4,5 6,6 7,8 9 1,3 5,6 2 1,2 3,5 6 Input file PASS 0 1-page runs PASS 1 2-page runs PASS 2 4-page runs PASS 3 8-page runs 9 General External Merge Sort Suppose we have more than 3 buffer pages. How can we utilize them? To sort a file with N pages using B buffer pages: Pass 0: use B buffer pages: Produce N/B sorted runs of B pages each. Pass 1, 2,, etc.: merge B-1 runs to one output page keep writing to disk once the output page is full... INPUT 1 INPUT 2 OUTPUT...... INPUT B-1 B Main memory buffers Duke CS, Fall 2016 CompSci 516: Data Intensive Computing Systems 23 Cost of External Merge Sort Number of passes:1 + log B-1 N/B Cost = 2N * (# of passes) why 2 times? E.g., with 5 buffer pages, to sort 108 page file: Pass 0: sorting 5 pages at a time 108/5 = 22 sorted runs of 5 pages each (last run is only 3 pages) Pass 1: 4-way merge 22/4 = 6 sorted runs of 20 pages each (last run is only 8 pages) Pass 2: 4-way merge (but 2-way for the last two runs) [6/4 = 2 sorted runs, 80 pages and 28 pages Pass 3: 2-way merge (only 2 runs remaining) Sorted file of 108 pages Duke CS, Fall 2016 CompSci 516: Data Intensive Computing Systems 24 4

Number of Passes of External Sort High B is good, although CPU cost increases N B=3 B=5 B=9 B=17 B=129 B=257 100 7 4 3 2 1 1 1,000 10 5 4 3 2 2 10,000 13 7 5 4 2 2 100,000 17 9 6 5 3 3 1,000,000 20 10 7 5 3 3 10,000,000 23 12 8 6 4 3 100,000,000 26 14 9 7 4 4 1,000,000,000 30 15 10 8 5 4 12 I/O for External Merge Sort If 10 buffer pages either merge 9 runs at a time with one output buffer or 8 runs with two output buffers If #page I/O is the metric goal is minimize the #passes each page is read and written in each pass If we decide to read a block of b pages sequentially Suggests we should make each buffer (input/output) be a block of pages But this will reduce fan-out during merge passes i.e. not as many runs can be merged again any more In practice, most files still sorted in 2-3 passes Duke CS, Fall 2016 CompSci 516: Data Intensive Computing Systems 26 Double Buffering To reduce CPU wait time for I/O request to complete, can prefetch into `shadow block. INPUT 1 INPUT 1' INPUT 2 INPUT 2' INPUT k INPUT k' OUTPUT OUTPUT' b block size Using B+ Trees for Sorting Scenario: Table to be sorted has B+ tree index on sorting column(s) Idea: Can retrieve data entries (then records) in order by traversing leaf pages. Is this a good idea? Cases to consider: B+ tree is clustered: Good idea! B+ tree is not clustered: Could be a very bad idea! B main memory buffers, k-way merge Duke CS, Fall 2016 CompSci 516: Data Intensive Computing Systems 27 Duke CS, Fall 2016 CompSci 516: Data Intensive Computing Systems 28 Clustered B+ Tree Used for Sorting Unclustered B+ Tree Used for Sorting Cost: root to the left-most leaf, then retrieve all leaf pages (Alternative 1) Index (Directs search) Alternative (2) for data entries; each data entry contains rid of a data record In general, one I/O per data record! If Alternative 2 is used? Additional cost of retrieving data records: each page fetched just once Data Records Data Entries ("Sequence set") Index (Directs search) Data Entries ("Sequence set") Always better than external sorting! 17 Data Records Duke CS, Fall 2016 CompSci 516: Data Intensive Computing Systems 30 5

Summary External sorting is important; DBMS may dedicate part of buffer pool for sorting! External merge sort minimizes disk I/O cost: Pass 0: Produces sorted runs of size B (# buffer pages) Later passes: merge runs # of runs merged at a time depends on B, and block size. Larger block size means less I/O cost per page. Larger block size means smaller # runs merged. In practice, # of runs rarely more than 2 or 3 Duke CS, Fall 2016 CompSci 516: Data Intensive Computing Systems 31 6