CSC 261/461 Database Systems Lecture 17. Fall 2017

Similar documents
Lecture 12. Lecture 12: The IO Model & External Sorting

Lecture 13. Lecture 13: B+ Tree

Lecture 12. Lecture 12: Access Methods

Announcements. Reading Material. Recap. Today 9/17/17. Storage (contd. from Lecture 6)

Lecture 8 Index (B+-Tree and Hash)

THE B+ TREE INDEX. CS 564- Spring ACKs: Jignesh Patel, AnHai Doan

Physical Disk Structure. Physical Data Organization and Indexing. Pages and Blocks. Access Path. I/O Time to Access a Page. Disks.

CSIT5300: Advanced Database Systems

CSE 530A. B+ Trees. Washington University Fall 2013

Topics to Learn. Important concepts. Tree-based index. Hash-based index

Chapter 17. Disk Storage, Basic File Structures, and Hashing. Records. Blocking

Indexes. File Organizations and Indexing. First Question to Ask About Indexes. Index Breakdown. Alternatives for Data Entries (Contd.

Tree-Structured Indexes

Introduction to Data Management. Lecture 15 (More About Indexing)

Chapter 12: Indexing and Hashing (Cnt(

Chapter 12: Indexing and Hashing. Basic Concepts

Introduction to Data Management. Lecture 21 (Indexing, cont.)

Chapter 12: Indexing and Hashing

CS143: Index. Book Chapters: (4 th ) , (5 th ) , , 12.10

Intro to DB CHAPTER 12 INDEXING & HASHING

Database Technology. Topic 7: Data Structures for Databases. Olaf Hartig.

Introduction. Choice orthogonal to indexing technique used to locate entries K.

Spring 2017 B-TREES (LOOSELY BASED ON THE COW BOOK: CH. 10) 1/29/17 CS 564: Database Management Systems, Jignesh M. Patel 1

Goals for Today. CS 133: Databases. Example: Indexes. I/O Operation Cost. Reason about tradeoffs between clustered vs. unclustered tree indexes

Chapter 11: Indexing and Hashing

Introduction to Indexing 2. Acknowledgements: Eamonn Keogh and Chotirat Ann Ratanamahatana

Database Systems. File Organization-2. A.R. Hurson 323 CS Building

Indexing. Chapter 8, 10, 11. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Tree-Structured Indexes. Chapter 10

Data Organization B trees

CARNEGIE MELLON UNIVERSITY DEPT. OF COMPUTER SCIENCE DATABASE APPLICATIONS

INDEXES MICHAEL LIUT DEPARTMENT OF COMPUTING AND SOFTWARE MCMASTER UNIVERSITY

Why Is This Important? Overview of Storage and Indexing. Components of a Disk. Data on External Storage. Accessing a Disk Page. Records on a Disk Page

Kathleen Durant PhD Northeastern University CS Indexes

CSE 444: Database Internals. Lectures 5-6 Indexing

Principles of Data Management. Lecture #5 (Tree-Based Index Structures)

Selection Queries. to answer a selection query (ssn=10) needs to traverse a full path.

Tree-Structured Indexes

Tree-Structured Indexes

Tree-Structured Indexes. A Note of Caution. Range Searches ISAM. Example ISAM Tree. Introduction

Chapter 12: Indexing and Hashing

Tree-Structured Indexes

Chapter 17 Indexing Structures for Files and Physical Database Design

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Physical Level of Databases: B+-Trees

Chapter 11: Indexing and Hashing

Chapter 18 Indexing Structures for Files. Indexes as Access Paths

Tree-Structured Indexes

Administrivia. Tree-Structured Indexes. Review. Today: B-Tree Indexes. A Note of Caution. Introduction

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 6 - Storage and Indexing

Material You Need to Know

Tree-Structured Indexes

Chapter 11: Indexing and Hashing

Database Applications (15-415)

Multi-way Search Trees! M-Way Search! M-Way Search Trees Representation!

Indexing Methods. Lecture 9. Storage Requirements of Databases

Indexing: Overview & Hashing. CS 377: Database Systems

Storage hierarchy. Textbook: chapters 11, 12, and 13

UNIT III BALANCED SEARCH TREES AND INDEXING

CAS CS 460/660 Introduction to Database Systems. File Organization and Indexing

Indexing. Jan Chomicki University at Buffalo. Jan Chomicki () Indexing 1 / 25

Tree-Structured Indexes ISAM. Range Searches. Comments on ISAM. Example ISAM Tree. Introduction. As for any index, 3 alternatives for data entries k*:

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Chapter 5: Physical Database Design. Designing Physical Files

Chapter 11: Indexing and Hashing" Chapter 11: Indexing and Hashing"

File Structures and Indexing

CSE 544 Principles of Database Management Systems

Problem. Indexing with B-trees. Indexing. Primary Key Indexing. B-trees: Example. B-trees. primary key indexing

Database System Concepts, 5th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

CS 222/122C Fall 2016, Midterm Exam

Hash-Based Indexes. Chapter 11

Data Structures and Algorithms

Chapter 1 Disk Storage, Basic File Structures, and Hashing.

Storage and Indexing

System Structure Revisited

Tree-Structured Indexes (Brass Tacks)

Question Bank Subject: Advanced Data Structures Class: SE Computer

Physical Database Design: Outline

Extra: B+ Trees. Motivations. Differences between BST and B+ 10/27/2017. CS1: Java Programming Colorado State University

Access Methods. Basic Concepts. Index Evaluation Metrics. search key pointer. record. value. Value

More B-trees, Hash Tables, etc. CS157B Chris Pollett Feb 21, 2005.

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

CS 525: Advanced Database Organization 03: Disk Organization

CS 525: Advanced Database Organization 04: Indexing

RAID in Practice, Overview of Indexing

CPS352 Lecture - Indexing

Review of Storage and Indexing

Tree-Structured Indexes

Database Applications (15-415)

M-ary Search Tree. B-Trees. Solution: B-Trees. B-Tree: Example. B-Tree Properties. B-Trees (4.7 in Weiss)

Chapter 11: Indexing and Hashing

Advanced Database Systems


Hash-Based Indexes. Chapter 11 Ramakrishnan & Gehrke (Sections ) CPSC 404, Laks V.S. Lakshmanan 1

Review: Memory, Disks, & Files. File Organizations and Indexing. Today: File Storage. Alternative File Organizations. Cost Model for Analysis

Data Management for Data Science

Overview of Storage and Indexing

Find the block in which the tuple should be! If there is free space, insert it! Otherwise, must create overflow pages!

Indexing. Announcements. Basics. CPS 116 Introduction to Database Systems

Transcription:

CSC 261/461 Database Systems Lecture 17 Fall 2017

Announcement Quiz 6 Due: Tonight at 11:59 pm Project 1 Milepost 3 Due: Nov 10 Project 2 Part 2 (Optional) Due: Nov 15

The IO Model & External Sorting

Today s Lecture 1. Chapter 16 (Disk Storage, File Structure and Hashing) 2. Chapter 17 (Indexing)

Self-Study Buffering of Blocks Pin Count and Dirty Bit Buffer Replacement Strategies Least recently used (LRU) Clock Policy First in First Out (FIFO) MRU (Most recently used)

RECORDS AND FILES

Records and Record Types Data in Database = A set of records organized into a set of files Each record consists of a collection of related data values or items Each value corresponds to a particular attribute Takes one or more bytes

Fixed Length Records vs Variable Length Records Fig 16.5 (from textbook)

Record Blocking Block Size = B bytes Fixed Record Size = R bytes With B > R: Records per block = bfr = +, Unused space per block = B bfr R = B mod R Number of Blocks required = R/bfr Two scenarios: Spanned: records can span more than one block Unspanned: records can t span more than one block

Unspanned vs Spanned Unspanned Spanned

Heap Files vs Sorted Files Insertion (Heap vs Sorted) For Sorted files: Overflow Deletion Deletion Marker Modifying For Sorted files: May consist of Deletion and Insertion Searching How many block access? (by record number) What if the records are numbered and are of fixed size? Searching by range (Heap vs Sorted)

Another variety Another type of files beyond: Heap Files and Sorted Files Hash Files

2. HASHING TECHNIQUES

- Sears, Roebuck and Co., Consumers Guide, 1897 IF YOU DON T FIND IT IN THE INDEX, LOOK VERY CAREFULLY THROUGH THE ENTIRE CATALOG

- Hashing first proposed by Arnold Dumey (1956) - Hash codes - Chaining - Open addressing HASING GENERAL IDEAS

Example Java HashCode

Top level view Arbitrary objects (strings, doubles, ints) h(object) int with wide range {0,1,,m-1} n Objects actually used Hash code Compression function m We will also call this the hash function

Example (Once again!) Java HashCode

True / False Unequal objects must have different hash codes Objects with the same hash code must be equal Both Wrong We would like this but it s impossible to achieve Still possible for a particular set of values but impossible if input values are unknown prior to applying the hashcode.

Good Hash Function If key 1 key 2, then it s extremely unlikely that h(key 1 ) = h(key 2 ) Collision problem! Pigeonhole principle K+1 pigeons, K holes à at least one hole with 2 pigeons

Division method h(s) =s mod m How does this function perform for different m?

Separate chaining Open addressing COLLISION RESOLUTION

Separate Chaining Turing Cantor Index 0 Pointer Knuth 1 2 Turing Knuth Dijkstra Karp 3 4 Karp Cantor Dijkstra

Open Addressing Store all entries in the hash table itself, no pointer to the outside If a particular index i is full, try the next index (i+c) where C is a constant. Advantage Less space waste Perhaps good cache usage Disadvantage More complex collision resolution Slower operations

Open Addressing C=2 Turing Index 0 Pointer Cantor Knuth Karp Dijkstra h( Knuth, 0) 1 2 3 4 5 h( Karp, 0) Dijkstra h( Karp, 1) 6 Karp h( Dijkstra, 0) 7 Turing h( Knuth, h( Dijkstra, 1) 1) Knuth Cantor h( Dijkstra, 2)

External Hashing Hashing for disk files Target address space is made of buckets Hashing function maps a key into relative bucket number Convert the bucket number into corresponding disk block address

Bucket Number to Disk Block address

Bucket Number to Disk Block address

REVIEW

What Did We Learn Disk Storage Hardware Description of Disk Devices Sections to study: 16.2.1 and 16.2.2 Buffering of Blocks Buffer Management Buffer Replacement Strategies Sections to study: 16.3 Placing File Records on Disk Records and Record Types Fixed-Length, Variable-Length, Spanned, Unspanned Sections to study: 16.4.1 to 16.4.4

What did we learn Operations on Files Insert, Modify, and Delete And others. Sections to study: 16.5 Heap Files vs Sorted Files Sections to study: 16.6 and 16.7 Hash Files Internal Hashing and External Hashing Sections to study: 16.8.1 and 16.8.2

INDEXING

Types of Indexing Primary Indexes Clustering Indexes Secondary Indexes Multilevel Indexes Dynamic Multilevel Indexes Hash Indexes

What you will learn about in this section 1. Indexes: Motivation 2. Indexes: Basics

Index Motivation Person(name, age) Suppose we want to search for people of a specific age First idea: Sort the records by age we know how to do this fast! How many IO operations to search over N sorted records? Simple scan: O(N) Binary search: O(log 2 N) Could we get even cheaper search? E.g. go from log 2 N à log 200 N?

Index Motivation What about if we want to insert a new person, but keep the list sorted? 2 1,3 4,5 6,7 1,2 3,4 5,6 We would have to potentially shift N records, requiring up to ~ 2*N/P IO operations (where P = # of records per page)! We could leave some slack in the pages Could we get faster insertions? 7,

Index Motivation What about if we want to be able to search quickly along multiple attributes (e.g. not just age)? We could keep multiple copies of the records, each sorted by one attribute set this would take a lot of space Can we get fast search over multiple attribute (sets) without taking too much space? We ll create separate data structures called indexes to address all these points

Further Motivation for Indexes: NoSQL! NoSQL engines are (basically) just indexes! A lot more is left to the user in NoSQL one of the primary remaining functions of the DBMS is still to provide index over the data records, for the reasons we just saw! Sometimes use B+ Trees, sometimes hash indexes Indexes are critical across all DBMS types

Indexes: High-level An index on a file speeds up selections on the search key fields for the index. Search key properties Any subset of fields is not the same as key of a relation Example: On which attributes Product(name, maker, price) would you build indexes?

More precisely An index is a data structure mapping search keys to sets of rows in a database table Provides efficient lookup & retrieval by search key valueusually much faster than searching through all the rows of the database table An index can store: Full rows it points to (primary index) or Pointers to those rows (secondary index)

Operations on an Index Search: Quickly find all records which meet some condition on the search key attributes More sophisticated variants as well. Why? Indexing is one the most important features provided by a database for performance

Conceptual Example What if we want to return all books published after 1867? The above table might be very expensive to search over row-byrow Russian_Novels BID Title Author Published Full_text 001 War and Peace Tolstoy 1869 002 Crime and Punishment SELECT * FROM Russian_Novels WHERE Published > 1867 Dostoyevsky 1866 003 Anna Karenina Tolstoy 1877

Conceptual Example By_Yr_Index Published BID 1866 002 1869 001 1877 003 Russian_Novels BID Title Author Published Full_text 001 War and Peace Tolstoy 1869 002 Crime and Punishment Dostoyevsky 1866 003 Anna Karenina Tolstoy 1877 Maintain an index for this, and search over that! Why might just keeping the table sorted by year not be good enough?

Conceptual Example By_Yr_Index Published BID 1866 002 1869 001 1877 003 Russian_Novels BID Title Author Published Full_text 001 War and Peace Tolstoy 1869 002 Crime and Punishment Dostoyevsky 1866 003 Anna Karenina Tolstoy 1877 By_Author_Title_Index Author Title BID Dostoyevsky Crime and Punishment 002 Tolstoy Anna Karenina 003 Tolstoy War and Peace 001 Can have multiple indexes to support multiple search keys Indexes shown here as tables, but in reality we will use more efficient data structures

Covering Indexes By_Yr_Index Published BID 1866 002 1869 001 1877 003 We say that an index is covering for a specific query if the index contains all the needed attributes- meaning the query can be answered using the index alone! The needed attributes are the union of those in the SELECT and WHERE clauses Example: SELECT Published, BID FROM Russian_Novels WHERE Published > 1867

TYPES OF INDEXES

Types of Indexing Primary Indexes Clustering Indexes Secondary Indexes Multilevel Indexes Dynamic Multilevel Indexes Hash Indexes

Primary Indexes: Index for Sorted (Ordered) Files

Clustering Indexes (Index for Sorted (on non-key) Files)

Secondary Indexes (on a key field) Secondary means of accessing a data file File records could be ordered, unordered, or hashed

Secondary Indexes (on a non-key field) Extra level of indirection Provides logical ordering Though records are not physically ordered

High-level Categories of Index Types Multilevel Indexes Very good for range queries, sorted data Some old databases only implemented B-Trees We will mostly look at a variant called B+ Trees Hash Tables Very good for searching Real difference between structures: costs of ops determines which index you pick and why

MULTILEVEL INDEXES

What you will learn about in this section 1. ISAM 2. B+ Tree

1. ISAM

ISAM Indexed Sequential Access Method For an index with b 9 blocks Earlier: log = b? block access Now: log @A b? access (fo = fanout) block

1. B+ TREES

What you will learn about in this section 1. B+ Trees: Basics 2. B+ Trees: Design & Cost 3. Clustered Indexes

B+ Trees Search trees B does not mean binary! Idea in B Trees: make 1 node = 1 physical page Balanced, height adjusted tree (not the B either) Idea in B+ Trees: Make leaves into a linked list (for range queries)

B+ Tree Basics 10 20 30 Parameter d = degree The minimum number of key an interior node can have Each non-leaf ( interior ) node has d and 2d keys* *except for root node, which can have between 1 and 2d keys

B+ Tree Basics 10 20 30 k < 10 20 k < 30 10 k < 20 30 k The n keys in a node define n+1 ranges

B+ Tree Basics Non-leaf or internal node 10 20 30 22 25 28 For each range, in a non-leaf node, there is a pointer to another node with keys in that range

B+ Tree Basics Non-leaf or internal node 10 20 30 Leaf nodes also have between d and 2d keys, and are different in that: Leaf nodes 12 17 22 25 28 29 32 34 37 38

B+ Tree Basics Non-leaf or internal node 10 20 30 Leaf nodes Leaf nodes also have between d and 2d keys, and are different in that: Their key slots contain pointers to data records 12 17 22 25 28 29 32 34 37 38 1 1 1 5 21 22 27 28 30 33 35 37

B+ Tree Basics Non-leaf or internal node 10 20 30 Leaf nodes Leaf nodes also have between d and 2d keys, and are different in that: Their key slots contain pointers to data records 1 1 12 17 1 5 22 25 28 29 32 34 37 38 21 22 27 28 30 33 35 37 They contain a pointer to the next leaf node as well, for faster sequential traversal

B+ Tree Basics Non-leaf or internal node 10 20 30 Note that the pointers at the leaf level will be to the actual data records (rows). 12 17 22 25 28 29 Leaf nodes 32 34 37 38 We might truncate these for simpler display (as before) Name: Joe Age: 11 Name: Jake Age: 15 Name: John Age: 21 Name: Bess Age: 22 Name: Bob Age: 27 Name: Sally Age: 28 Name: Sal Age: 30 Name: Sue Age: 33 Name: Jess Age: 35 Name: Alf Age: 37

Some finer points of B+ Trees

Searching a B+ Tree For exact key values: Start at the root Proceed down, to the leaf For range queries: As above Then sequential traversal SELECT name FROM people WHERE age = 25 SELECT name FROM people WHERE 20 <= age AND age <= 30

B+ Tree Exact Search Animation K = 30? 30 < 80 80 30 in [20,60) 20 60 100 120 140 30 in [30,40) 10 15 18 20 30 40 50 60 65 80 85 90 To the data! 10 12 15 20 28 30 40 60 63 80 84 89 Not all nodes pictured

B+ Tree Range Search Animation K in [30,85]? 30 < 80 80 30 in [20,60) 20 60 100 120 140 30 in [30,40) 10 15 18 20 30 40 50 60 65 80 85 90 To the data! 10 12 15 20 28 30 40 59 63 80 84 89 Not all nodes pictured

B+ Tree Design How large is d? Example: Key size = 4 bytes Pointer size = 8 bytes Block size = 4096 bytes We want each node to fit on a single block/page 2d x 4 + (2d+1) x 8 <= 4096 à d <= 170

B+ Tree: High Fanout = Smaller & Lower IO As compared to e.g. binary search trees, B+ Trees have high fanout (between d+1 and 2d+1) This means that the depth of the tree is small à getting to any element requires very few IO operations! Also can often store most or all of the B+ Tree in main memory! The fanout is defined as the number of pointers to child nodes coming out of a node Note that fanout is dynamicwe ll often assume it s constant just to come up with approximate eqns!

Simple Cost Model for Search Let: f = fanout, which is in [d+1, 2d+1] (we ll assume it s constant for our cost model ) N = the total number of pages we need to index F = fill-factor (usually ~= 2/3) Our B+ Tree needs to have room to index N / F pages! We have the fill factor in order to leave some open slots for faster insertions What height (h) does our B+ Tree need to be? h=1 à Just the root node- room to index f pages h=2 à f leaf nodes- room to index f 2 pages h=3 à f 2 leaf nodes- room to index f 3 pages h à f h-1 leaf nodes- room to index f h pages! à We need a B+ Tree of height h = log K L M

Fast Insertions & Self-Balancing Same cost as exact search Self-balancing: B+ Tree remains balanced (with respect to height) even after insert B+ Trees also (relatively) fast for single insertions! However, can become bottleneck if many insertions (if fillfactor slack is used up )

Insertion Perform a search to determine what bucket the new record should go into. If the bucket is not full (at most (b-1) entries after the insertion), add the record. Otherwise, split the bucket. Allocate new leaf and move half the bucket's elements to the new bucket. Insert the new leaf's smallest key and address into the parent. If the parent is full, split it too. Add the middle key to the parent node. Repeat until a parent is found that need not split. If the root splits, create a new root which has one key and two pointers. (That is, the value that gets pushed to the new root gets removed from the original node) Note: B-trees grow at the root and not at the leave

Insertion (Insert 85) 40 80 b = f = branching factor/ fan-out = 3 20 40 80 90

Insertion (Insert 85) 40 80 b = f = branching factor/ fan-out = 3 20 40 80 85 90 This is what we would like. But the maximum number of keys in any node is (3-1) = 2 So, split.

Insertion (Insert 85) 40 80 b = f = branching factor/ fan-out = 3 20 40 80 85 90 Allocate new leaf and move half the bucket's elements to the new bucket.

Insertion (Insert 85) 40 80 85 b = f = branching factor/ fan-out = 3 20 40 80 85 90 Insert the new leaf's smallest key and address into the parent.

Insertion (Insert 85) 40 80 85 b = f = branching factor/ fan-out = 3 20 40 80 85 90 This is not allowed, as the parent is full. Need to split

Insertion (Insert 85) 80 40 85 b = f = branching factor/ fan-out = 3 20 40 80 85 90 If the parent is full, split it too. Add the middle key to the parent node. Repeat until a parent is found that needs no spliting

Insertion (Insert 85) 80 40 85 b = f = branching factor/ fan-out = 3 20 40 80 85 90 If the root splits, create a new root which has one key and two pointers. (That is, the value that gets pushed to the new root gets removed from the original node)

Deletion Start at root, find leaf L where entry belongs. Remove the entry. If L is at least half-full, done! If L has fewer entries than it should, If sibling (adjacent node with same parent as L) is more than half-full, redistribute, borrowing an entry from it. Otherwise, sibling is exactly half-full, so we can merge L and sibling. If merge occurred, must delete entry (pointing to L or sibling) from parent of L. Merge could propagate to root, decreasing height. https://www.cs.usfca.edu/~galles/visualization/b PlusTree.html The degree in this visualization is actually fan-out f or branching factor b.

Find 86

Delete it

Stealing from right sibling (redistribute). Modify the parent node

Finally,

Acknowledgement Some of the slides in this presentation are taken from the slides provided by the authors. Many of these slides are taken from cs145 course offered by Stanford University.