Introduction to Data Management. Lecture 21 (Indexing, cont.)

Similar documents
Introduction to Data Management. Lecture 15 (More About Indexing)

Principles of Data Management. Lecture #5 (Tree-Based Index Structures)

Tree-Structured Indexes

Tree-Structured Indexes ISAM. Range Searches. Comments on ISAM. Example ISAM Tree. Introduction. As for any index, 3 alternatives for data entries k*:

Tree-Structured Indexes

Tree-Structured Indexes. Chapter 10

Tree-Structured Indexes

Introduction. Choice orthogonal to indexing technique used to locate entries K.

Tree-Structured Indexes

Lecture 8 Index (B+-Tree and Hash)

Administrivia. Tree-Structured Indexes. Review. Today: B-Tree Indexes. A Note of Caution. Introduction

Tree-Structured Indexes

Tree-Structured Indexes

Tree-Structured Indexes. A Note of Caution. Range Searches ISAM. Example ISAM Tree. Introduction

Announcements. Reading Material. Recap. Today 9/17/17. Storage (contd. from Lecture 6)

Introduction to Indexing 2. Acknowledgements: Eamonn Keogh and Chotirat Ann Ratanamahatana

Introduction to Data Management. Lecture #13 (Indexing)

Tree-Structured Indexes

Indexing. Chapter 8, 10, 11. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Tree-Structured Indexes (Brass Tacks)

Hash-Based Indexes. Chapter 11. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Introduction to Data Management. Lecture 14 (Storage and Indexing)

Indexes. File Organizations and Indexing. First Question to Ask About Indexes. Index Breakdown. Alternatives for Data Entries (Contd.

Database Applications (15-415)

THE B+ TREE INDEX. CS 564- Spring ACKs: Jignesh Patel, AnHai Doan

CSIT5300: Advanced Database Systems

Lecture 13. Lecture 13: B+ Tree

Department of Computer Science University of Cyprus EPL446 Advanced Database Systems. Lecture 6. B+ Trees: Structure and Functions

Selection Queries. to answer a selection query (ssn=10) needs to traverse a full path.

Spring 2017 B-TREES (LOOSELY BASED ON THE COW BOOK: CH. 10) 1/29/17 CS 564: Database Management Systems, Jignesh M. Patel 1

Hash-Based Indexes. Chapter 11

Kathleen Durant PhD Northeastern University CS Indexes

Database Applications (15-415)

Chapter 12: Indexing and Hashing (Cnt(

CSC 261/461 Database Systems Lecture 17. Fall 2017

ECS 165B: Database System Implementa6on Lecture 7

Hashed-Based Indexing

Hash-Based Indexes. Chapter 11 Ramakrishnan & Gehrke (Sections ) CPSC 404, Laks V.S. Lakshmanan 1

CS122A: Introduction to Data Management. Lecture #14: Indexing. Instructor: Chen Li

INDEXES MICHAEL LIUT DEPARTMENT OF COMPUTING AND SOFTWARE MCMASTER UNIVERSITY

Data Organization B trees

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Hash-Based Indexing 1

Topics to Learn. Important concepts. Tree-based index. Hash-based index

Database Applications (15-415)

Database Applications (15-415)

Why Is This Important? Overview of Storage and Indexing. Components of a Disk. Data on External Storage. Accessing a Disk Page. Records on a Disk Page

Principles of Data Management. Lecture #14 (Spatial Data Management)

CSE 444: Database Internals. Lectures 5-6 Indexing

CS143: Index. Book Chapters: (4 th ) , (5 th ) , , 12.10

CARNEGIE MELLON UNIVERSITY DEPT. OF COMPUTER SCIENCE DATABASE APPLICATIONS

Instructor: Amol Deshpande

Storage and Indexing

Goals for Today. CS 133: Databases. Example: Indexes. I/O Operation Cost. Reason about tradeoffs between clustered vs. unclustered tree indexes

Physical Level of Databases: B+-Trees

Chapter 11: Indexing and Hashing

Advances in Data Management Principles of Database Systems - 2 A.Poulovassilis

Extra: B+ Trees. Motivations. Differences between BST and B+ 10/27/2017. CS1: Java Programming Colorado State University

Find the block in which the tuple should be! If there is free space, insert it! Otherwise, must create overflow pages!

Principles of Data Management. Lecture #9 (Query Processing Overview)

CSE 530A. B+ Trees. Washington University Fall 2013

Chapter 11: Indexing and Hashing

I think that I shall never see A billboard lovely as a tree. Perhaps unless the billboards fall I ll never see a tree at all.

More B-trees, Hash Tables, etc. CS157B Chris Pollett Feb 21, 2005.

Data Management for Data Science

Intro to DB CHAPTER 12 INDEXING & HASHING

Introduction to Data Management. Lecture #25 (Transactions II)

RAID in Practice, Overview of Indexing

Indexing. Announcements. Basics. CPS 116 Introduction to Database Systems

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

CSE 544 Principles of Database Management Systems

Physical Disk Structure. Physical Data Organization and Indexing. Pages and Blocks. Access Path. I/O Time to Access a Page. Disks.

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 6 - Storage and Indexing

Chapter 11: Indexing and Hashing" Chapter 11: Indexing and Hashing"

Problem. Indexing with B-trees. Indexing. Primary Key Indexing. B-trees: Example. B-trees. primary key indexing

Material You Need to Know

Introduction to Indexing R-trees. Hong Kong University of Science and Technology

Overview of Storage and Indexing

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

File Structures and Indexing

Multidimensional Indexing The R Tree

Overview of Storage and Indexing

Final Exam Review. Kathleen Durant PhD CS 3200 Northeastern University

CSE 373 OCTOBER 25 TH B-TREES

(2,4) Trees Goodrich, Tamassia. (2,4) Trees 1

Introduction to Data Management. Lecture #26 (Transactions, cont.)

Final Exam Review. Kathleen Durant CS 3200 Northeastern University Lecture 22

Overview of Storage and Indexing

Modern Database Systems Lecture 1

amiri advanced databases '05

Principles of Data Management. Lecture #3 (Managing Files of Records)

Overview of Storage and Indexing

Chapter 12: Indexing and Hashing

Introduction to Data Management. Lecture #5 Relational Model (Cont.) & E-Rà Relational Mapping

M-ary Search Tree. B-Trees. B-Trees. Solution: B-Trees. B-Tree: Example. B-Tree Properties. Maximum branching factor of M Complete tree has height =

Main Memory and the CPU Cache

Chapter 12: Indexing and Hashing. Basic Concepts

Introduction to Data Management. Lecture #24 (Transactions)

Overview of Storage and Indexing

Transcription:

Introduction to Data Management Lecture 21 (Indexing, cont.) Instructor: Mike Carey mjcarey@ics.uci.edu Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Announcements v Midterm #2 grading is underway (and a priority) Endterm exam is non-cumulative - and in class too! v Next HW assignment is now underway (but) Indexes and physical DB design Due Tuesday, May 29 th (7 days plus one late day) You don t have all the material yet, read ahead (!) v Today s lecture plan: Finish up B+ tree indexes Quick overview of (static) hash-based indexes Start on physical DB design (use of indexes) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 2

B+ Tree Deletion (Review) v Start at root, find leaf L where entry belongs. v Remove the entry. If L is still at least half-full, done! If L has only d-1 entries, Try to redistribute, borrowing from sibling (adjacent node with same parent as L). If re-distribution fails, merge L and sibling. v If merge occurred, must delete search-guiding entry (pointing to L or sibling) from parent of L. v Merge could propagate to root, decreasing height. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 3 (Our previous B+ Tree including 8*) 17 (d=2) 5 13 24 30 2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* Note (see Piazza): Very cool online B+ tree viz tool available (J) https://www.cs.usfca.edu/~galles/visualization/bplustree.html Only slight differences from our defs (e.g., notice key 13 above) Their Max. Degree is our 2d+1 (e.g., limit of 5 ptrs/node above) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 4

Example Tree After (Inserting 8*, Then) Deleting 19* and 20*... 17 5 13 27 30 2* 3* 5* 7* 8* 14* 16* 22* 24* 27* 29* 33* 34* 38* 39* v Deleting 19* is easy. v Deleting 20* is done with redistribution. Notice how middle key is copied up. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 5... And Then Deleting 24* v Must merge. v Observe toss of index entry (on right), and pull down of index entry (below). 30 22* 27* 29* 33* 34* 38* 39* 5 13 17 30 2* 3* 5* 7* 8* 14* 16* 22* 27* 29* 33* 34* 38* 39* Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 6

Example of Non-leaf Redistribution v New/different example B+ tree is shown below during deletion of 24* v In contrast to previous example, can redistribute entry from left child of root to right child. 22 (Note: This shows a temporary illegal tree state w.r.t. d!) 5 13 17 20 30 2* 3* 5* 7* 8* 14* 16* 17* 18* 20* 21* 22* 27* 29* 33* 34* 38* 39* Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 7 After Redistribution v Intuitively, entries are redistributed by pushing through (or rotating if you prefer) the splitting entry in the parent node. 17 5 13 20 22 30 2* 3* 5* 7* 8* 14* 16* 17* 18* 20* 21* 22* 27* 29* 33* 34* 38* 39* Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 8

Bulk Loading of a B+ Tree v If we have a large collection of records, and we want to create a B+ tree on some field, doing so by repeatedly inserting records is very slow. v Bulk Loading can be done much more efficiently! v Initialization: Sort all data entries, insert pointer to first (leaf) page in a new (root) page. Root Sorted pages of data entries; not yet in B+ tree 3* 4* 6* 9* 10* 11* 12* 13* 20* 22* 23* 31* 35* 36* 38* 41* 44* Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 9 Bulk Loading (cont d.) v Index entries for leaf pages always entered into right-most index page just above leaf level. When one fills, it splits. (A split may go up the rightmost path to the root.) v Much faster than repeated inserts! v Can also control the leaf fill factor (%) Root Data entry pages not yet in B+ tree 3* 4* 6* 9* 10*11* 12*13* 20*22* 23* 31* 35*36* 38*41* 44* 6 6 12 Root 10 10 20 20 12 23 23 35 35 38 Data entry pages not yet in B+ tree 3* 4* 6* 9* 10* 11* 12*13* 20*22* 23* 31* 35*36* 38*41* 44* Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 10

A Note on B+ Tree Order v (Mythical!) order (d) concept replaced by physical space criterion in practice ( at least half-full ). Index pages can typically hold many more entries than leaf pages. Variable-sized records and search keys mean that different nodes will contain different numbers of entries. Even with fixed length fields, multiple records with the same search key value (duplicates) can lead to variable-sized data entries in the tree s leaf pages. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 11 (Page Implementation Details) Q: What if you were to open up a B+ Tree page? Control info (e.g., level, # children, free space offset) Search key array (with possible on-page indirection for variablelength data, using offsets), or key/data array for non-leaf vs. leaf pages, respectively 0 Child pointer array, where pointer Level (1) = page id on disk! NumChildren (3) 8... Free offset (40) P3 Ralph Timothy Key 0 offset (32) Key 1 offset (37) 20 Child 1 page id (P0) Child 2 page id (P1) P0 P1 P2 Child 3 page id (P2) } 32 Leaf 1 Leaf 2 Leaf 3 Key 0 ( Ralph ) not to 37 scale... Key 1 ( Timothy ) 40... Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 12

(Leaf Page I(k) Alternatives Revisited) Ex: Emp(eid, ename, sal, deptid) P2 1 2 3 4 (P2) Alternative 1: (records) 555 Smith 18K 3 666 Jones 90K 5 Smith 23K 4 Krishan 60K 8 Alternative 2: (RIDs) 444 (P1,4) 555 (P2,1) 666 (P2,2) (P2,3) (P2,4) (P3,1) Alternative 3: (RID lists) 3K (P1,1) 12K (P1,4) 18K (P2,1), (P10000,1) 23K (P2,3) 60K (P2,4) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 13 (Leaf Page I(k) Alternatives, cont.) Ex: Emp(eid, ename, sal, deptid) Alternative 1: (records) P2 1 2 3 4 555 Smith 18K 3 666 Jones 90K 5 Smith 23K 4 Krishan 60K 8 (P2) Note: Must use PKs in secondary indexes when primary index uses Alternative 1! Alternative 2 : (PKs) 444 444 555 555 666 666 999 Alternative 3 : (PK lists) 3K 111 12K 444 18K 555, 4439667 23K 60K Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 14

A Brief Aside: Hash-Based Indexes v As for any index, 3 alternatives for data entries k*: Data record with key value k <k, rid of data record with search key value k> <k, list of rids of data records with search key k> Choice is orthogonal to the indexing technique! v Hash-based indexes are fast for equality selections. Cannot support range searches. v Static and dynamic hashing techniques exist; trade-offs similar to ISAM vs. B+ trees. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 15 Static Hashed Indexes v # primary pages fixed, allocated sequentially, never de-allocated; overflow pages if needed. v h(k) mod N = bucket (page) to which data entry with key k belongs. (N = # of buckets) h(k) mod N k Ex: Using N = 8 and h(k) = k (a bad h(k)...!) h 0 1 2 N-1 0*, 24*, -- 9*, 17*, 81* 25*, --, -- 26*, 42*, -- 15*, 71*, 7* Primary bucket pages Ex: Find data with key=25 à 25* 23*, 31*, -- Overflow pages Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 16

Static Hashed Indexes (Cont d.) v Buckets contain data entries (like for ISAM or B+ trees) very similar to what we just looked at. v Hash function works on search key field of record r. Must distribute values over range 0... M-1. h(key) = (a * key + b) mod M works fairly well. a and b are constants; lots known about how to tune h. v Long overflow chains can develop and degrade performance. (Analogous to ISAM.) Extendible Hashing and Linear Hashing: More dynamic approaches that address this problem. (Take CS122c!) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 17 Indexing Summary v Tree-structured indexes are ideal for rangesearches, also good for equality searches. v ISAM is a static structure. (Prehistoric B+ Tree!) Only leaf pages modified; overflow pages needed. Overflow chains can degrade performance unless size of data set and data distribution stay constant. v B+ tree is a dynamic structure. Inserts/deletes leave tree height-balanced; log F N cost. High fanout F à tree depth rarely more than 3-4. https://www.cs.usfca.edu/~galles/visualization/bplustree.html Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 18

Indexing Summary (Cont d.) v Bulk loading can be much faster than repeated inserts for creating a B+ tree on a large data set. v Most widely used index in DBMS land, and also outside of DBMSs, because of its versatility. Also the most optimized (e.g., for bulk loads, locking, crash recovery, and so on). v Other database indexes to be aware of: Hash-based (for exact-match queries). R-tree (for spatial indexing and queries). Inverted keyword (for text indexing and queries). Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 19