Introduction. Choice orthogonal to indexing technique used to locate entries K.

Similar documents
Tree-Structured Indexes

Tree-Structured Indexes ISAM. Range Searches. Comments on ISAM. Example ISAM Tree. Introduction. As for any index, 3 alternatives for data entries k*:

Tree-Structured Indexes

Tree-Structured Indexes

Tree-Structured Indexes. Chapter 10

Principles of Data Management. Lecture #5 (Tree-Based Index Structures)

Tree-Structured Indexes

Tree-Structured Indexes

Tree-Structured Indexes. A Note of Caution. Range Searches ISAM. Example ISAM Tree. Introduction

Tree-Structured Indexes

Administrivia. Tree-Structured Indexes. Review. Today: B-Tree Indexes. A Note of Caution. Introduction

Tree-Structured Indexes (Brass Tacks)

Tree-Structured Indexes

Announcements. Reading Material. Recap. Today 9/17/17. Storage (contd. from Lecture 6)

Spring 2017 B-TREES (LOOSELY BASED ON THE COW BOOK: CH. 10) 1/29/17 CS 564: Database Management Systems, Jignesh M. Patel 1

Indexing. Chapter 8, 10, 11. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Lecture 8 Index (B+-Tree and Hash)

Introduction to Indexing 2. Acknowledgements: Eamonn Keogh and Chotirat Ann Ratanamahatana

Introduction to Data Management. Lecture 15 (More About Indexing)

Department of Computer Science University of Cyprus EPL446 Advanced Database Systems. Lecture 6. B+ Trees: Structure and Functions

Indexes. File Organizations and Indexing. First Question to Ask About Indexes. Index Breakdown. Alternatives for Data Entries (Contd.

Lecture 13. Lecture 13: B+ Tree

Introduction to Data Management. Lecture 21 (Indexing, cont.)

THE B+ TREE INDEX. CS 564- Spring ACKs: Jignesh Patel, AnHai Doan

Kathleen Durant PhD Northeastern University CS Indexes

Goals for Today. CS 133: Databases. Example: Indexes. I/O Operation Cost. Reason about tradeoffs between clustered vs. unclustered tree indexes

CSIT5300: Advanced Database Systems

Chapter 12: Indexing and Hashing (Cnt(

Extra: B+ Trees. Motivations. Differences between BST and B+ 10/27/2017. CS1: Java Programming Colorado State University

I think that I shall never see A billboard lovely as a tree. Perhaps unless the billboards fall I ll never see a tree at all.

Database Applications (15-415)

CSE 530A. B+ Trees. Washington University Fall 2013

Physical Level of Databases: B+-Trees

Introduction to Data Management. Lecture 14 (Storage and Indexing)

M-ary Search Tree. B-Trees. Solution: B-Trees. B-Tree: Example. B-Tree Properties. B-Trees (4.7 in Weiss)

Introduction to Indexing R-trees. Hong Kong University of Science and Technology

2-3 Tree. Outline B-TREE. catch(...){ printf( "Assignment::SolveProblem() AAAA!"); } ADD SLIDES ON DISJOINT SETS

Overview of Storage and Indexing

Lecture 4. ISAM and B + -trees. Database Systems. Tree-Structured Indexing. Binary Search ISAM. B + -trees

CSC 261/461 Database Systems Lecture 17. Fall 2017

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

ICOM 6005 Database Management Systems Design. Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department

Find the block in which the tuple should be! If there is free space, insert it! Otherwise, must create overflow pages!

Main Memory and the CPU Cache

Physical Disk Structure. Physical Data Organization and Indexing. Pages and Blocks. Access Path. I/O Time to Access a Page. Disks.

Hash-Based Indexes. Chapter 11. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Chapter 4. ISAM and B + -trees. Architecture and Implementation of Database Systems Summer 2016

Intro to DB CHAPTER 12 INDEXING & HASHING

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Module 4: Tree-Structured Indexing

(2,4) Trees Goodrich, Tamassia. (2,4) Trees 1

ACCESS METHODS: FILE ORGANIZATIONS, B+TREE

ECS 165B: Database System Implementa6on Lecture 7

Topics to Learn. Important concepts. Tree-based index. Hash-based index

INDEXES MICHAEL LIUT DEPARTMENT OF COMPUTING AND SOFTWARE MCMASTER UNIVERSITY

M-ary Search Tree. B-Trees. B-Trees. Solution: B-Trees. B-Tree: Example. B-Tree Properties. Maximum branching factor of M Complete tree has height =

Chapter 12: Indexing and Hashing. Basic Concepts

Multiway searching. In the worst case of searching a complete binary search tree, we can make log(n) page faults Everyone knows what a page fault is?

Information Systems (Informationssysteme)

Chapter 11: Indexing and Hashing

Introduction to Data Management. Lecture #13 (Indexing)

Chapter 12: Indexing and Hashing

Material You Need to Know

amiri advanced databases '05

Chapter 4. ISAM and B + -trees. Architecture and Implementation of Database Systems Summer 2013

Chapter 18 Indexing Structures for Files. Indexes as Access Paths

Database Systems. File Organization-2. A.R. Hurson 323 CS Building

CSE 444: Database Internals. Lectures 5-6 Indexing

CS143: Index. Book Chapters: (4 th ) , (5 th ) , , 12.10

Multi-way Search Trees! M-Way Search! M-Way Search Trees Representation!

CS 310 B-trees, Page 1. Motives. Large-scale databases are stored in disks/hard drives.

2-3 and Trees. COL 106 Shweta Agrawal, Amit Kumar, Dr. Ilyas Cicekli

Multi-way Search Trees. (Multi-way Search Trees) Data Structures and Programming Spring / 25

(2,4) Trees. 2/22/2006 (2,4) Trees 1

Chapter 4. ISAM and B + -trees. Architecture and Implementation of Database Systems Winter 2010/11

Problem. Indexing with B-trees. Indexing. Primary Key Indexing. B-trees: Example. B-trees. primary key indexing

Data Management for Data Science

Database Applications (15-415)

Data Organization B trees

Chapter 11: Indexing and Hashing" Chapter 11: Indexing and Hashing"

Chapter 17. Disk Storage, Basic File Structures, and Hashing. Records. Blocking

Hash-Based Indexes. Chapter 11

User Perspective. Module III: System Perspective. Module III: Topics Covered. Module III Overview of Storage Structures, QP, and TM

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing

B + -Trees. Fundamentals of Database Systems Elmasri/Navathe Chapter 6 (6.3) Database System Concepts Silberschatz/Korth/Sudarshan Chapter 11 (11.

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 6 - Storage and Indexing

Design and Analysis of Algorithms Lecture- 9: B- Trees

Final Exam Review. Kathleen Durant PhD CS 3200 Northeastern University

Chapter 12: Indexing and Hashing

Data Structures and Algorithms

CARNEGIE MELLON UNIVERSITY DEPT. OF COMPUTER SCIENCE DATABASE APPLICATIONS

B+ TREES Examples. B Trees are dynamic. That is, the height of the tree grows and contracts as records are added and deleted.

B-Trees. Disk Storage. What is a multiway tree? What is a B-tree? Why B-trees? Insertion in a B-tree. Deletion in a B-tree

CSE 544 Principles of Database Management Systems

Background: disk access vs. main memory access (1/2)

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Indexing. Announcements. Basics. CPS 116 Introduction to Database Systems

Final Exam Review. Kathleen Durant CS 3200 Northeastern University Lecture 22

Instructor: Amol Deshpande

Transcription:

Tree-Structured Indexes Werner Nutt Introduction to Database Systems Free University of Bozen-Bolzano 2 Introduction As for any index, three alternatives for data entries K : Data record with key value K K, r, wherer is rid of a record with search key value K K, [r 1,...,r n ],where[r 1,...,r n ] is a list or rid s of records with search key value K Choice orthogonal to indexing technique used to locate entries K.. Tree-structured indexing techniques support both range searches and equality searches. ISAM: static structure; B+-tree: dynamic, adjusts gracefully under inserts and deletes.

3 Range Searches Find all employees with sal > 1500 If data is in sorted file, do binary search to find first such employee, then scan to find others Cost of binary search can be quite high Simple idea: create an index file k1 k2 kn Index File Page 1 Page 2 Page 3 Page N Data File can do binary search on (smaller) index file! 4 ISAM (= Indexed Sequential Access Method) index entry P 0 K 1 P 1 K 2 P 2 K m P m Index file may still be quite large. But we can apply the idea repeatedly! Index Pages Leaf Pages Overflow page Primary pages Leaf pages contain data entries

5 Comments on ISAM File creation: Leaf (data) pages allocated sequentially, sorted by search key; then index pages allocated, then space for overflow pages. Index entries: search key value, page id ; direct search for data entries, which are in leaf pages Search: Start at root; use key comparisons to go to leaf. Cost log F N where F =#entries/index page ( fanout ) and N =#leaf pages Insert: Find leaf data that entry belongs to, and put it there Delete: Find leaf and remove from leaf; if empty overflow page, de-allocate Static tree structure: inserts/deletes affect only leaf pages 6 Example ISAM Tree Each node can hold 2 entries No need for next-leaf-page pointers (Why?) 40 20 33 51 63 10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97*

7 Index Pages Primary Leaf Pages After Inserting 23, 48, 41, 42,... 40 20 33 51 63 10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97* Overflow Pages 23* 48* 41* 42* 8 Then Deleting 42, 51, 97 40 20 33 51 63 10* 15* 20* 27* 33* 37* 40* 46* 55* 63* 23* 48* 41* Note that 51 appears in index levels, but not in leaf!

9 B+-Tree: The Most Widely Used Index Insert/delete at log F N cost (F = fan out and N =#leaf pages); keep tree height-balanced. Minimum 50% occupancy (except for root). Each node contains d m 2d entries (d is the order of the tree). Supports equality and range-searches efficiently. Index (Directs search) Data Entries ("Sequence set") 10 Example B+-Tree Search begins at root, and key comparisons direct it to a leaf (as in ISAM) Search for 5,15, all data entries with key 24 13 17 24 30 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* Basedonthesearchfor15,weknow it is not in the tree!

11 Average fill-factor: 66% (= ln 2) Typical order: 100 average fanout = 133 B+-Trees in Numbers Typical capacities: Height 4: 133 4 = 312,900,700 records Height 3: 133 3 = 2,352,637 records Can often hold top levels in buffer pool: Level 1 = 1 page = 8 KBytes Level 2 = 133 pages = 1 MByte Level 3 = 17,689 pages = 133 MBytes 12 Inserting a Data Entry into a B+-Tree Find correct leaf L Put data entry onto L If L has enough space, done! Else, must split L (into L and a new node L ) Redistribute entries evenly, copy up middle key Insert index entry pointing to L into parent of L This can happen recursively To split index note, redistribute entries evenly, but push up middle key (contrast with leaf splits!) Splits grow three; root split increases height Tree growth: gets wider or one level taller at top

13 Observe how minimum occupancy is guaranteed in both leaf and index page splits. Inserting 8 into Example B+-Tree 2* 3* 5* 7* 8* 5 Entry to be inserted in parent node. (Note that 5 is copied up and continues to appear in the leaf.) Note difference between copy up and push up! What s the reason? 17 5 13 24 30 Entry to be inserted in parent node. (Note that 17 is pushed up and and appears once in the index. Contrast this with a leaf split.) 14...After Inserting 8 17 5 13 24 30 2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* Notice that root was split, leading to increase in height In this example, we can avoid split be re-distributing entries; however, this is usually not done in practice

15 Deleting a Data Entry from a B+-Tree Startatroot,findleafL where entry belongs Remove the entry If L is at least half-full, done! If L has only d 1 entries, Try to re-distribute, borrowing from sibling If re-distribution fails, merge L and sibling (adjacent node with same parent as L) If merge occurred, must delete entry (pointing to L or sibling) from parent of L Merge could propagate to root, decreasing height 16... after (Inserting 8, then) Deleting 19 and 20 17 5 13 27 30 2* 3* 5* 7* 8* 14* 16* 22* 24* 27* 29* 33* 34* 38* 39* Deleting 19 is easy Deleting 20 is done with re-distribution. Notice how middle key is copied up!

17 Must merge observe toss of index entry (on right), and pull down of index entry (below)...and then Deleting 24 5 13 17 30 30 22* 27* 29* 33* 34* 38* 39* 2* 3* 5* 7* 8* 14* 16* 22* 27* 29* 33* 34* 38* 39* 18 Example of Non-leaf Re-distribution Tree is shown below during deletion of 24 (What could be a possible tree?) In contrast to previous example, can re-distribute entry form left child of root to right child 22 5 13 17 20 30 2* 3* 5* 7* 8* 14* 16* 17* 18* 20* 21* 22* 27* 29* 33* 34* 38* 39*

19 After Re-distribution Intuitively, entries are re-distributed by pushing through the splitting entry in the parent node It suffices to re-distribute index entry with key 20; (we have re-distributed 17 as well for illustration) 5 13 17 20 22 30 2* 3* 5* 7* 8* 14* 16* 17* 18* 20* 21* 22* 27* 29* 33* 34* 38* 39* 20 Bulk Loading of a B+-Tree If we have a large collection of records, and we want to create a B+-tree on some filed, doing so by repeatedly inserting records is very slow Bulk loading can be done much more efficiently Initialisation: Sort all data entries, insert pointer to first (leaf) page in a new (root) page Sorted pages of data entries; not yet in B+ tree 3* 4* 6* 9* 10* 11* 12* 13* 20* 22* 23* 31* 35* 36* 38* 41* 44*

21 Bulk Loading of a B+-Tree (Cntd.) 6 10 Data entry pages not yet in B+ tree 3* 4* 6* 9* 10* 11* 12* 13* 20* 22* 23* 31* 35* 36* 38* 41* 44* 10 6 12 Data entry pages not yet in B+ tree 3* 4* 6* 9* 10* 11* 12* 13* 20* 22* 23* 31* 35* 36* 38* 41* 44* 22 Bulk Loading of a B+-Tree (Cntd.) Index entries for leaf pages always entered into right-most index page just above leaf level. When this fills up, it splits. (Split may go up right-most path to the root) Much faster than repeated inserts, especially when one considers locking!

23 Bulk Loading of a B+-Tree (Cntd.) 6 10 20 12 Data entry pages not yet in B+ tree 3* 4* 6* 9* 10* 11* 12* 13* 20* 22* 23* 31* 35* 36* 38* 41* 44* 20 23 35 10 35 Data entry pages not yet in B+ tree 6 12 23 38 3* 4* 6* 9* 10* 11* 12* 13* 20* 22* 23* 31* 35* 36* 38* 41* 44* 24 Summary of Bulk Loading Option 1: multiple inserts Slow Does not give sequential storage of leaves Option 2: Bulk Loading Has advantages for concurrency control Fewer I/O s during build Leaves will be stored sequentially (and linked, of course) Can control fill factor on pages

25 Storage and Access Cost for an Average B+-tree Example: Relation Orders with attribute Orders.CustId Assumptions: Page Size: 4KBytes (including 96 Bytes page header) Occupancy of Page: 70 % Number of records in Orders: 10,000,000 Number of distinct Customer ID s: 100,000 (for every customer, there is an equal number of orders) Length of a Customer ID: 24 Bytes Length of an rid: 6Bytes Length of a pointer in B+-tree: 6Bytes 26 We Conclude: Length of Rid List: 24 + 100 6 Bytes = 624 Bytes Number of Rid Lists on an Index Page:.7 (4096 96)/624 =4 Number of Index Pages: 100, 000/4 =25, 000 Length of a Signpost to a Non-leaf Node: 24 + 6 Bytes =30Bytes Fanout:.7 (4096 96)/30 =93 Height of Index: log 93 25, 000 +1=4 (3 Levels for non-leaf nodes plus leaf level) Number of Pages in Index: 25, 000 pages on Level 4, 25, 000/93 = 269 non-leaf nodes on Level 3 269/93 =3non-leaf nodes on Level 2 plus 1 root node Storage Space: 25, 270 4 KBytes 100 MBytes Reading all orders for a CustId requires 4 + 100 = 104 page accesses

27 Tree-structured Indexes: Summary Ideal for range-searches, also good for equality searches ISAM is a static structure Only leaf pages modified; overflow pages needed Overflow chains can degrade performance unless size of data set and data distribution stay constant B+-tree is a dynamic structure Inserts/deletes leave tree height-balanced (log F N cost) High fanout F means depth rarely more than 3 or 4 Almost always better than maintaining a sorted file Typically, 66% (= ln 2) occupancy on average If data entries are data records, splits can change rids! 28 Tree-structured Indexes: Summary Bulk loading can be much faster than repeated inserts for creating a B+-tree on a large data set Most widely used index in database management systems because of its versatility. On of the most optimized components of a DBMS.

29 References These slides are based on Chapter 10 of the book Database Management Systems by R. Ramakrishnan and J. Gehrke, and on slides by the authors published at www.cs.wisc.edu/~dbbook/openaccess/thirdedition/slides/slides3ed.html