Extra: B+ Trees. Motivations. Differences between BST and B+ 10/27/2017. CS1: Java Programming Colorado State University

Similar documents
Introduction to Indexing 2. Acknowledgements: Eamonn Keogh and Chotirat Ann Ratanamahatana

Physical Level of Databases: B+-Trees

CSIT5300: Advanced Database Systems

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Intro to DB CHAPTER 12 INDEXING & HASHING

Tree-Structured Indexes

Tree-Structured Indexes

Chapter 11: Indexing and Hashing

Tree-Structured Indexes. Chapter 10

Chapter 11: Indexing and Hashing" Chapter 11: Indexing and Hashing"

Chapter 11: Indexing and Hashing

Module 4: Index Structures Lecture 13: Index structure. The Lecture Contains: Index structure. Binary search tree (BST) B-tree. B+-tree.

Principles of Data Management. Lecture #5 (Tree-Based Index Structures)

Tree-Structured Indexes

Material You Need to Know

Tree-Structured Indexes

Chapter 12: Indexing and Hashing (Cnt(

Tree-Structured Indexes ISAM. Range Searches. Comments on ISAM. Example ISAM Tree. Introduction. As for any index, 3 alternatives for data entries k*:

Chapter 11: Indexing and Hashing

CSE 530A. B+ Trees. Washington University Fall 2013

Multi-way Search Trees. (Multi-way Search Trees) Data Structures and Programming Spring / 25

Chapter 12: Indexing and Hashing. Basic Concepts

Chapter 11: Indexing and Hashing

CS127: B-Trees. B-Trees

Chapter 12: Indexing and Hashing

Tree-Structured Indexes

Introduction. Choice orthogonal to indexing technique used to locate entries K.

Announcements. Reading Material. Recap. Today 9/17/17. Storage (contd. from Lecture 6)

B+ TREES Examples. B Trees are dynamic. That is, the height of the tree grows and contracts as records are added and deleted.

Data Structures and Algorithms

THE B+ TREE INDEX. CS 564- Spring ACKs: Jignesh Patel, AnHai Doan

Spring 2017 B-TREES (LOOSELY BASED ON THE COW BOOK: CH. 10) 1/29/17 CS 564: Database Management Systems, Jignesh M. Patel 1

Chapter 18 Indexing Structures for Files. Indexes as Access Paths

Tree-Structured Indexes. A Note of Caution. Range Searches ISAM. Example ISAM Tree. Introduction

Algorithms. Deleting from Red-Black Trees B-Trees

Chapter 12: Indexing and Hashing

Tree-Structured Indexes

Indexing. Chapter 8, 10, 11. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Database System Concepts, 5th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Balanced Search Trees

Administrivia. Tree-Structured Indexes. Review. Today: B-Tree Indexes. A Note of Caution. Introduction

Database index structures

Lecture 13. Lecture 13: B+ Tree

Design and Analysis of Algorithms Lecture- 9: B- Trees

Background: disk access vs. main memory access (1/2)

2-3 Tree. Outline B-TREE. catch(...){ printf( "Assignment::SolveProblem() AAAA!"); } ADD SLIDES ON DISJOINT SETS

B-Trees. Introduction. Definitions

B-Trees & its Variants

CS143: Index. Book Chapters: (4 th ) , (5 th ) , , 12.10

Topics to Learn. Important concepts. Tree-based index. Hash-based index

Introduction to Data Management. Lecture 15 (More About Indexing)

M-ary Search Tree. B-Trees. Solution: B-Trees. B-Tree: Example. B-Tree Properties. B-Trees (4.7 in Weiss)

CS 310 B-trees, Page 1. Motives. Large-scale databases are stored in disks/hard drives.

Tree-Structured Indexes (Brass Tacks)

Main Memory and the CPU Cache

Multiway searching. In the worst case of searching a complete binary search tree, we can make log(n) page faults Everyone knows what a page fault is?

B-Trees. Version of October 2, B-Trees Version of October 2, / 22

Introduction. for large input, even access time may be prohibitive we need data structures that exhibit times closer to O(log N) binary search tree

M-ary Search Tree. B-Trees. B-Trees. Solution: B-Trees. B-Tree: Example. B-Tree Properties. Maximum branching factor of M Complete tree has height =

I think that I shall never see A billboard lovely as a tree. Perhaps unless the billboards fall I ll never see a tree at all.

9/29/2016. Chapter 4 Trees. Introduction. Terminology. Terminology. Terminology. Terminology

Tree-Structured Indexes

B-Trees. Disk Storage. What is a multiway tree? What is a B-tree? Why B-trees? Insertion in a B-tree. Deletion in a B-tree

CS 525: Advanced Database Organization 04: Indexing

CS F-11 B-Trees 1

B-Trees. Based on materials by D. Frey and T. Anastasio

Programming II (CS300)

CS Fall 2010 B-trees Carola Wenk

Data Organization B trees

Indexing: B + -Tree. CS 377: Database Systems

An AVL tree with N nodes is an excellent data. The Big-Oh analysis shows that most operations finish within O(log N) time

Hash Tables. CS 311 Data Structures and Algorithms Lecture Slides. Wednesday, April 22, Glenn G. Chappell

CS350: Data Structures Red-Black Trees

CS 350 : Data Structures B-Trees

Kathleen Durant PhD Northeastern University CS Indexes

Problem. Indexing with B-trees. Indexing. Primary Key Indexing. B-trees: Example. B-trees. primary key indexing

CMPS 2200 Fall 2017 B-trees Carola Wenk

CS350: Data Structures B-Trees

Trees can be used to store entire records from a database, serving as an in-memory representation of the collection of records in a file.

Chapter 12 Advanced Data Structures

Motivation for B-Trees

CS 350 Algorithms and Complexity

CIS265/ Trees Red-Black Trees. Some of the following material is from:

CARNEGIE MELLON UNIVERSITY DEPT. OF COMPUTER SCIENCE DATABASE APPLICATIONS

If you took your exam home last time, I will still regrade it if you want.

TREES. Trees - Introduction

Indexing and Hashing

Data Structures. Outline. Introduction Linked Lists Stacks Queues Trees Deitel & Associates, Inc. All rights reserved.

Trees. Reading: Weiss, Chapter 4. Cpt S 223, Fall 2007 Copyright: Washington State University

Indexes. File Organizations and Indexing. First Question to Ask About Indexes. Index Breakdown. Alternatives for Data Entries (Contd.

Multi-way Search Trees! M-Way Search! M-Way Search Trees Representation!

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Search Trees. The term refers to a family of implementations, that may have different properties. We will discuss:

QUIZ: Buffer replacement policies

What is a Multi-way tree?

Binary Trees

Trees. Courtesy to Goodrich, Tamassia and Olga Veksler

CSCI Trees. Mark Redekopp David Kempe

CSE 562 Database Systems

Database Applications (15-415)

Transcription:

Extra: B+ Trees CS1: Java Programming Colorado State University Slides by Wim Bohm and Russ Wakefield 1 Motivations Many times you want to minimize the disk accesses while doing a search. A binary search tree allows two keys per node a B+ tree allows as many values that will fit on a page. 2 Differences between BST and B+ B+ is a balanced tree Same distance for every path through the tree Not true for BST B+ tree keeps data only in the leaf nodes The index nodes are simply for traffic control BST has data with the keys B+ tree has many keys per node BST has 1 3 1

Objectives To understand the basic structure of a B+ Tree Understanding how the way keys are stored allows for efficient searching To understand the insertion algorithm To understand the deletion algorithm To see how rotation minimizes the cost of insertion and deletion algorithms 4 What is a B+ tree? B+ tree a dynamic structure that adjusts to changes in the file gracefully. It is the the most widely used structure because it adjusts well to changes and supports both equality and range queries. It is a balanced tree in which the internal nodes direct the search and the leaf nodes contain the data entries. The leaf nodes are organized into a doubly linked list allowing us to easily traverse the leaf pages in either direction. 5 Example of a B+ tree 75 25 50 100 250 10 30 5 25 50 55 60 65 75 0 5 90 100 5 160 250 255 260 6 2

Advantages / Disadvanges Advantage of B + -tree index files: automatically reorganizes itself with small, local, changes, in the face of insertions and deletions. Reorganization of entire file is not required to maintain performance. (Minor) disadvantage of B + -trees: extra insertion and deletion overhead, space overhead. Advantages of B + -trees outweigh disadvantages B + -trees are used extensively 7 B + -Tree Index Files (Cont.) A B + -tree is a rooted tree satisfying the following properties: All paths from root to leaf are of the same length Each node that is not a root or a leaf has between n/2 and n children. A leaf node has between (n 1)/2 and n 1 values Special cases: If the root is not a leaf, it has at least 2 children. If the root is a leaf (that is, there are no other nodes in the tree), it can have between 0 and (n 1) values. B + -Tree Node Structure Typical node K i are the search-key values P i are pointers to children (for non-leaf nodes) or pointers to records or buckets of records (for leaf nodes). The search-keys in a node are ordered K 1 < K 2 < K 3 <... < K n 1 (Initially assume no duplicate keys, address duplicates later) 3

Non-Leaf Nodes in B + -Trees Non leaf nodes form a multi-level sparse index on the leaf nodes. For a non-leaf node with m pointers: All the search-keys in the subtree to which P 1 points are less than K 1 For 2 i n 1, all the search-keys in the subtree to which P i points have values greater than or equal to K i 1 and less than K i All the search-keys in the subtree to which P n points have values greater than or equal to K n 1 Example of B + -tree 75 25 50 100 250 25 30 50 55 60 65 75 0 5 90 100 5 160 250 255 260 B + -tree for instructor file (n = 5) Leaf nodes must have between 2 and 4 values ( (n 1)/2 and n 1, with n = 5). Non-leaf nodes other than root must have between 3 and 5 children ( (n/2 and n with n =5). Root must have at least 2 children. Observations about B + -trees Operations (insert, delete) on the tree keep it balanced. Log f N cost where f=fanout, N = # of leaf pages. Minimum occupancy of 50% is guaranteed for each node except the root node if the deletion algorithm we will present is used. (in practice, deletes just delete the data entry because files usually grow, not shrink). Each node contains m entries where d <= m <= 2d entries. d is referred to as the order of the tree. Search for a record is just a traversal from the root to the appropriate leaf. This is the height of the tree because it is balanced is consistent. Because of the high fan-out, the height of a B+ tree is rarely more than 3 or 4. 4

Queries on B + -Trees Find record with search-key value V. 1. C=root 2. While C is not a leaf node { 1. Let i be least value such that V K i. 2. If no such exists, set C = last non-null pointer in C 3. Else { if (V= K i ) Set C = P i +1 else set C = P i } } //Now C is in leaf node containing K i 3. Let i be least value such that K i = V 4. If there is such a value i, follow pointer P i to the desired record. 5. Else no record with search-key value k exists. Queries on B +- Trees (Cont.) If there are K search-key values in the file, the height of the tree is no more than log n/2 (K). (n = number of indices / node) A node is generally the same size as a disk block, typically 4 kilobytes and n is typically around 100 (40 bytes per index entry). With 1 million search key values and n = 100 at most log 50 (1,000,000) = 4 nodes are accessed in a lookup. Contrast this with a balanced binary tree with 1 million search key values around nodes are accessed in a lookup above difference is significant since every node access may need a disk I/O, costing around milliseconds Insertion Algorithm Leaf page Index page Action full? full? No No Place the record in sorted position in the appropriate leaf page Yes No 1. Split the leaf page including the inserted record. 2. Place Middle Key in the index page in sorted order. 3. Left leaf page contains records with keys below the middle key. 4. Right leaf page contains records with keys equal to or greater than the middle key. Yes Yes 1. Split the leaf page including the inserted record. 2. Records with keys < middle key go to the left leaf page. 3. Records with keys >= middle key go to the right leaf page. 4. Split the index page. 5. Keys < middle key go to the left index page. 6. Keys > middle key go to the right index page. 7. The middle key goes to the next (higher level) index. IF the next level index page is full, continue splitting the index pages. 5

Insertion Examples of insertion with B+ tree with order = 1. Starting with a tree looking like this: 1* 4* 5* 9* 10* 12* * 1* * 16 Insert 2 Insert 2 into the below tree: 1* 4* 5* 9* 10* 12* * 1* * 17 Insert 2 Our first insertion has an index of 2. We look at the leaf node to see if there is room. Finding an empty slot, we place the index in node in sorted order. 1* 4* 5* 9* 10* 12* * 1* * 2* 1 6

Insert 25 Insert 25 in the below tree: 1* 4* 5* 9* 10* 12* * 1* * 2* 19 Insert 25 Our next insertion is at 25. We look at the leaf node it would go in and find there is no room. We split the node, and roll the middle value to the index mode above it. 25 1* 4* 5* 9* 10* 12* * 1* * 25* 2* Insert in the below tree: Insert 25 1* 4* 5* 9* 10* 12* * 1* * 25* 2* 21 7

Insert Our next case occurs when we want to add. The leaf node is full, so we split it and attempt to roll the index to the index node. It is full, so we must split it as well. 25 * 9* 1* 4* 5* 10* 12* * 1* * 25* 2* 22 Insert into the below tree: Insert 25 * 9* 1* 4* 5* 10* 12* * 1* * 25* 2* 23 Insert Our last case occurs when we want to add. This is going to result in the root node being split. The leaf node is full, as are the two index nodes above it. This gives us: 25 * 9* * 1* 1* 4* 5* 10* 12* * * 25* 2* 24

Delete Algorithm Leaf page less than Index page less Action 1/2? than 1/2? No No Delete the record from the leaf page. Arrange keys in ascending order to fill void. If the key of the deleted record appears in the index pages, use the next key to replace it. Yes No Combine the leaf page and its sibling. Change the index page to reflect the change. Yes Yes 1. Combine the leaf page and its sibling. 2. Adjust the index page to reflect the change. 3. Combine the index page with its sibling 4. Delete entry from parent Continue combining index pages until you reach a page with thecorrect fill factor or you reach the root page. 25 Delete Let s take our tree from the insert example with a minor modification (we have added 30 to give us an index node with 2 indexes in it: 25 30 * 9* * 1* 1* 4* 5* 10* 12* * * 25* 2* 30* 26 Delete 1 Delete 1 from the below tree: 25 30 * 9* * 1* 1* 4* 5* 10* 12* * * 25* 2* 30* 27 9

Delete 1 Our first delete is of 1. Simplest case is that it is not an index and in a leaf node that deleting it will not take you below 1/2. 25 30 1* 4* 5* * 9* 10* 12* * * * 25* 2* 30* 2 Delete 25 Delete 25 from the below tree: 25 30 1* 4* 5* * 9* 10* 12* * * * 25* 2* 30* 29 Delete 25 Our next delete is similar, except the index appears in a index node. In that case, the next index replaces the one in the index node. Let s delete 25 from the previous slide. 2 30 1* 4* 5* * 9* 10* 12* * * * 2* 30* 30 10

Delete 2 Delete 2 from the below tree: 2 30 1* 4* 5* * 9* 10* 12* * * * 2* 30* 31 Delete 2 Our next case takes the node below d. Let s delete 2. For this one we combine the leaf page (in our case it is empty) with its sibling and update the index appropriately. That gives us: 30 1* 4* 5* * 9* 10* 12* * * * 30* 32 Delete 30 Delete 30 from the below tree: 30 1* 4* 5* * 9* 10* 12* * * * 30* 33 11

Delete 30 Next we delete 30. This takes us below d for the index. We combine the indexes, which has the effect of taking the index above below d. This continues to the root. 1* 4* 5* 10* 12* * * 9* * * 34 Delete 30 Woah. That seemed like magic. What process got us to that? Ok let s go through it. When we deleted 30, which took the data entry node that 30 was in below d. Now we have to merge with the sibling. When we merge it s to the sibling on the left, which means pointer in the index above is no longer valid. We remove it, (which leaves it less than d), pull down the index from above and merge the index node with its sibling. 35 Delete 30 * 9* * 1* 4* 5* 10* 12* * * 36 12

Delete 30 Repeating the process gets us back to 1* 4* 5* 10* 12* * * 9* * * 37 Delete 5 Delete 5 from the below tree: 1* 4* 5* 10* 12* * * 9* * * 3 Delete 5 Our last example deletes 5. This takes the node and the index above it below d. We remove the leaf node and combine the index with its neighbor. 10 1* 4* 10* 12* * * 9* * * 39

Rotation It is also possible to rebalance a tree to reduce the number of splits called rotation. If you are trying to insert, and a leaf page is full, but its sibling isn t you can move an index to a sibling and avoid splitting. Let s go back to a tree from our insert example: 40 Add 3 25 * 9* * 1* 1* 4* 5* 10* 12* * * 25* 2* 41 Add 3 (Rotation) We want to add 3 but in this case we check the sibling to see if it has room. It does, so we move a record to it adjusting the index. Now we have : 4 10 25 * 9* * 1* 1* 3* 4* 5* 10* 12* * * 25* 2* 42 14

Delete (Rotation) The same concept works with deletes. 4 10 25 * 9* * 1* 1* 3* 4* 5* 10* 12* * * 25* 2* 43 Delete (Rotation) 4 10 1 25 1* 3* 4* 5* * 9* 10* 12* * 1* * 25* 2* 44 Bulk Loading of a B+ Tree If we have a large collection of records, and we want to create a B+ tree on some field, doing so by repeatedly inserting records is very slow. Bulk Loading can be done much more efficiently. Initialization: Sort all data entries, insert pointer to first (leaf) page in a new (root) page. Root Sorted pages of data entries; not yet in B+ tree 3* 4* 6* 9* 10* 11* 12* * * 22* 23* 31* 35* 36* 3* 41* 44*

Bulk Loading (Contd.) Index entries for leaf pages always entered into right-most index page just above leaf level. When this fills up, it splits. (Split may go up right-most path to the root.) Much faster than repeated inserts. 3* 4* 6* 9* 10*11* 12** *22* 23*31* 35*36* 3*41* 44* 6 6 Root 10 12 Root 10 12 23 23 35 35 3 Data entry pages not yet in B+ tree Data entry pages not yet in B+ tree 3* 4* 6* 9* 10*11* 12** *22* 23*31* 35*36* 3*41* 44* Summary of Bulk Loading Option 1: multiple inserts. Slow. Does not give sequential storage of leaves. Option 2: Bulk Loading Has advantages for concurrency control. Fewer I/Os during build. Leaves will be stored sequentially (and linked, of course). Can control fill factor on pages. 16