Introduction to Indexing 2. Acknowledgements: Eamonn Keogh and Chotirat Ann Ratanamahatana

Similar documents
Tree-Structured Indexes

Tree-Structured Indexes

Tree-Structured Indexes. Chapter 10

Tree-Structured Indexes ISAM. Range Searches. Comments on ISAM. Example ISAM Tree. Introduction. As for any index, 3 alternatives for data entries k*:

Tree-Structured Indexes

Introduction. Choice orthogonal to indexing technique used to locate entries K.

Tree-Structured Indexes

Tree-Structured Indexes

Principles of Data Management. Lecture #5 (Tree-Based Index Structures)

Extra: B+ Trees. Motivations. Differences between BST and B+ 10/27/2017. CS1: Java Programming Colorado State University

Tree-Structured Indexes

Tree-Structured Indexes. A Note of Caution. Range Searches ISAM. Example ISAM Tree. Introduction

CSIT5300: Advanced Database Systems

Physical Level of Databases: B+-Trees

Administrivia. Tree-Structured Indexes. Review. Today: B-Tree Indexes. A Note of Caution. Introduction

Tree-Structured Indexes (Brass Tacks)

Chapter 12: Indexing and Hashing (Cnt(

Introduction to Data Management. Lecture 15 (More About Indexing)

Announcements. Reading Material. Recap. Today 9/17/17. Storage (contd. from Lecture 6)

Intro to DB CHAPTER 12 INDEXING & HASHING

Chapter 11: Indexing and Hashing

Indexing. Chapter 8, 10, 11. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Tree-Structured Indexes

Chapter 12: Indexing and Hashing. Basic Concepts

Lecture 8 Index (B+-Tree and Hash)

Chapter 12: Indexing and Hashing

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Introduction to Data Management. Lecture 21 (Indexing, cont.)

Spring 2017 B-TREES (LOOSELY BASED ON THE COW BOOK: CH. 10) 1/29/17 CS 564: Database Management Systems, Jignesh M. Patel 1

Department of Computer Science University of Cyprus EPL446 Advanced Database Systems. Lecture 6. B+ Trees: Structure and Functions

Indexes. File Organizations and Indexing. First Question to Ask About Indexes. Index Breakdown. Alternatives for Data Entries (Contd.

Lecture 13. Lecture 13: B+ Tree

THE B+ TREE INDEX. CS 564- Spring ACKs: Jignesh Patel, AnHai Doan

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing" Chapter 11: Indexing and Hashing"

Chapter 11: Indexing and Hashing

Chapter 12: Indexing and Hashing

Material You Need to Know

Kathleen Durant PhD Northeastern University CS Indexes

Database Applications (15-415)

Database System Concepts, 5th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Chapter 11: Indexing and Hashing

CS143: Index. Book Chapters: (4 th ) , (5 th ) , , 12.10

Topics to Learn. Important concepts. Tree-based index. Hash-based index

CS127: B-Trees. B-Trees

CARNEGIE MELLON UNIVERSITY DEPT. OF COMPUTER SCIENCE DATABASE APPLICATIONS

I think that I shall never see A billboard lovely as a tree. Perhaps unless the billboards fall I ll never see a tree at all.

CSE 530A. B+ Trees. Washington University Fall 2013

Multi-way Search Trees. (Multi-way Search Trees) Data Structures and Programming Spring / 25

Algorithms. Deleting from Red-Black Trees B-Trees

Goals for Today. CS 133: Databases. Example: Indexes. I/O Operation Cost. Reason about tradeoffs between clustered vs. unclustered tree indexes

M-ary Search Tree. B-Trees. Solution: B-Trees. B-Tree: Example. B-Tree Properties. B-Trees (4.7 in Weiss)

Design and Analysis of Algorithms Lecture- 9: B- Trees

Data Structures and Algorithms

An AVL tree with N nodes is an excellent data. The Big-Oh analysis shows that most operations finish within O(log N) time

Main Memory and the CPU Cache

2-3 Tree. Outline B-TREE. catch(...){ printf( "Assignment::SolveProblem() AAAA!"); } ADD SLIDES ON DISJOINT SETS

Multiway searching. In the worst case of searching a complete binary search tree, we can make log(n) page faults Everyone knows what a page fault is?

Database index structures

Physical Disk Structure. Physical Data Organization and Indexing. Pages and Blocks. Access Path. I/O Time to Access a Page. Disks.

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Balanced Search Trees

CS350: Data Structures B-Trees

Introduction to Indexing R-trees. Hong Kong University of Science and Technology

Data Organization B trees

Indexing: B + -Tree. CS 377: Database Systems

CS 525: Advanced Database Organization 04: Indexing

M-ary Search Tree. B-Trees. B-Trees. Solution: B-Trees. B-Tree: Example. B-Tree Properties. Maximum branching factor of M Complete tree has height =

Lecture 4. ISAM and B + -trees. Database Systems. Tree-Structured Indexing. Binary Search ISAM. B + -trees

Indexing and Hashing

amiri advanced databases '05

CSCI Trees. Mark Redekopp David Kempe

CSC 261/461 Database Systems Lecture 17. Fall 2017

Selection Queries. to answer a selection query (ssn=10) needs to traverse a full path.

Background: disk access vs. main memory access (1/2)

Find the block in which the tuple should be! If there is free space, insert it! Otherwise, must create overflow pages!

B-Trees. Introduction. Definitions

CS 350 : Data Structures B-Trees

Problem. Indexing with B-trees. Indexing. Primary Key Indexing. B-trees: Example. B-trees. primary key indexing

Advanced Database Systems

Access Methods. Basic Concepts. Index Evaluation Metrics. search key pointer. record. value. Value

Module 4: Index Structures Lecture 13: Index structure. The Lecture Contains: Index structure. Binary search tree (BST) B-tree. B+-tree.

More B-trees, Hash Tables, etc. CS157B Chris Pollett Feb 21, 2005.

Chapter 4. ISAM and B + -trees. Architecture and Implementation of Database Systems Summer 2016

CS 350 Algorithms and Complexity

B-Trees. Disk Storage. What is a multiway tree? What is a B-tree? Why B-trees? Insertion in a B-tree. Deletion in a B-tree

B-Trees. Version of October 2, B-Trees Version of October 2, / 22

Hashed-Based Indexing

2-3 and Trees. COL 106 Shweta Agrawal, Amit Kumar, Dr. Ilyas Cicekli

Overview of Storage and Indexing

Chapter 12: Query Processing. Chapter 12: Query Processing

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

Database Applications (15-415)

Physical Database Design: Outline

(2,4) Trees Goodrich, Tamassia. (2,4) Trees 1

Chapter 12: Query Processing

CS F-11 B-Trees 1

Information Systems (Informationssysteme)

Indexing. Announcements. Basics. CPS 116 Introduction to Database Systems

Hash-Based Indexing 1

Transcription:

Introduction to Indexing 2 Acknowledgements: Eamonn Keogh and Chotirat Ann Ratanamahatana

Indexed Sequential Access Method We have seen that too small or too large an index (in other words too few or too many pointers) can be a problem. But suppose the index does not fit in main memory? The key observation is that the index itself is a sort of database, so let s build an index on the index! p 21 Index File 5 12 :: 16 19 p p p p p p Files Page 1 Page 2 Page 3 :: Page N-1 Page N

Tree Based Indexing An index of indices is a tree! We can use this structure to do fast equality search. Find 15, 0 What about range search? It looks like we have solved our fast indexing problem, but there is a catch. what happens if we have a deletion, or an insertion? Define: root internal node leaf 5 13 14 16 18 30 35 43 Page 1 Page 2 Page 3 Page 4 Page 5 Page 6 Page 7 Page 8 Page 9 Page 10

What happens if we have a deletion? Tree Based Indexing (not much) What happens if we have an insertion? (trouble!) Solution: Overflow Buckets If we have enough overflow buckets, we might as well have no index at all Suppose we add a bunch of 15 year olds to the database 5 13 14 16 18 30 35 43 Page 1 Page 2 Page 3 Page 4 Page 5 Page 6 Page 7 Page 8 Page 9 Page 10 Overflow 1

B + -Tree Index Files B + -tree indices are an alternative to indexed-sequential files. Disadvantage of indexed-sequential files: performance degrades as file grows, since many overflow blocks get created. Periodic reorganization of entire file is required. Advantage of B + -tree index files: automatically reorganizes itself with small, local, changes, in the face of insertions and deletions. Reorganization of entire file is not required to maintain performance. Disadvantage of B + -trees: extra insertion and deletion overhead, space overhead. Advantages of B + -trees outweigh disadvantages, and they are used extensively.

B + -Tree Index Files (Cont.) A B + -tree is a rooted tree satisfying the following properties: All paths from root to leaf are of the same length Two types of nodes: index (internal) nodes and data (leaf) nodes. Each node is one disk page. Each node must have minimum 50% occupancy (except for root). Each node contains d <= m <= 2d entries/pointers. d is the order/branching factor/capacity of the tree The root must have at least 2 children

B + -Trees Example Root Entries <= Entries > 5 13 27 30 2* 3* 5* 7* 8* 14* 16* 22* 24* 27* 29* 33* 34* 38* 39*

Queries on B + -Trees Find all records with a search-key value of k. 1. Start with the root node 1. Examine the node for the smallest search-key value > k. 2. If such a value exists, assume it is K j. Then follow P i to the child node 3. Otherwise k K m 1, where there are m pointers in the node. Then follow P m to the child node. 2. If the node reached by following the pointer above is not a leaf node, repeat the above procedure on the node, and follow the corresponding pointer. 3. Eventually reach a leaf node. If for some i, key K i = k follow pointer P i to the desired record. Else no record with search-key value k exists.

Queries on B + -Trees Find 28*, Find 0*, Find all records > 25 Root Entries <= Entries > 5 13 27 30 2* 3* 5* 7* 8* 14* 16* 22* 24* 27* 29* 33* 34* 38* 39*

Queries on B +- Trees (Cont.) In processing a query, a path is traversed in the tree from the root to some leaf node. If there are K search-key values in the file, the path is no longer than log n/2 (K). A node is generally the same size as a disk block, e.g. 4 kilobytes, and n = 2d is typically around 100 (40 bytes per index entry). With 1 million search key values and n = 100, at most log 50 (1,000,000) = 4 nodes are accessed in a lookup. Contrast this with a balanced binary tree with 1 million search key values around 20 nodes are accessed in a lookup above difference is significant since every node access may need a disk I/O, costing around 20 milliseconds!

Updates on B + -Trees: Insertion Find the leaf node in which the search-key value would appear If the search-key value is already there in the leaf node, record is added to file and if necessary a pointer is inserted into the bucket. If the search-key value is not there, then add the record to the main file and create a bucket if necessary. Then: If there is room in the leaf node, insert (key-value, pointer) pair in the leaf node Otherwise, split the node (along with the new (key-value, pointer) entry) as discussed in the next slide.

Updates on B + -Trees: Insertion 13 24 30 Insert 23 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 28* 40* 41* 45* 77* This is the easy case! 13 24 30 2* 3* 5* 7* 14* 16* 19* 20* 22* 23* 24* 27* 28* 40* 41* 45* 77*

Updates on B + -Trees: Insertion 13 24 30 Insert 8 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 28* 40* 41* 45* 77* 13 24 30 5 2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 28* 40* 41* 45* 77* Because the insertion will cause overfill, we split the leaf node into two nodes, we split the data into two nodes (and distribute the data evenly between them). 5 is special, since it discriminates between the two new siblings, so it is copied up. We now need to insert 5 into the parent node

Updates on B + -Trees: Insertion We now need to insert 5 into the parent node 13 24 30 5 2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 28* 40* 41* 45* 77* 5 13 24 30 2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 28* 40* 41* 45* 77* Because the insertion will cause overfill, we split the node into two nodes, we split the data into two nodes. is special, since it discriminates between the two new siblings, so it is pushed up.

Updates on B + -Trees: Insertion 5 13 24 30 2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 28* 40* 41* 45* 77* 5 13 24 30 The insertion of 8 has increased the height of the tree by one (this is rare). 2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 28* 40* 41* 45* 77*

Updates on B + -Trees: Deletion Find the record to be deleted, and remove it from the main file and from the bucket (if present) Remove (search-key value, pointer) from the leaf node if there is no bucket or if the bucket has become empty If the node has too few entries due to the removal, and a sibling node has extra entries Redistribute the pointers between the node and a sibling such that both have more than the minimum number of entries. Update the corresponding search-key value in the parent of the node.

Updates on B + -Trees: Deletion Otherwise if the node has too few entries due to the removal, and redistribution is not possible, then Insert all the search-key values in the two nodes into a single node (the one on the left), and delete the other node. Delete the pair (K i 1, P i ), where P i is the pointer to the deleted node, from its parent, recursively using the above procedure. The node deletions may cascade upwards till a node which has n/2 or more pointers is found. If the root node has only one pointer after deletion, it is deleted and the sole child becomes the root.

Example B+ Tree Root 5 13 24 30 2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

Example Tree After Deleting 19* and 20*... Root 5 13 27 30 2* 3* 5* 7* 8* 14* 16* 22* 24* 27* 29* 33* 34* 38* 39* Deleting 19* is easy. Deleting 20* is done with re-distribution. Notice how middle key is copied up.

... And Then Deleting 24* Must merge. Observe `toss of index entry (on right), and `pull down of index entry (below). Root 5 13 30 22* 27* 29* 33* 34* 38* 39* 30 2* 3* 5* 7* 8* 14* 16* 22* 27* 29* 33* 34* 38* 39*

Example of Non-leaf Re-distribution Look at a different tree below. Suppose we re in the middle of deletion of 24*. In contrast to previous example, can redistribute entry from left child of root to right child. Root 22 5 13 20 30 2* 3* 5* 7* 8* 14* 16* * 18* 20* 21* 22* 27* 29* 33* 34* 38* 39*

After Re-distribution Intuitively, entries are re-distributed by `pushing through the splitting entry in the parent node. It suffices to re-distribute index entry with key 20; we ve re-distributed as well for illustration. Root 5 13 20 22 30 2* 3* 5* 7* 8* 14* 16* * 18* 20* 21* 22* 27* 29* 33* 34* 38* 39*

B + -Tree File Optimization Good space utilization important since records use more space than pointers. To improve space utilization, involve more sibling nodes in redistribution during splits and merges Involving 2 siblings in redistribution (to avoid split / merge where possible) results in each node having at least 2n / 3 entries Insert 23 13 24 30 2* 3* 5* 7* 14* 16* 19* 20* 21* 22* 24* 27* 28* 40* 41* 45* 77*

Bulk Loading of a B+ Tree If we have a large collection of records, and we want to create a B+ tree on some field, doing so by repeatedly inserting records is very slow. Bulk Loading can be done much more efficiently. Initialization: Sort all data entries, insert pointer to first (leaf) page in a new (root) page. Root Sorted pages of data entries; not yet in B+ tree 3* 4* 6* 9* 10* 11* 12* 13* 20* 22* 23* 31* 35* 36* 38* 41* 44*

Bulk loading (Cont.) Root 6 10 Sorted pages of data entries; not yet in B+ tree 3* 4* 6* 9* 10* 11* 12* 13* 20* 22* 23* 31* 35* 36* 38* 41* 44* Root 10 6 12 entry pages not yet in B+ tree 3* 4* 6* 9* 10* 11* 12* 13* 20* 22* 23* 31* 35* 36* 38* 41* 44*

Bulk Loading (Cont.) Index entries for leaf pages always entered into rightmost index page just above leaf level. When this fills up, it splits. (Split may go up right-most path to the root.) Root entry pages not yet in B+ tree 3* 4* 6* 9* 10* 11* 12* 13* 20* 22* 23* 31* 35* 36* 38* 41* 44* 6 6 12 Root 10 10 20 20 12 23 23 35 35 38 entry pages not yet in B+ tree 3* 4* 6* 9* 10* 11* 12* 13* 20* 22* 23* 31* 35* 36* 38* 41* 44*

Summary of Bulk Loading Option 1: multiple inserts. Slow. Does not give sequential storage of leaves. Option 2: Bulk Loading Has advantages for concurrency control. Fewer I/Os during build. Leaves will be stored sequentially (and linked, of course). Can control fill factor on pages.

Summary Tree-structured indexes are ideal for rangesearches, also good for equality searches. ISAM is a static structure. Only leaf pages modified; overflow pages needed. Overflow chains can degrade performance unless size of data set and data distribution stay constant. B+ tree is a dynamic structure. Inserts/deletes leave tree height-balanced; log F N cost. High fanout (F) means depth rarely more than 3 or 4. Almost always better than maintaining a sorted file.

Summary (Cont.) Typically, 67% occupancy on average. Usually preferable to ISAM; adjusts to growth gracefully. Bulk loading can be much faster than repeated inserts for creating a B+ tree on a large data set. Most widely used index in database management systems because of its versatility. One of the most optimized components of a DBMS.

Indices in SQL CREATE INDEX gpa_ranking ON Students WITH STRUCTURE = BTREE KEY = gpa