CS 525: Advanced Database Organization 04: Indexing

Similar documents
CS 245: Database System Principles

CSE 562 Database Systems

Chapter 13: Indexing. Chapter 13. ? value. Topics. Indexing & Hashing. value. Conventional indexes B-trees Hashing schemes (self-study) record

CS232A: Database System Principles INDEXING. Indexing. Indexing. Given condition on attribute find qualified records Attr = value

Access Methods. Basic Concepts. Index Evaluation Metrics. search key pointer. record. value. Value

Physical Level of Databases: B+-Trees

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

Topics to Learn. Important concepts. Tree-based index. Hash-based index

CS143: Index. Book Chapters: (4 th ) , (5 th ) , , 12.10

Chapter 12: Indexing and Hashing

CSIT5300: Advanced Database Systems

Chapter 12: Indexing and Hashing. Basic Concepts

Material You Need to Know

Intro to DB CHAPTER 12 INDEXING & HASHING

Chapter 12: Indexing and Hashing (Cnt(

Chapter 11: Indexing and Hashing

Introduction to Indexing 2. Acknowledgements: Eamonn Keogh and Chotirat Ann Ratanamahatana

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Find the block in which the tuple should be! If there is free space, insert it! Otherwise, must create overflow pages!

Chapter 11: Indexing and Hashing

Chapter 12: Indexing and Hashing

Extra: B+ Trees. Motivations. Differences between BST and B+ 10/27/2017. CS1: Java Programming Colorado State University

Chapter 11: Indexing and Hashing

7RSLFV ,QGH[HV 7HUPVÃDQGÃ'LVWLQFWLRQV. Conventional Indexes B-Tree Indexes Hashing Indexes

Some Practice Problems on Hardware, File Organization and Indexing

Kathleen Durant PhD Northeastern University CS Indexes

Indexing and Hashing

Data Organization B trees

Chapter 11: Indexing and Hashing" Chapter 11: Indexing and Hashing"

Database System Concepts, 5th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Problem. Indexing with B-trees. Indexing. Primary Key Indexing. B-trees: Example. B-trees. primary key indexing

amiri advanced databases '05

Chapter 11: Indexing and Hashing

CARNEGIE MELLON UNIVERSITY DEPT. OF COMPUTER SCIENCE DATABASE APPLICATIONS

Indexing. Announcements. Basics. CPS 116 Introduction to Database Systems

Storage hierarchy. Textbook: chapters 11, 12, and 13

Tree-Structured Indexes

Announcements. Reading Material. Recap. Today 9/17/17. Storage (contd. from Lecture 6)

Indexing. Jan Chomicki University at Buffalo. Jan Chomicki () Indexing 1 / 25

Advanced Database Systems

Spring 2017 B-TREES (LOOSELY BASED ON THE COW BOOK: CH. 10) 1/29/17 CS 564: Database Management Systems, Jignesh M. Patel 1

Chapter 13: Query Processing

Hashed-Based Indexing

Database System Concepts

Tree-Structured Indexes

Selection Queries. to answer a selection query (ssn=10) needs to traverse a full path.

Indexing. Chapter 8, 10, 11. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Advances in Data Management Principles of Database Systems - 2 A.Poulovassilis

Chapter 12: Query Processing. Chapter 12: Query Processing

Background: disk access vs. main memory access (1/2)

Database index structures

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

Physical Database Design: Outline

THE B+ TREE INDEX. CS 564- Spring ACKs: Jignesh Patel, AnHai Doan

Indexing: Overview & Hashing. CS 377: Database Systems

CS 525: Advanced Database Organization 03: Disk Organization

Chapter 12: Query Processing

Datenbanksysteme II: Caching and File Structures. Ulf Leser

CSE 530A. B+ Trees. Washington University Fall 2013

CS 525: Advanced Database Organization

Indexing: B + -Tree. CS 377: Database Systems

CSC 261/461 Database Systems Lecture 17. Fall 2017

Database Technology. Topic 7: Data Structures for Databases. Olaf Hartig.

Physical Disk Structure. Physical Data Organization and Indexing. Pages and Blocks. Access Path. I/O Time to Access a Page. Disks.

More B-trees, Hash Tables, etc. CS157B Chris Pollett Feb 21, 2005.

Remember. 376a. Database Design. Also. B + tree reminders. Algorithms for B + trees. Remember

M-ary Search Tree. B-Trees. Solution: B-Trees. B-Tree: Example. B-Tree Properties. B-Trees (4.7 in Weiss)

Lecture 13. Lecture 13: B+ Tree

Introduction. Choice orthogonal to indexing technique used to locate entries K.

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Introduction to Indexing R-trees. Hong Kong University of Science and Technology

System Structure Revisited

C SCI 335 Software Analysis & Design III Lecture Notes Prof. Stewart Weiss Chapter 4: B Trees

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 6 - Storage and Indexing

Tree-Structured Indexes

Tree-Structured Indexes. Chapter 10

M-ary Search Tree. B-Trees. B-Trees. Solution: B-Trees. B-Tree: Example. B-Tree Properties. Maximum branching factor of M Complete tree has height =

Query Processing & Optimization

An AVL tree with N nodes is an excellent data. The Big-Oh analysis shows that most operations finish within O(log N) time

Data Structures and Algorithms

COMP 430 Intro. to Database Systems. Indexing

ACCESS METHODS: FILE ORGANIZATIONS, B+TREE

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Overview of Storage and Indexing

DATABASE PERFORMANCE AND INDEXES. CS121: Relational Databases Fall 2017 Lecture 11

Indexing and Hashing

Database Applications (15-415)

(2,4) Trees Goodrich, Tamassia. (2,4) Trees 1

Hash-Based Indexes. Chapter 11. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Introduction to Data Management. Lecture 15 (More About Indexing)

Chapter 17 Indexing Structures for Files and Physical Database Design

Indexing Methods. Lecture 9. Storage Requirements of Databases

Lecture 8 Index (B+-Tree and Hash)

Administrivia. Tree-Structured Indexes. Review. Today: B-Tree Indexes. A Note of Caution. Introduction

CS Fall 2010 B-trees Carola Wenk

File Structures and Indexing

CS F-11 B-Trees 1

Lecture 3: B-Trees. October Lecture 3: B-Trees

CS-245 Database System Principles

Transcription:

CS 5: Advanced Database Organization 04: Indexing Boris Glavic Part 04 Indexing & Hashing value record? value Slides: adapted from a course taught by Hector Garcia-Molina, Stanford InfoLab CS 5 Notes 4 - Indexing 1 CS 5 Notes 4 - Indexing 2 Topics Conventional indexes B-trees Hashing schemes Advanced Index Techniques Sequential File 0 CS 5 Notes 4 - Indexing CS 5 Notes 4 - Indexing 4 Dense Index Sequential File Sparse Index Sequential File 0 1 1 0 1 1 1 1 1 2 2 0 CS 5 Notes 4 - Indexing 5 CS 5 Notes 4 - Indexing 6 1

Sparse 2nd level 1 2 4 4 5 1 1 1 1 1 2 2 Sequential File 0 Comment: {FILE,INDEX} may be contiguous or not (blocks chained) CS 5 Notes 4 - Indexing 7 CS 5 Notes 4 - Indexing 8 Question: Can we build a dense, 2nd level index for a dense index? Notes on pointers: (1) Block pointer (sparse index) can be smaller than record pointer BP RP CS 5 Notes 4 - Indexing 9 CS 5 Notes 4 - Indexing Notes on pointers: (2) If file is contiguous, then we can omit pointers (i.e., compute them) K1 K2 K K4 R1 R2 R R4 CS 5 Notes 4 - Indexing 11 CS 5 Notes 4 - Indexing 12 2

R1 Sparse vs. Dense Tradeoff K1 K2 K K4 if we want K block: get it at offset (-1)24 = 48 bytes R2 R R4 say: 24 B per block Sparse: Less index space per record can keep more of index in memory Dense: Can tell if any record exists without accessing file (Later: sparse better for insertions dense needed for secondary indexes) CS 5 Notes 4 - Indexing 1 CS 5 Notes 4 - Indexing 14 Terms Index sequential file Search key ( primary key) Primary index (on Sequencing field) Secondary index Dense index (all Search Key values in) Sparse index Multi-level index Next: Duplicate keys Deletion/Insertion Secondary indexes CS 5 Notes 4 - Indexing 15 CS 5 Notes 4 - Indexing 16 Duplicate keys Duplicate keys Dense index, one way to implement? CS 5 Notes 4 - Indexing 17 CS 5 Notes 4 - Indexing 18

Duplicate keys Duplicate keys Dense index, better way? Sparse index, one way? CS 5 Notes 4 - Indexing 19 CS 5 Notes 4 - Indexing Duplicate keys Sparse index, one way? careful if looking for or! Duplicate keys Sparse index, another way? place first new key from block CS 5 Notes 4 - Indexing 21 CS 5 Notes 4 - Indexing 22 Duplicate keys Sparse index, another way? place first new key from block should this be? CS 5 Notes 4 - Indexing 2 Summary Duplicate values, primary index Index may point to first instance of each value only File Index a a a CS 5 Notes 4 - Indexing 24.. b 4

Deletion from sparse index Deletion from sparse index delete record 1 1 1 1 1 1 CS 5 Notes 4 - Indexing CS 5 Notes 4 - Indexing 26 Deletion from sparse index delete record Deletion from sparse index delete record 1 1 1 1 1 1 CS 5 Notes 4 - Indexing 27 CS 5 Notes 4 - Indexing 28 Deletion from sparse index delete record Deletion from sparse index delete records & 1 1 1 1 1 1 CS 5 Notes 4 - Indexing 29 CS 5 Notes 4 - Indexing 5

Deletion from sparse index delete records & Deletion from sparse index delete records & 1 1 1 1 1 1 CS 5 Notes 4 - Indexing 1 CS 5 Notes 4 - Indexing 2 Deletion from dense index Deletion from dense index delete record CS 5 Notes 4 - Indexing CS 5 Notes 4 - Indexing 4 Deletion from dense index delete record Deletion from dense index delete record CS 5 Notes 4 - Indexing 5 CS 5 Notes 4 - Indexing 6 6

Insertion, sparse index case Insertion, sparse index case insert record 4 CS 5 Notes 4 - Indexing 7 CS 5 Notes 4 - Indexing 8 Insertion, sparse index case insert record 4 Insertion, sparse index case insert record 15 4 our lucky day! we have free space where we need it! CS 5 Notes 4 - Indexing 9 CS 5 Notes 4 - Indexing Insertion, sparse index case insert record 15 Insertion, sparse index case insert record 15 15 Illustrated: Immediate reorganization Variation: insert new block (chained file) update index 15 CS 5 Notes 4 - Indexing 41 CS 5 Notes 4 - Indexing 42 7

Insertion, sparse index case insert record Insertion, sparse index case insert record overflow blocks (reorganize later...) CS 5 Notes 4 - Indexing 4 CS 5 Notes 4 - Indexing 44 Insertion, dense index case Similar Often more expensive... Secondary indexes 0 Sequence field CS 5 Notes 4 - Indexing CS 5 Notes 4 - Indexing 46 Secondary indexes Sparse index Sequence field Secondary indexes Sparse index Sequence field 0 0...... 0 does not make sense! 0 CS 5 Notes 4 - Indexing 47 CS 5 Notes 4 - Indexing 48 8

Secondary indexes Dense index Sequence field Secondary indexes Dense index Sequence field 0... 0 CS 5 Notes 4 - Indexing 49 CS 5 Notes 4 - Indexing Secondary indexes Dense index Sequence field With secondary indexes:... sparse high level... 0 Lowest level is dense Other levels are sparse Also: Pointers are record pointers (not block pointers; not computed) CS 5 Notes 4 - Indexing 51 CS 5 Notes 4 - Indexing 52 Duplicate values & secondary indexes Duplicate values & secondary indexes one option...... CS 5 Notes 4 - Indexing 5 CS 5 Notes 4 - Indexing 54 9

Duplicate values & secondary indexes one option... Problem: excess overhead! disk space search time... Duplicate values & secondary indexes another option... CS 5 Notes 4 - Indexing 55 CS 5 Notes 4 - Indexing 56 Duplicate values & secondary indexes another option... Problem: variable size records in index! Duplicate values & secondary indexes... Another idea: Chain records with same key? λ λ λ λ CS 5 Notes 4 - Indexing 57 CS 5 Notes 4 - Indexing 58 Duplicate values & secondary indexes Duplicate values & secondary indexes... Another idea (suggested in class): Chain records with same key? Problems: Need to add fields to records Need to follow chain to know records λ λ λ λ... buckets CS 5 Notes 4 - Indexing 59 CS 5 Notes 4 - Indexing

Why bucket idea is useful Query: Get employees in (Toy Dept) ^ (2nd floor) Indexes Name: primary Dept: secondary Floor: secondary Records EMP (name,dept,floor,...) Dept. index EMP Floor index Toy 2nd CS 5 Notes 4 - Indexing 61 CS 5 Notes 4 - Indexing 62 Query: Get employees in (Toy Dept) ^ (2nd floor) Dept. index EMP Floor index This idea used in text information retrieval Documents Toy 2nd...the cat is fat......was raining cats and dogs... Intersect toy bucket and 2nd Floor bucket to get set of matching EMP s CS 5 Notes 4 - Indexing 6 CS 5 Notes 4 - Indexing 64...Fido the dog... This idea used in text information retrieval cat dog Documents...the cat is fat......was raining cats and dogs... IR QUERIES Find articles with cat and dog Find articles with cat or dog Find articles with cat and not dog Inverted lists...fido the dog... CS 5 Notes 4 - Indexing 65 CS 5 Notes 4 - Indexing 66 11

IR QUERIES Find articles with cat and dog Find articles with cat or dog Find articles with cat and not dog Find articles with cat in title Find articles with cat and dog within 5 words Common technique: more info in inverted list cat Title 5 dog Author Abstract 57 Title 0 Title 12 type position location d d 2 d 1 CS 5 Notes 4 - Indexing 67 CS 5 Notes 4 - Indexing 68 Posting: an entry in inverted list. Represents occurrence of term in article Size of a list: 1 Rare words or (in postings) miss-spellings 6 Common words Size of a posting: -15 bits (compressed) IR DISCUSSION Stop words Truncation Thesaurus Full text vs. Abstracts Vector model CS 5 Notes 4 - Indexing 69 CS 5 Notes 4 - Indexing Vector space model w1 w2 w w4 w5 w6 w7 DOC = <1 0 0 1 1 0 0 > Query= <0 0 1 1 0 0 0 > Vector space model w1 w2 w w4 w5 w6 w7 DOC = <1 0 0 1 1 0 0 > Query= <0 0 1 1 0 0 0 > PRODUCT = 1 +. = score CS 5 Notes 4 - Indexing 71 CS 5 Notes 4 - Indexing 72 12

Tricks to weigh scores + normalize e.g.: Match on common word not as useful as match on rare words... How to process V.S. Queries? w1 w2 w w4 w5 w6 Q = < 0 0 0 1 1 0 > CS 5 Notes 4 - Indexing 7 CS 5 Notes 4 - Indexing 74 Try Stanford Libraries Try Google, Yahoo,... Summary so far Conventional index Basic Ideas: sparse, dense, multi-level Duplicate Keys Deletion/Insertion Secondary indexes Buckets of Postings List CS 5 Notes 4 - Indexing 75 CS 5 Notes 4 - Indexing 76 Conventional indexes Advantage: - Simple - Index is sequential file good for scans Disadvantage: - Inserts expensive, and/or - Lose sequentiality & balance Example continuous free space Index (sequential) CS 5 Notes 4 - Indexing 77 CS 5 Notes 4 - Indexing 78 1

Example continuous free space Index (sequential) 9 1 5 6 2 8 4 overflow area (not sequential) Outline: Conventional indexes B-Trees NEXT Hashing schemes Advanced Index Techniques CS 5 Notes 4 - Indexing 79 CS 5 Notes 4 - Indexing NEXT: Another type of index Give up on sequentiality of index Try to get balance B+Tree Example n= Root 5 11 5 0 1 1 1 1 0 1 156 179 1 0 1 1 1 CS 5 Notes 4 - Indexing 81 CS 5 Notes 4 - Indexing 82 Sample non-leaf Sample leaf node: From non-leaf node 57 81 95 57 81 95 to next leaf in sequence to keys to keys to keys to keys < 57 57 k<81 81 k<95 95 To record with key 57 To record with key 81 To record with key 85 CS 5 Notes 4 - Indexing 8 CS 5 Notes 4 - Indexing 84 14

In textbook s notation n= Leaf: 5 5 Size of nodes: n+1 pointers (fixed) n keys Non-leaf: CS 5 Notes 4 - Indexing 85 CS 5 Notes 4 - Indexing 86 Don t want nodes to be too empty Use at least (balance) n= Full node min. node Non-leaf: (n+1)/2 pointers Non-leaf 1 1 1 Leaf: (n+1)/2 pointers to data Leaf 5 11 5 counts even if null CS 5 Notes 4 - Indexing 87 CS 5 Notes 4 - Indexing 88 B+tree rules tree of order n () Number of pointers/keys for B+tree (1) All leaves at same lowest level (balanced tree) (2) Pointers in leaves point to records except for sequence pointer Max Max Min ptrs keys ptrs data Min keys Non-leaf (non-root) n+1 n (n+1)/2 (n+1)/2-1 Leaf (non-root) n+1 n (n+1)/2 (n+1)/2 Root n+1 n 1 1 CS 5 Notes 4 - Indexing 89 CS 5 Notes 4 - Indexing 15

Search Algorithm Search for key k Start from root until leaf is reached For current node find i so that Key[i] < k < Key[i + 1] Follow i+1 th pointer If current node is leaf return pointer to record or fail (no such record in tree) Search Example k= 1 n= 5 11 5 0 1 1 Root 0 1 1 1 1 1 1 156 179 1 0 CS 5 Notes 4 - Indexing 91 CS 5 Notes 4 - Indexing 92 Remarks Search If n is large, e.g., 0 Keys inside node are sorted -> use binary search to find I Performance considerations Linear search O(n) Binary search O(log 2 (n)) Insert into B+tree (a) simple case space available in leaf (b) leaf overflow (c) non-leaf overflow (d) new root CS 5 Notes 4 - Indexing 9 CS 5 Notes 4 - Indexing 94 (a) Insert key = 2 n= (a) Insert key = 2 n= 5 11 1 5 11 1 2 0 0 CS 5 Notes 4 - Indexing 95 CS 5 Notes 4 - Indexing 96 16

(a) Insert key = 7 n= (a) Insert key = 7 n= 5 11 1 5 7 5 11 1 0 0 CS 5 Notes 4 - Indexing 97 CS 5 Notes 4 - Indexing 98 (a) Insert key = 7 n= (c) Insert key = 1 n= 5 7 5 11 1 1 156 179 1 0 7 1 1 1 0 0 CS 5 Notes 4 - Indexing 99 CS 5 Notes 4 - Indexing 0 (c) Insert key = 1 n= (c) Insert key = 1 n= 1 156 179 1 179 1 0 1 156 179 1 179 1 0 1 1 1 1 1 1 1 0 0 CS 5 Notes 4 - Indexing 1 CS 5 Notes 4 - Indexing 2 17

(c) Insert key = 1 n= (d) New root, insert n= 1 156 179 1 179 1 0 1 2 12 2 1 1 1 1 0 1 CS 5 Notes 4 - Indexing CS 5 Notes 4 - Indexing 4 (d) New root, insert n= (d) New root, insert n= 1 2 12 2 1 2 12 2 CS 5 Notes 4 - Indexing 5 CS 5 Notes 4 - Indexing 6 (d) New root, insert n= 1 2 new root 12 2 Insertion Algorithm Insert Record with key k Search leaf node for k Leaf node has at least one space Insert into leaf Leaf is full Split leaf into two nodes (new leaf) Insert new leaf s smallest key into parent CS 5 Notes 4 - Indexing 7 CS 5 Notes 4 - Indexing 8 18

Insertion Algorithm cont. Non-leaf node is full Split parent Insert median key into parent Root is full Split root Create new root with two pointers and single key -> B-trees grow at the root Deletion from B+tree (a) Simple case - no example (b) Coalesce with neighbor (sibling) (c) Re-distribute keys (d) Cases (b) or (c) at non-leaf CS 5 Notes 4 - Indexing 9 CS 5 Notes 4 - Indexing 1 (b) Coalesce with sibling Delete n=4 (b) Coalesce with sibling Delete n=4 0 0 CS 5 Notes 4 - Indexing 111 CS 5 Notes 4 - Indexing 112 (c) Redistribute keys Delete n=4 (c) Redistribute keys Delete n=4 5 5 5 0 0 5 CS 5 Notes 4 - Indexing 11 CS 5 Notes 4 - Indexing 114 19

(d) Non-leaf coalese Delete 7 n=4 (d) Non-leaf coalese Delete 7 n=4 1 14 22 26 7 1 14 22 26 7 CS 5 Notes 4 - Indexing 115 CS 5 Notes 4 - Indexing 116 (d) Non-leaf coalese Delete 7 n=4 (d) Non-leaf coalese Delete 7 n=4 new root 1 14 22 26 7 1 14 22 26 7 CS 5 Notes 4 - Indexing 117 CS 5 Notes 4 - Indexing 118 Deletion Algorithm Delete record with key k Search leaf node for k Leaf has more than min entries Remove from leaf Leaf has min entries Try to borrow from sibling One direct sibling has more min entries Move entry from sibling and adapt key in parent Deletion Algorithm cont. Both direct siblings have min entries Merge with one sibling Remove node or sibling from parent ->recursive deletion Root has two children that get merged Merged node becomes new root CS 5 Notes 4 - Indexing 119 CS 5 Notes 4 - Indexing 1

B+tree deletions in practice Often, coalescing is not implemented Too hard and not worth it! Assumption: nodes will fill up in time again Comparison: B-trees vs. static indexed sequential file Ref #1: Held & Stonebraker B-Trees Re-examined CACM, Feb. 1978 CS 5 Notes 4 - Indexing 121 CS 5 Notes 4 - Indexing 122 Ref # 1 claims: - Concurrency control harder in B-Trees - B-tree consumes more space For their comparison: block = 512 bytes key = pointer = 4 bytes 4 data records per block CS 5 Notes 4 - Indexing 12 Example: 1 block static index 127 keys k1 k2 k (127+1)4 = 512 Bytes -> pointers in index implicit! up to 127 blocks CS 5 Notes 4 - Indexing 124 k1 k2 k 1 data block Example: 1 block B-tree Size comparison Ref. #1 k1 k1 k2 k2... 6 keys k6 k - next 6x(4+4)+8 = 512 Bytes -> pointers needed in B-tree up to 6 blocks because index is blocks not contiguous CS 5 Notes 4 - Indexing 1 1 data block Static Index B-tree # data # data blocks height blocks height 2 -> 127 2 2 -> 6 2 128 -> 16,129 64 -> 968 16,1 -> 2,048,8 4 969 -> 2,047 4 2,048 -> 15,752,961 5 CS 5 Notes 4 - Indexing 126 21

Ref. #1 analysis claims For an 8,000 block file, after 2,000 inserts after 16,000 lookups Static index saves enough accesses to allow for reorganization Ref. #1 analysis claims For an 8,000 block file, after 2,000 inserts after 16,000 lookups Static index saves enough accesses to allow for reorganization Ref. #1 conclusion Static index better!! CS 5 Notes 4 - Indexing 127 CS 5 Notes 4 - Indexing 128 Ref #2: M. Stonebraker, Retrospective on a database system, TODS, June 19 Ref. #2 conclusion B-trees better!! Ref. #2 conclusion B-trees better!! DBA does not know when to reorganize DBA does not know how full to load pages of new index CS 5 Notes 4 - Indexing 129 CS 5 Notes 4 - Indexing 1 Ref. #2 conclusion B-trees better!! Buffering B-tree: has fixed buffer requirements Static index: must read several overflow blocks to be efficient (large & variable size buffers needed for this) Speaking of buffering Is LRU a good policy for B+tree buffers? CS 5 Notes 4 - Indexing 11 CS 5 Notes 4 - Indexing 12 22

Speaking of buffering Is LRU a good policy for B+tree buffers? Of course not! Should try to keep root in memory at all times (and perhaps some nodes from second level) Interesting problem: For B+tree, how large should n be? n is number of keys / node CS 5 Notes 4 - Indexing 1 CS 5 Notes 4 - Indexing 14 Sample assumptions: (1) Time to read node from disk is (S+Tn) msec. Sample assumptions: (1) Time to read node from disk is (S+Tn) msec. (2) Once block in memory, use binary search to locate key: (a + b LOG 2 n) msec. For some constants a,b; Assume a << S CS 5 Notes 4 - Indexing 15 CS 5 Notes 4 - Indexing 16 Sample assumptions: (1) Time to read node from disk is (S+Tn) msec. (2) Once block in memory, use binary search to locate key: (a + b LOG 2 n) msec. For some constants a,b; Assume a << S () Assume B+tree is full, i.e., # nodes to examine is LOG n N where N = # records CS 5 Notes 4 - Indexing 17 Can get: f(n) = time to find a record f(n) n opt n CS 5 Notes 4 - Indexing 18 2

FIND n opt by f (n) = 0 Answer is n opt = few hundred FIND n opt by f (n) = 0 Answer is n opt = few hundred What happens to n opt as Disk gets faster? CPU get faster? Memory hierarchy? CS 5 Notes 4 - Indexing 19 CS 5 Notes 4 - Indexing 1 Variation on B+tree: B-tree (no +) Idea: Avoid duplicate keys Have record pointers in non-leaf nodes K1 P1 K2 P2 K P to record to record to record with K1 with K2 with K to keys to keys to keys to keys < K1 K1<x<K2 K2<x<k >k CS 5 Notes 4 - Indexing 141 CS 5 Notes 4 - Indexing 142 B-tree example n=2 65 1 B-tree example n=2 sequence pointers not useful now! (but keep space for simplicity) 0 1 1 1 1 1 1 1 1 0 85 5 1 165 65 1 1 1 1 1 1 1 1 1 85 5 1 165 CS 5 Notes 4 - Indexing 14 CS 5 Notes 4 - Indexing 144 24

Note on inserts Say we insert record with key = Note on inserts Say we insert record with key = leaf n= leaf n= Afterwards: CS 5 Notes 4 - Indexing 1 CS 5 Notes 4 - Indexing 146 So, for B-trees: MAX MIN Tree Rec Keys Tree Rec Keys Ptrs Ptrs Ptrs Ptrs Non-leaf non-root n+1 n n (n+1)/2 (n+1)/2-1 (n+1)/2-1 Leaf non-root 1 n n 1 n/2 n/2 Root non-leaf n+1 n n 2 1 1 Root Leaf 1 n n 1 1 1 Tradeoffs: J B-trees have faster lookup than B+trees L in B-tree, non-leaf & leaf different sizes L in B-tree, deletion more complicated CS 5 Notes 4 - Indexing 147 CS 5 Notes 4 - Indexing 148 Tradeoffs: J B-trees have faster lookup than B+trees L in B-tree, non-leaf & leaf different sizes L in B-tree, deletion more complicated But note: If blocks are fixed size (due to disk and buffering restrictions) Then lookup for B+tree is actually better!! B+trees preferred! CS 5 Notes 4 - Indexing 149 CS 5 Notes 4 - Indexing 1

Example: - Pointers 4 bytes - Keys 4 bytes - Blocks 0 bytes (just example) - Look at full 2 level tree B-tree: Root has 8 keys + 8 record pointers + 9 son pointers = 8x4 + 8x4 + 9x4 = 0 bytes CS 5 Notes 4 - Indexing 151 CS 5 Notes 4 - Indexing 152 B-tree: Root has 8 keys + 8 record pointers + 9 son pointers = 8x4 + 8x4 + 9x4 = 0 bytes Each of 9 sons: 12 rec. pointers (+12 keys) = 12x(4+4) + 4 = 0 bytes B-tree: Root has 8 keys + 8 record pointers + 9 son pointers = 8x4 + 8x4 + 9x4 = 0 bytes Each of 9 sons: 12 rec. pointers (+12 keys) = 12x(4+4) + 4 = 0 bytes 2-level B-tree, Max # records = 12x9 + 8 = 116 CS 5 Notes 4 - Indexing 15 CS 5 Notes 4 - Indexing 154 B+tree: Root has 12 keys + 1 son pointers = 12x4 + 1x4 = 0 bytes B+tree: Root has 12 keys + 1 son pointers = 12x4 + 1x4 = 0 bytes Each of 1 sons: 12 rec. ptrs (+12 keys) = 12x(4 +4) + 4 = 0 bytes CS 5 Notes 4 - Indexing 155 CS 5 Notes 4 - Indexing 156 26

B+tree: Root has 12 keys + 1 son pointers = 12x4 + 1x4 = 0 bytes So... B+ B 8 records Each of 1 sons: 12 rec. ptrs (+12 keys) = 12x(4 +4) + 4 = 0 bytes 2-level B+tree, Max # records = 1x12 = 156 ooooooooooooo ooooooooo 156 records 8 records Total = 116 CS 5 Notes 4 - Indexing 157 CS 5 Notes 4 - Indexing 158 So... B+ B ooooooooooooo ooooooooo 156 records 8 records Total = 116 Conclusion: For fixed block size, B+ tree is better because it is bushier 8 records Additional B-tree Variants B*-tree Internal notes have to be 2/ full CS 5 Notes 4 - Indexing 159 CS 5 Notes 4 - Indexing 1 An Interesting Problem... What is a good index structure when: records tend to be inserted with keys that are larger than existing values? (e.g., banking records with growing data/time) we want to remove older data One Solution: Multiple Indexes Example: I1, I2 day days indexed days indexed I1 I2 1,2,,4,5 6,7,8,9, 11 11,2,,4,5 6,7,8,9, 12 11,12,,4,5 6,7,8,9, 1 11,12,1,4,5 6,7,8,9, advantage: deletions/insertions from smaller index disadvantage: query multiple indexes CS 5 Notes 4 - Indexing 161 CS 5 Notes 4 - Indexing 162 27

Another Solution (Wave Indexes) day I1 I2 I I4 1,2, 4,5,6 7,8,9 11 1,2, 4,5,6 7,8,9,11 12 1,2, 4,5,6 7,8,9,11, 12 1 1 4,5,6 7,8,9,11, 12 14 1,14 4,5,6 7,8,9,11, 12 15 1,14,15 4,5,6 7,8,9,11, 12 16 1,14,15 16 7,8,9,11, 12 advantage: no deletions disadvantage: approximate windows CS 5 Notes 4 - Indexing 16 Concurrent Access To B-trees Multiple processes/threads accessing the B-tree Can lead to corruption Serialize access to complete tree for updates Simple Unnecessary restrictive Not feasible for high concurrency CS 5 Notes 4 - Indexing 164 Locks Nodes One solution Read and exclusive locks Safe and unsafe updates of nodes Safe: No ancestor of node will be effected by update Unsafe: Ancestor may be affected Can be determined locally E.g., deletion is safe is node has more than n/2 CS 5 Notes 4 - Indexing 165 Read Write Read X - Write - - Lock Nodes Reading Use standard search algorithm Hold lock on current node Release when navigating to child Writing Lock each node on search for key Release all locks on parents of node if the node is safe CS 5 Notes 4 - Indexing 166 Improvements? Try locking only the leaf for update Let update use read locks and only lock leaf node with write lock If leaf node is unsafe then use previous protocol Many more locking approaches have been proposed Outline/summary Conventional Indexes Sparse vs. dense Primary vs. secondary B trees B+trees vs. B-trees B+trees vs. indexed sequential Hashing schemes --> Next Advanced Index Techniques CS 5 Notes 4 - Indexing 167 CS 5 Notes 4 - Indexing 168 28