CARNEGIE MELLON UNIVERSITY DEPT. OF COMPUTER SCIENCE 15-415 DATABASE APPLICATIONS C. Faloutsos Indexing and Hashing 15-415 Database Applications http://www.cs.cmu.edu/~christos/courses/dbms.s00/
general overview - rel. model relational model - SQL - formal & commercial query languages functional dependencies normalization physical design indexing 15-415 Database Applications 2/37 C. Faloutsos
overview - detailed ordered indices - primary / secondary indices - index-sequential - multilevel (ISAM) B - trees, B+ - trees hashing - static hashing - dynamic hashing 15-415 Database Applications 3/37 C. Faloutsos
motivation once the records are stored in a file, how do you search efficiently? brute force: retrieve all records, report the qualifying ones better: use indices (pointers) to locate the records directly 15-415 Database Applications 4/37 C. Faloutsos
we need additional structures indexing structure what structures? how many indices / pointers? 123 smith main st. 234 jones forbes ave 300 stevens main st. 15-415 Database Applications 5/37 C. Faloutsos
what is a good indexing technique?..depends on the database & the queries we want to answer range queries? retrieval time? insertion / deletion? space overhead? reorganization? 15-415 Database Applications 6/37 C. Faloutsos
ordered indices search keys are sorted in the index file and point to the actual records primary vs. secondary indices 123 234 300 forbes ave main st. 123 smith main st. 234 jones forbes ave 300 stevens main st. 15-415 Database Applications 7/37 C. Faloutsos
index-sequential files (primary indices) records are organized sequentially within the file (linked-list), according to a chosen key index file on the same key forbes ave main st. forbes ave jones 234 main st. smith 123 main st. stevens 300 15-415 Database Applications 8/37 C. Faloutsos
dense vs. sparse index 123 150 234 236 300 123 smith main st. 150 gates walnut st. 234 jones forbes ave 236 holmes walnut st. 300 stevens main st. 15-415 Database Applications 9/37 C. Faloutsos
dense vs. sparse index 123 234 300 123 smith main st. 150 gates walnut st. 234 jones forbes ave 236 holmes walnut st. 300 stevens main st. 15-415 Database Applications 10/37 C. Faloutsos
multilevel indices (ISAM) if index is too large to fit in main memory, store it on disk and keep index on the index (in memory) memory 123 234 123 150.. 234 236 disk..smith....holmes.. index file record file 15-415 Database Applications 11/37 C. Faloutsos
multilevel indices (ISAM) usually two levels of indices, one firstlevel entry per disk block (why?) typically, blocks 80% full initially (why? what are potential problems / inefficiencies?) 15-415 Database Applications 12/37 C. Faloutsos
secondary indices the record file is already sorted on some other attribute sec. index buckets forbes ave main st. walnut st. 123 smith main st. 150 gates walnut st. 234 jones forbes ave 236 holmes walnut st. 300 stevens main st. 15-415 Database Applications 13/37 C. Faloutsos
secondary indices only dense. clustering index how to organize the sec. index? performance? search is very good, insertions / deletions are expensive 15-415 Database Applications 14/37 C. Faloutsos
summary of ordered indices primary index sec. index dense sparse..ordered indices suffer in the presence of frequent updates alternative indexing structure: B - trees 15-415 Database Applications 15/37 C. Faloutsos
overview - detailed ordered indices - primary / secondary indices - index-sequential - multilevel (ISAM) B - trees, B+ - trees hashing - static hashing - dynamic hashing 15-415 Database Applications 16/37 C. Faloutsos
B - trees the most successful family of index schemes balanced n-way search trees a b - tree node: k 1 k 2... k n-1 15-415 Database Applications 17/37 C. Faloutsos
B - trees, definition each node, in a B-tree of order n: - at most n pointers - at least n/2 pointers (except root) - all leaves at the same level - if number of pointers is k, then node has exactly k-1 keys - (leaves are empty) 15-415 Database Applications 18/37 C. Faloutsos
B - trees, properties block aware nodes O(log (N)) for everything! typically, if m = 50-100, then 2-3 levels utilization >= 50%, guaranteed. on average 69% 15-415 Database Applications 19/37 C. Faloutsos
B - trees, operations insertion - split: preserves B - tree property. notice how it grows: level increases when root overflows deletion - may need to merge 15-415 Database Applications 20/37 C. Faloutsos
insertion INSERTION OF KEY K find the correct leaf node L ; if ( L overflows ){ split L, by pushing the middle key upstairs to parent node P ; if ( P overflows){ repeat the split recursively; } else{ add the key K in node L ; /* maintaining the key order in L */ } 15-415 Database Applications 21/37 C. Faloutsos
deletion (ouch!) DELETION OF KEY K locate key K, in node N if( N is a non-leaf node) { delete K from N ; find the immediately largest key K1 ; /* which is guaranteed to be on a leaf node L */ copy K1 in the old position of K ; invoke this DELETION routine on K1 from the leaf node L ; else { /* N is a leaf node */... (next slide..) 15-415 Database Applications 22/37 C. Faloutsos
ouch! ouch! /* N is a leaf node */ if( N underflows ){ let N1 be the sibling of N ; if( N1 is "rich"){ /* ie., N1 can lend us a key */ borrow a key from N1 THROUGH the parent node; }else{ /* N1 is 1 key away from underflowing */ MERGE: pull the key from the parent P, and merge it with the keys of N and N1 into a new node; if( P underflows){ repeat recursively } } } 15-415 Database Applications 23/37 C. Faloutsos
B - trees come in different flavors what about range queries, proximity searches? B + - trees facilitate sequential ops leaf nodes have all the keys, replicate keys in non-leaf nodes 15-415 Database Applications 24/37 C. Faloutsos
B + - trees, insertion INSERTION OF KEY K insert search-key value to L such that the keys are in order; if ( L overflows) { split L ; insert (ie., COPY) smallest search-key value of new node to parent node P ; if ( P overflows) { repeat the B-tree split procedure recursively; /* Notice: the B-TREE split; NOT the B+ -tree */ } } /* ATTENTION: a split at the leaf level is handled by COPYING the middle key upstairs; " " " a higher level " " " PUSHING " " " ". */ 15-415 Database Applications 25/37 C. Faloutsos
still more flavors should leaves be empty? - practical B - trees how to increase the utilization of B - trees?..with B* - trees! 15-415 Database Applications 26/37 C. Faloutsos
B - trees, summary a great structure. block aware all B - trees can be used either as primary ( = sparse, clustering), or secondary (= dense, non-clustering) index 15-415 Database Applications 27/37 C. Faloutsos
overview - detailed ordered indices - primary / secondary indices - index-sequential - multilevel (ISAM) B - trees, B+ - trees hashing - static hashing - dynamic hashing 15-415 Database Applications 28/37 C. Faloutsos
hashing: the idea it would be nice to be able to map key values to record positions e.g. (123, smith) is stored in 123 block number what is the problem with this mapping? 15-415 Database Applications 29/37 C. Faloutsos
hash functions key value -> bucket (with pointer to records) k -> h(k) suppose we have M buckets. this is a hash function, based on division: h(k) = k mod M M... 15-415 Database Applications 30/37 C. Faloutsos
hash functions another hash function, using multiplication: h(k) = [k * φ mod 1] * M good hash functions: uniformity good hash functions: randomness 15-415 Database Applications 31/37 C. Faloutsos
hashing: ups and downs speed!..but at the cost of loss of key ordering - no range queries - no proximity queries - no sequential scan 15-415 Database Applications 32/37 C. Faloutsos
hashing flavors fixed or variable number of buckets? how to handle overflows? 2 main hashing categories: - static hashing - dynamic hashing 15-415 Database Applications 33/37 C. Faloutsos
static hashing number of buckets M, is fixed collision resolution? - open addressing linear probing double hashing - chaining 15-415 Database Applications 34/37 C. Faloutsos
static hashing problem: overflow? problem: underflow? (underutilization) idea: shrink / expand hash table on demand....dynamic hashing 15-415 Database Applications 35/37 C. Faloutsos
dynamic hashing many approaches, we examine extendable hashing hash each key to an infinite bit string, and use as many bits as necessary idea: directory that doubles on demand 15-415 Database Applications 36/37 C. Faloutsos
discussion comparison multiple-key access? SQL statements - create index <index-name> on <relation-name> (<attribute-list>) - create unique index <index-name> on <relation-name> (<attribute-list>) - drop index <index-name> 15-415 Database Applications 37/37 C. Faloutsos