Hashing. Data organization in main memory or disk

Hashing Data organization in main memory or disk sequential, indexed sequential, binary trees, the location of a record depends on other keys unnecessary key comparisons to find a key The goal of hashing is to retrieve a record with a single key comparison the location of a record is computed using its key only good for random accesses (usually -3 comparisons) slow for range queries E.G.M. Petrakis Hashing

Hash Table Hash Function: transforms keys to array indices 0 3 4... n h(key) h(key): Hash Function index data E.G.M. Petrakis Hashing m

position key record 0 4967000 8400 3...... 395 396 468396 397 4957397 398 399 86399 400 40...... 990 0000990 99 000099 99 0099 993 0047993 994 995 9846995 996 468996 997 4967997 E.G.M. Petrakis 998 999 000999 Hashing 3 h(key) = key mod 000

Properties of Good Hash Functions. Two records cannot occupy the same location hash collision: k i k j, i j : h( ki ) = h( k j ). uniform: distributes keys evenly in hash space 3. perfect: ki k j, i j : h( ki ) h( k j ) 4. order preserving: ki k j, i j : h( ki ) h( k j Difficult to find a hash function with these properties property is the most essential to good hashing most functions are no better than h(key) = key mod m ) E.G.M. Petrakis Hashing 4

Collision Resolution Techniques. Open Addressing (rehashing): compute new position to store the key in the same table no extra space, stores at most m records i. linear probing ii. double hashing. Separate Chaining: lists of keys mapped to the same position uses extra space can store more than m records (size of hash table) E.G.M. Petrakis Hashing 5

Open Addressing Compute a new address to store the key if it is occupied, compute a new address (rehashing) if this is occupied too, compute a new address again (and so on) until an empty position is found primary hash function:h(key) rehash function: rh(i)=rh(h(key)) hash sequence: (h 0,h,h ) = (h(key), rh(h(key)), rh(rh(h(key))) ) Retrieval: to find a key follow the same hash sequence E.G.M. Petrakis Hashing 6

Problem : Locate Empty Positions No empty position can be found i. the table is full ii. check the number of empty positions the hash function fails to find an empty position although the table is not full!! h(key) = key mod 000 rh(i) = (i + 00) mod 000 checks only 5 positions on a table of 000 positions use rh(key) that checks the entire table rh(i) = (i+c) mod 000 where GCD(c,m) = E.G.M. Petrakis Hashing 7

Problem : Primary Clustering Keys that hash into different addresses compete with each other in successive rehashes h(key) = key mod 00 rh(i) = (i+) mod 00 keys: 990, 99, 99, 993, 994 94 0 00 0...... 90 990 9 99 9 99 93 993 94... 00... E.G.M. Petrakis Hashing 8

Problem 3: Secondary Clustering Different keys which hash to the same hash value have the same rehash sequence h(key) = key mod 0 rh(i,j) = (i + j) mod 0 i. key 3: h(3) = 3 rh = 4, 6, 9, 3, ii. key 3: h(3) = 3 rh = 4, 6, 9, 3, 0 0 3 53 4 4 5 5 6 46 7 8 9 E.G.M. Petrakis Hashing 9

(i) Linear Probing Store the key into the next free position h 0 = h(key) usually h 0 = key mod m h i = (h i- + ) mod m, i >= 0 30 3 0 4 45 5 35 6 7 8 9 99 S = {, 35, 30, 99, 0, 45} E.G.M. Petrakis Hashing 0

Observation Different insertion sequences different hash sequences S ={,3,7,99,8,50,7 7,,,3,33,40,53} 8 probes S ={53,40,33,3,,, 77,50,8,99,7,3,} 30 probes 0 7 7 4 3 3 4 40 4 5 3 6 53 6 7 33 8 99 9 8 0 50 number of probes H(key) = key mod 3 E.G.M. Petrakis Hashing

Observation Deletions are not easy: h(key) = key mod 0 rh(i) = (i+) mod 0 Action: delete(65) and search(5) Problem: search will stop at the empty position and will not find 5 Solution: 0 70 3 33 4 4 5 55 6 65 7 75 8 85 9 5 mark position as deleted rather than empty the marked position can be reused E.G.M. Petrakis Hashing

Observation 3 Linear probing tends to create long sequences of occupied positions the longer a sequence is, the longer it tends to become m Β P = B + m P: probability to use a position in the cluster E.G.M. Petrakis Hashing 3

Observation 4 Linear probing suffers from both primary and secondary clustering Solution: double hashing uses two hash functions h, h and a rehashing function rh E.G.M. Petrakis Hashing 4

Double Hashing Two hash functions and a rehashing function primary hash function h (key)= key mod m secondary hash function h (key) rehashing function: rh(key) = (i + h (key)) mod m h (m,key) is some function of m, key helps rh in computing random positions in the hash table h is computed once for each key! E.G.M. Petrakis Hashing 5

Example of Double Hashing i. hash function: h (key) = key mod m m div q = 0 h( key) = q q 0 q = (key div m) mod m ii. rehash function: rh(i, key) = (i + h (key)) mod m E.G.M. Petrakis Hashing 6

Example (continued) A. m = 0, key = 3 h (3) = 3, h (3) = rh: 5, 7, 9,, B. m = 0, key = 3 h (key) = 3, h (3) = rh:, 3, 4, 5, E.G.M. Petrakis Hashing 7

Performance of Open Addressing Distinguish between successful and unsuccessful search Assume a series of probes to random positions independent events load factor: λ =n/m λ: probability to probe an occupied position each position has the same probability P=/m E.G.M. Petrakis Hashing 8

Unsuccessful Search The hash sequence is exhausted let u be the expected number of probes uequals the expected length of the hash sequence P(k): probability to search k positions in the hash sequence E.G.M. Petrakis Hashing 9

E.G.M. Petrakis Hashing 0 L L L L + + + + + + + + + + + + = = ) ( ) ( ) ( ) ( ) ( (3) (3) (3) () () () ) ( probes P probes P k P k P k P P P P P P P k kp u k

u = k k λ κ P( = λ k k u P( first k k probes) k λ = positions ocupied) = independent events u increases with λ performance drops as λ increases E.G.M. Petrakis Hashing

Successful Search The hash sequence is not exhausted the number of probes to find a key equals the number of probes s at the time the key was inserted plus λ was less at that time consider all values of λ s λ = ( u + ) dx λ = + u: equivalent to successful search ln( λ 0 λ E.G.M. Petrakis Hashing ) increases with λ

Performance The performance drops as λ increases the higher the value of λ is, the higher the probability of collisions Unsuccessful search is more expensive than successful search unsuccessful search exhausts the hash sequence E.G.M. Petrakis Hashing 3

Experimental Results SUCCESSFUL UNSUCCESSFUL LOAD FACTOR LINEAR i + bkey DOUBLE LINEAR i + bkey DOUBLE 5%.7.6.5.39.37.33 50%.50.44.39.50.9.00 75%.50.0.85 8.50 4.64 4.00 90% 5.50.85.56 50.50.40 0.00 95% 0.50 3.5 3.5 00.50.04 0.00 E.G.M. Petrakis Hashing 4

Performance on Full Table TABLE SIZE (m) SUCCESSFUL LINEAR i + bkey DOUBLE UNSUCCESSFUL LOG m 00 6.60 4.6 4. 50.50 6.64 500 4.35 6. 5.7 50.50 8.97 000 0.5 6.9 6.4 500.50 9.97 5000 44.64 8.5 8.0 500.5.9 0000 63.00 9. 8.7 5000.50 3.9 E.G.M. Petrakis Hashing 5

Separate Chaining Keys hashing to the same hash value are stored in separate lists one list per hash position can store more than m records easy to implement the keys in each list can be ordered E.G.M. Petrakis Hashing 6

0 40 30 9 nil nil h(key) = key mod m 4 9 37 nil 3 nil 4 nil 5 75 nil 6 66 6 nil 7 87 67 47 7 nil 8 nil 9 49 nil E.G.M. Petrakis Hashing 7

Performance of Separate Chaining Depends on the average chain size insertions are independent events let P(c,n,m): probability that a position has been selected c times after n insertions P(c,n,m) is also the probability that the chain has length c P(c,n,m) follows the binomial distribution P( c, n, m) n = c p c q n c p=/m: success case q=-p: failure case E.G.M. Petrakis Hashing 8

E.G.M. Petrakis Hashing 9 n c c n c m m m c n m n c m m c n m n c P + = =! ),, ( λ λ + e m m m c n n c P(c,n,m)=(/c!)λ c e -λ m n, Poison

Unsuccessful Search The entire chain is searched the average number of comparisons equals its average length u u = cp( c, l) = c 0 c 0! λ c e λ = λ c E.G.M. Petrakis Hashing 30

Successful Search Not the whole chain is searched the average number of comparisons equals the length s of the chain at time the key was inserted plus the performance at the time a key was inserted equals that of unsuccessful search! s = λ ( u + ) dx = ( x + ) dx = + λ λ λ λ 0 0 E.G.M. Petrakis Hashing 3

Performance The performance drops with the length of the chains worst case: all keys are stored in a single chain worst case performance: O(N) unsuccessful search performs better than successful search!! WHY? no problem with deletions!! E.G.M. Petrakis Hashing 3

Coalesced Hashing The hash sequence is implemented as a linked list within the hash table no rehash function the next hash position is the next available position in linked list extra space for the list h(key) = key mod 0 keys: 9, 9, 49, 59 0 49 7 3 4 5 9 6 7 59-8 9 9 5 E.G.M. Petrakis Hashing 33

avail initialization 0 nilkey - nilkey -.. 3.. 4.. 5.. 6.. 7.. 8.. 9 nilkey - linked list initially: avail = 9 h(key) = key mod 0 keys: 4,9,34,8,4,39,84,38 avail 0 nilkey - nilkey - 4-3 38-4 4 8 5 84 3 6 39-7 8 5 8 34 7 9 9 6 E.G.M. Petrakis Hashing 34

Performance of Coalesced Hashing Unsuccessful search λ λ e + 4 Successful search λ e 8λ 0.75 probes/search λ + + 0.75 4 probes/search E.G.M. Petrakis Hashing 35

Hashing on the Disk Use disk pages or buckets to store records several records fit within one bucket first retrieve the appropriate page into the main memory then searching for the appropriate record within the bucket comes for free E.G.M. Petrakis Hashing 36

Σ key space hash function data pages.... 0 m- b page size b: maximum number of records in page space utilization u: measure of the use of space u = # stored # pages records b E.G.M. Petrakis Hashing 37

Collisions Keys that hash to the same hash position are stored within the same page If the page is full, collisions cause i. page splits: split pages content between the old and a new page overflow ii. overflows: list of overflow pages x x x x x x The larger the page size the less the overflows E.G.M. Petrakis Hashing 38

Access Time Goal: find key in one disk access access time ~ number of page accesses many overflows cause more disk accesses large u means good space utilization but many overflows E.G.M. Petrakis Hashing 39

Problems File expansion (dynamic growth) file size adapts to space requirements Searching with overflows bad performance Non-uniform distribution of keys: many keys map to the same addresses Categories of Methods pseudo-dynamic: open addressing, chaining, dynamic: dynamic hashing, extendible hashing, linear hashing, spiral storage E.G.M. Petrakis Hashing 40

Dynamic Hashing Schemes Dynamic growth without total reorganization typically -3 disk accesses to access a key access time and space utilization are a typical trade-off space utilization between 50-00% typically 69% not always easy implementation E.G.M. Petrakis Hashing 4

Dynamic Schemes With Index At least two disk accesses one to access the index and one to access the data (overflows cause more disk accesses) if we keep index in main memory one disk access less problem: the index may become too large index data pages Methods Dynamic hashing (Larson 978) Extendible hashing (Fagin et.al. 979) E.G.M. Petrakis Hashing 4

Dynamic Schemes Without Index Ideally, less space and less disk accesses at least one disk access overflows are allowed more disk accesses Methods address space data space Linear Hashing (Litwin 980) Linear Hashing with Partial Expansions (Larson 980) Spiral Storage (Martin 979) E.G.M. Petrakis Hashing 43

Hash Functions Support shrinking or growing hash file shrinking or growing address space the hash function adapts to these changes hash functions using first (last) bits of key key = b n- b n-.b i b i- b b b 0 h i (key) = b i- b b b 0 supports i addresses h i : one more bit than h i- to address larger files hi ( key) hi ( key) = i hi ( key) + E.G.M. Petrakis Hashing 44

Dynamic Hashing (Larson 978) Two level index primary index h (key): accesses a hash table secondary index h (key): accesses a binary tree index h (k) st level h (k) nd level b data pages E.G.M. Petrakis Hashing 45 3 4

Primary index is fixed h (key) = key mod m Dynamic behavior on secondary index h (key) uses i bits of key the bit sequence of h denotes which path on the binary tree of the secondary index to follow in order to access the data page scan h from left to right bit : follow right path bit 0: follow left path E.G.M. Petrakis Hashing 46

0 3 4 5 h (k) st level index 0 0 0 h (k) nd level 3 4 b data pages h =, h =0 h =, h =0 h =, h = h =5, h = any h (key) = key mod 6 h (key) = depth of binary tree = E.G.M. Petrakis Hashing 47

Insertions Initially fixed size primary index and no data 0 3 insert record: store in new page on h location if page is full, allocate one extra page and split contents of old page between old and new page use one extra bit in h for addressing 0 3 0 E.G.M. Petrakis Hashing 48 0 3 b h =, h =0 h =, h =

0 3 0 3 index b storage h =0, h =any h =3, h =any 0 3 0 3 0 0 0 0 E.G.M. Petrakis Hashing 49 3 3 4 5 h =0, h =0 h =0, h = h =3, h =any h =0, h =0 h =0, h =0 h =0, h = h =3, h =0 h =3, h =

Deletions Find record to be deleted using h and h Delete record and Check siblink page: less than b records in both pages? if yesmerge the two pages and delete one empty page shrink secondary index by one level and reduce h by one bit E.G.M. Petrakis Hashing 50

0 3 3 4 merging 0 3 3 4 delete E.G.M. Petrakis Hashing 5

Extendible Hashing (Fagin et.al. 979) Dynamic hashing without index primary hashing is omitted hash function similar to secondary hash function and all binary trees at same level the index shrinks and grows according to file size data pages attached to the index E.G.M. Petrakis Hashing 5

0 3 4 0 dynamic hashing 0 3 4 00 0 0 0 dynamic hashing with all binary trees at same level address bits extendible hashing 0 E.G.M. Petrakis Hashing 53

Insertions Initially index and data page 0 address bits insert records in this data page index storage 0 0 b E.G.M. Petrakis Hashing 54

Page 0 overflows: more key bit for addressing, extra page index doubles!! split contents of previous page into pages according to next bit of key global depth d: # index bits d index size local depth l: max # bits for record addressing d 0 l d: global depth = l : local depth = E.G.M. Petrakis Hashing 55

d Page 0 overflows: l d d 00 0 0 Page overflows: 000 00 00 0 00 0 0 3 E.G.M. Petrakis Hashing 56 l contain records with same bits of key contains records with same st bit of key 3 3 more key bit for addressing d- : number of pointers to page

Page 00 overflows: no need to double index page 00 splits into two ( new page) local depth l is increased by 000 00 00 0 00 0 0 3 3 3 + d l E.G.M. Petrakis Hashing 57

Insertion Algorithm If l < d split overflowed page ( extra page) If l = d double index, split page and d is increased by more bit for addressing update pointers with either way: a) if d prefix bits are used for addressing d=d+; for (i= d -, i>=0,i--) index[i]=index[i/]; a) if d suffix bits are used for (i=0; i <= d -; i++) index[i]=index[i]+ d ; E.G.M. Petrakis Hashing 58

Deletion Algorithm Find and delete record Check siblink page If less than b records in both pages merge pages and free empty page decrease local depth l by (records in merged page have less common bit) if l < d everywhere reduce index (half size) update pointers E.G.M. Petrakis Hashing 59

000 00 00 0 00 0 0 3 3 3 delete with merging 000 00 00 0 00 0 0 3 l < d 00 0 0 E.G.M. Petrakis Hashing 60

Observations A page splits and there are more than b keys with same next bit take one more bit for addressing (increase l) if d=l the index is doubled again!! Hashing might fail for non-uniform distributions of keys (e.g., multiple keys with same value) if the distribution of keys is known, transform it to uniform Dynamic hashing performs better for non-uniform distributions (affected locally) E.G.M. Petrakis Hashing 6

Storage utilization: b before splitting b after splitting u = b b = 50% After splitting in general 50% < u < 00% on the average u ~ ln ~ 69% (no overflows) E.G.M. Petrakis Hashing 6

Overflows storage utilization: allow overflows in order to achieve higher u and to avoid page doubling (if d=l) b b higher u is achieved for small overflow pages u=b/3b~66% after splitting for smaller overflow pages (e.g., b/) u = (b+b/)/b ~ 75% double index only if the overflow overflows!! E.G.M. Petrakis Hashing 63

Performance of Extendible Hashing For n: records and page size b expected size of index (Flajolet) l ( + ) 3. 9 (+ ) b b n blog b disk access/retrieval when index in main mem disk accesses when index is stored on disk overflows increase number of disk accesses n E.G.M. Petrakis Hashing 64

Linear Hashing (Litwin 980) Dynamic hashing scheme without index Indices refer to directly page addresses Overflows are allowed The file grows one page at a time The page which splits is not always the one which overflowed The pages split in a predetermined order E.G.M. Petrakis Hashing 65

Initially n empty pages ppoints to the page that splits p b Overflows are allowed p b E.G.M. Petrakis Hashing 66

A page splits whenever the splitting criterion is satisfied a new page is added at the end of the file pointer p points to the next page split contents of old page between old and new page based on key values p E.G.M. Petrakis Hashing 67

p 0 3 4 5 30 90 435 6 7 40 7 737 7 63 303 4 39 7 u = > 80% split 5 5 438 new element b=b page =4, b overflow = initially n=5 pages hash function h 0 =k mod n splitting criterion u > A% alternatively split when overflow overflows, etc. E.G.M. Petrakis Hashing 68

p 0 3 4 5 30 90 6 7 40 7 63 303 4 39 5 435 737 7 438 5 h h0 h0 h0 h0 h 5 u 8 = < 5 80% Page 5 is added at end of file The contents of 0 are split between 0 and 5 based on hash function h ppoints to next page E.G.M. Petrakis Hashing 69

Hash Functions for Linear Hashing Initially as simple as h 0 =key mod n As pages are added, h 0 alone becomes insufficient The file will eventually double its size In that case use h =key mod n In the meantime use h 0 for pages not yet split use h for pages that have already split Split contents of page pointed to by p based on h E.G.M. Petrakis Hashing 70

Hash functions (continued) When the file has doubled its size, h 0 is no longer needed set h 0 h and continue (e.g., h 0 =k mod 0) The file will eventually double its size again Deletions cause merging of pages whenever a merging criterion is satisfied merging criterion e.g., u < B% E.G.M. Petrakis Hashing 7

Hash functions for linear hashing Initially n pages and 0 <= h 0 (k) <= n Series of hash functions h h ( k) ( i i+ k) = i i( k) + n Selection of hash function: h if h i (k) >= p then use h i (k) else use h i+ (k) E.G.M. Petrakis Hashing 7

Linear Hashing with Partial Expansions (Larson 980) Problem with Linear Hashing: pages to right of p delay to split large chains of overflows on the rightmost pages Solution: do not wait that much to split a page k partial expansions: take pages in groups of k all k pages of a group split together the file grows at lower rates E.G.M. Petrakis Hashing 73

Two partial expansions on a file with n pages initially: n groups with k= pages each groups: (0, n) (, +n) (i, i+n) (n-, n-) 0 n n pointers to pages of the same group all pages of the same group spit together: after the first split, pages 0 and n split together some records go to page n (new page) E.G.M. Petrakis Hashing 74

st expansion: after n splits,all pages are split at the end of st expansion the file has 3n pages the file grows at lower rate the file has size.5 times larger 0 n n 3n after st expansion take pages in groups of 3 groups: (j, j+n, j+n), 0 <= j <= n 0 n n 3n E.G.M. Petrakis Hashing 75

nd expansion: after n splits the file has size 4n repeat the same process having initially 4n pages ngroups 0 n 4n pointers to pages of the same group E.G.M. Petrakis Hashing 76

disk access/retrieval,6,5,4,3,, Linear Hashing Linear Hashing partial expansions retrieval insertion deletion,,4,6,8 relative file size Linear Hashing.7 3.57 4.04 Linear Hashing part. Exp.. 3. 3.53 Linear Hashing 3 part. Exp..09 3.3 3.56 b = 5 b = 5 u = 0.85 E.G.M. Petrakis Hashing 77

Dynamic Hashing Schemes Very good performance on membership, insert, delete operations Suitable for both main memory and disk Critical parameter: space utilization u large u more overflows, bad performance small u less overflows, better performance Suitable for direct access queries (random accesses) but not for range queries E.G.M. Petrakis Hashing 78