Hashing Techniques. Material based on slides by George Bebis

Hashing Techniques Material based on slides by George Bebis https://www.cse.unr.edu/~bebis/cs477/lect/hashing.ppt

The Search Problem Find items with keys matching a given search key Given an array A, containing n keys, and a search key x, find the index i such as x=a[i] As in the case of sorting, a key could be part of a large record. 2

Applications Keeping track of customer account information at a bank Search through records to check balances and perform transactions Keep track of reservations on flights Search to find empty seats, cancel/modify reservations Search engine Looks for all documents containing a given word 3

Direct Addressing Assumptions: Key values are distinct Each key is drawn from a universe U = {0, 1,..., m - 1} Idea: Store the items in an array, indexed by keys Direct-address table representation: An array T[0... m - 1] Each slot, or position, in T corresponds to a key in U For an element x with key k, a pointer to x (or x itself) will be placed in location T[k] If there are no elements with key k in the set, T[k] is empty, represented by NIL 4

Direct Addressing (cont d) 5

Examples Using Direct Addressing Example 1: 6

Examples Using Direct Addressing Example 2: 7

Hashing Hashing provides a means for accessing data without the use of an index structure. Data is addressed on disk by computing a function on a search key instead.

Organization A bucket in a hash file is unit of storage (typically a disk block) that can hold one or more records. The hash function, h, is a function from the set of all search-keys, K, to the set of all bucket addresses, B. Insertion, deletion, and lookup are done in constant time.

Hash Tables When K is much smaller than U, a hash table requires much less space than a direct-address table Can reduce storage requirements to K Can still get O(1) search time, but on the average case, not the worst case 10

Hash Tables Idea: Use a function h to compute the slot for each key Store the element in slot h(k) A hash function h transforms a key into an index in a hash table T[0 m-1]: h : U {0, 1,..., m - 1} We say that k hashes to slot h(k) Advantages: Reduce the range of array indices handled: m instead of U Storage is also reduced 11

Example: HASH TABLES 0 U (universe of keys) k 1 K (actual k 4 k 2 keys) k 5 k 3 h(k 1 ) h(k 4 ) h(k 2 ) = h(k 5 ) h(k 3 ) m - 1 12

Revisit Example 2 13

Do you see any problems with this approach? 0 U (universe of keys) k 1 K (actual k 4 k 2 keys) k 5 k 3 h(k 1 ) h(k 4 ) h(k 2 ) = h(k 5 ) h(k 3 ) m - 1 14

Do you see any problems with this approach? 0 U (universe of keys) K (actual keys) k 1 k 4 k 2 k 5 k 3 h(k 1 ) h(k 4 ) h(k 2 ) = h(k 5 ) h(k 3 ) Collisions! m - 1 15

store names using a hashing function h(k)= k mod m k = sum of alphabet positions Let m = 51 MOHIT : 13+15+8+9+20 = 65 mod 51 = 14 RANA : 34 STOP: 19 BISWAS: 22 XEROX: 14 TOPS: 19 16

Collisions Two or more keys hash to the same slot!! For a given set K of keys If K m, collisions may or may not happen, depending on the hash function If K > m, collisions will definitely happen (i.e., there must be at least two keys that have the same hash value) Avoiding collisions completely is hard, even with a good hash function 17

Handling Collisions We will review the following methods:. Chaining Open addressing Linear probing Quadratic probing Double hashing 18

Handling Collisions Using Chaining Idea: Put all elements that hash to the same slot into a linked list Slot j contains a pointer to the head of the list of all elements that hash to j 19

Collision with Chaining - Discussion Choosing the size of the table Small enough not to waste space Large enough such that lists remain short How should we keep the lists: ordered or not? Not ordered! Insert is fast Can easily remove the most recently inserted elements 20

Insertion in Hash Tables Worst-case running time is O(1) Assumes that the element being inserted isn t already in the list It would take an additional search to check if it was already inserted 21

Searching in Hash Tables search for an element with key k in list T[h(k)] Running time is proportional to the length of the list of elements in slot h(k) 22

Hash Functions A hash function transforms a key into a table address What makes a good hash function? (1) Easy to compute (2) Approximates a random function: for every input, every output is equally likely (simple uniform hashing) In practice, it is very hard to satisfy the simple uniform hashing property i.e., we don t know in advance the probability distribution that keys are drawn from 23

Good Approaches for Hash Functions Minimize the chance that closely related keys hash to the same slot Strings such as pt and pts should hash to different slots Derive a hash value that is independent from any patterns that may exist in the distribution of the keys 24

Idea: The Division Method Map a key k into one of the m slots by taking the remainder of k divided by m Advantage: h(k) = k mod m fast, requires only one operation Disadvantage: Certain values of m are bad, e.g., power of 2 non-prime numbers 25

Example - The Division Method If m = 2 p, then h(k) is just the least significant p bits of k p = 1 m = 2 h(k) = p = 2 m = 4 h(k) =, least significant 1 bit of k, least significant 2 bits of k Choose m to be a prime, not close to a power of 2 Column 2: Column 3: {0, 1} {0, 1, 2, 3} k mod 97 k mod 100 m 97 m 100 26

Probing Without using linked lists Use a larger table and try successive locations 27

Common Open Addressing Methods Linear probing Quadratic probing Double hashing 28

Linear probing: Inserting a key Idea: when there is a collision, check the next available position in the table (i.e., probing) First slot probed: h 1 (k) h(k,i) = (h 1 (k) + i) mod m i=0,1,2,... Second slot probed: h 1 (k) + 1 Third slot probed: h 1 (k)+2, and so on probe sequence: < h1(k), h1(k)+1, h1(k)+2,...> Can generate m probe sequences maximum, why? wrap around 29

Insert keys 89, 18, 49, 58, 69 H(x) = x mod 10 0 1 2 3 4 5 6 7 8 9 49 58 69 18 89 49 collides with 89, place in next location 0 58 collides with 18, next available place is 1 69 lands in location 2. Some slots tend to be crowded, forming a cluster 30

Insert keys 89, 18, 49, 58, 69 H(x) = x mod 12 0 1 2 3 4 5 6 7 8 9 10 11 49 89 18 69 58 89: 5, 18: 6, 49: 1, 58: 10, 69: 9 No collisions Time to search O(1) 31

Linear probing: Searching for a key Three cases: (1) Position in table is occupied with an element of equal key (2) Position in table is empty (3) Position in table occupied with a different element Case 3: probe the next higher index until the element is found or an empty position is found The process wraps around to the beginning of the table 0 h(k 1 ) h(k 4 ) h(k 2 ) = h(k 5 ) h(k 3 ) m - 1 32

Linear probing: Deleting a key Problems Cannot mark the slot as empty Impossible to retrieve keys inserted after that slot was occupied Solution Mark the slot with a sentinel value DELETED The deleted slot can later be used for insertion Searching will be able to find all the keys 0 m - 1 33

Primary Clustering Problem Some slots become more likely than others Long chunks of occupied slots are created search time increases!! initially, all slots have probability 1/m Slot b: 2/m Slot d: 4/m Slot e: 5/m 34

Quadratic Probing To overcome primary clusters, the scheme of quadratic probing is proposed f(i) = i 2 35

Insert 89,18,49,58,69 49 collides with 89, try with i = 1, goes to location 0 58 collides with 18, collides with 89 with i=1, try with i=2 (4 cells away) goes to location 2 69 collides with 89, then 49 so try with i=2, finds an empty slot at 3 0 49 1 2 58 3 69 4 5 6 7 8 18 9 89 36

Quadratic probing with prime TS If table size is chosen as a prime, place to hold an element can be found as long as the table is not yet half full. First TS/2 alternative locations are going to be all distinct. Two of these locations are h(x) + i 2 mod TS and h(x) + j 2 mod TS where i and j are both less than TS/2 To prove that the locations are going to be distinct, for the sake of contradiction, let us suppose that while i and j are different, the two locations turn out to be the same. 37

Can i and j point to same location? Then h(x) + i 2 = h(x) + j 2 mod TS i 2 = j 2 mod TS i 2 j 2 = 0 mod TS (i + j) ( i j ) = 0 mod TS Since table size is prime, it follows either (i + j) = 0 mod TS (not possible since both i and j are less than TS/2) OR (i j ) = 0 mod TS (not possible since i and j are distinct) Thus the first TS/2 locations are distinct. 38

example Consider a table of size 37, Let h(x) be 26, what are the alternative locations? for i = 2 : 26+4 = 30 for i = 3 : 26+9 = 35 for i = 4 : 26+16 = 42 = 5 for i = 5 : 26+25 = 51 = 14 Such data structures do not support deletion, as cells might have caused a collision to go past it). One could use lazy deletion ( mark with a flag) When table gets half full, enlarge the hash table. 39

Double Hashing 40

Double Hashing We use a second hash function h2(x). We probe at h2(x), 2 h2(x), 3 h2(x) A good choice h2(x) = R ( x mod R) where R is a prime number smaller than TS. 41

Double Hashing We use a second hash function h2(x). We probe at h2(x), 2 h2(x), 3 h2(x) A good choice h2(x) = R ( x mod R) where R is a prime number smaller than TS. Consider the problem of inserting 89, 18, 49, 58,69 on a table of size 10. Let R = 7 49 gets a Collision at position 9 h2(49) = 7 49 mod 7 =7 count 7 positions from there 0 1 2 3 4 5 6 7 8 9 49 18 89 42

Double Hashing 58: h2(58) = 7 58 mod 7 = 5 69: h2(69) = 7 69 mod 7 = 1 Now try 60: h2(60) = 7 60 mod 7 = 3 Collision with 58, try 2 h2(60)= 6 Collision with 49, try 3 h2(60)= 9 Collision with 89, try 4 h2(60)= 12 mod 10 = 2 Now try with 23, problem? Table size small, not prime. 0 1 2 3 4 5 6 7 8 9 69 60 58 49 18 89 43

A different Double Hashing style (1) Use one hash function to determine the first slot (2) Use a second hash function to determine the increment for the probe sequence h(k,i) = (h 1 (k) + i h 2 (k) ) mod m, i=0,1,... Initial probe: h 1 (k) Second probe is offset by h 2 (k) mod m, so on... Advantage: avoids clustering 44

Different Double Hashing: Example h 1 (k) = k mod 13 h 2 (k) = 1+ (k mod 11) h(k,i) = (h 1 (k) + i h 2 (k) ) mod 13 Insert key 14: h 1 (14,0) = 14 mod 13 = 1 h(14,1) = (h 1 (14) + h 2 (14)) mod 13 = (1 + 4) mod 13 = 5 h(14,2) = (h 1 (14) + 2 h 2 (14)) mod 13 = (1 + 8) mod 13 = 9 0 1 2 3 4 5 6 7 8 9 10 11 12 79 69 98 72 14 50 45

Rehashing When table size gets too full, running time starts getting large Solution : Double the table size and with a new hash function insert the old elements into the new table. 46

Extendible Hashing When data is too large to fit in main memory, main consideration is number of disk accesses. Rehashing is very expensive as all entries to be done all over again in a new table Borrow idea from B Trees. Let M be records fitting in one disk block. As M increases depth of B tree decreases. (but increases the branching factor so processing time increases) 47

The strategy used in extendible hashing is to reduce the time to search for the appropriate leaf. Let the numbers be hashed to 6 bit integers. Create a pointer table of size 4, with each cell pointing to first 2 bit of each number (D=2). Let us assume that each leaf could hold up to M=4 elements 48

00 01 10 11 000100 001000 001010 001011 010100 011000 100000 101000 101100 101110 111000 111001 49

What happens when leaf gets full Suppose we want to insert 100100. This should go to 3 rd leaf, but it is already full So we split this leaf into 2 leaves This results in D being changed to 3 (each leaf being determined by 3 bits) Note all leaves not involved in splits are now pointed to by two adjacent directory entries. Directory is new, but other leaves are not disturbed 50

000 010 001 011 101 100 110 111 000100 001000 001010 001011 010100 011000 100000 100100 101000 101100 101110 111000 111001 51

If key 000000 is now inserted, then the first leaf needs to be split (others are not disturbed). The scheme is a very simple strategy for quick access times for insert and search operations on large databases. 52

It helps if the bits are fairly random. This can be accomplished by hashing the keys into a reasonably long integer. Balanced search trees are quite expensive to implement for storing large number of data values. If there is any suspicion that the data might be sorted, hashing would be the data structure of choice. 53

The end 54