CSE 30 Notes 3: Hashing (Last updated 0/7/7 : PM) CLRS.-. (skip.3.3) 3.A. CONCEPTS Goal: Achieve faster operations than balanced trees (nearly Ο( ) epected time) by using randomness in key sets by sacrificing ) generality and ) ordered retrieval. Set of Potential Keys Mapping (h) Table Subscripts Regardless of the hash function, a dynamic set of keys will lead to collisions. Birthday parado 3 different birthdays available How many (random) persons are needed to have at least even odds of two persons with the same birthday? 3 Probability of k persons having k different birthdays is k i= 3 i 3 probability of unique birthdays among 0 people is probability of unique birthdays among people is probability of unique birthdays among people is 0.9978 probability of unique birthdays among 3 people is 0.9988 probability of unique birthdays among people is 0.557 probability of unique birthdays among people is 0.559 probability of unique birthdays among 3 people is 0.9377 probability of unique birthdays among people is 0.5 probability of unique birthdays among 57 people is 0.0000 probability of unique birthdays among 58 people is 0.0085
3.B. HASH FUNCTIONS Modular (AKA remaindering or division method) Multiplicative h(key) = key % m m is the table size Folklore: Make m prime, regardless of collision handling technique. Double hashing requires. hash = m * (0.703587*key - (int)(0.703587*key)); Universal Hashing - aside Use parameterized hash function to minimize chance of getting collisions beyond epectation. Parameters are randomly generated when hash structure is initialized. Tet Strings as Key scanf("%s",str); hash=0; for (i=0; str[i]!=0; i++) hash = (hash*0 + str[i]) % m; printf("%s => %d\n",str,hash); A string s signature may be stored in a data structure, even if hashing is not used. 3.C. COLLISION HANDLING BY CHAINING Concept Use table of pointers to unordered linked lists. Elements of a list have the same signature. Load Factor = α = # elements stored # in table Often stated as a per cent. For some methods, such as chaining, α can eceed 00%. Epected s is n m = α for hits and n = α for misses. m
3 3.D. COLLISION HANDLING BY OPEN ADDRESSING Saves space when records are small, so chaining would waste a large fraction of space for links. Collisions are handled by using a sequence for each key a permutation of the table s subscripts. Hash function is h(key, i) where i is the number of re attempts tried. Two special key values (or flags) are often used: never-used (-) and recycled (-). Searches stop on never-used, but continue on recycled. (For linear probing, but not double hashing - can also reinsert records past the emptied slot for a deletion.) Linear Probing - h(key, i) = (key + i) % m Properties:. Probe sequences eventually hit all.. Probe sequences wrap back to beginning of table. 3. Long clusters of contiguous occupied are costly for misses.. There are only m sequences. Two keys hashing to same initial slot have the same sequence. What about using h(key, i) = (key + *i) % 0 or h(key, i) = (key + 50*i) % 000?
Suppose all keys are equally likely to be accessed. Is there a best order for inserting keys? Insert keys: 0, 7, 0, 03, 0, 05, 0 0 3 5 0 3 5 Double Hashing h(key, i) = (h (key) + i*h (key)) % m Properties:. Probe sequences will hit all only if m is prime.. m(m ) sequences. Unlikely that two keys hashing to the same initial slot will have the same sequence. 3. Minimizes effect of clustering. Typical Hash Functions: h = key % m h = + key % (m )
5 0 3 5 7 8 9 0 Key h h 33 0 0 9 8 3 5 0 0 9 3 0 9 3 00 0 9 305 5 0 3.E. UPPER BOUNDS ON EXPECTED PERFORMANCE FOR OPEN ADDRESSING Double hashing comes very close to these results, but analysis assumes that hash function provides all m! permutations of subscripts.. Unsuccessful search when load factor is α = n. Each successive has the effect of decreasing m both the number of in the table and the number of occupied by one. Before first first second third fourth n n- n- n-3 n- m m- m- m-3 m-
a. Probability that a search has a first b. Probability that search goes on to a second α = n m c. Probability that search goes on to a third α n m < α n m < α d. Probability that search goes on to a fourth α n n m m < α n m < α3... Suppose the table is large. Sum the probabilities for s to get upper bound on epected number of s: α i = α i=0 (much worse than chaining). Inserting a key when load factor is α a. Eactly like unsuccessful search b. Upper bound of 3. Successful search α s a. Searching for a key takes as many s as inserting that particular key. b. Each inserted key increases the load factor, so the inserted key number i + is epected to take no more than i m = m m i s
c. Find epected s for n keys inserted into an empty table (each key is equally likely to be requested): 7 n n i=0 m m i = m n Sum is n i=0 m i m + m +...+ m n + = m m m n i=m n+ i n = m n m d Upper bound on sum for decreasing function. m n ( ln m ln ( m n )) = α ln m m n = α ln α alpha 0.00 unsuccessful (insert).50 successful. alpha 0.50 unsuccessful (insert).333 successful.5 alpha 0.300 unsuccessful (insert).9 successful.89 alpha 0.350 unsuccessful (insert).538 successful.3 alpha 0.00 unsuccessful (insert).7 successful.77 alpha 0.50 unsuccessful (insert).88 successful.39 alpha 0.500 unsuccessful (insert).000 successful.38 alpha 0.550 unsuccessful (insert). successful.5 alpha 0.00 unsuccessful (insert).500 successful.57 alpha 0.50 unsuccessful (insert).857 successful.5 alpha 0.700 unsuccessful (insert) 3.333 successful.70 alpha 0.750 unsuccessful (insert).000 successful.88 alpha 0.800 unsuccessful (insert) 5.000 successful.0 alpha 0.850 unsuccessful (insert).7 successful.3 alpha 0.900 unsuccessful (insert) 0.000 successful.558 alpha 0.90 unsuccessful (insert). successful. alpha 0.90 unsuccessful (insert).500 successful.75 alpha 0.930 unsuccessful (insert).8 successful.859 alpha 0.90 unsuccessful (insert). successful.993 alpha 0.950 unsuccessful (insert) 0.000 successful 3.53 alpha 0.90 unsuccessful (insert) 5.000 successful 3.353 alpha 0.970 unsuccessful (insert) 33.333 successful 3.5 alpha 0.980 unsuccessful (insert) 9.998 successful 3.99 alpha 0.990 unsuccessful (insert) 99.993 successful.5 Fast and Powerful Hashing Using Tabulation, CACM 0 (7), July 07, https://dl-acm-org.ezproy.uta.edu/citation.cfm?id=30877