CS 241 Analysis of Algorithms Professor Eric Aaron Lecture T Th 9:00am Lecture Meeting Location: OLB 205 Business HW5 extended, due November 19 HW6 to be out Nov. 14, due November 26 Make-up lecture: Wed, Nov. 13, 4:30pm, OLB205 (tentative) Exam back today Reading: CLRS Ch. 12.1-12.3, and the unstarred parts of Ch. 11 1
Business pt. 2 A Note From Your (Vassar CS) Majors Committee: What: Java Review Session! When: Tonight! 8pm! Where: OLB 104! Hashing; Hash Tables Consider a case where there are many possible keys (or elements) to be stored, but relatively few of them are used U: Universe of keys; K: keys being used; K << U Then, look for an option with efficient operations but space on the order of K, not the order of U Important approach to dictionaries in this case: hashing Elements are stored (and searched for) in a hash table an array T that s typically of size related to K, not related to U Instead of key value k as the index into T, compute h(k) using a hash function h, and use h(k) as the array index Goals: Fast operations (O(1) time) Space-efficient data structure (O(n) space to store n elements) Things to think about: -- How do we define h? -- What if there s a collision, i.e., h(x) == h(y) for some x!= y? 2
Hash Functions and Collisions Hash table: Array T[0..m-1], where m is prime and m << U Hash function: maps any key value into [0..m-1], so every element could be stored / referenced in T Things to think about: How do we define h? And what if there s a collision, i.e., h(x) == h(y) for some x!= y? Because hash functions compute search keys, it is essential that hash functions map equal keys to the same table index Good hash functions minimize the chance of collisions Managing collisions is essential for hashing What kinds of problems do collisions cause for hashing as a method to implement a dictionary? What approaches can you think of to handle collisions, when they occur? Remember, hashing needs to support: insert, delete, search operations Hash Functions and Collisions Hash table: Array T[0..m-1], where m is prime and m << U Hash function: maps any key value into [0..m-1], so every element could be stored / referenced in T Things to think about: How do we define h? And what if there s a collision, i.e., h(x) == h(y) for some x!= y? Because hash functions compute search keys, it is essential that hash functions map equal keys to the same table index Good hash functions minimize the chance of collisions Managing collisions is essential for hashing. Two approaches: Chaining: Each cell in the table is a linked list of elements mapped to that index Open addressing: When there s a collision, move along through the table until an open index, or the element being searched for, is found 3
Hash Functions A good hash function is efficient to compute A really good hash function satisfies (or almost satisfies) the assumption of simple uniform hashing: Each key is equally likely to hash to any slot in the hash table In practice, it s generally not possible to achieve this it s not known in advance what the likelihoods are for keys to be chosen Heuristics or other intelligent choices, however, can yield good performance Hash functions essentially compute an integer summary of the object (e.g., the object being searched for) Thus, hash functions return natural numbers, and they often presume that their input keys are natural numbers If keys aren t numbers, they must somehow be mapped to numbers For example: How could a character string be represented as an integer? Hash Function Examples: Strings Some possible hash functions for String objects Treat each String as a sequence of Unicode characters, then: Sum of Unicode codes hash(s) = s 0 + s 1 + + s n-1 e.g.: hash( now ) = 110 + 111 + 119 = 340 Not a great choice: Few codes, uneven distribution Shifted sum of Unicode codes hash(s) = s 0 b n-1 + s 1 b n-2 + + s n-2 b + s n-1 e.g.: hash( now ) = 110*b 2 + 111*b 1 + 119*b 0 Choices for b: 2 16, prime numbers But wouldn t this result in very big hash values? How would we get them to smaller values for a space-efficient table? 4
Compression; and The Division Method For an object, its index into a hash table could be computed by a two-step hash function: A hash-code function h 1 (not the same as hash function h) computes a value, such as the shifted Unicode sum Then, to compress the range of hash codes into the range of table indices, compute the remainder mod m of that hash code Thus, h(k) = h 1 (k) mod m, for an m-sized hash table Notes: Simple case: h 1 (k) = k, so h(k) = k mod m. CLRS calls this the division method Typically, m is chosen to be a prime number this helps spread out hash values and avoid collisions Digression: Horner s Method Recall hash function: Shifted sum of Unicode codes hash(s) = s 0 b n-1 + s 1 b n-2 + + s n-2 b + s n-1 Horner s method can simplify calculation of such a code a 0 b n-1 + a 1 b n-2 +... + a n-2 b + a n-1 = ((a 0 b + a 1 )b +... + a n-2 )b + a n-1 Right-hand side is efficient to compute! Horner s method For hashing, we often want the hash value mod M. By properties of % operator, (a*b % n) == (a % n) * (b % n) (a+b % n) == (a % n) + (b % n) Can use these facts in Horner s method: x = 0 for i=0 to n-1 x = x*b + a[i] Horner s method with % M x = 0 for i=0 to n-1 x = (x*b + a[i]) % M 5
Collisions Recall: two approaches to managing hash-value collisions Chaining: use a general hash function and put all elements that hash to same location in a linked list at that location. Open addressing: use a general hash function as in chaining, and then increment the original position until an empty slot (or the element you are looking for) is found. One index position for each element in table. Chaining Idea: Each cell T[k] in the hash table is a linked list (chain) T[k] is the head node of a linked list containing all hashed objects x with h(x) = k Default: unsorted list, singly-linked list List lengths after storing n elements in a table of size m Load factor (average list length) is α = n / m With good hash function, each list likely to have length close to α What are the running times of the Insert, Delete, Search operations? 6
Open Addressing Instead of chaining with linked lists, collisions could be resolved by storing every element directly in the table Each slot in the table contains either NIL or an element (or a reference to that element) Note: Table may fill up! But α will never be greater than 1 Idea: compute hash value as with chaining, but if collision occurs, successively index into (i.e., probe) T until When inserting an element, systematically find an open slot into which the element can be placed When searching, systematically examine slots under either finding the element or determining it s not in the table (How about deleting is deleting simple? Think about it as it relates to searching ) Probe sequence depends both on key and on probe increment Open Addressing; Probe Sequence With table size T = m, the probe sequence sequence of table indices examined for a key value must be a permutation of <0, 1,, m-1> If it weren t, then some slots in the table might never be considered as the table becomes full Recall also that probe sequence depends in part on key being hashed (as well as other factors that determine probe increment) For theoretical analysis, we assume uniform hashing the probe sequence of each key is equally likely to be any of the m! permutations of <0, 1,, m-1> (In practice, we try to approach that performance with approximations) Summary: Hashing with open addressing Compute hash index h(k,i) for the i th index in the probe sequence, where k is the key for hashing Thus, probe sequence for key k is h(k,0), h(k,1),..., h(k,m-1) In worst case, every slot in T is examined before finding an empty slot (or the element being searched for) 7
An Example: Linear Probing Three common techniques for generating probe sequences that are guaranteed to be permutations of <0,1, m-1> Linear probing; quadratic probing; double hashing Each of these uses auxiliary hash functions as part of the full hash function i.e., hash function h is in terms of some h'(k) or h 1 (k), etc. Linear probing is the simplest of the three Probe sequence: if slot is full, go to the next; wrap around if needed Function: h(k, i) = (h'(k) + i) mod m [k is the element s key; i is the probe number, which goes from 0 to m-1] Trade-offs: simple to implement, but primary clustering long ranges of consecutive filled slots build up, making probe sequences longer If an empty slot is preceded by i filled slots, then i+1 keys will fill that slot next! Hash-Search with Open Addressing An example of a function on a hash table with open addressing: Searching in the hash table T Note index probed at i th probe is h(k,i) Because each key has a unique probe sequence, the sequence followed when searching will be the same as the one followed when inserting the element How would this work with deleting from the table? What if deleting an element was simply replacing it by NIL in T? 8