Hash Tables CS 321 Spring 2015
Todays Topics HW1 Available on Web Site. PA 2 CPUScheduling Due Fri Feb 13 th PA 1 Grades Out today All codes ran. Max Heap Methods Full Radix Sort Hash Tables
Max- Heap Methods Max() Build- MaxHeap() Max- Heapify() Insert() replacekey(i,key) parheapify() useful convenience method. findkey(key)
Full Radix Sort A combinauon sort that runs in linear Ume. Uses muluple passes to sort. Running Ume = O(p*n) where p is the number of passes.
CounUng Sort
Running Time of Radix Sort O(w/d*n + b d ) where w = number of digits in numbers. d = number of digits in counung sort pass. Can do for any base: example base 100.
Trouble with Arrays What are some trouble with Arrays? Arrays can only store data by a numeric index. Arrays can waste a lot of space but are fast. How can I store data by a more general key? What is a real world example of an object sorted by a non- numeric key with data associated? Why? Keep track for a player in a game: Shirt: diamond armor. Legs: chainmail Head: gold helmet.
Specific Goals for our SoluUon. Lookups should be very quick: O(1) if at all possible or as close as possible. As few steps as possible to find. Insert and Deletes should be fast: like arrays. We will assume that objects use unique keys: A key may be a single value. Or may be created from muluple values. We will only consider single value keys.
Common SoluUon: Hash Table A data structure that holds values indexed by keys. Keys are usually strings. The locauon of the value for a given key is found by passing the key to a hash funcuon that returns an index to the correct value. Hash Tables are oben called dicuonaries. Also oben called Tables of Key/Value pairs. Standard implementauons are extremely efficient: close to O(1) for all operauons.
What About Other Data Structues? Must have: Insert(), Delete() and Find(): Arrays: can accomplish in O(1) Ume but are not space efficient (assumes we leave empty space for keys not currently in dicuonary) Binary search trees can accomplish in O(log n) Ume- want faster. are space efficient. Hash Tables: With constraints is ~O(1) for Insert/Delete/Find
Example Array Use SSN for the key. Use an Array to hold: Use an array with range 0-999,999,999 Using the SSN as a key, you have O(1) access to any person object Unfortunately, the number of acuve keys (Social Security Numbers) is much less than the array size (1 billion entries) Est. US populauon, Oct. 20th 2004: 294,564,209 Over 60% of the array would be unused But would be fast and fit in memory.
Hash Table SoluUon Hash on your SSN yields Index into a Table. Hash funcuon must choose good index. Very Useful for When ID numbers are widely spread out When you don t need access in ID order Fits our SSID example.
Hash Table abstract data type. Core methods for a Hash Table: Insert(key,value) ~O(1), add key and value. Delete(key) ~O(1), remove key and value. Search/Find(key) ~O(1), find key and value in table. Internal method criucal method: Hash(key) O(1), compute an index for the given key.
Hash Tables Conceptual View 7 table buckets obj1 key=15 hash value/index 6 5 4 3 2 1 Obj3 key=4 Obj2 key=30 Obj4 key=2 0 Index = hash(key); 7 = hash(15); Obj5 key=1
Hash index/value A hash value or hash index is used to index the hash table (array) A hash funcuon takes a key and returns a hash value/ index The hash index is a integer (to index an array) The key is specific value associated with a specific object being stored in the hash table It is important that the key remain constant for the lifeume of the object
Hash FuncUons & insert( ) Usage summary: int hashvalue = hashfunction (int key); Or hashvalue = hashfunction (String key); Or hashvalue = hashfunction (itemtype item); Insert method: public void insert (int key, itemtype item) { hashvalue = hashfunction (key); } table[hashvalue] = item;
Hash FuncUon Requirements You want a hash funcuon/algorithm that is: Fast Distributes keys throughout the table. Hash funcuons can use as input Integer key values String key values MulUpart key values MulUpart fields, and/or MulUple fields
Simple Hash FuncUon: Mod Stands for modulo: Remainder of X/Y in integer arithmeuc. Example Mod results. 8 mod 5 = 3 9 mod 5 = 4 10 mod 5 = 0 15 mod 5 = 0 Key mod M = 0 if key = M*c What if M is prime and keys!= M*c
Hash Tables: Insert Example For example, if we hash keys 0 1000 into a hash table with 5 entries and use h(key) = key mod 5, we get the following sequence of events: Insert 2 Insert 21 Insert 34 Insert 54 key data key data key data 0 1 2 3 4 2 0 1 2 3 4 21 2 0 1 2 3 4 21 2 34 There is a collision at array entry #4???
Dealing with Collisions A problem arises when we have two keys that hash in the same array entry this is called a collision. There are two ways to resolve collision: Hashing with Chaining (a.k.a. Separate Chaining ): every hash table entry contains a pointer to a linked list of keys that hash in the same entry Hashing with Open Addressing: every hash table entry contains only one key. If a new key hashes to a table entry which is filled, systemaucally examine other table entries unul you find one empty entry to place the new key
Hashing with Chaining The problem is that keys 34 and 54 hash in the same entry (4). We solve this collision by placing all keys that hash in the same hash table entry in a chain (linked list) or bucket (array) pointed by this entry: Insert 54 0 1 2 3 4 other key key data 21 2 54 34 Insert 101 0 1 2 3 4 21 101 2 54 34 CHAIN
Hashing with Chaining What is the running Ume for insert/search/delete? Insert: It takes O(1) Ume to compute the hash funcuon and insert at head of linked list Search: It is proporuonal to max linked list length Delete: Same as search Therefore, in the unfortunate event that we have a bad hash funcuon all n keys may hash in the same table entry giving an O(n) run- Ume! So how can we create a good hash funcuon?
Choosing a Hash FuncUon 1 Uniform Hashing = keys distributed throughout table. Choosing a good hash funcuon requires taking into account the kind of data that will be used. The stausucs of the key distribuuon needs to be accounted for E.g., Choosing the first leser of a last name will likely cause lots of collisions depending on the nauonality of the populauon Many programming systems have hash funcuons built in
Choosing a Hash FuncUon 2 Division/modulo method key mod m m is the array size; in general, it should be prime. MulUplicaUon method Floor ((key*somefracuon mod 1)*arraySize) Where some fracuon is typically 0.618 Java Hash Map method Create a hash by performing a series of shibs, adds, and xors on the key index = hash mod arraysize
Prime Number DistribuUon For example, assume Keys (key values) are muluples of 5 5, 10, 15, 20, 25 The keys are evenly distributed 5 to 245 An M (the divisor) of 7 Then, the hash values will be evenly distributed from 0 to 6 for the keys See table à If M was 5, then you would have what kind of distribuuon? Key mod M Total 0 7 1 7 2 7 3 7 4 7 5 7 6 7 (blank) Grand Total 49 hash value = key mod m (m is typically the table size)
Choosing Hash FuncUon 3 If keys are non- random e.g. part numbers Use all data to contribute to the hash funcuon to get a beser distribuuon Consider folding sum the natural (or arbitrary) groups of digits in key Don t use redundant or non- data (.e.g. checksum values) Do not use informauon that might change! è Analyze your expected key values (or some representauve subset) to make sure your hash funcuon gives a good distribuuon!
Hashing with Open Addressing So far we have studied hashing with chaining, using a list to store the items that hash to the same locauon Another opuon is to store all the items (references to single items) directly in the table. Open addressing collisions are resolved by systemaucally examining other table indexes, i 0, i 1, i 2, unul an empty slot is located.
Hash Tables Open Addressing table I = key mod 8 hash value/index 7 6 5 4 3 2 1 0 Index=4 obj1 key=15 Index=4 Obj5 key=1 Obj3 key=4 Obj4 key=2 Obj2 Key=28
Open Addressing The key is first mapped to an array cell using the hash funcuon (e.g. key % array- size) If there is a collision find an available array cell There are different algorithms to find (to probe for) the next array cell Linear H+1,H+2,H+3, unul empty slot. QuadraUc H+1*1, H+2*2, H+3*3, H+4*4, Double Hashing hash again with a different hash funcuon.
Probe Algorithms (Collision ResoluUon) Linear Probing Choose the next available array cell First try arrayindex = hash value + 1 Then try arrayindex = hash value + 2 Be sure to wrap around the end of the array! arrayindex = (arrayindex + 1) % arraysize Stop when you have tried all possible array indices If the array is full, you need to throw an excepuon or, beser yet, resize the array QuadraUc Probing VariaUon of linear probing that uses a more complex funcuon to calculate the next cell to try
Double Hashing Apply a second hash funcuon aber the first The second hash funcuon, like the first, is dependent on the key Secondary hash funcuon must Be different than the first And, obviously, not generate a zero Good algorithm: arrayindex = (arrayindex + stepsize) % arraysize; Where stepsize = constant (key % constant) And constant is a prime less than the array size
Problems Linear Probing yields clusters. QuadraUc Probing yields secondary clusters. Double hashing can avoid both. Depends on secondary hash funcuon.
The End