Hash Table Ric Glassey glassey@kth.se
Overview Hash Table Aim: Describe the map abstract data type with efficient insertion, deletion and search operations Motivation: List data structures are divided by their underlying implementation, and combining their respective best properties is desirable Maps and Hash tables Key concepts Hashing and compression Collisions and chaining Load and efficiency 2
THE MISSING LIST 3
B & C search the Web B: I d like to search for cats... B: Hej, C, what s the IP address for Google? C: Why it s 216.58.209.142... C: Or type google in the search box Not a user friendly system, rather We need a simple addressing scheme for websites (URL) We want the URL to reliably map to an IP address In general, some arbitrary key k maps to some value v is a useful construct for many applications So far the only keys used have been integers 4
List Limitations Recall that behind the List abstract data type are two implementations (array & linked list) with advantages and disadvantages Data Structure Array Linked List Opera,on Search* O(1) O(n) slow to search Insert O(n) O(1) Delete O(n) O(1) * Index based retrieval Assume doubly linked list slow to update 5
Desirable Properties There are many applications that require both efficient search + update Cheap to insert & delete items Fast to search for items Leads to a classical engineer s dilemma: Cheap, Fast, Simple. You can only pick two :( perhaps combine with Reliable, and Secure, or some other 6
MAP ABSTRACT DATA TYPE 7
Abstract data type Map Efficiently stores and retrieves values, based upon a unique search key Map is said to store key-value pairs (k, v) Keys must be unique, such that k maps only to v Key acts like an index Key can be of arbitrary type (not just numeric) Key Blue maps to value RGB(0, 0,255) Key Red maps to value RGB(255, 0 0) Key Fuchsia maps to value RGB(255, 0, 255) 8
Primary operations Insert ( key, value ) Delete ( key ) Search ( key ) Map Operations Depending upon specific implementations, many more operations are included (see later) mostly utility functions Implementations also commonly referred to: Hash Table, Dictionary, Associative Array 9
Simple Direct Addressing 1 U Universe of Keys 7 4 K Actual Keys 2 5 0 9 2 2 v 3 6 8 T 3 5 8 0 1 2 3 4 5 6 7 8 9 Key Value 3 v 5 v 8 v Entries Essen,al ac,ng as a random access array However, what happens if all U keys have to be an,cipated and exist in T? 10
Accommodating all keys? 1) If the set of keys becomes large in U U Universe of Keys K Actual Keys T 3) The amount of wasted space in T becomes a resource concern 2) Whilst the actual keys used K is relatively small Direct addressing is not a space- efficient approach 11
HASHING AND COMPRESSION 12
Hashing Chopping & mixing Ideally we want to avoid direct addressing Maintain a more space efficient table T of size N Allow arbitrary types as keys (not just integers) We can design some function h(k) that converts k into an integer i (to index a position in T), that falls within the range of [0, N-1] Hash Function Hash Code Compression Function 13
Hash Code Aim is to generate an integer from input key No need to be bounded by table size Can be negative But should avoid collisions as much as possible h(k1) == h(k2) Bit representation strategy If data type uses as many bits as hash code integers e.g. Java uses 32 bit hash codes, so byte, char, int, short can simply be cast to int, so h(13) = 0...1101 Other schemes Polynomial hash codes Cyclic-shift hash codes override Java s hashcode( ) method and make your own 14
Compression Function A hash code may not lie within the bounds [0, N-1] of a table with size N, and it needs to be converted to fall within this range. A good compression function should also seek to minimise the number of collisions Division method simple approach, but suffers from repeated patterns of hash codes being copied through to hash values i mod N MAD method Multiple-Add-Divide [(ai + b) mod p] mod N p is prime > N a,b are random integers from [0, p-1], with a > 0 15
COLLISIONS AND CHAINING 16
Managing Collisions Collisions are a consequence of using hashing functions, and eventually some h(k2) == h(k5) T U Universe of Keys h(k1) K Actual Keys k1 k6 k2 k5 k4 h(k4) h(k2) == h(k5) h(k6) 17
Separate Chaining To deal with collisions, we can simply extend the capacity of a slot to have its own DL-List T U Universe of Keys / k1 k4 / K Actual Keys k6 k5 k8 k3 k1 k2 k4 k7 / k5 k2 / k3 / k7 / / k6 k8 / Why DL- List? 4) Where collisions occur, use a doubly-linked list 18
Back to lists? Ideally, the size of a bucket should never become too large Operations within the buckets will be proportional to their size Insert and Remove are still O(1) Search is O(n) Pathological case is only one slot active with a bucket containing all entries in a hash table :( As more collisions occur, the load on the table increases and efficiency will begin to decrease 19
LOAD AND SIZE 20
Load Factor Simple measure of health α = number of entries (n) / number of slots (N) α"="n/n ""="3/8 ""="0.375 α"="n/n ""="8/8 ""="1.0 As α 1, what problems can we expect to occur? What is the solu,on? 21
Resizing To maintain efficiency and limit collisions, we set a threshold of α < 1, and resize the table Use a dynamic table that doubles it size once the threshold is reached Then, rehash all keys* k1 rehash k1,k2,k3,k4 α"="n/n ""="4/8 ""="0.5 k2 k3 k3 k1 k4 threshold reached! k4 * we may only have to re- compress we may want to shrink or contract the table...why? k2 double table 22
PERFORMANCE 23
Summary of Hash Table Performance Data Structure Array Linked List Hash Table average worst Opera,on Search* O(1) O(n) O(1) O(n) Insert O(n) O(1) O(1) O(n) Delete O(n) O(1) O(1) O(n) * Index or Key based search Assume doubly linked list 24
JAVA S MAP INTERFACE & IMPLEMENTATIONS 25
Java s Map Interface Subset of operations include: boolean containskey(object key) boolean containsvalue(object value) V get(object key) V put(k key, V value) V remove(object key) int size( ) # n of k,v mappings Set<K> keyset( ) Collection<V> values( ) Set<Map.Entry<K, V>>entrySet( ) 26
Implementation and Usage of Map e.g. Hashtable, HashMap, TreeMap import java.util.*; public class Freq { public static void main(string[] args) { Map<String, Integer> m = new HashMap<String, Integer>(); // Initialize frequency table from command line for (String a : args) { Integer freq = m.get(a); m.put(a, (freq == null)? 1 : freq + 1); } } } System.out.println(m.size() + " distinct words:"); System.out.println(m); hqp://docs.oracle.com/javase/tutorial/collec,ons/interfaces/map.html 27
Readings Algorithms and Data Structures Stefan Nilsson s text on Hash Tables http://www.nada.kth.se/~snilsson/algoritmer/hashtabell/ Introduction to Algorithms, 3 rd Edition Chapter 11: Hash Tables Full text available via KTH Library http://kth-primo.hosted.exlibrisgroup.com/ KTH:KTH_SFX2560000000068328 Data Structures and Algorithms in Java, 6 th Edition Goodrich et al. Chapter 10: Maps, Hash Tables and Skip Lists Full text available via KTH Library http://kth-primo.hosted.exlibrisgroup.com/ KTH:KTH_SFX3710000000333147 28