Dictionaries (Maps) Hash tables. ADT Dictionary or Map. n INSERT: inserts a new element, associated to unique value of a field (key)

Dictionaries (Maps) Hash tables ADT Dictionary or Map Has following operations: n INSERT: inserts a new element, associated to unique value of a field (key) n SEARCH: searches an element with a certain value of the key. If it esists, it returns it n DELETE: cancels element with given key, if exists 2 1

Uses of dictionaries n Symbol table in a compiler n Key: nameof identifier n Values: types, context n Citizens in a country n Key: social security number n Values: name, surname, age, address 3 Associative array A dictionary would be easily implemented with an associative array (index of value = key instead of position) Ex: n Citizens = {{ jr50, john, red }, { bg40, bill, green }, } n Citizens[ jr50 ] = { jr50, john, red } 4 2

Goal Complexity of insert/search/delete: n O(1) average case n Θ(n) worst case 5 Hash tables Implementation of associative arrays An array containing elements. Address of element is computed by hash function, in time O(1). Ex: n Hash( jr50 ) = 117: element john red is in position 117 of vector 6 3

Associative array 1 2 4 U (all keys) 7 0 6 9 3 5 8 0 1 2 3 4 5 6 7 8 9 T 2 3 5 8 key value K (used keys) 7 Dictionary implemented w associative array n T: associative array, key: key, x: value n Search(T, key) n Return T[key] n Insert(T, x) n T[key[x]] x n Delete(T, x) n T[key[x]] NIL n Complexity O(1), memory O( U ) O( U ) number of different values of key 8 4

Assumptions Two assumptions are needed: n No two elements with same key (keys are unique) n Size of T == size of max number of possible values of key, U. n This is critical, if U is large, array unfeasible n Ex: key = SSN, 10chars, U = 24 10 10 13 n Assuming 24 values alphabet n But, the citizens of a country are in the order 10 7-10 9 n It is essential that size of array be O( K ) and not O( U ) 9 Hash tables n A kind of associative array with size O( K ) and not O( U ) n Insert/search/delete are O(1) on average n However, the way of computing index given key must be different: hash function 10 5

Hash function n Hash table is array with size m (m<< U ) n Hash function h, from key to position in array (index) n h: U { 0, 1,..., m-1 } n Element x is stored in n T[h(key[x])] 11 Hash function k 1 U k 3 k2 k 4 k 5 0 1 2 3 4 5 6 7 8 m-1 T h(k 1 ) h(k 4 ) h(k 2 )=h(k 5 ) h(k 3 ) 12 6

Collision n Collision n when h(k i )=h(k j ) and k i k j, n Essential to: n Minimize number of collisions n Depend on hash function n Manage collisions 13 Example Key is a string of characters Hash function h(k) = Σ(c i ) mod m with n c i ASCII code of i-th char of string k n m number of elements (size) of array T 14 7

Ex (II) m = 15. Collision with strings paperino and paperoga n h( pippo ) = (112+105+112+112+111)mod 15= 552 mod 15 = 12 n h( pluto ) = (112+108+117+116+111)mod 15= 564 mod 15 = 9 n h( paperino ) = (112+97+112+101+114+105+110+111)mod 15= 862 mod 15 = 7 n h( topolino ) = (116+111+112+111+108+105+110+111)mod 15= 884 mod 15 = 14 n h( paperoga ) = (112+97+112+101+114+111+103+97)mod 15= 847 mod 15 = 7 15 Ex (II) m = 15. n h("mickey ) = (77 + 105 + 99 + 107 + 101 + 121) mod 15 = 10 n h("minnie") = (77 + 105 + 110 + 110 + 105 + 101) mod 15 = 8 n h("donald") = (68 + 111 + 110 + 97 + 108 + 100) mod 15 = 9 n h("daisy") = (68 + 97 + 105 + 115 + 121) mod 15 = 11 n h("foo") = (102 + 111 + 111) mod 15 = 9 n h("bar") = (98 + 97 + 114) mod 15 = 9 Collision with strings foo and bar 16 8

Collisions mitigation The best hash functions are capable of distributing as uniformly (randomly) as possible the K elements among the m positions available Typical strategies: pick m as a prime number manipulate bits of k 17 Collision management n Chaining n Open Addressing 18 9

Chaining (I) Position i can contain more than one element This can be implemented through a linked list 19 Chaining (II) k 1 U k 3 T 0 1 2 k 1 k 6 k 3 6 k 4 4 k 4 5 k 2 k 5 k2 6 k 5 7 8 k 3 m-1 20 10

Chaining (III) n T[i] is a pointer to a list, initially NIL. n CHAINED-HASH-INSERT(T,x) n insert x at head of list T[h(key[x])] n CHAINED-HASH-SEARCH(T,k) n Search element with key k in list T[h(k)] n CHAINED-HASH-DELETE(T,x) n Cancel x from list T[h(key[x])] 21 Chaining - Complexity n Assumption: unorderd list, single chaining n Insert: O(1) n Search: O(length of lists) n Cancel: O(length of lists) n Requires a search 22 11

Search (hash + chaining) - complexity n We have n n : number of elements in hash table T n m : size of hash table T n α=n/m: load factor for hash table T n Normally α>1 n What if m,n (with same α)? 23 Search (hash + chaining) complexity (II) n Search n Worst case: a linked list, not ordered n Time to compute h(k) + n Time to transverse the list, Θ(n) n Best case: depends on how uniformly h(k) distributes the elements n Let s assume h(k) is capable of simple uniform hashing (distributes in perfect uniform way) (this requires that the table grows with the elements, so that α remains constant) 24 12

Search (hash + chaining) complexity (II) Search Time to compute h(k) = O(1). Time to trasverse the list, depends on length of list T[h(k)] depends on element found/not found In both cases complexity is Θ(1+α). summing up O(1) + Θ(1+α) = O(1) 25 Open Addressing T[i] can contain only one element In case of collision another free cell is searched for next one, after next, etc Must be α<1. 26 13

Hash-Insert HASH-INSERT(T, k) 1 i 0 2 repeat j h(k, i) 3 if T[j] = NIL 4 then T[j] k 5 return 6 else i i + 1 7 until i = m 8 error hash table overflow 27 Hash-Search HASH-SEARCH(T, k) 1 i 0 2 repeat j h(k, i) 3 if T[j] = k 4 then return j 5 i i + 1 6 until T[j] = NIL or i = m 7 return NIL 28 14

Re-hash functions n Linear probing n h(k, i) = (h (k)+i) mod m n Quadratic probing n h(k, i) = (h (k)+ c 1 i + c 2 i 2 ) mod m n Double hashing n h(k, i) = (h 1 (k)+ i h 2 (k) ) mod m 29 Ex - insert n m = 10 n open addressing with linear probing. Hash values sequence: n h(a)=5, h(b)=4, h(c)=9, h(d)=4, h(e)=8, h(f)=8, h(g)=10 30 15

Ex - insert (II) A B A B A B A D B A D B A D C C E C E C F 5 4 9 4 8 8 G B A D E C F 10 31 Ex - search (III) search: n D: (h(d)=4) n Read 4 n Read 5 n Read 6 found n G: (h(g)=10) n Read 10 n Read 1 found n M: (h(m)=4) n Read 4, n Read 5, n Read 6, n Read 7, not found 32 16

Delete Very complex, because changes the rehash/ collision sequence In practice open hashing is used only if no delete 33 Complexity With uniform hashing and linear probing: n The number of probing trials is 1/(1 α), and complexity is the same as for insert n Complexity of search is 1 1 1 ln + α 1 α α 34 17

Hash functions 35 Uniform hashing Best hash functions do a uniform hashing: if keys have the same probability, also h(k) should have equal probability k: h( k ) = j 1 P( k) =, j = 0,1,, m 1 m 36 18

Keys are not uniform However, keys often are not equally distributed (ex words in a language, ex names and surnames) use all characters amplify the differences 37 Keys as numbers Usually keys are strings of characters Easiest thing is to treat them as integers n Ex: abc becomes a *256 2 + b *256 + c However, with very long strings this is impractical, variants have to be used In the following the key is an integer 38 19

Hash function = mod m n k is an integer : n h(k) = k mod m n Requires m n/α. n m size, n number of elements 39 Choice of m n Avoid n Powers of 2 n Division by m looses high bits of k n Powers of 10 n Same as above, if k is decimal number n Use n A prime number n Far from powers of 2 40 20

Ex n n = 2000 n On average 3 comparisons in searches n m = 701 is a prime, close to 2000/3 but far from powers of 2 n h(k) = k mod 701 41 Hash function = multiply n K integer: n A constant 0<A<1 n Frac(x) = x - x n h(k) = m frac(k A) n k A shuffles bits of k, n Multiplying by m expands [0,1] in [0,m] 42 21

Choice of m and A n M is not critical. Using a power of 2 simplifies the multiplication n Best A depends on how keys are statistically distributed n A = ( 5 1) / 2 = 0.6180339887... Is a good choice 43 22