Topic HashTable and Table ADT

Hashing, Hash Function & Hashtable

Search, Insertion & Deletion of elements based on Keys So far, By comparing keys! Linear data structures Non-linear data structures Time complexity?

Search, Insertion & Deletion of elements based on Keys A different approach: By calculating the location from keys! Time complexity?

Search, Insertion & Deletion of elements by calculating the location from keys 0 1 Search Key Location Calculator 2 3. N-1

Hashing The technique used for ordering and accessing elements in an array of some fixed size N. Hashtable By manipulating the key of an element to identify its location in the array. Each key is are mapped to an array position (0.. N-1) by a hash function. Hash Function In a relatively constant amount of time.

Hashtable The array of elements based on hashing. - unordered, sparse table! 0 1 2 3. N-1 N = Hashtable size (N= The fixed size of the array) Hashtable

Keys of Elements K = The set of keys of elements The size of K is relatively large or even unbounded. K = 9-digit numbers K = 1,000,000,000 K = Arbitrary character strings of arbitrary length K = unbounded

Keys Depending on the application, the keys might be Integers letters strings and so on

Elements E = The number of elements to be stored E is significantly less than K. K E

Size of Hashtable N = The size of hashtable N is at least as great as the maximum number of elements to be stored, i.e. E. K E 0 1 2 3. N-1

Hash Function A function used to manipulate the key of an element to identify its location in the array (hashtable).

Hash Function A Key Value in the set of possible keys K h An Integer Value between 0 and N 0 1 2 3.. N-1. Hash Function Hashing a key to an array index h: K {0, 1,, N-1}

Example Employees using their SSNs as a key: K = The set of keys = 10 9 = 1,000,000,000 Let E = The number of employees to be stored = 10,000 E << K Let N= 10,000. Hash function h: SSN {0, 1,, 9999} Let h (key) = key % 10000

Example Employees using their five digit ID numbers as a key: K = The set of keys = 10 5 = 100,000 Let E = The number of employees to be stored = 100 E << K Let N= 100. Hash function h: ID {0, 1,, 99} Let h (key) = key % 100

Hash Function Methods Selecting digits Folding Method - Add digits Middle square method Multiplication method Division method - Modulo arithmetic

Access using Hash Function Hash function has two uses: As a method of determining where to store the element. As a method of accessing the element.

Access using Hash Function Key h 0 1 2 3. N-1 Hashtable

Perfect Hash Function Transforms different keys to different numbers.

Collision

Collisions The condition resulting when two or more keys produce the same hash location. When two or more items should be kept in the same location, esp. in hash tables, that is, when two or more different keys hash to the same value.

Why Collisions? In general: K N The mapping defined by hash function H: K {0, 1,, N-1} is a many-to-one mapping! There will exist many pairs of two distinct keys K1 and K2 s.t. H(K1)= H(K2). K E 0 1 2 3.. N-1..

Example Employees using their SSNs as a key: K = The set of keys = 10 9 = 1,000,000,000 Let N= 10,000. K N {0, 1,, 9999} is a many- Hash function h: SSN to-one mapping! There will exist many pairs of two distinct keys SSN1 and SSN2 s.t. h(ssn1)= h(ssn2). h(999991234) = 1234, h(111111234) = 1234,

Example Employees using their five digit ID numbers as a key: K = The set of keys = 10 5 = 100,000 Let N= 100. K N {0, 1,, 99} is a many-to- Hash function h: ID one mapping! There will exist many pairs of two distinct keys ID1 and ID2 s.t. h(id1)= h(id2). h(91234) = 1234, h(11234) = 1234,

Collisions? The hash function The hash table size

Collision Resolution Schemes

How to Resolve Collisions? Two approaches to collision resolution : Through open addressing Closed hashing Through restructuring the hash table (chained addressing) Open hashing

Example N = 101 & h(key) = Key mod 101 7597 mod 101 = 22 7597 h 22 0. 22. 7597 100 Hashtable

Example N = 101 & h(key) = Key mod 101 4567 mod 101 = 22 22 4567 h 7597 4567? 0. 22. 100 Hashtable

Collision Resolution through Open Addressing A method of finding an open location for insertion into a hashtable after a collision has occurred. How to find an open location? (Probe) Linear probing Quadratic probing Doubling hashing Random probing

1. Linear Probing for Open Addressing An open addressing technique in which we continue from the hash location on looking for the next available position sequentially. The size of step = 1 The probe sequence: HT[ h ( SearchKey ) ] HT[ h ( SearchKey ) + 1 ] HT[ h ( SearchKey ) + 2 ] HT[ h ( SearchKey ) + 3 ]

Example: Insertion h(key) = Key mod 101 7597 4567 0628 h 22 0. 22 23 24 7597 4567 0628 4567? 3658 25. 3658 100

Example: Insertion h(key) = Key mod 101 1110 1211 h 100 0. 22 23 24 1211 7597 4567 0628 25. 3658 100 1110 1211?

Example: Deletion 4567 h(key) = Key mod 101 4567 h 22 0. 22 23 24 1211 7597 4567 0628 25. 3658 100 1110

Example: Search 3658 h(key) = Key mod 101 3658 h 22 0. 22 23 24 1211 7597 4567 0628 25. 3658 100 1110

Status of Each Location Each location has Three states: valid empty deleted

Example: Deletion 4567 & Search 3658 h(key) = Key mod 101 1211 22 22 7597 valid 3658 h 23 4567 deleted 0. 24 25. 100 0628 3658 1110 empty

Example: Insert 4567 Again h(key) = Key mod 101 1211 22 22 7597 valid 4567 h 23 4567 deleted 0. 24 25. 100 0628 3658 1110 empty

Example: Insert 4567 Again h(key) = Key mod 101 1211 22 22 7597 valid 4567 h 23 4567 valid 0. 24 25. 100 0628 3658 1110 empty

Clustering Problem with Linear Probing The tendency of elements to become unevenly distributed in the hashtable, with many elements clustering around a single hash location. Clustering causes long probe searches!

2. Quadratic Probing for Open Addressing An open addressing technique in which we continue from the hash location on looking for the next available position sequentially. The size of step = 1 2, 2 2, 3 2 The probe sequence: HT[ h ( SearchKey ) ] HT[ h ( SearchKey ) + 1 2 ] HT[ h ( SearchKey ) + 2 2 ] HT[ h ( SearchKey ) + 3 2 ]

Example: Insertion h(key) = Key mod 101. 22 7597 4567 0628 3658 h 22 23 24 25 26. 31. 7597 4567 0628 3658 4567?

Example: Insertions h(key) = Key mod 7 9 23 16 2 h 0 1 2 3 4 5 6 9 23 2 16

Example: Insertions h(key) = Key mod 7 0 1 2 9 30 h 3 4 23 2 30? 5 6 16

Quadratic Probing Virtually eliminates clustering! Cannot guarantee successful insertion if the hash table is half full or more.

3. Double Hashing for Open Addressing An open addressing technique in which we continue from the hash location on looking for the next available position sequentially. The size of step = h (SearchKey) The probe sequence: HT[ h ( SearchKey ) ] HT[ h ( SearchKey ) + h (SearchKey) ] HT[ h ( SearchKey ) + 2 * h (SearchKey) ] HT[ h ( SearchKey ) + 3 * h (SearchKey) ] The probe sequence is key-dependent.

Example: Insertion h(key) = Key mod 11 h (key) = 7 - (Key mod 7) 58 0. 3 14? 7 14 h 3 6. 58 10 14 Hashtable

Example: Insertion h(key) = Key mod 11 h (key) = 7 - (Key mod 7) 0. 3 91? 7 91 h 3 6. 58 91 10 14 91? Hashtable

Rehashing What happens the hash table is full or very full? Rehashing! Enlarge the hashtable size. Rehashing is Create a new larger hash table. Insert each element in the old hash table into the new hash table. How larger hash table? Double the hashtable size.

Collision Resolution through Restructuring the Hashtable Change the structure of the hashtable so that it can accommodate more tan one element in the same location! Two ways: Using buckets Using separate chaining

1. Using Buckets A technique to resolve collisions by implementing a hashtable as an array of arrays. A bucket is an element of a hashtable that is itself an array.

Example: Insertion h(key) = Key mod 101 7597 4567 0628 h 0. 22 23 7597 4567 0628 24 1110 1211 25. 100 1110 1211 Bucket size = 3

Using Buckets The size of bucket Too small? Too big?

2. Using Separate Chaining A technique to resolve collisions by implementing a hashtable as an array of pointers, each pointer is the head of a linked list of records with keys that hash to that location. Each linked list is called a chain.

Example: Insertion h(key) = Key mod 101 7597 4567 0628 h 0. 22 23 0628 4567 7597 24 1110 25. 1211 100 1211 1110

Separate Chaining Using a linked list A unsorted linked list A sorted linked list

Good Hash Functions? A good hash function avoids collisions. A good hash function tends to spread keys evenly. A good hash function is easy to compute. The running time should be O(1).

Good Hash Functions The calculation of the hash function should involve the entire search key. If a hash function uses modulo arithmetic, the base (the hashtable size) should be prime. f(key) % table_size

Size of Hashtable? Too big? Memory waste Too small? More collisions & rehashing Should be as large as practical & prime number!

Hash Table Implementation

Hash Table ADT template < class DT, class KT > class HashTbl { public: HashTbl ( int inittablesize ); ~HashTbl ();

Hash Table ADT void insert (const KT& searchkey, const DT &newdataitem); bool remove ( KT searchkey ); bool retrieve ( KT searchkey, DT &dataitem ); void clear (); bool isempty () const; bool isfull () const; void showstructure () const;

Hash Table ADT }; private: int tablesize; vector<list< pair<kt,dt> >> datatable;

Analysis of Hashing

Search - Time Analysis Search operation: The worst-case time Every key gets hashed to the same array index. O(N) linear time The average-case time O(2-5) constant time

Insert - Time Analysis Insert operation: The worst-case time Every key gets hashed to the same array index. O(N) linear time O(1) for separate chaining (insert at the front always) The average-case time O(2-5) constant time

Delete - Time Analysis Delete operation: The worst-case time Every key gets hashed to the same array index. O(N) linear time The average-case time O(2-5) constant time

Load Factor of a Hashtable In general, hashtables have some unused locations. Load factor = The number of occupied hashtable locations (entries) / The size of the hashtable!

Search - Average Case The average number of hashtable elements examined for Search with linear probing: (1 + 1/ (1 - lf)) / 2 for successful search (1 + 1/ (1 - lf)2) / 2 for unsuccessful search

Search - Average Case The average number of hashtable elements examined for Search with quadratic probing and double hashing: - log (1 - lf) / lf for successful search 1/(1 - lf) for unsuccessful search

Search - Average Case The average number of hashtable elements examined for Search with separate chaining: 1 + lf / 2 for successful search lf for unsuccessful search

Search - Average Case load factor linear probing double hashing separate chaining 0.5 1.50 1.39 1.25 0.6 1.75 1.53 1.30 0.7 2.17 1.72 1.35 0.8 3.00 2.01 1.40 0.9 5.50 2.56 1.45 1.0 1.50 2.0 2.00 3.0 3.00

FindMin & FindMax - Time Analysis FindMin & FindMax operations: O(N) linear time

Traverse - Time Analysis Ordered traversal operation: O(N) linear time