Hashing. Dr. Ronaldo Menezes Hugo Serrano. Ronaldo Menezes, Florida Tech

Hashing Dr. Ronaldo Menezes Hugo Serrano

Agenda Motivation Prehash Hashing Hash Functions Collisions Separate Chaining Open Addressing

Motivation Hash Table Its one of the most important data structures in computer science: Databases Authentication Systems Spell Checking Network Routers Cryptography Compilers File/Directory Synchronization

Hash Tables We are familiar with direct access structures and linear access structures Both have their advantages and disadvantages

Hash Tables The main reason one might avoid direct access structures is the fact that we need to allocate its size in advance We tend to think that the actual number of keys to be stored is equivalent to the universe of possible existing keys

Hash Tables 1. In some other cases we have a problem of accessing an element directly because the key is non-trivial Not necessarily an index in an array or something that can be easily used as an index 2. In some problems the number of keys to be stored is (much) smaller than the numbers in the universe of keys In this case if we use an array we might waste a lot of space

Prehash Map non-trivial keys to nonnegative integers. In theory, keys are finite and discrete; Anything on a computer can be written down as a string of bits. Strings of bits represent integers In practice, it is slightly different Ideally, prehash(x) = prehash(y) IF x=y

Hash Functions Reduce the universe U of all keys (integers) down to a reasonable size m for table. h : U --> {0,1,...,m-1} where m is the size of the table

Hash Functions Pigeonhole Principle If n pigeons (items) are put into m pigeonholes with n>m, then at least one pigeonhole must contain more than one item.

Hash Functions Most hash functions assume keys as natural numbers What makes a good hash function? It is one that satisfies the assumption of uniform hashing

Simple Uniform Hashing Uniformity Each key is equally likely to be hashed to any slot of the table Independence Keys will be hashed to slots independent of to what slot was hashed to other keys Unfortunately the above is rarely possible since we need to know the probability distribution of the keys

Common Hash Functions The division method is based on where m is the size of the hash table Good values of m are crucial These normally consist of prime numbers that are close to n divided by the number of desired average probes For instance, if we want to store 4000 numbers and we don t mind doing 4 probes, chose m to be a prime close to 1000 In this case 997

Common Hash Functions The multiplication method is based on where A is a constant between 0 and 1 This is a good choice because it does not depend on m only Knuth suggested that a good general A is This comes from the golden ration that is given (approximately) by Example storing the number 765 into a table of size 45 gives us

Pictorial view of Hash Tables k1 k2 k4 k3

Pictorial view of a Collision k1 k2 k4 k5 k3

Order Preservation Order preservation of hash functions is similar to the stability property in sorting Given an ordering of keys to be hashed from then we should expect that: What is the importance of this characteristic? Let s discuss it when we talk about conflict resolution.

Collision Resolution Because we are mapping elements to a normally smaller domain of slots, collisions are likely to happen There are two classes of collision resolution: Separate chaining The table points to structures holding the element that collides Open addressing The elements being hashed are actually stored in the table and not on a separate structure

Separate Chaining The most common resolution mechanism is called separate chaining or just chaining It consists of mixing the concepts of linked lists and direct access structures like arrays Each slot of a hash table is a pointer to a dynamic structure (say a linked list or a binary search tree)

Collision Resolution When hashing a key, if collision happens the new key is stored in the linked list in that location Let's see some real example. Suppose that we're mapping the universe of integers in a hash table of size 10 Our hash function may be based on the division method for creating hash values h(k) = k mod size

Hashing(103) h(n) = 103 mod 10 h(103) = 3

Hashing(103) h(n) = 103 mod 10 h(n) = 3 103 /

Hashing(69) h(n) = 69 mod 10 h(n) = 9 103 / 69 /

Hashing(20) h(n) = 20 mod 10 h(n) = 0 20 / 103 / 69 /

Hashing(13) h(n) = 13 mod 10 h(n) = 3 20 / 103 13 / 69 /

Hashing(110) h(n) = 110 mod 10 h(n) = 0 20 110 / 103 13 / 69 /

Hashing(53) h(n) = 53 mod 10 h(n) = 3 20 110 / 103 13 53 / 69 /

Final Hash Table 20 110 / 103 13 53 / 69 /

Searching in a Hash Table (assuming chaining) Like any other structure, searching is a common task with hash tables Searching works as below Hash the target Take the value of the hash of target and go to the slot. If the target exist it must be in that slot Search in the list in the current slot using a linear search (assuming Linked Lists)

Searching for 53 20 110 / 103 13 / 53 / 69 /

Searching for 53 20 110 / 103 13 / 53 / temp 69 /

hashsearch(n) NodeType hashsearch(nodetype table[],int target) { int index = hash(target); NodeType temp = table[index]; return linearsearch(temp,target); }

Analysis of Hash Search Expected length of the a chain for n keys and m slots n m = α = load factor

Analysis of Hash Search Discussion Using big-o notation express the performance of hash search Worst Case Best Case

Analysis of Hash Search Discussion Using big-o notation express the performance of hash search Average Case T n = O 1 + α = O(1)

Collision techniques The techniques based on open addressing are: Linear Probing: If position h(key) is occupied, do a linear search in the table until you find a empty slot. The slot is searched in this order: Suffers from primary clustering Values above are modded.

More open addressing techniques Quadratic probing: is a variant of the above where the term being added to the hash result is squared. h(key)+c 2 Suffers from secondary clustering: a milder version of primary clustering Random probing: is another variant where the term being added to the hash function is a random number. h(key)+random() No clustering but conflicts rapidly leads to O(n) search. Rehashing: is a technique where a sequence of hashing functions are defined (h 1, h 2,... h k ). If a collision occurs the functions are used in the this order.