Data Structures And Algorithms Hashing Eng. Anis Nazer First Semester 2017-2018
Searching Search: find if a key exists in a given set Searching algorithms: linear (sequential) search binary search Search based on a hash function
Linear/sequential Search Algorithm: go through the elements one by one, if Code: found, return bool linearsearch( int A[], int size, int key) { for ( i=0 ; i < size ; i++) if (A[i] == key ) return true; return false; } What is the complexity?
Binary Search Assumption: the array elements are sorted Algorithm: compare key with element at the middle if ( key == element) return true; if ( key > element ) search left sub array else search right sub array Question: when to stop? how to determin key is not found? What is the complexity?
Binary Search Code: bool binarysearch( int A[], int size, int key) { int L = 0, R = size 1; int M = (L+R) / 2; while ( L <= R ) { if if ( key == A[M] ) return true; else ( key > A[M] ) L = M+1; else R = M 1; M = (L+R)/2; } return false; }
Hash function Hash function is a function that gives the result based on the input or part of the input. Example of a hash function: f(x) = x % 10 Assume we store the elements in an array based on the hash function the index of value x is f(x) A[ f(x) ] = x
Hash function Example: store the following in an array of size 10, given that the hash function is f(x) = x % 10 1, 18, 15, 930, 77, 29 0 1 2 3 4 5 6 7 8 9 930 1 15 77 18 29 is 44 in the array? f(44) = 44 % 10 = 4, A[4] is empty 44 not in array
Hash function What is the advantage of using a hash function? What is the problem when using a hash function? two inputs hash to the same value ex. f(x) = x % 10 f(15) = 5 f(225) = 5 What to do if two values hash to the same index?
Collision Collision: when two distinct values v1 and v2 hash to the same index How to deal with collisions? Use a perfect hash function: i.e. no two values hash to the same index this is practically impossible since the data is unknown A good hash function is a function that avoids collisions
Hash functions Some examples of hash functions: Division Folding Mid-Square Extraction Radix transformation
Hash functions Division: based on the modulo operator: h(x) = x % (array size) It is better to have array size a prime number
Hash functions Folding: the key is divided into parts, and the parts are processed to generate the index (address) Example: divide the key into parts of three digits, then add the digits, then take the modulo array size ID = 199805535, array size = 101 h(199805535) = (199 + 805 + 535 ) % 101= 24
Hash functions Mid-Square: The key is squared and the middle is taken Example: key = 3121, size = 1000 3121^2 = 9740641, middle = 406 It is better to use a power of 2 size and use the middle of the binary representation Example: key = 3121, size = 1024 3121^2 = 9740641 = 100101001010000101100001 h(3121) = 0101000010 = 322
Hash functions Extraction: take a part of the key, Example: take the first 4 digits of the ID number: h(199805535) = 5535 This method is a useful when part of the key is common in the data, ID numbers usually start with the same digits
Hash functions Radix transformation: the key is converted to another number system, and the value is divided modulo array size: Example: key = 345, size = 100, base 9 h(345) = ( (423) % 100 ) = 23 h(245) = ( (309) % 100 ) = 9
Collision resolution Collision: two keys hash to the same address (index) How to deal with collision: Use a perfect hash function, not practical Open addressing: Find an availble position to place the colliding key linear probing quadratic probing double hashing Chaining: use a linked list to store the keys
Collision resolution Linear probing: look for the next available position, wrap around the end of the array Ex. h(x) = x % 10, size = 10 16, 22, 77, 48, 35, 62, 47, 99 0 1 2 3 4 5 6 7 8 9
Collision resolution Linear probing tends to create clusters. elements tend to group near each other The empty position following a cluster has a higher chance to be filled. this is proportional to the cluster size, the bigger the cluster, the higher the probability
Collision resolution Quadratic probing: look for positions using a quadratic formula: h(x) + i i = 1, -1, 4, -4, 9, -9,. Ex. h(x) = x % 10, size = 10 16, 22, 77, 48, 35, 62, 47, 99 0 1 2 3 4 5 6 7 8 9
Collision resolution Assume key = 9, h(x) = x %19 and the array is full except A[3], what is the sequence of indices (probes) that are tried? Quadratic probing avoids clustering but will generate secondary clusters since two elements that hash to the same index, will generate the same probe sequence
Collision resolution How to know when to stop if the key is not in the array? If the size of the array is a prime number of the form 4j + 3, where j is an integer, the probing sequence is guarenteed to cover all the indices
Collision resolution Double hashing: if a collision occures, use another hash function probe sequence: h(x), h(x)+h2(x), h(x) + 2h2(x), h(x)+3h2(x) Example: h(x) = x%19 h2(x) = x%13 What are the probe sequences for x = 3, x = 22
Comparison
Collision resolution Chaining: store a pointer to a linked list in the array, and store the data in the linked list The list can be sorted for efficiency Chaining requires more space to store the pointers
Collision resolution Separate chaining:
Collision resolution Coalesced chaining: 2D array: Size x 2 A[size][2] the second column stores the index of the next element in the chain Example: store the following data, h(x) = x % 10-2 position is available 12, 23, 15, 72, 49, 35, 9, 22-1 element is last in the chain collision resolution: linear probing
Example 12, 23, 15, 72, 49, 35, 9, 22 0 1 2 3 4 5 6 7 8 9
Example 12, 23, 15, 72, 49, 35, 9, 22 0 9-1 1-2 2 12 4 3 23-1 4 72 7 5 15 6 6 35-1 7 22-1 8-2 9 49 0
Deletion What happens if you delete a value from a hash table? Example: arrange the data: 11, 34, 62, 4, 91 use h(x) = x%10, and linear probing then delete data 34, 62 then search for 4 0 1 2 3 4 5 6 7 8 9
Deletion The position of the deleted item should not be marked as empty, why? Can we reuse the position of the deleted element? if you have many delete operations and few insert operations, you should rehash the table after a number of deletions Rehash: arrange the data using a different table size and/or different hash function
THE END