CPSC 9 admin notes! TAs Office hours next week! Monday during LA 9 - Pearl! Monday during LB Andrew! Monday during LF Marika! Monday during LE Angad! Tuesday during LH 9 Giorgio! Tuesday during LG - Pearl! Friday during LC 9 - Pearl! Friday during LD Lawrence! Online quizzes are due next Friday Nov 8 CPSC 9 Hashing Page
CPSC 9: Data Structures and Algorithms for Electrical Engineers Hashing Textbook Reference: Thareja first edition: Chapter, pages -7 Thareja second edition: Chapter, pages -88 Instructor: Dr. Hassan Khosravi September-December CPSC 9 Hashing Page
Learning Goals Provide examples of the types of problems that can benefit from a hash data structure. Compare and contrast open addressing and chaining. Evaluate collision resolution policies. Describe the conditions under which hashing can degenerate from O() expected complexity to O(n). Determine the characteristics of a good hash function (e.g., uniform distribution, fast to compute, appropriate table size). Determine how well a hashing solution scales. Explain how to solve scalability problems. CPSC 9 Hashing Page
weeks journey filled with data structures, complexity, and algorithms! We have learned! Pointers, struct, and dynamic memory allocation.! Computing the complexity of our algorithms! Linked lists, Stack and Queue! Recursion and applications! Binary trees and Binary search trees! Binary heaps and heapsorts! Sorting CPSC 9 Hashing Page
Hashing A data structure that allows O() search, insert and delete Biometrics and Identity Management Digital signatures CPSC 9 Public-Key Infrastructure Hashing Page
CPSC 9 Hashing Page
Hashing CPSC 9 Hashing Page 7
Introductory Examples! We have studied two search algorithms! Linear search has running time proportional to O(n)! Binary search takes time proportional to O(log n)! Can we do better?! Example (natural, numeric keys): In a small company of employees, each employee is assigned an Emp_ID number in the range 99. To store the employee s records in an array, each employee s Emp_ID number acts as an index into the array where this employee s record will be stored as shown in figure KEY ARRAY OF EMPLOYEE S RECORD Key [] Record of employee having Emp_ID Key [] Record of employee having Emp_ID.. Key 99 [99] Record of employee having Emp_ID 99 CPSC 9 Hashing Page 8
Introductory Examples! Let s assume that the same company uses a five digit Emp_ID number as the primary key. In this case, key values will range from to 99999. If we want to use the same technique as above, we will need an array of size,, of which only elements will be used. KEY ARRAY OF EMPLOYEE S RECORD Key [] Record of employee having Emp_ID. Key n [n] Record of employee having Emp_ID n.. Key 99999 [99999] Record of employee having Emp_ID 99999! It is impractical to waste that much storage just to ensure that each employee s record is in a unique and predictable location. CPSC 9 Hashing Page 9
Introductory Examples! Another possible option is to use just the last two digits of key to identify each employee.! Employee with Emp_ID number 799 is stored at index 9.! Employee with Emp_ID number is stored at index.! Would this be a good idea? Why? Why not? CPSC 9 Hashing Page
Example (characters): Suppose we want to design a frequency table for counting the number of times that each character appears in a file. We can use the character s ASCII code (8 bits = possibilities) as the key to index into an array of frequencies (the values): e.g., A = a = 97; = 8 97 In this example, our array of frequencies will have size : one for each of the different ASCII characters. It may well be that not all of these different characters are in a given file, and so we are potentially wasting memory. What we gain is the benefit of accessing frequencies in O() time. Is this a good time/space trade-off? Suppose we make the call: int freqa = getfrequency( A ); Since the ASCII code for A has the numeric value, freqa is set to the frequency found at array index. CPSC 9 Hashing Page
Example (long numeric keys): Suppose we want to keep track of Visa credit card accounts. Visa card numbers are long: digits allowing for a total of different numbers. If we were to build an array capable of holding accounts, we would have way more accounts than there are people on Earth! (We couldn t hold the entire table in memory, either.) It is not reasonable to use the Visa credit card number to index into an array. We ll see how to solve this problem momentarily. Example (strings): What if the key is a string? For example, social insurance numbers in the United Kingdom are alphanumeric. We cannot use a non-numeric string to index into an array. Solution: Convert String to int? Hashing solves these problems. How? We define a function that transforms keys into a numeric array index. Such a function is called a hash function. CPSC 9 Hashing Page
Hash function! Hash Function, h is simply a mathematical formula which when applied to the key, produces an integer which can be used as an index for the key in the hash table! The goal of a hash function is to produce a unique set of integers within some suitable range to disperse the keys in an apparently random, uniform way. Why? Avoid clusters (collisions) O() access! As much as we would like to produce unique keys, it is not always possible, so we might have two or more keys that map to the same memory location. This is called collision. CPSC 9 Hashing Page
Hash Table Terminology hash function Alan Steve Hassan Kendra Ed f(x) collision keys load factor λ = # of entries in table tablesize CPSC 9 Hashing Page
A Good Hash Function! Is easy (fast) to compute; O()! Distributes the data evenly (hash(a) hash(b), as much as possible).! Uses the whole hash table! (for all k < size, there s an i such that hash(i) % size = k). CPSC 9 Hashing Page
Strategies for Building Hash Functions Truncation: Use only part of the key as the hash index. For example, if we want to build a table of student records at UBC, then rather than using the entire student number (8 digits) as the key, we could use only the first digits. We would therefore require an array with space for up to student records. Note that some students may have the first digits of their student number in common, and so collisions would occur. Folding: Partition the key into parts and combine the parts using arithmetic operations. For example, we could take a Visa number and partition it into one digit number, plus two digit numbers. We could then add these three numbers together to come up with a hash index. CPSC 9 Hashing Page
Strategies for Building Hash Functions Modular Arithmetic: Use the modulus (remainder) operator (% in C) to produce a hash index in a given range. % = For example, suppose a company with employees wants to use each employee s Social Insurance Number as a key. We might use an array (hash table) of size, thereby leaving room for expansion. We can then generate a hash index from the Social Insurance Number as follows: hashindex = SIN % ; Again, we would expect this hash function to lead to collisions. 7 78 99 % = 99 99 99 CPSC 9 Hashing Page 7
Alphanumeric Keys Suppose we have a table capable of holding records, and whose keys consist of strings that are characters long. We can apply numeric operations to the ASCII codes of the characters in the string in order to determine a hash index: int hash(char * key){! int hashcode = ;! int index = ;! while (key[index]!= \ ){! hashcode += (int)key[index];! index++;! }! return hashcode % ;! }! C A M E C O / C = 7 A = M = 77 E = 9 C = 7 O = 79 = CAMECO 999 CPSC 9 Hashing Page 8
Alphanumeric Keys What is a significant problem with this approach? Hash of any string with the same six letters is the same ASCII values have a max of * =, which means [ 999] are wasted A solution: hashcode = ;! while (key[index]!= \ ){! hashcode = *hashcode + key[index];! index++;! } ((((((7) + )+ 77)+9)+7) + 79) % 999= 89 CPSC 9 Hashing Page 9
Collisions If a hash function h maps two different keys x and y to the same index (i.e., x y and h(x) = h(y)), then x and y collide. A perfect hash function causes no collisions. Unfortunately, creating a perfect hash function requires knowledge of what keys will be hashed. Even a hash function that distributes items randomly (why is this not a practical hash function?) will cause collisions even when the number of items hashed is small. To see why, consider the birthday paradox from statistics: The probability of finding or more people that share the same birthday, in a room of n people, is > % if n. So, we must design a collision handling scheme. CPSC 9 Hashing Page
The Pigeonhole Principle (informal) You can t put k+ pigeons into k holes without putting two pigeons in the same hole. Image by en:user:mckay, used under CC attr/share-alike. This place just isn t coo anymore. Pigeonhole principle says we can t avoid all collisions try to hash without collision m keys into n slots with m > n try to put pigeons into holes CPSC 9 Hashing Page
Clicker question Suppose we have colours of Halloween candy, and that there s lots of candy in a bag. How many pieces of candy do we have to pull out of the bag if we want to be sure to get of the same colour? a. b. c. d. 8 e. None of these CPSC 9 Hashing Page
Clicker question (answer) Suppose we have colours of Halloween candy, and that there s lots of candy in a bag. How many pieces of candy do we have to pull out of the bag if we want to be sure to get of the same colour? a. b. c. d. 8 e. None of these CPSC 9 Hashing Page
Collision Resolution! Pigeonhole principle says we can t avoid all collisions! try to hash without collision m keys into n slots with m > n! try to put pigeons into holes! What do we do when two keys hash to the same entry?! chaining: put a linked list in each entry shove extra pigeons in one hole!! open addressing: pick a next entry to try CPSC 9 Hashing Page
Hashing Method #: Chaining Each table location points to a linked list (chain) of items that hash to this location. When we search for an item, we apply the hash function to get its hash index and then perform a linear search of the linked list. If the item is not in the list, then it s not in the table. Properties λ can be greater than performance degrades with length of chains a e c load factor λ = h(a) = h(d) h(e) = h(b) d b # of entries in table tablesize CPSC 9 Hashing Page
In-class exercise Example: Suppose h(x) = x/ mod Hash:, 88, 9,, 99, 9, 9 9 88 99 Example: find node with key CPSC 9 Hashing Page
Deleting when using chaining Example: Suppose h(x) = x/ mod Hash:, 88,, 99, 88 99 Delete Remove from the linked list 88 99 CPSC 9 Hashing Page 7
Pros and cons of chaining Advantages of Chaining: The size s of the hash table can be smaller than the number of items n hashed. Why is this often a good thing? Fewer blank/wasted cells (especially in the case where the number of cells greatly exceeds the number of keys). Collision handling is easy: O(). Can accommodate overflows Disadvantages of Chaining: Search time can become O(n) due to long chains. Overhead of pointers: extra space, extra work. CPSC 9 Hashing Page 8
CPSC 9 admin notes! TAs Office hours for today! Friday during LC 9 - Pearl! Friday during LD Lawrence! My office hour! Monday -! Online quizzes are due tonight :9pm! Participation (clicker) grades are now available on Connect! %, Check your grade. I still have active clicker that has not been registered.! Final CPSC9 DEC, : PM OSBO A! Other than a calculator, no electronic equipment is allowed. You may not use your cell phone as a calculator. CPSC 9 Hashing Page 9
Hashing Method #: Open Addressing There is no linked list here. We have a hash table containing key-value pairs, and we handle collisions by trying a sequence of table entries until either the item is found in the table or we reach an empty cell. Properties λ performance degrades with difficulty of finding right spot h(a) = h(d) h(e) = h(b) a d e b c load factor λ = # of entries in table tablesize CPSC 9 Hashing Page
Probing Probing sequence is: h(k) mod size If h(k) is occupied, try (h(k) + f()) mod size If (h(k) + f()) mod size is occupied, try (h(k) + f()) mod size And so forth Probing properties the i th probe is to (h(k) + f(i)) mod size where f() = if i reaches size, the insert has failed depending on f(), the insert may fail sooner long sequences of probes are costly! CPSC 9 Hashing Page
Probe sequence is h(k) mod size (h(k) + ) mod size (h(k) + ) mod size Linear probing f(i) = i For any λ <, linear probing will find an empty slot CPSC 9 Hashing Page
In-class exercise Using the hash function h(x) = x % 7 insert the following values using linear probing: 7, 9,, 7,, insert(7) 7%7 = insert(9) 9%7 = insert() %7 = insert(7) 7%7 = insert() %7 = insert() %7 = 7 7 7 9 9 9 9 9 7 7 7 7 7 7 probes: CPSC 9 Hashing Page
Problems with linear probing! The algorithm results in clustering! Where there has been one collision there will be more! Clusters of consecutive cells are formed and the time required for a search increases with the size of the cluster.! Also, when a new value has to be inserted in to the table at a position which is already occupied, that value is inserted at the end of the cluster, which all the more increases the length of the cluster.! This phenomenon is called primary clustering. To avoid primary clustering, other techniques like quadratic probing and double hashing are used. CPSC 9 Hashing Page
Quadratic Probing f(i) = i Probe sequence is h(k) mod size (h(k) + ) mod size (h(k) + ) mod size (h(k) + 9) mod size CPSC 9 Hashing Page
Quadratic Probing Example Using the hash function h(x) = x % 7 insert the following values using quadratic probing: 7,, 8,, insert(7) 7%7 = insert() %7 = insert(8) 8%7 = insert() %7 = insert() %7 = 8 8 8 7 7 7 7 7 probes: CPSC 9 Hashing Page
Quadratic Probing Example L Using the hash function h(x) = x % 7 insert the following values using quadratic probing: 7, 9,,, 7 insert(7) 7%7 = insert(9) 9%7 = insert() %7 = insert() %7 = insert(7) 7%7 = 9 9 9 9 7 7 7 7 7 probes: CPSC 9 Hashing Page 7
Example: h(x) = x mod Clicker question Hash these keys,,,, using quadratic probing. Which cell in the array does key hash to? A. B. C. D. 7 E. 9 7 8 9 CPSC 9 Hashing Page 8
Example: h(x) = x mod Clicker question (answer) Hash these keys,,,, using quadratic probing. Which cell in the array does key hash to? A. B. C. D. 7 E. 9 7 8 9 CPSC 9 Hashing Page 9
Problems with quadratic probing! Quadratic probing resolves to the primary clustering problem that exists in linear probing! One of the major drawbacks with quadratic probing is that a sequence of successive probes may only explore a fraction of the table, and this fraction may be quite small! we will not be able to find an empty location in the table despite the fact that the table is by no means full (See example from few pages ago).! This phenomenon is called secondary clustering: if there is a collision between two keys then the same probe sequence will be followed for both. CPSC 9 Hashing Page
Double Hashing f(i) = i hash (k) Probe sequence is h (k) mod size (h (k) + h (k)) mod size (h (k) + h (k)) mod size A good second hash function is quick to evaluate. differs from the original hash function. never evaluates to (mod size). CPSC 9 Hashing Page
Double Hashing Example Using the hash functions h (x) = x % 7 and h (x)= - (x % ) insert the following values using double hashing 7, 9,, 7,, insert(7) 7%7 = insert(9) 9%7 = insert() %7 = insert(7) insert() 7%7 = %7 = - (7%) = insert() %7 = - (%) = probes: 7 9 9 7 7 CPSC 9 Hashing Page 7 9 7 7 9 7 7 9 7
Clicker question The primary hash function is: h (k) = (k + ) mod. The secondary hash function is: h (k) = 7 (k mod 7) Hash these keys, in this order:,,, 88,, 9,. Which cell in the array does key hash to? A. B. C. D. E. 7 8 9 CPSC 9 Hashing Page
Clicker question (answer) h (k) = (k + ) mod. h (k) = 7 (k mod 7),,, 88,, 9,. Which cell in the array does key hash to? h() = (() + ) % = 7 A. B. C. D. E. h() = (() + ) % = h() = (() + ) % = 9 h(88) = ((88) + ) % = +7 88%7 = 8 h() = (() + ) % = 7 + 7 %7 = h(9) = ((9) + ) % = h() = (() + ) % = +( 7 %7) = 7 8 9 9 88 CPSC 9 Hashing Page
Deleting when using probing Example: Suppose locations [97] to [] are occupied in our hash table: [] [] [97] [98] [99] [] [] [] dog cat goat owl deer Suppose a new key hashes to [97]. Assuming a linear collision resolution policy, the key goes to. Later, suppose we delete the key that was hashed to [98]. Add a tombstone (i.e., flag, marker) for reasons:. If searching, keep going when you hit a tombstone.. If inserting, stop and the add item here. This means that table entries can be occupied, deleted, or free CPSC 9 Hashing Page
Open Addressing summarized Advantage of Open Addressing No memory is wasted on pointers. Performance of Hashing Performance depends on: the quality of the hash function (includes N) the collision resolution algorithm the available space in the hash table To estimate how full a table is, we calculate the load factor α: α = ( # entries in table ) / ( table size ) For open addressing, α. For chaining, α <. CPSC 9 Hashing Page
Learning Goals revisited Provide examples of the types of problems that can benefit from a hash data structure. Compare and contrast open addressing and chaining. Evaluate collision resolution policies. Describe the conditions under which hashing can degenerate from O() expected complexity to O(n). Identify the types of search problems that do not benefit from hashing (e.g., range searching) and explain why. Determine the characteristics of a good hash function (e.g., uniform distribution, onto function, appropriate table size). Determine how well a hashing solution scales. Explain how to solve scalability problems. CPSC 9 Hashing Page 7