Announcements Today s topic: Hashing (Ch. 10) Next topic: Graphs Break around 11:45am Container structures so far Array lists O(1) access O(n) insertion/deletion (average case), better at end Linked lists O(n) access O(n) insertion/deletion (average case), better at front and back Binary search trees O(log n) access if balanced O(log n) insertion/deletion if balanced Heaps O(1) access of min/max O(log n) insertion O(log n) deletion (average case) Can we do even better? 1 2 Sets set: A collection of unique s (no duplicates allowed) that can perform the following operations efficiently: add, remove, search (contains) The client doesn't think of a set as having indices; we just add things to the set in general and don't worry about order IntSet ADT interface Let's think about how to write an implementation of a set. To simplify the problem, we only store ints in our set for now. As is (usually) done in the Java Collections Framework, we will define sets as an ADT by creating a Set interface. Core operations are: add, contains, remove. set.contains("to") set.contains("be") "the" "if" "of" "to" "down" "from" "by" "she" "in" "you" "why" "him" set true false 3 public interface IntSet { void add(int ); boolean contains(int ); void clear(); boolean isempty(); void remove(int ); int size(); 4 1
BST as a set We can implement a set as a binary search tree. O(log n) performance for: add contains Remove But there are other ways to implement a set perhaps with better performance. Is there a way to use an array s fast, O(1), access? -3 29 42 root 55 60 87 91 Unfilled array set? Consider storing a set in an unfilled array. It doesn't really matter what order the elements appear in a set, so long as they can be added and searched quickly. What would make a good ordering for the elements? If we store them in the next available index, as in a list, set.add(9); set.add(23); set.add(8); set.add(-3); 9 23 8-3 49 12 0 0 0 0 set.add(49); set.add(12); size 6 How efficient is add? contains? remove? O(1), O(n), O(n) 5 6 Sorted array set? Suppose we store the elements in an unfilled array, but in sorted order rather than order of insertion. set.add(9); set.add(23); set.add(8); -3 8 9 12 23 49 0 0 0 0 set.add(-3); set.add(49); set.add(12); size 6 How efficient is add? contains? remove? O(n), O(log n), O(n) A strange idea Silly idea: When client adds i, store it at index i in the array. Would this work? Problems/drawbacks of this approach? How to work around them? set.add(7); set.add(1); set.add(9); set.add(18); set.add(12); 0 1 0 0 0 0 0 7 0 9 size 3 0 1 2 3 4 5 6 7 8 9 0 1 0 0 0 0 0 7 0 9 0 0 12 0 0 0 0 0 18 0 7 8 2
Hashing Hash: To map a large domain of s to a smaller fixed domain. Typically, mapping a set of elements to integer indices in an array. Idea: Store any given element in a particular predictable index. That way, adding/removing/looking for it are constant-time (O(1)). hash table: An array that stores elements via hashing. Hash function: An algorithm that maps s to indices. hash code: The output of a hash function for a given. In previous slide, our "hash function" was: hash(i) i Potentially requires a large array. Doesn't work for negative numbers. Array could be very sparse, mostly empty (memory waste). Hashing overview element hash() The hash function, hash(), maps a range of elements of arbitrary type into an integer range [0, M-1]. hash code (integer index) hash table (array, M buckets) 0 1 2 3 M-1 9 10 Improved hash function Sketch of implementation To deal with negative numbers: hash(i) abs(i) To deal with large numbers: hash(i) abs(i) % length set.add(37); // abs(37) % 10 == 7 set.add(-2); // abs(-2) % 10 == 2 set.add(49); // abs(49) % 10 == 9 public class HashIntSet implements IntSet { private int[] elements; // the hash table public void add(int ) { elements[hash()] = ; 0 0-2 0 0 0 0 37 0 49 size 3 // inside HashIntSet class private int hash(int i) { return Math.abs(i) % elements.length; 11 public boolean contains(int ) { return elements[hash()] == ; public void remove(int ) { elements[hash()] = 0; Runtime of add, contains, and remove: O(1) Are there any problems with this approach? 12 3
Hash function In general, any function that maps from the space of elements to the space of array indices is a valid hash function but a good hash function spreads the indices out over the entire hash table (array). A good hash function also tries to avoid collisions - multiple elements having the same index in the hash table. Collisions Collision: When hash function maps 2 s to same index. set.add(11); set.add(49); set.add(24); set.add(37); set.add(54); // collides with 24! 0 11 0 0 54 0 0 37 0 49 Uniform hashing assumption: Hashing is most efficient when index s spread throughout the table 13 Collision resolution: An algorithm for fixing collisions Probing Separate chaining etc. Probing Probing: Resolving a collision by moving to another index. Linear probing: Moves to the next available index (wraps if needed). set.add(11); set.add(49); set.add(24); set.add(37); set.add(54); // collides with 24; must probe 0 11 0 0 24 54 0 37 0 49 variation: quadratic probing moves increasingly far away: +1, +4, +9, 15 Implementing HashIntSet Let s implement an int set using a hash table with linear probing. For simplicity, assume that the set cannot store 0s for now. public class HashIntSet implements IntSet { private int[] elements; private int size; // constructs new empty set public HashIntSet() { elements = new int[10]; size = 0; // hash function maps s to indices private int hash(int ) { return Math.abs() % elements.length; 16 4
The add operation How do we add an element to the hash table? Use the hash function to find the proper bucket index. If we see a 0, put it there, i.e., 0 means an available slot. If not, move forward until we find an empty (0) index to store it. If we see that the is already in the table, don't re-add it. set.add(54); set.add(); // client code 0 11 0 0 24 54 37 0 49 size 6 Implementing add How do we add an element to the hash table? public void add(int ) { int h = hash(); while (elements[h]!= 0 && elements[h]!= ) { // Linear probing h = (h + 1) % elements.length; // for empty slot. if (elements[h]!= ) { // Avoid duplicates. elements[h] = ; // Add it here. size++; 0 11 0 0 24 54 0 37 0 49 17 18 The contains operation How do we search for an element in the hash table? Use the hash function to find the proper bucket index. Loop forward until we either find the, or an empty index (0). If find the, it is contained (true). If we find 0, it is not (false). We assume that the table is never full. set.contains(24) set.contains() set.contains(35) // true // true // false 0 11 0 0 24 54 37 0 49 size 6 Implementing contains public boolean contains(int ) { int h = hash(); while (elements[h]!= 0) { if (elements[h] == ) { return true; h = (h + 1) % elements.length; return false; 0 11 0 0 24 54 0 37 0 49 // Linear probing // to search // not found 19 20 5
The remove operation Implementing remove We cannot remove by simply zeroing out an element: set.remove(54); // set index 5 to 0 set.contains() // false??? oops 0 11 0 0 24 0 34 0 49 Instead, we replace it by a special "removed" placeholder (can be re-used on add, but keep searching on contains) 0 11 0 0 24 XX 34 0 49 public void remove(int ) { int h = hash(); while (elements[h]!= 0 && elements[h]!= ) { h = (h + 1) % elements.length; if (elements[h] == ) { elements[h] = -999; // "removed" flag size--; set.remove(54); set.remove(11); set.remove(34); 0 11 0 0 24-999 34 0 49 // client code 21 22 Patching add, contains private static final int REMOVED = -999; // add needs patching. public void add(int ) { int h = hash(); while (elements[h]!= 0 && elements[h]!= && elements[h]!= REMOVED) { h = (h + 1) % elements.length; if (elements[h]!= ) { elements[h] = ; size++; // contains does not need patching; // it should keep going on a -999, which it already does public boolean contains(int ) { int h = hash(); while (elements[h]!= 0 && elements[h]!= ) { h = (h + 1) % elements.length; return elements[h] == ; 23 Problem: full array Clustering: Clumps of elements at neighboring indexes. Slows down the hash table lookup; you must loop through them. set.add(11); set.add(49); set.add(24); set.add(37); set.add(54); // collides with 24 set.add(); // collides with 24, then 54 set.add(86); // collides with, then 37 0 0 0 0 0 0 0 0 0 0 size 0 Where does each go in the array? How many indices must be examined to answer contains(94)? What will happen if the array completely fills up? 24 6
Rehashing Rehash: Using a larger array when the table is too full. Cannot simply copy the old array to a new one. (Why not?) Load factor: ratio of (# of elements ) / (hash table length) Many collections rehash when load factor.75 95 11 0 0 24 54 37 66 48 size 8 0 1 2 3 4 5 6 7 8 9 0 0 0 0 24 0 66 0 48 0 0 11 0 0 54 95 37 0 0 size 8 Hash table sizes Can use prime numbers as hash table sizes to reduce collisions. Also improves spread / reduces clustering on rehash. set.add(11); // 11 % 13 == 11 set.add(39); // 39 % 13 == 0 set.add(21); // 21 % 13 == 8 set.add(29); // 29 % 13 == 3 set.add(71); // 81 % 13 == 6 set.add(41); // 41 % 13 == 2 set.add(101); // 101 % 13 == 10 10 11 12 39 0 41 29 0 0 71 0 21 0 101 11 0 size 7 Google: Why setting Hash Table length to a Prime Number is a good practice? 25 26 Iterator for a hash table How would you implement an iterator for a hash table using linear probing, e.g., HashIntSet? And also for one with separate chaining (next page)? How would we implement tostring on our HashIntSet? Separate chaining Separate chaining: Solving collisions by storing a list at each index. add/contains/remove must traverse lists, but the lists are short impossible to "run out" of indices, unlike with probing. 0 11 0 0 24 54 0 37 0 49 System.out.println(set); // [11, 24, 54, 37, 49] 27 private class Node { public int data; 54 public Node next; Will see an alternative approach to implement chains later in MyHashMap.java Iterator for one with separate chaining? 11 24 7 49 28 7
Implementing HashIntSet Let s implement a hash set of ints using separate chaining. public class HashIntSet implements IntSet { // array of linked lists; // elements[i] = front of list #i (null if empty) private Node[] elements; private int size; // constructs new empty set public HashIntSet() { elements = new Node[10]; size = 0; The add operation How do we add an element to the hash table? When you want to modify a linked list, you must either change the list s front reference, or the next field of a node in the list. Where in the list should we add the new element? Must make sure to avoid duplicates. set.add(24); 11 54 7 49 // hash function maps s to indexes private int hash(int ) { return Math.abs() % elements.length; new node 29 30 24 Implementing add public void add(int ) { if (!contains()) { int h = hash(); // add to front Node newnode = new Node(); // of list #h newnode.next = elements[h]; elements[h] = newnode; size++; The contains operation How do we search for an element in the hash table? Must loop through the linked list for the appropriate hash index, looking for the desired. 11 24 7 49 set.contains() // true set.contains(84) // false set.contains(53) // false 54 31 32 8
Implementing contains public boolean contains(int ) { Node current = elements[hash()]; while (current!= null) { if (current.data == ) { return true; current = current.next; return false; The remove operation How do we remove an element from the hash table? Cases to consider: front (24), non-front (), not found (94), null (32) To remove a node from a linked list, you must either change the list's front reference, or the next field of the previous node in the list. set.remove(54); 11 24 7 49 current 54 33 34 Implementing remove public void remove(int ) { int h = hash(); if (elements[h]!= null && elements[h].data == ) { elements[h] = elements[h].next; // front case size--; else { Node current = elements[h]; // non-front case while (current!= null && current.next!= null) { if (current.next.data == ) { current.next = current.next.next; size--; return; current = current.next; 35 Rehashing with chaining Separate chaining handles rehashing similarly to linear probing. Loop over the list in each hash bucket; re-add each element. 11 24 54 7 49 10 11 12 13 15 16 17 18 19 24 7 49 11 54 36 9
Hash set of objects public class HashSet<E> implements Set<E> { private class Node { public E data; public Node next; It is easy to hash an integer i (use index abs(i) % length ). How can we hash other types of s (such as objects)? The hashcode method in Java All Java objects contain the following method (in Object): public int hashcode(); Returns an integer hash code for this object. We can call hashcode on any object to find its preferred index. HashSet, HashMap, and the other built-in "hash" collections call hashcode internally on their elements to store the data. We can modify our set s hash function to be the following: private int hash(e e) { return Math.abs(e.hashCode()) % elements.length; 37 38 Hash tables in Java HashTable class stores key/ pairs does not allow null for either key or older, slower class (thread-safe, synchronized) HashSet class implements Set interface, internal storage container that is a hash table fast (unsynchronized) cf. TreeSet class, internal storage container is a Red Black Tree HashMap class Implements Map interface, internal storage container for keys is a hash table allows null for key or fast (unsynchronized) Maps Also known as: table, search table, dictionary, associative array, or associative container A data structure optimized for a very specific kind of search / access with a bag we access by asking "is X present" with a list we access by asking "give me item number X" with a queue we access by asking "give me the item that has been in the collection the longest." In a map we access by asking "give me the associated with this key." cf. TreeMap class 39 40 10
Keys and s Dictionary analogy: The key in a dictionary is a word: foo The in a dictionary is the definition: First on the standard list of metasyntactic variables used in syntax examples A key and its associated form a pair that is stored in a map To retrieve a the key for that must be supplied A List can be viewed as a Map with integer keys (indices) Keys must be unique, meaning a given key can only represent one but one may be represented by multiple keys Implementing a HashMap A hash map is like a set where the nodes store key/ pairs: public class HashMap<K, V> implements Map<K, V> { // key map.put("marty", ); map.put("jeff", 21); map.put("kasey", 20); map.put("stef", 35); "Stef" 35 "Marty" Must modify your Node class to store a key and a "Jeff" 21 "Kasey" 20 41 42 Map ADT interface Let s think about how to write our own implementation of a map. As is (usually) done in the Java Collections Framework, we will define map as an ADT by creating a Map interface. Core operations: put (add), get, containskey, remove public interface Map<K, V> { void clear(); boolean containskey(k key); V get(k key); boolean isempty(); void put(k key, V ); void remove(k key); int size(); 43 HashMap vs. HashSet The hashing is always done on the keys, not the s. The contains method is now containskey; and in remove, you search for a node whose key matches a given key. The add method is now put; if the given key is already there, you must replace its old with the new one. map.put("bill", 66); // replace 49 with 66 "Stef" 35 "Marty" "Abby" 57 "Bill" 49 66 "Jeff" 21 "Kasey" 20 44 11
Java s TreeMap Uses a Red - Black tree to implement a Map relies on the compareto method of the keys slower than HashMap keys stored in sorted order (cf. Are keys in HashMap in sorted order?) Sample map problem Determine the frequency of words in a file. File f = new File(fileName); Scanner s = new Scanner(f); Map<String,Integer> counts = new HashMap<String,Integer>(); while(s.hasnext()){ String word = s.next(); if (!counts.containskey(word)) counts.put(word, 1); else counts.put(word, counts.get(word) + 1); 45 46 Implementing hashcode You can write your own hashcode methods in classes you write. All classes come with a default version based on memory address. Your overridden version should somehow "add up" the object's state. Often you scale/multiply parts of the result to distribute the results. public class Point { private int x; private int y; public int hashcode() { // better than just returning (x + y); // spreads out numbers, fewer collisions return 137 * x + 23 * y; Good hashcode behavior A well-written hashcode method should behave: Consistently with itself (must produce same results on each call): o.hashcode() == o.hashcode(), if o's state doesn't change Consistently with equality: a.equals(b) must imply a.hashcode() == b.hashcode(),!a.equals(b) does NOT necessarily imply that a.hashcode()!= b.hashcode() (why not?) When a class has an equals or hashcode, it should have both. Good distribution of hash codes: For a large set of objects with distinct states, they will generally return unique hash codes rather than all colliding into the same hash bucket. 47 48 12
Example: String hashcode hashcode tricks The hashcode function inside a String class looks like this: public int hashcode() { int hash = 0; for (int i = 0; i < this.length(); i++) { hash = 31 * hash + this.charat(i); return hash; As with any general hashing function, collisions are possible. Example: "Ea" and "FB" have the same hash. Early versions of Java examined only the first 16 characters. For some common data this led to poor hash table performance. 49 If one of your object s fields is an object, call its hashcode: public int hashcode() { // Student return 531 * firstname.hashcode() + ; To incorporate a double or boolean, use the hashcode method from the Double or Boolean wrapper classes: public int hashcode() { // BankAccount return 37 * Double.Of(balance).hashCode() + Boolean.Of(isCheckingAccount).hashCode(); Guava includes an Objects.hashCode() method that takes any number of s and combines them into one hash code. public int hashcode() { // BankAccount return Objects.hashCode(name, id, balance); 50 Hash tables vs. BST vs. heaps on search Example: using hash tables BSTs: has complete ordering information See UseHashSet.java, Student.java, StudentReader.java Heaps: has incomplete ordering information See UseHashMap.java Hash tables: has no order information See Hash.java 51 52 13
Example: implementing hash tables Next topic Using java.util.linkedlist as a chain in each bucket See MyHashSet.java Graphs See MyHashMap.java 53 54