COMP 103 2017-T1 Lecture 31 Hashing: collisions Marcus Frean, Lindsay Groves, Peter Andreae and Thomas Kuehne, VUW Lindsay Groves School of Engineering and Computer Science, Victoria University of Wellington 2 RECAP-TODAY RECAP Fastsets sets with O(1) contains, add, remove Bitsets use a Boolean array with one cell for each possible value that could be in the set Extending the bitset idea to bags and maps Hashing use a hash function to calculate the position TODAY Dealing with collisions: where do you put colliding values? Put them in the same array Closed hashing/probing Put them somewhere else Open hashing/chaining 3 Dealing with Collisions: Two approaches Show me the book th 0 1 2 3 4 5 6 7 8 9 581 N Put colliding values in the same array: Look for an empty place in the hashtable closed hashing, open addressing, probing Put them somewhere else: Use a collection (eg list) at each place The barber shaves everyone open hashing, closed addressing, buckets, chaining HASH Show me the book th 4 Collisions: open hashing/buckets/chaining Store a Set in each cell: hash value which set ant fox hen dog bee kea cow elk owl pig sow tui eel gnu ape bat bug cat Open hashing: not everything is in the same table Closed address: hash code takes you to the right place Array is top level index into a larger structure jay ray yak This is what Java's HashMap does. If the sets get too nit big... Resize and rehash! roe cod What kind of set? 1
5 Collisions: open hashing/buckets/chaining Performance? if the array is of size k, each subset will be about 1/k th of size(). cost cost of hashcode + cost of method applied to subset Eg, using linked lists and array of size 100 This is 100 times faster than a simple linked list Good when the subsets are mostly small Trie: Needs dynamic memory management for strings, index on first character, then on second, Lookup time is proportional to length of key! 6 Collisions: closed hashing/probing Closed hashing: All data is stored in the same array If location given by hash function is occupied, look for another location Open addressing: Hash value tells us where to start looking Probing: Looking at successive locations till we find the value we re looking for, or an empty location Where do we look? Next location? One further away? What will give best performance? 7 Linear Probing: Look in next location Hash value tells us where to start looking. if value.hashcode() p start at index p if cell is used, try p+1, p+2, p+3 wrap round to 0 at the end of the array. Stu (2) Sven (5) Sam (4) Steve (2) Stig (2) Sun (3) 8 Linear Probing: contains Search for: Stu (2) Sven (5) Sam (4) Steve (2) Sun (3) Sun Stu Steve Sam Sven Stig 0 1 2 3 4 5 6 Problem: remove Sam!! 0 1 2 3 4 5 6 2
9 Linear Probing: contains public boolean contains(object value) { if (value==null) return false; // or error int hash = Math.abs(value.hashCode() % data.length); int p = hash; if (data[p] == null) return false; // not there if (data[p].equals(value)) return true; // found p = (p+1) % data.length; if (p == hash) return false; // not there You ve gone right around to where you started... How can this happen? 10 Linear Probing: add public boolean add(e value) { if (value==null) throw new NullPointerException(); // better! ensurecapacity();!! int hash = Math.abs(value.hashCode() % data.length); int p = hash; if (data[p] == null) { data[p] = value; size++; return true; // added if (data[p].equals(value)) return false; // already there p = (p+1) % data.length; if (p == hash) return false; // ummm.???? 11 ensurecapacity If table is full (or nearly full), double its size and copy: how do you copy? 12 Hash Tables and Load Factor When is the hashtable too full? cat bee fox pig cat bee fox owl hen Index depends on cat bee fox hashcode and length (division method)! and it depends on previous collisions... Have to rehash everything! cat bee fox When number of items is close to array size: May have to probe a large number of cells to find empty cell performance becomes very slow. Linear probing is particularly bad! Should not let table get more than 70% - 80% full (maximum load factor ) With a low load factor, cost is O(1)...high...O(N) 3
13 ensurecapacity public void ensurecapacity() { if (size <= maximumload) return; E[ ] olddata = data; data = (E[ ])new Object[data.length * 2]; maximumload = data.length*max_load_factor; for (E v : olddata) { if (v!= null) { add(v); rehash a field, initially set to data.length*max_load_factor within ensurecapacity, calling add is unnecessarily expensive: checks capacity each time (we know it is OK) checks if item is present already (we know it isn t) 14 ensurecapacity (more efficient) public void ensurecapacity() { if (size <= maximumload) return; E[ ] olddata = data; data = (E[ ])new Object[data.length * 2]; maximumload = maximumload * 2; for (E v : olddata) { if (v!= null) { int p = Math.abs(v.hashCode() % data.length); if (data[p] == null) { data[p] = v; break; p = (p+1) % data.length; rehash 15 Linear Probing: remove Inserted: Stu (2) Sven (5) Sam (4) Steve (2) Sun (4) 16 Linear Probing: Runs and Clustering Linear probing is particularly bad: Sun Stu Steve Sam Sven Stig cat bee fox Now remove: Sam (4) 0 1 2 3 4 5 6 What s the problem? contains(sun) will return false! To remove, need to leave a tombstone (not null, not a value!) ignored by add, etc. How do we count tombstones in ensurecapacity? Repeated collisions at one index create runs Runs linear performance With linear probing, runs join up they grow fast: the bigger the run, the faster it grows This is called "clustering 1,2 5 3 4 Can we do better by increasing the step size? hen owl pig gnu emu rat tui 4
17 Quadratic Probing Make the sequence of probes have increasing steps: runs don t join up so fast 18 Quadratic Probing Another problem, perhaps? Sequence might wrap back on itself before checking each cell: hen bee cat fox owl hen h, h+1, h+4, h+9, h+16, p=h, p+=1, p+=3, p+=5, p+= 7, p+= 9,. Quadratic probing uses a quadratic formula: probe i = hash + a i + b i 2 (b 0) Eg: with a=b=½, the step sizes become 1,2,3 instead of 1,3,5 If we choose a = b = ½, and length is a power of 2... guaranteed not to wrap until it has checked every cell! probe i = hash + ½ (i + i 2 ) probes are hash, hash+1, hash+3, hash+6, hash+10, hash+15,... step sizes are 1, 2, 3, 4, 5, 19 Quadratic Probing: contains private static final int INITIAL_CAPACITY = 16; // a power of 2 : public boolean contains(object value) { if (value == null) return false; int p = Math.abs(value.hashCode() % data.length); int p = hash; int step = 1; if (data[p] == null) return false; // not there if (data[p].equals(value)) return true; // found p = (p + (step++)) % data.length; This does not check for cycles! It relies on: the array not being full, and the probe sequence checking every cell 20 Iterator Iterating through hash table is not simple: there will be nulls to skip over the order that items are returned appears random (and may change when the array is doubled!) At each call to next(), Iterator must advance the index to the next non-null cell. cat bee fox 5
21 Hash Table with Probing: iterator 22 Hash Table with Probing: iterator private class HashSetIterator implements Iterator <E> { private E[ ] data; private int nextindex = 0; private HashSetIterator (E[ ] d) { data = d; while (nextindex < data.length && data[nextindex] == null ) nextindex++; public E next () { if (nextindex >= data.length) throw new NoSuchElementException(); E ans = data[nextindex++]; while (nextindex < data.length && data[nextindex] == null) nextindex++; return ans; public boolean hasnext () { return (nextindex < data.length); public void remove() { throw new UnsupportedOperationException(); 23 Other Probing Techniques Quadratic probing: Step sizes 1,2,3 still suffers from secondary clustering Double hashing: use a second hash function, to compute next probe index: p = hash2(value, p); less clustering, but more expensive Cuckoo hashing... Use two hash functions. Try both indexes. the new hash depends on the value as well (unlike with probing) If both are full, kick out one of the values, and put it in its alternate place (kicking out a value if necessary,.) Office1 24 Extending the Bitset idea Bitsets use a Boolean array with one cell for each possible value that could be in the set Can we extend this idea to bags and maps? For bag, store number of times the value is in the bag For map, store the value that key maps to oh, that s just an ordinary array! 6
Slide 24 Office1 Now covered in lecture 30 Microsoft Office User, 10/17/2017