SFU CMPT-307 2008-2 1 Lecture: Week 8 SFU CMPT-307 2008-2 Lecture: Week 8 Ján Maňuch E-mail: jmanuch@sfu.ca Lecture on June 24, 2008, 5.30pm-8.20pm
SFU CMPT-307 2008-2 2 Lecture: Week 8 Universal hashing Overcomes problem of bad inputs by using randomization Mentioned: if adversary chooses keys to be hashed by fixed hash function, he can choose Ò keys that all hash into same slot All fixed hash functions have this problem Universal hashing: design class of parametrized hash functions À pick random ¾ À at beginning of execution (select at random parameters)
SFU CMPT-307 2008-2 3 Lecture: Week 8 À Let be a finite collection of hash functions that map given Í universe ¼ ½ Ñ ½ into À is said to be universal if for each pair ¾ Í of keys, the number of hash functions ¾ À with µ µ is at most À Ñ, i.e., the chance of a collision is at most À Ñ ½ Ñ À In other words: with randomly chosen from À, the chance of collision between arbitrary distinct keys and is just the same (½ Ñ) as a chance of collision µ if µ and were randomly and independently chosen ¼ Ñ ½ from
SFU CMPT-307 2008-2 4 Lecture: Week 8 Properties: average-case Let À be a universal collection of hash functions Theorem. Choose ¾ À randomly to hash Ò keys (set Ã) into table Ì of size Ñ, using chaining to resolve collisions. 1. If key is not in the table, then expected length Ò µ of list that hashes to is at most «. 2. If key is in the table, then expected length Ò µ of list containing is at most ½ «. Proof. Important: expectations over the choice of a hash function, do not depend on assumptions regarding distribution of keys! For each pair of distinct keys define a random variable with ½ iff µ µ, i.e, collision By definition, È ½µ ½ Ñ, and thus ½ Ñ
¾Ã Ð ¾Ã SFU CMPT-307 2008-2 5 Lecture: Week 8 Now define for each key a random variable i.e. the number of keys other than that hash to same slot as. Clearly, ¾ ½ ¾ Ã Ñ ¾Ã ¾Ã Ñ
SFU CMPT-307 2008-2 6 Lecture: Week 8 It remains to count the number of ¾ Ã with Case 1: If ¾ Ã, then Ò µ and ¾ Ã Ò Thus Ò µ Ò Ñ «Case 2: If ¾ Ã, key appears in Ì µ. Also, count does not include itself. Thus Ò µ ½ and Thus ¾ Ã Ò ½ Ò µ ½ Ò ½µ Ñ ½ ½ «½ Ñ ½ «
SFU CMPT-307 2008-2 7 Lecture: Week 8 Universal hashing: sequence of operations Corollary. Using the universal hashing and chaining in a table Ñ with slots, it takes expected µ time to handle any sequence of Insert, Search, and Delete operations, Ç Ñµ containing Insert operations. Proof. the number of Insertions is Ç Ñµ thus the number of elements in the hash table at any time is Ò Ç Ñµ and «Ç ½µ. Insert and Delete take constant time, Search takes expected constant time (due «to Ç ½µ), by linearity of expectation, expected time for sequence of operations is Ç µ. Notice: Impossible for adversary to pick sequence of operations that forces the worst-case time This is now without the assumption of simple uniform hashing
SFU CMPT-307 2008-2 8 Lecture: Week 8 Construction of a universal class of hash functions Simple solution. take all functions mapping Í into ¼ ½ Ñ disadvantages: ½ size of the collection: Ñ Í number of random bits needed to pick a random function: Í ÐÓ Ñ a lot of preprocessing needed how to encode functions? we have to store the chosen function space needed: Í ÐÓ Ñ we need a much smaller collection of hash functions
SFU CMPT-307 2008-2 9 Lecture: Week 8 Assignment Problem 8.1. (deadline: July 8, 5:30pm) Prove that the class of all functions from Í to ¼ ½ Ñ ½ is universal.
SFU CMPT-307 2008-2 10 Lecture: Week 8 Number-theoretical solution. Let Ô be a prime number and let Ô ¼ ½ Ô ½ Ô ½ ¾ Ô ½ Lemma. For any ¾ Ô, ¾ Ô, the equation Ü ÑÓ Ô has a unique solution Ü ¾ Ô. (see Chapter 31.4 for a proof)
SFU CMPT-307 2008-2 11 Lecture: Week 8 Let Ô be a prime, large enough such that every possible key is in ¼ Ô ½, inclusive. for all ¾ Ô and ¾ Ô define hash functions : µ µ ÑÓ Ôµ ÑÓ Ñ (linear transformation followed by reductions modulo Ô and then modulo Ñ) Note: Ô Ñ since by the assumption, the size of universe of keys Í is greater than the number of slots the function maps Í Ô into Ñ Consider the universal class of hash functions: Note: size of Ô Ô ½µ class: can be chosen such that Ô ¾ Í (Bertrand s postulate) Ô picking randomly À from Ô Ñ means picking random number of bits needed to pick a function À from Ô Ñ Ç ÐÓ Í µ : À Ô Ñ ¾ Ô ¾ Ô
SFU CMPT-307 2008-2 12 Lecture: Week 8 Theorem. À Class Ô Ñ of hash functions is universal. Proof. Take two distinct keys and. How many functions are there À in Ô Ñ mapping and to the same slot? Consider ¾ À Ô Ñ. Let Ö µ ÑÓ Ô µ ÑÓ Ô Is it possible that Ö? By Lemma: No. If Ö then and are solutions to equation Ü Ö ÑÓ Ô and by uniqueness of solution,. no collisions at modulo Ô level
SFU CMPT-307 2008-2 13 Lecture: Week 8 There Ô Ô ½µ are functions À in Ô Ñ and there Ô Ô ½µ are possible Ö µ pairs such Ö that. Is there a one-to-one correspondence? Yes Values Ö and determine the function (remember and are still fixed): µ ¼ ÑÓ Ô Ö by Lemma (unknown ), is unique Ö ½ ÑÓ Ô by Lemma (unknown ), is unique implies uniquely determined by Ö and Conclusion: If we pick ¾ À Ô Ñ uniformly at random, the resulting pair Ö µ is equally likely to be any pair of distinct values from Ô.
Ô Ñ ½ Ô Ñ SFU CMPT-307 2008-2 14 Lecture: Week 8 When a collision between and happens? Hence, probability that and collide is equal to probability Ö that Ñ when Ö and are randomly chosen as distinct values from Ô. ÑÓ and collide when Ö ÑÓ Ñ Note: doesn t depend on and anymore pick any Ö, there are at most Ô Ñ choices of such that Ö ÑÓ Ñ, one of them is Ö hence, there are at most Ô Ñ ½ choices for such that Ö and hence, probability that Ö ÑÓ Ñ and Ö is at most Ö ÑÓ Ñ ½µ Ñ ½ Ô ½ ½ Ô Ô ½µ Ñ ½ Ñ Ô Final conclusion: À Ô Ñ is universal. ½
SFU CMPT-307 2008-2 15 Lecture: Week 8 Open addressing the second method for dealing with collisions recall, the first method was chaining elements are stored directly in hash table (not in linked lists) hence: the load «is always at most ½ (but the table can overflow) for every key we have a sequence of slots, called the probe sequence when inserting we try (probe) the first slot, if used, the second slot, etc., until we find an empty slot when searching we go over slots in sequence until we either find the searched element or empty slot no pointers used we can use more memory for the hash table instead, which results in fewer collisions, and so, faster searching
SFU CMPT-307 2008-2 16 Lecture: Week 8 Hash functions have now two parameters: the key ¾ Í the probe number ¾ ¼ ½ Ñ ½ Furthermore, the hash function Í ¼ ½ Ñ ½ ¼ ½ Ñ ½ must satisfy that for every key, the probe sequence ¼µ ½µ Ñ ½µ is a permutation of ¼ ½ Ñ ½ Now: every position is eventually considered as a slot for a new key (the algorithm fails with overflow only when there is no place in the entire table to insert a key)
SFU CMPT-307 2008-2 17 Lecture: Week 8 Assume that an empty slot contains the value NIL µ Hash-Insert Ì ¼ 1: 2: repeat 3: µ 4: if Ì NIL then 5: Ì 6: return 7: else ½ 8: 9: end if 10: until Ñ 11: error hash table overflow Idea: probing the probe sequence of until we find an empty slot
SFU CMPT-307 2008-2 18 Lecture: Week 8 Hash-Search Ì µ probes the same sequence, therefore the search can be terminated when an empty slot is found assuming that we didn t delete anything from the table µ Hash-Search Ì ¼ 1: 2: repeat 3: µ 4: if Ì then 5: return 6: else ½ 7: 8: end if 9: until Ì NIL or Ñ 10: return NIL returns where slot contains the key, or NIL, if is not in the table
SFU CMPT-307 2008-2 19 Lecture: Week 8 Hash-Delete Ì µ difficult operation: when we delete a key from the table, we cannot just mark the slot containing the key by NIL (empty) we could disconnect a sequence of occupied elements leading to the key Solution. mark the place with the special value DELETED (instead of NIL) modify Hash-Insert Ì µ so that it treats slot containing DELETED as empty Hash-Search Ì µ will treat such a slot as occupied, which we want, so no modification is needed Problem. the search time doesn t depend on «load #elements #slots, but rather on #elements #DELETED #slots, which can be quite bad Conclusion. when Delete operation is allowed, prefer to use hashing by chaining
SFU CMPT-307 2008-2 20 Lecture: Week 8 Assignment Problem 8.2. (deadline: July 8, 5:30pm) Write a pseudo-code for the modified Hash-Insert able to handle also the special value DELETED, and for the procedure Hash-Delete.
SFU CMPT-307 2008-2 21 Lecture: Week 8 Uniform hashing generalization of simple uniform hashing very strong conditions, in practice just some approximations are used useful for analysis assumption: each key is equally likely to have any Ñ of permutations of as its probe sequence ½ requires that all Ñ possible permutations (probe sequence) could be generated ¼ ½ Ñ Common techniques used are much worse: linear probing (only Ñ distinct probe sequences) quadratic probing (only Ñ distinct probe sequences) double (Ñ hashing distinct probe sequences) the best ¾ results
SFU CMPT-307 2008-2 22 Lecture: Week 8 Linear probing needs an ordinary hash function ¼ Í ¼ ½ Ñ ½, called an auxiliary hash function hash function: µ ¼ µ µ ÑÓ Ñ hence, for a key, the probe sequence is ¼ µ ¼ µ ½ Ò ½ ¼ ¼ µ ½ we start at the slot given by auxiliary hash function, and then probe consecutive slots first probe determines the entire sequence µ only Ñ distinct probe sequences
SFU CMPT-307 2008-2 23 Lecture: Week 8 Problem. primary clustering long clusters (sequences) of occupied slots increasing average search time why? consider a cluster with occupied slots, if the consecutive slot is empty, then it has a ½µ Ñ probability that it will be filled during next Insert operation
SFU CMPT-307 2008-2 24 Lecture: Week 8 Quadratic probing hash function: µ ¼ µ ½ ¾ ¾ µ ÑÓ Ñ where ¼ is again an auxiliary hash function, ¾ ¼ Problems: doesn t work for all values of ½ ¾ and Ñ still: a probe sequence for one key is shifted version of the probe sequence for another µ key Ñ only distinct probe sequences secondary clustering milder form of clustering (the keys mapped to the same slot will form a cluster)
SFU CMPT-307 2008-2 25 Lecture: Week 8 Assignment Problem 8.3. (deadline: July 8, 5:30pm) Consider a quadratic probing scheme with constants Ñ ¾ Ø for some integer Ø (i.e., Ñ is a power of ¾) and ½ ¾ ½ ¾. Prove that the function µ ¼ µ ½ ¾ ¾ µ ÑÓ Ñ is indeed a hash function, i.e., prove that all probe sequences are permutations ¼ ½ Ñ of ½. Hint: ¾ Congruence equivalent to ¾ ÑÓ ¾ congruence. Ø ½ ÑÓ ¾ Ø makes sense only if is even, and is
SFU CMPT-307 2008-2 26 Lecture: Week 8 Double hashing hash function: µ ½ µ ¾ µµ ÑÓ Ñ where ½ and ¾ are auxiliary hash functions first probe doesn t determine the entire sequence, depends also on the value of the second hash function (assuming we don t use ¾ ½ ) approximates behavior of uniform hashing, generated permutations have some characteristics of randomly chosen permutation
SFU CMPT-307 2008-2 27 Lecture: Week 8 Requires. The values of ¾ µ have to be relatively prime (coprime) to the size of hash Ñ table two easy ways: Ñ 1. is power of ¾, and ¾ produce only odd numbers Ñ 2. is a prime, and ¾ returns positive values smaller Ñ than each pair ½ µ ¾ µµ yields a distinct probe sequence, hence Ñ ¾ µ probing sequences are used
SFU CMPT-307 2008-2 28 Lecture: Week 8 Analysis of open-address hashing assume that we use uniform hashing : a new key has a probe sequence chosen uniformly at random from all possible permutations when we are looking for a key which is already in the table, we assume that every key is equally likely to be searched for we will express the expected number of probes in terms of the load «Ò Ñ ½
SFU CMPT-307 2008-2 29 Lecture: Week 8 Unsuccessful search Theorem. Given an open-address hash table with load «Ò Ñ ½, the expected number of probes in an unsuccessful search is at most ½ ½ «µ, assuming uniform hashing. Example. if «½ ¾ ( ¼± of table occupied) then the expected number of probes for unsuccessful search will be ½ ½ ½ ¾µ ¾ if «¼ ( ¼± of table occupied), then ½ ½ ½¼µ ½¼
SFU CMPT-307 2008-2 30 Lecture: Week 8 Proof. let be a random variable denoting the number of probes made in an unsuccessful search for ½ Ñ, let be the event that the -th probe to an occupied slot iff events ½ ½ happened and event didn t event is equivalent to event ½ ¾ ½, hence ½µ is È µ È ½ ¾ ½ µ we are looking for why do we need È µ? answer: Assignment Problem 8.4
½ ½ SFU CMPT-307 2008-2 31 Lecture: Week 8 Assignment Problem 8.4. (deadline: July 8, 5:30pm) Consider a random variable having only positive integer values (i.e., it s mapping the probability space to the set of natural numbers). Prove that È ½µ È ¾µ È µ Hint: events ½, ¾,,..., are mutually exclusive (disjoint), hence ½ È µ È µ
SFU CMPT-307 2008-2 32 Lecture: Week 8 We will use the formula from Assignment Problem 5.2 again: È ½ ¾ ½ µ È ½ µ È ¾ ½ µ È ½ ¾ ¾ ½ µ Hence, to find out È µ it s enough to evaluate conditional probabilities È ½ ½ µ. case ½: È ½ µ Ò Ñ slots, Ò are occupied, uniform hashing, the first probe can be to any of (Ñ slots) Ñ
SFU CMPT-307 2008-2 33 Lecture: Week 8 case ½: assume ½ ½ occurred hence, first ½ probes were to occupied slots -th probe cannot be to any of these slots (the probe sequence is a permutation) hence, we have Ñ ½µ slots to choose from, and Ò ½µ of them are occupied conclusion: probability È ½ ½ µ is Ò ½µ Ñ ½µ
«½ ½ ½ «½ SFU CMPT-307 2008-2 34 Lecture: Week 8 We can now calculate È µ: ¾ µ Ò È Ò ½ Ñ ½ Ò Ñ Ñ ¾ Ò ½ Ñ Using Assignment Problem 8.4, È µ ½ ½ ½ ½ ««½ ¼ if «is constant, then ½ ½ «µ is also constant, and so an unsuccessful search runs in Ç ½µ time
the same expected number of probes: ½ ½ «µ SFU CMPT-307 2008-2 35 Lecture: Week 8 Insert Corollary. Inserting an element into an open-address hash table with load «requires at most ½ ½ «µ probes on average, assuming uniform hashing. Proof. there must be space in the table, «½ inserting a key is equivalent to unsuccessful search, where we insert the key to the final (empty) slot of unsuccessful search
SFU CMPT-307 2008-2 36 Lecture: Week 8 Successful search Theorem. Given an open-address hash table with load «½, the expected number of probes of successful search is at most ½ ÐÒ «½ «Ç ½µ ½ assuming uniform hashing and assuming that each key in the table is equally likely to be searched for. Example. 50% table full: expected number of probes in successful search: at most ½ 90% table full: ¾
SFU CMPT-307 2008-2 37 Lecture: Week 8 Proof. when looking for a key, we have to repeat the same probe sequence as when inserting into the table assume that keys were inserted to the table in the order ½ ¾ Ò hence, if we are looking for the key, the number of probe is the same as the number of probes we needed when inserting to the table at that time, table contained ½ elements, and the load factor was «½µ Ñ search for requires at most ½ «½ ½ ½µ Ñ Ñ ½ Ñ ½ probes on average
#probes ½ Ò Ñ Ò ½ «SFU CMPT-307 2008-2 38 Lecture: Week 8 as it can be any of Ò keys in the table with the same probability, we have to take the average of those expected numbers of probes: Ò Ñ Ñ ½ ½ ½ Ò ½ Ñ ¼ ½ À Ñ À Ñ Ò µ «À using ÐÒ Ç ½µ ÐÒ Ñ ÐÒ Ñ Òµ Ç ½µµ ½ «ÐÒ Ñ Ò Ç ½µ Ñ ½ «ÐÒ ½ «Ç ½µ ½