SFU CMPT Lecture: Week 8

Size: px

Start display at page:

Download "SFU CMPT Lecture: Week 8"

Britton Arnold
5 years ago
Views:

1 SFU CMPT Lecture: Week 8 SFU CMPT Lecture: Week 8 Ján Maňuch jmanuch@sfu.ca Lecture on June 24, 2008, 5.30pm-8.20pm

2 SFU CMPT Lecture: Week 8 Universal hashing Overcomes problem of bad inputs by using randomization Mentioned: if adversary chooses keys to be hashed by fixed hash function, he can choose Ò keys that all hash into same slot All fixed hash functions have this problem Universal hashing: design class of parametrized hash functions À pick random ¾ À at beginning of execution (select at random parameters)

3 SFU CMPT Lecture: Week 8 À Let be a finite collection of hash functions that map given Í universe ¼ ½ Ñ ½ into À is said to be universal if for each pair ¾ Í of keys, the number of hash functions ¾ À with µ µ is at most À Ñ, i.e., the chance of a collision is at most À Ñ ½ Ñ À In other words: with randomly chosen from À, the chance of collision between arbitrary distinct keys and is just the same (½ Ñ) as a chance of collision µ if µ and were randomly and independently chosen ¼ Ñ ½ from

4 SFU CMPT Lecture: Week 8 Properties: average-case Let À be a universal collection of hash functions Theorem. Choose ¾ À randomly to hash Ò keys (set Ã) into table Ì of size Ñ, using chaining to resolve collisions. 1. If key is not in the table, then expected length Ò µ of list that hashes to is at most «. 2. If key is in the table, then expected length Ò µ of list containing is at most ½ «. Proof. Important: expectations over the choice of a hash function, do not depend on assumptions regarding distribution of keys! For each pair of distinct keys define a random variable with ½ iff µ µ, i.e, collision By definition, È ½µ ½ Ñ, and thus ½ Ñ

5 ¾Ã Ð ¾Ã SFU CMPT Lecture: Week 8 Now define for each key a random variable i.e. the number of keys other than that hash to same slot as. Clearly, ¾ ½ ¾ Ã Ñ ¾Ã ¾Ã Ñ

6 SFU CMPT Lecture: Week 8 It remains to count the number of ¾ Ã with Case 1: If ¾ Ã, then Ò µ and ¾ Ã Ò Thus Ò µ Ò Ñ «Case 2: If ¾ Ã, key appears in Ì µ. Also, count does not include itself. Thus Ò µ ½ and Thus ¾ Ã Ò ½ Ò µ ½ Ò ½µ Ñ ½ ½ «½ Ñ ½ «

7 SFU CMPT Lecture: Week 8 Universal hashing: sequence of operations Corollary. Using the universal hashing and chaining in a table Ñ with slots, it takes expected µ time to handle any sequence of Insert, Search, and Delete operations, Ç Ñµ containing Insert operations. Proof. the number of Insertions is Ç Ñµ thus the number of elements in the hash table at any time is Ò Ç Ñµ and «Ç ½µ. Insert and Delete take constant time, Search takes expected constant time (due «to Ç ½µ), by linearity of expectation, expected time for sequence of operations is Ç µ. Notice: Impossible for adversary to pick sequence of operations that forces the worst-case time This is now without the assumption of simple uniform hashing

8 SFU CMPT Lecture: Week 8 Construction of a universal class of hash functions Simple solution. take all functions mapping Í into ¼ ½ Ñ disadvantages: ½ size of the collection: Ñ Í number of random bits needed to pick a random function: Í ÐÓ Ñ a lot of preprocessing needed how to encode functions? we have to store the chosen function space needed: Í ÐÓ Ñ we need a much smaller collection of hash functions

9 SFU CMPT Lecture: Week 8 Assignment Problem 8.1. (deadline: July 8, 5:30pm) Prove that the class of all functions from Í to ¼ ½ Ñ ½ is universal.

10 SFU CMPT Lecture: Week 8 Number-theoretical solution. Let Ô be a prime number and let Ô ¼ ½ Ô ½ Ô ½ ¾ Ô ½ Lemma. For any ¾ Ô, ¾ Ô, the equation Ü ÑÓ Ô has a unique solution Ü ¾ Ô. (see Chapter 31.4 for a proof)

11 SFU CMPT Lecture: Week 8 Let Ô be a prime, large enough such that every possible key is in ¼ Ô ½, inclusive. for all ¾ Ô and ¾ Ô define hash functions : µ µ ÑÓ Ôµ ÑÓ Ñ (linear transformation followed by reductions modulo Ô and then modulo Ñ) Note: Ô Ñ since by the assumption, the size of universe of keys Í is greater than the number of slots the function maps Í Ô into Ñ Consider the universal class of hash functions: Note: size of Ô Ô ½µ class: can be chosen such that Ô ¾ Í (Bertrand s postulate) Ô picking randomly À from Ô Ñ means picking random number of bits needed to pick a function À from Ô Ñ Ç ÐÓ Í µ : À Ô Ñ ¾ Ô ¾ Ô

12 SFU CMPT Lecture: Week 8 Theorem. À Class Ô Ñ of hash functions is universal. Proof. Take two distinct keys and. How many functions are there À in Ô Ñ mapping and to the same slot? Consider ¾ À Ô Ñ. Let Ö µ ÑÓ Ô µ ÑÓ Ô Is it possible that Ö? By Lemma: No. If Ö then and are solutions to equation Ü Ö ÑÓ Ô and by uniqueness of solution,. no collisions at modulo Ô level

13 SFU CMPT Lecture: Week 8 There Ô Ô ½µ are functions À in Ô Ñ and there Ô Ô ½µ are possible Ö µ pairs such Ö that. Is there a one-to-one correspondence? Yes Values Ö and determine the function (remember and are still fixed): µ ¼ ÑÓ Ô Ö by Lemma (unknown ), is unique Ö ½ ÑÓ Ô by Lemma (unknown ), is unique implies uniquely determined by Ö and Conclusion: If we pick ¾ À Ô Ñ uniformly at random, the resulting pair Ö µ is equally likely to be any pair of distinct values from Ô.

14 Ô Ñ ½ Ô Ñ SFU CMPT Lecture: Week 8 When a collision between and happens? Hence, probability that and collide is equal to probability Ö that Ñ when Ö and are randomly chosen as distinct values from Ô. ÑÓ and collide when Ö ÑÓ Ñ Note: doesn t depend on and anymore pick any Ö, there are at most Ô Ñ choices of such that Ö ÑÓ Ñ, one of them is Ö hence, there are at most Ô Ñ ½ choices for such that Ö and hence, probability that Ö ÑÓ Ñ and Ö is at most Ö ÑÓ Ñ ½µ Ñ ½ Ô ½ ½ Ô Ô ½µ Ñ ½ Ñ Ô Final conclusion: À Ô Ñ is universal. ½

15 SFU CMPT Lecture: Week 8 Open addressing the second method for dealing with collisions recall, the first method was chaining elements are stored directly in hash table (not in linked lists) hence: the load «is always at most ½ (but the table can overflow) for every key we have a sequence of slots, called the probe sequence when inserting we try (probe) the first slot, if used, the second slot, etc., until we find an empty slot when searching we go over slots in sequence until we either find the searched element or empty slot no pointers used we can use more memory for the hash table instead, which results in fewer collisions, and so, faster searching

16 SFU CMPT Lecture: Week 8 Hash functions have now two parameters: the key ¾ Í the probe number ¾ ¼ ½ Ñ ½ Furthermore, the hash function Í ¼ ½ Ñ ½ ¼ ½ Ñ ½ must satisfy that for every key, the probe sequence ¼µ ½µ Ñ ½µ is a permutation of ¼ ½ Ñ ½ Now: every position is eventually considered as a slot for a new key (the algorithm fails with overflow only when there is no place in the entire table to insert a key)

17 SFU CMPT Lecture: Week 8 Assume that an empty slot contains the value NIL µ Hash-Insert Ì ¼ 1: 2: repeat 3: µ 4: if Ì NIL then 5: Ì 6: return 7: else ½ 8: 9: end if 10: until Ñ 11: error hash table overflow Idea: probing the probe sequence of until we find an empty slot

18 SFU CMPT Lecture: Week 8 Hash-Search Ì µ probes the same sequence, therefore the search can be terminated when an empty slot is found assuming that we didn t delete anything from the table µ Hash-Search Ì ¼ 1: 2: repeat 3: µ 4: if Ì then 5: return 6: else ½ 7: 8: end if 9: until Ì NIL or Ñ 10: return NIL returns where slot contains the key, or NIL, if is not in the table

19 SFU CMPT Lecture: Week 8 Hash-Delete Ì µ difficult operation: when we delete a key from the table, we cannot just mark the slot containing the key by NIL (empty) we could disconnect a sequence of occupied elements leading to the key Solution. mark the place with the special value DELETED (instead of NIL) modify Hash-Insert Ì µ so that it treats slot containing DELETED as empty Hash-Search Ì µ will treat such a slot as occupied, which we want, so no modification is needed Problem. the search time doesn t depend on «load #elements #slots, but rather on #elements #DELETED #slots, which can be quite bad Conclusion. when Delete operation is allowed, prefer to use hashing by chaining

20 SFU CMPT Lecture: Week 8 Assignment Problem 8.2. (deadline: July 8, 5:30pm) Write a pseudo-code for the modified Hash-Insert able to handle also the special value DELETED, and for the procedure Hash-Delete.

21 SFU CMPT Lecture: Week 8 Uniform hashing generalization of simple uniform hashing very strong conditions, in practice just some approximations are used useful for analysis assumption: each key is equally likely to have any Ñ of permutations of as its probe sequence ½ requires that all Ñ possible permutations (probe sequence) could be generated ¼ ½ Ñ Common techniques used are much worse: linear probing (only Ñ distinct probe sequences) quadratic probing (only Ñ distinct probe sequences) double (Ñ hashing distinct probe sequences) the best ¾ results

22 SFU CMPT Lecture: Week 8 Linear probing needs an ordinary hash function ¼ Í ¼ ½ Ñ ½, called an auxiliary hash function hash function: µ ¼ µ µ ÑÓ Ñ hence, for a key, the probe sequence is ¼ µ ¼ µ ½ Ò ½ ¼ ¼ µ ½ we start at the slot given by auxiliary hash function, and then probe consecutive slots first probe determines the entire sequence µ only Ñ distinct probe sequences

23 SFU CMPT Lecture: Week 8 Problem. primary clustering long clusters (sequences) of occupied slots increasing average search time why? consider a cluster with occupied slots, if the consecutive slot is empty, then it has a ½µ Ñ probability that it will be filled during next Insert operation

24 SFU CMPT Lecture: Week 8 Quadratic probing hash function: µ ¼ µ ½ ¾ ¾ µ ÑÓ Ñ where ¼ is again an auxiliary hash function, ¾ ¼ Problems: doesn t work for all values of ½ ¾ and Ñ still: a probe sequence for one key is shifted version of the probe sequence for another µ key Ñ only distinct probe sequences secondary clustering milder form of clustering (the keys mapped to the same slot will form a cluster)

25 SFU CMPT Lecture: Week 8 Assignment Problem 8.3. (deadline: July 8, 5:30pm) Consider a quadratic probing scheme with constants Ñ ¾ Ø for some integer Ø (i.e., Ñ is a power of ¾) and ½ ¾ ½ ¾. Prove that the function µ ¼ µ ½ ¾ ¾ µ ÑÓ Ñ is indeed a hash function, i.e., prove that all probe sequences are permutations ¼ ½ Ñ of ½. Hint: ¾ Congruence equivalent to ¾ ÑÓ ¾ congruence. Ø ½ ÑÓ ¾ Ø makes sense only if is even, and is

26 SFU CMPT Lecture: Week 8 Double hashing hash function: µ ½ µ ¾ µµ ÑÓ Ñ where ½ and ¾ are auxiliary hash functions first probe doesn t determine the entire sequence, depends also on the value of the second hash function (assuming we don t use ¾ ½ ) approximates behavior of uniform hashing, generated permutations have some characteristics of randomly chosen permutation

27 SFU CMPT Lecture: Week 8 Requires. The values of ¾ µ have to be relatively prime (coprime) to the size of hash Ñ table two easy ways: Ñ 1. is power of ¾, and ¾ produce only odd numbers Ñ 2. is a prime, and ¾ returns positive values smaller Ñ than each pair ½ µ ¾ µµ yields a distinct probe sequence, hence Ñ ¾ µ probing sequences are used

28 SFU CMPT Lecture: Week 8 Analysis of open-address hashing assume that we use uniform hashing : a new key has a probe sequence chosen uniformly at random from all possible permutations when we are looking for a key which is already in the table, we assume that every key is equally likely to be searched for we will express the expected number of probes in terms of the load «Ò Ñ ½

29 SFU CMPT Lecture: Week 8 Unsuccessful search Theorem. Given an open-address hash table with load «Ò Ñ ½, the expected number of probes in an unsuccessful search is at most ½ ½ «µ, assuming uniform hashing. Example. if «½ ¾ ( ¼± of table occupied) then the expected number of probes for unsuccessful search will be ½ ½ ½ ¾µ ¾ if «¼ ( ¼± of table occupied), then ½ ½ ½¼µ ½¼

30 SFU CMPT Lecture: Week 8 Proof. let be a random variable denoting the number of probes made in an unsuccessful search for ½ Ñ, let be the event that the -th probe to an occupied slot iff events ½ ½ happened and event didn t event is equivalent to event ½ ¾ ½, hence ½µ is È µ È ½ ¾ ½ µ we are looking for why do we need È µ? answer: Assignment Problem 8.4

31 ½ ½ SFU CMPT Lecture: Week 8 Assignment Problem 8.4. (deadline: July 8, 5:30pm) Consider a random variable having only positive integer values (i.e., it s mapping the probability space to the set of natural numbers). Prove that È ½µ È ¾µ È µ Hint: events ½, ¾,,..., are mutually exclusive (disjoint), hence ½ È µ È µ

32 SFU CMPT Lecture: Week 8 We will use the formula from Assignment Problem 5.2 again: È ½ ¾ ½ µ È ½ µ È ¾ ½ µ È ½ ¾ ¾ ½ µ Hence, to find out È µ it s enough to evaluate conditional probabilities È ½ ½ µ. case ½: È ½ µ Ò Ñ slots, Ò are occupied, uniform hashing, the first probe can be to any of (Ñ slots) Ñ

33 SFU CMPT Lecture: Week 8 case ½: assume ½ ½ occurred hence, first ½ probes were to occupied slots -th probe cannot be to any of these slots (the probe sequence is a permutation) hence, we have Ñ ½µ slots to choose from, and Ò ½µ of them are occupied conclusion: probability È ½ ½ µ is Ò ½µ Ñ ½µ

34 «½ ½ ½ «½ SFU CMPT Lecture: Week 8 We can now calculate È µ: ¾ µ Ò È Ò ½ Ñ ½ Ò Ñ Ñ ¾ Ò ½ Ñ Using Assignment Problem 8.4, È µ ½ ½ ½ ½ ««½ ¼ if «is constant, then ½ ½ «µ is also constant, and so an unsuccessful search runs in Ç ½µ time

35 the same expected number of probes: ½ ½ «µ SFU CMPT Lecture: Week 8 Insert Corollary. Inserting an element into an open-address hash table with load «requires at most ½ ½ «µ probes on average, assuming uniform hashing. Proof. there must be space in the table, «½ inserting a key is equivalent to unsuccessful search, where we insert the key to the final (empty) slot of unsuccessful search

36 SFU CMPT Lecture: Week 8 Successful search Theorem. Given an open-address hash table with load «½, the expected number of probes of successful search is at most ½ ÐÒ «½ «Ç ½µ ½ assuming uniform hashing and assuming that each key in the table is equally likely to be searched for. Example. 50% table full: expected number of probes in successful search: at most ½ 90% table full: ¾

37 SFU CMPT Lecture: Week 8 Proof. when looking for a key, we have to repeat the same probe sequence as when inserting into the table assume that keys were inserted to the table in the order ½ ¾ Ò hence, if we are looking for the key, the number of probe is the same as the number of probes we needed when inserting to the table at that time, table contained ½ elements, and the load factor was «½µ Ñ search for requires at most ½ «½ ½ ½µ Ñ Ñ ½ Ñ ½ probes on average

38 #probes ½ Ò Ñ Ò ½ «SFU CMPT Lecture: Week 8 as it can be any of Ò keys in the table with the same probability, we have to take the average of those expected numbers of probes: Ò Ñ Ñ ½ ½ ½ Ò ½ Ñ ¼ ½ À Ñ À Ñ Ò µ «À using ÐÒ Ç ½µ ÐÒ Ñ ÐÒ Ñ Òµ Ç ½µµ ½ «ÐÒ Ñ Ò Ç ½µ Ñ ½ «ÐÒ ½ «Ç ½µ ½

SFU CMPT Lecture: Week 9

SFU CMPT Lecture: Week 9 SFU CMPT-307 2008-2 1 Lecture: Week 9 SFU CMPT-307 2008-2 Lecture: Week 9 Ján Maňuch E-mail: jmanuch@sfu.ca Lecture on July 8, 2008, 5.30pm-8.20pm SFU CMPT-307 2008-2 2 Lecture: Week 9 Binary search trees