Notes on Bloom filters

Size: px

Start display at page:

Download "Notes on Bloom filters"

Linette Miller
5 years ago
Views:

1 Computer Science B63 Winter 2017 Scarborough Campus University of Toronto Notes on Bloom filters Vassos Hadzilacos A Bloom filter is an approximate or probabilistic dictionary. Let S be a dynamic set of keys drawn from a universe U. A Bloom filter maintains a summary F S of S, supporting the following operations: BF-Insert(F S, x): S := S {x} add x to the underlying dynamic set S. BF-Search(F S, x): return no if x / S, and probably yes if, with high probability (to be discussed later), x S; it is possible, however, that probably yes is returned even though x / S. An instance where BF-Search(F S, x) returns probably yes even though x / S, is called a false positive. Note that there is no BF-Delete operation. As we will see, deletions are problematic for Bloom filters; we will discuss partial remedies for this weakness. Bloom filters are very space efficient; they consume only a small fraction of the space needed to store the full dynamic set S using, say, an AVL tree or a hash table. Consequently they also achieve time efficiencies: they can be stored in main memory, rather than in secondary storage, and so they can be accessed much faster. The disadvantage of Bloom filters is that there is a non-zero probability of false positive searches. Note that there is no possibility of a false negative: If BF-Search(F S, x) returns no, then x is definitely not in S. This asymmetry between the positive and negative responses is critical in making Bloom filters useful, as will be seen when we discuss some applications of Bloom filters. How Bloom filters work. A Bloom filter consists of an array of m bits, BF[0..m 1], initially all 0, corresponding to an empty set. Let h 1, h 2,..., h t be hash functions that map U to {0, 1,..., m 1}. The Bloom filter operations are then implemented as follows: To insert a key x into the Bloom filter, we set all the bits BF[h 1 (x)],..., BF[h t (x)] to 1. To search for a key x, we look at all the bits BF[h 1 (x)],..., BF[h t (x)]. If any one of them is still 0, we return no : had x been inserted to the Bloom filter, all these bits would have been set to 1. If all of them are 1, we return probably yes. Note that a search for x may find all the bits set to 1 even though x was never inserted into the dictionary. For example, suppose we use two hash functions, which map x to bit positions 1 and 3, y to bit positions 1 and 2, and z to bit positions 2 and 3. If we insert y and z, and then search for x, the search algorithm will return probably yes even though x was not inserted to the Bloom filter. This is an example of a false positive search. The algorithms for BF-Insert and BF-Search are shown in pseudocode in Figure 1. BF-Insert(F S, x) for i := 1 to t do BF[h i (x)] := 1 BF-Search(F S, x) for i := 1 to t do if BF[h i (x)] = 0 then return no return probably yes Figure 1: Insert and search operations with Bloom filters 1

2 Assuming that we can evaluate each hash function in O(1) time, it is obvious that these algorithms run in Θ(t) time. In typical uses, the number t of hash functions used is a small number, so the algorithms run in Θ(1) time. We use multiple hash functions, rather than just one, to reduce the probability of false positives. If we used only one hash function and we inserted a key x, then the search for any key x that collides with x under that hash function will return probably yes, even if x was never inserted. If we use two hash functions, a key x that collides with x under one hash function is unlikely to also collide with x under the other, provided the hash functions are independent informally, they tend to map the same key to different positions. We will make this more precise later, when we analyze the performance of Bloom filters. We will explore shortly what is the optimal number of hash functions to use. Probability of false positive. The probability of a false positive search depends on three factors: the size m of the Bloom filter; the number of items n inserted into the Bloom filter; and the number of hash functions t used for the Bloom filter. Intuitively, the larger the m, the lower the probability of collisions and therefore of false positives. Similarly, the smaller the n, the lower the probability of collisions and therefore of false positives. The ratio α = n/m is called the load factor, and we encountered this quantity in our analysis of hash tables. From the preceding discussion it is clear that the smaller the load factor, the lower the probability of false positives. The optimal value of the third parameter t occupies some sweet spot between too few hash functions (leading to higher probability of collisions, and therefore higher probability of false positives), and too many hash functions (causing each item inserted to set many bits to 1, and therefore higher probability of false positives). To analyze the probability of a false positive search, we consider a two-stage process. (A) We insert n distinct keys x 1, x 2,..., x n into the Bloom filter. We model these insertions by the following experiment. Start with a Bloom filter all of whose bits are set to 0. Repeat the following for a total of nt times, independently: choose a bit position in the Bloom filter uniformly at random (i.e., each position is chosen with probability 1/m), and set that bit to 1. This models the insertion of n distinct keys, drawn at random from U, where each insertion uses t hash functions to set some bits to 1. (B) Next we search for a randomly chosen key x x 1, x 2,..., x n in U, and we want to determine the probability of a false positive, i.e., the probability that the bits to which x is mapped by the t hash functions have all been set to 1 by the insertion process. We model this by repeating t times, independently, the following: choose a position in the Bloom filter uniformly at random. We then compute the probability of the event that all of the positions chosen were set to 1 during Stage (A). This is an idealized model, like the simple uniform hashing assumption (SUHA) that we used to analyze hashing: It assumes that there are no dependencies or regularities in the set of keys inserted to the Bloom filter, and that the hash functions distribute the keys uniformly at random to the positions of the Bloom filter. With suitably designed hash functions, this idealized model captures well enough the reality of many situations that arise in practice. Fix an arbitrary position l, 0 l < m, of the Bloom filter. We first compute the probability that BF[l] = 0 at the end of Stage (A), i.e., after the keys x 1, x 2,..., x n have been inserted. According to our model, the probability that one of these keys under one of the hash functions hits position l is 1/m; and therefore the probability that it misses position l is 1 1/m. Since the positions of the Bloom filter set 2

3 to 1 during Stage (A) are chosen independently and uniformly at random, the probability that all n keys inserted under all hash functions miss position l is (1 1/m) nt. That is, probability that BF[l] = 0 after x 1,..., x n are inserted = ( 1 1 ) nt m e nt/m = e αt where the approximation is justified by the fact that, for values of x close to 0, 1 x e x. Now consider any key x different from all the n keys inserted into the Bloom filter. The probability that a search for x yields a false positive is the probability that, after the insertion of x 1,..., x n in the Bloom filter, the positions to which the hash functions map x are all set to 1. As we just saw, the probability that any particular bit of BF is 0 after the insertions is e αt, and so the probability that any particular bit is 1 is 1 e αt. By the model assumption that the hash functions map x to positions of BF chosen independently and uniformly at random, the probability that all of the bits to which the hash functions map x are 1 is (1 e αt ) t. Suppose now that the size of the Bloom filter m and the number of elements in it n are fixed; therefore the load factor α = n/m is fixed. For this fixed α, the probability of a false positive becomes a function only of t, the number of hash functions: P (t) = (1 e αt ) t (1) We can therefore compute the value of t that minimizes this function, by taking its derivative and setting it to 0. We have: dp (t) ( = (1 e αt ) t ln(1 e αt e αt ) ) + αt dt 1 e αt Setting the derivative to 0 and solving for t we get that the value of t that minimizes the probability of false positive is t = 0 or t = α 1 ln 2. The value t = 0 is not feasible (since we need a positive number of hash functions!), so the optimal choice of hash function is given by t = α 1 ln 2 (2) Note that this is a non-integer value, so we will use the positive integer t that is closest to α 1 ln 2. Substituting (2) into (1) we get that the probability of a false positive search using the optimal number of hash functions is P (α 1 ln 2) = (1 e αα 1 ln 2 ) α 1 ln 2 = ( 1 2 ln 2 ) α α 1 (3) Example. Suppose we have a dictionary consisting of 10 million URLs, i.e., n = If I allocate a Bloom filter with m = bits, we have α 1 = 32. Applying (3), we get that the probability of a false positive in this case is A more accurate calculation would be to first find the optimal number t of hash functions as the positive integer closest to the value given by (2), and then apply (1) for that value of t. Doing so we obtain that t should be the positive integer closest to 32 ln , i.e., t = 22. Plugging this value to (1) we get that the probability of false positive search is P (22) (1 e 22/32 ) The inverse of the load factor α 1 = m/n can be thought of as the number of bits we allocate per element inserted in the Bloom filter. Note that this interpretation should not be viewed as meaning that we allocate a specific set of positions in the Bloom filter for each item we insert: Each item inserted to the Bloom filter gets (up to) t bits, the positions to which it is mapped by the t hash functions. Rather, α 1 is 3

4 a measure of how much space we save by using a Bloom filter instead of storing the dictionary explicitly. In our example, α 1 = 32; thus we allocate 32 bits, i.e., 4 bytes, for each URL in the dictionary. This is much shorter than is required to store an actual URL. Deletions. Deletions are problematic in Bloom filters. Note that we cannot delete an element merely by setting to 0 the bits to which it is mapped by the hash functions: Doing so would result in false negatives, which would render Bloom filters useless. To see how this can happen, suppose we have inserted three keys: x that is mapped to bit positions 1 and 3, y that is mapped to bit positions 1 and 2, and z that is mapped to bit positions 2 and 3. If we delete y and z by setting their bits to 0, and we then search for x, the Bloom filter would return no, even though x was not deleted. A partial solution to this limitation is to use so-called counting Bloom filters. In a counting Bloom filter, each position in the array BF is not a bit but a small counter. Initially, every counter is 0, indicating an empty Bloom filter. Each time a key x is inserted (respectively, deleted), the counters in the positions to which x is mapped by the hash functions are incremented (respectively, decremented) by 1. To search for a key x, we look at all the counters to which x is mapped by the hash functions; if any of them is 0 we return no ; otherwise, we return probably yes. Pseudocode for these operations is shown in Figure 2. BF-Insert(F S, x) for i := 1 to t do BF[h i (x)] := BF[h i (x)] + 1 BF-Delete(F S, x) for i := 1 to t do BF[h i (x)] := BF[h i (x)] 1 BF-Search(F S, x) for i := 1 to t do if BF[h i (x)] = 0 then return no return probably yes Figure 2: Insert, delete, and search operations with counting Bloom filters We don t want to allocate many bits to each counter, as this would undermine the space savings advantage that Bloom filters are designed to deliver. On the other hand, if the counters are too small, they will wrap around and again produce false negatives. For these reasons, counting Bloom filters are only a limited solution. As we will see, Bloom filters are typically used in applications where there are no, or only very few, deletions. Applications. We now briefly describe some applications of Bloom filters. Refusing service to black-listed sites. A web server may keep a (long) list of black-listed sites, known to contain malware or to distribute spam. Whenever the web server receives a request from such a site, it does not respond to it. Almost all requests that the web server receives are from clean sites. Nevertheless, the list of black-listed sites is too long to keep in main memory. It would be very inefficient to keep the list on disk: Doing so, would mean that the web server would have to perform time-consuming disk accesses at each request to verify that the requesting site is not black listed. Instead, the web server keeps the full list of black-listed sites on disk, and keeps in main memory a Bloom filter of the black-listed sites. This is feasible because the Bloom filter is much shorter than the actual list of black-listed sites. When a request arrives from a site s, the web server checks to see if s is in the Bloom filter. In most cases (and assuming that the probability of false positive is low), the answer is no, in which case the web server replies to s s request. In the rare instances where the answer is probably yes, the web server performs a disk access to search the actual list of black-listed sites for s. If s is not found on that list, the web server replies to s s request; otherwise, i.e., if s is actually a black-listed site, the web server ignores s s request. Approximate counting. Suppose we want to count how many different IP addresses have visited a web page. The obvious way to do this is to keep the set V of all IP addresses that have visited the web page in the past, and a counter giving the cardinality of that set. Each time a request arrives from IP address a, we check if a V ; if not, we add a to V and increment the counter. It is, however, too expensive to 4

5 remember all IP addresses that visited the page in the past. If (as is often the case) it is acceptable to provide an approximate counter that slightly undercounts unique visitors, we can use a Bloom filter of the visitor s IP addresses, rather than the set of addresses themselves. When IP address a visits the web page, we check if a is in the Bloom filter. If the answer is no, we know for sure that a is a new visitor. So we insert a to the Bloom filter and increment the counter. If the answer is probably yes, we don t increment the counter. Note that if this was a false positive, by not incrementing the counter we have missed a new visitor. If false positives are rare, our approximate counter will be close enough. To be honest about the service provided, the counter should be used to report at least x unique visitors (rather than report x unique visitors as if x were the exact number). These applications share the following characteristics: 1. Saving space is a key objective. In both applications, we don t want to allocate the space needed to store the entire set of items we are interested in. In the case of the web server managing black-listed sites we store the full set of black-listed sites on disk, but we access the disk copy only rarely. In the case of the approximate counter we don t even bother to store the full set of past visitors at all. 2. Objects are rarely (if ever) deleted from the dynamic set of objects contained in the Bloom filter. Once a site is compromised, it remains black-listed forever; and once an IP address has visited the web page, it remains a past visitor (by definition!) forever. 3. The fact that there are no false negatives is crucial for the Bloom filter to be useful. In the case of the web server managing black-listed sites, if a Bloom filter search returns no, we know for sure that the requesting site is not black-listed and it is therefore safe to respond to its requests. In the case of the approximate counter, if a Bloom filter search returns no, we know for sure that the visitor is new and it is correct to increment the counter of unique visitors. 4. There is an effective way to mitigate the effect of false positives. In the case of the web server managing black-listed sites, the mitigation strategy is to access the list of black-listed sited stored on disk when the Bloom filter answers probably yes. This is slow, but it is tolerable because it happens rarely. In the case of the approximate counter, the mitigation strategy is to provide an undercount of the unique visitors, rather than an exact one. These characteristics (saving space, rare deletions, tolerance to (rare) false positives, and existence of a mitigation strategy for false positives) are typical of applications in which Bloom filters can be brought to bear. 5

Hashing. Hashing Procedures

Hashing. Hashing Procedures Hashing Hashing Procedures Let us denote the set of all possible key values (i.e., the universe of keys) used in a dictionary application by U. Suppose an application requires a dictionary in which elements