Bloom Filters References: Li Fan, Pei Cao, Jussara Almeida, Andrei Broder, Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol, IEEE/ACM Transactions on Networking, Vol. 8, No. 3, June 2000. B. Bloom, Space/time trade-offs in hash coding with allowable errors, CACM, Vol. 13, No. 7, July 1970. 11/21/05 (SSL) 1
The Problem Bloom filter represents a set A = {a 1, a 2,, a n } of n stored elements (also called keys) A sequence of keys is later tested one by one for membership in the set The great majority of keys to be tested do not belong to the set fast reject time desired and no false negative Tradeoff between storage space for Bloom filter and false positive rate 11/21/05 (SSL) 2
The idea allocate a vector v of m bits, initially all set to 0 choose k independent hash functions, {H 1, H 2,, H k }, each with range {1,, m} for each stored element, a, the bits at positions H 1 (a), H 2 (a),..., H k (a) in v are set to 1. A particular bit might be set to 1 multiple times. Bloom filter with 4 hash functions 11/21/05 (SSL) 3
Membership query To determine whether b is in the set, the bits at positions H 1 (b), H 2 (b),..., H k (b) are checked. If any of them is 0, then certainly b is not in the set A no false negative However a false positive is probable parameters k and m are chosen to trade memory space for a small false positive probability False positive probability decreases as k increases or m/n increases 11/21/05 (SSL) 4
The math After inserting n keys into a filter of size m (bits), the probability that a particular bit in the filter is still 0 is The probability of a false positive in this situation is 11/21/05 (SSL) 5
Optimal tradeoff The right hand size of previous equation is minimized for k = ln 2 (m/n), in which case it becomes False positive probability With optimum integral no. of hash functions k=4 m/n False positive probability decreases as m/n increases 11/21/05 (SSL) 6
Handling membership changes For each location h in the bit array, h=1,, m, maintain a count, c(h), initially zero, equal to the number of times the bit location has been set to 1 When a key a joins/leaves A, the counts c(h 1 (a)), c(h 2 (a)),..., c(h k (a)) are incremented/decremented by 1 A bit location is turned on when its count changes from 0 to 1 A bit location is turned off when its count changes from 1 to 0 11/21/05 (SSL) 7
How much memory for counts After inserting n keys with k hash functions into array of m bits, probability that any count is greater than or equal to i Assume number of hash functions to be less than ln 2 (m/n), which is the optimum 11/21/05 (SSL) 8
How much memory for counts (cont.) For i = 16, we have Allowing 4 bits per count, for a practical m value, the probability of overflow is negligible. If the count ever exceeds 15 and it stays at 15 when the count should be incremented the consequence is that many deletions later, the Bloom filter may allow a false negative 11/21/05 (SSL) 9
Application Summary Cache Cooperating proxies behind an Internet bottleneck proxies serve each other s cache misses ICP protocol a cache miss causes queries sent to all other proxies (not scalable) Summary Cache each proxy computes a summary (Bloom filter) of URLs of its cached documents, together with counts for bit locations sends bit array to every other proxy sends update summary when % new documents reaches a threshold 11/21/05 (SSL) 10
More on Summary Cache A local cache miss results in queries sent only to proxies whose summaries have the requested document large reduction in msg traffic Summaries do not have to be up-to-date or accurate False misses total hit rate reduced, due to delayed updates False hits some bandwidth wasted Remote stale hits some bandwidth wasted Memory required increases with # of proxies 11/21/05 (SSL) 11
Other applications Sarang Dharmapurikar, et al., Longest Prefix Matching using Bloom Filters, Proceedings ACM SIGCOMM 2003, August 2003 Bloom filter i for the set of IP address prefixes of length i, i = 1,, 32 (some filters may be empty) To find next hop for a particular IP address, the address is used to probe all Bloom filters in parallel to get matching prefix lengths Then probe hash table associated with longest matching prefix length (first) 11/21/05 (SSL) 12
Other applications Alex C. Snoeren, et al., Hash-Based IP Traceback, Proceedings ACM SIGCOMM 2001, August 2001 Routers compute 32-bit digest over the invariant portion of IP header and first 8 bytes of payload of every packet forwarded Store digests in Bloom filters to save memory (down to 0.5% of link bandwidth per unit time) Use stored digests in routers to trace the source of attack packets 11/21/05 (SSL) 13
Other applications Intrusion detection, content based routing A. Broder and M. Mitzenmacher, Network applications of Bloom filters: A Survey, Proceedings 40 th Annual Allerton Conference, October 2002. 11/21/05 (SSL) 14