Compact data structures: Bloom filters

Size: px

Start display at page:

Download "Compact data structures: Bloom filters"

Derick Lang
5 years ago
Views:

1 Compact data structures: Luca Becchetti Sapienza Università di Roma Rome, Italy April 7, 2010

2 1 2 3

3 Dictionaries A dynamic set S of objects from a discrete universe U, on which (at least) the following operations are possible: Item insertion Item deletion Set memberhisp: decide whether item x S Typically, it is assumed that each element in S is uniquely identified by a key. Let obj(k) be object with key k: Operations insert(x, S): insert item x delete(k, S): delete item whose key is k retrieve(k, S): retrieve obj(k) This is a minimal set of operations. Any database implements a (greatly augmented) dictionary

4 Testing for membership Dictionaries are often large or huge in many applications Any of the operations above potentially involves access to secondary storage Set membership Retrieval (deletion) can be restated as follows: if obj(k) S then retrieve(k, S) (delete(k, S)) Set membership ismember(k, S): if false then obj(k) S. Why this: membership can be tested efficiently using compact data structures Check often in main memory No need to access secondary storage if false

5 Example: spell-checker Provide first level of spell checking for a text editor Must quickly report spell mistakes to user Exact check Need efficient data structure Trees are typically used Terms correspond to nodes (typically leaves) of the tree Thesaurus in the order of terms May be too large for quick response times Idea: trade accuracy for efficiency

6 Used to provide a compact summary of a set of keys Key k hashed t times on [m] = {0,..., m 1} using t independent hash functions Binary array B of size m (m typically a prime) For the moment: only insertions and set membership h 1 (k) 1 m-1 h 2 (k) 1 k 1 h t (k) 0 Bloom filter

7 Use of (object retrieval) Main memory ismember(k) true 1 2 Bloom filter retrieve(k) obj(k) 3 4 Database Time Potential savings for retrieval (insertion/deletion) - (3) and (4) do not occur if ismember(k) returns false - Bloom filter stored in main memory

8 : insertion and set membership insert(k) Require: k: object key 1: for j : 1... t do 2: i = h j (k) 3: if B i == 0 then 4: B i = 1 5: end if 6: end for ismember(k) Require: k: object key 1: member = true; j = 1 2: while member == true && j <= t do 3: i = h j (k) 4: if B i == 0 then 5: member = false 6: end if 7: j = j + 1 8: end while 9: return member Figure: Bloom filter: insertion and set membership (S is implicit) Initially, B i = 0 for every i B is a compact summary of keys of elements in S

9 False positives - No false negatives but... - Assume h 1 (k) = 2k + 1 mod 5, h 2 (k) = x + 2 mod 5 - ismember(4) returns true false positive h 1 (k) h 2 (k) t = 2 and m = 5: Insertion of keys (5, 2, 3)

10 1 2 3

11 The mathematics of Having false positives means that we might access database even if it contains no element with searched key Can be acceptable if P[false positive] small Probability of false positives Assume n elements in the Bloom filter Assume every h j ( ) ideal, i.e., it hashes every item uniformly at random and independently of the others (for the sake of the analysis) Consider ismember(k), with obj(k) S What is P[ismember(k) == true]? Small if m large enough

12 Fraction of 0 s Assume ideal h( ) s Assume that, after n insertions, fraction of 0 s in B is p Consider k B: P[ismember(k) == true] = (1 p) t The fraction of 0 s determines the probability of a false positive p is itself a random variable that depends on t and m

13 Fraction of 0 s cont. The B i s are random variables that depend on the input and the hash functions After n insertions we have: ( P[B i = 0] = 1 1 ) tn m E[p] = 1 m 1 P[B i = 0] = m i=0 ( 1 1 ) tn e tn/m m if X = number of 0 s then X = mp and E[X ] = me[p] Theorem ([Mitzenmacher, 2002]) Let X denote the number of 0 s in Bloom filter after n insertions. P[ p E[p] > ɛ] = P[ X me[p] > ɛm] 2e 2ɛ2 m 2 /tn

14 Fraction of 0 s cont. Remarks The B i s are not statistically independent (why?) Proof uses an extension of Chernoff bounds Note that p is very close to E[p] with high probability. Example: if m 17 nt, p [0.9E[p], 1.1E[p]] with probability at least 99% verify In practice (see further) condition above or similar easy to satisfy In the rest of this section we assume that p E[p] e tn/m deterministically This can be made rigorous at the cost of some complication in the analysis

15 Choice of m and t We have seen that with good approximation: P[ismember(k) == true] = (1 p) t (1 e tn/m ) t We can play with parameters m (size of Bloom filter) and t (number of hash functions) In the remainder of the analysis, we fix m and minimize the expression f (t) = (1 e tn/m ) t w.r.t. t (n is given, m is fixed) We next take g(t) = ln f (t) = t ln(1 e tn/m ). Minimizing f (t) is equivalent to minimizing g(t) but the latter is easier

16 Choice of m and t cont. We have: dg dt = ln(1 e tn/m ) + tn e tn/m m 1 e tn/m Derivative is 0 when t = m ln 2 n and this is a global minimum With this choice: P[ismember(k) == true] f (t) = 1 2 t (0.6185) m n Of course, the number t of hash functions has to be an integer

17 Recap n is given For any given m, t = m ln 2 n ideally, m ln 2 n or m ln 2 n in practice highly effective if m = cn, with c a small constant Example: c = 8, t = 5 or 6 false positive probability 0.02 Fixing m: in practice, choose a value a few times higher than the max predictable size of your databse

18 Recap cont. Assume database with n = 10 6 documents, keys are document digests of size 1Kbit each 256 MBytes A retrieve operation can be very expensive, caching can only in part mitigate Using m = 8n, we have a 1MB size Bloom filter that occupies an only small fraction of main memory Still missing... Deletions Can be implemented at the expense of a moderate increase in memory

19 Handling deletions Substitute binary array with counter array (counting Bloom filter) 1 4 h 1 (k) h 2 (k) Counting Bloom filter with t = 2 and m = 5: Insertion of keys (5, 2, 3)

20 Counting : insertion and deletion insert(k) Require: k: object key 1: for j : 1... t do 2: i = h j (k) 3: C i = C i + 1 4: end for delete(k) Require: k: object key 1: if ismember(k) then 2: for j : 1... t do 3: i = h j (k) 4: C i = C i 1 5: end for 6: end if Figure: Counting : insertion and deletion (S is implicit) Possible to prove that 4 bits per counter suffice for most applications [Broder and Mitzenmacher, 2004] ismember(k) unchanged

21 Applications [Broder and Mitzenmacher, 2004] Databases maintenance (since the early 80 s) Cooperative distributed caching (see also [Fan et al., 2000]) P2P/Overlay networks Resource routing Packet routing

22 Summary cache [Fan et al., 2000] Internet Caching Protocol (ICP) Proxies cooperate

23 Summary cache cont. On a cache miss, a proxy contacts its neighbour proxies instead of requesting the page from Web server ICP traffic can cause great overhead even for few proxies Idea Each proxy stores a (counting) Bloom filter of every other proxy s contents Keys are the URLs On a cache miss: 1 Check locally stored for key membership 2 Contact a proxy whose relevant Bloom filter is positive for the key

24 Questions Q1 Consider two dictionaries over the same universe of objects (and therefore keys) Describe how and why allow to easily construct a compact summary of their union Q2 Dictionary in secondary storage with n items, no insertions/deletions retrieve(k) costs (time to access disk) Access to main memory negligible 70% of requested items not in dictionary Let T be response time Design a Bloom filter such that speed-up E[T ] 2, i.e., a 100%

25 Example: spell-checker Text editor spell-checker Must quickly report spell mistakes to user Thesaurus contains 10 5 terms Average term length: 10 bytes Design a Bloom filter that performs spell - checking with probability of error 0.01

26 Example: spell-checker Text editor spell-checker Must quickly report spell mistakes to user Thesaurus contains 10 5 terms Average term length: 10 bytes Design a Bloom filter that performs spell - checking with probability of error 0.01 Solution Impose that (0.6185) m n 0.01 m n 9.59 t = m n ln We can use a Bloom filter of size 1Mbit using 7 hash functions Note that storing all words requires 1Mbyte + data structure

27 Broder, A. and Mitzenmacher, M. (2004). Network applications of bloom filters: A survey. In Internet Mathematics, A K Peters, Ltd., volume 1. Fan, L., Cao, P., Almeida, J., and Broder, A. Z. (2000). Summary cache: a scalable wide-area Web cache sharing protocol. IEEE/ACM Transactions on Networking, 8(3): Mitzenmacher, M. (2002). Compressed bloom filters. IEEE/ACM Transactions on Networking, 10(5):

Bloom Filters. References:

Bloom Filters. References: Bloom Filters References: Li Fan, Pei Cao, Jussara Almeida, Andrei Broder, Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol, IEEE/ACM Transactions on Networking, Vol. 8, No. 3, June 2000.