Notes on Bloom filters

Size: px
Start display at page:

Download "Notes on Bloom filters"

Transcription

1 Computer Science B63 Winter 2017 Scarborough Campus University of Toronto Notes on Bloom filters Vassos Hadzilacos A Bloom filter is an approximate or probabilistic dictionary. Let S be a dynamic set of keys drawn from a universe U. A Bloom filter maintains a summary F S of S, supporting the following operations: BF-Insert(F S, x): S := S {x} add x to the underlying dynamic set S. BF-Search(F S, x): return no if x / S, and probably yes if, with high probability (to be discussed later), x S; it is possible, however, that probably yes is returned even though x / S. An instance where BF-Search(F S, x) returns probably yes even though x / S, is called a false positive. Note that there is no BF-Delete operation. As we will see, deletions are problematic for Bloom filters; we will discuss partial remedies for this weakness. Bloom filters are very space efficient; they consume only a small fraction of the space needed to store the full dynamic set S using, say, an AVL tree or a hash table. Consequently they also achieve time efficiencies: they can be stored in main memory, rather than in secondary storage, and so they can be accessed much faster. The disadvantage of Bloom filters is that there is a non-zero probability of false positive searches. Note that there is no possibility of a false negative: If BF-Search(F S, x) returns no, then x is definitely not in S. This asymmetry between the positive and negative responses is critical in making Bloom filters useful, as will be seen when we discuss some applications of Bloom filters. How Bloom filters work. A Bloom filter consists of an array of m bits, BF[0..m 1], initially all 0, corresponding to an empty set. Let h 1, h 2,..., h t be hash functions that map U to {0, 1,..., m 1}. The Bloom filter operations are then implemented as follows: To insert a key x into the Bloom filter, we set all the bits BF[h 1 (x)],..., BF[h t (x)] to 1. To search for a key x, we look at all the bits BF[h 1 (x)],..., BF[h t (x)]. If any one of them is still 0, we return no : had x been inserted to the Bloom filter, all these bits would have been set to 1. If all of them are 1, we return probably yes. Note that a search for x may find all the bits set to 1 even though x was never inserted into the dictionary. For example, suppose we use two hash functions, which map x to bit positions 1 and 3, y to bit positions 1 and 2, and z to bit positions 2 and 3. If we insert y and z, and then search for x, the search algorithm will return probably yes even though x was not inserted to the Bloom filter. This is an example of a false positive search. The algorithms for BF-Insert and BF-Search are shown in pseudocode in Figure 1. BF-Insert(F S, x) for i := 1 to t do BF[h i (x)] := 1 BF-Search(F S, x) for i := 1 to t do if BF[h i (x)] = 0 then return no return probably yes Figure 1: Insert and search operations with Bloom filters 1

2 Assuming that we can evaluate each hash function in O(1) time, it is obvious that these algorithms run in Θ(t) time. In typical uses, the number t of hash functions used is a small number, so the algorithms run in Θ(1) time. We use multiple hash functions, rather than just one, to reduce the probability of false positives. If we used only one hash function and we inserted a key x, then the search for any key x that collides with x under that hash function will return probably yes, even if x was never inserted. If we use two hash functions, a key x that collides with x under one hash function is unlikely to also collide with x under the other, provided the hash functions are independent informally, they tend to map the same key to different positions. We will make this more precise later, when we analyze the performance of Bloom filters. We will explore shortly what is the optimal number of hash functions to use. Probability of false positive. The probability of a false positive search depends on three factors: the size m of the Bloom filter; the number of items n inserted into the Bloom filter; and the number of hash functions t used for the Bloom filter. Intuitively, the larger the m, the lower the probability of collisions and therefore of false positives. Similarly, the smaller the n, the lower the probability of collisions and therefore of false positives. The ratio α = n/m is called the load factor, and we encountered this quantity in our analysis of hash tables. From the preceding discussion it is clear that the smaller the load factor, the lower the probability of false positives. The optimal value of the third parameter t occupies some sweet spot between too few hash functions (leading to higher probability of collisions, and therefore higher probability of false positives), and too many hash functions (causing each item inserted to set many bits to 1, and therefore higher probability of false positives). To analyze the probability of a false positive search, we consider a two-stage process. (A) We insert n distinct keys x 1, x 2,..., x n into the Bloom filter. We model these insertions by the following experiment. Start with a Bloom filter all of whose bits are set to 0. Repeat the following for a total of nt times, independently: choose a bit position in the Bloom filter uniformly at random (i.e., each position is chosen with probability 1/m), and set that bit to 1. This models the insertion of n distinct keys, drawn at random from U, where each insertion uses t hash functions to set some bits to 1. (B) Next we search for a randomly chosen key x x 1, x 2,..., x n in U, and we want to determine the probability of a false positive, i.e., the probability that the bits to which x is mapped by the t hash functions have all been set to 1 by the insertion process. We model this by repeating t times, independently, the following: choose a position in the Bloom filter uniformly at random. We then compute the probability of the event that all of the positions chosen were set to 1 during Stage (A). This is an idealized model, like the simple uniform hashing assumption (SUHA) that we used to analyze hashing: It assumes that there are no dependencies or regularities in the set of keys inserted to the Bloom filter, and that the hash functions distribute the keys uniformly at random to the positions of the Bloom filter. With suitably designed hash functions, this idealized model captures well enough the reality of many situations that arise in practice. Fix an arbitrary position l, 0 l < m, of the Bloom filter. We first compute the probability that BF[l] = 0 at the end of Stage (A), i.e., after the keys x 1, x 2,..., x n have been inserted. According to our model, the probability that one of these keys under one of the hash functions hits position l is 1/m; and therefore the probability that it misses position l is 1 1/m. Since the positions of the Bloom filter set 2

3 to 1 during Stage (A) are chosen independently and uniformly at random, the probability that all n keys inserted under all hash functions miss position l is (1 1/m) nt. That is, probability that BF[l] = 0 after x 1,..., x n are inserted = ( 1 1 ) nt m e nt/m = e αt where the approximation is justified by the fact that, for values of x close to 0, 1 x e x. Now consider any key x different from all the n keys inserted into the Bloom filter. The probability that a search for x yields a false positive is the probability that, after the insertion of x 1,..., x n in the Bloom filter, the positions to which the hash functions map x are all set to 1. As we just saw, the probability that any particular bit of BF is 0 after the insertions is e αt, and so the probability that any particular bit is 1 is 1 e αt. By the model assumption that the hash functions map x to positions of BF chosen independently and uniformly at random, the probability that all of the bits to which the hash functions map x are 1 is (1 e αt ) t. Suppose now that the size of the Bloom filter m and the number of elements in it n are fixed; therefore the load factor α = n/m is fixed. For this fixed α, the probability of a false positive becomes a function only of t, the number of hash functions: P (t) = (1 e αt ) t (1) We can therefore compute the value of t that minimizes this function, by taking its derivative and setting it to 0. We have: dp (t) ( = (1 e αt ) t ln(1 e αt e αt ) ) + αt dt 1 e αt Setting the derivative to 0 and solving for t we get that the value of t that minimizes the probability of false positive is t = 0 or t = α 1 ln 2. The value t = 0 is not feasible (since we need a positive number of hash functions!), so the optimal choice of hash function is given by t = α 1 ln 2 (2) Note that this is a non-integer value, so we will use the positive integer t that is closest to α 1 ln 2. Substituting (2) into (1) we get that the probability of a false positive search using the optimal number of hash functions is P (α 1 ln 2) = (1 e αα 1 ln 2 ) α 1 ln 2 = ( 1 2 ln 2 ) α α 1 (3) Example. Suppose we have a dictionary consisting of 10 million URLs, i.e., n = If I allocate a Bloom filter with m = bits, we have α 1 = 32. Applying (3), we get that the probability of a false positive in this case is A more accurate calculation would be to first find the optimal number t of hash functions as the positive integer closest to the value given by (2), and then apply (1) for that value of t. Doing so we obtain that t should be the positive integer closest to 32 ln , i.e., t = 22. Plugging this value to (1) we get that the probability of false positive search is P (22) (1 e 22/32 ) The inverse of the load factor α 1 = m/n can be thought of as the number of bits we allocate per element inserted in the Bloom filter. Note that this interpretation should not be viewed as meaning that we allocate a specific set of positions in the Bloom filter for each item we insert: Each item inserted to the Bloom filter gets (up to) t bits, the positions to which it is mapped by the t hash functions. Rather, α 1 is 3

4 a measure of how much space we save by using a Bloom filter instead of storing the dictionary explicitly. In our example, α 1 = 32; thus we allocate 32 bits, i.e., 4 bytes, for each URL in the dictionary. This is much shorter than is required to store an actual URL. Deletions. Deletions are problematic in Bloom filters. Note that we cannot delete an element merely by setting to 0 the bits to which it is mapped by the hash functions: Doing so would result in false negatives, which would render Bloom filters useless. To see how this can happen, suppose we have inserted three keys: x that is mapped to bit positions 1 and 3, y that is mapped to bit positions 1 and 2, and z that is mapped to bit positions 2 and 3. If we delete y and z by setting their bits to 0, and we then search for x, the Bloom filter would return no, even though x was not deleted. A partial solution to this limitation is to use so-called counting Bloom filters. In a counting Bloom filter, each position in the array BF is not a bit but a small counter. Initially, every counter is 0, indicating an empty Bloom filter. Each time a key x is inserted (respectively, deleted), the counters in the positions to which x is mapped by the hash functions are incremented (respectively, decremented) by 1. To search for a key x, we look at all the counters to which x is mapped by the hash functions; if any of them is 0 we return no ; otherwise, we return probably yes. Pseudocode for these operations is shown in Figure 2. BF-Insert(F S, x) for i := 1 to t do BF[h i (x)] := BF[h i (x)] + 1 BF-Delete(F S, x) for i := 1 to t do BF[h i (x)] := BF[h i (x)] 1 BF-Search(F S, x) for i := 1 to t do if BF[h i (x)] = 0 then return no return probably yes Figure 2: Insert, delete, and search operations with counting Bloom filters We don t want to allocate many bits to each counter, as this would undermine the space savings advantage that Bloom filters are designed to deliver. On the other hand, if the counters are too small, they will wrap around and again produce false negatives. For these reasons, counting Bloom filters are only a limited solution. As we will see, Bloom filters are typically used in applications where there are no, or only very few, deletions. Applications. We now briefly describe some applications of Bloom filters. Refusing service to black-listed sites. A web server may keep a (long) list of black-listed sites, known to contain malware or to distribute spam. Whenever the web server receives a request from such a site, it does not respond to it. Almost all requests that the web server receives are from clean sites. Nevertheless, the list of black-listed sites is too long to keep in main memory. It would be very inefficient to keep the list on disk: Doing so, would mean that the web server would have to perform time-consuming disk accesses at each request to verify that the requesting site is not black listed. Instead, the web server keeps the full list of black-listed sites on disk, and keeps in main memory a Bloom filter of the black-listed sites. This is feasible because the Bloom filter is much shorter than the actual list of black-listed sites. When a request arrives from a site s, the web server checks to see if s is in the Bloom filter. In most cases (and assuming that the probability of false positive is low), the answer is no, in which case the web server replies to s s request. In the rare instances where the answer is probably yes, the web server performs a disk access to search the actual list of black-listed sites for s. If s is not found on that list, the web server replies to s s request; otherwise, i.e., if s is actually a black-listed site, the web server ignores s s request. Approximate counting. Suppose we want to count how many different IP addresses have visited a web page. The obvious way to do this is to keep the set V of all IP addresses that have visited the web page in the past, and a counter giving the cardinality of that set. Each time a request arrives from IP address a, we check if a V ; if not, we add a to V and increment the counter. It is, however, too expensive to 4

5 remember all IP addresses that visited the page in the past. If (as is often the case) it is acceptable to provide an approximate counter that slightly undercounts unique visitors, we can use a Bloom filter of the visitor s IP addresses, rather than the set of addresses themselves. When IP address a visits the web page, we check if a is in the Bloom filter. If the answer is no, we know for sure that a is a new visitor. So we insert a to the Bloom filter and increment the counter. If the answer is probably yes, we don t increment the counter. Note that if this was a false positive, by not incrementing the counter we have missed a new visitor. If false positives are rare, our approximate counter will be close enough. To be honest about the service provided, the counter should be used to report at least x unique visitors (rather than report x unique visitors as if x were the exact number). These applications share the following characteristics: 1. Saving space is a key objective. In both applications, we don t want to allocate the space needed to store the entire set of items we are interested in. In the case of the web server managing black-listed sites we store the full set of black-listed sites on disk, but we access the disk copy only rarely. In the case of the approximate counter we don t even bother to store the full set of past visitors at all. 2. Objects are rarely (if ever) deleted from the dynamic set of objects contained in the Bloom filter. Once a site is compromised, it remains black-listed forever; and once an IP address has visited the web page, it remains a past visitor (by definition!) forever. 3. The fact that there are no false negatives is crucial for the Bloom filter to be useful. In the case of the web server managing black-listed sites, if a Bloom filter search returns no, we know for sure that the requesting site is not black-listed and it is therefore safe to respond to its requests. In the case of the approximate counter, if a Bloom filter search returns no, we know for sure that the visitor is new and it is correct to increment the counter of unique visitors. 4. There is an effective way to mitigate the effect of false positives. In the case of the web server managing black-listed sites, the mitigation strategy is to access the list of black-listed sited stored on disk when the Bloom filter answers probably yes. This is slow, but it is tolerable because it happens rarely. In the case of the approximate counter, the mitigation strategy is to provide an undercount of the unique visitors, rather than an exact one. These characteristics (saving space, rare deletions, tolerance to (rare) false positives, and existence of a mitigation strategy for false positives) are typical of applications in which Bloom filters can be brought to bear. 5

Hashing. Hashing Procedures

Hashing. Hashing Procedures Hashing Hashing Procedures Let us denote the set of all possible key values (i.e., the universe of keys) used in a dictionary application by U. Suppose an application requires a dictionary in which elements

More information

9/24/ Hash functions

9/24/ Hash functions 11.3 Hash functions A good hash function satis es (approximately) the assumption of SUH: each key is equally likely to hash to any of the slots, independently of the other keys We typically have no way

More information

Worst-case running time for RANDOMIZED-SELECT

Worst-case running time for RANDOMIZED-SELECT Worst-case running time for RANDOMIZED-SELECT is ), even to nd the minimum The algorithm has a linear expected running time, though, and because it is randomized, no particular input elicits the worst-case

More information

III Data Structures. Dynamic sets

III Data Structures. Dynamic sets III Data Structures Elementary Data Structures Hash Tables Binary Search Trees Red-Black Trees Dynamic sets Sets are fundamental to computer science Algorithms may require several different types of operations

More information

COMP171. Hashing.

COMP171. Hashing. COMP171 Hashing Hashing 2 Hashing Again, a (dynamic) set of elements in which we do search, insert, and delete Linear ones: lists, stacks, queues, Nonlinear ones: trees, graphs (relations between elements

More information

Hash Table and Hashing

Hash Table and Hashing Hash Table and Hashing The tree structures discussed so far assume that we can only work with the input keys by comparing them. No other operation is considered. In practice, it is often true that an input

More information

Introduction hashing: a technique used for storing and retrieving information as quickly as possible.

Introduction hashing: a technique used for storing and retrieving information as quickly as possible. Lecture IX: Hashing Introduction hashing: a technique used for storing and retrieving information as quickly as possible. used to perform optimal searches and is useful in implementing symbol tables. Why

More information

Advanced Algorithmics (6EAP) MTAT Hashing. Jaak Vilo 2016 Fall

Advanced Algorithmics (6EAP) MTAT Hashing. Jaak Vilo 2016 Fall Advanced Algorithmics (6EAP) MTAT.03.238 Hashing Jaak Vilo 2016 Fall Jaak Vilo 1 ADT asscociative array INSERT, SEARCH, DELETE An associative array (also associative container, map, mapping, dictionary,

More information

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far Chapter 5 Hashing 2 Introduction hashing performs basic operations, such as insertion, deletion, and finds in average time better than other ADTs we ve seen so far 3 Hashing a hash table is merely an hashing

More information

Tirgul 7. Hash Tables. In a hash table, we allocate an array of size m, which is much smaller than U (the set of keys).

Tirgul 7. Hash Tables. In a hash table, we allocate an array of size m, which is much smaller than U (the set of keys). Tirgul 7 Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys belong to a universal group of keys, U = {1... M}.

More information

CS 270 Algorithms. Oliver Kullmann. Generalising arrays. Direct addressing. Hashing in general. Hashing through chaining. Reading from CLRS for week 7

CS 270 Algorithms. Oliver Kullmann. Generalising arrays. Direct addressing. Hashing in general. Hashing through chaining. Reading from CLRS for week 7 Week 9 General remarks tables 1 2 3 We continue data structures by discussing hash tables. Reading from CLRS for week 7 1 Chapter 11, Sections 11.1, 11.2, 11.3. 4 5 6 Recall: Dictionaries Applications

More information

Hashing. Dr. Ronaldo Menezes Hugo Serrano. Ronaldo Menezes, Florida Tech

Hashing. Dr. Ronaldo Menezes Hugo Serrano. Ronaldo Menezes, Florida Tech Hashing Dr. Ronaldo Menezes Hugo Serrano Agenda Motivation Prehash Hashing Hash Functions Collisions Separate Chaining Open Addressing Motivation Hash Table Its one of the most important data structures

More information

Hashing. Manolis Koubarakis. Data Structures and Programming Techniques

Hashing. Manolis Koubarakis. Data Structures and Programming Techniques Hashing Manolis Koubarakis 1 The Symbol Table ADT A symbol table T is an abstract storage that contains table entries that are either empty or are pairs of the form (K, I) where K is a key and I is some

More information

Fundamental Algorithms

Fundamental Algorithms Fundamental Algorithms Chapter 7: Hash Tables Michael Bader Winter 2011/12 Chapter 7: Hash Tables, Winter 2011/12 1 Generalised Search Problem Definition (Search Problem) Input: a sequence or set A of

More information

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion, Introduction Chapter 5 Hashing hashing performs basic operations, such as insertion, deletion, and finds in average time 2 Hashing a hash table is merely an of some fixed size hashing converts into locations

More information

Dictionary. Dictionary. stores key-value pairs. Find(k) Insert(k, v) Delete(k) List O(n) O(1) O(n) Sorted Array O(log n) O(n) O(n)

Dictionary. Dictionary. stores key-value pairs. Find(k) Insert(k, v) Delete(k) List O(n) O(1) O(n) Sorted Array O(log n) O(n) O(n) Hash-Tables Introduction Dictionary Dictionary stores key-value pairs Find(k) Insert(k, v) Delete(k) List O(n) O(1) O(n) Sorted Array O(log n) O(n) O(n) Balanced BST O(log n) O(log n) O(log n) Dictionary

More information

We assume uniform hashing (UH):

We assume uniform hashing (UH): We assume uniform hashing (UH): the probe sequence of each key is equally likely to be any of the! permutations of 0,1,, 1 UH generalizes the notion of SUH that produces not just a single number, but a

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams

More information

Compact data structures: Bloom filters

Compact data structures: Bloom filters Compact data structures: Luca Becchetti Sapienza Università di Roma Rome, Italy April 7, 2010 1 2 3 Dictionaries A dynamic set S of objects from a discrete universe U, on which (at least) the following

More information

TABLES AND HASHING. Chapter 13

TABLES AND HASHING. Chapter 13 Data Structures Dr Ahmed Rafat Abas Computer Science Dept, Faculty of Computer and Information, Zagazig University arabas@zu.edu.eg http://www.arsaliem.faculty.zu.edu.eg/ TABLES AND HASHING Chapter 13

More information

Hashing. Introduction to Data Structures Kyuseok Shim SoEECS, SNU.

Hashing. Introduction to Data Structures Kyuseok Shim SoEECS, SNU. Hashing Introduction to Data Structures Kyuseok Shim SoEECS, SNU. 1 8.1 INTRODUCTION Binary search tree (Chapter 5) GET, INSERT, DELETE O(n) Balanced binary search tree (Chapter 10) GET, INSERT, DELETE

More information

Data Streams. Everything Data CompSci 216 Spring 2018

Data Streams. Everything Data CompSci 216 Spring 2018 Data Streams Everything Data CompSci 216 Spring 2018 How much data is generated every 2 minute in the world? haps://fossbytes.com/how-much-data-is-generated-every-minute-in-the-world/ 3 Data stream A potentially

More information

Week 9. Hash tables. 1 Generalising arrays. 2 Direct addressing. 3 Hashing in general. 4 Hashing through chaining. 5 Hash functions.

Week 9. Hash tables. 1 Generalising arrays. 2 Direct addressing. 3 Hashing in general. 4 Hashing through chaining. 5 Hash functions. Week 9 tables 1 2 3 ing in ing in ing 4 ing 5 6 General remarks We continue data structures by discussing hash tables. For this year, we only consider the first four sections (not sections and ). Only

More information

Lecture 7: Efficient Collections via Hashing

Lecture 7: Efficient Collections via Hashing Lecture 7: Efficient Collections via Hashing These slides include material originally prepared by Dr. Ron Cytron, Dr. Jeremy Buhler, and Dr. Steve Cole. 1 Announcements Lab 6 due Friday Lab 7 out tomorrow

More information

LECTURE 11. Memory Hierarchy

LECTURE 11. Memory Hierarchy LECTURE 11 Memory Hierarchy MEMORY HIERARCHY When it comes to memory, there are two universally desirable properties: Large Size: ideally, we want to never have to worry about running out of memory. Speed

More information

Randomized Algorithms: Element Distinctness

Randomized Algorithms: Element Distinctness Randomized Algorithms: Element Distinctness CSE21 Winter 2017, Day 24 (B00), Day 16-17 (A00) March 13, 2017 http://vlsicad.ucsd.edu/courses/cse21-w17 Element Distinctness: WHAT Given list of positive integers

More information

Hashing Techniques. Material based on slides by George Bebis

Hashing Techniques. Material based on slides by George Bebis Hashing Techniques Material based on slides by George Bebis https://www.cse.unr.edu/~bebis/cs477/lect/hashing.ppt The Search Problem Find items with keys matching a given search key Given an array A, containing

More information

HO #13 Fall 2015 Gary Chan. Hashing (N:12)

HO #13 Fall 2015 Gary Chan. Hashing (N:12) HO #13 Fall 2015 Gary Chan Hashing (N:12) Outline Motivation Hashing Algorithms and Improving the Hash Functions Collisions Strategies Open addressing and linear probing Separate chaining COMP2012H (Hashing)

More information

Hash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1

Hash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1 Hash Tables Outline Definition Hash functions Open hashing Closed hashing collision resolution techniques Efficiency EECS 268 Programming II 1 Overview Implementation style for the Table ADT that is good

More information

Algorithms and Data Structures

Algorithms and Data Structures Lesson 4: Sets, Dictionaries and Hash Tables Luciano Bononi http://www.cs.unibo.it/~bononi/ (slide credits: these slides are a revised version of slides created by Dr. Gabriele D Angelo)

More information

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Data Structures Hashing Structures Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Hashing Structures I. Motivation and Review II. Hash Functions III. HashTables I. Implementations

More information

Tables. The Table ADT is used when information needs to be stored and acessed via a key usually, but not always, a string. For example: Dictionaries

Tables. The Table ADT is used when information needs to be stored and acessed via a key usually, but not always, a string. For example: Dictionaries 1: Tables Tables The Table ADT is used when information needs to be stored and acessed via a key usually, but not always, a string. For example: Dictionaries Symbol Tables Associative Arrays (eg in awk,

More information

The dictionary problem

The dictionary problem 6 Hashing The dictionary problem Different approaches to the dictionary problem: previously: Structuring the set of currently stored keys: lists, trees, graphs,... structuring the complete universe of

More information

Lecture 12 Hash Tables

Lecture 12 Hash Tables Lecture 12 Hash Tables 15-122: Principles of Imperative Computation (Spring 2018) Frank Pfenning, Rob Simmons Dictionaries, also called associative arrays as well as maps, are data structures that are

More information

Question Score Points Out Of 25

Question Score Points Out Of 25 University of Texas at Austin 6 May 2005 Department of Computer Science Theory in Programming Practice, Spring 2005 Test #3 Instructions. This is a 50-minute test. No electronic devices (including calculators)

More information

Cuckoo Hashing for Undergraduates

Cuckoo Hashing for Undergraduates Cuckoo Hashing for Undergraduates Rasmus Pagh IT University of Copenhagen March 27, 2006 Abstract This lecture note presents and analyses two simple hashing algorithms: Hashing with Chaining, and Cuckoo

More information

Introduction to Hashing

Introduction to Hashing Lecture 11 Hashing Introduction to Hashing We have learned that the run-time of the most efficient search in a sorted list can be performed in order O(lg 2 n) and that the most efficient sort by key comparison

More information

Q.1 Explain Computer s Basic Elements

Q.1 Explain Computer s Basic Elements Q.1 Explain Computer s Basic Elements Ans. At a top level, a computer consists of processor, memory, and I/O components, with one or more modules of each type. These components are interconnected in some

More information

COSC-4411(M) Midterm #1

COSC-4411(M) Midterm #1 12 February 2004 COSC-4411(M) Midterm #1 & answers p. 1 of 10 COSC-4411(M) Midterm #1 Sur / Last Name: Given / First Name: Student ID: Instructor: Parke Godfrey Exam Duration: 75 minutes Term: Winter 2004

More information

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/18/14

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/18/14 600.363 Introduction to Algorithms / 600.463 Algorithms I Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/18/14 23.1 Introduction We spent last week proving that for certain problems,

More information

CMSC 451: Lecture 10 Dynamic Programming: Weighted Interval Scheduling Tuesday, Oct 3, 2017

CMSC 451: Lecture 10 Dynamic Programming: Weighted Interval Scheduling Tuesday, Oct 3, 2017 CMSC 45 CMSC 45: Lecture Dynamic Programming: Weighted Interval Scheduling Tuesday, Oct, Reading: Section. in KT. Dynamic Programming: In this lecture we begin our coverage of an important algorithm design

More information

CS 3410 Ch 20 Hash Tables

CS 3410 Ch 20 Hash Tables CS 341 Ch 2 Hash Tables Sections 2.1-2.7 Pages 773-82 2.1 Basic Ideas 1. A hash table is a data structure that supports insert, remove, and find in constant time, but there is no order to the items stored.

More information

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is: CS 124 Section #8 Hashing, Skip Lists 3/20/17 1 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look

More information

COPYRIGHTED MATERIAL. An Introduction to Computers That Will Actually Help You in Life. Chapter 1. Memory: Not Exactly 0s and 1s. Memory Organization

COPYRIGHTED MATERIAL. An Introduction to Computers That Will Actually Help You in Life. Chapter 1. Memory: Not Exactly 0s and 1s. Memory Organization Chapter 1 An Introduction to Computers That Will Actually Help You in Life Memory: Not Exactly 0s and 1s Memory Organization A Very Simple Computer COPYRIGHTED MATERIAL 2 Chapter 1 An Introduction to Computers

More information

Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol

Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol Li Fan, Pei Cao and Jussara Almeida University of Wisconsin-Madison Andrei Broder Compaq/DEC System Research Center Why Web Caching One of

More information

Lecture 12 Notes Hash Tables

Lecture 12 Notes Hash Tables Lecture 12 Notes Hash Tables 15-122: Principles of Imperative Computation (Spring 2016) Frank Pfenning, Rob Simmons 1 Introduction In this lecture we re-introduce the dictionaries that were implemented

More information

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ.! Instructor: X. Zhang Spring 2017

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ.! Instructor: X. Zhang Spring 2017 Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ.! Instructor: X. Zhang Spring 2017 Acknowledgement The set of slides have used materials from the following resources Slides

More information

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ. Acknowledgement. Support for Dictionary

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ. Acknowledgement. Support for Dictionary Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ. Instructor: X. Zhang Spring 2017 Acknowledgement The set of slides have used materials from the following resources Slides for

More information

CHAPTER 5 PROPAGATION DELAY

CHAPTER 5 PROPAGATION DELAY 98 CHAPTER 5 PROPAGATION DELAY Underwater wireless sensor networks deployed of sensor nodes with sensing, forwarding and processing abilities that operate in underwater. In this environment brought challenges,

More information

1 Black Box Test Data Generation Techniques

1 Black Box Test Data Generation Techniques 1 Black Box Test Data Generation Techniques 1.1 Equivalence Partitioning Introduction Equivalence partitioning is based on the premise that the inputs and outputs of a component can be partitioned into

More information

Lecture Notes on Hash Tables

Lecture Notes on Hash Tables Lecture Notes on Hash Tables 15-122: Principles of Imperative Computation Frank Pfenning Lecture 13 February 24, 2011 1 Introduction In this lecture we introduce so-called associative arrays, that is,

More information

Hashing and sketching

Hashing and sketching Hashing and sketching 1 The age of big data An age of big data is upon us, brought on by a combination of: Pervasive sensing: so much of what goes on in our lives and in the world at large is now digitally

More information

Bloom filters and their applications

Bloom filters and their applications Bloom filters and their applications Fedor Nikitin June 11, 2006 1 Introduction The bloom filters, as a new approach to hashing, were firstly presented by Burton Bloom [Blo70]. He considered the task of

More information

Data Structures and Algorithms. Chapter 7. Hashing

Data Structures and Algorithms. Chapter 7. Hashing 1 Data Structures and Algorithms Chapter 7 Werner Nutt 2 Acknowledgments The course follows the book Introduction to Algorithms, by Cormen, Leiserson, Rivest and Stein, MIT Press [CLRST]. Many examples

More information

142

142 Scope Rules Thus, storage duration does not affect the scope of an identifier. The only identifiers with function-prototype scope are those used in the parameter list of a function prototype. As mentioned

More information

Data Stream Processing

Data Stream Processing Data Stream Processing Part II 1 Data Streams (recap) continuous, unbounded sequence of items unpredictable arrival times too large to store locally one pass real time processing required 2 Reservoir Sampling

More information

Hashing. 1. Introduction. 2. Direct-address tables. CmSc 250 Introduction to Algorithms

Hashing. 1. Introduction. 2. Direct-address tables. CmSc 250 Introduction to Algorithms Hashing CmSc 250 Introduction to Algorithms 1. Introduction Hashing is a method of storing elements in a table in a way that reduces the time for search. Elements are assumed to be records with several

More information

Chapter 27 Hashing. Objectives

Chapter 27 Hashing. Objectives Chapter 27 Hashing 1 Objectives To know what hashing is for ( 27.3). To obtain the hash code for an object and design the hash function to map a key to an index ( 27.4). To handle collisions using open

More information

Hash Tables. Hashing Probing Separate Chaining Hash Function

Hash Tables. Hashing Probing Separate Chaining Hash Function Hash Tables Hashing Probing Separate Chaining Hash Function Introduction In Chapter 4 we saw: linear search O( n ) binary search O( log n ) Can we improve the search operation to achieve better than O(

More information

Bloom Filters. From this point on, I m going to refer to search queries as keys since that is the role they

Bloom Filters. From this point on, I m going to refer to search queries as keys since that is the role they Bloom Filters One of the fundamental operations on a data set is membership testing: given a value x, is x in the set? So far we have focused on data structures that provide exact answers to this question.

More information

CHAPTER 4 BLOOM FILTER

CHAPTER 4 BLOOM FILTER 54 CHAPTER 4 BLOOM FILTER 4.1 INTRODUCTION Bloom filter was formulated by Bloom (1970) and is used widely today for different purposes including web caching, intrusion detection, content based routing,

More information

Data Structures and Algorithms. Roberto Sebastiani

Data Structures and Algorithms. Roberto Sebastiani Data Structures and Algorithms Roberto Sebastiani roberto.sebastiani@disi.unitn.it http://www.disi.unitn.it/~rseba - Week 07 - B.S. In Applied Computer Science Free University of Bozen/Bolzano academic

More information

Algorithms and Data Structures

Algorithms and Data Structures Algorithms and Data Structures Spring 2019 Alexis Maciel Department of Computer Science Clarkson University Copyright c 2019 Alexis Maciel ii Contents 1 Analysis of Algorithms 1 1.1 Introduction.................................

More information

Hash Table. A hash function h maps keys of a given type into integers in a fixed interval [0,m-1]

Hash Table. A hash function h maps keys of a given type into integers in a fixed interval [0,m-1] Exercise # 8- Hash Tables Hash Tables Hash Function Uniform Hash Hash Table Direct Addressing A hash function h maps keys of a given type into integers in a fixed interval [0,m-1] 1 Pr h( key) i, where

More information

Hashing. Yufei Tao. Department of Computer Science and Engineering Chinese University of Hong Kong

Hashing. Yufei Tao. Department of Computer Science and Engineering Chinese University of Hong Kong Department of Computer Science and Engineering Chinese University of Hong Kong In this lecture, we will revisit the dictionary search problem, where we want to locate an integer v in a set of size n or

More information

Operating system Dr. Shroouq J.

Operating system Dr. Shroouq J. 2.2.2 DMA Structure In a simple terminal-input driver, when a line is to be read from the terminal, the first character typed is sent to the computer. When that character is received, the asynchronous-communication

More information

AAL 217: DATA STRUCTURES

AAL 217: DATA STRUCTURES Chapter # 4: Hashing AAL 217: DATA STRUCTURES The implementation of hash tables is frequently called hashing. Hashing is a technique used for performing insertions, deletions, and finds in constant average

More information

Practice Midterm Exam Solutions

Practice Midterm Exam Solutions CSE 332: Data Abstractions Autumn 2015 Practice Midterm Exam Solutions Name: Sample Solutions ID #: 1234567 TA: The Best Section: A9 INSTRUCTIONS: You have 50 minutes to complete the exam. The exam is

More information

Question Points Score Total 100

Question Points Score Total 100 Midterm #2 CMSC 412 Operating Systems Fall 2005 November 22, 2004 Guidelines This exam has 7 pages (including this one); make sure you have them all. Put your name on each page before starting the exam.

More information

CSE 5311 Notes 5: Hashing

CSE 5311 Notes 5: Hashing CSE 5311 Notes 5: Hashing (Last updated 2/18/18 1:33 PM) CLRS, Chapter 11 Review: 11.2: Chaining - related to perfect hashing method 11.3: Hash functions, skim universal hashing (aside: https://dl-acm-org.ezproy.uta.edu/citation.cfm?doid=3116227.3068772

More information

Introduction to Algorithms April 21, 2004 Massachusetts Institute of Technology. Quiz 2 Solutions

Introduction to Algorithms April 21, 2004 Massachusetts Institute of Technology. Quiz 2 Solutions Introduction to Algorithms April 21, 2004 Massachusetts Institute of Technology 6.046J/18.410J Professors Erik Demaine and Shafi Goldwasser Quiz 2 Solutions Quiz 2 Solutions Do not open this quiz booklet

More information

CS369G: Algorithmic Techniques for Big Data Spring

CS369G: Algorithmic Techniques for Big Data Spring CS369G: Algorithmic Techniques for Big Data Spring 2015-2016 Lecture 11: l 0 -Sampling and Introduction to Graph Streaming Prof. Moses Charikar Scribe: Austin Benson 1 Overview We present and analyze the

More information

4.1 COMPUTATIONAL THINKING AND PROBLEM-SOLVING

4.1 COMPUTATIONAL THINKING AND PROBLEM-SOLVING 4.1 COMPUTATIONAL THINKING AND PROBLEM-SOLVING 4.1.2 ALGORITHMS ALGORITHM An Algorithm is a procedure or formula for solving a problem. It is a step-by-step set of operations to be performed. It is almost

More information

Chapter 17. Disk Storage, Basic File Structures, and Hashing. Records. Blocking

Chapter 17. Disk Storage, Basic File Structures, and Hashing. Records. Blocking Chapter 17 Disk Storage, Basic File Structures, and Hashing Records Fixed and variable length records Records contain fields which have values of a particular type (e.g., amount, date, time, age) Fields

More information

Variables and Constants

Variables and Constants HOUR 3 Variables and Constants Programs need a way to store the data they use. Variables and constants offer various ways to work with numbers and other values. In this hour you learn: How to declare and

More information

Hashing 1. Searching Lists

Hashing 1. Searching Lists Hashing 1 Searching Lists There are many instances when one is interested in storing and searching a list: A phone company wants to provide caller ID: Given a phone number a name is returned. Somebody

More information

Module 5: Hashing. CS Data Structures and Data Management. Reza Dorrigiv, Daniel Roche. School of Computer Science, University of Waterloo

Module 5: Hashing. CS Data Structures and Data Management. Reza Dorrigiv, Daniel Roche. School of Computer Science, University of Waterloo Module 5: Hashing CS 240 - Data Structures and Data Management Reza Dorrigiv, Daniel Roche School of Computer Science, University of Waterloo Winter 2010 Reza Dorrigiv, Daniel Roche (CS, UW) CS240 - Module

More information

1 Defining Message authentication

1 Defining Message authentication ISA 562: Information Security, Theory and Practice Lecture 3 1 Defining Message authentication 1.1 Defining MAC schemes In the last lecture we saw that, even if our data is encrypted, a clever adversary

More information

CSCI Analysis of Algorithms I

CSCI Analysis of Algorithms I CSCI 305 - Analysis of Algorithms I 04 June 2018 Filip Jagodzinski Computer Science Western Washington University Announcements Remainder of the term 04 June : lecture 05 June : lecture 06 June : lecture,

More information

Preview. Memory Management

Preview. Memory Management Preview Memory Management With Mono-Process With Multi-Processes Multi-process with Fixed Partitions Modeling Multiprogramming Swapping Memory Management with Bitmaps Memory Management with Free-List Virtual

More information

Algorithms in Systems Engineering ISE 172. Lecture 12. Dr. Ted Ralphs

Algorithms in Systems Engineering ISE 172. Lecture 12. Dr. Ted Ralphs Algorithms in Systems Engineering ISE 172 Lecture 12 Dr. Ted Ralphs ISE 172 Lecture 12 1 References for Today s Lecture Required reading Chapter 5 References CLRS Chapter 11 D.E. Knuth, The Art of Computer

More information

CS/COE 1501

CS/COE 1501 CS/COE 1501 www.cs.pitt.edu/~lipschultz/cs1501/ Hashing Wouldn t it be wonderful if... Search through a collection could be accomplished in Θ(1) with relatively small memory needs? Lets try this: Assume

More information

Layered Network Architecture. CSC358 - Introduction to Computer Networks

Layered Network Architecture. CSC358 - Introduction to Computer Networks Layered Network Architecture Layered Network Architecture Question: How can we provide a reliable service on the top of a unreliable service? ARQ: Automatic Repeat Request Can be used in every layer TCP

More information

HASH TABLES. Hash Tables Page 1

HASH TABLES. Hash Tables Page 1 HASH TABLES TABLE OF CONTENTS 1. Introduction to Hashing 2. Java Implementation of Linear Probing 3. Maurer s Quadratic Probing 4. Double Hashing 5. Separate Chaining 6. Hash Functions 7. Alphanumeric

More information

CSE100. Advanced Data Structures. Lecture 21. (Based on Paul Kube course materials)

CSE100. Advanced Data Structures. Lecture 21. (Based on Paul Kube course materials) CSE100 Advanced Data Structures Lecture 21 (Based on Paul Kube course materials) CSE 100 Collision resolution strategies: linear probing, double hashing, random hashing, separate chaining Hash table cost

More information

Lesson n.11 Data Structures for P2P Systems: Bloom Filters, Merkle Trees

Lesson n.11 Data Structures for P2P Systems: Bloom Filters, Merkle Trees Lesson n.11 : Bloom Filters, Merkle Trees Didactic Material Tutorial on Moodle 15/11/2013 1 SET MEMBERSHIP PROBLEM Let us consider the set S={s 1,s 2,...,s n } of n elements chosen from a very large universe

More information

Cpt S 223. School of EECS, WSU

Cpt S 223. School of EECS, WSU Hashing & Hash Tables 1 Overview Hash Table Data Structure : Purpose To support insertion, deletion and search in average-case constant t time Assumption: Order of elements irrelevant ==> data structure

More information

Final Exam in Algorithms and Data Structures 1 (1DL210)

Final Exam in Algorithms and Data Structures 1 (1DL210) Final Exam in Algorithms and Data Structures 1 (1DL210) Department of Information Technology Uppsala University February 0th, 2012 Lecturers: Parosh Aziz Abdulla, Jonathan Cederberg and Jari Stenman Location:

More information

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40 Lecture 16 Hashing Hash table and hash function design Hash functions for integers and strings Collision resolution strategies: linear probing, double hashing, random hashing, separate chaining Hash table

More information

CS 161 Problem Set 4

CS 161 Problem Set 4 CS 161 Problem Set 4 Spring 2017 Due: May 8, 2017, 3pm Please answer each of the following problems. Refer to the course webpage for the collaboration policy, as well as for helpful advice for how to write

More information

CS 350 Algorithms and Complexity

CS 350 Algorithms and Complexity CS 350 Algorithms and Complexity Winter 2019 Lecture 12: Space & Time Tradeoffs. Part 2: Hashing & B-Trees Andrew P. Black Department of Computer Science Portland State University Space-for-time tradeoffs

More information

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2 CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2 Prof. John Park Based on slides from previous iterations of this course Today s Topics Overview Uses and motivations of hash tables Major concerns with hash

More information

CS 561, Lecture 2 : Randomization in Data Structures. Jared Saia University of New Mexico

CS 561, Lecture 2 : Randomization in Data Structures. Jared Saia University of New Mexico CS 561, Lecture 2 : Randomization in Data Structures Jared Saia University of New Mexico Outline Hash Tables Bloom Filters Skip Lists 1 Dictionary ADT A dictionary ADT implements the following operations

More information

CS 561, Lecture 2 : Hash Tables, Skip Lists, Bloom Filters, Count-Min sketch. Jared Saia University of New Mexico

CS 561, Lecture 2 : Hash Tables, Skip Lists, Bloom Filters, Count-Min sketch. Jared Saia University of New Mexico CS 561, Lecture 2 : Hash Tables, Skip Lists, Bloom Filters, Count-Min sketch Jared Saia University of New Mexico Outline Hash Tables Skip Lists Count-Min Sketch 1 Dictionary ADT A dictionary ADT implements

More information

FILE SYSTEM IMPLEMENTATION. Sunu Wibirama

FILE SYSTEM IMPLEMENTATION. Sunu Wibirama FILE SYSTEM IMPLEMENTATION Sunu Wibirama File-System Structure Outline File-System Implementation Directory Implementation Allocation Methods Free-Space Management Discussion File-System Structure Outline

More information

Hashing. 5/1/2006 Algorithm analysis and Design CS 007 BE CS 5th Semester 2

Hashing. 5/1/2006 Algorithm analysis and Design CS 007 BE CS 5th Semester 2 Hashing Hashing A hash function h maps keys of a given type to integers in a fixed interval [0,N-1]. The goal of a hash function is to uniformly disperse keys in the range [0,N-1] 5/1/2006 Algorithm analysis

More information

CS 344/444 Computer Network Fundamentals Final Exam Solutions Spring 2007

CS 344/444 Computer Network Fundamentals Final Exam Solutions Spring 2007 CS 344/444 Computer Network Fundamentals Final Exam Solutions Spring 2007 Question 344 Points 444 Points Score 1 10 10 2 10 10 3 20 20 4 20 10 5 20 20 6 20 10 7-20 Total: 100 100 Instructions: 1. Question

More information

4.1 Paging suffers from and Segmentation suffers from. Ans

4.1 Paging suffers from and Segmentation suffers from. Ans Worked out Examples 4.1 Paging suffers from and Segmentation suffers from. Ans: Internal Fragmentation, External Fragmentation 4.2 Which of the following is/are fastest memory allocation policy? a. First

More information

Chapter 3 - Memory Management

Chapter 3 - Memory Management Chapter 3 - Memory Management Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ L. Tarrataca Chapter 3 - Memory Management 1 / 222 1 A Memory Abstraction: Address Spaces The Notion of an Address Space Swapping

More information

of characters from an alphabet, then, the hash function could be:

of characters from an alphabet, then, the hash function could be: Module 7: Hashing Dr. Natarajan Meghanathan Professor of Computer Science Jackson State University Jackson, MS 39217 E-mail: natarajan.meghanathan@jsums.edu Hashing A very efficient method for implementing

More information