Notes on Bloom filters
|
|
- Linette Miller
- 5 years ago
- Views:
Transcription
1 Computer Science B63 Winter 2017 Scarborough Campus University of Toronto Notes on Bloom filters Vassos Hadzilacos A Bloom filter is an approximate or probabilistic dictionary. Let S be a dynamic set of keys drawn from a universe U. A Bloom filter maintains a summary F S of S, supporting the following operations: BF-Insert(F S, x): S := S {x} add x to the underlying dynamic set S. BF-Search(F S, x): return no if x / S, and probably yes if, with high probability (to be discussed later), x S; it is possible, however, that probably yes is returned even though x / S. An instance where BF-Search(F S, x) returns probably yes even though x / S, is called a false positive. Note that there is no BF-Delete operation. As we will see, deletions are problematic for Bloom filters; we will discuss partial remedies for this weakness. Bloom filters are very space efficient; they consume only a small fraction of the space needed to store the full dynamic set S using, say, an AVL tree or a hash table. Consequently they also achieve time efficiencies: they can be stored in main memory, rather than in secondary storage, and so they can be accessed much faster. The disadvantage of Bloom filters is that there is a non-zero probability of false positive searches. Note that there is no possibility of a false negative: If BF-Search(F S, x) returns no, then x is definitely not in S. This asymmetry between the positive and negative responses is critical in making Bloom filters useful, as will be seen when we discuss some applications of Bloom filters. How Bloom filters work. A Bloom filter consists of an array of m bits, BF[0..m 1], initially all 0, corresponding to an empty set. Let h 1, h 2,..., h t be hash functions that map U to {0, 1,..., m 1}. The Bloom filter operations are then implemented as follows: To insert a key x into the Bloom filter, we set all the bits BF[h 1 (x)],..., BF[h t (x)] to 1. To search for a key x, we look at all the bits BF[h 1 (x)],..., BF[h t (x)]. If any one of them is still 0, we return no : had x been inserted to the Bloom filter, all these bits would have been set to 1. If all of them are 1, we return probably yes. Note that a search for x may find all the bits set to 1 even though x was never inserted into the dictionary. For example, suppose we use two hash functions, which map x to bit positions 1 and 3, y to bit positions 1 and 2, and z to bit positions 2 and 3. If we insert y and z, and then search for x, the search algorithm will return probably yes even though x was not inserted to the Bloom filter. This is an example of a false positive search. The algorithms for BF-Insert and BF-Search are shown in pseudocode in Figure 1. BF-Insert(F S, x) for i := 1 to t do BF[h i (x)] := 1 BF-Search(F S, x) for i := 1 to t do if BF[h i (x)] = 0 then return no return probably yes Figure 1: Insert and search operations with Bloom filters 1
2 Assuming that we can evaluate each hash function in O(1) time, it is obvious that these algorithms run in Θ(t) time. In typical uses, the number t of hash functions used is a small number, so the algorithms run in Θ(1) time. We use multiple hash functions, rather than just one, to reduce the probability of false positives. If we used only one hash function and we inserted a key x, then the search for any key x that collides with x under that hash function will return probably yes, even if x was never inserted. If we use two hash functions, a key x that collides with x under one hash function is unlikely to also collide with x under the other, provided the hash functions are independent informally, they tend to map the same key to different positions. We will make this more precise later, when we analyze the performance of Bloom filters. We will explore shortly what is the optimal number of hash functions to use. Probability of false positive. The probability of a false positive search depends on three factors: the size m of the Bloom filter; the number of items n inserted into the Bloom filter; and the number of hash functions t used for the Bloom filter. Intuitively, the larger the m, the lower the probability of collisions and therefore of false positives. Similarly, the smaller the n, the lower the probability of collisions and therefore of false positives. The ratio α = n/m is called the load factor, and we encountered this quantity in our analysis of hash tables. From the preceding discussion it is clear that the smaller the load factor, the lower the probability of false positives. The optimal value of the third parameter t occupies some sweet spot between too few hash functions (leading to higher probability of collisions, and therefore higher probability of false positives), and too many hash functions (causing each item inserted to set many bits to 1, and therefore higher probability of false positives). To analyze the probability of a false positive search, we consider a two-stage process. (A) We insert n distinct keys x 1, x 2,..., x n into the Bloom filter. We model these insertions by the following experiment. Start with a Bloom filter all of whose bits are set to 0. Repeat the following for a total of nt times, independently: choose a bit position in the Bloom filter uniformly at random (i.e., each position is chosen with probability 1/m), and set that bit to 1. This models the insertion of n distinct keys, drawn at random from U, where each insertion uses t hash functions to set some bits to 1. (B) Next we search for a randomly chosen key x x 1, x 2,..., x n in U, and we want to determine the probability of a false positive, i.e., the probability that the bits to which x is mapped by the t hash functions have all been set to 1 by the insertion process. We model this by repeating t times, independently, the following: choose a position in the Bloom filter uniformly at random. We then compute the probability of the event that all of the positions chosen were set to 1 during Stage (A). This is an idealized model, like the simple uniform hashing assumption (SUHA) that we used to analyze hashing: It assumes that there are no dependencies or regularities in the set of keys inserted to the Bloom filter, and that the hash functions distribute the keys uniformly at random to the positions of the Bloom filter. With suitably designed hash functions, this idealized model captures well enough the reality of many situations that arise in practice. Fix an arbitrary position l, 0 l < m, of the Bloom filter. We first compute the probability that BF[l] = 0 at the end of Stage (A), i.e., after the keys x 1, x 2,..., x n have been inserted. According to our model, the probability that one of these keys under one of the hash functions hits position l is 1/m; and therefore the probability that it misses position l is 1 1/m. Since the positions of the Bloom filter set 2
3 to 1 during Stage (A) are chosen independently and uniformly at random, the probability that all n keys inserted under all hash functions miss position l is (1 1/m) nt. That is, probability that BF[l] = 0 after x 1,..., x n are inserted = ( 1 1 ) nt m e nt/m = e αt where the approximation is justified by the fact that, for values of x close to 0, 1 x e x. Now consider any key x different from all the n keys inserted into the Bloom filter. The probability that a search for x yields a false positive is the probability that, after the insertion of x 1,..., x n in the Bloom filter, the positions to which the hash functions map x are all set to 1. As we just saw, the probability that any particular bit of BF is 0 after the insertions is e αt, and so the probability that any particular bit is 1 is 1 e αt. By the model assumption that the hash functions map x to positions of BF chosen independently and uniformly at random, the probability that all of the bits to which the hash functions map x are 1 is (1 e αt ) t. Suppose now that the size of the Bloom filter m and the number of elements in it n are fixed; therefore the load factor α = n/m is fixed. For this fixed α, the probability of a false positive becomes a function only of t, the number of hash functions: P (t) = (1 e αt ) t (1) We can therefore compute the value of t that minimizes this function, by taking its derivative and setting it to 0. We have: dp (t) ( = (1 e αt ) t ln(1 e αt e αt ) ) + αt dt 1 e αt Setting the derivative to 0 and solving for t we get that the value of t that minimizes the probability of false positive is t = 0 or t = α 1 ln 2. The value t = 0 is not feasible (since we need a positive number of hash functions!), so the optimal choice of hash function is given by t = α 1 ln 2 (2) Note that this is a non-integer value, so we will use the positive integer t that is closest to α 1 ln 2. Substituting (2) into (1) we get that the probability of a false positive search using the optimal number of hash functions is P (α 1 ln 2) = (1 e αα 1 ln 2 ) α 1 ln 2 = ( 1 2 ln 2 ) α α 1 (3) Example. Suppose we have a dictionary consisting of 10 million URLs, i.e., n = If I allocate a Bloom filter with m = bits, we have α 1 = 32. Applying (3), we get that the probability of a false positive in this case is A more accurate calculation would be to first find the optimal number t of hash functions as the positive integer closest to the value given by (2), and then apply (1) for that value of t. Doing so we obtain that t should be the positive integer closest to 32 ln , i.e., t = 22. Plugging this value to (1) we get that the probability of false positive search is P (22) (1 e 22/32 ) The inverse of the load factor α 1 = m/n can be thought of as the number of bits we allocate per element inserted in the Bloom filter. Note that this interpretation should not be viewed as meaning that we allocate a specific set of positions in the Bloom filter for each item we insert: Each item inserted to the Bloom filter gets (up to) t bits, the positions to which it is mapped by the t hash functions. Rather, α 1 is 3
4 a measure of how much space we save by using a Bloom filter instead of storing the dictionary explicitly. In our example, α 1 = 32; thus we allocate 32 bits, i.e., 4 bytes, for each URL in the dictionary. This is much shorter than is required to store an actual URL. Deletions. Deletions are problematic in Bloom filters. Note that we cannot delete an element merely by setting to 0 the bits to which it is mapped by the hash functions: Doing so would result in false negatives, which would render Bloom filters useless. To see how this can happen, suppose we have inserted three keys: x that is mapped to bit positions 1 and 3, y that is mapped to bit positions 1 and 2, and z that is mapped to bit positions 2 and 3. If we delete y and z by setting their bits to 0, and we then search for x, the Bloom filter would return no, even though x was not deleted. A partial solution to this limitation is to use so-called counting Bloom filters. In a counting Bloom filter, each position in the array BF is not a bit but a small counter. Initially, every counter is 0, indicating an empty Bloom filter. Each time a key x is inserted (respectively, deleted), the counters in the positions to which x is mapped by the hash functions are incremented (respectively, decremented) by 1. To search for a key x, we look at all the counters to which x is mapped by the hash functions; if any of them is 0 we return no ; otherwise, we return probably yes. Pseudocode for these operations is shown in Figure 2. BF-Insert(F S, x) for i := 1 to t do BF[h i (x)] := BF[h i (x)] + 1 BF-Delete(F S, x) for i := 1 to t do BF[h i (x)] := BF[h i (x)] 1 BF-Search(F S, x) for i := 1 to t do if BF[h i (x)] = 0 then return no return probably yes Figure 2: Insert, delete, and search operations with counting Bloom filters We don t want to allocate many bits to each counter, as this would undermine the space savings advantage that Bloom filters are designed to deliver. On the other hand, if the counters are too small, they will wrap around and again produce false negatives. For these reasons, counting Bloom filters are only a limited solution. As we will see, Bloom filters are typically used in applications where there are no, or only very few, deletions. Applications. We now briefly describe some applications of Bloom filters. Refusing service to black-listed sites. A web server may keep a (long) list of black-listed sites, known to contain malware or to distribute spam. Whenever the web server receives a request from such a site, it does not respond to it. Almost all requests that the web server receives are from clean sites. Nevertheless, the list of black-listed sites is too long to keep in main memory. It would be very inefficient to keep the list on disk: Doing so, would mean that the web server would have to perform time-consuming disk accesses at each request to verify that the requesting site is not black listed. Instead, the web server keeps the full list of black-listed sites on disk, and keeps in main memory a Bloom filter of the black-listed sites. This is feasible because the Bloom filter is much shorter than the actual list of black-listed sites. When a request arrives from a site s, the web server checks to see if s is in the Bloom filter. In most cases (and assuming that the probability of false positive is low), the answer is no, in which case the web server replies to s s request. In the rare instances where the answer is probably yes, the web server performs a disk access to search the actual list of black-listed sites for s. If s is not found on that list, the web server replies to s s request; otherwise, i.e., if s is actually a black-listed site, the web server ignores s s request. Approximate counting. Suppose we want to count how many different IP addresses have visited a web page. The obvious way to do this is to keep the set V of all IP addresses that have visited the web page in the past, and a counter giving the cardinality of that set. Each time a request arrives from IP address a, we check if a V ; if not, we add a to V and increment the counter. It is, however, too expensive to 4
5 remember all IP addresses that visited the page in the past. If (as is often the case) it is acceptable to provide an approximate counter that slightly undercounts unique visitors, we can use a Bloom filter of the visitor s IP addresses, rather than the set of addresses themselves. When IP address a visits the web page, we check if a is in the Bloom filter. If the answer is no, we know for sure that a is a new visitor. So we insert a to the Bloom filter and increment the counter. If the answer is probably yes, we don t increment the counter. Note that if this was a false positive, by not incrementing the counter we have missed a new visitor. If false positives are rare, our approximate counter will be close enough. To be honest about the service provided, the counter should be used to report at least x unique visitors (rather than report x unique visitors as if x were the exact number). These applications share the following characteristics: 1. Saving space is a key objective. In both applications, we don t want to allocate the space needed to store the entire set of items we are interested in. In the case of the web server managing black-listed sites we store the full set of black-listed sites on disk, but we access the disk copy only rarely. In the case of the approximate counter we don t even bother to store the full set of past visitors at all. 2. Objects are rarely (if ever) deleted from the dynamic set of objects contained in the Bloom filter. Once a site is compromised, it remains black-listed forever; and once an IP address has visited the web page, it remains a past visitor (by definition!) forever. 3. The fact that there are no false negatives is crucial for the Bloom filter to be useful. In the case of the web server managing black-listed sites, if a Bloom filter search returns no, we know for sure that the requesting site is not black-listed and it is therefore safe to respond to its requests. In the case of the approximate counter, if a Bloom filter search returns no, we know for sure that the visitor is new and it is correct to increment the counter of unique visitors. 4. There is an effective way to mitigate the effect of false positives. In the case of the web server managing black-listed sites, the mitigation strategy is to access the list of black-listed sited stored on disk when the Bloom filter answers probably yes. This is slow, but it is tolerable because it happens rarely. In the case of the approximate counter, the mitigation strategy is to provide an undercount of the unique visitors, rather than an exact one. These characteristics (saving space, rare deletions, tolerance to (rare) false positives, and existence of a mitigation strategy for false positives) are typical of applications in which Bloom filters can be brought to bear. 5
Hashing. Hashing Procedures
Hashing Hashing Procedures Let us denote the set of all possible key values (i.e., the universe of keys) used in a dictionary application by U. Suppose an application requires a dictionary in which elements
More information9/24/ Hash functions
11.3 Hash functions A good hash function satis es (approximately) the assumption of SUH: each key is equally likely to hash to any of the slots, independently of the other keys We typically have no way
More informationWorst-case running time for RANDOMIZED-SELECT
Worst-case running time for RANDOMIZED-SELECT is ), even to nd the minimum The algorithm has a linear expected running time, though, and because it is randomized, no particular input elicits the worst-case
More informationIII Data Structures. Dynamic sets
III Data Structures Elementary Data Structures Hash Tables Binary Search Trees Red-Black Trees Dynamic sets Sets are fundamental to computer science Algorithms may require several different types of operations
More informationCOMP171. Hashing.
COMP171 Hashing Hashing 2 Hashing Again, a (dynamic) set of elements in which we do search, insert, and delete Linear ones: lists, stacks, queues, Nonlinear ones: trees, graphs (relations between elements
More informationHash Table and Hashing
Hash Table and Hashing The tree structures discussed so far assume that we can only work with the input keys by comparing them. No other operation is considered. In practice, it is often true that an input
More informationIntroduction hashing: a technique used for storing and retrieving information as quickly as possible.
Lecture IX: Hashing Introduction hashing: a technique used for storing and retrieving information as quickly as possible. used to perform optimal searches and is useful in implementing symbol tables. Why
More informationAdvanced Algorithmics (6EAP) MTAT Hashing. Jaak Vilo 2016 Fall
Advanced Algorithmics (6EAP) MTAT.03.238 Hashing Jaak Vilo 2016 Fall Jaak Vilo 1 ADT asscociative array INSERT, SEARCH, DELETE An associative array (also associative container, map, mapping, dictionary,
More informationIntroduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far
Chapter 5 Hashing 2 Introduction hashing performs basic operations, such as insertion, deletion, and finds in average time better than other ADTs we ve seen so far 3 Hashing a hash table is merely an hashing
More informationTirgul 7. Hash Tables. In a hash table, we allocate an array of size m, which is much smaller than U (the set of keys).
Tirgul 7 Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys belong to a universal group of keys, U = {1... M}.
More informationCS 270 Algorithms. Oliver Kullmann. Generalising arrays. Direct addressing. Hashing in general. Hashing through chaining. Reading from CLRS for week 7
Week 9 General remarks tables 1 2 3 We continue data structures by discussing hash tables. Reading from CLRS for week 7 1 Chapter 11, Sections 11.1, 11.2, 11.3. 4 5 6 Recall: Dictionaries Applications
More informationHashing. Dr. Ronaldo Menezes Hugo Serrano. Ronaldo Menezes, Florida Tech
Hashing Dr. Ronaldo Menezes Hugo Serrano Agenda Motivation Prehash Hashing Hash Functions Collisions Separate Chaining Open Addressing Motivation Hash Table Its one of the most important data structures
More informationHashing. Manolis Koubarakis. Data Structures and Programming Techniques
Hashing Manolis Koubarakis 1 The Symbol Table ADT A symbol table T is an abstract storage that contains table entries that are either empty or are pairs of the form (K, I) where K is a key and I is some
More informationFundamental Algorithms
Fundamental Algorithms Chapter 7: Hash Tables Michael Bader Winter 2011/12 Chapter 7: Hash Tables, Winter 2011/12 1 Generalised Search Problem Definition (Search Problem) Input: a sequence or set A of
More informationChapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,
Introduction Chapter 5 Hashing hashing performs basic operations, such as insertion, deletion, and finds in average time 2 Hashing a hash table is merely an of some fixed size hashing converts into locations
More informationDictionary. Dictionary. stores key-value pairs. Find(k) Insert(k, v) Delete(k) List O(n) O(1) O(n) Sorted Array O(log n) O(n) O(n)
Hash-Tables Introduction Dictionary Dictionary stores key-value pairs Find(k) Insert(k, v) Delete(k) List O(n) O(1) O(n) Sorted Array O(log n) O(n) O(n) Balanced BST O(log n) O(log n) O(log n) Dictionary
More informationWe assume uniform hashing (UH):
We assume uniform hashing (UH): the probe sequence of each key is equally likely to be any of the! permutations of 0,1,, 1 UH generalizes the notion of SUH that produces not just a single number, but a
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams
More informationCompact data structures: Bloom filters
Compact data structures: Luca Becchetti Sapienza Università di Roma Rome, Italy April 7, 2010 1 2 3 Dictionaries A dynamic set S of objects from a discrete universe U, on which (at least) the following
More informationTABLES AND HASHING. Chapter 13
Data Structures Dr Ahmed Rafat Abas Computer Science Dept, Faculty of Computer and Information, Zagazig University arabas@zu.edu.eg http://www.arsaliem.faculty.zu.edu.eg/ TABLES AND HASHING Chapter 13
More informationHashing. Introduction to Data Structures Kyuseok Shim SoEECS, SNU.
Hashing Introduction to Data Structures Kyuseok Shim SoEECS, SNU. 1 8.1 INTRODUCTION Binary search tree (Chapter 5) GET, INSERT, DELETE O(n) Balanced binary search tree (Chapter 10) GET, INSERT, DELETE
More informationData Streams. Everything Data CompSci 216 Spring 2018
Data Streams Everything Data CompSci 216 Spring 2018 How much data is generated every 2 minute in the world? haps://fossbytes.com/how-much-data-is-generated-every-minute-in-the-world/ 3 Data stream A potentially
More informationWeek 9. Hash tables. 1 Generalising arrays. 2 Direct addressing. 3 Hashing in general. 4 Hashing through chaining. 5 Hash functions.
Week 9 tables 1 2 3 ing in ing in ing 4 ing 5 6 General remarks We continue data structures by discussing hash tables. For this year, we only consider the first four sections (not sections and ). Only
More informationLecture 7: Efficient Collections via Hashing
Lecture 7: Efficient Collections via Hashing These slides include material originally prepared by Dr. Ron Cytron, Dr. Jeremy Buhler, and Dr. Steve Cole. 1 Announcements Lab 6 due Friday Lab 7 out tomorrow
More informationLECTURE 11. Memory Hierarchy
LECTURE 11 Memory Hierarchy MEMORY HIERARCHY When it comes to memory, there are two universally desirable properties: Large Size: ideally, we want to never have to worry about running out of memory. Speed
More informationRandomized Algorithms: Element Distinctness
Randomized Algorithms: Element Distinctness CSE21 Winter 2017, Day 24 (B00), Day 16-17 (A00) March 13, 2017 http://vlsicad.ucsd.edu/courses/cse21-w17 Element Distinctness: WHAT Given list of positive integers
More informationHashing Techniques. Material based on slides by George Bebis
Hashing Techniques Material based on slides by George Bebis https://www.cse.unr.edu/~bebis/cs477/lect/hashing.ppt The Search Problem Find items with keys matching a given search key Given an array A, containing
More informationHO #13 Fall 2015 Gary Chan. Hashing (N:12)
HO #13 Fall 2015 Gary Chan Hashing (N:12) Outline Motivation Hashing Algorithms and Improving the Hash Functions Collisions Strategies Open addressing and linear probing Separate chaining COMP2012H (Hashing)
More informationHash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1
Hash Tables Outline Definition Hash functions Open hashing Closed hashing collision resolution techniques Efficiency EECS 268 Programming II 1 Overview Implementation style for the Table ADT that is good
More informationAlgorithms and Data Structures
Lesson 4: Sets, Dictionaries and Hash Tables Luciano Bononi http://www.cs.unibo.it/~bononi/ (slide credits: these slides are a revised version of slides created by Dr. Gabriele D Angelo)
More informationCOSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor
COSC160: Data Structures Hashing Structures Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Hashing Structures I. Motivation and Review II. Hash Functions III. HashTables I. Implementations
More informationTables. The Table ADT is used when information needs to be stored and acessed via a key usually, but not always, a string. For example: Dictionaries
1: Tables Tables The Table ADT is used when information needs to be stored and acessed via a key usually, but not always, a string. For example: Dictionaries Symbol Tables Associative Arrays (eg in awk,
More informationThe dictionary problem
6 Hashing The dictionary problem Different approaches to the dictionary problem: previously: Structuring the set of currently stored keys: lists, trees, graphs,... structuring the complete universe of
More informationLecture 12 Hash Tables
Lecture 12 Hash Tables 15-122: Principles of Imperative Computation (Spring 2018) Frank Pfenning, Rob Simmons Dictionaries, also called associative arrays as well as maps, are data structures that are
More informationQuestion Score Points Out Of 25
University of Texas at Austin 6 May 2005 Department of Computer Science Theory in Programming Practice, Spring 2005 Test #3 Instructions. This is a 50-minute test. No electronic devices (including calculators)
More informationCuckoo Hashing for Undergraduates
Cuckoo Hashing for Undergraduates Rasmus Pagh IT University of Copenhagen March 27, 2006 Abstract This lecture note presents and analyses two simple hashing algorithms: Hashing with Chaining, and Cuckoo
More informationIntroduction to Hashing
Lecture 11 Hashing Introduction to Hashing We have learned that the run-time of the most efficient search in a sorted list can be performed in order O(lg 2 n) and that the most efficient sort by key comparison
More informationQ.1 Explain Computer s Basic Elements
Q.1 Explain Computer s Basic Elements Ans. At a top level, a computer consists of processor, memory, and I/O components, with one or more modules of each type. These components are interconnected in some
More informationCOSC-4411(M) Midterm #1
12 February 2004 COSC-4411(M) Midterm #1 & answers p. 1 of 10 COSC-4411(M) Midterm #1 Sur / Last Name: Given / First Name: Student ID: Instructor: Parke Godfrey Exam Duration: 75 minutes Term: Winter 2004
More informationIntroduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/18/14
600.363 Introduction to Algorithms / 600.463 Algorithms I Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/18/14 23.1 Introduction We spent last week proving that for certain problems,
More informationCMSC 451: Lecture 10 Dynamic Programming: Weighted Interval Scheduling Tuesday, Oct 3, 2017
CMSC 45 CMSC 45: Lecture Dynamic Programming: Weighted Interval Scheduling Tuesday, Oct, Reading: Section. in KT. Dynamic Programming: In this lecture we begin our coverage of an important algorithm design
More informationCS 3410 Ch 20 Hash Tables
CS 341 Ch 2 Hash Tables Sections 2.1-2.7 Pages 773-82 2.1 Basic Ideas 1. A hash table is a data structure that supports insert, remove, and find in constant time, but there is no order to the items stored.
More information1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:
CS 124 Section #8 Hashing, Skip Lists 3/20/17 1 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look
More informationCOPYRIGHTED MATERIAL. An Introduction to Computers That Will Actually Help You in Life. Chapter 1. Memory: Not Exactly 0s and 1s. Memory Organization
Chapter 1 An Introduction to Computers That Will Actually Help You in Life Memory: Not Exactly 0s and 1s Memory Organization A Very Simple Computer COPYRIGHTED MATERIAL 2 Chapter 1 An Introduction to Computers
More informationSummary Cache: A Scalable Wide-Area Web Cache Sharing Protocol
Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol Li Fan, Pei Cao and Jussara Almeida University of Wisconsin-Madison Andrei Broder Compaq/DEC System Research Center Why Web Caching One of
More informationLecture 12 Notes Hash Tables
Lecture 12 Notes Hash Tables 15-122: Principles of Imperative Computation (Spring 2016) Frank Pfenning, Rob Simmons 1 Introduction In this lecture we re-introduce the dictionaries that were implemented
More informationAlgorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ.! Instructor: X. Zhang Spring 2017
Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ.! Instructor: X. Zhang Spring 2017 Acknowledgement The set of slides have used materials from the following resources Slides
More informationAlgorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ. Acknowledgement. Support for Dictionary
Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ. Instructor: X. Zhang Spring 2017 Acknowledgement The set of slides have used materials from the following resources Slides for
More informationCHAPTER 5 PROPAGATION DELAY
98 CHAPTER 5 PROPAGATION DELAY Underwater wireless sensor networks deployed of sensor nodes with sensing, forwarding and processing abilities that operate in underwater. In this environment brought challenges,
More information1 Black Box Test Data Generation Techniques
1 Black Box Test Data Generation Techniques 1.1 Equivalence Partitioning Introduction Equivalence partitioning is based on the premise that the inputs and outputs of a component can be partitioned into
More informationLecture Notes on Hash Tables
Lecture Notes on Hash Tables 15-122: Principles of Imperative Computation Frank Pfenning Lecture 13 February 24, 2011 1 Introduction In this lecture we introduce so-called associative arrays, that is,
More informationHashing and sketching
Hashing and sketching 1 The age of big data An age of big data is upon us, brought on by a combination of: Pervasive sensing: so much of what goes on in our lives and in the world at large is now digitally
More informationBloom filters and their applications
Bloom filters and their applications Fedor Nikitin June 11, 2006 1 Introduction The bloom filters, as a new approach to hashing, were firstly presented by Burton Bloom [Blo70]. He considered the task of
More informationData Structures and Algorithms. Chapter 7. Hashing
1 Data Structures and Algorithms Chapter 7 Werner Nutt 2 Acknowledgments The course follows the book Introduction to Algorithms, by Cormen, Leiserson, Rivest and Stein, MIT Press [CLRST]. Many examples
More information142
Scope Rules Thus, storage duration does not affect the scope of an identifier. The only identifiers with function-prototype scope are those used in the parameter list of a function prototype. As mentioned
More informationData Stream Processing
Data Stream Processing Part II 1 Data Streams (recap) continuous, unbounded sequence of items unpredictable arrival times too large to store locally one pass real time processing required 2 Reservoir Sampling
More informationHashing. 1. Introduction. 2. Direct-address tables. CmSc 250 Introduction to Algorithms
Hashing CmSc 250 Introduction to Algorithms 1. Introduction Hashing is a method of storing elements in a table in a way that reduces the time for search. Elements are assumed to be records with several
More informationChapter 27 Hashing. Objectives
Chapter 27 Hashing 1 Objectives To know what hashing is for ( 27.3). To obtain the hash code for an object and design the hash function to map a key to an index ( 27.4). To handle collisions using open
More informationHash Tables. Hashing Probing Separate Chaining Hash Function
Hash Tables Hashing Probing Separate Chaining Hash Function Introduction In Chapter 4 we saw: linear search O( n ) binary search O( log n ) Can we improve the search operation to achieve better than O(
More informationBloom Filters. From this point on, I m going to refer to search queries as keys since that is the role they
Bloom Filters One of the fundamental operations on a data set is membership testing: given a value x, is x in the set? So far we have focused on data structures that provide exact answers to this question.
More informationCHAPTER 4 BLOOM FILTER
54 CHAPTER 4 BLOOM FILTER 4.1 INTRODUCTION Bloom filter was formulated by Bloom (1970) and is used widely today for different purposes including web caching, intrusion detection, content based routing,
More informationData Structures and Algorithms. Roberto Sebastiani
Data Structures and Algorithms Roberto Sebastiani roberto.sebastiani@disi.unitn.it http://www.disi.unitn.it/~rseba - Week 07 - B.S. In Applied Computer Science Free University of Bozen/Bolzano academic
More informationAlgorithms and Data Structures
Algorithms and Data Structures Spring 2019 Alexis Maciel Department of Computer Science Clarkson University Copyright c 2019 Alexis Maciel ii Contents 1 Analysis of Algorithms 1 1.1 Introduction.................................
More informationHash Table. A hash function h maps keys of a given type into integers in a fixed interval [0,m-1]
Exercise # 8- Hash Tables Hash Tables Hash Function Uniform Hash Hash Table Direct Addressing A hash function h maps keys of a given type into integers in a fixed interval [0,m-1] 1 Pr h( key) i, where
More informationHashing. Yufei Tao. Department of Computer Science and Engineering Chinese University of Hong Kong
Department of Computer Science and Engineering Chinese University of Hong Kong In this lecture, we will revisit the dictionary search problem, where we want to locate an integer v in a set of size n or
More informationOperating system Dr. Shroouq J.
2.2.2 DMA Structure In a simple terminal-input driver, when a line is to be read from the terminal, the first character typed is sent to the computer. When that character is received, the asynchronous-communication
More informationAAL 217: DATA STRUCTURES
Chapter # 4: Hashing AAL 217: DATA STRUCTURES The implementation of hash tables is frequently called hashing. Hashing is a technique used for performing insertions, deletions, and finds in constant average
More informationPractice Midterm Exam Solutions
CSE 332: Data Abstractions Autumn 2015 Practice Midterm Exam Solutions Name: Sample Solutions ID #: 1234567 TA: The Best Section: A9 INSTRUCTIONS: You have 50 minutes to complete the exam. The exam is
More informationQuestion Points Score Total 100
Midterm #2 CMSC 412 Operating Systems Fall 2005 November 22, 2004 Guidelines This exam has 7 pages (including this one); make sure you have them all. Put your name on each page before starting the exam.
More informationCSE 5311 Notes 5: Hashing
CSE 5311 Notes 5: Hashing (Last updated 2/18/18 1:33 PM) CLRS, Chapter 11 Review: 11.2: Chaining - related to perfect hashing method 11.3: Hash functions, skim universal hashing (aside: https://dl-acm-org.ezproy.uta.edu/citation.cfm?doid=3116227.3068772
More informationIntroduction to Algorithms April 21, 2004 Massachusetts Institute of Technology. Quiz 2 Solutions
Introduction to Algorithms April 21, 2004 Massachusetts Institute of Technology 6.046J/18.410J Professors Erik Demaine and Shafi Goldwasser Quiz 2 Solutions Quiz 2 Solutions Do not open this quiz booklet
More informationCS369G: Algorithmic Techniques for Big Data Spring
CS369G: Algorithmic Techniques for Big Data Spring 2015-2016 Lecture 11: l 0 -Sampling and Introduction to Graph Streaming Prof. Moses Charikar Scribe: Austin Benson 1 Overview We present and analyze the
More information4.1 COMPUTATIONAL THINKING AND PROBLEM-SOLVING
4.1 COMPUTATIONAL THINKING AND PROBLEM-SOLVING 4.1.2 ALGORITHMS ALGORITHM An Algorithm is a procedure or formula for solving a problem. It is a step-by-step set of operations to be performed. It is almost
More informationChapter 17. Disk Storage, Basic File Structures, and Hashing. Records. Blocking
Chapter 17 Disk Storage, Basic File Structures, and Hashing Records Fixed and variable length records Records contain fields which have values of a particular type (e.g., amount, date, time, age) Fields
More informationVariables and Constants
HOUR 3 Variables and Constants Programs need a way to store the data they use. Variables and constants offer various ways to work with numbers and other values. In this hour you learn: How to declare and
More informationHashing 1. Searching Lists
Hashing 1 Searching Lists There are many instances when one is interested in storing and searching a list: A phone company wants to provide caller ID: Given a phone number a name is returned. Somebody
More informationModule 5: Hashing. CS Data Structures and Data Management. Reza Dorrigiv, Daniel Roche. School of Computer Science, University of Waterloo
Module 5: Hashing CS 240 - Data Structures and Data Management Reza Dorrigiv, Daniel Roche School of Computer Science, University of Waterloo Winter 2010 Reza Dorrigiv, Daniel Roche (CS, UW) CS240 - Module
More information1 Defining Message authentication
ISA 562: Information Security, Theory and Practice Lecture 3 1 Defining Message authentication 1.1 Defining MAC schemes In the last lecture we saw that, even if our data is encrypted, a clever adversary
More informationCSCI Analysis of Algorithms I
CSCI 305 - Analysis of Algorithms I 04 June 2018 Filip Jagodzinski Computer Science Western Washington University Announcements Remainder of the term 04 June : lecture 05 June : lecture 06 June : lecture,
More informationPreview. Memory Management
Preview Memory Management With Mono-Process With Multi-Processes Multi-process with Fixed Partitions Modeling Multiprogramming Swapping Memory Management with Bitmaps Memory Management with Free-List Virtual
More informationAlgorithms in Systems Engineering ISE 172. Lecture 12. Dr. Ted Ralphs
Algorithms in Systems Engineering ISE 172 Lecture 12 Dr. Ted Ralphs ISE 172 Lecture 12 1 References for Today s Lecture Required reading Chapter 5 References CLRS Chapter 11 D.E. Knuth, The Art of Computer
More informationCS/COE 1501
CS/COE 1501 www.cs.pitt.edu/~lipschultz/cs1501/ Hashing Wouldn t it be wonderful if... Search through a collection could be accomplished in Θ(1) with relatively small memory needs? Lets try this: Assume
More informationLayered Network Architecture. CSC358 - Introduction to Computer Networks
Layered Network Architecture Layered Network Architecture Question: How can we provide a reliable service on the top of a unreliable service? ARQ: Automatic Repeat Request Can be used in every layer TCP
More informationHASH TABLES. Hash Tables Page 1
HASH TABLES TABLE OF CONTENTS 1. Introduction to Hashing 2. Java Implementation of Linear Probing 3. Maurer s Quadratic Probing 4. Double Hashing 5. Separate Chaining 6. Hash Functions 7. Alphanumeric
More informationCSE100. Advanced Data Structures. Lecture 21. (Based on Paul Kube course materials)
CSE100 Advanced Data Structures Lecture 21 (Based on Paul Kube course materials) CSE 100 Collision resolution strategies: linear probing, double hashing, random hashing, separate chaining Hash table cost
More informationLesson n.11 Data Structures for P2P Systems: Bloom Filters, Merkle Trees
Lesson n.11 : Bloom Filters, Merkle Trees Didactic Material Tutorial on Moodle 15/11/2013 1 SET MEMBERSHIP PROBLEM Let us consider the set S={s 1,s 2,...,s n } of n elements chosen from a very large universe
More informationCpt S 223. School of EECS, WSU
Hashing & Hash Tables 1 Overview Hash Table Data Structure : Purpose To support insertion, deletion and search in average-case constant t time Assumption: Order of elements irrelevant ==> data structure
More informationFinal Exam in Algorithms and Data Structures 1 (1DL210)
Final Exam in Algorithms and Data Structures 1 (1DL210) Department of Information Technology Uppsala University February 0th, 2012 Lecturers: Parosh Aziz Abdulla, Jonathan Cederberg and Jari Stenman Location:
More informationLecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40
Lecture 16 Hashing Hash table and hash function design Hash functions for integers and strings Collision resolution strategies: linear probing, double hashing, random hashing, separate chaining Hash table
More informationCS 161 Problem Set 4
CS 161 Problem Set 4 Spring 2017 Due: May 8, 2017, 3pm Please answer each of the following problems. Refer to the course webpage for the collaboration policy, as well as for helpful advice for how to write
More informationCS 350 Algorithms and Complexity
CS 350 Algorithms and Complexity Winter 2019 Lecture 12: Space & Time Tradeoffs. Part 2: Hashing & B-Trees Andrew P. Black Department of Computer Science Portland State University Space-for-time tradeoffs
More informationCMSC 341 Lecture 16/17 Hashing, Parts 1 & 2
CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2 Prof. John Park Based on slides from previous iterations of this course Today s Topics Overview Uses and motivations of hash tables Major concerns with hash
More informationCS 561, Lecture 2 : Randomization in Data Structures. Jared Saia University of New Mexico
CS 561, Lecture 2 : Randomization in Data Structures Jared Saia University of New Mexico Outline Hash Tables Bloom Filters Skip Lists 1 Dictionary ADT A dictionary ADT implements the following operations
More informationCS 561, Lecture 2 : Hash Tables, Skip Lists, Bloom Filters, Count-Min sketch. Jared Saia University of New Mexico
CS 561, Lecture 2 : Hash Tables, Skip Lists, Bloom Filters, Count-Min sketch Jared Saia University of New Mexico Outline Hash Tables Skip Lists Count-Min Sketch 1 Dictionary ADT A dictionary ADT implements
More informationFILE SYSTEM IMPLEMENTATION. Sunu Wibirama
FILE SYSTEM IMPLEMENTATION Sunu Wibirama File-System Structure Outline File-System Implementation Directory Implementation Allocation Methods Free-Space Management Discussion File-System Structure Outline
More informationHashing. 5/1/2006 Algorithm analysis and Design CS 007 BE CS 5th Semester 2
Hashing Hashing A hash function h maps keys of a given type to integers in a fixed interval [0,N-1]. The goal of a hash function is to uniformly disperse keys in the range [0,N-1] 5/1/2006 Algorithm analysis
More informationCS 344/444 Computer Network Fundamentals Final Exam Solutions Spring 2007
CS 344/444 Computer Network Fundamentals Final Exam Solutions Spring 2007 Question 344 Points 444 Points Score 1 10 10 2 10 10 3 20 20 4 20 10 5 20 20 6 20 10 7-20 Total: 100 100 Instructions: 1. Question
More information4.1 Paging suffers from and Segmentation suffers from. Ans
Worked out Examples 4.1 Paging suffers from and Segmentation suffers from. Ans: Internal Fragmentation, External Fragmentation 4.2 Which of the following is/are fastest memory allocation policy? a. First
More informationChapter 3 - Memory Management
Chapter 3 - Memory Management Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ L. Tarrataca Chapter 3 - Memory Management 1 / 222 1 A Memory Abstraction: Address Spaces The Notion of an Address Space Swapping
More informationof characters from an alphabet, then, the hash function could be:
Module 7: Hashing Dr. Natarajan Meghanathan Professor of Computer Science Jackson State University Jackson, MS 39217 E-mail: natarajan.meghanathan@jsums.edu Hashing A very efficient method for implementing
More information