Hash Table. A hash function h maps keys of a given type into integers in a fixed interval [0,m-1]

Similar documents
Practical Session 8- Hash Tables

Hash Tables. Hashing Probing Separate Chaining Hash Function

Hashing. Yufei Tao. Department of Computer Science and Engineering Chinese University of Hong Kong

Data Structures and Algorithms. Chapter 7. Hashing

Data Structures and Algorithms. Roberto Sebastiani

DATA STRUCTURES AND ALGORITHMS

of characters from an alphabet, then, the hash function could be:

HO #13 Fall 2015 Gary Chan. Hashing (N:12)

Data Structures and Algorithm Analysis (CSC317) Hash tables (part2)

Dictionary. Dictionary. stores key-value pairs. Find(k) Insert(k, v) Delete(k) List O(n) O(1) O(n) Sorted Array O(log n) O(n) O(n)

Chapter 9: Maps, Dictionaries, Hashing

Today s Outline. CS 561, Lecture 8. Direct Addressing Problem. Hash Tables. Hash Tables Trees. Jared Saia University of New Mexico

Fundamental Algorithms

Understand how to deal with collisions

Hashing Techniques. Material based on slides by George Bebis

CSC263 Week 5. Larry Zhang.

Hash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1

CSE100. Advanced Data Structures. Lecture 21. (Based on Paul Kube course materials)

Successful vs. Unsuccessful

Tirgul 7. Hash Tables. In a hash table, we allocate an array of size m, which is much smaller than U (the set of keys).

Hash Table and Hashing

Module 5: Hashing. CS Data Structures and Data Management. Reza Dorrigiv, Daniel Roche. School of Computer Science, University of Waterloo

AAL 217: DATA STRUCTURES

HASH TABLES. Goal is to store elements k,v at index i = h k

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

Hashing. 1. Introduction. 2. Direct-address tables. CmSc 250 Introduction to Algorithms

Hash Tables. Reading: Cormen et al, Sections 11.1 and 11.2

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1

Dictionaries and Hash Tables

Outline. hash tables hash functions open addressing chained hashing

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ.! Instructor: X. Zhang Spring 2017

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ. Acknowledgement. Support for Dictionary

Algorithms and Data Structures

5. Hashing. 5.1 General Idea. 5.2 Hash Function. 5.3 Separate Chaining. 5.4 Open Addressing. 5.5 Rehashing. 5.6 Extendible Hashing. 5.

Hashing 1. Searching Lists

CS 241 Analysis of Algorithms

CS/COE 1501

Worst-case running time for RANDOMIZED-SELECT

COMP171. Hashing.

HashTable CISC5835, Computer Algorithms CIS, Fordham Univ. Instructor: X. Zhang Fall 2018

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

Hashing. Hashing Procedures

Data Structure Lecture#22: Searching 3 (Chapter 9) U Kang Seoul National University

Open Addressing: Linear Probing (cont.)

CSE 2320 Notes 13: Hashing

CHAPTER 9 HASH TABLES, MAPS, AND SKIP LISTS

DATA STRUCTURES/UNIT 3

Today: Finish up hashing Sorted Dictionary ADT: Binary search, divide-and-conquer Recursive function and recurrence relation

Hashing. Dr. Ronaldo Menezes Hugo Serrano. Ronaldo Menezes, Florida Tech

HASH TABLES.

CSE 214 Computer Science II Searching

Topic HashTable and Table ADT

Lecture 7: Efficient Collections via Hashing

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40

CS 270 Algorithms. Oliver Kullmann. Generalising arrays. Direct addressing. Hashing in general. Hashing through chaining. Reading from CLRS for week 7

Hashing. Manolis Koubarakis. Data Structures and Programming Techniques

Data Structures and Algorithms(10)

CITS2200 Data Structures and Algorithms. Topic 15. Hash Tables

Hashing for searching

Introduction hashing: a technique used for storing and retrieving information as quickly as possible.

Hashing. October 19, CMPE 250 Hashing October 19, / 25

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

Hashing Algorithms. Hash functions Separate Chaining Linear Probing Double Hashing

Hashing. Introduction to Data Structures Kyuseok Shim SoEECS, SNU.

Introducing Hashing. Chapter 21. Copyright 2012 by Pearson Education, Inc. All rights reserved

Fast Lookup: Hash tables

Module 2: Classical Algorithm Design Techniques

1 / 22. Inf 2B: Hash Tables. Lecture 4 of ADS thread. Kyriakos Kalorkoti. School of Informatics University of Edinburgh

CS 350 Algorithms and Complexity

1 CSE 100: HASH TABLES

Unit #5: Hash Functions and the Pigeonhole Principle

Hash Tables. Hash functions Open addressing. March 07, 2018 Cinda Heeren / Geoffrey Tien 1

We assume uniform hashing (UH):

Acknowledgement HashTable CISC4080, Computer Algorithms CIS, Fordham Univ.

Cpt S 223. School of EECS, WSU

Week 9. Hash tables. 1 Generalising arrays. 2 Direct addressing. 3 Hashing in general. 4 Hashing through chaining. 5 Hash functions.

III Data Structures. Dynamic sets

Hash Tables and Hash Functions

CS 310 Hash Tables, Page 1. Hash Tables. CS 310 Hash Tables, Page 2

Quiz 1 Practice Problems

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

Algorithms and Data Structures

CMSC 341 Hashing (Continued) Based on slides from previous iterations of this course

CS 10: Problem solving via Object Oriented Programming Winter 2017

CSCI 104 Hash Tables & Functions. Mark Redekopp David Kempe

CMSC 341 Hashing. Based on slides from previous iterations of this course

Advanced Algorithmics (6EAP) MTAT Hashing. Jaak Vilo 2016 Fall

Hashing. CptS 223 Advanced Data Structures. Larry Holder School of Electrical Engineering and Computer Science Washington State University

CSCI 104 Log Structured Merge Trees. Mark Redekopp

TABLES AND HASHING. Chapter 13

Multiple-choice (35 pt.)

CSCI Analysis of Algorithms I

Hashing HASHING HOW? Ordered Operations. Typical Hash Function. Why not discard other data structures?

( D. Θ n. ( ) f n ( ) D. Ο%

Hash Open Indexing. Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1

CSE 332: Data Structures & Parallelism Lecture 10:Hashing. Ruth Anderson Autumn 2018

Hashing. 7- Hashing. Hashing. Transform Keys into Integers in [[0, M 1]] The steps in hashing:

BBM371& Data*Management. Lecture 6: Hash Tables

Hash Tables Hash Tables Goodrich, Tamassia

Hash[ string key ] ==> integer value

Transcription:

Exercise # 8- Hash Tables Hash Tables Hash Function Uniform Hash Hash Table Direct Addressing A hash function h maps keys of a given type into integers in a fixed interval [0,m-1] 1 Pr h( key) i, where m is the size of the hash table. m A hash table for a given key type consists of: Hash function h: keys-set [0,m-1] Array (called table) of size m K is a set whose elements' keys are in the range [0,m-1]. Use a table of size m and store each element x in index x.key. Disadvantage: when K << m waste of space Chaining h(k) = k mod m (This is an example of a common hash function) If h(k) is occupied, add the new element in the head of the chain at index h(k) Open Addressing Basic idea: if slot is full, try another slot until an open slot is found (Probing). Search is similar to insert and is a good solution for fixed sets. Deletion need to mark slot during deletion, search time is no longer dependent on α. Linear Probing: h(k,i) = (h'(k) + i)mod m 0 i m-1 h'(k) - common hash function First try h(k,0) = h'(k), if it is occupied, try h(k,1) etc.... Advantage: simplicity Disadvantage: clusters, uses Θ(m) permutations of index addressing sequences Double Hashing: h(k,i) = (h1(k) + i h2(k))mod m 0 i m-1 h1 hash function h2 step function also a hash function First try h(k,0) = h1(k), if it is occupied, try h(k,1) etc.... Advantage: less clusters, uses Θ(m*m) permutations of index addressing sequences

Load Factor α n m, Hash table with m slots that stores n elements (keys) Average (expected) Search Time Open Addressing unsuccessful search: O(1+1/(1-α)) successful search: O(1+ (1/α)ln( 1/(1-α))) Chaining unsuccessful search: Θ (1 + α) successful search: Θ (1 + α/2) = Θ (1 + α) Dictionary ADT The dictionary ADT models a searchable collection of key-element items, and supports the following operations: Insert Delete Search Bloom Filter A Bloom filter model consists of a bit-array of size m (the bloom filter) and k hash functions h1, h2,... hk. It supports insertion and search queries only: Insert at O(1) Search(x) in O(1), with (1-e ) probability of getting a false positive kn / m k Remarks: 1. The efficiency of hashing is examined in the average case, not the worst case. 2. Load factor defined above is the average number of elements that are hashed to the same value. 3. Let us consider chaining: The load factor is the average number of elements that are stored in a single linked list. If we examine the worst-case, then all n keys are hashed to the same cell and form a linked list of length n. Thus, running time of searching of an element in the worst case is (Θ(n)+ the time of computing the hash function). So, it is clear that we are not using hash tables due to their worst case performance, and when analyzing running time of hashing operations we refer to the average case.

Question 1 Given a hash table with m=11 entries and the following hash function h1 and step function h2: h1(key) = key mod m h2(key) = {key mod (m-1)} + 1 Insert the keys {22, 1, 13, 11, 24, 33, 18, 42, 31} in the given order (from left to right) to the hash table using each of the following hash methods: a. Chaining with h1 h(k) = h1(k) b. Linear-Probing with h1 h(k,i) = (h1(k)+i) mod m c. Double-Hashing with h1 as the hash function and h2 as the step function h(k,i) = (h1(k) + ih2(k)) mod m Solution: Chaining Linear Probing Double Hashing 0 33 11 22 22 22 1 1 1 1 2 24 13 13 13 3 11 4 24 11 5 33 18 6 31 7 18 18 24 8 33 9 31 42 42 42 10 31

Question 2 a. For the previous h1 and h2, is it OK to use h2 as a hash function and h1 as a step function? b. Why is it important that the result of the step function and the table size will not have a common divisor? That is, if hstep is the step function, why is it important to enforce GCD(hstep(k),m) = 1 for every k? Solution: a. No, since h1 can return 0 and h2 skips the entry 0. For example, key 11 h1(11) = 0, and all the steps are of length 0, h(11,i) = h2(11) + i*h1(11) = h2(11) = 2. The sequence of index addressing for key 11 consists of only one index, 2. b. Suppose GCD(hstep(k),m) = d >1 for some k. The search after k would only address 1/d of the hash table, as the steps will always get us to the same cells. If those cells are occupied we may not find an available slot even if the hast table if not full. Example: m = 48, hstep(k) = 24,GCD(48,24) = 24 only 48 24 There are two guidelines on choosing m and step function so that GCD(hstep(k),m)=1: 2 cells of the table would be addressed. 1. The size of the table m is a prime number, and hstep(k) < m for every k. (An example for this is illustrated in Question1 ). 2. The size of the table m is 2 x, and the step function returns only odd values. Question 3 Consider two sets of integers, S = {s1, s2,..., sm} and T = {t1, t2,..., tn}, m n. a. Describe a deterministic algorithm that checks whether S is a subset of T. What is the running time of your algorithm? b. Devise an algorithm that uses a hash table of size m to test whether S is a subset of T. What is the average running time of your algorithm? c. Suppose we have a bloom filter of size r, and k hash functions h1, h2, hk : U [0, r-1] (U is the set of all possible keys). Describe an algorithm that checks whether S is a subset of T at O(n) worst time. What is the probability of getting a positive result although S is not a subset of T? (As a function of n, m, r and S T, you can assume k = O(1))

Solution: a) Subset(T,S,n,m) T=sort(T) for each sj S found = BinarySearch(T, sj) if (!found) return "S is not a subset of T" return "S is a subset of T" Time Complexity: O(nlogn+mlogn)=O(nlogn) b) The solution that uses a hash table of size m with chaining: SubsetWithHashTable(T,S,n,m) for each ti T insert(ht, ti) for each sj S found = search(ht, sj) if (!found) return "S is not a subset of T" return "S is a subset of T" Time Complexity Analysis: Inserting all elements ti to HT: n times O(1) = O(n) If S T then there are m successful searches: m (1 + α 2 ) = O(m) If S T then in the worst case all the elements except the last one, sm is in S, are in T, so there are (m-1) successful searches and one unsuccessful search: (m 1) (1 + α 2 ) + (1 + α) = O(m) Time Complexity: O(n + m) = O(n) c) We first insert all the elements in T to the Bloom filter. Then for each element x in S we query search(x), once we get one negetive result we return FALSE and break, if we get positive results on all the m queries we return TRUE. Time Complexity Analysis: n Insertions and m search queries to the bloom filter: n O(k) + m O(k) = (n+m) O(1) = O(n) The probability of getting a false positive result: Denote t = S T, for all t elements in S T we will get a positive result, the probability kn / r that we get a positive result for each of the m-t elements in S\T is ~ (1-e ) k and thus the probability of getting a positive result for all these m-t elements and so returning a kn / r k ( m t) TRUE answer although S T is ~ (1-e ) Note that the above analysis (b and c) gives an estimation of average run time

Question 4 You are given an array A[1..n] with n distinct real numbers and some value X. You need to determine whether there are two numbers in A[1..n] whose sum equals X. a. Do it in deterministic O(nlogn) time. b. Do it in expected O(n) time. Solution a: Sort A (O(nlogn) time). Pick A[1]. Perform a binary search (O(logn) time) on A to find an element that is equal to X-A[1]. If found, you re done. Otherwise, repeat the same with A[2], A[3], Overall O(nlogn) time Solution b: 1. Choose a universal hash function h from a universal collection of hash functions H. 2. Store all elements of array A in a Hash table of size n using h and chaining. 3. For (i = 1; i <= n; i++) { } Return false If ((A[i] is in Table) && (X-A[i] is in Table)) { Return true Time Complexity: We don't know if the elements inside the array distribute randomly. If they do not distribute randomly then using a "regular" hash function can result in α = O(n) and then the runtime could be O(n^2). If we use a universal hash function then regardless of the keys distribution the probability that two keys collide is no more than 1/m = 1/n. As seen in class if we use chaining and a universal hash function then the expected search time is O(1) and the expected runtime for the algorithm is O(n). Question 5 - Rolling hash Rolling hash is used when we want to enable fast search for parts of data. For example searching sub paragraphs in a long paragraph. In Rolling hash, a hash value is calculated for every fixed size part of the input in a sliding window manner. Example:

Input: Hello my name is Inigo Montoya Window size: 7 The result is a hash value for every 8 characters: Window Hash value example Hello my name is Inigo Montoya hash on Hello m = 20 Hello my name is Inigo Montoya hash on ello my = 571 Hello my name is Inigo Montoya hash on llo my = 12 Hello my name is Inigo Montoya hash on lo my n = 322 Hello my name is Inigo Montoya hash on o my na = 9 Hello my name is Inigo Montoya hash on my nam = 111 Hello my name is Inigo Montoya hash on my name = 7 Usually the window size is 512 and characters which means 512 bytes and each window differ from its predecessor in 1 character in the beginning and 1 in the end. Due to the fact that mod is a heavy command (if we are using mod as hash), and mod on a number with 512X8 bits (2^4096 ) is especially heavy, it is much easier to calculate the hash value of the window based on the difference from the preceding window. In a given text, the 1 st character is: a (ascii between 0 and 255) and the (k+1) th character is b (ascii between 0 and 255). If the hash value of the first k size window w1 is X, and the hush function is mod m, what is the value of the second window w2?

Solution: a is the most significant byte in a 8k bit number and b is the list significant byte. Thus: w2 = (w1 a*2 8(k-1) )*2 8 +b w2 mod m = ((w1 a*2 8(k-1) )*2^8 +b) mod m = (w1*2 8 a*2 8k +b) mod m = ((w1 mod m)* (2 8 mod m) (a mod m)*(2 8k mod m) + b mod m) mod m = X * (2 8 mod m) (a mod m)*(2 8k mod m) + b mod m) mod m Note that X is known from the last window, a,b and 2 8 are small numbers easy to calculate mod and 2 8k mod m can be calculated once in the beginning and saved. Question 6 You have a large amount of data chunks that need to be stored in servers in the cloud. In order to for the data to be distributed evenly, each data chunked is hashed to decide what server it will be stored in. In case a server stops working, its data chunks are supposed to be distributed evenly among the remaining servers. The data chunks of the remaining server obviously need to continue being stored in their server. If we implement such a system by numbering the servers and using modulo the number of servers as hash function, we will have a problem when a server stops working and the number of server changes, all the hash values will change. Suggest a way to implement such a system.

Solution: Consistent hashing The servers and data chunks will be hashed to the same table (array). Each server will be mapped by a number of hash functions so it will have a number of values in the table. Each server gets the chunks on its right until it meets another server: In this case: Chunk 1 is mapped to server 1 Chunk 2 is mapped to server 3 (the array is circular) Chunk 3 is mapped to server 2 Chunk 4 is mapped to server 2 Chunk 5 is mapped to server 2 Chunk 6 is mapped to server 2 Chunk 7 is mapped to server 3 Chunk 8 is mapped to server 1 If server 2 stops working, the other server still have their data and the data of server 2 is distributed among them: Now: Chunk 1 is mapped to server 1 Chunk 2 is mapped to server 3 (the array is circular) Chunk 3 is mapped to server 1 Chunk 4 is mapped to server 3 Chunk 5 is mapped to server 1 Chunk 6 is mapped to server 3 Chunk 7 is mapped to server 3 Chunk 8 is mapped to server 1

Question 7 You have a hash table with α=0.5. What is the probability that 3 or more keys will be mapped in to the same index? Solution: According to Poisson distribution, if the expected number of keys in the same index is λ, than the probability to get exactly k keys in 1 index is: λk e λ k! In our case λ=0.5 so: The probability for an index to be empty is 0.50 e 0.5 0! = e 0.5 The probability for an index to hold exactly 1 key is 0.51 e 0.5 The probability for an index to hold exactly 2 keys is 0.52 e 0.5 1! 2! = 0.5e 0.5 = 0.125e 0.5 And thus the probability of 3 or more keys is 1 e 0.5 0.5e 0.5 0.125e 0.5 = 0.014