Tirgul 7. Hash Tables. In a hash table, we allocate an array of size m, which is much smaller than U (the set of keys).

Similar documents
Hashing. Dr. Ronaldo Menezes Hugo Serrano. Ronaldo Menezes, Florida Tech

Hashing 1. Searching Lists

BBM371& Data*Management. Lecture 6: Hash Tables

Data Structures and Algorithms. Roberto Sebastiani

Hash Table and Hashing

Data Structures and Algorithms. Chapter 7. Hashing

CMSC 341 Hashing (Continued) Based on slides from previous iterations of this course

CS 241 Analysis of Algorithms

Algorithms and Data Structures

CS 310 Hash Tables, Page 1. Hash Tables. CS 310 Hash Tables, Page 2

Fundamental Algorithms

Today s Outline. CS 561, Lecture 8. Direct Addressing Problem. Hash Tables. Hash Tables Trees. Jared Saia University of New Mexico

Unit #5: Hash Functions and the Pigeonhole Principle

! A Hash Table is used to implement a set, ! The table uses a function that maps an. ! The function is called a hash function.

Fast Lookup: Hash tables

Data Structures and Algorithm Analysis (CSC317) Hash tables (part2)

HASH TABLES.

Worst-case running time for RANDOMIZED-SELECT

III Data Structures. Dynamic sets

The dictionary problem

DATA STRUCTURES AND ALGORITHMS

CMSC 341 Hashing. Based on slides from previous iterations of this course

Chapter 7. Space and Time Tradeoffs. Copyright 2007 Pearson Addison-Wesley. All rights reserved.

COMP171. Hashing.

Algorithms and Data Structures

TABLES AND HASHING. Chapter 13

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40

Hash Table. A hash function h maps keys of a given type into integers in a fixed interval [0,m-1]

CS1020 Data Structures and Algorithms I Lecture Note #15. Hashing. For efficient look-up in a table

Hashing. Manolis Koubarakis. Data Structures and Programming Techniques

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

CSC263 Week 5. Larry Zhang.

9/24/ Hash functions

Week 9. Hash tables. 1 Generalising arrays. 2 Direct addressing. 3 Hashing in general. 4 Hashing through chaining. 5 Hash functions.

Dictionary. Dictionary. stores key-value pairs. Find(k) Insert(k, v) Delete(k) List O(n) O(1) O(n) Sorted Array O(log n) O(n) O(n)

HASH TABLES. Hash Tables Page 1

SFU CMPT Lecture: Week 8

Comp 335 File Structures. Hashing

Chapter 9: Maps, Dictionaries, Hashing

AAL 217: DATA STRUCTURES

CSE 2320 Notes 13: Hashing

CSE 214 Computer Science II Searching

Lecture 7: Efficient Collections via Hashing

of characters from an alphabet, then, the hash function could be:

CS 270 Algorithms. Oliver Kullmann. Generalising arrays. Direct addressing. Hashing in general. Hashing through chaining. Reading from CLRS for week 7

Hashing. 1. Introduction. 2. Direct-address tables. CmSc 250 Introduction to Algorithms

Hash table basics mod 83 ate. ate. hashcode()

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ.! Instructor: X. Zhang Spring 2017

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ. Acknowledgement. Support for Dictionary

Hash table basics mod 83 ate. ate

Hash Tables. Hash functions Open addressing. March 07, 2018 Cinda Heeren / Geoffrey Tien 1

Hash table basics. ate à. à à mod à 83

CPSC 259 admin notes

Introduction to. Algorithms. Lecture 6. Prof. Constantinos Daskalakis. CLRS: Chapter 17 and 32.2.

Dictionaries and Hash Tables

HASH TABLES. Goal is to store elements k,v at index i = h k

5. Hashing. 5.1 General Idea. 5.2 Hash Function. 5.3 Separate Chaining. 5.4 Open Addressing. 5.5 Rehashing. 5.6 Extendible Hashing. 5.

Hashing. Hashing Procedures

Hashing. Yufei Tao. Department of Computer Science and Engineering Chinese University of Hong Kong

Hashing Techniques. Material based on slides by George Bebis

Hash Functions. Kuan-Yu Chen ( 陳冠宇 ) TR-212, NTUST

4 Hash-Based Indexing

General Idea. Key could be an integer, a string, etc e.g. a name or Id that is a part of a large employee structure

Hashing 1. Searching Lists

CHAPTER 9 HASH TABLES, MAPS, AND SKIP LISTS

Understand how to deal with collisions

Practical Session 8- Hash Tables

HO #13 Fall 2015 Gary Chan. Hashing (N:12)

CSCI Analysis of Algorithms I

Introduction hashing: a technique used for storing and retrieving information as quickly as possible.

Open Addressing: Linear Probing (cont.)

Hash table basics mod 83 ate. ate

Algorithms in Systems Engineering ISE 172. Lecture 12. Dr. Ted Ralphs

Algorithms and Data Structures

Lecture 17. Improving open-addressing hashing. Brent s method. Ordered hashing CSE 100, UCSD: LEC 17. Page 1 of 19

Hash Tables. Hashing Probing Separate Chaining Hash Function

Today: Finish up hashing Sorted Dictionary ADT: Binary search, divide-and-conquer Recursive function and recurrence relation

CSE100. Advanced Data Structures. Lecture 21. (Based on Paul Kube course materials)

DS ,21. L11-12: Hashmap

Hash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1

Hashing. Introduction to Data Structures Kyuseok Shim SoEECS, SNU.

Data Structures and Algorithms(10)

Module 5: Hashing. CS Data Structures and Data Management. Reza Dorrigiv, Daniel Roche. School of Computer Science, University of Waterloo

HashTable CISC5835, Computer Algorithms CIS, Fordham Univ. Instructor: X. Zhang Fall 2018

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

Hashing. Biostatistics 615 Lecture 12

Hash[ string key ] ==> integer value

CS/COE 1501

Chapter 20 Hash Tables

CSCI 104 Hash Tables & Functions. Mark Redekopp David Kempe

Successful vs. Unsuccessful

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

CS 350 Algorithms and Complexity

Figure 1: Chaining. The implementation consists of these private members:

Module 2: Classical Algorithm Design Techniques

Hashing for searching

Hashing. So what we do instead is to store in each slot of the array a linked list of (key, record) - pairs, as in Fig. 1. Figure 1: Chaining

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1

CS 350 : Data Structures Hash Tables

ECE 242 Data Structures and Algorithms. Hash Tables I. Lecture 24. Prof.

Transcription:

Tirgul 7 Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys belong to a universal group of keys, U = {1... M}. Direct Address Tables An array of size m. An Element with key i is mapped to cell i. Time-complexity: O(1) What might be the problem? If the range of the keys is much larger than the number of elements a lot of space is wasted - might be too large to be practical. Hash Tables In a hash table, we allocate an array of size m, which is much smaller than U (the set of keys). We use a hash function h() to determine the entry of each key. What property should a good hash function have? The crucial point: the hash function should spread the keys of U equally among all the entries of the array. An example of a hash function: h ( k ) = k mod m 1

Hash choices Assume we want to hash strings for a country club managing software, each string contains full name (e.g. Parker Anthony ). Suggested hash function : the ascii number of the first symbol in the string (e.g. h( Parker Anthony ) = ascii of h = 80. Is this a good hash function? No, since it is likely that many families register together. This means that we will have many collisions even with few entries. We would like to have something random looking. The exact meaning of this will be explained next week, however, it is important to think of this point when choosing a hash function. How to choose hash functions Can we choose a hash function that will spread the keys uniformly between the cells? Unfortunately, we don t know in advance what keys will be received, so there can always be a bad input. We can try to find a hash function that usually maps a small number of keys to the same cell. Two common methods: - The division method - The multiplication method The division method Denote by m the size of the table. The division method is : h(k) = k mod m What is the value of m we should choose? For example: U =000, if we want each search to take three operations (on the average), then choose a number close to 000/3 666. For technical reasons it is preferable to choose m as a prime (for a prime p Z p is a field). In our example we choose m = 701.

The multiplication method The multiplication method: Multiply k by a constant 0<A<1. Take the fractional part of ka Multiply by m. Formally, h k = m ka mod ( ) ( 1) The multiplication method does not depends as much on m since A helps randomizing the hash function. Which values of A are good choices? Which are bad choices? The multiplication method A bad choice of A, example: if m = 100 and A=1/3, then for k=10, h(k)=33, for k=11, h(k)=66, And for k=1, h(k)=99. This is not a good choice of A, since we ll have only three values of h(k)... The optimal choice of A depends on the keys themselves. Knuth claims that is likely to be a good choice. A ( 5 1 ) / = 0.6180339887... What if keys are not numbers? The hash functions we showed only work for numbers. When keys are not numbers,we should first convert them to numbers. How can we convert a string into a number? A string can be treated as a number in base 56. Each character is a digit between 0 and 55. The string key will be translated to (( ) k ) (( ) e ) (( ) y ) int ' ' 56 + int '' 56 + int ' ' 56 1 0 3

Translating long strings to numbers The disadvantage of the conversion is: A long string creates a large number. Strings longer than 4 characters would exceed the capacity of a 3 bit integer. How can this problem be solved when using the division method? We can write the integer value of word as (((w* 56 + o)*56 + r)*56 + d) When using the division method the following facts can be used: (a+b) mod n = ((a mod n)+b) mod n (a*b) mod n = ((a mod n)*b) mod n. Translating long strings to numbers The expression we reach is: ((((((w*56+o)mod m)*56)+r)mod m)*56+d)mod m Using the properties of mod, we get the simple alg.: int hash(string s, int m) int h=s[0] for ( i=1 ; i<s.length ; i++) h = ((h*56) + s[i])) mod m return h Notice that h is always smaller than m. This will also improve the performance of the algorithm. The Birthdays Paradox There are 8 people in the room. Assuming that birthdays are distributed uniformly over the year, how many pairs of people who share birthdays will you expect? Surprisingly, the expected number of such pairs is approximately 1. Analysis: There are n people. The probability that two people share birthdays is 1/365. n n ( n 1) The number of pairs is ( ) = So the expected number of pairs with the same birthday is n n( n 1) ( ) = * 365 4

Collisions Can several keys have the same entry? Yes, since U is much larger than m. Collision - several elements are mapped into the same cell. Similar to the birthdays paradox analysis. When are collisions more likely to happen? When the hash table is almost full. We define the load factor as = n/m. n - the number of keys in the hash table m - the size of the table How can we solve collisions? Chaining There are two approaches to handle collisions: Chaining. Open Addressing. Chaining: Each entry in the table is a linked list. The linked list holds all the keys that are mapped to this entry. Search operation on a hash table which applies chaining takes O ( 1 + α ) time. Chaining This complexity is calculated under the assumption of uniform hashing. Notice that in the chaining method, the load factor may be greater than one. Analysis: Insert - O(1) Unsuccessful search - calculating the hash value O(1), Scanning the linked-list expected O( ). Total: O(1 + ). Successful search Let x be the i th element inserted to the hash table. When searching for x, the expected number of scanned elements is (i-1)/m. For any element - the expected number of scanned elements is O( ). Total: O(1 + ). 5

Open addressing In this method, the table itself holds all the keys. We change the hash function to receive two parameters: The first is the key. The second is the probe number. We first try to locate h(k,0) in the table. If it fails we try to locate h(k,1) in the table, and so on. Open addressing It is required that {h(k, 0),...,h(k,m-1)} will be a permutation of {0,..,m-1}. After m-1 probes we ll definitely find a place to locate k (unless the table is full). Notice that here, the load factor must be smaller than one. There is a problem with deleting keys. What is it? How can it be solved? Open addressing While searching key i and reaching an empty slot, we don t know if: The key i doesn t exist in the table. Or, key i does exist in the table but at the time key i was inserted this slot was occupied, and we should continue our search. What functions can we use to implement open addressing? We will discuss two ways : linear probing double hashing 6

Open addressing Linear probing - h(k,i)=(h(k)+i) mod m The problem: primary clustering. If several consecutive slots are occupied, the next free slot has high probability of being occupied. Search time increases when large clusters are created. The reason for the primary clustering stems from the fact that there are only m different probe sequences. How can we change the hash function to avoid clusters? Open addressing Double hashing h(k,i)=(h 1 (k)+ih (k)) mod m Better than linear probing. It s best if for all k we have that h (k) does not have a common divisor with m. mthat way different probe sequences! Performance Insertion and unsuccessful search : 1/(1- ) probes on average. Intuition: We always check at least one cell. In probability this cell is occupied. In probability (n/m * (n-1)/m) the next cell is also occupied. Expected probes number 1 + + + 3 + = 1/(1- ). A successful search: (1/ )ln(1/(1- )) probes on average Idea: When searching for x, the probes number is the same as when the element was inserted (depends on the number of elements that were inserted before x). How to get to the formula look in CLR. For example: If the table is 50% full then a search will take about 1.4 probes on average. If the table 90% full then the search will take about.6 probes on average. 7

Example for Open Addressing Lets look at an example: First assume we use linear probing insert the numbers: 4, 8, 6, 38, 6 with m=11 and h(k)=k mod m. [ ] [ ] [ ] [ ] [4] [38] [8] [6] [6] [ ] [ ] 0 1 3 4 5 6 7 8 9 10 Note the clustering effect on 6! Example for Open Addressing Now lets try double hashing, again m=11 Use: h 1 (k)=k mod m h (k)=1 + k mod (m-1). h(k)= ( h 1 (k) + i*h (k) ) mod m Insert numbers: 4, 8, 6, 38, 6 [6] [ ] [6] [ ] [4] [38] [8] [ ] [ ] [ ] [ ] 0 1 3 4 5 6 7 8 9 10 The clustering effect disapeard. When should hash tables be used Hash tables are very useful for implementing dictionaries if we don t have an order on the elements, or we have order but we need only the standard operations. On the other hand, hash tables are less useful if we have order and we need more than just the standard operations. For example, last(), or iterator over all elements, which is problematic if the load factor is very low. 8

When should hash tables be used We should have a good estimate of the number of elements we need to store For example, the huji has about 30,000 students each year, but still it is a dynamic d.b. Re-hashing: If we don t know a-priori the number of elements, we might need to perform re-hashing, increasing the size of the table and re-assigning all elements. 9