CS/COE 1501

Similar documents
Hash Tables. Hash functions Open addressing. March 07, 2018 Cinda Heeren / Geoffrey Tien 1

Hashing 1. Searching Lists

Dictionary. Dictionary. stores key-value pairs. Find(k) Insert(k, v) Delete(k) List O(n) O(1) O(n) Sorted Array O(log n) O(n) O(n)

Hash Tables. Hash functions Open addressing. November 24, 2017 Hassan Khosravi / Geoffrey Tien 1

AAL 217: DATA STRUCTURES

TABLES AND HASHING. Chapter 13

Data Structures And Algorithms

Understand how to deal with collisions

Comp 335 File Structures. Hashing

Hashing. 1. Introduction. 2. Direct-address tables. CmSc 250 Introduction to Algorithms

Hash Table and Hashing

Hash Tables Hash Tables Goodrich, Tamassia

Hash Tables. Hashing Probing Separate Chaining Hash Function

Hash[ string key ] ==> integer value

Hash Table. A hash function h maps keys of a given type into integers in a fixed interval [0,m-1]

CPSC 259 admin notes

Data Structures and Algorithm Analysis (CSC317) Hash tables (part2)

Hash Tables. Gunnar Gotshalks. Maps 1

CS1020 Data Structures and Algorithms I Lecture Note #15. Hashing. For efficient look-up in a table

Hashing. CptS 223 Advanced Data Structures. Larry Holder School of Electrical Engineering and Computer Science Washington State University

CS 241 Analysis of Algorithms

Hashing Algorithms. Hash functions Separate Chaining Linear Probing Double Hashing

Hashing Techniques. Material based on slides by George Bebis

DATA STRUCTURES AND ALGORITHMS

Cpt S 223. School of EECS, WSU

Unit #5: Hash Functions and the Pigeonhole Principle

HO #13 Fall 2015 Gary Chan. Hashing (N:12)

CMSC 341 Hashing (Continued) Based on slides from previous iterations of this course

Hash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1

Introducing Hashing. Chapter 21. Copyright 2012 by Pearson Education, Inc. All rights reserved

DATA STRUCTURES AND ALGORITHMS

Dynamic Dictionaries. Operations: create insert find remove max/ min write out in sorted order. Only defined for object classes that are Comparable

CMSC 341 Hashing. Based on slides from previous iterations of this course

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

Data Structure Lecture#22: Searching 3 (Chapter 9) U Kang Seoul National University

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

Hashing 1. Searching Lists

HASH TABLES. Goal is to store elements k,v at index i = h k

Topic HashTable and Table ADT

CS 350 : Data Structures Hash Tables

Hashing. Dr. Ronaldo Menezes Hugo Serrano. Ronaldo Menezes, Florida Tech

HASH TABLES.

Tirgul 7. Hash Tables. In a hash table, we allocate an array of size m, which is much smaller than U (the set of keys).

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

COMP171. Hashing.

Hashing. Yufei Tao. Department of Computer Science and Engineering Chinese University of Hong Kong

Hashing. Hashing Procedures

Open Addressing: Linear Probing (cont.)

Hashing. mapping data to tables hashing integers and strings. aclasshash_table inserting and locating strings. inserting and locating strings

Priority Queue Sorting

Hashing. inserting and locating strings. MCS 360 Lecture 28 Introduction to Data Structures Jan Verschelde, 27 October 2010.

HASH TABLES. Hash Tables Page 1

SFU CMPT Lecture: Week 8

Adapted By Manik Hosen

CMSC 132: Object-Oriented Programming II. Hash Tables

Worst-case running time for RANDOMIZED-SELECT

IS 2610: Data Structures

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ.! Instructor: X. Zhang Spring 2017

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ. Acknowledgement. Support for Dictionary

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40

of characters from an alphabet, then, the hash function could be:

Module 5: Hashing. CS Data Structures and Data Management. Reza Dorrigiv, Daniel Roche. School of Computer Science, University of Waterloo

CS 310 Advanced Data Structures and Algorithms

ECE 242 Data Structures and Algorithms. Hash Tables II. Lecture 25. Prof.

Hashing. Manolis Koubarakis. Data Structures and Programming Techniques

Fast Lookup: Hash tables

More on Hashing: Collisions. See Chapter 20 of the text.

HashTable CISC5835, Computer Algorithms CIS, Fordham Univ. Instructor: X. Zhang Fall 2018

General Idea. Key could be an integer, a string, etc e.g. a name or Id that is a part of a large employee structure

9/24/ Hash functions

DS ,21. L11-12: Hashmap

Fundamental Algorithms

Symbol Table. Symbol table is used widely in many applications. dictionary is a kind of symbol table data dictionary is database management

Figure 1: Chaining. The implementation consists of these private members:

Dictionaries and Hash Tables

Midterm 3 practice problems

Hash Tables. Hash Tables

DATA STRUCTURES/UNIT 3

Acknowledgement HashTable CISC4080, Computer Algorithms CIS, Fordham Univ.

Data Structures Lecture 12

Outline. hash tables hash functions open addressing chained hashing

The dictionary problem

Practical Session 8- Hash Tables

Outline. Computer Science 331. Desirable Properties of Hash Functions. What is a Hash Function? Hash Functions. Mike Jacobson.

Hashing. So what we do instead is to store in each slot of the array a linked list of (key, record) - pairs, as in Fig. 1. Figure 1: Chaining

Fall 2017 Mentoring 9: October 23, Min-Heapify This. Level order, bubbling up. Level order, bubbling down. Reverse level order, bubbling up

Algorithms in Systems Engineering ISE 172. Lecture 12. Dr. Ted Ralphs

Successful vs. Unsuccessful

CSE100. Advanced Data Structures. Lecture 21. (Based on Paul Kube course materials)

! A Hash Table is used to implement a set, ! The table uses a function that maps an. ! The function is called a hash function.

Introduction hashing: a technique used for storing and retrieving information as quickly as possible.

Algorithms and Data Structures

Dictionaries-Hashing. Textbook: Dictionaries ( 8.1) Hash Tables ( 8.2)

III Data Structures. Dynamic sets

Data and File Structures Chapter 11. Hashing

5. Hashing. 5.1 General Idea. 5.2 Hash Function. 5.3 Separate Chaining. 5.4 Open Addressing. 5.5 Rehashing. 5.6 Extendible Hashing. 5.

Data Structures - CSCI 102. CS102 Hash Tables. Prof. Tejada. Copyright Sheila Tejada

This lecture. Iterators ( 5.4) Maps. Maps. The Map ADT ( 8.1) Comparison to java.util.map

Lecture 18. Collision Resolution

CITS2200 Data Structures and Algorithms. Topic 15. Hash Tables

Transcription:

CS/COE 1501 www.cs.pitt.edu/~lipschultz/cs1501/ Hashing

Wouldn t it be wonderful if... Search through a collection could be accomplished in Θ(1) with relatively small memory needs? Lets try this: Assume we have an array of length m (call it HT) Assume we have a function h(x) that maps from our key space to {0, 1, 2,, m-1} E.g., Z {0, 1, 2,, m-1} for integer keys Let s also assume h(x) is efficient to compute This is the basic premise of hash tables

How do we search/insert with a hash map? Insert: i = h(x) HT[i] = x Search: i = h(x) if (HT[i] == x) return true; else return false; This is a very general, simple approach to a hash table implementation Where will it run into problems?

What do we do if h(x) == h(y) where x!= y? Called a collision

Can we ever guarantee collisions will not occur? Yes, if the our keyspace is smaller than our hashmap If keyspace < m, perfect hashing can be used i.e., a hash function that maps every key to a distinct integer < m Note it can also be used if n < m and the keys to be inserted are known in advance E.g., hashing the keywords of a programming language during compilation If keyspace > m, collisions cannot be avoided

Consider an example Company has 500 employees Stores records using a hashmap with 1000 entries Employee SSNs are hashed to store records in the hashmap Keys are SSNs, so keyspace == 10 9 Specifically what keys are needed can t be known in advance Due to employee turnover What if one employee (with SSN x) is fired and replacement has an SSN of y? Can we design a hash function that guarantees h(y) does not collide with the 499 other employees' hashed SSNs?

Living with collisions Can we reduce the number of collisions? Using a good hash function is a start What makes a good hash function? Utilize the entire key Exploit differences between keys Uniform distribution of hash values should be produced

Examples Hash list of classmates by phone number Bad? Use first 3 digits Better? Consider it a single int Take that value modulo m Hash words Bad? Add up the ASCII values Better? Use Horner s method to do modular hashing again See Section 3.4 of the text

Horner's method Base 10 12345 = 1 * 10 4 + 2 * 10 3 + 3 * 10 2 + 4 * 10 1 + 5 * 10 0 Base 2 10100 = 1 * 2 4 + 0 * 2 3 + 1 * 2 2 + 0 * 2 1 + 0 * 2 0 Base 16 BEEF3 = 11 * 16 4 + 14 * 16 3 + 14 * 16 2 + 15 * 16 1 + 3 * 16 0 ASCII Strings BEEF3 = 'B' * 256 4 + 'E' * 256 3 + 'E' * 256 2 + 'F' * 256 1 + '3' * 256 0 = 66 * 256 4 + 69 * 256 3 + 69 * 256 2 + 70 * 256 1 + 51 * 256 0

Modular hashing Overall a good simple, general approach to implement a hash map Basic formula: h(x) = c(x) mod m Where c(x) converts x into a (possibly) large integer Generally want m to be a prime number Consider m = 100 Only the least significant digits matter h(1) = h(401) = h(4372901)

Back to collisions We ve done what we can to cut down the number of collisions, but we still need to deal with them Collision resolution: two main approaches Open Addressing Closed Addressing

Open Addressing I.e., if a pigeon s hole is taken, it has to find another If h(x) == h(y) == i And x is stored at index i in an example hash table If we want to insert y, we must try alternative indices This means y will not be stored at HT[h(y)] We must select alternatives in a consistent and predictable way so that they can be located later

Linear Probing Insert: Search: If we cannot store a key at index i due to collision Attempt to insert the key at index i+1 Then i+2 And so on mod m Until an open space is found If another key is stored at index i Check i+1, i+2, i+3 until Key is found Empty location is found We circle through the buffer back to i

Linear probing example h(x) = x mod 11 Insert 14, 17, 25, 37, 34, 16, 26 0 1 2 3 4 5 6 7 8 9 10 34 14 25 37 17 16 26

Alright! We solved collisions! Well, not quite Consider the load factor α = n/m As α increases, what happens to hash table performance? Consider an empty table using a good hash function What is the probability that a key x will be inserted into any index in the hash table? 1/m Consider a table that has a cluster of c consecutive indices occupied What is the probability that a key x will be inserted into the index directly after the cluster? (c + 1)/m

Avoiding clustering We must make sure that even after a collision, all of the indices of the hash table are possible for a key Probability of filled locations need to be distributed throughout the table

Double hashing After a collision, instead of attempting to place the key x in i+1 mod m, look at i+h2(x) mod m h2() is a second, different hash function Should still follow the same general rules as h() to be considered good, but needs to be different from h() h(x) == h(y) AND h2(x) == h2(y) should be very unlikely Hence, it should be unlikely for two keys to use the same increment

Double hashing h(x) = x mod 11 h2(x) = (x mod 7) +1 Insert 14, 17, 25, 37, 34, 16, 26 0 1 2 3 4 5 6 7 8 9 10 34 14 37 16 17 25 26 Insert 2401

A few extra rules for h2() Second hash function cannot map a value to 0 You should try all indices once before trying one twice Were either of these issues for linear probing?

As α 1... Meaning n approaches m Both linear probing and double hashing degrade to Θ(n) How? Multiple collisions will occur in both schemes Consider inserts and misses Both continue until an empty index is found With few indices available, close to m probes will need to be performed Θ(m) n is approaching m, so this turns out to be Θ(n)

Open addressing issues Must keep a portion of the table empty to maintain respectable performance For linear hashing ½ is a good rule of thumb Can go higher with double hashing

Closed addressing Most commonly done with separate chaining I.e., if a pigeon s hole is taken, it lives with a roommate Create a linked-list of keys at each index in the table As with DLBs, performance depends on chain length Which is determined by α and the quality of the hash function

In general... Closed-addressing hash tables are fast and efficient for a large number of applications