Hash Tables. Hash Tables

Similar documents
1 CSE 100: HASH TABLES

Fast Lookup: Hash tables

stacks operation array/vector linked list push amortized O(1) Θ(1) pop Θ(1) Θ(1) top Θ(1) Θ(1) isempty Θ(1) Θ(1)

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

Randomized Algorithms, Hash Functions

CS 241 Analysis of Algorithms

Integer keys. Databases and keys. Chaining. Analysis CS206 CS206

UNIT III BALANCED SEARCH TREES AND INDEXING

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40

Dynamic Dictionaries. Operations: create insert find remove max/ min write out in sorted order. Only defined for object classes that are Comparable

Understand how to deal with collisions

Hash tables. hashing -- idea collision resolution. hash function Java hashcode() for HashMap and HashSet big-o time bounds applications

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

CMSC 341 Hashing (Continued) Based on slides from previous iterations of this course

The dictionary problem

Randomized Algorithms: Element Distinctness

Hash Tables. Hash tables combine the random access ability of an array with the dynamism of a linked list.

Hash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1

Hash Open Indexing. Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1

Hashing Techniques. Material based on slides by George Bebis

AAL 217: DATA STRUCTURES

Final Examination CSE 100 UCSD (Practice)

Worst-case running time for RANDOMIZED-SELECT

CSE100. Advanced Data Structures. Lecture 21. (Based on Paul Kube course materials)

General Idea. Key could be an integer, a string, etc e.g. a name or Id that is a part of a large employee structure

Advanced Algorithmics (6EAP) MTAT Hashing. Jaak Vilo 2016 Fall

Programming Abstractions

COMP 250. Lecture 27. hashing. Nov. 10, 2017

Chapter 27 Hashing. Objectives

Lecture 12 Notes Hash Tables

1 / 22. Inf 2B: Hash Tables. Lecture 4 of ADS thread. Kyriakos Kalorkoti. School of Informatics University of Edinburgh

Randomized Algorithms Part Four

Randomized Algorithms Part Four

===Lab Info=== * 90 points. * Due 11:59pm on Sunday, 9/20/2015 for Monday and Wednesday lab. ==Assignment==

Lecture 12 Hash Tables

CS/COE 1501

Sample Solutions CSC 263H. June 9, 2016

Lecture 16: HashTables 10:00 AM, Mar 2, 2018

Cpt S 223. School of EECS, WSU

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ.! Instructor: X. Zhang Spring 2017

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ. Acknowledgement. Support for Dictionary

Dictionary. Dictionary. stores key-value pairs. Find(k) Insert(k, v) Delete(k) List O(n) O(1) O(n) Sorted Array O(log n) O(n) O(n)

CIS 190: C/C++ Programming. Lecture 12 Student Choice

CSE 332: Data Structures & Parallelism Lecture 10:Hashing. Ruth Anderson Autumn 2018

Hash[ string key ] ==> integer value

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

HASH TABLES.

CSI33 Data Structures

TABLES AND HASHING. Chapter 13

III Data Structures. Dynamic sets

9/24/ Hash functions

CMSC 132: Object-Oriented Programming II. Hash Tables

Introducing Hashing. Chapter 21. Copyright 2012 by Pearson Education, Inc. All rights reserved

of characters from an alphabet, then, the hash function could be:

Algorithms in Systems Engineering ISE 172. Lecture 12. Dr. Ted Ralphs

CSC Design and Analysis of Algorithms. Lecture 9. Space-For-Time Tradeoffs. Space-for-time tradeoffs

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1

Introduction to. Algorithms. Lecture 5

Hashing. Hashing Procedures

CSC263 Week 5. Larry Zhang.

Hash table basics. ate à. à à mod à 83

Tirgul 7. Hash Tables. In a hash table, we allocate an array of size m, which is much smaller than U (the set of keys).

Hashing. Manolis Koubarakis. Data Structures and Programming Techniques

Announcements. Submit Prelim 2 conflicts by Thursday night A6 is due Nov 7 (tomorrow!)

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

Hashing and sketching

Hash Table and Hashing

Data Structures. Topic #6

CS 261 Data Structures

Chapter 7. Space and Time Tradeoffs. Copyright 2007 Pearson Addison-Wesley. All rights reserved.

Unit #5: Hash Functions and the Pigeonhole Principle

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing

CS 350 : Data Structures Hash Tables

CSE373: Data Structures & Algorithms Lecture 17: Hash Collisions. Kevin Quinn Fall 2015

Chapter 27 Hashing. Liang, Introduction to Java Programming, Eleventh Edition, (c) 2017 Pearson Education, Inc. All rights reserved.

HashTable CISC5835, Computer Algorithms CIS, Fordham Univ. Instructor: X. Zhang Fall 2018

5. Hashing. 5.1 General Idea. 5.2 Hash Function. 5.3 Separate Chaining. 5.4 Open Addressing. 5.5 Rehashing. 5.6 Extendible Hashing. 5.

Figure 1: Chaining. The implementation consists of these private members:

CHAPTER 9 HASH TABLES, MAPS, AND SKIP LISTS

Data and File Structures Chapter 11. Hashing

1.00 Lecture 32. Hashing. Reading for next time: Big Java Motivation

CS369G: Algorithmic Techniques for Big Data Spring

HASH TABLES. Goal is to store elements k,v at index i = h k

BINARY HEAP cs2420 Introduction to Algorithms and Data Structures Spring 2015

Midterm Exam 2B Answer key

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

CS 161 Problem Set 4

More on Hashing: Collisions. See Chapter 20 of the text.

Computer Science Foundation Exam

INF2220: algorithms and data structures Series 3

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Dictionaries. Application

CS 251, LE 2 Fall MIDTERM 2 Tuesday, November 1, 2016 Version 00 - KEY

Acknowledgement HashTable CISC4080, Computer Algorithms CIS, Fordham Univ.

Outline. hash tables hash functions open addressing chained hashing

Quiz 1 Practice Problems

Why do we need hashing?

lecture23: Hash Tables

Hash Tables Hash Tables Goodrich, Tamassia

Hashing. October 19, CMPE 250 Hashing October 19, / 25

Transcription:

Hash Tables Hash Tables Insanely useful One of the most useful and used data structures that you ll encounter They do not support many operations But they are amazing at the operations they do support They act like an array with a couple of key differences 1

Hash Tables Operations: Insert Delete Look-up O(1) What are they not good for? Guaranteed constant running time for those operations if: 1. If the hash table is properly implemented, and 2. The data is non-pathological. Example Applications Remove duplicates Given a stream of objects (linear scan or data arriving over time) Don t add object if it already exists (distinct visitors to a web site, creating an efficient web crawler) 2-SUM Problem Given an array of integers A and a target sum T Goal: determine if any two numbers sum to T What is a naïve approach to this problem? What is a slightly better approach? What is an optimal approach based on hash-tables? 2

Other applications Used for symbol tables in compilers Block traffic from black listed IP addresses In search algorithms you can ensure that you don t test the same configuration twice (for example, in chess) Great Data Structure Easy to butcher Let U be the universe of all possible objects (all possible IP address, all possible student names, all chess configurations, etc.) We want to maintain an evolving subset S of U that is a reasonable size Naïve solution #1: is to create an array that has equal to U Naïve solution #2: use a linked list instead 3

Great Data Structure Easy to butcher Let U be the universe of all possible objects (all possible IP address, all possible student names, all chess configurations, etc.) We want to maintain an evolving subset S of U that is a reasonable size Hash table: Let n be approximately equal to S, where n is the # of buckets Choose a hash function h(x) à {0, 1,.., n-1} where x is an object in U Use an array A of length n, and store objects at A[h(x)] Hash Table A with n buckets 0 U x 1 2 3 4 5 6 7 4

Hash Table A with n buckets U x h(x) x 0 1 2 3 4 5 6 7 Hash Table A with n buckets y U x z h(y) h(x) h(z) z x 0 1 2 3 4 5 S y 6 7 5

Collisions What if two keys (objects) result in the same index? Is this really a problem? Birthday problem Consider n people with birthdays distributed uniformly at random. How large does n need to be before there is at least a 50% chance that two people have the same birthday? Birthday Problem a b c d a a a a a a a a a a a a a a a a a a n 50% Change of Collision 1 2 3 365 6

Collisions What if two keys (objects) result in the same index? Is this really a problem? Birthday problem Consider n people with birthdays distributed uniformly at random. How large does n need to be before there is at least a 50% chance that two people have the same birthday? a. 367 b. 57 c. 184 d. 23 Collisions What if two keys (objects) result in the same index? Is this really a problem? Birthday problem Consider n people with birthdays distributed uniformly at random. How large does n need to be before there is at least a 50% chance that two people have the same birthday? a. 367 à 100% b. 57 à 99% c. 184 à 99.9999% d. 23 à 50% 7

Collisions Even with a uniformly random hash function you still get quite a few collisions with a small data set. Collisions occur often, so we need to handle them carefully. Two common methods for resolving collisions 1. Separate Chaining 2. Open Addressing Method 1: Separate Chaining 8

Method 1: Separate Chaining B A Method 1: Separate Chaining B C A D 9

Method 1: Separate Chaining B E C A D Method 2: Open Addressing Only one object per bucket Hash functions now specify a sequence Y X 1. Linear probing: call the hash function and find the next available spot in the array 10

Method 2: Open Addressing Only one object per bucket Hash functions now specify a sequence Y X 1. Linear probing: call the hash function and find the next available spot in the array Method 2: Open Addressing Only one object per bucket Hash functions now specify a sequence Y X 1. Linear probing: call the hash function and find the next available spot in the array 2. Double hashing: requires two hash functions Call the first hash function to get X If there is a collision call the second hash function to get Y, add Y to X If there is another collision add Y again 11

Method 2: Open Addressing Only one object per bucket Hash functions now specify a sequence Y X 1. Linear probing: call the hash function and find the next available spot in the array 2. Double hashing: requires two hash functions Call the first hash function to get X If there is a collision call the second hash function to get Y, add Y to X If there is another collision add Y again Collisions Which resolution method is better? Separate chaining or open addressing Depends on the data and what you want to prioritize Which one is better if you worry about space? Which one is better if you worry about data locality? Which one is easier if you want to easily support deletion? 12

Hash Functions What makes a good hash function? Properties of a good hash function 1. Should lead to the smallest number of collisions as possible 2. It shouldn t be too much work to compute the hash (required for every lookup, insertion, or deletion) For separate chaining, what is the worst case for a hash function? Example Keys are 10-digit phone numbers, U = 10^10, we select n = 10^3 Terrible hash function: h(x) = 1 st 3 digits of x OK (not great) hash function: h(x) = x % 1000 (final 3 digits of x) 417-836-6646 417-836-5438 417-836-4944 417-836-5930 417-836-5026 417-836-8745 417-836-4834 417-836-5789 417-836-5224 417-836-4157 13

Passable Hash Function Hash Function Compression Objects Integers Buckets Take strings as objects for example Hash code could be to create a unique number from the characters Compression function would be to take the integer mod n How do you choose n (the number of buckets)? Choose n to be a prime number (on the order of the # of objects to store) Don t choose a value too near to a power of 2 Don t choose a value too near to a power of 10 Implementations C++ std::map and std::set are implemented as Red-Black trees They did this to preserve an ordering among elements and to eliminate the poor worst case search time std::unordered_map and std::unordered_set are proper hash tables Python Python dictionaries and sets are implemented as a hash tables Python does not include a BST in the standard library Rust Rust includes hash maps and has sets as well as B-Tree maps and sets 14

Rust Documentation you should probably just use Vec or HashMap. A hash map implementation which uses linear probing with Robin Hood bucket stealing. By default, HashMap uses a hashing algorithm selected to provide resistance against HashDoS attacks. The algorithm is randomly seeded, and a reasonable best-effort is made to generate this seed from a high quality, secure source of randomness provided by the host without blocking the program. Hash Table Summary 1. Make a big array 2. Create a function that converts elements into integers (hashing) 3. Store elements in the array at the index specified by the hash function 4. Do something interesting if two elements get the same index (collision) 15