Hash Open Indexing. Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1

Similar documents
Lecture 10: Introduction to Hash Tables

Implementing Hash and AVL

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

CMSC 341 Hashing (Continued) Based on slides from previous iterations of this course

Open Addressing: Linear Probing (cont.)

Hash Tables: Handling Collisions

BINARY HEAP cs2420 Introduction to Algorithms and Data Structures Spring 2015

Hash Tables. Gunnar Gotshalks. Maps 1

Section 05: Midterm Review

Practice Midterm Exam Solutions

CSE wi: Practice Midterm

Data Structures - CSCI 102. CS102 Hash Tables. Prof. Tejada. Copyright Sheila Tejada

Understand how to deal with collisions

CSE 332: Data Structures & Parallelism Lecture 10:Hashing. Ruth Anderson Autumn 2018

CSE373: Data Structures & Algorithms Lecture 17: Hash Collisions. Kevin Quinn Fall 2015

Cpt S 223. School of EECS, WSU

HashTable CISC5835, Computer Algorithms CIS, Fordham Univ. Instructor: X. Zhang Fall 2018

Section 05: Solutions

! A Hash Table is used to implement a set, ! The table uses a function that maps an. ! The function is called a hash function.

Hashing Techniques. Material based on slides by George Bebis

HASH TABLES. CSE 332 Data Abstractions: B Trees and Hash Tables Make a Complete Breakfast. An Ideal Hash Functions.

lecture23: Hash Tables

Section 05: Solutions

CSE 373: Data Structures and Algorithms. Hash Tables. Autumn Shrirang (Shri) Mare

COMP171. Hashing.

AAL 217: DATA STRUCTURES

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ.! Instructor: X. Zhang Spring 2017

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ. Acknowledgement. Support for Dictionary

CMSC 341 Hashing. Based on slides from previous iterations of this course

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

Hash[ string key ] ==> integer value

Data Structures And Algorithms

CSE 332 Spring 2013: Midterm Exam (closed book, closed notes, no calculators)

CSE 332 Spring 2014: Midterm Exam (closed book, closed notes, no calculators)

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1

General Idea. Key could be an integer, a string, etc e.g. a name or Id that is a part of a large employee structure

Data Structures and Algorithm Analysis (CSC317) Hash tables (part2)

HASH TABLES. Goal is to store elements k,v at index i = h k

Hash Table and Hashing

Hash table basics mod 83 ate. ate. hashcode()

Algorithms and Data Structures

Hashing. October 19, CMPE 250 Hashing October 19, / 25

Table ADT and Sorting. Algorithm topics continuing (or reviewing?) CS 24 curriculum

Acknowledgement HashTable CISC4080, Computer Algorithms CIS, Fordham Univ.

Fast Lookup: Hash tables

CSE 332 Autumn 2013: Midterm Exam (closed book, closed notes, no calculators)

Worst-case running time for RANDOMIZED-SELECT

9/24/ Hash functions

HASH TABLES cs2420 Introduction to Algorithms and Data Structures Spring 2015

DATA STRUCTURES AND ALGORITHMS

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40

csci 210: Data Structures Maps and Hash Tables

CSCI 104 Hash Tables & Functions. Mark Redekopp David Kempe

THINGS WE DID LAST TIME IN SECTION

Hashing. Dr. Ronaldo Menezes Hugo Serrano. Ronaldo Menezes, Florida Tech

Hash table basics mod 83 ate. ate

UNIT III BALANCED SEARCH TREES AND INDEXING

HO #13 Fall 2015 Gary Chan. Hashing (N:12)

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

Introduction hashing: a technique used for storing and retrieving information as quickly as possible.

CITS2200 Data Structures and Algorithms. Topic 15. Hash Tables

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

CS 241 Analysis of Algorithms

AP Computer Science 4325

Chapter 27 Hashing. Objectives

CS 350 : Data Structures Hash Tables

Hash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1

ASSIGNMENTS. Progra m Outcom e. Chapter Q. No. Outcom e (CO) I 1 If f(n) = Θ(g(n)) and g(n)= Θ(h(n)), then proof that h(n) = Θ(f(n))

CS 310 Advanced Data Structures and Algorithms

Introducing Hashing. Chapter 21. Copyright 2012 by Pearson Education, Inc. All rights reserved

5. Hashing. 5.1 General Idea. 5.2 Hash Function. 5.3 Separate Chaining. 5.4 Open Addressing. 5.5 Rehashing. 5.6 Extendible Hashing. 5.

Lecture 10 March 4. Goals: hashing. hash functions. closed hashing. application of hashing

HASH TABLES. Hash Tables Page 1

Module 5: Hashing. CS Data Structures and Data Management. Reza Dorrigiv, Daniel Roche. School of Computer Science, University of Waterloo

CSE 373: Data Structures and Algorithms. Memory and Locality. Autumn Shrirang (Shri) Mare

MA/CSSE 473 Day 23. Binary (max) Heap Quick Review

Selection, Bubble, Insertion, Merge, Heap, Quick Bucket, Radix

SFU CMPT Lecture: Week 8

Hash Tables. Hashing Probing Separate Chaining Hash Function

Module 2: Classical Algorithm Design Techniques

CSE 214 Computer Science II Searching

University of Waterloo Department of Electrical and Computer Engineering ECE250 Algorithms and Data Structures Fall 2014

UNIVERSITY OF WATERLOO DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING E&CE 250 ALGORITHMS AND DATA STRUCTURES

Hashing. 1. Introduction. 2. Direct-address tables. CmSc 250 Introduction to Algorithms

Unit #5: Hash Functions and the Pigeonhole Principle

CS 2150 (fall 2010) Midterm 2

CSE 373: Data Structures and Algorithms. Final Review. Autumn Shrirang (Shri) Mare

Today: Finish up hashing Sorted Dictionary ADT: Binary search, divide-and-conquer Recursive function and recurrence relation

CSCE 210/2201 Data Structures and Algorithms. Prof. Amr Goneid

Hashing. Introduction to Data Structures Kyuseok Shim SoEECS, SNU.

Hashing Algorithms. Hash functions Separate Chaining Linear Probing Double Hashing

Data Structures and Algorithms. Chapter 7. Hashing

CSCE 210/2201 Data Structures and Algorithms. Prof. Amr Goneid. Fall 2018

CSE100. Advanced Data Structures. Lecture 21. (Based on Paul Kube course materials)

Hashing. CptS 223 Advanced Data Structures. Larry Holder School of Electrical Engineering and Computer Science Washington State University

Hashing. Manolis Koubarakis. Data Structures and Programming Techniques

Topic HashTable and Table ADT

Part I Anton Gerdelan

CPSC 331 Term Test #2 March 26, 2007

More on Hashing: Collisions. See Chapter 20 of the text.

Transcription:

Hash Open Indexing Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1

Warm Up Consider a StringDictionary using separate chaining with an internal capacity of 10. Assume our buckets are implemented using a LinkedList. Use the following hash function: public int hashcode(string input) { return input.length() % arr.length; } Now, insert the following key-value pairs. What does the dictionary internally look like? ( cat, 1) ( bat, 2) ( mat, 3) ( a, 4) ( abcd, 5) ( abcdabcd, 6) ( five, 7) ( hello world, 8) 0 1 2 3 4 5 6 7 8 9 ( a, 4) ( cat, 1) ( abcd, 5) ( abcdabcd, 6) ( hello world, 8) ( bat, 2) ( five, 7) ( mat, 3) CSE 373 SP 18 - KASEY CHAMPION 2

Administrivia HW 2 due HW 3 out CSE 373 SP 18 - KASEY CHAMPION 3

Midterm Topics ADTs and Data structures - Lists, Stacks, Queues, Maps - Array vs Node implementations of each Asymptotic Analysis - Proving Big O by finding C and N 0 - Modeling code runtime with math functions, including recurrences and summations - Finding closed form of recurrences using unrolling, tree method and master theorem - Looking at code models and giving Big O runtimes - Definitions of Big O, Big Omega, Big Theta BST and AVL Trees - Binary Search Property, Balance Property - Insertions, Retrievals - AVL rotations Hashing - Understanding hash functions - Insertions and retrievals from a table - Collision resolution strategies: chaining, linear probing, quadratic probing, double hashing Heaps - Heap properties - Insertions, retrievals while maintaining structure with bubbling up Homework - ArrayDictionary - DoubleLinkedList - ChainedHashDictionary - ChainedHashSet CSE 373 SP 18 - KASEY CHAMPION 4

Can we do better? Idea 1: Take in better keys - Can t do anything about that right now Idea 2: Optimize the bucket - Use an AVL tree instead of a Linked List - Java starts off as a linked list then converts to AVL tree when collisions get large Idea 3: Modify the array s internal capacity - When load factor gets too high, resize array - Double size of array - Increase array size to next prime number that s roughly double the array size - Prime numbers reduce collisions when using % because of divisors - Resize when λ 1.0 - When you resize, you have to rehash CSE 373 SP 18 - KASEY CHAMPION 5

What about non integer keys? Hash Function An algorithm that maps a given key to an integer representing the index in the array for where to store the associated value Goals Avoid collisions - The more collisions, the further we move away from O(1) - Produce a wide range of indices Uniform distribution of outputs - Optimize for memory usage Low computational costs - Hash function is called every time we want to interact with the data CSE 373 SP 18 - KASEY CHAMPION 6

How to Hash non Integer Keys Implementation 1: Simple aspect of values public int hashcode(string input) { return input.length(); } Implementation 2: More aspects of value public int hashcode(string input) { int output = 0; for(char c : input) { out += (int)c; } return output; } Pro: super fast O(1) Con: lots of collisions! Pro: fast O(n) Con: some collisions Implementation 3: Multiple aspects of value + math! public int hashcode(string input) { int output = 1; Pro: few collisions for (char c : input) { int nextprime = getnextprime(); Con: slow, gigantic integers out *= Math.pow(nextPrime, (int)c); } return Math.pow(nextPrime, input.length()); } CSE 373 SP 18 - KASEY CHAMPION 7

Practice 3 Minutes Consider a StringDictionary using separate chaining with an internal capacity of 10. Assume our buckets are implemented using a LinkedList. Use the following hash function: public int hashcode(string input) { return input.length() % arr.length; } Now, insert the following key-value pairs. What does the dictionary internally look like? ( a, 1) ( ab, 2) ( c, 3) ( abc, 4) ( abcd, 5) ( abcdabcd, 6) ( five, 7) ( hello world, 8) 0 1 2 3 4 5 6 7 8 9 ( a, 1) ( ab, 2) ( abc, 4) ( abcd, 5) ( c, 3) ( abcdabcd, 6) ( hello world, 8) ( five, 7) CSE 373 SP 18 - KASEY CHAMPION 8

Review: Handling Collisions Solution 1: Chaining Each space holds a bucket that can store multiple values. Bucket is often implemented with a LinkedList put(key,value) Operation Array w/ indices as keys best O(1) average O(1 + λ) worst O(n) Average Case: Depends on average number of elements per chain get(key) remove(key) best O(1) average O(1 + λ) worst O(n) best O(1) average O(1 + λ) Load Factor λ If n is the total number of keyvalue pairs Let c be the capacity of array Load Factor λ =! " worst O(n) CSE 373 SP 18 - KASEY CHAMPION 9

Handling Collisions Solution 2: Open Addressing Resolves collisions by choosing a different location to tore a value if natural choice is already full. Type 1: Linear Probing If there is a collision, keep checking the next element until we find an open spot. public int hashfunction(string s) int naturalhash = this.gethash(s); if(natural hash in use) { int i = 1; while (index in use) { try (naturalhash + i); i++; CSE 373 SP 18 - KASEY CHAMPION 10

Linear Probing Insert the following values into the Hash Table using a hashfunction of % table size and linear probing to resolve collisions 1, 5, 11, 7, 12, 17, 6, 25 0 1 2 3 4 5 6 7 8 9 111 12 255 6 177 CSE 373 SP 18 - KASEY CHAMPION 11

Linear Probing 3 Minutes Insert the following values into the Hash Table using a hashfunction of % table size and linear probing to resolve collisions 38, 19, 8, 109, 10 0 1 2 3 4 5 6 7 8 9 8 10 388 19 109 Problem: Linear probing causes clustering Clustering causes more looping when probing Primary Clustering When probing causes long chains of occupied slots within a hash table CSE 373 SP 18 - KASEY CHAMPION 12

Runtime 2 Minutes When is runtime good? Empty table When is runtime bad? Table nearly full When we hit a cluster Maximum Load Factor? λ at most 1.0 When do we resize the array? λ ½ CSE 373 SP 18 - KASEY CHAMPION 13

Can we do better? Clusters are caused by picking new space near natural index Solution 2: Open Addressing Type 2: Quadratic Probing If we collide instead try the next i 2 space public int hashfunction(string s) int naturalhash = this.gethash(s); if(natural hash in use) { int i = 1; while (index in use) { try (naturalhash + i i); * i); i++; CSE 373 SP 18 - KASEY CHAMPION 14

Quadratic Probing Insert the following values into the Hash Table using a hashfunction of % table size and quadratic probing to resolve collisions 89, 18, 49, 58, 79 0 1 2 3 4 5 6 7 8 9 58 79 18 49 89 (49 % 10 + 0 * 0) % 10 = 9 (49 % 10 + 1 * 1) % 10 = 0 (58 % 10 + 0 * 0) % 10 = 8 (58 % 10 + 1 * 1) % 10 = 9 (58 % 10 + 2 * 2) % 10 = 2 Problems: If λ ½ we might never find an empty spot Infinite loop! Can still get clusters (79 % 10 + 0 * 0) % 10 = 9 (79 % 10 + 1 * 1) % 10 = 0 (79 % 10 + 2 * 2) % 10 = 3 CSE 373 SP 18 - KASEY CHAMPION 15

Secondary Clustering 3 Minutes Insert the following values into the Hash Table using a hashfunction of % table size and quadratic probing to resolve collisions 19, 39, 29, 9 0 1 2 3 4 5 6 7 8 9 39 29 9 19 Secondary Clustering When using quadratic probing sometimes need to probe the same sequence of table cells, not necessarily next to one another CSE 373 SP 18 - KASEY CHAMPION 16

Probing - h(k) = the natural hash - h (k, i) = resulting hash after probing - i = iteration of the probe - T = table size Linear Probing: h (k, i) = (h(k) + i) % T Quadratic Probing h (k, i) = (h(k) + i 2 ) % T For both types there are only O(T) probes available - Can we do better? CSE 373 SP 18 - KASEY CHAMPION 17

Double Hashing Probing causes us to check the same indices over and over- can we check different ones instead? Use a second hash function! h (k, i) = (h(k) + i * g(k)) % T <- Most effective if g(k) returns value prime to table size public int hashfunction(string s) int naturalhash = this.gethash(s); if(natural hash in use) { int i = 1; while (index in use) { try (naturalhash + i * jump_hash(key)); i++; CSE 373 SP 18 - KASEY CHAMPION 18

Second Hash Function Effective if g(k) returns a value that is relatively prime to table size - If T is a power of 2, make g(k) return an odd integer - If T is a prime, make g(k) return any smaller, non-zero integer - g(k) = 1 + (k % T(-1)) How many different probes are there? - T different starting positions - T 1 jump intervals - O(T 2 ) different probe sequences - Linear and quadratic only offer O(T) sequences CSE 373 SP 18 - KASEY CHAMPION 19

Resizing How do we resize? - Remake the table - Evaluate the hash function over again. - Re-insert. When to resize? - Depending on our load factor! -Heuristic: - for separate chaining! between 1 and 3 is a good time to resize. - For open addressing! between 0.5 and 1 is a good time to resize.

Separate chaining: Running Times What are the running times for: insert find delete Best:!(1) Worst:!(%) (if insertions are always at the end of the linked list) Best:!(1) Worst:!(%) Best:!(1) Worst:!(%) CSE 332 SU 18 ROBBIE WEBER

Linear probing: Average-case insert If! < 1 we ll find a spot eventually. What s the average running time? Uniform Hashing Assumption for any pair of elements x, y the probability that h(x) = h(y) is If find is unsuccessful: $ % 1 + $ $'( ) If find is successful: $ 1 + $ % ($'() We won t ask you to prove these $,-./01230 CSE 332 SU 18 ROBBIE WEBER

Summary 1. Pick a hash function to: - Avoid collisions - Uniformly distribute data - Reduce hash computational costs 2. Pick a collision strategy - Chaining - LinkedList - AVL Tree - Probing - Linear - Quadratic - Double Hashing No clustering Potentially more compact (λ can be higher) Managing clustering can be tricky Less compact (keep λ < ½) Array lookups tend to be a constant factor faster than traversing pointers CSE 373 SP 18 - KASEY CHAMPION 23

Summary Separate Chaining - Easy to implement - Running times!(1 + %) Open Addressing - Uses less memory. - Various schemes: - Linear Probing easiest, but need to resize most frequently - Quadratic Probing middle ground - Double Hashing need a whole new hash function, but low chance of clustering. Which you use depends on your application and what you re worried about.

Other applications of hashing - Cryptographic hash functions: Hash functions with some additional properties - Commonly used in practice: SHA-1, SHA-265 - To verify file integrity. When you share a large file with someone, how do you know that the other person got the exact same file? Just compare hash of the file on both ends. Used by file sharing services (Google Drive, Dropbox) - For password verification: Storing passwords in plaintext is insecure. So your passwords are stored as a hash. - For Digital signature - Lots of other crypto applications - Finding similar records: Records with similar but not identical keys - Spelling suggestion/corrector applications - Audio/video fingerprinting - Clustering - Finding similar substrings in a large collection of strings - Genomic databases - Detecting plagiarism - Geometric hashing: Widely used in computer graphics and computational geometry CSE 373 AU 18 SHRI MARE 25

Wrap Up Hash Tables: - Efficient find, insert, delete on average, under some assumptions - Items not in sorted order - Tons of real world uses - and really popular in tech interview questions. Need to pick a good hash function. - Have someone else do this if possible. - Balance getting a good distribution and speed of calculation. Resizing: - Always make the table size a prime number. -! determines when to resize, but depends on collision resolution strategy.

CSE 373 SP 18 - KASEY CHAMPION 27