Lecture 7: Efficient Collections via Hashing

Similar documents
Hashing. Hashing Procedures

Lecture 6: Hashing Steven Skiena

Worst-case running time for RANDOMIZED-SELECT

Hashing. Yufei Tao. Department of Computer Science and Engineering Chinese University of Hong Kong

CSE 332: Data Structures & Parallelism Lecture 10:Hashing. Ruth Anderson Autumn 2018

Tirgul 7. Hash Tables. In a hash table, we allocate an array of size m, which is much smaller than U (the set of keys).

CHAPTER 9 HASH TABLES, MAPS, AND SKIP LISTS

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Sorting lower bound and Linear-time sorting Date: 9/19/17

III Data Structures. Dynamic sets

Introducing Hashing. Chapter 21. Copyright 2012 by Pearson Education, Inc. All rights reserved

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ.! Instructor: X. Zhang Spring 2017

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ. Acknowledgement. Support for Dictionary

HashTable CISC5835, Computer Algorithms CIS, Fordham Univ. Instructor: X. Zhang Fall 2018

HASH TABLES. Goal is to store elements k,v at index i = h k

Data Structures and Algorithms. Roberto Sebastiani

CS 270 Algorithms. Oliver Kullmann. Generalising arrays. Direct addressing. Hashing in general. Hashing through chaining. Reading from CLRS for week 7

Data Structures and Algorithms. Chapter 7. Hashing

COMP171. Hashing.

Chapter 9: Maps, Dictionaries, Hashing

Dictionaries and Hash Tables

Standard ADTs. Lecture 19 CS2110 Summer 2009

Hashing. Dr. Ronaldo Menezes Hugo Serrano. Ronaldo Menezes, Florida Tech

DATA STRUCTURES AND ALGORITHMS

Taking Stock. IE170: Algorithms in Systems Engineering: Lecture 7. (A subset of) the Collections Interface. The Java Collections Interfaces

Acknowledgement HashTable CISC4080, Computer Algorithms CIS, Fordham Univ.

CSC263 Week 5. Larry Zhang.

The dictionary problem

Priority Queue Sorting

Hash Table and Hashing

9/24/ Hash functions

of characters from an alphabet, then, the hash function could be:

Dictionaries-Hashing. Textbook: Dictionaries ( 8.1) Hash Tables ( 8.2)

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40

Hash Table. A hash function h maps keys of a given type into integers in a fixed interval [0,m-1]

Algorithms and Data Structures

Week 9. Hash tables. 1 Generalising arrays. 2 Direct addressing. 3 Hashing in general. 4 Hashing through chaining. 5 Hash functions.

Hash Tables (Cont'd) Carlos Moreno uwaterloo.ca EIT

stacks operation array/vector linked list push amortized O(1) Θ(1) pop Θ(1) Θ(1) top Θ(1) Θ(1) isempty Θ(1) Θ(1)

Hash Tables Hash Tables Goodrich, Tamassia

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

Module 2: Classical Algorithm Design Techniques

Hashing. Manolis Koubarakis. Data Structures and Programming Techniques

Hashing 1. Searching Lists

Today: Finish up hashing Sorted Dictionary ADT: Binary search, divide-and-conquer Recursive function and recurrence relation

Today s Outline. CS 561, Lecture 8. Direct Addressing Problem. Hash Tables. Hash Tables Trees. Jared Saia University of New Mexico

Maps, Hash Tables and Dictionaries. Chapter 10.1, 10.2, 10.3, 10.5

CS2 Algorithms and Data Structures Note 4

Hash tables. hashing -- idea collision resolution. hash function Java hashcode() for HashMap and HashSet big-o time bounds applications

Computer Architecture and System Software Lecture 02: Overview of Computer Systems & Start of Chapter 2

Algorithms in Systems Engineering ISE 172. Lecture 12. Dr. Ted Ralphs

CSE100. Advanced Data Structures. Lecture 21. (Based on Paul Kube course materials)

Consider the actions taken on positive integers when considering decimal values shown in Table 1 where the division discards the remainder.

HO #13 Fall 2015 Gary Chan. Hashing (N:12)

Data Structures Lecture 12

This lecture. Iterators ( 5.4) Maps. Maps. The Map ADT ( 8.1) Comparison to java.util.map

Lecture 17. Improving open-addressing hashing. Brent s method. Ordered hashing CSE 100, UCSD: LEC 17. Page 1 of 19

CSE 214 Computer Science II Searching

Hash Tables. Hash functions Open addressing. November 24, 2017 Hassan Khosravi / Geoffrey Tien 1

TABLES AND HASHING. Chapter 13

lecture23: Hash Tables

Hash Tables. Reading: Cormen et al, Sections 11.1 and 11.2

B+ Tree Review. CSE332: Data Abstractions Lecture 10: More B Trees; Hashing. Can do a little better with insert. Adoption for insert

COMP 250. Lecture 27. hashing. Nov. 10, 2017

CSCI Analysis of Algorithms I

CMSC 341 Hashing (Continued) Based on slides from previous iterations of this course

CSC 261/461 Database Systems Lecture 17. Fall 2017

COMP251: Algorithms and Data Structures. Jérôme Waldispühl School of Computer Science McGill University

CS 241 Analysis of Algorithms

CS 303 Design and Analysis of Algorithms

Randomized Algorithms: Element Distinctness

Hash Tables. CS 311 Data Structures and Algorithms Lecture Slides. Wednesday, April 22, Glenn G. Chappell

Unit #5: Hash Functions and the Pigeonhole Principle

STANDARD ADTS Lecture 17 CS2110 Spring 2013

Math 4242 Polynomial Time algorithms, IndependentSet problem

Lecture 12 Notes Hash Tables

Randomized Algorithms Part Four

Randomized Algorithms Part Four

Abstract Data Types (ADTs) Queues & Priority Queues. Sets. Dictionaries. Stacks 6/15/2011

Lecture 12 Hash Tables

CPSC 259 admin notes

Taking Stock. IE170: Algorithms in Systems Engineering: Lecture 6. The Master Theorem. Some More Examples...

Hash table basics mod 83 ate. ate. hashcode()

Habanero Extreme Scale Software Research Project

Hash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1

Hashing. Introduction to Data Structures Kyuseok Shim SoEECS, SNU.

Dictionary. Dictionary. stores key-value pairs. Find(k) Insert(k, v) Delete(k) List O(n) O(1) O(n) Sorted Array O(log n) O(n) O(n)

Quiz 1 Solutions. (a) f(n) = n g(n) = log n Circle all that apply: f = O(g) f = Θ(g) f = Ω(g)

CSED233: Data Structures (2017F) Lecture10:Hash Tables, Maps, and Skip Lists

Understand how to deal with collisions

Hash Tables. Johns Hopkins Department of Computer Science Course : Data Structures, Professor: Greg Hager

Data Structures and Algorithm Analysis (CSC317) Hash tables (part2)

Hashing Techniques. Material based on slides by George Bebis

Hash Open Indexing. Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1

Project 0: Implementing a Hash Table

Introduction to. Algorithms. Lecture 6. Prof. Constantinos Daskalakis. CLRS: Chapter 17 and 32.2.

csci 210: Data Structures Maps and Hash Tables

Algorithms in Systems Engineering IE172. Midterm Review. Dr. Ted Ralphs

Hashing. October 19, CMPE 250 Hashing October 19, / 25

Data Structures and Algorithms(10)

Module 5: Hashing. CS Data Structures and Data Management. Reza Dorrigiv, Daniel Roche. School of Computer Science, University of Waterloo

Transcription:

Lecture 7: Efficient Collections via Hashing These slides include material originally prepared by Dr. Ron Cytron, Dr. Jeremy Buhler, and Dr. Steve Cole. 1

Announcements Lab 6 due Friday Lab 7 out tomorrow all about hashing! Pre-lab due 3/19; code and post-lab due 3/22 Spring Break hours No official course communication from 3/8 evening (Friday) until 3/18 morning (Monday) Please be patient on Piazza: instructor and TAs are on break too :-) 2

Let s Talk About Dictionaries 3

Let s Talk About Dictionaries No, not that kind 4

Dictionary ADT A dictionary is a data structure that stores a collection of objects Each object is associated with a key Objects can be dynamically inserted and removed Can efficiently find an object in the dictionary by its key 5

Dictionary Operations (One of Several Versions) insert(record r) add r to the dictionary find(key k) return one/some/all records whose key matches k, if any remove(key k) remove all records whose key matches k, if any 6

Dictionary Operations (One of Several Versions) insert(record r) add r to the dictionary find(key k) return one/some/all records whose key matches k, if any remove(key k) remove all records whose key matches k, if any Other versions are possible, e.g. remove() might take a Record to remove, rather than a key 7

Dictionary Operations (One of Several Versions) insert(record r) add r to the dictionary find(key k) return one/some/all records whose key matches k, if any remove(key k) remove all records whose key matches k, if any Other ops may exist, e.g. isempty(), size(), iterator() 8

Dictionary Examples An actual dictionary a collection that maps words to definitions Class list a set of students, with name (or possibly ID) as key DMV database a collection of cars accessed by license plate number 9

Some Questions about Dictionary Variants Can multiple records exist in dictionary with same key? What happens if find() does not find a record with a specified key? Is key the entire record (as in Java Set interface), is it internal to the record (as in Lab 7), or is it external (as in Java Map interface)? 10

How to Build a Dictionary Conceptually, it s just bag of records What concrete data structure do we use to implement it? Must support efficient dynamic add/remove and find 11

How to Build a Dictionary Conceptually, it s just bag of records What concrete data structure do we use to implement it? 1 Must support efficient dynamic add/remove and find insert 12

How to Build a Dictionary Conceptually, it s just bag of records What concrete data structure do we use to implement it? Must support efficient dynamic add/remove and find 1 8 insert 13

How to Build a Dictionary Conceptually, it s just bag of records What concrete data structure do we use to implement it? Must support efficient dynamic add/remove and find 5 1 8 insert 14

How to Build a Dictionary Conceptually, it s just bag of records What concrete data structure do we use to implement it? Must support efficient dynamic add/remove and find 5 1 8 find(8) 15

How to Build a Dictionary Conceptually, it s just bag of records What concrete data structure do we use to implement it? Must support efficient dynamic add/remove and find 5 8 delete(1) 16

How to Build a Dictionary Conceptually, it s just bag of records What concrete data structure do we use to implement it? Must support efficient dynamic add/remove and find 5 8 find(1) not found 17

Some bad implementations Time complexities for dictionary operations Structure insert delete find space unsorted list Θ(?) Θ(?) Θ(?) Θ(?) sorted list Θ(?) Θ(?) Θ(?) Θ(?) sorted array Θ(?) Θ(?) Θ(?) Θ(?) min-heap Θ(?) Θ(?) Θ(?) Θ(?) 18

Some bad implementations Time complexities for dictionary operations Structure insert delete find space unsorted list Θ(1) Θ(n) Θ(n) Θ(?) sorted list Θ(?) Θ(?) Θ(?) Θ(?) sorted array Θ(?) Θ(?) Θ(?) Θ(?) min-heap Θ(?) Θ(?) Θ(?) Θ(?) 19

Some bad implementations Time complexities for dictionary operations Structure insert delete find space unsorted list Θ(1) Θ(n) Θ(n) Θ(?) sorted list Θ(?) Θ(?) Θ(?) Θ(?) sorted array Θ(?) Θ(?) Θ(?) Θ(?) min-heap Θ(?) Θ(?) Θ(?) Θ(?) What assumption is being made about delete? Any other assumptions here? 20

Some bad implementations Time complexities for dictionary operations Structure insert delete find space unsorted list Θ(1) Θ(n) Θ(n) Θ(?) sorted list Θ(n) Θ(n) Θ(n) Θ(?) sorted array Θ(?) Θ(?) Θ(?) Θ(?) min-heap Θ(?) Θ(?) Θ(?) Θ(?) 21

Some bad implementations Time complexities for dictionary operations Structure insert delete find space unsorted list Θ(1) Θ(n) Θ(n) Θ(?) sorted list Θ(n) Θ(n) Θ(n) Θ(?) sorted array Θ(n) Θ(n) Θ(log n) Θ(?) min-heap Θ(?) Θ(?) Θ(?) Θ(?) 22

Some bad implementations Time complexities for dictionary operations Structure insert delete find space unsorted list Θ(1) Θ(n) Θ(n) Θ(?) sorted list Θ(n) Θ(n) Θ(n) Θ(?) sorted array Θ(n) Θ(n) Θ(log n) Θ(?) min-heap Θ(log n) XXX XXX Θ(?) Heaps don t support these ops (but find would be Θ(n)) 23

Some bad implementations Time complexities for dictionary operations Structure insert delete find space unsorted list Θ(1) Θ(n) Θ(n) Θ(?) sorted list Θ(n) Θ(n) Θ(n) Θ(?) sorted array Θ(n) Θ(n) Θ(log n) Θ(?) min-heap Θ(log n) XXX XXX Θ(?) None of these structures achieve sublinear time complexity for all three ops 24

Some bad implementations Time complexities for dictionary operations Structure insert delete find space unsorted list Θ(1) Θ(n) Θ(n) Θ(?) sorted list Θ(n) Θ(n) Θ(n) Θ(?) sorted array Θ(n) Θ(n) Θ(log n) Θ(?) min-heap Θ(log n) XXX XXX Θ(?) 25

Some bad implementations Time complexities for dictionary operations Structure insert delete find space unsorted list Θ(1) Θ(n) Θ(n) Θ(n) sorted list Θ(n) Θ(n) Θ(n) Θ(n) sorted array Θ(n) Θ(n) Θ(log n) Θ(n) min-heap Θ(log n) XXX XXX Θ(n) 26

Some bad implementations Time complexities for dictionary operations Structure insert delete find space unsorted list Θ(1) Θ(n) Θ(n) Θ(n) sorted list Θ(n) Θ(n) Θ(n) Θ(n) sorted array Θ(n) Θ(n) Θ(log n) Θ(n) min-heap Θ(log n) XXX XXX Θ(n) All these structures take space proportional to # of records stored 27

Key Question Is it possible to implement a dictionary with sublinear time for all of insert, find, and remove? 28

Key Question Is it possible to implement a dictionary with sublinear time for all of insert, find, and remove? We ll show that the answer is yes 29

Key Question Is it possible to implement a dictionary with sublinear time for all of insert, find, and remove? We ll show that the answer is yes depending on what you mean by sublinear time. (Guarantees will not be worst-case) 30

Idea: Direct-Addressed Table Let U be the set ( universe ) of all possible keys Allocate an array of size U If we get a record with key k, put it in k s array cell. 31

Direct-Addressed Tables Fig 11.2 value 32

Direct-Addressed Tables Fig 11.2 value Key space is a compact index of small, nonnegative integers 33

Direct-Addressed Tables Fig 11.2 Darkened cells are all null value Key space is a compact index of small, nonnegative integers 34

Direct-Addressed Tables Fig 11.2 Darkened cells are all null value True False False Example: record whether each key is divisible by 2 True Key space is a compact index of small, nonnegative integers 35

A Less Bad Implementation? Time complexities for dictionary operations Structure insert delete find space unsorted list Θ(1) Θ(n) Θ(n) Θ(n) sorted list Θ(n) Θ(n) Θ(n) Θ(n) sorted array Θ(n) Θ(n) Θ(log n) Θ(n) direct table Θ(???) Θ(???) Θ(???) Θ(???) 36

A Less Bad Implementation? Time complexities for dictionary operations Structure insert delete find space unsorted list Θ(1) Θ(n) Θ(n) Θ(n) sorted list Θ(n) Θ(n) Θ(n) Θ(n) sorted array Θ(n) Θ(n) Θ(log n) Θ(n) direct table Θ(1) Θ(1) Θ(1) Θ(???) We can look up any entry in the table in constant time, given its key 37

A Less Bad Implementation? Time complexities for dictionary operations Structure insert delete find space unsorted list Θ(1) Θ(n) Θ(n) Θ(n) sorted list Θ(n) Θ(n) Θ(n) Θ(n) sorted array Θ(n) Θ(n) Θ(log n) Θ(n) direct table Θ(1) Θ(1) Θ(1) Θ( U ) But the space cost is U, no matter how small n (# of records) is. 38

Problems with Direct-Addressed Tables Challenge #1: What if U >> n? ex. IPv6 (~10 38 ), Unix passwords (~10 15 )

Problems with Direct-Addressed Tables Challenge #1: What if U >> n? ex. IPv6 (~10 38 ), Unix passwords (~10 15 ) Challenge #2: What if keys aren t integers? What does T[blue] mean? T[5.7281934]? T[ hello world ]? How do you index an array using an arbitrary object type?

Idea: Hash Functions A hash function h maps keys k of some type to integers h(k) in a fixed range [0, N) The integer h(k) is the key s hashcode under h If N = U, h could map every key to a distinct integer, giving us a way to index our direct table. 41

What if our key is not Integer? dog Fig 11.2 value cat fossa Hash Function 42

What if our key is not Integer? dog Fig I 11.2 can turn a String into an Integer Can you think of some ways this could be done? value cat fossa Hash Function 43

What if our key is not Integer? dog Fig 11.2 value cat fossa Hash Function arf 44

What if our key is not Integer? dog Fig 11.2 value cat fossa Hash Function meow 45

What if our key is not Integer? dog Fig 11.2 value cat fossa Hash Function this 46

But What About Sparsity? We often can t afford to store a table of size U. What if our hash function mapped keys to a smaller space, i.e. [0, m) for m << U? We d need a table of size only m. This smaller table is called a hash table. 47

The Good m = 3 dog cat fossa Hash Function 2 0 1 A hash table lets us allocate arrays much smaller than U 48

The Bad m = 3 dog cat 2 0 Uh oh fossa Hash Function okapi 1 1 49

The Bad m = 3 dog cat 2 0 Oh dear fossa Hash Function 1 okapi 1 duiker 1 axolotl 1 coypu 1 50

When Worlds Keys Collide What happens if multiple keys hash to same table cell? This must happen if m < U -- pigeonhole principle When two keys hash to same cell, we say they collide. A hash table must work even in presence of collisions. 51

A Simple Strategy: Chaining Each table cell becomes a bucket that can hold multiple records A bucket holds a list of all records whose keys map to it. find(k) must traverse bucket h(k) s list, looking for a record with key k Analogous extensions for insert(), remove() 52

Hash Table with Chaining dog 2 { cat } cat 0 { fossa }, { okapi }, { duiker }, { axolotl }, { coypu } fossa 1 { dog } okapi 1 duiker 1 axolotl 1 coypu 1 53

Hash Table with Chaining { cat } find(axolotl) { fossa }, { okapi }, { duiker }, { axolotl }, { coypu } { dog } 54

Hash Table with Chaining { cat } h(axolotl) = 1 { fossa }, { okapi }, { duiker }, { axolotl }, { coypu } { dog } 55

Hash Table with Chaining { cat } h(axolotl) = 1 { fossa }, { okapi }, { duiker }, { axolotl }, { coypu } { dog } 56

Hash Table with Chaining { cat } h(axolotl) = 1 { fossa }, { okapi }, { duiker }, { axolotl }, { coypu } { dog } 57

Hash Table with Chaining { cat } h(axolotl) = 1 { fossa }, { okapi }, { duiker }, { axolotl }, { coypu } { dog } 58

Hash Table with Chaining { cat } h(axolotl) = 1 { fossa }, { okapi }, { duiker }, { axolotl }, { coypu } { dog } 59

Hash Table with Chaining { cat } find(potrzebie) { fossa }, { okapi }, { duiker }, { axolotl }, { coypu } { dog } 60

Hash Table with Chaining { cat } h(potrzebie) = 1 { fossa }, { okapi }, { duiker }, { axolotl }, { coypu } { dog } 61

Hash Table with Chaining { cat } h(potrzebie) = 1 { fossa }, { okapi }, { duiker }, { axolotl }, { coypu } { dog } 62

Hash Table with Chaining { cat } h(potrzebie) = 1 { fossa }, { okapi }, { duiker }, { axolotl }, { coypu } { dog } 63

Hash Table with Chaining { cat } h(potrzebie) = 1 { fossa }, { okapi }, { duiker }, { axolotl }, { coypu } { dog } 64

Hash Table with Chaining { cat } h(potrzebie) = 1 { fossa }, { okapi }, { duiker }, { axolotl }, { coypu } { dog } 65

Hash Table with Chaining { cat } h(potrzebie) = 1 { fossa }, { okapi }, { duiker }, { axolotl }, { coypu } { dog } 66

Hash Table with Chaining { cat } h(potrzebie) = 1 { fossa }, { okapi }, { duiker }, { axolotl }, { coypu } { dog } NOT FOUND 67

What is Performance of a Hash Table? Performance = cost to do a find() (remove, and maybe insert, similarly traverse list for some bucket) 68

What is Performance of a Hash Table? Performance = cost to do a find() (remove, and maybe insert, similarly traverse list for some bucket) Insert traverses list if we must check for duplicates 69

What is Performance of a Hash Table? Performance = cost to do a find() (remove, and maybe insert, similarly traverse list for some bucket) Suppose table holds n records In worst case, all n records hash to one bucket Searching this bucket takes time??? 70

What is Performance of a Hash Table? Performance = cost to do a find() (remove, and maybe insert, similarly traverse list for some bucket) Suppose table holds n records In worst case, all n records hash to one bucket Searching this bucket takes time Θ(n) 71

Cost of Hash Table (Worst-Case) Time complexities for dictionary operations Structure insert delete find space unsorted list Θ(1) Θ(n) Θ(n) Θ(n) sorted list Θ(n) Θ(n) Θ(n) Θ(n) sorted array Θ(n) Θ(n) Θ(log n) Θ(n) hash table Θ(n) Θ(n) Θ(n) Θ(m+n) I thought the point was to get sublinear-time ops! 72

A Weaker Performance Estimate Assume that, given a key k in U, hash function h is equally likely to map k to each value in [0, m), independent of all other keys. This assumption is called Simple Uniform Hashing. Now suppose we hash n keys k 1 k n from U into the table, then call find(k*) for some key k*. What is the average [over choice of keys] cost to search the table for k*? 73

Average Cost of Search Total of n elements distributed over m slots Average size of bucket is therefore 74

Average Cost of Search Total of n elements distributed over m slots Average size of bucket is therefore n/m 75

Average Cost of Search Total of n elements distributed over m slots Average size of bucket is therefore n/m Suppose k* is not in the table. Cost of find(k*) is Θ(1) to compute h(k*), plus Θ(bucket size) to search h(k*) equally likely to be any bucket, so average cost of unsuccessful find is Θ(1 + n/m). 76

Average Cost of Search Total of n elements distributed over m slots Average size of bucket is therefore n/m Suppose k* is not in the table. Follows from Simple Uniform Hashing Cost of find(k*) is Θ(1) to compute h(k), plus Θ(bucket size) to search h(k*) equally likely to be any bucket, so average cost of unsuccessful find is Θ(1 + n/m). 77

Average Cost of Search Average cost of unsuccessful find is Θ(1 + n/m). Similar arguments from SUH show that average cost of successful find is also Θ(1 + n/m). Defn: α = n/m is called the load factor of the hash table. 78

Cost of Hash Table (Average Under SUH) Time complexities for dictionary operations Structure insert delete find space unsorted list Θ(1) Θ(n) Θ(n) Θ(n) sorted list Θ(n) Θ(n) Θ(n) Θ(n) sorted array Θ(n) Θ(n) Θ(log n) Θ(n) hash table Θ(1 + α) Θ(1 + α) Θ(1 + α) Θ(m+n) Load factor determines performance of hash table 79

Controlling the Load Factor If we know that the table will hold at most n records We can make # of buckets m proportional to n, say m=cn. (e.g. c=0.75) This choice makes our load factor n/m a constant (called α). Ex: if we set m = n/4, load factor α is 4. But then expected search cost is Θ(1 + α) = Θ(1). 80

Cost of Hash Table (Average Under SUH, m = cn) Time complexities for dictionary operations Structure insert delete find space unsorted list Θ(1) Θ(n) Θ(n) Θ(n) sorted list Θ(n) Θ(n) Θ(n) Θ(n) sorted array Θ(n) Θ(n) Θ(log n) Θ(n) hash table Θ(1) Θ(1) Θ(1) Θ(n) Hashing gives expected constant-time dictionary ops in linear space! 81

How Do We Approach Ideal Performance? Hash function h(k) must approximate SUH assumptions Must distribute keys equally, independently across range [0, m) [We need to talk about how to design a good hash function h(k)!] Moreover, input keys we see must have average behavior (Alternative: attacker with knowledge of h(k) chooses keys so as to elicit worst-case behavior from your table!) 82

And Now, Some Hash Function Design 83

Hash Function Pipeline Two Steps c = h(k) j = b(c) Objects (keys k) Integers (hashcodes c) Buckets (indices j) 84

Hash Function Pipeline Two Steps Next week c = h(k) Today j = b(c) Objects (keys k) Integers (hashcodes c) Buckets (indices j) 85

Assumptions Objects to be hashed have been converted to integer hashcodes Hashcodes are in range [0, N) Need to convert hashcodes to indices in [0, m) m = table size 86

Assumptions Objects to be hashed have been converted to integer hashcodes NB: Java hashcodes can be positive or negative. May need Hashcodes are in range [0, N) to take absolute value or otherwise make 0! Need to convert hashcodes to indices in [0, m) m = table size 87

Goals for Mapping to Indices (from SUH) Each hashcode should be about equally likely to map to any value in [0, m). Mappings for different hashcodes should be independent, hence uncorrelated knowing the mapping for one should give little or no information about the mapping for another. 88

Two Main Approaches to Index Mapping Division hashing Multiplicative hashing (Other strategies exist; beyond scope of 247) 89

Division Hashing b(c) = c mod m bucket index = hashcode modulo table-size Very easy to implement (mod in Java is %) Result is surely in range [0, m) (if c is non-negative!) 90

The Perils of Division Hashing Does every choice of m yield SUH-like behavior? Ex: Suppose that m is divisible by a small integer d. Claim: if j = c mod m, then j mod d = c mod d So what? 91

The Perils of Division Hashing Ex: Suppose that m is divisible by a small integer d. Claim: if j = c mod m, then j mod d = c mod d E.g., if d = 2, then even hashcodes map to even indices. Natural subsets of all hashcodes do not map uniformly across the entire table not SUH behavior! 92

The Perils of Division Hashing (Proof) Claim: if j = c mod m, then j mod d = c mod d Pf: Suppose c = x + ym. Since d m, c = x + zd for some z. Hence c mod d = x = (c mod m) mod d = j mod d. QED 93

A Particularly Bad Case Ex: Suppose that m = 2 v Hashcodes with same v low-order bits map to same index 32-bit hashcode c 10011010111101100101111000010101 94

A Particularly Bad Case Ex: Suppose that m = 2 v Hashcodes with same v low-order bits map to same index 10011010111101100101111000010101 v = 10 (m = 1024) 32-bit hashcode c c mod m 95

A Particularly Bad Case Ex: Suppose that m = 2 v Hashcodes with same v low-order bits map to same index These bits are ignored! 10011010111101100101111000010101 v = 10 (m = 1024) 32-bit hashcode c c mod m 96

Advice on Division Hashing Table size m should be chosen so that No (obvious) correlations between hashcode bit pattern and index Index depends on all bits of hashcode, not just some Idea: make m a prime number (no small factors) Avoid choices of m close to powers of 2 or 10 97

What s Wrong with m Near Power of 2 or 10? Ex: Suppose m = 2 v 1 If c = c 0 + 2 v c 1 + 2 2v c 2 + 2 3v c 3 +. c mod m = c 0 + c 1 + c 2 + c 3 +. mod m Could permute chunks of v bits in c and get same index! (Think about strings encoded using v bits per character) 98

Other Thoughts on Division Hashing The operation c mod m is expensive on most computers (unless m is a constant known at compile time) Modulo op is most efficient when m is a power of 2 but this is a poor choice for division hashing! 99

Two Main Approaches to Index Mapping Division hashing Multiplicative hashing (Other strategies exist; beyond scope of 247) 100

Multiplicative Hashing Let A be a real number in [0, 1). b c = c A mod 1. 0 m x mod 1.0 means fractional part of x. E.g. 47.2465 mod 1.0 = 0.2465 ca mod 1.0 is in [0, 1), so b(c) is an integer in [0, m) an index! 101

Initial Observations A should not be too small would map many hashcodes to 0. Suggest picking A from [0.5, 1) If q = ca mod 1.0 is distributed uniformly in [0, 1), then we can use any value for m and still get uniform indices. In particular, we can use m = 2 v if we want. 102

Why Is Multiplication a Good Hashing Strategy? Mapping c q = ca mod 1.0 is a diffusing operation I.e., most significant digits of q depend (in a complex way) on many digits of c. (Makes q looks uniform, obscures correlations among c s.) Hence, bin number q m looks uniform, uncorrelated with c. (Same is true if we replace digits by bits and work in binary) 103

Example of Diffusion 1234 x 0.6734 Assumed: Integer c has fixed some # of digits We use same # of digits of A after decimal 104

Example of Diffusion 1234 x 0.6734.4936 105

Example of Diffusion 1234 x 0.6734.4936 3.7020 106

Example of Diffusion 1234 x 0.6734.4936 3.7020 86.3800 107

Example of Diffusion 1234 x 0.6734.4936 3.7020 86.3800 740.4000 108

Example of Diffusion 1234 x 0.6734.4936 3.7020 86.3800 +740.4000 First digit after decimal is middle digit of product Middle digits depend on all (or most) digits of c and all or most digits of A These digits determine bin number 109

Is Every Choice of A Equally Good? Not all A s have equally good diffusion/complexity properties. Fractions with few nonzero digits (e.g. 0.75) or repeating decimals (e.g. 7/9 = 0.7777777..) have poor diffusion and/or low complexity. Advice: pick an irrational number between 0.5 and 1. Ex: A = 5 1 2 0.61803398874989484820458683436564 [Knuth] 110

Multiplication Hashing Without Floating-Point Math What if you can t / don t want to use floating-point math? (May be more expensive than integer math) If we know our hashcodes c have at most d digits, we can multiply A by 10 d initially and do everything we need using only integer arithmetic. Similarly, if hashcodes have at most w bits, we can multiply A by 2 w initially. This trick is called fixed-point arithmetic. 111

Previous Example, in Fixed-Point Decimal 1234 x 0.6734 Assumed: Integer c has at most 4 digits We use same # of digits of A after decimal 112

Previous Example, in Fixed-Point Decimal 1234 x 6734 10 4 (multiply, but remember how to undo) 113

Previous Example, in Fixed-Point Decimal x 1234 6734 4936 10 4 114

Previous Example, in Fixed-Point Decimal x 1234 6734 4936 37020 10 4 115

Previous Example, in Fixed-Point Decimal x 1234 6734 10 4 4936 37020 863800 116

Previous Example, in Fixed-Point Decimal x 1234 6734 10 4 4936 37020 863800 7404000 117

Previous Example, in Fixed-Point Decimal x 1234 6734 10 4 4936 37020 863800 +7404000 8309756 ca mod 1 = 9756 10 4 We know decimal point goes here 118

Index Computation in Fixed-Point Decimal Suppose m = 100 = 10 2. (ca mod 1) m = 9756 10 4 x 10 2 = 9756 10 4-2 = 9756 10 2 119

Index Computation in Fixed-Point Decimal Suppose m = 100 = 10 2. (ca mod 1) m = 9756 10 4 x 10 2 = 9756 10 4-2 = 9756 10 2 Again, we know decimal point goes here 120

Index Computation in Fixed-Point Decimal Suppose m = 100 = 10 2. (ca mod 1) m = 9756 10 4 x 10 2 = 9756 10 4-2 = 9756 10 2 Again, we know decimal point goes here Hence, c A mod 1. 0 m = 97 121

What About Fixed-Point Binary? Book presents the binary version. It s also how you would typically implement it on a computer! If you have had 132, then the following slides will make more sense If not, follow along as best you can, and look at this again after you ve had 132 122

For base 2 (let s assume w = 32) result of.hashcode() 123

For base 2 (let s assume w = 32) result of.hashcode() A shifted left by 32 bits 124

For base 2 (let s assume w = 32) result of.hashcode() A shifted left by 32 bits The product of two w-bit numbers yields a 2w-bit result 125

For base 2 (let s assume w = 32) result of.hashcode() A shifted left by 32 bits The binary point belongs here, with the result shifted right by 32 bits 126

For base 2 (let s assume w = 32) So this is the fractional part of k x A result of.hashcode() A shifted left by 32 bits The binary point belongs here, with the result shifted right by 32 bits 127

For base 2 (let s assume w = 32) So this is the fractional part of k x A result of.hashcode() A shifted left by 32 bits The binary point belongs here, with the result shifted right by 32 bits If m = 2 p then multiplying the fractional part by m yields these p bits 128

For base 2 (let s assume w = 32) So this is the fractional part of k x A shifted right now by p bits result of.hashcode() A shifted left by 32 bits The binary point belongs here, with the result shifted right by 32 bits If m = 2 p then multiplying the fractional part by m yields these p bits 129

For base 2 (let s assume w = 32) So this is the fractional part of k x A shifted right now by p bits result of.hashcode() A shifted left by 32 bits The binary point belongs here, with the result shifted right by 32 bits If m = 2 p then multiplying the fractional part by m yields these p bits 130

For base 2 (let s assume w = 32) result of.hashcode() A shifted left by 32 bits The binary point belongs here, with the result shifted right by 32 bits If m = 2 p then multiplying the fractional part by m yields these p bits 131

For base 2 (let s assume w = 32) Assume we use Knuth s A: result of.hashcode() A shifted left by 32 bits 132

Example (page 264 in text) w = 32 p = 14 m = 16384 133

Example (page 264 in text) w = 32 p = 14 m = 16384 k = 123456 134

Example (page 264 in text) w = 32 p = 14 m = 16384 k = 123456 A x 2 32 = 2654435769 135

Example (page 264 in text) w = 32 p = 14 m = 16384 k = 123456 A x 2 32 = 2654435769 327706022297664 = 136

Example (page 264 in text) w = 32 p = 14 m = 16384 k = 123456 A x 2 32 = 2654435769 76300 x 2 32 + 17612864 137

Example (page 264 in text) w = 32 p = 14 m = 16384 k = 123456 A x 2 32 = 2654435769 76300 x 2 32 + 17612864 32 bit representation for 17612864 is 01 0C C0 40 138

Example (page 264 in text) w = 32 p = 14 m = 16384 k = 123456 A x 2 32 = 2654435769 76300 x 2 32 + 17612864 Top 14 bits 0 0000 32 bit representation for 17612864 is 01 0C C0 40 139

Example (page 264 in text) w = 32 p = 14 m = 16384 k = 123456 A x 2 32 = 2654435769 76300 x 2 32 + 17612864 Top 14 bits 01 0000 0001 32 bit representation for 17612864 is 01 0C C0 40 140

Example (page 264 in text) w = 32 p = 14 m = 16384 k = 123456 A x 2 32 = 2654435769 76300 x 2 32 + 17612864 Top 14 bits 01 0 0000 0001 0000 32 bit representation for 17612864 is 01 0C C0 40 141

Example (page 264 in text) w = 32 p = 14 m = 16384 C = 1100 but we only need the first two bits to make 14 total bits k = 123456 A x 2 32 = 2654435769 76300 x 2 32 + 17612864 Top 14 bits 01 0C 0000 0001 0000 32 bit representation for 17612864 is 01 0C C0 40 142

Example (page 264 in text) w = 32 p = 14 m = 16384 C = 1100 but we only need the first two bits to make 14 total bits k = 123456 A x 2 32 = 2654435769 76300 x 2 32 + 17612864 Top 14 bits 01 0C 0000 0001 0000 11 32 bit representation for 17612864 is 01 0C C0 40 143

Example (page 264 in text) w = 32 p = 14 m = 16384 k = 123456 A x 2 32 = 2654435769 76300 x 2 32 + 17612864 hashes to bucket 67 Top 14 bits 01 0000 0001 0000 11 = 64 + 3 = 67 32 bit representation for 17612864 is 01 0C C0 40 144

A Good Implementation Choose m = 2 p buckets Assume.hashCode() yields 32-bit unsigned integer k [does not exist in Java] Pre-compute the constant s = 2 32 x A Assume that if sk overflows 32 bits, we get only lower 32 bits of result Index computation on input k is then sk 2 32 p = sk >> (32 p) This is a close relative of the function you ll play with in Studio 7. 145

End of Lecture 7 146