Hash Tables. Nate Foster Spring 2018

Similar documents
CS Prof. Clarkson Fall 2014

CS Lecture 14: Hash tables. Prof. Clarkson Spring Today s music: Re-hash by Gorillaz

Amortized Analysis. Prof. Clarkson Fall Today s music: : "Money, Money, Money" by ABBA

Balanced Trees. Nate Foster Spring 2018

Balanced Trees. Prof. Clarkson Fall Today s music: Get the Balance Right by Depeche Mode

Balanced Trees. Nate Foster Spring 2019

Hash Tables. CS 311 Data Structures and Algorithms Lecture Slides. Wednesday, April 22, Glenn G. Chappell

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

CMSC 341 Hashing (Continued) Based on slides from previous iterations of this course

Friday Four Square! 4:15PM, Outside Gates

HEAPS & PRIORITY QUEUES

Mutable Data Types. Prof. Clarkson Fall A New Despair Mutability Strikes Back Return of Imperative Programming

The Substitution Model

lecture23: Hash Tables

CSE 373 Autumn 2012: Midterm #2 (closed book, closed notes, NO calculators allowed)

Direct Addressing Hash table: Collision resolution how handle collisions Hash Functions:

Questions. 6. Suppose we were to define a hash code on strings s by:

The Substitution Model. Nate Foster Spring 2018

1 Splay trees. 2 Implementation of splay. CS134b Homework #1 January 8, 2001 Due January 15, 2001

CSE 373 Spring 2010: Midterm #1 (closed book, closed notes, NO calculators allowed)

CS 251, LE 2 Fall MIDTERM 2 Tuesday, November 1, 2016 Version 00 - KEY

Practice Midterm Exam Solutions

Hash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1

CSE 373 Spring 2010: Midterm #2 (closed book, closed notes, NO calculators allowed)

Testing. Prof. Clarkson Fall Today s music: Wrecking Ball by Miley Cyrus

Abstraction Functions and Representation Invariants

Prelim 2. CS 2110, November 20, 2014, 7:30 PM Extra Total Question True/False Short Answer

CS : Data Structures

Cuckoo Hashing for Undergraduates

TREES Lecture 12 CS2110 Spring 2019

CS 2150 (fall 2010) Midterm 2

Programming Languages and Techniques (CIS120)

Standard ADTs. Lecture 19 CS2110 Summer 2009

Database Applications (15-415)

CSE 332 Winter 2015: Midterm Exam (closed book, closed notes, no calculators)

Announcements HEAPS & PRIORITY QUEUES. Abstract vs concrete data structures. Concrete data structures. Abstract data structures

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40

Programming Languages and Techniques (CIS120)

CSE 332: Data Structures & Parallelism Lecture 10:Hashing. Ruth Anderson Autumn 2018

Outline for Today. Why could we construct them in time O(n)? Analyzing data structures over the long term.

CSE 332 Spring 2014: Midterm Exam (closed book, closed notes, no calculators)

Quiz 1 Solutions. Asymptotic growth [10 points] For each pair of functions f(n) and g(n) given below:

Data Structures - CSCI 102. CS102 Hash Tables. Prof. Tejada. Copyright Sheila Tejada

Announcements. Submit Prelim 2 conflicts by Thursday night A6 is due Nov 7 (tomorrow!)

Computer Science Foundation Exam

CS251-SE1. Midterm 2. Tuesday 11/1 8:00pm 9:00pm. There are 16 multiple-choice questions and 6 essay questions.

Outline for Today. Towards Perfect Hashing. Cuckoo Hashing. The Cuckoo Graph. Analysis of Cuckoo Hashing. Reducing worst-case bounds

B+ Tree Review. CSE332: Data Abstractions Lecture 10: More B Trees; Hashing. Can do a little better with insert. Adoption for insert

DATA STRUCTURES AND ALGORITHMS

Finish Lec12 TREES, PART 2. Announcements. JavaHyperText topics. Trees, re-implemented. Iterate through data structure 3/7/19

CS 161 Problem Set 4

Course Review. Cpt S 223 Fall 2009

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

CSE 332 Spring 2013: Midterm Exam (closed book, closed notes, no calculators)

W4231: Analysis of Algorithms

Summer Final Exam Review Session August 5, 2009

Recitation 9. Prelim Review

Module 3: Hashing Lecture 9: Static and Dynamic Hashing. The Lecture Contains: Static hashing. Hashing. Dynamic hashing. Extendible hashing.

CSE 373 Autumn 2010: Midterm #1 (closed book, closed notes, NO calculators allowed)

CS211 Spring 2005 Prelim 2 April 19, Instructions

Introduction to Nate Foster Spring 2018

Disk Accesses. CS 361, Lecture 25. B-Tree Properties. Outline

CSE 373: AVL trees. Michael Lee Friday, Jan 19, 2018

Table ADT and Sorting. Algorithm topics continuing (or reviewing?) CS 24 curriculum

COS 226 Algorithms and Data Structures Fall Midterm

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

MIDTERM EXAM THURSDAY MARCH

CS : Data Structures

Programming Languages and Techniques (CIS120)

Prelim 2, CS :30 PM, 25 April Total Question Name Short Search/ Collections Trees Graphs

CS3110 Spring 2017 Lecture 18: Binary Search Trees

CIS 190: C/C++ Programming. Lecture 12 Student Choice

CSE100. Advanced Data Structures. Lecture 21. (Based on Paul Kube course materials)

Chapter 27 Hashing. Objectives

Module 4: Index Structures Lecture 13: Index structure. The Lecture Contains: Index structure. Binary search tree (BST) B-tree. B+-tree.

CS 315 Data Structures mid-term 2

1 Ordered Sets and Tables

Lecture 3 February 20, 2007

GRAPH ALGORITHMS Lecture 19 CS 2110 Spring 2019

Prelim 2 Solutions. CS 2110, November 20, 2014, 7:30 PM Extra Total Question True/False Short Answer

Balanced Binary Search Trees. Victor Gao

CS 310: Maps and Sets and Trees

BINARY HEAP cs2420 Introduction to Algorithms and Data Structures Spring 2015

Course Review for Finals. Cpt S 223 Fall 2008

Announcements. Hash Functions. Hash Functions 4/17/18 HASHING

ICS 691: Advanced Data Structures Spring Lecture 3

csci 210: Data Structures Maps and Hash Tables

Lecture 12 Hash Tables

CSE 332 Autumn 2013: Midterm Exam (closed book, closed notes, no calculators)

Open Addressing: Linear Probing (cont.)

Tables. The Table ADT is used when information needs to be stored and acessed via a key usually, but not always, a string. For example: Dictionaries

Hash table basics mod 83 ate. ate

Second Examination Solution

Hash table basics. ate à. à à mod à 83

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ.! Instructor: X. Zhang Spring 2017

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ. Acknowledgement. Support for Dictionary

Lecture 12 Notes Hash Tables

CSE332: Data Abstractions Lecture 7: B Trees. James Fogarty Winter 2012

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1

Lecture 8 13 March, 2012

Transcription:

Hash Tables Nate Foster Spring 2018

A3 Index and search books Extensive use of data structures and OCaml module system Competition for most scalable implementation Staff solution: ~500 LoC

Review Previously in 3110: Streams Balanced trees Refs Today: Hash tables

Maps Maps bind keys to values Aka associative array, dictionary, symbol table Abstract notation: {k1 : v1, k2 : v2,, kn : vn} e.g. {3110 : "Fun", 2110 : "OO"} {"Harvard" : 1636, "Princeton" : 1746, "Penn": 1740, Williams : 1792, "Cornell" : 1865}

Maps module type Map = sig type ('k, 'v) t val insert: 'k -> 'v -> ('k, 'v) t -> ('k, 'v) t val find: 'k -> ('k, 'v) t -> 'v option val remove: 'k -> ('k, 'v) t -> ('k, 'v) t... end

Map implementations Up next: four implementations For each implementation: What is the representation type? What is the abstraction function? What are the representation invariants? What is the efficiency of each operation?

FUNCTIONAL MAPS

Maps as lists Representation type: type ('k, 'v) t = ('k*'v) list AF: [(k1,v1); (k2,v2);...] represents {k1:v1, k2:v2,...}. If k occurs more than once in the list, then in the map it is bound to the leftmost value in the list. The empty list represents the empty map. RI: none Efficiency: insert: cons to front of list: O(1) find: traverse entire list: O(n) remove: traverse entire list: O(n)

Maps as red-black trees Representation type: RB tree where nodes contain (key,value) pair AF & RI: as in balanced trees lecture; the keys form a BST Efficiency: traverse path from root to node or leaf insert: O(log n) find: O(log n) remove: O(log n)

MAPS AS ARRAYS

Maps as arrays Array index operation: efficiently maps integers to values Index Value 459 Fan 460 Gries 461 Clarkson 462 Birrell 463 does not exist

Maps as arrays Aka direct address table Keys must be integers Representation type: type 'v t = 'v option array

Maps as arrays module type ArrayMapSig = sig type 'v t val create : int -> 'v t val insert : int -> 'v -> 'v t -> unit val find : int -> 'v t -> 'v option val remove : int -> 'v t -> unit end

Maps as arrays AF: [ Some v0; Some v1;... ] represents {0:v0, 1:v1,...} But if element i is None, then i is not bound in the map RI: none Efficiency: find, insert, remove all O(1)

Map implementations insert find remove Arrays O(1) O(1) O(1) Association lists O(1) O(n) O(n) Balanced search trees O(log n) O(log n) O(log n) Arrays: fast, but keys must be integers Lists and trees: allow any keys, but slower...we'd like the best of all worlds: constant efficiency with arbitrary keys

HASH TABLES

Mutable Maps module type MutableMap = sig type ('k, 'v) t val insert: 'k -> 'v -> ('k, 'v) t -> unit val find: 'k -> ('k, 'v) t -> 'v option val remove: 'k -> ('k, 'v) t -> unit end

Key idea: convert keys to integers Assume we have a conversion function hash : 'k -> int Want to implement insert by hashing key to int within array bounds storing binding at that index Conversion should be fast: ideally, constant time Problem: what if conversion function is not injective?

Injective: one-to-one domain co-domain domain co-domain A B C D 0 1 2 3 4 A B C D 0 1 2 3 4 collision injective not injective

Hash tables If hash function not injective, multiple keys will collide at same array index We're okay with that Strategy: Integer output of hash called a bucket Store multiple key-value pairs in a list at a bucket Called open hashing, closed addressing, separate chaining OCaml's Hashtbl does this Alternative: try to find an empty bucket somewhere else Called closed hashing, open addressing

Maps as hash tables Representation type combines association list with array: type ('k, 'v) t = ('k*'v) list array Abstraction function: An array [ [(k11,v11); (k12,v12);...]; [(k21,v21); (k22,v22);...];... ] represents the map {k11:v11, k12:v12,..., k21:v21, k22:v22,...,...}

Maps as hash tables Representation invariants: No key appears more than once in array (so, no duplicate keys in association lists) All keys are in the right buckets: k appears in array index b iff hash(k)=b

Maps as hash tables Efficiency: have to search through association list to find key so efficiency depends on how long the lists are which in turn depend on hash function... Terrible hash function: hash(k) = 42 All keys collide; stored in single bucket Degenerates to an association list (with no dups) in that bucket insert, find, remove: O(n)

Efficiency of hash table New goal: constant-time efficiency on average Desired property of hash function: distribute keys randomly among buckets Keep average bucket length small Hard; similar to designing good PRNG If length is on average L, then insert, find, remove will have expected running time that is O(L) New problem: how to make L a constant that doesn't depend on number of bindings in table?

Independence from # bindings Load factor: = (# bindings in hash table) / (# buckets in array) e.g., 10 bindings, 10 buckets, load factor = 1.0 e.g., 20 bindings, 10 buckets, load factor = 2.0 e.g., 5 bindings, 10 buckets, load factor = 0.5 = average bucket length Both OCaml Hashtbl and java.util.hashmap provide functionality to find out current load factor

Independence from # bindings Load factor = # bindings / # buckets # bindings not under implementer's control # buckets is Resize array to be bigger or smaller control average bucket length L if L is bounded by a constant, then insert, find, remove will on average be O(L): constant time

Resizing the array Requires a new representation type: type ('k, 'v) t = ('k*'v) list array ref Mutate an array element to insert or remove Mutate array ref to resize

Rehashing If load factor (or perhaps just >) 2.0 then: double array size rehash elements into new buckets thus bringing load factor back to around 1.0 Both OCaml Hashtbl and java.util.hashmap do this Efficiency: find, and remove: expected O(2), which is still constant time But insert: O(n), because it can require rehashing all elements So why is the common wisdom that hash tables offer constant-time performance?

Improved analysis of insert Key insight: rehashing occurs very rarely more and more rarely as size of table doubles! e.g., at 16 th insert, at 32 nd, at 64 th, Consider: what is average cost of each insert operation in a long sequence?

Improved analysis of insert Example: 8 buckets with 8 bindings, do 8 more inserts: first 7 inserts are (on average) constant-time 8 th insert is linear time: rehashing 15 bindings plus final insert causes 16 inserts total cost: 7+16 inserts; round up to 16+16 = 32 inserts So average cost per insert in sequence is equivalent to 4 non-rehashing inserts i.e., could just pretend every insert costs quadruple! This "creative accounting" technique: simple example of amortized analysis

Amortized analysis of efficiency Amortize: put aside money at intervals for gradual payment of debt [Webster's 1964] L. "mort-" as in "death" Pay extra money for some operations as a credit Use that credit to pay higher cost of some later operations Invented by Sleator and Tarjan (1985)

Robert Tarjan Turing Award Winner (1986) with Prof. John Hopcroft For fundamental achievements in the design and analysis of algorithms and data structures. Cornell CS faculty 1972-1973 b. 1948

Hash tables Conclusion: resizing hash tables have amortized expected worst-case running time that is constant!

Upcoming events [Last Night] Prelim grades out [Today 3-4pm] Foster Office Hours