Hashing Algorithms. Hash functions Separate Chaining Linear Probing Double Hashing

Similar documents
IS 2610: Data Structures

CMSC 132: Object-Oriented Programming II. Hash Tables

Hashing. 1. Introduction. 2. Direct-address tables. CmSc 250 Introduction to Algorithms

Dictionary. Dictionary. stores key-value pairs. Find(k) Insert(k, v) Delete(k) List O(n) O(1) O(n) Sorted Array O(log n) O(n) O(n)

CSE100. Advanced Data Structures. Lecture 21. (Based on Paul Kube course materials)

Hashing. Dr. Ronaldo Menezes Hugo Serrano. Ronaldo Menezes, Florida Tech

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40

Hash Table and Hashing

Today: Finish up hashing Sorted Dictionary ADT: Binary search, divide-and-conquer Recursive function and recurrence relation

Hash[ string key ] ==> integer value

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1

CMSC 341 Hashing (Continued) Based on slides from previous iterations of this course

Cpt S 223. School of EECS, WSU

Hashing. CptS 223 Advanced Data Structures. Larry Holder School of Electrical Engineering and Computer Science Washington State University

Hash Tables. Hashing Probing Separate Chaining Hash Function

Data Structures and Algorithm Analysis (CSC317) Hash tables (part2)

Hashing. Hashing Procedures

Algorithms and Data Structures

5. Hashing. 5.1 General Idea. 5.2 Hash Function. 5.3 Separate Chaining. 5.4 Open Addressing. 5.5 Rehashing. 5.6 Extendible Hashing. 5.

! A Hash Table is used to implement a set, ! The table uses a function that maps an. ! The function is called a hash function.

Data Structure Lecture#22: Searching 3 (Chapter 9) U Kang Seoul National University

Hash Tables Hash Tables Goodrich, Tamassia

Understand how to deal with collisions

HO #13 Fall 2015 Gary Chan. Hashing (N:12)

Solutions to Exam Data structures (X and NV)

COMP171. Hashing.

CS/COE 1501

Dictionaries and Hash Tables

HASH TABLES. Goal is to store elements k,v at index i = h k

CSE 214 Computer Science II Searching

Outline. 1 Hashing. 2 Separate-Chaining Symbol Table 2 / 13

Table ADT and Sorting. Algorithm topics continuing (or reviewing?) CS 24 curriculum

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

Hashing. Manolis Koubarakis. Data Structures and Programming Techniques

Tables. The Table ADT is used when information needs to be stored and acessed via a key usually, but not always, a string. For example: Dictionaries

HASH TABLES. Hash Tables Page 1

CSCI Analysis of Algorithms I

Data Structures And Algorithms

HASHING BBM ALGORITHMS DEPT. OF COMPUTER ENGINEERING ERKUT ERDEM. Mar. 25, 2014

Unit #5: Hash Functions and the Pigeonhole Principle

General Idea. Key could be an integer, a string, etc e.g. a name or Id that is a part of a large employee structure

DATA STRUCTURES AND ALGORITHMS

Hash Open Indexing. Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

TABLES AND HASHING. Chapter 13

Chapter 20 Hash Tables

Hash Tables. Hash functions Open addressing. March 07, 2018 Cinda Heeren / Geoffrey Tien 1

Hash Table. A hash function h maps keys of a given type into integers in a fixed interval [0,m-1]

Hash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1

Dictionaries and Hash Tables

Topic HashTable and Table ADT

DATA STRUCTURES/UNIT 3

Hash Tables. Computer Science S-111 Harvard University David G. Sullivan, Ph.D. Data Dictionary Revisited

HASH TABLES.

HashTable CISC5835, Computer Algorithms CIS, Fordham Univ. Instructor: X. Zhang Fall 2018

Module 5: Hashing. CS Data Structures and Data Management. Reza Dorrigiv, Daniel Roche. School of Computer Science, University of Waterloo

sequential search (unordered list) binary search (ordered array) BST N N N 1.39 lg N 1.39 lg N N compareto()

Fast Lookup: Hash tables

Symbol Table. Symbol table is used widely in many applications. dictionary is a kind of symbol table data dictionary is database management

Linked lists (6.5, 16)

Data Structures and Algorithms. Chapter 7. Hashing

Implementations III 1

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

III Data Structures. Dynamic sets

CS1020 Data Structures and Algorithms I Lecture Note #15. Hashing. For efficient look-up in a table

CHAPTER 9 HASH TABLES, MAPS, AND SKIP LISTS

Today s Outline. CS 561, Lecture 8. Direct Addressing Problem. Hash Tables. Hash Tables Trees. Jared Saia University of New Mexico

Algorithms and Data Structures

University of Toronto Department of Electrical and Computer Engineering. Midterm Examination. ECE 345 Algorithms and Data Structures Fall 2012

Data Structures and Algorithms. Roberto Sebastiani

Data structures. Organize your data to support various queries using little time and/or space

Introduction hashing: a technique used for storing and retrieving information as quickly as possible.

UNIT III BALANCED SEARCH TREES AND INDEXING

Tirgul 7. Hash Tables. In a hash table, we allocate an array of size m, which is much smaller than U (the set of keys).

csci 210: Data Structures Maps and Hash Tables

Algorithms in Systems Engineering ISE 172. Lecture 12. Dr. Ted Ralphs

Hash Tables. Hash functions Open addressing. November 24, 2017 Hassan Khosravi / Geoffrey Tien 1

Hash Tables and Hash Functions

Data and File Structures Chapter 11. Hashing

Introduction to Hashing

Fundamental Algorithms

Hashing 1. Searching Lists

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ.! Instructor: X. Zhang Spring 2017

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ. Acknowledgement. Support for Dictionary

1. Attempt any three of the following: 15

Worst-case running time for RANDOMIZED-SELECT

CSCD 326 Data Structures I Hashing

3.4 Hash Tables. hash functions separate chaining linear probing applications. Optimize judiciously

CS 350 Algorithms and Complexity

9/24/ Hash functions

BBM371& Data*Management. Lecture 6: Hash Tables

Chapter 9: Maps, Dictionaries, Hashing

Hash Tables. CS 311 Data Structures and Algorithms Lecture Slides. Wednesday, April 22, Glenn G. Chappell

Standard ADTs. Lecture 19 CS2110 Summer 2009

CS 241 Analysis of Algorithms

Design and Analysis of Algorithms - Chapter 6 6

Hashing. October 19, CMPE 250 Hashing October 19, / 25

Taking Stock. IE170: Algorithms in Systems Engineering: Lecture 7. (A subset of) the Collections Interface. The Java Collections Interfaces

Hashing. 7- Hashing. Hashing. Transform Keys into Integers in [[0, M 1]] The steps in hashing:

Hashing Techniques. Material based on slides by George Bebis

Transcription:

Hashing Algorithms Hash functions Separate Chaining Linear Probing Double Hashing

Symbol-Table ADT Records with keys (priorities) basic operations insert search create test if empty destroy copy generic operations common to many ADTs not needed for one-time use but critical in large systems Problem solved (?) balanced, randomized trees use O(lg N) comparisons Is lg N required? no (and yes) Are comparisons necessary? no ST.h void STinit(); void STinsert(Item); Item STsearch(Key); int STempty(); ST interface in C 2

ST implementations cost summary Guaranteed asymptotic costs for an ST with N items insert search delete find kth largest sort unordered array 1 N 1 N NlgN N BST N N N N N N randomized BST * lg N lg N lg N lgn N lgn red-black BST lg N lg N lg N lg N lg N lg N hashing * 1 1 1 N NlgN N join * assumes system can produce random numbers Can we do better? 3

Hashing: basic plan Save items in a key-indexed table (index is a function of the key) Hash function method for computing table index from key Collision resolution strategy algorithm and data structure to handle two keys that hash to the same index Classic time-space tradeoff no space limitation: trivial hash function with key as address no time limitation: trivial collision resolution: sequential search limitations on both time and space (the real world) hashing 4

Hash function Goal: random map (each table position equally likely for each key) Treat key as integer, use prime table size M hash function: h(k) = K mod M Ex: 4-char keys, table size 101 binary 01100001011000100110001101100100 hex 6 1 6 2 6 3 6 4 ascii a b c d 26 4 ~.5 million different 4-char keys 101 values ~50,000 keys per value Huge number of keys, small table: most collide! abcd hashes to 11 0x61626364 = 1633831724 16338831724 % 101 = 11 dcba hashes to 57 0x64636261 = 1684234849 1633883172 % 101 = 57 abbc also hashes to 57 0x61626263 = 1633837667 1633837667 % 101 = 57 5 25 items, 11 table positions ~2 items per table position 5 items, 11 table positions ~.5 items per table position

Hash function (long keys) Goal: random map (each table position equally likely for each key) Treat key as long integer, use prime table size M use same hash function: h(k) = K mod M compute value with Horner s method Ex: abcd hashes to 11 0x61626364 = 256*(256*(256*97+98)+99)+100 16338831724 % 101 = 11 numbers too big? OK to take mod after each op 256*97+98 = 24930 % 101 = 84 256*84+99 = 21603 % 101 = 90 256*90+100 = 23140 % 101 = 11... can continue indefinitely, for any length key How much work to hash a string of length N? N add, multiply, and mod ops 6 0x61 hash.c scramble by using 117 instead of 256 int hash(char *v, int M) { int h, a = 117; for (h = 0; *v!= '\0'; v++) h = (a*h + *v) % M; return h; } hash function for strings in C Uniform hashing: use a different random multiplier for each digit.

Collision Resolution Two approaches Separate chaining M much smaller than N ~N/M keys per table position put keys that collide in a list need to search lists Open addressing (linear probing, double hashing) M much larger than N plenty of empty table slots when a new key collides, find an empty slot complex collision patterns 7

Separate chaining Hash to an array of linked lists Hash map key to value between 0 and M-1 Array constant-time access to list with key Linked lists constant-time insert search through list using elementary algorithm M too large: too many empty array entries M too small: lists too long 0 1 L A A A 2 M X 3 N C 4 5 E P E E 6 7 G R 8 H S 9 I 10 Trivial: average list length is N/M Worst: all keys hash to same list Theorem (from classical probability theory): Probability that any list length is > tn/m is exponentially small in t Typical choice M ~ N/10: constant-time search/insert Guarantee depends on hash function being random map 8

Linear probing Hash to a large array of items, use sequential search within clusters Hash map key to value between 0 and M-1 Large array at least twice as many slots as items Cluster contiguous block of items search through cluster using elementary algorithm for arrays A S A S A E S A E R S A C E R S H A C E R S H A C E R I S H A C E R I N G S H A C E R I N G X S H A C E R I N G X M S H A C E R I N G X M S H P A C E R I N Trivial: average list length is N/M Worst: all keys hash to same list Theorem (beyond classical probability theory): M too large: too many empty array entries M too small: clusters coalesce Typical choice M ~ 2N: constant-time search/insert 9 insert: search: 1 2 1 2 1 (1 + ) (1 α) 2 1 (1 + ) (1 α) Guarantees depend on hash function being random map

Double hashing Avoid clustering by using second hash to compute skip for search Hash map key to array index between 0 and M-1 Second hash map key to nonzero skip value (best if relatively prime to M) quick hack OK Ex: 1 + (k mod 97) Avoids clustering skip values give different search paths for keys that collide G X S C E R I N G X S P C E R I N Trivial: average list length is N/M Worst: all keys hash to same list and same skip Theorem (deep): insert: search: 1 1 α 1 α ln(1+α) Typical choice M ~ 2N: constant-time search/insert Disadvantage: delete cumbersome to implement Guarantees depend on hash functions being random map 10

Double hashing ST implementation static Item *st; code assumes Items are pointers, initialized to NULL void STinsert(Item x) { Key v = ITEMkey(x); } int i = hash(v, M); int skip = hashtwo(v, M); 11 linear probing: take skip = 1 while (st[i]!= NULL) i = (i+skip) % M; st[i] = x; N++; Item STsearch(Key v) { } int i = hash(v, M); int skip = hashtwo(v, M); while (st[i]!= NULL) if eq(v, ITEMkey(st[i])) return st[i]; else i = (i+skip) % M; return NULL; insert probe loop search probe loop

Hashing tradeoffs Separate chaining vs. linear probing/double hashing space for links vs. empty table slots small table + linked allocation vs. big coherant array Linear probing vs. double hashing load factor (α) 50% 66% 75% 90% linear probing double hashing search 1.5 2.0 3.0 5.5 insert 2.5 5.0 8.5 55.5 search 1.4 1.6 1.8 2.6 insert 1.5 2.0 3.0 5.5 Hashing vs. red-black BSTs arithmetic to compute hash vs. comparison hashing performance guarantee is weaker (but with simpler code) easier to support other ST ADT operations with BSTs 12

ST implementations cost summary Guaranteed asymptotic costs for an ST with N items insert search delete find kth largest sort unordered array 1 N 1 N NlgN N BST N N N N N N randomized BST * lg N lg N lg N lgn N lgn red-black BST lg N lg N lg N lg N N lg N hashing * 1 1 1 N NlgN N join Not really: need lgn bits to distinguish N keys * assumes system can produce random numbers * assumes our hash functions can produce random values for all keys Can we do better? 13 tough to be sure...