CSE 5311 Notes 9: Hashing

Similar documents
CSE 5311 Notes 5: Hashing

CSE 2320 Notes 13: Hashing

Open Addressing: Linear Probing (cont.)

Hash Tables. Hashing Probing Separate Chaining Hash Function

( D. Θ n. ( ) f n ( ) D. Ο%

n 2 ( ) ( ) + n is in Θ n logn

Test 1 Last 4 Digits of Mav ID # Multiple Choice. Write your answer to the LEFT of each problem. 2 points each t 1

( ). Which of ( ) ( ) " #& ( ) " # g( n) ( ) " # f ( n) Test 1

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

n 2 C. Θ n ( ) Ο f ( n) B. n 2 Ω( n logn)

logn D. Θ C. Θ n 2 ( ) ( ) f n B. nlogn Ο n2 n 2 D. Ο & % ( C. Θ # ( D. Θ n ( ) Ω f ( n)

( ) n 3. n 2 ( ) D. Ο

CS 310 Hash Tables, Page 1. Hash Tables. CS 310 Hash Tables, Page 2

) $ f ( n) " %( g( n)

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1

( ) D. Θ ( ) ( ) Ο f ( n) ( ) Ω. C. T n C. Θ. B. n logn Ο

9. Which situation is true regarding a cascading cut that produces c trees for a Fibonacci heap?

Multiple Choice. Write your answer to the LEFT of each problem. 3 points each

Introduction hashing: a technique used for storing and retrieving information as quickly as possible.

D. Θ nlogn ( ) D. Ο. ). Which of the following is not necessarily true? . Which of the following cannot be shown as an improvement? D.

( ) 1 B. 1. Suppose f x

n 2 ( ) ( ) Ο f ( n) ( ) Ω B. n logn Ο

Fast Lookup: Hash tables

& ( D. " mnp ' ( ) n 3. n 2. ( ) C. " n

9. The expected time for insertion sort for n keys is in which set? (All n! input permutations are equally likely.)

Successful vs. Unsuccessful

HASH TABLES. Goal is to store elements k,v at index i = h k

Introducing Hashing. Chapter 21. Copyright 2012 by Pearson Education, Inc. All rights reserved

DATA STRUCTURES AND ALGORITHMS

Algorithms in Systems Engineering ISE 172. Lecture 12. Dr. Ted Ralphs

BBM371& Data*Management. Lecture 6: Hash Tables

Data Structures and Algorithms 2018

( ) n 5. Test 1 - Closed Book

HO #13 Fall 2015 Gary Chan. Hashing (N:12)

Tirgul 7. Hash Tables. In a hash table, we allocate an array of size m, which is much smaller than U (the set of keys).

9/24/ Hash functions

DATA STRUCTURES AND ALGORITHMS

CS 241 Analysis of Algorithms

Data Structures and Algorithms. Chapter 7. Hashing

Understand how to deal with collisions

( ) ( ) C. " 1 n. ( ) $ f n. ( ) B. " log( n! ) ( ) and that you already know ( ) ( ) " % g( n) ( ) " #&

CHAPTER 9 HASH TABLES, MAPS, AND SKIP LISTS

Hashing. Biostatistics 615 Lecture 12

Algorithms and Data Structures

Hash Table. A hash function h maps keys of a given type into integers in a fixed interval [0,m-1]

Test 1 - Closed Book Last 4 Digits of Student ID # Multiple Choice. Write your answer to the LEFT of each problem. 3 points each

HASH TABLES. Hash Tables Page 1

CITS2200 Data Structures and Algorithms. Topic 15. Hash Tables

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40

Hashing. Dr. Ronaldo Menezes Hugo Serrano. Ronaldo Menezes, Florida Tech

We assume uniform hashing (UH):

Hashing. Hashing Procedures

SFU CMPT Lecture: Week 8

CMSC 341 Hashing (Continued) Based on slides from previous iterations of this course

Data Structures And Algorithms

Worst-case running time for RANDOMIZED-SELECT

CS 3114 Data Structures and Algorithms READ THIS NOW!

Hashing Techniques. Material based on slides by George Bebis

CSCE 2014 Final Exam Spring Version A

Data Structures and Algorithms. Roberto Sebastiani

DATA STRUCTURES/UNIT 3

CSE 214 Computer Science II Searching

Hashing. October 19, CMPE 250 Hashing October 19, / 25

CSE 332 Autumn 2013: Midterm Exam (closed book, closed notes, no calculators)

Test points UTA Student ID #

CS 10: Problem solving via Object Oriented Programming Winter 2017

Table ADT and Sorting. Algorithm topics continuing (or reviewing?) CS 24 curriculum

Implementation with ruby features. Sorting, Searching and Haching. Quick Sort. Algorithm of Quick Sort

Hash Tables. Computer Science S-111 Harvard University David G. Sullivan, Ph.D. Data Dictionary Revisited

COMP171. Hashing.

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ.! Instructor: X. Zhang Spring 2017

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ. Acknowledgement. Support for Dictionary

Topic HashTable and Table ADT

CSE373: Data Structures & Algorithms Lecture 17: Hash Collisions. Kevin Quinn Fall 2015

TABLES AND HASHING. Chapter 13

Outline. hash tables hash functions open addressing chained hashing

Solutions to Exam Data structures (X and NV)

Hashing. Data organization in main memory or disk

CSE100. Advanced Data Structures. Lecture 21. (Based on Paul Kube course materials)

Lecture 10 March 4. Goals: hashing. hash functions. closed hashing. application of hashing

Midterm 2. Read all of the following information before starting the exam:

Hashing. Manolis Koubarakis. Data Structures and Programming Techniques

Lecture 17. Improving open-addressing hashing. Brent s method. Ordered hashing CSE 100, UCSD: LEC 17. Page 1 of 19

Chapter 20 Hash Tables

Dictionary. Dictionary. stores key-value pairs. Find(k) Insert(k, v) Delete(k) List O(n) O(1) O(n) Sorted Array O(log n) O(n) O(n)

CSE 373 Autumn 2010: Midterm #2 (closed book, closed notes, NO calculators allowed)

CS 350 Algorithms and Complexity

Today: Finish up hashing Sorted Dictionary ADT: Binary search, divide-and-conquer Recursive function and recurrence relation

( ) + n. ( ) = n "1) + n. ( ) = T n 2. ( ) = 2T n 2. ( ) = T( n 2 ) +1

Midterm 3 practice problems

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

Chapter 9: Maps, Dictionaries, Hashing

Data Structures - CSCI 102. CS102 Hash Tables. Prof. Tejada. Copyright Sheila Tejada

Hash Table and Hashing

Hashing. Introduction to Data Structures Kyuseok Shim SoEECS, SNU.

Part I Anton Gerdelan

Hashing. 1. Introduction. 2. Direct-address tables. CmSc 250 Introduction to Algorithms

CSE 100: HASHING, BOGGLE

Data Structures (CS 1520) Lecture 23 Name:

Transcription:

CSE 53 Notes 9: Hashing (Last updated 7/5/5 :07 PM) CLRS, Chapter Review: 2: Chaining - related to perfect hashing method 3: Hash functions, skim universal hashing 4: Open addressing COLLISION HANDLING BY OPEN ADDRESSING Saves space when records are small and chaining would waste a large fraction of space for links Collisions are handled by using a probe sequence for each key a permutation of the table s subscripts Hash function is h(key, i) where i is the number of reprobe attempts tried Two special key values (or flags) are used: never-used (-) and recycled (-2) Searches stop on neverused, but continue on recycled Linear Probing - h(key, i) = (key + i) % m Properties: Probe sequences eventually hit all 2 Probe sequences wrap back to beginning of table 3 Ehibits lots of primary clustering (the end of a probe sequence coincides with another probe sequence): i 0 i i 2 i 3 i 4 i j i j+ i j i j+ i j+2 4 There are only m probe sequences 5 Ehibits lots of secondary clustering: if two keys have the same initial probe, then their probe sequences are the same What about using h(key, i) = (key + 2*i) % 0 or h(key, i) = (key + 50*i) % 000?

Suppose all keys are equally likely to be accessed Is there a best order for inserting keys? 2 Insert keys: 0, 7, 02, 03, 04, 05, 06 0 2 3 4 5 6 0 2 3 4 5 6 Double Hashing h(key, i) = (h (key) + i*h 2 (key)) % m Properties: Probe sequences will hit all only if m is prime 2 m*(m ) probe sequences 3 Eliminates most clustering Hash Functions: h = key % m a h 2 = + key % (m ) b h 2 = + (key/m) % (m ) Load Factor = α = # elements stored # in table When a very small α is desirable, constant-time initialization may be a useful trade-off - see p 37 of http://dlacmorgezproyutaedu/citationcfm?doid=25977572535933

UPPER BOUNDS ON EXPECTED PERFORMANCE FOR OPEN ADDRESSING 3 Double hashing comes very close to these results, but analysis assumes that hash function provides all m! permutations of subscripts Unsuccessful search with load factor of α = n Each successive probe has the effect of decreasing m table size and number of in use by one Before first probe After first probe After second probe After third probe After fourth probe n n- n-2 n-3 n-4 m m- m-2 m-3 m-4 a Probability that all searches have a first probe b Probability that search goes on to a second probe α = n m c Probability that search goes on to a third probe α n m < α n m < α2 d Probability that search goes on to a fourth probe α n n 2 m m 2 < α2 n 2 m 2 < α3 Suppose the table is large Sum the probabilities for probes to get upper bound on epected number of probes: α i = i=0 α (much worse than chaining)

2 Inserting a key with load factor α 4 a Eactly like unsuccessful search b α probes 3 Successful search a Searching for a key takes as many probes as inserting that particular key b Each inserted key increases the load factor, so the inserted key number i + is epected to take no more than i m = m probes m i c Find epected probes for n consecutively inserted keys (each key is equally likely to be requested): n n i=0 m m i = m n Sum is n i=0 m i m + m + + m n + = m m n i=m n+ i m m d Upper bound on sum for decreasing function CLRS, p 067 (A2) n m n = m n ( ln m ln ( m n )) = α ln m m n = α ln α = ln α α ( ) BRENT S REHASH - On-the-fly reorganization of a double hash table (not in book) http://dlacmorgezproyutaedu/citationcfm?doid=3695236964 During insertion, moves no more than one other key to avoid epected penalty on recently inserted keys Diagram shows how i+, the increase in the total number of probes to search for all keys, is minimized Epected probes for successful search 25 (Assumes uniform access probabilities) keynew uses (j+)-prefi of its probe sequence to reach slot with keyold keyold is moved i-j additional positions down its probe sequence Insertion is more epensive, but typically needs only three searches per key to break even

Unsuccessful search performance is the same 5 i j i-j 0 0 0 0 0 2 2 0 0 2 3 3 0 2 2 0 3 4 4 0 3 2 2 3 0 4 h(keynew) j probes (h(keynew)+j*h2(keynew))%m jj keyold - ii i-j probes (jj+(i-j)*h2(keyold))%m void insert (int keynew, int r[]) { int i, ii, inc, init, j, jj, keyold; init = hashfunction(keynew); // h(keynew) inc = increment(keynew); // h2(keynew) for (i=0;i<=tabsize;i++) { printf("trying to add just %d to total of probe lengths\n",i+); for (j=i;j>=0;j--) { jj = (init + inc * j)%tabsize; // jj is the subscript for probe j+ for keynew keyold = r[jj]; ii = (jj+increment(keyold)*(i-j))%tabsize; // Net reprobe position for keyold printf("i=%d j=%d jj=%d ii=%d\n",i,j,jj,ii); if (r[ii] == (-)) // keyold may be moved { r[ii] = keyold; r[jj] = keynew; n++; return; } } } }

6 Test Problem: The hash table below was created using double hashing with Brent s rehash The initial slot ( h ( key)) and rehashing increment ( h 2 ( key)) are given for each key Show the result from inserting 9000 using Brent s rehash when h (9000) = 5 and h 2 (9000) = 4 (0 points) key h ( key) h 2 ( key) Probe Sequences (part of solution) 0 9000 5 2 6 3 0 2 5000 2 2 3 4 5 6 0 3 4000 3 4 3 0 4 3000 4 4 5 6 0 5 2000 5 5 5 3 6 000 6 2 6 GONNET S BINARY TREE HASHING - Generalizes Brent s Rehash (aside) http://dlacmorgezproyutaedu/citationcfm?doid=8000580340 Has same goal as Brent s Rehash (minimize increase in the total number of probes to search for all keys), but allows moving an arbitrary number of other keys Binary Tree refers to the search strategy for minimizing the increase in probes: a Root of tree corresponds to placing (new) key at its h slot (assuming slot is empty) b Left child corresponds to placing key of parent using same strategy as parent, but then moving the key from the taken slot using one additional h2 offset (The old key is associated with this left child) c Right child corresponds to placing key of parent using one additional h 2 offset (assuming slot is empty) (The key of parent is also associated with this right child) Search can either be implemented using a queue for generated tree nodes or by a recursive iterative deepening CUCKOO HASHING (aside, not in book) A more recent open addressing scheme with very simple reorganization Presumes that a low load factor is maintained by using dynamic tables (CLRS 74)

Based on having two hash tables, but no reprobing is used: 7 http://wwwit-cdk/people/pagh/papers/cuckoo-undergradpdf PERFECT HASHING (CLRS 5) Static key set Obtain O() hashing ( no collisions ) using: Preprocessing (constructing hash functions) and/or 2 Etra space (makes success more likely) - want cn, where c is small Many informal approaches - typical application is table of reserved words in a compiler CLRS approach: Suppose n keys and m = n 2 Randomly assigning keys to gives prob < 05 of any collisions 2 Use two-level structure (in one array): 0 j nj nj a b nj 2 (see CLRS, p 278) n-

E n 2 [ j ] < 2n, but there are three other values when n j > and one other value otherwise 8 Brent s method - about 96 million keys 090 lf epected=86 CPU 568 0920 lf epected=895 CPU 5877 0930 lf epected=934 CPU 60 0940 lf epected=979 CPU 6365 0950 lf epected=203 CPU 6693 0960 lf epected=2097 CPU 749 0970 lf epected=288 CPU 7934 0980 lf epected=2336 CPU 9922 Retrievals took 3525 secs Worst case probes is 372 Probe counts: Number of keys using probes is 960647 Number of keys using 2 probes is 475688 Number of keys using 3 probes is 224767 Number of keys using 4 probes is 5302 Number of keys using 5 probes is 635040 Number of keys using 6 probes is 376082 Number of keys using 7 probes is 23966 Number of keys using 8 probes is 583 Number of keys using 9 probes is 0259 Number of keys using 0 probes is 7484 Number of keys using probes is 5254 Number of keys using 2 probes is 3826 Number of keys using 3 probes is 2838 Number of keys using 4 probes is 22448 Number of keys using 5 probes is 7626 Number of keys using 6 probes is 4066 Number of keys using 7 probes is 247 Number of keys using 8 probes is 9468 Number of keys using 9 probes is 794 Number of keys using 20 probes is 6766 Number of keys using 2 probes is 573 Number of keys using 22 probes is 4904 Number of keys using 23 probes is 4295 Number of keys using 24 probes is 374 Number of keys using 25 probes is 3339 Number of keys using 26 probes is 2858 Number of keys using 27 probes is 2677 Number of keys using 28 probes is 238 Number of keys using 29 probes is 202 Number of keys using >=30 probes is 2925 CLRS Perfect Hashing - 20 million keys malloc'ed 240000000 bytes realloc for 25287052 more bytes final structure will use 6644 bytes per key Subarray statistics: Number with 0 keys is 7359244 Number with keys is 735536 Number with 2 keys is 3678240 Number with 3 keys is 227685 Number with 4 keys is 306079 Number with 5 keys is 6508 Number with 6 keys is 060 Number with 7 keys is 54 Number with 8 keys is 93 Number with 9 keys is 4 Number with 0 keys is 2 Number with keys is 0 Number with 2 keys is 0 Number with 3 keys is 0 Number with >=4 keys is 0 Time to build perfect hash structure 078 Time to retrieve each key once 6280 RIVEST S OPTIMAL HASHING (not in book) Application of the assignment problem (weighted bipartite matching) m rows, n columns (m n), positive weights (some may simulate ) Choose an entry in each row of M such that: No column has two entries chosen The sum of the m chosen entries is minimized (similar to binary tree hashing) # Aside: Solved using the Hungarian method in Ο% m 2 & n( time $ ' (https://enwikipediaorg/wiki/hungarian_algorithm, http://wwwamazoncom/stanford-graphbase-platform-combinatorial-computing/dp/020542757) or iterative use of Floyd-Warshall to find sub-optimal negative cycles (http://rangerutaedu/~weems/notes53/assignfwc, Notes reviews, CSE 2320 Notes 6 details) # in Ο% n 4 & ( time $ '

Minimizing epected number of probes by placement of keys to be retrieved by a specific open addressing technique (ie finishes what other methods started): 9 Static set of m keys for table with n Key i has request probability P i Follow m-prefi of probe sequence for key i: S i,s i2,!,s im Assign weight jp i to MiS ij (Unassigned entries get ) Solve resulting assignment problem If Mij is chosen, key i is stored at slot j n m keys Minimizing worst-case number of probes (ie alternative to perfect hashing): No probabilities are needed Similar approach to epected case, but now assign m j to MiSij Any matching that uses a smaller maimum j must have a smaller total weight: m j + other costs > m m j Wide range of magnitudes will be inconvenient in using Hungarian method, but result will not leave empty elements in the prefi S i,s i2,!,s ij for a selected MiSij

Also possible to use binary search on instances of (unweighted) bipartite matching: 0 low = high = m // Can also hash all m keys to get a better upper bound while low high k = (low + high)/2 Generate instance of bipartite matching that connects each key to k-prefi of that key s probe sequence Attempt to find matching with m edges if successful high = k - Save this matching since it may be the final solution else low = k + < low is the worst-case number of probes needed> Iteratively patch solution so that prefi elements of probe sequences are not left empty Generating instance of bipartite matching as maimum flow problem (unit capacity edges): Slots Keys 0 2 S i Si Si2 Sik T m n- BLOOM FILTERS (not in book) m bit array used as filter to avoid accessing slow eternal data structure for misses k independent hash functions Static set of n elements to be represented in filter, but not eplicitly stored: for (i=0; i<m; i++) bloom[i]=0; for (i=0; i<n; i++) for (j=0; j<k; j++) bloom[(*hash[j])(element[i])]=;

Testing if candidate element is possibly in the set of n: for (j=0; j<k; j++) { if (!bloom[(*hash[j])(candidate)]) <Can t be in the set> } <Possibly in set> The relationship among m, n, and k determines the false positive probability p Given m and n, the optimal number of functions is k = m n ln2 to minimize p = ( 2) k Instead, m may be determined for a desired n and p: m = nln p ( ln2) 2 (and k = lg p ) What interesting property can be epected for an optimally configured Bloom filter? (m coupon types, nk cereal boes ) http://enwikipediaorg/wiki/bloom_filter M Mitzenmacher and E Upfal, Probability and Computing: Randomized Algorithms and Probabilistic Analysis, Cambridge Univ Press, 2005 BH Bloom, Space/Time Trade-Offs in Hash Coding with Allowable Errors, CACM 3 (7), July 970, 422-426 ( http://dlacmorgezproyutaedu/citationcfm?doid=362686362692 )