Lecture 10 March 4. Goals: hashing. hash functions. closed hashing. application of hashing

Similar documents
Hashing. October 19, CMPE 250 Hashing October 19, / 25

Chapter 20 Hash Tables

CS 3410 Ch 20 Hash Tables

DATA STRUCTURES AND ALGORITHMS

6.7 b. Show that a heap of eight elements can be constructed in eight comparisons between heap elements. Tournament of pairwise comparisons

CMSC 341 Hashing (Continued) Based on slides from previous iterations of this course

Hash Tables. Hashing Probing Separate Chaining Hash Function

CSE 100: HASHING, BOGGLE

Cpt S 223. School of EECS, WSU

Topic HashTable and Table ADT

Tables. The Table ADT is used when information needs to be stored and acessed via a key usually, but not always, a string. For example: Dictionaries

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

Successful vs. Unsuccessful

HASH TABLES. Hash Tables Page 1

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

lecture23: Hash Tables

CS 310 Hash Tables, Page 1. Hash Tables. CS 310 Hash Tables, Page 2

Hash[ string key ] ==> integer value

Hashing Techniques. Material based on slides by George Bebis

CSE100. Advanced Data Structures. Lecture 21. (Based on Paul Kube course materials)

HO #13 Fall 2015 Gary Chan. Hashing (N:12)

Hashing. CptS 223 Advanced Data Structures. Larry Holder School of Electrical Engineering and Computer Science Washington State University

Open Addressing: Linear Probing (cont.)

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

Data Structures And Algorithms

Data Structures (CS 1520) Lecture 23 Name:

ECE 242 Data Structures and Algorithms. Hash Tables II. Lecture 25. Prof.

Hash Open Indexing. Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1

Fast Lookup: Hash tables

DATA STRUCTURES AND ALGORITHMS

Introducing Hashing. Chapter 21. Copyright 2012 by Pearson Education, Inc. All rights reserved

CS 310: Hash Table Collision Resolution Strategies

Worst-case running time for RANDOMIZED-SELECT

General Idea. Key could be an integer, a string, etc e.g. a name or Id that is a part of a large employee structure

CS 310 Advanced Data Structures and Algorithms

Standard ADTs. Lecture 19 CS2110 Summer 2009

Hashing as a Dictionary Implementation

CSE 100: UNION-FIND HASH

AAL 217: DATA STRUCTURES

Hashing. 1. Introduction. 2. Direct-address tables. CmSc 250 Introduction to Algorithms

Lecture 16 More on Hashing Collision Resolution

Hashing. Hashing Procedures

Hashing. Introduction to Data Structures Kyuseok Shim SoEECS, SNU.

Lecture 18. Collision Resolution

Hashing HASHING HOW? Ordered Operations. Typical Hash Function. Why not discard other data structures?

Module 3: Hashing Lecture 9: Static and Dynamic Hashing. The Lecture Contains: Static hashing. Hashing. Dynamic hashing. Extendible hashing.

What is a List? elements, with a linear relationship. first) has a unique predecessor, and. unique successor.

Symbol Table. Symbol table is used widely in many applications. dictionary is a kind of symbol table data dictionary is database management

Hash Tables. Gunnar Gotshalks. Maps 1

Hashing. Dr. Ronaldo Menezes Hugo Serrano. Ronaldo Menezes, Florida Tech

Algorithms and Data Structures

Abstract Data Types (ADTs) Queues & Priority Queues. Sets. Dictionaries. Stacks 6/15/2011

Priority Queues (Heaps)

Hash Table and Hashing

CS 241 Analysis of Algorithms

Data Structure Lecture#22: Searching 3 (Chapter 9) U Kang Seoul National University

Data Structures & File Management

CS 310: Hash Table Collision Resolution

Hash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1

Hashing. mapping data to tables hashing integers and strings. aclasshash_table inserting and locating strings. inserting and locating strings

Unit #5: Hash Functions and the Pigeonhole Principle

COMP171. Hashing.

9/24/ Hash functions

Hash Tables. Hash functions Open addressing. March 07, 2018 Cinda Heeren / Geoffrey Tien 1

HASH TABLES. Goal is to store elements k,v at index i = h k

Hashing. 7- Hashing. Hashing. Transform Keys into Integers in [[0, M 1]] The steps in hashing:

Data Structures and Algorithms. Chapter 7. Hashing

Table ADT and Sorting. Algorithm topics continuing (or reviewing?) CS 24 curriculum

Adapted By Manik Hosen

Lecture 17. Improving open-addressing hashing. Brent s method. Ordered hashing CSE 100, UCSD: LEC 17. Page 1 of 19

Hashing. So what we do instead is to store in each slot of the array a linked list of (key, record) - pairs, as in Fig. 1. Figure 1: Chaining

Dictionaries and Hash Tables

Module 5: Hashing. CS Data Structures and Data Management. Reza Dorrigiv, Daniel Roche. School of Computer Science, University of Waterloo

Hashing. It s not just for breakfast anymore! hashing 1

CS 310: Hash Table Collision Resolution

Data and File Structures Chapter 11. Hashing

1 CSE 100: HASH TABLES

Chapter 27 Hashing. Liang, Introduction to Java Programming, Eleventh Edition, (c) 2017 Pearson Education, Inc. All rights reserved.

Dynamic Dictionaries. Operations: create insert find remove max/ min write out in sorted order. Only defined for object classes that are Comparable

Data Structures and Algorithms. Roberto Sebastiani

BBM371& Data*Management. Lecture 6: Hash Tables

Today: Finish up hashing Sorted Dictionary ADT: Binary search, divide-and-conquer Recursive function and recurrence relation

CSC 427: Data Structures and Algorithm Analysis. Fall 2004

COMP 103 RECAP-TODAY. Hashing: collisions. Collisions: open hashing/buckets/chaining. Dealing with Collisions: Two approaches

Understand how to deal with collisions

Data Structures and Algorithm Analysis (CSC317) Hash tables (part2)

Chapter 27 Hashing. Objectives

Part I Anton Gerdelan

STANDARD ADTS Lecture 17 CS2110 Spring 2013

Counting Words Using Hashing

Hash Tables. Reading: Cormen et al, Sections 11.1 and 11.2

Algorithms in Systems Engineering ISE 172. Lecture 12. Dr. Ted Ralphs

CSC 321: Data Structures. Fall 2016

More on Hashing: Collisions. See Chapter 20 of the text.

III Data Structures. Dynamic sets

MIDTERM EXAM THURSDAY MARCH

CMSC 341 Hashing. Based on slides from previous iterations of this course

IS 2610: Data Structures

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ.! Instructor: X. Zhang Spring 2017

Transcription:

Lecture 10 March 4 Goals: hashing hash functions chaining closed hashing application of hashing

Computing hash function for a string Horner s rule: (( (a 0 x + a 1 ) x + a 2 ) x + + a n-2 )x + a n-1 ) int hash( const string & key ) { int hashval = 0; for( int i = 0; i < key.length( ); i++ ) hashval = 37 * hashval + key[ i ]; } return hashval;

Computing hash function for a string int myhash( const HashedObj & x ) const { int hashval = hash( x ); hashval %= thelists.size( ); return hashval; } Alternatively, we can apply % thelists.size() after each iteration of the loop in hash function. int myhash( const string & key ) { int hashval = 0; int s = thelists.size(); for( int i = 0; i < key.length( ); i++ ) hashval = (37 * hashval + key[ i ]) % s; } return hashval % s;

Analysis of open hashing/chaining Open hashing uses more memory than open addressing (because of pointers), but is generally more efficient in terms of time. If the keys arriving are random and the hash function is good, keys will be nicely distributed to different buckets and so each list will be roughly the same size. Let n = the number of keys present in the hash table. m = the number of buckets (lists) in the hash table. If there are n elements in set, then each bucket will have roughly n/m If we can estimate n and choose m to be ~ n, then the average bucket will be 1. (Most buckets will have a small number of items).

Analysis continued Average time per dictionary operation: m buckets, n elements in dictionary average n/m elements per bucket n/m = λ is called the load factor. insert, search, remove operation take O(1+n/m) = O(1+λ) time each (1 for the hash function computation) If we can choose m ~ n, constant time per operation on average. (Assuming each element is likely to be hashed to any bucket, running time constant, independent of n.)

Closed Hashing Associated with closed hashing is a rehash strategy: If we try to place x in bucket h(x) and find it occupied, find alternative location h 1 (x), h 2 (x), etc. Try each in order, if none empty table is full, h(x) is called home bucket Simplest rehash strategy is called linear hashing h i (x) = (h(x) + i) % D In general, our collision resolution strategy is to generate a sequence of hash table slots (probe sequence) that can hold the record; test each slot until find empty one (probing)

Closed Hashing (open addressing) Example: m =8, keys a,b,c,d have hash values h(a)=3, h(b)=0, h(c)=4, h(d)=3 Where do we insert d? 3 already filled Probe sequence using linear hashing: h 1 (d) = (h(d)+1)%8 = 4%8 = 4 h 2 (d) = (h(d)+2)%8 = 5%8 = 5* h 3 (d) = (h(d)+3)%8 = 6%8 = 6 Etc. Wraps around the beginning of the table 0 1 2 3 4 5 6 7 b a c d

Operations Using Linear Hashing Test for membership: search Examine h(k), h 1 (k), h 2 (k),, until we find k or an empty bucket or home bucket case 1: successful search -> return true case 2: unsuccessful search -> false case 3: unsuccessful search and table is full If deletions are not allowed, strategy works! What if deletions?

Operations Using Linear Hashing What if deletions? If we reach empty bucket, cannot be sure that k is not somewhere else and empty bucket was occupied when k was inserted Need special placeholder deleted, to distinguish bucket that was never used from one that once held a value

Implementation of closed hashing Code slightly modified from the text. // CONSTRUCTION: an approximate initial size or default of 101 // // ******************PUBLIC OPERATIONS********************* // bool insert( x ) --> Insert x // bool remove( x ) --> Remove x // bool contains( x ) --> Return true if x is present // void makeempty( ) --> Remove all items // int hash( string str ) --> Global method to hash strings There is no distinction between hash function used in closed hashing and open hashing. (I.e., they can be used in either context interchangeably.)

template <typename HashedObj> class HashTable { public: explicit HashTable( int size = 101 ) : array( nextprime( size ) ) { makeempty( ); } bool contains( const HashedObj & x ) const { } return isactive( findpos( x ) ); void makeempty( ) { currentsize = 0; for( int i = 0; i < array.size( ); i++ ) array[ i ].info = EMPTY; }

bool insert( const HashedObj & x ) { int currentpos = findpos( x ); if( isactive( currentpos ) ) return false; array[ currentpos ] = HashEntry( x, ACTIVE ); if( ++currentsize > array.size( ) / 2 ) rehash( ); // rehash when load factor exceeds 0.5 return true; } bool remove( const HashedObj & x ) { int currentpos = findpos( x ); if(!isactive( currentpos ) ) return false; array[ currentpos ].info = DELETED; return true; } enum EntryType { ACTIVE, EMPTY, DELETED };

private: struct HashEntry { HashedObj element; EntryType info; }; vector<hashentry> array; int currentsize; bool isactive( int currentpos ) const { return array[ currentpos ].info == ACTIVE; }

int findpos( const HashedObj & x ) { int offset = 1; // int offset = s_hash(x); /* double hashing */ int currentpos = myhash( x ); while( array[ currentpos ].info!= EMPTY && { } array[ currentpos ].element!= x ) currentpos += offset; // Compute ith probe // offset += 2 /* quadratic probing */ if( currentpos >= array.size( ) ) currentpos -= array.size( ); } return currentpos;

Performance Analysis - Worst Case Initialization: O(m), m = # of buckets Insert and search: O(n), n number of elements currently in the table Suppose there are close to n elements in the table that form a chain. Now want to search x, and say x is not in the table. It may happen that h(x) = start address of a very long chain. Then, it will take O(c) time to conclude failure. c ~ n. No better than linear list for maintaining dictionary! THIS IS NOT A RARE OCCURRENCE WHEN THE TABLE IS NEARLY FULL. (this is why we rehash when α reaches some value like 0.5)

0 1 2 3 4 5 6 7 8 9 10 Example I 1001 9537 3016 9874 2009 9875 1. What if next element has home bucket 0? h(k) = k%11 = 0 go to bucket 3 Same for elements with home bucket 1 or 2! Only a record with home position 3 will stay. p = 4/11 that next record will go to bucket 3 2. Similarly, records hashing to 7,8,9 will end up in 10 3. Only records hashing to 4 will end up in 4 (p=1/11); same for 5 and 6 0 1 2 3 4 5 6 7 8 9 II insert 1052 (h.b. 7) 10 1001 9537 3016 9874 2009 9875 1052 next element in bucket 3 with p = 8/11

Performance Analysis - Average Case Distinguish between successful and unsuccessful searches Delete = successful search for record to be deleted Insert = unsuccessful search along its probe sequence Expected cost of hashing is a function of how full the table is: load factor λ = n/m

Random probing model vs. linear probing model It can be shown that average costs under linear hashing (probing) are: Insertion: 1/2(1 + 1/(1 - λ) 2 ) Deletion: 1/2(1 + 1/(1 - λ)) Random probing: Suppose we use the following approach: we create a sequence of hash functions h, h, all of which are independent of each other. insertion: 1/(1 λ ) deletion: 1/λ log(1/ (1 λ))

Random probing analysis of insertion (unsuccessful search) What is the expected number of times one should roll a die before getting 4? Answer: 6 (probability of success = 1/6.) More generally, if the probability of success = p, expected number of times you repeat until you succeed is 1/p. If the current load factor = λ, then the probability of success = 1 λ since the proportion of empty slots is 1 λ.

Improved Collision Resolution Linear probing: h i (x) = (h(x) + i) % D all buckets in table will be candidates for inserting a new record before the probe sequence returns to home position clustering of records, leads to long probing sequence Linear probing with increment c > 1: h i (x) = (h(x) + ic) % D c constant other than 1 records with adjacent home buckets will not follow same probe sequence Double hashing: h i (x) = (h(x) + i g(x)) % D G is another hash function that is used as the increment amount. Avoids clustering problems associated with linear probing.

Comparison with Closed Hashing Worst case performance is O(n) for both. Average case is a small constant in both cases when α is small. Closed hashing uses less space. Open hashing behavior is not sensitive to load Open hashing behavior is not sensitive to load factor. Also no need to resize the table since memory is dynamically allocated.