CS2210 Data Structures and Algorithms

Similar documents
Hash Tables Hash Tables Goodrich, Tamassia

HASH TABLES. Goal is to store elements k,v at index i = h k

Dictionaries and Hash Tables

This lecture. Iterators ( 5.4) Maps. Maps. The Map ADT ( 8.1) Comparison to java.util.map

Hash Tables. Johns Hopkins Department of Computer Science Course : Data Structures, Professor: Greg Hager

Dictionaries-Hashing. Textbook: Dictionaries ( 8.1) Hash Tables ( 8.2)

Maps, Hash Tables and Dictionaries. Chapter 10.1, 10.2, 10.3, 10.5

CHAPTER 9 HASH TABLES, MAPS, AND SKIP LISTS

Elementary Data Structures 2

Data Structures Lecture 12

Priority Queue Sorting

Hashing. Manolis Koubarakis. Data Structures and Programming Techniques

Chapter 9: Maps, Dictionaries, Hashing

CSED233: Data Structures (2017F) Lecture10:Hash Tables, Maps, and Skip Lists

Hash Tables. Gunnar Gotshalks. Maps 1

Introduction to Hashing

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40

Introduction hashing: a technique used for storing and retrieving information as quickly as possible.

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

Introducing Hashing. Chapter 21. Copyright 2012 by Pearson Education, Inc. All rights reserved

CS 310 Advanced Data Structures and Algorithms

csci 210: Data Structures Maps and Hash Tables

Tables. The Table ADT is used when information needs to be stored and acessed via a key usually, but not always, a string. For example: Dictionaries

Hashing (Κατακερματισμός)

Open Addressing: Linear Probing (cont.)

CS 241 Analysis of Algorithms

Hash Table and Hashing

CMSC 341 Hashing (Continued) Based on slides from previous iterations of this course

HASH TABLES. Hash Tables Page 1

Lecture 4. Hashing Methods

Today: Finish up hashing Sorted Dictionary ADT: Binary search, divide-and-conquer Recursive function and recurrence relation

The dictionary problem

CS 350 : Data Structures Hash Tables

Understand how to deal with collisions

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

Hash Tables. Hash functions Open addressing. November 24, 2017 Hassan Khosravi / Geoffrey Tien 1

CS 10: Problem solving via Object Oriented Programming Winter 2017

Cpt S 223. School of EECS, WSU

Algorithms and Data Structures

Dictionary. Dictionary. stores key-value pairs. Find(k) Insert(k, v) Delete(k) List O(n) O(1) O(n) Sorted Array O(log n) O(n) O(n)

Hash Table. Ric Glassey

AAL 217: DATA STRUCTURES

Hashing. October 19, CMPE 250 Hashing October 19, / 25

Hash table basics mod 83 ate. ate. hashcode()

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ.! Instructor: X. Zhang Spring 2017

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ. Acknowledgement. Support for Dictionary

Hash table basics mod 83 ate. ate

CSE 214 Computer Science II Searching

CMSC 132: Object-Oriented Programming II. Hash Tables

Hash[ string key ] ==> integer value

Hashing. 5/1/2006 Algorithm analysis and Design CS 007 BE CS 5th Semester 2

DS ,21. L11-12: Hashmap

Hashing. CptS 223 Advanced Data Structures. Larry Holder School of Electrical Engineering and Computer Science Washington State University

Standard ADTs. Lecture 19 CS2110 Summer 2009

5. Hashing. 5.1 General Idea. 5.2 Hash Function. 5.3 Separate Chaining. 5.4 Open Addressing. 5.5 Rehashing. 5.6 Extendible Hashing. 5.

Hash Tables. Hashing Probing Separate Chaining Hash Function

CS2 Algorithms and Data Structures Note 4

Hash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1

Hashing Techniques. Material based on slides by George Bebis

Hash table basics. ate à. à à mod à 83

HashTable CISC5835, Computer Algorithms CIS, Fordham Univ. Instructor: X. Zhang Fall 2018

CMSC 341 Hashing. Based on slides from previous iterations of this course

COMP171. Hashing.

General Idea. Key could be an integer, a string, etc e.g. a name or Id that is a part of a large employee structure

! A Hash Table is used to implement a set, ! The table uses a function that maps an. ! The function is called a hash function.

CSE 332: Data Structures & Parallelism Lecture 10:Hashing. Ruth Anderson Autumn 2018

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

Hash Tables. Hash functions Open addressing. March 07, 2018 Cinda Heeren / Geoffrey Tien 1

Data Structures and Object-Oriented Design VIII. Spring 2014 Carola Wenk

Algorithms and Data Structures

Topic HashTable and Table ADT

Review. CSE 143 Java. A Magical Strategy. Hash Function Example. Want to implement Sets of objects Want fast contains( ), add( )

Chapter 27 Hashing. Liang, Introduction to Java Programming, Eleventh Edition, (c) 2017 Pearson Education, Inc. All rights reserved.

Hashing. 1. Introduction. 2. Direct-address tables. CmSc 250 Introduction to Algorithms

Hashing. Hashing Procedures

DATA STRUCTURES AND ALGORITHMS

Questions. 6. Suppose we were to define a hash code on strings s by:

Hash Open Indexing. Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1

Chapter 27 Hashing. Objectives

CSE373: Data Structures & Algorithms Lecture 17: Hash Collisions. Kevin Quinn Fall 2015

CS1020 Data Structures and Algorithms I Lecture Note #15. Hashing. For efficient look-up in a table

CSE100. Advanced Data Structures. Lecture 21. (Based on Paul Kube course materials)

Hash table basics mod 83 ate. ate

Dynamic Dictionaries. Operations: create insert find remove max/ min write out in sorted order. Only defined for object classes that are Comparable

CS 261 Data Structures

Hashing. Dr. Ronaldo Menezes Hugo Serrano. Ronaldo Menezes, Florida Tech

Chapter 20 Hash Tables

TABLES AND HASHING. Chapter 13

HASH TABLES.

Lecture 18. Collision Resolution

Hash Tables. Computer Science S-111 Harvard University David G. Sullivan, Ph.D. Data Dictionary Revisited

Figure 1: Chaining. The implementation consists of these private members:

Lecture 10 March 4. Goals: hashing. hash functions. closed hashing. application of hashing

Hashing. So what we do instead is to store in each slot of the array a linked list of (key, record) - pairs, as in Fig. 1. Figure 1: Chaining

ECE 242 Data Structures and Algorithms. Hash Tables II. Lecture 25. Prof.

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

Linked lists (6.5, 16)

Acknowledgement HashTable CISC4080, Computer Algorithms CIS, Fordham Univ.

Hash Tables and Hash Functions

Lecture 7: Efficient Collections via Hashing

Transcription:

CS2210 Data Structures and Algorithms Lecture 5: Hash Tables Instructor: Olga Veksler 0 1 2 3 025-612-0001 981-101-0002 4 451-229-0004 2004 Goodrich, Tamassia

Outline Hash Tables Motivation Hash functions Collision handling Separate Chaining (open addressing) Linked list, any other container class Closed addressing Linear Probing Double hashing 2

Dictionary Dictionary stores key-value pairs (k,v), called entries Linked List implementation O(1) to insert and O(n) to find an element Ordered array implementation O(n) to insert and O(log n) to find an element Can we have a more efficient dictionary? insert, find, delete O(1)? Hash tables have O(1) expected time

Hash Table Motivation Suppose know we will have at most N entries (k,v) Suppose keys k are unique integers between 0 to N - 1 Create initially empty array A of size N Store (k,v) in A[k] Example (1,`ape ) (3,`or ) (N-1,`sad ) 0 1 2 3 N-1 null ape null null null or null sad Main operations (insert, find, remove) are O(1) need O(N) space

Hash Table Motivation What if still have N keys, but they may be not unique? (1,A) (1,R) (1,C) (3,D) Use bucket array: 0 1 2 3 N-1 A R C D A bucket can be implemented as a linked list Assume have at most a constant number (say 3) of repeated keys methods find(), remove(), insert() are still O(1)

Hash Table Motivation 1. What if we will have at most 100 entries with integer keys but the keys are in range 0 to 1,000,000,000? still want O(1) insert(), delete(), find(), But do not want to use 1,000,000,000 memory cells to store only 100 entries 2. What if keys are not integers? These 2 issues above motivate a Hash Table data structure insert, find, delete are O(1) expected (average) time worst-case time is still O(n)

Hash Table: Main Idea Bucket array A of size N Design function h(k) that maps key k into integer range 0, 1,, N - 1 key k function h(k) integer Entry with key k is stored at index h(k) of array A 0 1 2 3 N-1 (k,v) h(k) = 1 (k,v) h(k) = 3

Hash Functions and Hash Tables A hash function h maps keys of given type to integers in interval [0, N - 1] Example for integer keys x h(x) = x mod N The integer h(k) is called the hash value of key k A hash table for a given key type consists of hash function h array (called table) of size N Store item (k, v) at index i = h(k) Collision happens when for k 1 k 2, h(k 1 ) = h(k 2 ) bucket array is one way to handle collisions

Hash Table Example Company has 5,000 employees, store information with key = SSN SNN is a nine digit positive integer if stored info under full 10 digit SSN, need array of size 9,999,999,999 0 1 2 3 4 025-612-0001 981-101-0002 451-229-0004 Hash table for storing entries (SSN,info) array of size N = 10,000 hash function h(x) = last four digits of x 9997 9998 9999 200-751-9998

Hash Functions Specify hash function as composition of two functions Hash code h 1 : keys integers Compression function h 2 : integers [0, N - 1] Hash code is applied first, and compression function second h(x) = h 2 (h 1 (x)) Collision: two different keys assigned the same hash code To avoid collisions, the goal of the hash function is to disperse the keys in an apparently random way both hash code and compression function should be designed so as to avoid collisions as much as possible

Hash Codes Hash code maps a key k to an integer not necessarily in the desired range [0,,N-1] may be even negative We assume a hash code is a 32 bit integer Hash code should be designed to avoid collisions as much as possible if hash code causes collision, then compression function will not resolve this collision 11

Memory Address Hash Code Interpret memory address of the key object as integer sometimes done in Java implementations any object inherits hashcode() method Sometimes sufficient, but often does not make sense Usually want objects with equal content to have the same hash code, this may not happen with inherited hashcode() want s1 = Hello and s2 = Hello to have same hash code s1 and s2 are stored at different memory locations they do not have the same hashcode() 12

Integer Interpretation Hash Code An object is represented as a sequence of bits For objects with the same content, bit representation is equal hash code of objects with the same content will be the same Reinterpret sequence of bits representing a key as an integer Suitable with no modification for keys with bit representation of length less than or equal to 32 for byte, short, int, char cast into type int for float, use Float.floatToIntBits() What about types longer than 32 bits, i.e. long or double? can cast into int, but upper half of the bits will not be used first 32 bits next 32 bits hash code many collisions if most of the key differences are in those lost bits =

Component Sum Hash Code Partition the bits of the key into components of fixed length of 32 bits and we sum the components ignoring overflows first 32 bits next 32 bits + = hash code 32 bits Suitable for numeric keys of fixed length greater than or equal to the number of bits of the integer type e.g. long and double in Java

Component Sum Hash Code Can do this for object of any length Suppose we have 128 bit object Add the parts first 32 bits next 32 bits next 32 bits last 32 bits + + + = hash code

Component Sum Hash Code Can do this even for variable length objects + + + + = variable-length object can be viewed as tuple (x 0,x 1,, x k-1 ) Component sum hash code is hash code x 0 + x 1 + + x k-1

Component Sum Hash Code: Strings Convenient to break strings into 8 bit components i.e. characters Sum up characters in String s abracadabra = `a + `b + + `a Many collisions stop, tops, pots, spot have the same hash code Problem: position of individual characters is important, but not taken into account by the component sum hash code

Hash Code: Polynomial Accumulation Object is a variable length tuple (x 0,x 1,, x k-1 ) and the order of x i s is significant Choose non-zero constant c 1 Polynomial hash code is: p(c) = x 0 + x 1 c + x 2 c 2 + + x k -1 c k-1 overflows are ignored Multiplication by c makes room for each component in a tuple of values, while also preserving a characterization of the previous components 18

Hash Code: Polynomial Accumulation Especially suitable for strings Experimentally, good choices c =33, 37, 39, 41 Choice c = 33 gives at most 6 collisions on a set of 50,000 English words Let a = 33 top = t + o *33+ p *33 2 =116+111*33+112*33 2 =125747 pot = p + o *33+ t *33 2 =112+111*33+116*33 2 =130099 19

Hash Code: Polynomial Accumulation Polynomial p(c) evaluated in O(k) time using Horner s rule p(c) = x 0 + x 1 c + x 2 c 2 + + x k-1 c k = x 0 + c(x 1 +c (x 2 + c(x n-3 +c ( x n-2 +x n-1 c )) ) Compute: p 1 = x n 1 p 2 = c p 1 +x n 2 p(c) = p k. p i = c p i-1 +x n i Algorithm p = x[k-1] i = k - 2 while i 0 do p = p*c + x[i] i = i - 1

Hash Code: Polynomial Accumulation p(c) = x 0 + x 1 c + x 2 c 2 + + x k-1 c k Alternatively, can accumulate powers of c Also O(k) efficiency Algorithm p = x[0] accum = c for i = 1 to k-1 0 do p = p + x[i]*accum accum = accum*c

Compression Functions Now know how to map objects to integers using a suitable hash code The hash code for key k will typically not in the legal range [0,,N-1] Need compression function to map the hash code into the legal range [0,,N-1] Good compression function will minimize the number of collisions 22

Division Compression Function h 2 (y) = y mod N N should be a prime number helps to spread out the hashed values the reason is beyond the scope of this course Example: {200,205,210,300,305,310,400,405,410} N=100, hashed values {0,5,10,0,5,10,0,5,10}={0,5,10} N=101, hashed values {99,3,8,98,2,7,97,1,6} mod N compression does not work well if there is repeated pattern of hash codes of form pn+q for different p s

MAD compression Function (Multiply Add and Divide) h 2 (y) = ay + b mod N N is a prime number a > 0 and b 0 are integers such that a mod N 0 otherwise, every integer would map to the same value, namely (b mod N) a and b are usually chosen randomly at the time when a MAD compression function is chosen This compression function spreads hash codes fairly evenly in the range [0,,N-1] MAD compression is better than mod N

Collision Handling via Separate Chaining Handle collisions with bucket array Bucket is a linked list or any other container data structure This is called Separate Chaining 0 1 2 3 025-612-0001 4 451-229-0004 981-101-0004

Load Factor Useful to keep track of load factor of the hash table Tells us if the hash array too small to store current number of entries Suppose bucket array has size N and there are n entries The load factor is defined as l = n/n Suppose that the hash function is good spreads keys evenly in the array A[0, N] Then n/n is expected number of items in each bucket find, insert, remove, take O(n/N) expected time Ideally, each bucket should have at most 1 item Should keep the load factor l < 1 for separate chaining, recommended l < 0.9 If load factor becomes too large, rehash make hash array larger (at least twice) and re-insert all entries into the new hash array

Dictionary Methods with Separate Chaining Implement each bucket as list-based dictionary Algorithm insert(k,v): Input: A key k and value v Output: Entry (k,v) is added to dictionary D if (n+1)/n > l then // Load factor became too large double the size of A and rehash all existing entries e = A[h(k)].insert(k,v) //A[h[k]] is a linked list n = n+1 //n is number of entries in hash table return e 27

Dictionary Methods with Separate Chaining Algorithm findall(k): Input: A key k Output: An iterator of entries with key equal to k return A[h(k)].findAll(k) 28

Dictionary Methods with Separate Chaining Algorithm remove(e): Input: an entry e Output: The (removed) entry e or null if e was not in dictionary D t = A[h(k)].remove(e) // delegate the remove to // dictionary at A[h(k)]} if t null then // e was found n = n - 1 return t // update number of entries in // hash table

Open Addressing Separate chaining Advantage: simple implementation Disadvantage: additional storage requirements for auxiliary data structure (list) Can handle collisions without additional data structure, i.e. can store entries in hash array A This is called open addressing A is not a bucket array in this case Load factor has to be always at most 1

Open Addressing: Linear Probing Handle collisions by placing the colliding item in the next (circularly) available table cell Each table cell inspected is referred to as a probe Example h(x) = x mod 13 Insert keys 18, 41, 22, 44, 59, 32, 31, 73, 5, 6 6 41 18 44 59 32 22 31 73 5

find with Linear Probing Algorithm find(k) hashval h(k) p 0 repeat i (hashval + p) mod N c A[i] if c = null return null if c.key() = k return c p p + 1 until p = N return null start at cell h(k) probe consecutive locations until one of the following item with key k is found, or empty cell is found, or N cells have been unsuccessfully probed

What about remove? h(x) = x mod 13 6 41 18 44 59 32 22 31 73 5 remove(18) find(31) 6 41 44 59 32 22 31 73 5 6 41 44 59 32 22 31 73 5 Oops, 31 is not found now!

Solution for remove Replace deleted entry with special marker to signal that an entry was deleted from that cell null is a special marker, but we already use it for another purpose create an Entry object, let us call it AVAILABLE make sure to instantiate AVAILABLE only once in your hash table Entry AVAILABLE = new Entry(someKey,someValue) somekey somevalue 41 A A 44 59 A 22 31 73 do not care what the somekey and somevalue are, we never use them we use only the reference (address) of object AVAILABLE if A[index] == AVAILABLE

Solution for remove remove(e) search for entry e if entry e is found, we replace it with AVAILABLE else return null Example: h(x) = x mod 13 41 18 44 59 32 22 31 73 remove(18) find(31) 41 A 44 59 32 22 31 73 41 A 44 59 32 22 31 73

Linear Probing: fixed find Algorithm find(k) hval h(k) p 0 repeat i (hval + p) mod N c A[i] if c = null return null if c!= AVAILABLE if c.key() = k return c p p + 1 until p = N return null Do not access A[i].key before making sure A[i] AVAILABLE

insert with Linear Probing insert(k,v) rehash if load factor becomes too large start at cell h(k) probe consecutive cells until find null or AVAILABLE cell found insert new entry at that cell Example: h(x) = x mod 13 insert(57) 41 A 44 59 32 22 31 73 41 57 44 59 32 22 31 73 insert(4) 41 4 57 44 59 32 22 31 73

Problems with Linear Probing Entries tend to cluster into contiguous regions Results in many probes per find, insert, and remove methods The more probes per find, insert, remove, the slower the code 41 18 44 59 32 22 31 73

Open Addressing: Double Hashing Linear Probing places item in first available cell in series ( h(k) + p 1 ) mod N for p = 0, 1,, N 1 Double hashing uses secondary hash function h (k) and places item in first available cell in the series: (h(k) + p h (k)) mod N for p = 0, 1,, N 1 must have 0< h (k) < N table size N must be a prime to allow probing of all cells linear probing is a special case of double hashing with h (k) = 1 for all k double hashing spreads entries more evenly through hash array

Open Addressing: Double Hashing Good choice for secondary hash function h (k) = q - k mod q where q < N q is a prime Possible values for h (k) are 1, 2,, q Assumed k is integer If not, use h (k) = q - HashCode(k) mod q 40

Double Hashing Example (h(k) + p*h (k)) mod N for p = 0, 1,, N - 1 N = 13 h(k) = k mod 13 h (k) = 7 - k mod 7 18 Insert keys in order 18, 41, 22, 44, 59, 32, 31, 73 Insert 18 h(18) =18 mod 13 = 5 h (18) = 7 18 mod 7 = 3 first empty location in sequence (5 + 3*0) mod 13, (5 + 3*1) mod 13, (5 + 3*2) mod 13,

Double Hashing Example Insert 41 h(41) = 41 mod 13 = 2 h (41) = 7 41 mod 7 = 1 first empty location in sequence (2 + 1*0) mod 13, (2 + 1*1) mod 13, (2 + 1*2) mod 13, 18 41 18

Double Hashing Example Insert 22 h(22) = 22 mod 13 = 9 h (22) = 7 22 mod 7 = 6 first empty location in sequence: (9 + 6*0) mod 13, (9 + 6*1) mod 13, (9 + 6*2) mod 13, 41 18 41 18 22

Double Hashing Example Insert 44 h(44) = 44 mod 13 = 5 h (44) = 7 44 mod 7 = 5 first empty location in sequence: (5 + 5*0) mod 13, (5 + 5*1) mod 13, (5 + 5*2) mod 13, 41 18 22 41 18 22 44

Double Hashing Example Insert 59 h(59) = 59 mod 13 = 7 h (59) = 7 59 mod 7 = 4 first empty location in sequence: (7 + 4*0) mod 13, (7 + 4*1) mod 13, (7 + 4*2) mod 13, 41 18 22 44 41 18 59 22 44

Double Hashing Example Insert 59 h(59) = 59 mod 13 = 7 h (59) = 7 59 mod 7 = 4 first empty location in sequence: (7 + 4*0) mod 13, (7 + 4*1) mod 13, (7 + 4*2) mod 13, 41 18 22 44 41 18 59 22 44

Double Hashing Example Insert 32 h(32) = 32 mod 13 = 6 h (32) = 7 32 mod 7 = 3 first empty location in sequence: (6 + 3*0) mod 13, (6 + 3*1) mod 13, (6 + 3*2) mod 13, 41 18 59 22 44 41 18 32 59 22 44

Double Hashing Example Insert 31 h(31) = 31 mod 13 = 5 h (31) = 7 31 mod 7 = 4 first empty location in sequence: (5 + 4*0) mod 13, (5 + 4*1) mod 13, (5 + 4*2) mod 13, 41 18 32 59 22 44 31 41 18 32 59 22 44

Double Hashing Example Insert 73 h(73) = 73 mod 13 = 8 h (73) = 7 73 mod 7 = 4 first empty location in sequence: (8 + 4*0) mod 13, (8 + 4*1) mod 13, (8 + 4*2) mod 13, 31 41 18 32 59 22 44 31 41 18 32 59 73 22 44

Double Hashing Example Summary Probe summary 8 insertions, 11 probes 11/8 probes on average Want average number of probes to be small k h (k ) h' (k ) Probes 18 5 3 5 41 2 1 2 22 9 6 9 44 5 5 5 10 59 7 4 7 32 6 3 6 31 5 4 5 9 0 73 8 4 8 31 41 18 32 59 73 22 44

find with Double Hashing Algorithm find(k) hval h(k) replaced hval2 h (k) i (hval + p) mod N p 0 h (k) = q - HashCode(k) mod q repeat i hval + p*hval2 mod N c A[i] if c = null return null if c!= AVAILABLE if c.key() = k return c p p + 1 until p = N return null

Open Addressing Performance Worst case: find(), insert() and remove() are O(n) worst case occurs when all inserted keys collide Load factor l = n/n affects performance Assuming that hash values are like random numbers, can show that expected number of probes for insert(), find(), remove() is 1 1 l Recommended to keep l < 0.5, then 1 1 l 2

Separate Chaining vs. Open Addressing Open addressing saves space over separate chaining Separate chaining is usually faster (depending on load factor of the bucket array) than the open addressing, both theoretically and experimentally Thus, if memory space is not a major issue, use separate chaining, otherwise use open addressing

Separate Chaining vs. Open Addressing Open addressing saves space over separate chaining Separate chaining is usually faster (depending on load factor of the bucket array) than the open addressing, both theoretically and experimentally Thus, if memory space is not a major issue, use separate chaining, otherwise use open addressing

Hash Code Implementation public interface HashCode<K> { public int givecode(k k);} public class HashDict<K,V> implements Dict<K,V> { private HashCode<K> hcode; public HashDict(HashCode<K> incode, float maxlfactor){ hcode = inputcode; } public Entry<K,V> find(k key){ int h = hcode.givecode(key); }

Hash Code Implementation public class WordPuzzle { } StringHashCode = hc new StringHashCode(); HashDict<String,Integer> D = new hashdict<string,integer>(hc,(float) 0.6); public class StringHashCode implements HashCode<String> { public int givecode(string key) { } }