Dictionaries and Hash Tables

Similar documents
AAL 217: DATA STRUCTURES

Hash Table and Hashing

Algorithms and Data Structures

Dictionary. Dictionary. stores key-value pairs. Find(k) Insert(k, v) Delete(k) List O(n) O(1) O(n) Sorted Array O(log n) O(n) O(n)

COMP171. Hashing.

Worst-case running time for RANDOMIZED-SELECT

Fundamental Algorithms

III Data Structures. Dynamic sets

HASH TABLES.

BBM371& Data*Management. Lecture 6: Hash Tables

TABLES AND HASHING. Chapter 13

Hash-Based Indexing 1

CSI33 Data Structures

5. Hashing. 5.1 General Idea. 5.2 Hash Function. 5.3 Separate Chaining. 5.4 Open Addressing. 5.5 Rehashing. 5.6 Extendible Hashing. 5.

Data Structure Lecture#22: Searching 3 (Chapter 9) U Kang Seoul National University

DATA STRUCTURES AND ALGORITHMS

Hashing Techniques. Material based on slides by George Bebis

CS 241 Analysis of Algorithms

SFU CMPT Lecture: Week 8

Hashing. October 19, CMPE 250 Hashing October 19, / 25

Hash Tables. Hashing Probing Separate Chaining Hash Function

Tirgul 7. Hash Tables. In a hash table, we allocate an array of size m, which is much smaller than U (the set of keys).

Symbol Table. Symbol table is used widely in many applications. dictionary is a kind of symbol table data dictionary is database management

Hashing. Hashing Procedures

Outline. hash tables hash functions open addressing chained hashing

CMSC 341 Hashing (Continued) Based on slides from previous iterations of this course

! A Hash Table is used to implement a set, ! The table uses a function that maps an. ! The function is called a hash function.

Hashing. Dr. Ronaldo Menezes Hugo Serrano. Ronaldo Menezes, Florida Tech

Data and File Structures Chapter 11. Hashing

9/24/ Hash functions

Chapter 27 Hashing. Liang, Introduction to Java Programming, Eleventh Edition, (c) 2017 Pearson Education, Inc. All rights reserved.

Practical Session 8- Hash Tables

HASH TABLES. Goal is to store elements k,v at index i = h k

Hash Tables. Gunnar Gotshalks. Maps 1

Hash Tables and Hash Functions

Data Structures and Algorithms. Chapter 7. Hashing

Tables. The Table ADT is used when information needs to be stored and acessed via a key usually, but not always, a string. For example: Dictionaries

Hash-Based Indexes. Chapter 11

Adapted By Manik Hosen

CS 270 Algorithms. Oliver Kullmann. Generalising arrays. Direct addressing. Hashing in general. Hashing through chaining. Reading from CLRS for week 7

Data Structures and Algorithms. Roberto Sebastiani

HASH TABLES. Hash Tables Page 1

Data Structures and Algorithm Analysis (CSC317) Hash tables (part2)

Hashing Algorithms. Hash functions Separate Chaining Linear Probing Double Hashing

More on Hashing: Collisions. See Chapter 20 of the text.

CSCD 326 Data Structures I Hashing

CSE 214 Computer Science II Searching

Week 9. Hash tables. 1 Generalising arrays. 2 Direct addressing. 3 Hashing in general. 4 Hashing through chaining. 5 Hash functions.

Comp 335 File Structures. Hashing

Hash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1

You must print this PDF and write your answers neatly by hand. You should hand in the assignment before recitation begins.

Hashed-Based Indexing

Algorithms and Data Structures

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

Quiz 1 Practice Problems

Unit #5: Hash Functions and the Pigeonhole Principle

CMSC 132: Object-Oriented Programming II. Hash Tables

Quiz 1 Practice Problems

HashTable CISC5835, Computer Algorithms CIS, Fordham Univ. Instructor: X. Zhang Fall 2018

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ.! Instructor: X. Zhang Spring 2017

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ. Acknowledgement. Support for Dictionary

Lecture 17. Improving open-addressing hashing. Brent s method. Ordered hashing CSE 100, UCSD: LEC 17. Page 1 of 19

Linked lists (6.5, 16)

Dictionaries and Hash Tables

CS 350 : Data Structures Hash Tables

Lecture 10 March 4. Goals: hashing. hash functions. closed hashing. application of hashing

Hashing. 1. Introduction. 2. Direct-address tables. CmSc 250 Introduction to Algorithms

Chapter 27 Hashing. Objectives

Open Addressing: Linear Probing (cont.)

Lecture 16 More on Hashing Collision Resolution

Today s Outline. CS 561, Lecture 8. Direct Addressing Problem. Hash Tables. Hash Tables Trees. Jared Saia University of New Mexico

Topic HashTable and Table ADT

Fundamentals of Database Systems Prof. Arnab Bhattacharya Department of Computer Science and Engineering Indian Institute of Technology, Kanpur

Project 0: Implementing a Hash Table

HO #13 Fall 2015 Gary Chan. Hashing (N:12)

CSE100. Advanced Data Structures. Lecture 21. (Based on Paul Kube course materials)

Fast Lookup: Hash tables

Standard ADTs. Lecture 19 CS2110 Summer 2009

Symbol Tables. For compile-time efficiency, compilers often use a symbol table: associates lexical names (symbols) with their attributes

CHAPTER 9 HASH TABLES, MAPS, AND SKIP LISTS

1. Attempt any three of the following: 15

Acknowledgement HashTable CISC4080, Computer Algorithms CIS, Fordham Univ.

Introduction hashing: a technique used for storing and retrieving information as quickly as possible.

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

Lecture 5 Data Structures (DAT037) Ramona Enache (with slides from Nick Smallbone)

Hash Table. A hash function h maps keys of a given type into integers in a fixed interval [0,m-1]

DS ,21. L11-12: Hashmap

Data Structures And Algorithms

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1

Hashing. 7- Hashing. Hashing. Transform Keys into Integers in [[0, M 1]] The steps in hashing:

THINGS WE DID LAST TIME IN SECTION

Hash-Based Indexes. Chapter 11 Ramakrishnan & Gehrke (Sections ) CPSC 404, Laks V.S. Lakshmanan 1

ASSIGNMENTS. Progra m Outcom e. Chapter Q. No. Outcom e (CO) I 1 If f(n) = Θ(g(n)) and g(n)= Θ(h(n)), then proof that h(n) = Θ(f(n))

CSCI Analysis of Algorithms I

Section 05: Midterm Review

Announcements. Biostatistics 615/815 Lecture 8: Hash Tables, and Dynamic Programming. Recap: Example of a linked list

DATA STRUCTURES/UNIT 3

Hash Tables Hash Tables Goodrich, Tamassia

Lecture 18. Collision Resolution

Hash Tables. Hash functions Open addressing. November 24, 2017 Hassan Khosravi / Geoffrey Tien 1

Transcription:

Dictionaries and Hash Tables Nicholas Mainardi Dipartimento di Elettronica e Informazione Politecnico di Milano nicholas.mainardi@polimi.it 14th June 2017

Dictionaries What is a dictionary? A dictionary is a collection of data which are accessible through a key in O(1) The key is usually an integer, which may be associated to arbitrary data, called satellite data Note that every raw data can be converted to an integer and used as a key (e.g. an ASCII string can be encoded as a base 128 integer) The key is generally unique Naive implementation Suppose the keys are integer in the range [0,..., M 1] Then, we can implement it with a simple array of size M This solution is known as direct-address table

Integer Sets and Direct-Address Tables Dictionaries as Sets Sets are usually represented with a dictionary Indeed, elements of a set are unique, thus a unique key can be used to identify them In case the elements of a set are integers, the element is the key itself, no satellite data are needed Maximum on a Set of Integers We want to find the maximum of a set of integers ranging from [0,..., M 1] Idea: with a direct-address table implementation, the index of the array is equal to the element of the set, thus the maximum is simply the last element of the array not empty We can easily look for it starting from M 1 position

Integer Sets and Direct-Address Tables Can we get a more concise representation of a set of integers? Reducing Spatial Complexity of Sets of Integers Consider a bit vector, which is an array where the single element is a bit Can we represent a set of integers with a bit vector? How many bits are needed? Yes! How? Use the bit vector as a bit mask! Integer k belongs to the set if and only if the k th bit is 1! In this way, we can represent a set with m bits instead of m w, with w being the size of a word in the memory

Algorithms for Direct-Address Tables Problem Suppose to implement a dictionary with a direct-address table T, with the size of the table being huge. The table may contain garbage in its empty records, however we cannot initialize them to constant values since the table is too big. Is it possible to design a scheme which manages to perform basic operations SEARCH, INSERT and DELETE in O(1) on this huge table, with initialization cost being constant too? It is possible to employ an additional array S, having the same size of the number of keys stored in the dictionary, to determine if a record is empty or not The size of S is stored in a variable size

Algorithms for Direct-Address Tables A Possible Solution Idea: We do not need to store the key in the record of the table, thus we can store the index of the additional array which contains information about this record How do we recognize an empty cell? It may be not sufficient to see that the index is a valid one (i.e. it is less than size) for the additional array S Indeed, a garbage value in a record may be a valid index in S Idea: the difference between a garbage access to S and a non-empty one is that in the latter case we know we have already accessed this cell from T [k] Thus, when we fill the additional array upon insertion of a key k, we can store k in the associated cell On the next searches, S[T [k]] = k iff the record k is not empty!

Algorithms for Direct-Address Table Insertion and Deletions For insertion, it is sufficient to verify that the element is not present and then add another cell in S with the value k In the first insertion, we know for sure that k is empty, thus we do not need to search for it Initialization phase is O(1) What about deletions? We can simply break the relations S[T [k]] = k by writing a different k, but... We need to avoid holes! Indeed, we need to reduce size when we erase the element, but if we simply decrease it we erase the last element of S too, leaving a hole with no elements somewhere in S. How to avoid it? T [S[size 1]] T [k] S[T [k]] S[size 1] size size 1

Hash Tables Direct-address tables are feasible only when the range of the keys to be stored is not too big We can use hash functions to map a wide range of keys to a fixed one, which is used as a direct-address table The new key is h(k), with H : U [0,..., M 1], with U being the universal set of keys, generally N Collisions Unless U = M, there necessarily exists k 1, k 2 (h(k 1 ) = h(k 2 ) k 1 k 2 ) We say that there is a collision between k 1 and k 2 To solve such collisions, there are different methods, let s look at them through an example

Hash Tables: Example M = 11 K = [35, 83, 57, 26, 15, 63, 97, 46] h(k) = k mod M Hash Values h(35) = 2 h(83) = 6 h(57) = 2 h(26) = 4 h(15) = 4 h(63) = 8 h(97) = 9 h(46) = 2

Hash Tables: Collisions Chaining Each record of the table has a list of elements mapped to that record by h: 0 1 2 3 4 5 6 7 8 9 10 35 26 83 63 97 57 15 46

Hash Tables: Open Addressing In open addressing, the idea is that if the record h(k) is full, then an inspection sequence is used to determine the next record to be checked The hash function is replaced by H : U [0,..., M 1] [0,..., M 1], fulfilling i, j [0,..., M 1](H(k, i) = H(k, j) i = j), that is the sequence must be a permutation of the records, in order to allow the key to be stored in a free record eventually To store k, we compute H(k, i), starting from i = 0, until we find a free record Linear Probing In linear probing H(k, i) = (h(k) + i) mod 11 0 1 2 3 4 5 6 7 8 9 10 35 57 26 15 83 46 63 97

Hash Tables: Open Addressing With linear probing, the keys are clustered around the records which have experienced collisions We should have a scheme which place elements not only in the closest records to h(k) Quadratic Probing H(k, i) = (h(k) + i 2 ) mod 11 0 1 2 3 4 5 6 7 8 9 10 46 35 57 26 15 83 63 97 There is no longer clustering, but keys experiencing a collision always have the same inspection sequence, which may result in longer times to find a free record We need to get an inspection sequence which depends on the key itself!

Hash Tables: Open Addressing Double Hashing Employ a second hash function, h 2 (k) = 1 + (k mod (M 1)) = 1 + k mod 10 Now, H(k, i) = (h(k) + i h 2 (k)) mod 11 Hash Values: k 35 83 57 26 15 63 97 46 h 2 (k) 6 4 8 7 6 4 8 7 Hash Table: 0 1 2 3 4 5 6 7 8 9 10 46 35 26 15 83 63 97 57

Hash Tables: In-Place Chaining Problem We want to implement an hash table where collisions are solved by chaining. However, we do not want to allocate external memory for the elements, but we want to store them in the table itself in the free slots. Suppose we can use an hash table whose slots can store a flag and either an element followed by a pointer, or two pointers. We can use a list to store all the free slots. All the operations on the hash table and on the list must be O(1) Idea: we can use the single slot to store the element and the pointer to the slot storing the next element in the chain of collisions The chain is finished when such a pointer is NIL How to perform INSERT, DELETE and SEARCH in O(1)?

Hash Tables: In-Place Chaining INSERT - 1 The boolean flag is used to understand if a slot is empty or not If the slot is empty, we can set the flag to true, store the element and set the pointer to NIL If the slot is full, there are 2 cases: 1 There is already a chain of elements starting in this slot it is in the right slot 2 The element is part of a chain of elements starting in another slot it is in the wrong slot How to distinguish between these 2 cases? From the hash of the element being stored! Suppose we have to insert the element e in slot l, where there is an element with key k

Hash Tables: In-Place Chaining INSERT - 2 If h(k) = l, then it is case 1, thus we need to insert e at the head of this chain That is, we pop the first element of the free slots list, we copy the element being stored in l, then we overwrite l with e and set the pointer to the slot with the copied element Conversely, if h(k) l, then we need to start a new chain for slot l, moving the stored element somewhere else That is, we pop the first element of the free slots list, we copy the element being stored in l, and then we insert e in l as if it was empty There is a problem: the element stored in l was in a chain, thus the previous element in the chain points to l, but now there is no longer the next element of the chain in l! Therefore, we need also to update this pointer with the new slot

Hash Tables: In-Place Chaining DELETE We need to know if the element is the head of a chain or not Again, we can use the value h(k). From INSERT, we can see that the first element in a chain of collisions is stored in the right slot, i.e a slot l such that h(k) = l If h(k) = l, save the value of the pointer to the second element, then replace the whole slot with the content of the pointed one. Then, the head of the free slots list becomes the one being freed (which has been saved before overwriting) If h(k) l, simply erase the content of this slot and move it to the head of the free slots list As before, we change the content of a slot being in the middle of a chain, thus we need to update the pointer of the previous element search for this element starting from slot h(k)