Algorithms and Data Structures

Similar documents
COMP171. Hashing.

Hash Table and Hashing

AAL 217: DATA STRUCTURES

Hashing Techniques. Material based on slides by George Bebis

Algorithms and Data Structures

Worst-case running time for RANDOMIZED-SELECT

Hashing. Hashing Procedures

! A Hash Table is used to implement a set, ! The table uses a function that maps an. ! The function is called a hash function.

Algorithms and Data Structures

III Data Structures. Dynamic sets

Algorithms and Data Structures

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

HASH TABLES.

5. Hashing. 5.1 General Idea. 5.2 Hash Function. 5.3 Separate Chaining. 5.4 Open Addressing. 5.5 Rehashing. 5.6 Extendible Hashing. 5.

Topic HashTable and Table ADT

CSE 214 Computer Science II Searching

HashTable CISC5835, Computer Algorithms CIS, Fordham Univ. Instructor: X. Zhang Fall 2018

Introducing Hashing. Chapter 21. Copyright 2012 by Pearson Education, Inc. All rights reserved

Chapter 9: Maps, Dictionaries, Hashing

Open Addressing: Linear Probing (cont.)

DATA STRUCTURES AND ALGORITHMS

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ.! Instructor: X. Zhang Spring 2017

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ. Acknowledgement. Support for Dictionary

9/24/ Hash functions

Hash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1

Fundamental Algorithms

HASH TABLES. Goal is to store elements k,v at index i = h k

Acknowledgement HashTable CISC4080, Computer Algorithms CIS, Fordham Univ.

General Idea. Key could be an integer, a string, etc e.g. a name or Id that is a part of a large employee structure

TABLES AND HASHING. Chapter 13

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

Data Structures and Algorithms. Roberto Sebastiani

Dictionaries and Hash Tables

Dynamic Dictionaries. Operations: create insert find remove max/ min write out in sorted order. Only defined for object classes that are Comparable

Data Structures and Algorithms. Chapter 7. Hashing

CS 350 : Data Structures Hash Tables

Hashing. Dr. Ronaldo Menezes Hugo Serrano. Ronaldo Menezes, Florida Tech

Tables. The Table ADT is used when information needs to be stored and acessed via a key usually, but not always, a string. For example: Dictionaries

CMSC 341 Hashing (Continued) Based on slides from previous iterations of this course

Today s Outline. CS 561, Lecture 8. Direct Addressing Problem. Hash Tables. Hash Tables Trees. Jared Saia University of New Mexico

Fast Lookup: Hash tables

Tirgul 7. Hash Tables. In a hash table, we allocate an array of size m, which is much smaller than U (the set of keys).

Dictionary. Dictionary. stores key-value pairs. Find(k) Insert(k, v) Delete(k) List O(n) O(1) O(n) Sorted Array O(log n) O(n) O(n)

CS 241 Analysis of Algorithms

CHAPTER 9 HASH TABLES, MAPS, AND SKIP LISTS

Question Bank Subject: Advanced Data Structures Class: SE Computer

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

Data Structures And Algorithms

lecture23: Hash Tables

DATA STRUCTURES/UNIT 3

Hashing. CptS 223 Advanced Data Structures. Larry Holder School of Electrical Engineering and Computer Science Washington State University

Hashing. 1. Introduction. 2. Direct-address tables. CmSc 250 Introduction to Algorithms

Symbol Table. Symbol table is used widely in many applications. dictionary is a kind of symbol table data dictionary is database management

Lecture 16 More on Hashing Collision Resolution

Hash Tables: Handling Collisions

CSE100. Advanced Data Structures. Lecture 21. (Based on Paul Kube course materials)

HO #13 Fall 2015 Gary Chan. Hashing (N:12)

UNIT III BALANCED SEARCH TREES AND INDEXING

CPSC 259 admin notes

Cpt S 223. School of EECS, WSU

Dictionaries and Hash Tables

Hash Open Indexing. Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1

Hash[ string key ] ==> integer value

Hashing. October 19, CMPE 250 Hashing October 19, / 25

Hash table basics mod 83 ate. ate. hashcode()

Hash Tables. Hashing Probing Separate Chaining Hash Function

CSI33 Data Structures

Data Structure Lecture#22: Searching 3 (Chapter 9) U Kang Seoul National University

Lecture 18. Collision Resolution

Today: Finish up hashing Sorted Dictionary ADT: Binary search, divide-and-conquer Recursive function and recurrence relation

Algorithms in Systems Engineering ISE 172. Lecture 12. Dr. Ted Ralphs

Lecture 17: Hash Tables, Maps, Finish Linked Lists

Advanced Algorithmics (6EAP) MTAT Hashing. Jaak Vilo 2016 Fall

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40

Hash Table. A hash function h maps keys of a given type into integers in a fixed interval [0,m-1]

Hashing Algorithms. Hash functions Separate Chaining Linear Probing Double Hashing

CITS2200 Data Structures and Algorithms. Topic 15. Hash Tables

Dictionaries-Hashing. Textbook: Dictionaries ( 8.1) Hash Tables ( 8.2)

Data Structures. Topic #6

Hash table basics. ate à. à à mod à 83

CSCD 326 Data Structures I Hashing

CS 561, Lecture 2 : Randomization in Data Structures. Jared Saia University of New Mexico

Chapter 1 Disk Storage, Basic File Structures, and Hashing.

Hashing. 5/1/2006 Algorithm analysis and Design CS 007 BE CS 5th Semester 2

CS301 - Data Structures Glossary By

Lecture 4. Hashing Methods

Hash table basics mod 83 ate. ate

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

This lecture. Iterators ( 5.4) Maps. Maps. The Map ADT ( 8.1) Comparison to java.util.map

Hash Tables. Gunnar Gotshalks. Maps 1

csci 210: Data Structures Maps and Hash Tables

Chapter 17. Disk Storage, Basic File Structures, and Hashing. Records. Blocking

CS1020 Data Structures and Algorithms I Lecture Note #15. Hashing. For efficient look-up in a table

Data and File Structures Chapter 11. Hashing

Lecture 7: Efficient Collections via Hashing

Maps, Hash Tables and Dictionaries. Chapter 10.1, 10.2, 10.3, 10.5

Table ADT and Sorting. Algorithm topics continuing (or reviewing?) CS 24 curriculum

Lecture Notes on Hash Tables

HASH TABLES. Hash Tables Page 1

STRUKTUR DATA. By : Sri Rezeki Candra Nursari 2 SKS

Transcription:

Lesson 4: Sets, Dictionaries and Hash Tables Luciano Bononi <bononi@cs.unibo.it> http://www.cs.unibo.it/~bononi/ (slide credits: these slides are a revised version of slides created by Dr. Gabriele D Angelo) International Bologna Master in Bioinformatics University of Bologna 29/04/2011, Bologna Outline of the lesson Sets Definition Abstract Data Type (ADT) Implementions Dictionaries Definition Hash tables 2 1

Sets: definition DEFINITION: a Set is a collection (or family) of elements that are of the same type The elements in a set are also called components or members Sets are one of the most fundamental concepts in mathematics The elements of a set are homogeneous There is no notion of order among elements Multiple instances of the same element are not allowed 3 Set: Abstract Data Type DEFINITION: a set S is an unordered collection of elements from a universe an element cannot appear more than once in S The cardinality of S is the number of elements in S an empty set is a set whose cardinality is zero a singleton is a set whose cardinality is one 4 2

Set ADT: some operations S, T = sets e = element Creates a new Set Given (e) returns {e} Returns S union T Returns S intersection T Returns the complement of S Returns True if e is in S, Create(S) Singleton(e) Union(S, T) Intersection(S, T) Complement(S) ElementOf(e, S) otherwise returns False Returns the cardinality of S Cardinality(S) 5 Set ADT: implementation As usual we have many possible implementations that are suitable for the same ADT: each one with PROs and CONs CHOICE 1: Boolean vector CHOICE 2: Unordered list CHOICE 3: Ordered List 6 3

Set ADT: implementation with Boolean vectors CHOICE 1: Boolean vector each Boolean element in the vector (true / false) represents an element of the set SET T F F T F T F F ELEMENTS 0 1 2 3 k N-1 Advantages and disadvantages: Memory footprint? Implementation of ADT operations Cost of the operations? 7 Set ADT: implementation with unsorted lists CHOICE 2: Unsorted List Singly or doubly linked? Head 27 Next 3 Next 99 Next Advantages and disadvantages: Memory footprint? Implementation of ADT operations Cost of the operations? 8 4

Set ADT: implementation with sorted lists CHOICE 3: Sorted List Head 3 Next 27 Next 99 Next Advantages and disadvantages: Memory footprint? Implementation of ADT operations Cost of the operations? 9 Dictionary: definition DEFINITION: a Dictionary is a special kind of set Only few operations are allowed: Find an element Insert a new element Delete an existing element The elements of a dictionary are usually called keys Many different implementations are suitable, in the following we ll see the hash tables 10 5

Dictionary: goal GOAL: to define a data structure that permits an efficient implementation of the find(), insert() and delete() operations The universe cardinality is often huge It often not known a priori the number of elements to accommodate in the dictionary Dictionaries are very often used in software, for example: compilers and databases 11 Hash tables: unrealistic solution The hash table T is an array of pointers Each position (slot) in the hash table (T) corresponds to a key in the universe of keys T[k] corresponds to an element with key k If the set contains no element with key k, then T[k]=NULL 12 6

Hash tables: unrealistic solution Insert(), Delete() and Find() all take O(1) (worst-case) time PROBLEM: The scheme wastes too much space if the universe is large compared with the actual number of elements to be stored In many real world cases this solution is impossible to implement (i.e. each time the cardinality of the universe is quite large) 13 Hash tables: the concept of hashing Usually, m << N h(k i ) = an integer in [0,, m-1] called the hash value of K i 14 7

Hash tables: hash function definition With hashing, an element having key k is stored in T[h(k)] h: hash function definition maps the universe U of keys into the slots of a hash table T[0,1,...,m-1] an element of key k hashes to slot h(k) h(k) is the hash value of key k 15 Hash tables: collisions PROBLEM: collision two keys may hash to the same slot can we ensure that any two distinct keys get different cells? No, if U >m, where m is the size of the hash table 1. Design a good hash function that is fast to compute and minimize the number of collisions 2. Design a method to resolve the collisions when they occur 16 8

Hash tables: hash function examples The division method h(k) = k mod m e.g. m=12, k=100, h(k)=4 Requires only a single division operation (quite fast) Certain values of m should be avoided e.g. if m=2p, then h(k) is just the p lowest-order bits of k; the hash function does not depend on all the bits Similarly, if the keys are decimal numbers, should not set m to be a power of 10 It s a good practice to set the table size m to be a prime number Good values for m: primes not too close to exact powers of 2 e.g. the hash table is to hold 2000 numbers, and we don t mind an average of 3 numbers being hashed to the same entry choose m=701 17 Hash tables: hash function examples Can the keys be strings? Most hash functions assume that the keys are natural numbers If keys are not natural numbers, a way must be found to interpret (translate) them as natural numbers EXAMPLE Add up the ASCII value of the characters in the string Problems: different permutations of the same set of characters would have the same hash value the keys are well distributed in the table? 18 9

Hash tables: open addressing Open addressing hash tables store the records directly within the array (also called closed hashing) Open addressing: relocate the key K to be inserted if it collides with an existing key. That is, we store K at an entry different from T[h(K)] Two issues arise: what is the relocation scheme? how to search for K later? Three common methods for resolving a collision in open addressing: 1) linear probing, 2) quadratic probing, 3) double hashing 19 Hash tables: open addressing To insert a key K, compute h 0 (K) if T[h 0 (K)] is empty, insert it there If collision occurs, probe alternative cell h 1 (K), h 2 (K),... until an empty cell is found h i (K) = (hash(k) + f(i)) mod m, with f(0) = 0 The function f(): is called collision resolution strategy 20 10

Hash tables: open addressing, linear probing f(i) =i cells are probed sequentially (with wraparound) h i (K) = (hash(k) + i) mod m Insertion: Let K be the new key to be inserted. We compute hash(k) for i = 0 to m-1 compute L = ( hash(k) + i ) mod m if T[L] is empty, then we put K there and stop If we cannot find an empty entry to put K, it means that the table is full and we should report an error 21 Hash tables: exercise Given an hash table that is initially empty and with: size = 17 H(K) = k mod 17 and with linear probing Show the content of the hash table: after inserting the keys: B, I, O, I, N, F, O, R, M, A, T, I, C, S after deleting the keys: I, S and finally after inserting the keys: C, O, O, L 22 11

Hash tables: open addressing, clustering Quadratic probing f(i) = i 2 hi(k) = ( hash(k) + i 2 ) mod m A block of contiguously occupied table entries is a cluster On the average, when we insert a new key K, we may hit the middle of a cluster. Therefore, the time to insert K would be proportional to half the size of a cluster. That is, the larger the cluster, the slower the performance Linear probing and quadratic probing are both affected by problems of clustering (primary and secondary clustering) 23 Hash tables: open addressing, double hashing To alleviate the problem of clustering, the sequence of probes for a key should be independent of its primary position => use two hash functions: hash() and hash2() f(i) = i * hash2(k) E.g. hash2(k) = R - (K mod R), with R is a prime number smaller than m 24 12

Hash tables: deletion in open addressing How is it possible to implement deletion in open addressing? In open addressing the deletion of a key can not be implemented simply erasing the key from the table, otherwise this will isolate records further down the probe sequence SOLUTION: add an extra bit to each table entry, and mark a deleted slot by storing a special value DELETED Using this approach, how is it possible to implement the Find(), Insert() and Delete() operations? 25 Hash tables: separate chaining Instead of a hash table, we use a table of linked lists: keep a linked list of keys that hash to the same value 26 13

Hash tables: separate chaining To insert a key K compute h(k) to determine which list to traverse if T[h(K)] contains a null pointer, initialize this entry to point to a linked list that contains K alone if T[h(K)] is a non-empty list, we add K at the beginning of this list To delete a key K compute h(k), then search for K within the list at T[h(K)] delete K if it is found 27 Hash tables: separate chaining Assume that we will be storing n keys. Then we should make m the next larger prime number. If the hash function works well, the number of keys in each linked list will be a small constant Therefore, we expect that each search, insertion, and deletion can be done in constant time. The value of this constant depends on the average length of lists Disadvantage: memory allocation in linked list manipulation will slow down the program Advantage: deletion is easy 28 14

Hash tables: separate chaining, exercise EXERCISE: in a dictionary some keys are accessed much more frequently than others (i.e. in Italian some words are much more frequently used than others) The dictionary is implemented using and hash table with separate chaining Is it possible to modify the hash table to reduce the cost of the Find() operation in the average case? SUGGESTION: implement a very simple heuristic to modify the list management and to speed up accesses to the data structure 29 References Part of this material is inspired / taken by the following freely available resources: http://www.cs.ust.hk/~huamin/comp171/index.htm 30 15

Algorithms and Data Structures 2010-2011 Lesson 4: Sets, Dictionaries and Hash Tables Luciano Bononi <bononi@cs.unibo.it> http://www.cs.unibo.it/~bononi/ (slide credits: these slides are a revised version of slides created by Dr. Gabriele D Angelo) International Bologna Master in Bioinformatics University of Bologna 29/04/2011, Bologna 16