Hashing. 1. Introduction. 2. Direct-address tables. CmSc 250 Introduction to Algorithms

Similar documents
DATA STRUCTURES/UNIT 3

General Idea. Key could be an integer, a string, etc e.g. a name or Id that is a part of a large employee structure

5. Hashing. 5.1 General Idea. 5.2 Hash Function. 5.3 Separate Chaining. 5.4 Open Addressing. 5.5 Rehashing. 5.6 Extendible Hashing. 5.

Understand how to deal with collisions

Hashing. CptS 223 Advanced Data Structures. Larry Holder School of Electrical Engineering and Computer Science Washington State University

Cpt S 223. School of EECS, WSU

Hash[ string key ] ==> integer value

COMP171. Hashing.

Hash Tables. Hashing Probing Separate Chaining Hash Function

Hashing Techniques. Material based on slides by George Bebis

AAL 217: DATA STRUCTURES

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

Hash Table and Hashing

Topic HashTable and Table ADT

Hash Tables. Hash functions Open addressing. March 07, 2018 Cinda Heeren / Geoffrey Tien 1

Chapter 20 Hash Tables

Open Addressing: Linear Probing (cont.)

CMSC 341 Hashing (Continued) Based on slides from previous iterations of this course

Hash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1

Introduction hashing: a technique used for storing and retrieving information as quickly as possible.

Algorithms in Systems Engineering ISE 172. Lecture 12. Dr. Ted Ralphs

Chapter 17. Disk Storage, Basic File Structures, and Hashing. Records. Blocking

TABLES AND HASHING. Chapter 13

! A Hash Table is used to implement a set, ! The table uses a function that maps an. ! The function is called a hash function.

Symbol Table. Symbol table is used widely in many applications. dictionary is a kind of symbol table data dictionary is database management

Hashing Algorithms. Hash functions Separate Chaining Linear Probing Double Hashing

Introduction to Hashing

UNIT III BALANCED SEARCH TREES AND INDEXING

CSE373: Data Structures & Algorithms Lecture 17: Hash Collisions. Kevin Quinn Fall 2015

Hashing. Dr. Ronaldo Menezes Hugo Serrano. Ronaldo Menezes, Florida Tech

Hashing. Manolis Koubarakis. Data Structures and Programming Techniques

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40

Hashing. Hashing Procedures

CS 310 Hash Tables, Page 1. Hash Tables. CS 310 Hash Tables, Page 2

Dynamic Dictionaries. Operations: create insert find remove max/ min write out in sorted order. Only defined for object classes that are Comparable

ECE 242 Data Structures and Algorithms. Hash Tables I. Lecture 24. Prof.

Hash Tables. Hash functions Open addressing. November 24, 2017 Hassan Khosravi / Geoffrey Tien 1

Hash Tables. Gunnar Gotshalks. Maps 1

STRUKTUR DATA. By : Sri Rezeki Candra Nursari 2 SKS

HASH TABLES. Hash Tables Page 1

Tables. The Table ADT is used when information needs to be stored and acessed via a key usually, but not always, a string. For example: Dictionaries

CSE 214 Computer Science II Searching

CS 350 Algorithms and Complexity

Comp 335 File Structures. Hashing

Module 3: Hashing Lecture 9: Static and Dynamic Hashing. The Lecture Contains: Static hashing. Hashing. Dynamic hashing. Extendible hashing.

CS/COE 1501

9/24/ Hash functions

HO #13 Fall 2015 Gary Chan. Hashing (N:12)

Today: Finish up hashing Sorted Dictionary ADT: Binary search, divide-and-conquer Recursive function and recurrence relation

Chapter 27 Hashing. Objectives

Algorithms and Data Structures

We assume uniform hashing (UH):

CSCD 326 Data Structures I Hashing

CMSC 341 Hashing. Based on slides from previous iterations of this course

Algorithms and Data Structures

Data Structures And Algorithms

Lecture 4. Hashing Methods

Tirgul 7. Hash Tables. In a hash table, we allocate an array of size m, which is much smaller than U (the set of keys).

Hash-Based Indexes. Chapter 11

Hashing. Data organization in main memory or disk

Module 5: Hashing. CS Data Structures and Data Management. Reza Dorrigiv, Daniel Roche. School of Computer Science, University of Waterloo

Hashing (Κατακερματισμός)

Adapted By Manik Hosen

Worst-case running time for RANDOMIZED-SELECT

CSE 332: Data Structures & Parallelism Lecture 10:Hashing. Ruth Anderson Autumn 2018

Hash Table. A hash function h maps keys of a given type into integers in a fixed interval [0,m-1]

III Data Structures. Dynamic sets

Introducing Hashing. Chapter 21. Copyright 2012 by Pearson Education, Inc. All rights reserved

Hashing. October 19, CMPE 250 Hashing October 19, / 25

CS 241 Analysis of Algorithms

Hashing. 5/1/2006 Algorithm analysis and Design CS 007 BE CS 5th Semester 2

HashTable CISC5835, Computer Algorithms CIS, Fordham Univ. Instructor: X. Zhang Fall 2018

ECE 242 Data Structures and Algorithms. Hash Tables II. Lecture 25. Prof.

Unit #5: Hash Functions and the Pigeonhole Principle

Chapter 7. Space and Time Tradeoffs. Copyright 2007 Pearson Addison-Wesley. All rights reserved.

HASH TABLES.

Hash Tables. Johns Hopkins Department of Computer Science Course : Data Structures, Professor: Greg Hager

Hashing. It s not just for breakfast anymore! hashing 1

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ.! Instructor: X. Zhang Spring 2017

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ. Acknowledgement. Support for Dictionary

HASH TABLES. Goal is to store elements k,v at index i = h k

Acknowledgement HashTable CISC4080, Computer Algorithms CIS, Fordham Univ.

THINGS WE DID LAST TIME IN SECTION

Dictionaries and Hash Tables

Dictionary. Dictionary. stores key-value pairs. Find(k) Insert(k, v) Delete(k) List O(n) O(1) O(n) Sorted Array O(log n) O(n) O(n)

CS 350 : Data Structures Hash Tables

4. SEARCHING AND SORTING LINEAR SEARCH

CS 3410 Ch 20 Hash Tables

Data Structure Lecture#22: Searching 3 (Chapter 9) U Kang Seoul National University

CS 310 Advanced Data Structures and Algorithms

Data Structures. Topic #6

Structures, Algorithm Analysis: CHAPTER 5: HASHING

Hashing for searching

Data Structures and Algorithms. Chapter 7. Hashing

DS ,21. L11-12: Hashmap

Chapter 27 Hashing. Liang, Introduction to Java Programming, Eleventh Edition, (c) 2017 Pearson Education, Inc. All rights reserved.

Why do we need hashing?

Question Bank Subject: Advanced Data Structures Class: SE Computer

Transcription:

Hashing CmSc 250 Introduction to Algorithms 1. Introduction Hashing is a method of storing elements in a table in a way that reduces the time for search. Elements are assumed to be records with several fields. One of the fields is called "key" - the field that we use to search. For example, in a student record the key might be students' ID number. In any database with personal records the key might be the SSN. In a lexical database (electronic dictionary) the key might be the word itself. Idea: map the keys to indexes in an array (table) Array elements are accessed by index. If we can find a mapping between the search keys and indices, than we can store each record in the element with the corresponding index. Thus each element would be found with one operation only. Example: To keep information about 1000 students, we can assign identification number to each student a number between 0 and 999, and then use an array of 1000 elements. Let us assume that we want to use the SSN of each student. SSN is a 9-digit number. If we want to use directly the number as an index, the table should have much more elements than the number of the students, which is a great waste of space. However, if we had a mapping between the 1000 SSNs of the students and the numbers between 0-999, than we can use the SSNs as search key and still use an array of 1000 elements to store the records. Approach: Directly referencing records in a table by doing arithmetic operations on keys to map them onto table addresses. Advantage: the records can be references directly - ideally the search time is a constant, complexity O(1) Question: how to find such correspondence? Answers: - direct access tables - hash tables 2. Direct-address tables Direct-address tables the most elementary form of hashing. Assumption direct one-to-one correspondence between the keys and numbers 0, 1,, m-1., m not very large. Searching is fast, but there is cost the size of the array we need is determined by the largest key. Not very useful if only a few keys are widely distributed. 1

3. Hash Functions Hash function: function that transforms the search key into a table address. Hash functions transform the keys into numbers within a predetermined interval. These numbers are then used as indices in an array (table, hash table) to store the records (keys and pointers.) 3.1. Keys numbers If M is the size of the array, then the hash function h(key) can be computed in this way: h(key) = key % M. This will map all the keys into numbers within the interval [0, M-1]. 3.2. Keys strings of characters Treat the binary representation of a key as a number, and then apply 3.1 How keys are treated as numbers: If each character is represented with p bits, then the string can be treated as base-2 p number. Example: A K E Y : 00001 01011 00101 11001 = 1. 32 3 + 11. 32 2 + 5. 32 1 + 25. 32 0 = 44271 If the keys are very long, an overflow may occur. A solution to this is to apply the Horner s method in computing the hash function. Horner s method representation of polynomials: a n x n + a n-1.x n-1 + a n-2.x n-2 + + a 1 x 1 + a 0 x 0 = x(x( x(x (a n.x +a n-1 ) + a n-2 ) +. ) + a 1 ) + a 0 2

Example 4x 5 + 2x 4 + 3x 3 + x 2 + 7x 1 + 9x 0 = x(x(x(x (4.x +2) + 3) + 1 ) + 7) + 9 The polynomial can be computed by alternating the multiplication and addition operations. V E R Y L O N G K E Y 10110 00101 10010 11001 01100 01111 01110 00111 01011 00101 11001 22 5 18 25 12 15 14 7 11 5 25 22.32 10 + 5. 32 9 + 18. 32 8 + 25. 32 7 + 12. 32 6 + 15.32 5 + 14.32 4 + 7. 32 3 + + 11. 32 2 + 5. 32 1 + 25. 32 0 = (((((((((22.32 + 5)32 + 18)32 + 25)32 + 12)32 + 15)32 + 14)32 + 7)32 +11)32 +5)32 + 25 Having this representation we compute the hash function by applying the mod operation at each step, thus avoiding overflowing. First we compute Then we compute Then h 0 = (22.32 +5)%M h 1 = (32.h 0 + 18)%M h 2 = (32.h 1 +25)%M Etc. Here is the code of a hash function using base = 32. The key is a string of characters, stored in an array key of length Key_Length. tbl_size is the size of the table. int hash32 (char key[key_length]) { int h = 0; int i; for (i=0; i < Key_Length ; i++) { h = (32 * h + key[i]) % tbl_size; } return h; } 3

4. Hash Tables Once we have found the method of mapping keys to indexes, the questions to be solved is how to choose the size of the table (array) to store the records, and how to perform the basic operations insert search delete 4.1. Basic concepts Let N be the number of the records to be stored, and M - the size of the array (hash table). The integer, generated by a hash function between 0 and M-1 is used as an index in a hash table of M elements. Initially all slots in the table are blank. This is shown either by a sentinel value, or a special field in each slot. To insert: use the hash function to generate an address for each value to be inserted. To search for a key in the table: the same hash function is used. To delete a record with a given key - first we apply the search method and when the key is found we delete the record Size of the table: Ideally we would like to store N records in a table with size N. However, in many cases we don't know in advance the exact number of records. Also, the hash function can map two keys to one and the same index, and some cells in the array will not be used. Hence we assume that the size of the table can be different from the number of the records. A characteristic of the hash table is its load factor : the ratio between the number of records to be stored and the size of the table: =N/M.The method to choose the size of the table depends on the chosen method of collision resolution, discussed below. M must be prime number. It has been proved that if M is a prime number, we obtain better (more even) distribution of the keys over the table. 4.2. Collision resolution Problem: Obviously, a mapping from a potentially huge set of strings to a small set of integers will not be unique. The hash function maps keys into indices in many-to-one fashion. Having a second key into a previously used slot is called a collision. Collision resolution: deals with keys that are mapped to same addresses. Methods: Separate chaining Open addressing Linear probing Quadric probing Double hashing 4

4.2.1. Separate chaining Invented by H. P. Luhn, an IBM engineer, in January 1953. Idea: Keys hashing to same address are kept in lists attached to that address For each table address, a linked list of the records whose keys hash to that address, is built. This method is useful for highly dynamic situations, where the number of the search keys cannot be predicted in advance. Example Records: (each character is a key): Key: A S E A R C H I N G E X A M P L E Hash (M = 11): 1 8 5 1 7 3 8 9 3 7 5 2 1 2 5 1 5 Separate chaining: 0 1 2 3 4 5 6 7 8 9 10 = L M N = E = G H I = A X C P R S = A = = E = = A E = = Each column is a linked list. Thus we maintain M lists with M list header nodes We can use the list search and insert procedures for sequential searching Complexity of separate chaining The time to compute the index of a given key is a constant. Then we have to search in a list for the record. Therefore the time depends on the length of the lists. It has been shown empirically that on average the list length is N/M (the load factor), provided M is prime and we use a function that gives good distribution. Unsuccessful searches go to the end of some list, hence we have comparisons Successful searches are expected to go half the way down some list. On average the number of comparisons in successful search is /2. Therefore we can say that runtime complexity of separate chaining is O( ) Note, that what really matters is the lead factor rather than the size of the table or the number of records, taken separately. 5

How to choose M in separate chaining? Since the method is used in cases when we cannot predict the number of records in advance, the choice of M basically depends on other factors such as available memory. Typically M is chosen relatively small so as not to use up a large area of contiguous memory, but enough large so that the lists are short for more efficient sequential search. Recommendations in the literature vary form M to be about one tenth of N - the number of the records to M to be equal (or close to) N Other possibilities of chaining: - Keep the lists ordered: useful if there are much more searches than inserts, and if most of the searches are unsuccessful. - Represent the chains as binary search tree. Extra effort needed not efficient. Advantages of separate chaining used when memory is of concern, easy to implement Disadvantages In case of unevenly distributed keys there may be long lists and many empty spaces in the table. 4.2.2. Open Addressing Invented by A. P. Ershov and W. W. Peterson in 1957 independently. Idea: Store collisions in the hash table itself. Different ways to implement this idea. If collision occurs, next probes are performed following the formula: h i (x) = (hash(x) + f(i))mod TableSize where: h i (x) is an index in the table to insert x hash(x) is the hash function f(i) is the collision resolution function. i - the current attempt to insert an element Open addressing methods are distinguished by the type of f(i) Linear probing : f(i) = i. Quadratic probing: f(i) = i 2 Double hashing: f(i) = i*hash 2 (x), where hash 2 (x) is a second hash function Problems with delete: a special flag is needed to distinguish deleted from empty positions. Necessary for the search function if we come to a deleted position, the search has to continue as the deletion might have been done after the insertion of the sought key the sought key might be further in the table. Total amount of memory space less, since no pointers are maintained. 6

4.2.2.1. Linear probing (linear hashing, sequential probing) f(i) = i. Insert: When there is a collision we just probe the next slot in the table. If it is unoccupied we store the key there. If it is occupied we continue probing the next slot. Search: If the key hashes to a position that is occupied and there is no match, we probe the next position. a) match successful search b) empty position unsuccessful search c) occupied and no match continue probing. When the end of the table is reached, the probing continues from the beginning, until the original starting position is reached. Disadvantage: Primary clustering Large clusters tend to build up: if an empty slot is preceded by i filled slots, the probability that the empty slot is the next one to be filled is (i+1)/m. If the preceding slot was empty, the probability is 1/M. This means that when the table begins to fill up, many other slots are examined. Linear probing runs slowly for nearly full tables. Example: h i (x) = (hash(x) + f(i))mod TableSize = (hash(x) + i) mod TableSize f(i) = i. I is the number of the attempt to insert, (starts with 0) Key: A S E A R C H I N G E X A M P L E Hash (M = 19): 1 0 5 1 18 3 8 9 14 7 5 5 1 13 16 12 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 A A 2 S S 3 E E 4 A x A 5 R R 6 C C 7 H H 8 I I 9 N 14 10 G G 11 E x E 12 X x x x x x X 13 A x x x A 14 M M 15 P P 16 L L 17 E x x x x x x E 7

4.2.2.2. Quadratic probing f(i) = i 2 Similar to linear probing. Difference: instead of using linear function to advance in the table in search of an empty slot, we use a quadratic function to compute the next index in the table to be probed. In linear probing we check the I-th position. If it is occupied, we check the I+1 st position, next, I+2 nd, etc. In quadric probing, if the I-th position is occupied we check the I+1 st, next we check I+4 th next, I + 9 th etc. The idea here is to skip regions in the table with possible clusters. 4.2.2.3. Double hashing f(i) = i*hash 2 (x) Purpose same as in quadratic probing : to overcome the disadvantage of clustering. Instead of examining each successive entry following a collided position, we use a second hash function to get a fixed increment for the probe sequence. The second function should be chosen so that the increment and M are relatively prime. Otherwise there will be slots that would remain unexamined. Example of hash 2 (x): hash 2 (x) = R - (x mod R), R smaller than tablesize, prime 5. Analysis of Open Addressing. Rehashing The time to hash a key to an address in the table is a constant O(1). If there were no collisions, that would be the search time and the time to insert a new record. However, in case of collisions we will have to count all positions in the hash table that have to be probed in order to find the wanted record. The run time in the case of collisions is computed to be O( 1/(1- )) for unsuccessful search, and O((1/ )ln(1/(1- ))) for successful search. Thus the run time is determined by the value of the load factor. Recall that in open addressing the load factor is less than 1. The smaller the load factor the faster the search and insertion. Good strategy: < 0.5, i.e. the size of the table is more than twice the number of the records. If the table is close to full, the search time grows and may become equal to the table size. 8

If the load factor exceeds a certain value (greater than 0.5) we do rehashing : Build a second table twice as large as the original and rehash there all the keys of the original table. Expensive operation, running time O(N) However, once done, the new hash table will have good performance. 6. Extendible Hashing Used when the amount of data is too large to fit in main memory and external storage is used. N records in total to store, M records in one disk block The problem: in ordinary hashing several disk blocks may be examined to find an element - a time consuming process. Extendible hashing: no more than two blocks are examined. Idea: Keys are grouped according to the first m bits in their code. Each group is stored in one disk block. If the block becomes full and no more records can be inserted, each group is split into two, and m+1 bits are considered to determine the location of a record. Example: assume we have 4 groups of keys according to the first two bits: 00 01 10 11 00010 01001 10001 11000 00100 01010 10100 11010 01100 4 disk blocks, each can contain 3 records New key to be inserted: 01011. Block2 is full, so we start considering 3 bits: 000/001 010 011 100/101 110/111 (still on same block) 00010 01001 01100 10001 11000 00100 01010 10100 11010 01011 Size of the directory (pointers from the cells in the first row above to the columns below): 2 D = O(N (1+1/M) / M) D - the number of bits considered. N - number of records M - number of disk blocks 9

7. Conclusion Hashing is the best search method (constant running time) if we don't need to have the records sorted. The choice of the hash function remains the most difficult part of the task and depends very much on the nature of the keys. Separate chaining or open addressing? If there is enough memory to keep a table twice larger than the number of the records - open addressing is the preferred method. Separate chaining is used when we don't know in advance the number of the records to be stored. It requires additional time for list processing, however it is simpler to implement. Some application areas Dictionaries, on-line spell checkers, compiler symbol tables. 10