Hash Table and Hashing

Similar documents
COMP171. Hashing.

HO #13 Fall 2015 Gary Chan. Hashing (N:12)

Algorithms and Data Structures

AAL 217: DATA STRUCTURES

TABLES AND HASHING. Chapter 13

Hashing. Hashing Procedures

! A Hash Table is used to implement a set, ! The table uses a function that maps an. ! The function is called a hash function.

Worst-case running time for RANDOMIZED-SELECT

Hashing. October 19, CMPE 250 Hashing October 19, / 25

Comp 335 File Structures. Hashing

Data Structures and Algorithms. Roberto Sebastiani

5. Hashing. 5.1 General Idea. 5.2 Hash Function. 5.3 Separate Chaining. 5.4 Open Addressing. 5.5 Rehashing. 5.6 Extendible Hashing. 5.

Hash Tables. Hashing Probing Separate Chaining Hash Function

Introducing Hashing. Chapter 21. Copyright 2012 by Pearson Education, Inc. All rights reserved

Hashing. Manolis Koubarakis. Data Structures and Programming Techniques

Hashing. 1. Introduction. 2. Direct-address tables. CmSc 250 Introduction to Algorithms

HASH TABLES.

III Data Structures. Dynamic sets

CSE 214 Computer Science II Searching

Hashing. Dr. Ronaldo Menezes Hugo Serrano. Ronaldo Menezes, Florida Tech

Data Structures and Algorithms. Chapter 7. Hashing

Hashing for searching

Tirgul 7. Hash Tables. In a hash table, we allocate an array of size m, which is much smaller than U (the set of keys).

Introduction hashing: a technique used for storing and retrieving information as quickly as possible.

Fundamental Algorithms

General Idea. Key could be an integer, a string, etc e.g. a name or Id that is a part of a large employee structure

Chapter 17. Disk Storage, Basic File Structures, and Hashing. Records. Blocking

Symbol Table. Symbol table is used widely in many applications. dictionary is a kind of symbol table data dictionary is database management

Dictionary. Dictionary. stores key-value pairs. Find(k) Insert(k, v) Delete(k) List O(n) O(1) O(n) Sorted Array O(log n) O(n) O(n)

9/24/ Hash functions

HashTable CISC5835, Computer Algorithms CIS, Fordham Univ. Instructor: X. Zhang Fall 2018

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

Tables. The Table ADT is used when information needs to be stored and acessed via a key usually, but not always, a string. For example: Dictionaries

DATA STRUCTURES/UNIT 3

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

CITS2200 Data Structures and Algorithms. Topic 15. Hash Tables

Dictionaries and Hash Tables

Successful vs. Unsuccessful

HASH TABLES. Hash Tables Page 1

Cpt S 223. School of EECS, WSU

CSI33 Data Structures

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

CS 310 Hash Tables, Page 1. Hash Tables. CS 310 Hash Tables, Page 2

Hash Tables and Hash Functions

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ.! Instructor: X. Zhang Spring 2017

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ. Acknowledgement. Support for Dictionary

Hashing Techniques. Material based on slides by George Bebis

Topic HashTable and Table ADT

2 Fundamentals of data structures

Lecture 4. Hashing Methods

Introduction to Hashing

Data Structures And Algorithms

Question Bank Subject: Advanced Data Structures Class: SE Computer

Acknowledgement HashTable CISC4080, Computer Algorithms CIS, Fordham Univ.

Data Structure. Measuring Input size. Composite Data Structures. Linear data structures. Data Structure is: Abstract Data Type 1/9/2014

Hash[ string key ] ==> integer value

Data and File Structures Chapter 11. Hashing

Hashing. 5/1/2006 Algorithm analysis and Design CS 007 BE CS 5th Semester 2

Data Structure Lecture#22: Searching 3 (Chapter 9) U Kang Seoul National University

CSE 2320 Notes 13: Hashing

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40

Measuring Input size. Last lecture recap.

In-Memory Searching. Linear Search. Binary Search. Binary Search Tree. k-d Tree. Hashing. Hash Collisions. Collision Strategies.

Hash Tables. Johns Hopkins Department of Computer Science Course : Data Structures, Professor: Greg Hager

HASH TABLES. Goal is to store elements k,v at index i = h k

Chapter 12: Indexing and Hashing. Basic Concepts

CS 241 Analysis of Algorithms

Chapter 11: Indexing and Hashing

Chapter 12: Indexing and Hashing

CS 350 : Data Structures Hash Tables

UNIT III BALANCED SEARCH TREES AND INDEXING

CMSC 341 Hashing. Based on slides from previous iterations of this course

Computer Science 136 Spring 2004 Professor Bruce. Final Examination May 19, 2004

CSCD 326 Data Structures I Hashing

Hashing 1. Searching Lists

Hashing. CptS 223 Advanced Data Structures. Larry Holder School of Electrical Engineering and Computer Science Washington State University

Algorithms and Data Structures

Hash Tables: Handling Collisions

Solutions to Exam Data structures (X and NV)

Hashing Algorithms. Hash functions Separate Chaining Linear Probing Double Hashing

CMSC 341 Hashing (Continued) Based on slides from previous iterations of this course

Hashing (Κατακερματισμός)

Part I Anton Gerdelan

Understand how to deal with collisions

Fast Lookup: Hash tables

Hash Tables. Hash functions Open addressing. March 07, 2018 Cinda Heeren / Geoffrey Tien 1

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing

Hashing file organization

Hash Tables. Hash functions Open addressing. November 24, 2017 Hassan Khosravi / Geoffrey Tien 1

Hashing. 7- Hashing. Hashing. Transform Keys into Integers in [[0, M 1]] The steps in hashing:

Hash Table. A hash function h maps keys of a given type into integers in a fixed interval [0,m-1]

INSTITUTE OF AERONAUTICAL ENGINEERING

Chapter 12: Indexing and Hashing

Hashing. Yufei Tao. Department of Computer Science and Engineering Chinese University of Hong Kong

ECE 242 Data Structures and Algorithms. Hash Tables II. Lecture 25. Prof.

Data Structures and Algorithms(10)

CS301 - Data Structures Glossary By

DATA STRUCTURES AND ALGORITHMS

Transcription:

Hash Table and Hashing The tree structures discussed so far assume that we can only work with the input keys by comparing them. No other operation is considered. In practice, it is often true that an input key can be decomposed into smaller units. 1. Integers consist of digits. 2. Strings consist of letters. Moreover, we can perform operations on these smaller units. For example, we can perform arithmetic operations on the digits. Even for strings, we can map each letter to its ASCII code and then perform arithmetic operations on the numeric codes. Also, we can use them as array indices. 1

In general, Universe U = {u 0, u 1,..., u N 1 }. Relatively easy to compute the index of an element. Support operations: Find Insert Delete. Deletions may be unnecessary in some applications. Unlike binary search tree, AVL tree and B + - tree, the following functions cannot be done: minimum and maximum. no approximate searching like the completion of a URL as in programming assignment 2. successor and predecessor. report data within a given range. list out the data in order. 2

Example Applications Compilers use hash tables (symbol table) to keep track of declared variables. Useful when the input keys come in sorted order. This is a bad case for binary search tree. AVL tree and B + - tree could be difficult to implement and they are not necessarily more efficient. On-line spell checkers. After prehashing the entire dictionary, one can check each word in constant time and print out the misspelled word in order of their appearance in the document. 3

Hashing Let {K 0, K 1,, K N } be the universe of keys. Let T [0..m 1] be an array representing the hash table, where m is much smaller than N. The soul of the hashing scheme is the hash function h : Key universe [0..m 1] that maps each key in the universe to an integer between 0 and m 1. For each key K, h(k) is called the hash value of K and K is supposed to be stored at T [h(k)]. What do we do if two keys have the same hash values? There are two aspects to this. We should design the hash function such that it spreads the keys uniformly among the entries of T. This will decrease the likelihood that two keys have the same hash values. Nevertheless, we still need a solution when this unlikely event happens. 4

Modular Hash Function Suppose that we are dealing with strings of letters. A naive attempt is to add up the ASCII codes of the letters in the string, divide the sum by m, and take the remainder. This hash function is easy to implement and runs fast. However, it is a very bad hash function in general because any permutation of the string has the same hash value. It is better to improve the hash function so that all the letters are involved and they contribute differently according to their positions. Let c n 1 c 0 be a string of letters. We also use c i to denote the ASCII code of c i. Note that an ASCII code has value between 0 and 127. Then n 1 h(c n 1 c 0 ) = ( i=0 c i 128 i ) % m. Basically, we treat c n 1 c 0 as a number represented using base 128, we convert it to the equivalent decimal number, and then take the remainder of the division by the hash table size. This is called a modular hash function. The same idea can be used for keys that are integers (base = 10) and bit strings (base = 2). 5

Modular Hash function When computing the hash function, one has to be careful about overflow since we may raise the base to a large power. One solution is to do all computation in modulo arithmetic. That is, instead of computing we compute n 1 ( i=0 c i 128 i ) % m, sum = 0; for (int j = 0; j < i; j++) sum = (sum + (c_j * 128^i) % m ) % m; Instead of computing (c i 128 i ) % m, we compute (c i (128 i % m)) % m. Instead of computing 128 i, we do: product = 1; for (int j = 0; j < i; j++) product = (product * 128) % m; If m = 128 k, then the hash value of a key K is the same as the hash value of the last k letters of K. This is not good since the hash function should take all the letters into consideration. If we are dealing with integers instead of strings of letters, then m = 10 k would have a similar effect. The simplest way to avoid such effects is to make m prime. 6

Collision Handling When the hash values of two keys to be stored in the hash table are the same, we have a collision. Collisions could happen even if we design our hash function carefully since N is much larger than m. There are two approaches to resolve collisions. We can convert the hash table to a table of linked lists. This is called separate chaining. Alternatively, we can relocate a key to a different entry in case of collision. This is called open addressing. 7

Separate Chaining 0 1. k. m 1.. K 1 K 5 Table of linked lists. To insert a key K, we compute h(k). If T [h(k)] contains a null pointer, we initialize this entry to point to a linked list that contains K alone. If T [h(k)] is a nonempty list, we add K at the beginning of this list. To delete a key K, we compute h(k). Then search for K within the list at T [h(k)]. Delete K if it is found. Assume that we will be storing n keys. Then we should make m the next larger prime number. If the hash function works well, the number of keys in each linked list will be a small constant. Therefore, we expect that each search, insertion, and deletion can be done in constant time. 8

Open Addressing Separate chaining has the disadvantage of using linked lists. Memory allocation in linked list manipulation will slow down the program. An alternative method is to relocate the key K to be inserted if it collides with an existing key. That is, we store K at an entry different from T [h(k)]. Two issues arise. First, what is the relocation scheme? Second, how to search for K later? There are two common methods for resolving a collision in open addressing: Linear probing. Double hashing. 9

Linear Probing Insertion: Let K be the new key to be inserted. We compute h(k). For i = 0 to m 1, 1. compute L = (h(k) + i) % m, 2. if T [L] is empty, then we put K there and stop. If we cannot find an empty entry to put K, it means that the table is full and we should report an error. Searching: Let K be the new key to be inserted. We compute h(k). For i = 0 to m 1, 1. Compute L = (h(k) + i) % m, 2. If T [L] contains K, then we are done and we can stop, 3. If T [L] is empty, then K is not in the table and we can stop too. (If K were in the table, it would have been placed in T [L] by our insertion strategy.) If we cannot find K at the end of the for-loop, we have scanned the entire table and so we can report that K is not in the table. 10

Empty table After 21 After 42 After 31 After 12 After 54 0 31 31 31 1 12 12 2 54 3 4 5 6 7 8 9 42 42 42 42 10 21 21 21 21 21 h(k) = K % 11. If we search for 31 at the end, we will access the entries T[9], T[10], and T[0]. 31 is found at T[0]. If we search for 20 at the end, then we will access the entries T[9], T[10], T[0], T[1], T[2], and T[3] in this sequence. Since T[3] is empty, we know that 20 is not in the table. 11

Primary Clustering We call a block of contiguously occupied table entries a cluster. On the average, when we insert a new key K, we may hit the middle of a cluster. Therefore, the time to insert K would be proportional to half the size of a cluster. That is, the larger the cluster, the slower the performance. Linear probing has the following disadvantages: Once h(k) falls into a cluster, this cluster will definitely grow in size by one. Thus, this may worsen the performance of insertion in the future. If two cluster are only separated by one entry, then inserting one key into a cluster can merge the two clusters together. Thus, the cluster size can increase drastically by a single insertion. This means that the performance of insertion can deteriorate drastically after a single insertion. Large clusters are easy targets for collisions. 12

Double Hashing To alleviate the problem of primary clustering, when resolving a collision, we should examine alternative positions in a more random fashion. To this end, we work with two hash functions h and h 2. Insertion: Let K be the new key to be inserted. We compute h(k). For i = 0 to m 1, 1. compute L = (h(k) + i h 2 (K)) % m, 2. if T [L] is empty, then we put K there and stop. If we cannot find an empty entry to put K, it means that the table is full and we should report an error. Searching: Let K be the new key to be inserted. We compute h(k). For i = 0 to m 1, 1. Compute L = (h(k) + i h 2 (K)) % m, 2. If T [L] contains K, then we are done and we can stop, 3. If T [L] is empty, then K is not in the table and we can stop too. (If K were in the table, it would have been placed in T [L] by our insertion strategy.) If we cannot find K at the end of the for-loop, we have scanned the entire table and so we can report that K is not in the table. 13

Choice of h 2 For any key K, h 2 (K) must be relatively prime to the table size m. Otherwise, we will only be able to examine a fraction of the table entries. For example, if h(k) = 0 and h 2 (K) = m/2, then we can only examine the entries T [0], T [m/2], and nothing else! One solution is to make m prime, choose r to be a prime smaller than m, and set h 2 (K) = r (K % r). 14

Empty table After 21 After 42 After 31 After 12 After 54 0 1 12 12 2 3 4 54 5 6 7 8 31 31 31 9 42 42 42 42 10 21 21 21 21 21 h(k) = K % 10 h 2(K) = 7 (K % 7) 15

Deletion in Open Addressing Actual deletion cannot be performed in open addressing strategies. Otherwise, suppose that the table stores three keys K 1, K 2 and K 3 that have identical probe sequences. Suppose that K 1 is stored at h(k 1 ), K 2 is stored at the second probe location and K 3 is stored at the third probe location. If K 2 is to be deleted and we make the slot containing K 2 empty, then when we search for K 3, we will find an empty slot before finding K 3. So we will report that K 3 does not exist in the table. Instead, we add an extra bit to each entry to indicate whether it has been deleted or not. 16