Introduction to Hashing

Similar documents
Introduction hashing: a technique used for storing and retrieving information as quickly as possible.

Introducing Hashing. Chapter 21. Copyright 2012 by Pearson Education, Inc. All rights reserved

Hash Tables. Gunnar Gotshalks. Maps 1

HASH TABLES. Goal is to store elements k,v at index i = h k

Hash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1

Hash Tables. Hashing Probing Separate Chaining Hash Function

Chapter 27 Hashing. Liang, Introduction to Java Programming, Eleventh Edition, (c) 2017 Pearson Education, Inc. All rights reserved.

Chapter 27 Hashing. Objectives

Understand how to deal with collisions

Open Addressing: Linear Probing (cont.)

Introduction To Hashing

Module 3: Hashing Lecture 9: Static and Dynamic Hashing. The Lecture Contains: Static hashing. Hashing. Dynamic hashing. Extendible hashing.

Hashing. 1. Introduction. 2. Direct-address tables. CmSc 250 Introduction to Algorithms

Hash Tables Hash Tables Goodrich, Tamassia

Hashing. Manolis Koubarakis. Data Structures and Programming Techniques

Dynamic Dictionaries. Operations: create insert find remove max/ min write out in sorted order. Only defined for object classes that are Comparable

Adapted By Manik Hosen

Hash Table and Hashing

Data Structures And Algorithms

Maps, Hash Tables and Dictionaries. Chapter 10.1, 10.2, 10.3, 10.5

Hash Tables. Johns Hopkins Department of Computer Science Course : Data Structures, Professor: Greg Hager

! A Hash Table is used to implement a set, ! The table uses a function that maps an. ! The function is called a hash function.

Data Structure Lecture#22: Searching 3 (Chapter 9) U Kang Seoul National University

Announcements. Submit Prelim 2 conflicts by Thursday night A6 is due Nov 7 (tomorrow!)

Symbol Table. Symbol table is used widely in many applications. dictionary is a kind of symbol table data dictionary is database management

CS2210 Data Structures and Algorithms

Topic 22 Hash Tables

CHAPTER 9 HASH TABLES, MAPS, AND SKIP LISTS

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

Dictionaries-Hashing. Textbook: Dictionaries ( 8.1) Hash Tables ( 8.2)

Hashing. Hashing Procedures

HASH TABLES. Hash Tables Page 1

CITS2200 Data Structures and Algorithms. Topic 15. Hash Tables

CMSC 341 Hashing (Continued) Based on slides from previous iterations of this course

Hashing for searching

Hash Tables. Hash functions Open addressing. March 07, 2018 Cinda Heeren / Geoffrey Tien 1

Hashing. CSE260, Computer Science B: Honors Stony Brook University

CS 241 Analysis of Algorithms

Review. CSE 143 Java. A Magical Strategy. Hash Function Example. Want to implement Sets of objects Want fast contains( ), add( )

Dictionaries and Hash Tables

General Idea. Key could be an integer, a string, etc e.g. a name or Id that is a part of a large employee structure

Hash-Based Indexes. Chapter 11

Data and File Structures Chapter 11. Hashing

csci 210: Data Structures Maps and Hash Tables

Hashing. October 19, CMPE 250 Hashing October 19, / 25

Priority Queue. 03/09/04 Lecture 17 1

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40

Lecture 18. Collision Resolution

CS 310 Advanced Data Structures and Algorithms

Lecture 4. Hashing Methods

AAL 217: DATA STRUCTURES

Topic HashTable and Table ADT

This lecture. Iterators ( 5.4) Maps. Maps. The Map ADT ( 8.1) Comparison to java.util.map

Hash Tables. Hash functions Open addressing. November 24, 2017 Hassan Khosravi / Geoffrey Tien 1

Questions. 6. Suppose we were to define a hash code on strings s by:

Hash-Based Indexing 1

Chapter 17. Disk Storage, Basic File Structures, and Hashing. Records. Blocking

Algorithms and Data Structures

Part I Anton Gerdelan

Hashing Techniques. Material based on slides by George Bebis

ECE 242 Data Structures and Algorithms. Hash Tables II. Lecture 25. Prof.

THINGS WE DID LAST TIME IN SECTION

Priority Queue Sorting

5. Hashing. 5.1 General Idea. 5.2 Hash Function. 5.3 Separate Chaining. 5.4 Open Addressing. 5.5 Rehashing. 5.6 Extendible Hashing. 5.

Outline. 1 Hashing. 2 Separate-Chaining Symbol Table 2 / 13

Hash table basics mod 83 ate. ate. hashcode()

Hashing. Dr. Ronaldo Menezes Hugo Serrano. Ronaldo Menezes, Florida Tech

Hashing. It s not just for breakfast anymore! hashing 1

TABLES AND HASHING. Chapter 13

ECE 242 Data Structures and Algorithms. Hash Tables I. Lecture 24. Prof.

Hash Tables. Computer Science S-111 Harvard University David G. Sullivan, Ph.D. Data Dictionary Revisited

HASH TABLES cs2420 Introduction to Algorithms and Data Structures Spring 2015

Hash table basics. ate à. à à mod à 83

Hash[ string key ] ==> integer value

CSCD 326 Data Structures I Hashing

DS ,21. L11-12: Hashmap

Fundamental Algorithms

Hashing. CptS 223 Advanced Data Structures. Larry Holder School of Electrical Engineering and Computer Science Washington State University

CS 10: Problem solving via Object Oriented Programming Winter 2017

Comp 335 File Structures. Hashing

Tables. The Table ADT is used when information needs to be stored and acessed via a key usually, but not always, a string. For example: Dictionaries

lecture23: Hash Tables

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

HASH TABLES.

CS 350 : Data Structures Hash Tables

MIDTERM EXAM THURSDAY MARCH

CMSC 341 Hashing. Based on slides from previous iterations of this course

Dictionary. Dictionary. stores key-value pairs. Find(k) Insert(k, v) Delete(k) List O(n) O(1) O(n) Sorted Array O(log n) O(n) O(n)

Successful vs. Unsuccessful

Chapter 20 Hash Tables

Cpt S 223. School of EECS, WSU

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1

COMP171. Hashing.

CS61B Lecture #24: Hashing. Last modified: Wed Oct 19 14:35: CS61B: Lecture #24 1

CSE 332 Autumn 2013: Midterm Exam (closed book, closed notes, no calculators)

Lecture 16 More on Hashing Collision Resolution

More B-trees, Hash Tables, etc. CS157B Chris Pollett Feb 21, 2005.

CSE 143. Lecture 28: Hashing

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

HO #13 Fall 2015 Gary Chan. Hashing (N:12)

Transcription:

Lecture 11 Hashing

Introduction to Hashing We have learned that the run-time of the most efficient search in a sorted list can be performed in order O(lg 2 n) and that the most efficient sort by key comparison runs in order O(n. lg 2 n). While this is true, there are more efficient ways to sort and search for items that does not require a comparison of key values. Using a technique called hashing, we can place a object into a sorted list or find an item in a sorted list in order O(1) time. Hashing works by using the value being searched or placed into a list to directly compute a location (index) for the value in the list. A hash function obtains an index from a key and uses the index to retrieve the value for the key. Hashing is a technique that retrieves the value using the index obtained from the key without performing a search.

Building a Hash Function A typical hash function first converts a search key to an integer value called a hash code, then compresses the hash code into an index to the hash table. Ideally, we would like to design a function that maps each search key to a different index in the hash table. Such a function is called a perfect hash function. For most applications, there is no perfect hash function. More typically the range of values exceeds the range of desired or allotted memory space. For example you might want to store a few thousand integers in a list of integers whose values are from the entire range of integers. For 32 numbers there are billions of integers but to store, say 5000 integers you would not want to reserve memory to hold billions of integers. A simple hash function for this operation would be hash_code = val % 5000; called Hash Code Compression list(hash_code) = val; which generates an integer between 0 and 4999, which would permit us to store any value, val in an array of size 5000. When two different values generate the same hash_code a collision is created.

Hash Codes for Primitive Types For a search key of the type byte, short, int, and char, simply cast it to int. So, two different search keys of any one of these types will have different hash codes. For a search key of the type float, use Float.floatToIntBits(key) as the hash code. Note that floattointbits(float f) returns an int value whose bit representation is the same as the bit representation for the floating number f. For a search key of the type long, divide the 64 bits into two halves and perform the exclusive-or operation to combine the two halves. This process is called folding. int hashcode = (int)(key ^ (key >> 32)); Note that >> is the right-shift operator that shifts the bits 32 position to the right. For example, 1010110 >> 2 yields 0010101. The is the bitwise exclusive-or operator. It operates on two corresponding bits of the binary operands. For example, 1010110 ^ 0110111 yields 1100001. For a search key of the type double, first convert it to a long value using doubletolongbits, then perform a folding as follows: long bits = Double.doubleToLongBits(key); int hashcode = (int)(bits ^ (bits >> 32));

Hash Codes for Strings Search keys are often strings. So, it is important to design a good hash function for strings. An intuitive approach is to sum the Unicode of all characters as the hash code for the string. This approach may work if two search keys in an application don t contain same letters. But it will produce a lot of collisions if the search keys contain the same letters such as tod and dot. A better approach is to generate a hash code that takes the position of characters into consideration. Specifically, let the hash code be s 0 *b (N-1) + s 1 *b (N-2 ) +... + s N-1 where s i is s.charat(i). This expression is a polynomial for some positive b. So, this is called a polynomial hash code. By Horner s rule, it can be evaluated efficiently as follows: (...((s 0 *b + s 1 )b + s 2 )b +... + S N-2 )b + S N-1 This computation can cause an overflow for long strings. Arithmetic overflow is ignored in Java. You should choose an appropriate value b to minimize collision. Experiments show that the good choices for b are 31, 33, 37, 39, and 41. In the String class, the hashcode is overridden using the polynomial hash code with b being 31.

When Hash Code Collide A collision occurs when two keys are mapped to the same index in a hash table. Generally, there are two ways for handling collisions: open addressing and separate chaining. Open addressing is to find an open location in the hash table in the event of collision. Open addressing has several variations: linear probing, quadratic probing, and double hashing. The separate chaining scheme places all entries with the same hash index into the same location, rather than finding new locations. Each location in the separate chaining scheme is called a bucket. A bucket is a container that holds multiple entries.

Linear Probing When a collision occurs during the insertion of an entry to a hash table, linear probing finds the next available location sequentially. For example, if a collision occurs at hashtable[k % N], check whether hashtable[(k+1) % N] is available. If not, check hashtable[(k+2) % N] and so on, until an available cell is found, Linear probing tends to cause groups of consecutive cells in the hash table to be occupied. Each group is called a cluster. Each cluster is actually a probe sequence that you must search when retrieving, adding, or removing an entry. As clusters grow in size, they may merge into even larger clusters, further slowing down the search time.

Quadratic Probing Quadratic probing can avoid the clustering problem in linear probing. Linear probing looks at the consecutive cells beginning at index k. Quadratic probing, on the other hand, looks at the cells at indices (k + j2) % n, for i.e., k, (k + 1) % n, (k + 4) % n, (k + 9) % n, and so on, as shown Quadratic probing works in the same way as linear probing except for the change of search sequence. Quadratic probing avoids the clustering problem in linear probing, but it has its own clustering problem, called secondary clustering; i.e., the entries that collide with an occupied entry use the same probe sequence. Linear probing guarantees that an available cell can be found for insertion as long as the table is not full. However, there is no such guarantee for quadratic probing.

Double Hashing Double hashing uses a secondary hash function on the keys to determine the increments to avoid the clustering problem. For example, let the primary hash function h and secondary hash function h' on a hash table of size 11 be defined as follows: h(k) = k % 11; h'(k) = 7 - k % 7; For a search key of 12, we have h(12) = 12 % 11 = 1; h'(k) = 7-12 % 7 = 2; The probe sequence for key 12 starts at index 1 with an increment 2, as shown

Separate Chaining The separate chaining scheme places all entries with the same hash index into the same location, rather than finding new locations. Each location in the separate chaining scheme is called a bucket. A bucket is a container that holds multiple entries. You may implement a bucket using an array, ArrayList, or LinkedList.

Load Factor and Rehashing Load factor measures how full the hash table is. It is the ratio of the size of the map to the size of the hash table, i.e., where n denotes the number of elements and N the number of locations in the hash table. Note that is zero if the map is empty. For the open addressing scheme, is between 0 and 1; is 1 if the hash table is full. For the separate chaining scheme, can be any value. As increases, the probability of collision increases. Studies show that you should maintain the load factor under 0.5 for the open addressing scheme and under 0.9 for the separate chaining scheme. Keeping the load factor under a certain threshold is important for the performance of hashing. In the implementation of java.util.hashmap class in the Java API, the threshold 0.75 is used. Whenever the load factor exceeds the threshold, you need to increase the hash-tab size and rehash all the entries in the map to the new hash table. Notice that you need to change the hash functions, since the hash-table size has been changed. To reduce the likelihood of rehashing, since it is costly, you should at least double the hash-table size. www.cs.armstrong.edu/liang/animation/hashingusingseparatechaininganimation.html