AAL 217: DATA STRUCTURES

Similar documents
COMP171. Hashing.

General Idea. Key could be an integer, a string, etc e.g. a name or Id that is a part of a large employee structure

Hash Table and Hashing

TABLES AND HASHING. Chapter 13

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

Open Addressing: Linear Probing (cont.)

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

Understand how to deal with collisions

5. Hashing. 5.1 General Idea. 5.2 Hash Function. 5.3 Separate Chaining. 5.4 Open Addressing. 5.5 Rehashing. 5.6 Extendible Hashing. 5.

HASH TABLES.

! A Hash Table is used to implement a set, ! The table uses a function that maps an. ! The function is called a hash function.

UNIT III BALANCED SEARCH TREES AND INDEXING

Algorithms and Data Structures

Hashing Techniques. Material based on slides by George Bebis

Hashing. 1. Introduction. 2. Direct-address tables. CmSc 250 Introduction to Algorithms

Hash Tables. Hashing Probing Separate Chaining Hash Function

CMSC 341 Hashing (Continued) Based on slides from previous iterations of this course

Hash[ string key ] ==> integer value

ECE 242 Data Structures and Algorithms. Hash Tables II. Lecture 25. Prof.

Hashing. Hashing Procedures

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

More on Hashing: Collisions. See Chapter 20 of the text.

Hash Tables. Hash functions Open addressing. March 07, 2018 Cinda Heeren / Geoffrey Tien 1

Hashing. CptS 223 Advanced Data Structures. Larry Holder School of Electrical Engineering and Computer Science Washington State University

Topic HashTable and Table ADT

Hash Tables. Gunnar Gotshalks. Maps 1

Hash Tables. Hash functions Open addressing. November 24, 2017 Hassan Khosravi / Geoffrey Tien 1

Cpt S 223. School of EECS, WSU

Hash Tables and Hash Functions

Comp 335 File Structures. Hashing

Hash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1

CS 350 : Data Structures Hash Tables

Dictionaries and Hash Tables

Chapter 20 Hash Tables

Introduction hashing: a technique used for storing and retrieving information as quickly as possible.

Introducing Hashing. Chapter 21. Copyright 2012 by Pearson Education, Inc. All rights reserved

Hashing for searching

Fundamental Algorithms

Worst-case running time for RANDOMIZED-SELECT

CITS2200 Data Structures and Algorithms. Topic 15. Hash Tables

Symbol Table. Symbol table is used widely in many applications. dictionary is a kind of symbol table data dictionary is database management

CMSC 341 Hashing. Based on slides from previous iterations of this course

CS 3410 Ch 20 Hash Tables

CS/COE 1501

Structures, Algorithm Analysis: CHAPTER 5: HASHING

Question Bank Subject: Advanced Data Structures Class: SE Computer

CSE 214 Computer Science II Searching

Part I Anton Gerdelan

Outline. hash tables hash functions open addressing chained hashing

CPSC 259 admin notes

Hashing. Dr. Ronaldo Menezes Hugo Serrano. Ronaldo Menezes, Florida Tech

Hashing. October 19, CMPE 250 Hashing October 19, / 25

SFU CMPT Lecture: Week 8

HASH TABLES. Goal is to store elements k,v at index i = h k

Why do we need hashing?

Data Structures - CSCI 102. CS102 Hash Tables. Prof. Tejada. Copyright Sheila Tejada

Dynamic Dictionaries. Operations: create insert find remove max/ min write out in sorted order. Only defined for object classes that are Comparable

Dictionary. Dictionary. stores key-value pairs. Find(k) Insert(k, v) Delete(k) List O(n) O(1) O(n) Sorted Array O(log n) O(n) O(n)

2 Fundamentals of data structures

HO #13 Fall 2015 Gary Chan. Hashing (N:12)

III Data Structures. Dynamic sets

Algorithms in Systems Engineering ISE 172. Lecture 12. Dr. Ted Ralphs

Adapted By Manik Hosen

DATA STRUCTURES/UNIT 3

1. Attempt any three of the following: 15

9/24/ Hash functions

Lecture 16 More on Hashing Collision Resolution

Data Structures. Topic #6

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

CSE100. Advanced Data Structures. Lecture 21. (Based on Paul Kube course materials)

STRUKTUR DATA. By : Sri Rezeki Candra Nursari 2 SKS

Tables. The Table ADT is used when information needs to be stored and acessed via a key usually, but not always, a string. For example: Dictionaries

Hashing HASHING HOW? Ordered Operations. Typical Hash Function. Why not discard other data structures?

HASH TABLES. Hash Tables Page 1

DATA STRUCTURES AND ALGORITHMS

Hash Open Indexing. Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1

Successful vs. Unsuccessful

Questions. 6. Suppose we were to define a hash code on strings s by:

ECE 242 Data Structures and Algorithms. Hash Tables I. Lecture 24. Prof.

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40

Unit #5: Hash Functions and the Pigeonhole Principle

Data Structures and Algorithm Analysis (CSC317) Hash tables (part2)

Dictionaries and Hash Tables

Hash table basics mod 83 ate. ate. hashcode()

Tirgul 7. Hash Tables. In a hash table, we allocate an array of size m, which is much smaller than U (the set of keys).

CSI33 Data Structures

Data Structures And Algorithms

csci 210: Data Structures Maps and Hash Tables

Hash Table. A hash function h maps keys of a given type into integers in a fixed interval [0,m-1]

Introduction to Hashing

Today: Finish up hashing Sorted Dictionary ADT: Binary search, divide-and-conquer Recursive function and recurrence relation

CS 261 Data Structures

CS 310 Advanced Data Structures and Algorithms

Data Structures (CS 1520) Lecture 23 Name:

HashTable CISC5835, Computer Algorithms CIS, Fordham Univ. Instructor: X. Zhang Fall 2018

Hashing. Manolis Koubarakis. Data Structures and Programming Techniques

Lecture 4. Hashing Methods

CSE373: Data Structures & Algorithms Lecture 17: Hash Collisions. Kevin Quinn Fall 2015

HASH TABLES cs2420 Introduction to Algorithms and Data Structures Spring 2015

Data and File Structures Chapter 11. Hashing

Transcription:

Chapter # 4: Hashing AAL 217: DATA STRUCTURES The implementation of hash tables is frequently called hashing. Hashing is a technique used for performing insertions, deletions, and finds in constant average time. The ideal hash table data structure is merely an array of some fixed size containing the items. Generally a search operation is performed on some part (that is, data member) of the item. This is called the key. For instance, an item could consist of a string (that serves as the key) and additional data members (for instance, a name that is part of a large employee structure). We will refer to the table size as TableSize, with the understanding that this is part of a hash data structure and not merely some variable floating around globally. 4.1. Dictionary Data Structure The hashing algorithms are often used on a special data structure called the dictionary. A dictionary is a dynamic data structure consisting of a set of keys. It supports three basic operations: insertion, deletion, and search. Generally, the keys in a dictionary can have additional related elements, called satellite data, as illustrated in the diagram. Many real life applications use dictionaries, consisting of keys based on numbers and/or alphabets. Set of Personnel Numbers {13456, 7890, 2348, 1256 } Set of Part numbers (111223-5, 67890-6, 2345-8, 789011-29,..} Symbol Table used by a compiler Online dictionary for spell checking Hashing is the procedure of mapping dictionary keys into a set of m integers in range 0, 1,.. m-1. The mapped keys are stored into table called hash table. The table consists of m cells. The table consists of m cells. Level 4 Page 1 of 6

4.2. Hash Function A hash function is any algorithm or subroutine that maps large data sets of variable length to smaller data sets of a fixed length. For example, a person's name, having a variable length, could be hashed to a single integer. The values returned by a hash function are called hash values, hash codes, hash sums, checksums or simply hashes. The keys being mapped are said to have collisions, because they all belong to the same slot in the hash table. Consider the hash function: h(k) = k mod 11 The keys {12, 10, 13, 2, 14, 3} would map as follows Each key is mapped into some number in the range 0 to TableSize 1 and placed in the appropriate cell. The mapping is called a hash function, which ideally should be simple to compute and should ensure that any two distinct keys get different cells. Since there are a finite number of cells and a virtually inexhaustible supply of keys, this is clearly impossible, and thus we seek a hash function that distributes the keys evenly among the cells. Figure below is typical of a perfect situation. In this example, john hashes to 3, philhashes to 4, dave hashes to 6, and mary hashes to 7. If the input keys are integers, then simply returning Key mod TableSize is generally a reasonable strategy, unless Key happens to have some undesirable properties. In this case, the Level 4 Page 2 of 6

choice of hash function needs to be carefully considered. For instance, if the table size is 10 and the keys all end in zero, then the standard hash function is a bad choice. To avoid situations like the one above, it is often a good idea to ensure that the table size is prime. When the input keys are random integers, then this function is not only very simple to compute but also distributes the keys evenly. Usually, the keys are strings; in this case, the hash function needs to be chosen carefully. One option is to add up the ASCII values of the characters in the string. If the table size is large, the function does not distribute the keys well. For instance, suppose that TableSize = 10,007 (10,007 is a prime number). Suppose all the keys are eight or fewer characters long. Since an ASCII character has an integer value that is always at most 127, the hash function typically can only assume values between 0 and 1,016, which is 127 8. This is clearly not an equitable distribution. Example (a): Consider the string MOIZ ASCII Codes: 77 79 73 90 Hash Code= 77+79+73+90 = 319 Example (b): Consider the string SATTAR ASCII Codes: 83 65 84 84 65 82 hash Code: 83+65+84+84+65+82 = 463 The ASCII sum method is easy, and produces short hash codes However, the method produces a large number of collisions, because all permutations of a character string hash to the same value. For example, ABC,ACB,BAC,BCA,CBA,CAB have the same hash code and, therefore, hash to the same slot of hash table Another hash function assumes that Key has at least three characters. The value 27 represents the number of letters in the English alphabet, plus the blank, and 729 is 27 2. This function examines only the first three characters, but if these are random and the table size is 10,007, as before, then we would expect a reasonably equitable distribution. Unfortunately, English is not random. Although there are 26 3 = 17,576 possible combinations of three characters (ignoring blanks), a check of a reasonably large online dictionary reveals that the number of different combinations is actually only 2,851. Even if none of these combinations collide, only 28 percent of the table can actually be hashed to. Thus, this function, although easily computable, is also not appropriate if the hash table is reasonably large. Example (1): Consider the string MOIZ ASCII Codes: 77 79 73 90 Hash Code : 77 + 79 x 271 + 73 x 272 + 90 x 273 = 1826897 Level 4 Page 3 of 6

4.3. Collision Resolution If, when an element is inserted, it hashes to the same value as an already inserted element, then we have a collision and need to resolve it. There are several methods for dealing with this. We will discuss two of the simplest: separate chaining (chain hashing) and open addressing. 4.3.1. Separate Hashing (Chained Hashing) In chained hashing the elements of a hash table are stored in a set of linked lists. All colliding elements are kept in one linked list. The list head pointers are usually stored in an array. Chained hashing is also known as open hashing The first strategy, commonly known as separate chaining, is to keep a list of all elements that hash to the same value. We can use the Standard Library list implementation. If space is tight, it might be preferable to avoid their use (since these lists are doubly linked and waste space). To perform a search, we use the hash function to determine which list to traverse. We then search the appropriate list. To perform an insert, we check the appropriate list to see whether the element is already in place (if duplicates are expected, an extra data member is usually kept, and this data member would be incremented in the event of a match). If the element turns out to be new, it can be inserted at the front of the list, since it is convenient and also because frequently it happens that recently inserted elements are the most likely to be accessed in the near future. 4.3.2. Open Address Hashing Separate chaining hashing has the disadvantage of using linked lists. This could slow the algorithm down a bit because of the time required to allocate new cells (especially in other languages) and essentially requires the implementation of a second data structure. An alternative to resolving collisions with linked lists is to try alternative cells until an empty cell is found. Because all the data go inside the table, a bigger table is needed in such a scheme than for separate chaining hashing. Generally, the load factor should be below λ = 0.5 for a hash table that doesn t use separate chaining. We call such tables probing hash tables. In an open address hashing the hashed keys are stored in the hash table itself. The colliding keys are allocated distinct cells in the table. Open address hashing is also referred to as closed hashing Open address hashing can be performed using three techniques. Linear probing Linear probing is a scheme in computer programming for resolving hash collisions of values of hash functions by sequentially searching the hash table for a free location. Linear probing Level 4 Page 4 of 6

is accomplished using two values - one as a starting value and one as an interval between successive values in modular arithmetic. The second value, which is the same for all keys and known as the stepsize, is repeatedly added to the starting value until a free space is found, or the entire table is traversed. (In order to traverse the entire table the stepsize should be relatively prime to the arraysize, which is why the array size is often chosen to be a prime number.) newlocation = (startingvalue + stepsize) % arraysize In linear probing, f is a linear function of i, typically f (i) = i. This amounts to trying cells sequentially in search of an empty cell. Figure 5.11 shows the result of inserting keys {89, 18, 49, 58, 69} into a hash table using the same hash function as before and the collision resolution strategy, f (i) = i. The first collision occurs when 49 is inserted; it is put in the next available spot, namely, spot 0, which is open. The key 58 collides with 18, 89, and then 49 before an empty cell is found three away. The collision for 69 is handled in a similar manner. As long as the table is big enough, a free cell can always be found, but the time to do so can get quite large. Worse, even if the table is relatively empty, blocks of occupied cells start forming. This effect, known as primary clustering, means that any key that hashes into the cluster will require several attempts to resolve the collision, and then it will add to the cluster. Quadratic Probing Quadratic probing is a collision resolution method that eliminates the primary clustering problem of linear probing. Quadratic probing is what you would expect the collision function is quadratic. The popular choice is f (i) = i 2. Figure 5.13 shows the resulting hash table with this collision function on the same input used in the linear probing example. When 49 collides with 89, the next position attempted is one cell away. This cell is empty, so 49 is placed there. Next, 58 collides at position 8. Then the cell one away is tried, but another collision occurs. A vacant cell is found at the next cell tried, which is 22 = 4 away. 58 is thus placed in cell 2. The same thing happens for 69. For linear probing, it is a bad idea to let the hash table get nearly full, because performance degrades. For quadratic probing, the situation is even more drastic: There is no guarantee of finding an empty cell once the table gets more than half full, or even before the table gets half full if the table size is not prime. This is because at most half of the table can be used as alternative locations to resolve collisions. Indeed, we prove now that if the table is half empty and the table size is prime, then we are always guaranteed to be able to insert a new element. Level 4 Page 5 of 6

Level 4 Page 6 of 6