Hashing Techniques. Material based on slides by George Bebis

Similar documents
5. Hashing. 5.1 General Idea. 5.2 Hash Function. 5.3 Separate Chaining. 5.4 Open Addressing. 5.5 Rehashing. 5.6 Extendible Hashing. 5.

CMSC 341 Hashing (Continued) Based on slides from previous iterations of this course

Symbol Table. Symbol table is used widely in many applications. dictionary is a kind of symbol table data dictionary is database management

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

CS 241 Analysis of Algorithms

Fast Lookup: Hash tables

Module 3: Hashing Lecture 9: Static and Dynamic Hashing. The Lecture Contains: Static hashing. Hashing. Dynamic hashing. Extendible hashing.

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ.! Instructor: X. Zhang Spring 2017

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ. Acknowledgement. Support for Dictionary

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

Hash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1

Algorithms and Data Structures

COMP171. Hashing.

Hashing. Hashing Procedures

Open Addressing: Linear Probing (cont.)

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

AAL 217: DATA STRUCTURES

Understand how to deal with collisions

Hashing. 1. Introduction. 2. Direct-address tables. CmSc 250 Introduction to Algorithms

Data Structures And Algorithms

HashTable CISC5835, Computer Algorithms CIS, Fordham Univ. Instructor: X. Zhang Fall 2018

TABLES AND HASHING. Chapter 13

Data Structures and Algorithm Analysis (CSC317) Hash tables (part2)

UNIT III BALANCED SEARCH TREES AND INDEXING

Module 5: Hashing. CS Data Structures and Data Management. Reza Dorrigiv, Daniel Roche. School of Computer Science, University of Waterloo

9/24/ Hash functions

Hashing. Dr. Ronaldo Menezes Hugo Serrano. Ronaldo Menezes, Florida Tech

HASH TABLES.

Hash Tables. Hashing Probing Separate Chaining Hash Function

Worst-case running time for RANDOMIZED-SELECT

CMSC 341 Hashing. Based on slides from previous iterations of this course

CSE100. Advanced Data Structures. Lecture 21. (Based on Paul Kube course materials)

Dynamic Dictionaries. Operations: create insert find remove max/ min write out in sorted order. Only defined for object classes that are Comparable

Hashing HASHING HOW? Ordered Operations. Typical Hash Function. Why not discard other data structures?

Acknowledgement HashTable CISC4080, Computer Algorithms CIS, Fordham Univ.

Algorithms in Systems Engineering ISE 172. Lecture 12. Dr. Ted Ralphs

DATA STRUCTURES/UNIT 3

Hashing. CptS 223 Advanced Data Structures. Larry Holder School of Electrical Engineering and Computer Science Washington State University

CS 350 : Data Structures Hash Tables

CITS2200 Data Structures and Algorithms. Topic 15. Hash Tables

Chapter 27 Hashing. Objectives

CSE373: Data Structures & Algorithms Lecture 17: Hash Collisions. Kevin Quinn Fall 2015

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40

Topic HashTable and Table ADT

Data Structures and Algorithms. Chapter 7. Hashing

HASH TABLES. Goal is to store elements k,v at index i = h k

4 Hash-Based Indexing

! A Hash Table is used to implement a set, ! The table uses a function that maps an. ! The function is called a hash function.

Today s Outline. CS 561, Lecture 8. Direct Addressing Problem. Hash Tables. Hash Tables Trees. Jared Saia University of New Mexico

Hash Tables. Hash functions Open addressing. March 07, 2018 Cinda Heeren / Geoffrey Tien 1

HO #13 Fall 2015 Gary Chan. Hashing (N:12)

Data and File Structures Chapter 11. Hashing

Cpt S 223. School of EECS, WSU

Data Structures and Algorithms. Roberto Sebastiani

Hash Tables. Gunnar Gotshalks. Maps 1

Question Bank Subject: Advanced Data Structures Class: SE Computer

More on Hashing: Collisions. See Chapter 20 of the text.

Hash Table and Hashing

Hash[ string key ] ==> integer value

Fundamental Algorithms

Hashing. Manolis Koubarakis. Data Structures and Programming Techniques

III Data Structures. Dynamic sets

Hashing file organization

HASH TABLES. Hash Tables Page 1

SFU CMPT Lecture: Week 8

DATA STRUCTURES AND ALGORITHMS

Hashing 1. Searching Lists

Chapter 6. Hash-Based Indexing. Efficient Support for Equality Search. Architecture and Implementation of Database Systems Summer 2014

Chapter 17. Disk Storage, Basic File Structures, and Hashing. Records. Blocking

Hashing. Yufei Tao. Department of Computer Science and Engineering Chinese University of Hong Kong

Tables. The Table ADT is used when information needs to be stored and acessed via a key usually, but not always, a string. For example: Dictionaries

Hash Tables. Hash functions Open addressing. November 24, 2017 Hassan Khosravi / Geoffrey Tien 1

CSE 214 Computer Science II Searching

CS 350 Algorithms and Complexity

Hash-Based Indexes. Chapter 11

Chapter 12: Indexing and Hashing (Cnt(

Hashing. October 19, CMPE 250 Hashing October 19, / 25

Hash-Based Indexing 1

Chapter 11: Indexing and Hashing

Hash Tables. Johns Hopkins Department of Computer Science Course : Data Structures, Professor: Greg Hager

Hashing. Introduction to Data Structures Kyuseok Shim SoEECS, SNU.

Outline. hash tables hash functions open addressing chained hashing

CARNEGIE MELLON UNIVERSITY DEPT. OF COMPUTER SCIENCE DATABASE APPLICATIONS

Hash Open Indexing. Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

Hash Table. A hash function h maps keys of a given type into integers in a fixed interval [0,m-1]

key h(key) Hash Indexing Friday, April 09, 2004 Disadvantages of Sequential File Organization Must use an index and/or binary search to locate data

Module 5: Hash-Based Indexing

BBM371& Data*Management. Lecture 6: Hash Tables

Hashed-Based Indexing

General Idea. Key could be an integer, a string, etc e.g. a name or Id that is a part of a large employee structure

STRUKTUR DATA. By : Sri Rezeki Candra Nursari 2 SKS

Chapter 12: Indexing and Hashing. Basic Concepts

THINGS WE DID LAST TIME IN SECTION

Advanced Algorithmics (6EAP) MTAT Hashing. Jaak Vilo 2016 Fall

Hashing. 5/1/2006 Algorithm analysis and Design CS 007 BE CS 5th Semester 2

Access Methods. Basic Concepts. Index Evaluation Metrics. search key pointer. record. value. Value

Lecture 8 Index (B+-Tree and Hash)

Chapter 27 Hashing. Liang, Introduction to Java Programming, Eleventh Edition, (c) 2017 Pearson Education, Inc. All rights reserved.

CS 310 Hash Tables, Page 1. Hash Tables. CS 310 Hash Tables, Page 2

Transcription:

Hashing Techniques Material based on slides by George Bebis https://www.cse.unr.edu/~bebis/cs477/lect/hashing.ppt

The Search Problem Find items with keys matching a given search key Given an array A, containing n keys, and a search key x, find the index i such as x=a[i] As in the case of sorting, a key could be part of a large record. 2

Applications Keeping track of customer account information at a bank Search through records to check balances and perform transactions Keep track of reservations on flights Search to find empty seats, cancel/modify reservations Search engine Looks for all documents containing a given word 3

Direct Addressing Assumptions: Key values are distinct Each key is drawn from a universe U = {0, 1,..., m - 1} Idea: Store the items in an array, indexed by keys Direct-address table representation: An array T[0... m - 1] Each slot, or position, in T corresponds to a key in U For an element x with key k, a pointer to x (or x itself) will be placed in location T[k] If there are no elements with key k in the set, T[k] is empty, represented by NIL 4

Direct Addressing (cont d) 5

Examples Using Direct Addressing Example 1: 6

Examples Using Direct Addressing Example 2: 7

Hashing Hashing provides a means for accessing data without the use of an index structure. Data is addressed on disk by computing a function on a search key instead.

Organization A bucket in a hash file is unit of storage (typically a disk block) that can hold one or more records. The hash function, h, is a function from the set of all search-keys, K, to the set of all bucket addresses, B. Insertion, deletion, and lookup are done in constant time.

Hash Tables When K is much smaller than U, a hash table requires much less space than a direct-address table Can reduce storage requirements to K Can still get O(1) search time, but on the average case, not the worst case 10

Hash Tables Idea: Use a function h to compute the slot for each key Store the element in slot h(k) A hash function h transforms a key into an index in a hash table T[0 m-1]: h : U {0, 1,..., m - 1} We say that k hashes to slot h(k) Advantages: Reduce the range of array indices handled: m instead of U Storage is also reduced 11

Example: HASH TABLES 0 U (universe of keys) k 1 K (actual k 4 k 2 keys) k 5 k 3 h(k 1 ) h(k 4 ) h(k 2 ) = h(k 5 ) h(k 3 ) m - 1 12

Revisit Example 2 13

Do you see any problems with this approach? 0 U (universe of keys) k 1 K (actual k 4 k 2 keys) k 5 k 3 h(k 1 ) h(k 4 ) h(k 2 ) = h(k 5 ) h(k 3 ) m - 1 14

Do you see any problems with this approach? 0 U (universe of keys) K (actual keys) k 1 k 4 k 2 k 5 k 3 h(k 1 ) h(k 4 ) h(k 2 ) = h(k 5 ) h(k 3 ) Collisions! m - 1 15

store names using a hashing function h(k)= k mod m k = sum of alphabet positions Let m = 51 MOHIT : 13+15+8+9+20 = 65 mod 51 = 14 RANA : 34 STOP: 19 BISWAS: 22 XEROX: 14 TOPS: 19 16

Collisions Two or more keys hash to the same slot!! For a given set K of keys If K m, collisions may or may not happen, depending on the hash function If K > m, collisions will definitely happen (i.e., there must be at least two keys that have the same hash value) Avoiding collisions completely is hard, even with a good hash function 17

Handling Collisions We will review the following methods:. Chaining Open addressing Linear probing Quadratic probing Double hashing 18

Handling Collisions Using Chaining Idea: Put all elements that hash to the same slot into a linked list Slot j contains a pointer to the head of the list of all elements that hash to j 19

Collision with Chaining - Discussion Choosing the size of the table Small enough not to waste space Large enough such that lists remain short How should we keep the lists: ordered or not? Not ordered! Insert is fast Can easily remove the most recently inserted elements 20

Insertion in Hash Tables Worst-case running time is O(1) Assumes that the element being inserted isn t already in the list It would take an additional search to check if it was already inserted 21

Searching in Hash Tables search for an element with key k in list T[h(k)] Running time is proportional to the length of the list of elements in slot h(k) 22

Hash Functions A hash function transforms a key into a table address What makes a good hash function? (1) Easy to compute (2) Approximates a random function: for every input, every output is equally likely (simple uniform hashing) In practice, it is very hard to satisfy the simple uniform hashing property i.e., we don t know in advance the probability distribution that keys are drawn from 23

Good Approaches for Hash Functions Minimize the chance that closely related keys hash to the same slot Strings such as pt and pts should hash to different slots Derive a hash value that is independent from any patterns that may exist in the distribution of the keys 24

Idea: The Division Method Map a key k into one of the m slots by taking the remainder of k divided by m Advantage: h(k) = k mod m fast, requires only one operation Disadvantage: Certain values of m are bad, e.g., power of 2 non-prime numbers 25

Example - The Division Method If m = 2 p, then h(k) is just the least significant p bits of k p = 1 m = 2 h(k) = p = 2 m = 4 h(k) =, least significant 1 bit of k, least significant 2 bits of k Choose m to be a prime, not close to a power of 2 Column 2: Column 3: {0, 1} {0, 1, 2, 3} k mod 97 k mod 100 m 97 m 100 26

Probing Without using linked lists Use a larger table and try successive locations 27

Common Open Addressing Methods Linear probing Quadratic probing Double hashing 28

Linear probing: Inserting a key Idea: when there is a collision, check the next available position in the table (i.e., probing) First slot probed: h 1 (k) h(k,i) = (h 1 (k) + i) mod m i=0,1,2,... Second slot probed: h 1 (k) + 1 Third slot probed: h 1 (k)+2, and so on probe sequence: < h1(k), h1(k)+1, h1(k)+2,...> Can generate m probe sequences maximum, why? wrap around 29

Insert keys 89, 18, 49, 58, 69 H(x) = x mod 10 0 1 2 3 4 5 6 7 8 9 49 58 69 18 89 49 collides with 89, place in next location 0 58 collides with 18, next available place is 1 69 lands in location 2. Some slots tend to be crowded, forming a cluster 30

Insert keys 89, 18, 49, 58, 69 H(x) = x mod 12 0 1 2 3 4 5 6 7 8 9 10 11 49 89 18 69 58 89: 5, 18: 6, 49: 1, 58: 10, 69: 9 No collisions Time to search O(1) 31

Linear probing: Searching for a key Three cases: (1) Position in table is occupied with an element of equal key (2) Position in table is empty (3) Position in table occupied with a different element Case 3: probe the next higher index until the element is found or an empty position is found The process wraps around to the beginning of the table 0 h(k 1 ) h(k 4 ) h(k 2 ) = h(k 5 ) h(k 3 ) m - 1 32

Linear probing: Deleting a key Problems Cannot mark the slot as empty Impossible to retrieve keys inserted after that slot was occupied Solution Mark the slot with a sentinel value DELETED The deleted slot can later be used for insertion Searching will be able to find all the keys 0 m - 1 33

Primary Clustering Problem Some slots become more likely than others Long chunks of occupied slots are created search time increases!! initially, all slots have probability 1/m Slot b: 2/m Slot d: 4/m Slot e: 5/m 34

Quadratic Probing To overcome primary clusters, the scheme of quadratic probing is proposed f(i) = i 2 35

Insert 89,18,49,58,69 49 collides with 89, try with i = 1, goes to location 0 58 collides with 18, collides with 89 with i=1, try with i=2 (4 cells away) goes to location 2 69 collides with 89, then 49 so try with i=2, finds an empty slot at 3 0 49 1 2 58 3 69 4 5 6 7 8 18 9 89 36

Quadratic probing with prime TS If table size is chosen as a prime, place to hold an element can be found as long as the table is not yet half full. First TS/2 alternative locations are going to be all distinct. Two of these locations are h(x) + i 2 mod TS and h(x) + j 2 mod TS where i and j are both less than TS/2 To prove that the locations are going to be distinct, for the sake of contradiction, let us suppose that while i and j are different, the two locations turn out to be the same. 37

Can i and j point to same location? Then h(x) + i 2 = h(x) + j 2 mod TS i 2 = j 2 mod TS i 2 j 2 = 0 mod TS (i + j) ( i j ) = 0 mod TS Since table size is prime, it follows either (i + j) = 0 mod TS (not possible since both i and j are less than TS/2) OR (i j ) = 0 mod TS (not possible since i and j are distinct) Thus the first TS/2 locations are distinct. 38

example Consider a table of size 37, Let h(x) be 26, what are the alternative locations? for i = 2 : 26+4 = 30 for i = 3 : 26+9 = 35 for i = 4 : 26+16 = 42 = 5 for i = 5 : 26+25 = 51 = 14 Such data structures do not support deletion, as cells might have caused a collision to go past it). One could use lazy deletion ( mark with a flag) When table gets half full, enlarge the hash table. 39

Double Hashing 40

Double Hashing We use a second hash function h2(x). We probe at h2(x), 2 h2(x), 3 h2(x) A good choice h2(x) = R ( x mod R) where R is a prime number smaller than TS. 41

Double Hashing We use a second hash function h2(x). We probe at h2(x), 2 h2(x), 3 h2(x) A good choice h2(x) = R ( x mod R) where R is a prime number smaller than TS. Consider the problem of inserting 89, 18, 49, 58,69 on a table of size 10. Let R = 7 49 gets a Collision at position 9 h2(49) = 7 49 mod 7 =7 count 7 positions from there 0 1 2 3 4 5 6 7 8 9 49 18 89 42

Double Hashing 58: h2(58) = 7 58 mod 7 = 5 69: h2(69) = 7 69 mod 7 = 1 Now try 60: h2(60) = 7 60 mod 7 = 3 Collision with 58, try 2 h2(60)= 6 Collision with 49, try 3 h2(60)= 9 Collision with 89, try 4 h2(60)= 12 mod 10 = 2 Now try with 23, problem? Table size small, not prime. 0 1 2 3 4 5 6 7 8 9 69 60 58 49 18 89 43

A different Double Hashing style (1) Use one hash function to determine the first slot (2) Use a second hash function to determine the increment for the probe sequence h(k,i) = (h 1 (k) + i h 2 (k) ) mod m, i=0,1,... Initial probe: h 1 (k) Second probe is offset by h 2 (k) mod m, so on... Advantage: avoids clustering 44

Different Double Hashing: Example h 1 (k) = k mod 13 h 2 (k) = 1+ (k mod 11) h(k,i) = (h 1 (k) + i h 2 (k) ) mod 13 Insert key 14: h 1 (14,0) = 14 mod 13 = 1 h(14,1) = (h 1 (14) + h 2 (14)) mod 13 = (1 + 4) mod 13 = 5 h(14,2) = (h 1 (14) + 2 h 2 (14)) mod 13 = (1 + 8) mod 13 = 9 0 1 2 3 4 5 6 7 8 9 10 11 12 79 69 98 72 14 50 45

Rehashing When table size gets too full, running time starts getting large Solution : Double the table size and with a new hash function insert the old elements into the new table. 46

Extendible Hashing When data is too large to fit in main memory, main consideration is number of disk accesses. Rehashing is very expensive as all entries to be done all over again in a new table Borrow idea from B Trees. Let M be records fitting in one disk block. As M increases depth of B tree decreases. (but increases the branching factor so processing time increases) 47

The strategy used in extendible hashing is to reduce the time to search for the appropriate leaf. Let the numbers be hashed to 6 bit integers. Create a pointer table of size 4, with each cell pointing to first 2 bit of each number (D=2). Let us assume that each leaf could hold up to M=4 elements 48

00 01 10 11 000100 001000 001010 001011 010100 011000 100000 101000 101100 101110 111000 111001 49

What happens when leaf gets full Suppose we want to insert 100100. This should go to 3 rd leaf, but it is already full So we split this leaf into 2 leaves This results in D being changed to 3 (each leaf being determined by 3 bits) Note all leaves not involved in splits are now pointed to by two adjacent directory entries. Directory is new, but other leaves are not disturbed 50

000 010 001 011 101 100 110 111 000100 001000 001010 001011 010100 011000 100000 100100 101000 101100 101110 111000 111001 51

If key 000000 is now inserted, then the first leaf needs to be split (others are not disturbed). The scheme is a very simple strategy for quick access times for insert and search operations on large databases. 52

It helps if the bits are fairly random. This can be accomplished by hashing the keys into a reasonably long integer. Balanced search trees are quite expensive to implement for storing large number of data values. If there is any suspicion that the data might be sorted, hashing would be the data structure of choice. 53

The end 54