Comp 335 File Structures. Hashing

Similar documents
Data and File Structures Chapter 11. Hashing

5. Hashing. 5.1 General Idea. 5.2 Hash Function. 5.3 Separate Chaining. 5.4 Open Addressing. 5.5 Rehashing. 5.6 Extendible Hashing. 5.

HASH TABLES.

TABLES AND HASHING. Chapter 13

Hash Table and Hashing

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

Data Structures And Algorithms

Topic HashTable and Table ADT

Symbol Table. Symbol table is used widely in many applications. dictionary is a kind of symbol table data dictionary is database management

Adapted By Manik Hosen

AAL 217: DATA STRUCTURES

BBM371& Data*Management. Lecture 6: Hash Tables

General Idea. Key could be an integer, a string, etc e.g. a name or Id that is a part of a large employee structure

! A Hash Table is used to implement a set, ! The table uses a function that maps an. ! The function is called a hash function.

Outline. hash tables hash functions open addressing chained hashing

CS 310 Hash Tables, Page 1. Hash Tables. CS 310 Hash Tables, Page 2

CSCD 326 Data Structures I Hashing

Introduction hashing: a technique used for storing and retrieving information as quickly as possible.

Question Bank Subject: Advanced Data Structures Class: SE Computer

Understand how to deal with collisions

1. Attempt any three of the following: 15

CMSC 341 Hashing (Continued) Based on slides from previous iterations of this course

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

Fundamental Algorithms

HASH TABLES. Hash Tables Page 1

Hashing. Given a search key, can we guess its location in the file? Goal: Method: hash keys into addresses

Direct File Organization Hakan Uraz - File Organization 1

Hashing. Hashing Procedures

CS 2412 Data Structures. Chapter 10 Sorting and Searching

Hash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1

CS/COE 1501

COMP171. Hashing.

Hashing. 1. Introduction. 2. Direct-address tables. CmSc 250 Introduction to Algorithms

Hash Tables. Hashing Probing Separate Chaining Hash Function

HO #13 Fall 2015 Gary Chan. Hashing (N:12)

Chapter 20 Hash Tables

Today: Finish up hashing Sorted Dictionary ADT: Binary search, divide-and-conquer Recursive function and recurrence relation

Data Structures and Algorithms(10)

Hashing Techniques. Material based on slides by George Bebis

Tirgul 7. Hash Tables. In a hash table, we allocate an array of size m, which is much smaller than U (the set of keys).

HashTable CISC5835, Computer Algorithms CIS, Fordham Univ. Instructor: X. Zhang Fall 2018

Introduction To Hashing

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ.! Instructor: X. Zhang Spring 2017

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ. Acknowledgement. Support for Dictionary

Introducing Hashing. Chapter 21. Copyright 2012 by Pearson Education, Inc. All rights reserved

Acknowledgement HashTable CISC4080, Computer Algorithms CIS, Fordham Univ.

Cpt S 223. School of EECS, WSU

Hashing for searching

Data Structure Lecture#22: Searching 3 (Chapter 9) U Kang Seoul National University

Hash Tables. Hash functions Open addressing. November 24, 2017 Hassan Khosravi / Geoffrey Tien 1

DS ,21. L11-12: Hashmap

Hash Tables. Hash functions Open addressing. March 07, 2018 Cinda Heeren / Geoffrey Tien 1

Hashing. Introduction to Data Structures Kyuseok Shim SoEECS, SNU.

Hashing. Dr. Ronaldo Menezes Hugo Serrano. Ronaldo Menezes, Florida Tech

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

4. SEARCHING AND SORTING LINEAR SEARCH

Hashing. 5/1/2006 Algorithm analysis and Design CS 007 BE CS 5th Semester 2

CMSC 341 Hashing. Based on slides from previous iterations of this course

Hashed-Based Indexing

Hash-Based Indexes. Chapter 11

Hashing IV and Course Overview

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

Introduction to Hashing

Chapter 17. Disk Storage, Basic File Structures, and Hashing. Records. Blocking

Successful vs. Unsuccessful

CITS2200 Data Structures and Algorithms. Topic 15. Hash Tables

Hash Tables. Computer Science S-111 Harvard University David G. Sullivan, Ph.D. Data Dictionary Revisited

Data Structures & File Management

Hash[ string key ] ==> integer value

Module 3: Hashing Lecture 9: Static and Dynamic Hashing. The Lecture Contains: Static hashing. Hashing. Dynamic hashing. Extendible hashing.

The dictionary problem

9/24/ Hash functions

Selection Queries. to answer a selection query (ssn=10) needs to traverse a full path.

CPSC 259 admin notes

Hashing HASHING HOW? Ordered Operations. Typical Hash Function. Why not discard other data structures?

CSE 332: Data Structures & Parallelism Lecture 10:Hashing. Ruth Anderson Autumn 2018

SFU CMPT Lecture: Week 8

III Data Structures. Dynamic sets

CSE100. Advanced Data Structures. Lecture 21. (Based on Paul Kube course materials)

Outline. Computer Science 331. Desirable Properties of Hash Functions. What is a Hash Function? Hash Functions. Mike Jacobson.

Part I Anton Gerdelan

Lecture 16 More on Hashing Collision Resolution

Hash Tables. CS 321 Spring 2015

Fast Lookup: Hash tables

Hash-Based Indexing 1

Worst-case running time for RANDOMIZED-SELECT

Use PageUp and PageDown to move from screen to screen. Click on speaker to play sound.

CSE373: Data Structures & Algorithms Lecture 17: Hash Collisions. Kevin Quinn Fall 2015

4 Hash-Based Indexing

Dynamic Dictionaries. Operations: create insert find remove max/ min write out in sorted order. Only defined for object classes that are Comparable

STRUKTUR DATA. By : Sri Rezeki Candra Nursari 2 SKS

CSE 214 Computer Science II Searching

Hashing. October 19, CMPE 250 Hashing October 19, / 25

Hashing. CptS 223 Advanced Data Structures. Larry Holder School of Electrical Engineering and Computer Science Washington State University

Data Structures and Algorithms. Chapter 7. Hashing

Dictionaries and Hash Tables

On my honor I affirm that I have neither given nor received inappropriate aid in the completion of this exercise.

key h(key) Hash Indexing Friday, April 09, 2004 Disadvantages of Sequential File Organization Must use an index and/or binary search to locate data

Hashing file organization

Transcription:

Comp 335 File Structures Hashing

What is Hashing? A process used with record files that will try to achieve O(1) (i.e. constant) access to a record s location in the file. An algorithm, called a hash function (h), is given a primary key as input; the resulting output is the location of the record within the file; h(key) = address.

Hashing Example Assume you want to store 5,000 data records on file. You want this to be a hashed file for quick access. Each record will be fixed in length and the primary key for each record is an employee number which is 8 digits long. A common hash function is called modulo arithmetic. h(key) = key mod n; n = 5000 h(82461792) = 82461792 mod 5000 = 1792 The address (RRN) of the record with this key is 1792

Other Hashing Methods Folding Folding requires extracting certain groupings from the key and then adding or multiplying the groupings in some fashion to form the hash address. Example : Key = BISON Address Space = 101 Step 1 get ASCII values of each character in the string B(66), I(73), S(83), O(79), N(78) Step 2 Add even[even index val] 66 +83+78 = 227 Step 3 Add odd[odd index val] 73+79 = 152 Step 4 Multiply results 227 * 152 = 34504 Step 5 Modulo results 34504 mod 101 = 63 (hash address)

Other Hashing Methods Mid-Square Involves squaring the numeric form of a key and extracting some of the digits from the middle of the square. Example: Assume address space is 1000 Key(4 digit int) = 2973 2973 * 2973 = 8838729 Extract middle digits = 387 (hash address)

Other Hashing Methods Radix Transformation Convert the key to a different base and then use modulo arithmetic. Example: Address space is 100. Key is 435 10 Conversion: 382 11 382 mod 100 = 82 (hash address)

Other Hashing Methods Multiplicative Function Involves multiplying the key by some constant less than one, the hash function will return some of the digits of the fractional part of the result. Example: Address space = 1000 Key (5 digit integer): 82165 Multiplier: 0.39731 82165 * 0.39731 = 32644.97615 First three digits of fractional part is hash address = 976

Major Problem with Hashing Given a random set of keys and a hash function (h), it is highly probable that some keys in the set will be hash synonyms. In other words, the same hash function output can be obtained from different keys in the set. A hashing algorithm can yield three different types of address distributions: Perfect no synonyms given a set of keys; the probability of obtaining a perfect distribution from a large set of unknown keys is very, very low (textbook 1 out 10 120,000 ) Random few synonyms generated; what we strive for! Scud many synonyms generated If the set of keys is known beforehand, it is possible to generate a perfect hashing algorithm (Pearson, Cichelli)

Collisions When two or more keys hash to same address, this is called a collision. This has to be accounted for with random hashing algorithms. The handling of collisions becomes a critical issue in the overall search efficiency of a given file. Remember each search could mean a disk access.

Decreasing the Probability of Collisions Increase the address space a common technique; allocate more addresses in the file than records to store; this can decrease the possibility of collisions greatly assuming the hashing algorithm is random. The disadvantage obviously is wasted space. Place more than one record at an address. This is commonly referred to as buckets. A single address space can store an array of records. This has been shown to increase search efficiency.

Collision Resolution Even if you have tried to decrease the probability of collisions, they still can and will happen. Ways to resolve collisions: Linear Probing Double Hashing Prime area with overflow Chaining

Linear Probing If a key is hashed to an address already occupied or full, search the address space linearly until the first free space is found. Easy to implement, however this technique can lead to poor search efficiency. This technique can take away home addresses from other keys resulting in more collision handling. It can also take many accesses to determine if a key does not exist. What about if a key is deleted using this technique? Could be bad if not handled properly.

Double Hashing Upon a collision, the key re-hashed using a different algorithm; this determines the increment to take to search for an open address space. The same problems exist as with linear probing. Research has shown that this technique will give better performance than linear probing.

Prime area with Overflow Usually used with buckets. A bucket will hold x number of records in the prime address space and will also contain a pointer to an overflow area of the file which is entry-sequenced. This pointer will contain the first overflow record and each overflow record will contain a pointer to the next overflow record. This is a common technique and gives excellent search efficiency.

Chaining The file consists of a hash table which is simply an array of pointers. When a key is hashed, the result is an index into the hash table. At this location is a pointer to the first record which has this hash address. All the records are then chained together as a linked list. The data record portion of the file can be entry sequenced.

Hash Address Distributions Assuming you have a random hash function, the Poisson Function can be used to compute various probabilities such as: How many empty hash slots will there be? What percentage of the time will access to a key result in more than one access to find it? What is the probability that a certain hash address will have x number of keys assigned to it?

Poisson Function p(x) = (r/n) x e -r/n x! n the address space r - number of keys to hash x number of records assigned to a given address r/n = packing density; load factor

Poisson Function Example Assume 1,000 records to be hashed into a 1,000 address hashed file. 1) What is the probability that a given address will have two keys hashed to it? p(2) = (1,000/1,000) 2 e -1,000/1,000 2! = e -1 2 =.368/2 =.184 2) 1,000 (number of addresses) *.184 = 184 Therefore there are approximately 184 addresses which will have 2 keys hashed to it which means there will be 184 overflow records.

Poisson Function Example Assume 1,000 records to be hashed into a 1,500 address hashed file. 1) What is the probability that a given address will have two keys hashed to it? p(2) = (1,000/1,500) 2 e -1,000/1,500 2! = (.67) 2 e -.67 = (.449)(.512)/2 2! =.230/2 =.115 2) 1,500 (number of addresses) *.115 = 172.5 (173) Therefore there are approximately 173 addresses which will have 2 keys hashed to it which means there will be 173 overflow records.