Data and File Structures Chapter 11. Hashing

Similar documents
Comp 335 File Structures. Hashing

Data Structure Lecture#22: Searching 3 (Chapter 9) U Kang Seoul National University

5. Hashing. 5.1 General Idea. 5.2 Hash Function. 5.3 Separate Chaining. 5.4 Open Addressing. 5.5 Rehashing. 5.6 Extendible Hashing. 5.

Hash Tables. Hashing Probing Separate Chaining Hash Function

Data Structures And Algorithms

Hashing IV and Course Overview

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

Successful vs. Unsuccessful

COMP171. Hashing.

CARNEGIE MELLON UNIVERSITY DEPT. OF COMPUTER SCIENCE DATABASE APPLICATIONS

Module 3: Hashing Lecture 9: Static and Dynamic Hashing. The Lecture Contains: Static hashing. Hashing. Dynamic hashing. Extendible hashing.

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

Hashing. Introduction to Data Structures Kyuseok Shim SoEECS, SNU.

Hashing for searching

Hashing. Given a search key, can we guess its location in the file? Goal: Method: hash keys into addresses

Topic HashTable and Table ADT

Symbol Table. Symbol table is used widely in many applications. dictionary is a kind of symbol table data dictionary is database management

Hash-Based Indexes. Chapter 11

CMSC 341 Hashing (Continued) Based on slides from previous iterations of this course

Hash-Based Indexing 1

Hash-Based Indexes. Chapter 11. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Hashed-Based Indexing

Hashing Techniques. Material based on slides by George Bebis

Hash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1

Hashing. Hashing Procedures

BBM371& Data*Management. Lecture 6: Hash Tables

Direct File Organization Hakan Uraz - File Organization 1

General Idea. Key could be an integer, a string, etc e.g. a name or Id that is a part of a large employee structure

Introduction hashing: a technique used for storing and retrieving information as quickly as possible.

TABLES AND HASHING. Chapter 13

HASH TABLES.

Selection Queries. to answer a selection query (ssn=10) needs to traverse a full path.

Hash Table and Hashing

Hashing HASHING HOW? Ordered Operations. Typical Hash Function. Why not discard other data structures?

CITS2200 Data Structures and Algorithms. Topic 15. Hash Tables

CSCD 326 Data Structures I Hashing

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

Hashing. Dr. Ronaldo Menezes Hugo Serrano. Ronaldo Menezes, Florida Tech

Algorithms and Data Structures

Understand how to deal with collisions

More on Hashing: Collisions. See Chapter 20 of the text.

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

More B-trees, Hash Tables, etc. CS157B Chris Pollett Feb 21, 2005.

Fundamental Algorithms

CSE 214 Computer Science II Searching

Chapter 17. Disk Storage, Basic File Structures, and Hashing. Records. Blocking

Hashing file organization

Database Technology. Topic 7: Data Structures for Databases. Olaf Hartig.

File System Interface and Implementation

Introduction to Hashing

Algorithms in Systems Engineering ISE 172. Lecture 12. Dr. Ted Ralphs

Adapted By Manik Hosen

Question Bank Subject: Advanced Data Structures Class: SE Computer

Hash Tables. Hash functions Open addressing. November 24, 2017 Hassan Khosravi / Geoffrey Tien 1

Open Addressing: Linear Probing (cont.)

Worst-case running time for RANDOMIZED-SELECT

key h(key) Hash Indexing Friday, April 09, 2004 Disadvantages of Sequential File Organization Must use an index and/or binary search to locate data

Material You Need to Know

Hash-Based Indexes. Chapter 11 Ramakrishnan & Gehrke (Sections ) CPSC 404, Laks V.S. Lakshmanan 1

! A Hash Table is used to implement a set, ! The table uses a function that maps an. ! The function is called a hash function.

III Data Structures. Dynamic sets

Data Structures and Algorithm Analysis (CSC317) Hash tables (part2)

Review of Elementary Data. Manoj Kumar DTU, Delhi

CSE373: Data Structures & Algorithms Lecture 17: Hash Collisions. Kevin Quinn Fall 2015

FILE SYSTEM IMPLEMENTATION. Sunu Wibirama

Hashing. Data organization in main memory or disk

AAL 217: DATA STRUCTURES

Hashing. October 19, CMPE 250 Hashing October 19, / 25

Dictionary. Dictionary. stores key-value pairs. Find(k) Insert(k, v) Delete(k) List O(n) O(1) O(n) Sorted Array O(log n) O(n) O(n)

CS 261 Data Structures

Chapter 1 Disk Storage, Basic File Structures, and Hashing.

Fundamentals of Database Systems Prof. Arnab Bhattacharya Department of Computer Science and Engineering Indian Institute of Technology, Kanpur

Hash Tables. Hash functions Open addressing. March 07, 2018 Cinda Heeren / Geoffrey Tien 1

File Organization and Storage Structures

CMSC 132: Object-Oriented Programming II. Hash Tables

Data and File Structures Laboratory

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ.! Instructor: X. Zhang Spring 2017

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ. Acknowledgement. Support for Dictionary

Indexing: Overview & Hashing. CS 377: Database Systems

Carnegie Mellon Univ. Dept. of Computer Science /615 DB Applications. Outline. (Static) Hashing. Faloutsos - Pavlo CMU SCS /615

Today: Finish up hashing Sorted Dictionary ADT: Binary search, divide-and-conquer Recursive function and recurrence relation

Algorithms and Data Structures

9/24/ Hash functions

HASH TABLES. Goal is to store elements k,v at index i = h k

Data Structures & File Management

Module 5: Hashing. CS Data Structures and Data Management. Reza Dorrigiv, Daniel Roche. School of Computer Science, University of Waterloo

Data Structures (CS 1520) Lecture 23 Name:

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1

CS 241 Analysis of Algorithms

HashTable CISC5835, Computer Algorithms CIS, Fordham Univ. Instructor: X. Zhang Fall 2018

Chapter 12: Indexing and Hashing (Cnt(

Dictionaries and Hash Tables

Introducing Hashing. Chapter 21. Copyright 2012 by Pearson Education, Inc. All rights reserved

4. SEARCHING AND SORTING LINEAR SEARCH

COMP 430 Intro. to Database Systems. Indexing

Use PageUp and PageDown to move from screen to screen. Click on speaker to play sound.

ASSIGNMENTS. Progra m Outcom e. Chapter Q. No. Outcom e (CO) I 1 If f(n) = Θ(g(n)) and g(n)= Θ(h(n)), then proof that h(n) = Θ(f(n))

CS 245 Midterm Exam Winter 2014

Hashing. 1. Introduction. 2. Direct-address tables. CmSc 250 Introduction to Algorithms

Data Structures. Topic #6

Transcription:

Data and File Structures Chapter 11 Hashing 1

Motivation Sequential Searching can be done in O(N) access time, meaning that the number of seeks grows in proportion to the size of the file. B-Trees improve on this greatly, providing O(Log k N) access where k is a measure of the leaf size - i.e., the number of records that can be stored in a leaf. However, what we would like to achieve is an O(1) access, which means that no matter how big a file grows, access to a record always takes the same small number of seeks. Hashing techniques can achieve such performance provided that the file does not increase in time. 2

What is Hashing? A Hash function is a function h(k) which transforms a key K into an address. Hashing is like indexing in that it involves associating a key with a relative record address. However, Hashing is different from indexing in two important ways: With hashing, there is no obvious connection between the key and the location. With hashing two different keys may be transformed to the same address. 3

Example 1/2 4

Example 2/2 5

Collisions When two different keys produce the same address, there is a collision. The keys involved are called synonyms. Coming up with a hashing function that avoids collision is extremely difficult. It is best to simply find ways to deal with them. Possible Solutions: Spread out the records Use extra memory Put more than one record at a single address. 6

Distribution of Records among Address 1/2 Records can be distributed among addresses in different ways: there may be - (a) no synonyms (uniform distribution); - (b) only synonyms (worst case); - (c) a few synonyms (happens with random distributions). 7

Distribution of Records among Address 1/2 Purely uniform distributions are difficult to obtain. Random distributions can be easily derived, but they are not perfect since they may generate a fair number of synonyms. We want better hashing methods. 8

Better than Random Distribution Here are some methods that are potentially better than random: Examine keys for a pattern Ex., F(year)=(year-1970) mod (2012-1970+1) Fold parts of the key Folding means extracting digits from a key and adding the parts together as in the previous example Divide the key by a prime number Square the key and take the middle Ex., key=453 453^2 = 205209 52 Radix transformation Transform the number into another base and then divide by the maximum address Ex., key=453 base 11: 382 382 mod 99 85 9

Collision Resolution: Progressive Overflow How do we deal with records that cannot fit into their home address? A simple approach: Progressive Overflow or Linear Probing. If a key, k1, hashes into the same address, a1, as another key, k2, then look for the first available address, a2, and place k1 in a2. If the end of the address space is reached, then wrap around it. When searching for a key that is not in, if the address space is not full, then an empty address will be reached or the search will come back to where it began. 10

Example 11

Search Length when using Progressive Overflow Progressive Overflow causes extra searches and thus extra disk accesses. If there are many collisions, then many records will be far from home. Definitions: Search length refers to the number of accesses required to retrieve a record from secondary memory. The average search length is the average number of times you can expect to have to access the disk to retrieve a record. Average search length = (Total search length) / (Total number of records) 12

Example 13

Storing More than One Record per Address: Buckets Definition: A bucket describes a block of records sharing the same address that is retrieved in one disk access. When a record is to be stored or retrieved, its home bucket address is determined by hashing. When a bucket is filled, we still have to worry about the record overflow problem, but this occurs much less often than when each address can hold only one record. 14

Example overflow 15

Effect of Buckets on Performance To compute how densely packed a file is, we need to consider: 1) the number of addresses, N, (buckets) 2) the number of records we can put at each address, b, (bucket size) and 3) the number of records, r. Then, Packing Density = r/bn Though the packing density does not change when halving the number of addresses and doubling the size of the buckets, the expected number of overflows decreases dramatically. 16

Example (N=1000, r=750) 17

Making Deletions Deleting a record from a hashed file is more complicated than adding a record for two reasons: The slot freed by the deletion must not be allowed to hinder later searches It should be possible to reuse the freed slot for later additions. In order to deal with deletions we use tombstones, i.e., a marker indicating that a record once lived there but no longer does. Tombstones solve both the problems caused by deletion. Insertion of records is slightly different when using tombstones. 18

Example (Progressive Overflow) 19

Other Collision Resolution Techniques There are a few variations on random hashing that may improve performance: Double Hashing: When an overflow occurs, use a second hashing function to map the record to its overflow location. Chained Progressive Overflow: Like Progressive overflow except that synonyms are linked together with pointers. Chaining with a Separate Overflow Area: Like chained progressive overflow except that overflow addresses do not occupy home addresses. Scatter Tables: The Hash file contains no records, but only pointers to records. I.e., it is an index. 20

Example (Double Hashing) 21

Example (Chained Progressive Overflow) 22

Example (Chained with a Separate Overflow Area) 23

Example (Scatter Tables) 24