The dictionary problem

Similar documents
7 Hashing: chaining. Summer Term 2010

Dictionary. Dictionary. stores key-value pairs. Find(k) Insert(k, v) Delete(k) List O(n) O(1) O(n) Sorted Array O(log n) O(n) O(n)

Hashing. Dr. Ronaldo Menezes Hugo Serrano. Ronaldo Menezes, Florida Tech

Introducing Hashing. Chapter 21. Copyright 2012 by Pearson Education, Inc. All rights reserved

Tirgul 7. Hash Tables. In a hash table, we allocate an array of size m, which is much smaller than U (the set of keys).

Topic HashTable and Table ADT

Hash[ string key ] ==> integer value

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40

CSE 332: Data Structures & Parallelism Lecture 10:Hashing. Ruth Anderson Autumn 2018

Lecture 4. Hashing Methods

Hashing. Manolis Koubarakis. Data Structures and Programming Techniques

Advanced Algorithmics (6EAP) MTAT Hashing. Jaak Vilo 2016 Fall

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

Cpt S 223. School of EECS, WSU

HASH TABLES.

Fast Lookup: Hash tables

Week 9. Hash tables. 1 Generalising arrays. 2 Direct addressing. 3 Hashing in general. 4 Hashing through chaining. 5 Hash functions.

CS 270 Algorithms. Oliver Kullmann. Generalising arrays. Direct addressing. Hashing in general. Hashing through chaining. Reading from CLRS for week 7

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ.! Instructor: X. Zhang Spring 2017

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ. Acknowledgement. Support for Dictionary

HASH TABLES. Goal is to store elements k,v at index i = h k

Hashing. CptS 223 Advanced Data Structures. Larry Holder School of Electrical Engineering and Computer Science Washington State University

5. Hashing. 5.1 General Idea. 5.2 Hash Function. 5.3 Separate Chaining. 5.4 Open Addressing. 5.5 Rehashing. 5.6 Extendible Hashing. 5.

Hash Table. Ric Glassey

1 CSE 100: HASH TABLES

HashTable CISC5835, Computer Algorithms CIS, Fordham Univ. Instructor: X. Zhang Fall 2018

AAL 217: DATA STRUCTURES

COMP171. Hashing.

CSE100. Advanced Data Structures. Lecture 21. (Based on Paul Kube course materials)

ECE 242 Data Structures and Algorithms. Hash Tables I. Lecture 24. Prof.

1/9/2014. Unary operation: min, max, sort, makenull, Set + Operations define an ADT.

More on Hashing: Collisions. See Chapter 20 of the text.

General Idea. Key could be an integer, a string, etc e.g. a name or Id that is a part of a large employee structure

BBM371& Data*Management. Lecture 6: Hash Tables

Lecture 18. Collision Resolution

Lecture 7: Efficient Collections via Hashing

Unit #5: Hash Functions and the Pigeonhole Principle

CS1020 Data Structures and Algorithms I Lecture Note #15. Hashing. For efficient look-up in a table

Lecture 16 More on Hashing Collision Resolution

1.00 Lecture 32. Hashing. Reading for next time: Big Java Motivation

Successful vs. Unsuccessful

CS61B Lecture #24: Hashing. Last modified: Wed Oct 19 14:35: CS61B: Lecture #24 1

of characters from an alphabet, then, the hash function could be:

Acknowledgement HashTable CISC4080, Computer Algorithms CIS, Fordham Univ.

Data Structures And Algorithms

Data Structures and Algorithms(10)

Hash Tables Hash Tables Goodrich, Tamassia

CS2210 Data Structures and Algorithms

CHAPTER 9 HASH TABLES, MAPS, AND SKIP LISTS

Comp 335 File Structures. Hashing

Hashing. Yufei Tao. Department of Computer Science and Engineering Chinese University of Hong Kong

Introduction hashing: a technique used for storing and retrieving information as quickly as possible.

Hash Table and Hashing

Algorithms and Data Structures

Chapter 20 Hash Tables

ECE 242 Data Structures and Algorithms. Hash Tables II. Lecture 25. Prof.

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

1 / 22. Inf 2B: Hash Tables. Lecture 4 of ADS thread. Kyriakos Kalorkoti. School of Informatics University of Edinburgh

Hashing. Hashing Procedures

Data and File Structures Chapter 11. Hashing

Hash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1

Hash Tables. Hashing Probing Separate Chaining Hash Function

DATA STRUCTURES AND ALGORITHMS

CSE373 Fall 2013, Second Midterm Examination November 15, 2013

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

Dynamic Dictionaries. Operations: create insert find remove max/ min write out in sorted order. Only defined for object classes that are Comparable

Outline. 1 Hashing. 2 Separate-Chaining Symbol Table 2 / 13

Open Addressing: Linear Probing (cont.)

Hash Tables. Hash Tables

Review. CSE 143 Java. A Magical Strategy. Hash Function Example. Want to implement Sets of objects Want fast contains( ), add( )

Outline. hash tables hash functions open addressing chained hashing

Data Structures and Algorithms. Chapter 7. Hashing

Worst-case running time for RANDOMIZED-SELECT

ADT asscociafve array INSERT, SEARCH, DELETE. Advanced Algorithmics (6EAP) Some reading MTAT Hashing. Jaak Vilo.

Introduction to Hashing

Data Structures and Object-Oriented Design VIII. Spring 2014 Carola Wenk

Hash tables. hashing -- idea collision resolution. hash function Java hashcode() for HashMap and HashSet big-o time bounds applications

Data Structures - CSCI 102. CS102 Hash Tables. Prof. Tejada. Copyright Sheila Tejada

Hash Tables. Hash functions Open addressing. November 24, 2017 Hassan Khosravi / Geoffrey Tien 1

CS 241 Analysis of Algorithms

Dictionaries. Application

Fall 2017 Mentoring 9: October 23, Min-Heapify This. Level order, bubbling up. Level order, bubbling down. Reverse level order, bubbling up

CMSC 132: Object-Oriented Programming II. Hash Tables

CS 310 Hash Tables, Page 1. Hash Tables. CS 310 Hash Tables, Page 2

Announcements. Submit Prelim 2 conflicts by Thursday night A6 is due Nov 7 (tomorrow!)

Hashing (Κατακερματισμός)

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

CSCI 104. Rafael Ferreira da Silva. Slides adapted from: Mark Redekopp and David Kempe

Standard ADTs. Lecture 19 CS2110 Summer 2009

ADT asscociafve array. Advanced Algorithmics (6EAP) Some reading INSERT, SEARCH, DELETE. MTAT Hashing. Jaak Vilo.

CSE 100: HASHING, BOGGLE

MIDTERM EXAM THURSDAY MARCH

CSCI 104 Hash Tables & Functions. Mark Redekopp David Kempe

CS 350 Algorithms and Complexity

Dictionaries-Hashing. Textbook: Dictionaries ( 8.1) Hash Tables ( 8.2)

III Data Structures. Dynamic sets

HO #13 Fall 2015 Gary Chan. Hashing (N:12)

Algorithms and Data Structures

Hash table basics mod 83 ate. ate. hashcode()

Transcription:

6 Hashing

The dictionary problem Different approaches to the dictionary problem: previously: Structuring the set of currently stored keys: lists, trees, graphs,... structuring the complete universe of all possible keys: hashing Hashing describes a special way of storing the elements of a set by breaking down the universe of possible keys. The position of the data element in the memory is given by computing a so called hash value directly from the key. 08.05.2012 Theory 1 - Hashing 2

Hashing Dictionary problem: lookup, insertion, deletion of data set (keys) Place of data set d: computed from the key k of d no comparisons constant time Data structure: linear field (array) of size m hash table key k 0 1 2 i m-2 m-1.. The memory is divided in m containers (buckets) of the same size. 08.05.2012 Theory 1 - Hashing 3

Implementation in Java class TableEntry { private Object key,value; } abstract class HashTable { private TableEntry[] tableentry; private int capacity; // Construktor HashTable (int capacity) { this.capacity = capacity; tableentry = new TableEntry [capacity]; for (int i = 0; i <= capacity-1; i++) tableentry[i] = null; } // the hash function protected abstract int h (Object key); // insert element with given key and value (if not there already) public abstract void insert (Object key Object value); // delete element with given key (if there) public abstract void delete (Object key); // locate element with given key public abstract Object search (Object key); } // class hashtable 08.05.2012 Theory 1 - Hashing 4

Hashing - problems 1. Size of the hash table only a small subset S of all possible keys (the universe) U actually occurs 2. Calculation of the adress of a data set - keys are not necessarily integers - index depends on the size of hash table In Java: public class Object {... public int hashcode() { }... } The universe U should be distributed as evenly as possibly to the numbers -2 31,, 2 31-1. 08.05.2012 Theory 1 - Hashing 5

Hash function (1) Set of keys S U Universe U of all possible keys hash function h 0,,m-1 hash table T h(s) = hash address h(s) = h(s ) s and s are synonyms with respect to h address collision 08.05.2012 Theory 1 - Hashing 6

Hash function (2) Definition: Let U be a universe of possible keys and {B 0,...,B m-1 } a set of m buckets for storing elements from U. Then a hash function h : U {0,..., m - 1} maps each key s U to a value h(s) (and the corresponding element to the bucket B h(s) ). The indices of the buckets also called hash addresses, the complete set of buckets is called hash table. B 0 B 1 B m-1 08.05.2012 Theory 1 - Hashing 7

Address collisions A hash function h calculates for each key s the index of the associated bucket. It would be ideal if the mapping of a data set with key s to a bucket h(s) was unique (one-to-one): insertion and lookup could be carried out in constant time (O(1)). In reality, there will be collisions: several elements can be mapped to the same hash address. Collisions have to be addressed (in one way or another). 08.05.2012 Theory 1 - Hashing 8

Hashing methods Example for U: all names in Java with length 40 U = 62 40 If U > m : address collisions are inevitable Hashing methods: 1. Choice of a hash function that is as good as possible 2. Strategy for resolving address collisions Load factor : Assumption: table size m is fixed 08.05.2012 Theory 1 - Hashing 9

Requirements for good hash functions Requirements A collision occurs if the bucket B h(s) for a newly inserted element with key s is not empty A hash function h is called perfect for a set S of keys if no collisions occur for S If h is perfect and S = n, then n m. The load factor of the hash table is n/m 1. A hash function is well chosen if the load factor is as high as possible, for many sets of keys the # of collisions is as small as possible, it can be computed efficiently. 08.05.2012 Theory 1 - Hashing 10

Example of a hash function Example: hash function for strings public static int h (String s){ int k = 0, m = 13; for (int i=0; i < s.length(); i++) k += (int)s.charat (i); return ( k%m ); } The following hash addresses are generated for m = 13 key s h(s) Test 0 Hallo 2 SE 9 Algo 10 The greater the choice of m, the more perfect h becomes. 08.05.2012 Theory 1 - Hashing 11

Probability of collision (1) Choice of the hash function The requirements high load factor and small number of collisions are in conflict with each other. We need to find a suitable compromise. For the set S of keys with S = n and buckets B 0,..., B m-1 : for n > m conflicts are inevitable for n < m there is a (residual) probability P K (n,m) for the occurrence of at least one collision. How can we find an estimate for P K (n,m)? For any key s the probability that h(s) = j with j {0,..., m - 1} is: P K [h(s) = j] = 1/m, provided that there is an equal distribution. We have P K (n,m) = 1 - P K (n,m), if P K (n,m) is the probability that storing of n elements in m buckets leads to no collision. 08.05.2012 Theory 1 - Hashing 12

Probability of collision (2) On the probability of collisions If n keys are distributed sequentially to the buckets B 0,..., B m-1 (with equal distribution), each time we have P [h(s) = j ] = 1/m. The probability P(i) for no collision in step i is P(i) = (m - (i - 1))/m Hence, we have For example, if m = 365, P(23) > 50% and P(50) 97% ( birthday paradox ) 08.05.2012 Theory 1 - Hashing 13

Probability of collision (3) Birthday paradox 08.05.2012 Theory 1 - Hashing 14

Common hash functions Hash fuctions used in practice: see: D.E. Knuth: The Art of Computer Programming for U = integer the [divisions-residue method] is used: h(s) = (a s) mod m (a 0, a m, m prime) for strings of characters of the form s = s 0 s 1... s k-1 one can use: e.g. B = 131 and w = word width (bits) of the computer (w = 32 or w = 64 is common). 08.05.2012 Theory 1 - Hashing 15

Simple hash functions Choice of the hash function - simple and quick computation - even distribution of the data (example: compiler) (Simple) division-residue method How to choose of m? Examples: h(k) = k mod m a) m even h(k) even k even Problematic if the last bit has a meaning (e.g. 0 = female, 1 = male) b) m = 2 p yields the p lowest dual digits of k Rule: choose m prime, and m is not a factor of any r i +/- j, where i and j are small, non-negative numbers and r is the radix of the representation. 08.05.2012 Theory 1 - Hashing 16

Multiplicative method (1) Choose constant 1. Compute 2. Choice of m is uncritical 08.05.2012 Theory 1 - Hashing 17

Multiplicative method (2) Example: Of all numbers leads to the most even distribution. 08.05.2012 Theory 1 - Hashing 18

Universal hashing Problem: if h is fixed there are S U with many collisions Idea of universal hashing: Choose hash function h randomly (h is independent of the keys that are going to be stored) H finite set of hash functions Definition: H is universal, if for arbitrary distinct keys x,y U: Hence: if x, y U, H universal, h H picked randomly 08.05.2012 Theory 1 - Hashing 19

A universal class of hash functions Assumptions: U = p (p prime) and U = {0,, p-1} let a {1,, p-1}, b {0,, p-1} and h a,b : U {0,,m-1} be defined as follows h a,b = ((ax+b) mod p) mod m Then: The set H = {h a,b 1 a < p-1, 0 b < p-1} is a universal class of hash functions. 08.05.2012 Theory 1 - Hashing 20

Universal hashing example Hash table T of size 3, U = 5 Consider the 20 functions (set H ): x+0 2x+0 3x+0 4x+0 x+1 2x+1 3x+1 4x+1 x+2 2x+2 3x+2 4x+2 x+3 2x+3 3x+3 4x+3 x+4 2x+4 3x+4 4x+4 each (mod 5) (mod 3) and the keys 1 und 4 We get: (1*1+0) mod 5 mod 3 = 1 = (1*4+0) mod 5 mod 3 (1*1+4) mod 5 mod 3 = 0 = (1*4+4) mod 5 mod 3 (4*1+0) mod 5 mod 3 = 1 = (4*4+0) mod 5 mod 3 (4*1+4) mod 5 mod 3 = 0 = (4*4+4) mod 5 mod 3 08.05.2012 Theory 1 - Hashing 21

Possible ways of treating collisions Treatment of collisions: collisions are treated differently in different methods. a data set with key s is called a colliding element if bucket B h(s) is already taken by another data set. what can we do with colliding elements? 1. chaining: Implement the buckets as linked lists. Colliding elements are stored in these lists. 2. open Addressing: Colliding elements are stored in other vacant buckets. During storage and lookup, these are found through so-called probing. 08.05.2012 Theory 1 - Hashing 22