Advanced Algorithmics (6EAP) MTAT Hashing. Jaak Vilo 2016 Fall

Similar documents
ADT asscociafve array INSERT, SEARCH, DELETE. Advanced Algorithmics (6EAP) Some reading MTAT Hashing. Jaak Vilo.

ADT asscociafve array. Advanced Algorithmics (6EAP) Some reading INSERT, SEARCH, DELETE. MTAT Hashing. Jaak Vilo.

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

Hashing Techniques. Material based on slides by George Bebis

The dictionary problem

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

HASH TABLES. Goal is to store elements k,v at index i = h k

Algorithms and Data Structures

COMP171. Hashing.

Open Addressing: Linear Probing (cont.)

Chapter 9: Maps, Dictionaries, Hashing

AAL 217: DATA STRUCTURES

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

Fast Lookup: Hash tables

Dictionary. Dictionary. stores key-value pairs. Find(k) Insert(k, v) Delete(k) List O(n) O(1) O(n) Sorted Array O(log n) O(n) O(n)

Hashing. Dr. Ronaldo Menezes Hugo Serrano. Ronaldo Menezes, Florida Tech

Chapter 27 Hashing. Objectives

Hash Table and Hashing

HASH TABLES.

Introduction to Hashing

Introducing Hashing. Chapter 21. Copyright 2012 by Pearson Education, Inc. All rights reserved

Chapter 27 Hashing. Liang, Introduction to Java Programming, Eleventh Edition, (c) 2017 Pearson Education, Inc. All rights reserved.

CHAPTER 9 HASH TABLES, MAPS, AND SKIP LISTS

1 CSE 100: HASH TABLES

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

Hash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1

Hash Open Indexing. Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1

Hash Table. A hash function h maps keys of a given type into integers in a fixed interval [0,m-1]

HashTable CISC5835, Computer Algorithms CIS, Fordham Univ. Instructor: X. Zhang Fall 2018

Hashing. Yufei Tao. Department of Computer Science and Engineering Chinese University of Hong Kong

Fundamentals of Database Systems Prof. Arnab Bhattacharya Department of Computer Science and Engineering Indian Institute of Technology, Kanpur

Hash Tables. Hash Tables

Hash Tables. Gunnar Gotshalks. Maps 1

Dictionaries and Hash Tables

Hash Tables Hash Tables Goodrich, Tamassia

CS1020 Data Structures and Algorithms I Lecture Note #15. Hashing. For efficient look-up in a table

Data Structures and Algorithms. Chapter 7. Hashing

Cpt S 223. School of EECS, WSU

CSE 332: Data Structures & Parallelism Lecture 10:Hashing. Ruth Anderson Autumn 2018

Unit #5: Hash Functions and the Pigeonhole Principle

Hashing. Hashing Procedures

CS 350 : Data Structures Hash Tables

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ.! Instructor: X. Zhang Spring 2017

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ. Acknowledgement. Support for Dictionary

Hashing. Manolis Koubarakis. Data Structures and Programming Techniques

CITS2200 Data Structures and Algorithms. Topic 15. Hash Tables

BBM371& Data*Management. Lecture 6: Hash Tables

Dynamic Dictionaries. Operations: create insert find remove max/ min write out in sorted order. Only defined for object classes that are Comparable

Data Structures and Algorithms. Roberto Sebastiani

Dictionaries-Hashing. Textbook: Dictionaries ( 8.1) Hash Tables ( 8.2)

CSc 120. Introduc/on to Computer Programming II. 15: Hashing

Quiz 1 Practice Problems

Algorithms and Data Structures

Hash Tables. Hashing Probing Separate Chaining Hash Function

Symbol Table. Symbol table is used widely in many applications. dictionary is a kind of symbol table data dictionary is database management

Notes on Bloom filters

CSCI 104 Hash Tables & Functions. Mark Redekopp David Kempe

Hashing for searching

CS 270 Algorithms. Oliver Kullmann. Generalising arrays. Direct addressing. Hashing in general. Hashing through chaining. Reading from CLRS for week 7

CSE 214 Computer Science II Searching

Lecture 4. Hashing Methods

Introduction hashing: a technique used for storing and retrieving information as quickly as possible.

HO #13 Fall 2015 Gary Chan. Hashing (N:12)

Topic HashTable and Table ADT

CSE100. Advanced Data Structures. Lecture 21. (Based on Paul Kube course materials)

Hash[ string key ] ==> integer value

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1

CSC263 Week 5. Larry Zhang.

Acknowledgement HashTable CISC4080, Computer Algorithms CIS, Fordham Univ.

Cuckoo Hashing for Undergraduates

CSI33 Data Structures

Week 9. Hash tables. 1 Generalising arrays. 2 Direct addressing. 3 Hashing in general. 4 Hashing through chaining. 5 Hash functions.

TABLES AND HASHING. Chapter 13

CSCI 104. Rafael Ferreira da Silva. Slides adapted from: Mark Redekopp and David Kempe

Outline. hash tables hash functions open addressing chained hashing

More B-trees, Hash Tables, etc. CS157B Chris Pollett Feb 21, 2005.

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40

Understand how to deal with collisions

SFU CMPT Lecture: Week 8

Hash Tables. Johns Hopkins Department of Computer Science Course : Data Structures, Professor: Greg Hager

CS 241 Analysis of Algorithms

Hashing. CptS 223 Advanced Data Structures. Larry Holder School of Electrical Engineering and Computer Science Washington State University

Randomized Algorithms: Element Distinctness

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

CS5112: Algorithms and Data Structures for Applications

Tirgul 7. Hash Tables. In a hash table, we allocate an array of size m, which is much smaller than U (the set of keys).

Today: Finish up hashing Sorted Dictionary ADT: Binary search, divide-and-conquer Recursive function and recurrence relation

Data Structures Lecture 12

Hashing Based Dictionaries in Different Memory Models. Zhewei Wei

Dictionaries and Hash Tables

5. Hashing. 5.1 General Idea. 5.2 Hash Function. 5.3 Separate Chaining. 5.4 Open Addressing. 5.5 Rehashing. 5.6 Extendible Hashing. 5.

Extended Introduction to Computer Science CS1001.py Lecture 15: The Dictionary Problem Hash Functions and Hash Tables

Data and File Structures Laboratory

CS 3114 Data Structures and Algorithms READ THIS NOW!

Data and File Structures Chapter 11. Hashing

Data Structures And Algorithms

Successful vs. Unsuccessful

Hash Tables and Hash Functions

! A Hash Table is used to implement a set, ! The table uses a function that maps an. ! The function is called a hash function.

Practical Session 8- Hash Tables

Transcription:

Advanced Algorithmics (6EAP) MTAT.03.238 Hashing Jaak Vilo 2016 Fall Jaak Vilo 1

ADT asscociative array INSERT, SEARCH, DELETE An associative array (also associative container, map, mapping, dictionary, finite map, and in queryprocessing an index or index file) is an abstract data type composed of a collection of unique keys and a collection of values, where each key is associated with one value (or set of values). The operation of finding the value associated with a key is called a lookup or indexing, and this is the most important operation supported by an associative array.

Some reading MIT http://ocw.mit.edu/ocwweb/electrical-engineering-and-computer-science/6-046jfall- 2005/VideoLectures/detail/embed07.htm http://ocw.mit.edu/ocwweb/electrical-engineering-and-computer-science/6-046jfall- 2005/VideoLectures/detail/embed08.htm CMU: http://www.cs.cmu.edu/afs/cs/academic/class/15451-s09/www/lectures/lect0212.pdf

Why? Speed! - O(1) Space efficient

Keys Integers Strings Floating point nr-s Usual assumption keys are integers. Step 1: Map keys to integers.

Primary clustering

Implementation of open addressing How do we define h(k,i)? Linear probing: Quadratic probing: Double hashing:

Quadratic Probing Suppose that an element should appear in bin h: if bin h is occupied, then check the following sequence of bins: h + 1 2, h + 2 2, h + 3 2, h + 4 2, h + 5 2,... h + 1, h + 4, h + 9, h + 16, h + 25,... For example, with M = 17:

Quadratic Probing If one of h + i 2 falls into a cluster, this does not imply the next one will Secondary clustering all k colliding in h(k) will cluster to same locations

Birthday Paradox In probability theory, the birthday problem, or birthday paradox [1] pertains to the probability that in a set of randomly chosen people some pair of them will have the same birthday. In a group of at least 23 randomly chosen people, there is more than 50% probability that some pair of them will both have been born on the same day.

Problem Adversary can choose a really bad set of keys E.g. the identifiers for the compiler By mapping all keys to the same location a worst-case scenario can happen http://www.cs.cmu.edu/afs/cs/academic/class/15451-s09/www/lectures/lect0212.pdf

Two possible views of hashing The hash function h is fixed. The keys are random. The hash function h is chosen randomly from a family of hash functions. The keys are fixed. Typical assumption (in both scenarios):

Universal hashing From Wikipedia, the free encyclopedia Universal hashing is a randomized algorithm for selecting a hash function F with the following property: for any two distinct inputs x and y, the probability that F(x)=F(y) (i.e., that there is a hash collision between x and y) is the same as if F was a random function. Thus, if F has function values in a range of size r, the probability of any particular hash collision should be at most 1/r. There are universal hashing methods that give a function F that can be evaluated in a handful of computer instructions.

Universal families of hash functions A family H of hash functions from U to [m] is said to be universal if and only if

A simple universal family To represent a function from the family we only need two numbers, a and b. The size m of the hash table is arbitrary.

Example p = 17, m= 6 h 5,7 ( 3 ) = 22 22 mod 17 = 5 h 3,4 ( 28 ) = 3 3*28= 88 (mod 17 =3) h 10,2 ( 3 ) = 15 (mod 17) mod 6 = 3

Universal Randomly select a, b and use that in one run. Store a and b for each hash table separately.

Perfect hashing Suppose that D is static. We want to implement Find in O(1) worst case time. Can we achieve it?

Perfect hashing Suppose that D is static. We want to implement Find in O(1) worst case time. Perfect hashing: No collisions Can we achieve it?

PHF and MPHF

Expected no. of collisions

Expected no. of collisions If we are willing to use m=n 2, then any universal family contains a perfect hash function. No collisions!

2-level hashing

2-level hashing

2-level hashing Assume that each h i can be represented using 2 words Total size:

A randomized algorithm for constructing a perfect 2-level hash table: Choose a random h from H(n) and compute the number of collisions. If there are more than n collisions, repeat. For each cell i,if n i >1, choose a random hash function from H(n i2 ). If there are any collisions, repeat. Expected construction time O(n) Worst case find time O(1)

Other applications of hashing Comparing files Cryptographic applications Web indexing keyword lookup VERY LARGE APPLICATIONS

Problem with hashing Problem: Sorting and next by value is hard Q: how to make hash functions that maintain order Q: how to scale hashing to very large data, with low memory and high speed

Examples Example: Enumerating the search space have we seen this state before? Dynamic programming/fp/memoization has the function been previously evaluated with thesameinput parameters?

CMPH - C Minimal Perfect Hashing Library http://cmph.sourceforge.net/ The use of minimal perfect hash functions is, until now, restricted to scenarios where the set of keys being hashed is small, because of the limitations of current algorithms. But in many cases, to deal with huge set of keys is crucial. So, this project gives to the free software community an API that will work with sets in the order of billion of keys. CMPH Library was conceived to create minimal perfect hash functions for very large sets of keys

Minimal perfect hash functions are widely used for memory efficient storage and fast retrieval of items from static sets, such as words in natural languages, reserved words in programming languages or interactive systems, universal resource locations (URLs) in Web search engines, or item sets in data mining techniques. Therefore, there are applications for minimal perfect hash functions in information retrieval systems, database systems, language translation systems, electronic commerce systems, compilers, operating systems, among others. The use of minimal perfect hash functions is, until now, restricted to scenarios where the set of keys being hashed is small, because of the limitations of current algorithms. But in many cases, to deal with huge set of keys is crucial. So, this project gives to the free software community an API that will work with sets in the order of billion of keys.

Compress, Hash and Displace: CHD Algorithm CHD Algorithm: It is the fastest algorithm to build PHFs and MPHFs in linear time. It generates the most compact PHFs and MPHFs we know of. It can generate PHFs with a load factor up to 99 %. It can be used to generate t-perfect hash functions. A t-perfect hash function allows at most t collisions in a given bin. It is a well-known fact that modern memories are organized as blocks which constitute transfer unit. Example of such blocks are cache lines for internal memory or sectors for hard disks. Thus, it can be very useful for devices that carry out I/O operations in blocks. It is a two level scheme. It uses a first level hash function to split the key set in buckets of average size determined by a parameter b in the range [1,32]. In the second level it uses displacement values to resolve the collisions that have given rise to the buckets. It can generate MPHFs that can be stored in approximately 2.07 bits per key. For a load factor equal to the maximum one that is achieved by the BDZ algorithm (81 %), the resulting PHFs are stored in approximately 1.40 bits per key.

Probabilistic data structures Just use hash tables to record what has been seen If the probability of a collision is small, then do not worry if some false positive hits have occurred See: http://pycon.blip.tv/file/4881076/

Bloom filters (1970) Simple query: is the key in the set? Probabilistic: No: 100% correct yes: p <= 1 Idea make p as large as possible (avoid false positive answers)

Bloom Filters An example of a Bloom filter, representing the set {x, y, z}. The colored arrows show the positions in the bit array that each set element is mapped to. The element w is not in the set {x, y, z}, because it hashes to one bitarray position containing 0. For this figure, m=18 and k=3.

Bloom filters h1 h2 Successful Fail

Bloom filter used to speed up answers in a key-value storage system. Values are stored on a disk which has slow access times. Bloom filter decisions are much faster. However some unnecessary disk accesses are made when the filter reports a positive (in order to weed out the false positives). Overall answer speed is better with the Bloom filter than without the Bloom filter. Use of a Bloom filter for this purpose, however, does increase memory usage.

Summary Load factors handling collisions Hash functions and families Universal hashing Perfect hashing Minimal perfect hashing Bloom filters