Lecture 6: Hashing Steven Skiena

Similar documents
Questions. 6. Suppose we were to define a hash code on strings s by:

Lecture 4: Elementary Data Structures Steven Skiena

of characters from an alphabet, then, the hash function could be:

Lecture 7: Efficient Collections via Hashing

DATA STRUCTURES AND ALGORITHMS

CSE 332: Data Structures & Parallelism Lecture 10:Hashing. Ruth Anderson Autumn 2018

Introduction to. Algorithms. Lecture 6. Prof. Constantinos Daskalakis. CLRS: Chapter 17 and 32.2.

Hashing. Hashing Procedures

Module 5: Hashing. CS Data Structures and Data Management. Reza Dorrigiv, Daniel Roche. School of Computer Science, University of Waterloo

Data Structures and Algorithms. Chapter 7. Hashing

Outline for Today. How can we speed up operations that work on integer data? A simple data structure for ordered dictionaries.

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

HO #13 Fall 2015 Gary Chan. Hashing (N:12)

Hash Table and Hashing

Data Structures and Algorithms. Roberto Sebastiani

Hashing Techniques. Material based on slides by George Bebis

AAL 217: DATA STRUCTURES

Hashing. Yufei Tao. Department of Computer Science and Engineering Chinese University of Hong Kong

Understand how to deal with collisions

Algorithms and Data Structures Lesson 3

Hashing. 7- Hashing. Hashing. Transform Keys into Integers in [[0, M 1]] The steps in hashing:

HashTable CISC5835, Computer Algorithms CIS, Fordham Univ. Instructor: X. Zhang Fall 2018

Hash Tables. Gunnar Gotshalks. Maps 1

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40

Dictionary. Dictionary. stores key-value pairs. Find(k) Insert(k, v) Delete(k) List O(n) O(1) O(n) Sorted Array O(log n) O(n) O(n)

Worst-case running time for RANDOMIZED-SELECT

CSE100. Advanced Data Structures. Lecture 21. (Based on Paul Kube course materials)

III Data Structures. Dynamic sets

Module 2: Classical Algorithm Design Techniques

University of Toronto Department of Electrical and Computer Engineering. Midterm Examination. ECE 345 Algorithms and Data Structures Fall 2012

Quiz 1 Practice Problems

ECE 122. Engineering Problem Solving Using Java

Outline for Today. How can we speed up operations that work on integer data? A simple data structure for ordered dictionaries.

CMSC 341 Hashing. Based on slides from previous iterations of this course

DATA STRUCTURES AND ALGORITHMS

Final Exam in Algorithms and Data Structures 1 (1DL210)

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ.! Instructor: X. Zhang Spring 2017

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ. Acknowledgement. Support for Dictionary

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

CS 270 Algorithms. Oliver Kullmann. Generalising arrays. Direct addressing. Hashing in general. Hashing through chaining. Reading from CLRS for week 7

THINGS WE DID LAST TIME IN SECTION

Introduction to Algorithms

Integer Algorithms and Data Structures

Beyond Counting. Owen Kaser. September 17, 2014

Quiz 1 Practice Problems

Fast Lookup: Hash tables

CARNEGIE MELLON UNIVERSITY DEPT. OF COMPUTER SCIENCE DATABASE APPLICATIONS

Randomized Algorithms, Hash Functions

Design and Analysis of Algorithms

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

DSA - Seminar 6. Deadline for Project stage for projects without trees: today, until 11:59 PM

1 / 22. Inf 2B: Hash Tables. Lecture 4 of ADS thread. Kyriakos Kalorkoti. School of Informatics University of Edinburgh

CSCI E 119 Section Notes Section 11 Solutions

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Sorting lower bound and Linear-time sorting Date: 9/19/17

Hash Tables. Reading: Cormen et al, Sections 11.1 and 11.2

IS 2610: Data Structures

Lecture 8: Mergesort / Quicksort Steven Skiena

Today s Outline. CS 561, Lecture 8. Direct Addressing Problem. Hash Tables. Hash Tables Trees. Jared Saia University of New Mexico

Acknowledgement HashTable CISC4080, Computer Algorithms CIS, Fordham Univ.

CSE 214 Computer Science II Searching

Implementation with ruby features. Sorting, Searching and Haching. Quick Sort. Algorithm of Quick Sort

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42

Practice Midterm Exam Solutions

CSE373: Data Structures & Algorithms Lecture 6: Hash Tables

Outline. hash tables hash functions open addressing chained hashing

CMSC 132: Object-Oriented Programming II. Hash Tables

Module 3: Hashing Lecture 9: Static and Dynamic Hashing. The Lecture Contains: Static hashing. Hashing. Dynamic hashing. Extendible hashing.

Hashing 1. Searching Lists

Introduction to. Algorithms. Lecture 5

Lecture 17. Improving open-addressing hashing. Brent s method. Ordered hashing CSE 100, UCSD: LEC 17. Page 1 of 19

Algorithms and Data Structures

Week 9. Hash tables. 1 Generalising arrays. 2 Direct addressing. 3 Hashing in general. 4 Hashing through chaining. 5 Hash functions.

Hash Table. A hash function h maps keys of a given type into integers in a fixed interval [0,m-1]

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1

Hash Tables. Computer Science S-111 Harvard University David G. Sullivan, Ph.D. Data Dictionary Revisited

Hashing. Dr. Ronaldo Menezes Hugo Serrano. Ronaldo Menezes, Florida Tech

CSCI Analysis of Algorithms I

Randomized Algorithms: Element Distinctness

Lecture 12 Hash Tables

Hashing and sketching

Goal of the course: The goal is to learn to design and analyze an algorithm. More specifically, you will learn:

Introduction hashing: a technique used for storing and retrieving information as quickly as possible.

Algorithms and Data Structures

x-fast and y-fast Tries

Taking Stock. IE170: Algorithms in Systems Engineering: Lecture 7. (A subset of) the Collections Interface. The Java Collections Interfaces

CS2 Algorithms and Data Structures Note 4

CSci 231 Final Review

Section 05: Solutions

CS 561, Lecture 2 : Hash Tables, Skip Lists, Bloom Filters, Count-Min sketch. Jared Saia University of New Mexico

Lecture 20: Satisfiability Steven Skiena. Department of Computer Science State University of New York Stony Brook, NY

Hashing. Manolis Koubarakis. Data Structures and Programming Techniques

Hash Open Indexing. Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1

CS 251, LE 2 Fall MIDTERM 2 Tuesday, November 1, 2016 Version 00 - KEY

Tables. The Table ADT is used when information needs to be stored and acessed via a key usually, but not always, a string. For example: Dictionaries

CS 350 Algorithms and Complexity

CMSC 341 Hashing (Continued) Based on slides from previous iterations of this course

CS61B Lecture #24: Hashing. Last modified: Wed Oct 19 14:35: CS61B: Lecture #24 1

String quicksort solves this problem by processing the obtained information immediately after each symbol comparison.

COMP3121/3821/9101/ s1 Assignment 1

Topics Applications Most Common Methods Serial Search Binary Search Search by Hashing (next lecture) Run-Time Analysis Average-time analysis Time anal

Transcription:

Lecture 6: Hashing Steven Skiena Department of Computer Science State University of New York Stony Brook, NY 11794 4400 http://www.cs.stonybrook.edu/ skiena

Dictionary / Dynamic Set Operations Perhaps the most important class of data structures maintain a set of items, indexed by keys. Search(S, k) A query that, given a set S and a key k, returns a pointer x to an element in S such that key[x] = k, or nil if no such element belongs to S. Insert(S,x) A modifying operation that augments the set S with the element x. Delete(S,x) Given a pointer x to an element in the set S, remove x from S. Observe we are given a pointer to an element x, not a key.

Min(S), Max(S) Returns the element of the totally ordered set S which has the smallest (largest) key. Logical Precessor(S,x), Successor(S,x) Given an element x whose key is from a totally ordered set S, returns the next largest (smallest) element in S, or NIL if x is the maximum (minimum) element. There are a variety of implementations of these dictionary operations, each of which yield different time bounds for various operations.

Problem of the Day You are given the task of reading in n numbers and then printing them out in sorted order. Suppose you have access to a balanced dictionary data structure, which supports each of the operations search, insert, delete, minimum, maximum, successor, and predecessor in O(log n) time. Explain how you can use this dictionary to sort in O(n log n) time using only the following abstract operations: minimum, successor, insert, search.

Explain how you can use this dictionary to sort in O(n log n) time using only the following abstract operations: minimum, insert, delete, search. Explain how you can use this dictionary to sort in O(n log n) time using only the following abstract operations: insert and in-order traversal.

Hash Tables Hash tables are a very practical way to maintain a dictionary. The idea is simply that looking an item up in an array is Θ(1) once you have its index. A hash function is a mathematical function which maps keys to integers.

Collisions Collisions are the set of keys mapped to the same bucket. If the keys are uniformly distributed, then each bucket should contain very few keys! The resulting short lists are easily searched! 0 1 2 3 4 5 6 7 8 9 10 11 Chaining is easy, but devotes a considerable amount of memory to pointers, which could be used to make the table larger.

Hash Functions It is the job of the hash function to map keys to integers. A good hash function: 1. Is cheap to evaluate 2. Tends to use all positions from 0... M with uniform frequency. The first step is usually to map the key to a big integer, for example h = keylength i=0 128 i char(key[i])

Modular Arithmetic This large number must be reduced to an integer whose size is between 1 and the size of our hash table. One way is by h(k) = k mod M, where M is best a large prime not too close to 2 i 1, which would just mask off the high bits. This works on the same principle as a roulette wheel!

Bad Hash Functions The first three digits of the Social Security Number 0 1 2 3 4 5 6 7 8 9

Good Hash Functions The last three digits of the Social Security Number 0 1 2 3 4 5 6 7 8 9

Performance on Set Operations With either chaining or open addressing: Search - O(1) expected, O(n) worst case Insert - O(1) expected, O(n) worst case Delete - O(1) expected, O(n) worst case Min, Max and Predecessor, Successor Θ(n + m) expected and worst case Pragmatically, a hash table is often the best data structure to maintain a dictionary. However, the worst-case time is unpredictable. The best worst-case bounds come from balanced binary trees.

Hashing, Hashing, and Hashing Udi Manber says that the three most important algorithms at Google are hashing, hashing, and hashing. Hashing has a variety of clever applications beyond just speeding up search, by giving you a short but distinctive representation of a larger document. Is this new document different from the rest in a large corpus? Hash the new document, and compare it to the hash codes of corpus. Is part of this document plagerized from part of a document in a large corpus? Hash overlapping windows of length w in the document and the corpus. If there is a match of hash codes, there is possibly a text match.

How can I convince you that a file isn t changed? Check if the cryptographic hash code of the file you give me today is the same as that of the original. Any changes to the file will change the hash code.

Substring Pattern Matching Input: A text string t and a pattern string p. Problem: Does t contain the pattern p as a substring, and if so where? E.g: Is Skiena in the Bible?

Brute Force Search The simplest algorithm to search for the presence of pattern string p in text t overlays the pattern string at every position in the text, and checks whether every pattern character matches the corresponding text character. This runs in O(nm) time, where n = t and m = p.

String Matching via Hashing Suppose we compute a given hash function on both the pattern string p and the m-character substring starting from the ith position of t. If these two strings are identical, clearly the resulting hash values will be the same. If the two strings are different, the hash values will almost certainly be different. These false positives should be so rare that we can easily spend the O(m) time it take to explicitly check the identity of two strings whenever the hash values agree.

The Catch This reduces string matching to n m + 2 hash value computations (the n m + 1 windows of t, plus one hash of p), plus what should be a very small number of O(m) time verification steps. The catch is that it takes O(m) time to compute a hash function on an m-character string, and O(n) such computations seems to leave us with an O(mn) algorithm again.

The Trick Look closely at our string hash function, applied to the m characters starting from the jth position of string S: H(S, j) = m 1 i=0 αm (i+1) char(s i+j ) A little algebra reveals that H(S, j + 1) = (H(S, j) α m 1 char(s j ))α + s j+m Thus once we know the hash value from the j position, we can find the hash value from the (j + 1)st position for the cost of two multiplications, one addition, and one subtraction. This can be done in constant time.

Hashing as a Representation Custom-designed hashcodes can be used to bucket items by a cannonical representation. Which five letters of the alphabet can make the most different words? Hash each word by the letters it contains: skiena aeikns! Observe that dog and god collide! Proximity-preserving hashing techniques put similar items in the same bucket. Use hashing for everything, except worst-case analysis!