Introduction to. Algorithms. Lecture 5

Similar documents
Introduction to. Algorithms. Lecture 6. Prof. Constantinos Daskalakis. CLRS: Chapter 17 and 32.2.

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

Dictionary. Dictionary. stores key-value pairs. Find(k) Insert(k, v) Delete(k) List O(n) O(1) O(n) Sorted Array O(log n) O(n) O(n)

Hashing Techniques. Material based on slides by George Bebis

Hash Open Indexing. Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1

COMP171. Hashing.

CSC263 Week 5. Larry Zhang.

HashTable CISC5835, Computer Algorithms CIS, Fordham Univ. Instructor: X. Zhang Fall 2018

Hashing 1. Searching Lists

HASH TABLES.

Hash Table and Hashing

Cpt S 223. School of EECS, WSU

Lecture 6: Hashing Steven Skiena

CMSC 341 Hashing (Continued) Based on slides from previous iterations of this course

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ.! Instructor: X. Zhang Spring 2017

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ. Acknowledgement. Support for Dictionary

Acknowledgement HashTable CISC4080, Computer Algorithms CIS, Fordham Univ.

Hash[ string key ] ==> integer value

1 CSE 100: HASH TABLES

HASH TABLES. Goal is to store elements k,v at index i = h k

Lecture 17. Improving open-addressing hashing. Brent s method. Ordered hashing CSE 100, UCSD: LEC 17. Page 1 of 19

CMSC 341 Hashing. Based on slides from previous iterations of this course

Hash Tables. Gunnar Gotshalks. Maps 1

AAL 217: DATA STRUCTURES

Hash Tables. Hash Tables

Hashing. Dr. Ronaldo Menezes Hugo Serrano. Ronaldo Menezes, Florida Tech

Quiz 1 Practice Problems

lecture23: Hash Tables

DATA STRUCTURES AND ALGORITHMS

Hashing. 1. Introduction. 2. Direct-address tables. CmSc 250 Introduction to Algorithms

Today: Finish up hashing Sorted Dictionary ADT: Binary search, divide-and-conquer Recursive function and recurrence relation

Data Structure Lecture#22: Searching 3 (Chapter 9) U Kang Seoul National University

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

Quiz 1 Practice Problems

Symbol Table. Symbol table is used widely in many applications. dictionary is a kind of symbol table data dictionary is database management

Fast Lookup: Hash tables

Module 3: Hashing Lecture 9: Static and Dynamic Hashing. The Lecture Contains: Static hashing. Hashing. Dynamic hashing. Extendible hashing.

CS 241 Analysis of Algorithms

Open Addressing: Linear Probing (cont.)

5. Hashing. 5.1 General Idea. 5.2 Hash Function. 5.3 Separate Chaining. 5.4 Open Addressing. 5.5 Rehashing. 5.6 Extendible Hashing. 5.

Data and File Structures Laboratory

Understand how to deal with collisions

Lecture 10 March 4. Goals: hashing. hash functions. closed hashing. application of hashing

Hash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

Topic HashTable and Table ADT

CHAPTER 9 HASH TABLES, MAPS, AND SKIP LISTS

CSE100. Advanced Data Structures. Lecture 21. (Based on Paul Kube course materials)

Quiz 1 Solutions. (a) f(n) = n g(n) = log n Circle all that apply: f = O(g) f = Θ(g) f = Ω(g)

Hashing. CptS 223 Advanced Data Structures. Larry Holder School of Electrical Engineering and Computer Science Washington State University

Fundamentals of Database Systems Prof. Arnab Bhattacharya Department of Computer Science and Engineering Indian Institute of Technology, Kanpur

Dictionaries and Hash Tables

Outline. hash tables hash functions open addressing chained hashing

Hash table basics mod 83 ate. ate

CPSC 259 admin notes

Hashing. October 19, CMPE 250 Hashing October 19, / 25

CSE 332: Data Structures & Parallelism Lecture 10:Hashing. Ruth Anderson Autumn 2018

Today s Outline. CS 561, Lecture 8. Direct Addressing Problem. Hash Tables. Hash Tables Trees. Jared Saia University of New Mexico

Data and File Structures Chapter 11. Hashing

Hash table basics. ate à. à à mod à 83

Algorithms in Systems Engineering ISE 172. Lecture 12. Dr. Ted Ralphs

Data Structures and Algorithms. Roberto Sebastiani

The dictionary problem

COSC 2007 Data Structures II Final Exam. Part 1: multiple choice (1 mark each, total 30 marks, circle the correct answer)

Hashing. Hashing Procedures

Lecture 7: Efficient Collections via Hashing

Hashing. Manolis Koubarakis. Data Structures and Programming Techniques

W4231: Analysis of Algorithms

Standard ADTs. Lecture 19 CS2110 Summer 2009

Data Structures and Algorithm Analysis (CSC317) Hash tables (part2)

Lecture 4: Advanced Data Structures

Data Structures and Algorithms. Chapter 7. Hashing

CSCI Analysis of Algorithms I

Questions. 6. Suppose we were to define a hash code on strings s by:

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40

HO #13 Fall 2015 Gary Chan. Hashing (N:12)

Problem Set 4. Problem 4-1. [35 points] Hash Functions and Load

Unit #5: Hash Functions and the Pigeonhole Principle

Module 5: Hashing. CS Data Structures and Data Management. Reza Dorrigiv, Daniel Roche. School of Computer Science, University of Waterloo

DATA STRUCTURES AND ALGORITHMS

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1

Hashing. Yufei Tao. Department of Computer Science and Engineering Chinese University of Hong Kong

DS ,21. L11-12: Hashmap

General Idea. Key could be an integer, a string, etc e.g. a name or Id that is a part of a large employee structure

Outline for Today. Towards Perfect Hashing. Cuckoo Hashing. The Cuckoo Graph. Analysis of Cuckoo Hashing. Reducing worst-case bounds

UNIT III BALANCED SEARCH TREES AND INDEXING

MIDTERM EXAM THURSDAY MARCH

CSI33 Data Structures

stacks operation array/vector linked list push amortized O(1) Θ(1) pop Θ(1) Θ(1) top Θ(1) Θ(1) isempty Θ(1) Θ(1)

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42

ECE 122. Engineering Problem Solving Using Java

CSE373: Data Structures & Algorithms Lecture 17: Hash Collisions. Kevin Quinn Fall 2015

Cuckoo Hashing for Undergraduates

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

Algorithms and Data Structures Lesson 3

Introduction to Algorithms March 11, 2009 Massachusetts Institute of Technology Spring 2009 Professors Sivan Toledo and Alan Edelman Quiz 1

CSC263 Week 12. Larry Zhang

Hash Tables. Hash functions Open addressing. November 24, 2017 Hassan Khosravi / Geoffrey Tien 1

Symbol Tables. For compile-time efficiency, compilers often use a symbol table: associates lexical names (symbols) with their attributes

Hash-Based Indexing 1

Transcription:

6.006- Introduction to Algorithms Lecture 5 Prof. Manolis Kellis

Unit #2 Genomes, Hashing, and Dictionaries 2

(hashing out ) Our plan ahead Today: Genomes, Dictionaries, i and Hashing Intro, basic operations, collisions and chaining Simple uniform hashing hi assumption Hash functions, python implementation Thursday: Speeding up hash tables Faster comparison: Signatures Faster hashing: Rolling Hash Next week: Space issues Dynamic resizing and amortized analysis Open addressing, deletions, and probing 3

Our plan for today: Hashing I Today: Genomes, Dictionaries, and Hashing Matching genome segments Introduction to dictionaries Hash function: definition Resolving collisions with chaining Simple uniform hashing assumption Hash functions in practice: mod / mult Python implementation Thursday: Speeding up hash tables Next week: Space issues 4

Comparing two genomes bit by bit Human chr 1 Mouse chrs Mouse chromosomes 1 19, X

DNA matching: All about strings How to find corresponding pieces of DNA Given two DNA sequences Strings over 4-letter alphabet Find longest substring that appears in both Algorithm vs. Arithmetic Algorithm vs. Arithmetic L19: Subsequence - much harder (e.g. Algorithm) Other applications: Plagiarism detection Word autocorrect Watson Jeopardy!

Naïve Algorithm Say strings S and T of length n For L = n downto 1 n* for all length L substrings X1 of S n* for all length L substrings X2 of ft n* if X1=X2, return L n Runtime analysis n candidate lengths n strings ti of fthtl that length thin X1 n strings of that length in X2 L time to compare the strings Total runtime: (n 4 )

Improvement 1: Binary Search on L Start with L=n/2 for all length L substrings X1 of S for all length L substrings X2 of T if X1=X2, success, try larger L if failed, try smaller L Runtime analysis (n 4 ) (n 3 log n)

Improvement 2: Python Dictionaries For every possible length L=n,,1 Insert all length L substrings of S into a dictionary For each length L substring of T, check if it exists in dictionary Possible lengths for outer loop: n For each length: at most n substrings of S inserted into dictionary, each insertion takes time O(1) * L (L is paid because we have to read string to insert it) at most n substrings of T checked for existence inside dictionary, each check takes time O(1) * L Overall time spent to deal with a particular length L is O(Ln) Hence overall (n 3 ) With binary search on length, total is (n 2 log n) Rolling hash dictionaries improve to (n log n) (next time)

Our plan for today: Hashing I Today: Genomes, Dictionaries, and Hashing Matching genome segments Introduction to dictionaries Hash function: definition Resolving collisions with chaining Simple uniform hashing assumption Hash functions in practice: mod / mult Python implementation Thursday: Speeding up hash tables Next week: Space issues 10

Dictionaries: Formal Definition It is a set containing items; ; each item has a key what keys and items are is quite flexible Supported Operations: Insert(key, item): add item to set, indexed by key Delete(key): dl delete item indexed id dby key Search(key): return the item corresponding to the given key, if such an item exists Random_key(): return a random key in dictionary Assumption: every item has its own key (or that inserting new item clobbers old Application (and origin of name): Dictionaries Key is word in English, item is word in French

Dictionaries are everywhere Spelling correction Key is misspelled word, item is correct spelling Python Interpreter Executing program, see a variable name (key) Need to look up its current assignment (item) Web server Thousands of network connections open When a packet arrives, must give to right process Key is source IP address of packet, item is handler

use BSTs! Implementation can keep keys in a BST, keeping a pointer from each key to its value O(log n) time per operation Often not fast enough for these applications! Can we beat BSTs? if only we could do all operations in O(1)

Dictionaries: Attempt #1 0 1 2 Forget about BSTs.. key1 item1 Use table, indexed by keys! key2 key3 item2 item3

Problems What if keys aren aren tt numbers? How can I then index a table? Everything E thi is i a number b Pythagoras

Interpreting words as numbers What if keys aren t numbers? Anything in the computer is a sequence of bits So we can pretend it s a number Example: English words 26 letters in alphabet can represent each with 5 bits Antidisestablishmentarianism has 28 letters 28*5 = 140 bits So, store in array of size 2 140.oops Isn t this too much space for 100,000 words?

Our plan for today: Hashing I Today: Genomes, Dictionaries, and Hashing Matching genome segments Introduction to dictionaries Hash function: definition Resolving collisions with chaining Simple uniform hashing assumption Hash functions in practice: mod / mult Python implementation Thursday: Speeding up hash tables Next week: Space issues 17

Hash Functions Exploit sparsity Huge universe U of possible keys But only n keys actually present Want to store in table (array) of size m n Define hash function h:u {1..m} Filter key k through h( ) to find table position Table entries are called buckets Time to insert/find key is Time to compute h (generally length of key) Plus one time step to look in array

The magic of hash functions PHENOMENAL COSMIC POWERS!! itty bitty living space With apologies to Disney

Hashing exploits sparsity of space K K : universe of all possible keys; huge set : actual keys; small set but not known in advance

All keys map to small space (i)insertitem1 insert item1, with key k1 (iii) insert item3, with ihkey k3 item1 item3 Ø 1 h(k1) h(k3) : universe of all possible keys (ii) insert item2, with key k2 item2 h(k2) (iv) suppose we now try to inset ( ) pp y item4, with key k4 and h(k4)=h(k2) m 1

leading to collisions (i)insertitem1 insert item1, with key k1 (iii) insert item3, with ihkey k3 item1 item3 Ø 1 h(k1) h(k3) : universe of all possible keys (ii) insert item2, with key k2 problem h(k2) = h(k4) (collision) (iv) suppose we now try to inset ( ) pp y item4, with key k4 and h(k4)=h(k2) m 1

Our plan for today: Hashing I Today: Genomes, Dictionaries, and Hashing Matching genome segments Introduction to dictionaries Hash function: definition Resolving collisions with chaining Simple uniform hashing assumption Hash functions in practice: mod / mult Python implementation Thursday: Speeding up hash tables Next week: Space issues 23

Collisions What went/can go wrong? Distinct keys x and y But h(x) = h(y) Called a collision This is unavoidable: if table smaller than range, some keys must collide Pigeonhole principle What do you put in the bucket?

Coping with collisions Idea1: Change to a new uncoliding hash function and re-hash all elements in the table Hard to find, and can take a long time if m=o(n) Idea2: Chaining Linked list of hashed items for each bucket (today) Idea3: Open addressing Find a different, empty bucket for y (next lecture) Idea4: Perfect hashing (not covered in 6.006) Create a 2 nd -level hash table of size k 2 for each k-element bin, and try several 2 nd -level hash functions until no collisions are found (see 6.046)

Chaining - Each bucket, linked list of contained items K h(k1) h(k3) k1 item1 k3 item3 - Space used is space of table plus one unit per item (size of key and item) h(k2) = h(k4) k2 item2 k4 item4 K : universe of all possible keys : actual keys, not known in advance

Problem Solved? To find dkey, must scan whole list in key s bucket Length L list costs L key comparisons If all keys hash to same bucket, lookup cost (n) Solution: optimism

Our plan for today: Hashing I Today: Genomes, Dictionaries, and Hashing Matching genome segments Introduction to dictionaries Hash function: definition Resolving collisions with chaining Simple uniform hashing assumption Hash functions in practice: mod / mult Python implementation Thursday: Speeding up hash tables Next week: Space issues 28

Simple uniform hashing assumption Definition: Each key k K of keys is equally likely to be hashed h dto any slot of table tbl T, independent d of where other keys are hashed. Let n be the number of keys in the table, and let m be the number of slots. Define the load factor of T to be = n/m = average number of keys per slot.

Chaining Analysis under SUHA Average case analysis: n items in table of m buckets Average number of fi items/bucket is =n/m/ So expected time to find some key x is (1+ O(1) if =O(1), i.e. m= (n) apply hash function and access slot search the list

Summary (rehash) Matching big genomes is a hard problem And you will tackle it in your problem set! Dictionaries are pervasive Hash htbl tables implement tthem efficiently ffii Under an optimistic assumption of random keys Can be made true by heuristic hash hfunctions Key idea for beating BSTs: Indexing S ifi d i i Sacrificed operations: previous, successor Chaining strategy for collision resolution Next two lectures: speed & space improvements

Unit #2: Genomes, Hashing, Dictionaries Today: Genomes, Dictionaries, i and Hashing Intro, basic operations, collisions and chaining Simple uniform hashing hi assumption Hash functions, Python implementation Thursday: Speeding up hash tables Faster comparison: Signatures Faster hashing: Rolling Hash Next week: Space issues Dynamic resizing and amortized analysis Open addressing, deletions, and probing 42