Exact Matching: Hash-tables and Automata

Similar documents
String Matching Algorithms

Exact String Matching. The Knuth-Morris-Pratt Algorithm

Combinatorial Pattern Matching. CS 466 Saurabh Sinha

COMP4128 Programming Challenges

Knuth-Morris-Pratt. Kranthi Kumar Mandumula Indiana State University Terre Haute IN, USA. December 16, 2011

SORTING. Practical applications in computing require things to be in order. To consider: Runtime. Memory Space. Stability. In-place algorithms???

Lecture 5: Suffix Trees

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

11/5/13 Comp 555 Fall

Lecture 18 April 12, 2005

Algorithms and Data Structures Lesson 3

17 dicembre Luca Bortolussi SUFFIX TREES. From exact to approximate string matching.

1 Dynamic Programming

Indexing and Searching

Lecture 7 February 26, 2010

11/5/09 Comp 590/Comp Fall

Project Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio

String Matching Algorithms

String matching algorithms تقديم الطالب: سليمان ضاهر اشراف المدرس: علي جنيدي

Applications of Suffix Tree

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42

Implementation of Lexical Analysis. Lecture 4

An Improved Query Time for Succinct Dynamic Dictionary Matching

Chapter 7. Space and Time Tradeoffs. Copyright 2007 Pearson Addison-Wesley. All rights reserved.

CSC Design and Analysis of Algorithms. Lecture 9. Space-For-Time Tradeoffs. Space-for-time tradeoffs

Suffix trees and applications. String Algorithms

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Priority Queues / Heaps Date: 9/27/17

Succinct dictionary matching with no slowdown

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

CS103 Handout 14 Winter February 8, 2013 Problem Set 5

Lecture 9: Core String Edits and Alignments

Suffix Tree and Array

Longest Common Subsequence. Definitions

CSCI S-Q Lecture #13 String Searching 8/3/98

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

(Refer Slide Time 04:53)

University of Waterloo CS240R Fall 2017 Review Problems

Combinatorial Pattern Matching

Strings. Zachary Friggstad. Programming Club Meeting

Applied Databases. Sebastian Maneth. Lecture 14 Indexed String Search, Suffix Trees. University of Edinburgh - March 9th, 2017

Cuckoo Hashing for Undergraduates

Scribe: Virginia Williams, Sam Kim (2016), Mary Wootters (2017) Date: May 22, 2017

W4231: Analysis of Algorithms

4. Suffix Trees and Arrays

From Smith-Waterman to BLAST

Efficient validation and construction of border arrays

DATA STRUCTURES AND ALGORITHMS

Longest Common Subsequence, Knapsack, Independent Set Scribe: Wilbur Yang (2016), Mary Wootters (2017) Date: November 6, 2017

Hashing Techniques. Material based on slides by George Bebis

Computational Biology Lecture 4: Overlap detection, Local Alignment, Space Efficient Needleman-Wunsch Saad Mneimneh

Data structures for string pattern matching: Suffix trees

CS/ENGRD 2110 Object-Oriented Programming and Data Structures Spring 2012 Thorsten Joachims. Lecture 10: Asymptotic Complexity and

1 Dynamic Programming

Searching a Sorted Set of Strings

Hashing. 1. Introduction. 2. Direct-address tables. CmSc 250 Introduction to Algorithms

Introduction to Computational Molecular Biology

CS161 - Final Exam Computer Science Department, Stanford University August 16, 2008

CSE P 501 Compilers. LR Parsing Hal Perkins Spring UW CSE P 501 Spring 2018 D-1

An introduction to suffix trees and indexing

CSE450. Translation of Programming Languages. Lecture 20: Automata and Regular Expressions

CSED233: Data Structures (2017F) Lecture12: Strings and Dynamic Programming

Greedy Algorithms CHAPTER 16

1 Introduciton. 2 Tries /651: Algorithms CMU, Spring Lecturer: Danny Sleator

Introduction to. Algorithms. Lecture 6. Prof. Constantinos Daskalakis. CLRS: Chapter 17 and 32.2.

Implementation of Lexical Analysis

Implementation of Lexical Analysis

Given a text file, or several text files, how do we search for a query string?

4. Suffix Trees and Arrays

Module 3: Hashing Lecture 9: Static and Dynamic Hashing. The Lecture Contains: Static hashing. Hashing. Dynamic hashing. Extendible hashing.

Hash Tables. CS 311 Data Structures and Algorithms Lecture Slides. Wednesday, April 22, Glenn G. Chappell

Edit Distance with Single-Symbol Combinations and Splits

String Patterns and Algorithms on Strings

Mapping Reads to Reference Genome

String Algorithms. CITS3001 Algorithms, Agents and Artificial Intelligence. 2017, Semester 2. CLRS Chapter 32

UNIT III BALANCED SEARCH TREES AND INDEXING

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

Verifying a Border Array in Linear Time

COS 226 Final Exam, Spring 2009

Hash Tables. Hash tables combine the random access ability of an array with the dynamism of a linked list.

PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet

Information Retrieval

Department of Computer Applications. MCA 312: Design and Analysis of Algorithms. [Part I : Medium Answer Type Questions] UNIT I

MySQL B+ tree. A typical B+tree. Why use B+tree?

CS166 Handout 11 Spring 2018 May 8, 2018 Problem Set 4: Amortized Efficiency

Two Dimensional Dictionary Matching

EECS730: Introduction to Bioinformatics

Lecture #3: PageRank Algorithm The Mathematics of Google Search

LING 473: Day 8. START THE RECORDING Computational Complexity Text Processing

Worst-case running time for RANDOMIZED-SELECT

Trees. 3. (Minimally Connected) G is connected and deleting any of its edges gives rise to a disconnected graph.

Principles of Computer Game Design and Implementation. Lecture 13

5.1 The String reconstruction problem

EE/CSCI 451 Spring 2018 Homework 8 Total Points: [10 points] Explain the following terms: EREW PRAM CRCW PRAM. Brent s Theorem.

1 Computing alignments in only linear space

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Shortest Paths Date: 10/13/15

Mapping a Dynamic Prefix Tree on a P2P Network

Introduction to Computer Science and Programming for Astronomers

CS 2150 Final Exam, Spring 2018 Page 1 of 10 UVa userid:

Efficient Sequential Algorithms, Comp309. Motivation. Longest Common Subsequence. Part 3. String Algorithms

COMP4128 Programming Challenges

Transcription:

18.417 Introduction to Computational Molecular Biology Lecture 10: October 12, 2004 Scribe: Lele Yu Lecturer: Ross Lippert Editor: Mark Halsey Exact Matching: Hash-tables and Automata While edit distances and dynamic programming will give a solution to exact matches, they take a long time to return an answer. If n is the length of the text we are searching and m is the length of the string we are searching for, the obvious method takes O(nm). The point of this lecture is to find an exact match in better than O(nm)time. History The history of the study of exact matching has been longer than everything else we ve been studying. More people than just biologists care about the subject. Examples of where it is used are: the world-wide web linguistic texts biologists The result of so much interest in the subject is that we ve gotten better results. Notation To formally define the problem, we use the following notation. The inputs to our exact matching algorithm are the following: q = query ( q = m) T = text ( T = n) 10-1

10-2 Lecture 10: October 12, 2004 A = alphabet Query is the string we re searching for, T is the entire text and A is the alphabet. The alphabet indicates the set of all values each character of the text can take. The outputs of our algorithm are: all occurrences of q in T. A variation on this problem can be replacing q with a set of q 1, q 2,..., q k so that finding any one of the queries is fine. Other variations include: 1. q i given in advance. We can preprocess all the queries. 2. T given in advance. We can preprocess T. Text Preprocessing To find occurrences of q in T Formally, this can be written as: q i ɛ A l where A is the alphabet. H[]: A l {location in T }. In other words, preprocessing the text means that for all possible queries of a certain length l find all the locations where such queries may occur. Obviously, this gets very expensive very quickly. However, this might be worth it if you are dealing with some huge T but the number of possible queries is limited. For the alphabet we re interested in ({A, T, G, C}) it is very possible for H to be an array of 4 16 elements. This is very expensive. Instead, let H be a hashtable: For an H [] array of 2v, we have a function h: 4 l = 2v.

Lecture 10: October 12, 2004 10-3 So that, H [h(q)] = H [h(q)] U location and each q is a query Now, it could be that h (q i ) = h (q j ) where q i is not equal to q j. When this is the case, a collision occurs. To do with that, let H [] be a bunch of pointers. Figure 10.1: A Hashtable collision. One can argue that the hashtable supports one pattern query in time proportional to query length. Time = O ( q ) ( have to examine each character of your pattern once) Space = O ( T ) One drawback to hashing is that it is relatively inflexible because creating hash functions isn t easy. Ideally, you want random output in order to have the fewest collisions. However, it is difficult to come up with a function that produces random output. Hence, hash functions tend to be very specific. If you slightly change your query type (e.g. from L-mers to (L+1)-mers), the hash function might prove ineffective. Query Preprocessing Now let s try pre-processing the query (we just talked about pre-processing the text). Take an example of applying the slow algorithm: q = A C T A A C T = A C A C T A A C

10-4 Lecture 10: October 12, 2004 In this case, you look at the first C and the second A twice, because you start all the way at the beginning. Traditionally, in order to find q in T, you start are the first A and continue to C. At the second A, the test fails and you start over at the first C. The test fails automatically. Finally you start at the second A and succeed. However, optimally, you shouldn t have to look at any character of the text more than once. Figure 10.2: A FSM. It is possible to look at every element the text only once because every letter in the text is absorbed by a FSM transition. In other words, scan time for T is O ( T ). We can see this through amortized analysis. The FSM fits into a small table: Figure 10.3: A FSM table. fail[i] = max f : f < i, S[0:f-1]==S[i-f:i-1] and S[f]!=S[i] S[0:f-1] gives the longest prefix of S S[i-1] gives what s being matched now S[i-f:i-1] matches a suffix of the string we ve matched so far The code follows:

Lecture 10: October 12, 2004 10-5 kmp-match(t,s) S= S+ * kmp-process(s,f) j= 0 for 0<= i < length(t): j = kmp-next(t[i], j, S, F) if S[j] == $ : report matching ending at i kmp-next(c, j, S, F) if j==-1 or S[j] == c: return j+1 return kmp-next(c, F[j], S, F) kmp-process(s, F) F[0] = -1 f= 0 for i in range( 1, length(s)): //loop inv: f = max f: f < i, S[0:f-1] == S[i-f:i-1] F[i] = f if S[F[i]] = S[i] and F[F[i]]!= -1 then F[i] = F[F[i]] f = kmp-next(s[i], f, S, F) The method kmp-process(s,f) computes the failure array. It takes O( q ) time since there is only one loop and it iterates through the length of S (the string we re matching). The time to match is O( T ) since the loop in kmp-match iterates through all elements of the text T. Hence, the processing time is O( S ) and the matching time is T. A common variation would be to build a A S lookup table for kmp-next for constant time improvement. This will avoid the recursive calls in kmp-next. Multiple Queries Now we will discuss multiple patterns. Aho-Corasick

10-6 Lecture 10: October 12, 2004 Figure 10.4: A multiple pattern FSM. Implementation; you can again fit the FSM into one table. The first line in the table encodes the fact that you can branch to siblings. The FSM fits into a small table: Figure 10.5: A multiple pattern FSM table. The failure row explains where to go if you failed to match. A failure moves you up in the BFS tree. The Code:

Lecture 10: October 12, 2004 10-7 ac-match(t,p 1,P 2,...) S = P 1 + * +... + P N + * ac-process(s,f) then same as kmp-match ac-next(c,j,s,f) same as kmp-next ac-process(i,j,s,f) i = 0,j = 0 while i < length(s): if S[i] == S[j]: if S[i] == * : j = -1 j = j+1 i = i+1 else if F[j] == -1: F[j] = i j=f[j] failures(0,-1,s,f) failures(i,f,s,f) if F[i] == -1: F[i] = f else failures(f[i],f,s,f) if S[i]!= * : failures(i+1, ac-next(s[i],f,s,f),s,f) This code is completely the same as kmp-match except for the ac-process method. The difference in the ac-process function is that: 1. failures do not occur immediately because they could still be branched off to siblings. Only after a series of soft failures do we have a hard failure. 2. The code incorporates a linked list. In this case, the pre-processing takes O( A Q ) (where Q is the sum of the lengths of the q i ) despite the fact that it looks quadratic. Running time of the the match process is then O( A T ) We can also consider building a lookup table for these transitions. This will take O( A Q ) space but then running time is O( T ). For next time: Suffix Trees.