kvjlixapejrbxeenpphkhthbkwyrwamnugzhppfx

Similar documents
Data Structures and Algorithms. Course slides: String Matching, Algorithms growth evaluation

Algorithms and Data Structures

CS/COE 1501

Knuth-Morris-Pratt. Kranthi Kumar Mandumula Indiana State University Terre Haute IN, USA. December 16, 2011

Chapter 7. Space and Time Tradeoffs. Copyright 2007 Pearson Addison-Wesley. All rights reserved.

CSC152 Algorithm and Complexity. Lecture 7: String Match

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42

String matching algorithms تقديم الطالب: سليمان ضاهر اشراف المدرس: علي جنيدي

Algorithms and Data Structures Lesson 3

CSCI S-Q Lecture #13 String Searching 8/3/98

6.3 Substring Search. brute force Knuth-Morris-Pratt Boyer-Moore Rabin-Karp !!!! Substring search

CSC Design and Analysis of Algorithms. Lecture 9. Space-For-Time Tradeoffs. Space-for-time tradeoffs

String matching algorithms

5.3 Substring Search

String Matching Algorithms

Algorithms. Algorithms 5.3 SUBSTRING SEARCH. introduction brute force Knuth-Morris-Pratt Boyer-Moore Rabin-Karp ROBERT SEDGEWICK KEVIN WAYNE

Applied Databases. Sebastian Maneth. Lecture 14 Indexed String Search, Suffix Trees. University of Edinburgh - March 9th, 2017

String Matching Algorithms

Application of String Matching in Auto Grading System

A string is a sequence of characters. In the field of computer science, we use strings more often as we use numbers.

String Matching in Scribblenauts Unlimited

COMP4128 Programming Challenges

Hashing Algorithms. Hash functions Separate Chaining Linear Probing Double Hashing

Indexing and Searching

Practical and Optimal String Matching

String Algorithms. CITS3001 Algorithms, Agents and Artificial Intelligence. 2017, Semester 2. CLRS Chapter 32

Fast Hybrid String Matching Algorithms

String Processing Workshop

SUBSTRING SEARCH BBM ALGORITHMS TODAY DEPT. OF COMPUTER ENGINEERING. Substring search applications. Substring search.

Given a text file, or several text files, how do we search for a query string?

String Patterns and Algorithms on Strings

Clever Linear Time Algorithms. Maximum Subset String Searching. Maximum Subrange

University of Waterloo CS240R Winter 2018 Help Session Problems

CSED233: Data Structures (2017F) Lecture12: Strings and Dynamic Programming

Introduction to Algorithms

Study of Selected Shifting based String Matching Algorithms

Chapter. String Algorithms. Contents

University of Waterloo CS240R Fall 2017 Review Problems

Lecture 7 February 26, 2010

of characters from an alphabet, then, the hash function could be:

University of Waterloo CS240 Spring 2018 Help Session Problems

A New String Matching Algorithm Based on Logical Indexing

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

Exact String Matching. The Knuth-Morris-Pratt Algorithm

Clever Linear Time Algorithms. Maximum Subset String Searching

String quicksort solves this problem by processing the obtained information immediately after each symbol comparison.

Multiple Skip Multiple Pattern Matching Algorithm (MSMPMA)

String Matching. Geetha Patil ID: Reference: Introduction to Algorithms, by Cormen, Leiserson and Rivest

ECE 122. Engineering Problem Solving Using Java

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40

International Journal of Computer Engineering and Applications, Volume XI, Issue XI, Nov. 17, ISSN

Efficient Algorithm for Two Dimensional Pattern Matching Problem (Square Pattern)

SORTING. Practical applications in computing require things to be in order. To consider: Runtime. Memory Space. Stability. In-place algorithms???

Problem Set 9 Solutions

Module 2: Classical Algorithm Design Techniques

An introduction to suffix trees and indexing

Cpt S 223. School of EECS, WSU

Suffix links are stored for compact trie nodes only, but we can define and compute them for any position represented by a pair (u, d):

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Questions. 6. Suppose we were to define a hash code on strings s by:

Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

Parallel and Sequential Data Structures and Algorithms Lecture (Spring 2012) Lecture 25 Suffix Arrays

Applications of Suffix Tree

Lecture 7: Efficient Collections via Hashing

Chapter 4. Transform-and-conquer

Data Structures and Algorithms Course Introduction. Google Camp Fall 2016 Yanqiao ZHU

COS 226 Algorithms and Data Structures Spring Final

PROGRAM EFFICIENCY & COMPLEXITY ANALYSIS

Implementation of Pattern Matching Algorithm on Antivirus for Detecting Virus Signature

Dictionaries and Hash Tables

CMSC423: Bioinformatic Algorithms, Databases and Tools. Exact string matching: introduction

Habanero Extreme Scale Software Research Project

CSE100. Advanced Data Structures. Lecture 21. (Based on Paul Kube course materials)

Topics Applications Most Common Methods Serial Search Binary Search Search by Hashing (next lecture) Run-Time Analysis Average-time analysis Time anal

Hashing. CptS 223 Advanced Data Structures. Larry Holder School of Electrical Engineering and Computer Science Washington State University

Algorithms in Systems Engineering IE172. Midterm Review. Dr. Ted Ralphs

Fast Substring Matching

Hash Table and Hashing

CS2 Algorithms and Data Structures Note 4

Lecture 18 April 12, 2005

Hash[ string key ] ==> integer value

Lecture Notes on Hash Tables

HashTable CISC5835, Computer Algorithms CIS, Fordham Univ. Instructor: X. Zhang Fall 2018

CSC 421: Algorithm Design & Analysis. Spring Space vs. time

4. Suffix Trees and Arrays

Fast Exact String Matching Algorithms

4.1 Performance. Running Time. The Challenge. Scientific Method. Reasons to Analyze Algorithms. Algorithmic Successes

Introduction to. Algorithms. Lecture 6. Prof. Constantinos Daskalakis. CLRS: Chapter 17 and 32.2.

LING/C SC/PSYC 438/538. Lecture 18 Sandiway Fong

Lecture L16 April 19, 2012

Hashing. 1. Introduction. 2. Direct-address tables. CmSc 250 Introduction to Algorithms

An analysis of the Intelligent Predictive String Search Algorithm: A Probabilistic Approach

Text Algorithms (6EAP) Lecture 3: Exact pa;ern matching II

4.1 Performance. Running Time. The Challenge. Scientific Method

CHAPTER 9 HASH TABLES, MAPS, AND SKIP LISTS

Data Structures and Algorithms. Chapter 7. Hashing

Running Time. Analytic Engine. Charles Babbage (1864) how many times do you have to turn the crank?

Data Types. Data Types. Integer Types. Signed Integers

DATA STRUCTURES + SORTING + STRING

Transcription:

COS 226 Lecture 12: String searching String search analysis TEXT: N characters PATTERN: M characters Idea to test algorithms: use random pattern or random text Existence: Any occurrence of pattern in text? Enumerate: How many occurrences of pattern in text? Search: Find an occurence of pattern in text Search all: Find all occurences of pattern in Three parameters N, M, C (number of occurences) start with N >> M >> C Ex: N = 100,000, M = 100, C = 5 Ex: binary text ~N possible different patterns Probability of search success < N/2^M <.00000000000000000000001 for N = 100,000, M = 100 Probabilities are much lower for bigger alphabets NOT FOUND is a simple algorithm that works Other factors multiple patterns (preprocessing) binary vs. ascii text avoid backup in text 12.1 Better idea to test algorithms: search for fixed patterns in fixed text use random position for successful search use random perturbations for unsuccessful search 12.3 String searching examples Brute-force string searching Text string: find gjkmxorzeoa in kvjlixapejrbxeenpphkhthbkwyrwamnugzhppfx iyjyanhapfwbghxmshwrlyujfjhrsovkvveylnbx nawavgizyvmfohigeabgksfnbkmffxjffqbualey tqrphyrbjqdjqavctgxjifqgfgydhoiwhrvwqbxg rixydzbpajnhopvlamhhfavoctdfytvvggikngkw zixgjtlxkozjlefilbrboignbzsudssvqymnapbp qvlubdoyxkkwhcoudvtkmikansgsutdjythzlapa wlvliygjkmxorzeoafeoffbfxuhkzukeftnrfmoc ylculksedgrdivayjpgkrtedehwhrvvbbltdkctq Binary: find 101001111010111011 in 1000100101011010001001101011010110110110 0111111110101001110101000000010111101001 1011100000011000110000100110000111101101 1100000011011110101111101111000111100001 1011111001110011001111011001000000011101 1111100001011001110011010010110101001001 0001111111111011100111101100101010011110 1111001110110010010010011000100010011100 0010000011100010101110101001101100100011 12.2 Check for pattern at every text position int brutesearch(char *p, char *a) int i, j, cnt = 0; for (i = 0; i < strlen(a); i++) for (j = 0; j < strlen(p); j++) if (a[i+j]!= p[j]) break; if (j == strlen(p)-1) return i; return strlen(a)+1; DON T USE THIS PROGRAM! 12.4

Problem with brute-force implementation Brute-force algorithm (alternative) for (i = 0; i < strlen(a); i++) In C, strlen takes time proportional to string length running time at least N^2 even for simpler programs (count the a s) PERFORMANCE BUG Performance matters in ADT design! Exercise: implement string ADT with fast strlen need space to store length need to update length when changing string... Different implementation (same algorithm) char match: increment both i and j char mismatch: set j to 0, reset i int brutesearch(char *p, char *a) int i, j, M = strlen(p), N = strlen(a); for (i = 0, j = 0; j < M && i < N; i++, j++) while (a[i]!= p[j]) i -= j-1; j = 0; if (j == M) return i-m; else return i; 12.5 12.7 Brute-force algorithm (bug fixed) Analysis of brute-force algorithm int brutesearch(char *p, char *a) int i, j; int M = strlen(p), N = strlen(a), cnt = 0; for (i = 0; i < N; i++) for (j = 0; j < M; j++) if (a[i+j]!= p[j]) break; if (j == M-1) return i; return N+1; Running time depends on text and pattern Worst case: search for 000001. 000000000000000000000000000001 M*N bit compares [Too slow when N and M are both large] 12.6 12.8

Average-case analysis Rabin-Karp algorithm (continued) Fixed pattern, random text. 10011010010100010100111000111. 00*. 0*. 0*. 001 long pattern: 2*N compares short pattern: precise cost depends on pattern (first 001 appears before first 000, on average) Ref: Flajolet and Sedgewick 12.9 Solution: use previous hash to compute next hash 31415 = 84 (mod 97) 14159 = (31415-30000)*10 + 9 = (84-3*9)*10 + 9 (mod 97) = 579 = 94 (mod 97) 41592 = (94-1*9)*10 + 2 = 76 (mod 97) 15926 = (76-4*9)*10 + 6 = 18 (mod 97) 59265 = (18-1*9)*10 + 5 = 95 (mod 97) Improves running time from N*M to N+M Slight problem: need full compare on collision 92653 = (95-5*9)*10 + 3 = 18 (mod 97) Solution: use giant (virtual) table! limit on table size: overflow on arithmetic ops 12.11 Rabin-Karp algorithm RK algorithm implementation Idea: Use hashing compute hash function for each text position NO TABLE! (just compare with pattern hash) Ex: "table" size 97, M = 5 Search for 15926 = 18 (mod 97). 31415926535897932384626433. 31415 = 84 (mod 97). 14159 = 94 (mod 97). 41592 = 76 (mod 97). 15926 = 18 (mod 97). 59265 = 95 (mod 97) Problem: hash function depends on M characters (running time N*M) 12.10 #define q 3355439 #define d 256 int rksearch(char *p, char *a) int i, j, dm = 1, h1 = 0, h2 = 0; int M = strlen(p), N = strlen(a); for (j = 1; j < M; j++) dm = (d*dm) % q; for (j = 0; j < M; j++) h1 = (h1*d + p[j]) % q; h2 = (h2*d + a[j]) % q; for (i = M; i < N; i++) if (h1 == h2) return i-m; h2 = (h2 + d*q - a[i-m]*dm) % q; h2 = (h2*d + a[i]) % q; return N; Randomized algorithm: take random (huge) table size 12.12

Knuth-Morris-Pratt algorithm KMP FSA typical example Text characters that match are in the pattern! Finite-state automaton for 10100110 PRECOMPUTE what to do on mismatch Ex: when searching for 000001, mismatch 00000* implies text had 000000 if next text char is 1, full match found if next text char is 0, back in same state Ex: when searching for 000001, mismatch 000* implies text had 0001 restart search on next text char 5 0 2 4 6 8 1 3 7 More complicated in general, but. 2 3 4 5 6 7 can always move to next text character. 0 0 2 0 4 5 0 2 8 next action uniquely determined. 1 1 1 3 1 3 6 7 1 Search example: Nonobvious algorithm. 100111010010101010100110000111 solved open theoretical and practical problems 12.13. 0120111234562343434345678 12.15 KMP finite state automaton KMP automaton construction "next action uniquely determined" Build FSA from pattern FSA builds itself (!) mismatch state is defined earlier in FSA Run FSA on text (inner loop: S = FSA[S][a[i++]]) N+M running time NO BACKUP in text Ex: FSA for 000001. 2 3 4 5. 2 3 4 5 5. 1 0 0 0 0 0 6 worst-case text. 000000000000000000000000000001 Ex: FSA for 10100110 State 6 1010011: go to state 7 1010010: go to state for 010010 (state 2) remember state for 010011 (X = state 1) State 7 10100110: go to state 8 10100111: go to state for 0100111 (state 1) remember state for 0100110 (X = state 2). 012345555555555555555555555556 random text. 100111010010100010100111000111. 001200010120101230101200012300 12.14 Main point: states needed are successors of old X X: mismatch state match: FSA[match][i] = i+1 mismatch: FSA[mismatch][i] = FSA[mismatch][X] X = FSA[match][X] 12.16

KMP automaton construction example Pattern: 10100110 KMP implementation (continued).. 0 0 2 1. 1 1 1 01 Easy to create specialized C program for pattern. 2. 0 0 2 0 00. 1 1 1 3 000. 2 3. 0 0 2 0 4 011. 1 1 1 3 1 0011. 2 3 4. 0 0 2 0 4 5 0101. 1 1 1 3 0 3 00123. 2 3 4 5. 0 0 2 0 4 5 0 01000. 1 1 1 3 0 0 6 001200. 2 3 4 5 6. 0 0 2 0 4 5 0 2 010010. 1 1 1 3 0 0 6 7 0012012 int kmpsearch(char *a) int i = 0; s0: if (a[i]!= 1 ) goto s0; i++; s1: if (a[i]!= 0 ) goto s1; i++; s2: if (a[i]!= 1 ) goto s0; i++; s3: if (a[i]!= 0 ) goto s1; i++; s4: if (a[i]!= 0 ) goto s3; i++; s5: if (a[i]!= 1 ) goto s0; i++; s6: if (a[i]!= 1 ) goto s2; i++; s7: if (a[i]!= 0 ) goto s1; i++; return i-8;. 2 3 4 5 6 7. 0 0 2 0 4 5 0 2 8 0100111. 1 1 1 3 0 0 6 7 1 00120111 12.17 Ultimate search program for pattern: machine language version of FSA 12.19 KMP implementation Right-left pattern scan Build FSA from pattern, run FSA on text string Improvement: match state in FSA unnecessary (always i+1) use mismatch state only ("next" array) small hack: next[0] = -1 int kmpsearch(char *p, char *a) int i, j, M = strlen(p), N = strlen(a); initnext(p); for (i = 0, j = 0; j < M && i < N; i++, j++) while ((j>=0) && (a[i]!=p[j])) j = next[j]; if (j == M) return i-m; else return i; initnext function constructs automaton (see text) 12.18 SUBLINEAR ALGORITHMS move right to left in pattern move left to right in text Ex: Find string of 9 consecutive 0s. 100111010010100010100111000111. 0. 1. 1. 0. 0. 1 Same idea effective in large alphabet Search time proportional to N/M for practical problems Time decreases as pattern length increases (!) 12.20

Right-left pattern scan examples Multiple patterns Text character not in pattern: skip forward M chars. now is the time for all good people to come * * * e. l. o. e Text character in pattern: skip forward to pattern end. you can fool some of the people some of the time * * o. e. l. o. e 12.21 DICTIONARY (symbol table ADT) build trie from text (preprocess) pattern lookups in O(lgN) steps EXCEPTION DICTIONARY build trie from set of patterns (preprocess) find patterns in given text in N lg N steps generalization of KMP build trie from set of patterns convert to FSA with KMP-like computation find patterns in given text in N steps 12.23 Right-left pattern scan implementation initskip(char *p) int j, M = strlen(p); for (j = 0; j < 256; j++) skip[j] = M; for (j = 0; j < M; j++) skip[p[j]] = M-j-1; #define max(a, B) (A > B)? A : B; int mischarsearch(char *p, char *a) int i, j; int M = strlen(p), N = strlen(a); initskip(p); for (i = M-1, j = M-1; j >= 0; i--, j--) while (a[i]!= p[j]) i += max(m-j, skip[a[i]]); if (i >= N) return N; j = M-1; return i+1; 12.22 Boyer-Moore: KMP-like computation to improve skip, if possible Multiple patterns example Ex: FSA for 000, 011, and 1010 3 1 4 7 8 0 5 9 2 6. 2 3 4 5 6. 3 5 7 5 3 9. 1 2 4 2 4 8 6 8 Simultaneous search for all patterns in text. 111100100100101110100000. 02222534534534568 12.24