String matching algorithms
Deliverables String Basics Naïve String matching Algorithm Boyer Moore Algorithm Rabin-Karp Algorithm Knuth-Morris- Pratt Algorithm Copyright @ gdeepak.com 2
String Basics A string is a sequence of characters Examples of strings: C++ program, HTML document, DNA sequence, Digitized image An alphabet S is the set of possible characters for a family of strings Example of alphabets: ASCII (used by C and C++), Unicode (used by Java), {0, 1}, {A, C, G, T} Copyright @ gdeepak.com 3
String Basics Let P be a string of size m A substring P[i.. j] of P is the subsequence of P consisting of the characters with ranks between i and j A prefix of P is a substring of the type P[0.. i] A suffix of P is a substring of the type P[i..m - 1] Given strings T (text) and P (pattern), pattern matching problem consists of finding a substring of T equal to P Applications: Text editors, Search engines, Biological research Copyright @ gdeepak.com 4
Brute Force String Matching Naive-String-Matcher(T, P) 1. n length[t] 2. m length[p] 3. for s 0 to n - m 4. do if P[1.. m]=t[s+1..s+m] 5. then print "Pattern occurs with shift" s Worst case O(m*n) T = aaa ah P = aaah Match Q with A, not matching, so shift the pattern by one and so on. Copyright @ gdeepak.com 5
Boyer-Moore Algorithm It uses two heuristics Looking-glass heuristic: Compare P with T moving backwards Character-jump heuristic: When a mismatch occurs at T[i] = c If P contains c, shift P to align the last occurrence of c in P with T[i] Else, shift P to align P[0] with T[i + 1] Copyright @ gdeepak.com 6
Boyer Moore Algorithm Boyer-Moore s runs in time O(nm + s) Example of worst case: T = aaa a P = baaa Character M 0 H 1 T 2 I 3 R 4 Worst case may occur in images and DNA sequences but unlikely in English text so BM is better for English Text a p a t t e r n m a t c h i n g a l g o r i t h m shift 1 r i t h m 3 r i t h m 5 r i t h m 11 10 9 8 7 r i t h m 2 r i t h m 4 r i t h m 6 r i t h m Copyright @ gdeepak.com 7
Boyer Moore Algorithm Algorithm BoyerMooreMatch(T, P, S) L lastoccurencefunction(p, S ) i m - 1 j m - 1 repeat if T[i] = P[j] if j = 0 return i { match at i } else i i - 1 j j - 1 else { character-jump } l L[T[i]] i i + m min(j, 1 + l) j m - 1 until i > n - 1 return -1 { no match } Copyright @ gdeepak.com 8
Rabin-Karp Algorithm It speeds up testing of equality of the pattern to the substrings in the text by using a hash function. A hash function is a function which converts every string into a numeric value, called its hash value; e.g. hash("hello")=5. It exploits the fact that if two strings are equal, their hash values are also equal. There are two problems: Different strings can also result in the same hash value and there is extra cost of calculating the hash for each group of strings. Worst case is O(mn) Copyright @ gdeepak.com 9
Rabin Karp Example Here we are matching 31415 in the given string. We can have various valid hits but there may be few valid matches. Copyright @ gdeepak.com 10
KMP Algorithm It never re-compares a text symbol that has matched a pattern symbol. As a result, complexity of the searching phase of the KMP = O(n). Preprocessing phase has a complexity of O(m). Since m< n, the overall complexity of is O(n). A border of x is a substring that is both proper prefix and proper suffix of x. We call its length b the width of the border. Let x=abacab Proper prefixes of x are ε, a, ab, aba, abac, abaca The proper suffixes of x are ε, b, ab, cab, acab, bacab The borders of x are ε, ab Copyright @ gdeepak.com 11
if s is the widest border of x, the next-widest border r of x is obtained as the widest border of s Border Concept Copyright @ gdeepak.com 12
Border Extension Let x be a string and a є A a symbol. A border r of x can be extended by a, if ra is a border of xa Copyright @ gdeepak.com 13
Border calculation In the pre-processing phase an array b of length m+1 is computed. Each entry b[i] contains the width of the widest border of the prefix of length i of the pattern (i = 0,..., m). Since the prefix ε of length i = 0 has no border, we set b[0] = - 1. Copyright @ gdeepak.com 14
KMP-Preprocess Algorithm void kmppreprocess() { int i=0, j=-1; b[i]=j; while (i<m) { while (j>=0 && p[i]!=p[j]) j=b[j]; i++; j++; b[i]=j; } } For pattern p = ababaa the widths of the borders in array b have the following values. For instance we have b[5] = 3, since the prefix ababa of length 5 has a border of width 3 Copyright @ gdeepak.com 15
pre-processing algorithm could be applied to the string pt instead of p. If borders up to a width of m are computed only, then a border of width m of some prefix x of pt corresponds to a match of the pattern in t (provided that the border is not selfoverlapping) Preprocessing Copyright @ gdeepak.com 16
KMP Search Algorithm void kmpsearch() { int i=0, j=0; while (i<n) { while (j>=0 && t[i]!=p[j]) i++; j++; j=b[j]; if (j==m) { report(i-j); j=b[j]; } } } Copyright @ gdeepak.com 17
KMP Search Algorithm When in inner while loop a mismatch at position j occurs, the widest border of the matching prefix of length j of the pattern is considered. Resuming comparisons at position b[j], the width of the border, yields a shift of the pattern such that the border matches. If again a mismatch occurs, the next-widest border is considered, and so on, until there is no border left or the next symbol matches. Then we have a new matching prefix of the pattern and continue with the outer while loop. If all m symbols of the pattern have matched the corresponding text window (j = m), a function report is called for reporting the match at position i-j. Afterwards, the pattern is shifted as far as its widest border allows Copyright @ gdeepak.com 18
a b a c a a b a c c a b a c a b a a b b 1 2 3 4 5 6 a b a c a b 7 a b a c a b 8 9 10 11 12 a b a c a b x 0 P[x] a f(x) 0 1 2 3 4 5 b a c a b 0 1 0 1 2 a b a c a b a b a c a b Copyright @ gdeepak.com 19 13 14 15 16 17 18 19
Questions, comments and Suggestions Copyright @ gdeepak.com 20
Question 1 How many nonempty prefixes of the string p= aaabbaaa are also suffixes of P? Copyright @ gdeepak.com 21
Question 2 What is the longest proper prefix of the string cgtacgttcgtacg that is also the suffix of this string. Copyright @ gdeepak.com 22
Question 3 What is the complexity of the KMP Algorithm, if we have a main string of length s and we wish to find the pattern of length p. A) O( s+p) B) O(p) C) O(sp) D) O(s) Copyright @ gdeepak.com 23