Applied Databases. Sebastian Maneth. Lecture 14 Indexed String Search, Suffix Trees. University of Edinburgh - March 9th, 2017

Similar documents
String Matching Algorithms

Indexing and Searching

String matching algorithms

Knuth-Morris-Pratt. Kranthi Kumar Mandumula Indiana State University Terre Haute IN, USA. December 16, 2011

SORTING. Practical applications in computing require things to be in order. To consider: Runtime. Memory Space. Stability. In-place algorithms???

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42

Exact String Matching. The Knuth-Morris-Pratt Algorithm

Chapter 7. Space and Time Tradeoffs. Copyright 2007 Pearson Addison-Wesley. All rights reserved.

String matching algorithms تقديم الطالب: سليمان ضاهر اشراف المدرس: علي جنيدي

Algorithms and Data Structures Lesson 3

String Matching Algorithms

CSED233: Data Structures (2017F) Lecture12: Strings and Dynamic Programming

Lecture 7 February 26, 2010

CSCI S-Q Lecture #13 String Searching 8/3/98

Applied Databases. Sebastian Maneth. Lecture 13 Online Pattern Matching on Strings. University of Edinburgh - February 29th, 2016

CSC152 Algorithm and Complexity. Lecture 7: String Match

CMPUT 403: Strings. Zachary Friggstad. March 11, 2016

String Patterns and Algorithms on Strings

17 dicembre Luca Bortolussi SUFFIX TREES. From exact to approximate string matching.

An introduction to suffix trees and indexing

Strings. Zachary Friggstad. Programming Club Meeting

Lecture 5: Suffix Trees

Data structures for string pattern matching: Suffix trees

Inexact Matching, Alignment. See Gusfield, Chapter 9 Dasgupta et al. Chapter 6 (Dynamic Programming)

A New String Matching Algorithm Based on Logical Indexing

CMSC423: Bioinformatic Algorithms, Databases and Tools. Exact string matching: introduction

Application of String Matching in Auto Grading System

Applications of Suffix Tree

Study of Selected Shifting based String Matching Algorithms

Algorithms and Data Structures

Suffix-based text indices, construction algorithms, and applications.

1 Introduciton. 2 Tries /651: Algorithms CMU, Spring Lecturer: Danny Sleator

A string is a sequence of characters. In the field of computer science, we use strings more often as we use numbers.

Faculty of Mathematics and Natural Sciences

DATA STRUCTURES + SORTING + STRING

Lecture L16 April 19, 2012

Fast Exact String Matching Algorithms

CMSC423: Bioinformatic Algorithms, Databases and Tools. Exact string matching: Suffix trees Suffix arrays

CSC Design and Analysis of Algorithms. Lecture 9. Space-For-Time Tradeoffs. Space-for-time tradeoffs

Clever Linear Time Algorithms. Maximum Subset String Searching

Giri Narasimhan. COT 6936: Topics in Algorithms. The String Matching Problem. Approximate String Matching

kvjlixapejrbxeenpphkhthbkwyrwamnugzhppfx

Syllabus. 5. String Problems. strings recap

Implementation of Pattern Matching Algorithm on Antivirus for Detecting Virus Signature

COMP4128 Programming Challenges

String Processing Workshop

Given a text file, or several text files, how do we search for a query string?

Combinatorial Pattern Matching. CS 466 Saurabh Sinha

Clever Linear Time Algorithms. Maximum Subset String Searching. Maximum Subrange

Text search. CSE 392, Computers Playing Jeopardy!, Fall

CSE 417 Dynamic Programming (pt 5) Multiple Inputs

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b

Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

String Matching. Geetha Patil ID: Reference: Introduction to Algorithms, by Cormen, Leiserson and Rivest

Fast Substring Matching

Information Processing Letters Vol. 30, No. 2, pp , January Acad. Andrei Ershov, ed. Partial Evaluation of Pattern Matching in Strings

Suffix links are stored for compact trie nodes only, but we can define and compute them for any position represented by a pair (u, d):

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

Fast Hybrid String Matching Algorithms

Experimental Results on String Matching Algorithms

A Unifying Look at the Apostolico Giancarlo String-Matching Algorithm

Text Algorithms (6EAP) Lecture 3: Exact pa;ern matching II

A very fast string matching algorithm for small. alphabets and long patterns. (Extended abstract)

Lecture 18 April 12, 2005

Application of the BWT Method to Solve the Exact String Matching Problem

International Journal of Computer Engineering and Applications, Volume XI, Issue XI, Nov. 17, ISSN

Suffix Trees. Martin Farach-Colton Rutgers University & Tokutek, Inc

Compiler Construction

Max-Shift BM and Max-Shift Horspool: Practical Fast Exact String Matching Algorithms

CS/COE 1501

Text Algorithms (6EAP) Lecture 3: Exact paaern matching II

String Matching in Scribblenauts Unlimited

String Algorithms. CITS3001 Algorithms, Agents and Artificial Intelligence. 2017, Semester 2. CLRS Chapter 32

COS 226 Algorithms and Data Structures Fall Final Exam

Assignment 2 (programming): Problem Description

Chapter. String Algorithms. Contents

Automaton-based Sublinear Keyword Pattern Matching. SoC Software. Loek Cleophas, Bruce W. Watson, Gerard Zwaan

4. Suffix Trees and Arrays

LING/C SC/PSYC 438/538. Lecture 18 Sandiway Fong

4. Suffix Trees and Arrays

Compiler Construction

Data Structure and Algorithm Midterm Reference Solution TA

Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree.

Suffix Trees and Arrays

Computing Patterns in Strings I. Specific, Generic, Intrinsic

Final Exam Solutions

Optimization of Boyer-Moore-Horspool-Sunday Algorithm

Parallel and Sequential Data Structures and Algorithms Lecture (Spring 2012) Lecture 25 Suffix Arrays

Mapping Reads to Reference Genome

Small-Space 2D Compressed Dictionary Matching

An Efficient Algorithm for Identifying the Most Contributory Substring. Ben Stephenson Department of Computer Science University of Western Ontario

Index-assisted approximate matching

University of Waterloo CS240R Fall 2017 Review Problems

Advanced Algorithms: Project

Special course in Computer Science: Advanced Text Algorithms

An analysis of the Intelligent Predictive String Search Algorithm: A Probabilistic Approach

Technical University of Denmark

COS 226 Algorithms and Data Structures Spring Final

Data Structures Lecture 3

Algorithms on Words Introduction CHAPTER 1

Transcription:

Applied Databases Lecture 14 Indexed String Search, Suffix Trees Sebastian Maneth University of Edinburgh - March 9th, 2017

2 Recap: Morris-Pratt (1970) Given Pattern P, Text T, find all occurrences of P in T. Lopopre(P[1..j], P) 2 will never be used not true! after successful match, update to 2 a a b 2

3 Morris-Pratt (1970) Given Pattern P, Text T, find all occurrences of P in T. Lopopre(P[1..j], P) 2 will never be used not true! after successful match, update to 2 a a b 2 3 4 5 one more successful match!

4 Knuth-Morris-Pratt (1977) largest Why?

5 KMP 0 1 2 a b b a b a a b a this shift make no sense! largest Why? Otherwise next check will fail!!

6 KMP 0 1 2 a b b a b a a b a different letters! - but P[1] = P[3] the shift MUST fail! this shift make no sense! largest Why? Otherwise next check will fail!!

7 KMP 0 1 2 a b b a b a a b a different letters! - but P[1] = P[3] the shift MUST fail! this shift make no sense! largest (C1) (C2) KMP[ 2 ] = 0? (C1) satisfied (C2) not satisfied: P[0 + 1] = a = P[2 + 1] KMP[ 2 ] = 1

8 Horspool = Idea 1 of Boyer-Moore Idea 1 T= Match RIGHT-TO-LEFT in window a b a b a a b a b c a b a b a b a a b c a b a b c R(a) = 2 For each letter z, let R(z) = distance from right-most occurrence of z in P[1..m 1], to the end of P (and P if there is no occurrence) R(c) = 5 R(b) = 1 Horspool (ONE RULE ONLY): If mismatch and P[m] aligned to z in T, shift pattern to the RIGHT by R(z).

P= G C A G A G A G A C G T 1 6 2 8 9 = R(z)-table

P= G C A G A G A G A C G T 1 6 2 8 10 = R(z)-table

Boyer-Moore 11 Idea 2 T= z c u if mismatch after cu on symbol z, then k = max( R(z), D(u) ) k = min( k, P -L(u) ) else report occurrence of P k := P -L(u) shift by k maximum shift restrict by lospre D(u) = distance to the next occurrence of u to the left ( P if not exists) L(u) = lospre(u, P)

12 Lecture 14 Indexed String Search 1. Suffix Trie 2. Suffix Tree 3. Suffix Tree Construction 4. Applications of Suffix Trees

13 String Search search over DNA sequences huge sequence over C, T, G A (ca. 3.2 billion) no spaces, no tokens...

14 String Search search over DNA sequences huge sequence over C, T, G A (ca. 3.2 billion) no spaces, no tokens... Given a long string T (text) of length n a short string P (pattern) of length m Problem 1: Problem 2: find all occurrences of P in T count #occurrence of P in T Online Search O( T ) time with O( P ) preprocessing E.g., using automaton or KMP sublinear time using Horspool / Boyer-Moore average time limit: O(n (log m)/m)

15 String Search sublinear time using Horspool / Boyer-Moore average time limit: O(n (log m)/m)

16 String Search for DNA, 40% of 3.2 billion is still huge

17 Indexed String Search Given a long string T (text) a short string P (pattern) Problem 1 Problem 2 m = P find all occurrences of P in T count #occurrence of P in T Offline Search = Indexed Search = (linear time) preprocessing of T Highlights O(m) time for Problem 1 O(m + #occ) time for Problem 2

18 Indexed String Search Given a long string T (text) a short string P (pattern) Problem 1 Problem 2 m = P find all occurrences of P in T count #occurrence of P in T Offline Search = Indexed Search = (linear time) preprocessing of T Highlights O(m) time for Problem 1 O(m + #occ) time for Problem 2 Independent of size of text T!!!

19 Indexed String Search Count / Find all occurrences of P in T Preprocessing ( indexing ) of T is permitted Naive Solution 1. List all substrings of T, together with their occurrence lists (string1, [3,7,21]), (string2, [3,21]), 2. Lexicographically sort the substrings 3. Record the beginnings of each distinct next letter (tree structure) a c (a, [3,4,6,7,10,... ]) (ab, [7,10,..]) (ad,.. (b,.. (c

20 Indexed String Search Count / Find all occurrences of P in T Preprocessing ( indexing ) of T is permitted Naive Solution 1. List all substrings of T, together with their occurrence lists (string1, [3,7,21]), (string2, [3,21]), 2. Lexicographically sort the substrings 3. Record the beginnings of each distinct next letter (tree structure) a c (a, [3,4,6,7,10,... ]) (ab, [7,10,..]) (ad,.. (b,.. (c Search occurrences of P: jump to substrings starting with letter P[1] from there, jump to substrings with next letter P[2] Etc. after m jumps, reach (or not) matching substring with its occurrence list

21 Indexed String Search Count / Find all occurrences of P in T Preprocessing ( indexing ) of T is permitted Search Time O(m) [good!] Indexing Time???? Naive Solution 1. List all substrings of T, together with their occurrence lists (string1, [3,7,21]), (string2, [3,21]), 2. Lexicographically sort the substrings 3. Record the beginnings of each distinct next letter (tree structure) a c (a, [3,4,6,7,10,... ]) (ab, [7,10,..]) (ad,.. (b,.. (c Search occurrences of P: jump to substrings starting with letter P[1] from there, jump to substrings with next letter P[2] Etc. after m jumps, reach (or not) matching substring with its occurrence list

22 Indexed String Search Count / Find all occurrences of P in T Preprocessing ( indexing ) of T is permitted Naive Solution Search Time O(m) [good!] Indexing Time exceeds O(n2) (sort n2 substrings) 1. List all substrings of T, together with their occurrence lists (string1, [3,7,21]), (string2, [3,21]), 2. Lexicographically sort the substrings 3. Record the beginnings of each distinct next letter (tree structure) a c (a, [3,4,6,7,10,... ]) (ab, [7,10,..]) (ad,.. (b,.. (c Search occurrences of P: jump to substrings starting with letter P[1] from there, jump to substrings with next letter P[2] Etc. after m jumps, reach (or not) matching substring with its occurrence list

23 1. Suffix Trie Idea: consider all suffixes of text T i.e., suffix starting at position 1 (= T) suffix starting at position 2 suffix starting at position 3 etc. arrange suffixes in a prefix tree (trie), with longest common prefixes shared

24 1. Suffix Trie Idea: consider all suffixes of text T i.e., suffix starting at position 1 (= T) suffix starting at position 2 suffix starting at position 3 etc. arrange suffixes in a prefix tree (trie), with longest common prefixes shared trie datastructure: 1959 by de la Briandais trie (Fredkin, 1961), pronounced /ˈtriː/ (as "tree") RETRIEVAL to distinguish from tree many authors say /ˈtraɪ/ (as "try") aka digital tree or radix tree or prefix tree

25 1. Suffix Trie

26 1. Suffix Trie black nodes represent suffixes are labeled by the corresponding number of the suffix

27 1. Suffix Trie how to search for all occurrences of a pattern P?

28 1. Suffix Trie how to search for all occurrences of a pattern P? starting at the root node follow letter-by-letter wrt P the unique edges in the trie!

29 1. Suffix Trie how to search for all occurrences of a pattern P? starting at the root node follow letter-by-letter wrt P the unique edges in the trie! P = aba

30 1. Suffix Trie how to search for all occurrences of a pattern P? starting at the root node follow letter-by-letter wrt P the unique edges in the trie! P = aba

31 1. Suffix Trie how to search for all occurrences of a pattern P? starting at the root node follow letter-by-letter wrt P the unique edges in the trie! P = aba

32 1. Suffix Trie how to search for all occurrences of a pattern P? starting at the root node follow letter-by-letter wrt P the unique edges in the trie! P = aba

33 1. Suffix Trie how to search for all occurrences of a pattern P? starting at the root node follow letter-by-letter wrt P the unique edges in the trie! 3 matches of P = aba P = aba

34 1. Suffix Trie O(m) count time If we can count #black nodes of a subtree in constant time. O(m + #occ) retrieval time 3 matches of P = aba If we can iterate through the leaves of a subtree with constant delay

35 1. Suffix Trie Indexing time? 3 matches of P = aba

36 1. Suffix Trie Indexing time? No sorting, but still quadratic in n, i.e., O(n2) :-( 3 matches of P = aba the size (#nodes) of trie is O(n2)

37 END Lecture 14