Data structures for string pattern matching: Suffix trees

Similar documents
Lecture 5: Suffix Trees

Exact String Matching Part II. Suffix Trees See Gusfield, Chapter 5

Suffix Tree and Array

Lecture 6: Suffix Trees and Their Construction

Exact Matching Part III: Ukkonen s Algorithm. See Gusfield, Chapter 5 Visualizations from

An introduction to suffix trees and indexing

Suffix trees and applications. String Algorithms

Suffix Trees. Martin Farach-Colton Rutgers University & Tokutek, Inc

Longest Common Substring

Special course in Computer Science: Advanced Text Algorithms

Given a text file, or several text files, how do we search for a query string?

17 dicembre Luca Bortolussi SUFFIX TREES. From exact to approximate string matching.

Special course in Computer Science: Advanced Text Algorithms

Suffix Trees and its Construction

Applied Databases. Sebastian Maneth. Lecture 14 Indexed String Search, Suffix Trees. University of Edinburgh - March 9th, 2017

String matching algorithms

Introduction to Suffix Trees

An Efficient Algorithm for Identifying the Most Contributory Substring. Ben Stephenson Department of Computer Science University of Western Ontario

String Matching Algorithms

SORTING. Practical applications in computing require things to be in order. To consider: Runtime. Memory Space. Stability. In-place algorithms???

Exact String Matching. The Knuth-Morris-Pratt Algorithm

11/5/09 Comp 590/Comp Fall

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42

11/5/13 Comp 555 Fall

A Suffix Tree Construction Algorithm for DNA Sequences

1 Introduciton. 2 Tries /651: Algorithms CMU, Spring Lecturer: Danny Sleator

Knuth-Morris-Pratt. Kranthi Kumar Mandumula Indiana State University Terre Haute IN, USA. December 16, 2011

4. Suffix Trees and Arrays

COMP4128 Programming Challenges

Combinatorial Pattern Matching. CS 466 Saurabh Sinha

Indexing and Searching

CMSC423: Bioinformatic Algorithms, Databases and Tools. Exact string matching: introduction

4. Suffix Trees and Arrays

Suffix links are stored for compact trie nodes only, but we can define and compute them for any position represented by a pair (u, d):

String matching algorithms تقديم الطالب: سليمان ضاهر اشراف المدرس: علي جنيدي

Inexact Matching, Alignment. See Gusfield, Chapter 9 Dasgupta et al. Chapter 6 (Dynamic Programming)

Lecture 7 February 26, 2010

String Matching Algorithms

CSC152 Algorithm and Complexity. Lecture 7: String Match

Suffix-based text indices, construction algorithms, and applications.

Distributed and Paged Suffix Trees for Large Genetic Databases p.1/18

Lecture L16 April 19, 2012

BUNDLED SUFFIX TREES

Project Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio

Chapter 7. Space and Time Tradeoffs. Copyright 2007 Pearson Addison-Wesley. All rights reserved.

Applications of Suffix Tree

Figure 1. The Suffix Trie Representing "BANANAS".

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

Ukkonen s suffix tree algorithm

Advanced Algorithms: Project

Shortest Unique Substring Query Revisited

Algorithms and Data Structures Lesson 3

CMSC423: Bioinformatic Algorithms, Databases and Tools. Exact string matching: Suffix trees Suffix arrays

Enhanced Component Retrieval Scheme Using Suffix Tree in Object Oriented Paradigm

CSED233: Data Structures (2017F) Lecture12: Strings and Dynamic Programming

Computing the Longest Common Substring with One Mismatch 1

String Patterns and Algorithms on Strings

A string is a sequence of characters. In the field of computer science, we use strings more often as we use numbers.

Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree.

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems

splitmem: graphical pan-genome analysis with suffix skips Shoshana Marcus May 7, 2014

Text search. CSE 392, Computers Playing Jeopardy!, Fall

Parallel Distributed Memory String Indexes

Algorithms and Data Structures

Space Efficient Linear Time Construction of

Inexact Pattern Matching Algorithms via Automata 1

An empirical evaluation of a metric index for approximate string matching. Bilegsaikhan Naidan and Magnus Lie Hetland

1. Gusfield text for chapter 5&6 about suffix trees are scanned and uploaded on the web 2. List of Project ideas is uploaded

Project Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio

CERC 2018 Presentation of solutions. December 2, 2018

University of Waterloo CS240R Fall 2017 Review Problems

Suffix trees, suffix arrays, BWT

PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet

CSCI 104 Tries. Mark Redekopp David Kempe

Information Retrieval and Organisation

CMPUT 403: Strings. Zachary Friggstad. March 11, 2016

University of Waterloo CS240R Winter 2018 Help Session Problems

Alignment of Long Sequences

Bálint Márk Vásárhelyi Suffix Trees and Their Applications

BWT Arrays and Mismatching Trees: A New Way for String Matching with k Mismatches

SUFFIX ARRAYS A Competitive Choice for Fast Lempel-Ziv Compressions

COMP4128 Programming Challenges

University of Waterloo CS240 Spring 2018 Help Session Problems

I519 Introduction to Bioinformatics. Indexing techniques. Yuzhen Ye School of Informatics & Computing, IUB

Syllabus. 5. String Problems. strings recap

Divya R. Singh. Faster Sequence Alignment using Suffix Tree and Data-Mining Techniques. February A Thesis Presented by

Two Dimensional Dictionary Matching

Application of the BWT Method to Solve the Exact String Matching Problem

A Performance Evaluation of the Preprocessing Phase of Multiple Keyword Matching Algorithms

LAB # 3 / Project # 1

Edit Distance with Single-Symbol Combinations and Splits

Boyer-Moore. Ben Langmead. Department of Computer Science

From Suffix Trie to Suffix Tree

USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT

Final Exam Solutions

Algorithms Theory. 15 Text Search (2)

Computing Patterns in Strings I. Specific, Generic, Intrinsic

Chapter. String Algorithms. Contents

58093 String Processing Algorithms. Lectures, Autumn 2013, period II

Study of Data Localities in Suffix-Tree Based Genetic Algorithms

Transcription:

Suffix trees

Data structures for string pattern matching: Suffix trees Linear algorithms for exact string matching KMP Z-value algorithm What is suffix tree? A tree-like data structure for solving problems involving strings. Related data structures: Trie (retrieval) & PATRICIA (radix tree) Allow the storage of all substrings of a given string in linear space Simple algorithm to solve string pattern matching problem in linear time

Better than hash tables? Hash tables are certainly easier to understand. And, one can produce a hash table of all length k strings in O(m) time and look up a k-length string x in O(k) time, finding all p places where string x is found in O(p) time. This is the same as the bound for suffix trees. What if you don t know how long the string x is going to be? And most other string matching tricks don t work for it either.

Suffix Tree: definition A suffix tree ST for an m-character string S is a rooted directed tree with exactly m leaves numbered 1 to m. Each internal node, other than the root, has at least two children and each edge is labeled with a nonempty substring of S.

Suffix tree: definition No two edges out of a node can have edge-labels beginning with the same character. The key feature of the suffix tree is that for any leaf i, the concatenation of the edgelabels on the path from the root to the leaf i exactly spells out the suffix of S that starts at position i.

Suffix Trees: Example Suffixes of papua papua apua pua ua a

Suffix Trees: Example Suffixes of papua papua apua pua ua a NOTE: Assume the string terminates with some character found nowhere else in the string. (eg. \0 )

Suffix Trees: Example Suffix tree for papua

Suffix Trees: Example Suffix tree for papua a apua papua ua pua

caracas

mississippi

Suffix Trees Exact matching in linear time Many others We know of no other single data structure that allows efficient solutions to such a wide range of complex string problems. - Dan Gusfield

Exact string matching problem Given a pattern P of length m, find all occurrences of P in text T O(n+m) algorithm Solution: Build a suffix tree ST for text T in O(m) time. Then, match the characters of P along the unique path in ST until either P is exhausted or no more matches are possible.

Exact string matching problem Find ssi in mississippi

Exact string matching problem Find ssi in mississippi

Exact string matching problem s si

Exact string matching problem s si Every leaf below this point in the tree marks the starting location of ssi in mississippi. (ie. ssissippi and ssippi ) ssippi

Exact string matching problem Find sissy in mississippi

Exact string matching problem

Exact string matching problem s i ss

Comparing to the other algorithms KMP and Boyer-Moore both achieve this worst case bound. O(m+n) when the text and pattern are presented together. Suffix trees are much faster when the text is fixed and known first while the patterns vary. O(m) for single time processing the text, then only O(n) for each new pattern. Based on suffix trees, is faster for searching a number of patterns at one time against a single text (exact set matching problem) Aho-Corasick algorithm: preprocessing P instead of T.

Building the Suffix Tree How do we build a suffix tree? while suffixes remain: add next shortest suffix to the tree

papua Building the Suffix Tree

Building the Suffix Tree papua papua

Building the Suffix Tree papua apua papua

Building the Suffix Tree papua p apua apua ua

Building the Suffix Tree papua p apua apua ua ua

Building the Suffix Tree papua a p pua apua ua ua

Building the Suffix Tree papua a p pua apua ua ua

Building the Suffix Tree How do we build a suffix tree? while suffices remain: add next shortest suffix to the tree Naïve method - O(m 2 ) (m = text size)

Building the Suffix Tree in O(m) time In the previous example, we assumed that the tree can be built in O(m) time. Weiner showed original O(m) algorithm (Knuth is claimed to have called it the algorithm of 1973 ) More space efficient algorithm by McCreight in 1976 Simpler on-line algorithm by Ukkonen in 1995

Ukkonen s Algorithm Build suffix tree T for string S[1..m] Build the tree in m phases, one for each character. At the end of phase i, we will have tree T i, which is the tree representing the prefix S[1..i] online construction In each phase i, we have i extensions, one for each character in the current prefix. At the end of extension j, we will have ensured that S[j..i] is in the tree T i.

Implicit suffix tree papua a p pua apua ua ua

Implicit suffix tree papua a p pua apua ua ua

Implicit suffix tree papua p apua apua ua ua Implicit suffix tree can be transformed from/into regular suffix tree in O(n) time.

Ukkonen s Algorithm

An Exmaple

Ukkonen s Algorithm This is an O(m 3 ) time, O(m 2 ) space algorithm. We need a few implementation speed-ups to achieve the O(m) time and O(m) space bounds.

Suffix extension rules 3 possible ways to extend S[j..i] with character i+1. 1. S[j..i] ends at a leaf. Add the character i+1 to the end of the leaf edge. 2. No path from the end of S[j..i] starts with the i+1 character. Split the edge and create a new node if necessary, then add a new leaf with character i+1. (This is the only extension that increases the number of leaves! The new leaf represents the suffix starting at position j.) 1. There is already a path from the end of S[j..i] starts with the i+1 character, or S[j..i+1] correspond to a path. Do nothing.

Ukkonen s Algorithmsbbb

Ukkonen s Algorithm: Speed-up 1 Suffix Links speed up navigation to the next extension point in the tree ississi ppi ssissi ppi

Ukkonen s Algorithm: Speed-up 2 Skip/Count Trick instead of stepping through each character, we know that we can just jump, as long as we re the right distance ississi ppi ssissi ppi

Ukkonen s Algorithm: Speed-up 2 Skip/Count Trick instead of stepping through each character, we know that we can just jump, as long as we re the right distance ississi 3 ppi ssissi 3 ppi

Ukkonen s Algorithm: Speed-up 3 Edge-Label Compression since we have a copy of the string, we don t need to store copies of the substrings for each edge iss issi ppi

Ukkonen s Algorithm: Speed-up 3 Edge-Label Compression since we have a copy of the string, we don t need to store copies of the substrings for each edge O(m 2 ) space becomes O(m) space *=s[1] len=3 mississippi *=s[4] len=4 *=s[7] len=3

Ukkonen s Algorithm: Speed-up 4 A match is a show stopper. If we find a match to our next character (rule 3 applies), we re done this phase.

Ukkonen s Algorithm: Speed-up 5 Once a leaf, always a leaf (implicitly implement rule 1). We don t need to update each leaf, since it will always be the end of the current string. We can get these updates for free. Either 1)maintain a global end-of-string index or 2) insert the whole string for every leaf

Ukkonen s Algorithm: Speed-ups Because of speed-ups 4 and 5, we can pick up the next phase right where we ended the last one!

Ukkonen s Algorithm mississippi with Speed-ups void SuffixTree::update(char* s, int len) { int i; int j; for (i = 0, j = 0; i < len; i++) { while (j <= i) { all the work } } }

Ukkonen s Algorithm The Punch Line By combining all of the speed-ups, we can now construct a suffix tree T m representing the string S[1..m] in O(m) time and in O(m) space!

Exact string matching Both P ( P =n) and T ( T =m) are known: Suffix tree method achieves same worst-case bound O(n+m) as KMP. T is fixed and build suffix tree, then P is input, k is the number of occurrences of P Using suffix tree: O(n+k) In contrast (KMP, preprocess P): O(n+m) for any single P P is fixed, then T is input Selecting KMP rather than suffix tree or Aho-Corasick algorithm (exact set matching problem)

Ukkonen s Algorithm Time Performance Time Performance of Suffix Tree Constuction On Swiss-Prot Protein Sequences (2GHz G5, 1.5G RAM) 140 120 Time User Time 100 Seconds 80 60 40 20 0 0 5000 10000 15000 20000 25000 Sequences

Ukkonen s Algorithm Memory Usage 12000000 Node Creation in Suffix Tree Construction On Swiss-Prot Protein Sequences (2GHz G5, 1.5G RAM) 10000000 8000000 Nodes 6000000 4000000 2000000 0 0 5000 10000 15000 20000 25000 Sequences

Applications Problems linear-time longest common substring constant-time least common ancestor maximally repetitive structures all-pairs suffix-prefix matching compression inexact matching conversion to suffix arrays

Bioinformatics applications Applications Sequence comparison motif discovery PST probabilistic suffix trees SVM string kernels chromosome-level similarities and rearrangements