Lecture 5: Suffix Trees

Similar documents
Lecture 6: Suffix Trees and Their Construction

An introduction to suffix trees and indexing

Data structures for string pattern matching: Suffix trees

Exact String Matching Part II. Suffix Trees See Gusfield, Chapter 5

Special course in Computer Science: Advanced Text Algorithms

Exact Matching Part III: Ukkonen s Algorithm. See Gusfield, Chapter 5 Visualizations from

4. Suffix Trees and Arrays

4. Suffix Trees and Arrays

Special course in Computer Science: Advanced Text Algorithms

1. Gusfield text for chapter 5&6 about suffix trees are scanned and uploaded on the web 2. List of Project ideas is uploaded

Applications of Suffix Tree

Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree.

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42

Computing the Longest Common Substring with One Mismatch 1

Ukkonen s suffix tree algorithm

Suffix links are stored for compact trie nodes only, but we can define and compute them for any position represented by a pair (u, d):

Lecture 7 February 26, 2010

Non-context-Free Languages. CS215, Lecture 5 c

17 dicembre Luca Bortolussi SUFFIX TREES. From exact to approximate string matching.

Distributed and Paged Suffix Trees for Large Genetic Databases p.1/18

Quiz 1 Solutions. (a) f(n) = n g(n) = log n Circle all that apply: f = O(g) f = Θ(g) f = Ω(g)

1 Introduciton. 2 Tries /651: Algorithms CMU, Spring Lecturer: Danny Sleator

Priority Queues. 1 Introduction. 2 Naïve Implementations. CSci 335 Software Design and Analysis III Chapter 6 Priority Queues. Prof.

Suffix trees and applications. String Algorithms

Lecture 9 March 4, 2010

Given a text file, or several text files, how do we search for a query string?

Advanced Algorithms: Project

Verifying a Border Array in Linear Time

V Advanced Data Structures

V Advanced Data Structures

Suffix Trees and its Construction

Lecture 18 April 12, 2005

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

Suffix Tree and Array

11/5/09 Comp 590/Comp Fall

11/5/13 Comp 555 Fall

Applied Databases. Sebastian Maneth. Lecture 14 Indexed String Search, Suffix Trees. University of Edinburgh - March 9th, 2017

COMP4128 Programming Challenges

Suffix Trees and Arrays

Scribe: Virginia Williams, Sam Kim (2016), Mary Wootters (2017) Date: May 22, 2017

BUNDLED SUFFIX TREES

Lecture L16 April 19, 2012

TREES Lecture 10 CS2110 Spring2014

Notes on Binary Dumbbell Trees

Dynamic Arrays and Amortized Analysis

Figure 1. The Suffix Trie Representing "BANANAS".

An Algorithm for Enumerating All Spanning Trees of a Directed Graph 1. S. Kapoor 2 and H. Ramesh 3

Binary Heaps in Dynamic Arrays

DDS Dynamic Search Trees

Algorithms Theory. 15 Text Search (2)

Figure 4.1: The evolution of a rooted tree.

CSE 530A. B+ Trees. Washington University Fall 2013

Computer Science 210 Data Structures Siena College Fall Topic Notes: Trees

Trapezoidal decomposition:

1 Computing alignments in only linear space

implementing the breadth-first search algorithm implementing the depth-first search algorithm

CSCI2100B Data Structures Trees

Combinatorial Pattern Matching. CS 466 Saurabh Sinha

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Priority Queues / Heaps Date: 9/27/17

Distributed minimum spanning tree problem

Full-Text Search on Data with Access Control

Chapter 7. Space and Time Tradeoffs. Copyright 2007 Pearson Addison-Wesley. All rights reserved.

9/24/ Hash functions

Graph Algorithms. Imran Rashid. Jan 16, University of Washington

CSE 506: Opera.ng Systems. The Page Cache. Don Porter

Trees, Part 1: Unbalanced Trees

There are many other applications like constructing the expression tree from the postorder expression. I leave you with an idea as how to do it.

SORTING. Practical applications in computing require things to be in order. To consider: Runtime. Memory Space. Stability. In-place algorithms???

CMSC423: Bioinformatic Algorithms, Databases and Tools. Exact string matching: Suffix trees Suffix arrays

Small-Space 2D Compressed Dictionary Matching

Friday Four Square! 4:15PM, Outside Gates

Exact String Matching. The Knuth-Morris-Pratt Algorithm

Suffix Vector: A Space-Efficient Suffix Tree Representation

References and Homework ABSTRACT DATA TYPES; LISTS & TREES. Abstract Data Type (ADT) 9/24/14. ADT example: Set (bunch of different values)

The Page Cache 3/16/16. Logical Diagram. Background. Recap of previous lectures. The address space abstracvon. Today s Problem.

marc skodborg, simon fischer,

A string is a sequence of characters. In the field of computer science, we use strings more often as we use numbers.

Treaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19

TREES. Trees - Introduction

Binary Trees, Binary Search Trees

Range Minimum Queries Part Two

x-fast and y-fast Tries

V1.0: Seth Gilbert, V1.1: Steven Halim August 30, Abstract. d(e), and we assume that the distance function is non-negative (i.e., d(x, y) 0).

MC 302 GRAPH THEORY 10/1/13 Solutions to HW #2 50 points + 6 XC points

Organizing Spatial Data

16 Greedy Algorithms

Dynamic Arrays and Amortized Analysis

Discrete mathematics

Binary Trees

Lecture 5. Treaps Find, insert, delete, split, and join in treaps Randomized search trees Randomized search tree time costs

9/29/2016. Chapter 4 Trees. Introduction. Terminology. Terminology. Terminology. Terminology

Disjoint-set data structure: Union-Find. Lecture 20

Notes slides from before lecture. CSE 21, Winter 2017, Section A00. Lecture 10 Notes. Class URL:

Simpler, Linear-time Transitive Orientation via Lexicographic Breadth-First Search

CSE 417 Branch & Bound (pt 4) Branch & Bound

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

4 Basics of Trees. Petr Hliněný, FI MU Brno 1 FI: MA010: Trees and Forests

Text search. CSE 392, Computers Playing Jeopardy!, Fall

Operations on Heap Tree The major operations required to be performed on a heap tree are Insertion, Deletion, and Merging.

Out: April 19, 2017 Due: April 26, 2017 (Wednesday, Reading/Study Day, no late work accepted after Friday)

Transcription:

Longest Common Substring Problem Lecture 5: Suffix Trees Given a text T = GGAGCTTAGAACT and a string P = ATTCGCTTAGCCTA, how do we find the longest common substring between them? Here the longest common substring would be GCTTAG. However none of the linear time string search algorithms for exact matching that we have learnt so far allows us to do this in linear time. We could try chopping off the characters from the beginning, and matching the remaining pattern repeatedly. That would be an inefficient method. Suffix trees allow us to do such searches efficiently. However just a naively constructed suffix tree would not give us an efficient algorithm. What is a suffix tree? Suffix tree, as the name suggests is a tree in which every suffix of a string S is represented. More formally defined, suffix tree is an automaton which accepts every suffix of a string. Definition of suffix trees, as given in section 5.2 of Gusfield: A suffix tree for an -character string is a rooted directed tree with exactly leaves numbered 1. Each internal node, other than the root, has at least two children and each edge is labeled with a non-empty substring of. No two edges out of a node can have edge-labels beginning with the same character. The key feature of the suffix tree is that for any leaf, the concatenation of the edge-labels on the path from the root to leaf exactly spells out the suffix of that starts at position. That is, it spells out... It is evident from the above definition that the strings are labeled on the edges, and we get a new node in the tree only when there is a split, that is, the label on the edge leading to the node is the prefix of more than one suffix of the string. Thus a suffix tree differs from a suffix trie which takes space. Example of suffix tree for the string ABC C 3 Root A B B C C 2 1 1, 2 and 3 represent the ends of suffixes starting at positions 1, 2 and 3 respectively. These are the leaf nodes.

Suffix links One important feature of suffix trees are suffix links. Suffix links are well defined for suffix trees. If there is a node in the tree with a label, where is a character and is a string (non-empty), then the suffix link of points to, which is a node with label. If is empty, then the suffix link of, i.e, is the root. Suffix links exist for every internal (non-leaf) node of a suffix tree. This can be easily proved. (Refer Lemma 6.1.1, Corollary 6.1.1 and Corollary 6.1.2 of Gusfield). Suffix links are similar to failure functions of Aho-Corasick algorithm. By following these links, one can jump from one suffix to another, each suffix starting exactly one character after the first character of its preceding suffix. Thus using suffix links it is possible to get a linear time algorithm for the previously stated problem: matching the longest common substring. It is because suffix links keep track of what or how much of the string has been matched so far. When we see a mismatch, we can jump along the suffix link to see if there is any match later in the text and the string. Not only that, this jumping trick also helps us to save computation time while constructing suffix trees. Thus suffix links are very important constructs of suffix trees. Example of a suffix tree with suffix links: Suffix tree for the string CACAO. The green arrows represent the suffix links. O 5 Root C A O 4 A C C O A A 3 O O 2 1 Construction of Suffix trees: Ukkonen s Algorithm We start by creating the suffix tree for all prefixes of the string. We first build the implicit suffix trees for all the prefixes 1.., increasing the length of the prefix by one character per stage, till our prefix is equal to the string. At this stage, we convert the final stage implicit suffix tree to a true suffix tree. A true suffix tree would have exactly leaves, where is the length of the string, each leaf representing one of the suffixes. An implicit suffix tree on the other hand, may not have a leaf for each suffix; however it encodes all the suffixes of 1... Every suffix of 1.. would be found on some path from the root of such an implicit suffix tree. At a high level, the method of construction of suffix trees is as follows: If we have the implicit suffix tree for 1.. in the stage, then in the 1 stage we simply add the character 1 to every suffix in to get. Continuing in this manner to stages, we get the

implicit suffix tree for the entire string, which we then convert to the actual suffix tree. Construction of suffix trees by Ukkonen s Algorithm is. Construction of the first stage 1 is very easy. It is just a single edge from the root labeled by the first character of the string, and this of course ends in a leaf. However, what we have described so far for the rest of the construction, is not linear time, rather quite an expensive process. The naive running time of the described algorithm would be. There would be stages of construction of implicit suffix trees, and in each stage, there would be suffixes. For each of the suffixes, we would have to travel the entire length of the suffix down from the root to the leaf to reach their end, where we would have to append 1. Also, we need to add a new edge for 1 from the root. The length of each such path, for the suffix.. would be 1. Hence the time required would be. When we follow each suffix branch to add the new character in the 1 stage, the following three cases are possible: i) The current search path ends in a leaf. In this case, we simply add the extra character to the edge leading into the leaf node. ii) The current search path ends in the middle of an edge and the following character is not equal to 1. In this case, create a new internal node at that point and add an edge leading out from the internal node into a new leaf node, labeled by 1. iii) The current search path ends in the middle of an edge, and the following character is equal to 1. In this case, do nothing, since the suffix is already present in the implicit tree. We had previously stated that suffix links exist for every internal node of a tree. So, what happens when case (ii) occurs and we create a new node for the suffix starting at the location of the string, at stage? The suffix link node for this node either already exists, or is created in the next step when we look at the suffix starting at position 1 and the link, is added in this step 1. (Ref Corollary 6.1.1 in Gusfield). Also, as stated by Dr. Pop in the suffix trees summary he provided, the suffix links are simply updated by keeping track of the last node created by the algorithm and creating a link from it to the next node being created. Example of construction of suffix trees for string BANANA$ Root N 3 N A 3 N A N 3 B B A B A B A B A 1 A 2 A N A N A N 1 N 2 N A N A 1 A 2 A N 1 N 2 1

N A N A 3 7 $ N A N A $ 3 B A B A $ 5 A N A $ N N A N 6 A $ 4 A N A N N A N A A 2 A $ 2 1 $ 1 There are a number of tricks that can be used to reduce the running time down to linear. Trick 1: Suffix Links Jump to successive suffixes via suffix links, as already described before. This is one of the prime tricks which helps us to achieve the linear time bound which Knuth had predicted to be impossible. The basic idea is to reduce the amount of work done in traversing each path from the root to locate the end of the next suffix. For the first suffix 1 in every stage, we can keep a pointer to its leaf, so that we can quickly jump to the leaf and add the new character. For every other suffix 1, we walk up at most one node up to a node with a suffix link or to the root, and follow the suffix link to continue the walk again from. This saves the work that would otherwise be spent in moving from the root down to. This method works because the label on the edge leading to (from which we jumped) is, where is a string (possibly empty), and the suffix link leads to a node with label on the edge leading to it. Hence any suffix starting with must be ending in the subtree containing. Trick 2: Skip / Count This is known as the skip/count trick. This helps us to find the place for adding the new character 1 in the 1 stage very quickly. Just reducing the work down from the root to is not enough to reduce the runtime. By using the skip/count trick, we reduce the work done in walking down from to the leaf to be equal to the number of nodes on the path, instead of the length of the path. Let us assume a few simple implementation details: i) we know the number of characters on each edge, ii) we can find the character from any given position of in constant time. Now, let the length of the path down from be. Since there cannot be more than one edge out of that starts with the same character, the first character in the suffix will appear on only one edge out

of. If the number of characters on this edge is, and, then let, and the next position on the string to be matched is 1. Now we again select the edge out of the current node which starts with character. If the label on this path, then again and. Continuing in this manner, if on some edge,, then the suffix ends exactly characters down that edge. Thus we have reduced the work to be equal to the number of nodes in the path (instead of the length of the suffix on that path). This saves work because now we can reach the end of each edge in constant time as we know the number of characters on each edge as per our implementation assumption. Why does this help us to reduce the work? Let the node depth of a node be the number of nodes on the path from root to (Ref: Gusfield). The node depth of any node is at most 1 higher than node-depth of node pointed by suffix link from. (Ref: Lemma 6.1.2, Gusfield) When has a suffix link to, and if the label on the path from the root to is, where is a nonempty string, then the label on the path from the root to is α (by def. of suffix links). Therefore the ancestors of have suffix links to ancestors of. Since node depths of any two ancestors must be different, each ancestor of has a distinct link to each ancestor of. The only extra ancestor that can possible have is the ancestor labeled by a single character. Hence the node depth of can be at most one more than depth of. The rest of the argument is similar to the potential function argument of KMP and Aho-Corasick algorithm. We know the maximum depth of a node can be at most (the length of the string). Now, in order to find a suffix link, we might have to walk up at most one node. Also, we might lose one more in node depth when we follow the suffix link to the new node. Since there are at most suffixes in each stage, we can decrement the depth at most 2 times. Since the maximum depth is, the total number of increments to the node depth is bounded by 3, which is O(m). However, this only ensures that each of the phases run in, and the entire algorithm is still. The following two tricks help us to reduce it to. Before we can describe the next trick, one point needs to be mentioned. The edge labels are represented as pairs of integers,, for the suffix... Since we can retrieve the characters from the string in each position in constant time, we continue to describe the algorithm as if the edges were labeled with the entire strings. This representation also helps to save storage space from to (under the standard RAM model where each integer can be represented in bits). Trick 3: Once a leaf, always a leaf If at some point in the algorithm, a leaf is created and labeled (for the suffix starting at position ), then throughout the algorithm, this will remain a leaf with label in all the successive tree creations in the algorithm. We do not ever extend a leaf node to an edge in the algorithm. In phase 1, when we extend the suffixes in stage, we simply add the character 1 to the leaves. So once a leaf is created, we extend it only by the case (i) described previously, by simply appending the character to its end. Also, it is obvious that for every leaf node with the edge label, leading to it, the value of in the stage, since a leaf can only represent the end of a suffix...

This suggests that we can reduce the work further. The integer label of each leaf would be anyway updated in the final stage of construction in the actual suffix tree to (length of the string). Hence there is no need of explicitly changing the integer label of a leaf node in every stage. This means once a leaf is created, in every subsequent stage we can avoid having to go and update the leaf in every stage. We can just label every leaf as, i.e,, every edge leading to a leaf can be labeled as, in the intermediate stages. Example implicit suffix tree for ABC : 1 2 3 (ABC) 1 (B,C) 2 (C) 3 One more trick is necessary to make Trick 3 more effective and to actually reduce the run time to linear. Trick 4 If at any stage, the character is found in the tree (case (iii) described earlier) while searching for some suffix starting at position, then it will be found in all the subsequent suffixes of that stage. It is because if.. is a suffix of, then 1.., 2.. etc. will all be suffixes of S, and if.. has been found in the tree, the others must also be present. Hence we can just stop the current stage of the algorithm at that point (since it is an implicit suffix tree we do not explicitly label the number of each suffix). We simply remember the starting position of the suffix where we stopped in some stage of the algorithm. Let this suffix number or position be. By trick 3, we know that only the character extensions due to case (ii) need to be done explicitly. Also, any extension by case (ii) would create a new leaf. So if the extensions due to case (i) and (ii) end at in stage, i.e., is the suffix for which we first see case (iii) (following character is already present in tree), then in the next stage we can start again from and ignore the values of. The suffixes would be extended implicitly. (Note: the first time we see an extension by case (iii) in stage, we cannot see any more extensions by case(i) or (ii) after that in that stage.) Also, when we start from in the next stage, we already know where ends (because we traversed it in the previous stage). Hence we avoid any work due to up-walking, following suffix link etc for. Putting all the tricks together The implicit extensions in every stage can be thought to take constant time, hence the time taken by them over all the stages is. For the explicit extensions, if the algorithm stops at in the current stage, it again starts from in the next stage. Hence this index cannot decrease between successive stages, rather remains the same. The index can increase to at most. Also there are stages. Therefore the algorithm can make at most 2

explicit extensions. We also know that each explicit extension takes time proportional to the number of node skips in each extension by the skip/count trick. Now, let us consider how many node skips can take place during the course of the algorithm. As we go from one stage to the next, by trick 4, our node depth does not change. Every time we explicitly extend some stage, our node depth can decrease by at most 2 by up walking and following suffix link. When we down-walk by the skip/count trick, our node depth increases by one at each node skip. The total node depth can be (length of the string). Also, we have shown that there are at most 2 explicit extensions. Hence the total amount of work in all down-walking is bounded by. (Ref Theorem 6.1.2 in Gusfield). When we finish constructing the suffix trees for all the stages 1.., we construct the true suffix tree from it. First, we append $ to our string, and build the suffix tree for $ from our implicit tree for. This will ensure that every suffix will now be explicitly represented by leaves in the tree. Then we change the labels on the all the leaves to from. Thus the entire suffix tree is constructed in time.