Exact Matching Part III: Ukkonen s Algorithm. See Gusfield, Chapter 5 Visualizations from

Similar documents
Exact String Matching Part II. Suffix Trees See Gusfield, Chapter 5

Lecture 6: Suffix Trees and Their Construction

Special course in Computer Science: Advanced Text Algorithms

Lecture 5: Suffix Trees

Data structures for string pattern matching: Suffix trees

Special course in Computer Science: Advanced Text Algorithms

An introduction to suffix trees and indexing

Suffix Trees and its Construction

Computing the Longest Common Substring with One Mismatch 1

Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree.

Suffix Tree and Array

Suffix trees and applications. String Algorithms

Scribe: Virginia Williams, Sam Kim (2016), Mary Wootters (2017) Date: May 22, 2017

Suffix Trees and Arrays

1 Computing alignments in only linear space

4. Suffix Trees and Arrays

CS525 Winter 2012 \ Class Assignment #2 Preparation

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

Algorithms Theory. 15 Text Search (2)

Glynda, the good witch of the North

DVA337 HT17 - LECTURE 4. Languages and regular expressions

17 dicembre Luca Bortolussi SUFFIX TREES. From exact to approximate string matching.

3. According to universal addressing, what is the address of vertex d? 4. According to universal addressing, what is the address of vertex f?

Complexity Theory. Compiled By : Hari Prasad Pokhrel Page 1 of 20. ioenotes.edu.np

4. Suffix Trees and Arrays

Non-context-Free Languages. CS215, Lecture 5 c

Discrete Mathematics for CS Spring 2008 David Wagner Note 13. An Introduction to Graphs

Binary Trees, Binary Search Trees

Inexact Matching, Alignment. See Gusfield, Chapter 9 Dasgupta et al. Chapter 6 (Dynamic Programming)

Greedy Algorithms CHAPTER 16

Lecture 20 : Trees DRAFT

Indexing Variable Length Substrings for Exact and Approximate Matching

CMPSCI 250: Introduction to Computation. Lecture #22: Graphs, Paths, and Trees David Mix Barrington 12 March 2014

Data Structure Lecture#10: Binary Trees (Chapter 5) U Kang Seoul National University

Adjacent: Two distinct vertices u, v are adjacent if there is an edge with ends u, v. In this case we let uv denote such an edge.

Exercises: Disjoint Sets/ Union-find [updated Feb. 3]

Uses for Trees About Trees Binary Trees. Trees. Seth Long. January 31, 2010

1 Introduciton. 2 Tries /651: Algorithms CMU, Spring Lecturer: Danny Sleator

Section Summary. Introduction to Trees Rooted Trees Trees as Models Properties of Trees

Disjoint-set data structure: Union-Find. Lecture 20

Exact String Matching. The Knuth-Morris-Pratt Algorithm

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao,David Tse Note 8

We have used both of the last two claims in previous algorithms and therefore their proof is omitted.

Verifying a Border Array in Linear Time

An Efficient Algorithm for Identifying the Most Contributory Substring. Ben Stephenson Department of Computer Science University of Western Ontario

Figure 4.1: The evolution of a rooted tree.

CSI33 Data Structures

11/5/09 Comp 590/Comp Fall

We will show that the height of a RB tree on n vertices is approximately 2*log n. In class I presented a simple structural proof of this claim:

EE 368. Weeks 5 (Notes)

Advanced Combinatorial Optimization September 17, Lecture 3. Sketch some results regarding ear-decompositions and factor-critical graphs.

Ukkonen s suffix tree algorithm

Given a text file, or several text files, how do we search for a query string?

Priority Queues. T. M. Murali. January 29, 2009

MATH20902: Discrete Maths, Solutions to Problem Set 1. These solutions, as well as the corresponding problems, are available at

1. Suppose you are given a magic black box that somehow answers the following decision problem in polynomial time:

Distributed and Paged Suffix Trees for Large Genetic Databases p.1/18

Chapter Summary. Introduction to Trees Applications of Trees Tree Traversal Spanning Trees Minimum Spanning Trees

Solution for Homework set 3

Text Compression through Huffman Coding. Terminology

Figure 1: A directed graph.

INF121: Functional Algorithmic and Programming

CMPSCI 250: Introduction to Computation. Lecture #14: Induction and Recursion (Still More Induction) David Mix Barrington 14 March 2013

Introduction to Suffix Trees

4 Basics of Trees. Petr Hliněný, FI MU Brno 1 FI: MA010: Trees and Forests

Problem Set 5 Solutions

2. Lecture notes on non-bipartite matching

Binary Trees

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42

58093 String Processing Algorithms. Lectures, Autumn 2013, period II

Suffix links are stored for compact trie nodes only, but we can define and compute them for any position represented by a pair (u, d):

Longest Common Substring

Lecture notes on: Maximum matching in bipartite and non-bipartite graphs (Draft)

PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet

Algorithms and Data Structures: Minimum Spanning Trees (Kruskal) ADS: lecture 16 slide 1

Broadcast: Befo re 1

Computer Science 236 Fall Nov. 11, 2010

Properties of red-black trees

CE 221 Data Structures and Algorithms

Languages and Finite Automata

Bálint Márk Vásárhelyi Suffix Trees and Their Applications

Homework. Context Free Languages. Before We Start. Announcements. Plan for today. Languages. Any questions? Recall. 1st half. 2nd half.

Applications of Suffix Tree

Characterizations of graph classes by forbidden configurations

These notes present some properties of chordal graphs, a set of undirected graphs that are important for undirected graphical models.

st-orientations September 29, 2005

11/5/13 Comp 555 Fall

Range Minimum Queries Part Two

THE EULER TOUR TECHNIQUE: EVALUATION OF TREE FUNCTIONS

tree follows. Game Trees

Randomness and Computation March 25, Lecture 5

Discrete Mathematics and Probability Theory Fall 2013 Vazirani Note 7

Here is a recursive algorithm that solves this problem, given a pointer to the root of T : MaxWtSubtree [r]

March 20/2003 Jayakanth Srinivasan,

An Algorithm for Enumerating All Spanning Trees of a Directed Graph 1. S. Kapoor 2 and H. Ramesh 3

Suffix Arrays Slides by Carl Kingsford

Search Trees. Undirected graph Directed graph Tree Binary search tree

Trees. 3. (Minimally Connected) G is connected and deleting any of its edges gives rise to a disconnected graph.

1 Minimum Cut Problem

Transcription:

Exact Matching Part III: Ukkonen s Algorithm See Gusfield, Chapter 5 Visualizations from http://brenden.github.io/ukkonen-animation/

Goals for Today Understand how suffix links are used in Ukkonen's algorithm Reason about properties of the algorithm, and correctness

Suffix Tree and Implicit Suffix Tree for abxa$

Ukkonen s Algorithm Review Build an Implicit Suffix Tree for S[1..m] in O(m) Time Notation: Let I i be the implicit suffix tree for prefix S[1,...,i] Construct I 1 // Phase 1 For i from 1 to m-1 // Phase i+1: Build I i+1 from I i For j from 1 to i+1 // Extension j of Phase i+1 Ensure that a path labeled S[j..i+1] is in the tree

Ukkonen s Algorithm: Extension j of Phase i+1 Ensure that a path labeled S[j,i+1] is in the tree 1. Find the end of path S[j..i] 2. Apply the appropriate suffix extension rule

Ukkonen s Algorithm: Extension j of Phase i+1 2. Apply the appropriate suffix extension rule Rule 1: Path S[j..i] ends at a leaf. Add S[i+1] to the end of the label on that leaf edge Rule 2: At least one path continues from the end of S[j..i], but no such path starts with S[i+1] If S[j..i] ends inside an edge, create a node at that endpoint. Let w be the node at the end of the path S[j..i] Create a new leaf node numbered j Create an edge labeled S[i+1] from w to leaf j Rule 3: Some path from the end of S[j..i] starts with character S[i+1]. Do nothing

Ukkonen s Algorithm Examples Exercise: Follow the steps of Ukkonen s algorithm to construct suffix trees for the following strings aab abababababab aababab

Ukkonen s Algorithm Examples Exercise: Follow the steps of Ukkonen s algorithm to construct suffix trees for the following strings 1 aab abababababab aababab 2 Implicit tree I 7 for aababab. Numbers inside nodes indicate order of creation. Leaf numbers are in boxes 3

Ukkonen s Algorithm: Correctness Claim: The tree constructed at the end of phase i+1 is the implicit tree I i+1. That is: Each internal node, other than the root, has at least two children Each edge is labeled with a nonempty substring of S[1..i+1] No two edges out of a node have edge labels beginning with the same character Every suffix of S[1..i+1] labels a path of the tree The concatenation of edge labels on the path from the root to leaf "j" is S[j..i+1]

Ukkonen s Algorithm: Suffix Links Since the naïve implementation takes Θ(m 3 ) time, we augment the algorithm to create and use suffix links A suffix link is a pointer from v to v, where v is an internal node with label xα v' is an internal node with label α where x is a character and α is a string

Ukkonen s Algorithm: Suffix Links Since the naïve implementation takes Θ(m 3 ) time, we augment the algorithm to create and use suffix links A suffix link is a pointer from v to v, where v is an internal node with label xα v' is an internal node with label α where x is a character and α is a string In addition to maintaining suffix links, the algorithm also maintains a link to leaf node 1

Ukkonen s Algorithm: Extension j of Phase i+1 Ensure that a path labeled S[j,i+1] is in the tree // Revised algorithm that creates and uses suffix links 1. Find the end of path S[j..i] (using the 1-link or suffix links, as well as the end of path S[j-1..i]) 2. Apply the appropriate suffix extension rule

Ukkonen s Algorithm: Extension j of Phase i+1 Ensure that a path labeled S[j,i+1] is in the tree // Revised algorithm that creates and uses suffix links 1. Find the end of path S[j..i] (using the link to leaf node 1, or suffix links, plus the end of path S[j-1..i]) 2. Apply the appropriate suffix extension rule

Ukkonen s Algorithm: Extension j of Phase i+1 Ensure that a path labeled S[j,i+1] is in the tree // Revised algorithm that creates and uses suffix links 1. Find the end of path S[j..i] (using the link to leaf node 1, or suffix links, plus the end of path S[j-1..i]) 2. Apply the appropriate suffix extension rule. If a new internal node (with label S[j..i]) is created in this step, keep a link to it

Ukkonen s Algorithm: Extension j of Phase i+1 Ensure that a path labeled S[j,i+1] is in the tree // Revised algorithm that creates and uses suffix links 1. Find the end of path S[j..i] (using the link to leaf node 1, or suffix links, plus the end of path S[j-1..i]) 2. Apply the appropriate suffix extension rule. If a new internal node (with label S[j..i]) is created in this step, keep a temporary link to it 3. If a new internal node w (with label S[j-1..i]) was added to the tree on extension j-1, add the suffix link from w (and delete the temporary link to w)

Ukkonen s Algorithm: Extension j of Phase i+1 1. Find the end of path S[j..i] If j=1: use the 1-link (path S[1..i] ends at leaf 1) If j > 1: // In extension j-1, we located the end of path S[j-1..i] Find the first node v at or above the end of path S[j-1..i]. Let γ denote the string from v to the end of S[j-1..i] If v is the root, follow the (unique) path γ If v is not the root, traverse the suffix link from v, say to v and follow the (unique) path γ from v

Ukkonen s Algorithm: Extension j of Phase i+1 2. Apply the appropriate suffix extension rule Rule 1: Path S[j..i] ends at a leaf. Add S[i+1] to the end of the label on that leaf edge Rule 2: At least one path continues from the end of S[j..i], but no such path starts with S[i+1] If S[j..i] ends inside an edge (u1,u2), create a node w at that endpoint (with edges (u1,w), (w,u2)) Create a new leaf node numbered j Create an edge labeled S[i+1] from w to leaf j Rule 3: Some path from the end of S[j..i] starts with character S[i+1]. Do nothing

Ukkonen s Algorithm: Extension j of Phase i+1 3. If a new internal node w was added to the tree on extension j-1, add the suffix link from w Let the label of w be xα (=S[j-1..i]) Let w be the node with label α (w is created in extension j of the phase if not already in the tree) Add the link (w,w )

Ukkonen s Algorithm Examples Let s see how suffix links are created and used, by doing some more examples ababb

Ukkonen s Algorithm Examples Let s see how suffix links are created and used, by doing some more examples abbba

Ukkonen s Algorithm Examples Let s see how suffix links are created and used, by doing some more examples abbba Note that the suffix link points to an ancestor (Here and in future illustrations, we do not include suffix links to the root)

Ukkonen s Algorithm Examples Let s see how suffix links are created and used, by doing some more examples 1 aabababb 2 Implicit tree I 7 for aababab. (In this and future trees, suffix links from internal nodes to the root are not shown.) Simulate Phase 8 of the algorithm to obtain the tree I 8 3

Ukkonen s Algorithm Examples aabababbababaa

Ukkonen s Algorithm Examples aabababbababaabababab (try this one yourself on the visualization website)

Correctness: Suffix Link Properties Later we ll use properties of suffix links to reason about the efficiency of Ukkonen s algorithm We ll describe and prove these properties now

Correctness: Suffix Link Properties Claim 1: At the end of phase i+1, every internal node of I i (except for the root) has a suffix link. We ll use the following claim to prove Claim 1. Claim 2: Let 1 j i m. If S[j..i]c is in the tree at the start of extension j of phase i+1, then so is S[j+1..i]c.

Correctness: Suffix Link Properties Claim 1: At the end of phase i+1, every internal node of I i (except for the root) has a suffix link. Proof: By induction on i. Base case i=0: In this case the statement is trivially true since I 1 has no suffix links.

Correctness: Suffix Link Properties Inductive step: i>0. Suppose a new internal node w with label S[j..i] is created at extension j of phase i+1, by extension Rule 2. Note that no new internal node is created in extension i+1, so j i. Let w break the edge (u1,u2). Let c be the first character of (w,u2); since Rule 2 is applied, c is not S[i+1]. By Claim 2, S[j+1..i]c is a path in the tree. Therefore, in extension j+1, if S[j+1..i] does not label an internal node or the root, Rule 2 will be applied again and another internal node w' will be added with label S[j+1..i]. And so the suffix link (w,w') can be added.

Correctness: Suffix Link Properties Claim 2: Let 1 j < i m. If S[j..i]c is in the tree at the start of extension j of phase i+1, then so is S[j+1..i]c. Proof: Case 1: S[j..i]c was added to the tree in phase i' for some i' < i+1. Then S[j+1..i]c was also added to the tree in phase i'. Case 2: S[j..i]c was added to the tree in phase i+1. Given its length, it must have been added in extension j-1. So it must be that S[j..i]c = S[j-1..i] In this case, S[j-1] = S[j] =... S[i] = c. And so S[j+1..i]c = S[j..i], which we know is already in the tree.

Correctness: Suffix Link Properties Illustration of the second case of Claim 2: S= abbba. Consider phase i+1 = 5. Let j = 3. Let c = b. S[j..i]c = S[3,4]c = bbb is in the tree, and equals S[j+1..i]c.

Correctness: Suffix Link Properties Claim 3: Let (w,w') be any suffix link traversed during Ukkonen s algorithm. At that moment, the node-depth of w is at most one greater than the node depth of w'. Proof: Consider when (w,w') is traversed. First, the suffix link from any ancestor u of w goes to an ancestor u' of w'; this follows since the label of u is a prefix of w, and also the label of u' is a prefix of w'. Second, links from distinct ancestors of w must go to distinct ancestors of w', because edge labels have length at least 1. Thus there is a 1-1 correspondence between the internal nodes on the path to w except for the root, and those on the path to w'. The claim follows.