An introduction to suffix trees and indexing

Similar documents
Lecture 6: Suffix Trees and Their Construction

Lecture 5: Suffix Trees

Special course in Computer Science: Advanced Text Algorithms

4. Suffix Trees and Arrays

4. Suffix Trees and Arrays

Lecture 7 February 26, 2010

Suffix links are stored for compact trie nodes only, but we can define and compute them for any position represented by a pair (u, d):

Exact String Matching Part II. Suffix Trees See Gusfield, Chapter 5

Lecture L16 April 19, 2012

Applications of Suffix Tree

Lecture 18 April 12, 2005

Special course in Computer Science: Advanced Text Algorithms

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42

Data structures for string pattern matching: Suffix trees

Suffix trees and applications. String Algorithms

Suffix Trees and its Construction

1 Introduciton. 2 Tries /651: Algorithms CMU, Spring Lecturer: Danny Sleator

Computing the Longest Common Substring with One Mismatch 1

58093 String Processing Algorithms. Lectures, Autumn 2013, period II

Verifying a Border Array in Linear Time

Exact Matching Part III: Ukkonen s Algorithm. See Gusfield, Chapter 5 Visualizations from

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

Ukkonen s suffix tree algorithm

Given a text file, or several text files, how do we search for a query string?

Applied Databases. Sebastian Maneth. Lecture 14 Indexed String Search, Suffix Trees. University of Edinburgh - March 9th, 2017

EE 368. Weeks 5 (Notes)

17 dicembre Luca Bortolussi SUFFIX TREES. From exact to approximate string matching.

Figure 1. The Suffix Trie Representing "BANANAS".

marc skodborg, simon fischer,

11/5/13 Comp 555 Fall

Range Minimum Queries Part Two

Non-context-Free Languages. CS215, Lecture 5 c

Chapter 7. Space and Time Tradeoffs. Copyright 2007 Pearson Addison-Wesley. All rights reserved.

Compressed Indexes for Dynamic Text Collections

11/5/09 Comp 590/Comp Fall

Data Structure Lecture#10: Binary Trees (Chapter 5) U Kang Seoul National University

PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet

Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree.

String Matching Algorithms

Analysis of Algorithms

CMSC th Lecture: Graph Theory: Trees.

March 20/2003 Jayakanth Srinivasan,

Advanced Algorithms: Project

Range Minimum Queries Part Two

Foundations of Computer Science Spring Mathematical Preliminaries

Suffix Vector: A Space-Efficient Suffix Tree Representation

Algorithms and Theory of Computation. Lecture 7: Priority Queue

Introduction to Suffix Trees

(2,4) Trees. 2/22/2006 (2,4) Trees 1

Suffix Trees and Arrays

COMP4128 Programming Challenges

Binary search trees. Binary search trees are data structures based on binary trees that support operations on dynamic sets.

Final Examination CSE 100 UCSD (Practice)

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

Fast Substring Matching

Randomized incremental construction. Trapezoidal decomposition: Special sampling idea: Sample all except one item

Introduction to Computers and Programming. Concept Question

Priority Queues. 1 Introduction. 2 Naïve Implementations. CSci 335 Software Design and Analysis III Chapter 6 Priority Queues. Prof.

Alphabet-Dependent String Searching with Wexponential Search Trees

SFU CMPT Lecture: Week 9

DO NOT. In the following, long chains of states with a single child are condensed into an edge showing all the letters along the way.

Cache-Oblivious String Dictionaries

University of Waterloo CS240R Fall 2017 Solutions to Review Problems

Indexing and Searching

Module 2: Classical Algorithm Design Techniques

An Efficient Algorithm for Identifying the Most Contributory Substring. Ben Stephenson Department of Computer Science University of Western Ontario

V Advanced Data Structures

Notes on Binary Dumbbell Trees

Two Dimensional Dictionary Matching

Disjoint-set data structure: Union-Find. Lecture 20

V Advanced Data Structures

Graph Algorithms Using Depth First Search

Problem Set 5 Solutions

CS2223: Algorithms Sorting Algorithms, Heap Sort, Linear-time sort, Median and Order Statistics

MODELING DELTA ENCODING OF COMPRESSED FILES. and. and

Binary search trees 3. Binary search trees. Binary search trees 2. Reading: Cormen et al, Sections 12.1 to 12.3

Lower Bound on Comparison-based Sorting

Algorithms Dr. Haim Levkowitz

Algorithms Theory. 15 Text Search (2)

Suffix-based text indices, construction algorithms, and applications.

An undirected graph is a tree if and only of there is a unique simple path between any 2 of its vertices.

Suffix Tree and Array

1 The range query problem

Quiz 1 Solutions. (a) f(n) = n g(n) = log n Circle all that apply: f = O(g) f = Θ(g) f = Ω(g)

Cache-Oblivious String Dictionaries

implementing the breadth-first search algorithm implementing the depth-first search algorithm

Search Trees. Undirected graph Directed graph Tree Binary search tree

Hashing. Yufei Tao. Department of Computer Science and Engineering Chinese University of Hong Kong

MITOCW watch?v=ninwepprkdq

University of Waterloo CS240R Fall 2017 Review Problems

Recursively Defined Functions

University of Waterloo CS240R Winter 2018 Help Session Problems

Dynamic indexes vs. static hierarchies for substring search

Analysis of Algorithms

BUNDLED SUFFIX TREES

Distributed and Paged Suffix Trees for Large Genetic Databases p.1/18

Introduction to Automata Theory. BİL405 - Automata Theory and Formal Languages 1

Space Efficient Linear Time Construction of

Small-Space 2D Compressed Dictionary Matching

Binary Heaps in Dynamic Arrays

Transcription:

An introduction to suffix trees and indexing Tomáš Flouri Solon P. Pissis Heidelberg Institute for Theoretical Studies December 3, 2012

1 Introduction Introduction 2 Basic Definitions Graph theory Alphabet and strings 3 Dictionaries Trie Patricia tree 4 Suffix tree Suffix trie Suffix tree Ukkonen s algorithm 5 Example 6 Overview

Contents 1 Introduction 2 Basic Definitions 3 Dictionaries 4 Suffix tree 5 Example 6 Overview

Introduction Introduction Two main problem areas in text retrieval 1 String matching 2 Indexing and querying

Introduction Introduction Two main problem areas in text retrieval 1 String matching 2 Indexing and querying Exact and approximate cases!

Introduction Exact string matching Many efficient algorithms exist Knuth-Morris-Pratt algorithm Boyer-Moore, Boyer-Moore-Horspool, Turbo-Boyer-Moore, etc. Aho-Corasick...

Introduction Indexing - 1 Problem Given a text T, we need to construct an efficient data structure D which will serve as an index of T, so that we can efficiently query text T. What do we expect from an efficient indexing data structure?

Introduction Indexing - 2 Given a query pattern P, we want to find all occurrences of P in preprocessed text T using the indexing data structure D The data structure D is efficient if It can be built in linear time in the size of T (O( T )) It occupies space linear in the size of T (O( T )) It can answer a query whether P exists in T in time linear in the size of P (O( P )) It can report all occurrences of P in T in time O( P +occ), where occ is the number of occurrences

Introduction Indexing - 2 Some efficient indexing data structures include Suffix automata (DAWG) and variations such as CDAWG Suffix trees Position heaps Suffix arrays In this lecture we will concentrate only on suffix trees

Contents 1 Introduction 2 Basic Definitions 3 Dictionaries 4 Suffix tree 5 Example 6 Overview

Graph theory Graph, Cycle, Path Graph A graph is a pair G = (V, E) of sets such that E V V. 2 3 1 4 6 5

Graph theory Graph, Cycle, Path Graph A graph is a pair G = (V, E) of sets such that E V V. 2 3 1 4 6 5

Graph theory Graph, Cycle, Path Graph A graph is a pair G = (V, E) of sets such that E V V. Path A path of length n in a graph G = (V, E) is a sequence v 0, v 1,... v n V such that (v 0, v 1 ),(v 1, v 2 ),...,(v n 1, v n ) E. 2 3 1 4 6 5

Graph theory Graph, Cycle, Path Graph A graph is a pair G = (V, E) of sets such that E V V. Path A path of length n in a graph G = (V, E) is a sequence v 0, v 1,... v n V such that (v 0, v 1 ),(v 1, v 2 ),...,(v n 1, v n ) E. 2 3 1 4 6 5

Graph theory Graph, Cycle, Path Graph A graph is a pair G = (V, E) of sets such that E V V. Path A path of length n in a graph G = (V, E) is a sequence v 0, v 1,... v n V such that (v 0, v 1 ),(v 1, v 2 ),...,(v n 1, v n ) E. Cycle A path v 0, v 1,... v n, v 0, where n 2, is called a cycle. 2 3 1 4 6 5

Graph theory Graph, Cycle, Path Graph A graph is a pair G = (V, E) of sets such that E V V. Path A path of length n in a graph G = (V, E) is a sequence v 0, v 1,... v n V such that (v 0, v 1 ),(v 1, v 2 ),...,(v n 1, v n ) E. Cycle A path v 0, v 1,... v n, v 0, where n 2, is called a cycle. 2 3 1 4 6 5

Graph theory Rooted tree, subtree, tree height, node height Tree A rooted tree is an acyclic graph T = (V, E) with a special vertex v V called the root. Nodes with degree 1 are called leaves.

Alphabet and strings Alphabet and strings Definition (Alphabet) An alphabet Σ is a finite non-empty set whose elements are called letters.

Alphabet and strings Alphabet and strings Definition (Alphabet) An alphabet Σ is a finite non-empty set whose elements are called letters. Definition (String) A string on an alphabet Σ is a finite, possibly empty, sequence of elements of Σ.

Alphabet and strings Alphabet and strings Definition (Alphabet) An alphabet Σ is a finite non-empty set whose elements are called letters. Definition (String) A string on an alphabet Σ is a finite, possibly empty, sequence of elements of Σ. The zero-letter sequence is called the empty string, and is denoted by ε.

Alphabet and strings Alphabet and strings Definition (Alphabet) An alphabet Σ is a finite non-empty set whose elements are called letters. Definition (String) A string on an alphabet Σ is a finite, possibly empty, sequence of elements of Σ. The zero-letter sequence is called the empty string, and is denoted by ε. The set of all possible strings on the alphabet Σ is denoted by Σ.

Alphabet and strings Alphabet and strings Definition (Alphabet) An alphabet Σ is a finite non-empty set whose elements are called letters. Definition (String) A string on an alphabet Σ is a finite, possibly empty, sequence of elements of Σ. The zero-letter sequence is called the empty string, and is denoted by ε. The set of all possible strings on the alphabet Σ is denoted by Σ. Definition (Length of string) The length of a string x is defined as the length of the sequence associated with the string x, and is denoted by x.

Alphabet and strings Alphabet and strings We denote by x[i], for all 1 i x, the letter at index i of x. We also call index i, for all 1 i x, a position in x when x ε. It follows that the ith letter of x is the letter at position i in x, and that x = x[1.. x ]

Alphabet and strings Alphabet and strings We denote by x[i], for all 1 i x, the letter at index i of x. We also call index i, for all 1 i x, a position in x when x ε. It follows that the ith letter of x is the letter at position i in x, and that x = x[1.. x ] Definition (Factor of string) A string x is a factor (substring) of a string y if there exist two strings u and v, such that y = uxv. We denote the factor (substring) of x starting at position i and ending at position j as x[i.. j].

Contents 1 Introduction 2 Basic Definitions 3 Dictionaries 4 Suffix tree 5 Example 6 Overview

Trie Trie Retrieval Construct a dictionary for the set of words {amy, andy, ann, rob, roger, ben, betty} m a A r b B C D n e o E F y d n n J t b M g G H I y K L t N O e P Q y S r R T

Trie Trie Retrieval Construct a dictionary for the set of words {amy, andy, ann, rob, roger, ben, betty} m a A r b B C D n e o E F y d n n J t b M g G H I K $ y $ $ L t $ N O e P $ Q y S r R T $ $

Patricia tree Patricia tree 1 Construct a trie 2 Remove nodes with out-degree 1 and concatenate the labels of the corresponding edges to one edge m a A r b B C D n e o E F y d n n J t b M g G H I y K L t N O e P Q y S r R T

Patricia tree Patricia tree 1 Construct a trie 2 Remove nodes with out-degree 1 and concatenate the labels of the corresponding edges to one edge A B F G I J K M N P R T C D E H L Q O S a n n n b r o b e m y d y t t y g e r

Patricia tree Patricia tree 1 Construct a trie 2 Remove nodes with out-degree 1 and concatenate the labels of the corresponding edges to one edge a A ro my G B dy n F be J M n n b I K tty N ger P R T

Contents 1 Introduction 2 Basic Definitions 3 Dictionaries 4 Suffix tree 5 Example 6 Overview

Suffix trie Suffix trie Given some text, i.e. t = banana, construct the suffix trie. 1 Generate the set Suff(t) 2 Construct a trie from Suff(t) The resulting data structure is called a suffix trie. Example Given the t = banana$, the set Suff(t) is Suff(t) = {banana$, anana$, nana$, ana$, na$, a$}

Suffix trie Suffix trie - Example Given the text t = banana$, construct the suffix trie. a b n $ n a a 6 a n $ n $ n a 5 a 4 a n $ 3 $ a 2 $ 1

Suffix tree Suffix tree Definition A suffix tree is a patricia tree of the suffix trie. Construction 1 Construct a suffix trie of text x 2 Eliminate all nodes with out-degree 1 and concatenate the labels in the corresponding edges to one edge.

Suffix tree Suffix tree - Example a b n $ n a a 6 a n $ n $ n a 5 a 4 a n $ 3 $ a 2 $ 1

Suffix tree Suffix tree - Example a b n $ n a a 6 a n $ n $ n a 5 a 4 a n $ 3 $ a 2 $ 1

Suffix tree Suffix tree - Example a na 6 $ na $ $ banana$ 5 na$ 4 na$ 3 2 1

Suffix tree Size of suffix tree Theorem A suffix tree consists of at most 2n 1 nodes (or 2n if empty suffix $ is taken into account). Proof (by induction) Base case For 2 leaves we have 1 internal node. Inductive step Assume that any binary tree with m < N leaves consists of at exactly m 1 internal nodes. We must prove that a binary tree with N leaves has exactly N 1 internal nodes. A binary tree with N leaves is made up of: A root node. A left binary tree with k leaves. A right binary tree with N k leaves.

Suffix tree Size of suffix tree Proof (by induction) According to the induction assumption The left binary tree with k leaves consists of k 1 internal nodes. The right binary tree with N k leaves consists of N k 1 internal nodes. Therefore, the total number of internal nodes in a binary tree with N leaves is (k 1)+(N k 1)+1 = N 1 and thus, the total number of nodes is 2N 1.

Suffix tree Suffix tree construction algorithms Weiner s algorithm (1973) Introduced as position tree Construction in linear time (for constant size alphabets) Characterized as algorithm of the year McCreight s algorithm (1976) Improved space requirements over Weiner s method Construction in linear time (for constant size alphabets) Ukkonen s algorithm (1995) Same time and space requirements as McCreight s Easier to understand On-line Farach s algorithm (1997) Linear time construction algorithm for any type of alphabet Hard to implement The basis for new algorithms i.e. position heaps and suffix arrays in linear time

Ukkonen s algorithm Implicit suffix tree Definition An implicit suffix tree for string x is a tree obtained from the suffix tree of x by 1 Removing $ from all edge labels 2 Removing any edge that has no label 3 Removing any node with only one child a na a na $ na banana$ $ na$ na banana na 6 $ na$ 5 3 6 na 5 3 4 2 1 4 2 1

Ukkonen s algorithm Implicit suffix tree Definition An implicit suffix tree for string x is a tree obtained from the suffix tree of x by 1 Removing $ from all edge labels 2 Removing any edge that has no label 3 Removing any node with only one child a na a na $ na banana$ $ na$ na banana na 6 $ na$ 5 3 na 3 4 2 1 2 1

Ukkonen s algorithm Implicit suffix tree Definition An implicit suffix tree for string x is a tree obtained from the suffix tree of x by 1 Removing $ from all edge labels 2 Removing any edge that has no label 3 Removing any node with only one child a na nana $ na banana$ $ na$ anana banana 6 $ na$ 5 3 3 4 2 1 2 1

Ukkonen s algorithm Implicit suffix tree The implicit suffix tree of a string is what results by applying Ukkonen s algorithm to the string without an added end marker $. All suffixes are included, but not necessarily as labels of complete paths leading to leaves. By appending a unique character at the end of the string (in our case the $), the implicit suffix tree is essentially the same as the (true) suffix tree (only without $).

Ukkonen s algorithm String paths of implicit suffix trees Given a string y[1.. n], an implicit suffix tree I i contains each suffix y[1.. i], y[2.. i],..., y[i] of y as a label of some path (possibly ending at the middle of an edge) That is, a string path is a string that can be matched along the edges, starting from the root, or equivalently a prefix of any node label

Ukkonen s algorithm Ukkonen s algorithm 1 Start with T = I 1. 2 Consecutively update T to I 2, I 3,..., I n+1 in n phases, where I i represents the implicit suffix tree of prefix y[1.. i]. Phase i + 1 updates T from I i (with all suffixes of y[1.. i]) to I i+1 (with all suffixes of y[1.. i + 1]). Each phase i + 1 consists of extensions j = 1, 2,..., i + 1 (one for each suffix of y[1.. i + 1]). Extension j ensures that suffix y[j.. i + 1] is in I i+1.

Ukkonen s algorithm Suffix extension rules Rule 1 y[j.. i] ends at a leaf Insert y[i + 1] at the end of the edge label Rule 2 y[j.. i] doesn t end at a leaf, and the following character is not y[i + 1] Connect the end of the path to a new leaf j by an edge labeled y[i + 1]. If the path ended at the middle of an edge, split that edge and insert a new node as the parent of leaf j. Rule 3 If the path y[j.. i] is already in the tree. No update.

Which is even worse than the naive algorithm which runs in O(n 2 ). We will see how this approach, with the use of some simple tricks, can achieve linear run-time. Ukkonen s algorithm Complexity Complexity The so-far presented algorithmic approach runs in O(n 3 ). Proof Consider a single phase i + 1. Each extension rule can be applied in O(1) Applying all i + 1 extensions takes time Θ(i). Locating the ends of string paths y[1.. i],..., y[i] by traversing the edge labels takes time Σ i k=1 = Θ(i2 ). Therefore, the total time for all phases i = 1, 2,..., n is Σ n i=1i 2 = Θ(n 3 )

Ukkonen s algorithm Suffix links The extensions of phase i + 1 need to locate the ends of all i + 1 suffixes of y[1.. i], and apply Rules 1-3. How to do this efficiently? For each internal node v of I i labeled xα, where x Σ and α Σ, define s(v) to be the node labeled by α. (Do these nodes actually exist?) Then a pointer from v to s(v) is called the suffix link of v. Note: If node v is labeled by a single character then α = ε and s(v) is the root node.

Ukkonen s algorithm Example of suffix links Suffix tree for x = xabxac bxac c a xa 3 6 c bxac c bxac 5 2 4 1

Ukkonen s algorithm Why do we need suffix links? Extension j (of phase i + 1) finds the end of the path y[j.. i] in the tree (and extends it with character y[i + 1]) Extension j + 1 similarly finds the end of the path y[j + 1.. i] Assume that v is an internal node whose string path y[j]α is (essentialy) a prefix of y[j.. i]. Then we can avoid traversing path α when locating the end of path y[j + 1.. i], by starting from node s(v). Do suffix links always exist?

Ukkonen s algorithm Suffix links existence Observation If an internal node v is created during extension j (of phase i + 1), then extension j + 1 will find out the node s(v). Let v be labeled xα Node v can only be created by extension Rule 2. That is, v is inserted at the end of path y[j.. i], which continued by some character c y[i + 1]. Therefore, paths xαc and αc have been entered before phase i + 1. in extension j + 1, node s(v) is either found or created at the end of path α = y[j + 1.. i].

Ukkonen s algorithm Speeding up path traversals Consider extensions of phase i + 1 Extension 1 extends path y[1.. i] with character y[i + 1]. Extension 1 is easy as path y[1.. i] always ends at leaf 1, and is thus extended by Rule 1. We can perform extension 1 in constant time, if we maintain a pointer to the edge at the end of y[1.. i]. What about subsequent extensions j + 1 (for j = 1, 2,..., i)?

Ukkonen s algorithm Locating subsequent paths Extension j has located the end of path y[j.. i] and v is the node last visited. Starting from there, walk up at most one node either 1 to the root, or 2 to a node s(v) with a suffix link from v In case of (1), traverse path y[j + 1.. i] explicitly down-wards from the root.

Ukkonen s algorithm Locating subsequent paths In case of (2), let xα be the label of v y[j.. i] = xαβ for some β Σ Then follow the suffix link of v, and continue by matching β down-wards from node s(v) (whose string-path is α). Having found the end of path αβ = y[j + 1.. i], apply extension rules to ensure that it extends with y[i + 1]. Finally, if a new internal node w was created in extension j, set its suffix link to point to the end node of path y[j + 1.. i]

Ukkonen s algorithm Locating subsequent paths - Illustration In case of (2), let xα be the label of v y[j.. i] = xαβ for some β Σ (in this case β = abcd) xα α s(v) a v abcd bc d

Ukkonen s algorithm Speeding up explicit traversals Skip/Count trick In phase i + 1, each path y[j.. i], which is followed in extension j, is known to exist in the tree The path can be followed by choosing the correct edges, instead of examining every character Let y[k] be the next character to be matched on path y[j.. i] Now an edge labeled by y[p.. q] can be traversed simply by checking that y[p] = y[k], and skipping the next q p characters of y[j.. i] The time to traverse a path is proportional to the number of nodes on the path (instead of its string length)

Ukkonen s algorithm Speeding up explicit traversals Lemma For any node v with a suffix link to s(v), it holds that depth(v) 1 depth(s(v)) depth(v) Sketch of proof The suffix links for any ancestor of v lead to distinct ancestors of s(v).

Ukkonen s algorithm Linear bound for any single phase Theorem Using suffix links and the skip/count trick, a single phase i takes time O(n) Proof There are i + 1 n+1 extensions in phase i + 1 In any extension, other work except tree-traversal (that is, extension rules) takes O(1) time only How to bound the work for traversing the tree? To find the end of the next path, an extension first moves at most one level up. Then a suffix link may be followed, which is followed by a down-traversal to match the rest of the path

Ukkonen s algorithm Linear bound for any single phase The up-walk in any extension decreases the current node depth by at most one (since it moves up at most one node) and each suffix link traversal decreases the node-depth by at most another one (previous Lemma). Thus the current node depth is decremented at most 2n times during the entire phase. On the other hand, the current node depth cannot exceed n it is incremented (by following downward edges) at most 3n times total run-time of a phase is thus O(n) Improvement Since there are n phases, the total run-time is O(n 2 )

Ukkonen s algorithm Final improvements (1) Some extensions can be found unnecessary to compute explicitly Observation 1 - Rule 3 terminates current phase If path y[j.. i + 1] is already in the tree, so are paths y[j + 1.. i + 1]... y[i + 1] Phase i + 1 can be finished at the first extension j that applies Rule 3

Ukkonen s algorithm Final improvements (2) Observation 2 - Once a leaf, always a leaf A node created as a leaf remains a leaf thereafter because no extension rule adds children to a leaf. If extension j created a leaf (numbered j), extension j of any later phase i + 1 applies Rule 1 (appending the next character y[i + 1] to label of the edge ending at leaf j. Explicit applications of Rule 1 can be eliminated as follows: Use compressed edge representation (i.e. indices p and q instead of substring y[p.. q]), and represent the end position of each terminal edge by a global value e, for the current end position (phase).

Ukkonen s algorithm Eliminating extensions Denote by j i the last non-void extension of phase i (that is, application of Rule 1 or 2) Obs 1 extensions 1,..., j i of phase i are non-void leaves 1,..., j i have been created at the end of phase i Obs 2 extensions 1,..., j i of any subsequent phase all apply Rule 1 j i+1 j i Execute only extensions j i + 1, j i + 2,... explicitly in phase i + 1

Ukkonen s algorithm Single phase algorithm Algorithm for phase i + 1 with unnecessary extensions eliminated 1 Set e = i + 1 (implements extensions 1,..., j i implicitly 2 Compute extensions j i + 1,..., j until j > i + 1 or Rule 3 was applied in extension j 3 Set j i+1 = j 1 (for the next phase) All these tricks together can be shown to lead to linear run-time

Ukkonen s algorithm Complexity of the tuned implementation (1) Theorem Ukkonen s algorithm builds the suffix tree for y[1.. n] in time O(n), when implemented using the mentioned tricks. Proof The extensions computed explicitly in any two phases i and i + 1 are disjoint except for extension j, which may be computed anew in phase i + 1. The second computation of extension j can be done in O(1) by remembering the end of the path entered in the previous computation

Ukkonen s algorithm Complexity of the tuned implementation (2) Let j = 1,..., n+1 denote the index of the current extension Over all phases 2,..., n+1 index j never decreases, but it can remain the same at the start of phases 3,..., n+1 at most 2n extensions are computed explicitly. Similarly to the previous proof (skip/count), the current node depth can be decremented at most 4n times, and thus the total length of all downward traversals is bounded by 5n

Ukkonen s algorithm Obtaining the true suffix tree Finally, the implicit suffix tree I n+1 can be converted to the true suffix tree of y[1.. n]$ in the following way All occurrences of the current end position marker e on edge labels can be replaced by n+1 (with a simple tree traversal, in time O(n))

Ukkonen s algorithm Ukkonen s algorith Reads a string x of size n from left to right. The algorithm is on-line, i.e. at step 1 i n it constructs an implicit suffix tree of prefix y[1.. i] which can then be easily converted to the (true) suffix tree by appending a unique symbol $ that has not appeared before. Runs in O(n) time for constant-size alphabets or O(n log n) for general alphabets. Requires O(n) space.

Contents 1 Introduction 2 Basic Definitions 3 Dictionaries 4 Suffix tree 5 Example 6 Overview

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Phase 1

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Explicit - Rule 2 (1, e) 1

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Phase 2 (1, e) 1

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Implicit (1, e) 1

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Explicit (1, e) (2, e) 1 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Phase 3 (1, e) (2, e) 1 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Implicit (1, e) (2, e) 1 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Implicit (1, e) (2, e) 1 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Explicit (1, e) (2, e) 3 1 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Phase 4 (1, e) (2, e) 3 1 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Implicit (1, e) (2, e) 3 1 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Implicit (1, e) (2, e) 3 1 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Implicit (1, e) (2, e) 3 1 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Explicit - Rule 3 (1, e) (2, e) 3 1 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Phase 5 (1, e) (2, e) 3 1 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Implicit (1, e) (2, e) 3 1 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Implicit (1, e) (2, e) 3 1 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Implicit (1, e) (2, e) 3 1 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Explicit - Rule 3 (1, e) (2, e) 3 1 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Phase 6 (1, e) (2, e) 3 1 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Skip all implicit (1, e) (2, e) 3 1 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Explicit - Rule 2 (1, e) (2, e) 3 1 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Explicit - Rule 2 (1, 2) (2, e) 3 1 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Explicit - Rule 2 (1, 2) 4 (2, e) 3 1 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Explicit - Rule 2 (1, 2) (2, e) 4 3 1 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Explicit - Rule 2 (1, 2) (2, 2) 4 3 1 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Explicit - Rule 2 (1, 2) (2, 2) 1 4 2 5 3

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Create suffix link (1, 2) (2, 2) 1 4 2 5 3

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Create suffix link (1, 2) (2, 2) 1 4 2 5 3

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ (1, 2) (2, 2) 1 4 2 5 3

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Explicit - Rule 2 (1, 2) (2, 2) 6 4 5 3 1 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Phase 7 (1, 2) (2, 2) 6 4 5 3 1 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Skip all implicit (1, 2) (2, 2) 6 4 5 3 1 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Explicit - Rule 3 (1, 2) (2, 2) 6 4 5 3 1 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Phase 8 (1, 2) (2, 2) 6 4 5 3 1 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Skip all implicit (1, 2) (2, 2) 6 4 5 3 1 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Explicit - Rule 3 (1, 2) (2, 2) 6 4 5 3 1 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Phase 9 (1, 2) (2, 2) 6 4 5 3 1 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Skip all implicit (1, 2) (2, 2) 6 4 5 3 1 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Explicit - Rule 3 (1, 2) (2, 2) 6 4 5 3 1 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Phase 10 (1, 2) (2, 2) 6 4 5 3 1 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Skip all implicit (1, 2) (2, 2) 6 4 5 3 1 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Explicit - Rule 3 (1, 2) (2, 2) 6 4 5 3 1 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Explicit - Rule 2 (1, 2) (3, 3) (2, 2) 6 (4, e) 4 5 3 1 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Explicit - Rule 2 (1, 2) (3, 3) (2, 2) 6 4 (4, e) (10, e) 5 3 1 7 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Follow suffix link (1, 2) (3, 3) (2, 2) 6 4 (4, e) (10, e) 5 3 1 7 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Explicit - Rule 2 (1, 2) (3, 3) (2, 2) 6 4 (4, e) (10, e) 5 3 1 7 2

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Explicit - Rule 2 (1, 2) (3, 3) (2, 2) (3, 3) 6 4 (4, e) (10, e) 1 7 2 5 3

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Explicit - Rule 2 (1, 2) (3, 3) (2, 2) (3, 3) 6 4 (4, e) (10, e) 5 (10, e) 3 1 7 2 8

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Follow suffix link (1, 2) (3, 3) (2, 2) (3, 3) 6 4 5 (4, e) (10, e) (10, e) 1 7 2 8 3

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Explicit - Rule 2 (1, 2) (3, 3) (2, 2) (3, 3) 6 4 5 (4, e) (10, e) (10, e) 1 7 2 8 3

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Explicit - Rule 2 (1, 2) (2, 2) (3, 3) (3, 3) (3, 3) (4, e) 6 4 5 (4, e) (10, e) (10, e) 1 7 2 8 3

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Explicit - Rule 2 (1, 2) (2, 2) (3, 3) (3, 3) (3, 3) (4, e) (10, e) 4 5 9 3 (4, e) (10, e) (10, e) 1 7 2 8 6

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Create suffix link (1, 2) (2, 2) (3, 3) (3, 3) (3, 3) (4, e) (10, e) 4 5 9 3 (4, e) (10, e) (10, e) 1 7 2 8 6

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Create suffix link (1, 2) (2, 2) (3, 3) (3, 3) (3, 3) (4, e) (10, e) 4 5 9 3 (4, e) (10, e) (10, e) 1 7 2 8 6

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ (1, 2) (2, 2) (3, 3) (3, 3) (3, 3) (4, e) (10, e) 4 5 9 3 (4, e) (10, e) (10, e) 1 7 2 8 6

Suffix tree - Example 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Explicit - Rule 2 (1, 2) (2, 2) (3, 3) (3, 3) (3, 3) (4, e) (10, e) 4 5 9 3 (4, e) (10, e) (10, e) 1 7 2 8 (10, e) 10 6

Application - finding all occurrences of a query 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ ab b c $ xabc$ c xabc$ c xabc$ $ abxabc$ 10 6 4 abxabc$ $ $ abxabc$ 1 7 2 8 5 9 3 Query the string a

Application - finding all occurrences of a query 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Find the node to which the string path a leads to ab b c $ xabc$ c xabc$ c xabc$ $ abxabc$ 10 6 4 abxabc$ $ $ abxabc$ 1 7 2 8 5 9 3 Query the string a

Application - finding all occurrences of a query 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Get the leafs of that node ab b c $ xabc$ c xabc$ c xabc$ $ abxabc$ 10 6 4 abxabc$ $ $ abxabc$ 1 7 2 8 5 9 3 Query the string a

Application - finding all occurrences of a query 1 2 3 4 5 6 7 8 9 10 y = a b c a b x a b c $ Leaves indicate the starting positions of a ab b c $ xabc$ c xabc$ c xabc$ $ abxabc$ 10 6 4 abxabc$ $ $ abxabc$ 1 7 2 8 5 9 3 Query the string a

Contents 1 Introduction 2 Basic Definitions 3 Dictionaries 4 Suffix tree 5 Example 6 Overview

Overview We had a quick look on indexing. Preprocessing a given text Efficient querying afterwards We ve seen what suffix trees are and some of their properties. Patricia suffix tries for a string x[1.. n] At most 2n 1 nodes Exactly n leaves We ve seen Ukkonen s algorithm. Fairly simple to understand Linear time construction for constant-size alphabets

Reminder - Next week Next week s lecture will take place at SR 148, Building 50.34