Exact String Matching Part II. Suffix Trees See Gusfield, Chapter 5

Similar documents
Exact Matching Part III: Ukkonen s Algorithm. See Gusfield, Chapter 5 Visualizations from

Lecture 6: Suffix Trees and Their Construction

Special course in Computer Science: Advanced Text Algorithms

Lecture 5: Suffix Trees

Special course in Computer Science: Advanced Text Algorithms

Data structures for string pattern matching: Suffix trees

An introduction to suffix trees and indexing

Exact String Matching. The Knuth-Morris-Pratt Algorithm

Introduction to Suffix Trees

Inexact Matching, Alignment. See Gusfield, Chapter 9 Dasgupta et al. Chapter 6 (Dynamic Programming)

Text search. CSE 392, Computers Playing Jeopardy!, Fall

Suffix trees and applications. String Algorithms

An Efficient Algorithm for Identifying the Most Contributory Substring. Ben Stephenson Department of Computer Science University of Western Ontario

Suffix Trees and its Construction

1 Introduciton. 2 Tries /651: Algorithms CMU, Spring Lecturer: Danny Sleator

Uses for Trees About Trees Binary Trees. Trees. Seth Long. January 31, 2010

Given a text file, or several text files, how do we search for a query string?

BUNDLED SUFFIX TREES

4. Suffix Trees and Arrays

4. Suffix Trees and Arrays

Computing the Longest Common Substring with One Mismatch 1

17 dicembre Luca Bortolussi SUFFIX TREES. From exact to approximate string matching.

Binary Trees, Binary Search Trees

Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree.

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

Distributed and Paged Suffix Trees for Large Genetic Databases p.1/18

Suffix Tree and Array

Applications of Suffix Tree

PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet

Technical University of Denmark

CSI33 Data Structures

Section 5.5. Left subtree The left subtree of a vertex V on a binary tree is the graph formed by the left child L of V, the descendents

CSE 5095 Topics in Big Data Analytics Spring 2014; Homework 1 Solutions

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42

Advanced Algorithms: Project

Binary Trees

1 Computing alignments in only linear space

Algorithms Theory. 15 Text Search (2)

Binary Search Tree (2A) Young Won Lim 5/17/18

Data Structure Lecture#10: Binary Trees (Chapter 5) U Kang Seoul National University

Quiz 1 Solutions. (a) f(n) = n g(n) = log n Circle all that apply: f = O(g) f = Θ(g) f = Ω(g)

6. Finding Efficient Compressions; Huffman and Hu-Tucker

Suffix links are stored for compact trie nodes only, but we can define and compute them for any position represented by a pair (u, d):

March 20/2003 Jayakanth Srinivasan,

Quiz 1 Solutions. Asymptotic growth [10 points] For each pair of functions f(n) and g(n) given below:

Reporting Consecutive Substring Occurrences Under Bounded Gap Constraints

Longest Common Substring

11/5/09 Comp 590/Comp Fall

Lecture 23: Priority Queues, Part 2 10:00 AM, Mar 19, 2018

Non-context-Free Languages. CS215, Lecture 5 c

Chapter 20: Binary Trees

Divide and Conquer Algorithms

Suffix Trees and Arrays

58093 String Processing Algorithms. Lectures, Autumn 2013, period II

Association Rule Mining: FP-Growth

CSI33 Data Structures

CSCI2100B Data Structures Trees

Garbage Collection: recycling unused memory

EE 368. Weeks 5 (Notes)

CS350: Data Structures B-Trees

MULTIMEDIA COLLEGE JALAN GURNEY KIRI KUALA LUMPUR

Trees. (Trees) Data Structures and Programming Spring / 28

Verifying a Border Array in Linear Time

SUFFIX ARRAYS A Competitive Choice for Fast Lempel-Ziv Compressions

CE 221 Data Structures and Algorithms

Tree: non-recursive definition. Trees, Binary Search Trees, and Heaps. Tree: recursive definition. Tree: example.

CSE 417 Dynamic Programming (pt 5) Multiple Inputs

Today s Outline. CS 561, Lecture 18. Potential Method. Pseudocode. Dynamic Tables. Jared Saia University of New Mexico

CMPSCI 250: Introduction to Computation. Lecture #14: Induction and Recursion (Still More Induction) David Mix Barrington 14 March 2013

Indexing Variable Length Substrings for Exact and Approximate Matching

Data Structures and Algorithms

splitmem: graphical pan-genome analysis with suffix skips Shoshana Marcus May 7, 2014

Lecture 7 February 26, 2010

Caveat lector: This is the first edition of this lecture note. Please send bug reports and suggestions to

Lower Bound on Comparison-based Sorting

Dynamic indexes vs. static hierarchies for substring search

Module 4: Index Structures Lecture 13: Index structure. The Lecture Contains: Index structure. Binary search tree (BST) B-tree. B+-tree.

CS 350 : Data Structures B-Trees

11/5/13 Comp 555 Fall

Multi-way Search Trees. (Multi-way Search Trees) Data Structures and Programming Spring / 25

Friday, March 30. Last time we were talking about traversal of a rooted ordered tree, having defined preorder traversal. We will continue from there.

On Wavelet Tree Construction

Lecture 18 April 12, 2005

(2,4) Trees. 2/22/2006 (2,4) Trees 1

Data Structures. Outline. Introduction Linked Lists Stacks Queues Trees Deitel & Associates, Inc. All rights reserved.

How fast can we sort? Sorting. Decision-tree example. Decision-tree example. CS 3343 Fall by Charles E. Leiserson; small changes by Carola

Tree traversals and binary trees

Trees. Introduction & Terminology. February 05, 2018 Cinda Heeren / Geoffrey Tien 1

Binary Trees and Binary Search Trees

Lecture L16 April 19, 2012

Binary Search Trees. Analysis of Algorithms

CS24 Week 8 Lecture 1

String Matching Algorithms

Suffix Vector: Space- and Time-Efficient Alternative To Suffix Trees

FS22k Transducer User Manual

SUFFIX TREES FOR DOCUMENT RETRIEVAL. A Thesis. Presented to. the Faculty of California Polytechnic State University.

Weighted Suffix Tree Document Model for Web Documents Clustering

A Suffix Tree Construction Algorithm for DNA Sequences

INF2220: algorithms and data structures Series 1

Combinatorial Pattern Matching. CS 466 Saurabh Sinha

Transcription:

Exact String Matching Part II Suffix Trees See Gusfield, Chapter 5

Outline for Today What are suffix trees Application to exact matching Building a suffix tree in linear time, part I: Ukkonen s algorithm

Suffix Tree for xabxac

Suffix Tree Definition A suffix tree for string S[1..m] is a rooted directed tree with the following properties: Each internal node, other than the root, has at least two children Each edge is labeled with a nonempty substring of S No two edges out of a node can have edge labels beginning with the same character The tree has exactly m leaves numbered 1 to m For each i, the concatenation of the edge labels on the path from the root to leaf i is S[i..m]

More examples Exercise: build suffix trees for ababx xabxa Try adding suffixes in decreasing order of length

More examples Exercise: build suffix trees for ababx xabxa Problem: xabxa does not have a suffix tree, since the tree would both need to have a leaf for xa AND extend the path labeled xa to xabxa, and there's no way to do both. More generally, when one suffix is a prefix of another suffix, we're in trouble. We can easily avoid this by adding a special $ symbol at the end of a string

Suffix Tree Notation Edge label: the string labeling an edge Path: starts at root and follows edges; may end in the middle of an edge label Path label: concatenation, in order, of strings from root to the end of the path Node label: label of the path from the root to the node Node depth: the number of nodes on the path to the node Node string-depth: the number of characters in the node s label

Using Suffix Trees for Exact Matching

Using Suffix Trees for Exact Matching Given a suffix tree for a text T[1,...,m], we can find all occurrences of a pattern P[1,...,n] in T in O(n+k) time, where k is the number of occurrences. How?

Using Suffix Trees for Exact Matching Given a suffix tree for a text T[1,...,m], we can find all occurrences of a pattern P[1,...,n] in T in O(n+k) time, where k is the number of occurrences. How? 1. Traverse a path from the root as long as it matches P Two cases (2 and 2 ): Not all characters of P can be matched, or all characters of P are matched

Using Suffix Trees for Exact Matching Given a suffix tree for a text T[1,...,m], we can find all occurrences of a pattern P[1,...,n] in T in O(n+k) time, where k is the number of occurrences. How? 1. Traverse a path from the root as long as it matches P 2. If not all characters of P can be matched, there is no occurrence of P in T

Using Suffix Trees for Exact Matching Given a suffix tree for a text T[1,...,m], we can find all occurrences of a pattern P[1,...,n] in T in O(n+k) time, where k is the number of occurrences. How? 1. Traverse a path from the root as long as it matches P 2 All characters of P are matched. Then traverse the subtree from the point where the last character was matched and return the leaf labels

Using Suffix Trees for Exact Matching Given a suffix tree for a text T[1,...,m], we can find all occurrences of a pattern P[1,...,n] in T in O(n+k) time, where k is the number of occurrences. How? 1. Traverse a path from the root as long as it matches P 2 All characters of P are matched. Then traverse the subtree from the point where the last character was matched and return the leaf labels Step 1 takes Θ(n) time and Step 2 takes Θ(k) time

Using Suffix Trees for Exact Matching Given a suffix tree for a text T[1,...,m], we can find all occurrences of a pattern P[1,...,n] in T in O(n+k) time, where k is the number of occurrences. How? 1. Traverse a path from the root as long as it matches P 2 All characters of P are matched. Then traverse the subtree from the point where the last character was matched and return the leaf labels Step 1 takes Θ(n) time and Step 2 takes Θ(k) time but how to build the suffix tree for T?

Ukkonen s Algorithm Build a Suffix Tree for S[1..m] in O(m) Time The algorithm first builts an implicit suffix tree, and then converts the implicit suffix tree to a true suffix tree

Implicit Suffix Tree Obtained from a suffix tree by Removing every copy of $ from the edge labels Removing any edge that has no label Removing any node that does not have at least two children

Suffix Tree and Implicit Suffix Tree for abxa$

Ukkonen s Algorithm Build an Implicit Suffix Tree for S[1..m] in O(m) Time Let I i be the implicit suffix tree for prefix S[1,...,i] Construct I 1 // Phase 1 For i from 1 to m-1 // Phase i+1: Build I i+1 from I i For j from 1 to i+1 // Extension j: Ensure that a path labeled S[j,i+1] is in the tree

Ukkonen s Algorithm: Phase i+1, Extension j Ensure that a path labeled S[j,i+1] is in the tree: 1. Find the end of path S[j..i] 2. Apply the appropriate suffix extension rule: Rule 1: Path S[j..i] ends at a leaf. Add S[i+1] to the end of the label on that leaf edge. Rule 2: At least one path continues from the end of S[j..i], but no such path starts with S[i+1]. Create a new leaf edge labeled S[i+1], starting from S[j,,i]. (Create a new node if S[j..i] ends inside an edge.) Number the new leaf "j". Rule 3: Some path from the end of S[j..i] starts with character S[i+1]. Do nothing

Ukkonen s Algorithm Example: awyawxawxz 4 7 1 5 8 2 awxz 6 3 9 10

Ukkonen s Algorithm Naïve implementation takes Θ(m 3 ) time: m phases, ith phase has i extensions, taking time proportional to 1, 2,... i A bottleneck in doing an extension j of phase i+1 is finding the end of path S[j..i] We need a faster implementation

Ukkonen s Algorithm: Suffix Links Let x be a character and α be a string Let internal node v have label xα Let internal node v' have label α A suffix link is a pointer from v to v'

Ukkonen s Algorithm: Suffix Links Let x be a character and α be a string Let internal node v have label xα Let internal node v' have label α A suffix link is a pointer from v to v' In addition to maintaining suffix links, the algorithm also maintains a link to leaf node 1, let s call this the 1-link

Ukkonen s Algorithm: Suffix Links Example: awyawxawxz 4 7 1 5 8 2 6 3 9 10 suffix links to the root not shown

Ukkonen s Algorithm: Phase i+1, Extension j Ensure that a path labeled S[j,i+1] is in the tree revised: 1. Find the end of path S[j..i] (using the 1-link or suffix links) 2. Apply the appropriate suffix extension rule (as before, to ensure that a path labeled S[j,i+1] is in the tree) 3. If a new internal node w was added to the tree on extension j-1, add the suffix link from w now Next time we ll see details of steps 1 and 3

Summary so far Suffix trees have many applications, including an exact matching algorithm that finds all k occurrence of a pattern P of length n in a given text T in time O(n+k), independent in the length m of T, assuming that the suffix tree for T has already been created Ukkonen s algorithm can create the suffix tree for T in O(m) time We ve seen the high level description of the algorithm, and introduced suffix links More details next time