What are suffix trees?

Similar documents
Outline. Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen s algorithm Applications of ST

Tries. Yufei Tao KAIST. April 9, Y. Tao, April 9, 2013 Tries

Suffix trees, suffix arrays, BWT

COMBINATORIAL PATTERN MATCHING

Information Retrieval and Organisation

Lecture 10: Suffix Trees

Suffix trees. December Computational Genomics

COMP 423 lecture 11 Jan. 28, 2008

Algorithm Design (5) Text Search

Intermediate Information Structures

CS481: Bioinformatics Algorithms

Suffix Tries. Slides adapted from the course by Ben Langmead

Position Heaps: A Simple and Dynamic Text Indexing Data Structure

2 Computing all Intersections of a Set of Segments Line Segment Intersection

Paradigm 5. Data Structure. Suffix trees. What is a suffix tree? Suffix tree. Simple applications. Simple applications. Algorithms

Fig.25: the Role of LEX

Applied Databases. Sebastian Maneth. Lecture 13 Online Pattern Matching on Strings. University of Edinburgh - February 29th, 2016

In the last lecture, we discussed how valid tokens may be specified by regular expressions.

Presentation Martin Randers

CSE 549: Suffix Tries & Suffix Trees. All slides in this lecture not marked with * of Ben Langmead.

The dictionary model allows several consecutive symbols, called phrases

Solving Problems by Searching. CS 486/686: Introduction to Artificial Intelligence Winter 2016

The Greedy Method. The Greedy Method

CS143 Handout 07 Summer 2011 June 24 th, 2011 Written Set 1: Lexical Analysis

Dr. D.M. Akbar Hussain

Solving Problems by Searching. CS 486/686: Introduction to Artificial Intelligence

Lexical Analysis: Constructing a Scanner from Regular Expressions

Alignment of Long Sequences. BMI/CS Spring 2012 Colin Dewey

Ma/CS 6b Class 1: Graph Recap

Definition of Regular Expression

CS321 Languages and Compiler Design I. Winter 2012 Lecture 5

Topic 2: Lexing and Flexing

ΕΠΛ323 - Θεωρία και Πρακτική Μεταγλωττιστών

Allocator Basics. Dynamic Memory Allocation in the Heap (malloc and free) Allocator Goals: malloc/free. Internal Fragmentation

From Indexing Data Structures to de Bruijn Graphs

Regular Expression Matching with Multi-Strings and Intervals. Philip Bille Mikkel Thorup

If you are at the university, either physically or via the VPN, you can download the chapters of this book as PDFs.

Graphs with at most two trees in a forest building process

CS201 Discussion 10 DRAWTREE + TRIES

CS311H: Discrete Mathematics. Graph Theory IV. A Non-planar Graph. Regions of a Planar Graph. Euler s Formula. Instructor: Işıl Dillig

Languages. L((a (b)(c))*) = { ε,a,bc,aa,abc,bca,... } εw = wε = w. εabba = abbaε = abba. (a (b)(c)) *

CSCI 104. Rafael Ferreira da Silva. Slides adapted from: Mark Redekopp and David Kempe

Ma/CS 6b Class 1: Graph Recap

Announcements. CS 188: Artificial Intelligence Fall Recap: Search. Today. Example: Pancake Problem. Example: Pancake Problem

From Dependencies to Evaluation Strategies

An Algorithm for Enumerating All Maximal Tree Patterns Without Duplication Using Succinct Data Structure

I/O Efficient Dynamic Data Structures for Longest Prefix Queries

Today. CS 188: Artificial Intelligence Fall Recap: Search. Example: Pancake Problem. Example: Pancake Problem. General Tree Search.

Reducing a DFA to a Minimal DFA

Lecture 10 Evolutionary Computation: Evolution strategies and genetic programming

CS 432 Fall Mike Lam, Professor a (bc)* Regular Expressions and Finite Automata

ΕΠΛ323 - Θεωρία και Πρακτική Μεταγλωττιστών. Lecture 3b Lexical Analysis Elias Athanasopoulos

A Tautology Checker loosely related to Stålmarck s Algorithm by Martin Richards

Stack. A list whose end points are pointed by top and bottom

Compression Outline :Algorithms in the Real World. Lempel-Ziv Algorithms. LZ77: Sliding Window Lempel-Ziv

10.5 Graphing Quadratic Functions

Today. Search Problems. Uninformed Search Methods. Depth-First Search Breadth-First Search Uniform-Cost Search

Finite Automata. Lecture 4 Sections Robb T. Koether. Hampden-Sydney College. Wed, Jan 21, 2015

Before We Begin. Introduction to Spatial Domain Filtering. Introduction to Digital Image Processing. Overview (1): Administrative Details (1):

CSc 453. Compilers and Systems Software. 4 : Lexical Analysis II. Department of Computer Science University of Arizona

2014 Haskell January Test Regular Expressions and Finite Automata

Efficient implementation of lazy suffix trees

CSCI 446: Artificial Intelligence

MTH 146 Conics Supplement

Algorithms in bioinformatics (CSI 5126) 1

CS 241. Fall 2017 Midterm Review Solutions. October 24, Bits and Bytes 1. 3 MIPS Assembler 6. 4 Regular Languages 7.

1.5 Extrema and the Mean Value Theorem

CS 430 Spring Mike Lam, Professor. Parsing

Orthogonal line segment intersection

Slides for Data Mining by I. H. Witten and E. Frank

Lexical Analysis. Amitabha Sanyal. ( as) Department of Computer Science and Engineering, Indian Institute of Technology, Bombay

Lesson 4.4. Euler Circuits and Paths. Explore This

UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS INFORMATICS 1 COMPUTATION & LOGIC INSTRUCTIONS TO CANDIDATES

Section 10.4 Hyperbolas

Pointwise convergence need not behave well with respect to standard properties such as continuity.

MATH 25 CLASS 5 NOTES, SEP

Section 3.1: Sequences and Series

Deterministic. Finite Automata. And Regular Languages. Fall 2018 Costas Busch - RPI 1

Lecture T1: Pattern Matching

CSCI 3130: Formal Languages and Automata Theory Lecture 12 The Chinese University of Hong Kong, Fall 2011

Lexical Analysis and Lexical Analyzer Generators

Lecture 6: Suffix Trees and Their Construction

Unit #9 : Definite Integral Properties, Fundamental Theorem of Calculus

Implementing Automata. CSc 453. Compilers and Systems Software. 4 : Lexical Analysis II. Department of Computer Science University of Arizona

LR Parsing, Part 2. Constructing Parse Tables. Need to Automatically Construct LR Parse Tables: Action and GOTO Table

Dynamic Programming. Andreas Klappenecker. [partially based on slides by Prof. Welch] Monday, September 24, 2012

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 4: Lexical Analyzers 28 Jan 08

Quiz2 45mins. Personal Number: Problem 1. (20pts) Here is an Table of Perl Regular Ex

Notes for Graph Theory

Principles of Programming Languages

CS 241 Week 4 Tutorial Solutions

CSCI1950 Z Computa4onal Methods for Biology Lecture 2. Ben Raphael January 26, hhp://cs.brown.edu/courses/csci1950 z/ Outline

Lexical analysis, scanners. Construction of a scanner

Meaningful Change Detection in Structured Data.

String Searching. String Search. Applications. Brute Force: Typical Case

The Complexity of Nonrepetitive Coloring

CSCE 531, Spring 2017, Midterm Exam Answer Key

Greedy Algorithm. Algorithm Fall Semester

The Complexity of Nonrepetitive Coloring

AI Adjacent Fields. This slide deck courtesy of Dan Klein at UC Berkeley

Transcription:

Suffix Trees 1

Wht re suffix trees? Allow lgorithm designers to store very lrge mount of informtion out strings while still keeping within liner spce Allow users to serch for new strings in the originl string, in time proportionl to the length of the new string (mny other pplictions) 2

Definition of Suffix Tree T A rooted tree with n leves numered 1 to n; Ech internl node, excluding the root, hs t lest two children; Ech edge is leled with non-empty sustring of S; Two edges out of sme node with distinct chrcters; Suffix S[i, n] corresponds to the conctention of the edge-lels on the pth from the root to lef i. 4

Exmple: Suffix tree for xxc Suffixes: {xxc, xc, xc, xc, c, c} root x xc v 1 c 6 3 c x c u x c c 4 5 2 5

Appending : Prefix Free Prolem: If suffix S[j, n] of S mtches prefix of nother suffix S[i, n] of S, then the pth for S[j, n] would not end t lef in T. For exmple, S = xx. S[4, 5] = x mtches prefix of S[1, 5] = xx. x x root 1 6

Appending : Prefix Free Prolem: If suffix S[j, n] of S mtches prefix of nother suffix S[i, n] of S, then the pth for S[j, n] would not end t lef in T. For exmple, S = xx. S[4, 5] = x mtches prefix of S[1, 5] = xx. x x root 1 Solution: Add unique chrcter, which is not in the lphet, to the end of S. 7

Exmple: Suffix Tree for xxc Suffixes: {xxc, xc, xc, xc, c, c, } 7 6 c x c root x u v xc c c x c 4 1 3 5 2 8

Pttern Mtching Prolem Pttern Mtching Prolem Input: text T of size n, pttern P of size m Output: All occurrences of P in T

Pttern Mtching Prolem Algorithm 1. Build suffix tree for T 2. Mtch the chrcters of P until P is exhusted or no more mtches re possile 3. If no more mtches re possile then P does not occur in T 4. If P is exhusted, then the numer of the leves in the sutree elow the point where P got exhusted correspond to the positions in T where the mtches occur.

Pttern Mtching Prolem Anlysis 1. Build suffix tree for T: O(n) time 2. Mtch the chrcters of P until P is exhusted or no more mtches re possile: O(m) time 3. If no more mtches re possile then P does not occur in T 4. If P is exhusted, then the numer of the leves in the sutree elow the point where P got exhusted correspond to the positions in T where the mtches occur. O(k) time, where k is the numer of mtched positions

Pttern Mtching Prolem 1 w x z 4 7 Frgment of suffix tree for wywxwxz Pttern w occurs in positions 1,4 nd 7

Definitions The Lel of pth from root r to node v is the conctention of sustrings on edges from r to v. The pth-lel of node v is the lel of the pth from root r to v. The string-depth of node v is the numer of chrcters in v s lel. Comment: In constructing suffix trees, we will need to e le to split edges in the middle. 13

A First Simple Algorithm Let S= Suffixes of S { } Suffix tree of S 14

A First Simple Algorithm Put the lrgest suffix in 15

A First Simple Algorithm Put the lrgest suffix in Put the suffix in 16

A First Simple Algorithm Put the lrgest suffix in Put the suffix in 17

A First Simple Algorithm 18

A First Simple Algorithm Put the suffix in 19

A First Simple Algorithm 20

A First Simple Algorithm Put the suffix in 21

A First Simple Algorithm 22

A First Simple Algorithm Put the suffix in 23

A First Simple Algorithm We will lso lel ech lef with the strting point of the corresponding suffix. 5 4 3 1 2 24

Ovious runtime This lgorithm hs runtime O(m 2 ), since it is doing O(m) work in ech phse 25

Ovious runtime This lgorithm hs runtime O(m 2 ), since it is doing O(m) work in ech phse But, qudrtic work on genome, for exmple, would e uncceptle 26

Constructing Suffix Trees in O(n) Weiner proposed the first liner-time lgorithm in 1973 (lgorithm of the yer ccording to Knuth) McCreight introduced more spce efficient linertime lgorithm in 1976; Ukkonen developed simpler to understnd linertime lgorithm in 1995. Ukkonen s lgorithm, sed on sequence of implicit suffix trees, is wht we will focus on. 27

Implicit Suffix Tree Definition: An implicit suffix tree I for string S is tree otined from the suffix tree for S y removing from ech edge lel; removing ny edges tht now hve no lel; removing ny node tht does not still hve t lest two children. Comment: some suffixes my no longer e leves. An implicit suffix tree for prefix S[1,k] of S denoted y I k. 28

Exmple: Implicit Suffix Tree Implicit suffix tree for S= xx Suffixes of xx: {xx, x, x, x,, } True Suffix tree for S: 6 x root x u x v x 4 1 3 5 2 29

Exmple: Implicit Suffix Tree (cont d) Remove from ech edge: 6 x root x u x v x 4 1 3 Some edges with no lels. 5 2 30

Exmple: Implicit Suffix Tree (cont d) Remove edges with no lel: x root x u x v x 1 3 Some internl nodes with only one child. 2 31

Exmple: Implicit Suffix Tree (cont d) Remove internl nodes with only one child. Finlly, implicit suffix tree for xx: x x root 1 x x 3 2 32

Ukkonen s Algorithm Key Ides Construct sequence of implicit suffix trees: I 1, I 2, I i, I i+1,, I n. Divide into n phses. Ech phse constructs n implicit suffix tree. In phse i+1, consider prefix S[1, i+1] nd construct I i+1 from I i. I 1 I 2 I i I i+1 I n Implicit suffix tree for prefix S[1,i] of S Implicit suffix tree for prefix S[1,i+1] of S 33

Ukkonen s Algorithm Key Ides (cont d) Further, divide ech phse i+1 into i+1 extensions Ext. 1: dding suffix S[1, i+1] of S[1, i+1] into I i Ext. 2: dding suffix S[2, i+1] of S[1, i+1] into I i I i I i+1 Ext. j: dding suffix S[j, i+1] of S[1, i+1] into I i Ext. i+1: dding suffix S[i+1, i+1] of S[1, i+1] into I i After i+1 extensions, we hve I i+1. 34

Ukkonen s Algorithm Construct I 1 ; For i=1 To n-1 Do (uild I i+1 ) /* phse loop*/ For j=1 To i+1 Do /* Extension loop */ Find the end of pth leled y S[j, i] in I i ; Add S[i+1] to the end y suffix extension rule; Convert I n into suffix tree of S. 35

Ukkonen s Algorithm Running Time O(n 3 ) 36

Suffix Extension Rules In (Phse i+1, extension j), the gol is to extend S[j, i] into S[j, i+1]. Rule 1: If pth = S[j, i] (suffix of S[1, i]) ends t lef, then dd chrcter S[i+1] to the end of the lel on tht lef edge. S[i+1] 37

Suffix Extension Rules (cont d) Rule 2: If pth does not end t lef nd the continue chrcter x is not S[i+1], then new lef edge strting from the end must e creted nd leled with S[i+1] nd the new lef is numered y j. x x S [i+1] j Crete lef j t extension j 38

Suffix Extension Rules (cont d) Rule 2: If pth does not end t lef nd the continue chrcter x is not S[i+1], then new lef edge strting from the end must e creted nd leled with S[i+1] nd the new lef is numered y j. x x S [i+1] j Crete lef j t extension j 39

Suffix Extension Rules (cont d) Rule 3: If some pth from the end of string strts with S[i+1], i.e. su-string S[i+1] is lredy in the tree, then we do nothing. S[i+1] S[i+1] 40

Suffix Trees: Ukkonen Algorithm How to locte efficiently the ends of ll the i+1 suffixes of S[1 i]? We need some tricks!

Suffix Link Definition: For n internl node v with pth-lel x, if there is nother node s(v) with pth-lel, then pointer from v to s(v) is clled suffix link. 7 6 c x c root x S(v) v xc c c x c 4 1 3 5 2 42

Suffix Links Lemm 6.1.1 If new internl node v with pth-lel x is dded to the current tree in extension j of some phse i+1, then the pth leled y lredy corresponds to n internl node u of the tree or u = s(v) the internl node leled y will e creted in extension j+1 of the sme phse or string is empty nd s(v) is the root 43

Suffix Links Proof. v is creted => rule 2 ws used => x c, with c S[i+1], is pth => c is pth on the tree Cse 1) If ends t node we re done since this node is s(v). Cse 2) does not end t node. Extension j+1 will crete node s(v) t the end of in the sme phse. 44

Suffix Links Corollry. Every newly creted internl node will hve suffix link from it y the end of the next extension. 45

Locte S[j, i] Using Suffix Links Nively, in extension j of phse i+1, locte suffix S[j, i] of S[1, i] y mtching it long pth from root. 46

Locte S[j, i] Using Suffix Links Nively, in extension j of phse i+1, locte suffix S[j, i] of S[1, i] y mtching it long pth from root. Using suffix links to shortcut the loction: v x root s(v) Strting t S[j-1, i], wlk up t most one node to v, End of S[j-1, i] c d End of S[j, i] d c Trverse the suffix link to s(v); then wlk down the tree to find end of S[j, i]. 47

Trick 1: Skip-Count Solution: Skip-Count technique g h c d e f 6 x v End of suffix S[j-1, i] s(v) c d 2 e 1 2 1 f i h g End of suffix S[j, i] At ech node, only check the first chrcter on the outgoing edge. Using numer of chrcters on tht edge to updte serch in O(1). Proportionl to numer of nodes on the pth rther thn numer of chrcters. 49

Trick 1: Skip-Count (cont d) Node-depth of v is the numer of nodes on the pth from root to node v, denoted y level(v). Lemm: At the moment of trversing suffix link (v, s(v)), level(v) level (s(v)) +1. nd(v)=4 v x d c x Suffix link c d s(v) nd(s(v))=3 50

Trick 1: Skip-Count (cont d) Theorem: Using suffix link & skip-count trick, ny phse tkes O(n) time: Proof We go up t most n nodes over phse We trverse t most n suffix links We must check how much we go down! 52

Trick 1: Skip-Count (cont d) Theorem: Using suffix link & skip-count trick, ny phse tkes O(n) time: Proof (cont.) level(j): level of node reched y extension j At extension j+1 we go down level(j+1)-level(j-1) +1 Adding over ll extensions of phse i we get tht the totl cost is O(n) 53

Edge-lel Compression Prolem: If edges re leled with sustring, the suffix tree my require (n 2 ) spce. 54

Edge-lel Compression Prolem: If edges re leled with sustring, the suffix tree my require (n 2 ) spce. S=c z.... z O(n 2 ) chrcters [1,26] [26,26] z. [2,26] z O(n) symols! 1 2 26 1 2 26 Solution: Lel ech edge with n index pir [i, j], denoting sustring S[i, j], the suffix tree requires only O(n) spce ( O(n) edges). 55

Trick 2: Stopper In ny phse i+1, if suffix extension rule 3 pplies in extension j, it will lso pply in ll remining extensions up to the end of phse i+1. S[j,i+1] is sustring of S[1 i] S[k..i+1] for k>j is sustring of S[1 i] Recll, when pplying rule 3, we do nothing. Tht implies, some extensions cn e done implicitly. Hence, end ny phse i+1 the first time tht extension rule 3 pplies. Reduce Work! 56

Trick 3: Glol Index In phse i, lef is creted nd leled y j. Then, t every extension j of the susequent phses j will llwys e lef 57

Trick 3: Glol Index In phse i, lef is creted nd leled y j. Then, t every extension j of the susequent phses j will llwys e lef Insted of leling lef edge with (p, i), lel it with (p, e). e is glol index. Set e once in phse i. 58

Trick 3: Glol Index In phse i, lef is creted nd leled y j. Then, t every extension j of the susequent phses j will llwys e lef Insted of leling lef edge with (p, i), lel it with (p, e). e is glol index. Set e once in phse i. In phse i, lst(i) denotes the lst extension tht rule 3 does not pply. 59

Trick 3: Glol Index In phse i, lef is creted nd leled y j. Then, t every extension j of the susequent phses j will llwys e lef Insted of leling lef edge with (p, i), lel it with (p, e). e is glol index. Set e once in phse i. In phse i, lst(i) denotes the lst extension tht rule 3 does not pply. Phse i Lst(i) Updte index e Explicit Ext. Stopper 60

Trick 3: Glol Index (cont d) 1. After phse i, suffixes S[j, i] for 1 j lst(i) end t lef. So, fter phses i, ll extensions for 1 j lst(i) pply rule 1. Only need to updte e! Keep lst(i) Note tht lst(i+1) lst(i). Never Shrink! In phse i+1, explicitly compute extensions for j lst(i)+1 until the first rule 3 extension. Hence, phses i nd i+1 shre t most 1 explicit extension. 61

Time Complexity Implicit extensions is constnt, totl: O(n); At most 2n explicit extensions: Phse i Phse i+1 Phse i+2 Explicit extensions 1 2 3 4 5 5 6 7 8 The mx numer of down-wlk skips: O(n); Therefore, the Totl time complexity: O(n)! 8 9 10 62

Suffix Trees: Ukkonen s Algorithm From n implicit tree to suffix tree Modifiction 1 Add terminl symol to the end of S Continue Ukkonen s lgorithm with this chrcter No suffix is prefix of ny suffix Modifiction 2 Replce ech index e on every lef edge with numer n. It cn e done in O(n) time vi DFS

T=cc Exmple

Prcticl Implementtion issues There re severl possiilities to represent nd serch the rnches out of the nodes of the tree Store vector of size O( ). Keep list t ech node Mintin lnced tree Mintin hsh tle Some implementtions comine different representtions. Nodes t the top of the tree (in generl with highest out degree) mke use of rrys. Nodes t lower levels employ lists

Prcticl Considertions Trversing suffix links my cuse severl pge fults A lot of effort hs een done to produce prcticl implementtions The liner time relies on the ssumption tht the lphet is ounded Optiml Suffix Tree Construction with Lrge Alphets [ Mrtin Frch, FOCS 1997]. 66

Reference A. Aho nd M. Corsick. Efficient string mtching: n id to iliogrphic serch. Comm.~ACM, 18: 333-40, 1975. P. Weiner. Liner pttern mtching lgorithms. Proceedings of I.E.E.E. 14th Annul Symposium on Switching nd Automt Theory, pges 1-11, 1973. E. McCreight. A spce-economicl suffix tree construction lgorithm. Journl of the Assocition for Computing Mchinery, 23(2):262-272, April 1976. E. Ukkonen. On-line construction of suffix trees. Algorithmic, 14(3):249-260, 1995. R. Giegerich, nd S. Kurtz. From Ukkone to McCreight nd Weiner: A Unifying View of Liner-Time Suffix Tree Construction. Report Nr. 94-03, Technische Fkultt der Universitt Bielefeld, 1994. D. Gusfield. Algorithms on strings, trees, nd sequences. Computer Science nd Computtionl Biology. Cmridge University Press, 1997. 67

THANK YOU 68