Suffix Trees 1
Wht re suffix trees? Allow lgorithm designers to store very lrge mount of informtion out strings while still keeping within liner spce Allow users to serch for new strings in the originl string, in time proportionl to the length of the new string (mny other pplictions) 2
Definition of Suffix Tree T A rooted tree with n leves numered 1 to n; Ech internl node, excluding the root, hs t lest two children; Ech edge is leled with non-empty sustring of S; Two edges out of sme node with distinct chrcters; Suffix S[i, n] corresponds to the conctention of the edge-lels on the pth from the root to lef i. 4
Exmple: Suffix tree for xxc Suffixes: {xxc, xc, xc, xc, c, c} root x xc v 1 c 6 3 c x c u x c c 4 5 2 5
Appending : Prefix Free Prolem: If suffix S[j, n] of S mtches prefix of nother suffix S[i, n] of S, then the pth for S[j, n] would not end t lef in T. For exmple, S = xx. S[4, 5] = x mtches prefix of S[1, 5] = xx. x x root 1 6
Appending : Prefix Free Prolem: If suffix S[j, n] of S mtches prefix of nother suffix S[i, n] of S, then the pth for S[j, n] would not end t lef in T. For exmple, S = xx. S[4, 5] = x mtches prefix of S[1, 5] = xx. x x root 1 Solution: Add unique chrcter, which is not in the lphet, to the end of S. 7
Exmple: Suffix Tree for xxc Suffixes: {xxc, xc, xc, xc, c, c, } 7 6 c x c root x u v xc c c x c 4 1 3 5 2 8
Pttern Mtching Prolem Pttern Mtching Prolem Input: text T of size n, pttern P of size m Output: All occurrences of P in T
Pttern Mtching Prolem Algorithm 1. Build suffix tree for T 2. Mtch the chrcters of P until P is exhusted or no more mtches re possile 3. If no more mtches re possile then P does not occur in T 4. If P is exhusted, then the numer of the leves in the sutree elow the point where P got exhusted correspond to the positions in T where the mtches occur.
Pttern Mtching Prolem Anlysis 1. Build suffix tree for T: O(n) time 2. Mtch the chrcters of P until P is exhusted or no more mtches re possile: O(m) time 3. If no more mtches re possile then P does not occur in T 4. If P is exhusted, then the numer of the leves in the sutree elow the point where P got exhusted correspond to the positions in T where the mtches occur. O(k) time, where k is the numer of mtched positions
Pttern Mtching Prolem 1 w x z 4 7 Frgment of suffix tree for wywxwxz Pttern w occurs in positions 1,4 nd 7
Definitions The Lel of pth from root r to node v is the conctention of sustrings on edges from r to v. The pth-lel of node v is the lel of the pth from root r to v. The string-depth of node v is the numer of chrcters in v s lel. Comment: In constructing suffix trees, we will need to e le to split edges in the middle. 13
A First Simple Algorithm Let S= Suffixes of S { } Suffix tree of S 14
A First Simple Algorithm Put the lrgest suffix in 15
A First Simple Algorithm Put the lrgest suffix in Put the suffix in 16
A First Simple Algorithm Put the lrgest suffix in Put the suffix in 17
A First Simple Algorithm 18
A First Simple Algorithm Put the suffix in 19
A First Simple Algorithm 20
A First Simple Algorithm Put the suffix in 21
A First Simple Algorithm 22
A First Simple Algorithm Put the suffix in 23
A First Simple Algorithm We will lso lel ech lef with the strting point of the corresponding suffix. 5 4 3 1 2 24
Ovious runtime This lgorithm hs runtime O(m 2 ), since it is doing O(m) work in ech phse 25
Ovious runtime This lgorithm hs runtime O(m 2 ), since it is doing O(m) work in ech phse But, qudrtic work on genome, for exmple, would e uncceptle 26
Constructing Suffix Trees in O(n) Weiner proposed the first liner-time lgorithm in 1973 (lgorithm of the yer ccording to Knuth) McCreight introduced more spce efficient linertime lgorithm in 1976; Ukkonen developed simpler to understnd linertime lgorithm in 1995. Ukkonen s lgorithm, sed on sequence of implicit suffix trees, is wht we will focus on. 27
Implicit Suffix Tree Definition: An implicit suffix tree I for string S is tree otined from the suffix tree for S y removing from ech edge lel; removing ny edges tht now hve no lel; removing ny node tht does not still hve t lest two children. Comment: some suffixes my no longer e leves. An implicit suffix tree for prefix S[1,k] of S denoted y I k. 28
Exmple: Implicit Suffix Tree Implicit suffix tree for S= xx Suffixes of xx: {xx, x, x, x,, } True Suffix tree for S: 6 x root x u x v x 4 1 3 5 2 29
Exmple: Implicit Suffix Tree (cont d) Remove from ech edge: 6 x root x u x v x 4 1 3 Some edges with no lels. 5 2 30
Exmple: Implicit Suffix Tree (cont d) Remove edges with no lel: x root x u x v x 1 3 Some internl nodes with only one child. 2 31
Exmple: Implicit Suffix Tree (cont d) Remove internl nodes with only one child. Finlly, implicit suffix tree for xx: x x root 1 x x 3 2 32
Ukkonen s Algorithm Key Ides Construct sequence of implicit suffix trees: I 1, I 2, I i, I i+1,, I n. Divide into n phses. Ech phse constructs n implicit suffix tree. In phse i+1, consider prefix S[1, i+1] nd construct I i+1 from I i. I 1 I 2 I i I i+1 I n Implicit suffix tree for prefix S[1,i] of S Implicit suffix tree for prefix S[1,i+1] of S 33
Ukkonen s Algorithm Key Ides (cont d) Further, divide ech phse i+1 into i+1 extensions Ext. 1: dding suffix S[1, i+1] of S[1, i+1] into I i Ext. 2: dding suffix S[2, i+1] of S[1, i+1] into I i I i I i+1 Ext. j: dding suffix S[j, i+1] of S[1, i+1] into I i Ext. i+1: dding suffix S[i+1, i+1] of S[1, i+1] into I i After i+1 extensions, we hve I i+1. 34
Ukkonen s Algorithm Construct I 1 ; For i=1 To n-1 Do (uild I i+1 ) /* phse loop*/ For j=1 To i+1 Do /* Extension loop */ Find the end of pth leled y S[j, i] in I i ; Add S[i+1] to the end y suffix extension rule; Convert I n into suffix tree of S. 35
Ukkonen s Algorithm Running Time O(n 3 ) 36
Suffix Extension Rules In (Phse i+1, extension j), the gol is to extend S[j, i] into S[j, i+1]. Rule 1: If pth = S[j, i] (suffix of S[1, i]) ends t lef, then dd chrcter S[i+1] to the end of the lel on tht lef edge. S[i+1] 37
Suffix Extension Rules (cont d) Rule 2: If pth does not end t lef nd the continue chrcter x is not S[i+1], then new lef edge strting from the end must e creted nd leled with S[i+1] nd the new lef is numered y j. x x S [i+1] j Crete lef j t extension j 38
Suffix Extension Rules (cont d) Rule 2: If pth does not end t lef nd the continue chrcter x is not S[i+1], then new lef edge strting from the end must e creted nd leled with S[i+1] nd the new lef is numered y j. x x S [i+1] j Crete lef j t extension j 39
Suffix Extension Rules (cont d) Rule 3: If some pth from the end of string strts with S[i+1], i.e. su-string S[i+1] is lredy in the tree, then we do nothing. S[i+1] S[i+1] 40
Suffix Trees: Ukkonen Algorithm How to locte efficiently the ends of ll the i+1 suffixes of S[1 i]? We need some tricks!
Suffix Link Definition: For n internl node v with pth-lel x, if there is nother node s(v) with pth-lel, then pointer from v to s(v) is clled suffix link. 7 6 c x c root x S(v) v xc c c x c 4 1 3 5 2 42
Suffix Links Lemm 6.1.1 If new internl node v with pth-lel x is dded to the current tree in extension j of some phse i+1, then the pth leled y lredy corresponds to n internl node u of the tree or u = s(v) the internl node leled y will e creted in extension j+1 of the sme phse or string is empty nd s(v) is the root 43
Suffix Links Proof. v is creted => rule 2 ws used => x c, with c S[i+1], is pth => c is pth on the tree Cse 1) If ends t node we re done since this node is s(v). Cse 2) does not end t node. Extension j+1 will crete node s(v) t the end of in the sme phse. 44
Suffix Links Corollry. Every newly creted internl node will hve suffix link from it y the end of the next extension. 45
Locte S[j, i] Using Suffix Links Nively, in extension j of phse i+1, locte suffix S[j, i] of S[1, i] y mtching it long pth from root. 46
Locte S[j, i] Using Suffix Links Nively, in extension j of phse i+1, locte suffix S[j, i] of S[1, i] y mtching it long pth from root. Using suffix links to shortcut the loction: v x root s(v) Strting t S[j-1, i], wlk up t most one node to v, End of S[j-1, i] c d End of S[j, i] d c Trverse the suffix link to s(v); then wlk down the tree to find end of S[j, i]. 47
Trick 1: Skip-Count Solution: Skip-Count technique g h c d e f 6 x v End of suffix S[j-1, i] s(v) c d 2 e 1 2 1 f i h g End of suffix S[j, i] At ech node, only check the first chrcter on the outgoing edge. Using numer of chrcters on tht edge to updte serch in O(1). Proportionl to numer of nodes on the pth rther thn numer of chrcters. 49
Trick 1: Skip-Count (cont d) Node-depth of v is the numer of nodes on the pth from root to node v, denoted y level(v). Lemm: At the moment of trversing suffix link (v, s(v)), level(v) level (s(v)) +1. nd(v)=4 v x d c x Suffix link c d s(v) nd(s(v))=3 50
Trick 1: Skip-Count (cont d) Theorem: Using suffix link & skip-count trick, ny phse tkes O(n) time: Proof We go up t most n nodes over phse We trverse t most n suffix links We must check how much we go down! 52
Trick 1: Skip-Count (cont d) Theorem: Using suffix link & skip-count trick, ny phse tkes O(n) time: Proof (cont.) level(j): level of node reched y extension j At extension j+1 we go down level(j+1)-level(j-1) +1 Adding over ll extensions of phse i we get tht the totl cost is O(n) 53
Edge-lel Compression Prolem: If edges re leled with sustring, the suffix tree my require (n 2 ) spce. 54
Edge-lel Compression Prolem: If edges re leled with sustring, the suffix tree my require (n 2 ) spce. S=c z.... z O(n 2 ) chrcters [1,26] [26,26] z. [2,26] z O(n) symols! 1 2 26 1 2 26 Solution: Lel ech edge with n index pir [i, j], denoting sustring S[i, j], the suffix tree requires only O(n) spce ( O(n) edges). 55
Trick 2: Stopper In ny phse i+1, if suffix extension rule 3 pplies in extension j, it will lso pply in ll remining extensions up to the end of phse i+1. S[j,i+1] is sustring of S[1 i] S[k..i+1] for k>j is sustring of S[1 i] Recll, when pplying rule 3, we do nothing. Tht implies, some extensions cn e done implicitly. Hence, end ny phse i+1 the first time tht extension rule 3 pplies. Reduce Work! 56
Trick 3: Glol Index In phse i, lef is creted nd leled y j. Then, t every extension j of the susequent phses j will llwys e lef 57
Trick 3: Glol Index In phse i, lef is creted nd leled y j. Then, t every extension j of the susequent phses j will llwys e lef Insted of leling lef edge with (p, i), lel it with (p, e). e is glol index. Set e once in phse i. 58
Trick 3: Glol Index In phse i, lef is creted nd leled y j. Then, t every extension j of the susequent phses j will llwys e lef Insted of leling lef edge with (p, i), lel it with (p, e). e is glol index. Set e once in phse i. In phse i, lst(i) denotes the lst extension tht rule 3 does not pply. 59
Trick 3: Glol Index In phse i, lef is creted nd leled y j. Then, t every extension j of the susequent phses j will llwys e lef Insted of leling lef edge with (p, i), lel it with (p, e). e is glol index. Set e once in phse i. In phse i, lst(i) denotes the lst extension tht rule 3 does not pply. Phse i Lst(i) Updte index e Explicit Ext. Stopper 60
Trick 3: Glol Index (cont d) 1. After phse i, suffixes S[j, i] for 1 j lst(i) end t lef. So, fter phses i, ll extensions for 1 j lst(i) pply rule 1. Only need to updte e! Keep lst(i) Note tht lst(i+1) lst(i). Never Shrink! In phse i+1, explicitly compute extensions for j lst(i)+1 until the first rule 3 extension. Hence, phses i nd i+1 shre t most 1 explicit extension. 61
Time Complexity Implicit extensions is constnt, totl: O(n); At most 2n explicit extensions: Phse i Phse i+1 Phse i+2 Explicit extensions 1 2 3 4 5 5 6 7 8 The mx numer of down-wlk skips: O(n); Therefore, the Totl time complexity: O(n)! 8 9 10 62
Suffix Trees: Ukkonen s Algorithm From n implicit tree to suffix tree Modifiction 1 Add terminl symol to the end of S Continue Ukkonen s lgorithm with this chrcter No suffix is prefix of ny suffix Modifiction 2 Replce ech index e on every lef edge with numer n. It cn e done in O(n) time vi DFS
T=cc Exmple
Prcticl Implementtion issues There re severl possiilities to represent nd serch the rnches out of the nodes of the tree Store vector of size O( ). Keep list t ech node Mintin lnced tree Mintin hsh tle Some implementtions comine different representtions. Nodes t the top of the tree (in generl with highest out degree) mke use of rrys. Nodes t lower levels employ lists
Prcticl Considertions Trversing suffix links my cuse severl pge fults A lot of effort hs een done to produce prcticl implementtions The liner time relies on the ssumption tht the lphet is ounded Optiml Suffix Tree Construction with Lrge Alphets [ Mrtin Frch, FOCS 1997]. 66
Reference A. Aho nd M. Corsick. Efficient string mtching: n id to iliogrphic serch. Comm.~ACM, 18: 333-40, 1975. P. Weiner. Liner pttern mtching lgorithms. Proceedings of I.E.E.E. 14th Annul Symposium on Switching nd Automt Theory, pges 1-11, 1973. E. McCreight. A spce-economicl suffix tree construction lgorithm. Journl of the Assocition for Computing Mchinery, 23(2):262-272, April 1976. E. Ukkonen. On-line construction of suffix trees. Algorithmic, 14(3):249-260, 1995. R. Giegerich, nd S. Kurtz. From Ukkone to McCreight nd Weiner: A Unifying View of Liner-Time Suffix Tree Construction. Report Nr. 94-03, Technische Fkultt der Universitt Bielefeld, 1994. D. Gusfield. Algorithms on strings, trees, nd sequences. Computer Science nd Computtionl Biology. Cmridge University Press, 1997. 67
THANK YOU 68