CPSC 335 Intermedite Informtion Structures LECTURE 13 Suffix Trees Jon Rokne Computer Science University of Clgry Cnd Modified from CMSC 423 - Todd Trengen UMD upd
Preprocessing Strings We will look t Suffix Tries Suffix Trees Suffix Arrys Borrows-Wheeler trnsform Typicl setting: A long, known, nd fixed text string (like genome) nd mny unknown, chnging query strings. Allowed to preprocess the text string once in nticiption of the future unknown queries. Dt structures will e useful in other settings s well.
Preprocessing Strings For exmple, text T might e genomic sequences nd the queries might e short words over n lphet,c,g,t descriing trnscription fctor inding sites.
Suffix Tries A trie, pronounced try, is tree tht exploits some structure in the keys - e.g. if the keys re strings, inry serch tree would compre the entire strings, ut trie would look t their individul chrcters - Suffix trie re spce-efficient dt structure to store string tht llows mny kinds of queries to e nswered quickly. - Suffix trees re hugely importnt for serching lrge sequences like genomes. The sis for tool clled MUMMer (developed y UMD fculty).
Suffix Tries Let the string e y nd the trie S(y). All leves of S(y) re leled y the suffixes of y. Edges of S(y) re leled y letters of the lphet used for S(y) with n dded sentinel chrcter, sy, which is not prt of the lphet. Internl nodes re rnching nodes if they hve t lest two children. Edges outgoing from rnching nodes re leled y different letters.
Suffix Tries The role of suffixes nd the sentinel Consider the text c over the lphet,,c. It hs the following suffixes: c, c, c, nd. The sentinel : In the following, we wnt to ensure tht no suffix is prefix of ny other. To do so, we ppend specil chrcter not in the lphet to the end of the text. Now, consider the text c. It hs the following suffixes: c, c, c,, nd nd now is not prefix of c.
Suffix Tries Queries re prefixes of suffixes: To determine whether given query q is contined in the text, we could simply check whether q is the prefix of one of the suffixes.
Founding Editor-in- Chief of The IEEE/ ACM Trnsctions of Computtionl Biology nd Bioinformtics 8
Suffixes of chrcter string s = s is the string over the lphet (,) with termintion symol lso known s sentinel,. The suffixes of s re: i.e. the trivil suffix i.e. the sentinel suffix.
s = Suffix Tries SufTrie(s) = suffix trie representing string s. Edges of the suffix trie re leled with letters from the lphet (sy {,}). Every pth from the root to solid node represents suffix of s. Every suffix of s is represented y some pth from the root to solid node. Why re ll the solid nodes leves? How mny leves will there e?
Processing Strings Using Suffix Tries Given suffix trie T, nd string q, how cn we: determine whether q is sustring of T? check whether q is suffix of T? count how mny times q ppers in T? find the longest repet in T? find the longest common sustring of T nd q? Min ide: every sustring of s is prefix of some suffix of s.
s = Serching Suffix Tries Is sustring of s? Follow the pth given y the query string. After we ve uilt the suffix trees, queries cn e nswered in time: O( query ) regrdless of the text size.
Suffix Links in Suffix Trie To understnd suffix links first recll tht there re three kinds of nodes in suffix tree: The root -- Internl nodes -- Lef nodes In the grph elow, which is the suffix tree for ABABABC, the yellow circle is the root, the grey, lue nd green ones re internl nodes, nd the smll lck ones re leves.
Suffix Links in Suffix Trie There re two importnt things to notice: Internl nodes hve either one or more thn one outgoing edges. Tht is, internl nodes with more thn one outgoing edges mrk those prts of the tree where rnching occurs. Brnching occurs wherever repeted string is involved, nd only there. For ny internl node X, the string leding from the root to X must hve occurred in the input string t lest s mny times s there re outgoing edges from X. Exmple: The string leding to the lue node is ABAB. Indeed, this string ppers twice in the input string: At level 0 nd t level 2. And tht is why the lue node exists.
Suffix Links in Suffix Trie If the string s leding up to some internl node X is longer thn 1 chrcter, the sme string minus the first chrcter (cll this s-1) must e in the tree, too (it's suffix tree, fter ll, so the suffix of ny of its strings must e in the tree, too). Exmple: Let s=abab, the string leding to the lue node. Then fter removing the first chrcter, s-1 is BAB. And indeed tht string is found in the tree, too. It leds to green node (lelled This node ). This node
Suffix Links in Suffix Trie If some string s leds to n internl node, its shortened version s-1 must led to n internl node (cll it X-1) s well. Why? Becuse s must pper t lest twice in the input string, so s-1 must pper t lest s mny times (ecuse it is prt of s: wherever s ppers, s-1 must pper, too). But if s-1 ppers multiple times in the input string, then there must e n internl node for it. In ny such sitution, specil link connecting X to X-1 is suffix link. This node
Suffix Links in Suffix Trie Every internl node X with more thn 1 outlinks must hve suffix link to exctly one other internl node. This is the sme suffix tree s efore; the dotted lines indicte the suffix links. If you strt t the lue node nd follow the suffix links from there (from lue, to green, to first pink, to second pink), nd look t the strings leding from the root to ech node, you will see this: ABAB -> BAB -> AB -> B (lue) (green) (pink1) (pink2) This is why they re clled suffix links (the entire sequence is clled suffix chin).
s = Serching Suffix Tries Is sustring of s? Follow the pth given y the query string. After we ve uilt the suffix trees, queries cn e nswered in time: O( query ) regrdless of the text size.
Check whether q is sustring of T: Applictions of Suffix Tries (1) Check whether q is suffix of T: Count # of occurrences of q in T: Find the longest repet in T: Find the lexicogrphiclly (lpheticlly) first suffix:
Applictions of Suffix Tries (1) Check whether q is sustring of T: Follow the pth for q strting from the root. If you exhust the query string, then q is in T. Check whether q is suffix of T: Count # of occurrences of q in T: Find the longest repet in T: Find the lexicogrphiclly (lpheticlly) first suffix:
Applictions of Suffix Tries (1) Check whether q is sustring of T: Follow the pth for q strting from the root. If you exhust the query string, then q is in T. Check whether q is suffix of T: Follow the pth for q strting from the root. If you end t lef t the end of q, then q is suffix of T Count # of occurrences of q in T: Find the longest repet in T: Find the lexicogrphiclly (lpheticlly) first suffix:
Applictions of Suffix Tries (1) Check whether q is sustring of T: Follow the pth for q strting from the root. If you exhust the query string, then q is in T. Check whether q is suffix of T: Follow the pth for q strting from the root. If you end t lef t the end of q, then q is suffix of T Count # of occurrences of q in T: Follow the pth for q strting from the root. The numer of leves under the node you end up in is the numer of occurrences of q. Find the longest repet in T: Find the lexicogrphiclly (lpheticlly) first suffix:
Applictions of Suffix Tries (1) Check whether q is sustring of T: Follow the pth for q strting from the root. If you exhust the query string, then q is in T. Check whether q is suffix of T: Follow the pth for q strting from the root. If you end t lef t the end of q, then q is suffix of T Count # of occurrences of q in T: Follow the pth for q strting from the root. The numer of leves under the node you end up in is the numer of occurrences of q. Find the longest repet in T: Find the deepest node tht hs t lest 2 leves under it. Find the lexicogrphiclly (lpheticlly) first suffix:
Applictions of Suffix Tries (1) Check whether q is sustring of T: Follow the pth for q strting from the root. If you exhust the query string, then q is in T. Check whether q is suffix of T: Follow the pth for q strting from the root. If you end t lef t the end of q, then q is suffix of T Count # of occurrences of q in T: Follow the pth for q strting from the root. The numer of leves under the node you end up in is the numer of occurrences of q. Find the longest repet in T: Find the deepest node tht hs t lest 2 leves under it. Find the lexicogrphiclly (lpheticlly) first suffix: Strt t the root, nd follow the edge leled with the lexicogrphiclly (lpheticlly) smllest letter.
s = Suffix Links Suffix links connect node representing xα to node representing α. Most importnt suffix links re the ones connecting suffixes of the full string (shown t right). But every node hs suffix link. Why? How do we know node representing α exists for every node representing xα?
s = Suffix Tries A node represents the prefix of some suffix: s The node s suffix link should link to the prefix of the suffix s tht is 1 chrcter shorter. Since the suffix trie contins ll suffixes, it contins pth representing s, nd therefore contins node representing every prefix of s.
s = Suffix Tries A node represents the prefix of some suffix: s The node s suffix link should link to the prefix of the suffix s tht is 1 chrcter shorter. Since the suffix trie contins ll suffixes, it contins pth representing s, nd therefore contins node representing every prefix of s.
Applictions of Suffix Tries (II) Find the longest common sustring of T nd q: T = q =
Applictions of Suffix Tries (II) Find the longest common sustring of T nd q: Wlk down the tree following q. If you hit ded end, sve the current depth, nd follow the suffix link from the current node. When you exhust q, return the longest sustring found. T = q =
Constructing Suffix Tries
Suppose we wnt to uild suffix trie for string: s = c We will wlk down the string from left to right: c uilding suffix tries for s[0], s[0..1], s[0..2],..., s[0..n] To uild suffix trie for s[0..i], we will use the suffix trie for s[0..i-1] uilt in previous step To convert SufTrie(S[0..i-1]) SufTrie(s[0..i]), dd chrcter s[i] to ll the suffixes: c i=4 Need to dd nodes for the suffixes: c c c c c Purple re suffixes tht will exist in SufTrie(s[0..i-1]) Why? How cn we find these suffixes quickly?
Suppose we wnt to uild suffix trie for string: s = c We will wlk down the string from left to right: c uilding suffix tries for s[0], s[0..1], s[0..2],..., s[0..n] To uild suffix trie for s[0..i], we will use the suffix trie for s[0..i-1] uilt in previous step To convert SufTrie(S[0..i-1]) SufTrie(s[0..i]), dd chrcter s[i] to ll the suffixes: c i=4 Need to dd nodes for the suffixes: c c c c c Purple re suffixes tht will exist in SufTrie(s[0..i-1]) Why? How cn we find these suffixes quickly?
c i=4 Need to dd nodes for the suffixes: c c c c c Purple re suffixes tht will exist in SufTrie(s[0..i-1]) Why? How cn we find these suffixes quickly? c c c c Where is the new deepest node? (k longest suffix) c SufT rie() SufT rie(c) How do we dd the suffix links for the new nodes?
c i=4 Need to dd nodes for the suffixes: c c c c c Purple re suffixes tht will exist in SufTrie(s[0..i-1]) Why? How cn we find these suffixes quickly? c c c c Where is the new deepest node? (k longest suffix) c SufT rie() SufT rie(c) How do we dd the suffix links for the new nodes?
To uild SufTrie(s[0..i]) from SufTrie(s[0..i-1]): CurrentSuffix = longest (k deepest suffix) until you rech the root or the current node lredy hs n edge leled s[i] leving it. Repet: Add child leled s[i] to CurrentSuffix. Follow suffix link to set CurrentSuffix to next shortest suffix. Becuse if you lredy hve node for suffix αs[i] then you hve node for every smller suffix. Add suffix links connecting nodes you just dded in the order in which you dded them. In prctice, you dd these links s you go long, rther thn t the end.
Python Code to Build Suffix Trie def uild_suffix_trie(s): """Construct suffix trie.""" ssert len(s) > 0 clss SuffixNode: def init (self, suffix_link = None): self.children = {} if suffix_link is not self.suffix_link = else: self.suffix_link = None: suffix_link # explicitly uild the two-node suffix tree Root = SuffixNode() # the root node Longest = SuffixNode(suffix_link = Root) Root.dd_link(s[0], Longest) s[0] self # for every chrcter left in the string def dd_link(self, c, v): """link this node to node self.children[c] = v v vi string c""" for c in s[1:]: Current = Longest; Previous = None while c not in Current.children: # crete new node r1 with trnsition Current -c->r1 r1 = SuffixNode() Current.dd_link(c, r1) # if we cme from some previous node, mke # node's suffix link point here if Previous is not None: Previous.suffix_link = r1 tht # wlk down the suffix links Previous = r1 Current = Current.suffix_link # mke the lst suffix link if Current is Root: Previous.suffix_link = Root else: Previous.suffix_link = Current.children[c] # move to the newly dded child of the longest # (which is the new longest pth) Longest = Longest.children[c] return Root pth
current current s[i] s[i] longest s[i] s[i] s[i] s[i] u longest s[i] s[i] u s[i] Prev Prev current s[i] oundry pth s[i] s[i] s[i] s[i] Prev longest
Note: there's lredy pth for suffix "", so we don't chnge it (we just dd suffix link to it)
Note: there's lredy pth for suffix "", so we don't chnge it (we just dd suffix link to it)
Note: there's lredy pth for suffix "", so we don't chnge it (we just dd suffix link to it)
How mny nodes cn suffix trie hve? s = s = n n will hve 1 root node n nodes in pth of s n pths of n+1 nodes Totl = n(n+1)+n+1 = O(n 2 ) nodes. This is not very efficient. How could you mke it smller?
So... we hve to trie gin... Spce-Efficient Suffix Trees
A More Compct Representtion s = 1234567 s = 1234567 6:6 5:6 7:7 5:6 7:7 4:7 7:7 4:7 7:7 4:7 Compress pths where there re no choices. Represent sequence long the pth using rnge [i,j] tht refers to the input string s.
Spce usge: In the compressed representtion: - - - # leves = O(n) [one lef for ech position in the string] Every internl node is t lest inry split. Ech edge uses O(1) spce. Therefore, # numer of internl nodes is out equl to the numer of leves. And # of edges numer of leves, nd spce per edge is O(1). Hence, liner spce.
Trivil lgorithm to uild Suffix tree Put the lrgest suffix in Put the suffix in
Put the suffix in
Put the suffix in
Put the suffix in
We will lso lel ech lef with the strting point of the corres. suffix. 1 4 3 2 5
Anlysis Tkes O(n 2 ) time to uild. We will see how to do it in O(n) time
Constructing Suffix Trees - Ukkonen s Algorithm The sme ide s with the suffix trie lgorithm. Min difference: not every trie node is explicitly s = u represented in the tree. Solution: represent trie nodes s pirs (u, α), where u is rel node in the tree nd α is some string leving it. v suffix_link[v] = (u, ) Some dditionl tricks to get to O(n) time.
Storing more thn one string with Generlized Suffix Trees
Constructing Generlized Suffix Tre Gol. Represent set of strings P = {s 1, s 2, s 3,..., s m }. Exmple. tt, tg, gt Simple solution: (1) uild suffix tree for string t# 1 tg# 2 gt# 3
Gol. Represent set of strings P = {s 1, s 2, s 3,..., s m }. Exmple. tt, tg, gt Simple solution: Constructing Generlized Suffix Tre (1) uild suffix tree for string t# 1 tg# 2 gt# 3 (2) For every lef node, remove ny text fter the first # symol. #3 g #1tg#2gt#3 #2gt#3 #3 g #2 #1 t t #3 g#2gt#3 #1tg#2gt#3 #3 g#2gt#3 t t#3 t#1tg#2gt#3 #2gt#3 #3 g#2 # 1 # 3 g# 2 t t# 1 t#3 #2 #1tg#2gt#3 #3 #1 #3
Applictions of Generlized Suffix Trees Longest common sustring of S nd T: Determine the strings in dtse {S 1, S 2, S 3,..., S m } tht contin query string q:
Applictions of Generlized Suffix Trees Longest common sustring of S nd T: Build generlized suffix tree for {S,T} Find the deepest node tht hs hs descendnts from oth strings (contining oth # 1 nd # 2 ) Determine the strings in dtse {S 1, S 2, S 3,..., S m } tht contin query string q: Build generlized suffix tree for {S 1, S 2, S 3,..., S m } Follow the pth for q in the suffix tree. Suppose you end t node u: trverse the tree elow u, nd output i if you find string contining # i.
Longest Common Extension Longest common extension:we re given strings S nd T. In the future, mny pirs (i,j) will e provided s queries, nd we wnt to quickly find: the longest sustring of S strting t i tht mtches sustring of T strting t j. S LCE(i,j) T LCE(i,j) i j Build generlized suffix tree for S nd T. Preprocess tree so tht lowest common ncestors (LCA) cn e found in constnt time. LCA(i,j) Crete n rry mpping suffix numers to lef nodes. Given query (i,j): Find the lef nodes for i nd j Return string of LCA for i nd j j i i j
Longest Common Extension Longest common extension:we re given strings S nd T. In the future, mny pirs (i,j) will e provided s queries, nd we wnt to quickly find: the longest sustring of S strting t i tht mtches sustring of T strting t j. S LCE(i,j) T LCE(i,j) i j Build generlized suffix tree for S nd T. Preprocess tree so tht lowest common O( S + T ) O( S + T ) ncestors (LCA) cn e found in constnt time. Crete n rry mpping suffix numers to lef LCA(i,j) nodes. O( S + T ) Given query (i,j): Find the lef nodes for i nd j Return string of LCA for i nd j O(1) O(1) j i i j
Using LCE to Find Plindromes Mximl even plindrome t position i: the longest string to the left nd right so tht the left hlf is equl to the reverse of the right hlf. S x y x y = the reverse of i plindromes in S. Gol: find ll mximl Sr y x x y n - i Construct S r, the reverse of S. Preprocess S nd S r so tht LCE queries cn e solved in constnt time (previous slide). LCE(i, n-i) is the length of the longest plindrome centered t i. n-i) For every position i: Compute LCE(i,
Using LCE to Find Plindromes Mximl even plindrome t position i: the longest string to the left nd right so tht the left hlf is equl to the reverse of the right hlf. S x y x y = the reverse of i plindromes in S. Gol: find ll mximl Sr y x x y Construct S r, the reverse of S. O( S ) n - i Preprocess S nd S r so tht LCE queries cn e solved in constnt time (previous slide). O( S ) LCE(i, n-i) is the length of the longest plindrome centered t i. For every position i: Compute LCE(i, n-i) O( S ) O(1) Totl time = O( S )
Recp Suffix tries nturl wy to store string -- serch, count occurrences, nd mny other queries nswerle esily. But they re not spce efficient: O(n 2 ) spce. Suffix trees re spce optiml: O(n), ut require little more sutle lgorithm to construct. Suffix trees cn e constructed in O(n) time using Ukkonen s lgorithm. Similr ides cn e used to store sets of strings.