Information Retrieval and Organisation

Informtion Retrievl nd Orgnistion Suffix Trees dpted from http://www.mth.tu.c.il/~himk/seminr02/suffixtrees.ppt Dell Zhng Birkeck, University of London

Trie A tree representing set of strings { } eef d fe fg c f e e c d f e c g

Trie Assume no string is prefix of nother 1) Ech edge is leled y letter. c 2) No two edges outgoing from the sme node re leled the sme. 3) Ech string corresponds to lef. f e e d f e g

Compressed Trie Compress unry nodes, lel edges y strings c c e e d f eef d f f e g e g

Suffix Tree Given string s, suffix tree of s is compressed trie of ll suffixes of s. To mke these suffixes prefix-free we dd specil chrcter, sy, t the end of s.

Suffix Tree For exmple, let s =, suffix tree of s is compressed trie of ll suffixes of. { } Note tht suffix tree hs O(n) nodes n = s. Why?

Suffix Tree Construction The trivil lgorithm Put the lrgest suffix in

Put the suffix in

We will lso lel ech lef with the strting point of the corresponding suffix 0 2 3 1 4

Suffix Tree Construction The trivil lgorithm tkes O(n 2 ) time. It is possile to uild suffix tree in O(n) time using Ukkonen s lgorithm. But, how come? Does it tke O(n) spce? To use only O(n) spce, encode the edge-lels s (eginning-position, end-position).

Consider the string

Consider the string (6,12)

Consider the string (0,0) (6,12) (1,1) (6,12) (2,2) (6,12) (3,3) (6,12) (4,4) (6,12) (5,12) (12,12) (6,6) (12,12) (7,7) (12,12) (8,8) (12,12) (9,9) (12,12) (10,10) (12,12) (11,12)

Suffix Tree Applictions Wht Cn We Do with It? Exct String Mtching Exct Set Mtching The Sustring Prolem for Dtse of Ptterns Longest Common Sustring of Two Strings Recognising DNA Contmintion Common Sustring of More Thn Two Strings

Exct String Mtching Given text T ( T = n), pre-process it such tht when pttern P ( P = m) rrives you cn quickly decide when it occurs in T. We my lso wnt to find ll occurrences of P in T.

Exct String Mtching In pre-processing, we just uild suffix tree in O(n) time 0 2 3 1 4

Exct String Mtching Given pttern P = we trverse the tree ccording to the pttern. If we do not get stuck trversing the pttern then the pttern occurs in the text, otherwise it does not. Ech lef in the sutree elow the node we rech corresponds to n occurrence. By trversing this sutree we get ll k occurrences in O(n+k) time.

Exct String Mtching How to mtch pttern (query) ginst dtse of strings (documents)?

Generlized Suffix Tree Given set of strings S, the generlized suffix tree of S is compressed trie of ll suffixes of ech s S. To mke these suffixes prefix-free we dd specil chr, sy, t the end of s. To ssocite ech suffix with unique string in S, dd different specil chr to ech s. Ech lef node needs to e lelled y the document id together with the suffix position.

Generlized Suffix Tree For exmple, Let s 1 = nd s 2 =, here is generlized suffix tree for s 1 nd s 2. { } # # # # 2 # 1 # 0 1 # 3 # 4 2 3 0

Longest Common Sustring Given two strings s 1 nd s 2, we uild their generlized suffix tree. Every node with lef descendnt from string s 1 nd lef descendnt from string s 2 represents mximl common sustring nd vice vers. Find such node with lrgest string depth.

Lowest Common Ancestor A lot more cn e gined from the suffix tree, if we pre-process it so tht we cn nswer LCA queries on it in constnt time.

Lowest Common Ancestor Why? The LCA of two leves represents the longest common prefix (LCP) of these 2 suffixes # 4 3 2 # 1 # 0 1 # 3 2 0

Finding Mximl Plindromes A plindrome: cc, cc, To find ll plindromes in string s (of length m), we uild generlized suffix tree for the string s nd the reversed string s r. The plindrome with centre etween i-1 nd i is the LCP of the suffix t position i of s nd the suffix t position m-i of s r.

Finding Mximl Plindromes For exmple, consider the string c. Prepre generlized suffix tree for s = c nd s r = c# For every i find the LCA of the suffix i of s nd the suffix m-i of s r. All plindromes cn e identified in liner time.

Let s = c then s r = c# 5 c # 6 6 2 2 3 4 4 5 3 0 1 1 0

Suffix Tree Drwcks It is O(n) ut the constnt is quite ig. It consume lot of spce. Notice tht if we indeed wnt to trverse n edge in O(1) time then we need n rry (of pointers) of size Σ in ech node, where Σ is the lphet.

Suffix Arry It is much simpler nd esier to implement. Compred with suffix trees, we lose some functionlity, ut we sve spce.

Suffix Arry For exmple, let s = Sort the suffixes lexicogrphiclly:,,, The suffix rry gives the indices of the suffixes in sorted order 2 0 3 1

Suffix Arry Construction The trivil lgorithm Quicksort The liner time lgorithm Build suffix tree in O(n) time first, nd then trverse the tree in in-order, lexicogrphiclly picking edges outgoing from ech node, nd fill the suffix rry. It cn lso e uilt in O(n) time directly.

Exct String Mtching How do we serch for pttern P in the text T, using the suffix rry of T? If P occurs in T, then ll its occurrences re consecutive in the suffix rry. So we cn do two inry serches on the suffix rry: the first serch loctes the strting position of the intervl, nd the second one determines the end position. It tkes O(m log(n)) time, s single suffix comprison needs to compre up to m chrcters.

Exct String Mtching It is lso possile to do it in O(m+log(n)) with n dditionl rry of LCP. Mner & Myers (1990)

T = mississippi P = iss L M R 10 7 4 1 0 9 8 6 3 5 2 i ippi issippi ississippi mississippi pi ppi sippi sisippi ssippi ssissippi