COMBINATORIAL PATTERN MATCHING
Genomic Repets Exmple of repets: ATGGTCTAGGTCCTAGTGGTC Motivtion to find them: Genomic rerrngements re often ssocited with repets Trce evolutionry secrets Mny tumors re chrcterized y n explosion of repets
Genomic Repets The prolem is often more difficult: ATGGTCTAGGACCTAGTGTTC Motivtion to find them: Genomic rerrngements re often ssocited with repets Trce evolutionry secrets Mny tumors re chrcterized y n explosion of repets
l-mer Repets Long repets re difficult to find Short repets re esy to find (e.g., hshing) Simple pproch to finding long repets: Find exct repets of short l-mers (l is usully 10 to 13) Use l-mer repets to potentilly extend into longer, mximl repets
l-mer Repets (cont d) There re typiclly mny loctions where n l-mer is repeted: GCTTACAGATTCAGTCTTACAGATGGT The 4-mer TTAC strts t loctions 3 nd 17
Extending l-mer Repets GCTTACAGATTCAGTCTTACAGATGGT Extend these 4-mer mtches: GCTTACAGATTCAGTCTTACAGATGGT Mximl repet: TTACAGAT
Mximl Repets To find mximl repets in this wy, we need ALL strt loctions of ll l-mers in the genome Hshing lets us find repets quickly in this mnner
Hshing DNA sequences Ech l-mer cn e trnslted into inry string (A, T, C, G cn e represented s 00, 01, 10, 11) After ssigning unique integer per l-mer it is esy to get ll strt loctions of ech l- mer in genome
Hshing: Mximl Repets To find repets in genome: For ll l-mers in the genome, note the strt position nd the sequence Generte hsh tle index for ech unique l-mer sequence In ech index of the hsh tle, store ll genome strt loctions of the l-mer which generted tht index Extend l-mer repets to mximl repets
Hshing: Collisions Deling with collisions: Chin ll strt loctions of l-mers (linked list)
Hshing: Summry When finding genomic repets from l-mers: Generte hsh tle index for ech l-mer sequence In ech index, store ll genome strt loctions of the l-mer which generted tht index Extend l-mer repets to mximl repets
Pttern Mtching Wht if, insted of finding repets in genome, we wnt to find ll sequences in dtse tht contin given pttern? This leds us to different prolem, the Pttern Mtching Prolem
Pttern Mtching Prolem Gol: Find ll occurrences of pttern in text Input: Pttern p = p 1 p n nd text t = t 1 t m Output: All positions 1< i < (m n + 1) such tht the n-letter sustring of t strting t i mtches p Motivtion: Serching dtse for known pttern
Exct Pttern Mtching: A Brute- Force Algorithm PtternMtching(p,t) 1 m length of pttern p 2 n length of text t 3 for i 1 to (n m + 1) 4 if t i t i+m-1 = p 5 output i
Exct Pttern Mtching: An Exmple PtternMtching lgorithm for: Pttern GCAT GCAT CGCATC GCAT CGCATC GCAT CGCATC Text CGCATC GCAT CGCATC GCAT CGCATC
Exct Pttern Mtching: Running Time PtternMtching runtime: O(nm) Better solution: suffix trees Cn solve prolem in O(n) time Conceptully relted to keyword trees
Keyword Trees: Exmple Keyword tree: Apple Also known s trie
Keyword Trees: Exmple (cont d) Keyword tree: Apple Apropos
Keyword Trees: Exmple (cont d) Keyword tree: Apple Apropos Bnn
Keyword Trees: Exmple (cont d) Keyword tree: Apple Apropos Bnn Bndn
Keyword Trees: Exmple (cont d) Keyword tree: Apple Apropos Bnn Bndn Ornge
Keyword Trees: Properties Stores set of keywords in rooted leled tree Ech edge leled with letter from n lphet Any two edges coming out of the sme vertex hve distinct lels Every keyword stored cn e spelled on pth from root to some lef
Keyword Trees: Threding (cont d) Thred ppel ppel
Keyword Trees: Threding (cont d) Thred ppel ppel
Keyword Trees: Threding (cont d) Thred ppel ppel
Keyword Trees: Threding (cont d) Thred ppel ppel
Keyword Trees: Threding (cont d) Thred pple pple
Keyword Trees: Threding (cont d) Thred pple pple
Keyword Trees: Threding (cont d) Thred pple pple
Keyword Trees: Threding (cont d) Thred pple pple
Keyword Trees: Threding (cont d) Thred pple pple
Multiple Pttern Mtching Prolem Gol: Given set of ptterns nd text, find ll occurrences of ny of ptterns in text Input: k ptterns p 1,,p k, nd text t = t 1 t m Output: Positions 1 < i < m where sustring of t strting t i mtches p j for 1 < j < k Motivtion: Serching dtse for known multiple ptterns
Multiple Pttern Mtching: Strightforwrd Approch Cn solve s k Pttern Mtching Prolems Runtime: O(kmn) using the PtternMtching lgorithm k times m - length of the text n - verge length of the pttern
Multiple Pttern Mtching: Keyword Tree Approch Or, we could use keyword trees: Build keyword tree in O(N) time; N is totl length of ll ptterns With nive threding: O(N + nm) Aho-Corsick lgorithm: O(N + m)
Keyword Trees: Threding To mtch ptterns in text using keyword tree: Build keyword tree of ptterns Thred the text through the keyword tree
Keyword Trees: Threding (cont d) Threding is complete when we rech lef in the keyword tree When threding is complete, we ve found pttern in the text Prolem: High memory requirement when N is lrge
Suffix Trees=Collpsed Keyword Trees Similr to keyword trees, except edges tht form pths re collpsed Ech edge is leled with sustring of text All internl edges hve t lest two outgoing edges Leves leled y the index of the pttern.
Suffix Tree of Text Suffix trees of text is constructed for ll its suffixes ATCATG TCATG CATG ATG TG G Keyword Tree Suffix Tree
Suffix Tree of Text Suffix trees of text is constructed for ll its suffixes ATCATG TCATG CATG ATG TG G Keyword Tree How much time does it tke? Suffix Tree
Suffix Tree of Text Suffix trees of text is constructed for ll its suffixes ATCATG TCATG CATG ATG TG G qudrtic Keyword Tree Suffix Tree Time is liner in the totl size of ll suffixes, i.e., it is qudrtic in the length of the text
Suffix tree (Exmple) Let s=, suffix tree of s is compressed trie of ll suffixes of s= { }
Trivil lgorithm to uild Suffix tree Put the lrgest suffix in Put the suffix in
Put the suffix in
Put the suffix in
Put the suffix in
We will lso lel ech lef with the strting point of the corres. suffix. 5 4 3 Trivil lgorithm: O(n 2 ) time 1 2
Suffix Trees: Advntges Suffix trees of text is constructed for ll its suffixes Suffix trees uild fster thn keyword trees ATCATG TCATG CATG ATG TG G qudrtic Keyword Tree liner (Weiner suffix tree lgorithm) Suffix Tree
Use of Suffix Trees Suffix trees hold ll suffixes of text i.e., ATCGC: ATCGC, TCGC, CGC, GC, C Builds in O(m) time for text of length m To find ny pttern of length n in text: Build suffix tree for text Thred the pttern through the suffix tree Cn find pttern in text in O(n) time! O(n + m) time for Pttern Mtching Prolem Build suffix tree nd lookup pttern
Pttern Mtching with Suffix Trees SuffixTreePtternMtching(p,t) 1 Build suffix tree for text t 2 Thred pttern p through suffix tree 3 if threding is complete 4 output positions of ll p-mtching leves in the tree 5 else 6 output Pttern does not pper in text
Suffix Trees: Exmple
Generlized suffix tree Given set of strings S generlized suffix tree of S is compressed trie of ll suffixes of s S To mke these suffixes prefix-free we dd specil chr, sy, t the end of s To ssocite ech suffix with unique string in S dd different specil chr to ech s
Generlized suffix tree (Exmple) Let s 1 = nd s 2 = here is generlized suffix tree for s 1 nd s 2 { # # # # } 3 # 2 # 1 2 # 4 # 5 3 4 1 Mtching pttern ginst dtse of strings
Longest common sustring of two strings Every node with lef descendnt from string s 1 nd lef descendnt from string s 2 represents mximl common sustring nd vice vers. Find such node with lrgest string depth 3 # 2 # 1 2 # 4 # 5 3 4 1
Multiple Pttern Mtching: Summry Keyword nd suffix trees re used to find ptterns in text Keyword trees: Build keyword tree of ptterns, nd thred text through it Suffix trees: Build suffix tree of text, nd thred ptterns through it