17 dicembre Luca Bortolussi SUFFIX TREES. From exact to approximate string matching.
|
|
- Brett Briggs
- 6 years ago
- Views:
Transcription
1 17 dicembre 2003 Luca Bortolussi SUFFIX TREES From exact to approximate string matching.
2 An introduction to string matching String matching is an important branch of algorithmica, and it has applications in many fields, as: Text searching Molecular biology Data compression and so on
3 Exact String matching: a brief history Naive algorithm Knuth-Morris-Pratt (1977) Boyer-Moore (1977) Suffix Trees: Weiner (1973), McCreight (1978), Ukkonen (1995)
4 Naive Algorithm bcadbcddacdbbba cdda cdda cdda
5 Knuth-Morris-Pratt bcabbcaddbcababcdbbba bcababcdb bcababcd bcababcd bcababcd bcababcd bcababcd
6 Boyer-Moore babcabaddbabdabcdbbba babdab babdab babdab Maximum between: the bad character rule the good suffix rule
7 Suffix Trees Definition: A suffix tree for a string T of length m is a rooted tree such that: 1. It has exactly m leafs, numbered from 1 to m; 2. Every edge has a label, which is a substring of T; 3. Every internal node has at least two children; 4. Labels of two edges starting at an internal node do not start with the same character; 5. The label of the path from the root to a leaf numbered I is the suffix of T starting at position i, i.e. T[i..m]
8 Suffix Trees - II abbcbab# 6 # ab b cbab# 4 bcbab# 1 7 # ab# 5 cbab# 3 bcbab# 2
9 Suffix Trees searching a pattern abbcbab# 6 # ab b cbab# 4 bcbab# 1 7 # ab# 5 cbab# Pattern: bcb 3 bcbab# 2
10 Suffix Trees naive construction abbcbab# ab cbab# # 6 abbcbab# bcbab# b bbcbab# 4 # ab# cbab# 3 bcbab# 2
11 Suffix Trees Ukkonen Algorithm Ukkonen algorithm was published in 1995, and it is the fastest and well performing algorithm for building a suffix tree in linear time. The basic idea is constructing iteratively the implicit suffix trees for S[1..i] in the following way: Construct tree 1 For i = 1 to m-1 // phase i+1 for j = 1 to i+1 // extension j find the end of the path from the root with label S[j i] in the current tree. Extend the path adding character S(i+1), so that S[j i+1] is in the tree. The extension will follow one of the next three rules, being = S[j..i]: 1. ends at a leaf. Add S(i+1) at the end of the label of the path to the leaf 2. There s one path continuing from the end of,, but none starting with S(i+1). Add a node at the end of and a path stating from the new node with label S(i+1), terminating in a leaf with number j. 3. There s one path from the end of starting with S(i+1). In this case do nothing.
12 Suffix Trees Ukkonen Algorithm - II The main idea to speed up the construction of the tree is the concept of suffix link. Suffix links are pointers from a node v with path label x to a node s(v) with path label ( is a string and x a character). The interesting feature of suffix trees is that every internal node, except the root, has a suffix link towards another node. abbcbab# Suffix link v ab # 6 bcbab# # 1 7 cbab# b S(v) ab# cbab# bcbab# 2
13 Suffix Trees Ukkonen Algorithm - III With suffix links, we can speed up the construction of the ST x In addition, every node can be crossed in costant time, just keeping track of the label s length of every single edge. This can be done because no two edges exiting from a node can start with the same character, hence a single comparison is needed to decide which path must be taken. Anyway, using suffix links, complexity is still quadratic.
14 Suffix Trees Ukkonen Algorithm - IV Storing the path labels explicitly will cost a quadratic space. Anyway, each edge need only costant space, i.e. two pointers, one to the beginning and one to the end of the substring it has as label. To complete the speed up of the algorithm, we need the following observations: Once a leaf is created, it will remain forever a leaf. Once in a phase rule 3 is used, all succeccive extensions make use of it, hence we can ignore them. If in phase i the rule 1 and 2 are applied in the first j i moves, in phase i+1 the first j i extensions can be made in costant time, simply adding the character S(i+2) at the end of the paths to the first j i leafs (we will use a global variable e do do this). Hence the extensions will be computed explicitly from j i+1, reducing their global number to 2m.
15 Generalized Suffix Trees A generalized suffix tree is simply a ST for a set of strings, each one ending with a different marker. The leafs have two numbers, one identifiing the string and the other identifiing the position inside the string. ab c$ S 1 = abbc$ (2,2) c# bc$ b (1,4) (2,4) S 2 = babc# (1,1) (1,3) c$ bc$ abc# (2,1) (2,3) (1,2)
16 Longest common substring Let S 1 and S 2 be two string over the same alphabeth. The Longest Common Substring problem is to find the longest substring of S 1 that is also a substring of S 2. Knuth in 1970 conjectured that this problem was (n 2 ) Building a generalized suffix tree for S 1 and S 2, to solve the problem one has to identify the nodes which belong to both suffix trees of S 1 and S 2 and choose the one with greatest string depth (length of the path label from the root to itself). All these operations cost O(n).
17 Longest Common Extension A problem that can be solved linearly using suffix trees is the Longest Common Extension problem, that is, for every couple of indexes (i,j), finding the length of the longest substring of T starting at position i that matches a substring of P starting at position j. It can be solved in O(n+m) time, building a generalized suffix tree for T and P, and finding, for every leaf i of T and j of P, their lowest common ancestor in the tree (it can be done in costant time after preprocessing the tree).
18 Hamming and Edit Distances Hamming Distance: two strings of the same length are aligned and the distance is the number of mismatches between them. abbcdbaabbc abbdcbbbaac H = 6 Edit Distance: it is the minimum number of insertions, deletions and substitutions needed to trasform a string into another. abbcdbaabbc cbcdbaabc abbcdbaabbc abbcdbaabbc E = 3
19 The k - mismatches problem We have a text T and a pattern P, and we want to find occurences of P in T, allowing a maximum of k mismatches, i.e. we want to find all the substring T of T such that H(P,T ) k. We can use suffix trees, but they do not perfome well anymore: the algorithm scans all the paths to leafs, keeping track of errors, and abandons the path if this number becomes greater that k. The algorithm is fastened using the longest common extensions. For every suffix of T, the pieces of agreement between the suffix and P are matched together until P is exausted or the errors overcome k. Every piece is found in costant time. The complexity of the resulting algorithm is O(k T ). aaacaabaaaaa. c aabaab An occurence is found in position 2 of T, with one error.
20 Inexact Matching In biology, inexact matching is very important: Similarity in DNA sequences implies often that they have the same biological function (viceversa is not true); Mutations and error transcription make exact comparison not very useful. There are a lot of algorithms that deal with inexact matching (with respect to edit distance), and they are mainly based on dynamic programming or on automata. Suffix trees are used as a secondary tools in some of them, because their structure is inadapt to deal with insertions and deletions, and even with substitutions. The main efforts are spend in fastening the average behaviour of algorithms, and this is justified because of the fact that random sequences often fall in these cases (and DNA sequences have an high degree of randomness).
21 Dynamic Programming We aim to compute edit distance (global alignements) between two string S and T The main idea is computing the edit distance between any of the prefixes of S and T. Let D(i,j) be this distance. Of course, the edit distance between S and T is D(n,m), where n= P and m= T. The following properties hold: 1. D(i,0) = i, D(0,j) = j; 2. D(i,j) = min { D(i,j-1) + 1, D(i-1,j) + 1, D(i-1,j-1) + t(i,j) }. Hence in O(mn) time we can compute a matrix which encodes not only the edit distance, bu also the way to trasform a string into another (just keeping track, by means of pointers, of which elements realize the minimum)
22 Dynamic Programming II C A S E A R E
23 Non-Deterministic Automata To recognize the approximate occurences of a pattern P in a text T, we can build a non-deterministic automaton for P, and run it with T as input. This leads to faster algorithms for the search, but the problem is building the automaton. C A S E C A S E C A S E
24 Longest Common Subsequence The Longest Common Subsequence between two strings S1 and S2 is the greater number of characters of S1 that can be aligned to S2. It is a global alignement problem, which is obviously connected with edit distance. Anyway, often it is modelled with a scoring scheme, which gives a positive score to matches and a negative one to mismatches, insertions and substitutions. So the best global alignement is the one which maximizes the total score. Clearly, given the best global alignement, the number of matches is the longest common subsequence solution. a b b c d a b b a a b _ c b a b _ a
25 The k differences problem This problem is to find all the occurences of a pattern P in a text T, allowing a maximum number of k insertions, deletions or substitutions. The Landau-Vishkin algorithm solves it in O(k T ) time, and implements an hybrid dynamic programming tecnique, which uses suffix trees to solve a subproblem. The algorithm looks for paths in the dynamic programming matrix (which start in the upper row), in particular for d-paths, which are paths that specify exactly d mismatches and spaces. Some of these paths are computed, for d k, and the ones that reach the bottom row correspond to approximate occurences of P in T, with exactly d mismatches or spaces.
26 Landau-Vishkin Algorithm Each diagonal is numbered: the main diagonal is numbered with 0, the upper diagonals with increasing positive integers while the lower diagonals with decreasing negative integers A d-path is farthest reaching diagonal i if it ends in diagonal i and the index of its ending column is greater than or equal to the one of every other d-path ending in diagonal i.
27 Landau-Vishkin Algorithm - II i+1 i i+1 The farthest reaching d-path that ends in diagonal i is one of the following three: 1. (d-1)-path of diagonal i + 1, plus a vertical edge plus the maximum extension along diagonal i that corresponds to identical substrings in P and T 2. (d-1)-path of diagonal i - 1, plus an horizontal edge plus the maximum extension along diagonal i that corresponds to identical substrings in P and T 3. (d-1)-path of diagonal i, plus a diagonal edge that corresponds to a mismatch plus the maximum extension along diagonal i that corresponds to identical substrings in P and T The maximum extension between substring of P and T can be done in costant time by means od suffix trees.
28 Inexact Matching, a new approach Suffix trees work very well for exact matching, but they fail when we admit errors in the matching process. This happens because, the only way to find approximate occurences of a pattern, when we search it in a suffix tree, is to walk down every path, keeping track of errors and discarding the paths which overcome the tolerance level previously chosen. A different approach may be that of defining a different data structure, though similar to suffix trees, which encodes in some way a concept of distance, in particular the Hamming Distance. A possible way is to shift from alphabeth to alphabet k, encoding the distance in a relation between letters: two letters are said to be equivalent if and only if their Hamming distance is less than a threshold.
29 Equivalence between letters Let s show and example of this idea of equivalence, with = {0,1} and k = 3. So, we can build the following table for A 3 : If the distance between two letters is less or equal than 1, we define them equivalent. For example ab, bd, but NOT(ad).
30 Bundled Suffix Trees Given this equivalence relation (which is not transitive), we want to incorporate it in a tree structure. For simplicity, we assume that the tree for the sequence S is the smallest tree which contains, for every substring of S, all the exact paths and all the equivalent paths that can be found in S. For historical reasons, we will call it a Bundled Suffix Tree. Definition: A bundled suffix tree for a string S of length m is a rooted tree such that: It has exactly m leafs, numbered from 1 to m; Every edge has a label, which is a substring of S; Every node has a set of labels, which is a subset of {1,2,..,m,}; The tree obtained deleting all nodes which do not has as label is the suffix tree for S; For every substring P of S, the subtree of rooted at the end of the path labeled with P has node labels which union (discarding ) gives all the starting positions of substrings of S equivalent to P; In every path from the root to a leaf no two nodes can be labelled with the same number.
31 Bundled Suffix Trees - II abbcda# a b a# 2 bcd 5,3 # b a bcd 3 a b c 1 c d 4 2 d a d 6 5 a a 3 1,4 # # d c # 1 2 #
32 Open Problems 1. Does BST work well for Hamming distance? (they seem to need a distributed distance). 2. How can BST be used to manage approximate searching using edit distance? At what price? 3. Which is the average number of red nodes expected? Is it linear or does it grows quadratically? 4. Is there a linear algorithm for building BST? 5. Does BST manage to improve existant algorithms, or the interest is just theoretical?
BUNDLED SUFFIX TREES
Motivation BUNDLED SUFFIX TREES Luca Bortolussi 1 Francesco Fabris 2 Alberto Policriti 1 1 Department of Mathematics and Computer Science University of Udine 2 Department of Mathematics and Computer Science
More informationLecture 5: Suffix Trees
Longest Common Substring Problem Lecture 5: Suffix Trees Given a text T = GGAGCTTAGAACT and a string P = ATTCGCTTAGCCTA, how do we find the longest common substring between them? Here the longest common
More informationString Matching Algorithms
String Matching Algorithms 1. Naïve String Matching The naïve approach simply test all the possible placement of Pattern P[1.. m] relative to text T[1.. n]. Specifically, we try shift s = 0, 1,..., n -
More informationData structures for string pattern matching: Suffix trees
Suffix trees Data structures for string pattern matching: Suffix trees Linear algorithms for exact string matching KMP Z-value algorithm What is suffix tree? A tree-like data structure for solving problems
More informationGiven a text file, or several text files, how do we search for a query string?
CS 840 Fall 2016 Text Search and Succinct Data Structures: Unit 4 Given a text file, or several text files, how do we search for a query string? Note the query/pattern is not of fixed length, unlike key
More information4. Suffix Trees and Arrays
4. Suffix Trees and Arrays Let T = T [0..n) be the text. For i [0..n], let T i denote the suffix T [i..n). Furthermore, for any subset C [0..n], we write T C = {T i i C}. In particular, T [0..n] is the
More informationApplied Databases. Sebastian Maneth. Lecture 14 Indexed String Search, Suffix Trees. University of Edinburgh - March 9th, 2017
Applied Databases Lecture 14 Indexed String Search, Suffix Trees Sebastian Maneth University of Edinburgh - March 9th, 2017 2 Recap: Morris-Pratt (1970) Given Pattern P, Text T, find all occurrences of
More informationAn introduction to suffix trees and indexing
An introduction to suffix trees and indexing Tomáš Flouri Solon P. Pissis Heidelberg Institute for Theoretical Studies December 3, 2012 1 Introduction Introduction 2 Basic Definitions Graph theory Alphabet
More informationSpecial course in Computer Science: Advanced Text Algorithms
Special course in Computer Science: Advanced Text Algorithms Lecture 6: Alignments Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory http://www.users.abo.fi/ipetre/textalg
More informationKnuth-Morris-Pratt. Kranthi Kumar Mandumula Indiana State University Terre Haute IN, USA. December 16, 2011
Kranthi Kumar Mandumula Indiana State University Terre Haute IN, USA December 16, 2011 Abstract KMP is a string searching algorithm. The problem is to find the occurrence of P in S, where S is the given
More information4. Suffix Trees and Arrays
4. Suffix Trees and Arrays Let T = T [0..n) be the text. For i [0..n], let T i denote the suffix T [i..n). Furthermore, for any subset C [0..n], we write T C = {T i i C}. In particular, T [0..n] is the
More informationString Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42
String Matching Pedro Ribeiro DCC/FCUP 2016/2017 Pedro Ribeiro (DCC/FCUP) String Matching 2016/2017 1 / 42 On this lecture The String Matching Problem Naive Algorithm Deterministic Finite Automata Knuth-Morris-Pratt
More informationString Matching Algorithms
String Matching Algorithms Georgy Gimel farb (with basic contributions from M. J. Dinneen, Wikipedia, and web materials by Ch. Charras and Thierry Lecroq, Russ Cox, David Eppstein, etc.) COMPSCI 369 Computational
More informationInexact Matching, Alignment. See Gusfield, Chapter 9 Dasgupta et al. Chapter 6 (Dynamic Programming)
Inexact Matching, Alignment See Gusfield, Chapter 9 Dasgupta et al. Chapter 6 (Dynamic Programming) Outline Yet more applications of generalized suffix trees, when combined with a least common ancestor
More informationLecture 7 February 26, 2010
6.85: Advanced Data Structures Spring Prof. Andre Schulz Lecture 7 February 6, Scribe: Mark Chen Overview In this lecture, we consider the string matching problem - finding all places in a text where some
More informationSuffix trees and applications. String Algorithms
Suffix trees and applications String Algorithms Tries a trie is a data structure for storing and retrieval of strings. Tries a trie is a data structure for storing and retrieval of strings. x 1 = a b x
More informationSuffix links are stored for compact trie nodes only, but we can define and compute them for any position represented by a pair (u, d):
Suffix links are the same as Aho Corasick failure links but Lemma 4.4 ensures that depth(slink(u)) = depth(u) 1. This is not the case for an arbitrary trie or a compact trie. Suffix links are stored for
More informationIndexing and Searching
Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)
More informationSolution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree.
Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree. The basic approach is the same as when using a suffix tree,
More informationApplications of Suffix Tree
Applications of Suffix Tree Let us have a glimpse of the numerous applications of suffix trees. Exact String Matching As already mentioned earlier, given the suffix tree of the text, all occ occurrences
More informationCSED233: Data Structures (2017F) Lecture12: Strings and Dynamic Programming
(2017F) Lecture12: Strings and Dynamic Programming Daijin Kim CSE, POSTECH dkim@postech.ac.kr Strings A string is a sequence of characters Examples of strings: Python program HTML document DNA sequence
More informationLecture L16 April 19, 2012
6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture L16 April 19, 2012 1 Overview In this lecture, we consider the string matching problem - finding some or all places in a text where
More informationInexact Pattern Matching Algorithms via Automata 1
Inexact Pattern Matching Algorithms via Automata 1 1. Introduction Chung W. Ng BioChem 218 March 19, 2007 Pattern matching occurs in various applications, ranging from simple text searching in word processors
More informationExact String Matching Part II. Suffix Trees See Gusfield, Chapter 5
Exact String Matching Part II Suffix Trees See Gusfield, Chapter 5 Outline for Today What are suffix trees Application to exact matching Building a suffix tree in linear time, part I: Ukkonen s algorithm
More informationLecture 6: Suffix Trees and Their Construction
Biosequence Algorithms, Spring 2007 Lecture 6: Suffix Trees and Their Construction Pekka Kilpeläinen University of Kuopio Department of Computer Science BSA Lecture 6: Intro to suffix trees p.1/46 II:
More informationSORTING. Practical applications in computing require things to be in order. To consider: Runtime. Memory Space. Stability. In-place algorithms???
SORTING + STRING COMP 321 McGill University These slides are mainly compiled from the following resources. - Professor Jaehyun Park slides CS 97SI - Top-coder tutorials. - Programming Challenges book.
More informationString Patterns and Algorithms on Strings
String Patterns and Algorithms on Strings Lecture delivered by: Venkatanatha Sarma Y Assistant Professor MSRSAS-Bangalore 11 Objectives To introduce the pattern matching problem and the important of algorithms
More informationData Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.
Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture 18 Tries Today we are going to be talking about another data
More informationSpecial course in Computer Science: Advanced Text Algorithms
Special course in Computer Science: Advanced Text Algorithms Lecture 5: Suffix trees and their applications Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory
More informationSuffix Vector: A Space-Efficient Suffix Tree Representation
Lecture Notes in Computer Science 1 Suffix Vector: A Space-Efficient Suffix Tree Representation Krisztián Monostori 1, Arkady Zaslavsky 1, and István Vajk 2 1 School of Computer Science and Software Engineering,
More informationEfficient Sequential Algorithms, Comp309. Motivation. Longest Common Subsequence. Part 3. String Algorithms
Efficient Sequential Algorithms, Comp39 Part 3. String Algorithms University of Liverpool References: T. H. Cormen, C. E. Leiserson, R. L. Rivest Introduction to Algorithms, Second Edition. MIT Press (21).
More informationString Algorithms. CITS3001 Algorithms, Agents and Artificial Intelligence. 2017, Semester 2. CLRS Chapter 32
String Algorithms CITS3001 Algorithms, Agents and Artificial Intelligence Tim French School of Computer Science and Software Engineering The University of Western Australia CLRS Chapter 32 2017, Semester
More informationPAPER Constructing the Suffix Tree of a Tree with a Large Alphabet
IEICE TRANS. FUNDAMENTALS, VOL.E8??, NO. JANUARY 999 PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet Tetsuo SHIBUYA, SUMMARY The problem of constructing the suffix tree of a tree is
More informationProject Proposal. ECE 526 Spring Modified Data Structure of Aho-Corasick. Benfano Soewito, Ed Flanigan and John Pangrazio
Project Proposal ECE 526 Spring 2006 Modified Data Structure of Aho-Corasick Benfano Soewito, Ed Flanigan and John Pangrazio 1. Introduction The internet becomes the most important tool in this decade
More informationAdvanced Algorithms: Project
Advanced Algorithms: Project (deadline: May 13th, 2016, 17:00) Alexandre Francisco and Luís Russo Last modified: February 26, 2016 This project considers two different problems described in part I and
More informationExact Matching Part III: Ukkonen s Algorithm. See Gusfield, Chapter 5 Visualizations from
Exact Matching Part III: Ukkonen s Algorithm See Gusfield, Chapter 5 Visualizations from http://brenden.github.io/ukkonen-animation/ Goals for Today Understand how suffix links are used in Ukkonen's algorithm
More informationIntroduction to Suffix Trees
Algorithms on Strings, Trees, and Sequences Dan Gusfield University of California, Davis Cambridge University Press 1997 Introduction to Suffix Trees A suffix tree is a data structure that exposes the
More informationLecture 9: Core String Edits and Alignments
Biosequence Algorithms, Spring 2005 Lecture 9: Core String Edits and Alignments Pekka Kilpeläinen University of Kuopio Department of Computer Science BSA Lecture 9: String Edits and Alignments p.1/30 III:
More informationString matching algorithms تقديم الطالب: سليمان ضاهر اشراف المدرس: علي جنيدي
String matching algorithms تقديم الطالب: سليمان ضاهر اشراف المدرس: علي جنيدي للعام الدراسي: 2017/2016 The Introduction The introduction to information theory is quite simple. The invention of writing occurred
More informationData Structure and Algorithm Midterm Reference Solution TA
Data Structure and Algorithm Midterm Reference Solution TA email: dsa1@csie.ntu.edu.tw Problem 1. To prove log 2 n! = Θ(n log n), it suffices to show N N, c 1, c 2 > 0 such that c 1 n ln n ln n! c 2 n
More informationUkkonen s suffix tree algorithm
Ukkonen s suffix tree algorithm Recall McCreight s approach: For i = 1.. n+1, build compressed trie of {x[..n]$ i} Ukkonen s approach: For i = 1.. n+1, build compressed trie of {$ i} Compressed trie of
More informationFast Substring Matching
Fast Substring Matching Andreas Klein 1 2 3 4 5 6 7 8 9 10 Abstract The substring matching problem occurs in several applications. Two of the well-known solutions are the Knuth-Morris-Pratt algorithm (which
More informationSuffix Trees and its Construction
Chapter 5 Suffix Trees and its Construction 5.1 Introduction to Suffix Trees Sometimes fundamental techniques do not make it into the mainstream of computer science education in spite of its importance,
More informationNew Implementation for the Multi-sequence All-Against-All Substring Matching Problem
New Implementation for the Multi-sequence All-Against-All Substring Matching Problem Oana Sandu Supervised by Ulrike Stege In collaboration with Chris Upton, Alex Thomo, and Marina Barsky University of
More informationAlgorithms and Data Structures
Algorithms and Data Structures Charles A. Wuethrich Bauhaus-University Weimar - CogVis/MMC May 11, 2017 Algorithms and Data Structures String searching algorithm 1/29 String searching algorithm Introduction
More informationCSE 417 Dynamic Programming (pt 5) Multiple Inputs
CSE 417 Dynamic Programming (pt 5) Multiple Inputs Reminders > HW5 due Wednesday Dynamic Programming Review > Apply the steps... optimal substructure: (small) set of solutions, constructed from solutions
More informationComputational Genomics and Molecular Biology, Fall
Computational Genomics and Molecular Biology, Fall 2015 1 Sequence Alignment Dannie Durand Pairwise Sequence Alignment The goal of pairwise sequence alignment is to establish a correspondence between the
More informationExact Matching: Hash-tables and Automata
18.417 Introduction to Computational Molecular Biology Lecture 10: October 12, 2004 Scribe: Lele Yu Lecturer: Ross Lippert Editor: Mark Halsey Exact Matching: Hash-tables and Automata While edit distances
More informationCombinatorial Pattern Matching
Combinatorial Pattern Matching Outline Exact Pattern Matching Keyword Trees Suffix Trees Approximate String Matching Local alignment is to slow Quadratic local alignment is too slow while looking for similarities
More informationSpecial course in Computer Science: Advanced Text Algorithms
Special course in Computer Science: Advanced Text Algorithms Lecture 4: Suffix trees Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory http://www.users.abo.fi/ipetre/textalg
More informationAn Index Based Sequential Multiple Pattern Matching Algorithm Using Least Count
2011 International Conference on Life Science and Technology IPCBEE vol.3 (2011) (2011) IACSIT Press, Singapore An Index Based Sequential Multiple Pattern Matching Algorithm Using Least Count Raju Bhukya
More informationCSCI S-Q Lecture #13 String Searching 8/3/98
CSCI S-Q Lecture #13 String Searching 8/3/98 Administrivia Final Exam - Wednesday 8/12, 6:15pm, SC102B Room for class next Monday Graduate Paper due Friday Tonight Precomputation Brute force string searching
More informationA Suffix Tree Construction Algorithm for DNA Sequences
A Suffix Tree Construction Algorithm for DNA Sequences Hongwei Huo School of Computer Science and Technol Xidian University Xi 'an 710071, China Vojislav Stojkovic Computer Science Department Morgan State
More informationChapter 7. Space and Time Tradeoffs. Copyright 2007 Pearson Addison-Wesley. All rights reserved.
Chapter 7 Space and Time Tradeoffs Copyright 2007 Pearson Addison-Wesley. All rights reserved. Space-for-time tradeoffs Two varieties of space-for-time algorithms: input enhancement preprocess the input
More informationComputational Biology Lecture 4: Overlap detection, Local Alignment, Space Efficient Needleman-Wunsch Saad Mneimneh
Computational Biology Lecture 4: Overlap detection, Local Alignment, Space Efficient Needleman-Wunsch Saad Mneimneh Overlap detection: Semi-Global Alignment An overlap of two sequences is considered an
More informationLecture 18 April 12, 2005
6.897: Advanced Data Structures Spring 5 Prof. Erik Demaine Lecture 8 April, 5 Scribe: Igor Ganichev Overview In this lecture we are starting a sequence of lectures about string data structures. Today
More informationDynamic programming II - The Recursion Strikes Back
Chapter 5 Dynamic programming II - The Recursion Strikes Back By Sariel Har-Peled, December 17, 2012 1 Version: 0.4 No, mademoiselle, I don t capture elephants. I content myself with living among them.
More informationWAVEFRONT LONGEST COMMON SUBSEQUENCE ALGORITHM ON MULTICORE AND GPGPU PLATFORM BILAL MAHMOUD ISSA SHEHABAT UNIVERSITI SAINS MALAYSIA
WAVEFRONT LONGEST COMMON SUBSEQUENCE ALGORITHM ON MULTICORE AND GPGPU PLATFORM BILAL MAHMOUD ISSA SHEHABAT UNIVERSITI SAINS MALAYSIA 2010 WAVE-FRONT LONGEST COMMON SUBSEQUENCE ALGORITHM ON MULTICORE AND
More informationVolume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies
Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com
More informationLongest Common Substring
2012 Longest Common Substring CS:255 Design and Analysis of Algorithms Computer Science Department San Jose State University San Jose, CA-95192 Project Guide: Dr Sami Khuri Team Members: Avinash Anantharamu
More information11/5/09 Comp 590/Comp Fall
11/5/09 Comp 590/Comp 790-90 Fall 2009 1 Example of repeats: ATGGTCTAGGTCCTAGTGGTC Motivation to find them: Genomic rearrangements are often associated with repeats Trace evolutionary secrets Many tumors
More informationCombinatorial Pattern Matching. CS 466 Saurabh Sinha
Combinatorial Pattern Matching CS 466 Saurabh Sinha Genomic Repeats Example of repeats: ATGGTCTAGGTCCTAGTGGTC Motivation to find them: Genomic rearrangements are often associated with repeats Trace evolutionary
More informationSyllabus. 5. String Problems. strings recap
Introduction to Algorithms Syllabus Recap on Strings Pattern Matching: Knuth-Morris-Pratt Longest Common Substring Edit Distance Context-free Parsing: Cocke-Younger-Kasami Huffman Compression strings recap
More informationTwo Dimensional Dictionary Matching
Two Dimensional Dictionary Matching Amihood Amir Martin Farach Georgia Tech DIMACS September 10, 1992 Abstract Most traditional pattern matching algorithms solve the problem of finding all occurrences
More informationCMSC423: Bioinformatic Algorithms, Databases and Tools. Exact string matching: introduction
CMSC423: Bioinformatic Algorithms, Databases and Tools Exact string matching: introduction Sequence alignment: exact matching ACAGGTACAGTTCCCTCGACACCTACTACCTAAG CCTACT CCTACT CCTACT CCTACT Text Pattern
More informationAn Efficient Algorithm for Identifying the Most Contributory Substring. Ben Stephenson Department of Computer Science University of Western Ontario
An Efficient Algorithm for Identifying the Most Contributory Substring Ben Stephenson Department of Computer Science University of Western Ontario Problem Definition Related Problems Applications Algorithm
More informationA Sub-Quadratic Algorithm for Approximate Regular Expression Matching
A Sub-Quadratic Algorithm for Approximate Regular Expression Matching Sun Wu, Udi Manber 1, and Eugene Myers 2 Department of Computer Science University of Arizona Tucson, AZ 85721 May 1992 Keywords: algorithm,
More informationA New String Matching Algorithm Based on Logical Indexing
The 5th International Conference on Electrical Engineering and Informatics 2015 August 10-11, 2015, Bali, Indonesia A New String Matching Algorithm Based on Logical Indexing Daniar Heri Kurniawan Department
More informationGreedy Algorithms CHAPTER 16
CHAPTER 16 Greedy Algorithms In dynamic programming, the optimal solution is described in a recursive manner, and then is computed ``bottom up''. Dynamic programming is a powerful technique, but it often
More informationAnalysis of Algorithms
Algorithm An algorithm is a procedure or formula for solving a problem, based on conducting a sequence of specified actions. A computer program can be viewed as an elaborate algorithm. In mathematics and
More informationA very fast string matching algorithm for small. alphabets and long patterns. (Extended abstract)
A very fast string matching algorithm for small alphabets and long patterns (Extended abstract) Christian Charras 1, Thierry Lecroq 1, and Joseph Daniel Pehoushek 2 1 LIR (Laboratoire d'informatique de
More informationAlgorithms and Data Structures Lesson 3
Algorithms and Data Structures Lesson 3 Michael Schwarzkopf https://www.uni weimar.de/de/medien/professuren/medieninformatik/grafische datenverarbeitung Bauhaus University Weimar May 30, 2018 Overview...of
More informationLowest Common Ancestor (LCA) Queries
Lowest Common Ancestor (LCA) Queries A technique with application to approximate matching Chris Lewis Approximate Matching Match pattern to text Insertion/Deletion/Substitution Applications Bioinformatics,
More informationSuffix Tree and Array
Suffix Tree and rray 1 Things To Study So far we learned how to find approximate matches the alignments. nd they are difficult. Finding exact matches are much easier. Suffix tree and array are two data
More informationGraph and Digraph Glossary
1 of 15 31.1.2004 14:45 Graph and Digraph Glossary A B C D E F G H I-J K L M N O P-Q R S T U V W-Z Acyclic Graph A graph is acyclic if it contains no cycles. Adjacency Matrix A 0-1 square matrix whose
More informationRecursive-Fib(n) if n=1 or n=2 then return 1 else return Recursive-Fib(n-1)+Recursive-Fib(n-2)
Dynamic Programming Any recursive formula can be directly translated into recursive algorithms. However, sometimes the compiler will not implement the recursive algorithm very efficiently. When this is
More informationDATA MINING II - 1DL460
DATA MINING II - 1DL460 Spring 2013 " An second class in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt13 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,
More informationString matching algorithms
String matching algorithms Deliverables String Basics Naïve String matching Algorithm Boyer Moore Algorithm Rabin-Karp Algorithm Knuth-Morris- Pratt Algorithm Copyright @ gdeepak.com 2 String Basics A
More informationUniversity of Waterloo CS240R Fall 2017 Solutions to Review Problems
University of Waterloo CS240R Fall 2017 Solutions to Review Problems Reminder: Final on Tuesday, December 12 2017 Note: This is a sample of problems designed to help prepare for the final exam. These problems
More informationSmall-Space 2D Compressed Dictionary Matching
Small-Space 2D Compressed Dictionary Matching Shoshana Neuburger 1 and Dina Sokol 2 1 Department of Computer Science, The Graduate Center of the City University of New York, New York, NY, 10016 shoshana@sci.brooklyn.cuny.edu
More informationExact String Matching. The Knuth-Morris-Pratt Algorithm
Exact String Matching The Knuth-Morris-Pratt Algorithm Outline for Today The Exact Matching Problem A simple algorithm Motivation for better algorithms The Knuth-Morris-Pratt algorithm The Exact Matching
More informationLAB # 3 / Project # 1
DEI Departamento de Engenharia Informática Algorithms for Discrete Structures 2011/2012 LAB # 3 / Project # 1 Matching Proteins This is a lab guide for Algorithms in Discrete Structures. These exercises
More informationSuffix-based text indices, construction algorithms, and applications.
Suffix-based text indices, construction algorithms, and applications. F. Franek Computing and Software McMaster University Hamilton, Ontario 2nd CanaDAM Conference Centre de recherches mathématiques in
More informationIntroduction to Algorithms I
Summer School on Algorithms and Optimization Organized by: ACM Unit, ISI and IEEE CEDA. Tutorial II Date: 05.07.017 Introduction to Algorithms I (Q1) A binary tree is a rooted tree in which each node has
More informationSequence Alignment. Ulf Leser
Sequence Alignment Ulf Leser his Lecture Approximate String Matching Edit distance and alignment Computing global alignments Local alignment Ulf Leser: Bioinformatics, Summer Semester 2016 2 ene Function
More informationImplementation of Lexical Analysis. Lecture 4
Implementation of Lexical Analysis Lecture 4 1 Tips on Building Large Systems KISS (Keep It Simple, Stupid!) Don t optimize prematurely Design systems that can be tested It is easier to modify a working
More informationComputational Molecular Biology
Computational Molecular Biology Erwin M. Bakker Lecture 3, mainly from material by R. Shamir [2] and H.J. Hoogeboom [4]. 1 Pairwise Sequence Alignment Biological Motivation Algorithmic Aspect Recursive
More informationA string is a sequence of characters. In the field of computer science, we use strings more often as we use numbers.
STRING ALGORITHMS : Introduction A string is a sequence of characters. In the field of computer science, we use strings more often as we use numbers. There are many functions those can be applied on strings.
More informationComputing Patterns in Strings I. Specific, Generic, Intrinsic
Outline : Specific, Generic, Intrinsic 1,2,3 1 Algorithms Research Group, Department of Computing & Software McMaster University, Hamilton, Ontario, Canada email: smyth@mcmaster.ca 2 Digital Ecosystems
More informationDivya R. Singh. Faster Sequence Alignment using Suffix Tree and Data-Mining Techniques. February A Thesis Presented by
Faster Sequence Alignment using Suffix Tree and Data-Mining Techniques A Thesis Presented by Divya R. Singh to The Faculty of the Graduate College of the University of Vermont In Partial Fulfillment of
More informationCS2 Language Processing note 3
CS2 Language Processing note 3 CS2Ah 5..4 CS2 Language Processing note 3 Nondeterministic finite automata In this lecture we look at nondeterministic finite automata and prove the Conversion Theorem, which
More informationUniversity of Waterloo CS240R Fall 2017 Review Problems
University of Waterloo CS240R Fall 2017 Review Problems Reminder: Final on Tuesday, December 12 2017 Note: This is a sample of problems designed to help prepare for the final exam. These problems do not
More informationStudy of Selected Shifting based String Matching Algorithms
Study of Selected Shifting based String Matching Algorithms G.L. Prajapati, PhD Dept. of Comp. Engg. IET-Devi Ahilya University, Indore Mohd. Sharique Dept. of Comp. Engg. IET-Devi Ahilya University, Indore
More informationA Fast Algorithm for Optimal Alignment between Similar Ordered Trees
Fundamenta Informaticae 56 (2003) 105 120 105 IOS Press A Fast Algorithm for Optimal Alignment between Similar Ordered Trees Jesper Jansson Department of Computer Science Lund University, Box 118 SE-221
More informationFull-Text Search on Data with Access Control
Full-Text Search on Data with Access Control Ahmad Zaky School of Electrical Engineering and Informatics Institut Teknologi Bandung Bandung, Indonesia 13512076@std.stei.itb.ac.id Rinaldi Munir, S.T., M.T.
More informationCOMP4128 Programming Challenges
Multi- COMP4128 Programming Challenges School of Computer Science and Engineering UNSW Australia Table of Contents 2 Multi- 1 2 Multi- 3 3 Multi- Given two strings, a text T and a pattern P, find the first
More informationHuffman Coding. Version of October 13, Version of October 13, 2014 Huffman Coding 1 / 27
Huffman Coding Version of October 13, 2014 Version of October 13, 2014 Huffman Coding 1 / 27 Outline Outline Coding and Decoding The optimal source coding problem Huffman coding: A greedy algorithm Correctness
More informationAssignment 2 (programming): Problem Description
CS2210b Data Structures and Algorithms Due: Monday, February 14th Assignment 2 (programming): Problem Description 1 Overview The purpose of this assignment is for students to practice on hashing techniques
More information1. (a) O(log n) algorithm for finding the logical AND of n bits with n processors
1. (a) O(log n) algorithm for finding the logical AND of n bits with n processors on an EREW PRAM: See solution for the next problem. Omit the step where each processor sequentially computes the AND of
More informationReconstructing long sequences from overlapping sequence fragment. Searching databases for related sequences and subsequences
SEQUENCE ALIGNMENT ALGORITHMS 1 Why compare sequences? Reconstructing long sequences from overlapping sequence fragment Searching databases for related sequences and subsequences Storing, retrieving and
More information