Lecture 9: Core String Edits and Alignments

Similar documents
Special course in Computer Science: Advanced Text Algorithms

Computational Genomics and Molecular Biology, Fall

Special course in Computer Science: Advanced Text Algorithms

Computational Molecular Biology

Data Mining Technologies for Bioinformatics Sequences

Lecture 3: February Local Alignment: The Smith-Waterman Algorithm

Computational Molecular Biology

Lecture 10. Sequence alignments

15.4 Longest common subsequence

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:

Computational Biology Lecture 4: Overlap detection, Local Alignment, Space Efficient Needleman-Wunsch Saad Mneimneh

Lecture 2 Pairwise sequence alignment. Principles Computational Biology Teresa Przytycka, PhD

EECS730: Introduction to Bioinformatics

Pairwise Sequence Alignment: Dynamic Programming Algorithms. COMP Spring 2015 Luay Nakhleh, Rice University

Brief review from last class

Algorithmic Approaches for Biological Data, Lecture #20

1 Computing alignments in only linear space

15.4 Longest common subsequence

Lecture 6: Suffix Trees and Their Construction

Reconstructing long sequences from overlapping sequence fragment. Searching databases for related sequences and subsequences

Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice University

17 dicembre Luca Bortolussi SUFFIX TREES. From exact to approximate string matching.

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:

Sequence Alignment & Search

FastA & the chaining problem

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

Important Example: Gene Sequence Matching. Corrigiendum. Central Dogma of Modern Biology. Genetics. How Nucleotides code for Amino Acids

Memoization/Dynamic Programming. The String reconstruction problem. CS124 Lecture 11 Spring 2018

Sequence Alignment. Ulf Leser

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures

Inexact Matching, Alignment. See Gusfield, Chapter 9 Dasgupta et al. Chapter 6 (Dynamic Programming)

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Efficient Sequential Algorithms, Comp309. Motivation. Longest Common Subsequence. Part 3. String Algorithms

12 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, October 18, 2006

Parsimony-Based Approaches to Inferring Phylogenetic Trees

Global Alignment Scoring Matrices Local Alignment Alignment with Affine Gap Penalties

Sequence analysis Pairwise sequence alignment

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Sorting lower bound and Linear-time sorting Date: 9/19/17

.. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar..

Multiple Sequence Alignment Sum-of-Pairs and ClustalW. Ulf Leser

12 Edit distance. Summer Term 2011

CMSC423: Bioinformatic Algorithms, Databases and Tools Lecture 8. Note

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

A Study On Pair-Wise Local Alignment Of Protein Sequence For Identifying The Structural Similarity

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

Longest Common Subsequence. Definitions

Dynamic Programming Part I: Examples. Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, / 77

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.

Central Issues in Biological Sequence Comparison

EECS 4425: Introductory Computational Bioinformatics Fall Suprakash Datta

Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it.

Space Efficient Linear Time Construction of

Distributed Protein Sequence Alignment

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

Chapter 3 Dynamic Programming Part 2 Algorithm Theory WS 2012/13 Fabian Kuhn

Longest Common Subsequence

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm.

Inexact Pattern Matching Algorithms via Automata 1

BLAST, Profile, and PSI-BLAST

Notes on Dynamic-Programming Sequence Alignment

Dynamic Programming: Sequence alignment. CS 466 Saurabh Sinha

5.1 The String reconstruction problem

Chapter 3 Dynamic programming

Biology 644: Bioinformatics

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

Bioinformatics explained: BLAST. March 8, 2007

Recursive-Fib(n) if n=1 or n=2 then return 1 else return Recursive-Fib(n-1)+Recursive-Fib(n-2)

Recent Research Results. Evolutionary Trees Distance Methods

Dynamic Programming Algorithms

Front End: Lexical Analysis. The Structure of a Compiler

Evolutionary tree reconstruction (Chapter 10)

Bioinformatics explained: Smith-Waterman

EECS730: Introduction to Bioinformatics

Machine Learning. Computational biology: Sequence alignment and profile HMMs

Heuristic methods for pairwise alignment:

Algorithms for Computational Biology. String Distance Measures

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

New String Kernels for Biosequence Data

Exact Matching: Hash-tables and Automata

CSE 417 Dynamic Programming (pt 5) Multiple Inputs

Multiple Sequence Alignment: Multidimensional. Biological Motivation

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

Acceleration of the Smith-Waterman algorithm for DNA sequence alignment using an FPGA platform

CS 581. Tandy Warnow

Lecture 2. Dynammic Programming Notes.txt Page 1

CSED233: Data Structures (2017F) Lecture12: Strings and Dynamic Programming

An Efficient Algorithm for Computing Non-overlapping Inversion and Transposition Distance

Suffix Tree and Array

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching

Dynamic Programming & Smith-Waterman algorithm

Cost Partitioning Techniques for Multiple Sequence Alignment. Mirko Riesterer,

Divya R. Singh. Faster Sequence Alignment using Suffix Tree and Data-Mining Techniques. February A Thesis Presented by

Programming assignment for the course Sequence Analysis (2006)

Multiple Sequence Alignment Sum-of-Pairs and ClustalW. Ulf Leser

Lecture 12: Divide and Conquer Algorithms

Regular Expression Constrained Sequence Alignment

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

Minimum Edit Distance. Definition of Minimum Edit Distance

Transcription:

Biosequence Algorithms, Spring 2005 Lecture 9: Core String Edits and Alignments Pekka Kilpeläinen University of Kuopio Department of Computer Science BSA Lecture 9: String Edits and Alignments p.1/30

III: Inexact Matching, Sequence Alignment, and Dynamic Programming (Likimääräinen hahmonsovitus, sekvenssien rinnastus ja dynaaminen ohjelmointi) BSA Lecture 9: String Edits and Alignments p.2/30

Introduction to Part III Today, the most powerful method for inferring the biological function of a gene (or the protein that it encodes) is by sequence similarity searching on protein and DNA sequence databases. (W.R. Pearson, 1995) Inexact or approximate matching and sequence comparison are central tools in computational molecular biology. Because of... errors in molecular data trying to understand mutational evolution of the sequences exact equality is a too rigid notion of similarity BSA Lecture 9: String Edits and Alignments p.3/30

Intro to Part III Inexact matching and comparison use alignments (rinnastus) of strings to recognize their similarities Computing alignments involves considering subsequences (alisekvenssi) (instead of contiguous substrings), and solving optimization problems (to maximize sequence similarity), often applying dynamic programming BSA Lecture 9: String Edits and Alignments p.4/30

The Edit Distance Problem The most classic inexact matching problem solved by dynamic programming: the edit distance problem Common way to formalize the difference or distance of two strings S 1 and S 2 : Number of single-character edit operations needed to transform S 1 into S 2 An edit transcript is a sequence of operations I (insert the next char of S 2 ), D (delete), R (replace by the next char of S 2 ), and M (match) that transform S 1 to S 2 Example: A transcript to transform Vintner to writers : RIMDMDMMI V intner wri t ers BSA Lecture 9: String Edits and Alignments p.5/30

Definition of the Edit Distance The edit distance between strings S 1 and S 2 is the minimum number of operations I, D and R in any transcript that transforms S 1 into S 2 also known as the Levenshtein distance An edit transcript with the smallest number of operations I, D and R is an optimal transcript Edit distance problem: Compute the edit distance and an optimal transcript for the given strings S 1 and S 2 Obs: The roles of S 1 and S 2 are symmetric, since a deletion in one string corresponds to an insertion in the other BSA Lecture 9: String Edits and Alignments p.6/30

String Alignment An alignment is an alternative representation for differences and similarities between strings A (global) alignment (kokonaisrinnastus) of S 1 and S 2 is obtained by inserting spaces in the strings, and then placing them one above the other s.t. each char or space is opposite a unique char or space of the other string global entire strings participate in the alignment (local alignment regions of high similarity) Example: A global alignment of Vintner and writers : V _ i n t n e r _ w r i _ t _ e r s BSA Lecture 9: String Edits and Alignments p.7/30

Computing the Edit Distance The edit distance btw strings S 1 [1... n] and S 2 [1... m] can be computed applying dynamic programming Define D(i, j) to be the edit distance of prefixes S 1 [1... i] and S 2 [1... j] D(n, m) is the edit distance of S 1 and S 2 Dynamic programming computes D(n, m) by computing D(i, j) for all i n and j m BSA Lecture 9: String Edits and Alignments p.8/30

Components of dynamic programming Dynamic programming (of the edit distance) has three essential components: Recurrence relation (palautuskaavat) How is D(i, j) determined from values D(i, j ) with smaller i and j? Tabular computation (taulukointi) How to memorize computed values, to avoid computing them over and over again? Traceback (jäljitys) How to find an optimal edit transcript? BSA Lecture 9: String Edits and Alignments p.9/30

The recurrence relation How to determine the D(i, j) values? Base D(i, 0): How to edit S 1 [1... i] to S 2 [1... 0] = ɛ? With i deletions D(i, 0) = i Similarly, D(0, j) = j Inductive case for i, j > 0: D(i, j) = min ( insert S 2 [1], S 2 [2],..., S 2 [j]) D(i 1, j) + 1 D(i, j 1) + 1 D(i 1, j 1) + t(i, j) where t(i, j) = if S 1 [i] = S 2 [j] then 0 else 1 (D) (I) (MR) BSA Lecture 9: String Edits and Alignments p.10/30

Edit distance recurrence Meaning of cases (D), (I) and (R)? A transcript that transforms S 1 [1... i] to S 2 [1... j] either transforms S 1 [1... i 1] to S 2 [1... j] and deletes S 1 [i] transforms S 1 [1... i] to S 2 [1... j 1] and inserts S 2 [j], or (D) (I) transforms S 1 [1... i 1] to S 2 [1... j 1] and then matches or replaces S 1 [i] by S 2 [j] (MR) An optimal transcript is one that minimizes over the three possibilities NB: More than one of the above cases may give the same minimum there may be many cooptimal transcripts BSA Lecture 9: String Edits and Alignments p.11/30

Tabular computation of edit distance It is easy to implement recurrences for D(i, j) as a recursive procedure and some really try to use this! The problem is that a recursive procedure for D(i, j) computes the same values exponentially many times But there are only (n + 1) (m + 1) different D(i, j) values (0 i n, 0 j m) compute them in a suitable order (bottom-up), and store them in an array so that each value is computed only once BSA Lecture 9: String Edits and Alignments p.12/30

Bottom-up computation Fill a table D(i, j), where i = 0,..., n and j = 0,..., m, in an increasing order of pairs (i, j) First initialize column 0 and row 0 according to the base cases of the recurrence: for i := 0 to n do D(i, 0) = i; for j := 0 to m do D(0, j) = j; Table D(i, j) after initializing row 0 and column 0: BSA Lecture 9: String Edits and Alignments p.13/30

Initialization of the dynamic programming table D(i, j) S 2 : w r i t e r s S 1 0 1 2 3 4 5 6 7 0 0 1 2 3 4 5 6 7 V 1 1 i 2 2 n 3 3 t 4 4 n 5 5 e 6 6 r 7 7 Then compute the remaining cells of the table BSA Lecture 9: String Edits and Alignments p.14/30

Computing inner cells Inner cells D(i, j) (i, j > 0) can be computed in any order s.t. the three values required by the recurrence have been computed: D(i-1,j-1) D(i-1,j) D(i,j-1) D(i,j) For example, in a row-first order: for i := 1 to n do // Compute row i for j := 1 to m do // and col j of row i Compute D(i, j) according to the recurrence; BSA Lecture 9: String Edits and Alignments p.15/30

Example of row-first computation Table after computing rows 1 3, and cols 1 3 of row 4: D(i, j) S 2 : w r i t e r s S 1 0 1 2 3 4 5 6 7 0 0 1 2 3 4 5 6 7 V 1 1 1 2 3 4 5 6 7 i 2 2 2 2 2 3 4 5 6 n 3 3 3 3 3 3 4 5 6 t 4 4 4 4 4 n 5 5 e 6 6 r 7 7 BSA Lecture 9: String Edits and Alignments p.16/30

Complexity of the tabulation Each of the Θ(nm) cells is filled in constant time Theorem 11.3.2 The edit distance D(n, m) of strings S 1 [1... n] and S 2 [1... m] can be computed in O(nm) time An optimal edit transcript can be computed within the same time bound, too: Store at each cell (i, j) three pointers, and set them to point to the cell(s) that gave the value for D(i, j) pointers are not necessary, but they are useful for explanation pointers on row 0 point to the cell on the left, and pointers on column 0 to the cell above BSA Lecture 9: String Edits and Alignments p.17/30

Example of traceback pointers S 2 : w r i t e r s S 1 0 1 2 3 4 5 6 7 0 0 1 2 3 4 5 6 7 V 1 1 1 2 3 4 5 6 7 i 2 2 2 2 2 3 4 5 6 n 3 3 3 3 3 3 4 5 6 t 4 4 4 4 4 3 4 5 6 n 5 5 5 5 5 4 4 5 6 e 6 6 6 6 6 5 4 5 6 r 7 7 7 6 7 6 5 4 5 BSA Lecture 9: String Edits and Alignments p.18/30

Finding optimal transcripts and alingments An optimal transcript can be found (as a reversal) by following the traceback pointers from cell (n, m) to (0, 0) 1. pointer from (i, j) to (i, j 1) insertion of S 2 [j] 2. pointer from (i, j) to (i 1, j) deletion of S 1 [i] 3. pointer from (i, j) to (i 1, j 1) match/replace, depending on whether S 1 [i] and S 2 [j] match or not Alternatively, the path gives an optimal alignment, where 1. pointer space in S 1 opposite to S 2 [j], and 2. pointer space in S 2 opposite to S 1 [i] 3. pointer S 1 [i] and S 2 [j] are aligned BSA Lecture 9: String Edits and Alignments p.19/30

Finding optimal alignments Example: Optimal alignments specified by pointers of the preceding table: 1. S 1 : _ V i n t n e r _ S 2 : w r i _ t _ e r s 2. S 1 : V _ i n t n e r _ S 2 : w r i _ t _ e r s 3. S 1 : V i n t n e r _ S 2 : w r i t _ e r s BSA Lecture 9: String Edits and Alignments p.20/30

Complexity of tracing an optimal transcript Theorem 11.3.3 After computing the dynamic programming table with pointers, an optimal transcript can be found in time O(n + m) Proof. Starting from cell (n, m), a path to (0,0) can be followed by choosing any of the pointers: Each cell (i, j) (0, 0) has at least one pointer to one of (i 1, j), (i, j 1) and (i 1, j 1) at most n + m pointers need to be followed The Θ(nm) time and space requirements of the basic dynamic programming solution can be improved slightly with more advanced techniques BSA Lecture 9: String Edits and Alignments p.21/30

Generalizations We can also consider edit operations with weights (or costs or scores): d for deletion/insertion, r for substitution, and e for match The operation-weight edit distance problem is to find a transcript S 1 S 2 of minimum total weight Edit distance is a special case with d = r = 1 and e = 0 Example: An alignment with r = 2, d = 4 and e = 1: V i n t n e r _ w r i t _ e r s weight = 2+2+2+1+4+1+1+4 = 17 BSA Lecture 9: String Edits and Alignments p.22/30

Computing operation-weight edit distance Operation weights cause straight-forward modifications to the recurrences: Base cases with operation weights: D(i, 0) = i d D(0, j) = j d ( how many chars are deleted/inserted) BSA Lecture 9: String Edits and Alignments p.23/30

Computing operation-weight edit distance (2) Inductive case, for i, j > 0, with operation weights: D(i 1, j) + d D(i, j) = min D(i, j 1) + d D(i 1, j 1) + t(i, j), with t(i, j) = if S 1 [i] = S 2 [j] then e else r With these modifications, edit distance with operation weights and the corresponding transcript/alignment can be computed in time O(nm), exactly as before BSA Lecture 9: String Edits and Alignments p.24/30

Another generalization The cost of edit operations can also be allowed to depend on which characters of the alphabet are involved ( alphabet-weight edit distance) Which are used, alphabet or operation weights? Proteins are most often compared using alphabet-weights over the amino-acid alphabet no universal agreement over the scores exist; PAM and BLOSUM are two dominant scoring matrices DNA strings more often compared applying operation-weight (or unweighted) costs DB search program BLAST uses e = +5 and r = 4 NB: Above applications maximize the score for similarity BSA Lecture 9: String Edits and Alignments p.25/30

String Similarity An alternative formalization for the relatedness of strings is their similarity (rather than distance), which is applied in most bio-applications Let Σ be the alphabet extended with the space _ Denote the score of aligning chars x and y of Σ by s(x, y) Let S 1[1... l] and S 2[1... l] be the equal-length versions of strings S 1 and S 2, padded with spaces, for an alignment A The value (or total score) of alignment A is then l i=1 s(s 1[i], S 2[i]) BSA Lecture 9: String Edits and Alignments p.26/30

Value on an alignment Consider the following score matrix for Σ = {a, b, c, d, _}: s a b c d _ a 1-1 -2 0-1 b 3-2 -1 0 c 0-4 -2 d 3-1 _ 0 Often s(x, y) 0 iff x = y Value of an alignment btw S 1 = cacdbd and S 2 = cabbdb: S 1 : c a c _ d b d S 2 : c a b b d b _ 0+1-2+0+3+3-1 = 4 BSA Lecture 9: String Edits and Alignments p.27/30

Defining string similarity The similarity of strings S 1 and S 2 (determined by a scoring matrix), is the value of an alignment of S 1 and S 2 that maximizes the total score, giving the optimal alignment value String similarity and alphabet-weight edit distance are related but not interchangeable; We ll see that similarity is more suitable for computing local alignments Similarity of strings S 1 and S 2 can be computed as a straight-forward modification of edit distance For this, define V (i, j) to be the optimal alignment value of prefixes S 1 [1... i] and S 2 [1... j] BSA Lecture 9: String Edits and Alignments p.28/30

Recurrences for String Similarity Base conditions are rather obvious (Ass. s(_, _) 0): V (0, j) = V (i, 0) = j k=1 i k=1 s(_, S 2 [k]), and s(s 1 [k], _) ( total score of aligning chars with spaces) The inductive case for i, j > 0 similarly takes the character-specific scores into account: BSA Lecture 9: String Edits and Alignments p.29/30

Computing String Similarity V (i, j) = max V (i 1, j) + s(s 1 [i], _) V (i, j 1) + s(_, S 2 [j]) V (i 1, j 1) + s(s 1 [i], S 2 [j]), Applying these, similarity can be computed like edit distance, with V (n, m) at the lower right corner of the table The similarity and an optimal alignment of S 1 [1... n] and S 2 [1... m] can be computed in time O(nm) A variation of above recurrences supports locating approximate occurrences of a pattern P in text T (Considered next) BSA Lecture 9: String Edits and Alignments p.30/30