Computational Biology Lecture 6: Affine gap penalty function, multiple sequence alignment Saad Mneimneh

Similar documents
Computational Biology Lecture 4: Overlap detection, Local Alignment, Space Efficient Needleman-Wunsch Saad Mneimneh

Lecture 10. Sequence alignments

Pairwise Sequence Alignment: Dynamic Programming Algorithms. COMP Spring 2015 Luay Nakhleh, Rice University

Computational Genomics and Molecular Biology, Fall

In this section we describe how to extend the match refinement to the multiple case and then use T-Coffee to heuristically compute a multiple trace.

Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice University

Lecture 2 Pairwise sequence alignment. Principles Computational Biology Teresa Przytycka, PhD

Multiple Sequence Alignment: Multidimensional. Biological Motivation

EECS730: Introduction to Bioinformatics

Dynamic Programming. Lecture Overview Introduction

Mathematical and Algorithmic Foundations Linear Programming and Matchings

Computational Molecular Biology

Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it.

Greedy Algorithms. Algorithms

Lecture 10: Local Alignments

Lecturers: Sanjam Garg and Prasad Raghavendra March 20, Midterm 2 Solutions

Algorithms for Data Science

Framework for Design of Dynamic Programming Algorithms

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

Rigidity, connectivity and graph decompositions

Exact Algorithms Lecture 7: FPT Hardness and the ETH

Special course in Computer Science: Advanced Text Algorithms

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

Treewidth and graph minors

Multiway Blockwise In-place Merging

Lecture 12 March 4th

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

(Refer Slide Time: 01:00)

UML CS Algorithms Qualifying Exam Fall, 2003 ALGORITHMS QUALIFYING EXAM

Lecture 21: Other Reductions Steven Skiena

Simplicity is Beauty: Improved Upper Bounds for Vertex Cover

5.1 The String reconstruction problem

Memoization/Dynamic Programming. The String reconstruction problem. CS124 Lecture 11 Spring 2018

Greedy Algorithms 1. For large values of d, brute force search is not feasible because there are 2 d

INTRODUCTION TO ALGEBRAIC GEOMETRY, CLASS 11

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/27/18

Algorithms Exam TIN093/DIT600

Algorithms Exam TIN093/DIT600

CMSC423: Bioinformatic Algorithms, Databases and Tools Lecture 12

Functions of Several Variables

Divide and Conquer Algorithms

Dynamic Programming II

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

FastA & the chaining problem

Alignment ABC. Most slides are modified from Serafim s lectures

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:

Lecture 3: February Local Alignment: The Smith-Waterman Algorithm

Optimization I : Brute force and Greedy strategy

15-451/651: Design & Analysis of Algorithms November 4, 2015 Lecture #18 last changed: November 22, 2015

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures

TELCOM2125: Network Science and Analysis

1 Metric spaces. d(x, x) = 0 for all x M, d(x, y) = d(y, x) for all x, y M,

MULTIPLE SEQUENCE ALIGNMENT

EECS730: Introduction to Bioinformatics

Computational Biology Lecture 12: Physical mapping by restriction mapping Saad Mneimneh

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

In this chapter we introduce some of the basic concepts that will be useful for the study of integer programming problems.

Lecture 21: Other Reductions Steven Skiena. Department of Computer Science State University of New York Stony Brook, NY

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

CS 372: Computational Geometry Lecture 10 Linear Programming in Fixed Dimension

NP-Hardness. We start by defining types of problem, and then move on to defining the polynomial-time reductions.

6.867 Machine Learning

Lecture 8: The Traveling Salesman Problem

The Simplex Algorithm

On the Partial Sum of the Laplacian Eigenvalues of Abstract Simplicial Complexes

Divide and Conquer Algorithms. Problem Set #3 is graded Problem Set #4 due on Thursday

Linear Programming in Small Dimensions

Chapter S:II. II. Search Space Representation

Introduction to Graph Theory

Least Squares; Sequence Alignment

6.001 Notes: Section 8.1

Introduction to Graph Theory

This lecture: Convex optimization Convex sets Convex functions Convex optimization problems Why convex optimization? Why so early in the course?

Solutions to Homework 10

Data Caching under Number Constraint

Notes on Dynamic-Programming Sequence Alignment

Introduction to Functions of Several Variables

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization

Midterm 2 Solutions. CS70 Discrete Mathematics and Probability Theory, Spring 2009

Discrete Optimization. Lecture Notes 2

The Resolution Algorithm

Longest Common Subsequence, Knapsack, Independent Set Scribe: Wilbur Yang (2016), Mary Wootters (2017) Date: November 6, 2017

4 Integer Linear Programming (ILP)

Sequencing Alignment I

Matching Algorithms. Proof. If a bipartite graph has a perfect matching, then it is easy to see that the right hand side is a necessary condition.

Computer Science 210 Data Structures Siena College Fall Topic Notes: Complexity and Asymptotic Analysis

3 No-Wait Job Shops with Variable Processing Times

Algorithms for Computational Biology. String Distance Measures

Faster parameterized algorithms for Minimum Fill-In

(Refer Slide Time: 1:40)

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

Bioinformatics. Gabriella Trucco.

Distance based tree reconstruction. Hierarchical clustering (UPGMA) Neighbor-Joining (NJ)

The PRAM model. A. V. Gerbessiotis CIS 485/Spring 1999 Handout 2 Week 2

Bipartite Roots of Graphs

Lecture 2. 1 Introduction. 2 The Set Cover Problem. COMPSCI 632: Approximation Algorithms August 30, 2017

Advanced Operations Research Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras

1 The Traveling Salesperson Problem (TSP)

Transcription:

Computational Biology Lecture 6: Affine gap penalty function, multiple sequence alignment Saad Mneimneh We saw earlier how we can use a concave gap penalty function γ, i.e. one that satisfies γ(x+1) γ(x) γ(x) γ(x 1), as a more realistic model for penalizing gaps. This was at the expense of increasing the time complexity of our dynamic programming algorithm. With a concave gap penalty function, the new DP algorithm requires O(mn 2 + nm 2 ) time, as compared to the O(mn) time bound of previous algorithms. We will seek a solution that will bring us back to our old time bound of O(mn) while still providing a good model for gap penalty. Affine gap penalty function The objective is to come up with a realistic gap penalty function that will not require the dynamic programming algorithm to spend more than O(mn) time updating the table. Let us look then at the origin of the problem and why the general gap penalty function fails to achieve that: The general penalty function can possible have a different penalty for each additional gap. The algorithm has to check for all possible gap lengths. At a given position A(i, j) this is i + j possibilities. This is what makes the algorithm spend cubic time for updating all entries. We will approximate γ by a set of linear functions. For simplicity assume that we will approximate it using two functions as illustrated below (this could be generalized to more than two). γ(x) affine model previous model e d 1 x Figure 1: Affine gap penalty function The figure above suggests that we penalize the first gap differently, possibly more than the rest, then each subsequent gap will be penalized linerarly. The gap penalty function will be γ(x) = e + (x 1)d, x 1 This is called the affine gap penalty model. If e >> d, which is usually the case, then this is still a concave function and therefore a good approximation of the realistic setting. However, for the rest of this document, we do not impose any constraints on d and e (concavity will not be a requirement). What have we gained so far? Now the gap is either the opening of a larger gap or not. We only distinguish between these two cases. Therefore, we have only two possibilities to check for. If we have an opening of the gap, we penalize it by e; otherwise, we penalize as before by d. But now we have to be able to detect the opening of a gap. To be able to detect the opening of a gap, we will explicitly distinguish between three kinds of alignments: alignments that end with no gaps, alignments that end with a gap aligned with x, and alignments that end with a gap aligned with y. For this purpose, we are going to define three types of blocks in an alignment: A: block containing one character in x aligned to one character in y B: block containing a maximal number of contiguous gaps aligned with x, thus a type B block cannot follow a type B block. C: block containing a maximal number of contiguous gaps aligned with y, thus a type C block cannot follow a type C block. Therefore, an alignment can be divided into blocks (instead of columns as before) of three types A, B, and C. The following is an example: A A C --- A ATTCCG A C T AC A C T ACC T ------ C G C -- 1

An important feature of the division above is that the score of the alignment becomes additive across blocks boundaries (this is true regardless of the gap penalty function). Therefore, we can make our dynamic programming table maintain three matrices for the three kinds of alignments mentioned above: alignments that end with a block of type A, alignments that end with a block of type B, and alignments that end with a block of type C. Each entry in one of these three matrices will now depend on previous entries in other matrices. For instance, the optimal score for an alignment of x 1...x i and y 1...y j that ends with a gap aligned with x (defined as B(i, j)) is obtained by either making the gap an opening gap, thus A(i 1, j) e or C(i 1, j) e, or making the gap a continuation of a previously opened gap, thus B(i 1, j) d. Obviously, the maximum of those three is the optimal score for B(i, j). More formally: A(i, j): optimal score for an alignment of x 1...x i and y 1...y j that ends with no gaps B(i, j): optimal score for an alignment of x 1...x i and y 1...y j that ends with a maximal gap aligned with x C(i, j): optimal score for an alignment of x 1...x i and y 1...y j that ends with a maximal gap aligned with y Then our update rules become: A(i 1, j 1) + s(i, j) A(i, j) = max B(i 1, j 1) + s(i, j) C(i 1, j 1) + s(i, j) A(i 1, j) e B(i, j) = max B(i 1, j) d C(i 1, j) e A(i, j 1) e C(i, j) = max B(i, j 1) e C(i, j 1) d same block, continuing gap same block, continuing gap Special care has to be made for the initial conditions of the three matrices. A(0, 0) = 0. A(i, 0) = for i = 1...m and A(0, j) = for j = 1...n. The reason for this is that A(i, 0) and A(0, j) do not correspond to alignments that end with type A block, and hence we have to make sure they do not affect the computattion. Similarly, B(0, 0) = 0, B(i, 0) = e (i 1)d, and B(0, j) =. Finally, C(0, 0) = 0, C(i, 0) =, and C(0, j) = e (j 1)d. The trace back pointers are kept as before, except that now they can jump from one matrix to another and we start from max(a(m, n), B(m, n), C(m, n)). We can simplify the above algorithm a little bit. Note that the following two types of alignments cannot be optimal if s < d + e....-x......y-......x-......-y... This is because if s < d + e, then the penalty for a mismatch is less then the penalty of two gaps and the two above alignments are not optimal. In other words, a type C block cannot follow a type B block and viceversa. Therefore, if s d + e we can disregard updating B using C or C using B. As a result, if s d + e, the third line in the update for B(i, j) can be ignored, and the second line in the updated for C(i, j) can be ignored. Multiple sequence alignment The objective is now to generalize what we have seen so far to multiple sequences. Therefore, we need to align k sequences in the optimal way. But what is optimal now? Let us assume for the rest of this document that the score is additive, and hence, the score of an alignment is the sum of scores of each column in the alignment. We have now k characters in each column, as opposed to just 2. Our scoring function will have to takes k arguments now. This will be a complicated function to build. We would like to still use the scoring function for two characters to score the whole column. One famous scoring shceme is the sum of pairs score score, SP score, which consists of scoring every pair of sequences separately and adding up the scores. Given two sequences s i and s j (note that the subscript now refers to the sequence 2

but not a character in the sequence), define the induced alignment of s i and s j as the alignment obtained by isolating both sequences from the multiple alignment and ignoring all columns with all gaps. Let score ij be the score of the induced alignment of s i and s j, the the total score of the multiple alignment is i<j score ij. Alternatively, one could look at each column of the multiple alignment and compute the score of the column separately (assuming additivity) as the sum of scores of every pair of characters in that column. In this case, we need to define s(, ), the score of a gap aligned with a gap. This scenario is now possible since we have k sequences and hence we can have up to k 1 gaps in a column. We therefore define s(, ) = 0 to be consistent with the score of induced alignments. In other words, if s(, ) = 0, then SP score l = score ij i<j where SP score l is the sum of pairs score of column l. Here s an example: If we have a column [I,, I, V ], then the score will be l SP score(i,, I, V ) = s(i, ) + s(i, I) + s(i, V ) + s(, I) + s(, V ) + s(i, V ) Note that the SP score is independent of the order of characters in a column (which is nice). An important thing to realize is that the induced alignment between a pair of sequences to is not necessarily an optimal one. The following example shows how to compute the SP score and illustrates a case where the induced alignment is not optimal. s1= ATG, s2 = ATG, s3 = A, s4 = T for the multiple alignment: ATG ATG A-- -T- The score is: SP-score = SP-score(A,A,A,-) + SP-score(T,T,-,T) + SP-score(G,G,-,-) Alternatively, using induced alignments: SP-Score = score(atg, ATG) + score(atg, A--) + score(atg, -T-) + score(atg, A--) + score(atg, -T-) + score(a-, -T) = 3-3 -3-3 -3-4 = -13 Note that the induced alignment for s3 and s4 is: A- -T which is not the optimal alignment for these two sequences. Multiple sequence alignment can be solved using the same dynamic programming formulation as before. We have now k sequences of length n i each, i = 1...k. Therefore we need a k dimentional array A of length n i + 1 in each dimension (recall for the case of two sequences of length m and n we need an array of size (m + 1) (n + 1)). Therefore, A requires now O(n k ) space.,..., i k ) holds the score of the optimal alignment of s 11...s 1i1,..., s k1...s kik. The problem now is that A(i, j) depends on 2 k 1 other entries (note that k = 2 A(i, j) depends on 3 entries, namely A(i 1, j 1), A(i, j 1), and A(i 1, j)). To see this, let us look at the simple case of two sequences. For two sequences x and y, A(i, j) can end in 3 different ways, namely x i aligned with x j, x i aligned with a gap, or y j aligned with a gap. In other words, all possible ways of having gaps at the end, including no gaps at all, and excluding the possibility of having gaps only. Similarly, with k sequences, we can have any combination of gaps at the end, excluding the possibility of gaps only. These are 2 k 1 possibilities. To think of it in another way, consider a vector of length k 3

) 1) 001 ) 010 1) 011 ) 100 1) 101 ) 110 1) 111 Figure 2: 2 k 1 dependencies of zeros and ones where zero denotes a gap, then we can have all possible vectors except for 000...0. Again, these are 2 k 1 vectors. The figure below illustrates for k = 3 the 2 k 1 = 7 dependencies of ). Since the SP score of a column requires O(k 2 ) time to computes (looking at all pairs), the running time of the dynamic programming algorithm will be O(2 k k 2 n k ). The exponential time bound for the dynamic programming algorithm is not a surprise. The problem of multiple sequence alignment with SP score is known to be NP-complete. Therefore, no polynomial time algorithm is known for the problem. For this reason, we will look at heuristic algorithms for the problem. One of the famous heuristics is Star alignment, which is a special case of a more general kind of alignments knows as tree alignments. We will describe what a tree alignment is next. Tree alignment: Given a tree with k nodes representing k sequences s 1,...,s k, a multiple alignment of the k sequences consistent with the tree is such that the induced alignment between s i and s j is optimal if there is an edge (s i, s j ). s 1 = AXZ s 2 = AXZ s 3 = AXXZ s 4 = AYZ s 1 s 2 s 3 s 4 s 5 A X - - Z A - X - Z A X X - Z A Y - - Z A y X X Z s 5 = AYXYZ Figure 3: Tree alignment In the example above, the induced alignment of s 1 and s 3 is optimal. The induced alignment of s 2 and s 4, for instance, is not. Given a tree, is it always possible to produce an alignment consistent with the tree? The answer is yes. Here s the algorithm to produce such an alignment. (1) Pick s i and s j such that (s i, s j ) is an edge in the tree and align them optimally. Let S = {s i, s j }, the set of aligned sequences. (2) Pick s k S and s l S such that (s k, s l ) is an edge in the tree and align them optimally. (3) Once a gap always a gap: For each gap added to s l in this alignment, add a corresponding gap to all sequences in S. For each gap already in s l, add a corresponding gap in s k (if needed). (4) S = S {s k } (5) Repeat from (2) until all sequences are in S. The proof that the algorithm above works is by induction on the number of sequences in S. The base case is when we have two sequences that are optimally aligned and corresponding to some edge (s i, s j ). The inductive step is when a sequence s k S is added and aligned optimally with s l S for edge (s k, s l ) (for a given s k, there is only one such s l, since the edges belong to a tree). Using the once a gap always a gap strategy, the score of induced alignments for sequences aleardy in S in unchanged (because s(, ) = 0). For the same reason, the score of the new optimal alignment of s k and s l is unchanged. Therefore, we obtain a new set S = S {s k } with one additional sequence and a multiple alignment that is still consistent with the tree. 4

What is the running time of the tree alignment algorithm? Well, we perform k 1 optimal alignments (the number of edges in the tree) each taking O(n 2 ) time. Moreover, every time we add a new sequence to the alignment, we have to update the gaps for all sequences in the alignment (this is O(k) sequences), using once a gap always a gap strategy. This gap update will therefore take O(kl) time for each added sequence, where l is the length of the multiple alignment. As a result, the total running time is O(kn 2 + k 2 l). The following figure illustrates the steps of the algorithm on the example presented earlier. s 1 = AXZ s 2 = AXZ s 3 = AXXZ s 4 = AYZ s 5 = AYXYZ (s 1,s 3 ) (s 2,s 3 ) Join AX-Z A-XZ AXXZ AXXZ AXXZ AX-Z A-XZ (s 3,s 4 ) Join AXXZ AXXZ AY-Z AX-Z A-XZ AY-Z (s 4,s 5 ) Join AY--Z AXX-Z AYXXZ AX--Z A-X-Z AY--Z AYXXZ Figure 4: Performing the tree alignment algorithm Now back to Star alignment. The Star alignment is just a special case when the tree is a star. One sequence will be the center of the star and every other sequence is connceted to the center by an edge. The question of course is, given the sequences s 1,..., s k, which one to choose to be the center of the star? The only guarantee of Star alignment is to produce a multiple alignment in which the induced alignment of any sequence with the center sequence is optimal. Therefore, as a heuristic, Start alignment chooses the center of the star to be the sequences s i that maximizes M i = OP T (s i, s j ) j where OP T denotes the optimal score of aligning two sequences. The running time of Start alignment is the same for the general tree alignment except that we have to account for finding argmax i M i. This can be done by performing all pairwise alignments and choosing the sequence s i that maximizes M i. Therefore, the running time of Star alignment is O(k 2 n 2 + k 2 l). Star alignment with the above heuristic has some nice properties in terms of approximating the optimal SP score. The score of the Star alignment will be within a constant factor of the optimal SP score, under some special pairwise schoring schemes. References Setubal J., Meidanis, J., Introduction to Molecular Biology, Chapter 3. Gusfield D., Algorithms on Strings, Trees, and Sequences, Chapter 14. 5