Lecture 3: February Local Alignment: The Smith-Waterman Algorithm

Size: px

Start display at page:

Download "Lecture 3: February Local Alignment: The Smith-Waterman Algorithm"

Simon Ellis
6 years ago
Views:

1 CSCI1820: Sequence Alignment Spring 2017 Lecture 3: February 7 Lecturer: Sorin Istrail Scribe: Pranavan Chanthrakumar Note: LaTeX template courtesy of UC Berkeley EECS dept. Notes are also adapted from notes from previous years offerings of CSCI1810 and CCSCI1820. Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the Instructor. 3.1 Local Alignment: The Smith-Waterman Algorithm The Smith-Waterman algorithm is one of the most important and useful algorithms in computational biology. In the last lecture, we looked at both global and local alignment algorithms; here we will dive deeper into the intuition behind the local alignment algorithm, more formally known as the Smith-Waterman Algorithm Prefixes and Suffixes Before formalizing the above intuition and defining local alignment, we need a few definitions. A prefix a of a string b is a string that may be obtained by removing characters from the end of b. Similarly, a suffix c of a string b may be obtained by removing characters from the beginning of b. Then, a substring or subsequence 1 a of a string b may be obtained by removing characters from either end of b (but not from the middle). See Figure 3.1 for examples of prefixes, suffixes, and substrings of the sequence ACGAAT. Note that in Figure 3.1, we use ɛ to represent the empty string, which is also technically a prefix and a suffix of a string. Furthermore, a string itself is also a prefix and suffix of itself. Excluding these two edge cases, though, gives us the proper prefixes and proper suffixes of a string. At this time, you should convince yourself that the set of all suffixes of prefixes of a string x is equivalent to the set of all prefixes of suffixes of x, and both are equivalent to the set of all substrings of x Local Alignment Definition In Lecture 2, we had defined local alignment in a general sense. Here, we will describe it more formally. The optimal local alignment of two strings X and Y is simply the optimal global alignment of any X, Y such that X is a substring of X and Y is a substring of Y. This definition is deceptively simple. Note that it encapsulates our intuition, allowing us to identify regions of two strings that align well, even when the remainder of the strings aligns poorly. Furthermore, we shall soon see an efficient algorithm (Smith-Waterman) exists to calculate optimal local alignments that is asymptotically equivalent to the Needleman-Wunsch algorithm, which is the formal name for global alignment. 1 Note that some authors define subsequence such that the characters can be removed from the middle of a sequence; this definition is not of interest to us in this context. The term substring is generally less ambiguous. 3-1

2 3-2 Lecture 3: February 7 Prefixes: Suffixes: ɛ A AC ACG ACGA ACGAA ACGAAT ɛ T AT AAT GAAT CGAAT ACGAAT Substrings: ACGAAT ACGAA ACGA ACG AC A CGAAT CGAA CGA CG C GAAT GAA GA G AAT AA AT T ɛ Figure 3.1: Prefixes, Suffixes, and Substrings of ACGAAT Problem Description We present here, as in last lecture, the problem of local alignment. Some of the notation is slightly different to get you used to different ways of representing the same problem. The same ideas are all present, though. Given: An alphabet, Σ. A similarity (or scoring) matrix, δ. Two sequences, X and Y, such that X = x 1 x 2 x 3... x n = x Y = y 1 y 2 y 3... y m = y where x i, y j Σ for all i [1, n] and j [1, m]. In other words, x Σ n and y Σ m. Compute: The score of the optimal local alignment. Let α be a subsequence of X, and β be a subsequence of Y. Let r be the score of the global alignment of α with β. The maximum score of the local alignment between X and Y, r, which is the global alignment score of α and β, where α Subseq(x) and β Subseq(Y ). Note that Subseq(S) refers to the set of substrings of the string S A Description of the Smith Waterman Algorithm The Smith Waterman Algorithm can be thought of as carrying out the following steps (from a high level):

3 Lecture 3: February Consider two sequences, X and Y. Let X i be the ith prefix of X, and let Y j be the jth prefix of Y. As an example, X 2 of our previous sequence ACGAAT would be AC, while Y 3 of a different sequence ACT GAG would be ACT. Take a suffix of X i, and take a suffix of Y j. Let V (i, j) be the value of the optimal global alignment between these two chosen suffixes. The idea is to take every possible global alignment for all the suffixes of X i and Y j and find the optimal score from a certain pair of suffixes and store it in V (i, j) Pseudocode and Key Concepts Here, as we did in the last lecture, we will present the pseudocode for the local alignment algorithm: 1: function Local Alignment(x Σ n, y Σ m ) 2: V 0,0 0 3: for i {1, 2,..., n} do 4: V i,0 0 5: for j {1, 2,..., m} do 6: V 0,j 0 7: for i {1, 2,..., n} do 0 V 8: V i,j max i 1,j 1 + δ(x i, y j ) V i 1,j + δ(x i, ) V i,j 1 + δ(, y j ) 9: return max V i,j i {0, 1,..., n} j {0, 1,..., m} Note that V (i, 0) = V (0, j) = 0 for all i and j. Also note how just as in global alignment, our matrix here has (N + 1) columns and (M + 1) rows. You can switch these dimensions, but it would require a slight readjusting of the loops in our code in order to fill in the table row-wise rather than column-wise. You ll notice that these finer details don t really quite matter. For i > 0 and j > 0, we replace V (i, j) with the maximum score from four different scenarios: beginning a new local alignment, keeping both X i and Y j, aligning X i with a gap, and aligning Y j with a gap. See the last lecture for a bit more info on the dynamic programming aspect of this all. At last, the maximum value in the matrix V gives you the score of the optimal local alignment. There is also the concept of backtracking (or traceback) that is necessary to construct the actual optimal local alignment (rather than just have its score). In global alignment, we start constructing our alignment from the edit graph using the value from the bottom right corner of our matrix, while in local you can start from anywhere in the V matrix that contains the matrix s maximal score Runtime of Local Alignment The Smith-Waterman algorithm is an algorithm of order NM. A unit of time can be any one of addition, subtraction, assignment of a variable, and other small, constant operations of that scale.

4 3-4 Lecture 3: February 7 Note that our V matrix is (N + 1) by (M + 1), and we do the following calculations in the algorithm: The initialization step is (N M + 1) in order to fill in each of the matrix values where i = 0 or j = 0. The non-initialization loops occur (N M) times. On each loop, we find the max of four numbers. The minimum number of operations to find the max of four numbers is 3. Three of these four (the 0 doesn t involve any sort of computation) numbers need to be computed, which involves about 3 operations each (adding, accessing the appropriate element from V and δ). Thus, we have about = 12 units of time for finding one V (i, j) for i > 0 and j > 0. We find that the approximate total work done is on the order of: (N M + 1) + 12NM = NM NM time. This is the approximate time to complete our dynamic programming matrix V. For those familiar with big-o notation, we would say the algorithm runs in O(NM). 3.2 Affine Gap Alignment Gap Theory So far, we have used the word gap to represent aligning a character of one string with the - character in the other string. In general, though, gaps can be several dashes in a row. To distinguish between these gap clusters and just a single - character, we ll refer to just a single - as an indel. From a biological context, it may be more useful to consider aligning two sequences such that clusters of gaps are preferred to single indels spread out through the alignment. One such example of this scenario occurs when thinking about trying to align a sequence of DNA with introns removed (perhaps by reverse engineering the DNA sequence from a known protein sequence), and the entire length of DNA on a chromosome, to detect where in the chromosome the gene corresponding to the intron-removed sequence might be. In cases like these, you want to do the alignment so that there are fewer gaps (clusters). This can be done by doing a simple change to local alignment, involving creating a scenario where the penalty for k indels in a row (a gap of length k) is less than the penalty for k times the penalty for a single indel. In general, there are different types of gapped alignment algorithms that treat indels and gap clusters differently, but we will mainly focus on the affine gap alignment, presented below Alternate Notation of Global Alignment Before we explore the idea of preferring gap clusters to indels, we will introduce the notation for the global alignment problem used by Smith and Waterman. This notation doesn t change the algorithm, but it does make it easier for us to understand what will happen during the new gap alignment algorithm presented in this class. First, we define several different variables: Let s(a, b) be a similarity score function, with a and b in the alphabet that gives the similarity between characters a and b.

5 Lecture 3: February Let a = a 1 a 2 a 3... a n and b = b 1 b 2 b 3... b m, with a = n and b = m. Let S represent a function applied to prefixes of a and b, such that S(a 1 a 2... a i, b 1 b 2... b j ) equals the similarity score of the global alignment of prefix i of a with prefix j of b. Then, the following initializations occur: S(0, 0) = 0 S(i, 0) = i l=1 S(a l, ) S(0, j) = j k=1 S(, b k) And the main portion of the algorithm has the following notation: S i 1,j 1 + s(x i, y j ) S i,j max S i 1,j + s(x i, ) S i,j 1 + s(, y j ) Finally, we see that S(a, b) is the maximum value over all the entire sequences a and b. Again, this is all just a different notation of global alignment. The reason we present it is because it is concise, and also makes the gap alignment notation a lot easier to look at. The above may be called the Smith Waterman notation of alignment Affine Gap Overview, Notation, and Recurrence The affine gap alignment algorithm prefers alignments that have a small number of large gaps by introducing a penalty for opening a gap cluster, as well as a penalty for each indel after the opening of a cluster. Firstly, let us define a few variables to add onto the the Smith-Waterman notation above: Let H i,j = { 0, max 1 p i n S(a p a p+1 a p+2... a i 2 a i 1 a i, b q b q+1 b q+2... b j 2 b j 1 b j ) 1 q j m Let α represent the penalty for opening a gap cluster. Let β represent the penalty for continuing a gap cluster. Consider a cluster of single indels of length k. Let the cost or score of the k length cluster be g(k) = α + β(k 1) We will typically see the gap function, g(k) with a negative sign in front of it to signify that it is a penalty, g(k). In a similar vein, let H(a, b) = max S(a i a i+1 a i+2... a j 2 a j 1 a j, b k b k+1 b k+2... b l 2 b l 1 b l ) 1 i j n 1 k l m What this represents is finding the maximum similarity score over all subsequences of the strings a and b.

6 3-6 Lecture 3: February 7 Now, we get into the major recurrences of the affine gap algorithm using the above notation. Consider three matrices, E, F, and H. Let the following be used to define these matrices: E i,j = F i,j = H i,j = 0 for i j = 0 { Hi,j 1 α, E i.j = max E i,j 1 β { Hi 1,j α, F i.j = max F i 1,j β 0, E H i.j = max i,j F i,j H i 1,j 1 + s(a i, b j ) Affine Gap Algorithm/Notation Explanation With some intuition on the notation for the affine gap algorithm, we can now examine what the different parts of the algorithm mean. Firstly, realize that in order for the algorithm to prefer having a small number of large gaps, α should be big to penalize creating a gap, while β should be small to not penalize continuing a gap so much. What the matrix E represents is the optimal score of the ith prefix of a and jth prefix of b in the case that the alignment will align the jth character of b with a - character. F represents the optimal score for when the - occurs in string a. In either case, though, by considering that you are including an indel at your current location, you are either opening up a new gap cluster (meaning you subtract α, or continuing an existing one (meaning you subtract β). At the same time, the H matrix stores the optimal score for the alignments taking into account our modified gap penalties that favor gap clusters. Traceback (or backtracking) will also be slightly different here compared to local and global alignment, since when you construct your new alignment, you are not just considering the H matrix, but you may actually find yourself moving back and forth between all three matrices in order to reconstruct your alignment.

Computational Molecular Biology

Computational Molecular Biology Erwin M. Bakker Lecture 2 Materials used from R. Shamir [2] and H.J. Hoogeboom [4]. 1 Molecular Biology Sequences DNA A, T, C, G RNA A, U, C, G Protein A, R, D, N, C E,