Biologically significant sequence alignments using Boltzmann probabilities

Size: px

Start display at page:

Download "Biologically significant sequence alignments using Boltzmann probabilities"

Reynold Scott
5 years ago
Views:

1 Biologically significant sequence alignments using Boltzmann probabilities P Clote Department of Biology, Boston College Gasson Hall 16, Chestnut Hill MA 0267 clote@bcedu Abstract In this paper, we give a dynamic programming algorithm with quadratic time and space complexity to compute the partition function for both global and local sequence alignments of two peptides and, thus providing an efficient computation of the Boltzmann probability that a particular pair of amino acids is aligned As proof of concept, our probabilistic refinement of both the Needleman-Wunsch [16] global and Smith-Waterman [19] local alignment algorithm is then compared with pairwise BLAST to determine an optimal local alignment of bovine trypsin and pig elastase, an example considered in Lipman et al [1] A web-server of our prototype tool is currently available[5] 1 Introduction Sequence alignment is one of the most important initial steps taken in trying to understand the function, evolutionary relationship, and general biology (eg binding sites) of an amino acid or nucleotide sequence Using dynamic programming, Needleman and Wunsch [16] designed a quadratic time/space algorithm to determine an optimal global sequence alignment of given sequences and, provided that the cost of successive gaps is! ", for some fixed constant 1 Building on this algorithm, Smith and Waterman [19] later provided a quadratic time/space algorithm to determine an optimal local sequence alignment of (convex) subwords $#&% '()*#&% ',+ -#/ from * with *0 %*'() *0 %1',+ *02 from, again with the restriction to linear gap penalty A year later, Gotoh [9] introduced a clever trick to compute global and local alignments with affine gap penalty ;:=<?>@-ACBED7 in quadratic time and space When aligning a sequence with all sequences from a database, quadratic time is prohibitive, so the BLAST algorithm of Altschul et al [2] was introduced as a heuristic to approximate the Smith-Waterman algorithm The advantage of BLAST over F Key words: dynamic programming, sequence alignment, Smith-Waterman algorithm, Boltzmann probability, partition function 1 Sequence alignment distance using a linear gap penalty is known in computer science as edit distance Though the Needleman-Wunsch and Gotoh algorithms were originally formulated in terms of distance, rather than similarity, each can be trivially reformulated for similarity measure

2 + + + Smith-Waterman is that the expected run time is linear 2 in sequence and database size and that statistical significance ( -value, -value) can be computed by virtue of the Karlin-Altschul [12, 13] result that the distribution of BLAST hits is the Fisher-Tippett (aka extreme-value or Gumbel) distribution Multiple sequence alignment is a difficult ( -complete) problem, for which several different approaches have been developed: the Carillo-Lipman algorithm [, 1], hidden Markov models [8], ClustalW [20], etc More recently, in order to detect distantly related proteins, Altschul et al developed PSI-BLAST [3], which iteratively builds a profile [10], then blasts databases with the profile Despite its success, it should be noted that PSI-BLAST depends heavily on the quality of the multiple sequence alignment obtained from pairwise BLAST hits in order to build a correct profile For additional background on computational biology, see the Clote-Backofen text [6], and for additional remarks on algorithmic complexity for both sequential and parallel algorithms, see the recent Clote-Kranakis monograph [7] In this paper, we adapt an idea of McCaskill [15], who extended the Zuker-Sankoff [2] energy minimization algorithm for RNA secondary structure prediction, to give an efficient computation of the partition function for the ensemble of RNA secondary structures Our contribution in this paper is to extend the Needleman-Wunsch, Smith- Waterman and Gotoh algorithms, so as to compute the partition function of optimal global and local pairwise alignments using an affine gap penalty This allows us then to provide a mathematically rigorous notion of biological significance to whether particular residue pairs A #1 *07, or residues and gaps #*B 7, 1B *07 are likely to be reliably aligned In future work, we plan to extend these notions to multiple sequence alignments, structural alignments and to a prototype version of PSI-BLAST with Boltzmann probabilities 2 Global alignment partition function for linear gap penalty Let )* and be two given amino acid sequences 3 Throughout, let! A-#1 *07 denote the similarity of residue # with 0 ; for instance, in Section 5, we use the PAM250 similarity matrix [17], though of course BLOSUM62 [11] or any other similary matrix could have been used For didactic reasons, in this section we present the gist of our quadratic time/space algorithm to compute the partition function for global alignments using a linear gap penalty 3 A$7 where constant #"%$ In this case, the Boltzmann probability '&)(&A # 0 7!* that # is aligned with 0, formally defined later, is 6879, -0/213 :; + 5 where + 8, + +@8 <, and + 8 =>> and? ranges over all alignments of $*# with ) 0,? + ranges over all 2 Note that BLAST has worst-case quadratic run time, though not generally encountered in practice 3 Our implementation actually handles any finite alphabet for which a similarity matrix is provided, thus in particular, our code applies to the alignment of nucleotide sequences 2

3 / alignments of #2* with *0, and? over all possible alignments of $)* with An approximate, but incorrect, intuition for the probability '&)(&A#* 07!* would be to consider all exponentially many global alignments of with, and to return the number of times that # is aligned with 0 divided by the number of alignments This intuition would be essentially correct, if we were to weight each count by a factor deriving from Boltzmann s criterion, so that the weight for the alignment would be close to $ An explicit exponential time computation of partition function can be avoided by noting that since the similarity score for subwords is additive, the partition function is multiplicative We now proceed to the details The Needleman-Wunsch algorithm computes the < D7 = < D7 path matrix, where for $ " " and $ " ", 7 is the maximum similarity score between * # and 0 Let be the (negative) penalty for a gap and let : be the cost for gap initiation and > be the cost for gap extension Typical values for BLAST with PAM250 are : 8 B D, > 8 B A linear gap penalty is ), while an affine gap penalty is : <?>@ACBED7, both for a gap of size, where :!1>92 "#$ Algorithm 1 (Needleman-Wunsch [16] global pairwise alignment with linear gap penalty) " and D " ", let 7 by! For D " B D B D7 < A-#1 *07 $ 7 8, $$7 B D7 < 8, and define B D 7,< Since each entry in the array requires constant time to be computed, the Needleman- 7, assuming that " By construc- Wunsch algorithm runs in time and space + tion,! 7 is the maximum similarity score of any alignment of ) - with ) This optimal alignment can be obtained by the usual method of tracebacks (for details, see Clote-Backofen [6]) Note that we could have computed a reverse path matrix ", defined for D " D and D "$ " < D by setting " -7 to be the maximum similarity score of any alignment of #1* with *0 This observation, lifted to the calculation of a forward and backward partition function, is crucial for our computation of the Boltzmann probabilities In the following algorithm, %'& is the forward partition function, defined for $ " " and $ "( " by %'& 7 8*) The Needleman-Wunsch algorithm was originally formulated in terms of distance, rather than similarity The use of similarity, along with minor changes in the base and inductive cases and the definition of traceback, yields the Smith-Waterman local alignment algorithm 3

4 3 < < 1 where? ranges over all possible alignments of # with ) *0, is Boltzmann s constant and " is temperature 5 Algorithm 2 (Forward partition function for linear gap penalty) For D " " and D " ", define %'& $-7!8 %'& 7 by 6 79 %'& B D BED7, - / < %'& B D7 < %'&, %'& $$7!8 7 B D 7,, and define Analogously, we compute the backward partition function &, defined for D " " < D and D " " < D by & -7 8 ) >> where? ranges over all possible alignments of #** with *0 Algorithm 3 (Backward partition function for linear gap penalty) For < D ;D and < D ;D, let & < D7 8 1, & <@D / and define & 7 to be & < D < D7, - / & < D7 :; & < D7 :; 9 One can easily check that %'& 7 8 & 1DD7 and that this value is -, where? ranges over all alignments of $ with 6 The Boltzmann probability '& ( # *07!* that # will be aligned with 0 is then %'& B D B D7 - / 13 5 & < D < D7 %'&! 7 Similarly, the Boltzmann probability that # will be aligned above a gap B, while * #, is aligned with 0, is given by %'& B D 7, & < D < D7 %'&! 7 Finally, the Boltzmann probability that 0 will be aligned below a gap B, while $ -# is aligned with 0, is given by %'& B D7, & < D < D7 %'&! 7 5 In our implementation, we experimented with both and as well as!"#$#%%%!&$, which latter corresponds to replacing '$()*,+-0/21365,798 by ()*,+-0/213 6 It should be noted that in any implementation, these values will be different because the sum of many (large) numbers from left to right is not the same as the sum from right, a well-known phenomenon due to limited machine precision and truncation error For this reason, it is more useful when debugging to verify that the relative error : ;=< /?> is very close to # 3@BA < / > 3 ;=< /?> : 9

5 & & 3 Local alignment partition function for linear gap penalty At first thought, one could attempt to define a partition function with respect to all local alignments After initial investigation, this is clearly not the most reasonable choice (note that it is possible that two optimal local alignments are disjoint) Instead, on input and, we first obtain the optimal local alignment? of subwords # % '( # % ',+ * # and 0 % '( 0 % ',+ 0, then determine the forward and backward partition functions %'& for these subwords, where %'& are computed by the technique of the previous section in performing a global alignment on # % '( * # % ',+ * # and 0 % '( 0 % ',+ 0 Algorithm (Smith-Waterman algorithm for local alignment with linear gap function) " and $ "%", let $ -7 8 $ and 7 to be $ B D B D7 < #1 *07 For $ " B D7,< $-7 8 $, and define B D 7,< Determine the indices ) where 7 achieves a maximum, and perform the traceback until indices where 7 8 $ This determines the local alignment # % '( * # with 0 % '( 0 Algorithm 5 (Partition function for local alignments with linear gap penalty) Given amino acid or nucleotide sequences *, ) : 1 Use Algorithm to determine optimal local alignment? of subwords $#&%*'()*#/ with *0 %*' *02 2 Use Algorithms 2 and 3 to compute partition functions %'&, & for alignment? 3 Suppose that the resulting optimal local alignment? of # % '( * # with 0 % '( 0 is of the form * 7 717, where # # are either B or single-letter residue codes, such that # % '( # [resp 0 % '( 0 ] are obtained after removing B from [resp ] For D " ", compute the Boltzmann pair probabilities '&)( # # 7!* in the manner described after Algorithm 3 Quadratic time algorithm for affine gap penalty Let 3 A7 denote the penalty for successive gaps In the following sections, we assume that 3 A$7 8 : < >E,A B D7 is an affine function, where : " > " $ and : [resp > ] denotes the gap initiation [resp gap extension] cost Let 7 be 5

6 7 7 1 "" Figure 1: Local alignment Boltzmann probabilities of portions of bovine trypsin and pig elastase (see text) the maximum alignment score of any alignment of a suffix of * # with a suffix of 0, where # is aligned with B Define 7 [resp 7 ] analogously, except that B is aligned with 0 [resp # is aligned with 0 ] Finally define Due to space constraints, we cannot give full details of our version of the Smith- Waterman-Gotoh algorithm, which is different in a small, but fundamental aspect from Gotoh s original paper [9] as well as from the presentation in Clote-Backofen [6] This difference is crucial and allows one in the partition function to avoid overcounting We now are in a position to give the pseudocode for the computation of the forward partition function Algorithm 6 (Forward partition functions for affine gap penalty) "!#%$'& for!() *!(,+- ; / 3 5/ 70/ 8 3 ;!( 1!()2+ ; / - / 335?365$/ 76/ 8 3 ; %89%:;& for <8 )2+ - / 365$/ 76/ 8 3 <8 ) ; ; <8 => * <8 0,+ ; / - / 335?365/276/ 8 3 ; %89%: for <8 "!#%$ 0=?+ / - 33B( /A@ > B 363 5/ 70/ 8 3 ; for 1!(,+ / - 33B( /A@ > B / 70/ 8 3 ; 7 In the full version of the paper, C*DFEHGIJ is defined slightly differently, by incorporating traceback information 6

7 7 D %89%: for!# $ & for 8 if <8 <8 if! if 8!(9?+ <8 5$/ 76/ <8 5?5$/ 76/ 8 3!(!(!( 2+ 5?5$/ 76/ 8 <8! % 3 + 5/276/ 8 3 <8! % <8!(92+ ( > B 7 5$/ 76/ 8 <8 % 3 + ( > B 7 5$/ 76/ 8 3 <8 % + ( > B 7 <8 5$/ 76/ 8 3 *%89%: for <8 *"!#%$ / 365$/ 76/ 8 3 for %89%:!( 2+ - / 365/276/ 8 3 for!# $ for <8!( 2 <8!( <8!( <8 return and! + 5?5$/ 76/ 8 3 <8 % + 5?5/ 70/ 8 3 <8 0 0! %!!!(!!( In an analogous manner, the backward partition functions '&, &, &, &, corresponding respectively to,,, can be defined As before, the probability '&)(&A # 0 7!* that # is aligned with 0 in the optimal global alignment is 6879 %'& B D B D7, -0/213 :; 5 & < D < D7 %'& 7 Assuming that D and * # is aligned with 0,, then the probability '&)( 2B *07* that 0 is aligned below a gap is!% '& B D B D7,< % & B D B :; D717 %'& 7 < % & B D B D7 Other cases are similar Using this method, we can determine the Boltzmann probability for particular aligned pairs in a local alignment Algorithm 6, along with our explicit algorithms for the earlier treatment of linear gap penalty should provide sufficient detail to get a general idea of our method With this, we conclude that the partition functions and hence Boltzmann probabilities can be computed in + 7 time and space & < D < 7 5 Example Let s compare the output of pairwise BLAST at the NCBI server [18] on two biologically related proteins bovinetrypsin (PDB identity 1TGB) and pigelastase (chain A with SwissProt accession 1C1MA) These sequences were chosen, because they were 7

8 used by Lipman et al [1] to illustrate the improvement that Carrillo-Lipman multiple sequence alignment provides over dynamic programming local pairwise alignment Both methods align the subsequence of bovine trypsin starting at position 29 through 238 with the subsequence of pig elastase starting at position 28 through 239 The BLAST output indicates which positions in the alignment involve identical or similar residues, with the first line as follows: HFCGGSLINSQWVVSAAHCYKSGIQVRL--GEDNINVVEGNEQFISASKSIVHPSYNSNT H CGG+LI +WV++AAHC + R+ GE+N+N +G EQ+V+ K VVHP N++ HTCGGTLIRQNWVMTAAHCVDRELTFRVVVGEHNLNQNDGTEQYVGVQKIVVHPYWNTDD In contrast, in our alignment, < designates a Boltzmann probability of 75%-100%, while corresponds to 50%-75%, B to 25%-50%, and nothing to 0%-25% HFCGGSLINSQWVVSAAHCYKSGIQVR--LGEDNINVVEGNEQFISASKSIVHPSYNSNT HTCGGTLIRQNWVMTAAHCVDRELTFRVVVGEHNLNQNDGTEQYVGVQKIVVHPYWNTDD The Boltzmann probabilities for the entire alignment are graphically depicted in Figure 1 6 Discussion The significance, in terms of Boltzmann probability, of how well two residues (or a residue and a gap) are aligned in an optimal scoring alignment, developed in this paper is quite distinct from any Viterbi probability or sum-of-all-path probabilities from a trained hidden Markov model Using publicly available HMMs, it is easy to find a pair of sequences, whose HMM alignment differs from Needleman-Wunsch or Smith- Waterman, hence HMMs have little to do with the concepts developed in this paper As well, the algorithms of Waterman [22] and [23] concern subsequent modifications of the path matrix after the optimal alignment is found, hence have nothing to do with our approach Finally, the method of threading, discussed in Clote-Backofen [6] concerns sampling -mer conformations from the PDB, assuming that the resulting distribution is Boltzmann distributed, and taking the negative logarithm of these frequencies as a suitable pseudo-energy In threading, there is no computation of the partition function, and the alignment of certain -mers (ie the threading of convex subwords of the peptide) does not admit gaps within the -mers, nor does it consider the partition function over all such possible alignments of -mers Thus, to the best of our knowledge, our results are new and bear little in common with HMMs, suboptimal alignment algorithms, or threading 7 Conclusions and future work In this work, we have designed and implemented a new quadratic time and space algorithm to compute the partition function for global and local sequence alignments of two 8

9 peptides, thus obtaining an efficient computation of the Boltzmann probability that a particular pair of amino acids residues or a gap and a residue are aligned Additionally, we have created a web-server to make the algorithm available for testing Our prototype programs and cgi-scripts are written in the platform-independent, object-oriented scripting language Python [21] We are currently extending the Boltzmann probability computation to multiple sequence alignments (Feng-Doolittle and ClustalW algorithms), to dynamic time warping of cdna microarray data as implemented by in Aach-Church [1], structural alignements, etc To address efficiency issues, a collaborator is beginning the translation of our Python code into C/C++ We are currently investigating both FSSP and 3dAli structural alignment databases, to calibrate our method of using Boltzmann probabilities to correlate the biological significance of certain portions of an alignment Acknowledgements I d like to thank Stephen H Bryant for a brief suggestion that we contrast our method with that of profile hidden Markov models, E-values, threading and suboptimal alignments References [1] J Aach and G Church Aligning gene expression time series with time warping algorithms Bioinformatics, 17(6):95 508, 2001 [2] SF Altschul, W Gish, W Miller, EW Myers, and DJ Lipman Basic local alignment search tool J Mol Biol, 215:03 10, 1990 [3] SF Altschul, TL Madden, AA Schffer, J Zhang, W Miller, and DJ Lipman Gapped BLAST and PSI-BLAST: A new generation of protein database search programs Nucleic Acids Res, 25: , 1997 [] H Carillo and D Lipman The multiple sequence alignment problem in biology SIAM J Appl Math, 8(5): , 1988 [5] P Clote Boltzmann alignment server cslabbcedu:8080/ compbio/boltzmannalignmenthtml is only a prototype implentation An expanded webserver (currently under construction) will be hosted elsewhere [6] P Clote and R Backofen Computational Molecular Biology: An Introduction John Wiley & Sons, pages [7] P Clote and E Kranakis Boolean Functions and Computation Models Springer-Verlag, pages [8] SR Eddy Hidden Markov models and large-scale genome analysis In CRawlings et al, editor, Proc Third Int Conf Intelligent Systems for Molecular Biology, pages AAAI Press, Menlo Park, 1995 [9] O Gotoh An improved algorithm for matching biological sequences J Mol Biol, 162: , 1982 [10] M Gribskov, AD McLachlan, and D Eisenberg Profile analysis: Detection of distantly related proteins Proc Natl Acad Sci USA, 8: ,

10 [11] S Henikoff and JG Henikoff Amino acid substitution matrices from protein blocks Proc Natl Acad Sci USA, 89: , 1992 [12] S Karlin and SF Altschul Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes Proc Natl Acad Sci USA, 87: , 1990 [13] S Karlin and SF Altschul Applications and statistics for multiple high-scoring segments in molecular sequences Proc Natl Acad Sci USA, 90: , 1993 [1] DJ Lipman, SF Altschul, and JD Kececioglu A tool for multiple sequence alignment Proc Natl Acad Sci USA, 86:12 15, 1989 [15] JS McCaskill The equilibrium partition function and base pair binding probabilities for rna secondary structure Biopolymers, 29: , 1990 [16] SB Needleman and CD Wunsch A general method applicable to the search for similarities in the amino acid sequence of two proteins J Mol Bio, 8:3 53, 1970 [17] RM Schwartz and MO Dayhoff Matrices for detecting distant relationships In MO Dayhoff, editor, Atlas of Protein Sequence and Structure, volume 25, pages Natl Biomed Res Found, Washington, DC, 1978 Vol 5, Suppl 3 [18] BLAST server [19] TF Smith and MS Waterman Identification of common molecular subsequences J Mol Biol, 17: , 1981 [20] J Thompson, D Higgins, and T Gipson Clustalw: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice Nucleic Acids Research, 22: , 199 [21] G von Rossum Python programming language wwwpythonorg [22] MS Waterman Sequence alignments in the neighborhood of the optimum with general application to dynamic programming Proc Natl Acad Sci USA, 80: , 1983 [23] MS Waterman and M Eggert A new algorithm for best subsequence alignments with applications to trna rrna J Mol Bio, 197: , 1987 [2] M Zuker RNA secondary structures and their prediction Bulletin of Mathematical Biology, 6(): ,

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST Alexander Chan 5075504 Biochemistry 218 Final Project An Analysis of Pairwise