Local weighting schemes for protein multiple sequence alignment

Size: px

Start display at page:

Download "Local weighting schemes for protein multiple sequence alignment"

Lesley Burke
5 years ago
Views:

Computers and Chemistry 26 (2002) 459 477 www.elsevier.

Ridgeway, Mill Hill, London NW7 1AA, UK Received 29 May 2001; received in revised form 9 November 2001; accepted 20 November 2001 Abstract This paper describes three weighting schemes for improving

1 Computers and Chemistry 26 (2002) Local weighting schemes for protein multiple sequence alignment Jaap Heringa 1 * Di ision of Mathematical Biology, MRC National Institute for Medical Research (NIMR), The Ridgeway, Mill Hill, London NW7 1AA, UK Received 29 May 2001; received in revised form 9 November 2001; accepted 20 November 2001 Abstract This paper describes three weighting schemes for improving the accuracy of progressive multiple sequence alignment methods: (1) global profile pre-processing, to capture for each sequence information about other sequences in a profile before the actual multiple alignment takes place; (2) local pre-processing; which incorporates a new protocol to only use non-overlapping local sequence regions to construct the pre-processed profiles; and (3) local global alignment, a weighting scheme based on the double dynamic programming (DDP) technique to softly bias global alignment to local sequence motifs. The first two schemes allow the compilation of residue-specific multiple alignment reliability indices, which can be used in an iterative fashion. The schemes have been implemented with associated iterative modes in the PRALINE multiple sequence alignment method, and have been evaluated using the BAliBASE benchmark alignment database. These tests indicate that PRALINE is a toolbox able to build alignments with very high quality. We found that local profile pre-processing raises the alignment quality by 5.5% compared to PRALINE alignments generated under default conditions. Iteration enhances the quality by a further percentage point. The implications of multiple alignment scoring functions and iteration in relation to alignment quality and benchmarking are discussed Elsevier Science Ltd. All rights reserved. Keywords: Weighting schemes; Multiple sequence alignment; Profile 1. Introduction The simultaneous alignment of three or more nucleotide or amino acid sequences is one of the most common tasks in bioinformatics. Multiple alignments are an essential pre-requisite to many further modes of analysis into protein families such as homology modelling, secondary structure prediction, phylogenetic reconstruction, or the delineation of conserved and variable sites within a family. Alignments may be further used to derive profiles (Gribskov et al., 1987) or Abbre iations: DP, dynamic programming; DDP, double dynamic programming; SP, sum-of-pairs. * Tel.: x2293; fax: address: jhering@nimr.mrc.ac.uk (J. Heringa). 1 www: hidden Markov models (Bucher et al., 1996; Eddy, 1998; Karplus et al., 1998) that can be used to scour databases for distantly related members of the family. With the recent completion of the first draft of the human genome and genomes of many other species becoming rapidly available, the accurate alignment of biological sequences becomes more important than ever. Although many initiatives are underway for largescale proteomics and structure elucidation of novel genomic proteins, an important drive in genomic approaches is essentially aimed at gathering the function of most if not all translated proteins from sequence data alone. To accomplish this, accurate alignment of the sequences is essential. Given the overwhelming amounts of sequence data, the alignment engines have to be extremely fast and fully automatic to be included in genomic pipelines /02/$ - see front matter 2002 Elsevier Science Ltd. All rights reserved. PII: S (02)

2 460 The automatic generation of an accurate multiple alignment is potentially a daunting task. Ideally, one would make use of an in-depth knowledge of the evolutionary and structural relationships within the family but this information is often lacking or difficult to use. General empirical models of protein evolution (Benner et al., 1992; Dayhoff, 1978; Henikoff and Henikoff, 1992) are widely used instead but these can be difficult to apply when the sequences are less than 30% identical (Sander and Schneider, 1991). Further, mathematically sound methods for carrying out alignments, using these models, can be extremely demanding in computer resources for more than a handful of sequences (Carillo and Lipman, 1988; Wang and Jiang, 1994). To be able to cope with practical dataset sizes, heuristics have been developed that are used for all but the smallest data sets. The most commonly used heuristic methods are based on the progressive alignment strategy (Hogeweg and Hesper, 1984; Feng and Doolittle, 1987; Taylor, 1988) with ClustalW (Thompson et al., 1994) being the most widely used implementation. The idea is to establish an initial order for joining the sequences, and to follow this order in gradually building up the alignment. Many implementations use an aproximation of a phylogenetic tree between the sequences as a guide tree that dictates the alignment order. Although, appropriate for many alignment problems, the progressive strategy suffers from its greediness. Errors made in the first alignments during the progressive protocol cannot be corrected later as the remaining sequences are added in. Attempts to minimize such alignment errors have generally been targeted at global sequence weighting (Altschul et al., 1989; Thompson et al., 1994), where the contribution of individual sequences are weighted during the alignment process. However, such global sequence weighting schemes carry the risk of propagating rather than reducing error when used in progressive multiple alignment strategies (Heringa, 1999), for reasons given later in this paper. The main alternative to progressive alignment is the simultaneous alignment of all the sequences. Two such implementations are available, MSA (Lipman et al., 1989) and DCA (Stoye et al., 1997). Both methods are based on the Carillo and Lipman algorithm (Carillo and Lipman, 1988) to limit computations to a small area in the multi-dimensional search matrix. They nonetheless remain an extremely CPU- and memory-intensive approach, applicable only to about nine sequences of average length for the fastest implementation (DCA). Iterative strategies (Hogeweg and Hesper, 1984; Gotoh, 1996; Notredame and Higgins, 1996; Heringa, 1999) are an alternative to optimise multiple alignments by reconsidering and correcting those made during preceding iterations. Although such iterative strategies do not provide any guarantees about finding optimal solutions, they are reasonably robust and much less sensitive to the number of sequences than their simultaneous counterparts. All of these techniques perform global alignment and match sequences over their full lengths. Problems with this approach can arise when highly dissimilar sequences are compared. In such cases global alignment techniques might fail to recognise highly similar internal regions because these may be overshadowed by dissimilar regions and high gap penalties normally required to achieve proper global matching. Moreover, many biological sequences are modular and show shuffled domains (Heringa and Taylor, 1997), which can render a global alignment of two complete sequences meaningless. The occurrence of varying numbers of internal sequence repeats (Heringa, 1998) can also severely limit the applicability of global methods. In general, when there is a large difference in the lengths of two sequences to be compared, global alignment routines become unwarranted. To address these problems, Smith and Waterman (1981) early on developed a so-called local alignment technique in which the most similar regions in two sequences are selected and aligned. The algorithm has been extended in various techniques to compute a list of top-scoring pair-wise local alignments (Waterman and Eggert, 1987; Huang et al., 1990; Huang and Miller, 1991). Alignments produced by the latter techniques are non-intersecting; i.e., they have no matched pair of amino acids in common. For multiple sequences, the main automatic methods include the Gibbs sampler (Lawrence et al., 1993), MEME (Bailey and Elkan, 1994) and Dialign2 (Morgenstern, 1999). These programs often perform well when there is a clear block of ungapped alignment shared by all of the sequences. They perform poorly, however, on general sets of test cases when compared with global methods (Thompson et al., 1999a; Notredame et al., 2000). Here, three strategies aimed at combining the best properties of global and local multiple alignment are presented: global pre-processing (Heringa, 1999) and two new strategies, local pre-processing and local global DDP. The latter is a technique to integrate local alignment patterns into global alignment using the double dynamic programming (DDP) protocol. The three modes are incorporated in the multiple alignment package PRALINE (Heringa, 1999) and are used in progressive multiple alignment. They each incorporate a weighting scheme that is more flexible and appropriate for alignment than the aforementioned global sequence weighting schemes. The global and local pre-processing strategies are aimed at minimising error at early stages during progressive multiple alignment and do this by using information from other reliable sequences for each pair-wise comparison at every stage during progressive alignment. An important consequence of the

3 461 pre-processing scenarios is that they allow the calculation of a reliability index for each amino acid in the alignment, because the consistency of each aligned residue can be estimated based on the information from other sequences. The schemes allow further optimisation in iterative steps, based on the residue-specific reliability indices as an objective function. It will be shown how these weighting schemes can be used to refine multiple alignment, and how the iterative mode can further enhance the alignment quality. Also a new motif-based weighting scheme will be introduced here, based on a double dynamic programming to reconcile all best local alignments in a single global alignment. Finally, an evaluation of the weighting schemes with various parameter settings will be presented. Measuring the alignment quality is performed using the BAliBASE multiple alignment benchmark set (Thompson et al., 1999b), a database comprising five distinct alignment categories developed especially for testing the quality of multiple alignment methods. 2. Overview of existing sequence weighting schemes There are two basic modes of sequence weighting that can be distinguished for multiple sequence alignment: global and local sequence weighting. The latter can be subdivided into position specific sequence weighting and motif weighting. This section will present an overview of previously developed techniques falling in the above three classes Global sequence weighting Early on Altschul et al. (1989) and Vingron and Argos (1989) proposed global sequence weighting as a means to deal with the fact that sets of sequences normally are unequally represented. To address this problem, global sequence weighting involves the assignment of a weight for each sequence that is used in deriving any average value from the multiple alignment of a set of sequences. The Altschul et al. weighting scheme is integrated in the multiple sequence alignment method MSA (Lipman et al., 1989), where the sequence weights are derived from a rooted phylogenetic tree. Sequences closer to the root in the tree are assigned a larger weight than those at the periphery. In contrast, Vingron and Argos (1989) do not use a tree but derive the sequence weights by calculating for each sequence the average distance from all others. More distant sequences (outliers) according to this criterion receive a larger weight than relatively similar sequences. Thompson et al. (1994) derived the sequence weights in a profile directly from the branch lengths of a phylogenetic tree constructed with the Neighbour-Joining technique (Saitou and Nei, 1987). They used these weights during progressive alignment in the construction of profiles representing pre-aligned sequence groups. Independently, Lüthy et al. (1994) modified the profile search technique by employing the Voronoibased weighting (Sibbald and Argos, 1990), a technique that exploits a notion of the neighbourhood around each sequence in sequence space. Both techniques make use of the BLOSUM62 matrix, constructed from ungapped alignment blocks (Henikoff and Henikoff, 1992). In line with this derivation of the BLOSUM62 matrix, Thompson et al. (1994) exclude from analysis all alignment positions with a percentage of gaps higher than a certain specified threshold. Such regions would be expected to constitute loop regions in the associated protein structures showing less consistent amino acid conservation patterns. Henikoff and Henikoff (1994) derived global sequence weights in a tree-less way by averaging the weights derived for each alignment column by a simple weighting scheme based on the principle of maximum entropy. For each alignment column, the contribution of each occurring amino acid is linearly down-weighted to make the overall contribution of each amino acid type equal. For example, if a column would have five valines and only one leucine, the contribution for each of the sequences having a valine at the considered alignment position would be 1/5 of that of the sequence comprising leucine. In principle it is a good idea to perform global weighting of multiple alignments aimed at increasing the contribution of more distant sequences as they carry more information at each alignment position. However, when sequence weighting is used in progressive multiple alignment, the increased chance of mistakes when aligning distant sequences can well lead to error progression (Heringa, 1999). Vogt et al. (1995) compared local and global alignments of pair-wise sequences with a data bank of structure-based alignments (Pascarella and Argos, 1992) and included a set of over 30 substitution matrices with optimised gap-penalties. The best global alignments were achieved with the Gonnet residue exchange matrix (Gonnet et al., 1992), resulting in 15% incorrect residue matching when sequences with 30% residue identity were aligned. The error rate quickly increased to 45% incorrect matches at 20% residue identity of the aligned sequences, and to 73% error at 15% sequence identity. Rost (1999) stressed the same point and reported even higher pairwise alignment error rates in the twilight zone. These statistics clearly demonstrate that increasing the global weight for distant sequences is likely to lead to the use of misalignment, which will hamper the recognition of true patterns. This is particularly significant if global weighting is used in progressive alignment, because incorrect alignments typically yield low scores, which make the involved sequences appear to be distant, such

4 462 that their incorrect contribution can be amplified by increased weighting in later progressive alignment steps Position-specific sequence weighting Position specific sequence weighting involves the calculation of a weight for each alignment position. Although, many global weighting schemes derive their weights from averaging over positional weights calculated by position-specific schemes, the schemes are principally designed to differentiate the contribution of local alignment regions and aim to make use of the most informative fragments. Various schemes have been developed to locally adjust the contribution from the various sequences in an alignment, e.g. pseudocounts (Henikoff and Henikoff, 1994) or Dirichlet mixtures (Sjölander et al., 1996). These convert the observed amino acid frequencies into weights using background amino acid probabilities and residue exchange weights matrices (Pietroskovski, 1996). Sunyaev et al. (1998, 1999) devised a strategy to construct profiles from given multiple alignments based on a weighting scenario reminiscent to phylogenetic parsimony methods. In their approach, amino acid propensities at each alignment position in the alignment profile are weighted according to the probability that identical amino acids occur in more than one sequence at the alignment position. If more alignment positions show identical conservation for a given subset of sequences (not necessarily the same conserved amino acid type over the alignment positions involved), the occurrence of the amino acids at those positions becomes more expected, which is corrected for by appropriately lowering the weight for the considered position. This approach leads to position-specific sequence weights, which are then implemented in the position-specific propensities for each of the amino acid types in a profile. The authors report increased sensitivity if profile searches were performed using profiles constructed with this technique. However, since their method is critically depending on which sequences are actually chosen to represent a given protein family in the multiple alignment, it is not suitable for progressive multiple alignment Motif-based weighting An extension of local sequence weighting can be implemented in the N M search matrix used in dynamic programming (DP), with N and M the length of the two sequences being compared. With DP, each cell in the matrix corresponds to a value representing the likelihood of the matched pair of amino acids from either sequence typically taken from an amino acid exchange table and can be weighted with any source of extraneous information. Early on Argos (1987) weighted local diagonals in the alignment search matrix with correlations of physical chemical amino acid characters such as hydrophobicity, bulkiness, size and the like. Each matched residue position in a local window received as a score a combination of the amino acid exchange value as given in the PAM-250 exchange matrix (Dayhoff, 1978), and the correlation of each of the five residue parameters calculated over a 5-residue window, each time with the considered matched pair at the middle window position. Although, Argos (1987) did not multiply align sequences but mainly used the scheme to generate visual dot plots for pair-wise protein sequence comparison, the approach basically weights local ungapped alignment scores with physical chemical features. More recently, Bucher and Hofmann (1996) developed a statistical local alignment technique for pairwise sequence comparison in which each cell [i, j ] in the DP search matrix holds the total probability that a local alignment would pass through it. This is achieved by summing the scores of all local alignments intersecting cell [i, j ]. The multiple alignment method T-Coffee (Notredame et al., 2000) combines information from global and local pair-wise alignments. For each sequence pair, a single global alignment and 10 top-scoring non-intersecting local alignments are generated, respectively by the programs ClustalW (Thompson et al., 1994) and Lalign (Huang and Miller, 1991). The global and local alignment scores are then combined to yield a synthetic weight W for each aligned pair of amino acids, which is achieved by taking the sum of the associated basic scores (sequence identities): W(A(x), B(y))= S(A(x), B(y)), where A(x) is residue x in sequence A, and summation is over the scores S of the global and local alignments containing the residue pair (A(x), B(y)), while for S each time the sequence identity percentage of the associated alignment is taken. This scenario results in a library of weights for each non-redundant residue pair. The information in the library is then further enhanced by a procedure called matrix extension (Notredame et al., 2000). Each library weight W(A(x), B(y)) is recalculated to reflect the degree to which residues A(x) and B(y) align consistently, as judged by all other library weights involving either A(x) or B(y). This is done using a triplet approach aimed at calculating the contribution of third sequences I onto the direct alignment of sequence A and B, based on the notion that a triplet alignment A I B effectively provides an alternative alignment of A and B. Each extended score W is then calculated as W (A(x), B(y))=W(A(x), B(y))+ I A,B Min(W(A(x), I(z)), W(I(z), B(y))), where x, y and z are sequence positions in sequences A, B and the intermediate sequence I, respectively, and summation is done over all third sequences I other than A or B. The

5 463 minimum of W(A(x), I(z)) and W(I(z), B(y) is taken to use information from third sequences conservatively. The more intermediate sequences support the alignment of the pair, the higher becomes its extended weight. The extended library weights W for each matched amino acid pair are then used to fill the DP search matrix and align the associated input sequences. Library extension is performed at each step during the progressive alignment, which is carried out following the ClustalW protocol (Thompson et al., 1994). The dramatic increase in sensitivity of the T- Coffee method is mainly a result of its matrix extension scenario, which combines local and global alignment, where an incorrect direct alignment of sequences A and B can effectively be overridden by consistent alignments of other sequences acting as intermediates in the above triplet alignments. Notredame et al. (2000) showed, using the BAliBASE set of reference alignments (Thompson et al., 1999a) as the standard of truth (see below), that T-Coffee generates much improved alignments as compared to ClustalW (Thompson et al., 1994), Prrp (Gotoh, 1996), and Dialign2 (Morgenstern, 1999): the overall relative improvements measured using the column score (see below) were 8.6, 8.6 and 17.2%, respectively. 3. Scoring alignments Before the PRALINE alignment strategies and benchmarking results are described, it is convenient to first give an overview of existing modes of calculating similarity scores of pair-wise alignments, individual multiple alignments, and pair-wise multiple alignments, where the latter is normally used to benchmark multiple alignment routines with reference alignments Calculating pair-wise alignment scores The alignment score of pair-wise sequence alignments is normally calculated, using a amino acid exchange matrix and gap penalties, as the sum of the exchange values minus appropriate gap penalties: S a,b = l s(a i, b j ) k N k gp(k) where the first summation is over the exchange values associated with l matched residues and the second over each group of gaps of length k, with N k and gp(k), respectively the number of gaps of length k and associated gap penalty. Using the common affine gap penalty scheme, a gap of length k is penalised with the value gp(k) =pi+k pe, where pi and pe are the penalties for gap initialisation and extension, respectively Sum-of-pairs score for single alignment Since pair-wise alignment algorithms optimise the above score constituted by residue exchange values and gap penalties, an obvious way of scoring multiple alignments is to extend the pair-wise sequence scores to get a single score for a multiple alignment. This is referred to as the sum-of-pairs (SP) score for alignment. In the early simultaneous multiple alignment method MSA (Lipman et al., 1989), each cell in the multi-dimensional search matrix (see above), which corresponds to a column in the associated multiple alignment, is scored with the SP score. This requires special gap handling for matrix cells associated with gaps. Here, the SP score of a multiple alignment is calculated without additional penalties for gaps, consistent with the fact that gaps are also ignored in the sum-of-pairs score for pair-wise alignments (see below). For each amino acid a i,x in sequence i and at position x in the multiple alignment, the SP score is SP(x)= k l s(a k,x, a l,x ), where s(a k,x, a l,x ) is the amino acid exchange value. Alignment positions with gaps will be ignored in the pair-wise summation, which will effectively lower the resulting score. The overall SP alignment score is then calculated by summing over the alignment positions: SP= l x N SP(x), where N is the number of aligned positions Sum-of-pairs and column scores for comparing two alignments In addition to the above SP score for single alignments, a pair-wise measure to determine the similarity between a query and reference alignment is also referred to as the sum-of-pairs score: SP= l x N SP(x), where N is the number of columns in the reference alignment and SP(x)= k l (r k,x, r l,x ), where r k,x is the amino acid in sequence k at position x in the reference alignment, and (r k,x, r l,x )=1 if the matched pair (r k,x, r l,x ) is also matched in the query alignment; otherwise (r k,x, r l,x )=0. The score is commonly normalised using the total number of aligned amino acid pairs in the reference alignment. The SP score can also be weighted with amino acid exchange values and then normalised: SP= l x N k l (r k,x, r l,x ) s(r k,x, r l,x )/ l x N k l s(r k,x, r l,x ), where terms and indices are as above. However, a more salient measure than the SP score for pair-wise alignment is the column score. In this measure, alignment columns of the query alignment are compared with those in the corresponding reference alignments and only taken as correctly reproduced if columns in query and target alignment are identical. The column score is given as the fraction of the reference alignment columns that is correct reproduced in the query alignment. Whereas, the SP score only gradually goes down with more misaligned sequences, a

464 single misaligned sequence can effectively zero the column score. Note that for the analysis presented below, the column score was used. 4. The PRALINE progressive alignment strategies 4.1.

6 464 single misaligned sequence can effectively zero the column score. Note that for the analysis presented below, the column score was used. 4. The PRALINE progressive alignment strategies 4.1. General outline The PRALINE method (Heringa, 1999) relies on the Dynamic Programming (DP) technique for pair-wise sequence alignment, introduced by Needleman and Wunsch (1970) to the biological community. Input parameters for the DP algorithm are an amino acid substitution weights matrix and a gap opening and extension penalty value. The latter are applied each time when a gap is inserted in one of the sequences. Based on these parameters, the DP procedure is guaranteed to produce the optimal alignment of two sequences. For carrying out progressive multiple alignment with PRALINE, the basic DP routine is adapted for sequence profile and profile profile alignments. The PRALINE method does not use a pre-calculated search tree as do many progressive alignment methods (e.g. Hogeweg and Hesper, 1984; Thompson et al., 1994; Gotoh, 1996; Notredame et al., 2000) but performs at each alignment step a full profile search and compiles the optimal alignment score of the sequence block aligned in the preceding step with all other blocks and hitherto unaligned sequences. For the current alignment step it then selects the highest scoring pair of sequences or blocks of sequences to be aligned. The alignment order is thus established during the progressive alignment, such that a tree associated with the alignment order becomes only available upon completion of the alignment. The PRALINE method offers a number of strategies based on dynamic programming to optimise multiple alignment. Here, three of the strategies are included: global and local profile pre-processing, and local global double dynamic programming. The first two strategies can be classified as position specific sequence weights, while the latter falls in the category of motif-based weighting schemes. Global and local profile pre-processing allow the calculation of a reliability index for each amino acid in the resulting multiple alignment, based on the consistency of pair-wise alignments. This, and iterative modes relying on the reliability indices, will also be described Global profile pre-processing Fig. 1. Profile pre-processing and subsequent pre-profile alignment. The top panel shows a schematic outline of the construction of pre-processed profiles (pre-profiles) for five sequences. Pair-wise alignments are used to construct the five master-slaves alignments (pre-alignments), which are identified by the order number of their key sequence. The pre-profiles are compiled from the pre-alignments. The bottom panel shows how the pre-profiles are used to construct the final multiple alignment of the five original sequences. Note that for clarity, the pre-profiles depicted here all contain the total number of five sequences. In practice, setting the alignment score threshold value can lead to deletion of sequences from pre-processed blocks and associated pre-profiles. The profile pre-processing strategy in the PRALINE method (Heringa, 1999) is a position-specific weighting scheme aimed at incorporating into each sequence, trusted information from other sequences. For each sequence, a multiple alignment is created by stacking other sequences (master-slaves alignment) that score beyond a user-specified threshold after pair-wise alignment with the sequence considered (Fig. 1). A low threshold would result in a pre-processed alignment for each sequence comprising all other sequences (where the chance for alignment error is large), while higher thresholds would allow the information from lesser sequences into the alignment (with fewer alignment errors). The use of a cut-off value for alignment scores, rather than alignment identity or length-normalised similarity values is in agreement with the analysis of Abagyan and Batalov (1997), who showed that alignment scores allow the best discrimination between alignments of structurally related and unrelated se-

7 465 Fig. 2. Global profile pre-processing. An example is included of two pre-processed alignments for the sequences with PDB codes 2fcr and 4fxn. The sequences are taken from a data set of 13 flavodoxin sequences combined with the chey sequence (PDB code 3chy). Pre-processing was effected with a score cut-off set at zero, thus allowing all remaining sequences in each pre-profile. Note that the key (or master) sequence in each of the two blocks (2fcr and 4fxn, respectively) contain no gaps, amino acids appearing opposite gaps in the key sequences are deleted from the added (slave) sequences in the pre-processed alignments.

8 466 Fig. 3. Outline of local profile pre-processing. Shown are the first two highest-scoring local alignments with excluded associated matrix regions designated in grey: descriptions of the dark and light grey regions are given in the text. White matrix regions without alignments can either be filled with lower scoring top alignments, or remain blank if a threshold score value is used and no further local alignments score beyond the threshold. quences. For each of the thus formed pre-processed alignments, a profile is constructed (Fig. 1). The PRA- LINE method then performs progressive multiple alignment using the pre-processed profiles, where each sequence is now represented by its pre-processed profile. To do this, the pair-wise alignment step is repeated over all pre-processed profiles, after which progressive alignment takes place. The pre-processed profiles for each of the sequences incorporate knowledge about other sequences (in particular similar sequences) and comprise position-specific gap penalties. This enables increased matching of distant sequences and likely placement of gaps outside the ungapped core regions in the pre-processed profiles during progressive alignment. While more details about global profile preprocessing can be found in Heringa (1999), Fig. 2 includes an example of two pre-processed sequence blocks. These are taken from a set of 13 distant flavodoxin sequences and the chey sequence (Heringa, 1999), the latter being a signal transduction protein, which displays extremely low sequence similarity but nevertheless adopts the flavodoxin fold. The alignment score cut-off value can be specified by the user as the direct alignment score. Alternatively, it can be specified as a factor related to the aligned sequence lengths: S xl, where x is the score threshold factor and L the length of the shortest matched sequence. This renders a score threshold that is linearly related to the alignment length, which is in agreement with the observation made for global similarity scores for random sequences (for a review, see Yona and Brenner, 2000) Local profile pre-processing The selection of sequences based on their overall pair-wise alignment scores can be extended by scrutinis- ing local fragments within the sequences, and in addition to rejecting the information from low scoring sequences as can be done in global pre-processing, now also information can be discarded from sequence regions that cannot be trusted to contribute reliable information. This is achieved by a simple protocol that selects for each pre-processed profile the best sequence regions from sequences that are candidates for inclusion in the profile. The protocol is based on the local alignment algorithm of Smith and Waterman (1981). For each cell in the N M search matrix, the following function is evaluated for two sequences A= (a 1, a 2,, a N )andb=(b 1, b 2,, b M ): H[i 1, j 1]+s[a i, b j ] Max{H[i x, j 1] P(x)} H[i, j ]=Max Max{H[i 1, j y] P(y)} 0 where H[i, j ] is guaranteed to hold the maximum similarity of two segments ending in a i and b j, s[a i, b j ]isthe value for substitutions between residue types a i and b j, and P(x) is the penalty for insertion of a gap of length x. Note that to convert a local alignment routine into a global algorithm, only the zero in (Eq. (2)) needs to be discarded. For local dynamic programming, the amino acids exchange matrix used must include negative values s[a i, b j ], otherwise global alignment will occur. The local alignment algorithm thus relies on dissimilar subsequences producing negative scores, which are subsequently discarded by placing zero values in the associated search matrix cells. After calculation of the search matrix by the above procedure, the local alignment corresponding to each non-zero cell in the search matrix can be obtained by a traceback procedure. Whereas, Smith and Waterman (1981) selected the optimal local alignment by performing the traceback only on the highest scoring matrix cell H[i, j ], Waterman and Eggert (1987) modified the traceback step to include a set of top-scoring non-intersecting alignments, i.e. having no matched amino acid pair in common. However, at each step in the traceback procedure for a local alignment that is being tested for inclusion in the top-scoring list, the currently matched residue pair must be checked against all matched pairs within the local alignments contained in the top-scoring list at that moment, which is computationally demanding. A result of this approach is that for a sequence A and B being compared, there can be more than one matched residue pair within the top-scoring alignments containing either residue a i (l i N) orb j (l j M). With our pre-processed profile approach with one line for each included sequence, this would lead to ambiguities in residue pair selection. Therefore, an alternative scenario was developed to avoid this selection problem and speed up computation at the same time. In this scenario, which is (2)

9 467 outlined in Fig. 3, after performing the forward local dynamic programming step, the local alignment scores are ordered from high to low. Then the highest scoring alignment is selected and is traced back to obtain the pair-wise residue matches. The regions of the sequences corresponding to this alignment are then considered occupied by the local alignment and are effectively locked so to not allow any more alignments in this region (dark grey matrix regions in Fig. 3). Then the procedure is repeated for lower scoring local alignments that do not enter the locked matrix region. It is possible in principle to allow local alignments that would cross already accepted top alignments. For example, a local alignment j could include one sequence region positioned N-terminally of a sequence fragment within an accepted local alignment i, while the other sequence of j covers a region C-terminal of alignment i. However, since this would lead to motif inversions, which complicate the global multiple sequence alignment technique, these regions are disallowed by default in the local pre-processing procedure (light grey matrix regions in Fig. 3). This means that for each considered local alignment, the matched amino acid pairs (x, y) should be compared with all earlier accepted alignments. However, the comparison is more straightforward than in the approach by Waterman and Eggert (1987). If an earlier alignment would span residue a i to a j in sequence A and residue b k to b l in sequence B, either the condition (x a i and y b k ) should be met, or (x a j and y b l ). As a result of this local scenario carried out for each of the input sequences, the pre-processed profiles contain added sequences from which low scoring regions are deleted. The user must specify a local alignment cut-off score, so that only sequence fragments that locally align to the query sequence with a score above the cut-off value are included in the locally pre-processed profile. Fig. 4 shows an example of two locally pre-processed alignments, corresponding to those in Fig. 2 for global pre-processing. Due to the global relationship of the sequences, no sequence stretches matching middle regions of the key (or master) sequence have been discarded, but the use of information from the distant chey sequence has been clearly reduced for the two sequences (Fig. 4) Local global DDP alignment The local alignment-driven global alignment strategy, which falls in the class of motif-based weighting schemes (see above) operates in two steps (Fig. 5): First, for each possible residue match between two sequences (or sequence blocks), the score of the optimal local alignment including that match is calculated. Then, the optimal global alignment is compiled based on these local alignment scores. This two-step alignment strategy is akin to the double dynamic programming (DDP) protocol first implemented in the protein structure superpositioning algorithm SAP (Taylor and Orengo, 1989). The strategy can be viewed as a shortcut of the T-Coffee scenario described above (Notredame et al., 2000), as it ensures that the global alignment is biased towards matching local motifs, though using the local signals as soft constraints only. The strategy can be useful when local sequence similarity is suspected (for example in cases of very different sequence lengths). For each pair of sequences or sequence blocks, it uses the classical Smith and Waterman (1981) local alignment algorithm based on the dynamic programming protocol (Needleman and Wunsch, 1970). The idea is to determine for each matched pair of amino acids the score of the best local alignment over all those that include the pair. The thus obtained scores are then assigned to a search matrix as weights and subsequently subjected to a global alignment round, which resolves the values in a final global alignment that is biased towards local alignment. In the Smith Waterman algorithm, which is explained above, the local alignment corresponding to the highest scoring matrix cell H[i, j ] is determined by a traceback step. However, to meet our objective of calculating the score of the best local alignment for each matrix cell, for all or most of the N M matrix cells, the traceback step would have to be performed and the corresponding score of the best local alignment substituted in the cell considered. This would be prohibitive in the context of multiple sequence alignment. Therefore, we devised the following shortcut to obtain the alignment score for each cell in the matrix without carrying out the traceback steps as shown in Fig. 5. Local dynamic programming is performed in forward and in backward direction of the sequences. Then, the values of the resulting two DP search matrices are added for each cell with subtraction of the local score s[a i, b j ] to avoid double counting of the local substitution value due to two times applying Eq. (2) for each cell [i, j ]. After this operation, each cell in the resulting matrix shows the score of its best corresponding local alignment. The thus obtained weights for each amino acid exchange of the two sequences are subjected to a second round of dynamic programming, this time to find the optimal global alignment (Gotoh, 1982) based on the local alignment scores (Fig. 5, step 2). A number of operations are available in the PRALINE method to scale the raw scores resulting from the local alignment step before the second global alignment step is effected. These include converting each score in the matrix to the z-score derived from the values in its corresponding row and column; converting the weights to logarithmic values; and adding the normalised weights to the residue exchange value corresponding to the search matrix cells.

10 468 Fig. 4. Local profile pre-processing. An example is included of two pre-processed master-slaves alignments for the sequences with PDB codes 2fcr and 4fxn, corresponding to those in Fig. 2. The local alignment cut-off score was set at zero, so that all sequences are represented in each of the two pre-processed blocks and short local fragments are selected for some of the sequences. Deleted sequence regions are indicated by dots to discern them from gap regions within accepted local top alignments.

11 4.5. Consistency-based alignment iteration 469 The globally or locally pre-processed profiles can be used to derive consistency scores for each amino acid in the final multiple alignment, reflecting the consistency among the pair-wise alignments included in all the pre-processed profiles. The consistency scores are based on the extent to which the residue appears at the same alignment position across its corresponding identical sequences in the other pre-processed alignments (Fig. 1). The number of sequences allowed into the profiles during global pre-processing and the number of fragments selected during local pre-processing (Section 5) influences the consistency scores: the higher the cut-off values, the less corresponding identical amino acids there are within other pre-processed profiles, which in turn will tend to increase the consistency scores as fewer but more similar sequences or sequence fragments are being used. The consistency scores can be used in an iterative strategy, designed to give preference to consistent alignment regions in subsequent alignment rounds (Fig. 6, top panel: consistency iteration). This is achieved by using the reliability scores as residue weights in the alignments performed in the next iteration round (Heringa, 1999). Another possible iterative strategy updates the aligned sequences in the pre-processed alignments based on their placement in the multiple alignment of a preceding alignment round (Fig. 6, bottom panel: pre-profile update iteration). Although likely, iteration is not guaranteed to lead to convergence. The three possible outcomes of iteration are: (i) convergence to a single multiple alignment; (ii) limit cycle behaviour, where iteration after a number of iterations reaches an alignment that has been generated before, and Fig. 6. Iteration of pre-profile multiple alignment. The top panel outlines schematically the consistency iteration protocol. The upward arrows at the right hand side indicate how the positional consistency scores for each multiply aligned sequence are copied into a vector for each pre-processed profile. The alignment can then be refined during subsequent iterations using the weights given in the profile vectors in each DP alignment during the progressive alignment protocol (downward arrow). After each alignment iteration, the consistency scores are recalculated and the profile weight vectors updated. The bottom panel depicts the pre-profile update iteration protocol. At each iteration, the pre-profiles are updated by using the multiple alignment resulting from the preceding iteration to position each sequence included in every pre-alignment under its key sequence. For each thus updated pre-alignment, a new pre-profile is constructed and a new round of progressive alignment is initiated. Fig. 5. Local global DDP. (1) Summation of local DP search matrices resulting from forward and reversed local DP. For this step, a residue exchange matrix, and a gap open and extension penalty are required (M+P o,e ). (2) Second step of global DP to find the highest-scoring global alignment based on all best local alignment scores. No residue exchange matrix is needed for this step, and gap opening and extension penalty are set to zero (nomorp o,e ). For details, see text. then goes on repeating the iterative steps in between the two identical alignments; and (iii) divergence. The latter is declared when no conversion or limit cycle behaviour is reached. With the PRALINE method, the maximum number of iterations must be specified, but within this number of iterations, both convergence and limit-cycle behaviour is traced, upon which iteration is terminated. An additional option is to select, from all alignments constructed during iteration, the one with the highest SP score (see above). This option, called SP selection, is designed to control those iterations where alignments would wander away in alignment space to lower SP scores.

12 470 Fig. 7. Flavodoxin-cheY normalised multiple alignment Sum-of-Pairs (SP) scores versus total number of sequences included in the 14 pre-processed profiles resulting from varying the pair-wise alignment cut-off scores. The solid line (with triangles) denotes the SP scores, while the dashed line (squares) designates the number of identical pairs encountered in the alignments. Both the SP scores and the number of identical pairs scores have been normalised to the maximum found, multiplied by Tuning and evaluating the alignment strategies 5.1. Reference database Evaluation tests were performed using the BAliBASE multiple alignment benchmark set (Thompson et al., 1999b) as the standard of truth. The BAliBASE alignments are placed in five different categories, which are supposed to cover most of the problems the alignment engines are faced with: (1) alignments containing equidistant sequences; (2) alignments with a single orphan sequence; (3) alignments comprising two distant groups; (4) alignments containing long deletions; and (5) alignments containing long insertions Tuning threshold alues for profile pre-processed alignment Can alignment score cut-off values be found that are generally optimal for global or local pre-processing? Fig. 7 shows, taking the flavodoxin chey data set as an example, the relation between the pair-wise alignment cut-off scores for global pre-processing and the SP scores (for single alignment) of each of the alignments produced. The alignments were made for a set of distant flavodoxin sequences and the extremely distant chey sequence (Heringa, 1999), a signal transduction protein which nonetheless adopts the flavodoxin fold. The figure shows that the SP scores of the alignments do not display a uniform behaviour: A similar pattern is found for the number of identical pairs found in each of the alignments (Fig. 7). The variation of the SP scores is slightly over 2% and that of the identities about 6%. The flavodoxin-chey alignment example illustrates that it is unlikely to derive an easy optimisation protocol for setting a fixed cut-off value for global pre-processing in individual alignments. However, Fig. 8 shows the pair-wise alignment scores of all possible sequence pairs in each of the BAliBASE alignments versus the length of the shortest sequence in each of the pairs, displaying a linear relationship. Consequently, global pre-processing was tested with alignment score threshold values specified linearly related to the sequence length: S xl, where S is the alignment score, x the specified cut-off value and L the length of the shortest sequence. This means that the higher the value of x, the higher the alignment score S needs to be in order for the target sequence to be included in the pre-processed profile. The same holds for the threshold value for local profile pre-processing, although to a lesser extent, because here a more gradual effect between the threshold value and inclusion of sequence fragments is observed (data not shown). Moreover, local similarity scores are known not to grow linearly

13 471 Fig. 8. BAliBASE pair-wise alignment scores versus the minimum sequence length of each sequence pair, aligned using the BLOSUM62 amino acid exchange matrix and penalties of 12 and 1 for gap opening and extension, respectively. The slope of the regression line is with the sequence length, but with the logarithm of the product of the sequence lengths (Karlin and Altschul, 1990) Measuring alignment accuracy Over a total of 144 BAliBASE reference alignments (version of July 2000), alignments generated by the PRALINE method were compared using column scoring (see above), i.e. alignment columns of the target alignments were compared with those in the corresponding reference alignments and were only taken as correct if columns were identical. Following Notredame et al. (2000), the column scores were evaluated only over alignment regions that were deemed to be reliable by the BAliBASE curators, who defined for each BALiBASE alignment the trusted regions. The PRALINE strategies were evaluated using a 500 MHz Pentium III cluster under the Linux operating system. We tested global and local profile pre-processing over the 144 BAliBASE alignments, and also ran the PRALINE consistency iteration protocol, with and without SP selection (see above) Accuracy of global pre-processing Table 1 shows the accuracy numbers generated under a number of score cut-off values for global pre-processing. It is clear that global profile pre-processing increases the quality of the alignments produced, compared to the PRALINE default setting without pre-processing, a maximum gain of 3.5% (weighted average) and 6.5% (unweighted average) for S L 9.5, where S is the alignment score and L the length of the shortest sequence in the pair-wise alignment. The greatest variation in accuracy percentages is observed for the BAliBASE categories 4 and 5 (long deletions and insertions, respectively). The weighted average accuracy is further increased by 2.5% when iteration is applied. The best result is obtained using iteration and SP selection for S L 9, resulting in 65.35% weighted and unweighted accuracy. SP selection enhances the accuracy in all categories except for S L 9.5, which shows a decline in accuracy of nearly 3% for BAliBASE category 4. Overall, the best scores for each of the categories was reached when iteration was performed, except for category 5, where both S 0 and S L 9.5 attain the highest accuracy of 76.12%. Also, iteration for category 5 results in lower accuracy values for all tested cases. Although, SP selection improves the iteration by 6.5% or more, it does not reach the accuracy values observed for the non-iterative runs Accuracy of local pre-processing Local pre-processing was tested by varying the alignment cut-off score of the local alignments, results of

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota Marina Sirota MOTIVATION: PROTEIN MULTIPLE ALIGNMENT To study evolution on the genetic level across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein