Local weighting schemes for protein multiple sequence alignment

Size: px
Start display at page:

Download "Local weighting schemes for protein multiple sequence alignment"

Transcription

1 Computers and Chemistry 26 (2002) Local weighting schemes for protein multiple sequence alignment Jaap Heringa 1 * Di ision of Mathematical Biology, MRC National Institute for Medical Research (NIMR), The Ridgeway, Mill Hill, London NW7 1AA, UK Received 29 May 2001; received in revised form 9 November 2001; accepted 20 November 2001 Abstract This paper describes three weighting schemes for improving the accuracy of progressive multiple sequence alignment methods: (1) global profile pre-processing, to capture for each sequence information about other sequences in a profile before the actual multiple alignment takes place; (2) local pre-processing; which incorporates a new protocol to only use non-overlapping local sequence regions to construct the pre-processed profiles; and (3) local global alignment, a weighting scheme based on the double dynamic programming (DDP) technique to softly bias global alignment to local sequence motifs. The first two schemes allow the compilation of residue-specific multiple alignment reliability indices, which can be used in an iterative fashion. The schemes have been implemented with associated iterative modes in the PRALINE multiple sequence alignment method, and have been evaluated using the BAliBASE benchmark alignment database. These tests indicate that PRALINE is a toolbox able to build alignments with very high quality. We found that local profile pre-processing raises the alignment quality by 5.5% compared to PRALINE alignments generated under default conditions. Iteration enhances the quality by a further percentage point. The implications of multiple alignment scoring functions and iteration in relation to alignment quality and benchmarking are discussed Elsevier Science Ltd. All rights reserved. Keywords: Weighting schemes; Multiple sequence alignment; Profile 1. Introduction The simultaneous alignment of three or more nucleotide or amino acid sequences is one of the most common tasks in bioinformatics. Multiple alignments are an essential pre-requisite to many further modes of analysis into protein families such as homology modelling, secondary structure prediction, phylogenetic reconstruction, or the delineation of conserved and variable sites within a family. Alignments may be further used to derive profiles (Gribskov et al., 1987) or Abbre iations: DP, dynamic programming; DDP, double dynamic programming; SP, sum-of-pairs. * Tel.: x2293; fax: address: jhering@nimr.mrc.ac.uk (J. Heringa). 1 www: hidden Markov models (Bucher et al., 1996; Eddy, 1998; Karplus et al., 1998) that can be used to scour databases for distantly related members of the family. With the recent completion of the first draft of the human genome and genomes of many other species becoming rapidly available, the accurate alignment of biological sequences becomes more important than ever. Although many initiatives are underway for largescale proteomics and structure elucidation of novel genomic proteins, an important drive in genomic approaches is essentially aimed at gathering the function of most if not all translated proteins from sequence data alone. To accomplish this, accurate alignment of the sequences is essential. Given the overwhelming amounts of sequence data, the alignment engines have to be extremely fast and fully automatic to be included in genomic pipelines /02/$ - see front matter 2002 Elsevier Science Ltd. All rights reserved. PII: S (02)

2 460 The automatic generation of an accurate multiple alignment is potentially a daunting task. Ideally, one would make use of an in-depth knowledge of the evolutionary and structural relationships within the family but this information is often lacking or difficult to use. General empirical models of protein evolution (Benner et al., 1992; Dayhoff, 1978; Henikoff and Henikoff, 1992) are widely used instead but these can be difficult to apply when the sequences are less than 30% identical (Sander and Schneider, 1991). Further, mathematically sound methods for carrying out alignments, using these models, can be extremely demanding in computer resources for more than a handful of sequences (Carillo and Lipman, 1988; Wang and Jiang, 1994). To be able to cope with practical dataset sizes, heuristics have been developed that are used for all but the smallest data sets. The most commonly used heuristic methods are based on the progressive alignment strategy (Hogeweg and Hesper, 1984; Feng and Doolittle, 1987; Taylor, 1988) with ClustalW (Thompson et al., 1994) being the most widely used implementation. The idea is to establish an initial order for joining the sequences, and to follow this order in gradually building up the alignment. Many implementations use an aproximation of a phylogenetic tree between the sequences as a guide tree that dictates the alignment order. Although, appropriate for many alignment problems, the progressive strategy suffers from its greediness. Errors made in the first alignments during the progressive protocol cannot be corrected later as the remaining sequences are added in. Attempts to minimize such alignment errors have generally been targeted at global sequence weighting (Altschul et al., 1989; Thompson et al., 1994), where the contribution of individual sequences are weighted during the alignment process. However, such global sequence weighting schemes carry the risk of propagating rather than reducing error when used in progressive multiple alignment strategies (Heringa, 1999), for reasons given later in this paper. The main alternative to progressive alignment is the simultaneous alignment of all the sequences. Two such implementations are available, MSA (Lipman et al., 1989) and DCA (Stoye et al., 1997). Both methods are based on the Carillo and Lipman algorithm (Carillo and Lipman, 1988) to limit computations to a small area in the multi-dimensional search matrix. They nonetheless remain an extremely CPU- and memory-intensive approach, applicable only to about nine sequences of average length for the fastest implementation (DCA). Iterative strategies (Hogeweg and Hesper, 1984; Gotoh, 1996; Notredame and Higgins, 1996; Heringa, 1999) are an alternative to optimise multiple alignments by reconsidering and correcting those made during preceding iterations. Although such iterative strategies do not provide any guarantees about finding optimal solutions, they are reasonably robust and much less sensitive to the number of sequences than their simultaneous counterparts. All of these techniques perform global alignment and match sequences over their full lengths. Problems with this approach can arise when highly dissimilar sequences are compared. In such cases global alignment techniques might fail to recognise highly similar internal regions because these may be overshadowed by dissimilar regions and high gap penalties normally required to achieve proper global matching. Moreover, many biological sequences are modular and show shuffled domains (Heringa and Taylor, 1997), which can render a global alignment of two complete sequences meaningless. The occurrence of varying numbers of internal sequence repeats (Heringa, 1998) can also severely limit the applicability of global methods. In general, when there is a large difference in the lengths of two sequences to be compared, global alignment routines become unwarranted. To address these problems, Smith and Waterman (1981) early on developed a so-called local alignment technique in which the most similar regions in two sequences are selected and aligned. The algorithm has been extended in various techniques to compute a list of top-scoring pair-wise local alignments (Waterman and Eggert, 1987; Huang et al., 1990; Huang and Miller, 1991). Alignments produced by the latter techniques are non-intersecting; i.e., they have no matched pair of amino acids in common. For multiple sequences, the main automatic methods include the Gibbs sampler (Lawrence et al., 1993), MEME (Bailey and Elkan, 1994) and Dialign2 (Morgenstern, 1999). These programs often perform well when there is a clear block of ungapped alignment shared by all of the sequences. They perform poorly, however, on general sets of test cases when compared with global methods (Thompson et al., 1999a; Notredame et al., 2000). Here, three strategies aimed at combining the best properties of global and local multiple alignment are presented: global pre-processing (Heringa, 1999) and two new strategies, local pre-processing and local global DDP. The latter is a technique to integrate local alignment patterns into global alignment using the double dynamic programming (DDP) protocol. The three modes are incorporated in the multiple alignment package PRALINE (Heringa, 1999) and are used in progressive multiple alignment. They each incorporate a weighting scheme that is more flexible and appropriate for alignment than the aforementioned global sequence weighting schemes. The global and local pre-processing strategies are aimed at minimising error at early stages during progressive multiple alignment and do this by using information from other reliable sequences for each pair-wise comparison at every stage during progressive alignment. An important consequence of the

3 461 pre-processing scenarios is that they allow the calculation of a reliability index for each amino acid in the alignment, because the consistency of each aligned residue can be estimated based on the information from other sequences. The schemes allow further optimisation in iterative steps, based on the residue-specific reliability indices as an objective function. It will be shown how these weighting schemes can be used to refine multiple alignment, and how the iterative mode can further enhance the alignment quality. Also a new motif-based weighting scheme will be introduced here, based on a double dynamic programming to reconcile all best local alignments in a single global alignment. Finally, an evaluation of the weighting schemes with various parameter settings will be presented. Measuring the alignment quality is performed using the BAliBASE multiple alignment benchmark set (Thompson et al., 1999b), a database comprising five distinct alignment categories developed especially for testing the quality of multiple alignment methods. 2. Overview of existing sequence weighting schemes There are two basic modes of sequence weighting that can be distinguished for multiple sequence alignment: global and local sequence weighting. The latter can be subdivided into position specific sequence weighting and motif weighting. This section will present an overview of previously developed techniques falling in the above three classes Global sequence weighting Early on Altschul et al. (1989) and Vingron and Argos (1989) proposed global sequence weighting as a means to deal with the fact that sets of sequences normally are unequally represented. To address this problem, global sequence weighting involves the assignment of a weight for each sequence that is used in deriving any average value from the multiple alignment of a set of sequences. The Altschul et al. weighting scheme is integrated in the multiple sequence alignment method MSA (Lipman et al., 1989), where the sequence weights are derived from a rooted phylogenetic tree. Sequences closer to the root in the tree are assigned a larger weight than those at the periphery. In contrast, Vingron and Argos (1989) do not use a tree but derive the sequence weights by calculating for each sequence the average distance from all others. More distant sequences (outliers) according to this criterion receive a larger weight than relatively similar sequences. Thompson et al. (1994) derived the sequence weights in a profile directly from the branch lengths of a phylogenetic tree constructed with the Neighbour-Joining technique (Saitou and Nei, 1987). They used these weights during progressive alignment in the construction of profiles representing pre-aligned sequence groups. Independently, Lüthy et al. (1994) modified the profile search technique by employing the Voronoibased weighting (Sibbald and Argos, 1990), a technique that exploits a notion of the neighbourhood around each sequence in sequence space. Both techniques make use of the BLOSUM62 matrix, constructed from ungapped alignment blocks (Henikoff and Henikoff, 1992). In line with this derivation of the BLOSUM62 matrix, Thompson et al. (1994) exclude from analysis all alignment positions with a percentage of gaps higher than a certain specified threshold. Such regions would be expected to constitute loop regions in the associated protein structures showing less consistent amino acid conservation patterns. Henikoff and Henikoff (1994) derived global sequence weights in a tree-less way by averaging the weights derived for each alignment column by a simple weighting scheme based on the principle of maximum entropy. For each alignment column, the contribution of each occurring amino acid is linearly down-weighted to make the overall contribution of each amino acid type equal. For example, if a column would have five valines and only one leucine, the contribution for each of the sequences having a valine at the considered alignment position would be 1/5 of that of the sequence comprising leucine. In principle it is a good idea to perform global weighting of multiple alignments aimed at increasing the contribution of more distant sequences as they carry more information at each alignment position. However, when sequence weighting is used in progressive multiple alignment, the increased chance of mistakes when aligning distant sequences can well lead to error progression (Heringa, 1999). Vogt et al. (1995) compared local and global alignments of pair-wise sequences with a data bank of structure-based alignments (Pascarella and Argos, 1992) and included a set of over 30 substitution matrices with optimised gap-penalties. The best global alignments were achieved with the Gonnet residue exchange matrix (Gonnet et al., 1992), resulting in 15% incorrect residue matching when sequences with 30% residue identity were aligned. The error rate quickly increased to 45% incorrect matches at 20% residue identity of the aligned sequences, and to 73% error at 15% sequence identity. Rost (1999) stressed the same point and reported even higher pairwise alignment error rates in the twilight zone. These statistics clearly demonstrate that increasing the global weight for distant sequences is likely to lead to the use of misalignment, which will hamper the recognition of true patterns. This is particularly significant if global weighting is used in progressive alignment, because incorrect alignments typically yield low scores, which make the involved sequences appear to be distant, such

4 462 that their incorrect contribution can be amplified by increased weighting in later progressive alignment steps Position-specific sequence weighting Position specific sequence weighting involves the calculation of a weight for each alignment position. Although, many global weighting schemes derive their weights from averaging over positional weights calculated by position-specific schemes, the schemes are principally designed to differentiate the contribution of local alignment regions and aim to make use of the most informative fragments. Various schemes have been developed to locally adjust the contribution from the various sequences in an alignment, e.g. pseudocounts (Henikoff and Henikoff, 1994) or Dirichlet mixtures (Sjölander et al., 1996). These convert the observed amino acid frequencies into weights using background amino acid probabilities and residue exchange weights matrices (Pietroskovski, 1996). Sunyaev et al. (1998, 1999) devised a strategy to construct profiles from given multiple alignments based on a weighting scenario reminiscent to phylogenetic parsimony methods. In their approach, amino acid propensities at each alignment position in the alignment profile are weighted according to the probability that identical amino acids occur in more than one sequence at the alignment position. If more alignment positions show identical conservation for a given subset of sequences (not necessarily the same conserved amino acid type over the alignment positions involved), the occurrence of the amino acids at those positions becomes more expected, which is corrected for by appropriately lowering the weight for the considered position. This approach leads to position-specific sequence weights, which are then implemented in the position-specific propensities for each of the amino acid types in a profile. The authors report increased sensitivity if profile searches were performed using profiles constructed with this technique. However, since their method is critically depending on which sequences are actually chosen to represent a given protein family in the multiple alignment, it is not suitable for progressive multiple alignment Motif-based weighting An extension of local sequence weighting can be implemented in the N M search matrix used in dynamic programming (DP), with N and M the length of the two sequences being compared. With DP, each cell in the matrix corresponds to a value representing the likelihood of the matched pair of amino acids from either sequence typically taken from an amino acid exchange table and can be weighted with any source of extraneous information. Early on Argos (1987) weighted local diagonals in the alignment search matrix with correlations of physical chemical amino acid characters such as hydrophobicity, bulkiness, size and the like. Each matched residue position in a local window received as a score a combination of the amino acid exchange value as given in the PAM-250 exchange matrix (Dayhoff, 1978), and the correlation of each of the five residue parameters calculated over a 5-residue window, each time with the considered matched pair at the middle window position. Although, Argos (1987) did not multiply align sequences but mainly used the scheme to generate visual dot plots for pair-wise protein sequence comparison, the approach basically weights local ungapped alignment scores with physical chemical features. More recently, Bucher and Hofmann (1996) developed a statistical local alignment technique for pairwise sequence comparison in which each cell [i, j ] in the DP search matrix holds the total probability that a local alignment would pass through it. This is achieved by summing the scores of all local alignments intersecting cell [i, j ]. The multiple alignment method T-Coffee (Notredame et al., 2000) combines information from global and local pair-wise alignments. For each sequence pair, a single global alignment and 10 top-scoring non-intersecting local alignments are generated, respectively by the programs ClustalW (Thompson et al., 1994) and Lalign (Huang and Miller, 1991). The global and local alignment scores are then combined to yield a synthetic weight W for each aligned pair of amino acids, which is achieved by taking the sum of the associated basic scores (sequence identities): W(A(x), B(y))= S(A(x), B(y)), where A(x) is residue x in sequence A, and summation is over the scores S of the global and local alignments containing the residue pair (A(x), B(y)), while for S each time the sequence identity percentage of the associated alignment is taken. This scenario results in a library of weights for each non-redundant residue pair. The information in the library is then further enhanced by a procedure called matrix extension (Notredame et al., 2000). Each library weight W(A(x), B(y)) is recalculated to reflect the degree to which residues A(x) and B(y) align consistently, as judged by all other library weights involving either A(x) or B(y). This is done using a triplet approach aimed at calculating the contribution of third sequences I onto the direct alignment of sequence A and B, based on the notion that a triplet alignment A I B effectively provides an alternative alignment of A and B. Each extended score W is then calculated as W (A(x), B(y))=W(A(x), B(y))+ I A,B Min(W(A(x), I(z)), W(I(z), B(y))), where x, y and z are sequence positions in sequences A, B and the intermediate sequence I, respectively, and summation is done over all third sequences I other than A or B. The

5 463 minimum of W(A(x), I(z)) and W(I(z), B(y) is taken to use information from third sequences conservatively. The more intermediate sequences support the alignment of the pair, the higher becomes its extended weight. The extended library weights W for each matched amino acid pair are then used to fill the DP search matrix and align the associated input sequences. Library extension is performed at each step during the progressive alignment, which is carried out following the ClustalW protocol (Thompson et al., 1994). The dramatic increase in sensitivity of the T- Coffee method is mainly a result of its matrix extension scenario, which combines local and global alignment, where an incorrect direct alignment of sequences A and B can effectively be overridden by consistent alignments of other sequences acting as intermediates in the above triplet alignments. Notredame et al. (2000) showed, using the BAliBASE set of reference alignments (Thompson et al., 1999a) as the standard of truth (see below), that T-Coffee generates much improved alignments as compared to ClustalW (Thompson et al., 1994), Prrp (Gotoh, 1996), and Dialign2 (Morgenstern, 1999): the overall relative improvements measured using the column score (see below) were 8.6, 8.6 and 17.2%, respectively. 3. Scoring alignments Before the PRALINE alignment strategies and benchmarking results are described, it is convenient to first give an overview of existing modes of calculating similarity scores of pair-wise alignments, individual multiple alignments, and pair-wise multiple alignments, where the latter is normally used to benchmark multiple alignment routines with reference alignments Calculating pair-wise alignment scores The alignment score of pair-wise sequence alignments is normally calculated, using a amino acid exchange matrix and gap penalties, as the sum of the exchange values minus appropriate gap penalties: S a,b = l s(a i, b j ) k N k gp(k) where the first summation is over the exchange values associated with l matched residues and the second over each group of gaps of length k, with N k and gp(k), respectively the number of gaps of length k and associated gap penalty. Using the common affine gap penalty scheme, a gap of length k is penalised with the value gp(k) =pi+k pe, where pi and pe are the penalties for gap initialisation and extension, respectively Sum-of-pairs score for single alignment Since pair-wise alignment algorithms optimise the above score constituted by residue exchange values and gap penalties, an obvious way of scoring multiple alignments is to extend the pair-wise sequence scores to get a single score for a multiple alignment. This is referred to as the sum-of-pairs (SP) score for alignment. In the early simultaneous multiple alignment method MSA (Lipman et al., 1989), each cell in the multi-dimensional search matrix (see above), which corresponds to a column in the associated multiple alignment, is scored with the SP score. This requires special gap handling for matrix cells associated with gaps. Here, the SP score of a multiple alignment is calculated without additional penalties for gaps, consistent with the fact that gaps are also ignored in the sum-of-pairs score for pair-wise alignments (see below). For each amino acid a i,x in sequence i and at position x in the multiple alignment, the SP score is SP(x)= k l s(a k,x, a l,x ), where s(a k,x, a l,x ) is the amino acid exchange value. Alignment positions with gaps will be ignored in the pair-wise summation, which will effectively lower the resulting score. The overall SP alignment score is then calculated by summing over the alignment positions: SP= l x N SP(x), where N is the number of aligned positions Sum-of-pairs and column scores for comparing two alignments In addition to the above SP score for single alignments, a pair-wise measure to determine the similarity between a query and reference alignment is also referred to as the sum-of-pairs score: SP= l x N SP(x), where N is the number of columns in the reference alignment and SP(x)= k l (r k,x, r l,x ), where r k,x is the amino acid in sequence k at position x in the reference alignment, and (r k,x, r l,x )=1 if the matched pair (r k,x, r l,x ) is also matched in the query alignment; otherwise (r k,x, r l,x )=0. The score is commonly normalised using the total number of aligned amino acid pairs in the reference alignment. The SP score can also be weighted with amino acid exchange values and then normalised: SP= l x N k l (r k,x, r l,x ) s(r k,x, r l,x )/ l x N k l s(r k,x, r l,x ), where terms and indices are as above. However, a more salient measure than the SP score for pair-wise alignment is the column score. In this measure, alignment columns of the query alignment are compared with those in the corresponding reference alignments and only taken as correctly reproduced if columns in query and target alignment are identical. The column score is given as the fraction of the reference alignment columns that is correct reproduced in the query alignment. Whereas, the SP score only gradually goes down with more misaligned sequences, a

6 464 single misaligned sequence can effectively zero the column score. Note that for the analysis presented below, the column score was used. 4. The PRALINE progressive alignment strategies 4.1. General outline The PRALINE method (Heringa, 1999) relies on the Dynamic Programming (DP) technique for pair-wise sequence alignment, introduced by Needleman and Wunsch (1970) to the biological community. Input parameters for the DP algorithm are an amino acid substitution weights matrix and a gap opening and extension penalty value. The latter are applied each time when a gap is inserted in one of the sequences. Based on these parameters, the DP procedure is guaranteed to produce the optimal alignment of two sequences. For carrying out progressive multiple alignment with PRALINE, the basic DP routine is adapted for sequence profile and profile profile alignments. The PRALINE method does not use a pre-calculated search tree as do many progressive alignment methods (e.g. Hogeweg and Hesper, 1984; Thompson et al., 1994; Gotoh, 1996; Notredame et al., 2000) but performs at each alignment step a full profile search and compiles the optimal alignment score of the sequence block aligned in the preceding step with all other blocks and hitherto unaligned sequences. For the current alignment step it then selects the highest scoring pair of sequences or blocks of sequences to be aligned. The alignment order is thus established during the progressive alignment, such that a tree associated with the alignment order becomes only available upon completion of the alignment. The PRALINE method offers a number of strategies based on dynamic programming to optimise multiple alignment. Here, three of the strategies are included: global and local profile pre-processing, and local global double dynamic programming. The first two strategies can be classified as position specific sequence weights, while the latter falls in the category of motif-based weighting schemes. Global and local profile pre-processing allow the calculation of a reliability index for each amino acid in the resulting multiple alignment, based on the consistency of pair-wise alignments. This, and iterative modes relying on the reliability indices, will also be described Global profile pre-processing Fig. 1. Profile pre-processing and subsequent pre-profile alignment. The top panel shows a schematic outline of the construction of pre-processed profiles (pre-profiles) for five sequences. Pair-wise alignments are used to construct the five master-slaves alignments (pre-alignments), which are identified by the order number of their key sequence. The pre-profiles are compiled from the pre-alignments. The bottom panel shows how the pre-profiles are used to construct the final multiple alignment of the five original sequences. Note that for clarity, the pre-profiles depicted here all contain the total number of five sequences. In practice, setting the alignment score threshold value can lead to deletion of sequences from pre-processed blocks and associated pre-profiles. The profile pre-processing strategy in the PRALINE method (Heringa, 1999) is a position-specific weighting scheme aimed at incorporating into each sequence, trusted information from other sequences. For each sequence, a multiple alignment is created by stacking other sequences (master-slaves alignment) that score beyond a user-specified threshold after pair-wise alignment with the sequence considered (Fig. 1). A low threshold would result in a pre-processed alignment for each sequence comprising all other sequences (where the chance for alignment error is large), while higher thresholds would allow the information from lesser sequences into the alignment (with fewer alignment errors). The use of a cut-off value for alignment scores, rather than alignment identity or length-normalised similarity values is in agreement with the analysis of Abagyan and Batalov (1997), who showed that alignment scores allow the best discrimination between alignments of structurally related and unrelated se-

7 465 Fig. 2. Global profile pre-processing. An example is included of two pre-processed alignments for the sequences with PDB codes 2fcr and 4fxn. The sequences are taken from a data set of 13 flavodoxin sequences combined with the chey sequence (PDB code 3chy). Pre-processing was effected with a score cut-off set at zero, thus allowing all remaining sequences in each pre-profile. Note that the key (or master) sequence in each of the two blocks (2fcr and 4fxn, respectively) contain no gaps, amino acids appearing opposite gaps in the key sequences are deleted from the added (slave) sequences in the pre-processed alignments.

8 466 Fig. 3. Outline of local profile pre-processing. Shown are the first two highest-scoring local alignments with excluded associated matrix regions designated in grey: descriptions of the dark and light grey regions are given in the text. White matrix regions without alignments can either be filled with lower scoring top alignments, or remain blank if a threshold score value is used and no further local alignments score beyond the threshold. quences. For each of the thus formed pre-processed alignments, a profile is constructed (Fig. 1). The PRA- LINE method then performs progressive multiple alignment using the pre-processed profiles, where each sequence is now represented by its pre-processed profile. To do this, the pair-wise alignment step is repeated over all pre-processed profiles, after which progressive alignment takes place. The pre-processed profiles for each of the sequences incorporate knowledge about other sequences (in particular similar sequences) and comprise position-specific gap penalties. This enables increased matching of distant sequences and likely placement of gaps outside the ungapped core regions in the pre-processed profiles during progressive alignment. While more details about global profile preprocessing can be found in Heringa (1999), Fig. 2 includes an example of two pre-processed sequence blocks. These are taken from a set of 13 distant flavodoxin sequences and the chey sequence (Heringa, 1999), the latter being a signal transduction protein, which displays extremely low sequence similarity but nevertheless adopts the flavodoxin fold. The alignment score cut-off value can be specified by the user as the direct alignment score. Alternatively, it can be specified as a factor related to the aligned sequence lengths: S xl, where x is the score threshold factor and L the length of the shortest matched sequence. This renders a score threshold that is linearly related to the alignment length, which is in agreement with the observation made for global similarity scores for random sequences (for a review, see Yona and Brenner, 2000) Local profile pre-processing The selection of sequences based on their overall pair-wise alignment scores can be extended by scrutinis- ing local fragments within the sequences, and in addition to rejecting the information from low scoring sequences as can be done in global pre-processing, now also information can be discarded from sequence regions that cannot be trusted to contribute reliable information. This is achieved by a simple protocol that selects for each pre-processed profile the best sequence regions from sequences that are candidates for inclusion in the profile. The protocol is based on the local alignment algorithm of Smith and Waterman (1981). For each cell in the N M search matrix, the following function is evaluated for two sequences A= (a 1, a 2,, a N )andb=(b 1, b 2,, b M ): H[i 1, j 1]+s[a i, b j ] Max{H[i x, j 1] P(x)} H[i, j ]=Max Max{H[i 1, j y] P(y)} 0 where H[i, j ] is guaranteed to hold the maximum similarity of two segments ending in a i and b j, s[a i, b j ]isthe value for substitutions between residue types a i and b j, and P(x) is the penalty for insertion of a gap of length x. Note that to convert a local alignment routine into a global algorithm, only the zero in (Eq. (2)) needs to be discarded. For local dynamic programming, the amino acids exchange matrix used must include negative values s[a i, b j ], otherwise global alignment will occur. The local alignment algorithm thus relies on dissimilar subsequences producing negative scores, which are subsequently discarded by placing zero values in the associated search matrix cells. After calculation of the search matrix by the above procedure, the local alignment corresponding to each non-zero cell in the search matrix can be obtained by a traceback procedure. Whereas, Smith and Waterman (1981) selected the optimal local alignment by performing the traceback only on the highest scoring matrix cell H[i, j ], Waterman and Eggert (1987) modified the traceback step to include a set of top-scoring non-intersecting alignments, i.e. having no matched amino acid pair in common. However, at each step in the traceback procedure for a local alignment that is being tested for inclusion in the top-scoring list, the currently matched residue pair must be checked against all matched pairs within the local alignments contained in the top-scoring list at that moment, which is computationally demanding. A result of this approach is that for a sequence A and B being compared, there can be more than one matched residue pair within the top-scoring alignments containing either residue a i (l i N) orb j (l j M). With our pre-processed profile approach with one line for each included sequence, this would lead to ambiguities in residue pair selection. Therefore, an alternative scenario was developed to avoid this selection problem and speed up computation at the same time. In this scenario, which is (2)

9 467 outlined in Fig. 3, after performing the forward local dynamic programming step, the local alignment scores are ordered from high to low. Then the highest scoring alignment is selected and is traced back to obtain the pair-wise residue matches. The regions of the sequences corresponding to this alignment are then considered occupied by the local alignment and are effectively locked so to not allow any more alignments in this region (dark grey matrix regions in Fig. 3). Then the procedure is repeated for lower scoring local alignments that do not enter the locked matrix region. It is possible in principle to allow local alignments that would cross already accepted top alignments. For example, a local alignment j could include one sequence region positioned N-terminally of a sequence fragment within an accepted local alignment i, while the other sequence of j covers a region C-terminal of alignment i. However, since this would lead to motif inversions, which complicate the global multiple sequence alignment technique, these regions are disallowed by default in the local pre-processing procedure (light grey matrix regions in Fig. 3). This means that for each considered local alignment, the matched amino acid pairs (x, y) should be compared with all earlier accepted alignments. However, the comparison is more straightforward than in the approach by Waterman and Eggert (1987). If an earlier alignment would span residue a i to a j in sequence A and residue b k to b l in sequence B, either the condition (x a i and y b k ) should be met, or (x a j and y b l ). As a result of this local scenario carried out for each of the input sequences, the pre-processed profiles contain added sequences from which low scoring regions are deleted. The user must specify a local alignment cut-off score, so that only sequence fragments that locally align to the query sequence with a score above the cut-off value are included in the locally pre-processed profile. Fig. 4 shows an example of two locally pre-processed alignments, corresponding to those in Fig. 2 for global pre-processing. Due to the global relationship of the sequences, no sequence stretches matching middle regions of the key (or master) sequence have been discarded, but the use of information from the distant chey sequence has been clearly reduced for the two sequences (Fig. 4) Local global DDP alignment The local alignment-driven global alignment strategy, which falls in the class of motif-based weighting schemes (see above) operates in two steps (Fig. 5): First, for each possible residue match between two sequences (or sequence blocks), the score of the optimal local alignment including that match is calculated. Then, the optimal global alignment is compiled based on these local alignment scores. This two-step alignment strategy is akin to the double dynamic programming (DDP) protocol first implemented in the protein structure superpositioning algorithm SAP (Taylor and Orengo, 1989). The strategy can be viewed as a shortcut of the T-Coffee scenario described above (Notredame et al., 2000), as it ensures that the global alignment is biased towards matching local motifs, though using the local signals as soft constraints only. The strategy can be useful when local sequence similarity is suspected (for example in cases of very different sequence lengths). For each pair of sequences or sequence blocks, it uses the classical Smith and Waterman (1981) local alignment algorithm based on the dynamic programming protocol (Needleman and Wunsch, 1970). The idea is to determine for each matched pair of amino acids the score of the best local alignment over all those that include the pair. The thus obtained scores are then assigned to a search matrix as weights and subsequently subjected to a global alignment round, which resolves the values in a final global alignment that is biased towards local alignment. In the Smith Waterman algorithm, which is explained above, the local alignment corresponding to the highest scoring matrix cell H[i, j ] is determined by a traceback step. However, to meet our objective of calculating the score of the best local alignment for each matrix cell, for all or most of the N M matrix cells, the traceback step would have to be performed and the corresponding score of the best local alignment substituted in the cell considered. This would be prohibitive in the context of multiple sequence alignment. Therefore, we devised the following shortcut to obtain the alignment score for each cell in the matrix without carrying out the traceback steps as shown in Fig. 5. Local dynamic programming is performed in forward and in backward direction of the sequences. Then, the values of the resulting two DP search matrices are added for each cell with subtraction of the local score s[a i, b j ] to avoid double counting of the local substitution value due to two times applying Eq. (2) for each cell [i, j ]. After this operation, each cell in the resulting matrix shows the score of its best corresponding local alignment. The thus obtained weights for each amino acid exchange of the two sequences are subjected to a second round of dynamic programming, this time to find the optimal global alignment (Gotoh, 1982) based on the local alignment scores (Fig. 5, step 2). A number of operations are available in the PRALINE method to scale the raw scores resulting from the local alignment step before the second global alignment step is effected. These include converting each score in the matrix to the z-score derived from the values in its corresponding row and column; converting the weights to logarithmic values; and adding the normalised weights to the residue exchange value corresponding to the search matrix cells.

10 468 Fig. 4. Local profile pre-processing. An example is included of two pre-processed master-slaves alignments for the sequences with PDB codes 2fcr and 4fxn, corresponding to those in Fig. 2. The local alignment cut-off score was set at zero, so that all sequences are represented in each of the two pre-processed blocks and short local fragments are selected for some of the sequences. Deleted sequence regions are indicated by dots to discern them from gap regions within accepted local top alignments.

11 4.5. Consistency-based alignment iteration 469 The globally or locally pre-processed profiles can be used to derive consistency scores for each amino acid in the final multiple alignment, reflecting the consistency among the pair-wise alignments included in all the pre-processed profiles. The consistency scores are based on the extent to which the residue appears at the same alignment position across its corresponding identical sequences in the other pre-processed alignments (Fig. 1). The number of sequences allowed into the profiles during global pre-processing and the number of fragments selected during local pre-processing (Section 5) influences the consistency scores: the higher the cut-off values, the less corresponding identical amino acids there are within other pre-processed profiles, which in turn will tend to increase the consistency scores as fewer but more similar sequences or sequence fragments are being used. The consistency scores can be used in an iterative strategy, designed to give preference to consistent alignment regions in subsequent alignment rounds (Fig. 6, top panel: consistency iteration). This is achieved by using the reliability scores as residue weights in the alignments performed in the next iteration round (Heringa, 1999). Another possible iterative strategy updates the aligned sequences in the pre-processed alignments based on their placement in the multiple alignment of a preceding alignment round (Fig. 6, bottom panel: pre-profile update iteration). Although likely, iteration is not guaranteed to lead to convergence. The three possible outcomes of iteration are: (i) convergence to a single multiple alignment; (ii) limit cycle behaviour, where iteration after a number of iterations reaches an alignment that has been generated before, and Fig. 6. Iteration of pre-profile multiple alignment. The top panel outlines schematically the consistency iteration protocol. The upward arrows at the right hand side indicate how the positional consistency scores for each multiply aligned sequence are copied into a vector for each pre-processed profile. The alignment can then be refined during subsequent iterations using the weights given in the profile vectors in each DP alignment during the progressive alignment protocol (downward arrow). After each alignment iteration, the consistency scores are recalculated and the profile weight vectors updated. The bottom panel depicts the pre-profile update iteration protocol. At each iteration, the pre-profiles are updated by using the multiple alignment resulting from the preceding iteration to position each sequence included in every pre-alignment under its key sequence. For each thus updated pre-alignment, a new pre-profile is constructed and a new round of progressive alignment is initiated. Fig. 5. Local global DDP. (1) Summation of local DP search matrices resulting from forward and reversed local DP. For this step, a residue exchange matrix, and a gap open and extension penalty are required (M+P o,e ). (2) Second step of global DP to find the highest-scoring global alignment based on all best local alignment scores. No residue exchange matrix is needed for this step, and gap opening and extension penalty are set to zero (nomorp o,e ). For details, see text. then goes on repeating the iterative steps in between the two identical alignments; and (iii) divergence. The latter is declared when no conversion or limit cycle behaviour is reached. With the PRALINE method, the maximum number of iterations must be specified, but within this number of iterations, both convergence and limit-cycle behaviour is traced, upon which iteration is terminated. An additional option is to select, from all alignments constructed during iteration, the one with the highest SP score (see above). This option, called SP selection, is designed to control those iterations where alignments would wander away in alignment space to lower SP scores.

12 470 Fig. 7. Flavodoxin-cheY normalised multiple alignment Sum-of-Pairs (SP) scores versus total number of sequences included in the 14 pre-processed profiles resulting from varying the pair-wise alignment cut-off scores. The solid line (with triangles) denotes the SP scores, while the dashed line (squares) designates the number of identical pairs encountered in the alignments. Both the SP scores and the number of identical pairs scores have been normalised to the maximum found, multiplied by Tuning and evaluating the alignment strategies 5.1. Reference database Evaluation tests were performed using the BAliBASE multiple alignment benchmark set (Thompson et al., 1999b) as the standard of truth. The BAliBASE alignments are placed in five different categories, which are supposed to cover most of the problems the alignment engines are faced with: (1) alignments containing equidistant sequences; (2) alignments with a single orphan sequence; (3) alignments comprising two distant groups; (4) alignments containing long deletions; and (5) alignments containing long insertions Tuning threshold alues for profile pre-processed alignment Can alignment score cut-off values be found that are generally optimal for global or local pre-processing? Fig. 7 shows, taking the flavodoxin chey data set as an example, the relation between the pair-wise alignment cut-off scores for global pre-processing and the SP scores (for single alignment) of each of the alignments produced. The alignments were made for a set of distant flavodoxin sequences and the extremely distant chey sequence (Heringa, 1999), a signal transduction protein which nonetheless adopts the flavodoxin fold. The figure shows that the SP scores of the alignments do not display a uniform behaviour: A similar pattern is found for the number of identical pairs found in each of the alignments (Fig. 7). The variation of the SP scores is slightly over 2% and that of the identities about 6%. The flavodoxin-chey alignment example illustrates that it is unlikely to derive an easy optimisation protocol for setting a fixed cut-off value for global pre-processing in individual alignments. However, Fig. 8 shows the pair-wise alignment scores of all possible sequence pairs in each of the BAliBASE alignments versus the length of the shortest sequence in each of the pairs, displaying a linear relationship. Consequently, global pre-processing was tested with alignment score threshold values specified linearly related to the sequence length: S xl, where S is the alignment score, x the specified cut-off value and L the length of the shortest sequence. This means that the higher the value of x, the higher the alignment score S needs to be in order for the target sequence to be included in the pre-processed profile. The same holds for the threshold value for local profile pre-processing, although to a lesser extent, because here a more gradual effect between the threshold value and inclusion of sequence fragments is observed (data not shown). Moreover, local similarity scores are known not to grow linearly

13 471 Fig. 8. BAliBASE pair-wise alignment scores versus the minimum sequence length of each sequence pair, aligned using the BLOSUM62 amino acid exchange matrix and penalties of 12 and 1 for gap opening and extension, respectively. The slope of the regression line is with the sequence length, but with the logarithm of the product of the sequence lengths (Karlin and Altschul, 1990) Measuring alignment accuracy Over a total of 144 BAliBASE reference alignments (version of July 2000), alignments generated by the PRALINE method were compared using column scoring (see above), i.e. alignment columns of the target alignments were compared with those in the corresponding reference alignments and were only taken as correct if columns were identical. Following Notredame et al. (2000), the column scores were evaluated only over alignment regions that were deemed to be reliable by the BAliBASE curators, who defined for each BALiBASE alignment the trusted regions. The PRALINE strategies were evaluated using a 500 MHz Pentium III cluster under the Linux operating system. We tested global and local profile pre-processing over the 144 BAliBASE alignments, and also ran the PRALINE consistency iteration protocol, with and without SP selection (see above) Accuracy of global pre-processing Table 1 shows the accuracy numbers generated under a number of score cut-off values for global pre-processing. It is clear that global profile pre-processing increases the quality of the alignments produced, compared to the PRALINE default setting without pre-processing, a maximum gain of 3.5% (weighted average) and 6.5% (unweighted average) for S L 9.5, where S is the alignment score and L the length of the shortest sequence in the pair-wise alignment. The greatest variation in accuracy percentages is observed for the BAliBASE categories 4 and 5 (long deletions and insertions, respectively). The weighted average accuracy is further increased by 2.5% when iteration is applied. The best result is obtained using iteration and SP selection for S L 9, resulting in 65.35% weighted and unweighted accuracy. SP selection enhances the accuracy in all categories except for S L 9.5, which shows a decline in accuracy of nearly 3% for BAliBASE category 4. Overall, the best scores for each of the categories was reached when iteration was performed, except for category 5, where both S 0 and S L 9.5 attain the highest accuracy of 76.12%. Also, iteration for category 5 results in lower accuracy values for all tested cases. Although, SP selection improves the iteration by 6.5% or more, it does not reach the accuracy values observed for the non-iterative runs Accuracy of local pre-processing Local pre-processing was tested by varying the alignment cut-off score of the local alignments, results of

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota Marina Sirota MOTIVATION: PROTEIN MULTIPLE ALIGNMENT To study evolution on the genetic level across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein

More information

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment Courtesy of jalview 1 Motivations Collective statistic Protein families Identification and representation of conserved sequence features

More information

Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods

Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods Khaddouja Boujenfa, Nadia Essoussi, and Mohamed Limam International Science Index, Computer and Information Engineering waset.org/publication/482

More information

Lecture 5: Multiple sequence alignment

Lecture 5: Multiple sequence alignment Lecture 5: Multiple sequence alignment Introduction to Computational Biology Teresa Przytycka, PhD (with some additions by Martin Vingron) Why do we need multiple sequence alignment Pairwise sequence alignment

More information

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST Alexander Chan 5075504 Biochemistry 218 Final Project An Analysis of Pairwise

More information

3.4 Multiple sequence alignment

3.4 Multiple sequence alignment 3.4 Multiple sequence alignment Why produce a multiple sequence alignment? Using more than two sequences results in a more convincing alignment by revealing conserved regions in ALL of the sequences Aligned

More information

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University Profiles and Multiple Alignments COMP 571 Luay Nakhleh, Rice University Outline Profiles and sequence logos Profile hidden Markov models Aligning profiles Multiple sequence alignment by gradual sequence

More information

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic programming is a group of mathematical methods used to sequentially split a complicated problem into

More information

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 13 G R A T I V. Iterative homology searching,

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 13 G R A T I V. Iterative homology searching, C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U Introduction to bioinformatics 2007 Lecture 13 Iterative homology searching, PSI (Position Specific Iterated) BLAST basic idea use

More information

Multiple sequence alignment. November 20, 2018

Multiple sequence alignment. November 20, 2018 Multiple sequence alignment November 20, 2018 Why do multiple alignment? Gain insight into evolutionary history Can assess time of divergence by looking at the number of mutations needed to change one

More information

Biology 644: Bioinformatics

Biology 644: Bioinformatics Find the best alignment between 2 sequences with lengths n and m, respectively Best alignment is very dependent upon the substitution matrix and gap penalties The Global Alignment Problem tries to find

More information

Chapter 6. Multiple sequence alignment (week 10)

Chapter 6. Multiple sequence alignment (week 10) Course organization Introduction ( Week 1,2) Part I: Algorithms for Sequence Analysis (Week 1-11) Chapter 1-3, Models and theories» Probability theory and Statistics (Week 3)» Algorithm complexity analysis

More information

Chapter 8 Multiple sequence alignment. Chaochun Wei Spring 2018

Chapter 8 Multiple sequence alignment. Chaochun Wei Spring 2018 1896 1920 1987 2006 Chapter 8 Multiple sequence alignment Chaochun Wei Spring 2018 Contents 1. Reading materials 2. Multiple sequence alignment basic algorithms and tools how to improve multiple alignment

More information

.. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar..

.. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar.. .. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar.. PAM and BLOSUM Matrices Prepared by: Jason Banich and Chris Hoover Background As DNA sequences change and evolve, certain amino acids are more

More information

Salvador Capella-Gutiérrez, Jose M. Silla-Martínez and Toni Gabaldón

Salvador Capella-Gutiérrez, Jose M. Silla-Martínez and Toni Gabaldón trimal: a tool for automated alignment trimming in large-scale phylogenetics analyses Salvador Capella-Gutiérrez, Jose M. Silla-Martínez and Toni Gabaldón Version 1.2b Index of contents 1. General features

More information

Multiple Sequence Alignment. Mark Whitsitt - NCSA

Multiple Sequence Alignment. Mark Whitsitt - NCSA Multiple Sequence Alignment Mark Whitsitt - NCSA What is a Multiple Sequence Alignment (MA)? GMHGTVYANYAVDSSDLLLAFGVRFDDRVTGKLEAFASRAKIVHIDIDSAEIGKNKQPHV GMHGTVYANYAVEHSDLLLAFGVRFDDRVTGKLEAFASRAKIVHIDIDSAEIGKNKTPHV

More information

GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment

GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms CLUSTAL W Courtesy of jalview Motivations Collective (or aggregate) statistic

More information

Stephen Scott.

Stephen Scott. 1 / 33 sscott@cse.unl.edu 2 / 33 Start with a set of sequences In each column, residues are homolgous Residues occupy similar positions in 3D structure Residues diverge from a common ancestral residue

More information

Bioinformatics explained: Smith-Waterman

Bioinformatics explained: Smith-Waterman Bioinformatics Explained Bioinformatics explained: Smith-Waterman May 1, 2007 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com info@clcbio.com

More information

Multiple sequence alignment. November 2, 2017

Multiple sequence alignment. November 2, 2017 Multiple sequence alignment November 2, 2017 Why do multiple alignment? Gain insight into evolutionary history Can assess time of divergence by looking at the number of mutations needed to change one sequence

More information

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations: Lecture Overview Sequence Alignment & Search Karin Verspoor, Ph.D. Faculty, Computational Bioscience Program University of Colorado School of Medicine With credit and thanks to Larry Hunter for creating

More information

BLAST, Profile, and PSI-BLAST

BLAST, Profile, and PSI-BLAST BLAST, Profile, and PSI-BLAST Jianlin Cheng, PhD School of Electrical Engineering and Computer Science University of Central Florida 26 Free for academic use Copyright @ Jianlin Cheng & original sources

More information

Sequence Alignment & Search

Sequence Alignment & Search Sequence Alignment & Search Karin Verspoor, Ph.D. Faculty, Computational Bioscience Program University of Colorado School of Medicine With credit and thanks to Larry Hunter for creating the first version

More information

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010 Principles of Bioinformatics BIO540/STA569/CSI660 Fall 2010 Lecture 11 Multiple Sequence Alignment I Administrivia Administrivia The midterm examination will be Monday, October 18 th, in class. Closed

More information

Computational Molecular Biology

Computational Molecular Biology Computational Molecular Biology Erwin M. Bakker Lecture 3, mainly from material by R. Shamir [2] and H.J. Hoogeboom [4]. 1 Pairwise Sequence Alignment Biological Motivation Algorithmic Aspect Recursive

More information

In this section we describe how to extend the match refinement to the multiple case and then use T-Coffee to heuristically compute a multiple trace.

In this section we describe how to extend the match refinement to the multiple case and then use T-Coffee to heuristically compute a multiple trace. 5 Multiple Match Refinement and T-Coffee In this section we describe how to extend the match refinement to the multiple case and then use T-Coffee to heuristically compute a multiple trace. This exposition

More information

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1 Database Searching In database search, we typically have a large sequence database

More information

Sequence analysis Pairwise sequence alignment

Sequence analysis Pairwise sequence alignment UMF11 Introduction to bioinformatics, 25 Sequence analysis Pairwise sequence alignment 1. Sequence alignment Lecturer: Marina lexandersson 12 September, 25 here are two types of sequence alignments, global

More information

A New Approach For Tree Alignment Based on Local Re-Optimization

A New Approach For Tree Alignment Based on Local Re-Optimization A New Approach For Tree Alignment Based on Local Re-Optimization Feng Yue and Jijun Tang Department of Computer Science and Engineering University of South Carolina Columbia, SC 29063, USA yuef, jtang

More information

Sequence alignment theory and applications Session 3: BLAST algorithm

Sequence alignment theory and applications Session 3: BLAST algorithm Sequence alignment theory and applications Session 3: BLAST algorithm Introduction to Bioinformatics online course : IBT Sonal Henson Learning Objectives Understand the principles of the BLAST algorithm

More information

Algorithmic Approaches for Biological Data, Lecture #20

Algorithmic Approaches for Biological Data, Lecture #20 Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History 20 April 2016 Outline Aligning with Gaps and Substitution Matrices

More information

JET 2 User Manual 1 INSTALLATION 2 EXECUTION AND FUNCTIONALITIES. 1.1 Download. 1.2 System requirements. 1.3 How to install JET 2

JET 2 User Manual 1 INSTALLATION 2 EXECUTION AND FUNCTIONALITIES. 1.1 Download. 1.2 System requirements. 1.3 How to install JET 2 JET 2 User Manual 1 INSTALLATION 1.1 Download The JET 2 package is available at www.lcqb.upmc.fr/jet2. 1.2 System requirements JET 2 runs on Linux or Mac OS X. The program requires some external tools

More information

Bioinformatics for Biologists

Bioinformatics for Biologists Bioinformatics for Biologists Sequence Analysis: Part I. Pairwise alignment and database searching Fran Lewitter, Ph.D. Director Bioinformatics & Research Computing Whitehead Institute Topics to Cover

More information

Biochemistry 324 Bioinformatics. Multiple Sequence Alignment (MSA)

Biochemistry 324 Bioinformatics. Multiple Sequence Alignment (MSA) Biochemistry 324 Bioinformatics Multiple Sequence Alignment (MSA) Big- Οh notation Greek omicron symbol Ο The Big-Oh notation indicates the complexity of an algorithm in terms of execution speed and storage

More information

Distributed Protein Sequence Alignment

Distributed Protein Sequence Alignment Distributed Protein Sequence Alignment ABSTRACT J. Michael Meehan meehan@wwu.edu James Hearne hearne@wwu.edu Given the explosive growth of biological sequence databases and the computational complexity

More information

Biologically significant sequence alignments using Boltzmann probabilities

Biologically significant sequence alignments using Boltzmann probabilities Biologically significant sequence alignments using Boltzmann probabilities P Clote Department of Biology, Boston College Gasson Hall 16, Chestnut Hill MA 0267 clote@bcedu Abstract In this paper, we give

More information

Two Phase Evolutionary Method for Multiple Sequence Alignments

Two Phase Evolutionary Method for Multiple Sequence Alignments The First International Symposium on Optimization and Systems Biology (OSB 07) Beijing, China, August 8 10, 2007 Copyright 2007 ORSC & APORC pp. 309 323 Two Phase Evolutionary Method for Multiple Sequence

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2015 1 Sequence Alignment Dannie Durand Pairwise Sequence Alignment The goal of pairwise sequence alignment is to establish a correspondence between the

More information

AlignMe Manual. Version 1.1. Rene Staritzbichler, Marcus Stamm, Kamil Khafizov and Lucy R. Forrest

AlignMe Manual. Version 1.1. Rene Staritzbichler, Marcus Stamm, Kamil Khafizov and Lucy R. Forrest AlignMe Manual Version 1.1 Rene Staritzbichler, Marcus Stamm, Kamil Khafizov and Lucy R. Forrest Max Planck Institute of Biophysics Frankfurt am Main 60438 Germany 1) Introduction...3 2) Using AlignMe

More information

The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval. Kevin C. O'Kane. Department of Computer Science

The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval. Kevin C. O'Kane. Department of Computer Science The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval Kevin C. O'Kane Department of Computer Science The University of Northern Iowa Cedar Falls, Iowa okane@cs.uni.edu http://www.cs.uni.edu/~okane

More information

Bioinformatics explained: BLAST. March 8, 2007

Bioinformatics explained: BLAST. March 8, 2007 Bioinformatics Explained Bioinformatics explained: BLAST March 8, 2007 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com info@clcbio.com Bioinformatics

More information

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm.

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm. FASTA INTRODUCTION Definition (by David J. Lipman and William R. Pearson in 1985) - Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence

More information

Basic Local Alignment Search Tool (BLAST)

Basic Local Alignment Search Tool (BLAST) BLAST 26.04.2018 Basic Local Alignment Search Tool (BLAST) BLAST (Altshul-1990) is an heuristic Pairwise Alignment composed by six-steps that search for local similarities. The most used access point to

More information

PoMSA: An Efficient and Precise Position-based Multiple Sequence Alignment Technique

PoMSA: An Efficient and Precise Position-based Multiple Sequence Alignment Technique PoMSA: An Efficient and Precise Position-based Multiple Sequence Alignment Technique Sara Shehab 1,a, Sameh Shohdy 1,b, Arabi E. Keshk 1,c 1 Department of Computer Science, Faculty of Computers and Information,

More information

Today s Lecture. Multiple sequence alignment. Improved scoring of pairwise alignments. Affine gap penalties Profiles

Today s Lecture. Multiple sequence alignment. Improved scoring of pairwise alignments. Affine gap penalties Profiles Today s Lecture Multiple sequence alignment Improved scoring of pairwise alignments Affine gap penalties Profiles 1 The Edit Graph for a Pair of Sequences G A C G T T G A A T G A C C C A C A T G A C G

More information

Jyoti Lakhani 1, Ajay Khunteta 2, Dharmesh Harwani *3 1 Poornima University, Jaipur & Maharaja Ganga Singh University, Bikaner, Rajasthan, India

Jyoti Lakhani 1, Ajay Khunteta 2, Dharmesh Harwani *3 1 Poornima University, Jaipur & Maharaja Ganga Singh University, Bikaner, Rajasthan, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 6 ISSN : 2456-3307 Improvisation of Global Pairwise Sequence Alignment

More information

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be 48 Bioinformatics I, WS 09-10, S. Henz (script by D. Huson) November 26, 2009 4 BLAST and BLAT Outline of the chapter: 1. Heuristics for the pairwise local alignment of two sequences 2. BLAST: search and

More information

Sequence Alignment. part 2

Sequence Alignment. part 2 Sequence Alignment part 2 Dynamic programming with more realistic scoring scheme Using the same initial sequences, we ll look at a dynamic programming example with a scoring scheme that selects for matches

More information

BLAST MCDB 187. Friday, February 8, 13

BLAST MCDB 187. Friday, February 8, 13 BLAST MCDB 187 BLAST Basic Local Alignment Sequence Tool Uses shortcut to compute alignments of a sequence against a database very quickly Typically takes about a minute to align a sequence against a database

More information

Recent Research Results. Evolutionary Trees Distance Methods

Recent Research Results. Evolutionary Trees Distance Methods Recent Research Results Evolutionary Trees Distance Methods Indo-European Languages After Tandy Warnow What is the purpose? Understand evolutionary history (relationship between species). Uderstand how

More information

Multiple Sequence Alignment (MSA)

Multiple Sequence Alignment (MSA) I519 Introduction to Bioinformatics, Fall 2013 Multiple Sequence Alignment (MSA) Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Multiple sequence alignment (MSA) Generalize

More information

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA. Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA. Fasta is used to compare a protein or DNA sequence to all of the

More information

Comparison and Evaluation of Multiple Sequence Alignment Tools In Bininformatics

Comparison and Evaluation of Multiple Sequence Alignment Tools In Bininformatics IJCSNS International Journal of Computer Science and Network Security, VOL.9 No.7, July 2009 51 Comparison and Evaluation of Multiple Sequence Alignment Tools In Bininformatics Asieh Sedaghatinia, Dr Rodziah

More information

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha Motivation Sequence homology to a known protein suggest function of newly sequenced protein Bioinformatics

More information

Multiple Sequence Alignment Based on Profile Alignment of Intermediate Sequences

Multiple Sequence Alignment Based on Profile Alignment of Intermediate Sequences Multiple Sequence Alignment Based on Profile Alignment of Intermediate Sequences Yue Lu and Sing-Hoi Sze RECOMB 2007 Presented by: Wanxing Xu March 6, 2008 Content Biology Motivation Computation Problem

More information

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998 7 Multiple Sequence Alignment The exposition was prepared by Clemens Gröpl, based on earlier versions by Daniel Huson, Knut Reinert, and Gunnar Klau. It is based on the following sources, which are all

More information

FastA & the chaining problem

FastA & the chaining problem FastA & the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 1 Sources for this lecture: Lectures by Volker Heun, Daniel Huson and Knut Reinert,

More information

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10: FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:56 4001 4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem

More information

Comparison of Sequence Similarity Measures for Distant Evolutionary Relationships

Comparison of Sequence Similarity Measures for Distant Evolutionary Relationships Comparison of Sequence Similarity Measures for Distant Evolutionary Relationships Abhishek Majumdar, Peter Z. Revesz Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln,

More information

Gene regulation. DNA is merely the blueprint Shared spatially (among all tissues) and temporally But cells manage to differentiate

Gene regulation. DNA is merely the blueprint Shared spatially (among all tissues) and temporally But cells manage to differentiate Gene regulation DNA is merely the blueprint Shared spatially (among all tissues) and temporally But cells manage to differentiate Especially but not only during developmental stage And cells respond to

More information

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p Multiple alignment

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p Multiple alignment Sequence lignment (chapter 6) p The biological problem p lobal alignment p Local alignment p Multiple alignment Local alignment: rationale p Otherwise dissimilar proteins may have local regions of similarity

More information

Multiple Sequence Alignment: Multidimensional. Biological Motivation

Multiple Sequence Alignment: Multidimensional. Biological Motivation Multiple Sequence Alignment: Multidimensional Dynamic Programming Boston University Biological Motivation Compare a new sequence with the sequences in a protein family. Proteins can be categorized into

More information

Research Article Aligning Sequences by Minimum Description Length

Research Article Aligning Sequences by Minimum Description Length Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2007, Article ID 72936, 14 pages doi:10.1155/2007/72936 Research Article Aligning Sequences by Minimum Description

More information

HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT

HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT - Swarbhanu Chatterjee. Hidden Markov models are a sophisticated and flexible statistical tool for the study of protein models. Using HMMs to analyze proteins

More information

Representation is Everything: Towards Efficient and Adaptable Similarity Measures for Biological Data

Representation is Everything: Towards Efficient and Adaptable Similarity Measures for Biological Data Representation is Everything: Towards Efficient and Adaptable Similarity Measures for Biological Data Charu C. Aggarwal IBM T. J. Watson Research Center charu@us.ibm.com Abstract Distance function computation

More information

Similarity Searches on Sequence Databases

Similarity Searches on Sequence Databases Similarity Searches on Sequence Databases Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Zürich, October 2004 Swiss Institute of Bioinformatics Swiss EMBnet node Outline Importance of

More information

Data Mining Technologies for Bioinformatics Sequences

Data Mining Technologies for Bioinformatics Sequences Data Mining Technologies for Bioinformatics Sequences Deepak Garg Computer Science and Engineering Department Thapar Institute of Engineering & Tecnology, Patiala Abstract Main tool used for sequence alignment

More information

BIOL591: Introduction to Bioinformatics Alignment of pairs of sequences

BIOL591: Introduction to Bioinformatics Alignment of pairs of sequences BIOL591: Introduction to Bioinformatics Alignment of pairs of sequences Reading in text (Mount Bioinformatics): I must confess that the treatment in Mount of sequence alignment does not seem to me a model

More information

Pairwise Sequence Alignment: Dynamic Programming Algorithms. COMP Spring 2015 Luay Nakhleh, Rice University

Pairwise Sequence Alignment: Dynamic Programming Algorithms. COMP Spring 2015 Luay Nakhleh, Rice University Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 - Spring 2015 Luay Nakhleh, Rice University DP Algorithms for Pairwise Alignment The number of all possible pairwise alignments (if

More information

Lecture 10. Sequence alignments

Lecture 10. Sequence alignments Lecture 10 Sequence alignments Alignment algorithms: Overview Given a scoring system, we need to have an algorithm for finding an optimal alignment for a pair of sequences. We want to maximize the score

More information

Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice University

Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice University 1 Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice University DP Algorithms for Pairwise Alignment 2 The number of all possible pairwise alignments (if gaps are allowed)

More information

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures 4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 4.1 Sources for this lecture Lectures by Volker Heun, Daniel Huson and Knut

More information

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998 7 Multiple Sequence Alignment The exposition was prepared by Clemens GrÃP pl, based on earlier versions by Daniel Huson, Knut Reinert, and Gunnar Klau. It is based on the following sources, which are all

More information

Sequence alignment algorithms

Sequence alignment algorithms Sequence alignment algorithms Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 23 rd 27 After this lecture, you can decide when to use local and global sequence alignments

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

BLAST & Genome assembly

BLAST & Genome assembly BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies May 15, 2014 1 BLAST What is BLAST? The algorithm 2 Genome assembly De novo assembly Mapping assembly 3

More information

3. Cluster analysis Overview

3. Cluster analysis Overview Université Laval Multivariate analysis - February 2006 1 3.1. Overview 3. Cluster analysis Clustering requires the recognition of discontinuous subsets in an environment that is sometimes discrete (as

More information

A Coprocessor Architecture for Fast Protein Structure Prediction

A Coprocessor Architecture for Fast Protein Structure Prediction A Coprocessor Architecture for Fast Protein Structure Prediction M. Marolia, R. Khoja, T. Acharya, C. Chakrabarti Department of Electrical Engineering Arizona State University, Tempe, USA. Abstract Predicting

More information

Multiple Sequence Alignment II

Multiple Sequence Alignment II Multiple Sequence Alignment II Lectures 20 Dec 5, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022 1 Outline

More information

Unsupervised learning in Vision

Unsupervised learning in Vision Chapter 7 Unsupervised learning in Vision The fields of Computer Vision and Machine Learning complement each other in a very natural way: the aim of the former is to extract useful information from visual

More information

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT Asif Ali Khan*, Laiq Hassan*, Salim Ullah* ABSTRACT: In bioinformatics, sequence alignment is a common and insistent task. Biologists align

More information

Small Libraries of Protein Fragments through Clustering

Small Libraries of Protein Fragments through Clustering Small Libraries of Protein Fragments through Clustering Varun Ganapathi Department of Computer Science Stanford University June 8, 2005 Abstract When trying to extract information from the information

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 06: Multiple Sequence Alignment https://upload.wikimedia.org/wikipedia/commons/thumb/7/79/rplp0_90_clustalw_aln.gif/575px-rplp0_90_clustalw_aln.gif Slides

More information

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading: 24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, 2010 3 BLAST and FASTA This lecture is based on the following papers, which are all recommended reading: D.J. Lipman and W.R. Pearson, Rapid

More information

A Layer-Based Approach to Multiple Sequences Alignment

A Layer-Based Approach to Multiple Sequences Alignment A Layer-Based Approach to Multiple Sequences Alignment Tianwei JIANG and Weichuan YU Laboratory for Bioinformatics and Computational Biology, Department of Electronic and Computer Engineering, The Hong

More information

Gibbs Sampling. Stephen F. Altschul. National Center for Biotechnology Information National Library of Medicine National Institutes of Health

Gibbs Sampling. Stephen F. Altschul. National Center for Biotechnology Information National Library of Medicine National Institutes of Health Gibbs Sampling Stephen F. Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health Optimization in High-Dimensional Space Smooth and simple landscapes

More information

CAP BLAST. BIOINFORMATICS Su-Shing Chen CISE. 8/20/2005 Su-Shing Chen, CISE 1

CAP BLAST. BIOINFORMATICS Su-Shing Chen CISE. 8/20/2005 Su-Shing Chen, CISE 1 CAP 5510-6 BLAST BIOINFORMATICS Su-Shing Chen CISE 8/20/2005 Su-Shing Chen, CISE 1 BLAST Basic Local Alignment Prof Search Su-Shing Chen Tool A Fast Pair-wise Alignment and Database Searching Tool 8/20/2005

More information

Today s Lecture. Edit graph & alignment algorithms. Local vs global Computational complexity of pairwise alignment Multiple sequence alignment

Today s Lecture. Edit graph & alignment algorithms. Local vs global Computational complexity of pairwise alignment Multiple sequence alignment Today s Lecture Edit graph & alignment algorithms Smith-Waterman algorithm Needleman-Wunsch algorithm Local vs global Computational complexity of pairwise alignment Multiple sequence alignment 1 Sequence

More information

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering Sequence clustering Introduction Data clustering is one of the key tools used in various incarnations of data-mining - trying to make sense of large datasets. It is, thus, natural to ask whether clustering

More information

From Smith-Waterman to BLAST

From Smith-Waterman to BLAST From Smith-Waterman to BLAST Jeremy Buhler July 23, 2015 Smith-Waterman is the fundamental tool that we use to decide how similar two sequences are. Isn t that all that BLAST does? In principle, it is

More information

Scoring and heuristic methods for sequence alignment CG 17

Scoring and heuristic methods for sequence alignment CG 17 Scoring and heuristic methods for sequence alignment CG 17 Amino Acid Substitution Matrices Used to score alignments. Reflect evolution of sequences. Unitary Matrix: M ij = 1 i=j { 0 o/w Genetic Code Matrix:

More information

An Efficient Algorithm to Locate All Locally Optimal Alignments Between Two Sequences Allowing for Gaps

An Efficient Algorithm to Locate All Locally Optimal Alignments Between Two Sequences Allowing for Gaps An Efficient Algorithm to Locate All Locally Optimal Alignments Between Two Sequences Allowing for Gaps Geoffrey J. Barton Laboratory of Molecular Biophysics University of Oxford Rex Richards Building

More information

Heuristic methods for pairwise alignment:

Heuristic methods for pairwise alignment: Bi03c_1 Unit 03c: Heuristic methods for pairwise alignment: k-tuple-methods k-tuple-methods for alignment of pairs of sequences Bi03c_2 dynamic programming is too slow for large databases Use heuristic

More information

Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it.

Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it. Sequence Alignments Overview Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it. Sequence alignment means arranging

More information

BLAST - Basic Local Alignment Search Tool

BLAST - Basic Local Alignment Search Tool Lecture for ic Bioinformatics (DD2450) April 11, 2013 Searching 1. Input: Query Sequence 2. Database of sequences 3. Subject Sequence(s) 4. Output: High Segment Pairs (HSPs) Sequence Similarity Measures:

More information

INVESTIGATION STUDY: AN INTENSIVE ANALYSIS FOR MSA LEADING METHODS

INVESTIGATION STUDY: AN INTENSIVE ANALYSIS FOR MSA LEADING METHODS INVESTIGATION STUDY: AN INTENSIVE ANALYSIS FOR MSA LEADING METHODS MUHANNAD A. ABU-HASHEM, NUR'AINI ABDUL RASHID, ROSNI ABDULLAH, AWSAN A. HASAN AND ATHEER A. ABDULRAZZAQ School of Computer Sciences, Universiti

More information

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748 CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs07.html 1/30/07 CAP5510 1 BLAST & FASTA FASTA [Lipman, Pearson 85, 88]

More information

DI TRANSFORM. The regressive analyses. identify relationships

DI TRANSFORM. The regressive analyses. identify relationships July 2, 2015 DI TRANSFORM MVstats TM Algorithm Overview Summary The DI Transform Multivariate Statistics (MVstats TM ) package includes five algorithm options that operate on most types of geologic, geophysical,

More information

Multiple alignment. Multiple alignment. Computational complexity. Multiple sequence alignment. Consensus. 2 1 sec sec sec

Multiple alignment. Multiple alignment. Computational complexity. Multiple sequence alignment. Consensus. 2 1 sec sec sec Introduction to Bioinformatics Iosif Vaisman Email: ivaisman@gmu.edu VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG--

More information

Basics of Multiple Sequence Alignment

Basics of Multiple Sequence Alignment Basics of Multiple Sequence Alignment Tandy Warnow February 10, 2018 Basics of Multiple Sequence Alignment Tandy Warnow Basic issues What is a multiple sequence alignment? Evolutionary processes operating

More information