MULTIPLE SEQUENCE ALIGNMENT SOLUTIONS AND APPLICATIONS

Size: px

Start display at page:

Download "MULTIPLE SEQUENCE ALIGNMENT SOLUTIONS AND APPLICATIONS"

Arleen Andrews
5 years ago
Views:

1 MULTIPLE SEQUENCE ALIGNMENT SOLUTIONS AND APPLICATIONS By XU ZHANG A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA

2 c 2007 Xu Zhang 2

3 To my family, and to all who nurtured my intellectual curiosity, academic interests, and sense of scholarship throughout my lifetime, making this milestone possible 3

4 AKNOWLEDGEMENTS This dissertation would not have been possible without the support of many people. Many thanks to my adviser, Tamer Kahveci, who worked with me on our researches and read my numerous revisions. Also thanks to my committee members, Alin Dobra, Arunava Banerjee, Christopher M. Jermaine and Kevin M. Folta, who offered guidance and support. Thanks to Amit Dhingra for cooperating with me and giving me a lot of helps in MAPPIT project. Finally, thanks to my parents and numerous friends who endured this long process with me, always offering support and love. 4

5 TABLE OF CONTENTS page LIST OF TABLES LIST OF FIGURES ABSTRACT CHAPTER 1 INTRODUCTION BACKGROUND Measurements of Multiple Sequence Alignment Dynamic Programming Methods Heuristic Methods Optimizing Existing Alignments Methods Approximation Algorithms Our Methods vs. Approximation Methods What do approximatable and non-approximatable mean? Why does approximation algorithms do not work for multiple sequence alignment applications? Why do our algorithms work? Overview of Approximation Algorithms for Multiple Sequence Alignment Hardness Results NP-completeness and MAX-SNP-hardness of multiple sequence alignment OPTIMIZATION OF SP SCORE FOR MULTIPLE SEQUENCE ALIGNMENT IN GIVEN TIME Motivation and Problem Definition Current Results Constructing Initial Alignment Improving the SP Score via Local Optimizations QOMA and Optimality Improved Algorithm: Sparse Graph Experimental Evaluation OPTIMIZING THE ALIGNMENT OF MANY SEQUENCES Motivation and Problem Definition Current Results Aligning a Window

6 4.3.1 Constructing Initial Graph Clustering Refining Clusters Iteratively Aligning the Subsequences in Clusters Complexity of QOMA Experimental Evaluation IMPROVING BIOLOGICAL RELEVANCE OF MULTIPLE SEQUENCE ALIGNMENT Motivation and Problem Definition Current Results Constructing Initial Graph Grouping Fragments Fragment Position Adjustment Alignment Gap Adjustment Experimental Results MODULE FOR AMPLIFICATION OF PLASTOMES BY PRIMER IDENTIFICATION Motivation and Problem Definition Related Work Current Results Finding Primer Candidates Multiple sequence alignment-based primer identification Motif-based primer identification Finding Minimum Primer Pair Set Evaluating Primer Pairs Experimental Evaluation Quality Evaluation Performance Comparison Wet-lab Verification CONCLUSION REFERENCES BIOGRAPHICAL SKETCH

7 Table LIST OF TABLES page 3-1 The average SP scores of QOMA using complete K-partite graph The average SP scores of QOMA and five other tools The improvement of QOMA The average (µ), standard deviation (σ) of the error, S SP, for a window using sparse version of QOMA The running time of QOMA (in seconds) The list of variables used in this chapter The average SW and SP scores of individual windows The average SP scores of QOMA2 for individual windows The average SP scores of the alignments of the entire benchmarks The average SP scores of QOMA2 and other tools The BAliBASE score of HSA and other tools. less than 25 % identity The BAliBASE score of HSA and other tools. 20%-40% identity The BAliBASE score of HSA and other tools. more than 35% identity The SP score of HSA and other tools The running time of HSA and other tools (measured by milliseconds) Comparison of Primer3 and using multiple sequence alignment in step Comparison of using different source of alignment Comparison of multiple sequence alignment-based methods and motif-based methods in step Effects of the number of reference sequences Eight randomly selected primer pairs

8 Figure LIST OF FIGURES page 1-1 An example of multiple sequence alignment An example to show meaningless of alignments with approximation ratio less than An example of different alignments with the same SP-score Constructing the initial alignment by strategy QOMA finds optimal alignment inside window Sparse K-partite graph An example of using K-partite graph The SP scores of QOMA alignments Alignment strategies at a high level Comparison of the SP score found by different strategies The distribution of the number of benchmarks with different number of sequences (K) The initial graph constructed The fragments with similar features are grouped together A gap vertex is inserted Cliques found are the columns Gaps are moved Example of primer pairs on target sequence An example of computing the SP score of multiple sequence alignment An example of matching primers with translocations Selection of next forward primer from current reverse primer Polymerase chain reaction samples

9 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy MULTIPLE SEQUENCE ALIGNMENT SOLUTIONS AND APPLICATIONS Chair: Tamer Kahveci Major: Computer Engineering By Xu Zhang December 2007 Bioinformatics is a field where the computer science is used to assist the biology science. In this area, multiple sequence alignment is one of the most fundamental problems. Multiple sequence alignment is an alignment of three or more sequences. Multiple sequence alignment is widely used in many applications such as protein structure prediction, phylogenetic analysis, identification of conserved motifs, protein classification, gene prediction and genome primer identification. In the research areas of multiple sequence alignment, a challenging problem is how to find the multiple sequence alignment that maximizes the SP (Sum-of-Pairs) score. This problem is a NP-complete problem. Furthermore, finding an alignment that is biologically meaningful is not trivial since the SP score may not reflect the biological significances. This thesis addresses these problems. More specifically we consider four problems. First, we develop an efficient algorithm to optimize the SP score of multiple sequence alignment. Second, we extend this algorithm to handle large number of sequences. Third, we apply secondary structure information of residues to build a biological meaningful alignment. Finally, we describe a strategy to employ the alignment of multiple sequences to identify primers for a given target genome. 9

10 CHAPTER 1 INTRODUCTION Bioinformatics is the interaction of molecular biology and computer science, it can be viewed as a branch of biology which implements the use of computers to help answer biology questions. One of the fundamental research areas in bioinformatics is multiple sequence alignment. A multiple sequence alignment is an alignment of more than two sequences. An example of multiple sequence alignment is shown in Figure 1-1. The alignment is part of a whole alignment selected from BAliBASE benchmark database [1, 2]. Multiple sequence alignment is widely used in many applications such as protein structure prediction [3], phylogenetic analysis [4], identification of conserved motifs [5], protein classification [6], gene prediction [7 9], and genome primer identification [10]. The follows are some examples of the applications. Application 1. Identification of conserved motifs and domains One important application of multiple sequence alignments is to identify conserved motifs and domains. Motifs are conserved regions or structures in protein or DNA families. They tend to be preserved during evolution [11]. For related proteins, their motifs present similar structures and functions. Within a multiple alignment, motifs can be identified as columns with more conservation than their surroundings. Analyzed with experimental data, the motifs can be very important characterization of sequences of unknown function. The principal leads to a lot of important applications in bioinformatics. Some important databases, such as PROSITE [12] and PRINTS [13], are built based on this principal. Another type of methods uses a profile [14] or a hidden Markov model (HMM) [15] to identify motifs. These methods work well when a motif is too subtle to be defined via a standard pattern. Since when searching a database, profiles and HMMs can identify distant members of a protein family and provide much higher sensitivity and specificity than what a single sequence or a single pattern can provide. In practice, users 10

11 1thx thio_thife thio_strcl thio_rhoru thio_myctu thio_gripa thio_rhosh txla_synp7 1kte 1grx 2trcp aeqpvlvyfwaswcgpcqlmsplinlaantysdrlkvvkleidp... sskpvlvdfwaewcgpckmiapileeiadeyadrlrvakfnide... sekpvlvdfwaewcgpcrqiapsleait.ehggqieivklnidq... adgpnxvdfwaewcgpcrqxapaleelatalgdkvtvakinide... snkpvlvdfwatwcgpckmvapvleeiateratdltvakldvdt... srqpvlvdfwapwcgpcrmiastideiahdykdklkvvkvntdq... sdvpvvvdfwaewcgpcrqigpaleelskeyagkvkivkvnvde... ndrpmllefyadwctscqamagriaalkqdysdrldfvmlnidn... iqpgkvvvfikptcpfcrktqellsqlp..fkegl.lefvditatsdtn...mqtvifgrsgcpysvrakdlaeklsner.ddfqyqyvdiraeg... kvttivvniyedgvrgcdalnssleclaaey.pmvkfckirasnt... Figure 1-1. An example of multiple sequence alignment. Sequences are subsequences selected from BAliBASE database. can create their own profile from multiple sequence alignments, by using tools such as PFTOOLS [16], pre-established collections like Pfam [17], or by computing the profiles on the fly by using PSI-BLAST [18], the position specific version of BLAST. Application 2. Protein Family Classifications Given a family of homologous protein sequences, how can we know if a new sequence S belongs to the family? One answer to this question would be to align S to the multiple alignment of the sequences of the family, then find common motifs between them [19, 20]. Here, motifs are aligned ungapped segments of most highly conserved protein regions in the multiple sequence alignment. By comparing the motifs in the multiple sequence alignment with the unknown sequence S, we can find how similar between the alignment and S, and then conclude the possibility of the target sequence s classification. Application 3. Sequence Assembly Multiple sequence alignment can be used in DNA sequencing and primer identification [21 25]. In shotgun sequencing, multiple sequence alignment plays a very important role [26]. Assuming we are given a set of genomic reads in shotgun sequencing project; these read fragments are highly similar, and hence easy to align. The multiple sequence alignment of the reads can construct the foot print of main backbone of the original sequence, thus ease the work of recognizing the whole sequence from the reads. If high quality reads are used, the target sequence can be re-built directly from the consensus sequence of the multiple sequence alignment of the reads. 11

12 Given two sequences, P i and P j, we indict the score of their alignment as Score(P i, P j ). It can be computed as Score(P i, P j ) = 1<=k<=N c(p i,k, P j,k ), where N is the length of the alignment, P x,k is the kth character of P x, and c(x, y) is the score of matching x and y. Here x or y can be a gap, which means an insertion or a deletion. Finding the multiple sequence alignment that maximizes the SP (Sum-of-Pairs) score is an NP-complete problem [27]. Here, the SP Score of an alignment, A, of sequences P 1, P 2,, P K is computed by adding the alignment scores of all induced pairwise alignments. It can be expressed as SP (A) = i<j Score(P i, P j ), where K is the number of sequences, P i is the sequence indexed by i, and Score(P i, P j ) is the score of the alignment of P i and P j induced by A. The alignment of two sequences with maximum score can be found in O(N 2 ) time using dynamic programming [28], where N is the length of the sequences. This algorithm can be extended to align K sequences, but requires O(N K ) time [29, 30]. Variety of heuristic algorithms have been developed to overcome this difficulty [1]. Most of them are based on progressive application of pairwise alignment. They build up alignments of larger numbers of sequences by adding sequences one by one to the existing alignment [31]. These methods have the shortcoming that the order of sequences to be added to the existing alignment significantly affects the quality of the resulting alignment. This thesis focuses on the problems of optimization of SP score and sequence order dependence. We provide solutions based on divide-and-conquer strategy and also an application for prediction of genome primers. The contributions of this thesis are as follows: Contribution 1: Given a fixed time budget, we aim to maximize the SP score for moderate (3-10) number of sequences within this time. The optimization of SP score for multiple sequence alignment requires O(N K ) time, which leads the optimization of multiple sequence alignment unpracticable. We consider the problem of optimization of multiple sequence alignment and provide a solution to construct alignment. This solution 12

13 can result in an alignment which can converge to optimal alignment and keep a practical running time. We develop an algorithm, called QOMA, to address this problem. QOMA takes an initial alignment, then optimizes the alignment by a window with limited size, which is selected from the alignment. It finds the optimal alignment of the window in the sense of SP score and replaces the window back with the optimal alignment. We develop theories to justify the claim that QOMA can find alignments which converge to global SP optimal alignments when the size of the sliding window increases. The experimental results also agree with the claim. Contribution 2: Given a large number of protein sequences,we aim to maximize the SP (Sum-of-Pairs) score. The QOMA (Quasi-Optimal Multiple Alignment) algorithm addressed this problem when the number of sequences is small. However, as the number of sequences increases, QOMA becomes impractical. This paper develops a new algorithm, QOMA2, which optimizes the SP score of the alignment of arbitrarily large number of sequences. Given an initial (potentially sub-optimal) alignment, QOMA2 selects short subsequences from this alignment by placing a window on it. It quickly estimates the amount of improvement that can be obtained by optimizing the alignment of the subsequences in short windows on this alignment. This estimate is called the SW (Sum of Weights) score. It employs a dynamic programming algorithm that selects the set of window positions with the largest total expected improvement. It partitions the subsequences within each window into clusters such that the number of subsequences in each cluster is small enough to be optimally aligned within a given time. Also, it aims to select these clusters so that the optimal alignment of the subsequences in these clusters produces the highest expected SP score. Contribution 3: We aim to construct a biological meaningful alignment from multiple sequences. We consider this problem and sequence order dependence problem. Our solution is to apply secondary structure information of residues when we align the protein sequences. In this method, we first group residues in sequences based on their primary 13

14 types and secondary structures, adjust their positions according to the groups, we then slide a window on the adjusted sequences, align the residues in the window and replace the window with the resulting alignment. We construct the final resulting alignment by concatenating the alignments obtained from the sliding window. This method showed higher SP score than any other tools we selected for comparison. Contribution 4: We apply multiple sequences to assist genome sequencing. It is a new problem motivated by new DNA sequencing techniques (see project ASAP [32]). In sequencing DNA, plastid sequencing throughput can be increased by amplifying the isolated plastid DNA using rolling circle amplification (RCA) [33]. However, obtaining sequence through RCA requires this intermediate step. Recently, the ASAP method showed that sequence information could be gathered by creating templates from plastid DNA based on conserved regions of plastid genes. To expand this technique to an entire chloroplast genome an efficient method is required to facilitate primer selection. More importantly, such a method will allow the selected primer set to be updated based upon the availability of new plastid sequences. Our method is named MAPPIT. MAPPIT uses related species genes to assist predicting unknown genes. MAPPIT inputs existing gene sequences, which are close related to the gene to predict, extracts information from the given gene sequences, and constructs primer pairs. The goal is to find the primer pairs which can cover as much as the unknown gene, in the meanwhile, the number of pairs should be as small as it can. MAPPIT uses two different strategies for constructing primer candidates: multiple sequence alignment and motif based method. The experimental results showed the primer pairs found by MAPPIT did a lot of helps for prediction of unknown genomes. The rest of this thesis is organized as follows: Chapter 2 discusses related work of multiple sequence alignment. Chapter 3 addresses an algorithm for optimizing the SP score of resulting multiple sequence alignment in a given time. Chapter 4 introduces an algorithm for aligning many sequences, with the goal of optimizing the SP score. 14

15 Chapter 5 presents an algorithm for improving biological relevance of multiple sequence alignment by applying secondary structure information. Chapter 6 introduces an application of a module for amplification of plastomes by primer identification. Chapter 7 presents the conclusion of our work. 15

16 CHAPTER 2 BACKGROUND Multiple sequence alignment [34, 35] of protein sequences is one of the most fundamental problems in computational biology. It is an alignment of three or more protein sequences. Multiple sequence alignment is widely used in many applications such as protein structure prediction [3], phylogenetic analysis [4], identification of conserved motifs and domains [5], gene prediction [7 9], and protein classification [6]. 2.1 Measurements of Multiple Sequence Alignment There are several different ways to assess a multiple sequence alignment [36]. One common method is to score a multiple alignment according to a mathematics model. We define the cost of the multiple sequence alignment A of K sequences as l c(p 1 (i), P 2 (i),, P K (i)) i=1 where P j (i) is the ith letter in the sequence P j, j = 1, 2,, N, and c(p 1 (i), P 2 (i),, P K (i)) is the cost of the ith column [37]. c(p 1 (i), P 2 (i),, P K (i)) = 1 p q k c(p p (i), P q (i)) where c(p p (i), P q (i)) is the cost of the two letters P p (i) and P q (i) in the column. This column cost function is called as the Sum-of-Pairs (or SP) cost. SP alignment model is widely used in applications such as finding conserved regions, and receives extensively research [38 44]. In SP alignment, we assume all sequences equally relate to each other, then all pairs of sequences are assigned the same weight. In our later discussion, we will focus on SP model. There are also other optimization models in this group, such as consensus alignment and tree alignment [29, 40 42, 45 50]. The key deference of these models is how to formulate their column cost functions [37]. For all models in this type of measurement, the cost scheme used should be a reflect of the probabilities of evolutionary events, including substitution, insertion, and deletion. So it is important to choose 16

17 appropriate cost schemes for pairs of letters. For protein sequences, the PAM matrix and BLOSUM matrix are the most widely used [51, 52]. For DNA sequences, the simple match/mismatch cost scheme is often used. We can also use more sophisticated cost schemes such as transition/transversion costs [53] and DNA PAM matrices. Throughout this section, we use c() as the column cost function and c(x, y) as pairwise cost function, which measures the dissimilarity between a pair of letters or spaces x and y. We use to denote a space and to denote the set of letters that form input sequences. Another type of measurement is to compare a alignment with a reference alignment. BAliBASE score [5, 54] is the most widely used in this type. Given a gold-standard alignment A, this measure evaluates how similar the alignments A and A are. The BAliBASE score is commonly used in the literature as an alternative to the SP score, however, BAliBASE score can only be computed for sets of sequences for which the gold standard is known. In contrast, the SP score can be computed for any set of sequences. Most of the existing methods aim to maximize a linear variation of the SP score by weighting the sequences (or subsequences) in order to converge to the BAliBASE score for known benchmark [1, 2]. This chapter focuses on optimizing the SP score which is computationally an equivalent problem to the weighted versions in the literature. The problem of finding appropriate weights to converge the SP and the BAliBASE score is orthogonal to this chapter and should be considered separately. 2.2 Dynamic Programming Methods Dynamic programming methods was first provided for multiple string matching problem. Multiple sequence alignment problem can be viewed as multiple string matching problem [55 58] and also can use dynamic programming to find optimal solutions. Given a table of scores for matches and mismatches between all amino acids and penalties for insertions or deletions, the optimal of alignment of two sequences can be determined using dynamic programming (DP). The time and space complexity of this methods is O(N 2 ) [28, 59, 60], where N is the length of each sequence. This algorithm can be extended to align 17

18 K sequences, but requires O(N K ) time [29, 30]. Indeed, finding the multiple sequence alignment that maximizes the SP (Sum-of-Pairs) score is an NP-complete problem [27]. There are a few methods which aim to optimize the alignment by running dynamic programming alignment on all sequences simultaneously. MSA is the representative in this class [61]. DCA extends MSA by utilizing divide-and-conquer strategy [47]. Unlike progressive methods, DCA divides the sequences recursively until they are shorter than a given threshold. DCA then uses MSA to find the optimal solutions for the smaller problems. The performance of DCA depends on how it divides the sequences. DCA uses a cut strategy that minimizes additional costs [62] and uses the longest sequence in the input sequences as reference to select the cut positions. DCA does not guarantee to find optimal solution. The selection of the longest sequence makes DCA order dependent, as there is no justification why this selection (or any other selection) optimizes the SP-score of the alignment. On the contrary, our methods in this thesis are order independent. However, MSA, DCA and other algorithms who maximize the SP score suffer from computation expenses [1]. 2.3 Heuristic Methods Variety of heuristic algorithms have been developed to overcome the computation expenses of dynamic programming methods [1]. These heuristic methods also provide solutions for aligning large sequences, which dynamic programming is unable to process due to the limitation of memory [63 69]. These heuristic methods can be classified into four groups [70]: progressive, iterative, anchor-based and probabilistic. They all have the drawback that they do not provide flexible quality/time trade off. Progressive methods find multiple alignment by iteratively picking two sequences or profiles from this set and replacing them with their alignment (i.e., consensus sequence) until all sequences are aligned into a single consensus sequence. Thus, progressive methods guarantee that never more than two sequences or profiles are aligned simultaneously. The order of selecting sequence or profile is determined by a pre-created guide tree or 18

19 a clustering algorithm [71]. This approach is sufficiently fast to allow alignments of almost any size. The common shortcoming of these methods above is that the resulting alignment depends on the order of aligning the sequences. ClustalW [1], T-COFFEE [2], Treealign [72], POA [45, 73, 74], and MAFFT [75] can be grouped into this class [76]. ClustalW [1, 77] is currently the most commonly used multiple sequence alignment program. ClustalW includes the following features to produce biologically meaningful multiple sequence alignments. 1) According to a pro-computed guide tree, each input sequence is assigned a weight during the alignment process. Thus that sequences with more similarity get less weight and divergent sequences get more weight. 2) According to the divergence of the sequences to be aligned, different amino acid substitution matrices are used at different alignment stages. 3) Gap penalties prefer more continuous gaps to opening new gaps. Therefore, it encourages that gaps occur in loop regions instead of in highly structured regions such as alpha helices and beta sheets. The background biological meaning for this is that biologically divergence is often less likely in highly structured regions, which are commonly very important to the fold and function of a protein. For similar reasons, to discourage the opening of new gaps near the existing ones, existing gaps are assigned locally reduced gap penalties. T-COFFEE [2] is a progressive approach based on consistency. It is one of the most accurate programs available for multiple sequence alignment. T-COFFEE avoids the most serious drawback caused by the greedy nature of progressive algorithm. T-Coffee first aligns all sequences pair-wisely, and then uses the alignment information to guide the progressive alignment. T-Coffee creates intermediate alignments based on the sequences to be aligned next and how all of the sequences align to each other. MAFFT [75] provides a set of multiple alignment methods and is used on unix-like operating systems. MAFFT includes two new techniques: Identifying motif regions quickly and using a simplified scoring system. The first technology is done by the fast fourier transform (FFT). This technique changes an amino acid sequence to a sequence of 19

20 volume and polarity values of each amino acid residue. The second technique is to reduce CPU time and increase the accuracy of alignments. It works well even when sequences have large number of insertions or extensions, or when sequences of similar length are distantly related. MAFFT implements the iterative refinement method in addition to the progressive method. POA [45] program does not use generalized profiles during progressive alignment process. Instead, it introduces a partial order-multiple sequence alignment format to represent sequences. POA allows to extend alignable regions and allows longer alignments between closely related sequences and shorter alignments for the entire set of sequences. Iterative methods start with an initial alignment. They then repeatedly refine this alignment through a series of iterations until no more improvements can be made. Iterative methods do not provide flexible quality/time trade off. And iterative methods can not fix the mis-matches in the previous alignment during the iteration. MUSCLE [78] can be grouped into this class as well as the progressive method class since it uses a progressive alignment at each iteration. MUSCLE [78] applies many techniques such as fast distance estimation using k-mer counting, progressive alignment using a new profile function which is called the log-expectation score, and refinement using tree-dependent restricted partitioning. At the time it was proposed, it achieved the best accuracy. Since it was relatively slow MUSCLE was not widely used. Anchor-based methods first identifies local motifs (short common subsequences) as anchors. Then, the unaligned regions between consecutive anchors are aligned using other techniques. In general, anchor-based methods belong to divide-and-conquer strategy [79]. This group includes several methods which have designs for rapidly detecting anchors [80 82]. DIALIGN [83, 84], Align-m [46], L-align [85], Mavid [86] and PRRP [87] belong to this class. 20

21 DIALIGN program implements a local alignment approach to construct multiple alignments. It uses comparisons based on segments instead of residue used previously. It then integrates the segments identified as anchors into a multiple alignment using an iterative procedure. DIALIGN treats a column as either alignable or non-alignable. Align-m [46] program uses a non-progressive local approach to guide a global alignment. It construct a set of pairwise alignments guided by consistency. It performs well on divergent sequences. The drawback is that it runs slowly. PRRP program uses a randomized iterative strategy. It progressively optimizes a global alignment by dividing the sequences into two groups iteratively. It realigns groups globally using a group-based alignment algorithm. Probabilistic methods first compute the substitution probabilities from known multiple alignments. They then use the probabilities to maximize the substitution probabilities for a given set of sequences. Especially for divergent sequences, these consistency-based methods often have an advantage in terms of accuracy. ProbCons [88], and HMMT [89] can be grouped into this class. ProbCons [88] introduces an approach based on consistency. It uses a probabilistic model and maximum expected accuracy scoring. According to the evaluation of its performance on several standard alignment benchmark data sets, ProbCons is one of most accurate alignment tools today. HMMT first discovers the pattern which are common in the multiple sequences, and saves a description of the pattern in HMM file. It then applies a simulated annealing method, which tries to maximize the probability represented by the HMM file for the sequences to be aligned. HMMT works iteratively by improving a new multiple sequence alignment calculated using the pattern, then a new pattern derived from that alignment. 21

22 2.4 Optimizing Existing Alignments Methods There are also a set of alignment algorithms targeting to improve the alignment quality of an initial alignment. Our methods, QOMA and QOMA2 can be classified in this group. Improving the alignment quality of an initial alignment have been traditionally done manually (e.g. through programs like MaM and WebMaM [90]). Recently, RASCAL [91], REFINER [92] and ReAligner [93] have included more automatic features. Our methods, QOMA and QOMA2, belong to this group in general. QOMA and QOMA2 are different from RASCAL and REFINER because that QOMA and QOMA2 focus on optimizing the SP score of alignments and require only sequence information, while RASCAL is a knowledge-based approach and REFINER targets for optimizing score of core regions. ReAligner uses a round-robin algorithm and improves DNA alignment. Most of existing tools have the shortcoming that they are unable to process a large number of sequences. It is appropriate to apply dynamic programming on subdivisions of alignments. Jumping alignments [94] applies a similar idea. Our method, QOMA2 [95], provides a solution on how to align a large number of protein sequences. In this thesis, we address the problems mentioned above: The sequence-order-dependent problem, quality/time trade off problem and a large number of sequences input problem. 2.5 Approximation Algorithms Our algorithms provided in this thesis are heuristic algorithms by nature. Heuristic algorithms can be defined as algorithms that search all possible solutions, but abandon the goal of finding the optimal solution, for the sake of improvement in run time. Heuristic algorithms usually run fast and get good results, however, they do not guarantee the optimal solution, and have no proof that the obtained solution is not arbitrarily bad. If we want to find the optimal solution, we can use exact algorithms. The most widely adopted method of exact algorithms in multiple sequence alignment is dynamic programming. However, dynamic programming requires running time of O(N K ) for 22

23 aligning K sequences with length N. The required running time is actually infeasible for large N or K. Thus, if we want to find solutions which are close to the optimal solution, and want to guarantee that the result is not too bad, and also want to run in reasonable time, then one alternative is to make use of approximation algorithms. Approximation algorithms are algorithms which are polynomial and guarantee that for all possible instances of a minimization problem, all solutions obtained are at most ρ times the optimal solution. We can define approximation algorithms for maximization problem symmetrically. Approximation algorithms are often associated with NP-hard problems. Unlike heuristic algorithms, approximation algorithms have provable solution quality and provable running time bounds. Multiple sequence alignment with SP-score problems are MAX-SNP-hard. Here a maximization problem is MAX-SNP-hard when given a set of relations R 1, R 2,, R k, a relation D, and a quantifier-free formula Φ(R 1, R 2,, R k, D, v 1, v 2,, v t ), where v i is a variable, the following are satisfied [96]: 1) Given any instance I of the problem, there exists a polynomial-time algorithm that can produces a set J of relations R J 1, R J 2,, R J k, where every RJ i has the same arity as the relation R i. 2) OP T (I) = max D J {(v 1, v 2,, v t ) J t : Φ(R J 1, R J 2,, R J k, D J, v 1, v 2,, v t ) = T RUE} where OP T (I) is the optimal solution for instance I, D J is a relation on J with the same arity as D and J t is the set of t-tuples of J. The original definition and detailed discussion can be found in [96] Chapter

24 We define performance ratio of an approximation algorithm for a minimization problem H [37, 96] as a number ρ such that for any instance I of the problem, H(I) OP T (I) ρ where H(I) is the cost of the solution produced by algorithm H, and OP T (I) is the cost of an optimal solution for instance I. We define an approximation scheme for a minimization problem as an algorithm H that takes both instance I and an error bound ɛ as input, and achieves the performance ratio R H (I, ɛ) = H(I) OP T (I) 1 + ɛ. We can actually view such an algorithm H as a set of algorithms {H ɛ ɛ > 0)}, for each error bound ɛ. We define a polynomial time approximation scheme (PTAS) as an approximation scheme {H ɛ }, where the algorithm H runs in polynomial time of the size of the instance I, for any fixed ɛ. There are two types of problems: problems which have good approximation algorithms, and problems which are hard to approximate. PTASs belong to the first type and the best we can hope for a problem is it has a PTAS. However, a MAX SNP-hard problem has little chance to have a PTAS. The more detailed discussion can be found in [37] Chapter 4. Since achieving an approximation ratio 1 + ɛ for a MAX-SNP-hard problem is NP-hard, where ɛ > 0 is a fixed value, the approximatableness of an problem actually depends on the value of ɛ. For multiple sequence alignment problems, the best approximation algorithm has 2 l/k approximation ratio for any constant l, where K is the number of the sequences [39, 42, 97]. Later we will show this approximation ratio is not appropriate for real applications of multiple sequence alignment and show other reasons that approximation algorithms do not well for multiple sequence alignment. 24

25 In this section we discuss the advantages of our algorithms over approximation algorithms. We will answer critical questions: How can we claim that our algorithms are superior to other algorithms that offer approximation guarantees? Why do we claim our algorithms are more appropriate for bioinformatics applications than approximation algorithms? In the rest of this section, first we answer the above questions, then we present an overview of approximation algorithms Our Methods vs. Approximation Methods In this section, we first represent the concepts of approximatable and non-approximatable. We then show the reason that approximation algorithm is not appropriate for multiple sequence alignment problem on bioinformatics. We finally discuss the reason that our algorithms is superior to approximation algorithms for the applications of multiple sequence alignment What do approximatable and non-approximatable mean? Even when a problem is MAX-SNP-hard, it may still have good approximation algorithms which produce results with a guaranteed approximation ratio. In another words, a MAX-SNP-hard problem may still be able to be approximated. We know that MAX-SNP-hard problem is the problem for which achieving an approximation ratio 1 + ɛ is NP-hard for some fixed ɛ > 0. The result is guaranteed close to the optimal solution within a error factor. We consider a problem as approximatable if it has approximation algorithms which produce solutions close to optimal solutions within a constant factor, while the approximation ratio is acceptable for most applications. Otherwise, we consider it as non-approximatable Why does approximation algorithms do not work for multiple sequence alignment applications? We will show later that multiple sequence alignment problem belongs to MAX-SNP-hard problems. Then we raise a question: Is multiple sequence alignment problem approximatable or non-approximatable with respect to bioinformatics? There are already several 25

26 (a) (b) Figure 2-1. An example that alignments with approximation ratio of less than 2 can be meaningless: (a) The optimal alignment. (b) An alignment with approximation ratio of 1.5. approximation algorithms for multiple sequence alignment [42], which can efficiently produce alignments. However, we will provide three reasons that approximation algorithms are not applicable to multiple sequence alignment applications in bioinformatics. 1) The score scheme supported for approximation algorithms is metric, while currently, most widely used score matrices are not metric. A metric cost matrix should satisfy the following conditions [98]: (Cl) c(x, y) > 0 for all x y (C2) c(x, x) = 0 for all x (C3) c(x, y) = c(y, x) (C4) c(x, y) < c(x, z) + c(y, z) for any z Popular score matrices used today, such as BLOSUM62, are not metric. When a general score matrix is used in the approximation algorithm, the approximation ratio is no longer guaranteed. Thus these approximation algorithms are of little use in realty. 2) The approximation ratio around 2 is too loose to actually make much sense in bioinformatics area and thus are almost useless in real applications of bioinformatics. So far the best known approximation ratio for SP alignment has been improved from 2 2/K to 2 l/k for any constant l, where K is the number of the sequences [39, 42, 97]. It seems impossible to reduce 2 o(1) approximation ratio. The approximation ratio is not acceptable and makes the approximation algorithm non-approximatable in biological science. Here we present a sample example as follows: The score scheme is translated from DNA simple match/mismatch score scheme: 26

27 c(x, x) = 3 c(x, y) = 1 if x y Then given sequences A and A, two possible alignments are shown in Figure 2-1. We consider the alignment problem as a maximization problem, then the first alignment is the optimal solution, with SP score 3, and the second alignment has SP score of 2. So the second alignment has approximation ratio 1.5. We know that the second alignment is a trivial alignment without any meaning in realty. Actually in this example all alignments other than the optimal one have approximation ratio less than 2, which means the approximation ratio of less than 2 can not guarantee a good alignment at all. 3) These approximation algorithms do not consider the biological meaning of the resulting alignment, and they do not count for the impact of gaps. Here we provide a sample example to show that we need to consider the location of gaps inserted. In biological applications, it is widely accepted that a mismatch can be bad as matching with a gap. We can design a simple score scheme as follows: c(x, y) = 1 if x y c(x, ) = 1 c(x, x) = 2 c(, ) = 0 Then given sequences A, A and A, two possible alignments are shown in Figure 2-2. From Figure 2-2, we see both alignments have SP-score 6, however, the first alignment does not actually make any sense. Thus, an approximation algorithm for multiple sequence alignment with a guaranteed approximation that introduces a lot of gaps into the resulting alignment without considering biological meaning of the resulting alignment can be useless Why do our algorithms work? Heuristic algorithms can adjust parameter settings, such as the weights of sequences and score matrix, during processing, and build more biological meaningful alignment, 27

28 (a) (b) Figure 2-2. An example of different alignments with the same SP-score: (a) An alignment with many gaps. (b) An alignment without gaps. which is the main advantage over approximation algorithms. Other researchers have exploited this fact before. For example, ProbCons [88] can obtain pre-knowledge via training to guide the later alignment process, and ClustalW [1, 77] can adjust the weights of profiles during the alignment process. Our programs, QOMA [99], QOMA2 [95] and HSA [100] are heuristic optimization algorithms by nature. They also provide adjustment during the alignment. Also, our methods are designed not only for fixed models such as SP-score, but can be extented to incorporate additional biological features Overview of Approximation Algorithms for Multiple Sequence Alignment In this section, we first introduce several proved theories of approximation algorithms for multiple sequence alignment, finally we present brief proofs of NP-completeness and MAX-SNP-hardness of multiple sequence alignment with SP score Hardness Results SP alignment was proved to be NP-hard [27] when a particular pairwise cost scheme is used. The cost scheme used in the proof is not a metric since it does not satisfy the triangle inequality. Later SP alignment was proved to be NP-hard even when the alphabet size is 2 and the pairwise cost scheme is a metric. Thus, SP alignment problem is unlikely to be solved in polynomial time [101]. Theorem 1 [101] SP Alignment is NP-hard when the alphabet size is 2 and the cost scheme is metric. 28

29 Theorem 2 [102] SP Alignment is NP-hard when all spaces are only allowed to insert at both ends of the sequences using pairwise cost scheme where a match costs 0 and a mismatch costs 1. Theorem 3 [103] Tree alignment is NP-hard even when the given phylogeny tree is a binary tree. Theorem 4 [104] Consensus alignment is NP-hard when the alphabet size is 4 using the cost scheme where a match costs 0 and a mismatch costs 1. Theorem 5 [27, 103] Consensus alignment is MAX SNP-hard when the pairwise cost scheme is arbitrary NP-completeness and MAX-SNP-hardness of multiple sequence alignment In this section, we first show the NP-completeness of multiple sequence alignment with SP-score. Then we show the MAX-SNP-hardness of multiple sequence alignment. Theorem 6 [27] Multiple sequence alignment with SP-score is NP-complete. Proof: The original proof was given in [27]. The basic idea is to show that multiple sequence alignment problem is equivalent to shortest common supersequence problem, which is a known NP-complete problem even if = 2 [105]. Theorem 7 [106] There exists a score matrix B, such that multiple sequence alignment problem for B is MAX-SNP-hard, when spaces are only allowed to insert at both ends of the sequences. Proof: The original proof was given in [106] and used L-reductions. Here we can simplify the proof and use gap-preserving reduction [96]. We prove the theorem by showing that there are gap-preserving reductions from maximization problem of gap-0-1 multiple sequence alignment with SP-score to maximization problem of MAX-CUT(Z) problem of size k. It was proved that SIMPLE MAX-CUT(Z) is a MAX-SNP-complete problem for some positive integer Z. In fact, Z = 3 works [107]. Then we show that an optimal gap-0-1 multiple sequence alignment with SP-score problem exactly defines the optimal 29

30 solution of SIMPLE MAX-CUT(Z) problem of size k, and vice versa. Then we conclude gap-0-1 multiple sequence alignment with SP-score problems are MAX-SNP-hard. Since this restrained gap-0-1 version of multiple sequence alignment is MAX-SNP-hard, the general case of multiple sequence alignment is also MAX-SNP-hard. That ends our proof. 30

31 CHAPTER 3 OPTIMIZATION OF SP SCORE FOR MULTIPLE SEQUENCE ALIGNMENT IN GIVEN TIME In this chapter, we consider the problem of multiple alignment of protein sequences with the goal of achieving a large SP (Sum-of-Pairs) score. We introduce a new graph-based method. We name our method QOMA (Quasi-Optimal Multiple Alignment). QOMA starts with an initial alignment. It represents this alignment using a K-partite graph. It then improves the SP score of the initial alignment through local optimizations within a window that moves greedily on the alignment. QOMA uses two strategies to permit flexibility in time/accuracy trade off: (1) Adjust the sliding window size. (2) Tune from complete K-partite graph to sparse K-partite graph for local optimization of window. Unlike traditional tools, QOMA can be independent of the order of sequences. It also provides a flexible cost/accuracy trade off by adjusting local alignment size or adjusting the sparsity of the graph it uses. The experimental results on BAliBASE benchmarks show that QOMA produces higher SP score than the existing tools including ClustalW, ProbCons, MUSCLE, T-Coffee and DCA. The difference is more significant for distant proteins. 3.1 Motivation and Problem Definition We have introduced some background of multiple sequence alignment in Chapter 2. Progressive methods are most popular methods for multiple sequence alignment, however, they have an important shortcoming. The order that the profiles are chosen for alignment significantly affects the quality of the alignment. The optimal alignment may be different than all possible alignments obtained by considering all possible orderings of sequences [100]. Section 2 has discussed major multiple sequence alignment strategies in detail. A method, which can balance running time and alignment accuracy is seriously in demand. Fragment-based methods follow the strategy of assembling pairwise or multiple local alignment. The divide-and-conquer alignment methods such as DCA [47] can be 31

32 considered in this group. However, DCA is still an order dependent method as explained in Chapter 2. In this chapter, we consider the problem of maximizing the SP score of the alignment of multiple protein sequences. We develop a graph-based method named QOMA (Quasi-Optimal Multiple Alignment). QOMA starts by constructing an initial multiple alignment. The initial alignment is independent of any sequence order. QOMA then builds a graph corresponding to the initial alignment. It iteratively places a window on this graph, and improves the SP score of the initial alignment by optimizing the alignment inside the window. The location of the window is selected greedily as the one that has a chance of improving the SP score by the largest amount. QOMA uses two strategies to permit flexibility in time/accuracy trade off: (1) Adjust the sliding window size. (2) Tune from complete K-partite graph to sparse K-partite graph for local optimization of window. The experimental results show that QOMA finds alignments with better SP score compared to existing tools including ClustalW, ProbCons, MUSCLE, T-Coffee and DCA. The improvement is more significant for distant proteins. 3.2 Current Results In this section, we introduce the basic QOMA algorithm for aligning K protein sequences. QOMA works in two steps: (1) It constructs an initial alignment and the K-partite graph corresponding to this alignment. (2) It iteratively places a window on the sequences and replaces the window with its optimal alignment. We call this the complete K-partite graph algorithm since a letter of a protein can be aligned with any letter of the other proteins within the same window. Next, we describe these two steps in detail Constructing Initial Alignment The purpose of constructing an initial alignment is to roughly identify the position of each node in final alignment. It is important to find this initial alignment quickly in order to minimize initialization overhead. 32

33 p 1 a a a a p 2 b b b b p 1 a a 1 2 p 1 a 1 a 2 a 3 a 4 p 2 b 1 b b b p 3 c 1 c 2 c 3 c 4 p 3 c 1 c c c p 2 b1 b2 b3 b4 p 3 c 1 c 2 c 3 c 4 Figure 3-1. Constructing the initial alignment by strategy 2. Left: A pairs of of sequences are aligned. Edges are inserted between nodes which match in the alignment. Right: Columns are constructed by aligning the nodes. Gaps are inserted wherever necessary. There are many ways to construct the initial alignment. We group them into two classes: (1) Use an existing tool, such as ClustalW, to create an alignment. This strategy has the shortcoming that the initial alignment depends on other tools, which may be order-dependent. This makes QOMA partially order-dependent. (2) Construct alignment from pairwise optimal alignments of sequences. In this strategy, first, sequence pairs are optimally aligned using DP [60]. An edge is added between two nodes if the nodes are matched in this alignment. A weight is assigned to each edge as the substitution score of the two residues that constitute that edge. The substitution score is obtained from the underlying scoring matrix, such as BLOSUM62 [108]. The weight of each node is defined as the sum of the weights of the edges that have that vertex on one end. A node set is then defined by selecting one node from the head of each sequence. The node which has the highest weight is selected from this set. This node is aligned with the nodes adjacent to it. Thus, the letters aligned at the end of this step constitute one column of the initial 33

34 multiple alignment. The node set is then updated as the nodes immediately after the nodes in current set in each sequence. This process is repeated and columns are found until all the sequences end. The alignment is obtained by concatenating all these columns. Gaps are inserted between nodes if necessary. Unlike progressive tools, this strategy is order-independent. An example for initial alignment construction is shown in Figure 3-1. In this example, three protein sequences p 1, p 2 and p 3 are first pairwisely aligned. For simplicity, we show each pairwise alignment as a separate graph in this figure. In reality, one node per letter is sufficient. The nodes that match in these optimal alignments then are linked by edges. For example, a 1 and b 2 match in the optimal alignment of p 1 and p 2, thus they have an edge < a 1, b 2 > in the graph constructed. The weight of this edge is equal to the BLOSUM62 entry for the letters a 1 and b 2. We do not show the weight of the edges in Figure 3-1 in order to keep the figure simple. In this figure, node for a 1 has an edge to nodes for b 2 and c 2. Therefore, the weight of the node for a 1 is computed as the sum of the weights of the edges < a 1, b 2 > and < a 1, c 2 >. Initially {a 1, b 1, c 1 } are chosen as the candidate node set. In this example, we assume that among three nodes for a 1, b 1 and c 1, the node for a 1 has the largest weight. Thus we select the node for a 1 as the central node and construct column (a 1, b 2, c 2 ). Then we start to construct next column. We update candidate node set to {a 2, b 3, c 3 }, which are all nodes that immediately proceed nodes for a 1, b 2 and c 2 in the sequences. Assume that node for a 2 has the largest weight among nodes for a 2, b 3 and c 3, we select the node for a 2 as the central node and construct column (a 2, b 4, c 4 ) correspondingly. When we concatenate columns to make final alignment, gap nodes are inserted if necessary. In this example, when we concatenate columns (a 1, b 2, c 2 ) and (a 2, b 4, c 4 ), two gap nodes are inserted in sequence p 1, one before the node for a 1 and one after node for a 1. Thus we construct columns (, b 1, c 1 ) and (, b 3, c 3 ). The time complexity of both of these strategies are O(K 2 N 2 ) since pairwise comparisons dominate the running time. However, latter approach is faster. This is 34

35 because it runs dynamic programming only once for each sequence pair. On the other hand, the former one performs two set of pairwise alignments. One to find a guide tree and another to align sequences progressively according to the guide tree Improving the SP Score via Local Optimizations After constructing the initial alignment, the nodes are placed roughly in their correct positions (or in a close by position) in the alignment. Next, the alignment is iteratively improved. At each iteration, a short window is placed on the existing alignment. The subsequences contained in this window are then replaced by their optimal alignment (Figure 3-2). Generalized version of the DP algorithm [60] is used to find the optimal alignment. This is feasible since the cost of aligning a window is much less than that of the entire sequences. This algorithm requires solving two problems. First, where should the windows be placed? Second, when should the iterations stop? One obvious solution is to slide a window from left to right (or right to left) shifting by some predefined amount at each iteration. In this case, the iterations will end once the window reaches to the right end (or the left end) of the alignment (see Figure 3-2). This solution, however, have two problems. First, it is not clear which direction the window should be slid. Second, a window is optimized even if it is already a good alignment. We propose another solution. We compute an upper bound to the improvement of the SP score for every possible window position as follows. Let X i denote the upper bound to the SP score for the window starting at position i in the alignment. This number can be computed as the sum of the scores of all the pairwise optimal alignments of the subsequences in this window. Let Y i denote the current SP score of that window. The upper bound is computed as X i Y i. We propose to greedily select the window that has the largest lower bound at each iteration. In order to ensure that this solution does not optimize more windows than the first one (i.e., sliding windows), we do not select a window position that is within /2 positions to a previously optimized window. The iterations stop when all the remaining windows 35

36 W A prefix A W A suffix A W * = optimal alignment in the window A prefix A W * A suffix W Figure 3-2. QOMA finds optimal alignment inside window, it replaces the window with the optimal alignment and then moves the window by positions. have an upper bound of zero or they are within /2 positions of a previously optimized window. In our experiments, the two solutions roughly produced the same SP-score. The second solution was slightly better. The second solution, however, converged to the final result much faster than the first one. (results not shown.) are The time complexity of the algorithm is O( 2K W K K 2 (N W +1) ). This is because there (N W +1) positions for window. A dynamic programming solution is computed for each such window. The cost of each dynamic programming solution is O(2 K W K K 2 ) This algorithm is much faster than the optimal dynamic programming when W is much smaller than N. The space complexity is O(W K + KN). This is because dynamic programming for a window requires O(W K ) space, and only one window is maintained at a time. Also O(KN) space is needed to store the sequences and the alignment. Note that the edges of the complete K-partite graph are not stored at this step as we already know that the graph is complete QOMA and Optimality In this section, we analyze QOMA approach. Let P 1, P 2,, P K be the protein sequences to be aligned. Let A be an optimal alignment of P 1, P 2,, P K. Let S denote the SP score of A. Let A be an alignment of P 1, P 2,, P K. Let SP (A) be the SP score of A. We define the error induced by A as error(a) = S SP (A). This expression, 36

37 however, is not computable for finding of S is NP-complete. Instead, we compute the error of A as ɛ(a) = S SP (A), where S is an upper bound to S. Here, S is computed as the sum of the scores of all optimal pairwise alignments of P 1, P 2,, P K. We conclude that ɛ(a) error(a). Let QOMA(A, W ) be the alignment obtained by QOMA starting from initial alignment A by sliding a window of size W. We define the percentage of improvement provided by QOMA over A using a window size of W as improve(a, W ) = (1 ɛ(qoma(a, W )) ) 100 (3 1) ɛ(a) Our first lemma shows that QOMA always results in an alignment at least as good as the initial alignment (The proof is shown in the appendix). Lemma 1. improve(a, W ) 0, A, W. Proof: For a given position of window, let A prefix, A W and A suffix denote the alignment to the left of the window, inside the window, and to the right of the window respectively (see Figure 3-2). Let A W be the optimal alignment obtained by QOMA for the window and A be the alignment obtained by replacing A W with A W from A. We have SP (A W ) SP (A W ). Thus, SP (A) = SP (A prefix)+sp (A W )+SP (A suffix ) SP (A prefix )+ SP (A W ) + SP (A suffix) = SP (A ). Then, we get ɛ(a) = S SP (A) S SP (A ) = ɛ(a ). Finally, we have ɛ(qoma(a,w )) ɛ(a) 1. We conclude improve(a, W ) 0. Corollary 1 follows from Lemma 1. Corollary 1. SP (A ) = SP (QOMA(A, W )), W. Corollary 1 implies that QOMA alters an initial alignment A only if A is not optimal. Next lemma discusses the impact of window size on QOMA. Lemma 2. SP (QOMA(A, W )) SP (QOMA(A, 2W )). Proof: For a given position of window of length 2W, let A 2W denote the alignment inside the window. Let A W1 and A W2 denote the first and second half of window A 2W. SP (A W1 ) + SP (A W2 ) SP (A 2W ). This is because, SP (A 2W ) is the optimal SP score for the entire window. Therefore, SP (QOMA(A, W )) SP (QOMA(A, 2W )). 37

38 p d = 0 q d p d = 1 q Figure 3-3. Sparse K-partite graph for two sequences for d = 0 and d = 1. p p 1 3 p p 2 4 p (a) p 3 (b) 4 Figure 3-4. An example of using K-partite graph: (a) A sparse K-partite graph for three sequences from a window of size 4. (b) The induced subgraph for cell [3, 4, 4] for the K-partite graph in (a). Lemma 2 indicates that as W increases, the SP score of the resulting alignment increases. When W becomes greater than the length of A, the sliding window contains the entire sequences. In this case, SP (QOMA(A, W )) = S. Following corollary states this. Corollary 2. As W increases, SP (QOMA(A, W )) converges to S Improved Algorithm: Sparse Graph QOMA converges to optimal alignment as the window size (W ) grows. However, this happens at the expense of exponential time complexity. In Section we computed the time complexity of QOMA using complete K-partite graph as O( 2K W K (N W +1)K 2 ) for proteins P 1, P 2,, P K. In this section, we reduce the time complexity of QOMA by 38

39 sacrificing accuracy through use of sparse K-partite graph. The goal is to enable QOMA run within a given limited time budget when using a larger window size. The factor 2 K in the complexity is incurred because each cell of the dynamic programming (DP) matrix is computed by considering 2 K 1 conditions (i.e., 2 K 1 neighboring cells). This is because there are 2 K 1 possible nonempty subsets of K residues. Each subset, here corresponds to a set of residues that align together, and thus to a neighboring cell. We propose to reduce this complexity by reducing the number of residues that can be aligned together. We do this by keeping only the edges between node pairs with high possibility of matching. The strategy for choosing the promising edges is crucial for the quality of the resulting alignment. We use the optimal pairwise alignment method as discussed in Section This strategy produces at most K 1 edges per node since each node is aligned with at most one node from each of the K 1 sequences. We also introduce a deviation parameter d, where d is a non-negative integer. Let p[i] and q[j] be the nodes corresponding to protein sequences p and q at positions i and j in the initial graph respectively. We draw an edge between p[i] and q[j] only if one of the following two conditions holds in the optimal pairwise alignment of p and q: (1) δ, δ d, such that p[i] is aligned with q[j + δ] ; (2) δ, δ d,such that q[j] is aligned with p[i + δ]. In other words, we draw an edge between two nodes if their positions differ by at most d in the optimal alignment of p and q. For example, in Figure 3-3, p[2] aligns with q[2]. Therefore, we draw an edge from p[2] to q[1] and q[3] as well as q[2] since q[1] and q[3] are within d-neighborhood of (d = 1) of q[2]. The dynamic programming is modified for sparse K-partite graph as follows: Each cell, [x 1, x 2,, x K ] in K-dimensional DP matrix corresponds to nodes P 1 [x 1 ], P 2 [x 2 ],, P K [x K ]. Here P i [j] stands for the node at position j in sequence i. The set contains one node from each sequence, and can be either a residue or a gap. Thus, each cell defines a subgraph induced by its node set. For example, during the alignment of the sequences that 39

40 have the K-partite graph as shown in Figure 3-4(a), the cell [3, 4, 4] corresponds to nodes P 1 [3], P 2 [4] and P 3 [4]. Figure 3-4(b) shows the induced subgraph of cell [3, 4, 4]. The induced subgraph for each cell yields a set of connected components. Sparse graph strategy exploits the concept of connected components to improve running time of DP as follows: During the computation of the value of a DP matrix cell, we allow two nodes to align only if they belong to the same connected component of the induced subgraph of that cell. For example, for cell [3, 4, 4], P 2 [4] and P 3 [4] can be aligned together, but P 1 [3] can not be aligned with P 2 [4] or P 3 [4] (see Figure 3-4(b)). A connected component with n nodes produce 2 n 1 non-empty subsets. Thus, for a given cell, if there are t connected components and the tth component has n t nodes, then the cost of that cell becomes t i=1 (2n i 1). This is a significant improvement as the cost of a single cell is 2 n 1+n 2 + +n t 1 using the complete K-partite graph. For example, in Figure 3-4, the cost for cell [3, 4, 4] drops from = 7 to (2 0 1) + (2 2 1) = 4. The connected components of an induced subgraph can be found in O(K 2 ) time (i.e., the size of the induced subgraph) by traversing the induced subgraph once. Thus, the total time complexity of the sparse K-partite graph approach is O( ( W K i=1 ( j (2n j 1)))(N W + 1)K 2 ). The space complexity of using the sparse K-partite graph is O(W K + KN + N(K 1)K(2d + 1)/2). The first term denotes the space for the dynamic programming alignment within a window. The second term denotes the number of letters. The last term denotes the number of edges. The space complexity for the last two terms can be reduced by storing only the subgraph inside the window. 40

41 Table 3-1. The average SP scores of QOMA using complete K-partite graph with = W/2 on BAliBASE benchmarks and upper bound score ( S). (Initialization Strategy 1, indicated by s1: Initial alignments are obtained from ClustalW, Initialization Strategy 2, indicted by s2: Initial alignments are obtained from optimal pairwise alignments as discussed in Section 3.2.1). Dataset S Strategy Initial W =2 W =4 W =8 W =16 V1-R1-low 565 s s V1-R1-medium 2880 s s V1-R1-high 5324 s s Experimental Evaluation Experimental setup: We used BAliBASE benchmarks [5] reference 1 from version 1 (www-igbmc.u-strasbg.fr/bioinfo/balibase/) and references 1, 2,, 8 from version 3 (www-bio3d-igbmc.u-strasbg.fr/balibase/) for evaluation of our method. We use V1 and V3 to denote BAliBASE versions 1 and 3 respectively. We use R1 to R8 to denote reference 1 to 8. For example, we use V3-R4 to represent the reference 4 dataset from version 3. We split the V1-R1 dataset into three datasets (V1-R1-low, V1-R1-medium, and V1-R1-high) according to the similarity of the sequences in the benchmarks as denoted in BAliBASE (low, medium and high similarities). Similarly, V3-R1 is split into two datasets V3-R1-low and V3-R1-high containing low and high similarity benchmarks. The number of sequences in the benchmarks in version 3 were usually too large for QOMA and DCA. Therefore, we created 1,000 benchmarks from each reference by randomly selecting five sequences from the existing benchmarks. Thus, each of the benchmarks from version 3 contains five sequences. We evaluated the SP score and the running time in our experiments. We do not report the BAliBASE scores since the purpose of QOMA is to maximize the SP score. We implemented the complete and the sparse K-partite QOMA algorithms as discussed in the chapter, using standard C. We used BLOSUM62 as a measure of 41

42 similarity between amino acids. We used gap open = gap extend = -4 to penalize gaps. We used = W/2 in our experiments since we achieved best quality per time with this value. We also downloaded ClustalW, ProbCons, MUSCLE, T-coffee and DCA for comparison. We did not compare QOMA with our work HSA [100] since HSA needs Second Structure information of proteins for alignment. To ensure a fair comparison, we ran ClustalW, MUSCLE, T-coffee, DCA and QOMA using the same parameters (gap open = gap extend = -4, similarity matrix = BLOSUM62). This was not possible for ProbCons. We also ran all the competing methods using their default parameters. We present the results using the same parameters in our experiments unless otherwise stated. We ran all our experiments on Intel Pentium 4, with 2.6 G Hz speed, and 512 MB memory. The operating system was Windows Quality evaluation: We first evaluate the quality of QOMA. Table 3-1 shows the average SP score of QOMA using two strategies for constructing initial alignment and four values of W. Strategy 1 obtains the initial alignments from ClustalW. Strategy 2 obtains the initial alignments from the algorithm provided in Section The table also shows the upper bound for the SP score, S, and the SP score of ClustalW for comparison. QOMA achieves higher SP score compared to ClustalW on average for all window sizes and for all data sets. The SP score of QOMA consistently increases as W increases. These results are justified by Lemmas 1 and 2. The SP score of Strategy 2 is usually higher than that of Strategy 1 for almost all cases of low and medium similarity. Both strategies are almost identical for highly similar sequences. There is a loose correlation between the initial SP score and the final SP score of QOMA. Higher initial SP scores usually imply higher SP scores of the end result. There are however exceptions especially for highly similar sequences. In the rest of the experiments, we use Strategy 2 to construct the initial alignments by default. Table 3-2 shows us the SP scores of five existing tools, and QOMA on all the datasets when the competing tools are run using the same parameters as QOMA and using their 42

43 default parameters. QOMA has higher SP scores than all the tools compared for all the datasets. DCA always has second best scores since it also targets on maximizing the SP score of alignments. The difference between the SP scores of QOMA and the other tools are more significant for low and medium similarity sequences. This is an important achievement because the alignment of such sequences are usually harder than highly similar sequences. Table 3-3 shows the average percentage of improvement of QOMA over alignments of ClustalW using the improvement formula as given in Section 3.2.3, the data set is V1-R1. As window size increases, the increase in improvement percentage reduces. This indicates that QOMA converges to the optimal score at reasonably window sizes. In other words, using window size larger than 16 will not improve the SP score significantly. Table 3-4 shows the average and the standard deviation of the error incurred for each window due to using the sparse K-partite graph for QOMA. The error decreases as d increases. For W = 8, when d increases from 0 to 1, the error reduces by (i.e., ). When d increases from 1 to 2, the error decreases by This implies that the average improvement in the SP score degrades quickly for d > 1. Similar observations can be made for W = 16. Thus, we conclude that the SP score improves slightly for d > 1. Figure 3-5 shows the average SP scores of resulting alignments using sparse K-partite graph for different values of d and using complete K-partite graph on the V1-R1 dataset. The complete K-partite graph algorithm produces the best SP scores. However, the SP scores of results from the sparse K-partite graph algorithm are very close to that of the complete K-partite graph algorithm. The quality of the sparse K-partite graph algorithm improves significantly when d increases from 0 to 1. The improvement is less when d increases from 1 to 2. This implies when d becomes larger, it has less impact on the quality of alignment. 43

44 Performance evaluation: Our second experiment set evaluates the running time of QOMA. Table 3-5 lists the running time of QOMA for the complete and the sparse K-partite graph algorithms for varying values of W. Experimental results show that QOMA runs faster for small W. The sparse K-partite graph algorithm is faster than the complete K-partite graph algorithm for all values of d for large W. The running time of QOMA increases as d increases. The results in this table agree with the time complexity we computed in Sections and Referring to Tables 3-1, 3-2 and 3-3, we conclude when window size is small, QOMA runs fast and has high quality results. As window size increases, its performance drops but alignment quality improves further. Another parameter for quality/time trade off is d. Figure 3-5 shows that the SP score difference between the complete and the sparse K-partite graph algorithms is small. Thus, it is better to increase the window size and use sparse K-partite graph strategy to obtain high scoring results quickly. As we have observed in Tables 3-1 and 3-5 and Figure 3-5, the best balance between quality and running time appears at d = 1 using sparse K-partite graph strategy. 44

45 SP Score sparse K-partite graph,d=0 sparse K-partite graph,d=1 sparse K-partite graph,d=2 complete K-partite graph Window Size Figure 3-5. The SP scores of QOMA alignments using complete K-partite graph and sparse K-partite graphs for different values of d and W on the V1-R1dataset. The initial alignments are obtained from strategy 2. 45

46 Table 3-2. The average SP scores of QOMA (using complete K-partite graph with W = 16 ) and five other tools on BAliBASE benchmarks. The numbers show the SP scores when the tools are run with the same parameters as QOMA (indicted by S) and with their default parameters (indicted by D). Some of the tools, namely T-coffee and ClustalW, did not produce any alignment for some benchmarks for each parameter settings. The results of all the tools are ignored for such benchmarks. N/A indicates that the corresponding tool failed to produce alignment for most of the benchmarks in a dataset for that parameter setting. We ignore such tools (i.e., T-coffee) for those datasets and parameter setting. Dataset ClustalW ProbCons T-coffee MUSCLE DCA QOMA S D S D S D S D S D S D V1-R1-low V1-R1-medium V1-R1-high V3-R1-low V3-R1-high V3-R V3-R N/A V3-R N/A V3-R N/A V3-R N/A V3-R N/A V3-R N/A

47 Table 3-3. The improvement (see Formula 3 1 in Section 3.2.3) of QOMA (using complete K-partite graph) over ClustalW on the V1-R1 dataset. The dataset is split into three subsets (short, medium, and long) according to the length of the sequences. Length Window Size Short Medium Long Table 3-4. The average (µ), standard deviation (σ) of the error, S SP, for a window using sparse version of QOMA on the V1-R1 dataset. Results are shown for window sizes W = 8 and 16, and deviation d = 0, 1, and 2.The ɛ value denotes the 95 % confidence interval, i.e., 95 % of the expected improvement values are in [µ ɛ, µ + ɛ] interval. Error using sparse K-partite graph d = 0 d = 1 d = 2 W µ σ ɛ µ σ ɛ µ σ ɛ

48 Table 3-5. The running time of QOMA (in seconds) using complete K-partite graph and spare graph for different value of d and W on the V1-R1 dataset. (A: complete K-partite graph. B: sparse K-partite graph with d =0. C: sparse K-partite graph with d =1. D: sparse K-partite graph with d =2.) The dataset is split into three subsets (short, medium, and long) according to the length of the sequences in the benchmarks. Window Short Medium Long Size A B C D A B C D A B C D W= W= W= W=

49 CHAPTER 4 OPTIMIZING THE ALIGNMENT OF MANY SEQUENCES In this chapter, we consider the problem of aligning multiple protein sequences with the goal of maximizing the SP (Sum-of-Pairs) score, when the number of sequences is large. The QOMA (Quasi-Optimal Multiple Alignment) algorithm addressed this problem when the number of sequences is small. However, as the number of sequences increases, QOMA becomes impractical. This chapter develops a new algorithm, QOMA2, which optimizes the SP score of the alignment of arbitrarily large number of sequences. Given an initial (potentially sub-optimal) alignment, QOMA2 selects short subsequences from this alignment by placing a window on it. It quickly estimates the amount of improvement that can be obtained by optimizing the alignment of the subsequences in short windows on this alignment. This estimate is called the SW (Sum of Weights) score. It employs a dynamic programming algorithm that selects the set of window positions with the largest total expected improvement. It partitions the subsequences within each window into clusters such that the number of subsequences in each cluster is small enough to be optimally aligned within a given time. Also, it aims to select these clusters so that the optimal alignment of the subsequences in these clusters produces the highest expected SP score. The experimental results show that QOMA2 produces high SP scores quickly even for large number of sequences. They also show that the SW score and the resulting SP score are highly correlated. This implies that it is promising to aim for optimizing the SW score since it is much cheaper than aligning multiple sequences optimally. 4.1 Motivation and Problem Definition Progressive methods progressively align pairs of profiles in a certain order and produce a new profile until a single profile is left. A profile is either a sequence or the alignment of a set of sequences. Figure 4-1(a) illustrates this. Here, sequences a and b are optimally aligned. Then, c and d are optimally aligned. Their resulting alignments are aligned next. Progressive methods, however, have an important shortcoming. The 49

50 Table 4-1. The list of variables used in this chapter Variable Meaning K Total number of sequences to be aligned. W Window size. T Maximum number of sequences of length W that can be optimally aligned. P i Sequence or profile. f i Subsequence of P i that lies in a given window. v i Vertex corresponding to f i. e i,j Weight of the edge between v i and v j. N Length of a sequence or a profile. M Number of windows that are optimized. order that the profiles are chosen for alignment affects the quality of the alignment significantly. The optimal alignment may be different than all possible alignments obtained by considering all possible orderings of sequences [100]. Table 4-1 defines the variables frequently used in the rest of paper. In Chapter 3, we have introduced QOMA [99], which eliminated the drawbacks of the progressive methods. QOMA partitioned an initial alignment into short subsequences by placing a window. It then optimally realigned the subsequences in each window. This is shown in Figure 4-1(b). Optimally aligning each window costs O(W K 2 K ), significantly less than O(N K 2 K ) for W N. However, when K is large, even O(W K 2 K ) becomes too costly. The value of W needs to be reduced significantly to make QOMA practical. For example, assume that QOMA works for W = 32 when K = 6. When K becomes 18, W should be reduced to two in order to run at roughly the same time. This, however, reduces the SP score of the alignments found by QOMA since each window contains extremely short subsequences. This chapter addresses the problem of aligning multiple protein sequences with the goal of achieving a large SP score when the number of sequences is large. We develop an algorithm, QOMA2, which works well even when the number of sequences is large. Figure 4-1(c) illustrates the QOMA2 algorithm. It takes K sequences and a initial (potentially sub-optimal) alignment of them as input. QOMA2 selects short subsequences 50

51 from these sequences by placing a window on their initial alignment. Each window position defines K subsequences, and each subsequence has at most W letters. It quickly estimates the amount of improvement that can be obtained by optimizing the alignment of the subsequences in each window. This estimate is called the SW (Sum of Weights) score. It uses a dynamic programming algorithm to select the set of window positions with the largest total expected improvement. It then recursively forms clusters of T, T K, subsequences and optimally aligns each cluster. The clusters are created by iteratively partitioning the subsequences into clusters and updating the SW score according to these clusters. Thus, different windows can result different partitioning of subsequences to clusters (see Figure 4-1(c)). This is desirable since the optimal clustering of the subsequences may differ for different window positions. The value of T is determined by the allowed time budget for QOMA2 for the alignment of the subsequences in clusters governs the overall running time. As T increases both the alignment score and the running time increase. The experimental results show that QOMA2 achieves high SP scores quickly even for large number of sequences. They also show that the SW score and the resulting SP score are highly correlated. This implies that it is promising to aim for optimizing the SW score since it is much cheaper than aligning multiple sequences optimally. Graph Partitioning. METIS [109, 110] is a popular tool for partitioning unstructured graphs, partitioning meshes, and computing fill-reduced ordering of sparse matrices. The algorithms implemented in METIS are based on the multilevel recursive-bisection, multilevel k-way, and multi-constraint partitioning schemes. It can provide high quality partitions fast. 4.2 Current Results Let A be an alignment of K sequences P 1, P 2,, P K. Let W > 1 be an integer that denotes the window length. Assume that we are allowed to place a window on A in M different locations and optimize the alignment of the subsequences in these M locations. 51

52 a b c d a b c d (a) w a b c d (b) a b c d w 1 w 2 w 1 a b c d e f (c) a w 2 b a b f c d e f c d e Figure 4-1. Alignment strategies at a high level: (a) progressive alignment, (b) the QOMA algorithm (c) the QOMA2 algorithm. The solid lines denote sequences a, b,..., f. Dashed polygons denote the (sub)sequences whose alignments are optimized. The trees next to alignments show the guide tree used by the underlying algorithm to align the sequences. In (a), a and b are optimally aligned. Then, c and d are optimally aligned. Their resulting alignments are aligned next. In (b), small subsequences of a, b, c, and d in each window is aligned optimally. In (c), the window on the left indicates that the subsequences from a, b and c are optimally aligned, the subsequences from d, e and f are optimally aligned, and then their results are aligned. Similarly, the window on the right indicates that the subsequences from a, b and f are optimally aligned, the subsequences from c, d and e are optimally aligned, and then their results are aligned. The first problem that needs to be addressed is the identification of the M locations that maximize the overall improvement. Figures 4-1(b) and 4-1(c) show two examples in which three and two positions are selected respectively. It is important to mention that the number of windows, M, is governed by the total time allowed for improving the alignment. 52

53 A simple way to select the positions to place the window is to slide a window from the left to the right (or from the right to the left), shifting by some predefined amount at each iteration. Another simple solution is to select the window positions randomly. Clearly, both of these solutions do not distinguish promising window positions from unpromising ones. We suggested a greedy solution in our QOMA paper. This algorithm greedily selects the most promising window position from the unselected positions until M positions are selected. We discuss how we quantify how promising a window is later in this section. This greedy strategy, however, does not guarantee to find the best set of M window positions. Here, we develop a dynamic programming algorithm that guarantees to find the M optimal window positions. For each window position, we compute an upper bound to the improvement of the SP score that could be achieved by replacing that window with its optimal alignment as follows. Let X i denote the upper bound to the optimal SP score for the subsequences in the window starting at position i of the alignment. This number can be computed as the sum of the scores of all pairwise optimal alignments of the subsequences in this window. Let Y i denote the current SP score of that window. The upper bound to the improvement of the SP score is computed as U i = X i Y i. We say that a window position i is promising if U i is large. We propose to select the M window positions, π 1, π 2,, π M ( i, π i < π i+1 ) whose sum of upper bounds (i.e., i U π i ) is the largest. Note that, if two windows overlap greatly, their combined improvement over the initial alignment can be much less than their individual improvements. This is because they improve almost the same regions, and thus, they are highly dependent. The sum of their upper bounds includes the upper bound for their common region twice. In order to prevent this, we also enforce a minimum distance between the positions of different windows as i, π i+1 π i τ. Thus, if a window is positioned at π i, no other window can be placed on a position in the [π i τ, π i + τ] interval. 53

54 The value of τ determines how independent the windows are. As τ increases, windows become more independent. For τ W, the windows are completely non-overlapping. On the other hand, large values of τ limit the number of possible window positions. We use τ = W/4 as it provided a good balance in our experiments. We develop a dynamic programming solution to determine the optimal window positions. Let SU(a, b) denote the largest possible sum of upper bounds of b window positions selected from the first a possible window positions. We would like to determine SU(N W + 1, M) to solve our problem, where N is the length of the alignment. Clearly, SU(a, 1) = max a i=1{u i }.This is because if a single window is selected it should be the one with the largest upper bound. For b > 1, there are two possibilities: 1) If a < bτ, SU(a, b) = 0. This is because, from Dirichlet principle, it is impossible to select b window positions that overlap with less than τ positions in this case. 2) If a bτ, we compute SU(a, b) recursively as SU(a τ, b 1) + U a, SU(a, b) = max SU(a 1, b), if U a is selected otherwise In this computation, the first condition implies that the bth window starts at position a. Thus, the first b 1 windows should be selected in the interval [1, a τ] to ensure that they do not overlap with the bth window by more than τ. The second condition implies that the window at position a is not a part of the solution. Therefore, the b window positions should be selected in the interval [1, a 1]. The value of SU(N W + 1, M) is the optimal sum of upper bounds. The window positions that lead to this optimal solution can be found by tracking back the values of SU after the dynamic programming computation completes. Figure 4-2 shows the average SP score of the improved alignment for the first eleven window positions when the windows are selected using our dynamic programming method, greedily, and by sliding a window. For the window sliding strategy, we shift the window by 54

55 840 Optimal selection Greedy selection Sliding window SP score Number of window positions (M) Figure 4-2. Comparison of the SP score found by different strategies of selection of window positions: using the proposed optimal selection, the greedy selection and the sliding window. W/2 at each iteration. The results are obtained by averaging the results of 82 BAliBASE benchmarks. We use W = 8 and K = T = 4 (i.e., each window of length eight is optimally aligned). The figure shows that the proposed selection strategy improves the SP score much faster than the sliding and the greedy strategies. 4.3 Aligning a Window The goal of aligning a window is to maximize the SP score of the subsequences within each window. We propose a divide-and-conquer strategy, which clusters the set of K subsequences into smaller sets of T subsequences so that the subsequences in each subset can be optimally aligned. This method has two major differences from the progressive 55

56 methods. First, progressive methods align two sequences (or profiles) at a time. Thus T = 2 for the progressive methods, whereas QOMA2 can use larger T values since it focuses on a short window. Second, once the clusters are determined, progressive methods align the entire sequences based on that clustering. However, QOMA2 can find different ways of clustering the data for different window positions (see Figure 4-1(c) as an example). This is desirable for different regions in sequences may evolve at different conservation rates. For example, regions that serve important functions show much less variation then the remaining regions. Therefore, the best clustering for one region of the sequences may not be good for another region. QOMA2 addresses this by treating each region independently. We first construct an initial weighted complete graph by considering each subsequence in the window as a vertex. We then align the subsequences using two nested loops. The details of the two steps are discussed next Constructing Initial Graph Given a window on the alignment, we first construct a weighted, undirected, complete graph G = (V, E). This graph models how much the SP score can be improved by realigning the subsequences in this window carefully. Let f i denote the subsequence of the sequence P i that remains in the window, i, 1 i K. Each f i maps to a vertex v i V in this graph. We compute the weight of the edge e i,j E between vertices v i and v j as e i,j = Score optimal (f i, f j ) Score induced (f i, f j ) (4 1) Score optimal (f i, f j ) computes the score of the optimal alignment of f i and f j. Score induced (f i, f j ) denotes the score of the alignment of f i and f j induced from the current alignment. In other words, e i,j is an upper bound to the improvement of the SP score due to f i and f j after realigning the window. 56

57 Definition 1. Let G = (V, E) be the graph constructed for a set of subsequences in a window. We define the sum of the weights of all the edges in E as the SW (Sum of Weights) score of G. The SW score is an upper bound to how much the SP score of the subsequences in the underlying window can improve by aligning those subsequences optimally when the edge weights are computed as given in equation (4 1). The vertex induced subgraph of any subset V V defines a complete subgraph G = (V, E ). The SW score of G is an upper bound to the amount of improvement that can be obtained by optimally aligning only the subsequences that map to the vertices in V. In the following sections we will exploit the SW score to find a good clustering of the subsequences in a given window Clustering The clustering algorithm partitions the set subsequences {f 1, f 2,, f N } into non-overlapping subsets of size at most T. The eventual goal is that optimally aligning each subset followed by aligning the results of these alignments improves the SP score as much as possible. Recall that each subset can not have more than T subsequences since we can not optimally align more than T subsequences within the allowed time. We first need to understand how many clusters need to be created. The number of subsequences in each partition should be as large as possible. This is because more subsequences are optimally aligned with each other when the clusters are large. This indicates that there must be K clusters. T Next, we need to understand the right criteria to partition the set of subsequences. A number of strategies can be developed to address this question. We discuss two solutions with the help of the complete weighted graph G constructed for the subsequences. Notice that partitioning the set of subsequences into clusters of subsequences is equivalent to partitioning the graph G into vertex induced subgraphs of the vertices corresponding to the subsequences in each cluster. 57

58 Min-cut clustering. The first strategy aims to optimize the intra-cluster SP score. That is, it maximizes the improvement in the SP score by optimally aligning the subsequences within each cluster. At a high level, this is done by partitioning G into K subgraphs T such that the sum of the SW scores of these subgraphs is as large as possible. This is equivalent to the Min K -Cut problem with the additional restriction that each subgraph T has at most T vertices. In other words, it translates into the problem of finding the set of edges in G such that their removal partitions G into K complete subgraphs of size at most T, and T the sum of their weights are as small as possible. Finding the Min K -Cut of a graph is an NP-complete problem. A number of heuristic T algorithms have been developed to address this issue. One of the most commonly used tools for partitioning graphs is METIS [109, 110]. METIS partitions an input graph to a given number of subgraphs with the aim of minimizing or maximizing the total weight of the edges between different subgraphs. We use METIS to partition G into K subgraphs T with minimal K -cut. T Although, METIS tries to partition the graph into roughly the same sized subgraphs, it does not guarantee that they will be perfectly balanced in size. As a result, some of the clusters determined by METIS can have more than T vertices. This is undesirable since the subsequences in each cluster are optimally aligned in the following step. Recall that the cost of optimally aligning a cluster is exponential in the size of that cluster. The maximum size of a cluster, T, is determined by the total amount of time allowed to spend to optimize the alignment. Thus, METIS clusters need to be post-processed to guarantee that the sizes of the clusters do not exceed T. Next, we describe how we propose to adjust the size of the METIS clusters for the first strategy (i.e., optimizing the intra-cluster SP score) first. It is trivial to adapt this algorithm to the second strategy. 58

59 Given a set of subgraphs (i.e., clusters) identified by METIS, we create three sets. The first one is the set of subgraphs with T vertices, named EK (Equal to T ). The second one is the set of subgraphs with more than T vertices, named GK (Greater than T ). The last one is the set of clusters with less than T vertices, named LK (Less than T ). We adjust the size of the clusters by moving vertices from clusters in GK to clusters in LK. Out of all such moves, it greedily picks the one which causes the smallest cut since the goal is to minimize the total weight of the inter-cluster edges. After each move, the number of vertices in one of the clusters in GK decreases by one. Similarly, the number of vertices in one of the clusters in LK increases by one. Thus, the clusters in GK and LK move to EK. The iterations stop when GK is empty. This algorithm is guaranteed to converge to a solution in G GK ( G T ) iterations of the while loop, where G denotes the number of vertices in G. This is because, the number of vertices in a G GK reduces by one at each iteration. Max-Cut clustering. The second strategy aims to optimize the inter-cluster SP. It achieves this by maximizing the total weight of the edges in the K -cut of G. Similar to T the first strategy, we use METIS to identify such a cut. The proposed algorithm for post-processing the clusters found by METIS can be adapted to the second strategy as follows. At each iteration of the while loop, the vertex move that maximizes the cut is chosen instead of the one that minimizes. This can be done by modifying Steps 1 and 2.c of the algorithm. It is worth mentioning that the METIS algorithm for clustering the sequences is a module in QOMA2. It can be replaced by any clustering algorithm that finds better Min K -Cut or Max K -Cut in the future. T T Refining Clusters Iteratively The Min-Cut and the Max-Cut clustering strategies aim to minimize or maximize the cut (see Section 4.3.2). One drawback of these strategies is that each edge weight is computed by only considering the two subsequences corresponding to the two ends of that 59

60 edge (see Section 4.3.1). This is problematic, because the amount of improvement in the SP score by optimally aligning a cluster of subsequences depends on all the subsequences in that cluster. Considering two subsequences at a time greatly overestimates the improvement. We propose to improve the clusters iteratively. Each iteration updates the edge weights by considering all the subsequences in each cluster. We discuss how the edge weights are updated later in this section. Once the edge weights are updated, it reclusters the subsequences using the new weights. The iterations stop when the SW score of the graph G does not increase between two consecutive iterations or a certain number of iterations have been performed. We would like to estimate how much the two subsequences, f i and f j, contribute to the SP score under the restriction that each cluster is optimally aligned. The obvious solution is to optimally align each cluster and measure the new alignment score. This, however, is not practical for two reasons. First, optimally aligning a cluster of T subsequences is a costly operation. Performing this operation will make each iteration of the cluster refinement as costly as QOMA2. Furthermore, this will only update the weight of the edges whose two ends belong to the same subgraph (i.e., intra-cluster edges). The weight of the edges between different subgraphs (i.e., inter-cluster edges) still need to be computed. Thus, a good estimator should be efficient and work for both inter- and intra-cluster edges. We propose to estimate the edge weights by focusing on the gaps. At a high level, we assume the best scenario (i.e., smallest possible number of gaps) for intra-cluster edges. This is because of the restriction that the subsequences in each cluster are optimally aligned. We then estimate the improvement in the SP score between every pair of subsequences by considering these gaps. We describe our estimator in detail next. Let L i be the length of subsequence f i. After the complete weighted graph G is partitioned into K complete subgraphs, assume that v T i belongs to the subgraph G. Recall that v i is the vertex that denotes f i. The optimal alignment of all the subsequences 60

61 in the same cluster as f i requires insertion of at least g i = max v j G {L j} L i letters into f i. This is because the alignment of all the subsequences in a cluster can not be shorter than the longest subsequence in that cluster. Each such insertion corresponds to a gap in the alignment. Thus, g i denotes the minimum number of gaps imposed on f i due to clustering of the subsequences. Next, we compute the expected number of indels (insertions or deletions) in the alignment of subsequences f i and f j. An indel is an alignment of a letter with a gap. The alignment of two letters or two gaps are not considered as indels. Considering all possible arrangement of the letters and gaps in f i and f j, the expected ratio of letter-letter alignments between f i and f j in their alignments is L i L j (L i + g i )(L j + g j ) (4 2) Similarly, the expected ratio of gap-gap alignments is g i g j (L i + g i )(L j + g j ) (4 3) Thus, the expected ratio of indels can be computed by subtracting equations (4 2) and (4 3) from one. The total length of the induced alignment of f i and f j is at most max{l i + g i, L j + g j }. Therefore, the expected number of indels in the induced alignment of f i and f j, denoted by Gap expected (f i, f j ) is at most ( ) L i L j + g i g j 1 max{l i + g i, L j + g j } (4 4) (L i + g i )(L j + g j ) Let Gap induced (f i, f j ) denote the number of indels in the induced alignment of f i and f j. Let gapcost denote the cost of a single indel. We compute the new weight of the edge 61

62 between vertices v i and v j as e i,j = Score optimal (f i, f j ) Score induced (f i, f j ) gapcost (Gap expected (f i, f j ) Gap induced (f i, f j )). This computation differs from the one in Section since it considers the change in the gap cost as imposed by the clusters that f i and f j belong to. Once the weights of the edges are updated, the current partitioning may not be a good one anymore. Therefore, we iteratively run the clustering algorithm again and update the edge weights similarly until the SW score of the complete graph built for the current window does not increase any further or a given maximum number for iterations are reached. The Pseudo-code of the Adjustment in Section While GK 1. min = ; 2. For all G GK and G LK - For all u G (a) ug = Sum of weights of all the edges from u to all the vertices in G ; (b) ug = Sum of weights of all the edges from u to all the vertices in G ; (c) If ug ug < min then - Record (u, G, G ) as the current best move; - Update min as min = ug ug ; 3. Move the vertex u from G to G according to the best move; - If G contains T vertices then - Move G from GK to EK; - If G contains T vertices then - Move G from LK to EK; End While 62

63 4.3.4 Aligning the Subsequences in Clusters The clustering algorithm guarantees that each cluster has at most T subsequences. However, the total number of clusters may be greater than T. This happens when K > T 2. In that case, finding the optimal alignment of the profiles of clusters becomes infeasible. Although this brings us back to the same problem we are tackling in this paper, it is easier since we have K profiles which is significantly less than K. We recursively T apply the QOMA2 algorithm (Sections to 4.3.3) to these profiles until all the subsequences are aligned Complexity of QOMA2 The time complexity of QOMA2 is O(M log T K( KW T 2 T T + ck 2 )), where c is the upper bound for the number of inner loop iterations. In practice c 10. We deduct the time complexity as follows: For each window, we need to apply the clustering algorithm and align the clusters using two nested loops. The outer loop iterates log T K times. At each iteration the set of subsequences inside the window is partitioned into clusters and the edge weights are updated. Thus, each iteration of the inner loop costs O( E ) time. Since G contains K vertices O( E ) = O(K 2 ). At the end of each iteration of the inner loop all the clusters are optimally aligned. Optimally Aligning T subsequences costs O(W T 2 T ) time. At the ith iteration of the outer loop, O( K ) such optimal alignments are T i done. Adding these steps, we find that the total cost of the ith iteration of the outer loop is O( K T W T 2 T + ck 2 ). i 63

64 The number of outer loop iterations is log T K. Thus, the total cost of aligning a window is logt K i=1 O( K T i W T 2 T + ck 2 ) = O((log T K)(KW T 2 T ( log T K i=1 K(T 1) 1 ) + ck 2 )) T i = O((log T K)(KW T 2 T ( ) + ck 2 )) (K 1)T 2 = O((log T K)( KW T 2 T T + ck 2 )) Since we totally have M positions for window to align, the total cost of QOMA2 is O(M log T K( KW T 2 T T + ck 2 )). 4.4 Experimental Evaluation Experimental setup: We used BAliBASE benchmarks [5] reference 1 from version 1 (www-igbmc.u-strasbg.fr/bioinfo/balibase/) and references 1, 2,, 8 from version 3 (www-bio3d-igbmc.u-strasbg.fr/balibase/) for evaluation of our method. We call this dataset D3 since it contains benchmarks with three or more sequences. We call the subset of D3 that contains all the benchmarks with at least 10 sequences as D10. Similarly, we call the subset of D3 that contains all the benchmarks with at least 20 sequences as D20. D3, D10, and D20 contain 440, 209, and 84 benchmarks respectively. We implemented the QOMA2 algorithm using standard C. We downloaded ProbCons [88], T-Coffee [2], MUSCLE [78], and ClustalW [1, 77] for comparison. We also downloaded DCA [47] since it aims to maximize the SP score as well. However, DCA did not run for the benchmarks in our datasets D10 and D20 since it can not align large number of sequences. We used BLOSUM62 as a measure of similarity between amino acids, since BLOSUM62 is commonly used. Using other popular score matrices, such as BLOSUM90 or PAM250 will produce similar results. We used gap cost = -4 to penalize each indel. In order to be fair, we used the same parameters (i.e., BLOSUM62 and gap 64

65 cost) for QOMA2, T-Coffee, MUSCLE, and ClustalW. We used the default parameters for ProbCons for it was impossible to change those parameters for ProbCons. Among the competing tools, used in our experiments, MUSCLE aims to maximize the SP score, ClustalW and T-Coffee aims to maximize a weighted version of the SP score. Therefore, one can argue that it is not fair to include ClustalW, T-Coffee and ProbCons in our experiments. We, however, include them since most of the existing tools that aim to maximize the SP score, such as DCA or MSA, do not work for large number of sequences. We improve the fairness of our experiments by using the same parameters for all the tools. First, we compared different clustering algorithms and showed the relationship between the SP and the SW scores on each window. We then evaluated the impact of the window and the cluster size on the SP score of the QOMA2 alignment and the running time of QOMA2. We also compared the SP scores of QOMA2 with four competing multiple sequence alignment tools. We ran our experiments on a system with dual 2.59 GHz AMD Opteron Processors, 8 gigabytes of RAM, and a Linux operating system. Dataset Details The distribution of the number of benchmarks with different number of sequences (K) is shown in Figure 4-3. Correlation between the SP and the SW scores: The main hypothesis that QOMA2 depends on is that optimizing the SW score optimizes the SP score. Thus it aims to optimize the SW score by finding an appropriate clustering of the sequences. For a given window, the SW score is computed in O(K 2 ) time as it requires estimating the gap cost for each pair of subsequences. The SP score, on the other hand, requires aligning the subsequences. Therefore, it costs O(M log T K( KW T 2 T T + ck 2 )) time. This makes QOMA2 desirable since the SW score can be measured efficiently without actually finding the alignment of multiple sequences. In this experiment, we 65

66 Figure 4-3. The distribution of the number of benchmarks with different number of sequences (K). evaluate the relationship between the SW and the SP scores. We also measure how each of the proposed clustering strategies performs. We place a window (W = 16) on all possible locations of an initial alignment. We find the clusters using the Min-Cut and the Max-Cut clustering algorithms (see Section 4.3.2). We also find clusters using the iterative refinement (see Section 4.3.3) on the results of Min-Cut and Max-Cut. We measure the average SP and SW scores obtained by these algorithms for T = 2, 3, and 4. We use D20 dataset in this experiment. Table 4-2 presents the results. Results show that there is a strong correlation between the SP and the SW scores. For each value of T, the SP score gets larger when the SW score gets larger. This implies that optimizing the SW score can potentially optimize the SP score. This is an important observation since the cost of computing the SW score is negligible as compared to that of the SP score. Note that the SW scores obtained 66

67 Table 4-2. The average SW and SP scores of individual windows after applying different clustering algorithms for different values of T, with W = 16. The average SP scores of initial alignment in the window is 351. The average upper bound to the SP score for the subsequences in the windows is Benchmarks are selected from the D20 dataset. Min-Cut Min-Cut Max-Cut Max-Cut T Iterative Iterative SP SW SP SW SP SW SP SW with different number of clusters are not comparable to each other since they compute the gap cost under different cluster size assumptions. The results also demonstrate that the iterative refinement helps in improving the SW and the SP score of both of the Max-Cut and the Min-Cut algorithms. The Max-Cut algorithm with iterative refinement always has the best SP and SW scores. This implies that if the induced alignment of two subsequences has a high score as compared to that of their optimal alignment, it is advantageous to keep them in the same cluster (i.e., force them to be almost optimally aligned). The SP score of all the methods increase as the value of T increases. This is intuitive since more subsequences are optimally aligned at once for large values of T. Another important observation that follows from these results is that optimally aligning clusters does not always improve the SP score of a window. It can actually reduce it. This happens especially for the Min-Cut clustering (with or without iterative refinement) for all values of T as well as the Max-Cut clustering for T = 2. This is because when the clusters of subsequences are aligned, they impose a certain alignment for the subsequences in each cluster. These restrictions limit the number of possibilities in which a set of clusters can be aligned together. This indicates that the clusters should be selected carefully. 67

68 Table 4-3. The average SP scores of QOMA2 for individual windows. SP before and Upper bound denote the average initial SP scores and the average upper bounds to the SP scores for individual windows respectively. Benchmarks are selected from the D10 dataset. W SP before Upper bound T = 2 T = 3 T = 4 T = In the rest of the experiments, we select the Max-Cut clustering algorithm with iterative refinement as the default clustering algorithm of QOMA2. Impact of W and T on the SP score. The QOMA2 algorithm hypothesizes that the SP score can be optimized by increasing the value of W and T. In this experiment, we evaluate the impact of these parameters on the SP score of QOMA2. Table 4-3 shows the SP score of individual windows aligned by QOMA2 for different values of W and T. The results show that the SP scores increase when T increases for all values of W. Table 4-4 shows the SP scores of alignments of the entire benchmarks in D10 using QOMA2 for varying values of W and T. As W and T increase, QOMA2 produces higher scores. The two extreme parameter choices of using very large value for one of these parameters and very small value for the other, i.e., W = 16, T = 2 or W = 4, T = 5 do not produce lower SP scores as compared to the intermediate solutions such as W = 12, T = 3. This is an important observation since it validates that QOMA2 is superior to the two existing extreme solutions (see Figure 4-1). Impact of W and T on the running time Table 4-4 shows the average running time of QOMA2 for optimizing a single window for varying values of W and T. The experimental results show that QOMA2 runs very efficiently even for large number of sequences. As we have mentioned in Section 4.3.5, the time complexity of QOMA2 is O((log T K)( KW T 2 T T + ck 2 )) 68

69 Table 4-4. The average SP scores of the alignments of the entire benchmarks in D10 using QOMA2. The average SP scores of initial alignments is The average of the upper bound to the SP scores of the benchmarks is The average running times are also shown in the parentheses by seconds. W T = 2 T = 3 T = 4 T = (1.173) -6770(0.653) -6676(0.403) -6498(0.465) (1.213) -5348(0.673) -4762(1.053) -4236(5.050) (1.116) -4659(0.808) -3966(3.619) -3464(13.485) (1.097) -4327(1.102) -3555(8.856) -2811(40.132) for a single window. The experimental results suggest when W is large, the factor O((log T K)( KW T 2 T T )) quickly dominates the running time. From Tables 4-3 and 4-4, we conclude a good point for balancing time and quality is at (W = 12, T = 4). Comparison to existing tools. Table 4-5 presents the SP scores of the alignments of the benchmarks in D10 using four existing tools and QOMA2. Note that the compared tools do not aim to maximize the SP score. ClustalW, MUSCLE, and T-coffee optimize a variation of the SP score by computing weights for sequences or subsequences. We still included this experiment because the existing tools that optimize the SP score, such as DCA [47], MSA [61] and COSA [111] do not work for large number of sequences. For small number of sequences, QOMA performs significantly better than DCA (see [99]). We divided the queries into four subsets according to the number of sequences they contain. The table shows that QOMA2 has higher SP score than all the tools compared. ClustalW is always the second best. The remaining tools are not competitive in terms of the SP score. Table 4-5. The average SP scores of QOMA2 (W = 12 and T = 4 ) and four other tools on the D10 dataset. The competing tools (except ProbCons) are run with the same parameters as QOMA2. K ProbCons T-coffee MUSCLE ClustalW QOMA

70 CHAPTER 5 IMPROVING BIOLOGICAL RELEVANCE OF MULTIPLE SEQUENCE ALIGNMENT In this chapter, we introduce a new graph-based multiple sequence alignment method for protein sequences. We name our method HSA (Horizontal Sequence Alignment) for it horizontally slides a window on the protein sequences simultaneously. HSA considers all the proteins at once. It obtains final alignment by concatenating cliques of graph. In order to find a biologically relevant alignment, HSA takes secondary structure information as well as amino acid sequences into account. The experimental results show that HSA achieves higher accuracy compared to existing tools on BAliBASE benchmarks. The improvement is more significant for proteins with low similarity. 5.1 Motivation and Problem Definition Most of heuristic multiple sequence alignment algorithms are based on progressive application of pairwise alignment. They build up alignments of larger numbers of sequences by adding sequences one by one to existing alignment [31]. We call this a vertical alignment since it progressively adds a new sequence (i.e., row) to a consensus alignment. These methods have the shortcoming that the order of sequences to be added to existing alignment significantly affects the quality of the resulting alignment. This problem is more apparent when the percentage of identities among amino acids falls below 25%, called the twilight zone [88]. The accuracies of most progressive sequence alignment methods drop considerably for such proteins. We consider the problem of alignment of multiple proteins. We develop a graph-based solution to this problem. We name this algorithm HSA (Horizontal Sequence Alignment) as it horizontally aligns sequences. Here, horizontal alignment means that all proteins are aligned simultaneously, one column at a time. HSA first constructs a directed-graph. In this graph, each amino acid of the input sequences maps to a vertex. An edge is drawn between pairs of vertices that may be aligned together. The graph is then adjusted by 70

71 inserting gap vertices. Later, this graph is traversed to find high scoring cliques. Final alignment is obtained by concatenating these cliques. 5.2 Current Results We provide a heuristic solution for multiple sequence alignment for proteins. We name this algorithm HSA (Horizontal Sequence Alignment) as it horizontally aligns sequences. Here, horizontal alignment means that all proteins are aligned simultaneously, one column at a time. HSA first constructs a directed-graph. In this graph, each amino acid of the input sequences maps to a vertex. An edge is drawn between pairs of vertices that may be aligned together. The graph is then adjusted by inserting gap vertices. Later, this graph is traversed to find high scoring cliques. Final alignment is obtained by concatenating these cliques. The underlying assumption of HSA is that the residues that have same SSE types have more chance to be aligned compared to the residues that have different SSE types. This assumption is verified by a number of real experiments and observations [ ]. HSA works in five steps: (1) An initial directed graph is constructed by considering residue information such as amino acid and secondary structure type. (2) The vertices are grouped based on the types of residues. The residue vertices in each group are more likely to be aligned together in the following step. (3) Gap vertices are inserted to the graph in order to bring vertices in the same group close to each other in terms topological position in the graph. (4) A window is slid from beginning to end. The clique with highest score is found in each window and an initial alignment is constructed by concatenating these cliques. (5) The final alignment is constructed by adjusting gap vertices of the initial alignment. Next, we describe these five steps in detail Constructing Initial Graph This step constructs the initial graph which will guide the alignment later. Let s 1, s 2,, s k be the protein sequences to be aligned. Let s i (j) denote the jth amino acid of protein s i. A vertex is built for each amino acid. The vertices corresponding to different 71

72 proteins are marked with different colors. Thus, the vertices of the graph span k different colors. If available, Secondary Structure Element (SSE) type (α-helix, β-sheet) of each residue is also stored along with the vertex. For simplicity, SSE types include α-helix, β-sheet, and no SSE information, as shown in Figure 5-1. Two types of edges are defined. First, a directed edge is included from the vertex corresponding to s i (j) to s i (j + 1) for all consecutive amino acids. Second, an undirected edge is drawn between pairs of vertices of different colors if their substitution score is higher than a threshold. HSA gets the substitution score from BLOSUM62 matrix. A weight is assigned to each undirected edge as the sum of the substitution score and typescore for the amino acid pair that make up that edge. The typescore is computed from the SSE types. If two residues belong to the same SSE type, then their typescore is high. Otherwise, it is low. We discuss this in more detail in Section This policy of weight assignment lets residues with same SSE type or similar amino acids have higher change to be aligned in following steps. We will discuss this in Section Figure 5-1 demonstrates this step on three proteins. The amino acid sequences and the SSEs are shown at the top of this figure. The dotted arrows represent the undirected edges between two vertices of different color, the solid arrows only appear between the vertices corresponding to consecutive amino acids of the same protein and they only have one direction, from left to right Grouping Fragments The graph constructed at the first step shows the similarity of pairs of residues. However, multiple alignment involves alignment of groups of amino acids rather than pairs. In this step, we group the fragments that are more likely to be aligned together. Here, a fragment is defined by the following four properties: 1) It is composed of consecutive vertices. 2) All the vertices have the same color. 3) All vertices have the same SSE type. 4) There is no other fragment that contains it. For example, in Figure 5-2, S 1 consists of four fragments: f 1 = LT, f 2 = GKTIV, f 3 = E, and f 4 = IAK. Thus, S 1 can be written as S 1 = f 1 f 2 f 3 f 4. 72

73 Figure 5-1. The initial graph constructed for sequence S 1, S 2 and S 3. Each residue maps to a vertex in this graph. The figure shows some edges between the first vertices of the sequences, indicated by dashed arrows. The vertices for different sequences are marked with different colors (colors not shown in figure). With the knowledge that the fragments with the same SSE type are more likely to be aligned, all sequences are scanned to find fragments with known SSE types. The fragments are then clustered into groups, where each group consists of one fragment from each sequence. To group fragments, we align the fragments first. We use a simplified dynamic programming algorithm by considering each fragment as a residue in the basic algorithm [28]. The score of two fragment pairs is computed from the following formula: totalscore = typescore positionp enalty lengthp enalty The typescore is computed from the SSE types. Fragments with the same SSE type contribute a high score whereas fragments of different SSE types incur penalty. This is because of our assumption that residues with the same SSE type have higher chance to be aligned. Thus typescore is calculated as follows: we check the types of two fragment first and return a number according to the following 5 different conditions. 1) They are the same type of α-helix, we return 4; 2) They are the same type of β-sheet, we return 2; 3) 73

74 Figure 5-2. The fragments with similar features, such as SSE types, lengths and positions in original sequences are grouped together. They are the same type of no SSE type, we return 1; 4) They are α-helix and β-sheet, we return -4; 5) Otherwise, we return 0. The positionp enalty is computed as the difference between the positions of two fragments. Here the position of a fragment is the topological position in the original sequence. If two fragments are far away in their sequences, then the pair of them gets a higher penalty. This is because the alignment of such fragments introduce many gaps. The lengthp enalty is computed as the difference between the lengths of the two fragments. The length of a fragment is the number of residues it contains. Fragment pairs with similar length will be given smaller penalty. This is because as the lengths the fragment pairs differ more, the number of gap vertices that need to be inserted in the later alignment increases. Figure 5-2 demonstrates how HSA groups fragments. Using the example of Figure 5-1, fragments with same SSE type, similar positions and lengths are clustered into the same group. Two such groups with α-helix and β-sheet are circled in Figure

75 Figure 5-3. A gap vertex is inserted to let the fragments in same group close to other each other vertically Fragment Position Adjustment Once the groups of fragments are determined, we update the graph to bring the fragments in same group close to each other in terms of vertical position. Here, vertical position corresponds to a position in the topological order of the vertices of the same color. For example, in Figure 5-3, vertex L in S 1, vertex P in S 2, and vertex P in S 3 are at the same vertical position 1, similarly, vertex T in S 1, vertex N in S 2, and vertex S in S 3 are at the same vertical position 2, etc. As we will discuss later, this process increases the possibility that the vertices in these fragments are aligned. We update the graph by inserting gap vertices, as shown in Figure 5-3. First, we compute the number of gap vertices to be inserted based on two factors: 1) The number of residues in fragments. 2) The relative positions of fragments in the same group. Here a good relative position of fragments means that the positions of fragments lead to a high scoring alignment of the vertices in these fragments. We align the vertices in fragments of the same group to compute those positions. Then, we randomly select a position between two consecutive fragment groups. Finally, for each sequence we insert gap vertices at these 75

76 positions to bring the fragments within the same group together. In Figure 5-3, a gap s vertex is inserted before residue I in S 3 to bring fragments in the group with β-sheet type close to each other Alignment So far, we have prepared the graph for actual alignment by two means. (1) We determined vertex pairs that can be a part of the alignment; (2) We brought sequences to roughly the same size by inserting gap vertices, while keeping similar vertices vertically close. In this step, the sequences are actually aligned by scanning the updated graph in topological order. As demonstrated in Figure 5-4, we start by placing a window of width W at the beginning of each sequence. This window defines a subgraph of the graph. Typically, we use W = 4 or 6. The example in Figure 5-4 uses W = 3. Next, we greedily choose a clique with the best expectation score from this subgraph. We will define the expectation score of a clique later. A clique here is defined as a complete subgraph that consists of one vertex from each color. In other words, if K sequences are to be aligned, a clique corresponds to the alignment of one letter from each of the K sequences. Thus, each clique produces one column of the multiple alignment. For each clique, we align the letters of that clique, and iteratively find the next best clique that 1) does not conflict with this clique, and 2) has at least one letter next to a letter in this clique. This iteration is repeated t times to find t columns. Typically, t = 4. These t cliques define a local alignment of the input sequences. The expectation score of the original clique is defined as the SP score of this local alignment. After finding the highest expectation score clique, we add this clique as a column to existing alignment. We then slide the window to the location which is immediately after the clique found and repeat the same process until it reaches the end of sequences. Each clique defines a column in the multiple alignment. The columns are concatenated and gaps are inserted to align them. Figure 5-4 illustrates this step, in the window (circled by the dotted rectangle), the highest expectation score clique 76

77 Figure 5-4. Cliques found in the sliding window (window size = 3) are the columns of the resulting alignment. Gaps are inserted to concatenate these columns. (the left shadow background marked column) consists of residues T, R, and I in S 1, S 2 and S 3 respectively. Then, the window slides to next location toward the right of the graph (this window is not shown in the Figure 5-4), and the highest expectation score clique (the right background marked column) in the window consists of residue V, V, and C in S 1, S 2 and S 3 respectively. The two cliques found (marked by shadow background) are two columns in resulting alignment. The resulting alignment is obtained by inserting a gap vertex to S 3. As mentioned in section 5.2.1, due to the policy of edge weight assignment, cliques that contain vertices of the same SSE type or similar amino acids have higher score than other possible cliques. Since a clique contains one vertex of each color, finding the best clique does not assure any order for traversal of vertices of different colors. Thus, unlike existing tools, our method is order independent Gap Adjustment After concatenating the cliques in previous step, short gaps may be scattered in the sequence. In this step, the alignment obtained in the previous step is adjusted by moving 77

78 Figure 5-5. Gaps are moved to produce longer and fewer gaps. We favor gaps outside the fragments of type α-helix and β-sheet. the gaps as follows. The sequences are scanned from left to right to find isolated gaps. If a gap is inside a fragment of type α-helix or β-sheet, it is moved outside of that fragment, either before or after. We choose the direction that produces higher alignment score. If a gap is inside a fragment with no SSE type, it is moved next to the neighboring gap only if the movement produces a higher score than the current alignment. Figure 5-5 shows us the movement of the first gap vertex in S 3 (i.e., the gap vertex between residues I and C). This is a gap vertex inside a fragment of type α-helix. Thus this gap vertex is moved out and combined with the next gap vertex. The final alignment is obtained by mapping each vertex in the final graph back to its original residue Experimental Results In order to demonstrate the feasibility of our method, we ran it on BAliBASE benchmarks [5] ( We chose the 78

79 benchmarks that contain SSE information since our algorithm needs SSE information of sequences. We downloaded ClustalW [1, 77], ProbCons [88], MUSCLE [78] and T-Coffee [2] for comparison since they are the most commonly used and the most recent tools. We ran all experiments on a computer with 3 GHz speed, Intel pentium 4 processor, and 1 GB main memory. The operating system is Windows XP. Evaluation of alignment quality Alignment of dissimilar proteins is usually harder than the alignment of highly similar proteins. Tables 5-1, 5-2 and 5-3 show the BAliBASE scores of HSA, ClustalW, ProbCons, MUSCLE and T-Coffee on benchmarks with low, medium, and high similarity respectively. From Table 5-1, we conclude that for low similarity benchmarks, our method outperforms all other tools. On the average HSA achieves a score of 0.619, which is better than any other tool. HSA finds the best result for 14 out of 21 reference benchmarks. HSA is the second best in 5 of the remaining 7 benchmarks. Table 5-2 shows that for sequences with 20-40% identity, HSA is comparable to other tools on average. The average score is not the best one. However, it is only slightly worse than the winner of this group (0.909 versus 0.901). HSA performs best for 2 cases out of 7, including a case for which HSA gets full score. In Table 5-3, HSA is higher than other tools on average. HSA performs best on 2 cases out of 7, including a case for which HSA gets full score. High scores of existing methods for sequences with high percentage of identity (Table 5-2 and 5-3) show that there is little room for improvement for such sequences. Proteins at the twilight zone (Table 5-1) pose a greater challenge. These results show that our algorithm performs best for such sequences. For medium and high similarity benchmarks, our results are comparable to existing tools. Table 5-4 shows the SP scores of HSA, ClustalW, ProbCons, MUSCLE, T-Coffee and original BAliBASE alignment. On the average, ClustalW, MUSCLE, and T-Coffee find the highest SP score for low, medium, and high similarity sequences respectively. However, according to Table 5-1 to 5-3, those methods have relatively low BAliBASE 79

80 Table 5-1. The BAliBASE score of HSA and other tools. less than 25 % identity ClustalW ProbCons MUSCLE T-Coffee HSA Short 1aboA idy r tvxA ubi wit trx Avg Medium 1bbt sbp havA uky hsdA pia grs Avg Long 1ajsA cpt lvl pamA ped myr enl Avg Avg all scores. This means that, the alignment with the highest SP score is not necessarily the most meaningful alignment. The SP score of HSA is comparable to other tools on the Table 5-2. The BAliBASE score of HSA and other tools. 20%-40% identity. ClustalW ProbCons MUSCLE T-Coffee HSA 1fjlA csy tgxA ldg mrj pgtA ton Avg

81 average. For low similarity sequence benchmarks, the average SP score of HSA is higher than the average SP score of the reference alignment. Table 5-3. The BAliBASE score of HSA and other tools. more than 35% identity. ClustalW ProbCons MUSCLE T-Coffee HSA 1amk ar5A led ppn thm zin ptp Avg Performance Evaluation The time complexity of our algorithm is O(W K N + K 2 M 2 ), where K is the number of sequences, W is the sliding window size, N is the sequence length and M is the number of fragments in a protein sequence. The complexity is computed as follows. The clique, in a window, with the highest expectation score is found in W K time, and there are N positions for the sliding window. K 2 M 2 time is required for aligning fragments. Usually, M N. Thus, the total time complexity, in practice, is O(W K N). Typically W is a small number such as 4. For reasonably small K, W K N = O(N). Therefore, for small K, the complexity is O(N). As K increases, the complexity increases quickly. However, this complexity is observed only if the subgraphs inside a window is highly connected. It is possible to get rid of the W K term in the complexity by using longest path methods rather than clique finding methods. The experimental results in Table 5-5 coincides with the above conclusion. In general, Table 5-4. The SP score of HSA and other tools. REF ClustalW ProbCons MUSCLE T-Coffee HSA Short, <25% Medium, <25% Long, <25% Short, 20%-40% Medium, 20%-40% Medium, >35% Avg overall

82 Table 5-5. The running time of HSA and other tools (measured by milliseconds). ClustalW ProbCons MUSCLE T-Coffee HSA Short, <25% Medium, <25% Long, <25% Short, 20%-40% Medium, 20%-40% Medium, >35% Avg overall ClustalW performs best. However, ClustalW achieves this at expense of low accuracy (see Figures 5-1 to 5-3). HSA is slower than ClustalW and MUSCLE. It is, however, faster than ProbCons and T-Coffee. 82

83 CHAPTER 6 MODULE FOR AMPLIFICATION OF PLASTOMES BY PRIMER IDENTIFICATION The chloroplast is the site of photosynthesis, and is therefore critical to plant growth, development and agricultural output. The chloroplast genome is also relatively small, yet despite its approachable size and importance, only a small number of chloroplast genomes have been sequenced. The dearth of information is due to the requisite preparation, frequently requiring isolation of plastids and generation of plasmid-based chloroplast DNA libraries. The method shown in this chapter tests the hypothesis that rapid, inexpensive, yet substantial sequence coverage of an unknown target chloroplast genome may be obtained through a PCR-based means. A computational approach predicts a large number of overlapping primer pairs corresponding to conserved coding regions of known chloroplast genomes. These computer-selected primers are used to generate PCR-derived amplicons that may then be sequenced by conventional methods. This chapter considers the problem of finding saturating number of overlapping primer pairs to bracket maximum possible coverage of the unknown target DNA sequence. None of the currently available primer prediction tools consider gene and inter-gene information and most use only one reference sequence, which limits their power and accuracy. This chapter provides a heuristic solution, named MAPPIT, to the above mentioned problem that is divided into the task of first identifying universal primers and then assessing spatial relationships between the primer pair candidates. Two strategies have been developed to solve the first problem. The first employs multiple alignment, and the second identifies motifs. The distance between primers, their alignment within gene coding regions, and most of all their presence in multiple reference genomes narrows the primer set. Primers generated by the MAPPIT module provide substantially more coverage than those generated via Primer3. Motif-based strategies provide more coverage than multiple-alignment based approaches. As predicted, primer selection improves when based on a larger reference set. The computational predictions were tested in the laboratory and 83

84 demonstrate that substantial coverage may be obtained from a set of eudicots, and at least partial sequence may be obtained from distant taxa. 6.1 Motivation and Problem Definition DNA sequence information is the basis of many disciplines of biology including molecular biology, phylogenetics and molecular evolution. The sequence information of a plant cell resides in three physically distinct compartments, namely the nucleus, the mitochondrion, and the plastid. Each encodes proteins required for cell form and function, and each is subject to different mechanisms of selection and inheritance. The green plastid, chloroplast, is an important organelle. It is the site of photosynthesis and several other important metabolic processes, and is therefore critical to plant growth, development and agricultural output. The plastome or chloroplast genome holds a wealth of functional and phylogenetic information. By mining sequence information from many species, important taxonomic relationships may be resolved, complementing associations built from studies of variability in morphology, as well as biochemical and nuclear-genome-based molecular markers. Also, genetic engineering of the chloroplast requires a foundation of sequence information. The chloroplast genome maintains a great degree of conservation in gene content and organization. Thus a relatively high level of synteny exists between plastid genomes derived from distantly-related taxa [10]. The chloroplast genome is much smaller than the nuclear genome, yet only a small number of these extra-nuclear genomes have been sequenced. Traditionally, plastid genomes have been sequenced only after generating extensive plasmid-based libraries of the plastid DNA. Plastid DNA extraction relies on difficult, sometimes problematic and typically time consuming preparative procedures. Recently, several reports have increased plastid sequencing throughput by amplifying the isolated plastid DNA using rolling circle amplification (RCA) [33]. However, obtaining sequence through RCA requires this intermediate step. Recently, the ASAP method showed that sequence information could be gathered by creating templates from plastid 84

85 DNA based on conserved regions of plastid genes [32]. ASAP uses conserved primers (short, single-stranded DNA fragments that initiate enzyme-based DNA strand elongation) to flank unknown regions, and the regions are amplified using the polymerase chain reaction (PCR). PCR involves the exponential amplification of a finite length of DNA in a cell free environment [116], and it is frequently used to generate a large quantity of specific DNA sequences for forensic applications. The procedure relies on a thermostable enzyme known as Taq DNA polymerase, which elongates specific DNA sequences bracketed by primer homology. A primer is classified as forward or reverse primer depending on its orientation relative to the target sequence. For instance, a forward and reverse primer that flank a given gene allow amplification of the bracketed sequence in the presence of DNA polymerase, nucleotides and appropriate cofactors. Use of PCR depends on many successive rounds of primer annealing and subsequent template elongation to amplify a sequence of interest. The ASAP method is fast and cost effective. However, in the initial report, the required primers were selected by visual inspection of target sequences. This restricted the ASAP study to a small region of the chloroplast genome. To expand this technique to an entire chloroplast genome an efficient method is required to facilitate primer selection. More importantly, such a method will allow the selected primer set to be updated based upon the availability of new plastid sequences. This chapter presents the Module for Amplification of Plastomes by Primer Identification, or MAPPIT. The MAPPIT tool uses the information of database-resident reference plastid genomes to predict a set of conserved primers that will generate overlapping amplicons for sequencing. The power of MAPPIT is that it would theoretically gain accuracy and precision as the reference sequence set grows. MAPPIT uses two approaches to identify the primers, namely multiple alignment and motif-based. The first approach develops a multiple alignment strategy. The proposed multiple alignment method is a variation of traditional progressive multiple alignment strategy that weights the coding regions of the genomes, increasing the probability that the primers 85

86 identified reside in the coding regions of associated genes. Once a multiple sequence alignment of the reference genomes is obtained, a window is slid on the consensus sequence to identify the subsequences that satisfy the constraints that designate primer candidates. Individual primer candidates are then assessed for their relative association with other primer candidates to assign feasible primer pairs. The second approach is based on motif identification. This method recognizes potential primers from each reference genome separately. It then identifies a subset of these primers that occur frequently in a subset of reference genomes. The presence in multiple genomes adds support to any primer being assigned to the final primer set. Two solutions have been developed to identify the final set of primer pairs from the candidates, namely order dependent and order independent, depending on whether they consider primer order or not when computing the support values. Finally, a computational method has been developed to measure the quality of the identified primer pairs. Experimental results show that the primer pairs designed cover up to 81 % of an unknown target sequence. Randomly selected primer pairs devised by MAPPIT were used in laboratory experiments to validate computational predictions. We first define several terms: A DNA sequence is represented by a string of four letters: A, C, G, T as the bases and two extra alphabets: N as unknown bases and - as gaps. A primer is defined as a sequence which satisfies certain constraints. The length of a primer p, indicated by length(p), is the number of characters it contains. Let s[i : j] denote the subsequence of s from position i to position j; A primer p binds to DNA sequence s at position i if p and s[i : i + length(p) 1] are similar. Two sequence are considered as similar if they have sufficient percentage identity. In practice 93% identity is required for primer similarity. A partial order primers p and q with respect to sequence s, p s q, is defined if the position of p is before the position of q in s. Let f and r denote a forward and reverse primer respectively. Assume that f and r bind to s[i : i+length(f) 1] 86

87 Target f 1 f 2 f 3 r 1 r 2 r 3 a 1 a 2 a 3 Contig 1 Contig 2 Figure 6-1. Example of primer pairs on target sequence: f and r stand for forward and reverse primers respectively. The directions of primers are shown. < f 1, r 1 > pair covers a region a 1 and constructs a contig Contig 1, pairs < f 2, r 2 > and < f 3, r 3 > cover regions a 2 and a 3, which construct a contig Contig 2 since a 2 and a 3 have overlap. and s[j : j + length(r) 1]. The distance between f and r with respect to s, d s (f, r) is defined as j+ length(r) - i if i < j d s (f, r) = otherwise A primer pair < f, r > identifies the fragment s[i : i + d s (f, r) 1] from s if d s (f, r) less than a given cutoff. This cutoff number is usually 1000 and is determined by the limitations of automated sequencing methods currently available. Two fragments of s, say s 1 and s 2, identified by two primer pairs can be combined to form a contig if s 1 and s 2 have sufficient overlap. In practice, overlap of at least 100 letters denote a contig with high confidence. short overlap can not be continued as they may indicate random overlaps. Given a set of primer pairs p = {< f 1, r 1 >, < f 2, r 2 >,, < f k, r k >}, We define the coverage of p on a sequence s as the total number of letters of s that can be identified using p. We define a primer pairs finding problem as following: Given a target sequence T and a set of reference sequences S = {S 1, S 2,, S K }, where S i are homologous to T, the goal is to find set of primer pairs < f i, r i >, i 87

88 {1, 2,, k}. (1) has that a large coverage on T and (2) produces a small number of contigs from T. An example is shown in Figure 6-1. In this example, a target DNA sequence and six primers are shown. Primers f 1 and r 1 construct a primer pair < f 1, r 1 > since d s (f 1, r 1 ) is in the distance limitation L. This pair constructs a contig (Contig 1 ) on the target. Primer pairs < f 2, r 2 > and < f 3, r 3 > has overlap greater than the overlap threshold V, therefore these two primer pairs produce another contig (Contig 2 ). 6.2 Related Work Rapid and cost effective DNA sequence acquisition is one of the core problems in bioinformatics research. Sequencing methods mainly fall into two classes: whole-genome shotgun (WGS) assembly and PCR-based assembly. The whole-genome shotgun assembly technique has been remarkably successful in efforts to determine the sequence of bases that make up a genome [23]. CAP3 belongs to this category [117]. The accuracy of the assembled sequences using WGS methods suffer because of read errors and repeats [118]. They also incur very high computation cost due to large number of pairwise sequence comparisons. And they also need an additional finishing phase. On the other hand, PCR-based sequencing methods are more accurate. However, their processing time is usually much longer and the cost of processing is more expensive. Recently, Dinghra and Folta proposed a new sequencing method, called ASAP, [32] to overcome the shortcomings of PCR-based methods. ASAP exploits the fact that chloroplast genomes are extremely well conserved in gene organization, at least within major taxonomic subgroups of the plant kingdom. It is a universal high-throughput, rapid PCR-based technique to amplify, sequence and assemble plasmid genome sequence from diverse species in a short time and at reasonable cost. The ASAP method finds the multiple alignment of a set of reference genomes that are homolog to the target genome using ClustalW [1]. Domain experts, then, identify conserved primer pairs from the multiple alignment through visual inspection. ASAP uses these primer pairs to generate 88

89 1-1.2 kbp overlapping amplicons from the inverted repeat region in 14 diverse genera, which can be sequenced directly without cloning [32]. The manual primer identification step is the bottleneck of ASAP. Efficient computational methods are needed to automate this process. Also, as we discuss later, ASAP can miss potential primers since it uses ClustalW for multiple alignment. This is because ClustalW maximizes the overall alignment score for the entire sequences. Primers are however short sequences scattered in the entire sequence. Thus, short conserved regions can be missed using ClustalW when the sequences have many indels. Similar to ASAP, PriFi [119] uses multiple sequence alignment to identify primers. It also uses ClustalW to obtain multiple alignment. PriFi has the same shortcomings as ASAP. PriFi also has the shortcoming that it can not automatically identify introns. Multiple sequence alignment has a lot of applications in biological science such as gene prediction [7] and improving local alignment quality [20]. Multiple sequence alignment methods can be classified into two groups: optimal and heuristic methods. MSA [61] is the representative of optimal solutions. Heuristic methods are much more popular because of their low time complexity. ClustalW [1, 77], ProbCons [88], T-coffee [2] and MUSCLE [78] are some examples to heuristic strategies. 6.3 Current Results Finding Primer Candidates In this section, we discuss how we construct the set of candidate primers (forward and reverse) from reference sequences. Our final goal is to obtain a set of primers, which should cover the unknown target sequence. Therefore, the primers found in this step should be selected according to their possibility of being in the target sequence. Let T denote the target sequence. Let S = {S 1, S 2,, S K } denote the set of reference sequences homologous to T. Similar to ASAP method, we assume that a primer p appears in T with high possibility if it appears in most of the reference sequences. We say that p appears in a given sequence if that sequence has a subsequence whose alignment with 89

90 p has a percent-identity greater than a given threshold. This threshold is usually chosen as 93 % for practical purposes (see Section 6.1). We define the support of a primer p on a sequence S i as: 1 if p appears in S i support(p, S i ) = 0 otherwise We define the support of a primer p on sequence set S as: support(p, S) = 1 K support(p, S i ) 100 S i S A primer is considered as a candidate primer only if it satisfies the following two criteria: Conservation Criteria: A primer has to have sufficient support on set S. In practice % support is sufficient. CG-content Criteria: A forward primer has to satisfy the following two criteria in order to successfully amplify the target. (1) The last letter should be C or G. (2) At least two of the last six letters should be C or G. Reverse primers have the symmetric restriction; the first letter should be C or G and at least two of the first six letters should be C or G. We develop two strategies to obtain a set of candidate primers. The first one is an extension of the ASAP method and uses multiple alignment. The second one finds primer candidates for each reference genome separately. It then merges the candidates progressively. We will describe them in subsequent sections next Multiple sequence alignment-based primer identification One way to find candidate primers is to align all the reference sequences using a multiple alignment method. A window is then slid on the resulting alignment. The length of the window is equal to the desired primer length. Each window position that satisfies the conservation and CG rate criteria define a forward or reverse primer candidate. In this approach the multiple alignment brings similar subsequences of all the reference sequences together. 90

91 f r S1 S2 SK A B Figure 6-2. An example of computing the SP score of multiple sequence alignment. Region A and C have primers in, we include their SP score when we compute the SP score of the alignment. Region B has no primer inside, we only treat its SP score as zero. C Alignment: Trivial approach here is to use an existing alignment strategy, such as ClustalW [1, 77]. The underlying problem, however, differs from traditional multiple alignment. This is because traditional multiple alignment methods aim to maximize the overall alignment score. However, in order to find primers we only need to identify short, highly conserved regions in the reference sequences. The non-conserved regions of less than 1000 bases between two primer candidates should be disregarded as this region will be identified during PCR amplification process. Figure 6-2 illustrates this. In the figure, a forward primer region A and a reverse primer region C are shown, we only maximize the SP score of A and C. The region B, which has no primer in, are not considered when computing the SP score of the whole alignment. We propose a variation of hierarchical clustering algorithm [71]. It follows from two observations: (1) The gene regions of a set of homologous sequences are usually highly conserved while their intergenic regions can show high variation in length and letter content. (2) Primers need to have sufficient CG rate. For each reference sequence, we read location and lengths of genes from data source files, which are previous downloaded from GenBank. We also scan the sequence and find regions which have lower CG rate than the required cutoff for a primer. We tag these 91

92 regions as unpromising. We replace the letters in such regions with N. In other words we mask these regions. During the alignment of the sequences we compute a weighted score of the alignment: The score for letters which are tagged as genes are scaled up using some predefined weight constant. The score letters which tagged as N are computed as 0. We applied affine gap penalty strategy to reduce the number of gaps. We used an algorithm extended from alignment method of Myers and Miller [65] to reduce memory requirement since the reference genomes are usually too long. We use Sum-of-Pairs score to evaluate the score of alignment. The alignment algorithm is described as follows. We first compute the alignment score between each pair of sequences and construct an initial score table. The initial profiles to be aligned are the original sequences. Second, we select the pair of profiles which has highest score in the score table and obtain a new profile from the alignment of these two profiles. Third, we remove the two profiles and add the new profile to profile set. We calculate the SP score when we score two elements from two profiles. Fourth, we construct a new pairwise alignment score table. Fifth, we repeat from second step to fourth step until only one profile is left. The final profile left is the resulting alignment. Primer selection: We first construct a consensus string from the multiple alignment. To do this, we scan the alignment from the beginning to the end. For each column of the alignment, we choose the most frequent character as its consensus character. We compute the conservation rate of the consensus character of each column as the percentage of the appearance of this character in that column. We then slide a window from the beginning to the end of the consensus string then. The window has same size as the primer. For each window, we check the fragment in the window if it satisfies the CG rate and conservation rate criteria. The fragments which pass the test become primers. Depending on the CG positions, a fragment is inserted in either 92

93 forward primer set or reverse primer set or both. For each primer, we keep its sequence and position in the consensus sequence Motif-based primer identification Multiple alignment of reference sequences provides primer candidates from conserved regions. However, there are two drawbacks of this approach. First, variations between intergenic regions can cause shifts in alignment. As a result some of the conserved regions may not be observed in the consensus sequence. Weighting the genes partially alleviates this problem. However, it is not sufficient as the intergenic regions can also contain primers. Second, multiple alignment can not find all conserved regions if there are translocations in the reference genomes. In this section, we propose a new strategy to address these problems. Our solution first finds possible primers from each sequence separately without considering any conservation constraints. It then finds common primers with sufficient support by iteratively merging the primer set. We discuss these steps in more detail next. We start by constructing a set of possible forward primers F i and a set of reverse primers R i for each reference sequence S i. To do this, we slide a window of primer length on each reference sequence. Each position of the window produces a fragment. The fragments that satisfy the CG criteria for primers are inserted into corresponding primer set. Let F i = {f i,1, f i,2,, f i,mi } and R i = {r i,1, r i,2,, r i,ni } denote the primers found for S i. For each primer f i,j, two values are stored: support and location, denoted with support(f i,j ) and location(f i,j ). The support and location of f i,j are initialized to one and the position of f i,j in S i respectively. support and location of all reverse primers are computed in the same way. We propose two strategies to find candidate primers from these primers. We explain our strategies for candidate forward primers. Candidate reverse primers are found exactly the same way. The only difference is that we use R i instead of F i. 93

94 Order independent strategy: Let G denote the set of candidate forward primers. G is initialized to empty set. We then carry out the following steps: We pick a random S i from reference sequence set that has not been considered so far. For all primers f i,j F i we check if there exists a primer g G that is similar to f i,j (i.e., g and f i,j have at least 93 % identitiy. See Section 6.1.). If there is no such g G, then we insert f i,j to G. If there exist such a g, then we update the support and location of g. The location is updated as location(g)support(g) + location(f i,j ). (1) support(g) + 1 The support of g is then incremented by one. We repeat the same process to each of the remaining reference sequences in random order similarly. Once all the references are processed we remove the primers in G that do not satisfy support criteria. Note that further optimizations can be made in the implementation by removing primers from G as soon as they are guaranteed to have insufficient support. We do not discuss them as they only affect the performance. Order dependent strategy: The first strategy increases the support of a primer regardless of the positions of the primers in G and F i. As a result of this, primers in conflicting positions can be considered as similar simultaneously. Such conflicting primers can be desirable in case of translocations. However, if the reference genomes do not have translocations, this strategy can produce false primers as it increments support for all matches regardless of the position. Figure 6-3 illustrates this. In the figure, we only show forward primers and their locations, the matched primers are connected by arrows. Primers f 1 and f 2 are crossed and are not considered as matched at same time when using multiple sequence alignment. In this strategy, we allow this type of match. In this strategy, we consider the problem as finding the Longest Common Subsequence from a set of sequences, known as k-lcs. Here, each primer set F i denotes a sequence of primers for the primers in F i are ordered by their locations. The goal is to find a 94

95 S 1 f 1 f 2 f 3 f 4 S 2 f 2 f 1 f 3 f 4 Figure 6-3. An example of matching primers with translocations. Only forward primers are shown in the figure. Primers f 1 and f 2 have positions crossed due to translocation. In step 1, the matchings of f 1 s and f 2 s at same time can be allowed if using motif-based strategy, but not if using multiple sequence alignment-based strategy. subsequence of primers that is common to most of the reference sequences (i.e., % of the reference sequences contain it). k-lcs is an NP-complete problem [65] and has many heuristic solutions. We use a progressive solution which is similar to our first strategy in spirit. We pick a random S i from reference sequence set and initialize G to F i. We then repeatedly pick a reference sequence from the remaining references and process it as follows: We find the LCS of F i and G. Here, two primers are considered as common if they are similar to each other (i.e., they have at least 93 % identitiy). We update the support and location of all g G which are in LCS. The location is updated as given in equation (1) The support of g is then incremented by one. We then insert all the f i,j F that are not in LCS to G. Once all the references are processed we remove the primers in G that do not satisfy support criteria. The time complexity of this motif-based method is O(M 2 ), where M is the number of primers in a sequence. Usually M is much less than the length of the sequence Finding Minimum Primer Pair Set So far, we have discussed how to find candidate primers from a given set of reference sequences. In this section, we discuss how to select minimum set of primer pairs to obtain the largest coverage and minimum number of contigs. Let F = {f 1, f 2,, f m } and R = {r 1, r 2,, r n } denote the set of forward and reverse primers with sufficient support identified using any of the strategies discussed in 95

96 Section Assume that location(f i ) < location(f j ) and location(g i ) < location(g j ) for i < j. Note that the locations of primers are computed as discussed in Section The goal is to find set of primer pairs P = {< f π1, r ρ1 >, < f π2, r ρ2 >,, < f πk, r ρk > }, where i, f πi F, r ρi R and i < j, π i < π j, ρ i < ρ j with the objective that the primer pairs in P have maximum coverage on the reference sequences and produces minimum number of contigs. We propose a greedy algorithm. It works in three steps: Step 1: Initialize the current forward primer, f = f 1. Remove f from F. Step 2: For the current forward primer, check R. If there are reverse primers r R which satisfy the distance criteria with f, select the one with the largest location as current reverse primer, r. Recall from Section 6.1 that the distance criteria is 0 < location(r) location(f) + length(r) < distance-cutoff. Distance-cutoff is set to 1,000 (see Section 6.1). Insert < f, r > pair into P. If there is no r R which satisfy the distance criteria with f, then update f as the next forward primer, remove f from F, and repeat Step 2. If there is no more forward primer left in F, the algorithm stops. Step 3: For the current reverse primer r, check F. There are three cases. Case 1: If F = then the algorithm stops. Case 2: If there are forward primers in F which satisfy the overlap criteria, select the one with the largest location as current forward primer f. Remove all the primers in F whose locations are less than or equal to location of f. Case 3: If the forward primers do not satisfy overlap criteria select the first forward primer in F which has larger location than r and go to Step 2. Recall from Section 6.1 that the overlap criteria is 0 < location(r) location(f) < overlap-cutoff. Overlap-cutoff is set to 100 (see Section 6.1). Figure 6-4 illustrates our primer pair selection strategy. In this example, f 1 is chosen as the first forward primer (Step 1). The reverse primers r 2 and r 3 satisfy distance criteria for f 1. Therefore, r 2 and r 3 can be paired with f 1. < f 1, r 3 > pair is inserted into solution 96

97 Distance cutoff Target f f f 2 f 5 f f 4 r1 r2 r3 A Overlap cutoff B Figure 6-4. Selection of next forward primer from current reverse primer. The positions of primer are shown in the figure. We select f 2 if both f 1 and f 2 are in Region A, and select f 3 if f 3, f 4, f 5 and f 6 are in Region B and no primer is in Region A set since location(r 2 ) < location(r 3 ) (Step 2). The search space is split into regions A and B. The cut position shows the boundary for the overlap criteria. All the forward primers in A satisfy this criteria, whereas the ones in B do not. The last forward primer in region A, f 3 is chosen as the next forward primer (Step 3). If the region A had not contain any forward primers with location greater than that of f 1,the primer f 4 would be selected as the next forward primer for f 4 is the forward primer with smallest location in region B (Step 3). Note that one can prove that our greedy primer selection strategy is optimal solution among all possible solutions that can be found from the candidate primers. We define the optimality according to two criteria: 1) The optimal set of primer pairs covers the largest number of letters of the consensus of the reference sequences. 2) Among all the solutions with the same coverage, optimal solution contains the minimum number of primers and produces the minimum number of contigs. We, however, do not include the proof due to space limitations. Next, we prove that our primer selection strategy is optimal solution among all possible solutions that can be found from the candidate primers. We define the optimality according to two criteria: 1) The optimal set of primer pairs covers the largest number of 97

98 letters of the consensus of the reference sequences. 2) Among all the solutions with the same coverage, optimal solution contains the minimum number of primers and produces the minimum number of contigs. Optimality Proof: Let F = {f 1, f 2,, f m } and R = {r 1, r 2,, r n } denote the set of candidate forward and reverse primers. Let P = {< f π1, r ρ1 >, < f π2, r ρ2 >,, < f πk, r ρk >} be the set of primer pairs found using our primer selection strategy. Let C = {c 1, c 2,, c s } be the optimal set of contigs that can be determined using F and R, sorted in ascending order of their locations. Let left(c i ) and right(c i ) denote the position of the leftmost and rightmost position of c i in the consensus sequence. We have right(c i ) < left(c i+1 ), i, 1 i < s. (A) We first show that location(f π1 ) = left(c 1 ). Let f i be the leftmost primer (i.e., smallest location) in F, which has at least one matching reverse primer satisfying distance criteria. f i is selected by our algorithm (Steps 1 & 2) (i.e., π 1 = i). (A.1) Assume that location(f i ) < left(c 1 ). This is contradicts with the assumption that C is optimal. This is because f i can be paired with a reverse primer to cover some letters to the left of c 1. These letters can be included in C to increase its coverage. (A.2) Assume that location(f i ) > left(c 1 ). This contradicts with the assumption that f i is the leftmost primer with a matching reverse primer. From (A.1) and (A.2), we conclude that location(f π1 ) = left(c 1 ). (B) Second we prove that location(r ρ1 ) right(c 1 ). We prove this by contradiction. location(r ρ1 ) > right(c 1 ) contradicts with the assumption that c 1 is an optimal contig as < f π1, r ρ1 > can be included to extend c 1. (C) Third, we show that < f π1, r ρ1 > is a part of the optimal solution (Steps 1 & 2 of the algorithm). (A) and (B) proves that f π1 and r ρ1 are contained in c 1. Thus, they identify a prefix of c 1. Selection of < f π1, r ρ1 > minimizes the number of primer pairs to cover c 1. This is because < f π1, r ρ1 > define the longest prefix of c 1 that can be identified using F and R. 98

99 Thus, the coverage of any other primer pair that covers a prefix of c 1 is a subsequence of that of < f π1, r ρ1 >. Such a pair will require additional primer pairs to cover the same region. (D) Finally, we prove that selection strategy for the next forward primer minimizes the number of primer pairs (Step 3 of the algorithm). (B) implies that there are two possibilities for r ρ1. (D.1) Assume that location(r ρ1 ) = right(c 1 ). This implies that < f π1, r ρ1 > is the optimal primer pair to identify c 1. Since c 1 is a part of the optimal solution, there is no primer pair which satisfy the overlap criteria with < f π1, r ρ1 > and location(r ρ1 ) > right(c 1 ). Thus, the next forward primer should be selected as the first forward primer in F in region B (see Figure 6-4) in order to detect the next contig in C (Step 3). The justification follows from (A). (D.1) Assume that location(r ρ1 ) < right(c 1 ). This implies that there exists at least one primer pair that satisfies overlap constraint with < f π1, r ρ1 > and covers a subsequence of c 1. Otherwise, c 1 would not be identified as a part of the optimal solution. Step 3 chooses the rightmost forward primer in region A (see Figure 6-4) to maximize the coverage of this primer pair, and thus minimize the number of primer pairs Evaluating Primer Pairs So far, we have discussed how to find primer pairs from reference sequences to amplify the target sequence. Performing wet-lab experimentation to evaluate the quality of the primers is costly. In this section, we develop a new method to evaluate the quality of a set of primer pairs computationally. This method can be used to predict the primer quality quickly without any additional cost. We evaluate the primer pairs using two key parameters: (1) average coverage, and (2) average number of contigs produced for all the reference sequences. Here the coverage is the total number of characters covered by the primer pairs. The total number of contigs are the number of fragments identified such that no two fragments have sufficient overlap. 99

100 Let P = {< f 1, r 1 >, < f 2, r 2 >,, < f k, r k >} denote the set of primer pairs identified from reference sequences S = {S 1, S 2,, S K }. For each S i S, the algorithm keeps an integer vector V i, whose size is equal to the length of S i. All entries of V i are initially set to zero. The algorithm works as follows. 1. Initialize contigid = For j = 1 to k (a) Find the locations of f j and r j in S i using dynamic programming [28 30]. A primer is found in S i if S i contains a subsequence whose alignment with that primer has at least 93 % identity (see Section 6.1). (b) If both f i and r i can be found and their locations satisfy distance criteria (i.e., locations differ by at most 1,000) then check the values in V i from the starting location of f j to ending location of r j If the first or the last 100 values are identical and greater than zero, then the fragment identified by < f j, r j > is an extension of an existing contig. This is because this fragment satisfies the overlap criteria with the existing contig (see Section 6.1). Set all the values of V i corresponding to the new fragment to this value. Otherwise, < f j, r j > defines a part of a new contig. Increment the value of contigid by one and set. all the values of V i corresponding to the new fragment to contigid. 3. Return the number of non-zero values in V i as the coverage and the number of distinct non-zero values in V i as the number of contigs Experimental Evaluation Experimental setup: We evaluate our proposed methods through both computational and wet-lab experimentation. We evaluate the primer pairs based on several criteria, namely the coverage, the number of contigs, and hit ratio on the target sequence as well as time it takes to find the primers. The former two are described in Section 6.1. Hit ratio denotes the ratio of primers that has a matching subsequence in the target genome. For comparison, we downloaded Primer3 [120] as a representative of single sequence input primer design tools, for it is one of the well known tools. For our multiple alignment 100

101 based strategy, we downloaded the source code of ClustalW [1, 77]. We also implemented the proposed weighted multiple alignment method in Section We also implemented our motif based primer method as described in Section As a part of this method we implemented both order independent and order dependent strategies. We used C language in all our implementations. We used five plastid genomes used in ASAP [32] and added two more from Cucumis and Lactuca to our dataset. We obtained the DNA sequences of these genomes from GenBank ( and selected their inverted repeat regions. We use the last four digits of the accession number of each DNA sequence in GenBank as its name. To test divergent sequences, we also created another set of sequences by randomly deleting non-gene characters from according to a given probability. Unless otherwise stated, we report the results for the original plastid genomes in our experiments. In all our experiments we used a subset of these sequences as reference sequences. We picked another sequence, which is not a reference sequence, as the target sequence. Unless otherwise stated, for a given target sequence all the remaining six genomes are used as reference sequence. We run all computational experiments on Intel Pentium 4, with 3.2 Ghz speed, with 2 GB memory, the operation system is windows XP. In the following tables to show, word CovT represents the coverage on the target sequence, ConT represents the number of contigs on the target sequence, CovR represents the average coverage on the reference sequences and ConR represents the average number of contigs on the reference sequences Quality Evaluation Comparison to Primer3: Our first experiment set compares the quality of primer pairs of MAPPIT to that of Primer3 [120]. We use Primer3 with its default parameters on a single reference sequence to identify the top 50 primers. We then evaluate these primers on the target genome. We limit the number of primers of Primer3 to 50 for MAPPIT to make 101

102 it comparable to our method. We repeat this for all possible reference-target combination and present the average results for each target. For MAPPIT, we use all the six remaining sequences as the reference sequence for each target sequence. We report results for both multiple alignment strategies. Table 6-1 shows the results. The results show that the coverage of Primer3 is significantly lower than that of our method in all cases. The results illustrate that existing tools which consider only one sequence for primer design are not suitable to sequence plastid genomes. The coverage of MAPPIT is greater than 62 % on the average. Furthermore, both alignment strategies achieve similar coverage, number of contigs, and primer pairs. Evaluation of impact of reference similarity: In order to observe the impact of the degree of similarity of reference sequences, we run MAPPIT on reference sequences of 4 %, 8 %, and 16 % divergence. Here, x % divergence means that letters in non-gene regions are randomly deleted with x % probability. Table 6-2 presents the results for 16 % divergent dataset. Due to space limitations results for other divergent datasets are not shown. The experiments show that the coverage and the number of primers decreases, whereas the number of contigs increases. The coverage is slightly more than 57 %. However, the quality drop is very small given that the sequences are altered by 16 %. We observe that the quality gradually drops as the divergence increases (results not shown). Another important observation is that MAPPIT achieves higher quality using our weighted multiple sequence alignment method compared to ClustalW. This shows that ClustalW is more suitable for highly similar sequences, whereas our weighted multiple alignment is more suitable for genomes with variations in non-coding regions. Comparison of proposed strategies: We compare the two methods for constructing primer candidate set. We show the evaluations in Table 6-3 for multiple sequence alignment- and motif-based primer identification strategies. For motif-based strategy, 102

103 Table 6-1. Comparison of Primer3 and using multiple sequence alignment in step 1. The table shows the results of using alignment from ClustalW and our own designed multiple sequence algorithm, which uses hierarchical clustering algorithm and gap open extension score strategy. Primer3 ClustalW-MAPPIT weighted-mappit Data Set Target Length CovT ConT Pairs# CovT ConT Pairs# CovT ConT Avg

104 Table 6-2. Comparison of using different source of alignment: using ClustalW and our weighted multiple sequence alignment algorithm. The data set are 16 % divergent. The weighted multiple sequence alignment method uses hierarchical clustering algorithm and gap open extension score scheme. ClustalW-MAPPIT weighted-mappit Data Set Target Length Pairs# CovT ConT CovR ConR Pairs# CovT ConT CovR ConR Avg

105 we show the results using order independent and order dependent approaches, indicated in table by non-order-mappit and order-mappit respectively. Motif-based strategies have better coverage than multiple alignment-based strategy in all experiments. This is because multiple alignment takes all the letters into consideration from references, including the non-coding regions. As a result, variations in less conserved regions cause the support of the primers in conserved regions as they cause shifts in alignments. Order independent motif-based strategy has the highest coverage in all the experiments. The reason is that it produces more candidate primers as the order criteria is relaxed. The average coverage of this strategy is 81 %. This is a significant improvement over our multiple alignment-based strategy. Table 6-3 also shows the coverage and the number of contigs computed on the reference sequences as discussed in Section The results show that the estimated quality values from the reference sequences are similar to the actual values computed from the target sequence. Thus, we conclude that the evaluation strategy proposed in Section is accurate. Evaluation of impact of number of references: Here, we test the effects of the number of reference sequences. We use hit ratio as to evaluate the methods. This value shows the accuracy of the primers found. We carry out the following steps. First we select a target sequence from our dataset. We then select k sequences randomly from the reference sequences such that all of them are different from the target sequence. We then run our program on these k sequences and find the primer pairs. We compute the coverage and the number of contigs these primer pairs produce on the target sequence. We repeat this process for each possible target sequence 10 times, each time selecting a new set of references. Thus we carry out 70 experiments (7 target, 10 tests per target). We report the average values of all these experiments. Table 6-4 shows the results. The hit ratio usually increases as k increases. This agrees with our assumption that more reference sequence achieve higher quality primers. The 105

106 Table 6-3. Comparison of multiple sequence alignment-based methods and motif-based methods in step 1. The non-order-mappit and order-mappit stand for using motif-based methods with order independent and dependent strategies separately. The multiple sequence alignment-based methods use hierarchical clustering algorithm and gap open extension score scheme. weighted-mappit non-order-mappit order-mappit Data Set Target Length Pairs# CovT ConT Pairs# CovT ConT Pairs# CovT ConT Avg

107 Table 6-4. Effects of the number of reference sequences. Multiple sequence alignment-based method uses hierarchical clustering algorithm and gap open extension score scheme. Non-order-MAPPIT and order-mappit stand for order independent and dependent strategies separately when applying motif-based method. weighted-mappit non-order-mappit order-mappit Reference # Coverage Hit Ratio Coverage Hit Ratio Coverage Hit Ratio coverage of the multiple alignment-based strategy increases as k decreases. This is because this strategy produces more primers for small k. The coverage of the motif-based strategy shows variations. However, it usually increases as k decreases Performance Comparison In this section we evaluate the running time of our methods. Our result show that on average, our multiple alignment-based method runs for about 270 minutes using our weighted alignment strategy. The same method runs in 195 minutes using ClustalW. Our motif-based method runs in 23 and 13 minutes for order dependent and order independent strategies respectively. These running times are significant improvements over current ASAP strategy which requires manual inspection of multiple alignment given that the considered sequences are 40K to 150K bases long Wet-lab Verification The computational method was assessed in the laboratory for efficacy. Primer pairs identified using the computational method described above were tested using actual polymerase chain reaction in a wet lab experiment. Eight primer pairs were selected at random; the corresponding DNA oligonucleotides were synthesized and used to attempt to amplify target regions from 12 different plant genera (Figure 6-5). Of these, 9 plants are somewhat related and 3 represent ancient or highly-diverged species. Pea lacks the 107

108 Table 6-5. Eight randomly selected primer pairs, their locations on sequence 1879, the length of the segment identified by the primers and the genes that they land on. The negative value indicates that the primers landed in incorrect order. Primer pairs Location in 1879 Size base pairs Forward Reverse rps16 Intergenic rps2 rpoc ycf9 psaa ndhb rps12 Intron rps12 Intron rps12 Intron rps12 orf orf131 16S ycf2 ycf2 inverted repeat region and thus is very different from other plastid genomes sampled here. Ginkgo, an ancient Gymnosperm, and Equisetum a Pteridophyte, are ancestors of modern day flowering plants and exhibit high degree of sequence dissimilarity. The primers devised by the computational method were mapped on the tobacco chloroplast genome (1879) and Table 6-5 summarizes the sequence location, expected sizes and annealing sites of the forward and reverse primer. From Table 6-5 following features are evident: 1. Computationally identified primers pairs anneal mainly to the coding regions or conserved intron between the genes. This parameter was one of the prerequisites for efficient primer identification and demonstrates that the new method of multiple sequence alignment is promising for this specific purpose. 2. The size of the amplified regions ranges from 452 base pairs to 1782 base pairs. The optimal primer set will amplify regions ranging from 800 base pairs to 1200 base pairs, which makes the amplified products more amenable to sequencing. 3. Primer pair 5 represent divergent primers in 1879 thus no product is visible here and in all other species but in maize there is an annealing site that produces an amplicon of the expected size. This illustrates the potential of the method as applicable to divergent plant species. 108

M 5 17 36 99 100 101 102 150 M 5 17 36 99 100 101 102 150 M 5 17 36 99

Amaranthus Maize Lettuce Tomato Strawberry Peach Citrus Pea Ginkgo

Polymerase chain reaction samples were analyzed on an agarose gel by

Columns labeled as 5, 17, 36, 99, 100, 101 102 and 150 represent the

White bands in each column represent amplified DNA from each primer

Note that primer pair 100 does not produce an amplified product in most

Ginkgo and Equisetum represent ancestral samples used to test the

109 M M M M Tobacco Arabidopsis Amaranthus Maize Lettuce Tomato Strawberry Peach Citrus Pea Ginkgo Equisetum Figure 6-5. Polymerase chain reaction samples were analyzed on an agarose gel by electrophoresis. Column M represents a standard DNA size ladder. Columns labeled as 5, 17, 36, 99, 100, and 150 represent the primer pairs chosen at random from the computational dataset. White bands in each column represent amplified DNA from each primer pair in a given plant sample. Note that primer pair 100 does not produce an amplified product in most plants except for maize (see Table 6-5 ). Ginkgo and Equisetum represent ancestral samples used to test the limits of this approach. Although highly divergent in sequence content and position some coverage was obtained, indicating the method will be highly useful on contemporary crop species.(this figure is created by Amit Dhingra.) 109

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota Marina Sirota MOTIVATION: PROTEIN MULTIPLE ALIGNMENT To study evolution on the genetic level across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein