34 Bioinformatics I, WS 12/13, D. Huson, November 11, 2012

Size: px

Start display at page:

Download "34 Bioinformatics I, WS 12/13, D. Huson, November 11, 2012"

Annabelle Conley
5 years ago
Views:

1 34 Bioinformatics I, WS 12/13, D. Huson, November 11, Multiple Sequence lignment Sources for this lecture: R. Durbin, S. Eddy,. Krogh und. Mitchison, Biological sequence analysis, ambridge, 1998 D. usfield, lgorithms on string, trees and sequences, D.W. Mount. Bioinformatics: Sequences and enome analysis, J. Setubal & J. Meidanis, Introduction to computational molecular biology, M. Waterman. Introduction to computational biology, multiple sequence alignment (MS) is simply an alignment of more than two sequences, like this: MRP2 HUMN Q9UQ99 HUMN B8 HUMN Q96J65 HUMN Q96J6 HUMN MRP5 HUMN MRP4 HUMN O75555 HUMN FR HUMN SNRWLIRLELVNLVFFSLMMVIY--RDLSDVFVLSNLNIQLNWLVRM VNRWLVRLEVNIVLFLFVIS--RHSLSLVLSVSYSLQVYLNWLVRMS NRWLEVRMEYIVVLIVSISNSLHRELSLVLLYLMVSNYLNWMVRNL LRWFLRMDVLMNILFVLLVLS--FSSISSSKLSLSYIIQLSLLQVVR SSRWMLRLEIMNLVLVLFVF--ISSPYSFKVMVNIVLQLSSFQRI MRWLVRLDLISILILMIVLM--HQIPPYLISYVQLLFQFVRL SRWFVRLDIMFVIIVFSLIL--KLDQVLLSYLLMMFQWVRQS SRWFVRLDIMFVIIVFSLIL--KLDQVLLSYLLMMFQWVRQS SLRWFQMRIEMIFVIFFIVFISIL---EERVIILLMNIMSLQWVNSS ( small section of a multiple alignment of the human FR protein and eight homologous proteins.) 4.1 Why multiple sequence alignments? Multiple sequence alignment is applied to a set of sequences that are assumed to be related and the goal is to detect homologous residues and to place them in the same column of the multiple alignment. Multiple alignments (MS) are more suitable than pairwise alignments to address evolutionary questions, as the chance of random similarities occuring decreases, as the number of aligned sequences grows. Quote (rthur Lesk): One or two homologous sequences whisper... a full multiple sequence alignment shouts out loud Multiple alignments are used both for similarity studies, e.g. to classify members of protein families, and dissimilarity studies, e.g. to infer phylogenetic relationships haracterization of protein families ypical question: Suppose we have established a family F = { 1, 2,..., r } of homologous protein sequences. Does a new sequence 0 belong to the family? One way to address this question would be to align 0 to each of 1,..., r in turn. If one of these alignments produces a high score, then we may decide that 0 belongs to the family F. However, perhaps 0 does not align particularly well to any one specific family member, but scores well in a multiple alignment, due to common motifs etc.

Bioinformatics I, WS 12/13, D. Huson, November 11, 2012 35 4.1.2 onservation of structural elements Here we show the alignment of N-acetylglucosamine-binding proteins to the tertiary structure of one of them.

2 Bioinformatics I, WS 12/13, D. Huson, November 11, onservation of structural elements Here we show the alignment of N-acetylglucosamine-binding proteins to the tertiary structure of one of them. he example exhibits 8 conserved cysteins that form 4 disulphid bridges and are an essential part of the structure of these proteins MS and evolutionary trees One main application of multiple sequence alignments is in phylogenetic analysis. onsider the following MS: 1 = N - F L S 2 = N - F - S 3 = N K Y L S 4 = N - Y L S We would like to reconstruct the evolutionary tree that gave rise to these sequences, e.g.: N Y L S N K Y L S N F S N F L S +K L Y to F N Y L S In practice, the sequences considered in phylogenetics are much longer. he computation of phylogenetic trees will be discussed in a later chapter. 4.2 Definition of an MS Suppose we are given r sequences 1... r over an alphabet Σ: 1 = a 11, a 12,..., a 1n1 2 = a 21, a 22,..., a 2n2 :=. r = a r1, a r2,..., a rnr Definition (MS) multiple sequence alignment (MS) of is obtained by inserting gaps ( - ) into the original sequences such that all resulting sequences i have equal length L max{n i

3 36 Bioinformatics I, WS 12/13, D. Huson, November 11, 2012 i = 1,..., r}, we can get back the sequence i by removing all gaps from i, and no column consists of gaps only: 1 = a 11, a 12,..., a 1L 2 = a 21, a 22,..., a 2L :=. r = a r1, a r2,..., a rl, 4.3 Scoring an MS In the case of a linear gap penalty and assuming independence of the different columns of an MS, then the score α( ) of an MS can be defined as the sum of column scores: α( ) := L s(a 1i, a 2i,..., a ri). i=1 Here we assume that s(a 1i, a 2i,..., a ri ) is a function that returns a score for every combination of r symbols (including the gap symbol) he sum-of-pairs (SP) score How to define s? For two protein sequences, s is usually given by a BLOSUM or PM matrix. For more than two sequences, providing such a matrix is not practical, as the number of possible combinations of different letters is too big. Let be a MS. onsider two sequences p and q in the alignment. For two aligned symbols u and v we define: match score for u and v, if u and v are residues, s(u, v) := d if either u or v is a gap, or 0 if both u and v are gaps. (Note that u = and v = can occur simultaneously in a multiple alignment.) Let p and q be two sequences that are part of a MS of r sequences. hen defines a pairwise alignment of p and q. Define the score of this (not necessarily optimal) pairwise alignment as s( p, q) = L s(a pi, a qi). i=1 We obtain a score for the complete MS by summing up the pairwise scores for all pairs of involved sequences: S( 1,..., r) = s( p, q) 1 p<q r Definition he sum-of-pairs (SP) score of an alignment is defined as α SP ( ) := s( p, q) = 1 p<q r L s SP (a 1i, a 2i,..., a ri), i=1

4 Bioinformatics I, WS 12/13, D. Huson, November 11, with s( p, q) := L s(a pi, a qi) and s SP (a 1i,..., a ri) := s(a pi, a qi). i=1 1 p<q r Note that we thus obtain a score for a multiple alignment that is based on a pairwise-scoring matrix. (1) (2) (3) Seq N... N... N... Seq N... N... N... Multiple alignment: Seq N... N... N... Seq N... N Seq N omparisons: (1) (2) (3) N N N N N N N N N ( 5 2 N N N ) = # comparisons N-N pairs: N- pairs: pairs: BLOSUM62: (BLOSUM62 scores: N-N: 6, N-: -3, -: 9) n undesirable property of the SP score onsider L = 1 =... x... 2 =... x r 1 =... x... r =... x... he SP-score of the column shown in L is and R = s SP (x r ) = 1 =... x... 2 =... x r 1 =... x... r =... y... ( ) r s(x, x). 2 he SP-score of the column shown in R is ( ) r 1 s SP (x r 1, y) = s(x, x) + (r 1)s(x, y). 2 he column in L is completely conserved, whereas the column in R shows one mismatch. learly, it would be desirable that the former scores much better than the latter, and increasingly so, for longer and longer columns. he difference between s SP (x r ) and s SP (x r 1, y) is: ( ) ( ) r r 1 s(x, x) s(x, x) (r 1)s(x, y) = (r 1)(s(x, x) s(x, y)). 2 2 herefore, the relative difference is s SP (x r ) s SP (x r 1, y) s SP (x r ) = (r 1)(s(x, x) s(x, y)) r(r 1)/2 s(x, x)

5 38 Bioinformatics I, WS 12/13, D. Huson, November 11, 2012 = 2 r ( s(x, x) s(x, y) s(x, x) which unfortunately decreases as the number of sequences r increases! ), 4.4 he dynamic program for a global MS Dynamic programs developed for pairwise alignment can be extended to multiple alignments. We now discuss how to compute a global MS for three sequences, in the case of a linear gap penalty. Suppose we are given: 1 = (a 11, a 12,..., a 1n1 ) = 2 = 3 = (a 21, a 22,..., a 2n2 ) (a 31, a 32,..., a 3n3 ). We proceed by computing the entries of an (n 1 + 1) (n 2 + 1) (n 3 + 1)-matrix F (i, j, k) recursively. fter the computation, F (n 1, n 2, n 3 ) will contain the best score α for a global alignment. s in the case of pairwise alignment, we can use traceback to recover an optimal alignment. he main recursion is: F (i, j, k) = max F (i 1, j 1, k 1) + s(a 1i, a 2j, a 3k ), F (i 1, j 1, k) + s(a 1i, a 2j, ), F (i 1, j, k 1) + s(a 1i,, a 3k ), F (i, j 1, k 1) + s(, a 2j, a 3k ), F (i 1, j, k) + s(a 1i,, ), F (i, j 1, k) + s(, a 2j, ), F (i, j, k 1) + s(,, a 3k ), for 1 i n 1, 1 j n 2, 1 k n 3, where s(a, b, c) returns a score for a given column of symbols a, b, c; for example, s = s SP, the sumof-pairs score. Example: 1 = = 2 = 3 = BDE BE DEE = 1 = B D E = 2 = B E 3 = D E E

6 Bioinformatics I, WS 12/13, D. Huson, November 11, omplexity of dynamic program for an MS What is the complexity of the dynamic programming approach for an MS of r sequences of length n using the SP-score? Space complexity: O(n r ) ime complexity: O(r 2 n r 2 r ). heorem omputing an MS with optimal SP-score is NP-hard Progressive alignment Because optimal multiple sequence alignments cannot be computed efficiently by dynamic programming, we turn to heuristics. One main approach is progressive alignment. Progressive alignment: Progressive alignment has three steps: 1. ompute pairwise distances between all sequences 2. Build a rooted binary guide tree based on the distances 3. In a bottom-up traversal of the tree, repeatedly align the sequences or profiles associated with the two children of the current node and the assign the result to the current node he result is the alignment assigned to the root of the tree. he main idea is to align sequences along a tree. In the example indicated below, we first align sequences 1 and 2 to obtain (1, 2), then 4 and 5 to obtain (4, 5), and then 3 with (4, 5) to obtain (3(4, 5)). Finally, we align (1, 2) with (3(4, 5)) to obtain an alignment of all five sequences. ((1,2),(3,(4,5))) (1,2) (3,(4,5)) (4,5) Order of alignment matters guide tree specifies the order in which sequences are aligned. he following example shows that order matters: 1 = LVK, 2 = PFK, 3 = LFVK, 4 = PFVK. Performing the alignment of these sequences in two different orders results in two different results: 1 L. Wang and. Jiang, J omp Biol 1994

7 40 Bioinformatics I, WS 12/13, D. Huson, November 11, 2012 ( 1, 2 ), ( 3, 4 ) or ( 1, 3 ), ( 2, 4 ) LV-K L-VK PF-K PF-K LFVK LFVK PFVK PFVK Pseudocode for progressive alignment he general algorithm for progressive alignments is as follows: Input: a set = { 1,..., r } of sequences begin = // this will hold the current set of alignments For i = 1, 2,..., r do := {{ i }} do choose two sub-alignments p, q from ; = { p, q} s := align( p, q); = { s} while > 1 end he guide tree is not explicitly mentioned; it is used to decide which two sub-alignments to choose. Existing progressive alignment methods differ in: 1. how pairwise distances are computed between sequences, 2. the order in which the sequences are aligned (or how the guide tree is constructed), and 3. which parameters are used (such as scoring function, gap penalties, weight of individual sequences) ligning two alignments How do we align two alignments? ssume that we have two multiple sequence alignments 1 and 2. here are two way to align these two alignments, namely: compute a pair-guided alignment, or compute a profile alignment Pair-guided alignment of two sub-alignments o alignment two multiple alignments 1 and 2 using the pair-guided alignment approach, one chooses one sequence x from 1 and one sequence y from 2 (including all gaps that they contain). he two sequences x and y are then optimally aligned using dynamic programming. ll columns of the original sub-alignments follow the corresponding letters in x and y. For example, let the two (sub-)alignments be

8 Bioinformatics I, WS 12/13, D. Huson, November 11, LEE -EE -LEE Let us align first sequence of the first (sub-)alignment with the last sequence of the second: dd gaps to other sequences in the sub-alignments. Final multiple alignment is then -ERE LER- LEE- LER- LEE- -EE- -LEE- -ERE LER Profile alignment Suppose we are given two MS (called profiles in this context) 1 = { 1,..., r } and 2 = { r+1,..., n }. We now discuss profile alignment in the case of the SP-score and linear gap scores. We will assume s(, a) = s(a, ) = g and s(, ) = 0 for all a 1 or 2. Definition profile alignment of 1 and 2 is an MS 1 = a 11, a 12,..., a 1L... r = a r1, a r2,..., a rl = r+1 = a r+1,1, a r+1,2,..., a r+1,l... n = a n1, a n2,..., a nl, obtained by inserting gaps in whole columns of 1 or 2, without changing the alignment of either of the two profiles. aps that exist in either input alignment are never removed: Once a gap, always a gap. he SP-score of the profile alignment is: α sp ( ) = L s(a pi, a qi) = 1 p<q n i=1 L s(a pi, a qi) = i=1 1 p<q n L i=1 1 p<q r s(a pi, a qi) + }{{} L i=1 r<p<q n s(a pi, a qi) + }{{} L i=1 1 p r<q n s(a pi, a qi). }{{} lignment score of 1 lignment score of 2 cross terms he third sum can be optimized using standard pairwise alignment, with the modification that columns are scored against columns by adding their pair scores.

9 42 Bioinformatics I, WS 12/13, D. Huson, November 11, 2012 learly, either or both profiles may consist of a single sequence. In the former case, we are aligning a single sequence to a profile and in the latter case, we are simply aligning two sequences. In the following example, use 0 for match, 1 for mismatch or gap: lignment 1: lignment 2: 1 = 2 = 3 = - 4 = - 5 = 6 = - What is the score for each alignment? What is the optimal score for a profile alignment of the two? 4.6 Feng-Doolittle he first progessive alignment algorithm to be published was the Feng-Doolittle algorithm 2 : lgorithm (Feng-Dolittle) 1. alculate all ( r 2) pairwise alignment scores and convert them into distance scores. 2. onstruct a guide tree (using Fitch and Margoliash clustering algorithm, 1967) from the distance matrix. 3. raverse the tree bottom-up and perform a profile alignment on the two children of any internal node and then assign the result to the node. he final result is given by the alignment assigned to the root node. he distance calculation that Feng-Doolittle uses is: D = log S eff = log S obs S rand S max S rand where S obs is the observed score for a pair of sequences and S max is the maximum score and S rand is the expected score of an alignment of two random sequences of equal length and composition as the pair in question. hus the score S eff can be viewed as a normalised percentage similarity: it is expected that with increasing evolutionary distance this score decays exponentially against zero. he sequence-sequence alignments are conducted using the profile alignment approach. 4.7 LUSLW LUSLW 3 can be considered as an improvement of the Feng-Doolittle algorithm. For many years, this was possibly the most widely used program for computing an MS. 2 Feng, D-F & Doolittle, RF. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25: , hompson, J.D., Higgins, D.. & ibson,.j. LUSL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic cids Research, 22: , hompson,j.d., ibson,.j., Plewniak,F., Jeanmougin,F. & Higgins,D.. he lustalx windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic cids Research, 24: , 1997.

10 Bioinformatics I, WS 12/13, D. Huson, November 11, lgorithm (LUSLW progressive alignment) 1. onstruct a distance matrix of all ( r 2) pairs by pairwise dynamic programming alignment followed by approximate conversion of similarity scores to evolutionary distances. 2. onstruct a guide tree using the Neighbor Joining tree-building method from the distance matrix. 3. Progressively align sequences at nodes of tree in order of decreasing similarity, using sequencesequence, sequence-profile and profile-profile alignment. lustalw provides a choice of two distance scores to use, both derived from an optimal pairwise alignment of sequences i and j : One is the observed distance, defined as D ij = 1 (s ij /L) where s ij = number of identities in the best alignment between i and j divided by L, the number of positions considered (gap positions are excluded). his distance score equals the relative number of differences per site. he other is the corrected distance calculated using the Kimura correction (Kimura 1983). (We will discuss distance corrections later.) here are no provable performance guarantees associated with the program. However, it works well in practice and the following features contribute to its accuracy: Sequences are weighted to compensate for the defects of the SP score. he substitution matrix used is chosen based on the similarity expected of the alignment, e.g. BLOSUM80 for closely related sequences and BLOSUM50 for less related ones. Position-specific gap-open profile penalties are multiplied by a modifier that is a function of the residues observed at the position (hydrophobic residues give higher gap penalties than hydrophilic or flexible ones.) ap-open penalties are also decreased if the position is spanned by a consecutive stretch of five or more hydrophilic residues. ap-open and gap-extension penalties increase, if there are no gaps in the column, but gaps nearby. (his tries to force gaps to occur in the same places.) Example s an example, assume that we would like to align the following 11 (rypsin and rypsin inhibitor) sequences, which are given in Fast format: >EEI-II PRILMRKQDSDLVPNFSP >Ii Mutant PRLLMRKQDSDLVPNF >BDI-II RPRILMRKRDSDLVQKNY >MeI-B VPRILMKKDRDLKRNY >MI-IV HEERVPRILMKKKDSDLEVLEHY >SI-IIB

11 44 Bioinformatics I, WS 12/13, D. Huson, November 11, 2012 MVPKILMKKHDSDLLDVLEDIYVS >MRI-I IPRILMEKRDSDLQVKRQY >rypsin RIPRIWMERDSDMKIVH >IR MOMH RSPRIWMERDSDMKIVH >MI- RIPRIWMEKRDSDMQIVDH >LI-III RIPRILMESSDSDLEILENF First step: pairwise scores Start of Pairwise alignments ligning... Sequences (1:2) ligned. Score: 96 Sequences (1:3) ligned. Score: 82 Sequences (1:4) ligned. Score: 68 Sequences (1:5) ligned. Score: 66 Sequences (1:6) ligned. Score: 60 Sequences (1:7) ligned. Score: 68 Sequences (1:8) ligned. Score: 57 Sequences (1:9) ligned. Score: 57 Sequences (1:10) ligned. Score: 60 Sequences (1:11) ligned. Score: Second step: the NJ guide tree MRI-I LI-III MI- rypsin IR MOMH MI-IV SI-IIB MeI-B BDI-II EEI-II 0.1 Ii Mutant hird step: Progressive alignment along the guide tree; Start of Multiple lignment here are 10 groups ligning... roup 1: Sequences: 2 Score:641 roup 2: Sequences: 3 Score:600 roup 3: Sequences: 4 Score:571

12 Bioinformatics I, WS 12/13, D. Huson, November 11, roup 4: Sequences: 2 Score:601 roup 5: Sequences: 6 Score:540 roup 6: Sequences: 7 Score:561 roup 7: Sequences: 2 Score:639 roup 8: Sequences: 3 Score:619 roup 9: Sequences: 4 Score:560 roup 10: Sequences: 11 Score:515 lignment Score 7716 LUSL-lignment file created Result: 4.8 -OFFEE -OFFEE 4 - short for ree-based onsistency Objective Function for alignment Evaluation - is also a program that progressively aligns sequences in order to build an MS. -OFFEE aims for consistency: n MS is consistent if it agrees best with all optimal pairwise alignments. -offee uses an extended library of scores instead of a substitution matrix. 4.9 he alignment graph In the following, we will consider an alternative approach to computing an MS based on Integer Linear Programming. Suppose we are given two sequences a 1 = and a 2 =. he complete alignment graph is the following bipartite graph = (V, E), with node set V and edge set E: Each edge e = (u, v) has a weight ω(e) = s(u, v), namely the score for placing v under u. n alignment graph is any subgraph of the complete alignment graph. 4. Notredame, D. Higgins, J. Heringa: -offee: novel method for multiple sequence alignments. J Mol Biol 302, , (2000)

13 46 Bioinformatics I, WS 12/13, D. Huson, November 11, he trace of an alignment onsider an alignment such as: realized, if the corresponding positions are aligned:, we say that an edge in the alignment graph is - he set of realized edges is called the trace of the alignment. n arbitrary subset E of edges is called a trace, if there exists some alignment that realizes precisely the edges in E. Similarly, we define the (complete) alignment graph and trace for multiple alignments. For r sequences, the resulting graph will be r-partite Maximum-weight trace problem Problem (Maximum-Weight race Problem) iven a set of sequences and a corresponding alignment graph = (V, E) with edge weights ω. he maximum-weight trace problem is to find a trace E of maximum weight. For two sequences, this is the so-called maximum-weight bipartite matching problem, which is known to be solvable in polynomial time haracterization of traces We have seen that an alignment can be described by a trace in the complete alignment graph = (V, E). Question: Is every subset E the trace of some alignment? he answer is clearly no: Our goal is to characterize all legal traces. Here are two examples:

14 Bioinformatics I, WS 12/13, D. Huson, November 11, trace alignment (a) ok (b)? not ok he extended alignment graph he alignment graph is extended by defining a set of directed edges H on the cells of the matrix = {a ij } that correspond to successive cells or letters, (a ij, a i,j+1 ), as shown here: Simple mixed cycles mixed cycle Z is a cycle in the extended alignment graph = (V, E, H) that contains both undirected and directed edges, from E and H, respectively, the latter all in the same direction: U Z U U U mixed cycle Z is called simple, if all nodes in Z a p occur consecutively in Z for every sequence a p. In other words, a simple mixed cycle enters and leaves any given sequence in at most once. he following result says that we can restrict our attention to those mixed cycles that are simple: Lemma (Simple cycles suffice) he graph = (V, H ) contain a simple mixed cycle if and only if it contains a mixed cycle. We now obtain a nice result for determining whether a proposed trace is truely the trace of an alignment:

15 48 Bioinformatics I, WS 12/13, D. Huson, November 11, 2012 heorem (race characterization) subset E is a trace, if and only if = (V, H ) does not contain a simple mixed cycle. Returning to the two examples shown above, the first contains no simple mixed cycle, whereas the second one does: Block partition and the M problem Suppose we are given a set of sequences = {a 1, a 2,..., a r }. he complete alignment graph is usually too big to be useful. Often, we are give a set of block matches between pairs of the sequences a p and a q, where a match relates a substring of a p and a substring of a q via a run of non-crossing edges (called a block), as shown here for two blocks D and D : U U U D U U D U U Suppose we are given such a partition D of the edges of = (V, E) obtained from a set of matches. For a trace we require that: for any given block D D, either all edges in D are realized, or none. Each block D is assigned a positive weight ω(d), reflecting the number and weight of the edges that it contains. Problem (M problem) Suppose we are given an extended alignment graph = (V, E, H) and a partition D of E into blocks with weights ω(d). he eneralized Maximum race (M) problem is to determine a set M D of maximum total weight such that the edges in D M D do not induce a mixed cycle on. lthough blocks play an important role in practice, to simplify the following discussion, we will not use them explicity. However, everything that follows is easily adjusted to the case that a set of blocks is given.

16 Bioinformatics I, WS 12/13, D. Huson, November 11, Linear programming linear program (LP) consists of a set of linear inequalities, together with an objective function to be optimized, i.e. minimized or maximized. a 11 x 1 + a 12 x a 1n x n b 1 a 21 x 1 + a 22 x a 2n x n b 2... a m1 x 1 + a m2 x a mn x n b m, c 1 x 1 + c 2 x c n x n Linear programs can be efficiently solved using the simplex method, developed by eorge Dantzig in here exist powerful computer programs for solving LPs, even when huge numbers of variables and inequalities are involved. PLEX is a very powerful commercial LP solver. lp solve, which is free for academic purposes. Moderate size problems can be solved using he inequalities describe a convex polyhedron, which is called a polytope, if it is bounded. For example, the inequalities 1x 1 1x 2 5 2x 1 + 1x 2 1 1x 1 + 3x x 1 + 0x 2 6 1x 1 2x 2 2 describe the following hyperplanes and polytope: For example, the objective function 2x 1 3x 2 takes on a maximum of 6, for x 1 = 6 and x 2 = 2, and a minimum of 9, for x 1 = 3 and x 2 = Integer linear program n integer linear program (ILP) is a linear program with the additional constraint that the variables x i are only allowed to take on integer values. Solving ILPs has been shown to be NP-hard. (See the book by arey and Johnson 1979, for this and many other NP-completeness results.) here exist a number of different strategies for approximating or solving such an ILP. hese strategies usually first attempt to solve relaxations of the original problem, which are obtained by dropping some of the inequalities. hey usually also rely on the LP-relaxation of the ILP, which is the LP obtained by dropping the integer condition ILP for the M problem How to encode the M problem as an integer LP?

17 50 Bioinformatics I, WS 12/13, D. Huson, November 11, 2012 ssume we are given an extended alignment graph = (V, E, H), with E = {e 1, e 2,..., e n }. Each edge e i E is represented by a variable x i, that will take on value 1, if e i belongs to the best scoring trace, and 0, if not. Hence, our variables are x 1, x 2,..., x n. o ensure that the variables are binary, we add constraints x i 1 and x i 0. dditional inequalities must be added to prevent mixed cycles. (his and the following is from: Knut Reinert, Polyhedral pproach to Sequence lignment Problems, Dissertation, Saarbrücken 1999.) For example, consider: e1 U e2 U e3 here are three possible simple mixed cycles in the graph, one using e 1 and e 3, one using e 2 and e 3, and one using e 2 and e 4. We add the constraints e4 x 1 + x 3 1, x 2 + x 3 1, x 2 + x 4 1. to ensure that none of the simple mixed cycles is realized. For example, consider: U U e1 U e3 e2 U U U U with three edges e 1, e 2 and e 3 that all participate in a simple mixed cycle. he constraint prevents them from being realized simutaneously. x i + x j + x k 2 In summary, given an extended alignment graph = (V, E, H) with E = {e 1, e 2,..., e n }, and a score ω i defined for every edge edge e i E. We can obtain a solution to the M problem by solving the following ILP: Maximize subject to ω i x i, e i E e i E x i E 1, for all simple mixed cycles, and x i {0, 1} for all variables i = 1,..., n.

18 Bioinformatics I, WS 12/13, D. Huson, November 11, Solving the ILP using branch-and-cut Solving IPs and ILPs is a main topic in combinatorial optimization. We will take a brief look at the branch-and-cut approach. Branch-and-cut: his makes use of two techniques: utting: to solve an ILP, one considers the LP-relaxation of the problem and repeatedly cuts away parts of the polytope (by adding new constraints) in the hope of obtaining an integer solution. Branch-and-bound: an enumeration tree of all possible choices of parameters is partially traversed, computing local upper- and global lower-bounds, which are used to avoid parts of the tree that cannot produce the optimal value. First note that the number of mixed cycles grows exponentially with the size of the graph. So, initially, we select a polynomial number of constraints. hat is, we consider a relaxation of the original problem. We further relax the problem by solving the LP-relaxation. If the solution ˆx is not an integer, or is not feasible, then we add an unused constraint to the LP to cut away a part of the polyhedron that contains ˆx. (his is a non-trivial operation that we won t discuss). his is repeated until a integer solution is found that fulfills all constraints, or until we get stuck. If no appropriate cut plane can be found, then we branch. hat is, we choose a variable x i and solve two sub-cases, namely the case x i = 0 and the case x i = 1. Repeated application produces a enumeration tree of possible cases. We call an upper bound for the original ILP local, if it is obtained from considering such a subproblem in the enumeration tree. If the solution found for a subproblem is feasible for the original problem and has a higher score that any solution found so far, then it is recorded and its value becomes the new global lower bound for the original objective function. Subsequently, we only pursue subproblems whose local upper bound is greater or equal to the global lower bound. eneral strategy: Repeatedly cut & solve Repeatedly cut & solve bound Repeatedly cut & solve smaller smaller problem problem branch xk=0 xk=1 Repeatedly cut & solve smaller smaller problem problem branch xj=0 xj=1 Repeatedly cut & solve smaller problem xi=0 branch Repeatedly cut & solve relax Original problem smaller problem xi=1 feasible solution from smaller problem gives local upper bound maintain global lower bound s the details are quite involved, we will skip them. Example:

19 52 Bioinformatics I, WS 12/13, D. Huson, November 11, n ILP for pairwise alignment We discuss how to formulate the ILP for the problem of aligning two sequences. Suppose we are given two sequences a = (a 1,..., a n ) and b = (b 1,..., b m ). Let s(f, g) denote the score for aligning symbols f and g. he objective function that we would like to maximize is: s(a i, b j )x ij, 1 i n 1 j m where x ij is a variable that will indicate whether the edge from node a i to node b j belongs to the trace, or not. o ensure that every variable x ij is binary, we use the inequalities for all i, j with 1 i n and 1 j m. x ij 1 and x ij 0, In the case of two sequences, every simple mixed cycle is given by an ordered pair of positions (i, j) in sequence a and an ordered pair of positions (k, l) in sequence b: a1... ai... aj... an b1... bk his gives rise to the following set of inequalities: xil... xjk bl x il + x jk 1, for all i, j, k, l with 1 i j n, 1 k l m and, additionally, i j or k l. For example, given sequences a = and b =, and assume the match- and mismatch scores are 1 and 1, respectively. In the format used by the program lp solve, the ILP has the following formulation: max: +1*x1001-1*x1002-1*x2001-1*x2002-1*x3001+1*x3002; x1001<1; x1002<1; x2001<1; x1001+x3001<1; x2002<1; x1002+x3001<1; x3001<1; x1002+x3002<1;... bm

20 Bioinformatics I, WS 12/13, D. Huson, November 11, x3002<1; x2002+x2001<1; x1002+x1001<1; x2001+x3001<1; x1001+x2001<1; x2002+x3001<1; x1002+x2001<1; x2002+x3002<1; x1002+x2002<1; x3002+x3001<1; int x1001, x1002, x2001, x2002, x3001, x3002; Here, we use the variable x 1000i + j to represent the edge from a i to b j for all i, j. he program lp solve interprets < as and assigns only non-negative values to variables he gapped extended alignment graph he extended alignment graph = (V, E, H) does not explicitly model gaps. o allow the scoring of gaps, we add a new set B of edges to the graph, joining any two consecutive nodes in a sequence, as indicated here: (a) (b) (c) For two sequences a = U and b = we see (a) the gapped extended alignment graph, (b) an alignment and (c) the gapped trace that realizes the gapped alignment. Suppose we are given a gapped extended alignment graph = (V, E, H, B). When modeling gaps in a trace we require for any pair of sequences that a node must either be incident to an alignment edge between the two sequences, or it must be incident to or enclosed by exactly one gap edge. dditionally, we require that a consecutive run of gap characters is regarded as one gap. Without going into details, this gives rise to the definition of a gapped trace (, ), with E and B. iven weights ω for all edges in E and B, a gapped trace can be scored as follows: α((, )) = e ω(e) g ω(g). In the gapped trace formulation, it is trivial to encode, linear, affine, or any other reasonable gap cost function onclusions he computation of multiple sequence alignments (MS) is an important problem in bioinformatics. MS are often scored using the sum of pairs (SP) approach. omputing an MS with optimal SP score is computionally hard.

21 54 Bioinformatics I, WS 12/13, D. Huson, November 11, 2012 One heuristic approach is to perform progressive alignment along a guide tree, methods based on this idea include Feng-Doolittle, lustalw and -offee. We also saw that sequence alignment problems can addressed using Integer Linear Programming. here are many other alignment methods, such as DILIN, MUSLE, MFF or lustal-ω.

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998 7 Multiple Sequence Alignment The exposition was prepared by Clemens Gröpl, based on earlier versions by Daniel Huson, Knut Reinert, and Gunnar Klau. It is based on the following sources, which are all