34 Bioinformatics I, WS 12/13, D. Huson, November 11, 2012

Size: px
Start display at page:

Download "34 Bioinformatics I, WS 12/13, D. Huson, November 11, 2012"

Transcription

1 34 Bioinformatics I, WS 12/13, D. Huson, November 11, Multiple Sequence lignment Sources for this lecture: R. Durbin, S. Eddy,. Krogh und. Mitchison, Biological sequence analysis, ambridge, 1998 D. usfield, lgorithms on string, trees and sequences, D.W. Mount. Bioinformatics: Sequences and enome analysis, J. Setubal & J. Meidanis, Introduction to computational molecular biology, M. Waterman. Introduction to computational biology, multiple sequence alignment (MS) is simply an alignment of more than two sequences, like this: MRP2 HUMN Q9UQ99 HUMN B8 HUMN Q96J65 HUMN Q96J6 HUMN MRP5 HUMN MRP4 HUMN O75555 HUMN FR HUMN SNRWLIRLELVNLVFFSLMMVIY--RDLSDVFVLSNLNIQLNWLVRM VNRWLVRLEVNIVLFLFVIS--RHSLSLVLSVSYSLQVYLNWLVRMS NRWLEVRMEYIVVLIVSISNSLHRELSLVLLYLMVSNYLNWMVRNL LRWFLRMDVLMNILFVLLVLS--FSSISSSKLSLSYIIQLSLLQVVR SSRWMLRLEIMNLVLVLFVF--ISSPYSFKVMVNIVLQLSSFQRI MRWLVRLDLISILILMIVLM--HQIPPYLISYVQLLFQFVRL SRWFVRLDIMFVIIVFSLIL--KLDQVLLSYLLMMFQWVRQS SRWFVRLDIMFVIIVFSLIL--KLDQVLLSYLLMMFQWVRQS SLRWFQMRIEMIFVIFFIVFISIL---EERVIILLMNIMSLQWVNSS ( small section of a multiple alignment of the human FR protein and eight homologous proteins.) 4.1 Why multiple sequence alignments? Multiple sequence alignment is applied to a set of sequences that are assumed to be related and the goal is to detect homologous residues and to place them in the same column of the multiple alignment. Multiple alignments (MS) are more suitable than pairwise alignments to address evolutionary questions, as the chance of random similarities occuring decreases, as the number of aligned sequences grows. Quote (rthur Lesk): One or two homologous sequences whisper... a full multiple sequence alignment shouts out loud Multiple alignments are used both for similarity studies, e.g. to classify members of protein families, and dissimilarity studies, e.g. to infer phylogenetic relationships haracterization of protein families ypical question: Suppose we have established a family F = { 1, 2,..., r } of homologous protein sequences. Does a new sequence 0 belong to the family? One way to address this question would be to align 0 to each of 1,..., r in turn. If one of these alignments produces a high score, then we may decide that 0 belongs to the family F. However, perhaps 0 does not align particularly well to any one specific family member, but scores well in a multiple alignment, due to common motifs etc.

2 Bioinformatics I, WS 12/13, D. Huson, November 11, onservation of structural elements Here we show the alignment of N-acetylglucosamine-binding proteins to the tertiary structure of one of them. he example exhibits 8 conserved cysteins that form 4 disulphid bridges and are an essential part of the structure of these proteins MS and evolutionary trees One main application of multiple sequence alignments is in phylogenetic analysis. onsider the following MS: 1 = N - F L S 2 = N - F - S 3 = N K Y L S 4 = N - Y L S We would like to reconstruct the evolutionary tree that gave rise to these sequences, e.g.: N Y L S N K Y L S N F S N F L S +K L Y to F N Y L S In practice, the sequences considered in phylogenetics are much longer. he computation of phylogenetic trees will be discussed in a later chapter. 4.2 Definition of an MS Suppose we are given r sequences 1... r over an alphabet Σ: 1 = a 11, a 12,..., a 1n1 2 = a 21, a 22,..., a 2n2 :=. r = a r1, a r2,..., a rnr Definition (MS) multiple sequence alignment (MS) of is obtained by inserting gaps ( - ) into the original sequences such that all resulting sequences i have equal length L max{n i

3 36 Bioinformatics I, WS 12/13, D. Huson, November 11, 2012 i = 1,..., r}, we can get back the sequence i by removing all gaps from i, and no column consists of gaps only: 1 = a 11, a 12,..., a 1L 2 = a 21, a 22,..., a 2L :=. r = a r1, a r2,..., a rl, 4.3 Scoring an MS In the case of a linear gap penalty and assuming independence of the different columns of an MS, then the score α( ) of an MS can be defined as the sum of column scores: α( ) := L s(a 1i, a 2i,..., a ri). i=1 Here we assume that s(a 1i, a 2i,..., a ri ) is a function that returns a score for every combination of r symbols (including the gap symbol) he sum-of-pairs (SP) score How to define s? For two protein sequences, s is usually given by a BLOSUM or PM matrix. For more than two sequences, providing such a matrix is not practical, as the number of possible combinations of different letters is too big. Let be a MS. onsider two sequences p and q in the alignment. For two aligned symbols u and v we define: match score for u and v, if u and v are residues, s(u, v) := d if either u or v is a gap, or 0 if both u and v are gaps. (Note that u = and v = can occur simultaneously in a multiple alignment.) Let p and q be two sequences that are part of a MS of r sequences. hen defines a pairwise alignment of p and q. Define the score of this (not necessarily optimal) pairwise alignment as s( p, q) = L s(a pi, a qi). i=1 We obtain a score for the complete MS by summing up the pairwise scores for all pairs of involved sequences: S( 1,..., r) = s( p, q) 1 p<q r Definition he sum-of-pairs (SP) score of an alignment is defined as α SP ( ) := s( p, q) = 1 p<q r L s SP (a 1i, a 2i,..., a ri), i=1

4 Bioinformatics I, WS 12/13, D. Huson, November 11, with s( p, q) := L s(a pi, a qi) and s SP (a 1i,..., a ri) := s(a pi, a qi). i=1 1 p<q r Note that we thus obtain a score for a multiple alignment that is based on a pairwise-scoring matrix. (1) (2) (3) Seq N... N... N... Seq N... N... N... Multiple alignment: Seq N... N... N... Seq N... N Seq N omparisons: (1) (2) (3) N N N N N N N N N ( 5 2 N N N ) = # comparisons N-N pairs: N- pairs: pairs: BLOSUM62: (BLOSUM62 scores: N-N: 6, N-: -3, -: 9) n undesirable property of the SP score onsider L = 1 =... x... 2 =... x r 1 =... x... r =... x... he SP-score of the column shown in L is and R = s SP (x r ) = 1 =... x... 2 =... x r 1 =... x... r =... y... ( ) r s(x, x). 2 he SP-score of the column shown in R is ( ) r 1 s SP (x r 1, y) = s(x, x) + (r 1)s(x, y). 2 he column in L is completely conserved, whereas the column in R shows one mismatch. learly, it would be desirable that the former scores much better than the latter, and increasingly so, for longer and longer columns. he difference between s SP (x r ) and s SP (x r 1, y) is: ( ) ( ) r r 1 s(x, x) s(x, x) (r 1)s(x, y) = (r 1)(s(x, x) s(x, y)). 2 2 herefore, the relative difference is s SP (x r ) s SP (x r 1, y) s SP (x r ) = (r 1)(s(x, x) s(x, y)) r(r 1)/2 s(x, x)

5 38 Bioinformatics I, WS 12/13, D. Huson, November 11, 2012 = 2 r ( s(x, x) s(x, y) s(x, x) which unfortunately decreases as the number of sequences r increases! ), 4.4 he dynamic program for a global MS Dynamic programs developed for pairwise alignment can be extended to multiple alignments. We now discuss how to compute a global MS for three sequences, in the case of a linear gap penalty. Suppose we are given: 1 = (a 11, a 12,..., a 1n1 ) = 2 = 3 = (a 21, a 22,..., a 2n2 ) (a 31, a 32,..., a 3n3 ). We proceed by computing the entries of an (n 1 + 1) (n 2 + 1) (n 3 + 1)-matrix F (i, j, k) recursively. fter the computation, F (n 1, n 2, n 3 ) will contain the best score α for a global alignment. s in the case of pairwise alignment, we can use traceback to recover an optimal alignment. he main recursion is: F (i, j, k) = max F (i 1, j 1, k 1) + s(a 1i, a 2j, a 3k ), F (i 1, j 1, k) + s(a 1i, a 2j, ), F (i 1, j, k 1) + s(a 1i,, a 3k ), F (i, j 1, k 1) + s(, a 2j, a 3k ), F (i 1, j, k) + s(a 1i,, ), F (i, j 1, k) + s(, a 2j, ), F (i, j, k 1) + s(,, a 3k ), for 1 i n 1, 1 j n 2, 1 k n 3, where s(a, b, c) returns a score for a given column of symbols a, b, c; for example, s = s SP, the sumof-pairs score. Example: 1 = = 2 = 3 = BDE BE DEE = 1 = B D E = 2 = B E 3 = D E E

6 Bioinformatics I, WS 12/13, D. Huson, November 11, omplexity of dynamic program for an MS What is the complexity of the dynamic programming approach for an MS of r sequences of length n using the SP-score? Space complexity: O(n r ) ime complexity: O(r 2 n r 2 r ). heorem omputing an MS with optimal SP-score is NP-hard Progressive alignment Because optimal multiple sequence alignments cannot be computed efficiently by dynamic programming, we turn to heuristics. One main approach is progressive alignment. Progressive alignment: Progressive alignment has three steps: 1. ompute pairwise distances between all sequences 2. Build a rooted binary guide tree based on the distances 3. In a bottom-up traversal of the tree, repeatedly align the sequences or profiles associated with the two children of the current node and the assign the result to the current node he result is the alignment assigned to the root of the tree. he main idea is to align sequences along a tree. In the example indicated below, we first align sequences 1 and 2 to obtain (1, 2), then 4 and 5 to obtain (4, 5), and then 3 with (4, 5) to obtain (3(4, 5)). Finally, we align (1, 2) with (3(4, 5)) to obtain an alignment of all five sequences. ((1,2),(3,(4,5))) (1,2) (3,(4,5)) (4,5) Order of alignment matters guide tree specifies the order in which sequences are aligned. he following example shows that order matters: 1 = LVK, 2 = PFK, 3 = LFVK, 4 = PFVK. Performing the alignment of these sequences in two different orders results in two different results: 1 L. Wang and. Jiang, J omp Biol 1994

7 40 Bioinformatics I, WS 12/13, D. Huson, November 11, 2012 ( 1, 2 ), ( 3, 4 ) or ( 1, 3 ), ( 2, 4 ) LV-K L-VK PF-K PF-K LFVK LFVK PFVK PFVK Pseudocode for progressive alignment he general algorithm for progressive alignments is as follows: Input: a set = { 1,..., r } of sequences begin = // this will hold the current set of alignments For i = 1, 2,..., r do := {{ i }} do choose two sub-alignments p, q from ; = { p, q} s := align( p, q); = { s} while > 1 end he guide tree is not explicitly mentioned; it is used to decide which two sub-alignments to choose. Existing progressive alignment methods differ in: 1. how pairwise distances are computed between sequences, 2. the order in which the sequences are aligned (or how the guide tree is constructed), and 3. which parameters are used (such as scoring function, gap penalties, weight of individual sequences) ligning two alignments How do we align two alignments? ssume that we have two multiple sequence alignments 1 and 2. here are two way to align these two alignments, namely: compute a pair-guided alignment, or compute a profile alignment Pair-guided alignment of two sub-alignments o alignment two multiple alignments 1 and 2 using the pair-guided alignment approach, one chooses one sequence x from 1 and one sequence y from 2 (including all gaps that they contain). he two sequences x and y are then optimally aligned using dynamic programming. ll columns of the original sub-alignments follow the corresponding letters in x and y. For example, let the two (sub-)alignments be

8 Bioinformatics I, WS 12/13, D. Huson, November 11, LEE -EE -LEE Let us align first sequence of the first (sub-)alignment with the last sequence of the second: dd gaps to other sequences in the sub-alignments. Final multiple alignment is then -ERE LER- LEE- LER- LEE- -EE- -LEE- -ERE LER Profile alignment Suppose we are given two MS (called profiles in this context) 1 = { 1,..., r } and 2 = { r+1,..., n }. We now discuss profile alignment in the case of the SP-score and linear gap scores. We will assume s(, a) = s(a, ) = g and s(, ) = 0 for all a 1 or 2. Definition profile alignment of 1 and 2 is an MS 1 = a 11, a 12,..., a 1L... r = a r1, a r2,..., a rl = r+1 = a r+1,1, a r+1,2,..., a r+1,l... n = a n1, a n2,..., a nl, obtained by inserting gaps in whole columns of 1 or 2, without changing the alignment of either of the two profiles. aps that exist in either input alignment are never removed: Once a gap, always a gap. he SP-score of the profile alignment is: α sp ( ) = L s(a pi, a qi) = 1 p<q n i=1 L s(a pi, a qi) = i=1 1 p<q n L i=1 1 p<q r s(a pi, a qi) + }{{} L i=1 r<p<q n s(a pi, a qi) + }{{} L i=1 1 p r<q n s(a pi, a qi). }{{} lignment score of 1 lignment score of 2 cross terms he third sum can be optimized using standard pairwise alignment, with the modification that columns are scored against columns by adding their pair scores.

9 42 Bioinformatics I, WS 12/13, D. Huson, November 11, 2012 learly, either or both profiles may consist of a single sequence. In the former case, we are aligning a single sequence to a profile and in the latter case, we are simply aligning two sequences. In the following example, use 0 for match, 1 for mismatch or gap: lignment 1: lignment 2: 1 = 2 = 3 = - 4 = - 5 = 6 = - What is the score for each alignment? What is the optimal score for a profile alignment of the two? 4.6 Feng-Doolittle he first progessive alignment algorithm to be published was the Feng-Doolittle algorithm 2 : lgorithm (Feng-Dolittle) 1. alculate all ( r 2) pairwise alignment scores and convert them into distance scores. 2. onstruct a guide tree (using Fitch and Margoliash clustering algorithm, 1967) from the distance matrix. 3. raverse the tree bottom-up and perform a profile alignment on the two children of any internal node and then assign the result to the node. he final result is given by the alignment assigned to the root node. he distance calculation that Feng-Doolittle uses is: D = log S eff = log S obs S rand S max S rand where S obs is the observed score for a pair of sequences and S max is the maximum score and S rand is the expected score of an alignment of two random sequences of equal length and composition as the pair in question. hus the score S eff can be viewed as a normalised percentage similarity: it is expected that with increasing evolutionary distance this score decays exponentially against zero. he sequence-sequence alignments are conducted using the profile alignment approach. 4.7 LUSLW LUSLW 3 can be considered as an improvement of the Feng-Doolittle algorithm. For many years, this was possibly the most widely used program for computing an MS. 2 Feng, D-F & Doolittle, RF. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25: , hompson, J.D., Higgins, D.. & ibson,.j. LUSL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic cids Research, 22: , hompson,j.d., ibson,.j., Plewniak,F., Jeanmougin,F. & Higgins,D.. he lustalx windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic cids Research, 24: , 1997.

10 Bioinformatics I, WS 12/13, D. Huson, November 11, lgorithm (LUSLW progressive alignment) 1. onstruct a distance matrix of all ( r 2) pairs by pairwise dynamic programming alignment followed by approximate conversion of similarity scores to evolutionary distances. 2. onstruct a guide tree using the Neighbor Joining tree-building method from the distance matrix. 3. Progressively align sequences at nodes of tree in order of decreasing similarity, using sequencesequence, sequence-profile and profile-profile alignment. lustalw provides a choice of two distance scores to use, both derived from an optimal pairwise alignment of sequences i and j : One is the observed distance, defined as D ij = 1 (s ij /L) where s ij = number of identities in the best alignment between i and j divided by L, the number of positions considered (gap positions are excluded). his distance score equals the relative number of differences per site. he other is the corrected distance calculated using the Kimura correction (Kimura 1983). (We will discuss distance corrections later.) here are no provable performance guarantees associated with the program. However, it works well in practice and the following features contribute to its accuracy: Sequences are weighted to compensate for the defects of the SP score. he substitution matrix used is chosen based on the similarity expected of the alignment, e.g. BLOSUM80 for closely related sequences and BLOSUM50 for less related ones. Position-specific gap-open profile penalties are multiplied by a modifier that is a function of the residues observed at the position (hydrophobic residues give higher gap penalties than hydrophilic or flexible ones.) ap-open penalties are also decreased if the position is spanned by a consecutive stretch of five or more hydrophilic residues. ap-open and gap-extension penalties increase, if there are no gaps in the column, but gaps nearby. (his tries to force gaps to occur in the same places.) Example s an example, assume that we would like to align the following 11 (rypsin and rypsin inhibitor) sequences, which are given in Fast format: >EEI-II PRILMRKQDSDLVPNFSP >Ii Mutant PRLLMRKQDSDLVPNF >BDI-II RPRILMRKRDSDLVQKNY >MeI-B VPRILMKKDRDLKRNY >MI-IV HEERVPRILMKKKDSDLEVLEHY >SI-IIB

11 44 Bioinformatics I, WS 12/13, D. Huson, November 11, 2012 MVPKILMKKHDSDLLDVLEDIYVS >MRI-I IPRILMEKRDSDLQVKRQY >rypsin RIPRIWMERDSDMKIVH >IR MOMH RSPRIWMERDSDMKIVH >MI- RIPRIWMEKRDSDMQIVDH >LI-III RIPRILMESSDSDLEILENF First step: pairwise scores Start of Pairwise alignments ligning... Sequences (1:2) ligned. Score: 96 Sequences (1:3) ligned. Score: 82 Sequences (1:4) ligned. Score: 68 Sequences (1:5) ligned. Score: 66 Sequences (1:6) ligned. Score: 60 Sequences (1:7) ligned. Score: 68 Sequences (1:8) ligned. Score: 57 Sequences (1:9) ligned. Score: 57 Sequences (1:10) ligned. Score: 60 Sequences (1:11) ligned. Score: Second step: the NJ guide tree MRI-I LI-III MI- rypsin IR MOMH MI-IV SI-IIB MeI-B BDI-II EEI-II 0.1 Ii Mutant hird step: Progressive alignment along the guide tree; Start of Multiple lignment here are 10 groups ligning... roup 1: Sequences: 2 Score:641 roup 2: Sequences: 3 Score:600 roup 3: Sequences: 4 Score:571

12 Bioinformatics I, WS 12/13, D. Huson, November 11, roup 4: Sequences: 2 Score:601 roup 5: Sequences: 6 Score:540 roup 6: Sequences: 7 Score:561 roup 7: Sequences: 2 Score:639 roup 8: Sequences: 3 Score:619 roup 9: Sequences: 4 Score:560 roup 10: Sequences: 11 Score:515 lignment Score 7716 LUSL-lignment file created Result: 4.8 -OFFEE -OFFEE 4 - short for ree-based onsistency Objective Function for alignment Evaluation - is also a program that progressively aligns sequences in order to build an MS. -OFFEE aims for consistency: n MS is consistent if it agrees best with all optimal pairwise alignments. -offee uses an extended library of scores instead of a substitution matrix. 4.9 he alignment graph In the following, we will consider an alternative approach to computing an MS based on Integer Linear Programming. Suppose we are given two sequences a 1 = and a 2 =. he complete alignment graph is the following bipartite graph = (V, E), with node set V and edge set E: Each edge e = (u, v) has a weight ω(e) = s(u, v), namely the score for placing v under u. n alignment graph is any subgraph of the complete alignment graph. 4. Notredame, D. Higgins, J. Heringa: -offee: novel method for multiple sequence alignments. J Mol Biol 302, , (2000)

13 46 Bioinformatics I, WS 12/13, D. Huson, November 11, he trace of an alignment onsider an alignment such as: realized, if the corresponding positions are aligned:, we say that an edge in the alignment graph is - he set of realized edges is called the trace of the alignment. n arbitrary subset E of edges is called a trace, if there exists some alignment that realizes precisely the edges in E. Similarly, we define the (complete) alignment graph and trace for multiple alignments. For r sequences, the resulting graph will be r-partite Maximum-weight trace problem Problem (Maximum-Weight race Problem) iven a set of sequences and a corresponding alignment graph = (V, E) with edge weights ω. he maximum-weight trace problem is to find a trace E of maximum weight. For two sequences, this is the so-called maximum-weight bipartite matching problem, which is known to be solvable in polynomial time haracterization of traces We have seen that an alignment can be described by a trace in the complete alignment graph = (V, E). Question: Is every subset E the trace of some alignment? he answer is clearly no: Our goal is to characterize all legal traces. Here are two examples:

14 Bioinformatics I, WS 12/13, D. Huson, November 11, trace alignment (a) ok (b)? not ok he extended alignment graph he alignment graph is extended by defining a set of directed edges H on the cells of the matrix = {a ij } that correspond to successive cells or letters, (a ij, a i,j+1 ), as shown here: Simple mixed cycles mixed cycle Z is a cycle in the extended alignment graph = (V, E, H) that contains both undirected and directed edges, from E and H, respectively, the latter all in the same direction: U Z U U U mixed cycle Z is called simple, if all nodes in Z a p occur consecutively in Z for every sequence a p. In other words, a simple mixed cycle enters and leaves any given sequence in at most once. he following result says that we can restrict our attention to those mixed cycles that are simple: Lemma (Simple cycles suffice) he graph = (V, H ) contain a simple mixed cycle if and only if it contains a mixed cycle. We now obtain a nice result for determining whether a proposed trace is truely the trace of an alignment:

15 48 Bioinformatics I, WS 12/13, D. Huson, November 11, 2012 heorem (race characterization) subset E is a trace, if and only if = (V, H ) does not contain a simple mixed cycle. Returning to the two examples shown above, the first contains no simple mixed cycle, whereas the second one does: Block partition and the M problem Suppose we are given a set of sequences = {a 1, a 2,..., a r }. he complete alignment graph is usually too big to be useful. Often, we are give a set of block matches between pairs of the sequences a p and a q, where a match relates a substring of a p and a substring of a q via a run of non-crossing edges (called a block), as shown here for two blocks D and D : U U U D U U D U U Suppose we are given such a partition D of the edges of = (V, E) obtained from a set of matches. For a trace we require that: for any given block D D, either all edges in D are realized, or none. Each block D is assigned a positive weight ω(d), reflecting the number and weight of the edges that it contains. Problem (M problem) Suppose we are given an extended alignment graph = (V, E, H) and a partition D of E into blocks with weights ω(d). he eneralized Maximum race (M) problem is to determine a set M D of maximum total weight such that the edges in D M D do not induce a mixed cycle on. lthough blocks play an important role in practice, to simplify the following discussion, we will not use them explicity. However, everything that follows is easily adjusted to the case that a set of blocks is given.

16 Bioinformatics I, WS 12/13, D. Huson, November 11, Linear programming linear program (LP) consists of a set of linear inequalities, together with an objective function to be optimized, i.e. minimized or maximized. a 11 x 1 + a 12 x a 1n x n b 1 a 21 x 1 + a 22 x a 2n x n b 2... a m1 x 1 + a m2 x a mn x n b m, c 1 x 1 + c 2 x c n x n Linear programs can be efficiently solved using the simplex method, developed by eorge Dantzig in here exist powerful computer programs for solving LPs, even when huge numbers of variables and inequalities are involved. PLEX is a very powerful commercial LP solver. lp solve, which is free for academic purposes. Moderate size problems can be solved using he inequalities describe a convex polyhedron, which is called a polytope, if it is bounded. For example, the inequalities 1x 1 1x 2 5 2x 1 + 1x 2 1 1x 1 + 3x x 1 + 0x 2 6 1x 1 2x 2 2 describe the following hyperplanes and polytope: For example, the objective function 2x 1 3x 2 takes on a maximum of 6, for x 1 = 6 and x 2 = 2, and a minimum of 9, for x 1 = 3 and x 2 = Integer linear program n integer linear program (ILP) is a linear program with the additional constraint that the variables x i are only allowed to take on integer values. Solving ILPs has been shown to be NP-hard. (See the book by arey and Johnson 1979, for this and many other NP-completeness results.) here exist a number of different strategies for approximating or solving such an ILP. hese strategies usually first attempt to solve relaxations of the original problem, which are obtained by dropping some of the inequalities. hey usually also rely on the LP-relaxation of the ILP, which is the LP obtained by dropping the integer condition ILP for the M problem How to encode the M problem as an integer LP?

17 50 Bioinformatics I, WS 12/13, D. Huson, November 11, 2012 ssume we are given an extended alignment graph = (V, E, H), with E = {e 1, e 2,..., e n }. Each edge e i E is represented by a variable x i, that will take on value 1, if e i belongs to the best scoring trace, and 0, if not. Hence, our variables are x 1, x 2,..., x n. o ensure that the variables are binary, we add constraints x i 1 and x i 0. dditional inequalities must be added to prevent mixed cycles. (his and the following is from: Knut Reinert, Polyhedral pproach to Sequence lignment Problems, Dissertation, Saarbrücken 1999.) For example, consider: e1 U e2 U e3 here are three possible simple mixed cycles in the graph, one using e 1 and e 3, one using e 2 and e 3, and one using e 2 and e 4. We add the constraints e4 x 1 + x 3 1, x 2 + x 3 1, x 2 + x 4 1. to ensure that none of the simple mixed cycles is realized. For example, consider: U U e1 U e3 e2 U U U U with three edges e 1, e 2 and e 3 that all participate in a simple mixed cycle. he constraint prevents them from being realized simutaneously. x i + x j + x k 2 In summary, given an extended alignment graph = (V, E, H) with E = {e 1, e 2,..., e n }, and a score ω i defined for every edge edge e i E. We can obtain a solution to the M problem by solving the following ILP: Maximize subject to ω i x i, e i E e i E x i E 1, for all simple mixed cycles, and x i {0, 1} for all variables i = 1,..., n.

18 Bioinformatics I, WS 12/13, D. Huson, November 11, Solving the ILP using branch-and-cut Solving IPs and ILPs is a main topic in combinatorial optimization. We will take a brief look at the branch-and-cut approach. Branch-and-cut: his makes use of two techniques: utting: to solve an ILP, one considers the LP-relaxation of the problem and repeatedly cuts away parts of the polytope (by adding new constraints) in the hope of obtaining an integer solution. Branch-and-bound: an enumeration tree of all possible choices of parameters is partially traversed, computing local upper- and global lower-bounds, which are used to avoid parts of the tree that cannot produce the optimal value. First note that the number of mixed cycles grows exponentially with the size of the graph. So, initially, we select a polynomial number of constraints. hat is, we consider a relaxation of the original problem. We further relax the problem by solving the LP-relaxation. If the solution ˆx is not an integer, or is not feasible, then we add an unused constraint to the LP to cut away a part of the polyhedron that contains ˆx. (his is a non-trivial operation that we won t discuss). his is repeated until a integer solution is found that fulfills all constraints, or until we get stuck. If no appropriate cut plane can be found, then we branch. hat is, we choose a variable x i and solve two sub-cases, namely the case x i = 0 and the case x i = 1. Repeated application produces a enumeration tree of possible cases. We call an upper bound for the original ILP local, if it is obtained from considering such a subproblem in the enumeration tree. If the solution found for a subproblem is feasible for the original problem and has a higher score that any solution found so far, then it is recorded and its value becomes the new global lower bound for the original objective function. Subsequently, we only pursue subproblems whose local upper bound is greater or equal to the global lower bound. eneral strategy: Repeatedly cut & solve Repeatedly cut & solve bound Repeatedly cut & solve smaller smaller problem problem branch xk=0 xk=1 Repeatedly cut & solve smaller smaller problem problem branch xj=0 xj=1 Repeatedly cut & solve smaller problem xi=0 branch Repeatedly cut & solve relax Original problem smaller problem xi=1 feasible solution from smaller problem gives local upper bound maintain global lower bound s the details are quite involved, we will skip them. Example:

19 52 Bioinformatics I, WS 12/13, D. Huson, November 11, n ILP for pairwise alignment We discuss how to formulate the ILP for the problem of aligning two sequences. Suppose we are given two sequences a = (a 1,..., a n ) and b = (b 1,..., b m ). Let s(f, g) denote the score for aligning symbols f and g. he objective function that we would like to maximize is: s(a i, b j )x ij, 1 i n 1 j m where x ij is a variable that will indicate whether the edge from node a i to node b j belongs to the trace, or not. o ensure that every variable x ij is binary, we use the inequalities for all i, j with 1 i n and 1 j m. x ij 1 and x ij 0, In the case of two sequences, every simple mixed cycle is given by an ordered pair of positions (i, j) in sequence a and an ordered pair of positions (k, l) in sequence b: a1... ai... aj... an b1... bk his gives rise to the following set of inequalities: xil... xjk bl x il + x jk 1, for all i, j, k, l with 1 i j n, 1 k l m and, additionally, i j or k l. For example, given sequences a = and b =, and assume the match- and mismatch scores are 1 and 1, respectively. In the format used by the program lp solve, the ILP has the following formulation: max: +1*x1001-1*x1002-1*x2001-1*x2002-1*x3001+1*x3002; x1001<1; x1002<1; x2001<1; x1001+x3001<1; x2002<1; x1002+x3001<1; x3001<1; x1002+x3002<1;... bm

20 Bioinformatics I, WS 12/13, D. Huson, November 11, x3002<1; x2002+x2001<1; x1002+x1001<1; x2001+x3001<1; x1001+x2001<1; x2002+x3001<1; x1002+x2001<1; x2002+x3002<1; x1002+x2002<1; x3002+x3001<1; int x1001, x1002, x2001, x2002, x3001, x3002; Here, we use the variable x 1000i + j to represent the edge from a i to b j for all i, j. he program lp solve interprets < as and assigns only non-negative values to variables he gapped extended alignment graph he extended alignment graph = (V, E, H) does not explicitly model gaps. o allow the scoring of gaps, we add a new set B of edges to the graph, joining any two consecutive nodes in a sequence, as indicated here: (a) (b) (c) For two sequences a = U and b = we see (a) the gapped extended alignment graph, (b) an alignment and (c) the gapped trace that realizes the gapped alignment. Suppose we are given a gapped extended alignment graph = (V, E, H, B). When modeling gaps in a trace we require for any pair of sequences that a node must either be incident to an alignment edge between the two sequences, or it must be incident to or enclosed by exactly one gap edge. dditionally, we require that a consecutive run of gap characters is regarded as one gap. Without going into details, this gives rise to the definition of a gapped trace (, ), with E and B. iven weights ω for all edges in E and B, a gapped trace can be scored as follows: α((, )) = e ω(e) g ω(g). In the gapped trace formulation, it is trivial to encode, linear, affine, or any other reasonable gap cost function onclusions he computation of multiple sequence alignments (MS) is an important problem in bioinformatics. MS are often scored using the sum of pairs (SP) approach. omputing an MS with optimal SP score is computionally hard.

21 54 Bioinformatics I, WS 12/13, D. Huson, November 11, 2012 One heuristic approach is to perform progressive alignment along a guide tree, methods based on this idea include Feng-Doolittle, lustalw and -offee. We also saw that sequence alignment problems can addressed using Integer Linear Programming. here are many other alignment methods, such as DILIN, MUSLE, MFF or lustal-ω.

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998 7 Multiple Sequence Alignment The exposition was prepared by Clemens Gröpl, based on earlier versions by Daniel Huson, Knut Reinert, and Gunnar Klau. It is based on the following sources, which are all

More information

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998 7 Multiple Sequence Alignment The exposition was prepared by Clemens GrÃP pl, based on earlier versions by Daniel Huson, Knut Reinert, and Gunnar Klau. It is based on the following sources, which are all

More information

Combinatorial Optimization and Integer Linear Programming

Combinatorial Optimization and Integer Linear Programming Discrete Math for Bioinformatics WS 9/:, by. Bockmayr/K. Reinert, 26. Oktober 29, 7:22 3 Combinatorial Optimization and Integer Linear Programming Combinatorial Optimization: Introduction Many problems

More information

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998 7 Multiple Sequence Alignment The exposition was prepared by Clemens GrÃP pl, based on earlier versions by Daniel Huson, Knut Reinert, and Gunnar Klau. It is based on the following sources, which are all

More information

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p Multiple alignment

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p Multiple alignment Sequence lignment (chapter 6) p The biological problem p lobal alignment p Local alignment p Multiple alignment Local alignment: rationale p Otherwise dissimilar proteins may have local regions of similarity

More information

Lecture 10. Sequence alignments

Lecture 10. Sequence alignments Lecture 10 Sequence alignments Alignment algorithms: Overview Given a scoring system, we need to have an algorithm for finding an optimal alignment for a pair of sequences. We want to maximize the score

More information

Sequence analysis Pairwise sequence alignment

Sequence analysis Pairwise sequence alignment UMF11 Introduction to bioinformatics, 25 Sequence analysis Pairwise sequence alignment 1. Sequence alignment Lecturer: Marina lexandersson 12 September, 25 here are two types of sequence alignments, global

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2015 1 Sequence Alignment Dannie Durand Pairwise Sequence Alignment The goal of pairwise sequence alignment is to establish a correspondence between the

More information

Outline. Sequence Alignment. Types of Sequence Alignment. Genomics & Computational Biology. Section 2. How Computers Store Information

Outline. Sequence Alignment. Types of Sequence Alignment. Genomics & Computational Biology. Section 2. How Computers Store Information enomics & omputational Biology Section Lan Zhang Sep. th, Outline How omputers Store Information Sequence lignment Dot Matrix nalysis Dynamic programming lobal: NeedlemanWunsch lgorithm Local: SmithWaterman

More information

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota Marina Sirota MOTIVATION: PROTEIN MULTIPLE ALIGNMENT To study evolution on the genetic level across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein

More information

Stephen Scott.

Stephen Scott. 1 / 33 sscott@cse.unl.edu 2 / 33 Start with a set of sequences In each column, residues are homolgous Residues occupy similar positions in 3D structure Residues diverge from a common ancestral residue

More information

In this section we describe how to extend the match refinement to the multiple case and then use T-Coffee to heuristically compute a multiple trace.

In this section we describe how to extend the match refinement to the multiple case and then use T-Coffee to heuristically compute a multiple trace. 5 Multiple Match Refinement and T-Coffee In this section we describe how to extend the match refinement to the multiple case and then use T-Coffee to heuristically compute a multiple trace. This exposition

More information

Computational Biology Lecture 6: Affine gap penalty function, multiple sequence alignment Saad Mneimneh

Computational Biology Lecture 6: Affine gap penalty function, multiple sequence alignment Saad Mneimneh Computational Biology Lecture 6: Affine gap penalty function, multiple sequence alignment Saad Mneimneh We saw earlier how we can use a concave gap penalty function γ, i.e. one that satisfies γ(x+1) γ(x)

More information

6. Lecture notes on matroid intersection

6. Lecture notes on matroid intersection Massachusetts Institute of Technology 18.453: Combinatorial Optimization Michel X. Goemans May 2, 2017 6. Lecture notes on matroid intersection One nice feature about matroids is that a simple greedy algorithm

More information

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment Courtesy of jalview 1 Motivations Collective statistic Protein families Identification and representation of conserved sequence features

More information

Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it.

Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it. Sequence Alignments Overview Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it. Sequence alignment means arranging

More information

3 No-Wait Job Shops with Variable Processing Times

3 No-Wait Job Shops with Variable Processing Times 3 No-Wait Job Shops with Variable Processing Times In this chapter we assume that, on top of the classical no-wait job shop setting, we are given a set of processing times for each operation. We may select

More information

Graphs and Network Flows IE411. Lecture 21. Dr. Ted Ralphs

Graphs and Network Flows IE411. Lecture 21. Dr. Ted Ralphs Graphs and Network Flows IE411 Lecture 21 Dr. Ted Ralphs IE411 Lecture 21 1 Combinatorial Optimization and Network Flows In general, most combinatorial optimization and integer programming problems are

More information

Special course in Computer Science: Advanced Text Algorithms

Special course in Computer Science: Advanced Text Algorithms Special course in Computer Science: Advanced Text Algorithms Lecture 8: Multiple alignments Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory http://www.users.abo.fi/ipetre/textalg

More information

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST Alexander Chan 5075504 Biochemistry 218 Final Project An Analysis of Pairwise

More information

Multiple Sequence Alignment

Multiple Sequence Alignment . Multiple Sequence lignment utorial #4 Ilan ronau Multiple Sequence lignment Reminder S = S = S = Possible alignment Possible alignment Multiple Sequence lignment Reminder Input: Sequences S, S,, S over

More information

Computational Biology Lecture 12: Physical mapping by restriction mapping Saad Mneimneh

Computational Biology Lecture 12: Physical mapping by restriction mapping Saad Mneimneh Computational iology Lecture : Physical mapping by restriction mapping Saad Mneimneh In the beginning of the course, we looked at genetic mapping, which is the problem of identify the relative order of

More information

/ Approximation Algorithms Lecturer: Michael Dinitz Topic: Linear Programming Date: 2/24/15 Scribe: Runze Tang

/ Approximation Algorithms Lecturer: Michael Dinitz Topic: Linear Programming Date: 2/24/15 Scribe: Runze Tang 600.469 / 600.669 Approximation Algorithms Lecturer: Michael Dinitz Topic: Linear Programming Date: 2/24/15 Scribe: Runze Tang 9.1 Linear Programming Suppose we are trying to approximate a minimization

More information

Integer Programming Theory

Integer Programming Theory Integer Programming Theory Laura Galli October 24, 2016 In the following we assume all functions are linear, hence we often drop the term linear. In discrete optimization, we seek to find a solution x

More information

12.1 Formulation of General Perfect Matching

12.1 Formulation of General Perfect Matching CSC5160: Combinatorial Optimization and Approximation Algorithms Topic: Perfect Matching Polytope Date: 22/02/2008 Lecturer: Lap Chi Lau Scribe: Yuk Hei Chan, Ling Ding and Xiaobing Wu In this lecture,

More information

Whole Genome Comparison: Colinear Alignment

Whole Genome Comparison: Colinear Alignment Felix Heeger, Max Homilius, Ivan Kel, Sabrina Krakau, Svenja Specovius, John Wiedenhoeft May 10, 2010 The Big Picture Colinear Alignment Colinear Alignment: Containing elements that are arranged in the

More information

Multiple Sequence Alignment Sum-of-Pairs and ClustalW. Ulf Leser

Multiple Sequence Alignment Sum-of-Pairs and ClustalW. Ulf Leser Multiple Sequence lignment Sum-of-Pairs and ClustalW Ulf Leser This Lecture Multiple Sequence lignment The problem Theoretical approach: Sum-of-Pairs scores Practical approach: ClustalW Ulf Leser: Bioinformatics,

More information

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading: 24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, 2010 3 BLAST and FASTA This lecture is based on the following papers, which are all recommended reading: D.J. Lipman and W.R. Pearson, Rapid

More information

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010 Principles of Bioinformatics BIO540/STA569/CSI660 Fall 2010 Lecture 11 Multiple Sequence Alignment I Administrivia Administrivia The midterm examination will be Monday, October 18 th, in class. Closed

More information

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures 4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 4.1 Sources for this lecture Lectures by Volker Heun, Daniel Huson and Knut

More information

FastA & the chaining problem

FastA & the chaining problem FastA & the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 1 Sources for this lecture: Lectures by Volker Heun, Daniel Huson and Knut Reinert,

More information

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10: FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:56 4001 4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem

More information

Decision Problems. Observation: Many polynomial algorithms. Questions: Can we solve all problems in polynomial time? Answer: No, absolutely not.

Decision Problems. Observation: Many polynomial algorithms. Questions: Can we solve all problems in polynomial time? Answer: No, absolutely not. Decision Problems Observation: Many polynomial algorithms. Questions: Can we solve all problems in polynomial time? Answer: No, absolutely not. Definition: The class of problems that can be solved by polynomial-time

More information

Lecture 2 Pairwise sequence alignment. Principles Computational Biology Teresa Przytycka, PhD

Lecture 2 Pairwise sequence alignment. Principles Computational Biology Teresa Przytycka, PhD Lecture 2 Pairwise sequence alignment. Principles Computational Biology Teresa Przytycka, PhD Assumptions: Biological sequences evolved by evolution. Micro scale changes: For short sequences (e.g. one

More information

Introduction to Mathematical Programming IE406. Lecture 20. Dr. Ted Ralphs

Introduction to Mathematical Programming IE406. Lecture 20. Dr. Ted Ralphs Introduction to Mathematical Programming IE406 Lecture 20 Dr. Ted Ralphs IE406 Lecture 20 1 Reading for This Lecture Bertsimas Sections 10.1, 11.4 IE406 Lecture 20 2 Integer Linear Programming An integer

More information

11. APPROXIMATION ALGORITHMS

11. APPROXIMATION ALGORITHMS 11. APPROXIMATION ALGORITHMS load balancing center selection pricing method: vertex cover LP rounding: vertex cover generalized load balancing knapsack problem Lecture slides by Kevin Wayne Copyright 2005

More information

Mathematical and Algorithmic Foundations Linear Programming and Matchings

Mathematical and Algorithmic Foundations Linear Programming and Matchings Adavnced Algorithms Lectures Mathematical and Algorithmic Foundations Linear Programming and Matchings Paul G. Spirakis Department of Computer Science University of Patras and Liverpool Paul G. Spirakis

More information

Multiple Sequence Alignment: Multidimensional. Biological Motivation

Multiple Sequence Alignment: Multidimensional. Biological Motivation Multiple Sequence Alignment: Multidimensional Dynamic Programming Boston University Biological Motivation Compare a new sequence with the sequences in a protein family. Proteins can be categorized into

More information

1 Unweighted Set Cover

1 Unweighted Set Cover Comp 60: Advanced Algorithms Tufts University, Spring 018 Prof. Lenore Cowen Scribe: Yuelin Liu Lecture 7: Approximation Algorithms: Set Cover and Max Cut 1 Unweighted Set Cover 1.1 Formulations There

More information

Treewidth and graph minors

Treewidth and graph minors Treewidth and graph minors Lectures 9 and 10, December 29, 2011, January 5, 2012 We shall touch upon the theory of Graph Minors by Robertson and Seymour. This theory gives a very general condition under

More information

15-780: Graduate Artificial Intelligence. Computational biology: Sequence alignment and profile HMMs

15-780: Graduate Artificial Intelligence. Computational biology: Sequence alignment and profile HMMs 5-78: Graduate rtificial Intelligence omputational biology: Sequence alignment and profile HMMs entral dogma DN GGGG transcription mrn UGGUUUGUG translation Protein PEPIDE 2 omparison of Different Organisms

More information

An Exact Mathematical Programming Approach to Multiple RNA Sequence-Structure Alignment

An Exact Mathematical Programming Approach to Multiple RNA Sequence-Structure Alignment lgorithmic Operations Research Vol.3 (2008) 130 146 n Exact Mathematical Programming pproach to Multiple RN Sequence-Structure lignment Markus Bauer International Max Planck Research School & Free niversity

More information

Multiple sequence alignment. November 20, 2018

Multiple sequence alignment. November 20, 2018 Multiple sequence alignment November 20, 2018 Why do multiple alignment? Gain insight into evolutionary history Can assess time of divergence by looking at the number of mutations needed to change one

More information

15-451/651: Design & Analysis of Algorithms October 11, 2018 Lecture #13: Linear Programming I last changed: October 9, 2018

15-451/651: Design & Analysis of Algorithms October 11, 2018 Lecture #13: Linear Programming I last changed: October 9, 2018 15-451/651: Design & Analysis of Algorithms October 11, 2018 Lecture #13: Linear Programming I last changed: October 9, 2018 In this lecture, we describe a very general problem called linear programming

More information

4 Integer Linear Programming (ILP)

4 Integer Linear Programming (ILP) TDA6/DIT37 DISCRETE OPTIMIZATION 17 PERIOD 3 WEEK III 4 Integer Linear Programg (ILP) 14 An integer linear program, ILP for short, has the same form as a linear program (LP). The only difference is that

More information

Theorem 2.9: nearest addition algorithm

Theorem 2.9: nearest addition algorithm There are severe limits on our ability to compute near-optimal tours It is NP-complete to decide whether a given undirected =(,)has a Hamiltonian cycle An approximation algorithm for the TSP can be used

More information

Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods

Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods Khaddouja Boujenfa, Nadia Essoussi, and Mohamed Limam International Science Index, Computer and Information Engineering waset.org/publication/482

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 06: Multiple Sequence Alignment https://upload.wikimedia.org/wikipedia/commons/thumb/7/79/rplp0_90_clustalw_aln.gif/575px-rplp0_90_clustalw_aln.gif Slides

More information

Today s Lecture. Multiple sequence alignment. Improved scoring of pairwise alignments. Affine gap penalties Profiles

Today s Lecture. Multiple sequence alignment. Improved scoring of pairwise alignments. Affine gap penalties Profiles Today s Lecture Multiple sequence alignment Improved scoring of pairwise alignments Affine gap penalties Profiles 1 The Edit Graph for a Pair of Sequences G A C G T T G A A T G A C C C A C A T G A C G

More information

5. Lecture notes on matroid intersection

5. Lecture notes on matroid intersection Massachusetts Institute of Technology Handout 14 18.433: Combinatorial Optimization April 1st, 2009 Michel X. Goemans 5. Lecture notes on matroid intersection One nice feature about matroids is that a

More information

Sequence Comparison: Dynamic Programming. Genome 373 Genomic Informatics Elhanan Borenstein

Sequence Comparison: Dynamic Programming. Genome 373 Genomic Informatics Elhanan Borenstein Sequence omparison: Dynamic Programming Genome 373 Genomic Informatics Elhanan Borenstein quick review: hallenges Find the best global alignment of two sequences Find the best global alignment of multiple

More information

11.1 Facility Location

11.1 Facility Location CS787: Advanced Algorithms Scribe: Amanda Burton, Leah Kluegel Lecturer: Shuchi Chawla Topic: Facility Location ctd., Linear Programming Date: October 8, 2007 Today we conclude the discussion of local

More information

Algorithms for Bioinformatics

Algorithms for Bioinformatics Adapted from slides by Leena Salmena and Veli Mäkinen, which are partly from http: //bix.ucsd.edu/bioalgorithms/slides.php. 582670 Algorithms for Bioinformatics Lecture 6: Distance based clustering and

More information

LP-Modelling. dr.ir. C.A.J. Hurkens Technische Universiteit Eindhoven. January 30, 2008

LP-Modelling. dr.ir. C.A.J. Hurkens Technische Universiteit Eindhoven. January 30, 2008 LP-Modelling dr.ir. C.A.J. Hurkens Technische Universiteit Eindhoven January 30, 2008 1 Linear and Integer Programming After a brief check with the backgrounds of the participants it seems that the following

More information

Chapter 6. Multiple sequence alignment (week 10)

Chapter 6. Multiple sequence alignment (week 10) Course organization Introduction ( Week 1,2) Part I: Algorithms for Sequence Analysis (Week 1-11) Chapter 1-3, Models and theories» Probability theory and Statistics (Week 3)» Algorithm complexity analysis

More information

Algorithms for Integer Programming

Algorithms for Integer Programming Algorithms for Integer Programming Laura Galli November 9, 2016 Unlike linear programming problems, integer programming problems are very difficult to solve. In fact, no efficient general algorithm is

More information

In this lecture, we ll look at applications of duality to three problems:

In this lecture, we ll look at applications of duality to three problems: Lecture 7 Duality Applications (Part II) In this lecture, we ll look at applications of duality to three problems: 1. Finding maximum spanning trees (MST). We know that Kruskal s algorithm finds this,

More information

Polynomial-Time Approximation Algorithms

Polynomial-Time Approximation Algorithms 6.854 Advanced Algorithms Lecture 20: 10/27/2006 Lecturer: David Karger Scribes: Matt Doherty, John Nham, Sergiy Sidenko, David Schultz Polynomial-Time Approximation Algorithms NP-hard problems are a vast

More information

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering Sequence clustering Introduction Data clustering is one of the key tools used in various incarnations of data-mining - trying to make sense of large datasets. It is, thus, natural to ask whether clustering

More information

Pairwise Sequence Alignment: Dynamic Programming Algorithms. COMP Spring 2015 Luay Nakhleh, Rice University

Pairwise Sequence Alignment: Dynamic Programming Algorithms. COMP Spring 2015 Luay Nakhleh, Rice University Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 - Spring 2015 Luay Nakhleh, Rice University DP Algorithms for Pairwise Alignment The number of all possible pairwise alignments (if

More information

Lecture 5: Multiple sequence alignment

Lecture 5: Multiple sequence alignment Lecture 5: Multiple sequence alignment Introduction to Computational Biology Teresa Przytycka, PhD (with some additions by Martin Vingron) Why do we need multiple sequence alignment Pairwise sequence alignment

More information

Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice University

Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice University 1 Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice University DP Algorithms for Pairwise Alignment 2 The number of all possible pairwise alignments (if gaps are allowed)

More information

Multiple Sequence Alignment Theory and Applications

Multiple Sequence Alignment Theory and Applications Mahidol University Objectives SMI512 Molecular Sequence alysis Multiple Sequence lignment Theory and pplications Lecture 3 Pravech jawatanawong, Ph.. e-mail: pravech.aja@mahidol.edu epartment of Microbiology

More information

MVE165/MMG630, Applied Optimization Lecture 8 Integer linear programming algorithms. Ann-Brith Strömberg

MVE165/MMG630, Applied Optimization Lecture 8 Integer linear programming algorithms. Ann-Brith Strömberg MVE165/MMG630, Integer linear programming algorithms Ann-Brith Strömberg 2009 04 15 Methods for ILP: Overview (Ch. 14.1) Enumeration Implicit enumeration: Branch and bound Relaxations Decomposition methods:

More information

Recent Research Results. Evolutionary Trees Distance Methods

Recent Research Results. Evolutionary Trees Distance Methods Recent Research Results Evolutionary Trees Distance Methods Indo-European Languages After Tandy Warnow What is the purpose? Understand evolutionary history (relationship between species). Uderstand how

More information

Sequence Alignment. Ulf Leser

Sequence Alignment. Ulf Leser Sequence Alignment Ulf Leser his Lecture Approximate String Matching Edit distance and alignment Computing global alignments Local alignment Ulf Leser: Bioinformatics, Summer Semester 2016 2 ene Function

More information

The worst case complexity of Maximum Parsimony

The worst case complexity of Maximum Parsimony he worst case complexity of Maximum Parsimony mir armel Noa Musa-Lempel Dekel sur Michal Ziv-Ukelson Ben-urion University June 2, 20 / 2 What s a phylogeny Phylogenies: raph-like structures whose topology

More information

Multiple Sequence Alignment. Mark Whitsitt - NCSA

Multiple Sequence Alignment. Mark Whitsitt - NCSA Multiple Sequence Alignment Mark Whitsitt - NCSA What is a Multiple Sequence Alignment (MA)? GMHGTVYANYAVDSSDLLLAFGVRFDDRVTGKLEAFASRAKIVHIDIDSAEIGKNKQPHV GMHGTVYANYAVEHSDLLLAFGVRFDDRVTGKLEAFASRAKIVHIDIDSAEIGKNKTPHV

More information

Computational Molecular Biology

Computational Molecular Biology Computational Molecular Biology Erwin M. Bakker Lecture 3, mainly from material by R. Shamir [2] and H.J. Hoogeboom [4]. 1 Pairwise Sequence Alignment Biological Motivation Algorithmic Aspect Recursive

More information

Linear Programming Duality and Algorithms

Linear Programming Duality and Algorithms COMPSCI 330: Design and Analysis of Algorithms 4/5/2016 and 4/7/2016 Linear Programming Duality and Algorithms Lecturer: Debmalya Panigrahi Scribe: Tianqi Song 1 Overview In this lecture, we will cover

More information

Discrete mathematics , Fall Instructor: prof. János Pach

Discrete mathematics , Fall Instructor: prof. János Pach Discrete mathematics 2016-2017, Fall Instructor: prof. János Pach - covered material - Lecture 1. Counting problems To read: [Lov]: 1.2. Sets, 1.3. Number of subsets, 1.5. Sequences, 1.6. Permutations,

More information

Methods and Models for Combinatorial Optimization Exact methods for the Traveling Salesman Problem

Methods and Models for Combinatorial Optimization Exact methods for the Traveling Salesman Problem Methods and Models for Combinatorial Optimization Exact methods for the Traveling Salesman Problem L. De Giovanni M. Di Summa The Traveling Salesman Problem (TSP) is an optimization problem on a directed

More information

56:272 Integer Programming & Network Flows Final Examination -- December 14, 1998

56:272 Integer Programming & Network Flows Final Examination -- December 14, 1998 56:272 Integer Programming & Network Flows Final Examination -- December 14, 1998 Part A: Answer any four of the five problems. (15 points each) 1. Transportation problem 2. Integer LP Model Formulation

More information

Reconstructing long sequences from overlapping sequence fragment. Searching databases for related sequences and subsequences

Reconstructing long sequences from overlapping sequence fragment. Searching databases for related sequences and subsequences SEQUENCE ALIGNMENT ALGORITHMS 1 Why compare sequences? Reconstructing long sequences from overlapping sequence fragment Searching databases for related sequences and subsequences Storing, retrieving and

More information

A Fast Algorithm for Optimal Alignment between Similar Ordered Trees

A Fast Algorithm for Optimal Alignment between Similar Ordered Trees Fundamenta Informaticae 56 (2003) 105 120 105 IOS Press A Fast Algorithm for Optimal Alignment between Similar Ordered Trees Jesper Jansson Department of Computer Science Lund University, Box 118 SE-221

More information

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha Motivation Sequence homology to a known protein suggest function of newly sequenced protein Bioinformatics

More information

GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment

GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms CLUSTAL W Courtesy of jalview Motivations Collective (or aggregate) statistic

More information

Bioinformatics explained: Smith-Waterman

Bioinformatics explained: Smith-Waterman Bioinformatics Explained Bioinformatics explained: Smith-Waterman May 1, 2007 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com info@clcbio.com

More information

3.4 Multiple sequence alignment

3.4 Multiple sequence alignment 3.4 Multiple sequence alignment Why produce a multiple sequence alignment? Using more than two sequences results in a more convincing alignment by revealing conserved regions in ALL of the sequences Aligned

More information

Chapter 8 Multiple sequence alignment. Chaochun Wei Spring 2018

Chapter 8 Multiple sequence alignment. Chaochun Wei Spring 2018 1896 1920 1987 2006 Chapter 8 Multiple sequence alignment Chaochun Wei Spring 2018 Contents 1. Reading materials 2. Multiple sequence alignment basic algorithms and tools how to improve multiple alignment

More information

The Simplex Algorithm for LP, and an Open Problem

The Simplex Algorithm for LP, and an Open Problem The Simplex Algorithm for LP, and an Open Problem Linear Programming: General Formulation Inputs: real-valued m x n matrix A, and vectors c in R n and b in R m Output: n-dimensional vector x There is one

More information

Approximation Algorithms

Approximation Algorithms Approximation Algorithms Given an NP-hard problem, what should be done? Theory says you're unlikely to find a poly-time algorithm. Must sacrifice one of three desired features. Solve problem to optimality.

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Approximation Algorithms

Approximation Algorithms 18.433 Combinatorial Optimization Approximation Algorithms November 20,25 Lecturer: Santosh Vempala 1 Approximation Algorithms Any known algorithm that finds the solution to an NP-hard optimization problem

More information

8 Matroid Intersection

8 Matroid Intersection 8 Matroid Intersection 8.1 Definition and examples 8.2 Matroid Intersection Algorithm 8.1 Definitions Given two matroids M 1 = (X, I 1 ) and M 2 = (X, I 2 ) on the same set X, their intersection is M 1

More information

Multiple Sequence Alignment Sum-of-Pairs and ClustalW. Ulf Leser

Multiple Sequence Alignment Sum-of-Pairs and ClustalW. Ulf Leser Multiple Sequence Alignment Sum-of-Pairs and ClustalW Ulf Leser This Lecture Multiple Sequence Alignment The problem Theoretical approach: Sum-of-Pairs scores Practical approach: ClustalW Ulf Leser: Bioinformatics,

More information

Algorithms Exam TIN093/DIT600

Algorithms Exam TIN093/DIT600 Algorithms Exam TIN093/DIT600 Course: Algorithms Course code: TIN 093 (CTH), DIT 600 (GU) Date, time: 22nd October 2016, 14:00 18:00 Building: M Responsible teacher: Peter Damaschke, Tel. 5405 Examiner:

More information

Algorithmic Paradigms. Chapter 6 Dynamic Programming. Steps in Dynamic Programming. Dynamic Programming. Dynamic Programming Applications

Algorithmic Paradigms. Chapter 6 Dynamic Programming. Steps in Dynamic Programming. Dynamic Programming. Dynamic Programming Applications lgorithmic Paradigms reed. Build up a solution incrementally, only optimizing some local criterion. hapter Dynamic Programming Divide-and-conquer. Break up a problem into two sub-problems, solve each sub-problem

More information

Advanced Operations Research Techniques IE316. Quiz 1 Review. Dr. Ted Ralphs

Advanced Operations Research Techniques IE316. Quiz 1 Review. Dr. Ted Ralphs Advanced Operations Research Techniques IE316 Quiz 1 Review Dr. Ted Ralphs IE316 Quiz 1 Review 1 Reading for The Quiz Material covered in detail in lecture. 1.1, 1.4, 2.1-2.6, 3.1-3.3, 3.5 Background material

More information

56:272 Integer Programming & Network Flows Final Exam -- December 16, 1997

56:272 Integer Programming & Network Flows Final Exam -- December 16, 1997 56:272 Integer Programming & Network Flows Final Exam -- December 16, 1997 Answer #1 and any five of the remaining six problems! possible score 1. Multiple Choice 25 2. Traveling Salesman Problem 15 3.

More information

3 INTEGER LINEAR PROGRAMMING

3 INTEGER LINEAR PROGRAMMING 3 INTEGER LINEAR PROGRAMMING PROBLEM DEFINITION Integer linear programming problem (ILP) of the decision variables x 1,..,x n : (ILP) subject to minimize c x j j n j= 1 a ij x j x j 0 x j integer n j=

More information

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University Profiles and Multiple Alignments COMP 571 Luay Nakhleh, Rice University Outline Profiles and sequence logos Profile hidden Markov models Aligning profiles Multiple sequence alignment by gradual sequence

More information

A New Approach For Tree Alignment Based on Local Re-Optimization

A New Approach For Tree Alignment Based on Local Re-Optimization A New Approach For Tree Alignment Based on Local Re-Optimization Feng Yue and Jijun Tang Department of Computer Science and Engineering University of South Carolina Columbia, SC 29063, USA yuef, jtang

More information

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 36

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 36 CS 473: Algorithms Ruta Mehta University of Illinois, Urbana-Champaign Spring 2018 Ruta (UIUC) CS473 1 Spring 2018 1 / 36 CS 473: Algorithms, Spring 2018 LP Duality Lecture 20 April 3, 2018 Some of the

More information

Some Advanced Topics in Linear Programming

Some Advanced Topics in Linear Programming Some Advanced Topics in Linear Programming Matthew J. Saltzman July 2, 995 Connections with Algebra and Geometry In this section, we will explore how some of the ideas in linear programming, duality theory,

More information

Notes on Dynamic-Programming Sequence Alignment

Notes on Dynamic-Programming Sequence Alignment Notes on Dynamic-Programming Sequence Alignment Introduction. Following its introduction by Needleman and Wunsch (1970), dynamic programming has become the method of choice for rigorous alignment of DNA

More information

LATIN SQUARES AND THEIR APPLICATION TO THE FEASIBLE SET FOR ASSIGNMENT PROBLEMS

LATIN SQUARES AND THEIR APPLICATION TO THE FEASIBLE SET FOR ASSIGNMENT PROBLEMS LATIN SQUARES AND THEIR APPLICATION TO THE FEASIBLE SET FOR ASSIGNMENT PROBLEMS TIMOTHY L. VIS Abstract. A significant problem in finite optimization is the assignment problem. In essence, the assignment

More information

Lecture 14: Linear Programming II

Lecture 14: Linear Programming II A Theorist s Toolkit (CMU 18-859T, Fall 013) Lecture 14: Linear Programming II October 3, 013 Lecturer: Ryan O Donnell Scribe: Stylianos Despotakis 1 Introduction At a big conference in Wisconsin in 1948

More information

Copyright 2000, Kevin Wayne 1

Copyright 2000, Kevin Wayne 1 Guessing Game: NP-Complete? 1. LONGEST-PATH: Given a graph G = (V, E), does there exists a simple path of length at least k edges? YES. SHORTEST-PATH: Given a graph G = (V, E), does there exists a simple

More information

by conservation of flow, hence the cancelation. Similarly, we have

by conservation of flow, hence the cancelation. Similarly, we have Chapter 13: Network Flows and Applications Network: directed graph with source S and target T. Non-negative edge weights represent capacities. Assume no edges into S or out of T. (If necessary, we can

More information