Multiple Sequence Alignment Sum-of-Pairs and ClustalW. Ulf Leser

Similar documents
Multiple Sequence Alignment Sum-of-Pairs and ClustalW. Ulf Leser

Sequence Alignment. Ulf Leser

GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

Special course in Computer Science: Advanced Text Algorithms

Algorithms for Bioinformatics

EECS730: Introduction to Bioinformatics

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

Lecture 5: Multiple sequence alignment

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

3.4 Multiple sequence alignment

Gene expression & Clustering (Chapter 10)

Special course in Computer Science: Advanced Text Algorithms

Evolutionary tree reconstruction (Chapter 10)

Computational Genomics and Molecular Biology, Fall

Multiple Sequence Alignment (MSA)

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p Multiple alignment

MULTIPLE SEQUENCE ALIGNMENT

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering

Lecture 20: Clustering and Evolution

Parsimony-Based Approaches to Inferring Phylogenetic Trees

Sequence alignment algorithms

Lecture 20: Clustering and Evolution

Multiple Sequence Alignment. Mark Whitsitt - NCSA

Today s Lecture. Edit graph & alignment algorithms. Local vs global Computational complexity of pairwise alignment Multiple sequence alignment

Algorithms and Experimental Study for the Traveling Salesman Problem of Second Order. Gerold Jäger

A multiple alignment tool in 3D

Chapter 8 Multiple sequence alignment. Chaochun Wei Spring 2018

Multiple Sequence Alignment: Multidimensional. Biological Motivation

4/4/16 Comp 555 Spring

In this section we describe how to extend the match refinement to the multiple case and then use T-Coffee to heuristically compute a multiple trace.

Clustering Techniques

11/17/2009 Comp 590/Comp Fall

Multiple alignment. Multiple alignment. Computational complexity. Multiple sequence alignment. Consensus. 2 1 sec sec sec

A New Approach For Tree Alignment Based on Local Re-Optimization

Chapter 6. Multiple sequence alignment (week 10)

Lecture 9: Core String Edits and Alignments

Alignment ABC. Most slides are modified from Serafim s lectures

Multiple sequence alignment. November 20, 2018

Genome 559: Introduction to Statistical and Computational Genomics. Lecture15a Multiple Sequence Alignment Larry Ruzzo

Computational Molecular Biology

Lecture 10: Local Alignments

Multiple Sequence Alignment

Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods

Dynamic Programming in 3-D Progressive Alignment Profile Progressive Alignment (ClustalW) Scoring Multiple Alignments Entropy Sum of Pairs Alignment

Brief review from last class

Phylogenetics. Introduction to Bioinformatics Dortmund, Lectures: Sven Rahmann. Exercises: Udo Feldkamp, Michael Wurst

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

EECS730: Introduction to Bioinformatics

Phylogenetic Trees Lecture 12. Section 7.4, in Durbin et al., 6.5 in Setubal et al. Shlomo Moran, Ilan Gronau

Two-Level Logic Optimization ( Introduction to Computer-Aided Design) School of EECS Seoul National University

Lecture 3: February Local Alignment: The Smith-Waterman Algorithm

Algorithms and Data S tructures Structures Optimal S earc Sear h ch Tr T ees; r T ries Tries Ulf Leser

Specifying logic functions

17 dicembre Luca Bortolussi SUFFIX TREES. From exact to approximate string matching.

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:

Multiple Sequence Alignment II

Sequence Comparison: Dynamic Programming. Genome 373 Genomic Informatics Elhanan Borenstein

Sequence Alignment & Search

CLUSTERING IN BIOINFORMATICS

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

Recursive-Fib(n) if n=1 or n=2 then return 1 else return Recursive-Fib(n-1)+Recursive-Fib(n-2)

Reconstructing long sequences from overlapping sequence fragment. Searching databases for related sequences and subsequences

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Dynamic Programming & Smith-Waterman algorithm

Algorithms and Data Structures

Stephen Scott.

Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it.

Mining Social Network Graphs

Distance-based Phylogenetic Methods Near a Polytomy

Sequence analysis Pairwise sequence alignment

CS273: Algorithms for Structure Handout # 4 and Motion in Biology Stanford University Thursday, 8 April 2004

Lecture 2 Pairwise sequence alignment. Principles Computational Biology Teresa Przytycka, PhD

Machine Learning (BSMC-GA 4439) Wenke Liu

Comparison of Sequence Similarity Measures for Distant Evolutionary Relationships

Terminology. A phylogeny is the evolutionary history of an organism

Multiple Sequence Alignment Using Reconfigurable Computing

Recent Research Results. Evolutionary Trees Distance Methods

Additional Alignments Plugin USER MANUAL

A simple noise model. Algorithm sketch. A simple noise model. Estimating the probabilities

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

PoMSA: An Efficient and Precise Position-based Multiple Sequence Alignment Technique

EECS730: Introduction to Bioinformatics

On the Optimality of the Neighbor Joining Algorithm

More Graph Algorithms: Topological Sort and Shortest Distance

University of Technology

EVOLUTIONARY DISTANCES INFERRING PHYLOGENIES

Multiple Sequence Alignment. With thanks to Eric Stone and Steffen Heber, North Carolina State University

Lecture 10. Sequence alignments

Multiple sequence alignment. November 2, 2017

15-780: Graduate Artificial Intelligence. Computational biology: Sequence alignment and profile HMMs

What is a phylogenetic tree? Algorithms for Computational Biology. Phylogenetics Summary. Di erent types of phylogenetic trees

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Algorithms and Data Structures

CSE 549: Computational Biology

JET 2 User Manual 1 INSTALLATION 2 EXECUTION AND FUNCTIONALITIES. 1.1 Download. 1.2 System requirements. 1.3 How to install JET 2

Distance based tree reconstruction. Hierarchical clustering (UPGMA) Neighbor-Joining (NJ)

Clustering of Proteins

BMI/CS 576 Fall 2015 Midterm Exam

Algorithms and Data Structures

Transcription:

Multiple Sequence lignment Sum-of-Pairs and ClustalW Ulf Leser

This Lecture Multiple Sequence lignment The problem Theoretical approach: Sum-of-Pairs scores Practical approach: ClustalW Ulf Leser: Bioinformatics, Summer Semester 2012 2

Multiple Sequence lignment We now align multiple (k>2) sequences Note: lso BLST aligns only two sequences Why? Imagine k sequences of the promoter region of genes, all regulated by the same transcription factor f. Which subsequence within the k sequences is recognized by f? Imagine k sequences of proteins that bind to DN. Which subsequence of the k sequences code for the part of the proteins that performs the binding? General We want to know the common part(s) in k sequences common does not mean identical This part can be anywhere within the sequences Ulf Leser: Bioinformatics, Summer Semester 2012 3

Definition Definition multiple sequence alignment (MS) of k Strings s i, 1 i k, is a table of k rows and l columns (l max( s i )), such that Row i contains the sequence of s i, with an arbitrary number of blanks being inserted at arbitrary positions Every symbol of every s i stands in exactly one column No column contains only blanks CGTGTTGC TCGGTGCTTTCGT GCCGTGCTGTCG TTCGTGGCGTGGT GGTGCGCC CGTGTTG C TCGGT GCTTTCGT_ GCCGTGCTGT C_G TTC_GTG_GCGTG GT G GTGC_GC C Ulf Leser: Bioinformatics, Summer Semester 2012 4

Good MS We are searching for good (optimal) MSs Defining optimal here is not as simple as in the k=2 case Intuition ll sequences had a common ancestor and evolved by evolution We want to assume as few evolutionary events as possible Thus, we want few columns (~ few INSDELs) Thus, we want homogeneous columns (~ few replacements) CGTGTTG C TCGGT GCTTTCGT_ GCCGTGCTGT C_G TTC_GTG_GCGTG GT G GTGC_GC C C GTG_T_T_GC TCGGTGC_TTTC_GT GCCG TGC_T GTCG_ TTC_GTGGCGTG GT G GTGC TGCC Ulf Leser: Bioinformatics, Summer Semester 2012 5

This Lecture Multiple Sequence lignment The problem Theoretical approach: Sum-of-Pairs scores Practical approach: ClustalW Ulf Leser: Bioinformatics, Summer Semester 2012 6

What Should we Count? For two sequences We scored each column using a scoring matrix Find the alignment such that the total score is maximal But how do we score a column with 5*T, 3*, 1*_? We would need an exponentially big scoring matrix lternative: Sum-of-Pairs Score We score an entire MS We score the alignment of each pair of sequences in the usual way We aggregate over all pairs to score the MS We need a clever algorithm to find the MS with the best score Ulf Leser: Bioinformatics, Summer Semester 2012 7

Formally Definition Let M be a MS for the set S of k sequences S={s 1,...,s k } The alignment of s i with s j induced by M is generated as follows Remove from M all rows except i and j Remove all columns that contain only blanks The sum-of-pairs score (sop) of M is the sum of all pair-wise induced alignment scores The optimal MS for S wrt. to sop is the MS with the highest sopscore over all possible MS for S Ulf Leser: Bioinformatics, Summer Semester 2012 8

Example G_ T_TG CTG_G_G 4 5 5 } 14 d/i = 1 r = 1 m = 0 G TTG C_TGG_G 4 5 7 } 16 Given a MS over k sequences of length l how complex is it to compute its sop-score? How do we find the best MS? Ulf Leser: Bioinformatics, Summer Semester 2012 9

nalogy Think of the k=2 case Every alignment is a path through the matrix The three possible directions (down, right, down-right) conform to the three possible constellations in a column (XX, X_, _X) With growing paths, we align growing prefixes of both sequences Ulf Leser: Bioinformatics, Summer Semester 2012 10

nalogy ssume k=3 Think of a 3-dimensional cube with the three sequences giving the values in each dimension Now, we have paths aligning growing prefixes of three sequences Every column has seven possible constellations (XXX, XX_, X_X, _XX, X, _X_, X) Ulf Leser: Bioinformatics, Summer Semester 2012 11

ll Possible Steps d(i-1,j-1,k-1) d(i,j-1,k-1) 1 G 2 3 C 1-2 3 C d(i,j,k-1) d(i,j-1,k) d(i-1,j,k) d(i-1,j-1,k) d(i-1,j,k-1) 1-2 3-1 G 2-3 C 1 G 2-3 - 1-2 - 3 C 1 G 2 3 - Ulf Leser: Bioinformatics, Summer Semester 2012 12

Dynamic Programming in three Dimensions We compute the best possible alignment d(i,j,k) for every triple of prefixes (lengths i,j,k ) using the following formula d(i,j,k) = min d(i-1,j-1,k-1) + c ij + c ik + c jk d(i-1,j-1,k) + c ij + 2 d(i-1,j,k-1) + c ik + 2 d(i,j-1,k-1) + c jk + 2 d(i-1,j,k) + 2 d(i,j-1,k) + 2 d(i,j,k-1) + 2 Three (mis)matches One (mis)match, two ins Let c ij = 0, if S 1 (i) = S 2 (j), else 1 Let c ik = 0, if S 1 (i) = S 3 (k), else 1 Let c jk = 0, if S 2 (j) = S 3 (k), else 1 Ulf Leser: Bioinformatics, Summer Semester 2012 13

Concrete Examples d(i,j-1,k) 1-2 3 - d(i-1,j,k-1) 1 G 2-3 C Best sop-score for d(i,j-1,k) is known We want to compute d(i,j,k) This requires to align one symbol with two blanks (blank/blank does not count) d(i,j,k) = d(i,j-1,k) + 2 Best sop-score for d(i-1, j,k-1) is known We want to compute d(i,j,k) This requires aligning a blank with s 1 [i-1] and with s 3 [k-1] and to align s 1 [i-1] and s 3 [k-1] d(i,j,k) = d(i-1,j,k-1) + 2 + c ik Ulf Leser: Bioinformatics, Summer Semester 2012 14

Initialization Of course, we have d(0,0,0)=0 ligning in one dimension: d(i,0,0)=2*i Same for d(0,j,0), d(0,0,k) ligning in two dimensions: d(i,j,0)= Let d a,b (i,j) be the alignment score for S a [1..i] with S b [1..j] d(i, j, 0) = d 1,2 (i, j) + (i+j) d(i, 0, k) = d 1,3 (i, k) + (i+k) d(0, j, k) = d 2,3 (j, k) + (j+k) Ulf Leser: Bioinformatics, Summer Semester 2012 15

lgorithm initialize matrix d; for i := 1 to S 1 for j := 1 to S 2 for k := 1 to S 3 if (S 1 (i) = S 2 (j)) then c ij := 0; else c ij := 1; if (S 1 (i) = S 3 (k)) then c ik := 0; else c ik := 1; if (S 2 (j) = S 3 (k)) then c jk := 0; else c jk := 1; d 1 := d[i 1,j 1,k - 1] + c ij + c ik + c jk ; d 2 := d[i 1,j 1,k] + c ij + 2; d 3 := d[i 1,j,k - 1] + c ik + 2; d 4 := d[i,j 1,k - 1] + c jk + 2; d 5 := d[i 1,j,k) + 2; d 6 := d[i,j 1,k) + 2; d 7 := d[i,j,k - 1) + 2; d[i,j,k] := min(d 1, d 2, d 3, d 4, d 5, d 6, d 7 ); end for; end for; end for; Ulf Leser: Bioinformatics, Summer Semester 2012 16

Bad News Complexity For 3 sequences of length n, we have to perform There are n 3 cells in the cube For each cell (top-left-front corner), we need to look at 7 corners Together: O(7*n 3 ) computations For k sequences of length n There are n k cell corners in the cube For each corner, we need to look at 2 k -1 other corners Together: O(2 k * n k ) computations ctually, the problem is NP-complete Ulf Leser: Bioinformatics, Summer Semester 2012 17

Bad News: Biological Meaningfulness Let s take one step back What happened during evolution? GTTTC GTTGC GTTTTC CTTGC GTTGC GTTGTT Real number of events: 8 sop-score: 2+3+6+6+2+ Everything is counted multiple times GTTTTC GTTTTCT GTTTTG CT_TGC_ GT_TGC GT_TGTT GTTTTCT GTTTTG Ulf Leser: Bioinformatics, Summer Semester 2012 18

This Lecture Multiple Sequence lignment The problem Theoretical approach: Sum-of-Pairs scores Practical approach: ClustalW Ulf Leser: Bioinformatics, Summer Semester 2012 19

Different Scoring Function If we knew the phylogenetic tree of the k sequences lign every parent with all its children ggregate all alignment scores This gives the real number of evolutionary operations But: Finding the true phylogenetic tree requires a MS Not covered in this lecture Use a heuristic: ClustalW Thompson, J. D., Higgins, D. G. and Gibson, T. J. (1994). "CLUSTL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice." Nucleic cids Res 22(22): 4673-80. Ulf Leser: Bioinformatics, Summer Semester 2012 20

ClustalW Main idea Compute a good enough phylogeny the guide tree Use the guide tree to iteratively align small MS to larger MS Starting from single sequences To escape the curse of dimensionality, compute MS iteratively Progressive MS Does not necessarily find the best solution Needs a fast method to align two MSs Works quite well in practice Many other, newer (better) proposals Dlign, T-Coffee, HMMT, PRRT, MULTLIGN,... Ulf Leser: Bioinformatics, Summer Semester 2012 21

Step 1: Compute the Guide Tree Compute all pair-wise alignments and store in similarity matrix M M[i,j] = sim(s i, s j ) Compute the guide tree using hierarchical clustering Choose the smallest M[i,j] Let s i and s j form a new (next) branch of the tree Compute the distance from the ancestor of s i and s j to all other sequences as the average of the distances to s i and s j Set M = M Delete rows and columns i and j dd a new column and row (ij) For all k ij: M [ij,k] = (M[i,k] + M[j,k]) / 2 Iterate until M has only one column / row Ulf Leser: Bioinformatics, Summer Semester 2012 22

Ulf Leser: Bioinformatics, Summer Semester 2012 23 Sketch B C D E F G BCDEFG B. C.. D... E... F... G... (B,D) a B C D E F G CEFGa C. E.. F... G... a... (E,F) b B C D E F G CGab C. G.. a... b... (,b) c B C D E F G CGac C G. a.. c... (C,G) d acd a c. d.. (d,c) e B C D E F G (a,e) f ae a e. B C D E F G B C D E F G

Example B C D E 17 59 59 77 B 37 61 53 C 13 41 D 21 B C D E B E CD 17 77 59 B 53 49 E 31 B C D E E CD B E 31 65 CD 54 B C D E B C D E Ulf Leser: Bioinformatics, Summer Semester 2012 24

Example C PDKTNVKWGKVGHGEYG D DKTNVKWSKVGGHGEYG B C D E PEEKSVTLWGKVNVDEYGG B GEEKVLLWDKVNEEEYGG C PDKTNVKWG_KVGHGEYG D DKTNVKWS_KVGGHGEYG E TNVKTWSSKVGGHP PEEKSV_TLWG_KVN VDEYGG B GEEKV_LLWD_KVN EEEYGG C PDKTNVK_WG_KVGHGEYG D DKTNVK_WS_KVGGHGEYG E TNVKT_WSSKVGGHP Once a gap, always a gap Ulf Leser: Bioinformatics, Summer Semester 2012 25

Step 2: Progressive MS Pair-wise alignment in the order given by the guide tree ligning a MS M 1 with a MS M 2 Use the usual (global) alignment algorithm To score a column, compute the average score over all pairs of symbols in this columns Example P B G C P D E F Y Score of this column (2*s(P,)+s(P,Y)+ 2*s(G,)+s(G,Y)+ 2*s(P,)+s(P,Y) ) / 9 Ulf Leser: Bioinformatics, Summer Semester 2012 26

Issues There is a lot to say about whether hierarchical clustering actually computes the correct tree Clustal-W actually uses a different, more accurate algorithm called neighbor-joining Clustal-W is fast (O(k 2 *n 2 +k 2 *log(k)) For k sequences; plus cost for computing alignments, especially M Idea behind progressive alignment Find strong signals (highly conserved blocks) first Outliers are added last Increases the chances that conserved blocks survive Several improvements to this scheme are known Ulf Leser: Bioinformatics, Summer Semester 2012 27

Problems with progressive MS Source: Cedric Notredame, 2001 Ulf Leser: Bioinformatics, Summer Semester 2012 28

Further Reading Merkl & Waack, chapter 13 Böckenhauer & Bongartz, chapter 5.3 Ulf Leser: Bioinformatics, Summer Semester 2012 29