CS273: Algorithms for Structure Handout # 4 and Motion in Biology Stanford University Thursday, 8 April 2004

Similar documents
Multiple Sequence Alignment Gene Finding, Conserved Elements

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

Multiple Sequence Alignment II

In this section we describe how to extend the match refinement to the multiple case and then use T-Coffee to heuristically compute a multiple trace.

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University

Stephen Scott.

Chapter 6. Multiple sequence alignment (week 10)

Chapter 8 Multiple sequence alignment. Chaochun Wei Spring 2018

Multiple Sequence Alignment. With thanks to Eric Stone and Steffen Heber, North Carolina State University

Biology 644: Bioinformatics

Multiple Sequence Alignment (MSA)

GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment

Biochemistry 324 Bioinformatics. Multiple Sequence Alignment (MSA)

3.4 Multiple sequence alignment

Genome 559: Introduction to Statistical and Computational Genomics. Lecture15a Multiple Sequence Alignment Larry Ruzzo

Pairwise Sequence Alignment: Dynamic Programming Algorithms. COMP Spring 2015 Luay Nakhleh, Rice University

Alignment ABC. Most slides are modified from Serafim s lectures

Lecture 10. Sequence alignments

Genome 559. Hidden Markov Models

Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice University

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

Lecture 5: Multiple sequence alignment

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

Assignment 2. Unsupervised & Probabilistic Learning. Maneesh Sahani Due: Monday Nov 5, 2018

Phylogenetics. Introduction to Bioinformatics Dortmund, Lectures: Sven Rahmann. Exercises: Udo Feldkamp, Michael Wurst

EECS730: Introduction to Bioinformatics

Basics of Multiple Sequence Alignment

Quiz Section Week 8 May 17, Machine learning and Support Vector Machines

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Cluster Analysis. Ying Shen, SSE, Tongji University

Lecture 4: Undirected Graphical Models

HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT

Lecture 8: The EM algorithm

K-Means and Gaussian Mixture Models

Multiple Sequence Alignment Based on Profile Alignment of Intermediate Sequences

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Hidden Markov Models. Slides adapted from Joyce Ho, David Sontag, Geoffrey Hinton, Eric Xing, and Nicholas Ruozzi

MULTIPLE SEQUENCE ALIGNMENT SOLUTIONS AND APPLICATIONS

Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Sequence analysis Pairwise sequence alignment

Recent Research Results. Evolutionary Trees Distance Methods

Clustering Sequences with Hidden. Markov Models. Padhraic Smyth CA Abstract

Lecture 5: Markov models

HMMConverter A tool-box for hidden Markov models with two novel, memory efficient parameter training algorithms

Probabilistic Graphical Models Part III: Example Applications

A Visualization Tool to Improve the Performance of a Classifier Based on Hidden Markov Models

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Multiple Sequence Alignment. Mark Whitsitt - NCSA

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

10708 Graphical Models: Homework 4

STA 4273H: Statistical Machine Learning

Hidden Markov Models. Mark Voorhies 4/2/2012

Using Hidden Markov Models to Detect DNA Motifs

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

Geometric Registration for Deformable Shapes 3.3 Advanced Global Matching

Multiple Sequence Alignment Sum-of-Pairs and ClustalW. Ulf Leser

Support Vector Machine Learning for Interdependent and Structured Output Spaces

Algorithms for Bioinformatics

Machine Learning. Computational biology: Sequence alignment and profile HMMs

Inference and Representation

INVESTIGATION STUDY: AN INTENSIVE ANALYSIS FOR MSA LEADING METHODS

Chapter 6: Cluster Analysis

Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it.

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

ε-machine Estimation and Forecasting

One report (in pdf format) addressing each of following questions.

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

10-701/15-781, Fall 2006, Final

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

15-780: Graduate Artificial Intelligence. Computational biology: Sequence alignment and profile HMMs

Introduction to Graphical Models

Clustering of Proteins

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Exact Inference: Elimination and Sum Product (and hidden Markov models)

Spectral Clustering. Presented by Eldad Rubinstein Based on a Tutorial by Ulrike von Luxburg TAU Big Data Processing Seminar December 14, 2014

The Simplex Algorithm for LP, and an Open Problem

Midterm Examination CS 540-2: Introduction to Artificial Intelligence

CS242: Probabilistic Graphical Models Lecture 3: Factor Graphs & Variable Elimination

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

NOVEL HYBRID GENETIC ALGORITHM WITH HMM BASED IRIS RECOGNITION

Tree Models of Similarity and Association. Clustering and Classification Lecture 5

A multiple alignment tool in 3D

Multiple alignment. Multiple alignment. Computational complexity. Multiple sequence alignment. Consensus. 2 1 sec sec sec

Computational Genomics and Molecular Biology, Fall

Documentation of HMMEditor 1.0

Shallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher Raj Dabre 11305R001

Clustering. Unsupervised Learning

Data Clustering. Algorithmic Thinking Luay Nakhleh Department of Computer Science Rice University

Kernels for Structured Data

Targil 12 : Image Segmentation. Image segmentation. Why do we need it? Image segmentation

LAGAN and Multi-LAGAN: Efficient Tools for Large-Scale Multiple Alignment of Genomic DNA

Probabilistic Graphical Models

18 October, 2013 MVA ENS Cachan. Lecture 6: Introduction to graphical models Iasonas Kokkinos

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

Gribskov Profile. Hidden Markov Models. Building a Hidden Markov Model #$ %&

CSE 549: Computational Biology

Bioinformatics for Biologists

Transcription:

CS273: Algorithms for Structure Handout # 4 and Motion in Biology Stanford University Thursday, 8 April 2004 Lecture #4: 8 April 2004 Topics: Sequence Similarity Scribe: Sonil Mukherjee 1 Introduction When aligning multiple sequences, the general idea is that if the evolutionary tree is known, we just do pairwise alignment on the closest two sequences, or alignments of sequences, in the order specified by the tree. We will discuss three systems used for aligning protein sequences: ClustalW, T-COFFEE, and ProbCons. 2 ClustalW ClustalW starts by finding the score of the pairwise alignment between each pair of sequences, using a scoring function that is appropriate for proteins. For example, since the core of a protein has less insertions and deletions, and hydrophobic regions are more likely than hydrophillic regions to be in the core, the scoring function has a lower gap penalty in hydorphillic regions that in hydrophobic regions. Now that it has all the scores, ClustalW can construct a tree by merging pairs of sequences with a higher alignment score before merging pairs with a lower score. However, it is not always correct to do this, as the two nodes with the highest alignment score, and thus the smallest distance in the correct tree, may still be merged with other nodes in the correct tree before they are merged with each other. 2.1 Neighbor joining Definition 1. An additive distance measure is a distance measure in which the distance between any pair of leaves is the sum of the lengths of the edges connecting them. If the distances are additive, neighbor joining is guaranteed to find the correct tree. It works as follows: Define D i j = d i j (r i + r j ) Where r i = (1/( Leafs 2)) k d i k Theorem 2. The above calculations ensure that D i j is minimal iff i and j are neighbors. The proof of this theorem is omitted, but it shows that ClustalW can construct the tree correctly.

2 CS273: Handout # 4 2.2 Weighted progressive alignment In weighted progressive alignment, each leaf x has weight w x proportional to the tree length attributable to x. That is, if there is an edge of length l, and x is one of n nodes that can be reached by following l, the weight attributable to x from this edge is n/l. 2.3 Limitations Consider the following four sequences, shown as they would be optimally aligned: Seq A GARFIELD THE LAST FA-T CAT Seq C GARFIELD THE VERY FAST CAT Seq D - THE FA-T CAT Since proteins often have different lengths, terminal gaps are preferred to internal gaps. Therefore, ClustalW will prefer to align sequences A and B like this: Seq A GARFIELD THE LAST FAT CAT Seq B GARFIELD THE FAST CAT instead of like this: Seq A GARFIELD THE LAST FA-T CAT Now the alignment of A to B is fixed and cannot be improved later. Therefore, it is impossible for ClustalW to find the optimal alignment of the four sequences. There are two methods that attempt to overcome this limitation: iterative refinement and consistency. 2.3.1 Iterative refinement The main idea is, given a multiple alignment, to search the space around it for a better alignment. Algorithm(Barton-Stenberg): 1. Align most similar x(i), x(j) 2. Align x(k) most similar to (x(i)x(j)) 3. Repeat 2 until (x(1)... x(n)) are aligned 4. For j=1 to N Remove x(j) and realign it to x(1)... x(j 1)x(j + 1)... x(n)

CS273: Handout # 4 3 5. Repeat 4 until convergence Each time step 4 is run, the alignment either gets better or stays the same, so the algorithm is guaranteed to converge. For example, say we align (x,y), (z,w), and (xy,zw) to get the following multiple alignment: x: GAAGTTA y: GAC TTA z: GAACTGA w: GTACTGA When we remove sequence y and realign it to the rest, we will see that we gain 3 matches with this alignment: x: GAAGTTA y: G ACTTA z: GAACTGA w: GTACTGA However, iterative refinement does not always work this well. Consider the following multiple alignment: x: GAAGTTA y: GAC TTA v: GAC TTA u: GAC TTA z: GAACTGA w: GTACTGA We can see that we will get more matches if we move the gap from the fourth space to the second space for y, v, and u, but since we only realign one sequence at a time, the algorithm will never fix this. When we remove y, for example and realign it, we will keep the gap in the fourth place because u and v also have their gaps in the fourth place. 2.3.2 Consistency The idea here is to prevent errors when performing the intial alignments instead of fixing them later. This is done by, when aligning two sequences, consulting a third sequence to resolve any conflicts. In the following example of aligning x and y, suppose a portion of x, x i aligns well with with two different protions of y, y j and y k.

4 CS273: Handout # 4 x: -x i y: y j -y k It is unclear which part to align x i to. But if we know that x i aligns to z k in a third sequence z, and z l aligns to y j, then we can choose to align x i to y j, knowing that this will be best when we align xy to z. 3 T-COFFEE This picture summarizes T-COFFEE: T-COFFEE starts with a users library, which contains the set of n sequences to be aligned. Its first step is to create the primary library. 3.1 Primary library The primary library is built by first builidng a ClustalW library and a Lalign(local alignment version of ClustalW). T-COFFEE calls ClustalW to construct all pairwise alignments among the n sequences, and then calls Lalign to construct a library of local alignments. Sequence identity is then used to assign a primary weight to each alignment in either library. For example, for the alignment: Seq A GARFIELD THE LAST FAT CAT Seq B GARFIELD THE FAST CAT there are 18 letters in the shorter sequence, and only 2 of them do not match, so the sequence identity is 100*(16/18) = 88, so the primary weight for this alignment is 88. If an alignment is duplicated across two libraries, then it is merged into one entry with the new weight = sum of two previous weights. For example, if the alignment above is in both the ClustalW and Lalign libraries with a weight of 88 in each, then it becomes one entry with weight 88+88 = 176. This way, the ClustalW and Lalign libraries are megred into one primary library. 3.2 Library extension Lets say A and C are aligned with a score of 77, and C and B are aligned with a score of 100, as shown below: Seq A GARFIELD THE LAST FA T CAT Seq C GARFIELD THE VERY FAST CAT

CS273: Handout # 4 5 Now we have an alignment of A and B through C. The weight of this alignment W(A,B) = min(w(a,b), W(B,C)) = min(77, 100) = 77. To get the final W(A,B), we add this to W(A,B) from the primary library. So for this example, the final W(A,B) = 77 + 88 = 165. This value is put in the extended library. This approach requires examining all triplets of sequences, but not all will provide information. Consider the sequence D, whose alignments to A and B provide no information: Seq A GARFIELD THE LAST FA-T CAT Seq D - THE FA-T CAT 3.3 Running time Letting L be the average sequence length and N be the number of sequences, the running time can be separated into four parts: 1. O(N 2 L 2 ) time for pairwise library computation 2. O(N 3 L) time for library extension 3. O(N 3 ) time for computation of the neighbot joining tree 4. O(NL 2 ) time for progressive alignment computation Therefore, the total running time is O(N 2 L 2 ) + O(N 3 L) 4 ProbCons Unlike the ClustalW and T-COFFEE, ProbCons is not discrete. It is similar to T- COFFEE, but it is probabilistic. It uses a probabilistic model for alignment, probabilistic consistency, and it optimizes expected accuracy rather than the most likely alignment. 4.1 Hidden Markov Models Hidden Markov Models (HMMs) can be used to model alignments by having three states: 1. M corresponds to match, so when the HMM is in state M, a letter is emitted from both sequences. 2. I corresponds to a gap in the second sequence, so when the HMM is in state I, a letter is emitted from the first sequence only. 3. J corresponds to a gap in the first sequence, so when the HMM is in state I, a letter is emitted from the second sequence only.

6 CS273: Handout # 4 Therefore, an alignment of two sequences corresponds can only have one sequence of states in the HMM. An example is shown below: - - A G G C T A G - - C J J M I I M Probabilites are assigned to every transition arrow to determine which state to go to next, and to every emission to determine which letter(s) to emit in each state. When doing calculations, the logs of these probabilities will be used to prevent underflow. Once an HMM and its probabilities have been constructed, the Viterbi algorithm can be run on it to find the most likely alignment. We can also find posterior probabilities that two letters align by using the Forward and Backward HMM algorithms. 4.2 ProbCons details Since alignments of proteins have lots of small gaps and some large gaps, it makes sense to model small and large gaps separately. ProbCons does this by using an HMM with 5 states instead of 3. The new states, Ilong and Jlong, model long gaps, and I and J model short gaps. The transition probabilites to enter Ilong and Jlong are less than those to enter I and J, but the probabilities to stay in Ilong and Jlong are greater than those to stay in I or J. This way, we enter the long gap states less often but stay in them for a long time when we do. When performing pairwise alignment on two sequences x and y, ProbCons first derives P(x i, y j ), the probability that x i and y j are aligned, for all positions (i,j) in x and y. Then it maximizes expected accuracy, rather than likelihood of alignment, given by the following equation: E(x, y) = (1/(min( x, y )) (i, j)p (x i y i x, y) Finally, it uses probabilistic consistency, which, in the previous consistency example, uses the probability of z l being aligned to y j and y k to determine with one x i should be aligned to. The ProbCons algorithm is as follows: 1. Calculate all pairwise posterior matrices P (xy) = ( P (x i, y j x, y) ) 2. Calculate all pairwise distances d x y = E(x, y) = (1/(min( x, y )) (i, j)p (x i y i x, y) 3. Apply hierarchical clustering on (d x y) to build guide tree 4. Convert each P (xy) to a sparse matrix by dropping very small entries 5. Apply consistency transformation: P (xy) = 1/N z P (xz) * P (zy) 6. Iterate 5 for 1-3 steps to get indirect relationships x-z-w-v-y etc.

CS273: Handout # 4 7 7. Apply progressive alignment along the tree, at each step maximizing expected accuracy 8. Apply 100 rounds of iterative refinement The three features, along with iterative refinement, independently provide improvements, and test have shown that ProbCons provides better results than ClustalW and T-COFFEE. References