Machine Learning. Computational biology: Sequence alignment and profile HMMs

Similar documents
15-780: Graduate Artificial Intelligence. Computational biology: Sequence alignment and profile HMMs

HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University

Biology 644: Bioinformatics

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

Gribskov Profile. Hidden Markov Models. Building a Hidden Markov Model #$ %&

BLAST, Profile, and PSI-BLAST

Lecture 5: Markov models

Quiz Section Week 8 May 17, Machine learning and Support Vector Machines

Computational Molecular Biology

Chapter 6. Multiple sequence alignment (week 10)

!"#$ Gribskov Profile. Hidden Markov Models. Building an Hidden Markov Model. Proteins, DNA and other genomic features can be

8/19/13. Computational problems. Introduction to Algorithm

Chapter 8 Multiple sequence alignment. Chaochun Wei Spring 2018

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

Genome 559. Hidden Markov Models

Stephen Scott.

Multiple Sequence Alignment. Mark Whitsitt - NCSA

Multiple Sequence Alignment Based on Profile Alignment of Intermediate Sequences

Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it.

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

HMMConverter A tool-box for hidden Markov models with two novel, memory efficient parameter training algorithms

Using Hidden Markov Models to Detect DNA Motifs

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:

Computational Molecular Biology

Multiple Sequence Alignment II

CS313 Exercise 4 Cover Page Fall 2017

Special course in Computer Science: Advanced Text Algorithms

Sequence alignment theory and applications Session 3: BLAST algorithm

Data Mining Technologies for Bioinformatics Sequences

Bioinformatics explained: BLAST. March 8, 2007

Database Searching Using BLAST

Structured Learning. Jun Zhu

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.

GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment

3.4 Multiple sequence alignment

INTRODUCTION TO BIOINFORMATICS

BLAST - Basic Local Alignment Search Tool

Using Hidden Markov Models for Multiple Sequence Alignments Lab #3 Chem 389 Kelly M. Thayer

ICB Fall G4120: Introduction to Computational Biology. Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology

SVM Classification in -Arrays

Computational Genomics and Molecular Biology, Fall

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:

Gene regulation. DNA is merely the blueprint Shared spatially (among all tissues) and temporally But cells manage to differentiate

Lecture 10. Sequence alignments

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

Multiple Sequence Alignment: Multidimensional. Biological Motivation

Sequence Alignment & Search

Special course in Computer Science: Advanced Text Algorithms

CAP BLAST. BIOINFORMATICS Su-Shing Chen CISE. 8/20/2005 Su-Shing Chen, CISE 1

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Alignment of Pairs of Sequences

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm.

Sequence analysis Pairwise sequence alignment

INTRODUCTION TO BIOINFORMATICS

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

Project Report on. De novo Peptide Sequencing. Course: Math 574 Gaurav Kulkarni Washington State University

Basic Local Alignment Search Tool (BLAST)

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.

Brief review from last class

Dynamic Programming: Sequence alignment. CS 466 Saurabh Sinha

A Study On Pair-Wise Local Alignment Of Protein Sequence For Identifying The Structural Similarity

Genome 559: Introduction to Statistical and Computational Genomics. Lecture15a Multiple Sequence Alignment Larry Ruzzo

Heuristic methods for pairwise alignment:

USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT

Finding data. HMMER Answer key

Mismatch String Kernels for SVM Protein Classification

MULTIPLE SEQUENCE ALIGNMENT SOLUTIONS AND APPLICATIONS

Wilson Leung 05/27/2008 A Simple Introduction to NCBI BLAST

Sequence alignment algorithms

Application of Support Vector Machine In Bioinformatics

Bioinformatics explained: Smith-Waterman

New String Kernels for Biosequence Data

HARDWARE ACCELERATION OF HIDDEN MARKOV MODELS FOR BIOINFORMATICS APPLICATIONS. by Shakha Gupta. A project. submitted in partial fulfillment

Wilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment

Biology 644: Bioinformatics

From [4] D. Gusfield. Algorithms on Strings, Trees and Sequences. Cambridge University Press, New York, 1997.

A Hidden Markov Model for Alphabet Soup Word Recognition

Pairwise Sequence Alignment. Zhongming Zhao, PhD

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 13 G R A T I V. Iterative homology searching,

Introduction to Bioinformatics Online Course: IBT

Hidden Markov Models in the context of genetic analysis

FastA & the chaining problem

Lecture 9: Core String Edits and Alignments

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:

Multiple Sequence Alignment (MSA)

BLAST. Basic Local Alignment Search Tool. Used to quickly compare a protein or DNA sequence to a database.

EECS730: Introduction to Bioinformatics

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Multiple Sequence Alignment Gene Finding, Conserved Elements

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures

Introduction to Hidden Markov models

Biologically significant sequence alignments using Boltzmann probabilities

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

Bioinformatics for Biologists

A NEW GENERATION OF HOMOLOGY SEARCH TOOLS BASED ON PROBABILISTIC INFERENCE

EECS730: Introduction to Bioinformatics

CS273: Algorithms for Structure Handout # 4 and Motion in Biology Stanford University Thursday, 8 April 2004

Transcription:

10-601 Machine Learning Computational biology: Sequence alignment and profile HMMs

Central dogma DNA CCTGAGCCAACTATTGATGAA transcription mrna CCUGAGCCAACUAUUGAUGAA translation Protein PEPTIDE 2

Growth in biological data Lu et al Bioinformatics 2009 3

Central dogma Can be measured using sequencing techniques DNA CCTGAGCCAACTATTGATGAA Can be measured using microarrays transcription mrna CCUGAGCCAACUAUUGAUGAA translation Protein PEPTIDE Can be measured using mass spectrometry 4

5

FDA Approves Gene-Based Breast Cancer Test* MammaPrint is a DNA microarray-based test that measures the activity of 70 genes in a sample of a woman's breast-cancer tumor and then uses a specific formula to determine whether the patient is deemed low risk or high risk for the spread of the cancer to another site. *Washington Post, 2/06/2007

Input Output HMM For Data Integration I H H g H H 0 O 1 O 2 O 3 O 0 1 2 3

Active Learning 8

Assigning function to proteins One of the main goals of molecular (and computational) biology. There are 25000 human genes and the vast majority of their functions is still unknown Several ways to determine function - Direct experiments (knockout, overexpression) - Interacting partners - 3D structures - Sequence homology Hard Easier 9

Function from sequence homology We have a query gene: ACTGGTGTACCGAT Given a database containing genes with known function, our goal is to find similar genes from this database (possibly in another organism) When we find such gene we predict the function of the query gene to be similar to the resulting database gene Problems - How do we determine similarity? 10

Sequence analysis techniques A major area of research within computational biology. Initially, based on deterministic or heuristic alignment methods More recently, based on probabilistic inference methods 11

Sequence analysis Traditional - Dynamic programming Probabilsitic - Profile HMMs 12

Alignment: Possible reasons for differences Substitutions Insertions Deletions 13

Pairwise sequence alignment ACATTG AACATT A C A T T G AGCCTT AGCATT A A C A T T A G C C T T A G C A T T 14

Pairwise sequence alignment AGCCTT ACCATT A G C C T T AGCCTT AGCATT A C C A T T We cannot expect the alignments to be perfect. But we need to determine what is the reason for the difference (insertion, deletion or substitution). A G C C T T A G C A T T 15

Scoring Alignments Alignments can be scored by comparing the resulting alignment to a background (random) model. Independent Related ( x, y I qxi qx P ( x, y M ) j P ) i j i p x i y i Score for alignment: S s( x i, y i ) i where: s( a, b) log( p q a a, b q b ) Can be computed for each pair of letters 16

Scoring Alignments Alignments can be scored by comparing the resulting alignment to a background (random) model. In other Independent words, we are trying to find Related an alignment that maximizes the likelihood ratio of the aligned pair compared to the background model p ( x, y I qxi qx P ( x, y M ) j P ) i j i x i y i Score for alignment: S s( x i, y i ) i where: s( a, b) log( p q a a, b q b ) 17

Computing optimal alignment: The Needham-Wuncsh algorithm F(i-1,j-1)+s(x i,x j ) F(i,j) = max F(i-1,j)+d F(i,j-1)+d d is a penalty for a gap A G C C T T A C C F(i-1,j-1) F(i-1,j) A T F(i,j-1) F(i,j) T 18

Example Assume a simple model where S(a,b) = 1 if a=b and -5 otherwise. Also, assume that d = -1 A G C C T T 0-1 -2-3 -4-5 -6 A -1 C -2 C -3 A -4 T -5 T -6 19

Example Assume a simple model where S(a,b) = 1 if a=b and -5 otherwise. Also, assume that d = -1 A G C C T T F(i,j) = max F(i-1,j-1)+s(x i,x j ) F(i-1,j)+d F(i,j-1)+d 0-1 -2-3 -4-5 -6 A -1 1 C -2 C -3 A -4 T -5 T -6 20

Example Assume a simple model where S(a,b) = 1 if a=b and -5 otherwise. Also, assume that d = -1 A G C C T T F(i,j) = max F(i-1,j-1)+s(x i,x j ) F(i-1,j)+d F(i,j-1)+d 0-1 -2-3 -4-5 -6 A -1 1 0 C -2 0 C -3 A -4 T -5 T -6 21

Example Assume a simple model where S(a,b) = 1 if a=b and -5 otherwise. Also, assume that d = -1 A G C C T T F(i,j) = max F(i-1,j-1)+s(x i,x j ) F(i-1,j)+d F(i,j-1)+d 0-1 -2-3 -4-5 -6 A -1 1 0-1 -2-3 -4 C -2 0-1 C -3-1 A -4-2 T -5-3 T -6-4 22

Example Assume a simple model where S(a,b) = 1 if a=b and -5 otherwise. Also, assume that d = -1 A G C C T T 0-1 -2-3 -4-5 -6 A -1 1 0-1 -2-3 -4 C -2 0-1 1 0-1 -2 C -3-1 -2 0 2 1 0 A -4-2 -3-1 1 0-1 T -5-3 -4-2 0 2 1 T -6-4 -5-3 -1 1 3 23

Example Assume a simple model where S(a,b) = 1 if a=b and -5 otherwise. Also, assume that d = -1 A G C C T T 0-1 -2-3 -4-5 -6 A -1 1 0-1 -2-3 -4 C -2 0-1 1 0-1 -2 C -3-1 -2 0 2 1 0 A -4-2 -3-1 1 0-1 T -5-3 -4-2 0 2 1 T -6-4 -5-3 -1 1 3 24

Example Assume a simple model where S(a,b) = 1 if a=b and -5 otherwise. Also, assume that d = -1 A G C C T T A G C C T T A C C A T T 0-1 -2-3 -4-5 -6 A -1 1 0-1 -2-3 -4 C -2 0-1 1 0-1 -2 C -3-1 -2 0 2 1 0 A -4-2 -3-1 1 0-1 T -5-3 -4-2 0 2 1 T -6-4 -5-3 -1 1 3 25

Running time The running time of an alignment algorithms if O(n 2 ) This doesn t sound too bad, or is it? The time requirement for doing global sequence alignment is too high in many cases. Consider a database with tens of thousands of sequences. Looking through all these sequences for the best alignment is too time consuming. In many cases, a much faster heuristic approach can achieve equally good results. 26

Sequence analysis Traditional - Dynamic programming Probabilsitic - Profile HMMs 27

Protein families Proteins can be classified into families (and further into sub families etc.) A specific family includes proteins with similar high level functions For example: - Transcription factors - Receptors - Etc. Family assignment is an important first step towards function prediction 28

Methods for Characterizing a Protein Family Objective: Given a number of related sequences, encapsulate what they have in common in such a way that we can recognize other members of the family. Some standard methods for characterization: Multiple Alignments Regular Expressions Consensus Sequences Hidden Markov Models 29

Multiple Alignment Process Process of aligning three or more sequences with each other We can determine such alignment by generalizing the algorithm to align two sequences Running time exponential in the number of sequences 30

Training a HMM from an existing alignment Start with a predetermined number of states accounting for matches, insertions and deletions. For each position in the model, assign a column in the multiple alignment that is relatively conserved. Emission probabilities are set according to amino acid counts in columns. Transition probabilities are set according to how many sequences make use of a given delete or insert state. MLE estimates 31

Remember the simple example Chose six positions in model. Highlighted area was selected to be modeled by an insert due to variability. Can also do neat tricks for picking length of model, such as model pruning. 32

So what do we do with a model? Given a query protein: - Design statistical tests to determine how likely it is to get this score from a random (gene) sequence - Use several protein family models for classifying new proteins, assign protein to most highly scoring family. 33

Choosing the best model: Aligning sequences to a models Compute the likelihood of the best set of states for this sequence We know how to do this: The Viterbi algortthm Time: O(N*M) 34

Scoring our simple HMM #1 - T G C T A G G vrs: #2 - A C A C A T C HMM: #1 = Score of -0.97 #2 Score of 6.7 (Log odds) 35

Training from unaligned Baum-Welch algorithm sequences Start with a model whose length matches the average length of the sequences and with random emission and transition probabilities. Align all the sequences to the model. Use the alignment to alter the emission and transition probabilities Repeat. Continue until the model stops changing By-product: It produces a multiple alignment 36

Multiple Alignment: Reasons for differences Substitutions Insertions Deletions 37

Designing HMMs: Consensus (match) states We first include states to output the consensus sequence A: 0.8 C: 0.8 A: 0.8 T: 0.8 T: 0.2 G: 0.2 C: 0.2 G: 0.2 38

Designing HMMs: Insertions We next add states to allow insertions start 1 0.4 A: 0.2 C: 0.4 : G:0.2 T: 0.2 0.6 0.6 A: 0.8 1 C: 0.8 1 A: 0.8 0.4 T: 0.8 T: 0.2 G: 0.2 C: 0.2 G: 0.2 39

Designing HMMs: Deletions Finally we add states with no output to allow for deletions start 1 O O O 0.4 A: 0.2 C: 0.4 : G:0.2 T: 0.2 0.6 0.6 A: 0.8 1 C: 0.8 1 A: 0.8 0.4 T: 0.8 T: 0.2 G: 0.2 C: 0.2 G: 0.2 40

Training from unaligned continued Advantages: You take full advantage of the expressiveness of your HMM. You might not have a multiple alignment on hand. Disadvantages: HMM training methods are local optimizers, you may not get the best alignment or the best model unless you re very careful. Can be alleviated by starting from a logical model instead of a random one. 41

Summary Initial methods for sequence alignment relied on combinatorial and dynamic programming methods. These methods do not generalize well for multiple sequence alignment and for searching large databases. State of the art methods rely on AI techniques, primarily variants of HMMs to overcome this problem. 42