Biology 644: Bioinformatics

Similar documents
Genome 559. Hidden Markov Models

Lecture 5: Markov models

HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT

HMMConverter A tool-box for hidden Markov models with two novel, memory efficient parameter training algorithms

Gribskov Profile. Hidden Markov Models. Building a Hidden Markov Model #$ %&

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

Hidden Markov Models. Slides adapted from Joyce Ho, David Sontag, Geoffrey Hinton, Eric Xing, and Nicholas Ruozzi

Machine Learning. Computational biology: Sequence alignment and profile HMMs

Chapter 6. Multiple sequence alignment (week 10)

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

!"#$ Gribskov Profile. Hidden Markov Models. Building an Hidden Markov Model. Proteins, DNA and other genomic features can be

Hidden Markov Models in the context of genetic analysis

Eukaryotic Gene Finding: The GENSCAN System

15-780: Graduate Artificial Intelligence. Computational biology: Sequence alignment and profile HMMs

ε-machine Estimation and Forecasting

Using Hidden Markov Models to Detect DNA Motifs

Chapter 8 Multiple sequence alignment. Chaochun Wei Spring 2018

Time series, HMMs, Kalman Filters

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University

Stephen Scott.

Documentation of HMMEditor 1.0

Hidden Markov Models Review and Applications. hidden Markov model. what we see model M = (,Q,T) states Q transition probabilities e Ax

Computational Genomics and Molecular Biology, Fall

Introduction to Hidden Markov models

Hidden Markov Models. Mark Voorhies 4/2/2012

CS273: Algorithms for Structure Handout # 4 and Motion in Biology Stanford University Thursday, 8 April 2004

A reevaluation and benchmark of hidden Markov Models

Quiz Section Week 8 May 17, Machine learning and Support Vector Machines

An Introduction to Hidden Markov Models

Modeling time series with hidden Markov models

Graphical Models & HMMs

Structured Learning. Jun Zhu

HPC methods for hidden Markov models (HMMs) in population genetics

BLAST, Profile, and PSI-BLAST

NOVEL HYBRID GENETIC ALGORITHM WITH HMM BASED IRIS RECOGNITION

Speech Recognition Lecture 8: Acoustic Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Computational Molecular Biology

MSCBIO 2070/02-710: Computational Genomics, Spring A4: spline, HMM, clustering, time-series data analysis, RNA-folding

Faster Gradient Descent Training of Hidden Markov Models, Using Individual Learning Rate Adaptation

A Visualization Tool to Improve the Performance of a Classifier Based on Hidden Markov Models

BMI/CS Lecture #22 - Stochastic Context Free Grammars for RNA Structure Modeling. Colin Dewey (adapted from slides by Mark Craven)

DISCRETE HIDDEN MARKOV MODEL IMPLEMENTATION

Multiple Sequence Alignment (MSA)

Comparing the Bidirectional Baum-Welch Algorithm and the Baum-Welch Algorithm on Regular Lattice

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

Assignment 2. Unsupervised & Probabilistic Learning. Maneesh Sahani Due: Monday Nov 5, 2018

Hidden Markov Model II

Bioinformatics. Sequence alignment BLAST Significance. Next time Protein Structure

Hidden Markov Models. Natural Language Processing: Jordan Boyd-Graber. University of Colorado Boulder LECTURE 20. Adapted from material by Ray Mooney

Divide and Conquer Algorithms. Problem Set #3 is graded Problem Set #4 due on Thursday

Multiple Sequence Alignment II

Topological Mapping. Discrete Bayes Filter

Sequence alignment algorithms

Learning Hidden Markov Models for Regression using Path Aggregation

Bayesian Networks Inference

Dynamic Programming Course: A structure based flexible search method for motifs in RNA. By: Veksler, I., Ziv-Ukelson, M., Barash, D.

ChromHMM: automating chromatin-state discovery and characterization

Gene regulation. DNA is merely the blueprint Shared spatially (among all tissues) and temporally But cells manage to differentiate

A simple noise model. Algorithm sketch. A simple noise model. Estimating the probabilities

of Manchester The University COMP14112 Markov Chains, HMMs and Speech Revision

Introduction to Graphical Models

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

Multiple Sequence Alignment Gene Finding, Conserved Elements

A NEW GENERATION OF HOMOLOGY SEARCH TOOLS BASED ON PROBABILISTIC INFERENCE

Computational Molecular Biology

CSCI 599 Class Presenta/on. Zach Levine. Markov Chain Monte Carlo (MCMC) HMM Parameter Es/mates

Protein Sequence Classification Using Probabilistic Motifs and Neural Networks

Probabilistic Graphical Models

Parallel HMMs. Parallel Implementation of Hidden Markov Models for Wireless Applications

One report (in pdf format) addressing each of following questions.

Basics of Multiple Sequence Alignment

Clustering Sequences with Hidden. Markov Models. Padhraic Smyth CA Abstract

3.4 Multiple sequence alignment

Brief review from last class

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

Treba: Efficient Numerically Stable EM for PFA

STA 4273H: Statistical Machine Learning

CS313 Exercise 4 Cover Page Fall 2017

Introduction to Machine Learning CMU-10701

ToPS User Guide. André Yoshiaki Kashiwabara Ígor Bonadio Vitor Onuchic Alan Mitchell Durham

CS 532c Probabilistic Graphical Models N-Best Hypotheses. December

New String Kernels for Biosequence Data

Fourth Annual PRIMES Conference. MAY 18, 2014 Loop Extruding Enzymes in Interphase: Dynamic Folding of Chromatin Domains

Exact Inference: Elimination and Sum Product (and hidden Markov models)

ModelStructureSelection&TrainingAlgorithmsfor an HMMGesture Recognition System

Finding data. HMMER Answer key

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Multiple Sequence Alignment. Mark Whitsitt - NCSA

CSE 417 Dynamic Programming (pt 5) Multiple Inputs

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 13 G R A T I V. Iterative homology searching,

Sequence analysis Pairwise sequence alignment

CLC Server. End User USER MANUAL

Invariant Recognition of Hand-Drawn Pictograms Using HMMs with a Rotating Feature Extraction

ICB Fall G4120: Introduction to Computational Biology. Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Hidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017

Using Hidden Markov Models for Multiple Sequence Alignments Lab #3 Chem 389 Kelly M. Thayer

25 Lecture 25, Apr 22

Transcription:

A statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states in the training data. First used in speech and handwriting recognition In biology, frequently used to model biological sequences and structure: Gene tracks 5 and 3 splice sites Chromatin states CpG islands GC-isochores Protein folding conformations DNA-binding sites RNA-binding sites Copy Number Variation (CNV) Differential gene expression Sequence Homology Profile HMMs (phmms)

When at least some of the data labels are missing (hidden) in the training data, then we must infer (label) the missing hidden states Requires correctly inferring the model topology of the system Requires a lot of training data to train the additional parameters Successful training of the parameters is highly dependent on the initial conditions Famous example: Occasionally Dishonest Casino

When at least some of the data labels are missing (hidden) in the training data, then we must infer (label) the missing hidden states Requires correctly inferring the model topology of the system Requires a lot of training data to train the additional parameters Successful training of the parameters is highly dependent on the initial conditions Famous example: Occasionally Dishonest Casino Used to generate rolls Training with 300 rolls Training with 30,000 rolls

Tumour Copy Number -0.4-0.2 0.0 0.2 0.4 0.6 0.8 0.0e+00 5.0e+07 1.0e+08 1.5e+08 Chromosome Position

The Viterbi Algorithm Find An efficient Dynamic Programming algorithm that is guaranteed to return the Alignment Path with the highest Log-odds score for a given sequence (Also called the best supported path most different from the background.) The Forward Algorithm Find Another Dynamic Programming algorithm that gives the sum of all of the Log-odds scores for all possible Paths to obtain the full probability that sequence x i aligns with the model more that the background. This is necessary to obtain P(x i θ) since many different state paths can rise to the same sequence x i. The Backward Algorithm Find backwards The Backward Algorithm is similar to the Forward Algorithm, but it recurses in the backward direction

When the paths for the training sequences are not known, no known closed form solution exists for the parameter estimations All known iterative algorithms for continuous function optimization can be used [Press et. al. 1992] The Baum Welch algorithm is standardly used Find An EM method that uses the DP matrix and the forward and backward algorithms In HMMs the missing data are the unknown state paths (the hidden states) The overall log likelihood of the model increases with each iteration Guaranteed to converge to a local maximum Never guaranteed that the local max is the overall global max (for any algorithm). Since we are converging in continuous-value space, we never actually reach the local max Convergence criteria is met when the change in log likelihood is sufficiently small The Viterbi Training algorithm is often used if all we care about are the most probable paths π*(si) Find The log likelihood of the most probable paths for all the sequences increases with each iteration Guaranteed to converge to a local maximum Again, never guaranteed to reach the global maximum

Position Weight Matrices (PWMs) cannot model tolerated insertions or deletions correctly Any indel throws off the static alignment to the PWM Binding Site? T A T A A C G G T C A

PWM! 1.0 1.0 1.0 1.0 1.0

d 1 d 2 d 3 d 4 i 0 P(A) =.3 i 1 i 2 P(T) =.016 i 3 P(T) =.96 i 4 P(T) =.18 P(T) =.41 Begin m 1 P(C) =.37 P(T) =.76 m 2 P(C) =.45 m 3 m 4 P(G) =.29 P(T) =.93 End

Match State Emissions Insert State Emissions p53 Insertion State Emissions

Widely used database of protein families Currently containing more than 13,000 manually curated protein families as of release 26.0 Families are sets of protein regions that share a significant degree of sequence similarity, thereby suggesting homology. Similarity is detected using profile Hidden Markov Models (HMMs) Currently uses HMMER3 to build and align to phmms Currently no R interface to perform pfam alignments