!"#$ Gribskov Profile. Hidden Markov Models. Building an Hidden Markov Model. Proteins, DNA and other genomic features can be

Similar documents
Gribskov Profile. Hidden Markov Models. Building a Hidden Markov Model #$ %&

HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT

Biology 644: Bioinformatics

15-780: Graduate Artificial Intelligence. Computational biology: Sequence alignment and profile HMMs

Machine Learning. Computational biology: Sequence alignment and profile HMMs

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University

Quiz Section Week 8 May 17, Machine learning and Support Vector Machines

Lecture 5: Markov models

Genome 559. Hidden Markov Models

HMMConverter A tool-box for hidden Markov models with two novel, memory efficient parameter training algorithms

Stephen Scott.

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

Using Hidden Markov Models to Detect DNA Motifs

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment

Multiple Sequence Alignment Based on Profile Alignment of Intermediate Sequences

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

New String Kernels for Biosequence Data

Eukaryotic Gene Finding: The GENSCAN System

Multiple Sequence Alignment Gene Finding, Conserved Elements

Introduction to SLAM Part II. Paul Robertson

Hidden Markov Models. Mark Voorhies 4/2/2012

Quiz section 10. June 1, 2018

BMI/CS Lecture #22 - Stochastic Context Free Grammars for RNA Structure Modeling. Colin Dewey (adapted from slides by Mark Craven)

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

Hidden Markov Models. Slides adapted from Joyce Ho, David Sontag, Geoffrey Hinton, Eric Xing, and Nicholas Ruozzi

Chapter 8 Multiple sequence alignment. Chaochun Wei Spring 2018

Chapter 6. Multiple sequence alignment (week 10)

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

Finding data. HMMER Answer key

MSCBIO 2070/02-710: Computational Genomics, Spring A4: spline, HMM, clustering, time-series data analysis, RNA-folding

Support Vector Machine Learning for Interdependent and Structured Output Spaces

Using Hidden Markov Models for Multiple Sequence Alignments Lab #3 Chem 389 Kelly M. Thayer

Structured Learning. Jun Zhu

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

Modeling time series with hidden Markov models

3.4 Multiple sequence alignment

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

CS 6784 Paper Presentation

Hidden Markov Models in the context of genetic analysis

Exercise 2: Browser-Based Annotation and RNA-Seq Data

Brief review from last class

BLAST, Profile, and PSI-BLAST

Documentation of HMMEditor 1.0

Dynamic Time Warping

Optimization of Hidden Markov Model by a Genetic Algorithm for Web Information Extraction

Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods

Using Hidden Markov Models to analyse time series data

A Hidden Markov Model for Alphabet Soup Word Recognition

NOVEL HYBRID GENETIC ALGORITHM WITH HMM BASED IRIS RECOGNITION

Comparing the Bidirectional Baum-Welch Algorithm and the Baum-Welch Algorithm on Regular Lattice

Introduction to Unix/Linux INX_S17, Day 6,

A multiple alignment tool in 3D

An Introduction to Hidden Markov Models

CS273: Algorithms for Structure Handout # 4 and Motion in Biology Stanford University Thursday, 8 April 2004

Faster Gradient Descent Training of Hidden Markov Models, Using Individual Learning Rate Adaptation

Invariant Recognition of Hand-Drawn Pictograms Using HMMs with a Rotating Feature Extraction

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Multiple Sequence Alignment. Mark Whitsitt - NCSA

Hidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017

Genome 559: Introduction to Statistical and Computational Genomics. Lecture15a Multiple Sequence Alignment Larry Ruzzo

Det De e t cting abnormal event n s Jaechul Kim

JET 2 User Manual 1 INSTALLATION 2 EXECUTION AND FUNCTIONALITIES. 1.1 Download. 1.2 System requirements. 1.3 How to install JET 2

8/19/13. Computational problems. Introduction to Algorithm

Genome Browser. Background and Strategy. 12 April 2010

Parallel HMMs. Parallel Implementation of Hidden Markov Models for Wireless Applications

Semi-Supervised Learning of Named Entity Substructure

Efficient Implementation of a Generalized Pair HMM for Comparative Gene Finding. B. Majoros M. Pertea S.L. Salzberg

Posterior Decoding Methods for Optimization and Accuracy Control of Multiple Alignments

HMM-Based Handwritten Amharic Word Recognition with Feature Concatenation

Dynamic Bayesian network (DBN)

Sequence analysis Pairwise sequence alignment

Conditional Random Fields. Mike Brodie CS 778

Mismatch String Kernels for SVM Protein Classification

Package HMMCont. February 19, 2015

Semi-supervised protein classification using cluster kernels

Hidden Markov Model for Sequential Data

Hidden Markov Models Review and Applications. hidden Markov model. what we see model M = (,Q,T) states Q transition probabilities e Ax

DISCRETE HIDDEN MARKOV MODEL IMPLEMENTATION

Exercise 5. Deadlines: Monday (final, no student correction) Matlabs Statistics Toolbox contains the following functions for HMM :

Short Read Alignment. Mapping Reads to a Reference

Wilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment

Alignments BLAST, BLAT

Mismatch String Kernels for SVM Protein Classification

18 October, 2013 MVA ENS Cachan. Lecture 6: Introduction to graphical models Iasonas Kokkinos

application of learning vector quantization algorithms. In Proceedings of the International Joint Conference on

Basics of Multiple Sequence Alignment

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

Sequence alignment algorithms

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Structured Perceptron. Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

Tutorial for the Exon Ontology website

Preliminary Syllabus. Genomics. Introduction & Genome Assembly Sequence Comparison Gene Modeling Gene Function Identification

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

Multiple sequence alignment. November 20, 2018

Wilson Leung 05/27/2008 A Simple Introduction to NCBI BLAST

CSCI 5582 Artificial Intelligence. Today 10/31

ε-machine Estimation and Forecasting

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Transcription:

Gribskov Profile $ Hidden Markov Models Building an Hidden Markov Model $ Proteins, DN and other genomic features can be classified into families of related sequences and structures $ Related sequences can diverge beyond recognition with standard sequence comparison methods How to detect these similarities: $ $ POS D E F G H L S T Gap 3 6 What is a Gribskov Profile? - - - - 8-89 -6-3 -0-6 -8-30 - -6 - - - -0 - -8-0 -6-03 -03-83 -3-63 -3 6-30 6 96-63 -3-0 -30-38 -8-8 -03-03 -33-30 -3 76 - -0-39 -9 - -6-0 -8-8 8 - -0-0 -8-6 -8-6 30 00 00 00 00 30 $ $ Differences between Gribskov Profiles and common sequence comparison methods %& What is needed to create a Gribskov Profile? seqpep ~GTL seqpep GGSL~ seq3pep ~GHSV seqpep ~GGTL seqpep GSS~ ( % + +,-

& /0 + +- seqpep ~GTL seqpep GGSL~ seq3pep ~GHSV seqpep ~GGTL seqpep GSS~ The profile is filled using the 0 M (p,a = b= W (p,b (a,b D E W Gap W (p,b = n(b,p/ N R (a,b + 3 3 3 7 /0 /0 67 /70 89 /70 B D E F G H I K L M N P Q R S T V W X Z B - 6 0-3 9 D - 6-3 6 E - - F - -3 - -3-3 6 G 0 - -3 - - -3 6 H - - -3-0 - - 8 I - -3 - -3-3 0 - -3 K - - -3 - -3 - - -3 L - - - - -3 0 - -3 - M - -3 - -3-0 -3 - - N - -3 0-3 0-3 0-3 - 6 P - - -3 - - - - - -3 - -3 - - 7 Q - 0-3 0-3 - 0-3 - 0 0 - R - - -3-0 -3-0 -3 - - 0 - S 0-0 0-0 - - 0 - - - 0 - T 0 - - - - - - - - - - - 0 - - - V 0-3 - -3 - - -3-3 3 - -3 - - -3-0 W -3 - - - -3 - - -3-3 - - - - - -3-3 - -3 X - - - - - - - - - - - - - - - - - - - - - - -3 - -3-3 -3 - - - - - -3 - - - - - - 7 Z - - -3-0 -3-3 - 0-0 0 - - -3 - - W The profile is filled using the W (p,b 7 = n(b,p/ N R (a,b /0 0 M (p,a = b= W (p,b (a,b /0 67 /70 89 /70 9 /70 6&+ 0 M (p,a = b= W (p,b (a,b M (, = b= W (,b (,b M (, = ( W (, (, + (W (, (, ++ ( W (, (, M (, = ( 00/ + ( / 0 M (, = b= W (,b (,b ++ ( 00/ - seqpep ~GTL seqpep GGSL~ seq3pep ~GHSV seqpep ~GGTL seqpep GSS~ $-/-+,: POS D E F G H L S T Gap - -8 - -03 6-0 -03 - -6-0 30-89 -30-0 -03-30 -30-03 -0-0 -0 00 3 - -6 - - -83 6 - -33 - -8-8 00 - -3-6 -8-3 96 38-30 39-8 -6 00 8-0 - -0-63 -63-8 -3-9 8-8 00 6 - -6 - -6-3 -3-8 76 - - -6 30 8 8 8 8 seqpep ~GTL seqpep GGSL~ seq3pep ~GHSV seqpep ~GGTL seqpep GSS~ %;06 0< + 0< 0< 0<, %0< ;0 Probability of any sequence is calculated in the sa me way POS D E F G H L S T Gap - -8 - -03 6-0 -03 - -6-0 30-89 -30-0 -03-30 -30-03 -0-0 -0 00 3 - -6 - - -83 6 - -33 - -8-8 00 - -3-6 -8-3 96 38-30 39-8 -6 00 8-0 - -0-63 -63-8 -3-9 8-8 00 6 - -6 - -6-3 -3-8 76 - - -6 30 +

ProfileMake ProfileGap ProfileSearch ProfileSegments TProfileGap TProfileSearch TProfileSegments =7 =7 Gribskov Profile $ Hidden Markov Models Building an Hidden Markov Model Markov Models are probabilistic, models, with a solid statistical foundation In contrast to patterns and profiles, HMMs allow consistent treatment of insertions and deletions P=06 P=0 P=0 P=009 P=00 T G - Domain (active binding site Domain (never found, inactive Domain 3 (never found, inactive Domain (active TGTGTGTG TGTGGTGTG TGTTGTG TGTGTGTG (/ %0 63- In contrast to patterns and profiles, Markov Models take into account the information about neighboring residues (%/ 0, 6-3,/0, 6-3, (,/ 0 6-3,/0 6-3, (,/ 0 6-3,/0 6-3, (/ %0 > 63- (/ %0 > 63- Markov Models take into account additional information about neighboring residues First order Markov Model Fifth order Markov Model / %?0 Gene finding Protein secondary structure prediction Protein homology recognition Phylogenetic analysis Radiation hybrid mapping Profile HMM libraries Genetic linkage mapping +, +$ + -% % / ( $ & 03+

03 9 006,77 6,778 : D E F G H I 00 00 0 D E F G H I D E F G H I 00 9 03 9 006 03 9 006 : 9 : 9 00 00 0 00 00 00 0 00 P(sequence is the product of the emission and transition probabilties ny sequence can be represented by a path through the model 03 08 08 037 03 06 06 006 0 006 073 006 0 8:;<:9 8:;<:: & & ( & & ( $( $ ( $ $ +,-($$$ $( $( $ $ ( ( $ 03 0 09 08 09 08 06 00 06 00 00 00 00 0 097 097 06 00 03 00 07 Different state paths through the model can generate the same sequence orrect probability of a sequence

Forward lgorithm This solution is computationally unfeasible for long sequences Viterbi lgorithm / /-$ $ & & $ ( ( $( $( $( & ( $ $ $ ( $ $ ( & $ $ +, 0-- +, -$$(-$ +, +, -($-$$( +, -(- +, -$$-$$$ +, -- $ & ($ & ($ ($ & & $ $ ( ( $( ($ ( $ $( ($ + $( $ & & +,-+, 0 +, 3+, ( +, ( +, 3+, ( $ $ $ $ $ The score that a sequence obtains with an HMM measures the probability of that sequence to belong to a family, group, class +, -+, $ +, -+, 3+, +, -+, +, -+, 3+, +, -+, 3+, +, -$+, +, -+, Global scoring Local scoring The alignment type is part of the model and must be specified before creating the HMM and not when using it Gribskov Profile ; </ Hidden Markov Models 6 7 6 Building an Hidden Markov Model : 6 7 6 6 68 9 HMM can be estimated from sequences 8 Sequences used to estimate or train the model are called Training data seqpep ~GTL seqpep GGSL~ seq3pep ~GHSV seqpep ~GGTL seqpep GSS~

To build an HMM is necessary to estimate == >?0@? e k (b=e k (b/ b E k (b >?0@? seqpep ~GTL seqpep GGSL~ seq3pep ~GHSV seqpep ~GGTL seqpep GSS~ < The expected transition probability is calcuated the same way a kl (b= kl (b/ l E kl (b +, Model overspecialization /= crd XFTNVSTTSKEWSVQRLHNTSGRGKMMK bah XFTNVSXTTSKEWSVQRLHNTSGRGKMXMK Sequence weighting < % ; / 7 = sxm TIINVKTSPKQSKPKELGSSGaKMNGK lir XFTQESTSNQWSIRRLHNTNRGKMNSK mbt XFTNVSSSSQWPVKKLFGTRGKINGK Sequence weighting based on tree structures @ @ ;@ $ ( B @ / // -3-3 Model overspecialization --$ --$ Model overspecialization Model overspecialization Position-specific weighting method (Henikoff? <? Maximum discrimination weighting @ / < @ / @ / % / / crd XFTNVSTTSKEWSVQRLHNT bah XFTNVSXTTSKEWSVQRLHNT sxm TIINVKTSPKQSKPKELGS lir XFTQESTSNQWSIRRLHNT mbt XFTNVSSSSQWPVKKLFGT (

Overfitting $ $%& caused by insufficient training data ($$( Regularization using prior information +$, $, - $, - +( -$ $%( / % /0 -$ $%( % /0 but usually ( ++ 3(( -$,,$ ( Baum-Welch lgorithm, ( 6 ( $ % $ Baum-Welch lgorithm $ ( $ $ 6 Iterative algorithm which maximizes the probability of the training sequences in the model Maximizes the likelihood of the model That it is the joint probability of all sequences in the training set given a particular set of parameters $ (, 3 $ $ $ $, ( 6 $ $ ( greater variation little variation onvergence void a local maximum Use of heuristic methods 7, 8,( 6 -$ $ ($ $ (,( Gene finding Protein secondary structure prediction Protein homology recognition Phylogenetic analysis Radiation hybrid mapping Profile HMM libraries Genetic linkage mapping & -03 ( (,% ( 6$ 0$-(7+ & (+ &,:(+ &/ 0 (+ & (+ &89+ &, -$(+

HMMalign HMMBuild HMMconvert HMMemit TMhmm Genescan HMM scan HMMsearch $ %& ( For gene finding several signals must be recognized and combined into a prediction of exons and introns : : % ( % ;< % ; $ % % n HMM for unspliced genes n HMM for spliced genes : % :: % x xxxxxxxxtgccc ccc ccctxxxxxxxx + % < < needed to use three different models of introns for each reading frame Four models are combined together using Viterbi algorithm to find the most probable pathway 9 - %% n HMM for spliced genes $ $ GTxxxxxx interior intron xxxxxxg GTxxxxxx interior intron xxxxxxg +, GTxxxxxx interior intron xxxxxxg, + %% ll models are combined together using Viterbi algorithm to find the most probable pathway