Gribskov Profile $ Hidden Markov Models Building an Hidden Markov Model $ Proteins, DN and other genomic features can be classified into families of related sequences and structures $ Related sequences can diverge beyond recognition with standard sequence comparison methods How to detect these similarities: $ $ POS D E F G H L S T Gap 3 6 What is a Gribskov Profile? - - - - 8-89 -6-3 -0-6 -8-30 - -6 - - - -0 - -8-0 -6-03 -03-83 -3-63 -3 6-30 6 96-63 -3-0 -30-38 -8-8 -03-03 -33-30 -3 76 - -0-39 -9 - -6-0 -8-8 8 - -0-0 -8-6 -8-6 30 00 00 00 00 30 $ $ Differences between Gribskov Profiles and common sequence comparison methods %& What is needed to create a Gribskov Profile? seqpep ~GTL seqpep GGSL~ seq3pep ~GHSV seqpep ~GGTL seqpep GSS~ ( % + +,-
& /0 + +- seqpep ~GTL seqpep GGSL~ seq3pep ~GHSV seqpep ~GGTL seqpep GSS~ The profile is filled using the 0 M (p,a = b= W (p,b (a,b D E W Gap W (p,b = n(b,p/ N R (a,b + 3 3 3 7 /0 /0 67 /70 89 /70 B D E F G H I K L M N P Q R S T V W X Z B - 6 0-3 9 D - 6-3 6 E - - F - -3 - -3-3 6 G 0 - -3 - - -3 6 H - - -3-0 - - 8 I - -3 - -3-3 0 - -3 K - - -3 - -3 - - -3 L - - - - -3 0 - -3 - M - -3 - -3-0 -3 - - N - -3 0-3 0-3 0-3 - 6 P - - -3 - - - - - -3 - -3 - - 7 Q - 0-3 0-3 - 0-3 - 0 0 - R - - -3-0 -3-0 -3 - - 0 - S 0-0 0-0 - - 0 - - - 0 - T 0 - - - - - - - - - - - 0 - - - V 0-3 - -3 - - -3-3 3 - -3 - - -3-0 W -3 - - - -3 - - -3-3 - - - - - -3-3 - -3 X - - - - - - - - - - - - - - - - - - - - - - -3 - -3-3 -3 - - - - - -3 - - - - - - 7 Z - - -3-0 -3-3 - 0-0 0 - - -3 - - W The profile is filled using the W (p,b 7 = n(b,p/ N R (a,b /0 0 M (p,a = b= W (p,b (a,b /0 67 /70 89 /70 9 /70 6&+ 0 M (p,a = b= W (p,b (a,b M (, = b= W (,b (,b M (, = ( W (, (, + (W (, (, ++ ( W (, (, M (, = ( 00/ + ( / 0 M (, = b= W (,b (,b ++ ( 00/ - seqpep ~GTL seqpep GGSL~ seq3pep ~GHSV seqpep ~GGTL seqpep GSS~ $-/-+,: POS D E F G H L S T Gap - -8 - -03 6-0 -03 - -6-0 30-89 -30-0 -03-30 -30-03 -0-0 -0 00 3 - -6 - - -83 6 - -33 - -8-8 00 - -3-6 -8-3 96 38-30 39-8 -6 00 8-0 - -0-63 -63-8 -3-9 8-8 00 6 - -6 - -6-3 -3-8 76 - - -6 30 8 8 8 8 seqpep ~GTL seqpep GGSL~ seq3pep ~GHSV seqpep ~GGTL seqpep GSS~ %;06 0< + 0< 0< 0<, %0< ;0 Probability of any sequence is calculated in the sa me way POS D E F G H L S T Gap - -8 - -03 6-0 -03 - -6-0 30-89 -30-0 -03-30 -30-03 -0-0 -0 00 3 - -6 - - -83 6 - -33 - -8-8 00 - -3-6 -8-3 96 38-30 39-8 -6 00 8-0 - -0-63 -63-8 -3-9 8-8 00 6 - -6 - -6-3 -3-8 76 - - -6 30 +
ProfileMake ProfileGap ProfileSearch ProfileSegments TProfileGap TProfileSearch TProfileSegments =7 =7 Gribskov Profile $ Hidden Markov Models Building an Hidden Markov Model Markov Models are probabilistic, models, with a solid statistical foundation In contrast to patterns and profiles, HMMs allow consistent treatment of insertions and deletions P=06 P=0 P=0 P=009 P=00 T G - Domain (active binding site Domain (never found, inactive Domain 3 (never found, inactive Domain (active TGTGTGTG TGTGGTGTG TGTTGTG TGTGTGTG (/ %0 63- In contrast to patterns and profiles, Markov Models take into account the information about neighboring residues (%/ 0, 6-3,/0, 6-3, (,/ 0 6-3,/0 6-3, (,/ 0 6-3,/0 6-3, (/ %0 > 63- (/ %0 > 63- Markov Models take into account additional information about neighboring residues First order Markov Model Fifth order Markov Model / %?0 Gene finding Protein secondary structure prediction Protein homology recognition Phylogenetic analysis Radiation hybrid mapping Profile HMM libraries Genetic linkage mapping +, +$ + -% % / ( $ & 03+
03 9 006,77 6,778 : D E F G H I 00 00 0 D E F G H I D E F G H I 00 9 03 9 006 03 9 006 : 9 : 9 00 00 0 00 00 00 0 00 P(sequence is the product of the emission and transition probabilties ny sequence can be represented by a path through the model 03 08 08 037 03 06 06 006 0 006 073 006 0 8:;<:9 8:;<:: & & ( & & ( $( $ ( $ $ +,-($$$ $( $( $ $ ( ( $ 03 0 09 08 09 08 06 00 06 00 00 00 00 0 097 097 06 00 03 00 07 Different state paths through the model can generate the same sequence orrect probability of a sequence
Forward lgorithm This solution is computationally unfeasible for long sequences Viterbi lgorithm / /-$ $ & & $ ( ( $( $( $( & ( $ $ $ ( $ $ ( & $ $ +, 0-- +, -$$(-$ +, +, -($-$$( +, -(- +, -$$-$$$ +, -- $ & ($ & ($ ($ & & $ $ ( ( $( ($ ( $ $( ($ + $( $ & & +,-+, 0 +, 3+, ( +, ( +, 3+, ( $ $ $ $ $ The score that a sequence obtains with an HMM measures the probability of that sequence to belong to a family, group, class +, -+, $ +, -+, 3+, +, -+, +, -+, 3+, +, -+, 3+, +, -$+, +, -+, Global scoring Local scoring The alignment type is part of the model and must be specified before creating the HMM and not when using it Gribskov Profile ; </ Hidden Markov Models 6 7 6 Building an Hidden Markov Model : 6 7 6 6 68 9 HMM can be estimated from sequences 8 Sequences used to estimate or train the model are called Training data seqpep ~GTL seqpep GGSL~ seq3pep ~GHSV seqpep ~GGTL seqpep GSS~
To build an HMM is necessary to estimate == >?0@? e k (b=e k (b/ b E k (b >?0@? seqpep ~GTL seqpep GGSL~ seq3pep ~GHSV seqpep ~GGTL seqpep GSS~ < The expected transition probability is calcuated the same way a kl (b= kl (b/ l E kl (b +, Model overspecialization /= crd XFTNVSTTSKEWSVQRLHNTSGRGKMMK bah XFTNVSXTTSKEWSVQRLHNTSGRGKMXMK Sequence weighting < % ; / 7 = sxm TIINVKTSPKQSKPKELGSSGaKMNGK lir XFTQESTSNQWSIRRLHNTNRGKMNSK mbt XFTNVSSSSQWPVKKLFGTRGKINGK Sequence weighting based on tree structures @ @ ;@ $ ( B @ / // -3-3 Model overspecialization --$ --$ Model overspecialization Model overspecialization Position-specific weighting method (Henikoff? <? Maximum discrimination weighting @ / < @ / @ / % / / crd XFTNVSTTSKEWSVQRLHNT bah XFTNVSXTTSKEWSVQRLHNT sxm TIINVKTSPKQSKPKELGS lir XFTQESTSNQWSIRRLHNT mbt XFTNVSSSSQWPVKKLFGT (
Overfitting $ $%& caused by insufficient training data ($$( Regularization using prior information +$, $, - $, - +( -$ $%( / % /0 -$ $%( % /0 but usually ( ++ 3(( -$,,$ ( Baum-Welch lgorithm, ( 6 ( $ % $ Baum-Welch lgorithm $ ( $ $ 6 Iterative algorithm which maximizes the probability of the training sequences in the model Maximizes the likelihood of the model That it is the joint probability of all sequences in the training set given a particular set of parameters $ (, 3 $ $ $ $, ( 6 $ $ ( greater variation little variation onvergence void a local maximum Use of heuristic methods 7, 8,( 6 -$ $ ($ $ (,( Gene finding Protein secondary structure prediction Protein homology recognition Phylogenetic analysis Radiation hybrid mapping Profile HMM libraries Genetic linkage mapping & -03 ( (,% ( 6$ 0$-(7+ & (+ &,:(+ &/ 0 (+ & (+ &89+ &, -$(+
HMMalign HMMBuild HMMconvert HMMemit TMhmm Genescan HMM scan HMMsearch $ %& ( For gene finding several signals must be recognized and combined into a prediction of exons and introns : : % ( % ;< % ; $ % % n HMM for unspliced genes n HMM for spliced genes : % :: % x xxxxxxxxtgccc ccc ccctxxxxxxxx + % < < needed to use three different models of introns for each reading frame Four models are combined together using Viterbi algorithm to find the most probable pathway 9 - %% n HMM for spliced genes $ $ GTxxxxxx interior intron xxxxxxg GTxxxxxx interior intron xxxxxxg +, GTxxxxxx interior intron xxxxxxg, + %% ll models are combined together using Viterbi algorithm to find the most probable pathway