Biology 644: Bioinformatics - PDF Free Download

A statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states in the training data. First used in speech and handwriting recognition In biology, frequently used to model biological sequences and structure: Gene tracks 5 and 3 splice sites Chromatin states CpG islands GC-isochores Protein folding conformations DNA-binding sites RNA-binding sites Copy Number Variation (CNV) Differential gene expression Sequence Homology Profile HMMs (phmms)

When at least some of the data labels are missing (hidden) in the training data, then we must infer (label) the missing hidden states Requires correctly inferring the model topology of the system Requires a lot of training data to train the additional parameters Successful training of the parameters is highly dependent on the initial conditions Famous example: Occasionally Dishonest Casino

Tumour Copy Number -0.4-0.2 0.0 0.2 0.4 0.6 0.8 0.0e+00 5.0e+07 1.0e+08 1.5e+08 Chromosome Position

The Viterbi Algorithm Find An efficient Dynamic Programming algorithm that is guaranteed to return the Alignment Path with the highest Log-odds score for a given sequence (Also called the best supported path most different from the background.) The Forward Algorithm Find Another Dynamic Programming algorithm that gives the sum of all of the Log-odds scores for all possible Paths to obtain the full probability that sequence x i aligns with the model more that the background. This is necessary to obtain P(x i θ) since many different state paths can rise to the same sequence x i. The Backward Algorithm Find backwards The Backward Algorithm is similar to the Forward Algorithm, but it recurses in the backward direction

When the paths for the training sequences are not known, no known closed form solution exists for the parameter estimations All known iterative algorithms for continuous function optimization can be used [Press et. al. 1992] The Baum Welch algorithm is standardly used Find An EM method that uses the DP matrix and the forward and backward algorithms In HMMs the missing data are the unknown state paths (the hidden states) The overall log likelihood of the model increases with each iteration Guaranteed to converge to a local maximum Never guaranteed that the local max is the overall global max (for any algorithm). Since we are converging in continuous-value space, we never actually reach the local max Convergence criteria is met when the change in log likelihood is sufficiently small The Viterbi Training algorithm is often used if all we care about are the most probable paths π*(si) Find The log likelihood of the most probable paths for all the sequences increases with each iteration Guaranteed to converge to a local maximum Again, never guaranteed to reach the global maximum

Position Weight Matrices (PWMs) cannot model tolerated insertions or deletions correctly Any indel throws off the static alignment to the PWM Binding Site? T A T A A C G G T C A

PWM! 1.0 1.0 1.0 1.0 1.0

d 1 d 2 d 3 d 4 i 0 P(A) =.3 i 1 i 2 P(T) =.016 i 3 P(T) =.96 i 4 P(T) =.18 P(T) =.41 Begin m 1 P(C) =.37 P(T) =.76 m 2 P(C) =.45 m 3 m 4 P(G) =.29 P(T) =.93 End

Match State Emissions Insert State Emissions p53 Insertion State Emissions

Widely used database of protein families Currently containing more than 13,000 manually curated protein families as of release 26.0 Families are sets of protein regions that share a significant degree of sequence similarity, thereby suggesting homology. Similarity is detected using profile Hidden Markov Models (HMMs) Currently uses HMMER3 to build and align to phmms Currently no R interface to perform pfam alignments