MLiB - Mandatory Project 2. Gene finding using HMMs

Size: px

Start display at page:

Download "MLiB - Mandatory Project 2. Gene finding using HMMs"

Noreen Bennett
6 years ago
Views:

1 MLiB - Mandatory Project 2 Gene finding using HMMs

2 Viterbi decoding >NC_ Streptococcus pyogenes M1 GAS TTGTTGATATTCTGTTTTTTCTTTTTTAGTTTTCCACATGAAAAATAGTTGAAAACAATA GCGGTGTCCCCTTAAAATGGCTTTTCCACAGGTTGTGGAGAACCCAAATTAACAGTGTTA ATTTATTTTCCACAGGTTGTGGAAAAACTAACTATTATCCATCGTTCTGTGGAAAACTAG AATAGTTTATGGTAGAATAGTTCTAGAATTATCCACAAGAAGGAACCTAGTATGACTGAA AATGAACAAATTTTTTGGAACAGGGTCTTGGAATTAGCTCAGAGTCAATTAAAACAGGCA ACTTATGAATTTTTTGTTCATGATGCCCGTCTATTAAAGGTCGATAAGCATATTGCAACT ATTTACTTAGATCAAATGAAAGAGCTCTTTTGGGAAAAAAATCTTAAAGATGTTATTCTT ACTGCTGGTTTTGAAGTTTATAACGCTCAAATTTCTGTTGACTATGTTTTCGAAGAAGAC CTAATGATTGAGCAAAATCAGACCAAAATCAACCAAAAACCTAAGCAGCAAGCCTTAAAT TCTTTGCCTACTGTTACTTCAGATTTAAACTCGAAATATAGTTTTGAAAACTTTATTCAA GGAGATGAAAATCGTTGGGCTGTTGCTGCTTCAATAGCAGTAGCTAATACTCCTGGAACT ACCTATAATCCTTTGTTTATTTGGGGTGGCCCTGGGCTTGGAAAAACCCATTTATTAAAT GCTATTGGTAATTCTGTACTATTAGAAAATCCAAATGCTCGAATTAAATATATCACAGCT GAAAACTTTATTAATGAGTTTGTTATCCATATTCGCCTTGATACCATGGATGAATTGAAA GAAAAATTTCGTAATTTAGATTTACTCCTTATTGATGATATCCAATCTTTAGCTAAAAAA ACGCTCTCTGGAACACAAGAAGAGTTCTTTAATACTTTTAATGCACTTCATAATAATAAC AAACAAATTGTCCTAACAAGCGACCGTACACCAGATCATCTCAATGATTTAGAAGATCGA TTAGTTACTCGTTTTAAATGGGGATTAACAGTCAATATCACACCTCCTGATTTTGAAACA CGAGTGGCTATTTTGACAAATAAAATTCAAGAATATAACTTTATTTTTCCTCAAGATACC ATTGAGTATTTGGCTGGTCAATTTGATTCTAATGTCAGAGATTTAGAAGGTGCCTTAAAA GATATTAGTCTGGTTGCTAATTTCAAACAAATTGACACGATTACTGTTGACATTGCTGCC GAAGCTATTCGCGCCAGAAAGCAAGATGGACCTAAAATGACAGTTATTCCCATCGAAGAA ATTCAAGCGCAAGTTGGAAAATTTTACGGTGTTACCGTCAAAGAAATTAAAGCTACTAAA CGAACACAAAATATTGTTTTAGCAAGACAAGTAGCTATGTTTTTAGCACGTGAAATGACA GATAACAGTCTTCCTAAAATTGGAAAAGAATTTGGTGGCAGAGACCATTCAACAGTACTC CATGCCTATAATAAAATCAAAAACATGATCAGCCAGGACGAAAGCCTTAGGATCGAAATT GAAACCATAAAAAACAAAATTAAATAACATGTGGAAAAGAATATCTTTTATGAAATAGTT ATCCACAAGTTGTGAACATCCATTTAGTCTTGGATTCTCTCGTTTATTTAGAGTTATCCA CTATATACACAAGACCTACTACTACTACTTATTATTATACTTATTAAATAAAGGAGTTCT >NC_ gene annotation Streptococcus pyogenes M1 GAS NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN Design a HMM that models the syntax of genes

3 Gene structure in procaryotes Biological facts The gene is a substring of the DNA sequence of A,C,G,T's The gene starts with a start-code atg The gene ends with a stop-codon taa, tag or tga The number of nucleotides in a gene is a multiplum of Z: NNNCCCCCCCCCNNNNNNNNCCCCCCCCCCCCCCCNNNNNNNNNNN X: acgatgcgctaatatgtccgatgacgtgagcataagcgacatgcag N: non-coding C: coding π N = 1 π C = 0

4 Gene structure in procaryotes Biological facts The gene is a substring of the DNA sequence of A,C,G,T's The gene starts with a start-code atg The gene ends with a stop-codon taa, tag or tga The number of nucleotides in a gene is a multiplum of Z: NNNCCCCCCCCCNNNNNNNNCCCCCCCCCCCCCCCNNNNNNNNNNN X: acgatgcgctaatatgtccgatgacgtgagcataagcgacatgcag N: non-coding C: coding π N = 1 π C = 0

5 Gene structure in procaryotes Biological facts The gene is a substring of the DNA sequence of A,C,G,T's The gene starts with a start-code atg The gene ends with a stop-codon taa, tag or tga The number of nucleotides in a gene is a multiplum of Z: NNNCCCCCCCCCNNNNNNNNCCCCCCCCCCCCCCCNNNNNNNNNNN X: acgatgcgctaatatgtccgatgacgtgagcataagcgacatgcag π N = 1 π C = 0 N: non-coding C: coding

6 Gene structure in procaryotes Biological facts The gene is a substring of the DNA sequence of A,C,G,T's The gene starts with a start-code atg The gene ends with a stop-codon taa, tag or tga The number of nucleotides in a gene is a multiplum of Z: NNNCCCCCCCCCNNNNNNNNCCCCCCCCCCCCCCCNNNNNNNNNNN X: acgatgcgctaatatgtccgatgacgtgagcataagcgacatgcag π N = 1 π C = 0 N: non-coding C: coding

7 The gene is a substring of the DNA sequence of A,C,G,T's The gene starts with a start-code atg Gene structure in procaryotes The gene ends with a stop-codon taa, tag or tga The number of nucleotides in a gene is a multiplum of 3 N: non-coding C: coding π N = 1 π C = 0

8 The gene is a substring of the DNA sequence of A,C,G,T's The gene starts with a start-code atg Gene structure in procaryotes The gene ends with a stop-codon taa, tag or tga The number of nucleotides in a gene is a multiplum of 3 N: non-coding C: coding π N = 1 π C = 0

9 The gene is a substring of the DNA sequence of A,C,G,T's The gene starts with a start-code atg Gene structure in procaryotes The gene ends with a stop-codon taa, tag or tga The number of nucleotides in a gene is a multiplum of 3 N: non-coding π N = 1 π C = 0 C: coding

10 Gene structure in procaryotes N: non-coding π N = 1 π C = 0 C: coding

11 Gene structure in procaryotes N: non-coding Gene G: finding 0 Select initial model structure (e.g. as done here) Select model parameters by training. Either by counting from examples of (X,Z)'s, i.e. genes with known structure, or by EM from examples of X, i.e. sequence which are known to contain a gene (as we will see on Monday) Given a new T: sequence 0 X, predict its gene structure using the Viterbi algorithm for finding the most likely sequence of underlying C: latent 0 states, i.e. C: its 0gene structure π N = 1 π C = 0 C: coding

12 Example Gene finding N: non-coding Gene G: finding 0 Select initial model structure (e.g. as done here) Select model parameters by training. Either by counting from examples of (X,Z)'s, i.e. genes with known structure, or by EM from examples of X, i.e. sequence which are known to contain a gene (as we will see on Monday) Given a new T: sequence 0 Even X, more predict biology its gene structure using π N = 1 π C = 0 There the can Viterbi be genes algorithm in both for directions finding the (and most over likely lapping) sequence of underlying C: latent 0 states, i.e. C: its 0gene structure There are more possible start-codons atg, gtg, and ttg Internal codons A: cannot 1 be start- A: or stop-codons 0 And a lot more G:... 0 C: coding

13 DNA e' 1 e' 2 e'... 3 s' 1 s' 2 s' 3 5' 5' s 1 s 2 s e 1 e 2 e 3

14 AGT GAT AAT GTA e' 1 e' 2 e'... 3 s' 1 s' 2 s' 3 DNA 5' 5' TTA CTA TCA CAT s 1 s 2 s e 1 e 2 e 3 ATG TAA TAG TGA

15 Analysis of the 11 genomes Length of genome1: ( ) Length of genome2: ( ) Length of genome3: ( ) Length of genome4: ( ) Length of genome5: ( ) Length of genome6: ( ) Length of genome7: ( ) Length of genome8: ( ) Length of genome9: ( ) Length of genome10: ( ) Length of genome11: ( ) Start-codon in normal genes: ATG [8423, 'NCCC'] ATC [3, 'NCCC'] ATA [1, 'RCCC'] GTG [713, 'NCCC'] ATT [3, 'NCCC'] CTG [2, 'NCCC'] GTT [1, 'NCCC'] CTC [1, 'NCCC'] TTA [1, 'NCCC'] TTG [1020, 'NCCC'] Stop-codon in normal genes: TAG [1949, 'CCCN'] TGA [1531, 'CCCN'] TAA [6686, 'CCCN'] Reversed stop-codon in reversed genes: TTA (reverse-complement: TAA) [6596, 'NRRR'] CTA (reverse-complement: TAG) [2014, 'NRRR'] TCA (reverse-complement: TGA) [1148, 'NRRR'] Reversed start-codon in reversed genes: TAT (reverse-complement: ATA) [2, 'RRRN'] ATG (reverse-complement: CAT) [1, 'RRRN'] GAT (reverse-complement: ATC) [1, 'RRRN'] CAT (reverse-complement: ATG) [8077, 'RRRN'] AAT (reverse-complement: ATT) [4, 'RRRN'] TAC (reverse-complement: GTA) [1, 'RRRN'] CAC (reverse-complement: GTG) [715, 'RRRN'] CAA (reverse-complement: TTG) [953, 'RRRN'] CAG (reverse-complement: CTG) [4, 'RRRN']

16 Analysis of the 11 genomes Length of genome1: ( ) Length of genome2: ( ) Length of genome3: ( ) Length of genome4: ( ) Length of genome5: ( ) Length of genome6: ( ) Length of genome7: ( ) Length of genome8: ( ) Length of genome9: ( ) Length of genome10: ( ) Length of genome11: ( ) Start-codon in normal genes: ATG [8423, 'NCCC'] ATC [3, 'NCCC'] ATA [1, 'RCCC'] GTG [713, 'NCCC'] ATT [3, 'NCCC'] CTG [2, 'NCCC'] GTT [1, 'NCCC'] CTC [1, 'NCCC'] TTA [1, 'NCCC'] TTG [1020, 'NCCC'] It might be an okay idea to ignore, i.e. not to model, the startand stop-codons which are very rare... This of course means that there might be genes that you cannot predict.. Stop-codon in normal genes: TAG [1949, 'CCCN'] TGA [1531, 'CCCN'] TAA [6686, 'CCCN'] You also have to take care when training by counting because the data (X,Z) then contains annotations which are not possible in your model (you can e.g. skip start- and stopcodons which you do not model)... Reversed stop-codon in reversed genes: TTA (reverse-complement: TAA) [6596, 'NRRR'] CTA (reverse-complement: TAG) [2014, 'NRRR'] TCA (reverse-complement: TGA) [1148, 'NRRR'] Reversed start-codon in reversed genes: TAT (reverse-complement: ATA) [2, 'RRRN'] ATG (reverse-complement: CAT) [1, 'RRRN'] GAT (reverse-complement: ATC) [1, 'RRRN'] CAT (reverse-complement: ATG) [8077, 'RRRN'] AAT (reverse-complement: ATT) [4, 'RRRN'] TAC (reverse-complement: GTA) [1, 'RRRN'] CAC (reverse-complement: GTG) [715, 'RRRN'] CAA (reverse-complement: TTG) [953, 'RRRN'] CAG (reverse-complement: CTG) [4, 'RRRN']

17 Analysis of the 11 genomes e' 1 e' 2 e'... 3 s' 1 s' 2 s' 3 5' 5' s 1 s 2 s e 1 e 2 e 3 The genes going from left-to-right (the C's): The first 3 symbols s 1 s 2 s 3 (the start-codon) is: ATG, ATC, ATA, GTG, ATT, CTG, GTT, CTC, TAA, TTG The last 3 symbols e 1 e 2 e 3 (the stop-codon) is: TAG, TGA, TAA The total length is a multiplum of 3.

18 Analysis of the 11 genomes 5' GTA e' 1 e' 2 e'... 3 s' 1 s' 2 s' 3 CAT TAC ATG s 1 s 2 s e 1 e 2 e 3 5' The genes going from left-to-right (the C's): The first 3 symbols s 1 s 2 s 3 (the start-codon) is: ATG, ATC, ATA, GTG, ATT, CTG, GTT, CTC, TAA, TTG The last 3 symbols e 1 e 2 e 3 (the stop-codon) is: TAG, TGA, TAA The total length is a multiplum of 3. The genes going from right-to-left (the R's): The first 3 symbols s' 1 s' 2 s' 3 (the reversed start-codon) is: TAT, ATG, GAT, CAT, AAT, TAC, CAC, CAA, CAG ATA, CAT, ATC, ATG, ATT, GTA, GTG, TTG, CTG The last 3 symbols e' 1 e' 2 e' 3 (the reversed stop-codon) is: TTA, CTA, TCA TAA, TAG, TGA The total length is a multiplum of 3.

19 Predicting the full gene structure e' 1 e' 2 e'... 3 s' 1 s' 2 s' 3 s 1 s 2 s e 1 e 2 e 3 Approach for predicting the full gene structure (Cs and Rs) Predict C's and R's using two separate models, or the same model used for the genome (to predict C's) and its reverse complement (to predict R's). Afterwards we resolve conflicts, i.e. nucleotides predicted as both Cs and Rs. Or make a model which captures everything in one go.

20 Even more biology C: coding left-to-right There can be genes in both directions N: Non-coding C: 1 π N = 1 π C = 0 R: coding right-to-left C: 1 C: 1

21 Problem: From annotation to Z Problem: The string Z=NNNCCC... is not a prober sequence of states in the illustrated HMM, but is can easily be converted into one Z: NNNCCCCCCCCCNNNNNNNNCCCCCCCCCCCCCCCNNNNNNNNNNN X: acgatgcgctaatatgtccgatgacgtgagcataagcgacatgcag π N = 1 π C = N: non-coding C: coding

Problem: From annotation to Z Problem: The string Z=NNNCCC... is not a prober sequence of states in the illustrated HMM, but is can easily be converted into one.

22 Problem: From annotation to Z Problem: The string Z=NNNCCC... is not a prober sequence of states in the illustrated HMM, but is can easily be converted into one Z: NNNCCCCCCCCCNNNNNNNNCCCCCCCCCCCCCCCNNNNNNNNNNN π X: acgatgcgctaatatgtccgatgacgtgagcataagcgacatgcag N = 1 π C = N: non-coding C: coding

23 Evaluating performance Evaluation of Gene Structure Prediction Programs (Burset and Guigo, 1996)

24 Evaluating performance def print_stats(tp, fp, tn, fn): sn = float(tp) / (tp + fn) sp = float(tp) / (tp + fp) cc = float((tp*tn - fp*fn)) / math.sqrt(float((tp+fn)*(tn+fp)*(tp+fp)*(tn+fn))) acp = 0.25 * (float(tp)/(tp+fn) + float(tp)/(tp+fp) + float(tn)/(tn+fp) + float(tn)/(tn+fn)) ac = (acp - 0.5) * 2 print("sn = %.4f, Sp = %.4f, CC = %.4f, AC = %.4f" % (sn, sp, cc, ac)) Evaluation of Gene Structure Prediction Programs (Burset and Guigo, 1996)

25 Counting C's def count_c(true, pred): total = tp = fp = tn = fn = 0 for i in range(len(true)): if pred[i] == 'C' or pred[i] == 'c': total = total + 1 if true[i] == 'C' or true[i] == 'c': tp = tp + 1 else: fp = fp + 1 if pred[i] == 'N' or pred[i] == 'n': if true[i] == 'N' or true[i] == 'n' or true[i] == 'R' or true[i] == 'r': tn = tn + 1 else: fn = fn + 1 return(total, tp, fp, tn, fn)

26 Counting R's def count_r(true, pred): total = tp = fp = tn = fn = 0 for i in range(len(true)): if pred[i] == 'R' or pred[i] == 'r': total = total + 1 if true[i] == 'R' or true[i] == 'r': tp = tp + 1 else: fp = fp + 1 if pred[i] == 'N' or pred[i] == 'n': if true[i] == 'N' or true[i] == 'n' or true[i] == 'C' or true[i] == 'c': tn = tn + 1 else: fn = fn + 1 return(total, tp, fp, tn, fn)

27 Counting C's and R's def count_cr(true, pred): total = tp = fp = tn = fn = 0 for i in range(len(true)): if pred[i] == 'C' or pred[i] == 'c' or pred[i] == 'R' or pred[i] == 'r': total = total + 1 if (pred[i] == 'C' or pred[i] == 'c') and (true[i] == 'C' or true[i] == 'c'): tp = tp + 1 elif (pred[i] == 'R' or pred[i] == 'r') and (true[i] == 'R' or true[i] == 'r'): tp = tp + 1 else: fp = fp + 1 if pred[i] == 'N' or pred[i] == 'n': if true[i] == 'N' or true[i] == 'n': tn = tn + 1 else: fn = fn + 1 return(total, tp, fp, tn, fn)

Genome Reconstruction: A Puzzle with a Billion Pieces Phillip E. C. Compeau and Pavel A. Pevzner

Genome Reconstruction: A Puzzle with a Billion Pieces Phillip E. C. Compeau and Pavel A. Pevzner Outline I. Problem II. Two Historical Detours III.Example IV.The Mathematics of DNA Sequencing V.Complications