MLiB - Mandatory Project 2. Gene finding using HMMs

Size: px
Start display at page:

Download "MLiB - Mandatory Project 2. Gene finding using HMMs"

Transcription

1 MLiB - Mandatory Project 2 Gene finding using HMMs

2 Viterbi decoding >NC_ Streptococcus pyogenes M1 GAS TTGTTGATATTCTGTTTTTTCTTTTTTAGTTTTCCACATGAAAAATAGTTGAAAACAATA GCGGTGTCCCCTTAAAATGGCTTTTCCACAGGTTGTGGAGAACCCAAATTAACAGTGTTA ATTTATTTTCCACAGGTTGTGGAAAAACTAACTATTATCCATCGTTCTGTGGAAAACTAG AATAGTTTATGGTAGAATAGTTCTAGAATTATCCACAAGAAGGAACCTAGTATGACTGAA AATGAACAAATTTTTTGGAACAGGGTCTTGGAATTAGCTCAGAGTCAATTAAAACAGGCA ACTTATGAATTTTTTGTTCATGATGCCCGTCTATTAAAGGTCGATAAGCATATTGCAACT ATTTACTTAGATCAAATGAAAGAGCTCTTTTGGGAAAAAAATCTTAAAGATGTTATTCTT ACTGCTGGTTTTGAAGTTTATAACGCTCAAATTTCTGTTGACTATGTTTTCGAAGAAGAC CTAATGATTGAGCAAAATCAGACCAAAATCAACCAAAAACCTAAGCAGCAAGCCTTAAAT TCTTTGCCTACTGTTACTTCAGATTTAAACTCGAAATATAGTTTTGAAAACTTTATTCAA GGAGATGAAAATCGTTGGGCTGTTGCTGCTTCAATAGCAGTAGCTAATACTCCTGGAACT ACCTATAATCCTTTGTTTATTTGGGGTGGCCCTGGGCTTGGAAAAACCCATTTATTAAAT GCTATTGGTAATTCTGTACTATTAGAAAATCCAAATGCTCGAATTAAATATATCACAGCT GAAAACTTTATTAATGAGTTTGTTATCCATATTCGCCTTGATACCATGGATGAATTGAAA GAAAAATTTCGTAATTTAGATTTACTCCTTATTGATGATATCCAATCTTTAGCTAAAAAA ACGCTCTCTGGAACACAAGAAGAGTTCTTTAATACTTTTAATGCACTTCATAATAATAAC AAACAAATTGTCCTAACAAGCGACCGTACACCAGATCATCTCAATGATTTAGAAGATCGA TTAGTTACTCGTTTTAAATGGGGATTAACAGTCAATATCACACCTCCTGATTTTGAAACA CGAGTGGCTATTTTGACAAATAAAATTCAAGAATATAACTTTATTTTTCCTCAAGATACC ATTGAGTATTTGGCTGGTCAATTTGATTCTAATGTCAGAGATTTAGAAGGTGCCTTAAAA GATATTAGTCTGGTTGCTAATTTCAAACAAATTGACACGATTACTGTTGACATTGCTGCC GAAGCTATTCGCGCCAGAAAGCAAGATGGACCTAAAATGACAGTTATTCCCATCGAAGAA ATTCAAGCGCAAGTTGGAAAATTTTACGGTGTTACCGTCAAAGAAATTAAAGCTACTAAA CGAACACAAAATATTGTTTTAGCAAGACAAGTAGCTATGTTTTTAGCACGTGAAATGACA GATAACAGTCTTCCTAAAATTGGAAAAGAATTTGGTGGCAGAGACCATTCAACAGTACTC CATGCCTATAATAAAATCAAAAACATGATCAGCCAGGACGAAAGCCTTAGGATCGAAATT GAAACCATAAAAAACAAAATTAAATAACATGTGGAAAAGAATATCTTTTATGAAATAGTT ATCCACAAGTTGTGAACATCCATTTAGTCTTGGATTCTCTCGTTTATTTAGAGTTATCCA CTATATACACAAGACCTACTACTACTACTTATTATTATACTTATTAAATAAAGGAGTTCT >NC_ gene annotation Streptococcus pyogenes M1 GAS NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN Design a HMM that models the syntax of genes

3 Gene structure in procaryotes Biological facts The gene is a substring of the DNA sequence of A,C,G,T's The gene starts with a start-code atg The gene ends with a stop-codon taa, tag or tga The number of nucleotides in a gene is a multiplum of Z: NNNCCCCCCCCCNNNNNNNNCCCCCCCCCCCCCCCNNNNNNNNNNN X: acgatgcgctaatatgtccgatgacgtgagcataagcgacatgcag N: non-coding C: coding π N = 1 π C = 0

4 Gene structure in procaryotes Biological facts The gene is a substring of the DNA sequence of A,C,G,T's The gene starts with a start-code atg The gene ends with a stop-codon taa, tag or tga The number of nucleotides in a gene is a multiplum of Z: NNNCCCCCCCCCNNNNNNNNCCCCCCCCCCCCCCCNNNNNNNNNNN X: acgatgcgctaatatgtccgatgacgtgagcataagcgacatgcag N: non-coding C: coding π N = 1 π C = 0

5 Gene structure in procaryotes Biological facts The gene is a substring of the DNA sequence of A,C,G,T's The gene starts with a start-code atg The gene ends with a stop-codon taa, tag or tga The number of nucleotides in a gene is a multiplum of Z: NNNCCCCCCCCCNNNNNNNNCCCCCCCCCCCCCCCNNNNNNNNNNN X: acgatgcgctaatatgtccgatgacgtgagcataagcgacatgcag π N = 1 π C = 0 N: non-coding C: coding

6 Gene structure in procaryotes Biological facts The gene is a substring of the DNA sequence of A,C,G,T's The gene starts with a start-code atg The gene ends with a stop-codon taa, tag or tga The number of nucleotides in a gene is a multiplum of Z: NNNCCCCCCCCCNNNNNNNNCCCCCCCCCCCCCCCNNNNNNNNNNN X: acgatgcgctaatatgtccgatgacgtgagcataagcgacatgcag π N = 1 π C = 0 N: non-coding C: coding

7 The gene is a substring of the DNA sequence of A,C,G,T's The gene starts with a start-code atg Gene structure in procaryotes The gene ends with a stop-codon taa, tag or tga The number of nucleotides in a gene is a multiplum of 3 N: non-coding C: coding π N = 1 π C = 0

8 The gene is a substring of the DNA sequence of A,C,G,T's The gene starts with a start-code atg Gene structure in procaryotes The gene ends with a stop-codon taa, tag or tga The number of nucleotides in a gene is a multiplum of 3 N: non-coding C: coding π N = 1 π C = 0

9 The gene is a substring of the DNA sequence of A,C,G,T's The gene starts with a start-code atg Gene structure in procaryotes The gene ends with a stop-codon taa, tag or tga The number of nucleotides in a gene is a multiplum of 3 N: non-coding π N = 1 π C = 0 C: coding

10 Gene structure in procaryotes N: non-coding π N = 1 π C = 0 C: coding

11 Gene structure in procaryotes N: non-coding Gene G: finding 0 Select initial model structure (e.g. as done here) Select model parameters by training. Either by counting from examples of (X,Z)'s, i.e. genes with known structure, or by EM from examples of X, i.e. sequence which are known to contain a gene (as we will see on Monday) Given a new T: sequence 0 X, predict its gene structure using the Viterbi algorithm for finding the most likely sequence of underlying C: latent 0 states, i.e. C: its 0gene structure π N = 1 π C = 0 C: coding

12 Example Gene finding N: non-coding Gene G: finding 0 Select initial model structure (e.g. as done here) Select model parameters by training. Either by counting from examples of (X,Z)'s, i.e. genes with known structure, or by EM from examples of X, i.e. sequence which are known to contain a gene (as we will see on Monday) Given a new T: sequence 0 Even X, more predict biology its gene structure using π N = 1 π C = 0 There the can Viterbi be genes algorithm in both for directions finding the (and most over likely lapping) sequence of underlying C: latent 0 states, i.e. C: its 0gene structure There are more possible start-codons atg, gtg, and ttg Internal codons A: cannot 1 be start- A: or stop-codons 0 And a lot more G:... 0 C: coding

13 DNA e' 1 e' 2 e'... 3 s' 1 s' 2 s' 3 5' 5' s 1 s 2 s e 1 e 2 e 3

14 AGT GAT AAT GTA e' 1 e' 2 e'... 3 s' 1 s' 2 s' 3 DNA 5' 5' TTA CTA TCA CAT s 1 s 2 s e 1 e 2 e 3 ATG TAA TAG TGA

15 Analysis of the 11 genomes Length of genome1: ( ) Length of genome2: ( ) Length of genome3: ( ) Length of genome4: ( ) Length of genome5: ( ) Length of genome6: ( ) Length of genome7: ( ) Length of genome8: ( ) Length of genome9: ( ) Length of genome10: ( ) Length of genome11: ( ) Start-codon in normal genes: ATG [8423, 'NCCC'] ATC [3, 'NCCC'] ATA [1, 'RCCC'] GTG [713, 'NCCC'] ATT [3, 'NCCC'] CTG [2, 'NCCC'] GTT [1, 'NCCC'] CTC [1, 'NCCC'] TTA [1, 'NCCC'] TTG [1020, 'NCCC'] Stop-codon in normal genes: TAG [1949, 'CCCN'] TGA [1531, 'CCCN'] TAA [6686, 'CCCN'] Reversed stop-codon in reversed genes: TTA (reverse-complement: TAA) [6596, 'NRRR'] CTA (reverse-complement: TAG) [2014, 'NRRR'] TCA (reverse-complement: TGA) [1148, 'NRRR'] Reversed start-codon in reversed genes: TAT (reverse-complement: ATA) [2, 'RRRN'] ATG (reverse-complement: CAT) [1, 'RRRN'] GAT (reverse-complement: ATC) [1, 'RRRN'] CAT (reverse-complement: ATG) [8077, 'RRRN'] AAT (reverse-complement: ATT) [4, 'RRRN'] TAC (reverse-complement: GTA) [1, 'RRRN'] CAC (reverse-complement: GTG) [715, 'RRRN'] CAA (reverse-complement: TTG) [953, 'RRRN'] CAG (reverse-complement: CTG) [4, 'RRRN']

16 Analysis of the 11 genomes Length of genome1: ( ) Length of genome2: ( ) Length of genome3: ( ) Length of genome4: ( ) Length of genome5: ( ) Length of genome6: ( ) Length of genome7: ( ) Length of genome8: ( ) Length of genome9: ( ) Length of genome10: ( ) Length of genome11: ( ) Start-codon in normal genes: ATG [8423, 'NCCC'] ATC [3, 'NCCC'] ATA [1, 'RCCC'] GTG [713, 'NCCC'] ATT [3, 'NCCC'] CTG [2, 'NCCC'] GTT [1, 'NCCC'] CTC [1, 'NCCC'] TTA [1, 'NCCC'] TTG [1020, 'NCCC'] It might be an okay idea to ignore, i.e. not to model, the startand stop-codons which are very rare... This of course means that there might be genes that you cannot predict.. Stop-codon in normal genes: TAG [1949, 'CCCN'] TGA [1531, 'CCCN'] TAA [6686, 'CCCN'] You also have to take care when training by counting because the data (X,Z) then contains annotations which are not possible in your model (you can e.g. skip start- and stopcodons which you do not model)... Reversed stop-codon in reversed genes: TTA (reverse-complement: TAA) [6596, 'NRRR'] CTA (reverse-complement: TAG) [2014, 'NRRR'] TCA (reverse-complement: TGA) [1148, 'NRRR'] Reversed start-codon in reversed genes: TAT (reverse-complement: ATA) [2, 'RRRN'] ATG (reverse-complement: CAT) [1, 'RRRN'] GAT (reverse-complement: ATC) [1, 'RRRN'] CAT (reverse-complement: ATG) [8077, 'RRRN'] AAT (reverse-complement: ATT) [4, 'RRRN'] TAC (reverse-complement: GTA) [1, 'RRRN'] CAC (reverse-complement: GTG) [715, 'RRRN'] CAA (reverse-complement: TTG) [953, 'RRRN'] CAG (reverse-complement: CTG) [4, 'RRRN']

17 Analysis of the 11 genomes e' 1 e' 2 e'... 3 s' 1 s' 2 s' 3 5' 5' s 1 s 2 s e 1 e 2 e 3 The genes going from left-to-right (the C's): The first 3 symbols s 1 s 2 s 3 (the start-codon) is: ATG, ATC, ATA, GTG, ATT, CTG, GTT, CTC, TAA, TTG The last 3 symbols e 1 e 2 e 3 (the stop-codon) is: TAG, TGA, TAA The total length is a multiplum of 3.

18 Analysis of the 11 genomes 5' GTA e' 1 e' 2 e'... 3 s' 1 s' 2 s' 3 CAT TAC ATG s 1 s 2 s e 1 e 2 e 3 5' The genes going from left-to-right (the C's): The first 3 symbols s 1 s 2 s 3 (the start-codon) is: ATG, ATC, ATA, GTG, ATT, CTG, GTT, CTC, TAA, TTG The last 3 symbols e 1 e 2 e 3 (the stop-codon) is: TAG, TGA, TAA The total length is a multiplum of 3. The genes going from right-to-left (the R's): The first 3 symbols s' 1 s' 2 s' 3 (the reversed start-codon) is: TAT, ATG, GAT, CAT, AAT, TAC, CAC, CAA, CAG ATA, CAT, ATC, ATG, ATT, GTA, GTG, TTG, CTG The last 3 symbols e' 1 e' 2 e' 3 (the reversed stop-codon) is: TTA, CTA, TCA TAA, TAG, TGA The total length is a multiplum of 3.

19 Predicting the full gene structure e' 1 e' 2 e'... 3 s' 1 s' 2 s' 3 s 1 s 2 s e 1 e 2 e 3 Approach for predicting the full gene structure (Cs and Rs) Predict C's and R's using two separate models, or the same model used for the genome (to predict C's) and its reverse complement (to predict R's). Afterwards we resolve conflicts, i.e. nucleotides predicted as both Cs and Rs. Or make a model which captures everything in one go.

20 Even more biology C: coding left-to-right There can be genes in both directions N: Non-coding C: 1 π N = 1 π C = 0 R: coding right-to-left C: 1 C: 1

21 Problem: From annotation to Z Problem: The string Z=NNNCCC... is not a prober sequence of states in the illustrated HMM, but is can easily be converted into one Z: NNNCCCCCCCCCNNNNNNNNCCCCCCCCCCCCCCCNNNNNNNNNNN X: acgatgcgctaatatgtccgatgacgtgagcataagcgacatgcag π N = 1 π C = N: non-coding C: coding

22 Problem: From annotation to Z Problem: The string Z=NNNCCC... is not a prober sequence of states in the illustrated HMM, but is can easily be converted into one Z: NNNCCCCCCCCCNNNNNNNNCCCCCCCCCCCCCCCNNNNNNNNNNN π X: acgatgcgctaatatgtccgatgacgtgagcataagcgacatgcag N = 1 π C = N: non-coding C: coding

23 Evaluating performance Evaluation of Gene Structure Prediction Programs (Burset and Guigo, 1996)

24 Evaluating performance def print_stats(tp, fp, tn, fn): sn = float(tp) / (tp + fn) sp = float(tp) / (tp + fp) cc = float((tp*tn - fp*fn)) / math.sqrt(float((tp+fn)*(tn+fp)*(tp+fp)*(tn+fn))) acp = 0.25 * (float(tp)/(tp+fn) + float(tp)/(tp+fp) + float(tn)/(tn+fp) + float(tn)/(tn+fn)) ac = (acp - 0.5) * 2 print("sn = %.4f, Sp = %.4f, CC = %.4f, AC = %.4f" % (sn, sp, cc, ac)) Evaluation of Gene Structure Prediction Programs (Burset and Guigo, 1996)

25 Counting C's def count_c(true, pred): total = tp = fp = tn = fn = 0 for i in range(len(true)): if pred[i] == 'C' or pred[i] == 'c': total = total + 1 if true[i] == 'C' or true[i] == 'c': tp = tp + 1 else: fp = fp + 1 if pred[i] == 'N' or pred[i] == 'n': if true[i] == 'N' or true[i] == 'n' or true[i] == 'R' or true[i] == 'r': tn = tn + 1 else: fn = fn + 1 return(total, tp, fp, tn, fn)

26 Counting R's def count_r(true, pred): total = tp = fp = tn = fn = 0 for i in range(len(true)): if pred[i] == 'R' or pred[i] == 'r': total = total + 1 if true[i] == 'R' or true[i] == 'r': tp = tp + 1 else: fp = fp + 1 if pred[i] == 'N' or pred[i] == 'n': if true[i] == 'N' or true[i] == 'n' or true[i] == 'C' or true[i] == 'c': tn = tn + 1 else: fn = fn + 1 return(total, tp, fp, tn, fn)

27 Counting C's and R's def count_cr(true, pred): total = tp = fp = tn = fn = 0 for i in range(len(true)): if pred[i] == 'C' or pred[i] == 'c' or pred[i] == 'R' or pred[i] == 'r': total = total + 1 if (pred[i] == 'C' or pred[i] == 'c') and (true[i] == 'C' or true[i] == 'c'): tp = tp + 1 elif (pred[i] == 'R' or pred[i] == 'r') and (true[i] == 'R' or true[i] == 'r'): tp = tp + 1 else: fp = fp + 1 if pred[i] == 'N' or pred[i] == 'n': if true[i] == 'N' or true[i] == 'n': tn = tn + 1 else: fn = fn + 1 return(total, tp, fp, tn, fn)

Genome Reconstruction: A Puzzle with a Billion Pieces Phillip E. C. Compeau and Pavel A. Pevzner

Genome Reconstruction: A Puzzle with a Billion Pieces Phillip E. C. Compeau and Pavel A. Pevzner Genome Reconstruction: A Puzzle with a Billion Pieces Phillip E. C. Compeau and Pavel A. Pevzner Outline I. Problem II. Two Historical Detours III.Example IV.The Mathematics of DNA Sequencing V.Complications

More information

by the Genevestigator program (www.genevestigator.com). Darker blue color indicates higher gene expression.

by the Genevestigator program (www.genevestigator.com). Darker blue color indicates higher gene expression. Figure S1. Tissue-specific expression profile of the genes that were screened through the RHEPatmatch and root-specific microarray filters. The gene expression profile (heat map) was drawn by the Genevestigator

More information

warm-up exercise Representing Data Digitally goals for today proteins example from nature

warm-up exercise Representing Data Digitally goals for today proteins example from nature Representing Data Digitally Anne Condon September 6, 007 warm-up exercise pick two examples of in your everyday life* in what media are the is represented? is the converted from one representation to another,

More information

Pyramidal and Chiral Groupings of Gold Nanocrystals Assembled Using DNA Scaffolds

Pyramidal and Chiral Groupings of Gold Nanocrystals Assembled Using DNA Scaffolds Pyramidal and Chiral Groupings of Gold Nanocrystals Assembled Using DNA Scaffolds February 27, 2009 Alexander Mastroianni, Shelley Claridge, A. Paul Alivisatos Department of Chemistry, University of California,

More information

Supplementary Table 1. Data collection and refinement statistics

Supplementary Table 1. Data collection and refinement statistics Supplementary Table 1. Data collection and refinement statistics APY-EphA4 APY-βAla8.am-EphA4 Crystal Space group P2 1 P2 1 Cell dimensions a, b, c (Å) 36.27, 127.7, 84.57 37.22, 127.2, 84.6 α, β, γ (

More information

HP22.1 Roth Random Primer Kit A für die RAPD-PCR

HP22.1 Roth Random Primer Kit A für die RAPD-PCR HP22.1 Roth Random Kit A für die RAPD-PCR Kit besteht aus 20 Einzelprimern, jeweils aufgeteilt auf 2 Reaktionsgefäße zu je 1,0 OD Achtung: Angaben beziehen sich jeweils auf ein Reaktionsgefäß! Sequenz

More information

Machine Learning Classifiers

Machine Learning Classifiers Machine Learning Classifiers Outline Different types of learning problems Different types of learning algorithms Supervised learning Decision trees Naïve Bayes Perceptrons, Multi-layer Neural Networks

More information

Appendix A. Example code output. Chapter 1. Chapter 3

Appendix A. Example code output. Chapter 1. Chapter 3 Appendix A Example code output This is a compilation of output from selected examples. Some of these examples requires exernal input from e.g. STDIN, for such examples the interaction with the program

More information

TCGR: A Novel DNA/RNA Visualization Technique

TCGR: A Novel DNA/RNA Visualization Technique TCGR: A Novel DNA/RNA Visualization Technique Donya Quick and Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas, Texas 75275 dquick@mail.smu.edu, mhd@engr.smu.edu

More information

6 Anhang. 6.1 Transgene Su(var)3-9-Linien. P{GS.ry + hs(su(var)3-9)egfp} 1 I,II,III,IV 3 2I 3 3 I,II,III 3 4 I,II,III 2 5 I,II,III,IV 3

6 Anhang. 6.1 Transgene Su(var)3-9-Linien. P{GS.ry + hs(su(var)3-9)egfp} 1 I,II,III,IV 3 2I 3 3 I,II,III 3 4 I,II,III 2 5 I,II,III,IV 3 6.1 Transgene Su(var)3-9-n P{GS.ry + hs(su(var)3-9)egfp} 1 I,II,III,IV 3 2I 3 3 I,II,III 3 4 I,II,II 5 I,II,III,IV 3 6 7 I,II,II 8 I,II,II 10 I,II 3 P{GS.ry + UAS(Su(var)3-9)EGFP} A AII 3 B P{GS.ry + (10.5kbSu(var)3-9EGFP)}

More information

Genome Reconstruction: A Puzzle with a Billion Pieces. Phillip Compeau Carnegie Mellon University Computational Biology Department

Genome Reconstruction: A Puzzle with a Billion Pieces. Phillip Compeau Carnegie Mellon University Computational Biology Department http://cbd.cmu.edu Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau Carnegie Mellon University Computational Biology Department Eternity II: The Highest-Stakes Puzzle in History Courtesy:

More information

SUPPLEMENTARY INFORMATION. Systematic evaluation of CRISPR-Cas systems reveals design principles for genome editing in human cells

SUPPLEMENTARY INFORMATION. Systematic evaluation of CRISPR-Cas systems reveals design principles for genome editing in human cells SUPPLEMENTARY INFORMATION Systematic evaluation of CRISPR-Cas systems reveals design principles for genome editing in human cells Yuanming Wang 1,2,7, Kaiwen Ivy Liu 2,7, Norfala-Aliah Binte Sutrisnoh

More information

Digging into acceptor splice site prediction: an iterative feature selection approach

Digging into acceptor splice site prediction: an iterative feature selection approach Digging into acceptor splice site prediction: an iterative feature selection approach Yvan Saeys, Sven Degroeve, and Yves Van de Peer Department of Plant Systems Biology, Ghent University, Flanders Interuniversity

More information

2 41L Tag- AA GAA AAA ATA AAA GCA TTA RYA GAA ATT TGT RMW GAR C K65 Tag- A AAT CCA TAC AAT ACT CCA GTA TTT GCY ATA AAG AA

2 41L Tag- AA GAA AAA ATA AAA GCA TTA RYA GAA ATT TGT RMW GAR C K65 Tag- A AAT CCA TAC AAT ACT CCA GTA TTT GCY ATA AAG AA 176 SUPPLEMENTAL TABLES 177 Table S1. ASPE Primers for HIV-1 group M subtype B Primer no Type a Sequence (5'-3') Tag ID b Position c 1 M41 Tag- AA GAA AAA ATA AAA GCA TTA RYA GAA ATT TGT RMW GAR A d 45

More information

A relation between trinucleotide comma-free codes and trinucleotide circular codes

A relation between trinucleotide comma-free codes and trinucleotide circular codes Theoretical Computer Science 401 (2008) 17 26 www.elsevier.com/locate/tcs A relation between trinucleotide comma-free codes and trinucleotide circular codes Christian J. Michel a,, Giuseppe Pirillo b,c,

More information

Crick s Hypothesis Revisited: The Existence of a Universal Coding Frame

Crick s Hypothesis Revisited: The Existence of a Universal Coding Frame Crick s Hypothesis Revisited: The Existence of a Universal Coding Frame Jean-Louis Lassez*, Ryan A. Rossi Computer Science Department, Coastal Carolina University jlassez@coastal.edu, raross@coastal.edu

More information

de Bruijn graphs for sequencing data

de Bruijn graphs for sequencing data de Bruijn graphs for sequencing data Rayan Chikhi CNRS Bonsai team, CRIStAL/INRIA, Univ. Lille 1 SMPGD 2016 1 MOTIVATION - de Bruijn graphs are instrumental for reference-free sequencing data analysis:

More information

de novo assembly Rayan Chikhi Pennsylvania State University Workshop On Genomics - Cesky Krumlov - January /73

de novo assembly Rayan Chikhi Pennsylvania State University Workshop On Genomics - Cesky Krumlov - January /73 1/73 de novo assembly Rayan Chikhi Pennsylvania State University Workshop On Genomics - Cesky Krumlov - January 2014 2/73 YOUR INSTRUCTOR IS.. - Postdoc at Penn State, USA - PhD at INRIA / ENS Cachan,

More information

Programming Applications. What is Computer Programming?

Programming Applications. What is Computer Programming? Programming Applications What is Computer Programming? An algorithm is a series of steps for solving a problem A programming language is a way to express our algorithm to a computer Programming is the

More information

Problem statement. CS267 Assignment 3: Parallelize Graph Algorithms for de Novo Genome Assembly. Spring Example.

Problem statement. CS267 Assignment 3: Parallelize Graph Algorithms for de Novo Genome Assembly. Spring Example. CS267 Assignment 3: Problem statement 2 Parallelize Graph Algorithms for de Novo Genome Assembly k-mers are sequences of length k (alphabet is A/C/G/T). An extension is a simple symbol (A/C/G/T/F). The

More information

Detecting Superbubbles in Assembly Graphs. Taku Onodera (U. Tokyo)! Kunihiko Sadakane (NII)! Tetsuo Shibuya (U. Tokyo)!

Detecting Superbubbles in Assembly Graphs. Taku Onodera (U. Tokyo)! Kunihiko Sadakane (NII)! Tetsuo Shibuya (U. Tokyo)! Detecting Superbubbles in Assembly Graphs Taku Onodera (U. Tokyo)! Kunihiko Sadakane (NII)! Tetsuo Shibuya (U. Tokyo)! de Bruijn Graph-based Assembly Reads (substrings of original DNA sequence) de Bruijn

More information

Supplementary Materials:

Supplementary Materials: Supplementary Materials: Amino acid codo n Numb er Table S1. Codon usage in all the protein coding genes. RSC U Proportion (%) Amino acid codo n Numb er RSC U Proportion (%) Phe UUU 861 1.31 5.71 Ser UCU

More information

Gene Finding & Review

Gene Finding & Review Gene Finding & Review Michael Schatz Bioinformatics Lecture 4 Quantitative Biology 2010 Illumina DNA Sequencing http://www.youtube.com/watch?v=77r5p8ibwjk Pacific Biosciences http://www.pacificbiosciences.com/video_lg.html

More information

Graph Algorithms in Bioinformatics

Graph Algorithms in Bioinformatics Graph Algorithms in Bioinformatics Computational Biology IST Ana Teresa Freitas 2015/2016 Sequencing Clone-by-clone shotgun sequencing Human Genome Project Whole-genome shotgun sequencing Celera Genomics

More information

Glimmer Release Notes Version 3.01 (Beta) Arthur L. Delcher

Glimmer Release Notes Version 3.01 (Beta) Arthur L. Delcher Glimmer Release Notes Version 3.01 (Beta) Arthur L. Delcher 10 October 2005 1 Introduction This document describes Version 3 of the Glimmer gene-finding software. This version incorporates a nearly complete

More information

DNA Sequencing. Overview

DNA Sequencing. Overview BINF 3350, Genomics and Bioinformatics DNA Sequencing Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Backgrounds Eulerian Cycles Problem Hamiltonian Cycles

More information

Sequencing. Computational Biology IST Ana Teresa Freitas 2011/2012. (BACs) Whole-genome shotgun sequencing Celera Genomics

Sequencing. Computational Biology IST Ana Teresa Freitas 2011/2012. (BACs) Whole-genome shotgun sequencing Celera Genomics Computational Biology IST Ana Teresa Freitas 2011/2012 Sequencing Clone-by-clone shotgun sequencing Human Genome Project Whole-genome shotgun sequencing Celera Genomics (BACs) 1 Must take the fragments

More information

Degenerate Coding and Sequence Compacting

Degenerate Coding and Sequence Compacting ESI The Erwin Schrödinger International Boltzmanngasse 9 Institute for Mathematical Physics A-1090 Wien, Austria Degenerate Coding and Sequence Compacting Maya Gorel Kirzhner V.M. Vienna, Preprint ESI

More information

LABORATORY STANDARD OPERATING PROCEDURE FOR PULSENET CODE: PNL28 MLVA OF SHIGA TOXIN-PRODUCING ESCHERICHIA COLI

LABORATORY STANDARD OPERATING PROCEDURE FOR PULSENET CODE: PNL28 MLVA OF SHIGA TOXIN-PRODUCING ESCHERICHIA COLI 1. PURPOSE: to describe the standardized laboratory protocol for molecular subtyping of Shiga toxin-producing Escherichia coli O157 (STEC O157) and Salmonella enterica serotypes Typhimurium and Enteritidis.

More information

Strings. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Strings. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas Strings Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas Run a program by typing at a terminal prompt (which may be > or $ or something else depending on your computer;

More information

Computational Methods for de novo Assembly of Next-Generation Genome Sequencing Data

Computational Methods for de novo Assembly of Next-Generation Genome Sequencing Data 1/39 Computational Methods for de novo Assembly of Next-Generation Genome Sequencing Data Rayan Chikhi ENS Cachan Brittany / IRISA (Genscale team) Advisor : Dominique Lavenier 2/39 INTRODUCTION, YEAR 2000

More information

3. Open Vector NTI 9 (note 2) from desktop. A three pane window appears.

3. Open Vector NTI 9 (note 2) from desktop. A three pane window appears. SOP: SP043.. Recombinant Plasmid Map Design Vector NTI Materials and Reagents: 1. Dell Dimension XPS T450 Room C210 2. Vector NTI 9 application, on desktop 3. Tuberculist database open in Internet Explorer

More information

Supplementary Data. Image Processing Workflow Diagram A - Preprocessing. B - Hough Transform. C - Angle Histogram (Rose Plot)

Supplementary Data. Image Processing Workflow Diagram A - Preprocessing. B - Hough Transform. C - Angle Histogram (Rose Plot) Supplementary Data Image Processing Workflow Diagram A - Preprocessing B - Hough Transform C - Angle Histogram (Rose Plot) D - Determination of holes Description of Image Processing Workflow The key steps

More information

Supporting Information

Supporting Information Copyright WILEY VCH Verlag GmbH & Co. KGaA, 69469 Weinheim, Germany, 2015. Supporting Information for Small, DOI: 10.1002/smll.201501370 A Compact DNA Cube with Side Length 10 nm Max B. Scheible, Luvena

More information

CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assembly

CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assembly CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assembly Ben Raphael Sept. 22, 2009 http://cs.brown.edu/courses/csci2950-c/ l-mer composition Def: Given string s, the Spectrum ( s, l ) is unordered multiset

More information

ASSIGNMENT 4. COMP-202, Summer 2017, All Sections. Due: Friday, June 16 th, 11:59pm

ASSIGNMENT 4. COMP-202, Summer 2017, All Sections. Due: Friday, June 16 th, 11:59pm ASSIGNMENT 4 COMP-202, Summer 2017, All Sections Due: Friday, June 16 th, 11:59pm Please read the entire PDF before starting. You must do this assignment individually. Question 1: 100 points 100 points

More information

10/15/2009 Comp 590/Comp Fall

10/15/2009 Comp 590/Comp Fall Lecture 13: Graph Algorithms Study Chapter 8.1 8.8 10/15/2009 Comp 590/Comp 790-90 Fall 2009 1 The Bridge Obsession Problem Find a tour crossing every bridge just once Leonhard Euler, 1735 Bridges of Königsberg

More information

Efficient Selection of Unique and Popular Oligos for Large EST Databases. Stefano Lonardi. University of California, Riverside

Efficient Selection of Unique and Popular Oligos for Large EST Databases. Stefano Lonardi. University of California, Riverside Efficient Selection of Unique and Popular Oligos for Large EST Databases Stefano Lonardi University of California, Riverside joint work with Jie Zheng, Timothy Close, Tao Jiang University of California,

More information

DNA Sequencing The Shortest Superstring & Traveling Salesman Problems Sequencing by Hybridization

DNA Sequencing The Shortest Superstring & Traveling Salesman Problems Sequencing by Hybridization Eulerian & Hamiltonian Cycle Problems DNA Sequencing The Shortest Superstring & Traveling Salesman Problems Sequencing by Hybridization The Bridge Obsession Problem Find a tour crossing every bridge just

More information

de novo assembly & k-mers

de novo assembly & k-mers de novo assembly & k-mers Rayan Chikhi CNRS Workshop on Genomics - Cesky Krumlov January 2018 1 YOUR INSTRUCTOR IS.. - Junior CNRS bioinformatics researcher in Lille, France - Postdoc: Penn State, USA;

More information

Lecture 5: Markov models

Lecture 5: Markov models Master s course Bioinformatics Data Analysis and Tools Lecture 5: Markov models Centre for Integrative Bioinformatics Problem in biology Data and patterns are often not clear cut When we want to make a

More information

DNA Inspired Bi-directional Lempel-Ziv-like Compression Algorithms

DNA Inspired Bi-directional Lempel-Ziv-like Compression Algorithms DNA Inspired Bi-directional Lempel-Ziv-like Compression Algorithms Attiya Mahmood, Nazia Islam, Dawit Nigatu, and Werner Henkel Jacobs University Bremen Electrical Engineering and Computer Science Bremen,

More information

de novo assembly Simon Rasmussen 36626: Next Generation Sequencing analysis DTU Bioinformatics Next Generation Sequencing Analysis

de novo assembly Simon Rasmussen 36626: Next Generation Sequencing analysis DTU Bioinformatics Next Generation Sequencing Analysis de novo assembly Simon Rasmussen 36626: Next Generation Sequencing analysis DTU Bioinformatics 27626 - Next Generation Sequencing Analysis Generalized NGS analysis Data size Application Assembly: Compare

More information

Eukaryotic Gene Finding: The GENSCAN System

Eukaryotic Gene Finding: The GENSCAN System Eukaryotic Gene Finding: The GENSCAN System BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2016 Anthony Gitter gitter@biostat.wisc.edu These slides, excluding third-party material, are licensed under CC

More information

DNA Fragment Assembly

DNA Fragment Assembly Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri DNA Fragment Assembly Overlap

More information

A CAM(Content Addressable Memory)-based architecture for molecular sequence matching

A CAM(Content Addressable Memory)-based architecture for molecular sequence matching A CAM(Content Addressable Memory)-based architecture for molecular sequence matching P.K. Lala 1 and J.P. Parkerson 2 1 Department Electrical Engineering, Texas A&M University, Texarkana, Texas, USA 2

More information

Index-assisted approximate matching

Index-assisted approximate matching Index-assisted approximate matching Ben Langmead Department of Computer Science You are free to use these slides. If you do, please sign the guestbook (www.langmead-lab.org/teaching-materials), or email

More information

Computational Theory MAT542 (Computational Methods in Genomics) - Part 2 & 3 -

Computational Theory MAT542 (Computational Methods in Genomics) - Part 2 & 3 - Computational Theory MAT542 (Computational Methods in Genomics) - Part 2 & 3 - Benjamin King Mount Desert Island Biological Laboratory bking@mdibl.org Overview of 4 Lectures Introduction to Computation

More information

10/8/13 Comp 555 Fall

10/8/13 Comp 555 Fall 10/8/13 Comp 555 Fall 2013 1 Find a tour crossing every bridge just once Leonhard Euler, 1735 Bridges of Königsberg 10/8/13 Comp 555 Fall 2013 2 Find a cycle that visits every edge exactly once Linear

More information

Files. Reading from a file

Files. Reading from a file Files We often need to read data from files and write data to files within a Python program. The most common type of files you'll encounter in computational biology, are text files. Text files contain

More information

Sequence Design Problems in Discovery of Regulatory Elements

Sequence Design Problems in Discovery of Regulatory Elements Sequence Design Problems in Discovery of Regulatory Elements Yaron Orenstein, Bonnie Berger and Ron Shamir Regulatory Genomics workshop Simons Institute March 10th, 2016, Berkeley, CA Differentially methylated

More information

E cient Index Maintenance Under Dynamic Genome Modification

E cient Index Maintenance Under Dynamic Genome Modification E cient Index Maintenance Under Dynamic Genome Modification Nitish Gupta úı,1 Komal Sanjeev ı,1, Tim Wall 2, Carl Kingsford 3, Rob Patro 1 1 Department of Computer Science, Stony Brook University, Stony

More information

Structural analysis and haplotype diversity in swine LEP and MC4R genes

Structural analysis and haplotype diversity in swine LEP and MC4R genes J. Anim. Breed. Genet. ISSN - OIGINAL ATICLE Structural analysis and haplotype diversity in swine LEP and MC genes M. D Andrea, F. Pilla, E. Giuffra, D. Waddington & A.L. Archibald University of Molise,

More information

HMMConverter A tool-box for hidden Markov models with two novel, memory efficient parameter training algorithms

HMMConverter A tool-box for hidden Markov models with two novel, memory efficient parameter training algorithms HMMConverter A tool-box for hidden Markov models with two novel, memory efficient parameter training algorithms by TIN YIN LAM B.Sc., The Chinese University of Hong Kong, 2006 A THESIS SUBMITTED IN PARTIAL

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION doi:10.1038/nature25174 Sequences of DNA primers used in this study. Primer Primer sequence code 498 GTCCAGATCTTGATTAAGAAAAATGAAGAAA F pegfp 499 GTCCAGATCTTGGTTAAGAAAAATGAAGAAA F pegfp 500 GTCCCTGCAGCCTAGAGGGTTAGG

More information

Algorithms for Bioinformatics

Algorithms for Bioinformatics 582670 Algorithms for Bioinformatics Lecture 1: Primer to algorithms and molecular biology 4.9.2012 Course format Thu 12-14 Thu 10-12 Tue 12-14 Grading Exam 48 points Exercises 12 points 30% = 1 85% =

More information

Michał Kierzynka et al. Poznan University of Technology. 17 March 2015, San Jose

Michał Kierzynka et al. Poznan University of Technology. 17 March 2015, San Jose Michał Kierzynka et al. Poznan University of Technology 17 March 2015, San Jose The research has been supported by grant No. 2012/05/B/ST6/03026 from the National Science Centre, Poland. DNA de novo assembly

More information

Scalable Solutions for DNA Sequence Analysis

Scalable Solutions for DNA Sequence Analysis Scalable Solutions for DNA Sequence Analysis Michael Schatz Dec 4, 2009 JHU/UMD Joint Sequencing Meeting The Evolution of DNA Sequencing Year Genome Technology Cost 2001 Venter et al. Sanger (ABI) $300,000,000

More information

Strings. Looping. dna = 'ATGTAGC' print(dna) In [1]: ATGTAGC. You can loop over the characters of a string using a for loop, as we saw on Tuesday:

Strings. Looping. dna = 'ATGTAGC' print(dna) In [1]: ATGTAGC. You can loop over the characters of a string using a for loop, as we saw on Tuesday: Strings A string is simply a sequence of characters. From a biological perspective, this is quite useful, as a DNA sequence is simply a string composed of only 4 letters, and thus easily manipulated in

More information

Genome 373: Intro to Python I. Doug Fowler

Genome 373: Intro to Python I. Doug Fowler Genome 373: Intro to Python I Doug Fowler Outline Intro to Python I What is a program? Dealing with data Strings in Python Numbers in Python What is a program? What is a program? A series of instruc2ons,

More information

DELAMANID SUSCEPTIBILITY TESTING IN AN AUTOMATED LIQUID CULTURE SYSTEM

DELAMANID SUSCEPTIBILITY TESTING IN AN AUTOMATED LIQUID CULTURE SYSTEM DELAMANID SUSCEPTIBILITY TESTING IN AN AUTOMATED LIQUID CULTURE SYSTEM Daniela Maria Cirillo San Raffaele Scientific Institute, Milan COI/CA OSR as signed MTA with Janssen and Otzuka as SRL and is involved

More information

BCH339N Systems Biology/Bioinformatics Spring 2018 Marcotte A Python programming primer

BCH339N Systems Biology/Bioinformatics Spring 2018 Marcotte A Python programming primer BCH339N Systems Biology/Bioinformatics Spring 2018 Marcotte A Python programming primer Python: named after Monty Python s Flying Circus (designed to be fun to use) Python documentation: http://www.python.org/doc/

More information

WSSP-10 Chapter 7 BLASTN: DNA vs DNA searches

WSSP-10 Chapter 7 BLASTN: DNA vs DNA searches WSSP-10 Chapter 7 BLASTN: DNA vs DNA searches 4-3 DSAP: BLASTn Page p. 7-1 NCBI BLAST Home Page p. 7-1 NCBI BLASTN search page p. 7-2 Copy sequence from DSAP or wave form program p. 7-2 Choose a database

More information

Solutions Exercise Set 3 Author: Charmi Panchal

Solutions Exercise Set 3 Author: Charmi Panchal Solutions Exercise Set 3 Author: Charmi Panchal Problem 1: Suppose we have following fragments: f1 = ATCCTTAACCCC f2 = TTAACTCA f3 = TTAATACTCCC f4 = ATCTTTC f5 = CACTCCCACACA f6 = CACAATCCTTAACCC f7 =

More information

Algorithms for Bioinformatics

Algorithms for Bioinformatics Adapted from slides by Alexandru Tomescu, Leena Salmela and Veli Mäkinen, which are partly from http://bix.ucsd.edu/bioalgorithms/slides.php 58670 Algorithms for Bioinformatics Lecture 5: Graph Algorithms

More information

A tuple can be created as a comma-separated list of values:

A tuple can be created as a comma-separated list of values: Tuples A tuple is a sequence of values much like a list. The values stored in a tuple can be any type, and they are indexed by integers. The important difference is that tuples are immutable, and hence

More information

Algorithms for Bioinformatics

Algorithms for Bioinformatics These slides are based on previous years slides of Alexandru Tomescu, Leena Salmela and Veli Mäkinen 582670 Algorithms for Bioinformatics Lecture 1: Primer to algorithms and molecular biology 2.9.2014

More information

A Python programming primer for biochemists. BCH364C/394P Systems Biology/Bioinformatics Edward Marcotte, Univ of Texas at Austin

A Python programming primer for biochemists. BCH364C/394P Systems Biology/Bioinformatics Edward Marcotte, Univ of Texas at Austin A Python programming primer for biochemists (Named after Monty Python s Flying Circus& designed to be fun to use) BCH364C/394P Systems Biology/Bioinformatics Edward Marcotte, Univ of Texas at Austin Science

More information

I519 Introduction to Bioinformatics, Genome assembly. Yuzhen Ye School of Informatics & Computing, IUB

I519 Introduction to Bioinformatics, Genome assembly. Yuzhen Ye School of Informatics & Computing, IUB I519 Introduction to Bioinformatics, 2014 Genome assembly Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Contents Genome assembly problem Approaches Comparative assembly The string

More information

Phylogenetics on CUDA (Parallel) Architectures Bradly Alicea

Phylogenetics on CUDA (Parallel) Architectures Bradly Alicea Descent w/modification Descent w/modification Descent w/modification Descent w/modification CPU Descent w/modification Descent w/modification Phylogenetics on CUDA (Parallel) Architectures Bradly Alicea

More information

Admin. CS 112 Introduction to Programming. Exercise: Gene Finding. Language Organizing Structure. Gene Finding. Foundational Programming Concepts

Admin. CS 112 Introduction to Programming. Exercise: Gene Finding. Language Organizing Structure. Gene Finding. Foundational Programming Concepts Admin CS 112 Introduction to Programming q PS6 (Sukoku) questions? q Planning of remaining of the semester User-Defined Data Types Yang (Richard) Yang Computer Science Department Yale University 308A Watson,

More information

Multiple Sequence Alignment Gene Finding, Conserved Elements

Multiple Sequence Alignment Gene Finding, Conserved Elements Multiple Sequence Alignment Gene Finding, Conserved Elements Definition Given N sequences x 1, x 2,, x N : Insert gaps (-) in each sequence x i, such that All sequences have the same length L Score of

More information

Strings. Genome 373 Genomic Informatics Elhanan Borenstein

Strings. Genome 373 Genomic Informatics Elhanan Borenstein Strings Genome 373 Genomic Informatics Elhanan Borenstein print hello, world pi = 3.14159 pi = -7.2 yet_another_var = pi + 10 print pi import math log10 = math.log(10) import sys arg1 = sys.argv[1] arg2

More information

Overview. Dataset: testpos DNA: CCCATGGTCGGGGGGGGGGAGTCCATAACCC Num exons: 2 strand: + RNA (from file): AUGGUCAGUCCAUAA peptide (from file): MVSP*

Overview. Dataset: testpos DNA: CCCATGGTCGGGGGGGGGGAGTCCATAACCC Num exons: 2 strand: + RNA (from file): AUGGUCAGUCCAUAA peptide (from file): MVSP* Overview In this homework, we will write a program that will print the peptide (a string of amino acids) from four pieces of information: A DNA sequence (a string). The strand the gene appears on (a string).

More information

Sequence Assembly. BMI/CS 576 Mark Craven Some sequencing successes

Sequence Assembly. BMI/CS 576  Mark Craven Some sequencing successes Sequence Assembly BMI/CS 576 www.biostat.wisc.edu/bmi576/ Mark Craven craven@biostat.wisc.edu Some sequencing successes Yersinia pestis Cannabis sativa The sequencing problem We want to determine the identity

More information

How to Run NCBI BLAST on zcluster at GACRC

How to Run NCBI BLAST on zcluster at GACRC How to Run NCBI BLAST on zcluster at GACRC BLAST: Basic Local Alignment Search Tool Georgia Advanced Computing Resource Center University of Georgia Suchitra Pakala pakala@uga.edu 1 OVERVIEW What is BLAST?

More information

GLIMMER. Dennis Flottmann

GLIMMER. Dennis Flottmann GLIMMER Dennis Flottmann 1 Agenda Who invented GLIMMER? What is GLIMMER? How GLIMMER works IMMs ICMs GLIMMER live demonstration GLIMMER today and in comparison to other tools 2 Who invented GLIMMER? Steven

More information

PERFORMANCE ANALYSIS OF DATAMINIG TECHNIQUE IN RBC, WBC and PLATELET CANCER DATASETS

PERFORMANCE ANALYSIS OF DATAMINIG TECHNIQUE IN RBC, WBC and PLATELET CANCER DATASETS PERFORMANCE ANALYSIS OF DATAMINIG TECHNIQUE IN RBC, WBC and PLATELET CANCER DATASETS Mayilvaganan M 1 and Hemalatha 2 1 Associate Professor, Department of Computer Science, PSG College of arts and science,

More information

TCCAGGTG-GAT TGCAAGTGCG-T. Local Sequence Alignment & Heuristic Local Aligners. Review: Probabilistic Interpretation. Chance or true homology?

TCCAGGTG-GAT TGCAAGTGCG-T. Local Sequence Alignment & Heuristic Local Aligners. Review: Probabilistic Interpretation. Chance or true homology? Local Sequence Alignment & Heuristic Local Aligners Lectures 18 Nov 28, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

More information

INTRODUCTION TO BIOINFORMATICS

INTRODUCTION TO BIOINFORMATICS Molecular Biology-2019 1 INTRODUCTION TO BIOINFORMATICS In this section, we want to provide a simple introduction to using the web site of the National Center for Biotechnology Information NCBI) to obtain

More information

Algorithms and Data Structures

Algorithms and Data Structures Algorithms and Data Structures Sorting beyond Value Comparisons Marius Kloft Content of this Lecture Radix Exchange Sort Sorting bitstrings in linear time (almost) Bucket Sort Marius Kloft: Alg&DS, Summer

More information

Algorithmic Approaches for Biological Data, Lecture #7

Algorithmic Approaches for Biological Data, Lecture #7 Algorithmic Approaches for Biological Data, Lecture #7 Katherine St. John City University of New York American Museum of Natural History 10 February 2016 Outline Patterns in Strings Recap: Files in and

More information

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be 48 Bioinformatics I, WS 09-10, S. Henz (script by D. Huson) November 26, 2009 4 BLAST and BLAT Outline of the chapter: 1. Heuristics for the pairwise local alignment of two sequences 2. BLAST: search and

More information

STAN Manual by Anne-Sophie Valin, Patrick Durand, and Grégory Ranchy. Published May, 9th, 2005

STAN Manual by Anne-Sophie Valin, Patrick Durand, and Grégory Ranchy. Published May, 9th, 2005 STAN Manual STAN Manual by Anne-Sophie Valin, Patrick Durand, and Grégory Ranchy Published May, 9th, 2005 Revision History Revision 2.0 31/01/2007 Revised by: Laetitia Guillot Update of the screen printings,

More information

Sequence alignment. Genomes change over time

Sequence alignment. Genomes change over time Sequence alignment Genomes change over time 1 Goal of alignment: Infer edit operations What is sequence alignment? 2 Align biological sequences Statement of the problem Given 2 sequences Scoring system

More information

Quiz Section Week 8 May 17, Machine learning and Support Vector Machines

Quiz Section Week 8 May 17, Machine learning and Support Vector Machines Quiz Section Week 8 May 17, 2016 Machine learning and Support Vector Machines Another definition of supervised machine learning Given N training examples (objects) {(x 1,y 1 ), (x 2,y 2 ),, (x N,y N )}

More information

Weighted Finite-State Transducers in Computational Biology

Weighted Finite-State Transducers in Computational Biology Weighted Finite-State Transducers in Computational Biology Mehryar Mohri Courant Institute of Mathematical Sciences mohri@cims.nyu.edu Joint work with Corinna Cortes (Google Research). 1 This Tutorial

More information

3.1 Using Data Types. Data Type. Constructors and Methods. Objects. Data type. Set of values and operations on those values.

3.1 Using Data Types. Data Type. Constructors and Methods. Objects. Data type. Set of values and operations on those values. Data Types 3.1 Using Data Types Data type. Set of values and operations on those values. Primitive types. Ops directly translate to machine instructions. Data Type boolean int double Set of Values true,

More information

Genome 373: Genome Assembly. Doug Fowler

Genome 373: Genome Assembly. Doug Fowler Genome 373: Genome Assembly Doug Fowler What are some of the things we ve seen we can do with HTS data? We ve seen that HTS can enable a wide variety of analyses ranging from ID ing variants to genome-

More information

Sequencing Alignment I

Sequencing Alignment I Sequencing Alignment I Lectures 16 Nov 21, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022 1 Outline: Sequence

More information

Programming Languages and Uses in Bioinformatics

Programming Languages and Uses in Bioinformatics Programming in Perl Programming Languages and Uses in Bioinformatics Perl, Python Pros: reformatting data files reading, writing and parsing files building web pages and database access building work flow

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2015 1 Sequence Alignment Dannie Durand Pairwise Sequence Alignment The goal of pairwise sequence alignment is to establish a correspondence between the

More information

Support Vector Machines (SVM)

Support Vector Machines (SVM) Support Vector Machines a new classifier Attractive because (SVM) Has sound mathematical foundations Performs very well in diverse and difficult applications See paper placed on the class website Review

More information

Lecture 2, Introduction to Python. Python Programming Language

Lecture 2, Introduction to Python. Python Programming Language BINF 3360, Introduction to Computational Biology Lecture 2, Introduction to Python Young-Rae Cho Associate Professor Department of Computer Science Baylor University Python Programming Language Script

More information

Algorithmic Approaches for Biological Data, Lecture #8

Algorithmic Approaches for Biological Data, Lecture #8 Algorithmic Approaches for Biological Data, Lecture #8 Katherine St. John City University of New York American Museum of Natural History 17 February 2016 Outline More on Pattern Finding: Regular Expressions

More information

Question 4: a. We want to store a binary encoding of the 150 original Pokemon. How many bits do we need to use?

Question 4: a. We want to store a binary encoding of the 150 original Pokemon. How many bits do we need to use? Question 4: a. We want to store a binary encoding of the 150 original Pokemon. How many bits do we need to use? b. What is the encoding for Pikachu (#25)? Question 2: Flippin Fo Fun (10 points, 14 minutes)

More information

Improving the local alignment of LocARNA through automated parameter optimization

Improving the local alignment of LocARNA through automated parameter optimization Improving the local alignment of LocARNA through automated parameter optimization Bled 18.02.2016 Teresa Müller Introduction Non-coding RNA High performing RNA alignment tool correct classification LocARNA:

More information

Graph Algorithms in Bioinformatics

Graph Algorithms in Bioinformatics Graph Algorithms in Bioinformatics Bioinformatics: Issues and Algorithms CSE 308-408 Fall 2007 Lecture 13 Lopresti Fall 2007 Lecture 13-1 - Outline Introduction to graph theory Eulerian & Hamiltonian Cycle

More information

Lecture 2 Pairwise sequence alignment. Principles Computational Biology Teresa Przytycka, PhD

Lecture 2 Pairwise sequence alignment. Principles Computational Biology Teresa Przytycka, PhD Lecture 2 Pairwise sequence alignment. Principles Computational Biology Teresa Przytycka, PhD Assumptions: Biological sequences evolved by evolution. Micro scale changes: For short sequences (e.g. one

More information

(DNA#): Molecular Biology Computation Language Proposal

(DNA#): Molecular Biology Computation Language Proposal (DNA#): Molecular Biology Computation Language Proposal Aalhad Patankar, Min Fan, Nan Yu, Oriana Fuentes, Stan Peceny {ap3536, mf3084, ny2263, oif2102, skp2140} @columbia.edu Motivation Inspired by the

More information