Preliminary Syllabus. Genomics. Introduction & Genome Assembly Sequence Comparison Gene Modeling Gene Function Identification

Preliminary Syllabus Sep 30 Oct 2 Oct 7 Oct 9 Oct 14 Oct 16 Oct 21 Oct 25 Oct 28 Nov 4 Nov 8 Introduction & Genome Assembly Sequence Comparison Gene Modeling Gene Function Identification OCTOBER BREAK Comparative Protein-Protein Interactions Pathway Resources and Analysis Structural / Protein Structure Prediction Protein Modeling EXAM Gribskov@purdue.edu Lilly G-233 Gribskov 2.1

Genome Assembly Populus trichocarpa Science (2006) 313:1596-1604 (September 15) 485 Mb (cytogenetic estimate = 550 Mb) 7.5X coverage 2447 scaffolds, 410Mb in scaffold assembly (84%) 95% of genome 45,500 "genes" 19 Linkage groups Evidence for two whole genome duplications Gribskov 2.2

Genome Assembly Populus Clone and sequence statistics ti ti Insert Size Kb Vector Number Reads x10-6 Number Reads Number Bases Used Qual > 20 x10-6 Gb Number Bases After Trimming % Bases Used % of Total Gb 20 2.0-40 4.0 plasmid 445 4.45 275 2.75 276 2.76 173 1.73 62.7 56.4 4.5-7.5 plasmid 2.58 1.62 1.78 1.04 58.4 33.4 38-41 fosmid 0,.65 0.43 0.41 0.30 73.1 9.8 Total 769 7.69 480 4.80 495 4.95 307 3.07 62.0 Gribskov 2.3

Genome Assembly Populus Small contigs and singletons tend to be contaminants Gribskov 2.4

Genome Assembly Populus Inner tracks (inside to outside ) (black) shows clone coverage, Each circle shows 5X depth. (red) shows the coverage provided by clones not assigned to contigs (singletons). (alternating color) shows anchored contigs Next (alternating color) track shows position of individual anchored clones in each contig Outer tracks clones lacking contig assignment. singletons. 1Mb Gribskov 2.5

Genome Assembly Populus How common are chimeras? Chimeric reads in Chloroplast l genome one end chloroplast one end nuclear Average 410 reads/position ~ 5-6% Gribskov 2.6

Genome Assembly Populus Transposable Elements Gribskov 2.7

Genome Assembly Additional Assembly Protocols Comparative assembly - Align to existing very similar to genome Several times faster 3-4X Problems Insertions/deletions Rearrangements Align by physical map Gribskov 2.8

EST Assembly Assembling RNA Result often called unigenes Much less consistent than DNA Similar to DNA except Contigs do not join into one sequence Special Artifacts Post-transcriptional modification Alternative splicing Trans-splicing SNPs/haplotypes Gribskov 2.9

Genome Assembly Assembly Validation how good is it? mate-pair information number of mate pairs whose distance violates length assumptions number of mate-pairs whose orientation is impossible see Phillipy et al., 2008 number of unused reads (singletons) align singletons to contigs to check correlated polymorphisms overlapping reads should not have differences at the same position unless mis-assembled allelic duplicated Experimental physical map FISH Gribskov 2.10

Genome Assembly Populus Mapping of scaffolds to chromosomes using microsatellites Gribskov 2.11

Genome Assembly Populus Mapping BACs to chromosomes using FISH Gribskov 2.12

Genome Assembly Mate-Pair violations Compressed tandem repeats make mate-pairs appear "stretched" incorrect arrangement of contigs leads to mis-oriented and inconsistent matepairs Gribskov 2.13

Genome Assembly Mate-Pair Violations B. anthacis example 4 unassembled regions partially match assembly partial matches all end at same location Gribskov 2.14

Genome Assembly Mate-Pair Drosophila virilis repeat compression insert in assembly? Gribskov 2.15

Genome Assembly Mate-pair Violations 16 Phrap bacterial genome assemblies Gribskov 2.16

Genome Assembly Finding Overlaps Most time consuming aspect of assembly Requires n 2 /2 comparisons = O(n 2 ) All methods rely on looking for exact matches over some length Two concerns How likely are incorrect matches How do to it very quickly Gribskov 2.17

Sequence Database Searching Essentially same problem as finding overlaps in assembly Main approach Rapid scan of database for candidate matches Slow evaluation of similarity il it by dynamic programming alignment Statistical analysis BLAST theory based FASTA fit to observed data Gribskov 2.18

Sequence database searching Gribskov 2.19

FASTA Originally developed in the mid-1980s as FASTN and FASTP for nucleic acid and protein, respectively Fast approximation of dynamic programming alignment Relies on related sequences having "diagonals" " with high h similarity il it Step 1. Find best regions on diagonals Step 2. Rescan 10 best regions with PAM scoring table Step 3. Join initial regions Step 4. Calculate dynamic programming optimal alignment Step 5. Calculate significance of Scores Gribskov 2.20

Sequence database searching - FASTA Step 1. Find best regions on diagonals Step 2. Rescan 10 best regions with scoring table Step 3. Join initial regions Step 4. Calculate dynamic programming optimal alignment 1 2 3 4 Gribskov 2.21

Sequence database searching - FASTA Step 1 - Find Initial Regions (Fast part of search) Find best regions of diagonals using lookup table Lookup table: lists all the words of length ktup and where they occur Gribskov 2.22 MYSEQVENCEN HISSEQENCEQ CE 9 CE 9 EN 7,10 EN 7 EQ 4 EQ 5,10 MY 1 HI 1 NC 8 IS 2 QV 5 NC 8 SE 3 QE 6 UE 6 SE 4 YS 2 SS 3

Sequence database searching - FASTA Step 1 - Find Initial Regions For each matching word (ktup) calculate on which diagonal the match lies - AKA histograming diagonal = offset database - offset query CE 9 CE 9 0 EN 7,10 EN 7 0, +3 EQ 4 EQ 5,10-1, -6 MY 1 HI 1 0 Does it already have a region? If no, start a region (score=pair score) If yes, try to combine them score > distance to existing region (score = pair scores - distance) Gribskov 2.23

Sequence database searching Gribskov 2.24

Statistics Sequence matching is not normal, it is extreme! Scores follow and extreme value or Gumbel distribution Z score can't be directly converted to probability Whenever you are looking at a distribution of maxima longest run of heads in coin toss maximum scores for each sequence in database Sequence matches are a lot like coin tosses! PTVQGLRLFE :: : : PTAAGQELLS ++--+--+--+ + Gribskov 2.25

Extreme Value Distributions Are appropriate whenever you are looking at a DISTRIBUTION OF MAXIMA longest run of heads in coin toss maximum scores for each sequence in database Z score can't be directly converted to probability because it not a Normal or Gaussian distribution e.g. Z=3 has a normal P-value = 0.0013 but an extreme value distribution P-value ~ 0.12!!! about 100-fold error (error gets worse for smaller P-values)!!!!! Gribskov 2.26

Sequence Database Searching Score Distribution Cumulativ ve Probability Extreme Value Distribution 1 0.3 Cumulative 0.25 0.8 02 0.2 0.6 0.15 0.4 0.1 Probability 0.2 Probability 0.05 Gribskov 2.27 0 0 2 4 6 8 10 12 14 16 Run Length 0

BLAST Based on Maximal Segment Pairs (MSP) Highest scoring pair of identical length segments from two sequences Local alignment without gaps, similar to FASTA local region Expected distribution is known! Maximal Segment Pair sample calculation T G C A A T C G A T C G T C G T C C G T A T A C A : : : : : : : : : : : running sum A G C T C G T G A T C G T G G T G G G A T C G G T match = +1 mismatch = -1 0 1 2 1 0 0 0 1 2 3 4 5 6 5 6 7 6 5 6 5 4 3 2 1 0 Potential MSP Potential MSP Gribskov 2.28

BLAST is based on Significant MSPs Scoring system Must have at least one positive score Expected score must be less than zero E = Σ f i s i Probability of an MSP scoring higher than S P(MSP>S) KNe -λs N = size of data, K and λ are constants Karlin, S., and Altschul, S.F., Proc.Natl.Acad.Sci. 87, 2264-2268, 1990. Gribskov 2.29

Normal Distribution 1 0.4 0.8 Cumulative 0.35 0.3 Cumulative Probability 0.6 0.4 Probability 0.25 0.2 0.15 Proba ability 01 0.1 0.2 0.05 0 0-4 -3-2 -1 0 1 2 3 4 Gribskov 2.30

Extreme Value Distribution 1 0.3 0.8 Cumulative 0.25 ive Probability 0.6 02 0.2 0.15 Probabili Cumulat 0.4 0.1 ity 0.2 Probability 0.05 0 0 2 4 6 8 10 12 14 16 Run Length 0 Gribskov 2.31

BLAST Basic Idea Determine in advance the MSP score you need to be significant, S for example, choose S so that you will see fewer than 10 unrelated sequences in the database that score as high Look for matching words of length w that t score above a threshold, h T, such that MSPs of score S are unlikely to be missed. These are High-scoring Segment Pairs (HSPs) Gribskov 2.32

BLAST procedure Step 1: Compile list of high scoring words from query Step 2: Scan database for "hits" Step 3: Extend regions with 2 hits into MSPs Step 4: Dynamic programming alignment around MSPs sequence Gribskov 2.33

BLAST Step 1 - List of High Scoring Words Choose a significance level S Choose a word size, w, and cutoff, T, so that you are unlikely to miss MSPs with score S Make a table of all words in the "neighborhood" of the query (DNA sequences use all words) Typically 50 words for each residue Gribskov 2.34

BLAST Step 2 - Scan Database Scan only for words in neighborhood Use lookup tables (like FASTA) or finite automaton Keep data in memory to make it faster Gribskov 2.35

BLAST Step 3 - Extend Words to MSPs In BLAST2, a diagonal must have two word hits before extension to MSP is attempted. In principal, must examine diagonal until score drops to zero Shortcut, t only check until score drops by X T G C A A T C G A T C G T C G T C C G T A T A C A : : : : : : : : : : : A G C T C G T G A T C G T G G T G G G A T C G G T 0 1 2 1 0 0 0 1 2 3 4 5 6 5 6 7 6 5 6 5 4 3 2 1 0 Potential MSP Potential MSP Gribskov 2.36