Preliminary Syllabus. Genomics. Introduction & Genome Assembly Sequence Comparison Gene Modeling Gene Function Identification

Similar documents
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

Database Searching Using BLAST

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:

BLAST, Profile, and PSI-BLAST

BLAST - Basic Local Alignment Search Tool

Introduction to Computational Molecular Biology

BLAST & Genome assembly

When we search a nucleic acid databases, there is no need for you to carry out your own six frame translation. Mascot always performs a 6 frame

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm.

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

Wilson Leung 05/27/2008 A Simple Introduction to NCBI BLAST

Heuristic methods for pairwise alignment:

Computational Molecular Biology

How to use KAIKObase Version 3.1.0

BLAST & Genome assembly

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:

Computational Genomics and Molecular Biology, Fall

Sequence Alignment & Search

Adam M Phillippy Center for Bioinformatics and Computational Biology

Biology 644: Bioinformatics

Searching Sequence Databases

Exercise 2: Browser-Based Annotation and RNA-Seq Data

Bioinformatics. Sequence alignment BLAST Significance. Next time Protein Structure

BLAST. Basic Local Alignment Search Tool. Used to quickly compare a protein or DNA sequence to a database.

Wilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment

B L A S T! BLAST: Basic local alignment search tool. Copyright notice. February 6, Pairwise alignment: key points. Outline of tonight s lecture

Tutorial 4 BLAST Searching the CHO Genome

From Smith-Waterman to BLAST

Browser Exercises - I. Alignments and Comparative genomics

CLC Server. End User USER MANUAL

BLAST MCDB 187. Friday, February 8, 13

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Meraculous De Novo Assembly of the Ariolimax dolichophallus Genome. Charles Cole, Jake Houser, Kyle McGovern, and Jennie Richardson

FINDING APPROXIMATE REPEATS WITH MULTIPLE SPACED SEEDS

Bioinformatics for Biologists

Basic Local Alignment Search Tool (BLAST)

Chapter 4: Blast. Chaochun Wei Fall 2014

Genome Assembly and De Novo RNAseq

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:

Tutorial: How to use the Wheat TILLING database

Scoring and heuristic methods for sequence alignment CG 17

Sequence analysis Pairwise sequence alignment

Finishing Circular Assemblies. J Fass UCD Genome Center Bioinformatics Core Thursday April 16, 2015

Introduction to BLAST with Protein Sequences. Utah State University Spring 2014 STAT 5570: Statistical Bioinformatics Notes 6.2

TCCAGGTG-GAT TGCAAGTGCG-T. Local Sequence Alignment & Heuristic Local Aligners. Review: Probabilistic Interpretation. Chance or true homology?

FastA & the chaining problem

Pacific Symposium on Biocomputing 13: (2008) PASH 2.0: SCALEABLE SEQUENCE ANCHORING FOR NEXT-GENERATION SEQUENCING TECHNOLOGIES

Database Similarity Searching

Finding homologous sequences in databases

Tutorial 1: Exploring the UCSC Genome Browser

AMOS Assembly Validation and Visualization

Bioinformatics explained: BLAST. March 8, 2007

Sequence Alignment. GBIO0002 Archana Bhardwaj University of Liege

GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu

M 100 G 3000 M 3000 G 100. ii) iii)

Jyoti Lakhani 1, Ajay Khunteta 2, Dharmesh Harwani *3 1 Poornima University, Jaipur & Maharaja Ganga Singh University, Bikaner, Rajasthan, India

Proteome Comparison: A fine-grained tool for comparative genomics

Sequence alignment theory and applications Session 3: BLAST algorithm

INTRODUCTION TO CONSED

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures

L4: Blast: Alignment Scores etc.

SlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching

Introduction to Bioinformatics Problem Set 3: Genome Sequencing

Miniproject 1. Part 1 Due: 16 February. The coverage problem. Method. Why it is hard. Data. Task1

PLNT4610 BIOINFORMATICS FINAL EXAMINATION

Sequence Alignment Heuristics

Alignment of Long Sequences

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

INTRODUCTION TO BIOINFORMATICS

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 13 G R A T I V. Iterative homology searching,

Single Pass, BLAST-like, Approximate String Matching on FPGAs*

Example of repeats: ATGGTCTAGGTCCTAGTGGTC Motivation to find them: Genomic rearrangements are often associated with repeats Trace evolutionary

Omega: an Overlap-graph de novo Assembler for Metagenomics

Computational models for bionformatics

MacVector for Mac OS X. The online updater for this release is MB in size

CodonCode Aligner User Manual

CS313 Exercise 4 Cover Page Fall 2017

BLAST Exercise 2: Using mrna and EST Evidence in Annotation Adapted by W. Leung and SCR Elgin from Annotation Using mrna and ESTs by Dr. J.

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p Multiple alignment

QIAseq DNA V3 Panel Analysis Plugin USER MANUAL

Genomic Finishing & Consed

ON HEURISTIC METHODS IN NEXT-GENERATION SEQUENCING DATA ANALYSIS

Alignments BLAST, BLAT

INTRODUCTION TO BIOINFORMATICS

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

EECS730: Introduction to Bioinformatics

Two Examples of Datanomic. David Du Digital Technology Center Intelligent Storage Consortium University of Minnesota

Tour Guide for Windows and Macintosh

A Design of a Hybrid System for DNA Sequence Alignment

17 ½ Weeks in Leipzig, Saxonia. Andreas Gruber Institute for Theoretical Chemistry University of Vienna

Important Example: Gene Sequence Matching. Corrigiendum. Central Dogma of Modern Biology. Genetics. How Nucleotides code for Amino Acids

Short Read Alignment. Mapping Reads to a Reference

Variant calling using SAMtools

Under the Hood of Alignment Algorithms for NGS Researchers

PLNT4610 BIOINFORMATICS FINAL EXAMINATION

) I R L Press Limited, Oxford, England. The protein identification resource (PIR)

(for more info see:

Transcription:

Preliminary Syllabus Sep 30 Oct 2 Oct 7 Oct 9 Oct 14 Oct 16 Oct 21 Oct 25 Oct 28 Nov 4 Nov 8 Introduction & Genome Assembly Sequence Comparison Gene Modeling Gene Function Identification OCTOBER BREAK Comparative Protein-Protein Interactions Pathway Resources and Analysis Structural / Protein Structure Prediction Protein Modeling EXAM Gribskov@purdue.edu Lilly G-233 Gribskov 2.1

Genome Assembly Populus trichocarpa Science (2006) 313:1596-1604 (September 15) 485 Mb (cytogenetic estimate = 550 Mb) 7.5X coverage 2447 scaffolds, 410Mb in scaffold assembly (84%) 95% of genome 45,500 "genes" 19 Linkage groups Evidence for two whole genome duplications Gribskov 2.2

Genome Assembly Populus Clone and sequence statistics ti ti Insert Size Kb Vector Number Reads x10-6 Number Reads Number Bases Used Qual > 20 x10-6 Gb Number Bases After Trimming % Bases Used % of Total Gb 20 2.0-40 4.0 plasmid 445 4.45 275 2.75 276 2.76 173 1.73 62.7 56.4 4.5-7.5 plasmid 2.58 1.62 1.78 1.04 58.4 33.4 38-41 fosmid 0,.65 0.43 0.41 0.30 73.1 9.8 Total 769 7.69 480 4.80 495 4.95 307 3.07 62.0 Gribskov 2.3

Genome Assembly Populus Small contigs and singletons tend to be contaminants Gribskov 2.4

Genome Assembly Populus Inner tracks (inside to outside ) (black) shows clone coverage, Each circle shows 5X depth. (red) shows the coverage provided by clones not assigned to contigs (singletons). (alternating color) shows anchored contigs Next (alternating color) track shows position of individual anchored clones in each contig Outer tracks clones lacking contig assignment. singletons. 1Mb Gribskov 2.5

Genome Assembly Populus How common are chimeras? Chimeric reads in Chloroplast l genome one end chloroplast one end nuclear Average 410 reads/position ~ 5-6% Gribskov 2.6

Genome Assembly Populus Transposable Elements Gribskov 2.7

Genome Assembly Additional Assembly Protocols Comparative assembly - Align to existing very similar to genome Several times faster 3-4X Problems Insertions/deletions Rearrangements Align by physical map Gribskov 2.8

EST Assembly Assembling RNA Result often called unigenes Much less consistent than DNA Similar to DNA except Contigs do not join into one sequence Special Artifacts Post-transcriptional modification Alternative splicing Trans-splicing SNPs/haplotypes Gribskov 2.9

Genome Assembly Assembly Validation how good is it? mate-pair information number of mate pairs whose distance violates length assumptions number of mate-pairs whose orientation is impossible see Phillipy et al., 2008 number of unused reads (singletons) align singletons to contigs to check correlated polymorphisms overlapping reads should not have differences at the same position unless mis-assembled allelic duplicated Experimental physical map FISH Gribskov 2.10

Genome Assembly Populus Mapping of scaffolds to chromosomes using microsatellites Gribskov 2.11

Genome Assembly Populus Mapping BACs to chromosomes using FISH Gribskov 2.12

Genome Assembly Mate-Pair violations Compressed tandem repeats make mate-pairs appear "stretched" incorrect arrangement of contigs leads to mis-oriented and inconsistent matepairs Gribskov 2.13

Genome Assembly Mate-Pair Violations B. anthacis example 4 unassembled regions partially match assembly partial matches all end at same location Gribskov 2.14

Genome Assembly Mate-Pair Drosophila virilis repeat compression insert in assembly? Gribskov 2.15

Genome Assembly Mate-pair Violations 16 Phrap bacterial genome assemblies Gribskov 2.16

Genome Assembly Finding Overlaps Most time consuming aspect of assembly Requires n 2 /2 comparisons = O(n 2 ) All methods rely on looking for exact matches over some length Two concerns How likely are incorrect matches How do to it very quickly Gribskov 2.17

Sequence Database Searching Essentially same problem as finding overlaps in assembly Main approach Rapid scan of database for candidate matches Slow evaluation of similarity il it by dynamic programming alignment Statistical analysis BLAST theory based FASTA fit to observed data Gribskov 2.18

Sequence database searching Gribskov 2.19

FASTA Originally developed in the mid-1980s as FASTN and FASTP for nucleic acid and protein, respectively Fast approximation of dynamic programming alignment Relies on related sequences having "diagonals" " with high h similarity il it Step 1. Find best regions on diagonals Step 2. Rescan 10 best regions with PAM scoring table Step 3. Join initial regions Step 4. Calculate dynamic programming optimal alignment Step 5. Calculate significance of Scores Gribskov 2.20

Sequence database searching - FASTA Step 1. Find best regions on diagonals Step 2. Rescan 10 best regions with scoring table Step 3. Join initial regions Step 4. Calculate dynamic programming optimal alignment 1 2 3 4 Gribskov 2.21

Sequence database searching - FASTA Step 1 - Find Initial Regions (Fast part of search) Find best regions of diagonals using lookup table Lookup table: lists all the words of length ktup and where they occur Gribskov 2.22 MYSEQVENCEN HISSEQENCEQ CE 9 CE 9 EN 7,10 EN 7 EQ 4 EQ 5,10 MY 1 HI 1 NC 8 IS 2 QV 5 NC 8 SE 3 QE 6 UE 6 SE 4 YS 2 SS 3

Sequence database searching - FASTA Step 1 - Find Initial Regions For each matching word (ktup) calculate on which diagonal the match lies - AKA histograming diagonal = offset database - offset query CE 9 CE 9 0 EN 7,10 EN 7 0, +3 EQ 4 EQ 5,10-1, -6 MY 1 HI 1 0 Does it already have a region? If no, start a region (score=pair score) If yes, try to combine them score > distance to existing region (score = pair scores - distance) Gribskov 2.23

Sequence database searching Gribskov 2.24

Statistics Sequence matching is not normal, it is extreme! Scores follow and extreme value or Gumbel distribution Z score can't be directly converted to probability Whenever you are looking at a distribution of maxima longest run of heads in coin toss maximum scores for each sequence in database Sequence matches are a lot like coin tosses! PTVQGLRLFE :: : : PTAAGQELLS ++--+--+--+ + Gribskov 2.25

Extreme Value Distributions Are appropriate whenever you are looking at a DISTRIBUTION OF MAXIMA longest run of heads in coin toss maximum scores for each sequence in database Z score can't be directly converted to probability because it not a Normal or Gaussian distribution e.g. Z=3 has a normal P-value = 0.0013 but an extreme value distribution P-value ~ 0.12!!! about 100-fold error (error gets worse for smaller P-values)!!!!! Gribskov 2.26

Sequence Database Searching Score Distribution Cumulativ ve Probability Extreme Value Distribution 1 0.3 Cumulative 0.25 0.8 02 0.2 0.6 0.15 0.4 0.1 Probability 0.2 Probability 0.05 Gribskov 2.27 0 0 2 4 6 8 10 12 14 16 Run Length 0

BLAST Based on Maximal Segment Pairs (MSP) Highest scoring pair of identical length segments from two sequences Local alignment without gaps, similar to FASTA local region Expected distribution is known! Maximal Segment Pair sample calculation T G C A A T C G A T C G T C G T C C G T A T A C A : : : : : : : : : : : running sum A G C T C G T G A T C G T G G T G G G A T C G G T match = +1 mismatch = -1 0 1 2 1 0 0 0 1 2 3 4 5 6 5 6 7 6 5 6 5 4 3 2 1 0 Potential MSP Potential MSP Gribskov 2.28

BLAST is based on Significant MSPs Scoring system Must have at least one positive score Expected score must be less than zero E = Σ f i s i Probability of an MSP scoring higher than S P(MSP>S) KNe -λs N = size of data, K and λ are constants Karlin, S., and Altschul, S.F., Proc.Natl.Acad.Sci. 87, 2264-2268, 1990. Gribskov 2.29

Normal Distribution 1 0.4 0.8 Cumulative 0.35 0.3 Cumulative Probability 0.6 0.4 Probability 0.25 0.2 0.15 Proba ability 01 0.1 0.2 0.05 0 0-4 -3-2 -1 0 1 2 3 4 Gribskov 2.30

Extreme Value Distribution 1 0.3 0.8 Cumulative 0.25 ive Probability 0.6 02 0.2 0.15 Probabili Cumulat 0.4 0.1 ity 0.2 Probability 0.05 0 0 2 4 6 8 10 12 14 16 Run Length 0 Gribskov 2.31

BLAST Basic Idea Determine in advance the MSP score you need to be significant, S for example, choose S so that you will see fewer than 10 unrelated sequences in the database that score as high Look for matching words of length w that t score above a threshold, h T, such that MSPs of score S are unlikely to be missed. These are High-scoring Segment Pairs (HSPs) Gribskov 2.32

BLAST procedure Step 1: Compile list of high scoring words from query Step 2: Scan database for "hits" Step 3: Extend regions with 2 hits into MSPs Step 4: Dynamic programming alignment around MSPs sequence Gribskov 2.33

BLAST Step 1 - List of High Scoring Words Choose a significance level S Choose a word size, w, and cutoff, T, so that you are unlikely to miss MSPs with score S Make a table of all words in the "neighborhood" of the query (DNA sequences use all words) Typically 50 words for each residue Gribskov 2.34

BLAST Step 2 - Scan Database Scan only for words in neighborhood Use lookup tables (like FASTA) or finite automaton Keep data in memory to make it faster Gribskov 2.35

BLAST Step 3 - Extend Words to MSPs In BLAST2, a diagonal must have two word hits before extension to MSP is attempted. In principal, must examine diagonal until score drops to zero Shortcut, t only check until score drops by X T G C A A T C G A T C G T C G T C C G T A T A C A : : : : : : : : : : : A G C T C G T G A T C G T G G T G G G A T C G G T 0 1 2 1 0 0 0 1 2 3 4 5 6 5 6 7 6 5 6 5 4 3 2 1 0 Potential MSP Potential MSP Gribskov 2.36