BLAST MCDB 187. Friday, February 8, 13

Similar documents
Basic Local Alignment Search Tool (BLAST)

Bioinformatics explained: BLAST. March 8, 2007

Pairwise Sequence Alignment. Zhongming Zhao, PhD

B L A S T! BLAST: Basic local alignment search tool. Copyright notice. February 6, Pairwise alignment: key points. Outline of tonight s lecture

Lecture 5 Advanced BLAST

Database Searching Using BLAST

Sequence alignment theory and applications Session 3: BLAST algorithm

Algorithmic Approaches for Biological Data, Lecture #20

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:

Similarity Searches on Sequence Databases

.. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar..

Bioinformatics for Biologists

Bioinformatics. Sequence alignment BLAST Significance. Next time Protein Structure

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.

BLAST & Genome assembly

BLAST, Profile, and PSI-BLAST

Chapter 4: Blast. Chaochun Wei Fall 2014

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:

Sequence Alignment & Search

Biology 644: Bioinformatics

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm.

Computational Molecular Biology

CS 284A: Algorithms for Computational Biology Notes on Lecture: BLAST. The statistics of alignment scores.

Lecture 10. Sequence alignments

Sequence analysis Pairwise sequence alignment

BLAST & Genome assembly

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Introduction to Computational Molecular Biology

Sequence Alignment. GBIO0002 Archana Bhardwaj University of Liege

BGGN 213 Foundations of Bioinformatics Barry Grant

Introduction to BLAST with Protein Sequences. Utah State University Spring 2014 STAT 5570: Statistical Bioinformatics Notes 6.2

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

BLAST. NCBI BLAST Basic Local Alignment Search Tool

Similarity searches in biological sequence databases

BIOL591: Introduction to Bioinformatics Alignment of pairs of sequences

Alignments BLAST, BLAT

Dynamic Programming & Smith-Waterman algorithm

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

Sequence Alignment Heuristics

Scoring and heuristic methods for sequence alignment CG 17

CAP BLAST. BIOINFORMATICS Su-Shing Chen CISE. 8/20/2005 Su-Shing Chen, CISE 1

Lecture 4: January 1, Biological Databases and Retrieval Systems

BLAST - Basic Local Alignment Search Tool

TCCAGGTG-GAT TGCAAGTGCG-T. Local Sequence Alignment & Heuristic Local Aligners. Review: Probabilistic Interpretation. Chance or true homology?

Heuristic methods for pairwise alignment:

Comparative Analysis of Protein Alignment Algorithms in Parallel environment using CUDA

Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it.

INTRODUCTION TO BIOINFORMATICS

Distributed Protein Sequence Alignment

ICB Fall G4120: Introduction to Computational Biology. Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology

Bioinformatics explained: Smith-Waterman

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

INTRODUCTION TO BIOINFORMATICS

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 13 G R A T I V. Iterative homology searching,

Wilson Leung 05/27/2008 A Simple Introduction to NCBI BLAST

Introduction to Phylogenetics Week 2. Databases and Sequence Formats

Database Similarity Searching

Wilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment

Pairwise Sequence Alignment: Dynamic Programming Algorithms. COMP Spring 2015 Luay Nakhleh, Rice University

Today s Lecture. Edit graph & alignment algorithms. Local vs global Computational complexity of pairwise alignment Multiple sequence alignment

CS313 Exercise 4 Cover Page Fall 2017

Sequence alignment algorithms

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching

BLAST Exercise 2: Using mrna and EST Evidence in Annotation Adapted by W. Leung and SCR Elgin from Annotation Using mrna and ESTs by Dr. J.

Searching Sequence Databases

Jyoti Lakhani 1, Ajay Khunteta 2, Dharmesh Harwani *3 1 Poornima University, Jaipur & Maharaja Ganga Singh University, Bikaner, Rajasthan, India

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 9

Brief review from last class

BIOL 7020 Special Topics Cell/Molecular: Molecular Phylogenetics. Spring 2010 Section A

Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice University

Rochester Institute of Technology. Making personalized education scalable using Sequence Alignment Algorithm

How to Run NCBI BLAST on zcluster at GACRC

Lecture 2 Pairwise sequence alignment. Principles Computational Biology Teresa Przytycka, PhD

Biochemistry 324 Bioinformatics. Multiple Sequence Alignment (MSA)

Data Mining Technologies for Bioinformatics Sequences

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

Utility of Sliding Window FASTA in Predicting Cross- Reactivity with Allergenic Proteins. Bob Cressman Pioneer Crop Genetics

A Design of a Hybrid System for DNA Sequence Alignment

SimSearch: A new variant of dynamic programming based on distance series for optimal and near-optimal similarity discovery in biological sequences

Sequence comparison: Local alignment

Biologically significant sequence alignments using Boltzmann probabilities

Programming assignment for the course Sequence Analysis (2006)

2) NCBI BLAST tutorial This is a users guide written by the education department at NCBI.

MetaPhyler Usage Manual

Fast Sequence Alignment Method Using CUDA-enabled GPU

Sequence Alignment. part 2

Chapter 4. Sequence Comparison

Sequence Alignment: Mo1va1on and Algorithms. Lecture 2: August 23, 2012

Lecture 2: Introduction to Computing September 29, 2014

Searching Biological Sequence Databases Using Distributed Adaptive Computing

Biological Sequence Matching Using Fuzzy Logic

Metric Indexing of Protein Databases and Promising Approaches

Algorithms for context prediction in Ubiquitous Systems

A TUTORIAL OF RECENT DEVELOPMENTS IN THE SEEDING OF LOCAL ALIGNMENT

FINDING APPROXIMATE REPEATS WITH MULTIPLE SPACED SEEDS

Acceleration of Ungapped Extension in Mercury BLAST. Joseph Lancaster Jeremy Buhler Roger Chamberlain

L4: Blast: Alignment Scores etc.

AlignMe Manual. Version 1.1. Rene Staritzbichler, Marcus Stamm, Kamil Khafizov and Lucy R. Forrest

Transcription:

BLAST MCDB 187

BLAST Basic Local Alignment Sequence Tool Uses shortcut to compute alignments of a sequence against a database very quickly Typically takes about a minute to align a sequence against a database of one million sequences Accurately computes the statistical significance of the alignment

One of most highly cited papers

Global vs Local Alignments In 1981 Smith and Waterman introduced a modification to the Needleman-Wunsch algorithm to search for optimal alignment fragments Positions in the alignment matrix with negative scores are set to zero, and are considered new start positions

Definitions Maximum Segment Pair (MSP): highest scoring pair of identical length segments from 2 sequences Word pair: segment pair of fixed length w T: minimum word pair score

Algorithm Pre-compute all words that score above T against the query sequence Search for these words in a database For each match, extend the alignment to generate the maximum segment pair Compute statistics of the resulting score

Step 1 : break up sequence into words ATCGTCTATTCCCGG w=4 letter words ATCG TCGT CGTC etc.

Step 2: find high scoring matches Start with word 1: ATCG score identities as +1, and mismatches a 0 find all words that have a score of at least T=3 ATCA ATCT ATCC etc.

Search for high scoring words in database Find all sequences in the database that contain a high scoring word For example: CCGGCTATCATCTATTCCCGGTTCG

Extend ALignments Around matching word CCGGCTATCATCTATTCCCGGTTCG ATCGTCTATTCCCGG The Yellow seqeunce alignmnet corresponds to the MSP wth score S

Lecture 3.1 11

Different Flavours of BLAST BLASTP - protein query against protein DB BLASTN - DNA/RNA query against GenBank (DNA) BLASTX - 6 frame trans. DNA query against proteindb TBLASTN - protein query against 6 frame GB transl. TBLASTX - 6 frame DNA query to 6 frame GB transl. PSI-BLAST - protein profile query against protein DB PHI-BLAST - protein pattern against protein DB Lecture 3.1 12

Running NCBI BLAST Click BLAST! Lecture 3.1 13

MT0895 MMKIQIYGTGCANCQMLEKNAREAVKELGID AEFEKIKEMDQILEAGLTALPGLAVDGELKI MGRVASKEEIKKILS Lecture 3.1 14

Formatting Results Lecture 3.1 15

BLAST Format Options Lecture 3.1 16

BLAST Output Lecture 3.1 17

BLAST Output Lecture 3.1 18

BLAST Output Lecture 3.1 19

BLAST Output Lecture 3.1 20

BLAST Output Lecture 3.1 21

BLAST Output Lecture 3.1 22

BLAST Parameters Identities - No. & % exact residue matches Positives - No. and % similar & ID matches Gaps - No. & % gaps introduced Score - Summed HSP score (S) Bit Score - a normalized score (S ) Expect (E) - Expected # of chance HSP aligns P - Probability of getting a score > X T - Minimum word or k-tuple score (Threshold) Lecture 3.1 23

BLAST - Rules of Thumb Expect (E-value) is equal to the number of BLAST alignments with a given Score that are expected to be seen simply due to chance Don t trust a BLAST alignment with an Expect score > 0.01 (Grey zone is between 0.01-1) Expect and Score are related, but Expect contains more information. Note that %Identies is more useful than the bit Score Recall Doolittle s Curve (%ID vs. Length, next slide) %ID > 30 - numres/50 If uncertain about a hit, perform a PSI-BLAST search Lecture 3.1 24

Doolittle s Curve 100 Evolutionary Distance VS Percent Sequence Identity Sequence Identity (%) 75 50 25 0 0 40 80 120 160 200 240 280 320 360 400 Number of Residues Twilight Zone Lecture 3.1 25

Statistics Probability of observing a score greater than S Probability of observing multiple MSPs greater than S

Extreme Value Distribution

Scoring Matrices BLOSUM Matrices Developed by Henikoff & Henikoff (1992) BLOcks SUbstitution Matrix Derived from the BLOCKS database PAM Matrices Developed by Schwarz and Dayhoff (1978) Point Accepted Mutation Derived from manual alignments of closely related proteins Lecture 3.1 28

Conclusions BLAST is the most important program in bioinformatics (maybe all of biology) BLAST is based on sound statistical principles (key to its speed and sensitivity) A basic understanding of its principles is key for using/interpreting BLAST output Use NBLAST or MEGABLAST for DNA Use PSI-BLAST for protein searches Lecture 3.1 29