BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

Similar documents
BLAST, Profile, and PSI-BLAST

BLAST - Basic Local Alignment Search Tool

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

Computational Molecular Biology

Heuristic methods for pairwise alignment:

Bioinformatics explained: BLAST. March 8, 2007

CS313 Exercise 4 Cover Page Fall 2017

Introduction to Computational Molecular Biology

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

Sequence alignment theory and applications Session 3: BLAST algorithm

Database Searching Using BLAST

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

Bioinformatics explained: Smith-Waterman

Introduction to BLAST with Protein Sequences. Utah State University Spring 2014 STAT 5570: Statistical Bioinformatics Notes 6.2

Data Mining Technologies for Bioinformatics Sequences

Preliminary Syllabus. Genomics. Introduction & Genome Assembly Sequence Comparison Gene Modeling Gene Function Identification

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.

BLAST MCDB 187. Friday, February 8, 13

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 13 G R A T I V. Iterative homology searching,

Biology 644: Bioinformatics

Bioinformatics. Sequence alignment BLAST Significance. Next time Protein Structure

Bioinformatics for Biologists

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

B L A S T! BLAST: Basic local alignment search tool. Copyright notice. February 6, Pairwise alignment: key points. Outline of tonight s lecture

GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment

Basic Local Alignment Search Tool (BLAST)

Multiple Sequence Alignment: Multidimensional. Biological Motivation

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:

Lecture 5 Advanced BLAST

Scoring and heuristic methods for sequence alignment CG 17

Pairwise Sequence Alignment. Zhongming Zhao, PhD

Sequence Alignment & Search

CAP BLAST. BIOINFORMATICS Su-Shing Chen CISE. 8/20/2005 Su-Shing Chen, CISE 1

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm.

From Smith-Waterman to BLAST

Computational Genomics and Molecular Biology, Fall

Alignments BLAST, BLAT

HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT

BGGN 213 Foundations of Bioinformatics Barry Grant

L4: Blast: Alignment Scores etc.

Brief review from last class

Sequence analysis Pairwise sequence alignment

Dynamic Programming: Sequence alignment. CS 466 Saurabh Sinha

.. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar..

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:

BLAST. Basic Local Alignment Search Tool. Used to quickly compare a protein or DNA sequence to a database.

Searching Sequence Databases

A Design of a Hybrid System for DNA Sequence Alignment

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT

Similarity Searches on Sequence Databases

FastA & the chaining problem

A CAM(Content Addressable Memory)-based architecture for molecular sequence matching

A Coprocessor Architecture for Fast Protein Structure Prediction

Biochemistry 324 Bioinformatics. Multiple Sequence Alignment (MSA)

Introduction to Phylogenetics Week 2. Databases and Sequence Formats

A Study On Pair-Wise Local Alignment Of Protein Sequence For Identifying The Structural Similarity

BIOL591: Introduction to Bioinformatics Alignment of pairs of sequences

Sequence Alignment Heuristics

Lecture 5: Multiple sequence alignment

JET 2 User Manual 1 INSTALLATION 2 EXECUTION AND FUNCTIONALITIES. 1.1 Download. 1.2 System requirements. 1.3 How to install JET 2

Jyoti Lakhani 1, Ajay Khunteta 2, Dharmesh Harwani *3 1 Poornima University, Jaipur & Maharaja Ganga Singh University, Bikaner, Rajasthan, India

Combinatorial Pattern Matching. CS 466 Saurabh Sinha

Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it.

TCCAGGTG-GAT TGCAAGTGCG-T. Local Sequence Alignment & Heuristic Local Aligners. Review: Probabilistic Interpretation. Chance or true homology?

Chapter 4: Blast. Chaochun Wei Fall 2014

Reconstructing long sequences from overlapping sequence fragment. Searching databases for related sequences and subsequences

CS 284A: Algorithms for Computational Biology Notes on Lecture: BLAST. The statistics of alignment scores.

EECS730: Introduction to Bioinformatics

Notes on Dynamic-Programming Sequence Alignment

Multiple Sequence Alignment. Mark Whitsitt - NCSA

An I/O device driver for bioinformatics tools: the case for BLAST

BLAST & Genome assembly

Optimizing multiple spaced seeds for homology search

Parsimony-Based Approaches to Inferring Phylogenetic Trees

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT

Multiple Sequence Alignment Based on Profile Alignment of Intermediate Sequences

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures

INTRODUCTION TO BIOINFORMATICS

Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Tutorial 4 BLAST Searching the CHO Genome

INTRODUCTION TO BIOINFORMATICS

BLAST & Genome assembly

Acceleration of Algorithm of Smith-Waterman Using Recursive Variable Expansion.

15-780: Graduate Artificial Intelligence. Computational biology: Sequence alignment and profile HMMs

New String Kernels for Biosequence Data

Towards Declarative and Efficient Querying on Protein Structures

Alignment of Pairs of Sequences

A NEW GENERATION OF HOMOLOGY SEARCH TOOLS BASED ON PROBABILISTIC INFERENCE

Module: Sequence Alignment Theory and Applica8ons Session: BLAST

3.4 Multiple sequence alignment

Sequence Alignment. part 2

Dynamic Programming & Smith-Waterman algorithm

Research Article International Journals of Advanced Research in Computer Science and Software Engineering ISSN: X (Volume-7, Issue-6)

Divya R. Singh. Faster Sequence Alignment using Suffix Tree and Data-Mining Techniques. February A Thesis Presented by

Lecture 2 Pairwise sequence alignment. Principles Computational Biology Teresa Przytycka, PhD

Transcription:

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha

Motivation Sequence homology to a known protein suggest function of newly sequenced protein Bioinformatics task is to find homologous sequence in a database of sequences Databases of sequences growing fast

Alignment Natural approach to check if the query sequence is homologous to a sequence in the database is to compute alignment score of the two sequences Alignment score counts gaps (insertions, deletions) and replacements Minimizing the evolutionary distance

Alignment Global alignment: optimize the overall similarity of the two sequences Local alignment: find only relatively conserved subsequences Local similarity measures preferred for database searches Distantly related proteins may only share isolated regions of similarity

Alignment Dynamic programming is the standard approach to sequence alignment Algorithm is quadratic in length of the two sequences Not practical for searches against very large database of sequences (e.g., whole genome)

Scoring alignments Scoring matrix: 4 x 4 matrix (DNA) or 20 x 20 matrix (protein) Amino acid sequences: PAM matrix Consider amino acid sequence alignment for very closely related proteins, extract replacement frequencies (probabilities), extrapolate to greater evolutionary distances DNA sequences: match = +5, mismatch = -4

BLAST: the MSP Given two sequences of same length, the similarity score of their alignment (without gaps) is the sum of similarity values for each pair of aligned residues Maximal segment pair (MSP): Highest scoring pair of identical length segments from the two sequences The similarity score of an MSP is called the MSP score BLAST heuristically aims to find this

Locally maximal segment pair A molecular biologist may be interested in all conserved regions shared by two proteins, not just their highest scoring pair A segment pair (segments of identical lengths) is locally maximal if its score cannot be improved by extending or shortening in either direction BLAST attempts to find all locally maximal segment pairs above some score cutoff.

Rapid approximation of MSP score Goal is to report those database sequences that have MSP score above some threshold S. Statistics tells us what is the highest threshold S at which chance similarities are likely to appear Tractability to statistical analysis is one of the attractive features of the MSP score

Rapid approximation of MSP score BLAST minimizes time spent on database sequences whose similarity with the query has little chance of exceeding this cutoff S. Main strategy: seek only segment pairs (one from database, one query) that contain a word pair with score >= T Intuition: If the sequence pair has to score above S, its most well matched word (of some predetermined small length) must score above T Lower T => Fewer false negatives Lower T => More pairs to analyze

Implementation 1. Compile a list of high scoring words 2. Scan database for hits to this word list 3. Extend hits

Step 1: Compiling list of words from query sequence For proteins: List of all w-length words that score at least T when compared to some word in query sequence Question: Does every word in the query sequence make it to the list? For DNA: list of all w-length words in the query sequence, often with w=12

Step 2: Scanning the database for hits Find exact matches to list words Can be done in linear time two methods (next slides) Each word in list points to all occurrences of the word in word list from previous step

Scanning the database for hits Method 1: Let w=4, so 20 4 possible words Each integer in 0 20 4-1 is an index for an array Array element point to list of all occurrences of that word in query Not all 20 4 elements of array are populated only the ones in word list from previous step

Scanning the database for hits Method 2: use deterministic finite automaton or finite state machine. Similar to the keyword trees seen in course. Build the finite state machine out of all words in word list from previous step

Step 3: Extending hits Once a word pair with score >= T has been found, extend it in each direction. Extend until score >= S is obtained During extension, score may go up, and then down, and then up again Terminate if it goes down too much (a certain distance below the best score found for shorter extensions) One implementation allows gaps during extension

BLAST: approximating the MSP BLAST may not find all segment pairs above threshold S Trying to approximate the MSP Bounds on the error: not hard bounds, but statistical bounds Highly likely to find the MSP

Statistics Suppose the MSP has been calculated by BLAST (and suppose this is the true MSP) Suppose this observed MSP scores S. What are the chances that the MSP score for two unrelated sequences would be >= S? If the chances are very low, then we can be confident that the two sequences must not have been unrelated

Statistics Given two random sequences of lengths m and n Probability that they will produce an MSP score of >= x?

Statistics Number of separate SPs with score >= x is Poisson distributed with mean y(x) = Kmn exp(-λx), where λ is the positive solution of p i p j exp(λs(i,j)) = 1 K is a constant s(i,j) is the scoring matrix, p i is the frequency of i in random sequences

Statistics Poisson distribution: Pr(x) = (e - λ λ x )/x! Pr(#SPs >= α) = 1 - Pr(#SPs <= α-1) =1" #"1 $ i= 0 e "y y i i! #"1 $ i= 0 =1" e "y y i i!

Statistics For α=1, Pr(#SPs >= 1) = 1-e -y(x) Choose S such that 1-e -y(s) is small Suppose the probability of having at least 1 SP with score >= S is 0.001. This seems reasonably small However, if you test 10000 random sequences, you expect 10 to cross the threshold Therefore, require E-value to be small. That is, expected number of random sequence pairs with score >= S should be small.

More statistics We just saw how to choose threshold S How to choose T? BLAST is trying to find segment pairs (SPs) scoring above S If an SP scores S, what is the probability that it will have a w-word match of score T or more? We want this probability to be high

More statistics: Choosing T Given a segment pair (from two random sequences) that scores S, what is the probability q that it will have no w-word match scoring above T? Want this q to be low Obtained from simulations Found to decrease exponentially as S increases

BLAST is the universally used bioinformatics tool

http://flybase.org/blast/