Heuristic methods for pairwise alignment:

Similar documents
24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:

BLAST - Basic Local Alignment Search Tool

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching

Sequence alignment theory and applications Session 3: BLAST algorithm

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

Computational Molecular Biology

Bioinformatics for Biologists

BLAST, Profile, and PSI-BLAST

Biology 644: Bioinformatics

Basic Local Alignment Search Tool (BLAST)

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

Chapter 4: Blast. Chaochun Wei Fall 2014

Scoring and heuristic methods for sequence alignment CG 17

CS 284A: Algorithms for Computational Biology Notes on Lecture: BLAST. The statistics of alignment scores.

B L A S T! BLAST: Basic local alignment search tool. Copyright notice. February 6, Pairwise alignment: key points. Outline of tonight s lecture

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Bioinformatics explained: BLAST. March 8, 2007

Alignment of Pairs of Sequences

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 13 G R A T I V. Iterative homology searching,

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:

Introduction to Computational Molecular Biology

Sequence Alignment & Search

.. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar..

Sequence Alignment Heuristics

Sequence analysis Pairwise sequence alignment

Lecture 5 Advanced BLAST

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.

CS313 Exercise 4 Cover Page Fall 2017

Comparative Analysis of Protein Alignment Algorithms in Parallel environment using CUDA

BIOL 7020 Special Topics Cell/Molecular: Molecular Phylogenetics. Spring 2010 Section A

Database Searching Using BLAST

Introduction to BLAST with Protein Sequences. Utah State University Spring 2014 STAT 5570: Statistical Bioinformatics Notes 6.2

Similarity Searches on Sequence Databases

A Coprocessor Architecture for Fast Protein Structure Prediction

The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval. Kevin C. O'Kane. Department of Computer Science

INTRODUCTION TO BIOINFORMATICS

USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm.

BLAST MCDB 187. Friday, February 8, 13

Tutorial 4 BLAST Searching the CHO Genome

Computational Genomics and Molecular Biology, Fall

Bioinformatics. Sequence alignment BLAST Significance. Next time Protein Structure

Database Similarity Searching

ICB Fall G4120: Introduction to Computational Biology. Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

Jyoti Lakhani 1, Ajay Khunteta 2, Dharmesh Harwani *3 1 Poornima University, Jaipur & Maharaja Ganga Singh University, Bikaner, Rajasthan, India

Bioinformatics explained: Smith-Waterman

TCCAGGTG-GAT TGCAAGTGCG-T. Local Sequence Alignment & Heuristic Local Aligners. Review: Probabilistic Interpretation. Chance or true homology?

BGGN 213 Foundations of Bioinformatics Barry Grant

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 9

Pairwise Sequence Alignment: Dynamic Programming Algorithms. COMP Spring 2015 Luay Nakhleh, Rice University

INTRODUCTION TO BIOINFORMATICS

Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice University

An I/O device driver for bioinformatics tools: the case for BLAST

Sequence Identification using BLAST

diamond v February 15, 2018

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT

Similarity searches in biological sequence databases

Lab 4: Multiple Sequence Alignment (MSA)

Data Mining Technologies for Bioinformatics Sequences

JET 2 User Manual 1 INSTALLATION 2 EXECUTION AND FUNCTIONALITIES. 1.1 Download. 1.2 System requirements. 1.3 How to install JET 2

Algorithms in Bioinformatics: A Practical Introduction. Database Search

2) NCBI BLAST tutorial This is a users guide written by the education department at NCBI.

CAP BLAST. BIOINFORMATICS Su-Shing Chen CISE. 8/20/2005 Su-Shing Chen, CISE 1

Reconstructing long sequences from overlapping sequence fragment. Searching databases for related sequences and subsequences

Lecture 4: January 1, Biological Databases and Retrieval Systems

Acceleration of Algorithm of Smith-Waterman Using Recursive Variable Expansion.

PARALIGN: rapid and sensitive sequence similarity searches powered by parallel computing technology

Biologically significant sequence alignments using Boltzmann probabilities

Sequence Alignment. GBIO0002 Archana Bhardwaj University of Liege

Searching Sequence Databases

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

15-780: Graduate Artificial Intelligence. Computational biology: Sequence alignment and profile HMMs

Alignments BLAST, BLAT

PyMod Documentation (Version 2.1, September 2011)

Preliminary Syllabus. Genomics. Introduction & Genome Assembly Sequence Comparison Gene Modeling Gene Function Identification

BLOSUM Trie for Faster Hit Detection in FSA Protein BLAST

Today s Lecture. Multiple sequence alignment. Improved scoring of pairwise alignments. Affine gap penalties Profiles

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

A Study On Pair-Wise Local Alignment Of Protein Sequence For Identifying The Structural Similarity

EECS730: Introduction to Bioinformatics

Wilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment

Pairwise Sequence Alignment. Zhongming Zhao, PhD

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Acceleration of Ungapped Extension in Mercury BLAST. Joseph Lancaster Jeremy Buhler Roger Chamberlain

Programming assignment for the course Sequence Analysis (2006)

BIOL591: Introduction to Bioinformatics Alignment of pairs of sequences

Salvador Capella-Gutiérrez, Jose M. Silla-Martínez and Toni Gabaldón

Introduction to Phylogenetics Week 2. Databases and Sequence Formats

L4: Blast: Alignment Scores etc.

Finding homologous sequences in databases

Biological Sequence Analysis. CSEP 521: Applied Algorithms Final Project. Archie Russell ( ), Jason Hogg ( )

Algorithmic Approaches for Biological Data, Lecture #20

Lecture 2 Pairwise sequence alignment. Principles Computational Biology Teresa Przytycka, PhD

BLAST. Basic Local Alignment Search Tool. Used to quickly compare a protein or DNA sequence to a database.

Algorithms for context prediction in Ubiquitous Systems

Highly Scalable and Accurate Seeds for Subsequence Alignment

Proceedings of the 11 th International Conference for Informatics and Information Technology

Metric Indexing of Protein Databases and Promising Approaches

Transcription:

Bi03c_1 Unit 03c: Heuristic methods for pairwise alignment: k-tuple-methods k-tuple-methods for alignment of pairs of sequences Bi03c_2 dynamic programming is too slow for large databases Use heuristic methods based on shared subsequences (with only a little sacrifice of sensitivity) FASTA BLAST + Gapped BLAST 1

FASTA Bi03c_3 Use hash table of short words of the query sequence. Short = 2 to 6 characters. Go through database and look for matches in to the query hash table (computing time linear in size of database) Score matching segments based on content of these matches: first regarding # of ocurrences, second regarding correct order Seq0 Seq1 Seq2 Seq3 Seq4 Seq5 Seq6... SeqN-1 SeqN Word 0 Word 1 Word 2... Word N from Altman (1999) BLAST (Basic( Local Alignment Search Tool) Bi03c_4 Very heuristic! But most successful! Detailed description in: Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J.Mol.Biol. Journal-of-Molecular-Biology. 1990; 215:3-410 Uses substitution matrices to compute scores e.g. PAM_120 for proteins +5 for matching aas, -5 for mismatch 2

BLAST (Basic( Local Alignment Search Tool), ctd. Bi03c_5 Define maximal segment pair (MSP): = : maximum scoring pair of identical length segments chosen from 2 sequences. Define local maximum scoring pair : = : high scoring pair (HSP) whose score cannot be improved by extending or shortening BLAST seeks all locally aligned HSPs with scores above some cutoff (and ranks them) yields list of local alignments without gaps BLAST implementation Bi03c_6 involved! (See above article) 1) 2) 3) compile a list of high scoring words (k-tuples, scoring at least T when compared to query sequency, (e.g. using a PAMsubstitution matrix). scan database for hits extend hits fast and most widely used! open to several variations of algorithm, strategy & parameters (e.g. substitution matrices threshold, word-lengths) setting the options (other than default) needs understanding of background concepts 3

BLAST services Bi03c_7 BLAST provided by NCBI (National Center for Biotechnology Information) WWW BLAST Stand alone BLAST Network BLAST BLAST URL API (HTTP-encoded requests to NCBI web server) BLAST is also provided by many other Institutions BLAST: types of programs & searches Bi03c_8 Offline: insert1.gif Further information not shown, see http://www.ncbi.nlm.nih.gov/education/blastinfo/query_tutorial.html 4

BLAST: databases to select Bi03c_9 Offline:insert2.gif Further information not shown, see http://www.ncbi.nlm.nih.gov/education/blastinfo/query_tutorial.html BLAST: databases to select, ctd. Bi03c_10 Offline:insert3.gif Further information not shown, see http://www.ncbi.nlm.nih.gov/education/blastinfo/query_tutorial.html 5

BLAST: parameters to select Bi03c_11 substitution matrix & gap penalties No single scoring scheme is best for all purposes experience & understanding of background is necessary for appropriate choice. Suggested combinations of substitution matrices and affine gap-penalties Amino-Acid Substitution Matrix affine gap penalties γ ( g) = d ( g 1) e d = gap opening (existence) e = gap extension PAM30 9 1 PAM70 10 1 BLOSUM80 10 1 BLOSUM62 11 1 BLOSUM45 15 2 BLAST, options Bi03c_12 substitution matrices, background small effect, replacement occurs often large effect, replacement occurs rarely (seldom) aa k aai aa j matrices 6

BLAST, options Bi03c_13 substitution matrix & gap penalties No single scoring scheme is best for all purposes experience & understanding of background is necessary for appropriate choice. A given class of alignments is best distinguished from chance by the substitution matrix whose target frequencies characterize the class Should one BLAST Proteins or rather Nucleotides? Since more than one codon codes for a particular aa, BLASTing proteins is more reliable than BLASTing nucleotides to find similarities between sequences BLAST, options Bi03c_14 (Some) More options filtering low complexity regions (yes/no) SEG -algorithm for protein-blast DUST -algorithm for nucleotide BLAST beware of regions with highly biased amino-acid-composition... A L M M M M M M L K M M M M M K M M M... (appear as X s in alignment with protein itself!) (appear as X s in alignment with nucleotide itself!) selecting WORD size default: 3 for protein BLAST default: 11 for nucleotide BLAST short words: increase sensitivity & computation time 7

BLAST, interpretation of results Bi03c_15? Which score is high enough? to be significant? High means high compared to scores obtained by chance! BLAST, interpretation of results, ctd. Bi03c_16 Number of hits by chance Compute the expected number (E) of HSPs with score[hsp] > S if query were compared to random sequences. E=Km.ne N λs m, n sequence lengths of query (m) and (whole) database K, λ factors (can be compiled) S raw score, directly obtained via substitution matrix cannot be quantitatively interpreted directly... Interpretation makes intuitive sense 8

BLAST, interpretation of results, ctd. Bi03c_17 Compute normalized score (believe or read Altschul et al. 1990...) S' = λs lnκ ln 2 (bit-score) m.n E = ' 2 S expected number of HSPs with bit-score S listed in BLAST_output BLAST, interpretation of results, ctd. Bi03c_18 Probabilities for finding several (n HSP ) random HSPs p=e nhsp -E E ( nhsp)! p = probability of finding n HSP HSPs if E HSPs are expected (i.e. with score S) Offline: Distribution of HSPs.gif 9

BLAST, interpretation of results, ctd. Bi03c_19 Compute probability of finding at least 1 HSP: p( nhsp 1) -E E = 1 e. 0! ( ) = 1 p nhsp= 0 0 = 1 ( E) the expected number of HSPs with score S is usually very small! 2 E 1 1 + + +... 1! 2! ( ) for 1 << p nhsp E E 1 BLAST, interpretation of results, ctd. Bi03c_20 Probabilities of finding at least 1 HSP: how accurate is approximation? Approximation for E<<1 accurate formula ( ) p nhsp 1 E for E << 1 10

BLAST, some more points to consider Bi03c_21 E-value of above equation refers to 2-sequence alignment. For comparison with a whole database of sequences E is adjusted: Mode chosen in FASTA: E E/(number of sequences in db) Mode chosen in BLAST: E E/(total length of db) E-value is valid only for ungapped alignments in a strict sense. But: Proves o.k. also for gapped ones Filter out low complexity regions! DUST-algorithm for nucleotide sequences SEG-algorithms for protein sequences (filtering is applied only to query sequence, not to db!) BLAST for human HFE Gene produkt (hemochromatosis protein) Bi03c_22 11

Human HFE Gene, Graphic Display Bi03c_23 high resolution Human HFE Gene, Graphic Display, detail for 1 st splicing variant Bi03c_24 12

Protein for Splicing Variant 1 Bi03c_25 high resolution Get FASTA for protein Bi03c_26 data: Variant1FASTA.txt 13

Launch BLAST via Entrez Bi03c_27 Submit BLAST Query Bi03c_28 14

format BLAST result Bi03c_29 Add Up: Aligned Conserved Domains Bi03c_30 high resolution 15

BLAST results, graphic Bi03c_31 high resolution BLAST results, graphic, ctd. Bi03c_32 high resolution 16

Interpreting BLAST alignment Bi03c_33 data: blosum62.html Bi03c_34 Correction items: get article from altschul page 18: figure import not ok nhsp HSP tiefstellen, auch in formula! Unit 3a p 22: link auf ScoringMatrix2.html und scoringmatrices_tut.html einfügen! General: Copmplete contents (starting at database searches) insert screen shots for queries 17