Similarity searches in biological sequence databases

Similar documents
Similarity Searches on Sequence Databases

Basic Local Alignment Search Tool (BLAST)

Lecture 5 Advanced BLAST

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:

Pairwise Sequence Alignment. Zhongming Zhao, PhD

BLAST MCDB 187. Friday, February 8, 13

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

.. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar..

Biology 644: Bioinformatics

BLAST, Profile, and PSI-BLAST

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm.

Bioinformatics explained: BLAST. March 8, 2007

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.

BGGN 213 Foundations of Bioinformatics Barry Grant

Sequence alignment theory and applications Session 3: BLAST algorithm

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

2) NCBI BLAST tutorial This is a users guide written by the education department at NCBI.

Bioinformatics for Biologists

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:

Heuristic methods for pairwise alignment:

Sequence Alignment. GBIO0002 Archana Bhardwaj University of Liege

B L A S T! BLAST: Basic local alignment search tool. Copyright notice. February 6, Pairwise alignment: key points. Outline of tonight s lecture

Sequence Alignment & Search

Database Searching Using BLAST

Bioinformatics. Sequence alignment BLAST Significance. Next time Protein Structure

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

CS 284A: Algorithms for Computational Biology Notes on Lecture: BLAST. The statistics of alignment scores.

Sequence analysis Pairwise sequence alignment

Lecture 4: January 1, Biological Databases and Retrieval Systems

BLAST. NCBI BLAST Basic Local Alignment Search Tool

CAP BLAST. BIOINFORMATICS Su-Shing Chen CISE. 8/20/2005 Su-Shing Chen, CISE 1

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching

Alignment of Pairs of Sequences

CS313 Exercise 4 Cover Page Fall 2017

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

Bioinformatics resources for data management. Etienne de Villiers KEMRI-Wellcome Trust, Kilifi

Wilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment

Trad DDBJ. DNA Data Bank of Japan

Bioinformatics explained: Smith-Waterman

Wilson Leung 05/27/2008 A Simple Introduction to NCBI BLAST

BIOL591: Introduction to Bioinformatics Alignment of pairs of sequences

BLAST - Basic Local Alignment Search Tool

INTRODUCTION TO BIOINFORMATICS

Introduction to Computational Molecular Biology

TCCAGGTG-GAT TGCAAGTGCG-T. Local Sequence Alignment & Heuristic Local Aligners. Review: Probabilistic Interpretation. Chance or true homology?

12. Key features involved in building biological 3databases

Computational Genomics and Molecular Biology, Fall

Introduction to Phylogenetics Week 2. Databases and Sequence Formats

Bioinformatics Sequence comparison 2 local pairwise alignment

INTRODUCTION TO BIOINFORMATICS

Alignments BLAST, BLAT

Scoring and heuristic methods for sequence alignment CG 17

Pairwise Sequence Alignment: Dynamic Programming Algorithms. COMP Spring 2015 Luay Nakhleh, Rice University

Today s Lecture. Multiple sequence alignment. Improved scoring of pairwise alignments. Affine gap penalties Profiles

Chapter 4: Blast. Chaochun Wei Fall 2014

Yutaka Ueno Neuroscience, AIST Tsukuba, Japan

Computational Molecular Biology

Sequence Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--

Finding homologous sequences in databases

ICB Fall G4120: Introduction to Computational Biology. Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology

) I R L Press Limited, Oxford, England. The protein identification resource (PIR)

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT

Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice University

Data Mining Technologies for Bioinformatics Sequences

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it.

The Use of WWW in Biological Research

Sequence Alignment Heuristics

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

Multiple Sequence Alignment. Mark Whitsitt - NCSA

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 9

Comparison of Sequence Similarity Measures for Distant Evolutionary Relationships

EECS730: Introduction to Bioinformatics

BLAST & Genome assembly

BLAST & Genome assembly

Multiple Sequence Alignment: Multidimensional. Biological Motivation

Tutorial 4 BLAST Searching the CHO Genome

Introduction to BLAST with Protein Sequences. Utah State University Spring 2014 STAT 5570: Statistical Bioinformatics Notes 6.2

A CAM(Content Addressable Memory)-based architecture for molecular sequence matching

From Smith-Waterman to BLAST

Comparative Analysis of Protein Alignment Algorithms in Parallel environment using CUDA

Algorithmic Approaches for Biological Data, Lecture #20

Database Similarity Searching

Distributed Protein Sequence Alignment

Biologically significant sequence alignments using Boltzmann probabilities

Protein Sequence Database

FastA & the chaining problem

Brief review from last class

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 13 G R A T I V. Iterative homology searching,

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:

SimSearch: A new variant of dynamic programming based on distance series for optimal and near-optimal similarity discovery in biological sequences

Jyoti Lakhani 1, Ajay Khunteta 2, Dharmesh Harwani *3 1 Poornima University, Jaipur & Maharaja Ganga Singh University, Bikaner, Rajasthan, India

Programming assignment for the course Sequence Analysis (2006)

Lecture 10. Sequence alignments

Research on Pairwise Sequence Alignment Needleman-Wunsch Algorithm

BLAST Exercise 2: Using mrna and EST Evidence in Annotation Adapted by W. Leung and SCR Elgin from Annotation Using mrna and ESTs by Dr. J.

Sequence alignment. Genomes change over time

Today s Lecture. Edit graph & alignment algorithms. Local vs global Computational complexity of pairwise alignment Multiple sequence alignment

BIOL 7020 Special Topics Cell/Molecular: Molecular Phylogenetics. Spring 2010 Section A

Transcription:

Similarity searches in biological sequence databases Volker Flegel september 2004 Page 1

Outline Keyword search in databases General concept Examples SRS Entrez Expasy Similarity searches in databases Goal Definitions Alignment visualisation Alignment algorithms Examples FASTA BLAST and its gory details http://srs.ebi.ac.uk http://www.ncbi.nih.gov/entrez/index.html http://www.expasy.uniprot.org/search/textsearch.shtml september 2004 Page 2

Keyword search Accessing database entries Each database uses its own specific access methods Several kinds of search possibilities according to the data stored Identification number (unique) Authors Keywords,... Biological sequence databases Use a unique identification number to retrieve a specific sequence This identification number must remain constant accross the database releases Genbank / EMBL / DDBJ accession.version Swiss-Prot accession and id (Note: id may change) september 2004 Page 3

Genbank entry example LOCUS AF455746_1 80 aa PRI 08-JAN-2002 DEFINITION ubiquitin-conjugating enzyme [Homo sapiens]. ACCESSION AAL58874 PID g18087414 VERSION AAL58874.1 GI:18087414 DBSOURCE locus AF455746 accession AF455746.1 KEYWORDS. SOURCE human. ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (residues 1 to 80) AUTHORS Poloumienko,A. TITLE Exon-intron structure of the mammalian ubiquitin-conjugating enzyme (HR6A) genes JOURNAL Unpublished COMMENT Method: conceptual translation supplied by author. FEATURES Location/Qualifiers source 1..80 /organism="homo sapiens" /db_xref="taxon:9606" /chromosome="x" /cell_line="mcf-7" Protein 1..80 /product="ubiquitin-conjugating enzyme" CDS 1..80 /gene="hr6a" /coded_by="join(af455746.1:<1..64,af455746.1:1057..1145, AF455746.1:1594..>1680)" ORIGIN 1 teeypnkppt vrfvskmfhp nvyadgsicl dilqnrwspt ydvssiltsi qslldepnpn 61 spansqaaql yqenkreyek // september 2004 Page 4

SwissProt entry example ID UBCA_HUMAN STANDARD; PRT; 152 AA. AC P49459; DT 01-FEB-1996 (Rel. 33, Created) DT 01-FEB-1996 (Rel. 33, Last sequence update) DT 16-OCT-2001 (Rel. 40, Last annotation update) DE Ubiquitin-conjugating enzyme E2-17 kda (EC 6.3.2.19) DE (Ubiquitin-protein ligase) (Ubiquitin carrier protein) (HR6A). GN UBE2A. OS Homo sapiens (Human). OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. OX NCBI_TaxID=9606; RN [1] RP SEQUENCE FROM N.A. RX MEDLINE=92020951; PubMed=1717990; RA Koken M.H.M., Reynolds P., Jaspers-Dekker I., Prakash L., Prakash S., RA Bootsma D., Hoeijmakers J.H.J.; RT "Structural and functional conservation of two human homologs of the RT yeast DNA repair gene RAD6."; RL Proc. Natl. Acad. Sci. U.S.A. 88:8865-8869(1991). (...) DR EMBL; M74524; AAA35981.1; -. DR HSSP; P25865; 2AAK. DR MIM; 312180; -. DR InterPro; IPR000608; UBQ_conjugat. DR Pfam; PF00179; UQ_con; 1. DR SMART; SM00212; UBCc; 1. DR PROSITE; PS00183; UBIQUITIN_CONJUGAT_1; 1. DR PROSITE; PS50127; UBIQUITIN_CONJUGAT_2; 1. KW Ubiquitin conjugation; Ligase; Multigene family. FT BINDING 88 88 UBIQUITIN (BY SIMILARITY). SQ SEQUENCE 152 AA; 17243 MW; 7A86173D5FAE6DE1 CRC64; MSTPARRRLM RDFKRLQEDP PAGVSGAPSE NNIMVWNAVI FGPEGTPFGD GTFKLTIEFT EEYPNKPPTV RFVSKMFHPN VYADGSICLD ILQNRWSPTY DVSSILTSIQ SLLDEPNPNS PANSQAAQLY QENKREYEKR VSAIVEQSWR DC // september 2004 Page 5

Similarity searches Concept Generalisation (asymmetric) of a pairwise comparison Query Subject sequence sequence Pairwise alignment sequence Similarity searches database Database vs. database database database september 2004 Page 6

Theoretical considerations Similar to those of pairwise comparison Sequence divergence is due to evolutionary mechanisms Sequence similarity allows information extrapolation: Sequence history and origin Biological function 3D structure Alignement types Global Local Alignment between the complete sequence A and the complete sequence B Alignment between a sub-sequence of A and a subsequence of B Computer implementation (Algorithms) Dynamic programing Global Needleman-Wunsch Local Smith-Waterman september 2004 Page 7

Problems to solve Similarity search mechanism A pairwise comparison is done successively between the query and every sequence of the database Obstacles The complexity of the task is proportional to the size of the database Extremely long running time of the search Difficult biological interpretation of the results Solutions Reduce search time by using more powerful computers Reduce search time by using newer and faster algorithms (heuristics) Sort and analyse the resulting alignments using statistical methods september 2004 Page 8

Definitions Query Sequence that is being compared against the database. Subject Sequence of the database that matches the query. Exact algorithm An exact algorithm is guaranteed to find the best alignment, or at least one of the best in case of a tie. Heuristic algorithm A heuristic algorithm is not guaranteed to find the best alignment. But good ones often do, and much quicker than exact ones. september 2004 Page 9

Some more definitions Identity Proportion of pairs of identical residues between two aligned sequences. Generally expressed as a percentage. This value strongly depends on how the two sequences are aligned. Similarity Proportion of pairs of similar residues between two aligned sequences. If two residues are similar is determined by a substitution matrix. This value also depends strongly on how the two sequences are aligned, as well as on the substitution matrix used. Homology Two sequences are homologous if and only if they have a common ancestor. There is no such thing as a level of homology! (It's either yes or no) Homologous sequences do not necessarily serve the same function...... Nor are they always highly similar: structure may be conserved while sequence is not. september 2004 Page 10

Alignment score Amino acid substitution matrices Example: PAM250 Most used: Blosum62 Raw score of an alignment TPEA APGA Score = 1 + 6 + 0 + 2 = 9 september 2004 Page 11

Insertions and deletions Gap penalties gap gap opening gap extension Seq A Seq B GARFIELDTHE----CAT GARFIELDTHELASTCAT Opening a gap penalizes an alignment score Each extension of a gap penalizes the alignment's score The gap opening penalty is in general higher than the gap extension penalties (simulating evolutionary behavior) The raw score of a gapped alignment is the sum of all amino acid substitutions from which we subtract the gap opening and extension penalties. september 2004 Page 12

Alignment visualisation Matrix - Text - Dotplot An alignment is a path through a graph DotPlot: Graphical view in 2 dimensions Visual aid to identify regions of similarity Tissue-Type plasminogen Activator Urokinase-Type plasminogen Activator Seq Seq B B Seq Seq A A A-CA-CA ACA--CA A-CA-CA ACA--CA ACCAAC- A-CCAAC ACCAAC- A-CCAAC Address: www.isrec.isb-sib.ch/java/dotlet/dotlet.html september 2004 Page 13

Optimal alignment extension How to extend optimaly an optimal alignment An optimal alignment up to positions i and j can be extended in 3 ways. Keeping the best of the 3 guarantees an extended optimal alignment. Seq A a 1 a 2 a 3... a i-1 a i Seq B b 1 b 2 b 3... b j-1 b j Seq A a 1 a 2 a 3... a i-1 a i Seq B b 1 b 2 b 3... b j-1 b j a i+1 b j+1 Score = Score ij + Subst i+1 j+1 b j+1 Seq A a 1 a 2 a 3... a i-1 a i Seq B b 1 b 2 b 3... b j-1 b j Seq A a 1 a 2 a 3... a i-1 a i Seq B b 1 b 2 b 3... b j-1 b j a i+1 - - b j+1 Score = Score ij - gap Score = Score ij - gap We have the optimal alignment extended from i and j by one residue. september 2004 Page 14

Exact algorithms (Needleman-Wunsch / Smith - Waterman) Simple example (Needleman-Wunsch) Scoring system: Match score: 2 Mismatch score: -1 Gap penalty: -2 Note 2 + 2 G A T T A 0-2 -4-6 -8-10 0-2 G -2 2 0-2 -4-6 A -4 0 4 2 0-2 A -6-2 2 3 1 2 0-2 T -8-4 0 4 5 3 T -10-6 -2 2 6 4 C -12-8 -4 0 4 5 F (i-1,j-1) We have to keep track of the origin of the score for each element in the matrix. This allows to build the alignment by traceback when the matrix has been completely filled out. Computation time is proportional to the size of sequences (n x m). september 2004 Page 15 s (xi,yj) F (i-1,j) -d F (i,j-1) -d F (i,j) F(i,j): score at position i, j s(x i,y j ): match or mismatch score (or substitution matrix value) for residues x i and y j d: gap penalty (positive value) GA-TTA GAATTC

Heuristic algorithms Faster but less sensitive They use the dynamic programming approach like exact algorithms They try to limit its use to sequences which seem interesting The heuristic part of the algorithm tries to make a clever guess at which sequences would produce an interesting alignment. FASTA Developped by Lipman and Pearson in 1985 Tries to find sequences having identical words (or k-tuples = k consecutive residues) in common on a same diagonal. Compares the query sequentially to all those sequences in the database. Blast Developped by Altschul et al. in 1990 The most used and cited bioinformatics tool in biology Online tutorial: www.ncbi.nlm.nih.gov/education/blastinfo/tut1.html september 2004 Page 16

A Blast for each query Different programs are available according to the type of query Program Query Database blastp protein VS protein blastn nucleotide VS nucleotide blastx nucleotide protein VS protein tblastn nucleotide protein VS protein tblastx nucleotide nucleotide protein VS protein september 2004 Page 17

Access to Blast Web access Numerous web sites offer access to Blast servers NCBI (USA) where the Blast program was created Provide access to all Blast options and numerous databases User interface not very intuitive URL: www.ncbi.nlm.nih.gov/blast EMBnet (i.e. Swiss node located in Lausanne at the SIB) Several servers across the world Provide access to all Blast options Provide a simplified and an advanced user interface Wide choice of databases URL: www.ch.embnet.org/software/bblast.html (Simple user interface) www.ch.embnet.org/software/ablast.html (Advanced user interface) september 2004 Page 18

Blast: the gory details Blast algorithm: creating a list of similar words A substitution matrix is used to compute the word scores Query REL RSL LKP score > T AAA AAA AAC AAC AAD AAD... YYY YYY List of all possible words with 3 amino acid residues score < T LKP LKP ACT ACT...... RSL RSL TVF TVF List of words matching the query with a score > T september 2004 Page 19

Blast: the gory details Blast algorithm: eliminating sequences without word hits Database sequences ACT ACT ACT ACT...... RSL RSL TVF TVF Search for exact matches RSL RSL RSL RSL TVF TVF List of words matching the query with a score > T List List of of sequences sequences containing containing words words similar similar to to the the query query (hits) (hits) september 2004 Page 20

Blast: the gory details (The End) Blast algorithm: extension of hits Database sequence Query A Ungapped extension if: 2 "Hits" are on the same diagonal but at a distance less than A Database sequence Query A Extension using dynamic programming limited to a restricted region september 2004 Page 21

Statistical evaluation of results Alignments are evaluated according to their score Raw score It's the sum of the amino acid substitution scores and gap penalties (gap opening and gap extension) Depends on the scoring system (substitution matrix, etc.) Different alignments should not be compared based only on the raw score Normalised score Is independent of the scoring system Allows the comparison of different alignments Units: expressed in bits september 2004 Page 22

Statistical evaluation of results 100% 0% Statistics derived from the scores p-value Probability that an alignment with this score occurs by chance in a database of this size The closer the p-value is towards 0, the better the alignment N 0 e-value Number of matches with this score one can expect to find by chance in a database of this size The closer the e-value is towards 0, the better the alignment Relationship between e-value and p-value: In a database containing N sequences e = p x N september 2004 Page 23

Low complexity regions Regions with a high frequency of only a few type of residues (= low complexity regions) may produce high scoring but biological uninteresting alignments, e.g. polyserine Such regions are, by default, filtered out by Blast. They appear masked with 'X' in the alignment. They are not taken into account for score computation september 2004 Page 24

Basic Blast on EMBnet www.ch.embnet.org/software/bblast.html Select the type of query Select the nucleotide database to search with either blastn, tblastn, tblastx Select the protein database to search with either blastp, Select the substitution blastx matrix to use Select your input type: Either a raw sequence or an accession or id number, as well as the database from which blast should retrieve your query september 2004 Page 25

Advanced Blast on EMBnet www.ch.embnet.org/software/ablast.html Greater choice of databases to search Advanced Blast parameter modification september 2004 Page 26

Search results Graphical visualisation and description of alignment scores september 2004 Page 27

Search results Alignment example Normalised score, raw score and e-value Percentage of identical aligned residues, percentage of aligned residues having a positive score in the substitution matrix Alignment (local) between the query and the database sequence. The middle line shows if a residue is conserved or not Low complexity region is masked with a series of 'X' september 2004 Page 28

Search results Search details (at the bottom of the results) Size of the database searched Scoring system parameters Details about the number of hits found september 2004 Page 29

Conclusions Blast: the most used database search tool Fast and very reliable even for a heuristic algorithm Does not necessarily find the best alignment, but most of the time it finds the best matching sequences in the database Easy to use with default parameters Solid statistical framework for the evaluation of scores but... The biologist's expertise is still essential to the analysis of the results! Tips and tricks For coding sequences always search at the protein level Mask low complexity regions Use a substitution matrix adapted to the expected divergence of the searched sequences (nevertheless most of the time BLOSUM62 works well) If there are only matches to a limited region of your query, cut out that region and rerun the search with the remaining part of your query september 2004 Page 30