Blast2GO Teaching Exercises SOLUTIONS

Similar documents
Blast2GO Teaching Exercises

MDA Blast2GO Exercises

Lecture 5. Functional Analysis with Blast2GO Enriched functions. Kegg Pathway Analysis Functional Similarities B2G-Far. FatiGO Babelomics.

Blast2GO PRO Plug-in User Manual

Blast2GO PRO Plugin for Geneious User Manual

Blast2GO Command Line User Manual

Blast2GO User Manual. Blast2GO Ortholog Group Annotation May, BioBam Bioinformatics S.L. Valencia, Spain

TAIR User guide. TAIR User Guide Version 1.0 1

Getting to know Blast2GO. Functional annotation: from sequences to functional labels

MetScape User Manual

High-throughput functional annotation and data mining with the Blast2GO suite

HymenopteraMine Documentation

DAVID hands-on. by Ester Feldmesser, June 2017

CLC Server. End User USER MANUAL

Bioinformatics Hubs on the Web

ClueGO - CluePedia Frequently asked questions

SEEK User Manual. Introduction

BovineMine Documentation

Tutorial:OverRepresentation - OpenTutorials

User s Guide. Using the R-Peridot Graphical User Interface (GUI) on Windows and GNU/Linux Systems

Editing Pathway/Genome Databases

Differential Expression Analysis at PATRIC

Viewing Molecular Structures

INTRODUCTION TO BIOINFORMATICS

mirnet Tutorial Starting with expression data

Editing Pathway/Genome Databases

How to store and visualize RNA-seq data

EGAN Tutorial: A Basic Use-case

Geneious 5.6 Quickstart Manual. Biomatters Ltd

Pathway Analysis of Untargeted Metabolomics Data using the MS Peaks to Pathways Module

IPA: networks generation algorithm

T-ACE Manual IKMB, UK S-H Lars Kraemer

Project Report on. De novo Peptide Sequencing. Course: Math 574 Gaurav Kulkarni Washington State University

Tutorial 4 BLAST Searching the CHO Genome

EBI services. Jennifer McDowall EMBL-EBI

Topics of the talk. Biodatabases. Data types. Some sequence terminology...

RiceFREND Ver 2.0 User Manual

Tutorial for the Exon Ontology website

Package genelistpie. February 19, 2015

When we search a nucleic acid databases, there is no need for you to carry out your own six frame translation. Mascot always performs a 6 frame

Metabolic network analysis. Alexey Sergushichev

mpmorfsdb: A database of Molecular Recognition Features (MoRFs) in membrane proteins. Introduction

Gegenees genome format...7. Gegenees comparisons...8 Creating a fragmented all-all comparison...9 The alignment The analysis...

Editing Pathway/Genome Databases

Tutorial: Using the SFLD and Cytoscape to Make Hypotheses About Enzyme Function for an Isoprenoid Synthase Superfamily Sequence

Tutorial: Jump Start on the Human Epigenome Browser at Washington University

Tutorial. Variant Detection. Sample to Insight. November 21, 2017

Browser Exercises - I. Alignments and Comparative genomics

INTRODUCTION TO BIOINFORMATICS

Tutorial: How to use the Wheat TILLING database

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA- MEM).

DREM. Dynamic Regulatory Events Miner (v1.0.9b) User Manual

GeneSifter.Net User s Guide

User Manual. Ver. 3.0 March 19, 2012

Supplementary Materials for. A gene ontology inferred from molecular networks

ChIP-Seq Tutorial on Galaxy

MassHunter Personal Compound Database and Library Manager for Forensic Toxicology

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

TBtools, a Toolkit for Biologists integrating various HTS-data

Structural Bioinformatics

Software review. Biomolecular Interaction Network Database

Imports data from files created by Mascot. User chooses.dat,.raw and FASTA files and Visualize creates corresponding.ez2 file.

Agilent G6825AA METLIN Personal Metabolite Database for MassHunter Workstation

WebGestalt Manual. January 30, 2013

Agilent G6854 MassHunter Personal Pesticide Database

m6aviewer Version Documentation

Genome Browsers Guide

Simulation of Molecular Evolution with Bioinformatics Analysis

Tutorial: De Novo Assembly of Paired Data

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:

Frequency tables Create a new Frequency Table

examine: Exploring annotated modules in networks Supplemental Text

Pathway Analysis using Partek Genomics Suite 6.6 and Partek Pathway

User Manual Zhou Du Version 1.0

Tutorial. Comparative Analysis of Three Bovine Genomes. Sample to Insight. November 21, 2017

TraceFinder Analysis Quick Reference Guide

Michelle Gwinn Giglio!! Table of Contents (for the most popular topics)!

Database Searching Using BLAST

BLAST Exercise 2: Using mrna and EST Evidence in Annotation Adapted by W. Leung and SCR Elgin from Annotation Using mrna and ESTs by Dr. J.

Order Preserving Triclustering Algorithm. (Version1.0)

User Guide. v Released June Advaita Corporation 2016

Lastly, in case you don t already know this, and don t have Excel on your computers, you can get it for free through IT s website under software.

Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi

Genomics - Problem Set 2 Part 1 due Friday, 1/26/2018 by 9:00am Part 2 due Friday, 2/2/2018 by 9:00am

Retina Workbench Users Guide

EBI patent related services

biochem480 Autumn 2016 Bioinformatics Report pdf document with the title bioinfof16lastname_initial.pdf

Copyright 2014 Regents of the University of Minnesota

CAP BIOINFORMATICS Su-Shing Chen CISE. 8/19/2005 Su-Shing Chen, CISE 1

MetaStorm: User Manual

CompClustTk Manual & Tutorial

Intro to NGS Tutorial

Tutorial: chloroplast genomes

DNASIS MAX V2.0. Tutorial Booklet

Sequence Alignment & Search

Genome Browsers - The UCSC Genome Browser

miscript mirna PCR Array Data Analysis v1.1 revision date November 2014

Agilent G2721AA Spectrum Mill MS Proteomics Workbench Quick Start Guide

We are painfully aware that we don't have a good, introductory tutorial for Mascot on our web site. Its something that has come up in discussions

ChIP-seq practical: peak detection and peak annotation. Mali Salmon-Divon Remco Loos Myrto Kostadima

Transcription:

Blast2GO Teaching Exerces SOLUTIONS Ana Conesa and Stefan Götz 2012 BioBam Bioinformatics S.L. Valencia, Spain

Contents 1 Annotate 10 sequences with Blast2GO 2 2 Perform a complete annotation with Blast2GO 6 3 Creating GO-DAGs and Pies 9 4 Enrichment Analys with Blast2GO - FatiGO 12 5 Functional Analys/Data Mining 14 1

1 Annotate 10 sequences with Blast2GO (please note that results may vary slightly depending on used parameters and different database versions!) 1.1 Annotate 10 sequences with Blast2GO BLAST against NCBI nr database: Check on the Application messages tab the progress of the BLASTing. How long does it take to complete? Are all sequences successfully blasted? It should take just some 2-3 minutes: All sequences are successfully blasted and are now orange. Launch mapping: That s just a click. Browsing Blast results: Place the cursor on one sequence and right-click on the mouse. The single sequence menu appears. By selecting the Show Blast Results option, the Blast results tab gets filled with BLAST information. Double click on the upper bar to enlarge the window and to enable the scroll bars. You can vualize different BLAST hits and their percentage of similarity, number of HSPs, reading frame, etc. GO graph: On the single sequence menu, click on Draw Graph of Mapping-Results with highlighted annotations. The graph appears with the annotation score of each GO term. Annotating terms are the most specific terms of each branch that surpass the annotation score threshold (default = 55). Export top-blast data: Here, only information on the best-blast-hit given. How many GO terms have you fetched for each sequence? Sequence 1 2 3 4 5 6 7 8 9 10 Number of GOs 13 6 3 10 2 2 11 7 2 2 Annotate the sequences. How many GO terms you obtain for each sequence? Here the table with annotation results: Sequence 1 2 3 4 5 6 7 8 9 10 Number of GOs 5 2 1 7 - - 8 2 - - 1.2 Let s check some annotations more in detail Obtain the annotation DAG of Sequence 7 (single sequence menu). Interpret and save the molecular function graph (see Figure 1). In th graph you can see the annotation score for all candidate GO terms (GO terms with description attached). The selected GO terms (octogonal boxes) are those that surpass the annotation threshold and are most specific terms in the branch. There are other terms that are above the threshold but do not appear in the annotation because there are more specific terms that are also above the cutoff value. Re-annotate sequences 1 and 8 at an annotation threshold of 80? How does it change? The steps to re-annotate are : De-select all sequences at the sequences check box. Select sequences 1 and 8. Go to Annotation and select Reset Annotation. Run annotation step again having only these 2 sequences selected. 2

molecular_function AnnotScore:100 catalytic activity AnnotScore:100 binding AnnotScore:95 transferase activity AnnotScore:100 ion binding nucleoside binding nucleotide binding AnnotScore:90 GO:0016740 transferase activity, transferring alkyl or aryl (other than methyl) groups AnnotScore:100 cation binding purine nucleoside binding purine nucleotide binding GO:0000166 ribonucleotide binding methionine adenosyltransferase activity AnnotScore:100 metal ion binding adenyl nucleotide binding purine ribonucleotide binding GO:0004478 GO:0046872 adenyl ribonucleotide binding ATP binding GO:0005524 Single Sequence Graph of Seq7 Figure 1: GO mapping and annotation of seq 7 Sequence 1 has now only 2 GO terms (3 less) and sequence 8 has now only 1 GO terms Both sequence lost information. general ones. Some terms dappeared, others changed to more There are a number of sequences with mapping but without annotation. What happened? Try to annotate them manually. Tip: go to the Blast results of these sequences to learn about them, decide on the functions you would give to these sequences. Go to the Gene Ontology resource www.geneontology.org and look for appropriate GO terms. Add these manually to the sequences and marked them as annotated manually These sequences do not have annotation because the obtained terms are root GOs. By browsing the blast results some functions can be proposed: Sequence 5: GO:0016021, integral to membrane Sequence 10: GO:0016020, membrane 1.3 Let s augment/modify the annotations Get InterPro annotation for these sequences. How long does it take? 3

It takes about 5 minutes for 10 sequences. 8 sequences obtain InterProScan results. Only 5 of them are linked to GO terms. Merge InterPro results with the exting (blast-based) annotations (AnnotScore=55). How much does your annotation improve? After merging there are 2 GO terms added and 0 GO term removed for being too general. Now 27 terms are assigend to our sequences. 38 (redundant) GO terms obtained through InterPro could be used to confirm exting ones. In th example 0 InterPro based GO terms have been more general than the ones already assigned. 32.5 Merge Interpro Annotation Results 30.0 27.5 25.0 Number of annotations 22.5 20.0 17.5 15.0 12.5 10.0 7.5 5.0 2.5 0.0 before after confirmed too general Figure 2: InterProScan results Run Annex on these sequences. How does you annotation improve? After Annex several new GO terms have been obtained based on already exting molecular function terms, some terms are replaced by more specific ones and some others got confirmed. 30.0 Annex Results Number of annotations 27.5 25.0 22.5 20.0 17.5 15.0 12.5 10.0 7.5 5.0 2.5 0.0 previous actual new replaced confirmed Figure 3: Annex results Get KEGG maps for these sequences. For how many sequences you obtain KEGG results? There are 3 sequences for which an Enzyme Code has been obtained. These codes map to 10 KEGG metabolic pathways. The enzyme position in the KEGG pathway high-lighted. Each enzyme with a different color. Get the GOSlim of these sequences. How many GO terms do you have now? Here the table with annotation results: Sequence 1 2 3 4 5 6 7 8 9 10 Before GOSlim 5 2 1 7 - - 8 4 - - After GOSlim 6 5 2 9 - - 8 4 2 2 GOSlim annotations are not necessary less in number than the normal GO at the single sequence level, but diversity of GO terms reduced (see Figure 4). 4

Export annotation results in different formats (.annot, GeneSpring, Sequence Table and BestHit). Open these files with OpenOffice SpreadSheet. Which format do you like the most? Every format has a function. The GeneSpring format good to understand results, while the.annot appropriate to perform calculations and to import results into other applications. The table formats are also useful to browse annotations. 1.4 Extra exerce: Merge two.dat file Save the B2G project in two separate files. Close the project and join the 2 dat files again with B2G. The steps are: Save project as result.dat De-select all sequences at the sequence check box Select the last 5 sequences Go to Select menu and delete selected sequences Save as result1.dat Close project Load result.dat De-select the last 5 sequences Go to Select menu and delete selected sequences Save as result2.dat Close Project Load result1.dat Go to Tools and select Add.dat to exting project, and then select result2.dat The original annotation project restored biological_ Seqs:6 cellular Seqs:4 metabolic Seqs:5 localization Seqs:2 biological regulation primary metabolic Seqs:4 small molecule metabolic cellular metabolic Seqs:4 nitrogen compound metabolic Seqs:4 macromolecule metabolic Seqs:2 secondary metabolic catabolic biosynthetic Seqs:4 establhment of localization Seqs:2 regulation of biological carbohydrate metabolic cellular amino acid and derivative metabolic cellular nitrogen compound metabolic Seqs:4 cellular macromolecule metabolic Seqs:2 cellular biosynthetic macromolecule biosynthetic transport Seqs:2 gene expression nucleobase, nucleoside, nucleotide and nucleic acid metabolic Seqs:4 cellular macromolecule biosynthetic nucleic acid metabolic Seqs:2 DNA metabolic transcription GOSlim Combined Graph Figure 4: GOSlim graph 5

2 Perform a complete annotation with Blast2GO (please note that results may vary slightly depending on used parameters and different database versions!) 2.1 Annotation of 1100 sequences with Blast2GO e-values and similarities: The e-value ranges from 1xE-3 to 1xE-130. Most sequences have an e-value between E-10 and E-70. The sequence similarity goes from approx. 40% to approx 95%, then it drops. Also we can observe a peak at 100% which could be self-hits or sequence pattern of 100% similarity. E-value dtribution HITs 400 350 300 250 200 150 100 50 0 25 50 75 100 125 150 175 E-value (1e-X) Figure 5: evalue dtribution HITs 700 650 600 550 500 450 400 350 300 250 200 150 100 50 Sequence similarity dtribution 0 0 10 20 30 40 50 60 70 80 90 100 #positives/alignment-length Figure 6: Similarity dtribution Mapping: The majority of sequences do have annotations inferred from electronic annotations, even so approx. 350 sequences also do have annotations inferred from direct assays. The next two evidences codes are also not experimental ones: Inferred from computational analys and inferred from sequence similarity. Having a look at the source of databases we can observe that the majority of annotations are obtained from the UniProt Knowledge Base. The mean GO-level 5.4 and approx. 2800 annotations could be assigned. 2.2 Augment annotation via InterPro and Annex About 600 sequences have a InterPro scan result and about 30% of them could be linked to a GO-term. However, (for th dataset) only 8 additional sequences could be annotated through InterPro domains. Through Annex, the amount of annotation increased from about 3500 to over 4100 by adding complementary terms derived form the exting molecular functions. 6

400 GO-level dtribution 350 300 # Annotations 250 200 150 100 50 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 GO Level (Total Annotations = 2839, Mean Level = 5.391, Std. Deviation = 1.778) P F C Figure 7: Dtribution of GO term levels for each GO category 2.3 Try different annotation strategies We observed a drastic decrease in the amount of assigned annotations by excluding several evidence codes with more restrictive setting. Only about 10% of the sequences could be successfully annotated. The other way round we obtained over 50% of annotated sequences. Compared to the annotation with default parameters we only obtained annotations for 70 more sequences (see Figure 8). (a) default parameters (b) permsive parameters (c) restrictive parameters Figure 8: Annotation results By generating the GO-Level dtribution chart we can see that the amount of annotated sequences much less for the restrictive mode. Also can be observed that the mean annotation level stayed more or less the same. At a closer look we see that the Cellular component category seems the less affected one by the the restrictive mode (see Figure 9). 7

(a) default parameters (b) permsive parameters (c) restrictive parameters Figure 9: GO level dtributions 8

3 Creating GO-DAGs and Pies 3.1 Creating GO-DAGs and Pies Create the complete graphs for all 3 GO branches. Can you extract any conclusion? All graphs are really big. The only thing you can conclude that the Biological Process branch bears much more information than the other two. Use the seq and score filters to reduce the number of GO terms. Try one type of filter at the time. How does the resulting graph look like? Which filtering value gives you a good view on the data? Can you see easily important terms? How? Which ones? By setting a seq filter the graph becomes smaller from the lowest nodes. The score filter makes some nodes in-between to dappear, creating an odd graph, with many links and few nodes. By setting the seq filter to 30 you can see some highlighted nodes such as response to stress, regulation of transcription-dna dependent, translation and transport. By setting the filter on the score value (also 30), we get an even more compact graph. Now all nodes are intensively colored and it more difficult to find the relevant terms, but we also see functions such as response to stress, regulation of transcription-dna dependent, translation which are among the dominant functions Perform a GOSlim on these data (use plant specific). Create the DAG. How does it compare to the previous graphs? The graph much more compact. You can find back important terms as response to stimulus, but many other nodes are not represented. Generate pie charts with normal and multilevel pies and bar-charts. Try out different filtering until you get a useful summary? Which functions are more abundant? Give a summary of the functions represented in th sub-array. We can obtain a good summary with the bar chart (see Figure 10) and the multilevel pie (see Figure 11), but the pies by level have always too many sectors to be useful. From both analys we can conclude that the main functions in th dataset are: response to stimulus, translation, metabolic es,transport,... 3.2 Extra exerce: Pie charts with Excel and custom-colored graphs Export the graph data as.txt and open it in excel. Try to reproduce some of the charts you obtained with Blast2GO. Here I simply have the counts on the different GOs. I also have the level, so it possible to create a bar chart and normal pie on one level. The multilevel pie more difficult since you need the relationships between nodes and branches. Make a custom-colored graph with the top 100 GO terms ordered by the amount of annotated sequences. From the table above we order the sequences by the score. We take the first 100 sequences and number them from 1 to 0. The column with GO IDs gets duplicated and the file with columns GO-ID, GO-ID, value saved with the extension.annot. Import the file into Blast2GO and create combined graph without filters but using the option color bydesc. The result can be seen in Figure 12. 9

Figure 10: Bar-Chart for the biological es Figure 11: Multi-Level Pie Chart 10

Figure 12: Custom-colored graph 11

4 Enrichment Analys with Blast2GO - FatiGO (NOTE: Results may vary depending on used parameters and different database versions!) For example: The enriched term response to chemical stimulus has 114 contigs in the test-set and 366 in the reference set. The term obtained an adjusted (FDR) p-value of 6.9E 3 and a un-adjusted value of 1.6E 6. Th value of 6.9E 3 above 0.05 and stattically overrepresented after multiple testing correction. Below we can see the unfiltered enriched graph of molecular functions (see Figure 13). Th graph got saved as pdf. A more compact graph of th results was generated as a thined graph with a FDR filter of 0.05 (see Figure 14). Finally the lt of the most specific (tip-terms) AND enriched terms per GO branch got generated (see Figure 15). molecular_function GO:0003674 structural molecule activity GO:0005198 FDR: 7.0E-3 FWER: 0.0E0 p-value: 4.8E-6 binding GO:0005488 structural constituent of ribosome GO:0003735 FDR: 7.0E-3 FWER: 0.0E0 p-value: 4.0E-6 ion binding GO:0043167 cation binding GO:0043169 metal ion binding GO:0046872 transition metal ion binding GO:0046914 iron ion binding GO:0005506 FDR: 3.3E-2 FWER: 0.0E0 p-value: 6.8E-5 Enriched Graph Figure 13: Enriched molecular functions (without filter) 12

molecular_function GO:0003674 5 terms structural molecule activity GO:0005198 FDR: 7.0E-3 FWER: 0.0E0 p-value: 4.8E-6 iron ion binding GO:0005506 FDR: 3.3E-2 FWER: 0.0E0 p-value: 6.8E-5 structural constituent of ribosome GO:0003735 FDR: 7.0E-3 FWER: 0.0E0 p-value: 4.0E-6 Enriched Graph Figure 14: Enriched molecular functions (thinned out with a 0.05 FDR filter) Figure 15: Lt of the most specific (tip-terms) AND enriched terms per GO branch. 13

5 Functional Analys/Data Mining The analys pipeline would be as follows: Upload the.dat in B2G Go to tools and use the add.annot function to include the manual.annot file into the.dat project Go to Enrichment Analys Select the StressSelection.txt file, performing a ONE tail stattical test. Once results are obtained, save them as default.gossip.txt Select all sequences Go to the annotation menu and reset annotation Change annotation parameters. Set all evidence codes lower than 1 to 0 Re-annotate Repeat the Fher Analys and save as strict.gossip.txt Select all sequences Go to the annotation menu and reset annotation Change annotation parameters. Set all evidence codes to 1 Re-annotate Repeat the Fher Analys and save as permsive.gossip.txt Outside B2G, in Excel, create a gossip-result annotation file: a 2 columns file with in the first column the name of the strategy default, strict or permsive and in the other the significant terms obtained in each analys. Save th file as text delimited file with the extension.annot Upload th file in Blast2GO (File menu, Load annotations) Make a Combined graph selecting Node Information = with Seqs. common and different nodes Here you can see To create a graph with only nodes that were enriched with the 3 strategies, make the graph giving a value of 2.5 to the Seq filter parameter. Export results as.txt 14