Blast2GO Teaching Exercises SOLUTIONS

Blast2GO Teaching Exerces SOLUTIONS Ana Conesa and Stefan Götz 2012 BioBam Bioinformatics S.L. Valencia, Spain

Contents 1 Annotate 10 sequences with Blast2GO 2 2 Perform a complete annotation with Blast2GO 6 3 Creating GO-DAGs and Pies 9 4 Enrichment Analys with Blast2GO - FatiGO 12 5 Functional Analys/Data Mining 14 1

1 Annotate 10 sequences with Blast2GO (please note that results may vary slightly depending on used parameters and different database versions!) 1.1 Annotate 10 sequences with Blast2GO BLAST against NCBI nr database: Check on the Application messages tab the progress of the BLASTing. How long does it take to complete? Are all sequences successfully blasted? It should take just some 2-3 minutes: All sequences are successfully blasted and are now orange. Launch mapping: That s just a click. Browsing Blast results: Place the cursor on one sequence and right-click on the mouse. The single sequence menu appears. By selecting the Show Blast Results option, the Blast results tab gets filled with BLAST information. Double click on the upper bar to enlarge the window and to enable the scroll bars. You can vualize different BLAST hits and their percentage of similarity, number of HSPs, reading frame, etc. GO graph: On the single sequence menu, click on Draw Graph of Mapping-Results with highlighted annotations. The graph appears with the annotation score of each GO term. Annotating terms are the most specific terms of each branch that surpass the annotation score threshold (default = 55). Export top-blast data: Here, only information on the best-blast-hit given. How many GO terms have you fetched for each sequence? Sequence 1 2 3 4 5 6 7 8 9 10 Number of GOs 13 6 3 10 2 2 11 7 2 2 Annotate the sequences. How many GO terms you obtain for each sequence? Here the table with annotation results: Sequence 1 2 3 4 5 6 7 8 9 10 Number of GOs 5 2 1 7 - - 8 2 - - 1.2 Let s check some annotations more in detail Obtain the annotation DAG of Sequence 7 (single sequence menu). Interpret and save the molecular function graph (see Figure 1). In th graph you can see the annotation score for all candidate GO terms (GO terms with description attached). The selected GO terms (octogonal boxes) are those that surpass the annotation threshold and are most specific terms in the branch. There are other terms that are above the threshold but do not appear in the annotation because there are more specific terms that are also above the cutoff value. Re-annotate sequences 1 and 8 at an annotation threshold of 80? How does it change? The steps to re-annotate are : De-select all sequences at the sequences check box. Select sequences 1 and 8. Go to Annotation and select Reset Annotation. Run annotation step again having only these 2 sequences selected. 2

molecular_function AnnotScore:100 catalytic activity AnnotScore:100 binding AnnotScore:95 transferase activity AnnotScore:100 ion binding nucleoside binding nucleotide binding AnnotScore:90 GO:0016740 transferase activity, transferring alkyl or aryl (other than methyl) groups AnnotScore:100 cation binding purine nucleoside binding purine nucleotide binding GO:0000166 ribonucleotide binding methionine adenosyltransferase activity AnnotScore:100 metal ion binding adenyl nucleotide binding purine ribonucleotide binding GO:0004478 GO:0046872 adenyl ribonucleotide binding ATP binding GO:0005524 Single Sequence Graph of Seq7 Figure 1: GO mapping and annotation of seq 7 Sequence 1 has now only 2 GO terms (3 less) and sequence 8 has now only 1 GO terms Both sequence lost information. general ones. Some terms dappeared, others changed to more There are a number of sequences with mapping but without annotation. What happened? Try to annotate them manually. Tip: go to the Blast results of these sequences to learn about them, decide on the functions you would give to these sequences. Go to the Gene Ontology resource www.geneontology.org and look for appropriate GO terms. Add these manually to the sequences and marked them as annotated manually These sequences do not have annotation because the obtained terms are root GOs. By browsing the blast results some functions can be proposed: Sequence 5: GO:0016021, integral to membrane Sequence 10: GO:0016020, membrane 1.3 Let s augment/modify the annotations Get InterPro annotation for these sequences. How long does it take? 3

It takes about 5 minutes for 10 sequences. 8 sequences obtain InterProScan results. Only 5 of them are linked to GO terms. Merge InterPro results with the exting (blast-based) annotations (AnnotScore=55). How much does your annotation improve? After merging there are 2 GO terms added and 0 GO term removed for being too general. Now 27 terms are assigend to our sequences. 38 (redundant) GO terms obtained through InterPro could be used to confirm exting ones. In th example 0 InterPro based GO terms have been more general than the ones already assigned. 32.5 Merge Interpro Annotation Results 30.0 27.5 25.0 Number of annotations 22.5 20.0 17.5 15.0 12.5 10.0 7.5 5.0 2.5 0.0 before after confirmed too general Figure 2: InterProScan results Run Annex on these sequences. How does you annotation improve? After Annex several new GO terms have been obtained based on already exting molecular function terms, some terms are replaced by more specific ones and some others got confirmed. 30.0 Annex Results Number of annotations 27.5 25.0 22.5 20.0 17.5 15.0 12.5 10.0 7.5 5.0 2.5 0.0 previous actual new replaced confirmed Figure 3: Annex results Get KEGG maps for these sequences. For how many sequences you obtain KEGG results? There are 3 sequences for which an Enzyme Code has been obtained. These codes map to 10 KEGG metabolic pathways. The enzyme position in the KEGG pathway high-lighted. Each enzyme with a different color. Get the GOSlim of these sequences. How many GO terms do you have now? Here the table with annotation results: Sequence 1 2 3 4 5 6 7 8 9 10 Before GOSlim 5 2 1 7 - - 8 4 - - After GOSlim 6 5 2 9 - - 8 4 2 2 GOSlim annotations are not necessary less in number than the normal GO at the single sequence level, but diversity of GO terms reduced (see Figure 4). 4

Export annotation results in different formats (.annot, GeneSpring, Sequence Table and BestHit). Open these files with OpenOffice SpreadSheet. Which format do you like the most? Every format has a function. The GeneSpring format good to understand results, while the.annot appropriate to perform calculations and to import results into other applications. The table formats are also useful to browse annotations. 1.4 Extra exerce: Merge two.dat file Save the B2G project in two separate files. Close the project and join the 2 dat files again with B2G. The steps are: Save project as result.dat De-select all sequences at the sequence check box Select the last 5 sequences Go to Select menu and delete selected sequences Save as result1.dat Close project Load result.dat De-select the last 5 sequences Go to Select menu and delete selected sequences Save as result2.dat Close Project Load result1.dat Go to Tools and select Add.dat to exting project, and then select result2.dat The original annotation project restored biological_ Seqs:6 cellular Seqs:4 metabolic Seqs:5 localization Seqs:2 biological regulation primary metabolic Seqs:4 small molecule metabolic cellular metabolic Seqs:4 nitrogen compound metabolic Seqs:4 macromolecule metabolic Seqs:2 secondary metabolic catabolic biosynthetic Seqs:4 establhment of localization Seqs:2 regulation of biological carbohydrate metabolic cellular amino acid and derivative metabolic cellular nitrogen compound metabolic Seqs:4 cellular macromolecule metabolic Seqs:2 cellular biosynthetic macromolecule biosynthetic transport Seqs:2 gene expression nucleobase, nucleoside, nucleotide and nucleic acid metabolic Seqs:4 cellular macromolecule biosynthetic nucleic acid metabolic Seqs:2 DNA metabolic transcription GOSlim Combined Graph Figure 4: GOSlim graph 5

2 Perform a complete annotation with Blast2GO (please note that results may vary slightly depending on used parameters and different database versions!) 2.1 Annotation of 1100 sequences with Blast2GO e-values and similarities: The e-value ranges from 1xE-3 to 1xE-130. Most sequences have an e-value between E-10 and E-70. The sequence similarity goes from approx. 40% to approx 95%, then it drops. Also we can observe a peak at 100% which could be self-hits or sequence pattern of 100% similarity. E-value dtribution HITs 400 350 300 250 200 150 100 50 0 25 50 75 100 125 150 175 E-value (1e-X) Figure 5: evalue dtribution HITs 700 650 600 550 500 450 400 350 300 250 200 150 100 50 Sequence similarity dtribution 0 0 10 20 30 40 50 60 70 80 90 100 #positives/alignment-length Figure 6: Similarity dtribution Mapping: The majority of sequences do have annotations inferred from electronic annotations, even so approx. 350 sequences also do have annotations inferred from direct assays. The next two evidences codes are also not experimental ones: Inferred from computational analys and inferred from sequence similarity. Having a look at the source of databases we can observe that the majority of annotations are obtained from the UniProt Knowledge Base. The mean GO-level 5.4 and approx. 2800 annotations could be assigned. 2.2 Augment annotation via InterPro and Annex About 600 sequences have a InterPro scan result and about 30% of them could be linked to a GO-term. However, (for th dataset) only 8 additional sequences could be annotated through InterPro domains. Through Annex, the amount of annotation increased from about 3500 to over 4100 by adding complementary terms derived form the exting molecular functions. 6

400 GO-level dtribution 350 300 # Annotations 250 200 150 100 50 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 GO Level (Total Annotations = 2839, Mean Level = 5.391, Std. Deviation = 1.778) P F C Figure 7: Dtribution of GO term levels for each GO category 2.3 Try different annotation strategies We observed a drastic decrease in the amount of assigned annotations by excluding several evidence codes with more restrictive setting. Only about 10% of the sequences could be successfully annotated. The other way round we obtained over 50% of annotated sequences. Compared to the annotation with default parameters we only obtained annotations for 70 more sequences (see Figure 8). (a) default parameters (b) permsive parameters (c) restrictive parameters Figure 8: Annotation results By generating the GO-Level dtribution chart we can see that the amount of annotated sequences much less for the restrictive mode. Also can be observed that the mean annotation level stayed more or less the same. At a closer look we see that the Cellular component category seems the less affected one by the the restrictive mode (see Figure 9). 7

(a) default parameters (b) permsive parameters (c) restrictive parameters Figure 9: GO level dtributions 8

3 Creating GO-DAGs and Pies 3.1 Creating GO-DAGs and Pies Create the complete graphs for all 3 GO branches. Can you extract any conclusion? All graphs are really big. The only thing you can conclude that the Biological Process branch bears much more information than the other two. Use the seq and score filters to reduce the number of GO terms. Try one type of filter at the time. How does the resulting graph look like? Which filtering value gives you a good view on the data? Can you see easily important terms? How? Which ones? By setting a seq filter the graph becomes smaller from the lowest nodes. The score filter makes some nodes in-between to dappear, creating an odd graph, with many links and few nodes. By setting the seq filter to 30 you can see some highlighted nodes such as response to stress, regulation of transcription-dna dependent, translation and transport. By setting the filter on the score value (also 30), we get an even more compact graph. Now all nodes are intensively colored and it more difficult to find the relevant terms, but we also see functions such as response to stress, regulation of transcription-dna dependent, translation which are among the dominant functions Perform a GOSlim on these data (use plant specific). Create the DAG. How does it compare to the previous graphs? The graph much more compact. You can find back important terms as response to stimulus, but many other nodes are not represented. Generate pie charts with normal and multilevel pies and bar-charts. Try out different filtering until you get a useful summary? Which functions are more abundant? Give a summary of the functions represented in th sub-array. We can obtain a good summary with the bar chart (see Figure 10) and the multilevel pie (see Figure 11), but the pies by level have always too many sectors to be useful. From both analys we can conclude that the main functions in th dataset are: response to stimulus, translation, metabolic es,transport,... 3.2 Extra exerce: Pie charts with Excel and custom-colored graphs Export the graph data as.txt and open it in excel. Try to reproduce some of the charts you obtained with Blast2GO. Here I simply have the counts on the different GOs. I also have the level, so it possible to create a bar chart and normal pie on one level. The multilevel pie more difficult since you need the relationships between nodes and branches. Make a custom-colored graph with the top 100 GO terms ordered by the amount of annotated sequences. From the table above we order the sequences by the score. We take the first 100 sequences and number them from 1 to 0. The column with GO IDs gets duplicated and the file with columns GO-ID, GO-ID, value saved with the extension.annot. Import the file into Blast2GO and create combined graph without filters but using the option color bydesc. The result can be seen in Figure 12. 9

Figure 10: Bar-Chart for the biological es Figure 11: Multi-Level Pie Chart 10

Figure 12: Custom-colored graph 11

4 Enrichment Analys with Blast2GO - FatiGO (NOTE: Results may vary depending on used parameters and different database versions!) For example: The enriched term response to chemical stimulus has 114 contigs in the test-set and 366 in the reference set. The term obtained an adjusted (FDR) p-value of 6.9E 3 and a un-adjusted value of 1.6E 6. Th value of 6.9E 3 above 0.05 and stattically overrepresented after multiple testing correction. Below we can see the unfiltered enriched graph of molecular functions (see Figure 13). Th graph got saved as pdf. A more compact graph of th results was generated as a thined graph with a FDR filter of 0.05 (see Figure 14). Finally the lt of the most specific (tip-terms) AND enriched terms per GO branch got generated (see Figure 15). molecular_function GO:0003674 structural molecule activity GO:0005198 FDR: 7.0E-3 FWER: 0.0E0 p-value: 4.8E-6 binding GO:0005488 structural constituent of ribosome GO:0003735 FDR: 7.0E-3 FWER: 0.0E0 p-value: 4.0E-6 ion binding GO:0043167 cation binding GO:0043169 metal ion binding GO:0046872 transition metal ion binding GO:0046914 iron ion binding GO:0005506 FDR: 3.3E-2 FWER: 0.0E0 p-value: 6.8E-5 Enriched Graph Figure 13: Enriched molecular functions (without filter) 12

molecular_function GO:0003674 5 terms structural molecule activity GO:0005198 FDR: 7.0E-3 FWER: 0.0E0 p-value: 4.8E-6 iron ion binding GO:0005506 FDR: 3.3E-2 FWER: 0.0E0 p-value: 6.8E-5 structural constituent of ribosome GO:0003735 FDR: 7.0E-3 FWER: 0.0E0 p-value: 4.0E-6 Enriched Graph Figure 14: Enriched molecular functions (thinned out with a 0.05 FDR filter) Figure 15: Lt of the most specific (tip-terms) AND enriched terms per GO branch. 13

5 Functional Analys/Data Mining The analys pipeline would be as follows: Upload the.dat in B2G Go to tools and use the add.annot function to include the manual.annot file into the.dat project Go to Enrichment Analys Select the StressSelection.txt file, performing a ONE tail stattical test. Once results are obtained, save them as default.gossip.txt Select all sequences Go to the annotation menu and reset annotation Change annotation parameters. Set all evidence codes lower than 1 to 0 Re-annotate Repeat the Fher Analys and save as strict.gossip.txt Select all sequences Go to the annotation menu and reset annotation Change annotation parameters. Set all evidence codes to 1 Re-annotate Repeat the Fher Analys and save as permsive.gossip.txt Outside B2G, in Excel, create a gossip-result annotation file: a 2 columns file with in the first column the name of the strategy default, strict or permsive and in the other the significant terms obtained in each analys. Save th file as text delimited file with the extension.annot Upload th file in Blast2GO (File menu, Load annotations) Make a Combined graph selecting Node Information = with Seqs. common and different nodes Here you can see To create a graph with only nodes that were enriched with the 3 strategies, make the graph giving a value of 2.5 to the Seq filter parameter. Export results as.txt 14