Annotation and Gene Set Analysis with R y Bioconductor

Size: px

Start display at page:

Download "Annotation and Gene Set Analysis with R y Bioconductor"

Janis Matthews
5 years ago
Views:

1 Annotation and Gene Set Analysis with R y Bioconductor Alex Sánchez Statistics and Bioinformatics Research Group Departament de Estadística. Universitat de Barcelona April 22, 2013 Contents 1 Introduction The estrogen case study Probe annotation information Probe annotation information using array specific annotation package Creating Annotated Results Tables (or how to get pretty output) 6 4 Species specific annotation packages 7 1 Introduction In this lab methods for the annotation of genes and also for the analysis of biological significance based on lists of genes are discussed and exemplified. These methods usually rely on one or more list of genes obtained after gene selection process. For the sake of completitude the process of selecting differentially expressed genes is reproduced below, although it is not discussed because it has been treated elsewhere. 1.1 The estrogen case study Data for the analyses are obtained from the estrogen dataset, available in the estrogen package. > if (!(require(estrogen))){ + source(" + bioclite("estrogen") + library(estrogen) + } 1

2 > estrogendir <- system.file("extdata", package = "estrogen") > # print(estrogendir) > workingdir <- getwd() > datadir <- file.path(workingdir, "datos") > if (!file.exists("datos")) system ("mkdir datos") > resultsdir <- file.path(workingdir, "results") > if (!file.exists("results")) system ("mkdir results") First, data are read from the package data directory. > require(biobase) > require(affy) > sampleinfo <- read.annotateddataframe(file.path(estrogendir,"targlimma.txt"), + header = TRUE, row.names = 1, sep="\t") > filenames <- pdata(sampleinfo)$filename > rawdata <- read.affybatch(filenames=file.path(estrogendir,filenames), + phenodata=sampleinfo) Exploration and quality control are omitted because they have been preented elsewhere. We go straight to normalization followed by non-specific filtering. > stopifnot(require(affy)) > eset_rma <- rma(rawdata) Background correcting Normalizing Calculating Expression > save(eset_rma, file=file.path(datadir,"normalized.rda")) > if(!(require(genefilter))) bioclite("genefilter") > if(!(require("hgu95av2.db"))) bioclite("hgu95av2.db") > filtrats <- nsfilter(eset_rma) Gene selection is done using on the linear model approach defined in the limma package. > cont.matrix <- makecontrasts ( + Estro10=(est10h-neg10h), + Estro48=(est48h-neg48h), + Tiempo=(neg48h-neg10h), + levels=design) > cont.matrix Contrasts Levels Estro10 Estro48 Tiempo neg10h est10h neg48h est48h

3 > toptabestro10 <- toptable (fit.main, number=nrow(fit.main), coef="estro10", adjust="fdr") > toptabestro48 <- toptable (fit.main, number=nrow(fit.main), coef="estro48", adjust="fdr") > toptabtiempo <- toptable (fit.main, number=nrow(fit.main), coef="tiempo", adjust="fdr") > save(toptabestro10, toptabestro48, toptabtiempo, file=file.path(resultsdir, "toptables.rd To select genes that are changed in either one comparison or another we rely on the decidetests function. Estro10 Estro48 Tiempo > probenames<-rownames(res) > probenames.selected<-probenames[sum.res.rows!=0] > exprsselected <-exprs(eset_rma)[probenames.selected,] > save(exprsselected, file=file.path(resultsdir, "exprsselected.rda")) 2 Probe annotation information The Bioconductor project provides software for associating microarray and other genomic data in real time to biological metadata from web databases such as GenBank, LocusLink and PubMed (annotate package). Functions are also provided for incorporating the results of statistical analysis in HTML reports with links to annotation WWW resources. Software tools are available for assembling and processing genomic annotation data, from databases such as GenBank, the Gene Ontology Consortium, Entrez, UniGene or the UCSC Human Genome Project (AnnotationDbi package). Data packages are distributed to provide mappings between different probe identifiers (e.g. Affy IDs, Entrez, PubMed). Customized annotation libraries can also be assembled. Use of Bioconductor annotation for Affymetrix arrays is illustrated below. We will use alternative approaches to obtain probe annotation. 2.1 Probe annotation information using array specific annotation package The purpose of the an annotation package, say hgu95av2.db package is to provide detailed information about the hgu95av2 platform. To use it it must first be loaded: > library(hgu95av2.db) We can try different options for displaying information about the content of the package: > require(hgu95av2.db) > ls("package:hgu95av2.db") [1] "hgu95av2" "hgu95av2accnum" "hgu95av2alias2probe" [4] "hgu95av2chr" "hgu95av2chrlengths" "hgu95av2chrloc" 3

4 [7] "hgu95av2chrlocend" "hgu95av2.db" "hgu95av2_dbconn" [10] "hgu95av2_dbfile" "hgu95av2_dbinfo" "hgu95av2_dbschema" [13] "hgu95av2ensembl" "hgu95av2ensembl2probe" "hgu95av2entrezid" [16] "hgu95av2enzyme" "hgu95av2enzyme2probe" "hgu95av2genename" [19] "hgu95av2go" "hgu95av2go2allprobes" "hgu95av2go2probe" [22] "hgu95av2map" "hgu95av2mapcounts" "hgu95av2omim" [25] "hgu95av2organism" "hgu95av2orgpkg" "hgu95av2path" [28] "hgu95av2path2probe" "hgu95av2pfam" "hgu95av2pmid" [31] "hgu95av2pmid2probe" "hgu95av2prosite" "hgu95av2refseq" [34] "hgu95av2symbol" "hgu95av2unigene" "hgu95av2uniprot" > head(ls("package:hgu95av2.db"), n = 10) [1] "hgu95av2" "hgu95av2accnum" "hgu95av2alias2probe" [4] "hgu95av2chr" "hgu95av2chrlengths" "hgu95av2chrloc" [7] "hgu95av2chrlocend" "hgu95av2.db" "hgu95av2_dbconn" [10] "hgu95av2_dbfile" > hgu95av2() Quality control information for hgu95av2: This package has the following mappings: hgu95av2accnum has mapped keys (of keys) hgu95av2alias2probe has mapped keys (of keys) hgu95av2chr has mapped keys (of keys) hgu95av2chrlengths has 93 mapped keys (of 93 keys) hgu95av2chrloc has mapped keys (of keys) hgu95av2chrlocend has mapped keys (of keys) hgu95av2ensembl has mapped keys (of keys) hgu95av2ensembl2probe has 9677 mapped keys (of keys) hgu95av2entrezid has mapped keys (of keys) hgu95av2enzyme has 2154 mapped keys (of keys) hgu95av2enzyme2probe has 791 mapped keys (of 975 keys) hgu95av2genename has mapped keys (of keys) hgu95av2go has mapped keys (of keys) hgu95av2go2allprobes has mapped keys (of keys) hgu95av2go2probe has mapped keys (of keys) hgu95av2map has mapped keys (of keys) hgu95av2omim has mapped keys (of keys) hgu95av2path has 5504 mapped keys (of keys) hgu95av2path2probe has 228 mapped keys (of 229 keys) hgu95av2pfam has mapped keys (of keys) hgu95av2pmid has mapped keys (of keys) hgu95av2pmid2probe has mapped keys (of keys) hgu95av2prosite has mapped keys (of keys) hgu95av2refseq has mapped keys (of keys) hgu95av2symbol has mapped keys (of keys) hgu95av2unigene has mapped keys (of keys) 4

5 hgu95av2uniprot has mapped keys (of keys) Additional Information about this package: DB schema: HUMANCHIP_DB DB schema version: 2.1 Organism: Homo sapiens Date for NCBI data: 2012-Sep4 Date for GO data: Date for KEGG data: 2011-Mar15 Date for Golden Path data: 2010-Mar22 Date for Ensembl data: 2012-Jul31 >?hgu95av2unigene > head(totable(hgu95av2unigene)) probe_id unigene_id _at Hs _at Hs _f_at Hs _s_at Hs _at Hs _at Hs We will now use some of the functions provided by the annotate package. The basic purpose of this package is to supply interface routines for getting data out of specific meta-data libraries (e.g. hu95av2.db) It is easy to get information about individual probes or a list of probes using the get/mget functions: > get("38187_at", hgu95av2genename) [1] "N-acetyltransferase 1 (arylamine N-acetyltransferase)" > affyid <- c("38187_at", "38912_at", "33825_at", "36512_at", "38434_at") > mget(affyid, hgu95av2genename) $`38187_at` [1] "N-acetyltransferase 1 (arylamine N-acetyltransferase)" $`38912_at` [1] "N-acetyltransferase 2 (arylamine N-acetyltransferase)" $`33825_at` [1] "serpin peptidase inhibitor, clade A (alpha-1 antiproteinase, antitrypsin), member 3" $`36512_at` [1] "arylacetamide deacetylase" $`38434_at` [1] "angio-associated, migratory cell protein" 5

6 Exercise : Try adding more annotation to the fit2 object generated in the linear model analysis described above. Add gene symbol and Entrez gene id Exercise: How many probes do not have a gene symbol? 3 Creating Annotated Results Tables (or how to get pretty output) It is possible to make reasonably nice looking HTML tables for presenting the results of a microarray analysis. These tables are a very nice format because you can insert clickable links to various public annotation databases, which facilitates the downstream analysis. In addition, the format is quite compact, can be posted on the web, and can be viewed using any number of free web browsers. The Bioconductor project supplies annotation packages for many of the more popular Affymetrix chips, as well as for many commercial spotted cdna chips. For chips that have annotation packages, the annaffy package is the preferred method for making HTML tables. In this example we will assume that we have analyzed an experiment using limma and that we have stored a top Table object with the most interesting genes into as an ASCII file, so that we begin reloading it into the computer. > if (!(exists("toptabestro48"))) load (file=file.path(resultsdir, "toptables.rda")) > toptab <- toptabestro48 > stopifnot(require(annotate)) > ### We will use ENTREZID codes to link with databases > gnames<-as.character(toptab$id) > # myenvirentrezid<-eval(parse(text = paste(anotpackage,"entrezid",sep=""))) > # gll<- mget(gnames, env = myenvirentrezid) > ### Add also gene symbols > # myenvirsymbol<-eval(parse(text = paste(anotpackage,"symbol",sep=""))) > # gsym <- mget(gnames, env = myenvirsymbol) > gll <- geteg(gnames, "hgu95av2.db") > gsym <- getsymbol(gnames, "hgu95av2.db") > linked <- list (misgenes=gll) > ### Prepare a dataframe to organize the output > othernames = data.frame(gll, gnames, round( toptab$logfc,4), round(toptab$t,4), + round(toptab$p.value, 6), round(toptab$adj.p.val,6), round(toptab$b,4)) > names(othernames) = c("genesymbol", "AffyID", "M", "t-stat", "p-val", "Adj. p-val", "B-st > htmlpage(linked, + filename =file.path(datadir, "Selected Genes.html"), + title = "Comparison of cell types after LPS treatment", + othernames = othernames, + table.head = c("locus ID", "Gene Symbol", "Affy ID", + "logfc", "t-stat", "p-val", "Adj. p-val","b-stat"), + table.center = TRUE, + repository=list("en")) A different approach can be obtained with the anaffy. 6

7 anaffy allows easy access to many types of annotations. Its use is more straightforward than that of htmlpage but it is restricted to Affymetrix chips. > source(" > if(!(require(annotate))) bioclite("annotate") > if(!(require("hgu95av2.db", character.only=t))) bioclite("hgu95av2.db") > if(!(require("kegg.db"))) bioclite("kegg.db", character.only=true) > if(!(require("go.db"))) bioclite("go.db") > if(!(require("annaffy"))) bioclite("annaffy") > atab <- aaftableann(toptab$id,"hgu95av2.db", aaf.handler() ) > savehtml(atab, file=file.path(datadir, "Annotations for Selected Genes.html")) See in the package vignette the section Building HTML pages to see how to build html pages combining annotations and results 4 Species specific annotation packages After some time of relying on plattform-specific annotation packages, (centered on the chips) it was decided to move the focus to organism-centered packages, allowing for a more flexible annotation system that does not depend on a specific brand dominating the market. > if(!(require(org.hs.eg.db))) bioclite("org.hs.eg.db") > require(kegg.db) > caff <- get("caffeine metabolism", + revmap(keggpathid2name)) > get(caff, revmap(org.hs.egpath)) [1] "9" "10" "1544" "1548" "1549" "1553" "7498" Exercise: Which gene symbols and gene names are associated with the following entrez gene Ids, 1544, 1548 and 1549? 7

Bioconductor: Annotation Package Overview

Bioconductor: Annotation Package Overview April 30, 2018 1 Overview In its current state the basic purpose of annotate is to supply interface routines that support user actions that rely on the different