The software and data for the RNA-Seq exercise are already available on the USB system

Size: px

Start display at page:

Download "The software and data for the RNA-Seq exercise are already available on the USB system"

Brooke McGee
6 years ago
Views:

1 BIT815 Notes on R analysis of RNA-seq data The software and data for the RNA-Seq exercise are already available on the USB system The notes below regarding installation of R packages and other software are provided for future reference. Installation of R packages in a Linux environment for future reference only If the packages installed in R are to be available to all users on a Linux system, then R must be run with root privileges so the packages can be installed in the system library. To run R with root privileges, start it using the command: sudo R and then install the desired packages, either from the R repository using (for example): install.packages( RSQLite,dependencies=TRUE,lib= /usr/local/lib/r/site-library ) or from the Bioconductor repository using (for example): source(" bioclite("genomicfeatures", lib= /usr/local/lib/r/site-library ) If R is started without root privileges (ie no sudo command before R), then packages will be installed in the user's home directory (where the user has write permission), and those packages are not available for other users on the system. We will use the database package SQLite and the R packages Genomic Features and RSQLite to save a database of transcript annotation; these packages are installed on the USB system. Overview: Analysis of RNA-Seq data using R Align to reference sequence Annotate aligned reads Count reads per feature Analyze counts for differences 1. The analysis of RNA-seq data begins with aligning the sequence reads to some reference sequence, either a reference genome sequence or a reference transcriptome previously assembled from sequence reads using the Trinity software package or another appropriate assembly method. 2. The next step is to merge the read alignment results with annotation of the reference sequence to identify which regions of the reference correspond to specific features genes, transcripts, or exons. 3. The third step is to count the number of reads aligned to each feature in the reference sequence. The choice of which type of feature to use in counting reads is an important part of the experimental design, and depends on the biological question of interest. a. Counting reads that map to annotated genes means alternative splicing events will not be counted all the reads that map to a specific region of genomic DNA will be assigned to the gene annotated at that site. b. Counting reads that map to annotated transcripts allows detection of previously-known and annotated alternative splicing events, but does not allow detection of novel events.

2 c. Counting reads that map to annotated exons allows detection of alternative splicing events, regardless of previous annotation, provided they involve the presence or absence of annotated exons. d. De novo assembly of RNA-Seq reads can give insight into novel transcripts that have not been previously described or annotated. 4. The final step is to use the read counts as input to a statistical analysis that can model the variation in read counts as a function of both technical and biological variation, and provide insight into which genes, transcripts, or exons are likely to be differentially-represented in the read count data. We will try to use the GenomicFeatures package of Bioconductor to create a database containing information on Arabidopsis transcripts (a TxDb ), for use in analysis of the reads sampled from the experiment comparing bacteria-inoculated versus mock-inoculated Arabidopsis plants. Ssee for a description of the experiment, and the link to Gene-counter; a computational pipeline for the analysis of RNA-seq data for gene expression differences for details on the experiment. More information on the GenomicFeatures package is available at and documents linked from that webpage. Exercises 1. Open a terminal window, create a directory called rnaseq in the home directory, and copy five sequence read files and the chromosome 5 reference sequence from /media/lubuntu/data/data to it. The ts3.fastq.gz file on the USB system is corrupt in some cases, so a new copy of that file can be downloaded from a link on the course webpage. mkdir rnaseq cd /media/lubuntu/data/data cp Atchromo5.fasta.gz cn1.fastq.gz cn2.fastq.gz cn3.fastq.gz tsl.fastq.gz ts2.fastq.gz t ~/rnaseq Change to the rnaseq directory, and download the file t3.fq.gz from the course website (Week 5, transcriptome analysis page): cd ~/rnaseq wget 2. Start RStudio from the Applications Programming menu (lower left corner of the screen), and load the Bioconductor packages RSQLite, biomart, ShortRead, GenomicFeatures, and DESeq for use. library(rsqlite); library(biomart); library(genomicfeatures); library(deseq2) 3. Use the listmarts() command to see a list of databases available through Biomart. listmarts() Read the help file on the maketxdbfrombiomart() command to learn which Biomart databases can be used to build a TxDb this information is at the very bottom of the help file.

3 ?maketxdbfrombiomart 4. Use the maketxtdbfrombiomart to create a transcript database from the Arabidopsis TAIR10 gene annotation dataset athaliana_eg_gene in the plants_mart_25 database. transcriptdb <- maketranscriptdbfrombiomart(biomart="plants_mart_25", dataset="athaliana_eg_gene") In my experience, this command fails, with a cryptic error message about 0 or more than 1 file found on the FTP server at the Ensembl database. A web search using this error message as a query should identify a post on the Bioconductor support page asking why the error occurs, and the answers to that question are illuminating it seems to be due to a bug introduced in a recent upgrade of the software. See the DEanalysis.R script file for a description of an alternative approach to creating the transcript database, and download the file At.TAIR10.sqlite using the following command: wget -O At.TAIR10.sqlite The resulting file can be loaded into your R session with the loaddb() command. transcriptdb <- loaddb("at.tair10.sqlite") 5. Look at the properties of the transcript database you created how many transcripts are represented? How many exons? metadata(transcriptdb) 6. Read the help file on the transcripts() command 7. Use the transcripts() command to extract the set of all transcripts from chromosome 5 (as a GRanges object) out of the database. chr5txpts <- transcripts(transcriptdb,vals=list(tx_chrom="5")) 8. Set the working directory to ~/rnaseq, then save the database as an SQLite file for future use. setwd("rnaseq") savedb(transcriptdb,file="atdb.sqlite") To re-load the database from the saved file (in a new R session), use the command: mysaveddb <- loaddb("atdb.sqlite") 9. Look at the GRanges object chr5txpts to see how the chromosome name is represented what name is shown in the seqnames column? head(chr5txpts) 10. This part of the exercise will use the terminal window rather than the RStudio window, because you will use other programs to align the six fastq.gz files to the reference chromosome 5 sequence to produce SAM/BAM files. In the Linux terminal, use the zcat and head commands to look at the first line of the Atchromo5.fasta.gz file in the rnaseq directory what is the chromosome name in that file? zcat Atchromo5.fasta.gz head -1

4 11. KEY POINT: The reference sequence names in the BAM alignment files MUST MATCH the names in the R transcript database object to be used for analysis of those alignment results. The reason is obvious if you image aligning reads to a transcriptome assembly with 50,000 different sequences each sequence must have the same unique name in both the alignment file and the transcript database so R can make the connection between the two datasets. 12. Use sed to change the name of the reference sequence in the Atchromo5.fasta.gz file to 5, to match the name in the chr5txpts GRanges object in R. Then use the bwa index to create an index of the modified file, followed by bwa mem to align the six files of RNA-seq reads to the reference sequence and pipe the SAM output to samtools to filter out any unmapped reads and convert the alignments to BAM format and sort them. The following code carries out these steps. zcat Atchromo5.fasta.gz sed -e 's/gi ref NC_ /5/' gzip > Atchr5.fa.gz bwa index -p Atchr5 Atchr5.fa.gz bwa mem -t 3 Atchr5 cn1ln7.fastq.gz samtools view -SbuF4 - samtools sort - ctrl1 The last series of commands (bwa mem samtools view samtools sort) should be repeated for the other five fastq.gz files: cn2ln8, cn3ln1, ts1ln4, ts2ln6, and ts3ln2. The output bam files can be named ctrl2.bam, ctrl3.bam, test1.bam, test2.bam and test3.bam. The F4 option filters reads that are not aligned to the reference sequence from the output to reduce the file sizes. The samtools version we have (0.1.19) will give an error message that states [bam_header_read] EOF marker is absent. The input is probably truncated. This is a bug in the samtools code, not a real problem. Using 3 processors (as specified by the -t 3 option), my laptop completes each alignment job in one to three minutes. To see how many processors are available, execute the command cat /proc/cpuinfo grep processor at a terminal prompt. Note that processors are numbered using a 0-indexed system, so if the highest number is 3, there are a total of 4 processors. NOTE: BWA is not designed to map RNA-seq reads that contain splice junctions between exons to genomic DNA, so reads that overlap such junctions will be lost in this analysis. 13. The modified files should now be ready for importing into R and analysis. Return to the RStudio window and look up the readaligned command.?readaligned Change the working directory to /home/lubuntu/rnaseq, then load the BAM files into R using readaligned(). setwd("/home/lubuntu/rnaseq") c1 <- readaligned(".","ctrl1.bam",type="bam") c2 <- readaligned(".","ctrl2.bam",type="bam") c3 <- readaligned(".","ctrl3.bam",type="bam") t1 <- readaligned(".","test1.bam",type="bam") t2 <- readaligned(".","test2.bam",type="bam") t3 <- readaligned(".","test3.bam",type="bam") 14. Convert the AlignedRead objects produced by the readaligned() import process into GRanges objects using as(x, GRanges ), and set the strand variable for each aligned read to *, because the library preparation method used for these reads was not strand-specific, so there is no useful biological information in that variable.

5 c1gr <- as(c1, "GRanges") strand(c1gr) <- "*" c2gr <- as(c2, "GRanges") strand(c2gr) <- "*" c3gr <- as(c3, "GRanges") strand(c3gr) <- "*" t1gr <- as(t1, "GRanges") strand(t1gr) <- "*" t2gr <- as(t2, "GRanges") strand(t2gr) <- "*" t3gr <- as(t3, "GRanges") strand(t3gr) <- "*" 15. Look up the countoverlaps() function in R to see what it does.?countoverlaps 16. Run countoverlaps() on each of the GRanges objects of aligned reads, using the chr5txpts object as a reference. c1.counts=countoverlaps(chr5txpts,c1gr) c2.counts=countoverlaps(chr5txpts,c2gr) c3.counts=countoverlaps(chr5txpts,c3gr) t1.counts=countoverlaps(chr5txpts,t1gr) t2.counts=countoverlaps(chr5txpts,t2gr) t3.counts=countoverlaps(chr5txpts,t3gr) 17. Combine the six vectors of read counts into a dataframe. Extract the transcript IDs for the 9288 transcripts on chromosome 5 from the transcriptdb database and use those to name the 9288 rows of the all.counts dataframe. all.counts <- data.frame(c1=c1.counts, c2=c2.counts, c3=c3.counts, t1=t1.counts, t2=t2.counts, t3=t3.counts) GRList <- transcriptsby(transcriptdb, by = "gene") tx_ids <- names(grlist) txpt.names <- select(transcriptdb, keys=tx_ids, keytype="geneid", columns=c("txid", "TXNAME", "GENEID")) chr5.rows <- which(substr(txpt.names$geneid,1,3)=="at5") rownames(all.counts) <- txpt.names[chr5.rows,3] 18. Create a vector of factors that define the experimental treatment of each column in the dataframe. trtmnts <- data.frame(condition=c(rep("ctrl",3),rep("test",3))) 19. Look up the DESeqDataSetFromMatrix() command from the DESeq2 package.?deseqdatasetfrommatrix() 20. Use DESeqDataSetFromMatrix() to convert the dataframe and vector of factors into a DESeqDataSet. Data <- DESeqDataSetFromMatrix(all.counts, trtmnts,formula(~condition)) 21. Use estimatesizefactors() to estimate the size of each of the six samples of reads Data <- estimatesizefactors(data)

6 22. Use estimatedispersions() to estimate the variance among replicate samples Data <- estimatedispersions(data) 23. Use nbinomwaldtest() to test the significance of differential expression for all transcripts. Data <- nbinomwaldtest(data) 24. Find the lines in the table of results with adjusted p-values (after correction for multiple testing) less than Recover the gene IDs (the first 9 characters of the transcript IDs in the rownames of the signif.genes table) as a vector. result <- nbinomtest(cdata, "test", "ctrl") signif.genes <- result[which(result$padj < 0.05),] gene.ids <- substr(rownames(signif.genes),1,9] 25. Retrieve functional annotation for the differentially-expressed genes from Biomart. atdb <- usemart("plants_mart_21", dataset="athaliana_eg_gene") filters <- listfilters(atdb) attributes <- listattributes(atdb) descriptions <- getbm(attributes=c("tair_locus","description"),filters="tair_locus",values=gene.ids,mart=atdb) Look in the Environment pane of the RStudio window for a summary of the descriptions and signif.genes objects you will note that they differ by several rows in size. This is because there are several cases of two or more differentially-expressed transcripts from single genes. You can find which gene IDs are duplicated in the vector gene.ids using the command which(duplicated(gene.ids)==true) there are a total of 18. The same gene description from the descriptions object can be applied to all rows of the signif.genes object that share the same TAIR locus name. The following lines create a table of differentially-expressed genes with the adjusted p-value and annotation, then write that table to a text file called DEgenes.txt in the current working directory. names.padj <- data.frame(names=gene.ids,padj=signif.genes$padj) merge.out <- merge(names.padj,descriptions,by.x=1,by.y=1,all.x=true) write.table(merge.out,"degenes.txt",row.names=f,col.names=t,quote=f,sep="\t")

ls /data/atrnaseq/ egrep "(fastq fasta fq fa)\.gz" ls /data/atrnaseq/ egrep "(cn ts)[1-3]ln[^3a-za-z]\."

$ls /data/atrnaseq/ egrep (fastq fasta fq fa)\.gz ls /data/atrnaseq/ egrep (cn ts)[1-3]ln[^3a-za-z]\.$ Command line tools - bash, awk and sed We can only explore a small fraction of the capabilities of the bash shell and command-line utilities in Linux during this course. An entire course could be taught