Introduction to RNA-seq Analysis

Size: px

Start display at page:

Download "Introduction to RNA-seq Analysis"

Calvin Simpson
6 years ago
Views:

1 Introduction to RNA-seq Analysis Pete E. Pascuzzi July 14, 2016 Contents 1 Intro 2 2 Setup Setup Directories Parallel Processing BAM Header Metadata Gene Annotation Genome Annotation Sequence Alignment Map Data Format Converting BAM file to GAlignments Object Essential Alignment Data Additional Data Sequence Data Determining Digital Gene Expression Counting Modes on Simulated Data Clean-Up Environment to Free Memory Method for Limited Memory and Single Core Parallel Processing of BAM Files on Multiple Cores Differential Gene Expression Analysis with EdgeR Formatting the Data Normalizing the Libraries (Counts) SessionInfo 69 1

2 1 Intro The following vignette is a basic RNAseq analysis of data from St. John, et al., Mol. Endocrinol The data was deposited at NCBI GEO under the Super Series GSE The RNAseq data is the Series GSE We will work with only a subset of these samples, the 2 X 2 design of mouse cells, untreated or treated with vitamin D at three days and 35 days. At GEO, there is metadata and processed data that is not amenable to further statistical analysis. However, the raw data in the form of FASTQ files was deposited in NCBI Short Read Archive (SRA). SRA is an entirely different system from GEO with a totally different series of accession numbers. The system can be confusing to navigate, but you should spend some time there familiarizing yourself with the system. Importantly, the raw data in SRA can be downloaded and reanalyzed. However, each of these files can be very large, greater than 20 GB each. The raw data for this vignette is about 500 GB combined. NCBI has developed a series of command line UNIX tools that are used to manipulate and download SRA data. This is not a simple matter of file transfer. Raw data at SRA is stored in SRA format to optimize storage. This data must be converted back to FASTQ format (sometime SAM or BAM format) using specific utilities. For example, we used the tool fastq-dump to transfer files from SRA to Rice and convert them to FASTQ format. This took more than a day! In fact, these FASTQ files are not entirely raw. Typically, NGS experiments are indexed/bar-coded/tagged so that they can be run together on a single lane. So, if you download an SRA file that corresponds to a single sample, these indices have already been processed so that the reads that correspond to specific sample can be sent to a specific file. Another step that may have been done is the removal of adaptors that are necessary for library construction and sequencing. The reads are then ready for additional QC and alignment to your reference genome. We have already performed quality trimming of the reads with the FASTX toolkit, removing bases from either end of the read that have a PHRED quality score below 30. We have QC d the reads with the tool FastQC. The reads were aligned to the mouse reference genome Mus musculus.grcm38.fa with the tool TopHat. The resulting Sequence Alignment Maps files were saved in binary format as BAM files which were sorted and indexed to facilitate analysis and visualization. The steps that this vignette will cover is an overview of the BAM file 2

3 format, generation of a gene annotation object that can be used to make your gene count table, counting of reads over genes, and edger analysis for differential expression. 2 Setup When working with such large files is it important to think about your storage and directory structure. We have a huge amount of disk space available to us in our scratch directory on Rice. For many tasks, it would be typical to copy your data to scratch, do your work, and copy the results back to secure storage such as Data Depot. However, to avoid that transfer of large files that could become corrupted, we are going to leave our BAM files on Data Depot and access them from Rice while working in our scratch directory. Additionally, some of the tools that have been used to process the data have specific conventions. TopHat generates a large number of files and directories, so when we want to access these files, we typically have to use long file paths. If you choose to modify this script for your own analysis, you will absolutely need to modify the paths to your own files! 2.1 Setup Directories We will access various data for this experiment from Data Depot so it will be convenient to define at least part of this path early. define directories for data #data.dir <- "/depot/nihomics/data/ngs/pike" #path for Rice data.dir <- "/Volumes/nihomics/data/ngs/Pike" #path for Pete's MacBook library(rsamtools) bam.files <- list.files(file.path(data.dir, "rnaseq", "tophat"), pattern="accepted_hits_ bam.files [1] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR _tophat_out/SRR _ [2] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR _tophat_out/SRR _ [3] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR _tophat_out/SRR _ [4] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR _tophat_out/SRR _ 3

4 [10] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR _tophat_out/SRR _ [11] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR _tophat_out/SRR _ [12] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR _tophat_out/SRR _ [5] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR _tophat_out/SRR _ [6] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR _tophat_out/SRR _ [7] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR _tophat_out/SRR _ [8] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR _tophat_out/SRR _ [9] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR _tophat_out/SRR _ 2.2 Parallel Processing Now that we have the location of the BAM files, we are ready to start processing them further. A huge advantage to using Rice for this type of analysis is that each node has 20 cores. Twenty processors that share memory (64 GB) but that can work on independent tasks. One advantage of Bioconductor is that they have developed a package that can faciliate the use of multiple cores with parallel processing. However, you do need to download this package and configure the options for your specific setup. Below, we are telling Bioconductor that it can have 18 cores and assign tasks as it sees fit. How you can use BiocParallel will depend on your computer. MulticoreParam works well for UNIX-like operating systems. It will not work on Windows machines. set up for parallel processing library(biocparallel) library(parallel) registered() $MulticoreParam class: MulticoreParam bpjobname:bpjob; bpworkers:6; bptasks:0; bptimeout:inf; bprngseed:; bpisup:false bplog:false; bpthreshold:info; bplogdir:na bpstoponerror:false; bpprogressbar:false bpresultdir:na cluster type: FORK 4

5 $SnowParam class: SnowParam bpjobname:bpjob; bpworkers:6; bptasks:0; bptimeout:inf; bprngseed:; bpisup:false bplog:false; bpthreshold:info; bplogdir:na bpstoponerror:false; bpprogressbar:false bpresultdir:na cluster type: SOCK $SerialParam class: SerialParam bplog:false; bpthreshold:info bpcatcherrors:false detectcores() [1] 8 #bpparam <- MulticoreParam(workers=18, tasks=0) #parameters for Rice bpparam <- MulticoreParam(workers=6, tasks=0) #parameters for Pete's MacBook register(bpparam) registered() $MulticoreParam class: MulticoreParam bpjobname:bpjob; bpworkers:6; bptasks:0; bptimeout:inf; bprngseed:; bpisup:false bplog:false; bpthreshold:info; bplogdir:na bpstoponerror:false; bpprogressbar:false bpresultdir:na cluster type: FORK $SnowParam class: SnowParam bpjobname:bpjob; bpworkers:6; bptasks:0; bptimeout:inf; bprngseed:; bpisup:false bplog:false; bpthreshold:info; bplogdir:na bpstoponerror:false; bpprogressbar:false bpresultdir:na cluster type: SOCK $SerialParam 5

6 class: SerialParam bplog:false; bpthreshold:info bpcatcherrors:false Below is an example of the bplapply function, a parallel version of lapply that can send tasks to multiple processors. DO NOT UNCOMMENT THIS LINE AND RUN IT. The BAM files are already indexed. This is only an example. However, you do need to get the location of the BAM index files. index BAM files so that they can be processed more easily and ready by IGV and other applications. This has already been done #system.time(bplapply(bam.files, indexbam)) Get the full names for the bam index files bam.index <- list.files(file.path(data.dir, "rnaseq", "tophat"), pattern="bam.bai$", rec bam.index 2.3 BAM Header Metadata Now, we can start to look at the data in a BAM file. The first thing that you should examine is the BAM header. It contains information on how the reads were aligned. In particular, it should give you the names of the chromosomes to which the reads were aligned, and information about the aligner. get the seqnames for the references used in the alignments bam.header <- scanbamheader(bam.files[1]) length(bam.header) [1] 1 bam.header <- bam.header[[1]] names(bam.header) [1] "targets" "text" 6

7 bam.header$targets GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL JH JH JH JH JH JH JH JH JH JH JH JH JH MT X Y bam.chrs <- names(bam.header$targets) head(bam.header$text) $`@HD` [1] "VN:1.0" "SO:coordinate" $`@SQ` [1] "SN:1" "LN: " $`@SQ` [1] "SN:10" "LN: " $`@SQ` [1] "SN:11" "LN: " 7

8 [1] "SN:12" "LN: " [1] "SN:13" "LN: " tail(bam.header$text) [1] "SN:JH " "LN:158099" [1] "SN:JH " "LN:114452" [1] "SN:MT" "LN:16299" [1] "SN:X" "LN: " [1] "SN:Y" "LN: " [1] "ID:TopHat" [2] "VN:2.1.0" [3] "CL:/group/bioinfo/apps/apps/tophat-2.1.0/tophat -p 10 --library-type fr-secondst 2.4 Gene Annotation Now that we have information about the reference sequence, we can generate a suitable object that represents the genes in the reference genome. There are many ways to do this, but we are going to use precompiled Bioconductor databases. It is not a simple matter of finding the transcription start site and transcription termination site of the genes and defining that interval. Why? What we need to do is retrieve the coordinates for the gene exons, grouping the exons together by gene. Further, we want to include all possible 8

9 exonic sequences. Bioconductor has many functions that help you to achieve such tasks. Below, we will use exonsby to get all exons for mouse genes. library(mus.musculus, verbose=true) Warning in library(mus.musculus, verbose = TRUE): package Mus.musculus already present in search() mouse.genes0 <- exonsby(mus.musculus, by="gene") mouse.genes0 GRangesList object of length 24028: $ GRanges object with 7 ranges and 2 metadata columns: seqnames ranges strand exon_id exon_name <Rle> <IRanges> <Rle> <integer> <character> [1] chr9 [ , ] <NA> [2] chr9 [ , ] <NA> [3] chr9 [ , ] <NA> [4] chr9 [ , ] <NA> [5] chr9 [ , ] <NA> [6] chr9 [ , ] <NA> [7] chr9 [ , ] <NA> $ GRanges object with 6 ranges and 2 metadata columns: seqnames ranges strand exon_id exon_name [1] chr7 [ , ] <NA> [2] chr7 [ , ] <NA> [3] chr7 [ , ] <NA> [4] chr7 [ , ] <NA> [5] chr7 [ , ] <NA> [6] chr7 [ , ] <NA> $ GRanges object with 1 range and 2 metadata columns: seqnames ranges strand exon_id exon_name [1] chr10 [ , ] <NA>... 9

10 <24025 more elements> seqinfo: 66 sequences (1 circular) from mm10 genome exons are for each transcript but grouped by gene so some exons represented multiple times mouse.genes0[218] GRangesList object of length 1: $ GRanges object with 22 ranges and 2 metadata columns: seqnames ranges strand exon_id exon_name <Rle> <IRanges> <Rle> <integer> <character> [1] chr12 [ , ] <NA> [2] chr12 [ , ] <NA> [3] chr12 [ , ] <NA> [4] chr12 [ , ] <NA> [5] chr12 [ , ] <NA> [18] chr12 [ , ] <NA> [19] chr12 [ , ] <NA> [20] chr12 [ , ] <NA> [21] chr12 [ , ] <NA> [22] chr12 [ , ] <NA> seqinfo: 66 sequences (1 circular) from mm10 genome Note that this mouse genes has 22 exons, but many of these are overlapping or the same. That is because all exons for all transcripts are represented for this gene. If you count reads over this object, you could potentially double or triple your counts because you would get gene counts for every transcript. In fact, Bioconductor is pretty smart and this does not happen, but we are going to perform a reduce on these exons. This function will combine overlapping or redundant intervals into a single non-rendundant set. mouse.genes <- reduce(mouse.genes0) mouse.genes[218] GRangesList object of length 1: 10

11 $ GRanges object with 16 ranges and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] chr12 [ , ] + [2] chr12 [ , ] + [3] chr12 [ , ] + [4] chr12 [ , ] + [5] chr12 [ , ] [12] chr12 [ , ] + [13] chr12 [ , ] + [14] chr12 [ , ] + [15] chr12 [ , ] + [16] chr12 [ , ] seqinfo: 66 sequences (1 circular) from mm10 genome Now, let s check the chromosome names (seqnames) that are used in our mouse.genes object with the chromosome names that are used in the BAM files. These are often incompatible, and you will not get any counts for your genes if Bioconductor can not match up the seqnames! We have to do some strange data conversions here because some of the data is stored as Run-Length encoded vectors. seqlevels(mouse.genes) [1] "chr1" "chr2" "chr3" [4] "chr4" "chr5" "chr6" [7] "chr7" "chr8" "chr9" [10] "chr10" "chr11" "chr12" [13] "chr13" "chr14" "chr15" [16] "chr16" "chr17" "chr18" [19] "chr19" "chrx" "chry" [22] "chrm" "chr1_gl456210_random" "chr1_gl456211_random" [25] "chr1_gl456212_random" "chr1_gl456213_random" "chr1_gl456221_random" [28] "chr4_gl456216_random" "chr4_gl456350_random" "chr4_jh584292_random" [31] "chr4_jh584293_random" "chr4_jh584294_random" "chr4_jh584295_random" 11

12 [34] "chr5_gl456354_random" "chr5_jh584296_random" "chr5_jh584297_random" [37] "chr5_jh584298_random" "chr5_jh584299_random" "chr7_gl456219_random" [40] "chrx_gl456233_random" "chry_jh584300_random" "chry_jh584301_random" [43] "chry_jh584302_random" "chry_jh584303_random" "chrun_gl456239" [46] "chrun_gl456359" "chrun_gl456360" "chrun_gl456366" [49] "chrun_gl456367" "chrun_gl456368" "chrun_gl456370" [52] "chrun_gl456372" "chrun_gl456378" "chrun_gl456379" [55] "chrun_gl456381" "chrun_gl456382" "chrun_gl456383" [58] "chrun_gl456385" "chrun_gl456387" "chrun_gl456389" [61] "chrun_gl456390" "chrun_gl456392" "chrun_gl456393" [64] "chrun_gl456394" "chrun_gl456396" "chrun_jh584304" not the same style as BAM files table(seqlevels(mouse.genes) %in% bam.chrs) FALSE 66 change the style seqlevelsstyle(mouse.genes) <- "NCBI" table(seqlevels(mouse.genes) %in% bam.chrs) FALSE TRUE still problems with some (Bioconductor error?) determine how many annotated genes are on each sequence seqnames.genes <- unlist(runvalue(seqnames(mouse.genes))) table(seqnames.genes) seqnames.genes

13 X Y MT chr1_gl456210_random chr1_gl456211_random chr1_gl456212_random chr1_gl456213_random chr1_gl456221_random chr4_gl456216_random chr4_gl456350_random chr4_jh584292_random chr4_jh584293_random chr4_jh584294_random chr4_jh584295_random chr5_gl456354_random chr5_jh584296_random chr5_jh584297_random chr5_jh584298_random chr5_jh584299_random chr7_gl456219_random chrx_gl456233_random chry_jh584300_random chry_jh584301_random chry_jh584302_random chry_jh584303_random chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_jh a few genes on unassembled contigs so will adjust seqnames to match bam.chrs temp.seqlevels <- seqlevels(mouse.genes) 13

14 summary(bam.chrs %in% temp.seqlevels) Mode FALSE TRUE NA's logical temp.seqlevels[23:44] <- substr(temp.seqlevels[23:44], 6, 13) temp.seqlevels[45:66] <- substr(temp.seqlevels[45:66], 7, 14) temp.seqlevels[23:66] <- paste(temp.seqlevels[23:66], ".1", sep="") summary(bam.chrs %in% temp.seqlevels) Mode TRUE NA's logical 66 0 seqlevels(mouse.genes) <- temp.seqlevels seqnames.genes <- unlist(runvalue(seqnames(mouse.genes))) table(seqnames.genes) seqnames.genes X Y MT GL GL GL GL GL GL GL JH JH JH JH GL JH JH JH JH GL GL JH JH JH JH GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL JH

15 Now, the seqnames of mouse.genes matches the seqnames in our BAM files. 2.5 Genome Annotation Next, we are going to make an object that represent the chromosomes in the mouse genome. This object is required if you want to access the reads in the BAM file one chromosome at a time. This is the best way to work with BAM files if you have limited memory on your computer, e.g. 16 GB or less on a laptop. Again, we need to make sure that the seqnames match! library(bsgenome.mmusculus.ucsc.mm10) mouse.bs <- BSgenome.Mmusculus.UCSC.mm10 mouse.bs Mouse genome: # organism: Mus musculus (Mouse) # provider: UCSC # provider version: mm10 # release date: Dec # release name: Genome Reference Consortium GRCm38 # 66 sequences: # chr1 chr2 chr3 # chr4 chr5 chr6 # chr7 chr8 chr9 # chr10 chr11 chr12 # chr13 chr14 chr15 # # chrun_gl chrun_gl chrun_gl # chrun_gl chrun_gl chrun_gl # chrun_gl chrun_gl chrun_gl # chrun_gl chrun_gl chrun_gl # chrun_gl chrun_gl chrun_jh # (use 'seqnames()' to see all the sequence names, use the '$' or '[[' # operator to access a given sequence) seqlevelsstyle(mouse.bs) <- "NCBI" seqlevels(mouse.bs) 15

16 [1] "1" "2" "3" [4] "4" "5" "6" [7] "7" "8" "9" [10] "10" "11" "12" [13] "13" "14" "15" [16] "16" "17" "18" [19] "19" "X" "Y" [22] "MT" "chr1_gl456210_random" "chr1_gl456211_random" [25] "chr1_gl456212_random" "chr1_gl456213_random" "chr1_gl456221_random" [28] "chr4_gl456216_random" "chr4_gl456350_random" "chr4_jh584292_random" [31] "chr4_jh584293_random" "chr4_jh584294_random" "chr4_jh584295_random" [34] "chr5_gl456354_random" "chr5_jh584296_random" "chr5_jh584297_random" [37] "chr5_jh584298_random" "chr5_jh584299_random" "chr7_gl456219_random" [40] "chrx_gl456233_random" "chry_jh584300_random" "chry_jh584301_random" [43] "chry_jh584302_random" "chry_jh584303_random" "chrun_gl456239" [46] "chrun_gl456359" "chrun_gl456360" "chrun_gl456366" [49] "chrun_gl456367" "chrun_gl456368" "chrun_gl456370" [52] "chrun_gl456372" "chrun_gl456378" "chrun_gl456379" [55] "chrun_gl456381" "chrun_gl456382" "chrun_gl456383" [58] "chrun_gl456385" "chrun_gl456387" "chrun_gl456389" [61] "chrun_gl456390" "chrun_gl456392" "chrun_gl456393" [64] "chrun_gl456394" "chrun_gl456396" "chrun_jh584304" same problem with sqnames seqlevels(mouse.bs) <- temp.seqlevels prepare GRanges object that can be used as a parameter to access specific chromosomes and regions from BAM files. mouse.gr <- GRanges(seqnames=Rle(as.character(seqnames(mouse.bs))), ranges=iranges(start mouse.gr GRanges object with 66 ranges and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] 1 [1, ] * [2] 2 [1, ] * [3] 3 [1, ] * [4] 4 [1, ] * [5] 5 [1, ] * [62] GL [1, 23629] * 16

17 [63] GL [1, 55711] * [64] GL [1, 24323] * [65] GL [1, 21240] * [66] JH [1, ] * seqinfo: 66 sequences (1 circular) from mm10 genome 3 Sequence Alignment Map Data Format 3.1 Converting BAM file to GAlignments Object There are several Bioconductor packages that allow you to access BAM files. Here we are going to use readgalignments to make a GAlignments object for the reads on chr19. To do this, we need to specify certain parameters. What data do you want? Which chromosomes? Do you want to filter the data on specific flags? needed to redefine the file paths so the PDF would compile data.dir <- "/Volumes/nihomics/data/ngs/Pike" #path for Pete's MacBook bam.files <- list.files(file.path(data.dir, "rnaseq", "tophat"), pattern="accepted_hits_ bam.index <- list.files(file.path(data.dir, "rnaseq", "tophat"), pattern="bam.bai$", rec library(genomicalignments) flag.param <- scanbamflag() what.param <- scanbamwhat() which.param <- mouse.gr[19] my.param <- ScanBamParam(what=what.param, which=which.param, flag=flag.param) my.param class: ScanBamParam bamflag (NA unless specified): bamsimplecigar: FALSE bamreversecomplement: FALSE bamtag: bamtagfilter: bamwhich: 1 ranges bamwhat: qname, flag, rname, strand, pos, qwidth, mapq, cigar, mrnm, mpos, isize, seq, qual, groupid, mate_status bammapqfilter: NA 17

18 bam.ga <- readgalignments(bam.files[1], index=bam.index[1], param=my.param) bam.ga GAlignments object with alignments and 13 metadata columns: seqnames strand cigar qwidth start end <Rle> <Rle> <character> <integer> <integer> <integer> [1] M [2] M [3] M [4] M [5] M [ ] M [ ] M [ ] M [ ] M [ ] M width njunc qname flag <integer> <integer> <character> <integer> [1] SRR [2] SRR [3] SRR [4] SRR [5] SRR [ ] SRR [ ] SRR [ ] SRR [ ] SRR [ ] SRR rname strand pos qwidth mapq cigar <factor> <factor> <integer> <integer> <integer> <character> [1] M [2] M [3] M [4] M [5] M [ ] M [ ] M 18

19 [ ] M [ ] M [ ] M mrnm mpos isize <factor> <integer> <integer> [1] <NA> 0 0 [2] <NA> 0 0 [3] <NA> 0 0 [4] <NA> 0 0 [5] <NA> [ ] <NA> 0 0 [ ] <NA> 0 0 [ ] <NA> 0 0 [ ] <NA> 0 0 [ ] <NA> 0 0 [1] CCTAGTATATCTGGAGAGTTAAGATGGGGAATTATGTCAACTTTCCCTCTTCCTATGCCAGTTATGCATAATG [2] CCTAGTATATCTGGAGAGTTAAGATGGGGAATTATGTCAACTTTCCCTCTTCCTATGCCAGTTATGCATAATG [3] AGGGGGAGATGTGAGGAGCCGCCCTTGCAATCGCCATTACAAAATGGTGCTGATATCCGGTGTTCTAACTAGT [4] AGGGGGAGATGTGAGGAGCCGCCCTTGCAATCGCCATTACAAAATGGTGCTGATATCCGGTGTTCTAACTAGT [5] CTATGGCCTTGGGCATCAAGATTTAAAAAATTAAGAGTGAAGAGTGCTATGGAAACAACTACTCTTGGTACTG... [ ] GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG [ ] GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG [ ] GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG [ ] TAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTT [ ] TAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTT [1] 3DCCDCCDCCCCDCCCEEDCEFFFCDA7HHHHGFIIIGEFIGEHJIIJJHGIJJIJJJJJJJJJJJJJIJIJJ [2] <DCDEDCDDCCACDECCCDEEEDC@B@EHEAE@EHGCHGFGJIIIIHBIHFGIIGJJIJIHEHEIJJJIJIGF [3] CCCFFFFFHHHHHIIJJIIIIJJJIJJJIJJJIJJIJJIIIIJJJHHEHHFFFFFFFFDD;?BDDEECDCCCC [4] CCCFFFFFHHHFHJIJIJJJJJJJJJJJJJJJJJJJJJGIJIJJIHHEHHHFFFFFFFDC;@DDDEEDCDDCC [5] CCCFFFFFHHDHHEIJIJJJIJJIHIJJIJJJJIIIIGHIJIIJ?FHIIIHGIIEIJIJJEIIJJIJICEHHH

20 [ ] [ ] [ ] seqinfo: 66 sequences from an unspecified genome You should see that some of the data is repeated. The reason is that some essential data is always imported, so we don t need to pass that in our what.param. what.param <- c("qname", "flag", "mapq", "seq", "qual") my.param <- ScanBamParam(what=what.param, which=which.param, flag=flag.param) bam.ga <- readgalignments(bam.files[1], index=bam.index[1], param=my.param) bam.ga GAlignments object with alignments and 5 metadata columns: seqnames strand cigar qwidth start end <Rle> <Rle> <character> <integer> <integer> <integer> [1] M [2] M [3] M [4] M [5] M [ ] M [ ] M [ ] M [ ] M [ ] M width njunc qname flag <integer> <integer> <character> <integer> [1] SRR [2] SRR [3] SRR [4] SRR [5] SRR [ ] SRR [ ] SRR [ ] SRR

21 [ ] SRR [ ] SRR mapq <integer> [1] 0 [2] 0 [3] 1 [4] 1 [5] [ ] 0 [ ] 0 [ ] 0 [ ] 50 [ ] 1 [1] CCTAGTATATCTGGAGAGTTAAGATGGGGAATTATGTCAACTTTCCCTCTTCCTATGCCAGTTATGCATAATG [2] CCTAGTATATCTGGAGAGTTAAGATGGGGAATTATGTCAACTTTCCCTCTTCCTATGCCAGTTATGCATAATG [3] AGGGGGAGATGTGAGGAGCCGCCCTTGCAATCGCCATTACAAAATGGTGCTGATATCCGGTGTTCTAACTAGT [4] AGGGGGAGATGTGAGGAGCCGCCCTTGCAATCGCCATTACAAAATGGTGCTGATATCCGGTGTTCTAACTAGT [5] CTATGGCCTTGGGCATCAAGATTTAAAAAATTAAGAGTGAAGAGTGCTATGGAAACAACTACTCTTGGTACTG... [ ] GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG [ ] GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG [ ] GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG [ ] TAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTT [ ] TAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTT [1] 3DCCDCCDCCCCDCCCEEDCEFFFCDA7HHHHGFIIIGEFIGEHJIIJJHGIJJIJJJJJJJJJJJJJIJIJJ [2] <DCDEDCDDCCACDECCCDEEEDC@B@EHEAE@EHGCHGFGJIIIIHBIHFGIIGJJIJIHEHEIJJJIJIGF [3] CCCFFFFFHHHHHIIJJIIIIJJJIJJJIJJJIJJIJJIIIIJJJHHEHHFFFFFFFFDD;?BDDEECDCCCC [4] CCCFFFFFHHHFHJIJIJJJJJJJJJJJJJJJJJJJJJGIJIJJIHHEHHHFFFFFFFDC;@DDDEEDCDDCC [5] CCCFFFFFHHDHHEIJIJJJIJJIHIJJIJJJJIIIIGHIJIIJ?FHIIIHGIIEIJIJJEIIJJIJICEHHH

22 [ ] [ ] seqinfo: 66 sequences from an unspecified genome 3.2 Essential Alignment Data There are certain fields in the BAM file that are always imported. Let s go through these one by one. seqnames(bam.ga) factor-rle of length with 1 run Lengths: Values : 19 Levels(66): JH JH JH MT X Y strand(bam.ga) factor-rle of length with runs Lengths: Values : Levels(3): + - * cigar.tbl <- sort(table(cigar(bam.ga)), decreasing=true) head(cigar.tbl, 20) 101M 26M623N75M 2M1413N99M 37M109N64M M623N67M 46M2212N55M 33M623N68M 84M781N17M M974N1M 37M634N64M 68M1413N33M 69M974N32M M634N80M1630N9M 97M1D4M 96M1I4M 61M781N40M M170N43M 63M3195N38M 99M781N2M 5M92N96M

23 range(qwidth(bam.ga)) [1] range(width(bam.ga)) [1] njunc.tbl <- table(njunc(bam.ga)) njunc.tbl Additional Data Some data is only optionally reported, but these fields contain important information about th e reads. The FLAG field contains information on how the reads aligned, especially how each read in a paired-end run aligned with respect to each other. The MAPQ field tells you how good or bad the alignment is. head(mcols(bam.ga)) 1 CCTAGTATATCTGGAGAGTTAAGATGGGGAATTATGTCAACTTTCCCTCTTCCTATGCCAGTTATGCATAATGCACAAATATT 2 CCTAGTATATCTGGAGAGTTAAGATGGGGAATTATGTCAACTTTCCCTCTTCCTATGCCAGTTATGCATAATGCACAAATATT 3 AGGGGGAGATGTGAGGAGCCGCCCTTGCAATCGCCATTACAAAATGGTGCTGATATCCGGTGTTCTAACTAGTAAACAAGTAG DataFrame with 6 rows and 5 columns qname flag mapq <character> <integer> <integer> 1 SRR SRR SRR SRR SRR SRR

24 4 AGGGGGAGATGTGAGGAGCCGCCCTTGCAATCGCCATTACAAAATGGTGCTGATATCCGGTGTTCTAACTAGTAAACAAGTAG 5 CTATGGCCTTGGGCATCAAGATTTAAAAAATTAAGAGTGAAGAGTGCTATGGAAACAACTACTCTTGGTACTGAGGGTAGATT 6 CTATGGCCTTGGGCATCAAGATTTAAAAAATTAAGAGTGAAGAGTGCTATGGAAACAACTACTCTTGGTACTGAGGGTAGATT 1 3DCCDCCDCCCCDCCCEEDCEFFFCDA7HHHHGFIIIGEFIGEHJIIJJHGIJJIJJJJJJJJJJJJJIJIJJJIJJJJJIJJ 2 <DCDEDCDDCCACDECCCDEEEDC@B@EHEAE@EHGCHGFGJIIIIHBIHFGIIGJJIJIHEHEIJJJIJIGFIIEGIIGIHG 3 CCCFFFFFHHHHHIIJJIIIIJJJIJJJIJJJIJJIJJIIIIJJJHHEHHFFFFFFFFDD;?BDDEECDCCCCCCCDDDDACD 4 CCCFFFFFHHHFHJIJIJJJJJJJJJJJJJJJJJJJJJGIJIJJIHHEHHHFFFFFFFDC;@DDDEEDCDDCCDDACDDC>CC 5 CCCFFFFFHHDHHEIJIJJJIJJIHIJJIJJJJIIIIGHIJIIJ?FHIIIHGIIEIJIJJEIIJJIJICEHHHFEDEDCD>CC flag.tbl <- table(mcols(bam.ga)$flag) flag.tbl flag.mat <- bamflagasbitmatrix(as.integer(names(flag.tbl))) flag.mat <- cbind(as.integer(names(flag.tbl)), flag.mat) flag.mat ispaired isproperpair isunmappedquery hasunmappedmate [1,] [2,] [3,] [4,] isminusstrand ismateminusstrand isfirstmateread issecondmateread [1,] [2,] [3,] [4,] issecondaryalignment isnotpassingqualitycontrols isduplicate [1,] [2,] [3,] [4,] dups <- duplicated(mcols(bam.ga)$qname) duplicated(mcols(bam.ga)$qname, fromlast=true) 24

25 dup.ga <- bam.ga[dups] dup.ga <- dup.ga[order(mcols(dup.ga)$qname)] dup.ga GAlignments object with alignments and 5 metadata columns: seqnames strand cigar qwidth start end <Rle> <Rle> <character> <integer> <integer> <integer> [1] M [2] M [3] M [4] M [5] M [383481] M [383482] M [383483] M [383484] M [383485] M width njunc qname flag <integer> <integer> <character> <integer> [1] SRR [2] SRR [3] SRR [4] SRR [5] SRR [383481] SRR [383482] SRR [383483] SRR [383484] SRR [383485] SRR mapq <integer> [1] 1 [2] 1 [3] 1 [4] 0 [5] [383481] 1 25

26 [383482] 1 [383483] 1 [383484] 1 [383485] 1 [1] GCCCCAAGTTCAAGATGCCTGAGATGAACATCAAGGCTCCAAAGATCTCCATGCCAGATTTACATCTTAAGGGT [2] GCCCCAAGTTCAAGATGCCTGAGATGAACATCAAGGCTCCAAAGATCTCCATGCCAGATTTACATCTTAAGGGT [3] GCCCCAAGTTCAAGATGCCTGAGATGAACATCAAGGCTCCAAAGATCTCCATGCCAGATTTACATCTTAAGGGT [4] TCCAGCTTCTCTCCATTTACTTTGATGTTGGCTACTGGTTTGCTGTAGATTGCTTTTATCATGTTTAGGTATGG [5] TCCAGCTTCTCTCCATTTACTTTGATGTTGGCTACTGGTTTGCTGTAGATTGCTTTTATCATGTTTAGGTATGG... [383481] GCCCCAAGTTCAAGATGCCTGACATGCACTTCAAGGCTCCTAAGATCTCCATGCCTGATGTGGACTTGCATCTG [383482] GCCCCAAGTTCAAGATGCCTGACATGCACTTCAAGGCTCCTAAGATCTCCATGCCTGATGTGGACTTGCATCTG [383483] AGTGCCAAAATTAGAGGGAGAATTAAAAGGCCCAAGTGTGGATGTGGAAGTACCTGGTGTTGATCTGGAATGTC [383484] AGTGCCAAAATTAGAGGGAGAATTAAAAGGCCCAAGTGTGGATGTGGAAGTACCTGGTGTTGATCTGGAATGTC [383485] AGTGCCAAAATTAGAGGGAGAATTAAAAGGCCCAAGTGTGGATGTGGAAGTACCTGGTGTTGATCTGGAATGTC [1] CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJIJJJJFJJJJJJIJJJJJJJIIJEHIJG [2] CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJIJJJJFJJJJJJIJJJJJJJIIJEHIJG [3] CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJIJJJJFJJJJJJIJJJJJJJIIJEHIJG [4] CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJGIIJJJJIJJJJJJJJJJJJJJIIJGHIGIJGIIIG [5] CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJGIIJJJJIJJJJJJJJJJJJJJIIJGHIGIJGIIIG... [383481] CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJIJJJJIJJJJJJJJJJJJJJIJIGJJJIIJJIJJJJJ [383482] CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJIJJJJIJJJJJJJJJJJJJJIJIGJJJIIJJIJJJJJ [383483] CCBFFFFFHHHHHJJJJJIJIJJJJIJJJJIJJJJJIJHIJJIJJIJIJJHIJJJIJJHIJIDHJJHHHHHHED [383484] CCBFFFFFHHHHHJJJJJIJIJJJJIJJJJIJJJJJIJHIJJIJJIJIJJHIJJJIJJHIJIDHJJHHHHHHED [383485] CCBFFFFFHHHHHJJJJJIJIJJJJIJJJJIJJJJJIJHIJJIJJIJIJJHIJJJIJJHIJIDHJJHHHHHHED seqinfo: 66 sequences from an unspecified genome table(mcols(bam.ga)$flag, dups) dups FALSE TRUE

27 mapq.tbl <- table(mcols(bam.ga)$mapq) mapq.tbl table(mcols(bam.ga)$mapq, dups) dups FALSE TRUE table(mcols(bam.ga)$mapq, mcols(bam.ga)$flag, dups),, dups = FALSE ,, dups = TRUE

28 3.4 Sequence Data The sequence data includes both the actual sequence of the read and the quality scores for each based in PHRED format. mcols(bam.ga)$seq A DNAStringSet instance of length width seq [1] 101 CCTAGTATATCTGGAGAGTTAAGATGGG...CACAAATATTTCCACGCTTTTTCACTAC [2] 101 CCTAGTATATCTGGAGAGTTAAGATGGG...CACAAATATTTCCACGCTTTTTCACTAC [3] 101 AGGGGGAGATGTGAGGAGCCGCCCTTGC...AAACAAGTAGTCTGCGCATGTGCTGGGG [4] 101 AGGGGGAGATGTGAGGAGCCGCCCTTGC...AAACAAGTAGTCTGCGCATGTGCTGGGG [5] 101 CTATGGCCTTGGGCATCAAGATTTAAAA...AGGGTAGATTCCCCCTTTTTGTTTAATT [ ] 101 GTTAGGGTTAGGGTTAGGGTTAGGGTTA...GTAGGGTTAGGGGTAGGGTTAGGGTTAG [ ] 101 GTTAGGGTTAGGGTTAGGGTTAGGGTTA...TTAGGGTTAGGGTTAGGGTTAGGGTTAG [ ] 101 GTTAGGGTTAGGGTTAGGGTTAGGGTTA...TTAGGGTTAGGGTTAGGGTTAGGGTTAG [ ] 101 TAGGGTTAGGGTTAGGGTTAGGGTTAGG...AGGGTTAGGGTTAGGGTTAGGGTTAGAT [ ] 101 TAGGGTTAGGGTTAGGGTTAGGGTTAGG...AGGGTTAGGGTTAGGGTTAGGGTTAGAT ecor1 <- vcountpattern("gaattc", mcols(bam.ga)$seq) table(ecor1) ecor mcols(bam.ga)$seq[ecor1 > 0] A DNAStringSet instance of length width seq [1] 101 ATATAGTGGATTACTTTGATGGATTTCCA...ATGATTGTTTTGATGTGTTCTTGAATTC [2] 101 ATATAGTGGATTACTTTGATGGATTTCCA...ATGATTGTTTTGATGTGTTCTTGAATTC [3] 101 CAAAATCCAACACCTATTCATGATGAAAG...AAAGCAATATACAGCAAGCCAGTAGCCA [4] 101 TGGCTACTGGTTTGCTGTAGATTGCTTTT...TTTTATCATGAATGGGTGTTGGATCTTG [5] 101 CATCTTTGCCATGATATTTTTTGCTTTAG...CCCAAATGCTGCATAATATCCCTTCCCC [109816] 101 CACATGATCATCTCGTTAGATGCAGAAAA...AAAGATCAGGAATTCAAGGCCCATACCT [109817] 101 TTAGATGCAGAAAAAGCATTTGACAAGAT...AAGGCCCATACCTAAACATGATAAAAGC [109818] 101 GTACACCCCTGCATGATAGAATTCTGCAG...TAAGTGATTCTTGGATCCTAACATTCTA [109819] 101 GTACACCCCTGCATGATAGAATTCTGCAG...TAAGTGATTCTTGGATCCTAACATTCTA [109820] 101 GTACACCCCTGCATGATAGAATTCTGCAG...TAAGTGATTCTTGGATCCTAACATTCTA 28

29 mcols(bam.ga)$qual A PhredQuality instance of length width seq [1] 101 3DCCDCCDCCCCDCCCEEDCEFFFCDA7...JIJJJJJIJJJJIHJHHHHHFFFFFC@C [2] 101 <DCDEDCDDCCACDECCCDEEEDC@B@E...IIEGIIGIHGIIHGHDHHHGFDEDF@C@ [3] 101 CCCFFFFFHHHHHIIJJIIIIJJJIJJJ...CCCDDDDACDDDDDDDB>BBCCCDC@B9 [4] 101 CCCFFFFFHHHFHJIJIJJJJJJJJJJJ...DDACDDC>CCCDEDDDBDDDDDCDDCBB [5] 101 CCCFFFFFHHDHHEIJIJJJIJJIHIJJ...FEDEDCD>CCEDCDDDDDDDDACDCAC: [ ] [ ] [ ] [ ] [ ] as.integer(chartoraw(as.character(mcols(bam.ga)$qual[1]))) [1] [24] [47] [70] [93] as.integer(chartoraw(as.character(mcols(bam.ga)$qual[1]))) - 33 [1] [24] [47] [70] [93] Determining Digital Gene Expression Now, we have everything we need to get our gene counts. It is important to think about how this is done. First, is your RNAseq library strand specific? If so, do you have paired-end reads? If so, which of the ends should you use for counting? 29

30 In addition, how stringent do you want to be when you determine whether or not a read actually overlaps with a gene? What if a read overlaps with two genes? In your package tab, go to GenomicAlignments, click on it. Open the User guides and select GenomicAlignments::summarizeOverlaps. Figure 1 shows the counting modes that are available. 4.1 Counting Modes on Simulated Data The following bit of code demonstrates how your gene expression counts can vary based on your choice of counting mode. library(gviz) my.gr <- GRanges(seqnames=Rle("chr1"), ranges=iranges(start=1, end=1000), strand=rle("*" my.seqinfo <- Seqinfo(seqnames="chr1", seqlengths=1000, iscircular=false, genome="simula seqinfo(my.gr) <- my.seqinfo gene1 <- GRanges(seqnames=Rle("chr1"), ranges=iranges(start=201, end=350), strand=rle("+ gene2 <- GRanges(seqnames=Rle(rep("chr1", 2)), ranges=iranges(start=c(301, 601), end=c(3 gene3 <- GRanges(seqnames=Rle("chr1"), ranges=iranges(start=451, end=550), strand=rle(" gene4 <- GRanges(seqnames=Rle(rep("chr1", 2)), ranges=iranges(start=c(101, 301), end=c(2 my.genes <- GRangesList(gene1=gene1, gene2=gene2, gene3=gene3, gene4=gene4) my.genes.df <- as.data.frame(my.genes) colnames(my.genes.df)[3] <- "chromosome" gene.track <-GeneRegionTrack(my.genes.df, fill="green4", arrowheadwidth=50, arrowheadmax gene1.reads <- GRanges(seqnames=Rle(rep("chr1", 20)), ranges=iranges(start=sample(200:32 gene2.reads <- GRanges(seqnames=Rle(rep("chr1", 60)), ranges=iranges(start=sample(c(300: gene3.reads <- GRanges(seqnames=Rle(rep("chr1", 10)), ranges=iranges(start=sample(450:52 gene4.reads <- GRanges(seqnames=Rle(rep("chr1", 20)), ranges=iranges(start=sample(c(100: all.reads <- c(gene1.reads, gene2.reads, gene3.reads, gene4.reads) read.track <- AnnotationTrack(all.reads, fill=c("red", "blue")[as.integer(strand(all.rea ax.track <- GenomeAxisTrack(GRanges(seqnames="chr1", ranges=iranges(1, 1000)), genome="s plottracks(list(gene.track, read.track, ax.track), from=1, to=700, grid=1, sizes=c(4, 4, 30

31 gene4 GeneRegionTrack gene1 gene2 gene3 Demo Reads count.mat <- matrix(0, nrow=4, ncol=5) colnames(count.mat) <- c("actual", "countoverlaps", "Union", "IntersectionStrict", "Inte rownames(count.mat) <- c("gene1", "gene2", "gene3", "gene4") count.mat[, 1] <- c(20, 60, 10, 20) count.mat[, 2] <- countoverlaps(my.genes, all.reads) count.mat[, 3] <- assays(summarizeoverlaps(my.genes, all.reads, "Union", ignore.strand=f count.mat[, 4] <- assays(summarizeoverlaps(my.genes, all.reads, "IntersectionStrict", ig count.mat[, 5] <- assays(summarizeoverlaps(my.genes, all.reads, "IntersectionNotEmpty", count.mat actual countoverlaps Union IntersectionStrict IntersectionNotEmpty 31

32 gene gene gene gene Clean-Up Environment to Free Memory We ve accumulated a bunch of objects in our workspace. remove what we don t need. ls() Let s carefully [1] "all.reads" "ax.track" "bam.chrs" "bam.files" [5] "bam.ga" "bam.header" "bam.index" "bpparam" [9] "cigar.tbl" "count.mat" "data.dir" "dup.ga" [13] "dups" "ecor1" "flag.mat" "flag.param" [17] "flag.tbl" "gene.track" "gene1" "gene1.reads" [21] "gene2" "gene2.reads" "gene3" "gene3.reads" [25] "gene4" "gene4.reads" "mapq.tbl" "mouse.bs" [29] "mouse.genes" "mouse.genes0" "mouse.gr" "my.genes" [33] "my.genes.df" "my.gr" "my.param" "my.seqinfo" [37] "njunc.tbl" "read.track" "seqnames.genes" "temp.seqlevels" [41] "what.param" "which.param" my.objs <- list(ls()) my.objs[[1]] <- setdiff(my.objs[[1]], c("data.dir", "bam.files", "bam.index", "mouse.gen rm(list=my.objs[[1]]) rm(my.objs) ls() [1] "bam.files" "bam.index" "data.dir" "mouse.genes" "mouse.gr" gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells Vcells

33 4.3 Method for Limited Memory and Single Core It is very likely that you will have limited computer resources. The following chunk of code will work down a list of BAM files, one chromosome at a time, and generate your gene count matrix. I have used this routinely on my MacBook Pro (16 GB memory). #sample.names <- substring(bam.files, 45, 54) #count.mat <- matrix(0, nrow=length(mouse.genes), ncol=length(sample.names)) #rownames(count.mat) <- names(mouse.genes) #colnames(count.mat) <- sample.names one file at a time, one computer core at a time # mapq.cutoff <- 50 # for(j in 1:length(bam.files)){ # for(i in 1:length(seqlevels(mouse.genes))){ # chr.param <- ScanBamParam(which=mouse.gr[i], mapqfilter=mapq.cutoff) # bam.ga <- readgalignments(bam.files[j], index=bam.index[j], param=chr.param) # my.counts <- summarizeoverlaps(features=mouse.genes, reads=bam.ga, mode="union", ign # count.mat[, j] <- count.mat[, j] + assays(my.counts) counts # } # } # write.table(count.mat, file=paste(sample.name, "counts.txt", sep="_"), row.names=true, 4.4 Parallel Processing of BAM Files on Multiple Cores The next chunk, will use the parameters we specified to BiocParallel to process all 12 BAM files (we have 18 workers). This should take less than 10 minutes. system.time(my.se <- summarizeoverlaps(mouse.genes, bam.files, mode="intersectionnotempt summarizeoverlaps will check for parallel processing parameters. We have 12 files and 18 workers so all 12 files will be processd in parallel registered() user system elapsed /60 Nine minutes to process all files. Would have taken ~110 minutes with for loop. class(my.se) 33

34 dim(assays(my.se)$counts) colnames(assays(my.se)$counts) save(my.se, file="pikesummarizedexp.rdata") Now, we are finished with the Big Data part of RNAseq analysis. 5 Differential Gene Expression Analysis with EdgeR There are several popular packages or tools for DGE analysis of RNAseq data. At one time CuffDiff was certainly the most popluar, but it has fallen out of favor for various reasons. It will be interesting to see how CuffDiff2 performs. DESeq2 and edger are two of the best available packages for DGE, and both are native to Bioconductor. Here we will cover edger because it is essentially limma for RNAseq. Both were developed by the same research group, so they have a similar philosophy and workflow. If you are interested in trying DESeq2, Bioconductor has an excellent workflow on their website that is quite easy to follow through if you know a little R (and you do now!) 5.1 Formatting the Data As with the microarray data, we need to format our raw data and our pdata. We will get the counts from the Summarized Experiment object. If you weren t able to generate the object, there is a copy that you can load. load("pikesummarizedexp.rdata") my.counts <- assays(my.se)$counts head(my.counts) SRR _accepted_hits_sorted.bam

35 SRR _accepted_hits_sorted.bam SRR _accepted_hits_sorted.bam SRR _accepted_hits_sorted.bam SRR _accepted_hits_sorted.bam SRR _accepted_hits_sorted.bam SRR _accepted_hits_sorted.bam

36 SRR _accepted_hits_sorted.bam SRR _accepted_hits_sorted.bam SRR _accepted_hits_sorted.bam SRR _accepted_hits_sorted.bam SRR _accepted_hits_sorted.bam colnames(my.counts) <- substr(colnames(my.counts), 1, 10) 36

37 head(my.counts) SRR SRR SRR SRR SRR SRR SRR SRR SRR SRR SRR SRR Next, we need to get the pdata for the sample. We will use a file that can be downloaded from SRA when you browse the data. Again, we are going to do a bit of data wrangling to make suitable names. metadata for experiment pdata <- read.delim(file.path(data.dir, "metadata", "Pike_SraRunTable_RNAseq.txt"), comm head(pdata) BioSample_s Experiment_s MBases_l MBytes_l Run_s SRA_Sample_s 1 SAMN SRX SRR SRS SAMN SRX SRR SRS SAMN SRX SRR SRS SAMN SRX SRR SRS SAMN SRX SRR SRS SAMN SRX SRR SRS Sample_Name_s differentiation_day_s source_name_s Assay_Type_s 37

38 1 GSM d3 IDGSW3_basal_d3 RNA-Seq 2 GSM d3 IDGSW3_basal_d3 RNA-Seq 3 GSM d3 IDGSW3_basal_d3 RNA-Seq 4 GSM d7 IDGSW3_basal_d7 RNA-Seq 5 GSM d7 IDGSW3_basal_d7 RNA-Seq 6 GSM d7 IDGSW3_basal_d7 RNA-Seq AssemblyName_s BioProject_s Center_Name_s Consent_s InsertSize_l 1 <not provided> PRJNA GEO public 0 2 <not provided> PRJNA GEO public 0 3 <not provided> PRJNA GEO public 0 4 <not provided> PRJNA GEO public 0 5 <not provided> PRJNA GEO public 0 6 <not provided> PRJNA GEO public 0 LibraryLayout_s LibrarySelection_s LibrarySource_s Library_Name_s 1 SINGLE cdna TRANSCRIPTOMIC <not provided> 2 SINGLE cdna TRANSCRIPTOMIC <not provided> 3 SINGLE cdna TRANSCRIPTOMIC <not provided> 4 SINGLE cdna TRANSCRIPTOMIC <not provided> 5 SINGLE cdna TRANSCRIPTOMIC <not provided> 6 SINGLE cdna TRANSCRIPTOMIC <not provided> LoadDate_s Organism_s Platform_s ReleaseDate_s SRA_Study_s cell_line_s 1 2/7/2014 Mus musculus ILLUMINA 6/2/2014 SRP IDG-SW3 2 2/7/2014 Mus musculus ILLUMINA 6/2/2014 SRP IDG-SW3 3 2/7/2014 Mus musculus ILLUMINA 6/2/2014 SRP IDG-SW3 4 2/7/2014 Mus musculus ILLUMINA 6/2/2014 SRP IDG-SW3 5 2/7/2014 Mus musculus ILLUMINA 6/2/2014 SRP IDG-SW3 6 2/7/2014 Mus musculus ILLUMINA 6/2/2014 SRP IDG-SW3 cell_type_s g1k_analysis_group_s g1k_pop_code_s source_s 1 osteocytic cells <not provided> <not provided> <not provided> 2 osteocytic cells <not provided> <not provided> <not provided> 3 osteocytic cells <not provided> <not provided> <not provided> 4 osteocytic cells <not provided> <not provided> <not provided> 5 osteocytic cells <not provided> <not provided> <not provided> 6 osteocytic cells <not provided> <not provided> <not provided> pdata <- pdata[pdata$run_s %in% colnames(my.counts), c("run_s", "differentiation_day_s", pdata$source_name_s <- as.factor(pdata$source_name_s) summary(pdata$source_name_s) IDGSW3_125_d3 IDGSW3_125_d35 IDGSW3_vehicle_d3 38

Practical: Read Counting in RNA-seq

Practical: Read Counting in RNA-seq Hervé Pagès (hpages@fhcrc.org) 5 February 2014 Contents 1 Introduction 1 2 First look at some precomputed read counts 2 3 Aligned reads and BAM files 4 4 Choosing and