Introduction to RNA-seq Analysis

Introduction to RNA-seq Analysis Pete E. Pascuzzi July 14, 2016 Contents 1 Intro 2 2 Setup 3 2.1 Setup Directories......................... 3 2.2 Parallel Processing........................ 4 2.3 BAM Header Metadata...................... 6 2.4 Gene Annotation......................... 8 2.5 Genome Annotation....................... 15 3 Sequence Alignment Map Data Format 17 3.1 Converting BAM file to GAlignments Object.......... 17 3.2 Essential Alignment Data.................... 22 3.3 Additional Data.......................... 23 3.4 Sequence Data.......................... 28 4 Determining Digital Gene Expression 29 4.1 Counting Modes on Simulated Data............... 30 4.2 Clean-Up Environment to Free Memory............. 32 4.3 Method for Limited Memory and Single Core......... 33 4.4 Parallel Processing of BAM Files on Multiple Cores...... 33 5 Differential Gene Expression Analysis with EdgeR 34 5.1 Formatting the Data....................... 34 5.2 Normalizing the Libraries (Counts)............... 41 6 SessionInfo 69 1

1 Intro The following vignette is a basic RNAseq analysis of data from St. John, et al., Mol. Endocrinol. 2014. The data was deposited at NCBI GEO under the Super Series GSE54784. The RNAseq data is the Series GSE54783. We will work with only a subset of these samples, the 2 X 2 design of mouse cells, untreated or treated with vitamin D at three days and 35 days. At GEO, there is metadata and processed data that is not amenable to further statistical analysis. However, the raw data in the form of FASTQ files was deposited in NCBI Short Read Archive (SRA). SRA is an entirely different system from GEO with a totally different series of accession numbers. The system can be confusing to navigate, but you should spend some time there familiarizing yourself with the system. Importantly, the raw data in SRA can be downloaded and reanalyzed. However, each of these files can be very large, greater than 20 GB each. The raw data for this vignette is about 500 GB combined. NCBI has developed a series of command line UNIX tools that are used to manipulate and download SRA data. This is not a simple matter of file transfer. Raw data at SRA is stored in SRA format to optimize storage. This data must be converted back to FASTQ format (sometime SAM or BAM format) using specific utilities. For example, we used the tool fastq-dump to transfer files from SRA to Rice and convert them to FASTQ format. This took more than a day! In fact, these FASTQ files are not entirely raw. Typically, NGS experiments are indexed/bar-coded/tagged so that they can be run together on a single lane. So, if you download an SRA file that corresponds to a single sample, these indices have already been processed so that the reads that correspond to specific sample can be sent to a specific file. Another step that may have been done is the removal of adaptors that are necessary for library construction and sequencing. The reads are then ready for additional QC and alignment to your reference genome. We have already performed quality trimming of the reads with the FASTX toolkit, removing bases from either end of the read that have a PHRED quality score below 30. We have QC d the reads with the tool FastQC. The reads were aligned to the mouse reference genome Mus musculus.grcm38.fa with the tool TopHat. The resulting Sequence Alignment Maps files were saved in binary format as BAM files which were sorted and indexed to facilitate analysis and visualization. The steps that this vignette will cover is an overview of the BAM file 2

format, generation of a gene annotation object that can be used to make your gene count table, counting of reads over genes, and edger analysis for differential expression. 2 Setup When working with such large files is it important to think about your storage and directory structure. We have a huge amount of disk space available to us in our scratch directory on Rice. For many tasks, it would be typical to copy your data to scratch, do your work, and copy the results back to secure storage such as Data Depot. However, to avoid that transfer of large files that could become corrupted, we are going to leave our BAM files on Data Depot and access them from Rice while working in our scratch directory. Additionally, some of the tools that have been used to process the data have specific conventions. TopHat generates a large number of files and directories, so when we want to access these files, we typically have to use long file paths. If you choose to modify this script for your own analysis, you will absolutely need to modify the paths to your own files! 2.1 Setup Directories We will access various data for this experiment from Data Depot so it will be convenient to define at least part of this path early. define directories for data #data.dir <- "/depot/nihomics/data/ngs/pike" #path for Rice data.dir <- "/Volumes/nihomics/data/ngs/Pike" #path for Pete's MacBook library(rsamtools) bam.files <- list.files(file.path(data.dir, "rnaseq", "tophat"), pattern="accepted_hits_ bam.files [1] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR1164972_tophat_out/SRR1164972_ [2] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR1164973_tophat_out/SRR1164973_ [3] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR1164974_tophat_out/SRR1164974_ [4] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR1164975_tophat_out/SRR1164975_ 3

[10] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR1164981_tophat_out/SRR1164981_ [11] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR1164982_tophat_out/SRR1164982_ [12] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR1164983_tophat_out/SRR1164983_ [5] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR1164976_tophat_out/SRR1164976_ [6] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR1164977_tophat_out/SRR1164977_ [7] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR1164978_tophat_out/SRR1164978_ [8] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR1164979_tophat_out/SRR1164979_ [9] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR1164980_tophat_out/SRR1164980_ 2.2 Parallel Processing Now that we have the location of the BAM files, we are ready to start processing them further. A huge advantage to using Rice for this type of analysis is that each node has 20 cores. Twenty processors that share memory (64 GB) but that can work on independent tasks. One advantage of Bioconductor is that they have developed a package that can faciliate the use of multiple cores with parallel processing. However, you do need to download this package and configure the options for your specific setup. Below, we are telling Bioconductor that it can have 18 cores and assign tasks as it sees fit. How you can use BiocParallel will depend on your computer. MulticoreParam works well for UNIX-like operating systems. It will not work on Windows machines. set up for parallel processing library(biocparallel) library(parallel) registered() $MulticoreParam class: MulticoreParam bpjobname:bpjob; bpworkers:6; bptasks:0; bptimeout:inf; bprngseed:; bpisup:false bplog:false; bpthreshold:info; bplogdir:na bpstoponerror:false; bpprogressbar:false bpresultdir:na cluster type: FORK 4

$SnowParam class: SnowParam bpjobname:bpjob; bpworkers:6; bptasks:0; bptimeout:inf; bprngseed:; bpisup:false bplog:false; bpthreshold:info; bplogdir:na bpstoponerror:false; bpprogressbar:false bpresultdir:na cluster type: SOCK $SerialParam class: SerialParam bplog:false; bpthreshold:info bpcatcherrors:false detectcores() [1] 8 #bpparam <- MulticoreParam(workers=18, tasks=0) #parameters for Rice bpparam <- MulticoreParam(workers=6, tasks=0) #parameters for Pete's MacBook register(bpparam) registered() $MulticoreParam class: MulticoreParam bpjobname:bpjob; bpworkers:6; bptasks:0; bptimeout:inf; bprngseed:; bpisup:false bplog:false; bpthreshold:info; bplogdir:na bpstoponerror:false; bpprogressbar:false bpresultdir:na cluster type: FORK $SnowParam class: SnowParam bpjobname:bpjob; bpworkers:6; bptasks:0; bptimeout:inf; bprngseed:; bpisup:false bplog:false; bpthreshold:info; bplogdir:na bpstoponerror:false; bpprogressbar:false bpresultdir:na cluster type: SOCK $SerialParam 5

class: SerialParam bplog:false; bpthreshold:info bpcatcherrors:false Below is an example of the bplapply function, a parallel version of lapply that can send tasks to multiple processors. DO NOT UNCOMMENT THIS LINE AND RUN IT. The BAM files are already indexed. This is only an example. However, you do need to get the location of the BAM index files. index BAM files so that they can be processed more easily and ready by IGV and other applications. This has already been done #system.time(bplapply(bam.files, indexbam)) Get the full names for the bam index files bam.index <- list.files(file.path(data.dir, "rnaseq", "tophat"), pattern="bam.bai$", rec bam.index 2.3 BAM Header Metadata Now, we can start to look at the data in a BAM file. The first thing that you should examine is the BAM header. It contains information on how the reads were aligned. In particular, it should give you the names of the chromosomes to which the reads were aligned, and information about the aligner. get the seqnames for the references used in the alignments bam.header <- scanbamheader(bam.files[1]) length(bam.header) [1] 1 bam.header <- bam.header[[1]] names(bam.header) [1] "targets" "text" 6

bam.header$targets 1 10 11 12 13 14 195471971 130694993 122082543 120129022 120421639 124902244 15 16 17 18 19 2 104043685 98207768 94987271 90702639 61431566 182113224 3 4 5 6 7 8 160039680 156508116 151834684 149736546 145441459 129401213 9 GL456210.1 GL456211.1 GL456212.1 GL456213.1 GL456216.1 124595110 169725 241735 153618 39340 66673 GL456219.1 GL456221.1 GL456233.1 GL456239.1 GL456350.1 GL456354.1 175968 206961 336933 40056 227966 195993 GL456359.1 GL456360.1 GL456366.1 GL456367.1 GL456368.1 GL456370.1 22974 31704 47073 42057 20208 26764 GL456372.1 GL456378.1 GL456379.1 GL456381.1 GL456382.1 GL456383.1 28664 31602 72385 25871 23158 38659 GL456385.1 GL456387.1 GL456389.1 GL456390.1 GL456392.1 GL456393.1 35240 24685 28772 24668 23629 55711 GL456394.1 GL456396.1 JH584292.1 JH584293.1 JH584294.1 JH584295.1 24323 21240 14945 207968 191905 1976 JH584296.1 JH584297.1 JH584298.1 JH584299.1 JH584300.1 JH584301.1 199368 205776 184189 953012 182347 259875 JH584302.1 JH584303.1 JH584304.1 MT X Y 155838 158099 114452 16299 171031299 91744698 bam.chrs <- names(bam.header$targets) head(bam.header$text) $`@HD` [1] "VN:1.0" "SO:coordinate" $`@SQ` [1] "SN:1" "LN:195471971" $`@SQ` [1] "SN:10" "LN:130694993" $`@SQ` [1] "SN:11" "LN:122082543" 7

$`@SQ` [1] "SN:12" "LN:120129022" $`@SQ` [1] "SN:13" "LN:120421639" tail(bam.header$text) $`@SQ` [1] "SN:JH584303.1" "LN:158099" $`@SQ` [1] "SN:JH584304.1" "LN:114452" $`@SQ` [1] "SN:MT" "LN:16299" $`@SQ` [1] "SN:X" "LN:171031299" $`@SQ` [1] "SN:Y" "LN:91744698" $`@PG` [1] "ID:TopHat" [2] "VN:2.1.0" [3] "CL:/group/bioinfo/apps/apps/tophat-2.1.0/tophat -p 10 --library-type fr-secondst 2.4 Gene Annotation Now that we have information about the reference sequence, we can generate a suitable object that represents the genes in the reference genome. There are many ways to do this, but we are going to use precompiled Bioconductor databases. It is not a simple matter of finding the transcription start site and transcription termination site of the genes and defining that interval. Why? What we need to do is retrieve the coordinates for the gene exons, grouping the exons together by gene. Further, we want to include all possible 8

exonic sequences. Bioconductor has many functions that help you to achieve such tasks. Below, we will use exonsby to get all exons for mouse genes. library(mus.musculus, verbose=true) Warning in library(mus.musculus, verbose = TRUE): package Mus.musculus already present in search() mouse.genes0 <- exonsby(mus.musculus, by="gene") mouse.genes0 GRangesList object of length 24028: $100009600 GRanges object with 7 ranges and 2 metadata columns: seqnames ranges strand exon_id exon_name <Rle> <IRanges> <Rle> <integer> <character> [1] chr9 [21062393, 21062717] - 134155 <NA> [2] chr9 [21062894, 21062987] - 134156 <NA> [3] chr9 [21063314, 21063396] - 134157 <NA> [4] chr9 [21066024, 21066377] - 134158 <NA> [5] chr9 [21066940, 21067925] - 134159 <NA> [6] chr9 [21068030, 21068117] - 134160 <NA> [7] chr9 [21073075, 21075496] - 134162 <NA> $100009609 GRanges object with 6 ranges and 2 metadata columns: seqnames ranges strand exon_id exon_name [1] chr7 [84940169, 84941088] - 109644 <NA> [2] chr7 [84943141, 84943264] - 109645 <NA> [3] chr7 [84943504, 84943722] - 109646 <NA> [4] chr7 [84946200, 84947000] - 109647 <NA> [5] chr7 [84947372, 84947651] - 109648 <NA> [6] chr7 [84963816, 84964009] - 109649 <NA> $100009614 GRanges object with 1 range and 2 metadata columns: seqnames ranges strand exon_id exon_name [1] chr10 [77711446, 77712009] + 143589 <NA>... 9

<24025 more elements> ------- seqinfo: 66 sequences (1 circular) from mm10 genome exons are for each transcript but grouped by gene so some exons represented multiple times mouse.genes0[218] GRangesList object of length 1: $100040724 GRanges object with 22 ranges and 2 metadata columns: seqnames ranges strand exon_id exon_name <Rle> <IRanges> <Rle> <integer> <character> [1] chr12 [109729764, 109729863] + 176381 <NA> [2] chr12 [109730113, 109730216] + 176382 <NA> [3] chr12 [109730276, 109731128] + 176383 <NA> [4] chr12 [109730296, 109730868] + 176384 <NA> [5] chr12 [109730966, 109731103] + 176385 <NA>..................... [18] chr12 [109742062, 109742290] + 176406 <NA> [19] chr12 [109743357, 109743418] + 176410 <NA> [20] chr12 [109743666, 109743847] + 176412 <NA> [21] chr12 [109748125, 109748293] + 176415 <NA> [22] chr12 [109749095, 109749457] + 176416 <NA> ------- seqinfo: 66 sequences (1 circular) from mm10 genome Note that this mouse genes has 22 exons, but many of these are overlapping or the same. That is because all exons for all transcripts are represented for this gene. If you count reads over this object, you could potentially double or triple your counts because you would get gene counts for every transcript. In fact, Bioconductor is pretty smart and this does not happen, but we are going to perform a reduce on these exons. This function will combine overlapping or redundant intervals into a single non-rendundant set. mouse.genes <- reduce(mouse.genes0) mouse.genes[218] GRangesList object of length 1: 10

$100040724 GRanges object with 16 ranges and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] chr12 [109729764, 109729863] + [2] chr12 [109730113, 109730216] + [3] chr12 [109730276, 109731128] + [4] chr12 [109731242, 109731912] + [5] chr12 [109733281, 109733608] +............ [12] chr12 [109742062, 109742290] + [13] chr12 [109743357, 109743418] + [14] chr12 [109743666, 109743847] + [15] chr12 [109748125, 109748293] + [16] chr12 [109749095, 109749457] + ------- seqinfo: 66 sequences (1 circular) from mm10 genome Now, let s check the chromosome names (seqnames) that are used in our mouse.genes object with the chromosome names that are used in the BAM files. These are often incompatible, and you will not get any counts for your genes if Bioconductor can not match up the seqnames! We have to do some strange data conversions here because some of the data is stored as Run-Length encoded vectors. seqlevels(mouse.genes) [1] "chr1" "chr2" "chr3" [4] "chr4" "chr5" "chr6" [7] "chr7" "chr8" "chr9" [10] "chr10" "chr11" "chr12" [13] "chr13" "chr14" "chr15" [16] "chr16" "chr17" "chr18" [19] "chr19" "chrx" "chry" [22] "chrm" "chr1_gl456210_random" "chr1_gl456211_random" [25] "chr1_gl456212_random" "chr1_gl456213_random" "chr1_gl456221_random" [28] "chr4_gl456216_random" "chr4_gl456350_random" "chr4_jh584292_random" [31] "chr4_jh584293_random" "chr4_jh584294_random" "chr4_jh584295_random" 11

[34] "chr5_gl456354_random" "chr5_jh584296_random" "chr5_jh584297_random" [37] "chr5_jh584298_random" "chr5_jh584299_random" "chr7_gl456219_random" [40] "chrx_gl456233_random" "chry_jh584300_random" "chry_jh584301_random" [43] "chry_jh584302_random" "chry_jh584303_random" "chrun_gl456239" [46] "chrun_gl456359" "chrun_gl456360" "chrun_gl456366" [49] "chrun_gl456367" "chrun_gl456368" "chrun_gl456370" [52] "chrun_gl456372" "chrun_gl456378" "chrun_gl456379" [55] "chrun_gl456381" "chrun_gl456382" "chrun_gl456383" [58] "chrun_gl456385" "chrun_gl456387" "chrun_gl456389" [61] "chrun_gl456390" "chrun_gl456392" "chrun_gl456393" [64] "chrun_gl456394" "chrun_gl456396" "chrun_jh584304" not the same style as BAM files table(seqlevels(mouse.genes) %in% bam.chrs) FALSE 66 change the style seqlevelsstyle(mouse.genes) <- "NCBI" table(seqlevels(mouse.genes) %in% bam.chrs) FALSE TRUE 44 22 still problems with some (Bioconductor error?) determine how many annotated genes are on each sequence seqnames.genes <- unlist(runvalue(seqnames(mouse.genes))) table(seqnames.genes) seqnames.genes 1 2 3 1332 2016 1132 4 5 6 1423 1374 1288 7 8 9 2145 1161 1345 10 11 12 12

1132 1818 816 13 14 15 928 856 900 16 17 18 747 1175 599 19 X Y 784 1023 39 MT chr1_gl456210_random chr1_gl456211_random 0 0 0 chr1_gl456212_random chr1_gl456213_random chr1_gl456221_random 0 0 1 chr4_gl456216_random chr4_gl456350_random chr4_jh584292_random 1 5 1 chr4_jh584293_random chr4_jh584294_random chr4_jh584295_random 6 5 0 chr5_gl456354_random chr5_jh584296_random chr5_jh584297_random 4 1 1 chr5_jh584298_random chr5_jh584299_random chr7_gl456219_random 2 3 3 chrx_gl456233_random chry_jh584300_random chry_jh584301_random 4 0 0 chry_jh584302_random chry_jh584303_random chrun_gl456239 0 0 0 chrun_gl456359 chrun_gl456360 chrun_gl456366 0 0 0 chrun_gl456367 chrun_gl456368 chrun_gl456370 0 0 0 chrun_gl456372 chrun_gl456378 chrun_gl456379 0 0 0 chrun_gl456381 chrun_gl456382 chrun_gl456383 0 0 0 chrun_gl456385 chrun_gl456387 chrun_gl456389 0 0 0 chrun_gl456390 chrun_gl456392 chrun_gl456393 0 0 0 chrun_gl456394 chrun_gl456396 chrun_jh584304 0 0 1 a few genes on unassembled contigs so will adjust seqnames to match bam.chrs temp.seqlevels <- seqlevels(mouse.genes) 13

summary(bam.chrs %in% temp.seqlevels) Mode FALSE TRUE NA's logical 44 22 0 temp.seqlevels[23:44] <- substr(temp.seqlevels[23:44], 6, 13) temp.seqlevels[45:66] <- substr(temp.seqlevels[45:66], 7, 14) temp.seqlevels[23:66] <- paste(temp.seqlevels[23:66], ".1", sep="") summary(bam.chrs %in% temp.seqlevels) Mode TRUE NA's logical 66 0 seqlevels(mouse.genes) <- temp.seqlevels seqnames.genes <- unlist(runvalue(seqnames(mouse.genes))) table(seqnames.genes) seqnames.genes 1 2 3 4 5 6 1332 2016 1132 1423 1374 1288 7 8 9 10 11 12 2145 1161 1345 1132 1818 816 13 14 15 16 17 18 928 856 900 747 1175 599 19 X Y MT GL456210.1 GL456211.1 784 1023 39 0 0 0 GL456212.1 GL456213.1 GL456221.1 GL456216.1 GL456350.1 JH584292.1 0 0 1 1 5 1 JH584293.1 JH584294.1 JH584295.1 GL456354.1 JH584296.1 JH584297.1 6 5 0 4 1 1 JH584298.1 JH584299.1 GL456219.1 GL456233.1 JH584300.1 JH584301.1 2 3 3 4 0 0 JH584302.1 JH584303.1 GL456239.1 GL456359.1 GL456360.1 GL456366.1 0 0 0 0 0 0 GL456367.1 GL456368.1 GL456370.1 GL456372.1 GL456378.1 GL456379.1 0 0 0 0 0 0 GL456381.1 GL456382.1 GL456383.1 GL456385.1 GL456387.1 GL456389.1 0 0 0 0 0 0 GL456390.1 GL456392.1 GL456393.1 GL456394.1 GL456396.1 JH584304.1 0 0 0 0 0 1 14

Now, the seqnames of mouse.genes matches the seqnames in our BAM files. 2.5 Genome Annotation Next, we are going to make an object that represent the chromosomes in the mouse genome. This object is required if you want to access the reads in the BAM file one chromosome at a time. This is the best way to work with BAM files if you have limited memory on your computer, e.g. 16 GB or less on a laptop. Again, we need to make sure that the seqnames match! library(bsgenome.mmusculus.ucsc.mm10) mouse.bs <- BSgenome.Mmusculus.UCSC.mm10 mouse.bs Mouse genome: # organism: Mus musculus (Mouse) # provider: UCSC # provider version: mm10 # release date: Dec. 2011 # release name: Genome Reference Consortium GRCm38 # 66 sequences: # chr1 chr2 chr3 # chr4 chr5 chr6 # chr7 chr8 chr9 # chr10 chr11 chr12 # chr13 chr14 chr15 #......... # chrun_gl456372 chrun_gl456378 chrun_gl456379 # chrun_gl456381 chrun_gl456382 chrun_gl456383 # chrun_gl456385 chrun_gl456387 chrun_gl456389 # chrun_gl456390 chrun_gl456392 chrun_gl456393 # chrun_gl456394 chrun_gl456396 chrun_jh584304 # (use 'seqnames()' to see all the sequence names, use the '$' or '[[' # operator to access a given sequence) seqlevelsstyle(mouse.bs) <- "NCBI" seqlevels(mouse.bs) 15

[1] "1" "2" "3" [4] "4" "5" "6" [7] "7" "8" "9" [10] "10" "11" "12" [13] "13" "14" "15" [16] "16" "17" "18" [19] "19" "X" "Y" [22] "MT" "chr1_gl456210_random" "chr1_gl456211_random" [25] "chr1_gl456212_random" "chr1_gl456213_random" "chr1_gl456221_random" [28] "chr4_gl456216_random" "chr4_gl456350_random" "chr4_jh584292_random" [31] "chr4_jh584293_random" "chr4_jh584294_random" "chr4_jh584295_random" [34] "chr5_gl456354_random" "chr5_jh584296_random" "chr5_jh584297_random" [37] "chr5_jh584298_random" "chr5_jh584299_random" "chr7_gl456219_random" [40] "chrx_gl456233_random" "chry_jh584300_random" "chry_jh584301_random" [43] "chry_jh584302_random" "chry_jh584303_random" "chrun_gl456239" [46] "chrun_gl456359" "chrun_gl456360" "chrun_gl456366" [49] "chrun_gl456367" "chrun_gl456368" "chrun_gl456370" [52] "chrun_gl456372" "chrun_gl456378" "chrun_gl456379" [55] "chrun_gl456381" "chrun_gl456382" "chrun_gl456383" [58] "chrun_gl456385" "chrun_gl456387" "chrun_gl456389" [61] "chrun_gl456390" "chrun_gl456392" "chrun_gl456393" [64] "chrun_gl456394" "chrun_gl456396" "chrun_jh584304" same problem with sqnames seqlevels(mouse.bs) <- temp.seqlevels prepare GRanges object that can be used as a parameter to access specific chromosomes and regions from BAM files. mouse.gr <- GRanges(seqnames=Rle(as.character(seqnames(mouse.bs))), ranges=iranges(start mouse.gr GRanges object with 66 ranges and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] 1 [1, 195471971] * [2] 2 [1, 182113224] * [3] 3 [1, 160039680] * [4] 4 [1, 156508116] * [5] 5 [1, 151834684] *............ [62] GL456392.1 [1, 23629] * 16

[63] GL456393.1 [1, 55711] * [64] GL456394.1 [1, 24323] * [65] GL456396.1 [1, 21240] * [66] JH584304.1 [1, 114452] * ------- seqinfo: 66 sequences (1 circular) from mm10 genome 3 Sequence Alignment Map Data Format 3.1 Converting BAM file to GAlignments Object There are several Bioconductor packages that allow you to access BAM files. Here we are going to use readgalignments to make a GAlignments object for the reads on chr19. To do this, we need to specify certain parameters. What data do you want? Which chromosomes? Do you want to filter the data on specific flags? needed to redefine the file paths so the PDF would compile data.dir <- "/Volumes/nihomics/data/ngs/Pike" #path for Pete's MacBook bam.files <- list.files(file.path(data.dir, "rnaseq", "tophat"), pattern="accepted_hits_ bam.index <- list.files(file.path(data.dir, "rnaseq", "tophat"), pattern="bam.bai$", rec library(genomicalignments) flag.param <- scanbamflag() what.param <- scanbamwhat() which.param <- mouse.gr[19] my.param <- ScanBamParam(what=what.param, which=which.param, flag=flag.param) my.param class: ScanBamParam bamflag (NA unless specified): bamsimplecigar: FALSE bamreversecomplement: FALSE bamtag: bamtagfilter: bamwhich: 1 ranges bamwhat: qname, flag, rname, strand, pos, qwidth, mapq, cigar, mrnm, mpos, isize, seq, qual, groupid, mate_status bammapqfilter: NA 17

bam.ga <- readgalignments(bam.files[1], index=bam.index[1], param=my.param) bam.ga GAlignments object with 5879740 alignments and 13 metadata columns: seqnames strand cigar qwidth start end <Rle> <Rle> <character> <integer> <integer> <integer> [1] 19-101M 101 3000032 3000132 [2] 19-101M 101 3000032 3000132 [3] 19 + 101M 101 3002046 3002146 [4] 19 + 101M 101 3002046 3002146 [5] 19 + 101M 101 3004158 3004258..................... [5879736] 19 + 101M 101 61331466 61331566 [5879737] 19 + 101M 101 61331466 61331566 [5879738] 19 + 101M 101 61331466 61331566 [5879739] 19 + 101M 101 61331468 61331568 [5879740] 19 + 101M 101 61331468 61331568 width njunc qname flag <integer> <integer> <character> <integer> [1] 101 0 SRR1164972.75188336 272 [2] 101 0 SRR1164972.117388926 272 [3] 101 0 SRR1164972.94887786 256 [4] 101 0 SRR1164972.99763995 0 [5] 101 0 SRR1164972.78216853 256.................. [5879736] 101 0 SRR1164972.31703343 256 [5879737] 101 0 SRR1164972.55791943 256 [5879738] 101 0 SRR1164972.98839515 256 [5879739] 101 0 SRR1164972.36321359 0 [5879740] 101 0 SRR1164972.71867433 0 rname strand pos qwidth mapq cigar <factor> <factor> <integer> <integer> <integer> <character> [1] 19-3000032 101 0 101M [2] 19-3000032 101 0 101M [3] 19 + 3002046 101 1 101M [4] 19 + 3002046 101 1 101M [5] 19 + 3004158 101 1 101M..................... [5879736] 19 + 61331466 101 0 101M [5879737] 19 + 61331466 101 0 101M 18

[5879738] 19 + 61331466 101 0 101M [5879739] 19 + 61331468 101 50 101M [5879740] 19 + 61331468 101 1 101M mrnm mpos isize <factor> <integer> <integer> [1] <NA> 0 0 [2] <NA> 0 0 [3] <NA> 0 0 [4] <NA> 0 0 [5] <NA> 0 0............ [5879736] <NA> 0 0 [5879737] <NA> 0 0 [5879738] <NA> 0 0 [5879739] <NA> 0 0 [5879740] <NA> 0 0 [1] CCTAGTATATCTGGAGAGTTAAGATGGGGAATTATGTCAACTTTCCCTCTTCCTATGCCAGTTATGCATAATG [2] CCTAGTATATCTGGAGAGTTAAGATGGGGAATTATGTCAACTTTCCCTCTTCCTATGCCAGTTATGCATAATG [3] AGGGGGAGATGTGAGGAGCCGCCCTTGCAATCGCCATTACAAAATGGTGCTGATATCCGGTGTTCTAACTAGT [4] AGGGGGAGATGTGAGGAGCCGCCCTTGCAATCGCCATTACAAAATGGTGCTGATATCCGGTGTTCTAACTAGT [5] CTATGGCCTTGGGCATCAAGATTTAAAAAATTAAGAGTGAAGAGTGCTATGGAAACAACTACTCTTGGTACTG... [5879736] GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG [5879737] GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG [5879738] GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG [5879739] TAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTT [5879740] TAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTT [1] 3DCCDCCDCCCCDCCCEEDCEFFFCDA7HHHHGFIIIGEFIGEHJIIJJHGIJJIJJJJJJJJJJJJJIJIJJ [2] <DCDEDCDDCCACDECCCDEEEDC@B@EHEAE@EHGCHGFGJIIIIHBIHFGIIGJJIJIHEHEIJJJIJIGF [3] CCCFFFFFHHHHHIIJJIIIIJJJIJJJIJJJIJJIJJIIIIJJJHHEHHFFFFFFFFDD;?BDDEECDCCCC [4] CCCFFFFFHHHFHJIJIJJJJJJJJJJJJJJJJJJJJJGIJIJJIHHEHHHFFFFFFFDC;@DDDEEDCDDCC [5] CCCFFFFFHHDHHEIJIJJJIJJIHIJJIJJJJIIIIGHIJIIJ?FHIIIHGIIEIJIJJEIIJJIJICEHHH... [5879736] @@@DDEFDDDFHHCEGGHJGGGFHHACFHEH:CDFHIDGABHE9@@FCF8CA@DH.=CHHF7??@>D;>6>;@ [5879737] @@@DFFFDHHHHHHIJIJJFHHIJJAEHIJJFIJJJJ:?GHJJ?DHIIJ8CFHII=CCHHG7=CDCF7>ACDE 19

[5879738] @@CFFFFDFFHHGEGGIJJFHHIJJ<CGHJJCGHIIJ?BFHIJBFHIJJ=BGIJJ7==DFH=?EEEF7;=>@A [5879739] @@CFFEFFHHGFDHIIJACFGII3CCG>@CFHEHC?DFHDHHHJIGI8=BFH>=FHHHI7?CE9@7;?ACACA [5879740] @?@DBDBDFFF<CEFFE<AFGCF2AFGIF:DGFIF00BBBF?9?DFFFEEFFCFF=@CD)==CE>B;BBCC.6 ------- seqinfo: 66 sequences from an unspecified genome You should see that some of the data is repeated. The reason is that some essential data is always imported, so we don t need to pass that in our what.param. what.param <- c("qname", "flag", "mapq", "seq", "qual") my.param <- ScanBamParam(what=what.param, which=which.param, flag=flag.param) bam.ga <- readgalignments(bam.files[1], index=bam.index[1], param=my.param) bam.ga GAlignments object with 5879740 alignments and 5 metadata columns: seqnames strand cigar qwidth start end <Rle> <Rle> <character> <integer> <integer> <integer> [1] 19-101M 101 3000032 3000132 [2] 19-101M 101 3000032 3000132 [3] 19 + 101M 101 3002046 3002146 [4] 19 + 101M 101 3002046 3002146 [5] 19 + 101M 101 3004158 3004258..................... [5879736] 19 + 101M 101 61331466 61331566 [5879737] 19 + 101M 101 61331466 61331566 [5879738] 19 + 101M 101 61331466 61331566 [5879739] 19 + 101M 101 61331468 61331568 [5879740] 19 + 101M 101 61331468 61331568 width njunc qname flag <integer> <integer> <character> <integer> [1] 101 0 SRR1164972.75188336 272 [2] 101 0 SRR1164972.117388926 272 [3] 101 0 SRR1164972.94887786 256 [4] 101 0 SRR1164972.99763995 0 [5] 101 0 SRR1164972.78216853 256.................. [5879736] 101 0 SRR1164972.31703343 256 [5879737] 101 0 SRR1164972.55791943 256 [5879738] 101 0 SRR1164972.98839515 256 20

[5879739] 101 0 SRR1164972.36321359 0 [5879740] 101 0 SRR1164972.71867433 0 mapq <integer> [1] 0 [2] 0 [3] 1 [4] 1 [5] 1...... [5879736] 0 [5879737] 0 [5879738] 0 [5879739] 50 [5879740] 1 [1] CCTAGTATATCTGGAGAGTTAAGATGGGGAATTATGTCAACTTTCCCTCTTCCTATGCCAGTTATGCATAATG [2] CCTAGTATATCTGGAGAGTTAAGATGGGGAATTATGTCAACTTTCCCTCTTCCTATGCCAGTTATGCATAATG [3] AGGGGGAGATGTGAGGAGCCGCCCTTGCAATCGCCATTACAAAATGGTGCTGATATCCGGTGTTCTAACTAGT [4] AGGGGGAGATGTGAGGAGCCGCCCTTGCAATCGCCATTACAAAATGGTGCTGATATCCGGTGTTCTAACTAGT [5] CTATGGCCTTGGGCATCAAGATTTAAAAAATTAAGAGTGAAGAGTGCTATGGAAACAACTACTCTTGGTACTG... [5879736] GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG [5879737] GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG [5879738] GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG [5879739] TAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTT [5879740] TAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTT [1] 3DCCDCCDCCCCDCCCEEDCEFFFCDA7HHHHGFIIIGEFIGEHJIIJJHGIJJIJJJJJJJJJJJJJIJIJJ [2] <DCDEDCDDCCACDECCCDEEEDC@B@EHEAE@EHGCHGFGJIIIIHBIHFGIIGJJIJIHEHEIJJJIJIGF [3] CCCFFFFFHHHHHIIJJIIIIJJJIJJJIJJJIJJIJJIIIIJJJHHEHHFFFFFFFFDD;?BDDEECDCCCC [4] CCCFFFFFHHHFHJIJIJJJJJJJJJJJJJJJJJJJJJGIJIJJIHHEHHHFFFFFFFDC;@DDDEEDCDDCC [5] CCCFFFFFHHDHHEIJIJJJIJJIHIJJIJJJJIIIIGHIJIIJ?FHIIIHGIIEIJIJJEIIJJIJICEHHH... [5879736] @@@DDEFDDDFHHCEGGHJGGGFHHACFHEH:CDFHIDGABHE9@@FCF8CA@DH.=CHHF7??@>D;>6>;@ [5879737] @@@DFFFDHHHHHHIJIJJFHHIJJAEHIJJFIJJJJ:?GHJJ?DHIIJ8CFHII=CCHHG7=CDCF7>ACDE [5879738] @@CFFFFDFFHHGEGGIJJFHHIJJ<CGHJJCGHIIJ?BFHIJBFHIJJ=BGIJJ7==DFH=?EEEF7;=>@A 21

[5879739] @@CFFEFFHHGFDHIIJACFGII3CCG>@CFHEHC?DFHDHHHJIGI8=BFH>=FHHHI7?CE9@7;?ACACA [5879740] @?@DBDBDFFF<CEFFE<AFGCF2AFGIF:DGFIF00BBBF?9?DFFFEEFFCFF=@CD)==CE>B;BBCC.6 ------- seqinfo: 66 sequences from an unspecified genome 3.2 Essential Alignment Data There are certain fields in the BAM file that are always imported. Let s go through these one by one. seqnames(bam.ga) factor-rle of length 5879740 with 1 run Lengths: 5879740 Values : 19 Levels(66): 1 10 11 12 13 14... JH584302.1 JH584303.1 JH584304.1 MT X Y strand(bam.ga) factor-rle of length 5879740 with 95300 runs Lengths: 2 4 7 9 15 16 3 4 7 1... 1 85 1 17 2 70 1 47 2 5 Values : - + - + - + - + - +... - + - + - + - + - + Levels(3): + - * cigar.tbl <- sort(table(cigar(bam.ga)), decreasing=true) head(cigar.tbl, 20) 101M 26M623N75M 2M1413N99M 37M109N64M 4928429 3513 3302 2941 34M623N67M 46M2212N55M 33M623N68M 84M781N17M 2892 2837 2771 2600 100M974N1M 37M634N64M 68M1413N33M 69M974N32M 2538 2533 2405 2299 12M634N80M1630N9M 97M1D4M 96M1I4M 61M781N40M 2031 1990 1932 1930 58M170N43M 63M3195N38M 99M781N2M 5M92N96M 1925 1785 1692 1672 22

range(qwidth(bam.ga)) [1] 101 101 range(width(bam.ga)) [1] 99 371953 njunc.tbl <- table(njunc(bam.ga)) njunc.tbl 0 1 2 3 4991711 815582 70631 1816 3.3 Additional Data Some data is only optionally reported, but these fields contain important information about th e reads. The FLAG field contains information on how the reads aligned, especially how each read in a paired-end run aligned with respect to each other. The MAPQ field tells you how good or bad the alignment is. head(mcols(bam.ga)) 1 CCTAGTATATCTGGAGAGTTAAGATGGGGAATTATGTCAACTTTCCCTCTTCCTATGCCAGTTATGCATAATGCACAAATATT 2 CCTAGTATATCTGGAGAGTTAAGATGGGGAATTATGTCAACTTTCCCTCTTCCTATGCCAGTTATGCATAATGCACAAATATT 3 AGGGGGAGATGTGAGGAGCCGCCCTTGCAATCGCCATTACAAAATGGTGCTGATATCCGGTGTTCTAACTAGTAAACAAGTAG DataFrame with 6 rows and 5 columns qname flag mapq <character> <integer> <integer> 1 SRR1164972.75188336 272 0 2 SRR1164972.117388926 272 0 3 SRR1164972.94887786 256 1 4 SRR1164972.99763995 0 1 5 SRR1164972.78216853 256 1 6 SRR1164972.134938149 256 1 23

4 AGGGGGAGATGTGAGGAGCCGCCCTTGCAATCGCCATTACAAAATGGTGCTGATATCCGGTGTTCTAACTAGTAAACAAGTAG 5 CTATGGCCTTGGGCATCAAGATTTAAAAAATTAAGAGTGAAGAGTGCTATGGAAACAACTACTCTTGGTACTGAGGGTAGATT 6 CTATGGCCTTGGGCATCAAGATTTAAAAAATTAAGAGTGAAGAGTGCTATGGAAACAACTACTCTTGGTACTGAGGGTAGATT 1 3DCCDCCDCCCCDCCCEEDCEFFFCDA7HHHHGFIIIGEFIGEHJIIJJHGIJJIJJJJJJJJJJJJJIJIJJJIJJJJJIJJ 2 <DCDEDCDDCCACDECCCDEEEDC@B@EHEAE@EHGCHGFGJIIIIHBIHFGIIGJJIJIHEHEIJJJIJIGFIIEGIIGIHG 3 CCCFFFFFHHHHHIIJJIIIIJJJIJJJIJJJIJJIJJIIIIJJJHHEHHFFFFFFFFDD;?BDDEECDCCCCCCCDDDDACD 4 CCCFFFFFHHHFHJIJIJJJJJJJJJJJJJJJJJJJJJGIJIJJIHHEHHHFFFFFFFDC;@DDDEEDCDDCCDDACDDC>CC 5 CCCFFFFFHHDHHEIJIJJJIJJIHIJJIJJJJIIIIGHIJIIJ?FHIIIHGIIEIJIJJEIIJJIJICEHHHFEDEDCD>CC 6 @@@DDDDFFHHHHIIIJIGIGIIJFICB@GDHCGHEGHHGIIJJBGHHH@FHIIGGIJJFGIGJGIII=DAECEHEDE9AEDE flag.tbl <- table(mcols(bam.ga)$flag) flag.tbl 0 16 256 272 2549760 2604698 421921 303361 flag.mat <- bamflagasbitmatrix(as.integer(names(flag.tbl))) flag.mat <- cbind(as.integer(names(flag.tbl)), flag.mat) flag.mat ispaired isproperpair isunmappedquery hasunmappedmate [1,] 0 0 0 0 0 [2,] 16 0 0 0 0 [3,] 256 0 0 0 0 [4,] 272 0 0 0 0 isminusstrand ismateminusstrand isfirstmateread issecondmateread [1,] 0 0 0 0 [2,] 1 0 0 0 [3,] 0 0 0 0 [4,] 1 0 0 0 issecondaryalignment isnotpassingqualitycontrols isduplicate [1,] 0 0 0 [2,] 0 0 0 [3,] 1 0 0 [4,] 1 0 0 dups <- duplicated(mcols(bam.ga)$qname) duplicated(mcols(bam.ga)$qname, fromlast=true) 24

dup.ga <- bam.ga[dups] dup.ga <- dup.ga[order(mcols(dup.ga)$qname)] dup.ga GAlignments object with 383485 alignments and 5 metadata columns: seqnames strand cigar qwidth start end <Rle> <Rle> <character> <integer> <integer> <integer> [1] 19 + 101M 101 9005267 9005367 [2] 19 + 101M 101 9009386 9009486 [3] 19 + 101M 101 9012272 9012372 [4] 19 + 101M 101 29993264 29993364 [5] 19 + 101M 101 60021275 60021375..................... [383481] 19 + 101M 101 9007409 9007509 [383482] 19 + 101M 101 9007811 9007911 [383483] 19 + 101M 101 9007317 9007417 [383484] 19 + 101M 101 9007518 9007618 [383485] 19 + 101M 101 9007719 9007819 width njunc qname flag <integer> <integer> <character> <integer> [1] 101 0 SRR1164972.100000987 256 [2] 101 0 SRR1164972.100000987 0 [3] 101 0 SRR1164972.100000987 256 [4] 101 0 SRR1164972.100001132 256 [5] 101 0 SRR1164972.100001132 256.................. [383481] 101 0 SRR1164972.99999348 256 [383482] 101 0 SRR1164972.99999348 0 [383483] 101 0 SRR1164972.99999461 256 [383484] 101 0 SRR1164972.99999461 256 [383485] 101 0 SRR1164972.99999461 0 mapq <integer> [1] 1 [2] 1 [3] 1 [4] 0 [5] 0...... [383481] 1 25

[383482] 1 [383483] 1 [383484] 1 [383485] 1 [1] GCCCCAAGTTCAAGATGCCTGAGATGAACATCAAGGCTCCAAAGATCTCCATGCCAGATTTACATCTTAAGGGT [2] GCCCCAAGTTCAAGATGCCTGAGATGAACATCAAGGCTCCAAAGATCTCCATGCCAGATTTACATCTTAAGGGT [3] GCCCCAAGTTCAAGATGCCTGAGATGAACATCAAGGCTCCAAAGATCTCCATGCCAGATTTACATCTTAAGGGT [4] TCCAGCTTCTCTCCATTTACTTTGATGTTGGCTACTGGTTTGCTGTAGATTGCTTTTATCATGTTTAGGTATGG [5] TCCAGCTTCTCTCCATTTACTTTGATGTTGGCTACTGGTTTGCTGTAGATTGCTTTTATCATGTTTAGGTATGG... [383481] GCCCCAAGTTCAAGATGCCTGACATGCACTTCAAGGCTCCTAAGATCTCCATGCCTGATGTGGACTTGCATCTG [383482] GCCCCAAGTTCAAGATGCCTGACATGCACTTCAAGGCTCCTAAGATCTCCATGCCTGATGTGGACTTGCATCTG [383483] AGTGCCAAAATTAGAGGGAGAATTAAAAGGCCCAAGTGTGGATGTGGAAGTACCTGGTGTTGATCTGGAATGTC [383484] AGTGCCAAAATTAGAGGGAGAATTAAAAGGCCCAAGTGTGGATGTGGAAGTACCTGGTGTTGATCTGGAATGTC [383485] AGTGCCAAAATTAGAGGGAGAATTAAAAGGCCCAAGTGTGGATGTGGAAGTACCTGGTGTTGATCTGGAATGTC [1] CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJIJJJJFJJJJJJIJJJJJJJIIJEHIJG [2] CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJIJJJJFJJJJJJIJJJJJJJIIJEHIJG [3] CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJIJJJJFJJJJJJIJJJJJJJIIJEHIJG [4] CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJGIIJJJJIJJJJJJJJJJJJJJIIJGHIGIJGIIIG [5] CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJGIIJJJJIJJJJJJJJJJJJJJIIJGHIGIJGIIIG... [383481] CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJIJJJJIJJJJJJJJJJJJJJIJIGJJJIIJJIJJJJJ [383482] CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJIJJJJIJJJJJJJJJJJJJJIJIGJJJIIJJIJJJJJ [383483] CCBFFFFFHHHHHJJJJJIJIJJJJIJJJJIJJJJJIJHIJJIJJIJIJJHIJJJIJJHIJIDHJJHHHHHHED [383484] CCBFFFFFHHHHHJJJJJIJIJJJJIJJJJIJJJJJIJHIJJIJJIJIJJHIJJJIJJHIJIDHJJHHHHHHED [383485] CCBFFFFFHHHHHJJJJJIJIJJJJIJJJJIJJJJJIJHIJJIJJIJIJJHIJJJIJJHIJIDHJJHHHHHHED ------- seqinfo: 66 sequences from an unspecified genome table(mcols(bam.ga)$flag, dups) dups FALSE TRUE 0 2460251 89509 16 2565491 39207 256 258576 163345 26

272 211937 91424 mapq.tbl <- table(mcols(bam.ga)$mapq) mapq.tbl 0 1 3 50 397697 271988 352838 4857217 table(mcols(bam.ga)$mapq, dups) dups FALSE TRUE 0 309593 88104 1 141847 130141 3 187598 165240 50 4857217 0 table(mcols(bam.ga)$mapq, mcols(bam.ga)$flag, dups),, dups = FALSE 0 16 256 272 0 12732 18752 125030 153079 1 27843 15037 60919 38048 3 73132 21029 72627 20810 50 2346544 2510673 0 0,, dups = TRUE 0 16 256 272 0 4033 3979 37025 43067 1 29589 8495 70453 21604 3 55887 26733 55867 26753 50 0 0 0 0 27

3.4 Sequence Data The sequence data includes both the actual sequence of the read and the quality scores for each based in PHRED format. mcols(bam.ga)$seq A DNAStringSet instance of length 5879740 width seq [1] 101 CCTAGTATATCTGGAGAGTTAAGATGGG...CACAAATATTTCCACGCTTTTTCACTAC [2] 101 CCTAGTATATCTGGAGAGTTAAGATGGG...CACAAATATTTCCACGCTTTTTCACTAC [3] 101 AGGGGGAGATGTGAGGAGCCGCCCTTGC...AAACAAGTAGTCTGCGCATGTGCTGGGG [4] 101 AGGGGGAGATGTGAGGAGCCGCCCTTGC...AAACAAGTAGTCTGCGCATGTGCTGGGG [5] 101 CTATGGCCTTGGGCATCAAGATTTAAAA...AGGGTAGATTCCCCCTTTTTGTTTAATT......... [5879736] 101 GTTAGGGTTAGGGTTAGGGTTAGGGTTA...GTAGGGTTAGGGGTAGGGTTAGGGTTAG [5879737] 101 GTTAGGGTTAGGGTTAGGGTTAGGGTTA...TTAGGGTTAGGGTTAGGGTTAGGGTTAG [5879738] 101 GTTAGGGTTAGGGTTAGGGTTAGGGTTA...TTAGGGTTAGGGTTAGGGTTAGGGTTAG [5879739] 101 TAGGGTTAGGGTTAGGGTTAGGGTTAGG...AGGGTTAGGGTTAGGGTTAGGGTTAGAT [5879740] 101 TAGGGTTAGGGTTAGGGTTAGGGTTAGG...AGGGTTAGGGTTAGGGTTAGGGTTAGAT ecor1 <- vcountpattern("gaattc", mcols(bam.ga)$seq) table(ecor1) ecor1 0 1 2 5769920 109369 451 mcols(bam.ga)$seq[ecor1 > 0] A DNAStringSet instance of length 109820 width seq [1] 101 ATATAGTGGATTACTTTGATGGATTTCCA...ATGATTGTTTTGATGTGTTCTTGAATTC [2] 101 ATATAGTGGATTACTTTGATGGATTTCCA...ATGATTGTTTTGATGTGTTCTTGAATTC [3] 101 CAAAATCCAACACCTATTCATGATGAAAG...AAAGCAATATACAGCAAGCCAGTAGCCA [4] 101 TGGCTACTGGTTTGCTGTAGATTGCTTTT...TTTTATCATGAATGGGTGTTGGATCTTG [5] 101 CATCTTTGCCATGATATTTTTTGCTTTAG...CCCAAATGCTGCATAATATCCCTTCCCC......... [109816] 101 CACATGATCATCTCGTTAGATGCAGAAAA...AAAGATCAGGAATTCAAGGCCCATACCT [109817] 101 TTAGATGCAGAAAAAGCATTTGACAAGAT...AAGGCCCATACCTAAACATGATAAAAGC [109818] 101 GTACACCCCTGCATGATAGAATTCTGCAG...TAAGTGATTCTTGGATCCTAACATTCTA [109819] 101 GTACACCCCTGCATGATAGAATTCTGCAG...TAAGTGATTCTTGGATCCTAACATTCTA [109820] 101 GTACACCCCTGCATGATAGAATTCTGCAG...TAAGTGATTCTTGGATCCTAACATTCTA 28

mcols(bam.ga)$qual A PhredQuality instance of length 5879740 width seq [1] 101 3DCCDCCDCCCCDCCCEEDCEFFFCDA7...JIJJJJJIJJJJIHJHHHHHFFFFFC@C [2] 101 <DCDEDCDDCCACDECCCDEEEDC@B@E...IIEGIIGIHGIIHGHDHHHGFDEDF@C@ [3] 101 CCCFFFFFHHHHHIIJJIIIIJJJIJJJ...CCCDDDDACDDDDDDDB>BBCCCDC@B9 [4] 101 CCCFFFFFHHHFHJIJIJJJJJJJJJJJ...DDACDDC>CCCDEDDDBDDDDDCDDCBB [5] 101 CCCFFFFFHHDHHEIJIJJJIJJIHIJJ...FEDEDCD>CCEDCDDDDDDDDACDCAC:......... [5879736] 101 @@@DDEFDDDFHHCEGGHJGGGFHHACF...',91<<59ABC# [5879737] 101 @@@DFFFDHHHHHHIJIJJFHHIJJAEH...5==ACD??AB?B(5(8<B38<A?B [5879738] 101 @@CFFFFDFFHHGEGGIJJFHHIJJ<CG...-,;;?A559<CB,58?CB [5879739] 101 @@CFFEFFHHGFDHIIJACFGII3CCG>...BD=B><AB9?59AA?<3<B<9?98<?C3 [5879740] 101 @?@DBDBDFFF<CEFFE<AFGCF2AFGI...2;;?-99?<@(99<B<(2<@ as.integer(chartoraw(as.character(mcols(bam.ga)$qual[1]))) [1] 51 68 67 67 68 67 67 68 67 67 67 67 68 67 67 67 69 69 68 67 69 70 70 [24] 70 67 68 65 55 72 72 72 72 71 70 73 73 73 71 69 70 73 71 69 72 74 73 [47] 73 74 74 72 71 73 74 74 73 74 74 74 74 74 74 74 74 74 74 74 74 74 73 [70] 74 73 74 74 74 73 74 74 74 74 74 73 74 74 74 74 73 72 74 72 72 72 72 [93] 72 70 70 70 70 70 67 64 67 as.integer(chartoraw(as.character(mcols(bam.ga)$qual[1]))) - 33 [1] 18 35 34 34 35 34 34 35 34 34 34 34 35 34 34 34 36 36 35 34 36 37 37 [24] 37 34 35 32 22 39 39 39 39 38 37 40 40 40 38 36 37 40 38 36 39 41 40 [47] 40 41 41 39 38 40 41 41 40 41 41 41 41 41 41 41 41 41 41 41 41 41 40 [70] 41 40 41 41 41 40 41 41 41 41 41 40 41 41 41 41 40 39 41 39 39 39 39 [93] 39 37 37 37 37 37 34 31 34 4 Determining Digital Gene Expression Now, we have everything we need to get our gene counts. It is important to think about how this is done. First, is your RNAseq library strand specific? If so, do you have paired-end reads? If so, which of the ends should you use for counting? 29

In addition, how stringent do you want to be when you determine whether or not a read actually overlaps with a gene? What if a read overlaps with two genes? In your package tab, go to GenomicAlignments, click on it. Open the User guides and select GenomicAlignments::summarizeOverlaps. Figure 1 shows the counting modes that are available. 4.1 Counting Modes on Simulated Data The following bit of code demonstrates how your gene expression counts can vary based on your choice of counting mode. library(gviz) my.gr <- GRanges(seqnames=Rle("chr1"), ranges=iranges(start=1, end=1000), strand=rle("*" my.seqinfo <- Seqinfo(seqnames="chr1", seqlengths=1000, iscircular=false, genome="simula seqinfo(my.gr) <- my.seqinfo gene1 <- GRanges(seqnames=Rle("chr1"), ranges=iranges(start=201, end=350), strand=rle("+ gene2 <- GRanges(seqnames=Rle(rep("chr1", 2)), ranges=iranges(start=c(301, 601), end=c(3 gene3 <- GRanges(seqnames=Rle("chr1"), ranges=iranges(start=451, end=550), strand=rle(" gene4 <- GRanges(seqnames=Rle(rep("chr1", 2)), ranges=iranges(start=c(101, 301), end=c(2 my.genes <- GRangesList(gene1=gene1, gene2=gene2, gene3=gene3, gene4=gene4) my.genes.df <- as.data.frame(my.genes) colnames(my.genes.df)[3] <- "chromosome" gene.track <-GeneRegionTrack(my.genes.df, fill="green4", arrowheadwidth=50, arrowheadmax gene1.reads <- GRanges(seqnames=Rle(rep("chr1", 20)), ranges=iranges(start=sample(200:32 gene2.reads <- GRanges(seqnames=Rle(rep("chr1", 60)), ranges=iranges(start=sample(c(300: gene3.reads <- GRanges(seqnames=Rle(rep("chr1", 10)), ranges=iranges(start=sample(450:52 gene4.reads <- GRanges(seqnames=Rle(rep("chr1", 20)), ranges=iranges(start=sample(c(100: all.reads <- c(gene1.reads, gene2.reads, gene3.reads, gene4.reads) read.track <- AnnotationTrack(all.reads, fill=c("red", "blue")[as.integer(strand(all.rea ax.track <- GenomeAxisTrack(GRanges(seqnames="chr1", ranges=iranges(1, 1000)), genome="s plottracks(list(gene.track, read.track, ax.track), from=1, to=700, grid=1, sizes=c(4, 4, 30

gene4 GeneRegionTrack gene1 gene2 gene3 Demo Reads 100 300 500 200 400 600 count.mat <- matrix(0, nrow=4, ncol=5) colnames(count.mat) <- c("actual", "countoverlaps", "Union", "IntersectionStrict", "Inte rownames(count.mat) <- c("gene1", "gene2", "gene3", "gene4") count.mat[, 1] <- c(20, 60, 10, 20) count.mat[, 2] <- countoverlaps(my.genes, all.reads) count.mat[, 3] <- assays(summarizeoverlaps(my.genes, all.reads, "Union", ignore.strand=f count.mat[, 4] <- assays(summarizeoverlaps(my.genes, all.reads, "IntersectionStrict", ig count.mat[, 5] <- assays(summarizeoverlaps(my.genes, all.reads, "IntersectionNotEmpty", count.mat actual countoverlaps Union IntersectionStrict IntersectionNotEmpty 31

gene1 20 35 11 15 15 gene2 60 69 45 45 45 gene3 10 10 10 10 10 gene4 20 20 20 20 20 4.2 Clean-Up Environment to Free Memory We ve accumulated a bunch of objects in our workspace. remove what we don t need. ls() Let s carefully [1] "all.reads" "ax.track" "bam.chrs" "bam.files" [5] "bam.ga" "bam.header" "bam.index" "bpparam" [9] "cigar.tbl" "count.mat" "data.dir" "dup.ga" [13] "dups" "ecor1" "flag.mat" "flag.param" [17] "flag.tbl" "gene.track" "gene1" "gene1.reads" [21] "gene2" "gene2.reads" "gene3" "gene3.reads" [25] "gene4" "gene4.reads" "mapq.tbl" "mouse.bs" [29] "mouse.genes" "mouse.genes0" "mouse.gr" "my.genes" [33] "my.genes.df" "my.gr" "my.param" "my.seqinfo" [37] "njunc.tbl" "read.track" "seqnames.genes" "temp.seqlevels" [41] "what.param" "which.param" my.objs <- list(ls()) my.objs[[1]] <- setdiff(my.objs[[1]], c("data.dir", "bam.files", "bam.index", "mouse.gen rm(list=my.objs[[1]]) rm(my.objs) ls() [1] "bam.files" "bam.index" "data.dir" "mouse.genes" "mouse.gr" gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 5105866 272.7 13897102 742.2 17371378 927.8 Vcells 8228399 62.8 677272256 5167.2 846274191 6456.6 32

4.3 Method for Limited Memory and Single Core It is very likely that you will have limited computer resources. The following chunk of code will work down a list of BAM files, one chromosome at a time, and generate your gene count matrix. I have used this routinely on my MacBook Pro (16 GB memory). #sample.names <- substring(bam.files, 45, 54) #count.mat <- matrix(0, nrow=length(mouse.genes), ncol=length(sample.names)) #rownames(count.mat) <- names(mouse.genes) #colnames(count.mat) <- sample.names one file at a time, one computer core at a time # mapq.cutoff <- 50 # for(j in 1:length(bam.files)){ # for(i in 1:length(seqlevels(mouse.genes))){ # chr.param <- ScanBamParam(which=mouse.gr[i], mapqfilter=mapq.cutoff) # bam.ga <- readgalignments(bam.files[j], index=bam.index[j], param=chr.param) # my.counts <- summarizeoverlaps(features=mouse.genes, reads=bam.ga, mode="union", ign # count.mat[, j] <- count.mat[, j] + assays(my.counts) counts # } # } # write.table(count.mat, file=paste(sample.name, "counts.txt", sep="_"), row.names=true, 4.4 Parallel Processing of BAM Files on Multiple Cores The next chunk, will use the parameters we specified to BiocParallel to process all 12 BAM files (we have 18 workers). This should take less than 10 minutes. system.time(my.se <- summarizeoverlaps(mouse.genes, bam.files, mode="intersectionnotempt summarizeoverlaps will check for parallel processing parameters. We have 12 files and 18 workers so all 12 files will be processd in parallel registered() user system elapsed 0.84 1.35 516.97 517/60 Nine minutes to process all files. Would have taken ~110 minutes with for loop. class(my.se) 33

dim(assays(my.se)$counts) colnames(assays(my.se)$counts) save(my.se, file="pikesummarizedexp.rdata") Now, we are finished with the Big Data part of RNAseq analysis. 5 Differential Gene Expression Analysis with EdgeR There are several popular packages or tools for DGE analysis of RNAseq data. At one time CuffDiff was certainly the most popluar, but it has fallen out of favor for various reasons. It will be interesting to see how CuffDiff2 performs. DESeq2 and edger are two of the best available packages for DGE, and both are native to Bioconductor. Here we will cover edger because it is essentially limma for RNAseq. Both were developed by the same research group, so they have a similar philosophy and workflow. If you are interested in trying DESeq2, Bioconductor has an excellent workflow on their website that is quite easy to follow through if you know a little R (and you do now!) 5.1 Formatting the Data As with the microarray data, we need to format our raw data and our pdata. We will get the counts from the Summarized Experiment object. If you weren t able to generate the object, there is a copy that you can load. load("pikesummarizedexp.rdata") my.counts <- assays(my.se)$counts head(my.counts) SRR1164972_accepted_hits_sorted.bam 100009600 0 100009609 0 100009614 0 100009664 0 100012 0 100017 492 34

SRR1164973_accepted_hits_sorted.bam 100009600 0 100009609 0 100009614 0 100009664 0 100012 0 100017 865 SRR1164974_accepted_hits_sorted.bam 100009600 7 100009609 0 100009614 0 100009664 0 100012 0 100017 599 SRR1164975_accepted_hits_sorted.bam 100009600 8 100009609 31 100009614 5 100009664 14 100012 18 100017 396 SRR1164976_accepted_hits_sorted.bam 100009600 7 100009609 14 100009614 0 100009664 21 100012 34 100017 701 SRR1164977_accepted_hits_sorted.bam 100009600 26 100009609 24 100009614 1 100009664 22 100012 41 100017 599 SRR1164978_accepted_hits_sorted.bam 100009600 7 100009609 0 100009614 0 35

100009664 15 100012 0 100017 941 SRR1164979_accepted_hits_sorted.bam 100009600 8 100009609 8 100009614 0 100009664 15 100012 5 100017 945 SRR1164980_accepted_hits_sorted.bam 100009600 3 100009609 4 100009614 1 100009664 5 100012 0 100017 938 SRR1164981_accepted_hits_sorted.bam 100009600 0 100009609 0 100009614 0 100009664 10 100012 0 100017 1271 SRR1164982_accepted_hits_sorted.bam 100009600 10 100009609 5 100009614 0 100009664 32 100012 44 100017 844 SRR1164983_accepted_hits_sorted.bam 100009600 14 100009609 0 100009614 0 100009664 37 100012 0 100017 1418 colnames(my.counts) <- substr(colnames(my.counts), 1, 10) 36

head(my.counts) SRR1164972 SRR1164973 SRR1164974 SRR1164975 SRR1164976 100009600 0 0 7 8 7 100009609 0 0 0 31 14 100009614 0 0 0 5 0 100009664 0 0 0 14 21 100012 0 0 0 18 34 100017 492 865 599 396 701 SRR1164977 SRR1164978 SRR1164979 SRR1164980 SRR1164981 100009600 26 7 8 3 0 100009609 24 0 8 4 0 100009614 1 0 0 1 0 100009664 22 15 15 5 10 100012 41 0 5 0 0 100017 599 941 945 938 1271 SRR1164982 SRR1164983 100009600 10 14 100009609 5 0 100009614 0 0 100009664 32 37 100012 44 0 100017 844 1418 Next, we need to get the pdata for the sample. We will use a file that can be downloaded from SRA when you browse the data. Again, we are going to do a bit of data wrangling to make suitable names. metadata for experiment pdata <- read.delim(file.path(data.dir, "metadata", "Pike_SraRunTable_RNAseq.txt"), comm head(pdata) BioSample_s Experiment_s MBases_l MBytes_l Run_s SRA_Sample_s 1 SAMN02630838 SRX467388 16205 11795 SRR1164954 SRS555053 2 SAMN02630865 SRX467389 17766 13154 SRR1164955 SRS555054 3 SAMN02630861 SRX467390 17896 13283 SRR1164956 SRS555055 4 SAMN02630863 SRX467391 18763 13720 SRR1164957 SRS555056 5 SAMN02630867 SRX467392 17625 12835 SRR1164958 SRS555057 6 SAMN02630869 SRX467393 15939 11586 SRR1164959 SRS555059 Sample_Name_s differentiation_day_s source_name_s Assay_Type_s 37

1 GSM1323943 d3 IDGSW3_basal_d3 RNA-Seq 2 GSM1323944 d3 IDGSW3_basal_d3 RNA-Seq 3 GSM1323945 d3 IDGSW3_basal_d3 RNA-Seq 4 GSM1323946 d7 IDGSW3_basal_d7 RNA-Seq 5 GSM1323947 d7 IDGSW3_basal_d7 RNA-Seq 6 GSM1323948 d7 IDGSW3_basal_d7 RNA-Seq AssemblyName_s BioProject_s Center_Name_s Consent_s InsertSize_l 1 <not provided> PRJNA237622 GEO public 0 2 <not provided> PRJNA237622 GEO public 0 3 <not provided> PRJNA237622 GEO public 0 4 <not provided> PRJNA237622 GEO public 0 5 <not provided> PRJNA237622 GEO public 0 6 <not provided> PRJNA237622 GEO public 0 LibraryLayout_s LibrarySelection_s LibrarySource_s Library_Name_s 1 SINGLE cdna TRANSCRIPTOMIC <not provided> 2 SINGLE cdna TRANSCRIPTOMIC <not provided> 3 SINGLE cdna TRANSCRIPTOMIC <not provided> 4 SINGLE cdna TRANSCRIPTOMIC <not provided> 5 SINGLE cdna TRANSCRIPTOMIC <not provided> 6 SINGLE cdna TRANSCRIPTOMIC <not provided> LoadDate_s Organism_s Platform_s ReleaseDate_s SRA_Study_s cell_line_s 1 2/7/2014 Mus musculus ILLUMINA 6/2/2014 SRP036858 IDG-SW3 2 2/7/2014 Mus musculus ILLUMINA 6/2/2014 SRP036858 IDG-SW3 3 2/7/2014 Mus musculus ILLUMINA 6/2/2014 SRP036858 IDG-SW3 4 2/7/2014 Mus musculus ILLUMINA 6/2/2014 SRP036858 IDG-SW3 5 2/7/2014 Mus musculus ILLUMINA 6/2/2014 SRP036858 IDG-SW3 6 2/7/2014 Mus musculus ILLUMINA 6/2/2014 SRP036858 IDG-SW3 cell_type_s g1k_analysis_group_s g1k_pop_code_s source_s 1 osteocytic cells <not provided> <not provided> <not provided> 2 osteocytic cells <not provided> <not provided> <not provided> 3 osteocytic cells <not provided> <not provided> <not provided> 4 osteocytic cells <not provided> <not provided> <not provided> 5 osteocytic cells <not provided> <not provided> <not provided> 6 osteocytic cells <not provided> <not provided> <not provided> pdata <- pdata[pdata$run_s %in% colnames(my.counts), c("run_s", "differentiation_day_s", pdata$source_name_s <- as.factor(pdata$source_name_s) summary(pdata$source_name_s) IDGSW3_125_d3 IDGSW3_125_d35 IDGSW3_vehicle_d3 38