Introduction to RNA-seq Analysis

Size: px
Start display at page:

Download "Introduction to RNA-seq Analysis"

Transcription

1 Introduction to RNA-seq Analysis Pete E. Pascuzzi July 14, 2016 Contents 1 Intro 2 2 Setup Setup Directories Parallel Processing BAM Header Metadata Gene Annotation Genome Annotation Sequence Alignment Map Data Format Converting BAM file to GAlignments Object Essential Alignment Data Additional Data Sequence Data Determining Digital Gene Expression Counting Modes on Simulated Data Clean-Up Environment to Free Memory Method for Limited Memory and Single Core Parallel Processing of BAM Files on Multiple Cores Differential Gene Expression Analysis with EdgeR Formatting the Data Normalizing the Libraries (Counts) SessionInfo 69 1

2 1 Intro The following vignette is a basic RNAseq analysis of data from St. John, et al., Mol. Endocrinol The data was deposited at NCBI GEO under the Super Series GSE The RNAseq data is the Series GSE We will work with only a subset of these samples, the 2 X 2 design of mouse cells, untreated or treated with vitamin D at three days and 35 days. At GEO, there is metadata and processed data that is not amenable to further statistical analysis. However, the raw data in the form of FASTQ files was deposited in NCBI Short Read Archive (SRA). SRA is an entirely different system from GEO with a totally different series of accession numbers. The system can be confusing to navigate, but you should spend some time there familiarizing yourself with the system. Importantly, the raw data in SRA can be downloaded and reanalyzed. However, each of these files can be very large, greater than 20 GB each. The raw data for this vignette is about 500 GB combined. NCBI has developed a series of command line UNIX tools that are used to manipulate and download SRA data. This is not a simple matter of file transfer. Raw data at SRA is stored in SRA format to optimize storage. This data must be converted back to FASTQ format (sometime SAM or BAM format) using specific utilities. For example, we used the tool fastq-dump to transfer files from SRA to Rice and convert them to FASTQ format. This took more than a day! In fact, these FASTQ files are not entirely raw. Typically, NGS experiments are indexed/bar-coded/tagged so that they can be run together on a single lane. So, if you download an SRA file that corresponds to a single sample, these indices have already been processed so that the reads that correspond to specific sample can be sent to a specific file. Another step that may have been done is the removal of adaptors that are necessary for library construction and sequencing. The reads are then ready for additional QC and alignment to your reference genome. We have already performed quality trimming of the reads with the FASTX toolkit, removing bases from either end of the read that have a PHRED quality score below 30. We have QC d the reads with the tool FastQC. The reads were aligned to the mouse reference genome Mus musculus.grcm38.fa with the tool TopHat. The resulting Sequence Alignment Maps files were saved in binary format as BAM files which were sorted and indexed to facilitate analysis and visualization. The steps that this vignette will cover is an overview of the BAM file 2

3 format, generation of a gene annotation object that can be used to make your gene count table, counting of reads over genes, and edger analysis for differential expression. 2 Setup When working with such large files is it important to think about your storage and directory structure. We have a huge amount of disk space available to us in our scratch directory on Rice. For many tasks, it would be typical to copy your data to scratch, do your work, and copy the results back to secure storage such as Data Depot. However, to avoid that transfer of large files that could become corrupted, we are going to leave our BAM files on Data Depot and access them from Rice while working in our scratch directory. Additionally, some of the tools that have been used to process the data have specific conventions. TopHat generates a large number of files and directories, so when we want to access these files, we typically have to use long file paths. If you choose to modify this script for your own analysis, you will absolutely need to modify the paths to your own files! 2.1 Setup Directories We will access various data for this experiment from Data Depot so it will be convenient to define at least part of this path early. define directories for data #data.dir <- "/depot/nihomics/data/ngs/pike" #path for Rice data.dir <- "/Volumes/nihomics/data/ngs/Pike" #path for Pete's MacBook library(rsamtools) bam.files <- list.files(file.path(data.dir, "rnaseq", "tophat"), pattern="accepted_hits_ bam.files [1] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR _tophat_out/SRR _ [2] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR _tophat_out/SRR _ [3] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR _tophat_out/SRR _ [4] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR _tophat_out/SRR _ 3

4 [10] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR _tophat_out/SRR _ [11] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR _tophat_out/SRR _ [12] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR _tophat_out/SRR _ [5] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR _tophat_out/SRR _ [6] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR _tophat_out/SRR _ [7] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR _tophat_out/SRR _ [8] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR _tophat_out/SRR _ [9] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR _tophat_out/SRR _ 2.2 Parallel Processing Now that we have the location of the BAM files, we are ready to start processing them further. A huge advantage to using Rice for this type of analysis is that each node has 20 cores. Twenty processors that share memory (64 GB) but that can work on independent tasks. One advantage of Bioconductor is that they have developed a package that can faciliate the use of multiple cores with parallel processing. However, you do need to download this package and configure the options for your specific setup. Below, we are telling Bioconductor that it can have 18 cores and assign tasks as it sees fit. How you can use BiocParallel will depend on your computer. MulticoreParam works well for UNIX-like operating systems. It will not work on Windows machines. set up for parallel processing library(biocparallel) library(parallel) registered() $MulticoreParam class: MulticoreParam bpjobname:bpjob; bpworkers:6; bptasks:0; bptimeout:inf; bprngseed:; bpisup:false bplog:false; bpthreshold:info; bplogdir:na bpstoponerror:false; bpprogressbar:false bpresultdir:na cluster type: FORK 4

5 $SnowParam class: SnowParam bpjobname:bpjob; bpworkers:6; bptasks:0; bptimeout:inf; bprngseed:; bpisup:false bplog:false; bpthreshold:info; bplogdir:na bpstoponerror:false; bpprogressbar:false bpresultdir:na cluster type: SOCK $SerialParam class: SerialParam bplog:false; bpthreshold:info bpcatcherrors:false detectcores() [1] 8 #bpparam <- MulticoreParam(workers=18, tasks=0) #parameters for Rice bpparam <- MulticoreParam(workers=6, tasks=0) #parameters for Pete's MacBook register(bpparam) registered() $MulticoreParam class: MulticoreParam bpjobname:bpjob; bpworkers:6; bptasks:0; bptimeout:inf; bprngseed:; bpisup:false bplog:false; bpthreshold:info; bplogdir:na bpstoponerror:false; bpprogressbar:false bpresultdir:na cluster type: FORK $SnowParam class: SnowParam bpjobname:bpjob; bpworkers:6; bptasks:0; bptimeout:inf; bprngseed:; bpisup:false bplog:false; bpthreshold:info; bplogdir:na bpstoponerror:false; bpprogressbar:false bpresultdir:na cluster type: SOCK $SerialParam 5

6 class: SerialParam bplog:false; bpthreshold:info bpcatcherrors:false Below is an example of the bplapply function, a parallel version of lapply that can send tasks to multiple processors. DO NOT UNCOMMENT THIS LINE AND RUN IT. The BAM files are already indexed. This is only an example. However, you do need to get the location of the BAM index files. index BAM files so that they can be processed more easily and ready by IGV and other applications. This has already been done #system.time(bplapply(bam.files, indexbam)) Get the full names for the bam index files bam.index <- list.files(file.path(data.dir, "rnaseq", "tophat"), pattern="bam.bai$", rec bam.index 2.3 BAM Header Metadata Now, we can start to look at the data in a BAM file. The first thing that you should examine is the BAM header. It contains information on how the reads were aligned. In particular, it should give you the names of the chromosomes to which the reads were aligned, and information about the aligner. get the seqnames for the references used in the alignments bam.header <- scanbamheader(bam.files[1]) length(bam.header) [1] 1 bam.header <- bam.header[[1]] names(bam.header) [1] "targets" "text" 6

7 bam.header$targets GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL JH JH JH JH JH JH JH JH JH JH JH JH JH MT X Y bam.chrs <- names(bam.header$targets) head(bam.header$text) $`@HD` [1] "VN:1.0" "SO:coordinate" $`@SQ` [1] "SN:1" "LN: " $`@SQ` [1] "SN:10" "LN: " $`@SQ` [1] "SN:11" "LN: " 7

8 [1] "SN:12" "LN: " [1] "SN:13" "LN: " tail(bam.header$text) [1] "SN:JH " "LN:158099" [1] "SN:JH " "LN:114452" [1] "SN:MT" "LN:16299" [1] "SN:X" "LN: " [1] "SN:Y" "LN: " [1] "ID:TopHat" [2] "VN:2.1.0" [3] "CL:/group/bioinfo/apps/apps/tophat-2.1.0/tophat -p 10 --library-type fr-secondst 2.4 Gene Annotation Now that we have information about the reference sequence, we can generate a suitable object that represents the genes in the reference genome. There are many ways to do this, but we are going to use precompiled Bioconductor databases. It is not a simple matter of finding the transcription start site and transcription termination site of the genes and defining that interval. Why? What we need to do is retrieve the coordinates for the gene exons, grouping the exons together by gene. Further, we want to include all possible 8

9 exonic sequences. Bioconductor has many functions that help you to achieve such tasks. Below, we will use exonsby to get all exons for mouse genes. library(mus.musculus, verbose=true) Warning in library(mus.musculus, verbose = TRUE): package Mus.musculus already present in search() mouse.genes0 <- exonsby(mus.musculus, by="gene") mouse.genes0 GRangesList object of length 24028: $ GRanges object with 7 ranges and 2 metadata columns: seqnames ranges strand exon_id exon_name <Rle> <IRanges> <Rle> <integer> <character> [1] chr9 [ , ] <NA> [2] chr9 [ , ] <NA> [3] chr9 [ , ] <NA> [4] chr9 [ , ] <NA> [5] chr9 [ , ] <NA> [6] chr9 [ , ] <NA> [7] chr9 [ , ] <NA> $ GRanges object with 6 ranges and 2 metadata columns: seqnames ranges strand exon_id exon_name [1] chr7 [ , ] <NA> [2] chr7 [ , ] <NA> [3] chr7 [ , ] <NA> [4] chr7 [ , ] <NA> [5] chr7 [ , ] <NA> [6] chr7 [ , ] <NA> $ GRanges object with 1 range and 2 metadata columns: seqnames ranges strand exon_id exon_name [1] chr10 [ , ] <NA>... 9

10 <24025 more elements> seqinfo: 66 sequences (1 circular) from mm10 genome exons are for each transcript but grouped by gene so some exons represented multiple times mouse.genes0[218] GRangesList object of length 1: $ GRanges object with 22 ranges and 2 metadata columns: seqnames ranges strand exon_id exon_name <Rle> <IRanges> <Rle> <integer> <character> [1] chr12 [ , ] <NA> [2] chr12 [ , ] <NA> [3] chr12 [ , ] <NA> [4] chr12 [ , ] <NA> [5] chr12 [ , ] <NA> [18] chr12 [ , ] <NA> [19] chr12 [ , ] <NA> [20] chr12 [ , ] <NA> [21] chr12 [ , ] <NA> [22] chr12 [ , ] <NA> seqinfo: 66 sequences (1 circular) from mm10 genome Note that this mouse genes has 22 exons, but many of these are overlapping or the same. That is because all exons for all transcripts are represented for this gene. If you count reads over this object, you could potentially double or triple your counts because you would get gene counts for every transcript. In fact, Bioconductor is pretty smart and this does not happen, but we are going to perform a reduce on these exons. This function will combine overlapping or redundant intervals into a single non-rendundant set. mouse.genes <- reduce(mouse.genes0) mouse.genes[218] GRangesList object of length 1: 10

11 $ GRanges object with 16 ranges and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] chr12 [ , ] + [2] chr12 [ , ] + [3] chr12 [ , ] + [4] chr12 [ , ] + [5] chr12 [ , ] [12] chr12 [ , ] + [13] chr12 [ , ] + [14] chr12 [ , ] + [15] chr12 [ , ] + [16] chr12 [ , ] seqinfo: 66 sequences (1 circular) from mm10 genome Now, let s check the chromosome names (seqnames) that are used in our mouse.genes object with the chromosome names that are used in the BAM files. These are often incompatible, and you will not get any counts for your genes if Bioconductor can not match up the seqnames! We have to do some strange data conversions here because some of the data is stored as Run-Length encoded vectors. seqlevels(mouse.genes) [1] "chr1" "chr2" "chr3" [4] "chr4" "chr5" "chr6" [7] "chr7" "chr8" "chr9" [10] "chr10" "chr11" "chr12" [13] "chr13" "chr14" "chr15" [16] "chr16" "chr17" "chr18" [19] "chr19" "chrx" "chry" [22] "chrm" "chr1_gl456210_random" "chr1_gl456211_random" [25] "chr1_gl456212_random" "chr1_gl456213_random" "chr1_gl456221_random" [28] "chr4_gl456216_random" "chr4_gl456350_random" "chr4_jh584292_random" [31] "chr4_jh584293_random" "chr4_jh584294_random" "chr4_jh584295_random" 11

12 [34] "chr5_gl456354_random" "chr5_jh584296_random" "chr5_jh584297_random" [37] "chr5_jh584298_random" "chr5_jh584299_random" "chr7_gl456219_random" [40] "chrx_gl456233_random" "chry_jh584300_random" "chry_jh584301_random" [43] "chry_jh584302_random" "chry_jh584303_random" "chrun_gl456239" [46] "chrun_gl456359" "chrun_gl456360" "chrun_gl456366" [49] "chrun_gl456367" "chrun_gl456368" "chrun_gl456370" [52] "chrun_gl456372" "chrun_gl456378" "chrun_gl456379" [55] "chrun_gl456381" "chrun_gl456382" "chrun_gl456383" [58] "chrun_gl456385" "chrun_gl456387" "chrun_gl456389" [61] "chrun_gl456390" "chrun_gl456392" "chrun_gl456393" [64] "chrun_gl456394" "chrun_gl456396" "chrun_jh584304" not the same style as BAM files table(seqlevels(mouse.genes) %in% bam.chrs) FALSE 66 change the style seqlevelsstyle(mouse.genes) <- "NCBI" table(seqlevels(mouse.genes) %in% bam.chrs) FALSE TRUE still problems with some (Bioconductor error?) determine how many annotated genes are on each sequence seqnames.genes <- unlist(runvalue(seqnames(mouse.genes))) table(seqnames.genes) seqnames.genes

13 X Y MT chr1_gl456210_random chr1_gl456211_random chr1_gl456212_random chr1_gl456213_random chr1_gl456221_random chr4_gl456216_random chr4_gl456350_random chr4_jh584292_random chr4_jh584293_random chr4_jh584294_random chr4_jh584295_random chr5_gl456354_random chr5_jh584296_random chr5_jh584297_random chr5_jh584298_random chr5_jh584299_random chr7_gl456219_random chrx_gl456233_random chry_jh584300_random chry_jh584301_random chry_jh584302_random chry_jh584303_random chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_jh a few genes on unassembled contigs so will adjust seqnames to match bam.chrs temp.seqlevels <- seqlevels(mouse.genes) 13

14 summary(bam.chrs %in% temp.seqlevels) Mode FALSE TRUE NA's logical temp.seqlevels[23:44] <- substr(temp.seqlevels[23:44], 6, 13) temp.seqlevels[45:66] <- substr(temp.seqlevels[45:66], 7, 14) temp.seqlevels[23:66] <- paste(temp.seqlevels[23:66], ".1", sep="") summary(bam.chrs %in% temp.seqlevels) Mode TRUE NA's logical 66 0 seqlevels(mouse.genes) <- temp.seqlevels seqnames.genes <- unlist(runvalue(seqnames(mouse.genes))) table(seqnames.genes) seqnames.genes X Y MT GL GL GL GL GL GL GL JH JH JH JH GL JH JH JH JH GL GL JH JH JH JH GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL JH

15 Now, the seqnames of mouse.genes matches the seqnames in our BAM files. 2.5 Genome Annotation Next, we are going to make an object that represent the chromosomes in the mouse genome. This object is required if you want to access the reads in the BAM file one chromosome at a time. This is the best way to work with BAM files if you have limited memory on your computer, e.g. 16 GB or less on a laptop. Again, we need to make sure that the seqnames match! library(bsgenome.mmusculus.ucsc.mm10) mouse.bs <- BSgenome.Mmusculus.UCSC.mm10 mouse.bs Mouse genome: # organism: Mus musculus (Mouse) # provider: UCSC # provider version: mm10 # release date: Dec # release name: Genome Reference Consortium GRCm38 # 66 sequences: # chr1 chr2 chr3 # chr4 chr5 chr6 # chr7 chr8 chr9 # chr10 chr11 chr12 # chr13 chr14 chr15 # # chrun_gl chrun_gl chrun_gl # chrun_gl chrun_gl chrun_gl # chrun_gl chrun_gl chrun_gl # chrun_gl chrun_gl chrun_gl # chrun_gl chrun_gl chrun_jh # (use 'seqnames()' to see all the sequence names, use the '$' or '[[' # operator to access a given sequence) seqlevelsstyle(mouse.bs) <- "NCBI" seqlevels(mouse.bs) 15

16 [1] "1" "2" "3" [4] "4" "5" "6" [7] "7" "8" "9" [10] "10" "11" "12" [13] "13" "14" "15" [16] "16" "17" "18" [19] "19" "X" "Y" [22] "MT" "chr1_gl456210_random" "chr1_gl456211_random" [25] "chr1_gl456212_random" "chr1_gl456213_random" "chr1_gl456221_random" [28] "chr4_gl456216_random" "chr4_gl456350_random" "chr4_jh584292_random" [31] "chr4_jh584293_random" "chr4_jh584294_random" "chr4_jh584295_random" [34] "chr5_gl456354_random" "chr5_jh584296_random" "chr5_jh584297_random" [37] "chr5_jh584298_random" "chr5_jh584299_random" "chr7_gl456219_random" [40] "chrx_gl456233_random" "chry_jh584300_random" "chry_jh584301_random" [43] "chry_jh584302_random" "chry_jh584303_random" "chrun_gl456239" [46] "chrun_gl456359" "chrun_gl456360" "chrun_gl456366" [49] "chrun_gl456367" "chrun_gl456368" "chrun_gl456370" [52] "chrun_gl456372" "chrun_gl456378" "chrun_gl456379" [55] "chrun_gl456381" "chrun_gl456382" "chrun_gl456383" [58] "chrun_gl456385" "chrun_gl456387" "chrun_gl456389" [61] "chrun_gl456390" "chrun_gl456392" "chrun_gl456393" [64] "chrun_gl456394" "chrun_gl456396" "chrun_jh584304" same problem with sqnames seqlevels(mouse.bs) <- temp.seqlevels prepare GRanges object that can be used as a parameter to access specific chromosomes and regions from BAM files. mouse.gr <- GRanges(seqnames=Rle(as.character(seqnames(mouse.bs))), ranges=iranges(start mouse.gr GRanges object with 66 ranges and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] 1 [1, ] * [2] 2 [1, ] * [3] 3 [1, ] * [4] 4 [1, ] * [5] 5 [1, ] * [62] GL [1, 23629] * 16

17 [63] GL [1, 55711] * [64] GL [1, 24323] * [65] GL [1, 21240] * [66] JH [1, ] * seqinfo: 66 sequences (1 circular) from mm10 genome 3 Sequence Alignment Map Data Format 3.1 Converting BAM file to GAlignments Object There are several Bioconductor packages that allow you to access BAM files. Here we are going to use readgalignments to make a GAlignments object for the reads on chr19. To do this, we need to specify certain parameters. What data do you want? Which chromosomes? Do you want to filter the data on specific flags? needed to redefine the file paths so the PDF would compile data.dir <- "/Volumes/nihomics/data/ngs/Pike" #path for Pete's MacBook bam.files <- list.files(file.path(data.dir, "rnaseq", "tophat"), pattern="accepted_hits_ bam.index <- list.files(file.path(data.dir, "rnaseq", "tophat"), pattern="bam.bai$", rec library(genomicalignments) flag.param <- scanbamflag() what.param <- scanbamwhat() which.param <- mouse.gr[19] my.param <- ScanBamParam(what=what.param, which=which.param, flag=flag.param) my.param class: ScanBamParam bamflag (NA unless specified): bamsimplecigar: FALSE bamreversecomplement: FALSE bamtag: bamtagfilter: bamwhich: 1 ranges bamwhat: qname, flag, rname, strand, pos, qwidth, mapq, cigar, mrnm, mpos, isize, seq, qual, groupid, mate_status bammapqfilter: NA 17

18 bam.ga <- readgalignments(bam.files[1], index=bam.index[1], param=my.param) bam.ga GAlignments object with alignments and 13 metadata columns: seqnames strand cigar qwidth start end <Rle> <Rle> <character> <integer> <integer> <integer> [1] M [2] M [3] M [4] M [5] M [ ] M [ ] M [ ] M [ ] M [ ] M width njunc qname flag <integer> <integer> <character> <integer> [1] SRR [2] SRR [3] SRR [4] SRR [5] SRR [ ] SRR [ ] SRR [ ] SRR [ ] SRR [ ] SRR rname strand pos qwidth mapq cigar <factor> <factor> <integer> <integer> <integer> <character> [1] M [2] M [3] M [4] M [5] M [ ] M [ ] M 18

19 [ ] M [ ] M [ ] M mrnm mpos isize <factor> <integer> <integer> [1] <NA> 0 0 [2] <NA> 0 0 [3] <NA> 0 0 [4] <NA> 0 0 [5] <NA> [ ] <NA> 0 0 [ ] <NA> 0 0 [ ] <NA> 0 0 [ ] <NA> 0 0 [ ] <NA> 0 0 [1] CCTAGTATATCTGGAGAGTTAAGATGGGGAATTATGTCAACTTTCCCTCTTCCTATGCCAGTTATGCATAATG [2] CCTAGTATATCTGGAGAGTTAAGATGGGGAATTATGTCAACTTTCCCTCTTCCTATGCCAGTTATGCATAATG [3] AGGGGGAGATGTGAGGAGCCGCCCTTGCAATCGCCATTACAAAATGGTGCTGATATCCGGTGTTCTAACTAGT [4] AGGGGGAGATGTGAGGAGCCGCCCTTGCAATCGCCATTACAAAATGGTGCTGATATCCGGTGTTCTAACTAGT [5] CTATGGCCTTGGGCATCAAGATTTAAAAAATTAAGAGTGAAGAGTGCTATGGAAACAACTACTCTTGGTACTG... [ ] GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG [ ] GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG [ ] GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG [ ] TAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTT [ ] TAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTT [1] 3DCCDCCDCCCCDCCCEEDCEFFFCDA7HHHHGFIIIGEFIGEHJIIJJHGIJJIJJJJJJJJJJJJJIJIJJ [2] <DCDEDCDDCCACDECCCDEEEDC@B@EHEAE@EHGCHGFGJIIIIHBIHFGIIGJJIJIHEHEIJJJIJIGF [3] CCCFFFFFHHHHHIIJJIIIIJJJIJJJIJJJIJJIJJIIIIJJJHHEHHFFFFFFFFDD;?BDDEECDCCCC [4] CCCFFFFFHHHFHJIJIJJJJJJJJJJJJJJJJJJJJJGIJIJJIHHEHHHFFFFFFFDC;@DDDEEDCDDCC [5] CCCFFFFFHHDHHEIJIJJJIJJIHIJJIJJJJIIIIGHIJIIJ?FHIIIHGIIEIJIJJEIIJJIJICEHHH

20 [ ] [ ] [ ] seqinfo: 66 sequences from an unspecified genome You should see that some of the data is repeated. The reason is that some essential data is always imported, so we don t need to pass that in our what.param. what.param <- c("qname", "flag", "mapq", "seq", "qual") my.param <- ScanBamParam(what=what.param, which=which.param, flag=flag.param) bam.ga <- readgalignments(bam.files[1], index=bam.index[1], param=my.param) bam.ga GAlignments object with alignments and 5 metadata columns: seqnames strand cigar qwidth start end <Rle> <Rle> <character> <integer> <integer> <integer> [1] M [2] M [3] M [4] M [5] M [ ] M [ ] M [ ] M [ ] M [ ] M width njunc qname flag <integer> <integer> <character> <integer> [1] SRR [2] SRR [3] SRR [4] SRR [5] SRR [ ] SRR [ ] SRR [ ] SRR

21 [ ] SRR [ ] SRR mapq <integer> [1] 0 [2] 0 [3] 1 [4] 1 [5] [ ] 0 [ ] 0 [ ] 0 [ ] 50 [ ] 1 [1] CCTAGTATATCTGGAGAGTTAAGATGGGGAATTATGTCAACTTTCCCTCTTCCTATGCCAGTTATGCATAATG [2] CCTAGTATATCTGGAGAGTTAAGATGGGGAATTATGTCAACTTTCCCTCTTCCTATGCCAGTTATGCATAATG [3] AGGGGGAGATGTGAGGAGCCGCCCTTGCAATCGCCATTACAAAATGGTGCTGATATCCGGTGTTCTAACTAGT [4] AGGGGGAGATGTGAGGAGCCGCCCTTGCAATCGCCATTACAAAATGGTGCTGATATCCGGTGTTCTAACTAGT [5] CTATGGCCTTGGGCATCAAGATTTAAAAAATTAAGAGTGAAGAGTGCTATGGAAACAACTACTCTTGGTACTG... [ ] GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG [ ] GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG [ ] GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG [ ] TAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTT [ ] TAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTT [1] 3DCCDCCDCCCCDCCCEEDCEFFFCDA7HHHHGFIIIGEFIGEHJIIJJHGIJJIJJJJJJJJJJJJJIJIJJ [2] <DCDEDCDDCCACDECCCDEEEDC@B@EHEAE@EHGCHGFGJIIIIHBIHFGIIGJJIJIHEHEIJJJIJIGF [3] CCCFFFFFHHHHHIIJJIIIIJJJIJJJIJJJIJJIJJIIIIJJJHHEHHFFFFFFFFDD;?BDDEECDCCCC [4] CCCFFFFFHHHFHJIJIJJJJJJJJJJJJJJJJJJJJJGIJIJJIHHEHHHFFFFFFFDC;@DDDEEDCDDCC [5] CCCFFFFFHHDHHEIJIJJJIJJIHIJJIJJJJIIIIGHIJIIJ?FHIIIHGIIEIJIJJEIIJJIJICEHHH

22 [ ] [ ] seqinfo: 66 sequences from an unspecified genome 3.2 Essential Alignment Data There are certain fields in the BAM file that are always imported. Let s go through these one by one. seqnames(bam.ga) factor-rle of length with 1 run Lengths: Values : 19 Levels(66): JH JH JH MT X Y strand(bam.ga) factor-rle of length with runs Lengths: Values : Levels(3): + - * cigar.tbl <- sort(table(cigar(bam.ga)), decreasing=true) head(cigar.tbl, 20) 101M 26M623N75M 2M1413N99M 37M109N64M M623N67M 46M2212N55M 33M623N68M 84M781N17M M974N1M 37M634N64M 68M1413N33M 69M974N32M M634N80M1630N9M 97M1D4M 96M1I4M 61M781N40M M170N43M 63M3195N38M 99M781N2M 5M92N96M

23 range(qwidth(bam.ga)) [1] range(width(bam.ga)) [1] njunc.tbl <- table(njunc(bam.ga)) njunc.tbl Additional Data Some data is only optionally reported, but these fields contain important information about th e reads. The FLAG field contains information on how the reads aligned, especially how each read in a paired-end run aligned with respect to each other. The MAPQ field tells you how good or bad the alignment is. head(mcols(bam.ga)) 1 CCTAGTATATCTGGAGAGTTAAGATGGGGAATTATGTCAACTTTCCCTCTTCCTATGCCAGTTATGCATAATGCACAAATATT 2 CCTAGTATATCTGGAGAGTTAAGATGGGGAATTATGTCAACTTTCCCTCTTCCTATGCCAGTTATGCATAATGCACAAATATT 3 AGGGGGAGATGTGAGGAGCCGCCCTTGCAATCGCCATTACAAAATGGTGCTGATATCCGGTGTTCTAACTAGTAAACAAGTAG DataFrame with 6 rows and 5 columns qname flag mapq <character> <integer> <integer> 1 SRR SRR SRR SRR SRR SRR

24 4 AGGGGGAGATGTGAGGAGCCGCCCTTGCAATCGCCATTACAAAATGGTGCTGATATCCGGTGTTCTAACTAGTAAACAAGTAG 5 CTATGGCCTTGGGCATCAAGATTTAAAAAATTAAGAGTGAAGAGTGCTATGGAAACAACTACTCTTGGTACTGAGGGTAGATT 6 CTATGGCCTTGGGCATCAAGATTTAAAAAATTAAGAGTGAAGAGTGCTATGGAAACAACTACTCTTGGTACTGAGGGTAGATT 1 3DCCDCCDCCCCDCCCEEDCEFFFCDA7HHHHGFIIIGEFIGEHJIIJJHGIJJIJJJJJJJJJJJJJIJIJJJIJJJJJIJJ 2 <DCDEDCDDCCACDECCCDEEEDC@B@EHEAE@EHGCHGFGJIIIIHBIHFGIIGJJIJIHEHEIJJJIJIGFIIEGIIGIHG 3 CCCFFFFFHHHHHIIJJIIIIJJJIJJJIJJJIJJIJJIIIIJJJHHEHHFFFFFFFFDD;?BDDEECDCCCCCCCDDDDACD 4 CCCFFFFFHHHFHJIJIJJJJJJJJJJJJJJJJJJJJJGIJIJJIHHEHHHFFFFFFFDC;@DDDEEDCDDCCDDACDDC>CC 5 CCCFFFFFHHDHHEIJIJJJIJJIHIJJIJJJJIIIIGHIJIIJ?FHIIIHGIIEIJIJJEIIJJIJICEHHHFEDEDCD>CC flag.tbl <- table(mcols(bam.ga)$flag) flag.tbl flag.mat <- bamflagasbitmatrix(as.integer(names(flag.tbl))) flag.mat <- cbind(as.integer(names(flag.tbl)), flag.mat) flag.mat ispaired isproperpair isunmappedquery hasunmappedmate [1,] [2,] [3,] [4,] isminusstrand ismateminusstrand isfirstmateread issecondmateread [1,] [2,] [3,] [4,] issecondaryalignment isnotpassingqualitycontrols isduplicate [1,] [2,] [3,] [4,] dups <- duplicated(mcols(bam.ga)$qname) duplicated(mcols(bam.ga)$qname, fromlast=true) 24

25 dup.ga <- bam.ga[dups] dup.ga <- dup.ga[order(mcols(dup.ga)$qname)] dup.ga GAlignments object with alignments and 5 metadata columns: seqnames strand cigar qwidth start end <Rle> <Rle> <character> <integer> <integer> <integer> [1] M [2] M [3] M [4] M [5] M [383481] M [383482] M [383483] M [383484] M [383485] M width njunc qname flag <integer> <integer> <character> <integer> [1] SRR [2] SRR [3] SRR [4] SRR [5] SRR [383481] SRR [383482] SRR [383483] SRR [383484] SRR [383485] SRR mapq <integer> [1] 1 [2] 1 [3] 1 [4] 0 [5] [383481] 1 25

26 [383482] 1 [383483] 1 [383484] 1 [383485] 1 [1] GCCCCAAGTTCAAGATGCCTGAGATGAACATCAAGGCTCCAAAGATCTCCATGCCAGATTTACATCTTAAGGGT [2] GCCCCAAGTTCAAGATGCCTGAGATGAACATCAAGGCTCCAAAGATCTCCATGCCAGATTTACATCTTAAGGGT [3] GCCCCAAGTTCAAGATGCCTGAGATGAACATCAAGGCTCCAAAGATCTCCATGCCAGATTTACATCTTAAGGGT [4] TCCAGCTTCTCTCCATTTACTTTGATGTTGGCTACTGGTTTGCTGTAGATTGCTTTTATCATGTTTAGGTATGG [5] TCCAGCTTCTCTCCATTTACTTTGATGTTGGCTACTGGTTTGCTGTAGATTGCTTTTATCATGTTTAGGTATGG... [383481] GCCCCAAGTTCAAGATGCCTGACATGCACTTCAAGGCTCCTAAGATCTCCATGCCTGATGTGGACTTGCATCTG [383482] GCCCCAAGTTCAAGATGCCTGACATGCACTTCAAGGCTCCTAAGATCTCCATGCCTGATGTGGACTTGCATCTG [383483] AGTGCCAAAATTAGAGGGAGAATTAAAAGGCCCAAGTGTGGATGTGGAAGTACCTGGTGTTGATCTGGAATGTC [383484] AGTGCCAAAATTAGAGGGAGAATTAAAAGGCCCAAGTGTGGATGTGGAAGTACCTGGTGTTGATCTGGAATGTC [383485] AGTGCCAAAATTAGAGGGAGAATTAAAAGGCCCAAGTGTGGATGTGGAAGTACCTGGTGTTGATCTGGAATGTC [1] CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJIJJJJFJJJJJJIJJJJJJJIIJEHIJG [2] CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJIJJJJFJJJJJJIJJJJJJJIIJEHIJG [3] CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJIJJJJFJJJJJJIJJJJJJJIIJEHIJG [4] CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJGIIJJJJIJJJJJJJJJJJJJJIIJGHIGIJGIIIG [5] CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJGIIJJJJIJJJJJJJJJJJJJJIIJGHIGIJGIIIG... [383481] CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJIJJJJIJJJJJJJJJJJJJJIJIGJJJIIJJIJJJJJ [383482] CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJIJJJJIJJJJJJJJJJJJJJIJIGJJJIIJJIJJJJJ [383483] CCBFFFFFHHHHHJJJJJIJIJJJJIJJJJIJJJJJIJHIJJIJJIJIJJHIJJJIJJHIJIDHJJHHHHHHED [383484] CCBFFFFFHHHHHJJJJJIJIJJJJIJJJJIJJJJJIJHIJJIJJIJIJJHIJJJIJJHIJIDHJJHHHHHHED [383485] CCBFFFFFHHHHHJJJJJIJIJJJJIJJJJIJJJJJIJHIJJIJJIJIJJHIJJJIJJHIJIDHJJHHHHHHED seqinfo: 66 sequences from an unspecified genome table(mcols(bam.ga)$flag, dups) dups FALSE TRUE

27 mapq.tbl <- table(mcols(bam.ga)$mapq) mapq.tbl table(mcols(bam.ga)$mapq, dups) dups FALSE TRUE table(mcols(bam.ga)$mapq, mcols(bam.ga)$flag, dups),, dups = FALSE ,, dups = TRUE

28 3.4 Sequence Data The sequence data includes both the actual sequence of the read and the quality scores for each based in PHRED format. mcols(bam.ga)$seq A DNAStringSet instance of length width seq [1] 101 CCTAGTATATCTGGAGAGTTAAGATGGG...CACAAATATTTCCACGCTTTTTCACTAC [2] 101 CCTAGTATATCTGGAGAGTTAAGATGGG...CACAAATATTTCCACGCTTTTTCACTAC [3] 101 AGGGGGAGATGTGAGGAGCCGCCCTTGC...AAACAAGTAGTCTGCGCATGTGCTGGGG [4] 101 AGGGGGAGATGTGAGGAGCCGCCCTTGC...AAACAAGTAGTCTGCGCATGTGCTGGGG [5] 101 CTATGGCCTTGGGCATCAAGATTTAAAA...AGGGTAGATTCCCCCTTTTTGTTTAATT [ ] 101 GTTAGGGTTAGGGTTAGGGTTAGGGTTA...GTAGGGTTAGGGGTAGGGTTAGGGTTAG [ ] 101 GTTAGGGTTAGGGTTAGGGTTAGGGTTA...TTAGGGTTAGGGTTAGGGTTAGGGTTAG [ ] 101 GTTAGGGTTAGGGTTAGGGTTAGGGTTA...TTAGGGTTAGGGTTAGGGTTAGGGTTAG [ ] 101 TAGGGTTAGGGTTAGGGTTAGGGTTAGG...AGGGTTAGGGTTAGGGTTAGGGTTAGAT [ ] 101 TAGGGTTAGGGTTAGGGTTAGGGTTAGG...AGGGTTAGGGTTAGGGTTAGGGTTAGAT ecor1 <- vcountpattern("gaattc", mcols(bam.ga)$seq) table(ecor1) ecor mcols(bam.ga)$seq[ecor1 > 0] A DNAStringSet instance of length width seq [1] 101 ATATAGTGGATTACTTTGATGGATTTCCA...ATGATTGTTTTGATGTGTTCTTGAATTC [2] 101 ATATAGTGGATTACTTTGATGGATTTCCA...ATGATTGTTTTGATGTGTTCTTGAATTC [3] 101 CAAAATCCAACACCTATTCATGATGAAAG...AAAGCAATATACAGCAAGCCAGTAGCCA [4] 101 TGGCTACTGGTTTGCTGTAGATTGCTTTT...TTTTATCATGAATGGGTGTTGGATCTTG [5] 101 CATCTTTGCCATGATATTTTTTGCTTTAG...CCCAAATGCTGCATAATATCCCTTCCCC [109816] 101 CACATGATCATCTCGTTAGATGCAGAAAA...AAAGATCAGGAATTCAAGGCCCATACCT [109817] 101 TTAGATGCAGAAAAAGCATTTGACAAGAT...AAGGCCCATACCTAAACATGATAAAAGC [109818] 101 GTACACCCCTGCATGATAGAATTCTGCAG...TAAGTGATTCTTGGATCCTAACATTCTA [109819] 101 GTACACCCCTGCATGATAGAATTCTGCAG...TAAGTGATTCTTGGATCCTAACATTCTA [109820] 101 GTACACCCCTGCATGATAGAATTCTGCAG...TAAGTGATTCTTGGATCCTAACATTCTA 28

29 mcols(bam.ga)$qual A PhredQuality instance of length width seq [1] 101 3DCCDCCDCCCCDCCCEEDCEFFFCDA7...JIJJJJJIJJJJIHJHHHHHFFFFFC@C [2] 101 <DCDEDCDDCCACDECCCDEEEDC@B@E...IIEGIIGIHGIIHGHDHHHGFDEDF@C@ [3] 101 CCCFFFFFHHHHHIIJJIIIIJJJIJJJ...CCCDDDDACDDDDDDDB>BBCCCDC@B9 [4] 101 CCCFFFFFHHHFHJIJIJJJJJJJJJJJ...DDACDDC>CCCDEDDDBDDDDDCDDCBB [5] 101 CCCFFFFFHHDHHEIJIJJJIJJIHIJJ...FEDEDCD>CCEDCDDDDDDDDACDCAC: [ ] [ ] [ ] [ ] [ ] as.integer(chartoraw(as.character(mcols(bam.ga)$qual[1]))) [1] [24] [47] [70] [93] as.integer(chartoraw(as.character(mcols(bam.ga)$qual[1]))) - 33 [1] [24] [47] [70] [93] Determining Digital Gene Expression Now, we have everything we need to get our gene counts. It is important to think about how this is done. First, is your RNAseq library strand specific? If so, do you have paired-end reads? If so, which of the ends should you use for counting? 29

30 In addition, how stringent do you want to be when you determine whether or not a read actually overlaps with a gene? What if a read overlaps with two genes? In your package tab, go to GenomicAlignments, click on it. Open the User guides and select GenomicAlignments::summarizeOverlaps. Figure 1 shows the counting modes that are available. 4.1 Counting Modes on Simulated Data The following bit of code demonstrates how your gene expression counts can vary based on your choice of counting mode. library(gviz) my.gr <- GRanges(seqnames=Rle("chr1"), ranges=iranges(start=1, end=1000), strand=rle("*" my.seqinfo <- Seqinfo(seqnames="chr1", seqlengths=1000, iscircular=false, genome="simula seqinfo(my.gr) <- my.seqinfo gene1 <- GRanges(seqnames=Rle("chr1"), ranges=iranges(start=201, end=350), strand=rle("+ gene2 <- GRanges(seqnames=Rle(rep("chr1", 2)), ranges=iranges(start=c(301, 601), end=c(3 gene3 <- GRanges(seqnames=Rle("chr1"), ranges=iranges(start=451, end=550), strand=rle(" gene4 <- GRanges(seqnames=Rle(rep("chr1", 2)), ranges=iranges(start=c(101, 301), end=c(2 my.genes <- GRangesList(gene1=gene1, gene2=gene2, gene3=gene3, gene4=gene4) my.genes.df <- as.data.frame(my.genes) colnames(my.genes.df)[3] <- "chromosome" gene.track <-GeneRegionTrack(my.genes.df, fill="green4", arrowheadwidth=50, arrowheadmax gene1.reads <- GRanges(seqnames=Rle(rep("chr1", 20)), ranges=iranges(start=sample(200:32 gene2.reads <- GRanges(seqnames=Rle(rep("chr1", 60)), ranges=iranges(start=sample(c(300: gene3.reads <- GRanges(seqnames=Rle(rep("chr1", 10)), ranges=iranges(start=sample(450:52 gene4.reads <- GRanges(seqnames=Rle(rep("chr1", 20)), ranges=iranges(start=sample(c(100: all.reads <- c(gene1.reads, gene2.reads, gene3.reads, gene4.reads) read.track <- AnnotationTrack(all.reads, fill=c("red", "blue")[as.integer(strand(all.rea ax.track <- GenomeAxisTrack(GRanges(seqnames="chr1", ranges=iranges(1, 1000)), genome="s plottracks(list(gene.track, read.track, ax.track), from=1, to=700, grid=1, sizes=c(4, 4, 30

31 gene4 GeneRegionTrack gene1 gene2 gene3 Demo Reads count.mat <- matrix(0, nrow=4, ncol=5) colnames(count.mat) <- c("actual", "countoverlaps", "Union", "IntersectionStrict", "Inte rownames(count.mat) <- c("gene1", "gene2", "gene3", "gene4") count.mat[, 1] <- c(20, 60, 10, 20) count.mat[, 2] <- countoverlaps(my.genes, all.reads) count.mat[, 3] <- assays(summarizeoverlaps(my.genes, all.reads, "Union", ignore.strand=f count.mat[, 4] <- assays(summarizeoverlaps(my.genes, all.reads, "IntersectionStrict", ig count.mat[, 5] <- assays(summarizeoverlaps(my.genes, all.reads, "IntersectionNotEmpty", count.mat actual countoverlaps Union IntersectionStrict IntersectionNotEmpty 31

32 gene gene gene gene Clean-Up Environment to Free Memory We ve accumulated a bunch of objects in our workspace. remove what we don t need. ls() Let s carefully [1] "all.reads" "ax.track" "bam.chrs" "bam.files" [5] "bam.ga" "bam.header" "bam.index" "bpparam" [9] "cigar.tbl" "count.mat" "data.dir" "dup.ga" [13] "dups" "ecor1" "flag.mat" "flag.param" [17] "flag.tbl" "gene.track" "gene1" "gene1.reads" [21] "gene2" "gene2.reads" "gene3" "gene3.reads" [25] "gene4" "gene4.reads" "mapq.tbl" "mouse.bs" [29] "mouse.genes" "mouse.genes0" "mouse.gr" "my.genes" [33] "my.genes.df" "my.gr" "my.param" "my.seqinfo" [37] "njunc.tbl" "read.track" "seqnames.genes" "temp.seqlevels" [41] "what.param" "which.param" my.objs <- list(ls()) my.objs[[1]] <- setdiff(my.objs[[1]], c("data.dir", "bam.files", "bam.index", "mouse.gen rm(list=my.objs[[1]]) rm(my.objs) ls() [1] "bam.files" "bam.index" "data.dir" "mouse.genes" "mouse.gr" gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells Vcells

33 4.3 Method for Limited Memory and Single Core It is very likely that you will have limited computer resources. The following chunk of code will work down a list of BAM files, one chromosome at a time, and generate your gene count matrix. I have used this routinely on my MacBook Pro (16 GB memory). #sample.names <- substring(bam.files, 45, 54) #count.mat <- matrix(0, nrow=length(mouse.genes), ncol=length(sample.names)) #rownames(count.mat) <- names(mouse.genes) #colnames(count.mat) <- sample.names one file at a time, one computer core at a time # mapq.cutoff <- 50 # for(j in 1:length(bam.files)){ # for(i in 1:length(seqlevels(mouse.genes))){ # chr.param <- ScanBamParam(which=mouse.gr[i], mapqfilter=mapq.cutoff) # bam.ga <- readgalignments(bam.files[j], index=bam.index[j], param=chr.param) # my.counts <- summarizeoverlaps(features=mouse.genes, reads=bam.ga, mode="union", ign # count.mat[, j] <- count.mat[, j] + assays(my.counts) counts # } # } # write.table(count.mat, file=paste(sample.name, "counts.txt", sep="_"), row.names=true, 4.4 Parallel Processing of BAM Files on Multiple Cores The next chunk, will use the parameters we specified to BiocParallel to process all 12 BAM files (we have 18 workers). This should take less than 10 minutes. system.time(my.se <- summarizeoverlaps(mouse.genes, bam.files, mode="intersectionnotempt summarizeoverlaps will check for parallel processing parameters. We have 12 files and 18 workers so all 12 files will be processd in parallel registered() user system elapsed /60 Nine minutes to process all files. Would have taken ~110 minutes with for loop. class(my.se) 33

34 dim(assays(my.se)$counts) colnames(assays(my.se)$counts) save(my.se, file="pikesummarizedexp.rdata") Now, we are finished with the Big Data part of RNAseq analysis. 5 Differential Gene Expression Analysis with EdgeR There are several popular packages or tools for DGE analysis of RNAseq data. At one time CuffDiff was certainly the most popluar, but it has fallen out of favor for various reasons. It will be interesting to see how CuffDiff2 performs. DESeq2 and edger are two of the best available packages for DGE, and both are native to Bioconductor. Here we will cover edger because it is essentially limma for RNAseq. Both were developed by the same research group, so they have a similar philosophy and workflow. If you are interested in trying DESeq2, Bioconductor has an excellent workflow on their website that is quite easy to follow through if you know a little R (and you do now!) 5.1 Formatting the Data As with the microarray data, we need to format our raw data and our pdata. We will get the counts from the Summarized Experiment object. If you weren t able to generate the object, there is a copy that you can load. load("pikesummarizedexp.rdata") my.counts <- assays(my.se)$counts head(my.counts) SRR _accepted_hits_sorted.bam

35 SRR _accepted_hits_sorted.bam SRR _accepted_hits_sorted.bam SRR _accepted_hits_sorted.bam SRR _accepted_hits_sorted.bam SRR _accepted_hits_sorted.bam SRR _accepted_hits_sorted.bam

36 SRR _accepted_hits_sorted.bam SRR _accepted_hits_sorted.bam SRR _accepted_hits_sorted.bam SRR _accepted_hits_sorted.bam SRR _accepted_hits_sorted.bam colnames(my.counts) <- substr(colnames(my.counts), 1, 10) 36

37 head(my.counts) SRR SRR SRR SRR SRR SRR SRR SRR SRR SRR SRR SRR Next, we need to get the pdata for the sample. We will use a file that can be downloaded from SRA when you browse the data. Again, we are going to do a bit of data wrangling to make suitable names. metadata for experiment pdata <- read.delim(file.path(data.dir, "metadata", "Pike_SraRunTable_RNAseq.txt"), comm head(pdata) BioSample_s Experiment_s MBases_l MBytes_l Run_s SRA_Sample_s 1 SAMN SRX SRR SRS SAMN SRX SRR SRS SAMN SRX SRR SRS SAMN SRX SRR SRS SAMN SRX SRR SRS SAMN SRX SRR SRS Sample_Name_s differentiation_day_s source_name_s Assay_Type_s 37

38 1 GSM d3 IDGSW3_basal_d3 RNA-Seq 2 GSM d3 IDGSW3_basal_d3 RNA-Seq 3 GSM d3 IDGSW3_basal_d3 RNA-Seq 4 GSM d7 IDGSW3_basal_d7 RNA-Seq 5 GSM d7 IDGSW3_basal_d7 RNA-Seq 6 GSM d7 IDGSW3_basal_d7 RNA-Seq AssemblyName_s BioProject_s Center_Name_s Consent_s InsertSize_l 1 <not provided> PRJNA GEO public 0 2 <not provided> PRJNA GEO public 0 3 <not provided> PRJNA GEO public 0 4 <not provided> PRJNA GEO public 0 5 <not provided> PRJNA GEO public 0 6 <not provided> PRJNA GEO public 0 LibraryLayout_s LibrarySelection_s LibrarySource_s Library_Name_s 1 SINGLE cdna TRANSCRIPTOMIC <not provided> 2 SINGLE cdna TRANSCRIPTOMIC <not provided> 3 SINGLE cdna TRANSCRIPTOMIC <not provided> 4 SINGLE cdna TRANSCRIPTOMIC <not provided> 5 SINGLE cdna TRANSCRIPTOMIC <not provided> 6 SINGLE cdna TRANSCRIPTOMIC <not provided> LoadDate_s Organism_s Platform_s ReleaseDate_s SRA_Study_s cell_line_s 1 2/7/2014 Mus musculus ILLUMINA 6/2/2014 SRP IDG-SW3 2 2/7/2014 Mus musculus ILLUMINA 6/2/2014 SRP IDG-SW3 3 2/7/2014 Mus musculus ILLUMINA 6/2/2014 SRP IDG-SW3 4 2/7/2014 Mus musculus ILLUMINA 6/2/2014 SRP IDG-SW3 5 2/7/2014 Mus musculus ILLUMINA 6/2/2014 SRP IDG-SW3 6 2/7/2014 Mus musculus ILLUMINA 6/2/2014 SRP IDG-SW3 cell_type_s g1k_analysis_group_s g1k_pop_code_s source_s 1 osteocytic cells <not provided> <not provided> <not provided> 2 osteocytic cells <not provided> <not provided> <not provided> 3 osteocytic cells <not provided> <not provided> <not provided> 4 osteocytic cells <not provided> <not provided> <not provided> 5 osteocytic cells <not provided> <not provided> <not provided> 6 osteocytic cells <not provided> <not provided> <not provided> pdata <- pdata[pdata$run_s %in% colnames(my.counts), c("run_s", "differentiation_day_s", pdata$source_name_s <- as.factor(pdata$source_name_s) summary(pdata$source_name_s) IDGSW3_125_d3 IDGSW3_125_d35 IDGSW3_vehicle_d3 38

Practical: Read Counting in RNA-seq

Practical: Read Counting in RNA-seq Practical: Read Counting in RNA-seq Hervé Pagès (hpages@fhcrc.org) 5 February 2014 Contents 1 Introduction 1 2 First look at some precomputed read counts 2 3 Aligned reads and BAM files 4 4 Choosing and

More information

Range-based containers in Bioconductor

Range-based containers in Bioconductor Range-based containers in Bioconductor Hervé Pagès hpages@fhcrc.org Fred Hutchinson Cancer Research Center Seattle, WA, USA 21 January 2014 Introduction IRanges objects GRanges objects Splitting a GRanges

More information

Working with aligned nucleotides (WORK- IN-PROGRESS!)

Working with aligned nucleotides (WORK- IN-PROGRESS!) Working with aligned nucleotides (WORK- IN-PROGRESS!) Hervé Pagès Last modified: January 2014; Compiled: November 17, 2017 Contents 1 Introduction.............................. 1 2 Load the aligned reads

More information

RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF

RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF RNA-Seq in Galaxy: Tuxedo protocol Igor Makunin, UQ RCC, QCIF Acknowledgments Genomics Virtual Lab: gvl.org.au Galaxy for tutorials: galaxy-tut.genome.edu.au Galaxy Australia: galaxy-aust.genome.edu.au

More information

Handling genomic data using Bioconductor II: GenomicRanges and GenomicFeatures

Handling genomic data using Bioconductor II: GenomicRanges and GenomicFeatures Handling genomic data using Bioconductor II: GenomicRanges and GenomicFeatures Motivating examples Genomic Features (e.g., genes, exons, CpG islands) on the genome are often represented as intervals, e.g.,

More information

Counting with summarizeoverlaps

Counting with summarizeoverlaps Counting with summarizeoverlaps Valerie Obenchain Edited: August 2012; Compiled: August 23, 2013 Contents 1 Introduction 1 2 A First Example 1 3 Counting Modes 2 4 Counting Features 3 5 pasilla Data 6

More information

Sequence Analysis Pipeline

Sequence Analysis Pipeline Sequence Analysis Pipeline Transcript fragments 1. PREPROCESSING 2. ASSEMBLY (today) Removal of contaminants, vector, adaptors, etc Put overlapping sequence together and calculate bigger sequences 3. Analysis/Annotation

More information

NGS Data Analysis. Roberto Preste

NGS Data Analysis. Roberto Preste NGS Data Analysis Roberto Preste 1 Useful info http://bit.ly/2r1y2dr Contacts: roberto.preste@gmail.com Slides: http://bit.ly/ngs-data 2 NGS data analysis Overview 3 NGS Data Analysis: the basic idea http://bit.ly/2r1y2dr

More information

A quick introduction to GRanges and GRangesList objects

A quick introduction to GRanges and GRangesList objects A quick introduction to GRanges and GRangesList objects Hervé Pagès hpages@fredhutch.org Michael Lawrence lawrence.michael@gene.com July 2015 GRanges objects The GRanges() constructor GRanges accessors

More information

Building and Using Ensembl Based Annotation Packages with ensembldb

Building and Using Ensembl Based Annotation Packages with ensembldb Building and Using Ensembl Based Annotation Packages with ensembldb Johannes Rainer 1 June 25, 2016 1 johannes.rainer@eurac.edu Introduction TxDb objects from GenomicFeatures provide gene model annotations:

More information

File Formats: SAM, BAM, and CRAM. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015

File Formats: SAM, BAM, and CRAM. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015 File Formats: SAM, BAM, and CRAM UCD Genome Center Bioinformatics Core Tuesday 15 September 2015 / BAM / CRAM NEW! http://samtools.sourceforge.net/ - deprecated! http://www.htslib.org/ - SAMtools 1.0 and

More information

TP RNA-seq : Differential expression analysis

TP RNA-seq : Differential expression analysis TP RNA-seq : Differential expression analysis Overview of RNA-seq analysis Fusion transcripts detection Differential expresssion Gene level RNA-seq Transcript level Transcripts and isoforms detection 2

More information

Lecture 12. Short read aligners

Lecture 12. Short read aligners Lecture 12 Short read aligners Ebola reference genome We will align ebola sequencing data against the 1976 Mayinga reference genome. We will hold the reference gnome and all indices: mkdir -p ~/reference/ebola

More information

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg High-throughput sequencing: Alignment and related topic Simon Anders EMBL Heidelberg Established platforms HTS Platforms Illumina HiSeq, ABI SOLiD, Roche 454 Newcomers: Benchtop machines 454 GS Junior,

More information

High-throughout sequencing and using short-read aligners. Simon Anders

High-throughout sequencing and using short-read aligners. Simon Anders High-throughout sequencing and using short-read aligners Simon Anders High-throughput sequencing (HTS) Sequencing millions of short DNA fragments in parallel. a.k.a.: next-generation sequencing (NGS) massively-parallel

More information

Lecture 8. Sequence alignments

Lecture 8. Sequence alignments Lecture 8 Sequence alignments DATA FORMATS bioawk bioawk is a program that extends awk s powerful processing of tabular data to processing tasks involving common bioinformatics formats like FASTA/FASTQ,

More information

Analyzing ChIP- Seq Data in Galaxy

Analyzing ChIP- Seq Data in Galaxy Analyzing ChIP- Seq Data in Galaxy Lauren Mills RISS ABSTRACT Step- by- step guide to basic ChIP- Seq analysis using the Galaxy platform. Table of Contents Introduction... 3 Links to helpful information...

More information

NGS Data Visualization and Exploration Using IGV

NGS Data Visualization and Exploration Using IGV 1 What is Galaxy Galaxy for Bioinformaticians Galaxy for Experimental Biologists Using Galaxy for NGS Analysis NGS Data Visualization and Exploration Using IGV 2 What is Galaxy Galaxy for Bioinformaticians

More information

NGS Analysis Using Galaxy

NGS Analysis Using Galaxy NGS Analysis Using Galaxy Sequences and Alignment Format Galaxy overview and Interface Get;ng Data in Galaxy Analyzing Data in Galaxy Quality Control Mapping Data History and workflow Galaxy Exercises

More information

10 things (maybe) you didn t know about GenomicRanges, Biostrings, and Rsamtools

10 things (maybe) you didn t know about GenomicRanges, Biostrings, and Rsamtools 10 things (maybe) you didn t know about GenomicRanges, Biostrings, and Rsamtools Hervé Pagès hpages@fredhutch.org June 2016 1. Inner vs outer metadata columns > mcols(grl)$id

More information

Package roar. August 31, 2018

Package roar. August 31, 2018 Type Package Package roar August 31, 2018 Title Identify differential APA usage from RNA-seq alignments Version 1.16.0 Date 2016-03-21 Author Elena Grassi Maintainer Elena Grassi Identify

More information

Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification data

Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification data Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification data Table of Contents Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification

More information

Genomic Files. University of Massachusetts Medical School. October, 2014

Genomic Files. University of Massachusetts Medical School. October, 2014 .. Genomic Files University of Massachusetts Medical School October, 2014 2 / 39. A Typical Deep-Sequencing Workflow Samples Fastq Files Fastq Files Sam / Bam Files Various files Deep Sequencing Further

More information

Package GenomicAlignments

Package GenomicAlignments Package GenomicAlignments November 26, 2017 Title Representation and manipulation of short genomic alignments Description Provides efficient containers for storing and manipulating short genomic alignments

More information

Resequencing Analysis. (Pseudomonas aeruginosa MAPO1 ) Sample to Insight

Resequencing Analysis. (Pseudomonas aeruginosa MAPO1 ) Sample to Insight Resequencing Analysis (Pseudomonas aeruginosa MAPO1 ) 1 Workflow Import NGS raw data Trim reads Import Reference Sequence Reference Mapping QC on reads Variant detection Case Study Pseudomonas aeruginosa

More information

An Introduction to VariantTools

An Introduction to VariantTools An Introduction to VariantTools Michael Lawrence, Jeremiah Degenhardt January 25, 2018 Contents 1 Introduction 2 2 Calling single-sample variants 2 2.1 Basic usage..............................................

More information

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg High-throughput sequencing: Alignment and related topic Simon Anders EMBL Heidelberg Established platforms HTS Platforms Illumina HiSeq, ABI SOLiD, Roche 454 Newcomers: Benchtop machines: Illumina MiSeq,

More information

Galaxy Platform For NGS Data Analyses

Galaxy Platform For NGS Data Analyses Galaxy Platform For NGS Data Analyses Weihong Yan wyan@chem.ucla.edu Collaboratory Web Site http://qcb.ucla.edu/collaboratory Collaboratory Workshops Workshop Outline ü Day 1 UCLA galaxy and user account

More information

Galaxy workshop at the Winter School Igor Makunin

Galaxy workshop at the Winter School Igor Makunin Galaxy workshop at the Winter School 2016 Igor Makunin i.makunin@uq.edu.au Winter school, UQ, July 6, 2016 Plan Overview of the Genomics Virtual Lab Introduce Galaxy, a web based platform for analysis

More information

Mapping RNA sequence data (Part 1: using pathogen portal s RNAseq pipeline) Exercise 6

Mapping RNA sequence data (Part 1: using pathogen portal s RNAseq pipeline) Exercise 6 Mapping RNA sequence data (Part 1: using pathogen portal s RNAseq pipeline) Exercise 6 The goal of this exercise is to retrieve an RNA-seq dataset in FASTQ format and run it through an RNA-sequence analysis

More information

Rsubread package: high-performance read alignment, quantification and mutation discovery

Rsubread package: high-performance read alignment, quantification and mutation discovery Rsubread package: high-performance read alignment, quantification and mutation discovery Wei Shi 14 September 2015 1 Introduction This vignette provides a brief description to the Rsubread package. For

More information

Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi

Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi Although a little- bit long, this is an easy exercise

More information

Ensembl RNASeq Practical. Overview

Ensembl RNASeq Practical. Overview Ensembl RNASeq Practical The aim of this practical session is to use BWA to align 2 lanes of Zebrafish paired end Illumina RNASeq reads to chromosome 12 of the zebrafish ZV9 assembly. We have restricted

More information

Rsubread package: high-performance read alignment, quantification and mutation discovery

Rsubread package: high-performance read alignment, quantification and mutation discovery Rsubread package: high-performance read alignment, quantification and mutation discovery Wei Shi 14 September 2015 1 Introduction This vignette provides a brief description to the Rsubread package. For

More information

Goal: Learn how to use various tool to extract information from RNAseq reads.

Goal: Learn how to use various tool to extract information from RNAseq reads. ESSENTIALS OF NEXT GENERATION SEQUENCING WORKSHOP 2017 Class 4 RNAseq Goal: Learn how to use various tool to extract information from RNAseq reads. Input(s): Output(s): magnaporthe_oryzae_70-15_8_supercontigs.fasta

More information

Genomic Files. University of Massachusetts Medical School. October, 2015

Genomic Files. University of Massachusetts Medical School. October, 2015 .. Genomic Files University of Massachusetts Medical School October, 2015 2 / 55. A Typical Deep-Sequencing Workflow Samples Fastq Files Fastq Files Sam / Bam Files Various files Deep Sequencing Further

More information

ssviz: A small RNA-seq visualizer and analysis toolkit

ssviz: A small RNA-seq visualizer and analysis toolkit ssviz: A small RNA-seq visualizer and analysis toolkit Diana HP Low Institute of Molecular and Cell Biology Agency for Science, Technology and Research (A*STAR), Singapore dlow@imcb.a-star.edu.sg August

More information

segmentseq: methods for detecting methylation loci and differential methylation

segmentseq: methods for detecting methylation loci and differential methylation segmentseq: methods for detecting methylation loci and differential methylation Thomas J. Hardcastle October 30, 2018 1 Introduction This vignette introduces analysis methods for data from high-throughput

More information

IRanges and GenomicRanges An introduction

IRanges and GenomicRanges An introduction IRanges and GenomicRanges An introduction Kasper Daniel Hansen CSAMA, Brixen 2011 1 / 28 Why you should care IRanges and GRanges are data structures I use often to solve a variety of

More information

Data: ftp://ftp.broad.mit.edu/pub/users/bhaas/rnaseq_workshop/rnaseq_workshop_dat a.tgz. Software:

Data: ftp://ftp.broad.mit.edu/pub/users/bhaas/rnaseq_workshop/rnaseq_workshop_dat a.tgz. Software: A Tutorial: De novo RNA- Seq Assembly and Analysis Using Trinity and edger The following data and software resources are required for following the tutorial: Data: ftp://ftp.broad.mit.edu/pub/users/bhaas/rnaseq_workshop/rnaseq_workshop_dat

More information

SAM / BAM Tutorial. EMBL Heidelberg. Course Materials. Tobias Rausch September 2012

SAM / BAM Tutorial. EMBL Heidelberg. Course Materials. Tobias Rausch September 2012 SAM / BAM Tutorial EMBL Heidelberg Course Materials Tobias Rausch September 2012 Contents 1 SAM / BAM 3 1.1 Introduction................................... 3 1.2 Tasks.......................................

More information

Single/paired-end RNAseq analysis with Galaxy

Single/paired-end RNAseq analysis with Galaxy October 016 Single/paired-end RNAseq analysis with Galaxy Contents: 1. Introduction. Quality control 3. Alignment 4. Normalization and read counts 5. Workflow overview 6. Sample data set to test the paired-end

More information

SAM : Sequence Alignment/Map format. A TAB-delimited text format storing the alignment information. A header section is optional.

SAM : Sequence Alignment/Map format. A TAB-delimited text format storing the alignment information. A header section is optional. Alignment of NGS reads, samtools and visualization Hands-on Software used in this practical BWA MEM : Burrows-Wheeler Aligner. A software package for mapping low-divergent sequences against a large reference

More information

PODKAT. An R Package for Association Testing Involving Rare and Private Variants. Ulrich Bodenhofer

PODKAT. An R Package for Association Testing Involving Rare and Private Variants. Ulrich Bodenhofer Software Manual Institute of Bioinformatics, Johannes Kepler University Linz PODKAT An R Package for Association Testing Involving Rare and Private Variants Ulrich Bodenhofer Institute of Bioinformatics,

More information

RNA-seq. Manpreet S. Katari

RNA-seq. Manpreet S. Katari RNA-seq Manpreet S. Katari Evolution of Sequence Technology Normalizing the Data RPKM (Reads per Kilobase of exons per million reads) Score = R NT R = # of unique reads for the gene N = Size of the gene

More information

ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013

ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013 ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013 1. Data and objectives We will use the data from GEO (GSE35368, Toedling, Servant et al. 2011). Two samples were

More information

Representing sequencing data in Bioconductor

Representing sequencing data in Bioconductor Representing sequencing data in Bioconductor Mark Dunning mark.dunning@cruk.cam.ac.uk Last modified: July 28, 2015 Contents 1 Accessing Genome Sequence 1 1.1 Alphabet Frequencies...................................

More information

RNA-Seq Analysis With the Tuxedo Suite

RNA-Seq Analysis With the Tuxedo Suite June 2016 RNA-Seq Analysis With the Tuxedo Suite Dena Leshkowitz Introduction In this exercise we will learn how to analyse RNA-Seq data using the Tuxedo Suite tools: Tophat, Cuffmerge, Cufflinks and Cuffdiff.

More information

segmentseq: methods for detecting methylation loci and differential methylation

segmentseq: methods for detecting methylation loci and differential methylation segmentseq: methods for detecting methylation loci and differential methylation Thomas J. Hardcastle October 13, 2015 1 Introduction This vignette introduces analysis methods for data from high-throughput

More information

Package SCAN.UPC. October 9, Type Package. Title Single-channel array normalization (SCAN) and University Probability of expression Codes (UPC)

Package SCAN.UPC. October 9, Type Package. Title Single-channel array normalization (SCAN) and University Probability of expression Codes (UPC) Package SCAN.UPC October 9, 2013 Type Package Title Single-channel array normalization (SCAN) and University Probability of expression Codes (UPC) Version 2.0.2 Author Stephen R. Piccolo and W. Evan Johnson

More information

NGS FASTQ file format

NGS FASTQ file format NGS FASTQ file format Line1: Begins with @ and followed by a sequence idenefier and opeonal descripeon Line2: Raw sequence leiers Line3: + Line4: Encodes the quality values for the sequence in Line2 (see

More information

How to store and visualize RNA-seq data

How to store and visualize RNA-seq data How to store and visualize RNA-seq data Gabriella Rustici Functional Genomics Group gabry@ebi.ac.uk EBI is an Outstation of the European Molecular Biology Laboratory. Talk summary How do we archive RNA-seq

More information

Generating and using Ensembl based annotation packages

Generating and using Ensembl based annotation packages Generating and using Ensembl based annotation packages Johannes Rainer Modified: 9 October, 2015. Compiled: January 19, 2016 Contents 1 Introduction 1 2 Using ensembldb annotation packages to retrieve

More information

Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata

Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata Analysis of RNA sequencing data sets using the Galaxy environment Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata Microarray and Deep-sequencing core facility 30.10.2017 RNA-seq workflow I Hypothesis

More information

ChIP-seq hands-on practical using Galaxy

ChIP-seq hands-on practical using Galaxy ChIP-seq hands-on practical using Galaxy In this exercise we will cover some of the basic NGS analysis steps for ChIP-seq using the Galaxy framework: Quality control Mapping of reads using Bowtie2 Peak-calling

More information

The SAM Format Specification (v1.3 draft)

The SAM Format Specification (v1.3 draft) The SAM Format Specification (v1.3 draft) The SAM Format Specification Working Group July 15, 2010 1 The SAM Format Specification SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text

More information

Maximizing Public Data Sources for Sequencing and GWAS

Maximizing Public Data Sources for Sequencing and GWAS Maximizing Public Data Sources for Sequencing and GWAS February 4, 2014 G Bryce Christensen Director of Services Questions during the presentation Use the Questions pane in your GoToWebinar window Agenda

More information

Cyverse tutorial 1 Logging in to Cyverse and data management. Open an Internet browser window and navigate to the Cyverse discovery environment:

Cyverse tutorial 1 Logging in to Cyverse and data management. Open an Internet browser window and navigate to the Cyverse discovery environment: Cyverse tutorial 1 Logging in to Cyverse and data management Open an Internet browser window and navigate to the Cyverse discovery environment: https://de.cyverse.org/de/ Click Log in with your CyVerse

More information

11/8/2017 Trinity De novo Transcriptome Assembly Workshop trinityrnaseq/rnaseq_trinity_tuxedo_workshop Wiki GitHub

11/8/2017 Trinity De novo Transcriptome Assembly Workshop trinityrnaseq/rnaseq_trinity_tuxedo_workshop Wiki GitHub trinityrnaseq / RNASeq_Trinity_Tuxedo_Workshop Trinity De novo Transcriptome Assembly Workshop Brian Haas edited this page on Oct 17, 2015 14 revisions De novo RNA-Seq Assembly and Analysis Using Trinity

More information

The SAM Format Specification (v1.3-r837)

The SAM Format Specification (v1.3-r837) The SAM Format Specification (v1.3-r837) The SAM Format Specification Working Group November 18, 2010 1 The SAM Format Specification SAM stands for Sequence Alignment/Map format. It is a TAB-delimited

More information

High-level S4 containers for HTS data

High-level S4 containers for HTS data High-level S4 containers for HTS data Hervé Pagès hpages@fhcrc.org Fred Hutchinson Cancer Research Center Seattle, WA July 2013 Introduction Most frequently seen low-level containers Rle objects IRanges

More information

An Introduction to the genoset Package

An Introduction to the genoset Package An Introduction to the genoset Package Peter M. Haverty April 4, 2013 Contents 1 Introduction 2 1.1 Creating Objects........................................... 2 1.2 Accessing Genome Information...................................

More information

NGS : reads quality control

NGS : reads quality control NGS : reads quality control Data used in this tutorials are available on https:/urgi.versailles.inra.fr/download/tuto/ngs-readsquality-control. Select genome solexa.fasta, illumina.fastq, solexa.fastq

More information

Introduction to GenomicFiles

Introduction to GenomicFiles Valerie Obenchain, Michael Love, Martin Morgan Last modified: October 2014; Compiled: October 30, 2018 Contents 1 Introduction.............................. 1 2 Quick Start..............................

More information

Useful software utilities for computational genomics. Shamith Samarajiwa CRUK Autumn School in Bioinformatics September 2017

Useful software utilities for computational genomics. Shamith Samarajiwa CRUK Autumn School in Bioinformatics September 2017 Useful software utilities for computational genomics Shamith Samarajiwa CRUK Autumn School in Bioinformatics September 2017 Overview Search and download genomic datasets: GEOquery, GEOsearch and GEOmetadb,

More information

Package SCAN.UPC. July 17, 2018

Package SCAN.UPC. July 17, 2018 Type Package Package SCAN.UPC July 17, 2018 Title Single-channel array normalization (SCAN) and Universal expression Codes (UPC) Version 2.22.0 Author and Andrea H. Bild and W. Evan Johnson Maintainer

More information

BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14)

BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14) BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14) Genome Informatics (Part 1) https://bioboot.github.io/bggn213_f17/lectures/#14 Dr. Barry Grant Nov 2017 Overview: The purpose of this lab session is

More information

Basic4Cseq: an R/Bioconductor package for the analysis of 4C-seq data

Basic4Cseq: an R/Bioconductor package for the analysis of 4C-seq data Basic4Cseq: an R/Bioconductor package for the analysis of 4C-seq data Carolin Walter October 30, 2017 Contents 1 Introduction 1 1.1 Loading the package...................................... 2 1.2 Provided

More information

Import GEO Experiment into Partek Genomics Suite

Import GEO Experiment into Partek Genomics Suite Import GEO Experiment into Partek Genomics Suite This tutorial will illustrate how to: Import a gene expression experiment from GEO SOFT files Specify annotations Import RAW data from GEO for gene expression

More information

Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers

Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers Data used in the exercise We will use D. melanogaster WGS paired-end Illumina data with NCBI accessions

More information

Tiling Assembly for Annotation-independent Novel Gene Discovery

Tiling Assembly for Annotation-independent Novel Gene Discovery Tiling Assembly for Annotation-independent Novel Gene Discovery By Jennifer Lopez and Kenneth Watanabe Last edited on September 7, 2015 by Kenneth Watanabe The following procedure explains how to run the

More information

Mapping NGS reads for genomics studies

Mapping NGS reads for genomics studies Mapping NGS reads for genomics studies Valencia, 28-30 Sep 2015 BIER Alejandro Alemán aaleman@cipf.es Genomics Data Analysis CIBERER Where are we? Fastq Sequence preprocessing Fastq Alignment BAM Visualization

More information

The QoRTs Analysis Pipeline Example Walkthrough

The QoRTs Analysis Pipeline Example Walkthrough The QoRTs Analysis Pipeline Example Walkthrough Stephen Hartley National Human Genome Research Institute National Institutes of Health October 31, 2017 QoRTs v1.0.1 JunctionSeq v1.9.0 Contents 1 Overview

More information

Using the GenomicFeatures package

Using the GenomicFeatures package Using the GenomicFeatures package Marc Carlson Fred Hutchinson Cancer Research Center December 10th 2010 Bioconductor Annotation Packages: a bigger picture PLATFORM PKGS GENE ID HOMOLOGY PKGS GENE ID ORG

More information

ExomeDepth. Vincent Plagnol. May 15, What is new? 1. 4 Load an example dataset 4. 6 CNV calling 5. 9 Visual display 9

ExomeDepth. Vincent Plagnol. May 15, What is new? 1. 4 Load an example dataset 4. 6 CNV calling 5. 9 Visual display 9 ExomeDepth Vincent Plagnol May 15, 2016 Contents 1 What is new? 1 2 What ExomeDepth does and tips for QC 2 2.1 What ExomeDepth does and does not do................................. 2 2.2 Useful quality

More information

Using the Galaxy Local Bioinformatics Cloud at CARC

Using the Galaxy Local Bioinformatics Cloud at CARC Using the Galaxy Local Bioinformatics Cloud at CARC Lijing Bu Sr. Research Scientist Bioinformatics Specialist Center for Evolutionary and Theoretical Immunology (CETI) Department of Biology, University

More information

Exeter Sequencing Service

Exeter Sequencing Service Exeter Sequencing Service A guide to your denovo RNA-seq results An overview Once your results are ready, you will receive an email with a password-protected link to them. Click the link to access your

More information

ChIP-seq (NGS) Data Formats

ChIP-seq (NGS) Data Formats ChIP-seq (NGS) Data Formats Biological samples Sequence reads SRA/SRF, FASTQ Quality control SAM/BAM/Pileup?? Mapping Assembly... DE Analysis Variant Detection Peak Calling...? Counts, RPKM VCF BED/narrowPeak/

More information

Bioinformatics in next generation sequencing projects

Bioinformatics in next generation sequencing projects Bioinformatics in next generation sequencing projects Rickard Sandberg Assistant Professor Department of Cell and Molecular Biology Karolinska Institutet March 2011 Once sequenced the problem becomes computational

More information

The software and data for the RNA-Seq exercise are already available on the USB system

The software and data for the RNA-Seq exercise are already available on the USB system BIT815 Notes on R analysis of RNA-seq data The software and data for the RNA-Seq exercise are already available on the USB system The notes below regarding installation of R packages and other software

More information

Ranges (and Data Integration)

Ranges (and Data Integration) Ranges (and Data Integration) Martin Morgan 1 Fred Hutchinson Cancer Research Center Seattle, WA 20 November 2013 1 mtmorgan@fhcrc.org Introduction Importance of range concepts: conceptually... Genomic

More information

de.nbi and its Galaxy interface for RNA-Seq

de.nbi and its Galaxy interface for RNA-Seq de.nbi and its Galaxy interface for RNA-Seq Jörg Fallmann Thanks to Björn Grüning (RBC-Freiburg) and Sarah Diehl (MPI-Freiburg) Institute for Bioinformatics University of Leipzig http://www.bioinf.uni-leipzig.de/

More information

fastseg An R Package for fast segmentation Günter Klambauer and Andreas Mitterecker Institute of Bioinformatics, Johannes Kepler University Linz

fastseg An R Package for fast segmentation Günter Klambauer and Andreas Mitterecker Institute of Bioinformatics, Johannes Kepler University Linz Software Manual Institute of Bioinformatics, Johannes Kepler University Linz fastseg An R Package for fast segmentation Günter Klambauer and Andreas Mitterecker Institute of Bioinformatics, Johannes Kepler

More information

CLC Server. End User USER MANUAL

CLC Server. End User USER MANUAL CLC Server End User USER MANUAL Manual for CLC Server 10.0.1 Windows, macos and Linux March 8, 2018 This software is for research purposes only. QIAGEN Aarhus Silkeborgvej 2 Prismet DK-8000 Aarhus C Denmark

More information

ChIP-Seq Tutorial on Galaxy

ChIP-Seq Tutorial on Galaxy 1 Introduction ChIP-Seq Tutorial on Galaxy 2 December 2010 (modified April 6, 2017) Rory Stark The aim of this practical is to give you some experience handling ChIP-Seq data. We will be working with data

More information

ChIP-seq hands-on practical using Galaxy

ChIP-seq hands-on practical using Galaxy ChIP-seq hands-on practical using Galaxy In this exercise we will cover some of the basic NGS analysis steps for ChIP-seq using the Galaxy framework: Quality control Mapping of reads using Bowtie2 Peak-calling

More information

Relationship Between BED and WIG Formats

Relationship Between BED and WIG Formats Relationship Between BED and WIG Formats Pete E. Pascuzzi July 2, 2015 This example will illustrate the similarities and differences between the various ways to represent ranged data in R. In bioinformatics,

More information

De novo genome assembly

De novo genome assembly BioNumerics Tutorial: De novo genome assembly 1 Aims This tutorial describes a de novo assembly of a Staphylococcus aureus genome, using single-end and pairedend reads generated by an Illumina R Genome

More information

!"#$%&$'()#$*)+,-./).01"0#,23+3,303456"6,&((46,7$+-./&((468,

!#$%&$'()#$*)+,-./).010#,23+3,3034566,&((46,7$+-./&((468, !"#$%&$'()#$*)+,-./).01"0#,23+3,303456"6,&((46,7$+-./&((468, 9"(1(02)1+(',:.;.4(*.',?9@A,!."2.4B.'#A,C(;.

More information

Using the Streamer classes to count genomic overlaps with summarizeoverlaps

Using the Streamer classes to count genomic overlaps with summarizeoverlaps Using the Streamer classes to count genomic overlaps with summarizeoverlaps Nishant Gopalakrishnan, Martin Morgan October 30, 2018 1 Introduction This vignette illustrates how users can make use of the

More information

Exercise 1. RNA-seq alignment and quantification. Part 1. Prepare the working directory. Part 2. Examine qualities of the RNA-seq data files

Exercise 1. RNA-seq alignment and quantification. Part 1. Prepare the working directory. Part 2. Examine qualities of the RNA-seq data files Exercise 1. RNA-seq alignment and quantification Part 1. Prepare the working directory. 1. Connect to your assigned computer. If you do not know how, follow the instruction at http://cbsu.tc.cornell.edu/lab/doc/remote_access.pdf

More information

Maize genome sequence in FASTA format. Gene annotation file in gff format

Maize genome sequence in FASTA format. Gene annotation file in gff format Exercise 1. Using Tophat/Cufflinks to analyze RNAseq data. Step 1. One of CBSU BioHPC Lab workstations has been allocated for your workshop exercise. The allocations are listed on the workshop exercise

More information

Working with ChIP-Seq Data in R/Bioconductor

Working with ChIP-Seq Data in R/Bioconductor Working with ChIP-Seq Data in R/Bioconductor Suraj Menon, Tom Carroll, Shamith Samarajiwa September 3, 2014 Contents 1 Introduction 1 2 Working with aligned data 1 2.1 Reading in data......................................

More information

Bioconductor packages for short read analyses

Bioconductor packages for short read analyses Bioconductor packages for short read analyses RNA-Seq / ChIP-Seq Data Analysis Workshop 10 September 2012 CSC, Helsinki Nicolas Delhomme Foreword The core packages for integrating NGS data analysis represents

More information

Our data for today is a small subset of Saimaa ringed seal RNA sequencing data (RNA_seq_reads.fasta). Let s first see how many reads are there:

Our data for today is a small subset of Saimaa ringed seal RNA sequencing data (RNA_seq_reads.fasta). Let s first see how many reads are there: Practical Course in Genome Bioinformatics 19.2.2016 (CORRECTED 22.2.2016) Exercises - Day 5 http://ekhidna.biocenter.helsinki.fi/downloads/teaching/spring2016/ Answer the 5 questions (Q1-Q5) according

More information

ChIP-seq Analysis Practical

ChIP-seq Analysis Practical ChIP-seq Analysis Practical Vladimir Teif (vteif@essex.ac.uk) An updated version of this document will be available at http://generegulation.info/index.php/teaching In this practical we will learn how

More information

Some Basic ChIP-Seq Data Analysis

Some Basic ChIP-Seq Data Analysis Some Basic ChIP-Seq Data Analysis July 28, 2009 Our goal is to describe the use of Bioconductor software to perform some basic tasks in the analysis of ChIP-Seq data. We will use several functions in the

More information

Package scruff. November 6, 2018

Package scruff. November 6, 2018 Package scruff November 6, 2018 Title Single Cell RNA-Seq UMI Filtering Facilitator (scruff) Version 1.0.0 Date 2018-08-29 A pipeline which processes single cell RNA-seq (scrna-seq) reads from CEL-seq

More information

Identiyfing splice junctions from RNA-Seq data

Identiyfing splice junctions from RNA-Seq data Identiyfing splice junctions from RNA-Seq data Joseph K. Pickrell pickrell@uchicago.edu October 4, 2010 Contents 1 Motivation 2 2 Identification of potential junction-spanning reads 2 3 Calling splice

More information

David Crossman, Ph.D. UAB Heflin Center for Genomic Science. GCC2012 Wednesday, July 25, 2012

David Crossman, Ph.D. UAB Heflin Center for Genomic Science. GCC2012 Wednesday, July 25, 2012 David Crossman, Ph.D. UAB Heflin Center for Genomic Science GCC2012 Wednesday, July 25, 2012 Galaxy Splash Page Colors Random Galaxy icons/colors Queued Running Completed Download/Save Failed Icons Display

More information