Introduction to RNA-seq Analysis
|
|
- Calvin Simpson
- 6 years ago
- Views:
Transcription
1 Introduction to RNA-seq Analysis Pete E. Pascuzzi July 14, 2016 Contents 1 Intro 2 2 Setup Setup Directories Parallel Processing BAM Header Metadata Gene Annotation Genome Annotation Sequence Alignment Map Data Format Converting BAM file to GAlignments Object Essential Alignment Data Additional Data Sequence Data Determining Digital Gene Expression Counting Modes on Simulated Data Clean-Up Environment to Free Memory Method for Limited Memory and Single Core Parallel Processing of BAM Files on Multiple Cores Differential Gene Expression Analysis with EdgeR Formatting the Data Normalizing the Libraries (Counts) SessionInfo 69 1
2 1 Intro The following vignette is a basic RNAseq analysis of data from St. John, et al., Mol. Endocrinol The data was deposited at NCBI GEO under the Super Series GSE The RNAseq data is the Series GSE We will work with only a subset of these samples, the 2 X 2 design of mouse cells, untreated or treated with vitamin D at three days and 35 days. At GEO, there is metadata and processed data that is not amenable to further statistical analysis. However, the raw data in the form of FASTQ files was deposited in NCBI Short Read Archive (SRA). SRA is an entirely different system from GEO with a totally different series of accession numbers. The system can be confusing to navigate, but you should spend some time there familiarizing yourself with the system. Importantly, the raw data in SRA can be downloaded and reanalyzed. However, each of these files can be very large, greater than 20 GB each. The raw data for this vignette is about 500 GB combined. NCBI has developed a series of command line UNIX tools that are used to manipulate and download SRA data. This is not a simple matter of file transfer. Raw data at SRA is stored in SRA format to optimize storage. This data must be converted back to FASTQ format (sometime SAM or BAM format) using specific utilities. For example, we used the tool fastq-dump to transfer files from SRA to Rice and convert them to FASTQ format. This took more than a day! In fact, these FASTQ files are not entirely raw. Typically, NGS experiments are indexed/bar-coded/tagged so that they can be run together on a single lane. So, if you download an SRA file that corresponds to a single sample, these indices have already been processed so that the reads that correspond to specific sample can be sent to a specific file. Another step that may have been done is the removal of adaptors that are necessary for library construction and sequencing. The reads are then ready for additional QC and alignment to your reference genome. We have already performed quality trimming of the reads with the FASTX toolkit, removing bases from either end of the read that have a PHRED quality score below 30. We have QC d the reads with the tool FastQC. The reads were aligned to the mouse reference genome Mus musculus.grcm38.fa with the tool TopHat. The resulting Sequence Alignment Maps files were saved in binary format as BAM files which were sorted and indexed to facilitate analysis and visualization. The steps that this vignette will cover is an overview of the BAM file 2
3 format, generation of a gene annotation object that can be used to make your gene count table, counting of reads over genes, and edger analysis for differential expression. 2 Setup When working with such large files is it important to think about your storage and directory structure. We have a huge amount of disk space available to us in our scratch directory on Rice. For many tasks, it would be typical to copy your data to scratch, do your work, and copy the results back to secure storage such as Data Depot. However, to avoid that transfer of large files that could become corrupted, we are going to leave our BAM files on Data Depot and access them from Rice while working in our scratch directory. Additionally, some of the tools that have been used to process the data have specific conventions. TopHat generates a large number of files and directories, so when we want to access these files, we typically have to use long file paths. If you choose to modify this script for your own analysis, you will absolutely need to modify the paths to your own files! 2.1 Setup Directories We will access various data for this experiment from Data Depot so it will be convenient to define at least part of this path early. define directories for data #data.dir <- "/depot/nihomics/data/ngs/pike" #path for Rice data.dir <- "/Volumes/nihomics/data/ngs/Pike" #path for Pete's MacBook library(rsamtools) bam.files <- list.files(file.path(data.dir, "rnaseq", "tophat"), pattern="accepted_hits_ bam.files [1] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR _tophat_out/SRR _ [2] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR _tophat_out/SRR _ [3] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR _tophat_out/SRR _ [4] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR _tophat_out/SRR _ 3
4 [10] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR _tophat_out/SRR _ [11] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR _tophat_out/SRR _ [12] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR _tophat_out/SRR _ [5] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR _tophat_out/SRR _ [6] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR _tophat_out/SRR _ [7] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR _tophat_out/SRR _ [8] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR _tophat_out/SRR _ [9] "/Volumes/nihomics/data/ngs/Pike/rnaseq/tophat/SRR _tophat_out/SRR _ 2.2 Parallel Processing Now that we have the location of the BAM files, we are ready to start processing them further. A huge advantage to using Rice for this type of analysis is that each node has 20 cores. Twenty processors that share memory (64 GB) but that can work on independent tasks. One advantage of Bioconductor is that they have developed a package that can faciliate the use of multiple cores with parallel processing. However, you do need to download this package and configure the options for your specific setup. Below, we are telling Bioconductor that it can have 18 cores and assign tasks as it sees fit. How you can use BiocParallel will depend on your computer. MulticoreParam works well for UNIX-like operating systems. It will not work on Windows machines. set up for parallel processing library(biocparallel) library(parallel) registered() $MulticoreParam class: MulticoreParam bpjobname:bpjob; bpworkers:6; bptasks:0; bptimeout:inf; bprngseed:; bpisup:false bplog:false; bpthreshold:info; bplogdir:na bpstoponerror:false; bpprogressbar:false bpresultdir:na cluster type: FORK 4
5 $SnowParam class: SnowParam bpjobname:bpjob; bpworkers:6; bptasks:0; bptimeout:inf; bprngseed:; bpisup:false bplog:false; bpthreshold:info; bplogdir:na bpstoponerror:false; bpprogressbar:false bpresultdir:na cluster type: SOCK $SerialParam class: SerialParam bplog:false; bpthreshold:info bpcatcherrors:false detectcores() [1] 8 #bpparam <- MulticoreParam(workers=18, tasks=0) #parameters for Rice bpparam <- MulticoreParam(workers=6, tasks=0) #parameters for Pete's MacBook register(bpparam) registered() $MulticoreParam class: MulticoreParam bpjobname:bpjob; bpworkers:6; bptasks:0; bptimeout:inf; bprngseed:; bpisup:false bplog:false; bpthreshold:info; bplogdir:na bpstoponerror:false; bpprogressbar:false bpresultdir:na cluster type: FORK $SnowParam class: SnowParam bpjobname:bpjob; bpworkers:6; bptasks:0; bptimeout:inf; bprngseed:; bpisup:false bplog:false; bpthreshold:info; bplogdir:na bpstoponerror:false; bpprogressbar:false bpresultdir:na cluster type: SOCK $SerialParam 5
6 class: SerialParam bplog:false; bpthreshold:info bpcatcherrors:false Below is an example of the bplapply function, a parallel version of lapply that can send tasks to multiple processors. DO NOT UNCOMMENT THIS LINE AND RUN IT. The BAM files are already indexed. This is only an example. However, you do need to get the location of the BAM index files. index BAM files so that they can be processed more easily and ready by IGV and other applications. This has already been done #system.time(bplapply(bam.files, indexbam)) Get the full names for the bam index files bam.index <- list.files(file.path(data.dir, "rnaseq", "tophat"), pattern="bam.bai$", rec bam.index 2.3 BAM Header Metadata Now, we can start to look at the data in a BAM file. The first thing that you should examine is the BAM header. It contains information on how the reads were aligned. In particular, it should give you the names of the chromosomes to which the reads were aligned, and information about the aligner. get the seqnames for the references used in the alignments bam.header <- scanbamheader(bam.files[1]) length(bam.header) [1] 1 bam.header <- bam.header[[1]] names(bam.header) [1] "targets" "text" 6
7 bam.header$targets GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL JH JH JH JH JH JH JH JH JH JH JH JH JH MT X Y bam.chrs <- names(bam.header$targets) head(bam.header$text) $`@HD` [1] "VN:1.0" "SO:coordinate" $`@SQ` [1] "SN:1" "LN: " $`@SQ` [1] "SN:10" "LN: " $`@SQ` [1] "SN:11" "LN: " 7
8 [1] "SN:12" "LN: " [1] "SN:13" "LN: " tail(bam.header$text) [1] "SN:JH " "LN:158099" [1] "SN:JH " "LN:114452" [1] "SN:MT" "LN:16299" [1] "SN:X" "LN: " [1] "SN:Y" "LN: " [1] "ID:TopHat" [2] "VN:2.1.0" [3] "CL:/group/bioinfo/apps/apps/tophat-2.1.0/tophat -p 10 --library-type fr-secondst 2.4 Gene Annotation Now that we have information about the reference sequence, we can generate a suitable object that represents the genes in the reference genome. There are many ways to do this, but we are going to use precompiled Bioconductor databases. It is not a simple matter of finding the transcription start site and transcription termination site of the genes and defining that interval. Why? What we need to do is retrieve the coordinates for the gene exons, grouping the exons together by gene. Further, we want to include all possible 8
9 exonic sequences. Bioconductor has many functions that help you to achieve such tasks. Below, we will use exonsby to get all exons for mouse genes. library(mus.musculus, verbose=true) Warning in library(mus.musculus, verbose = TRUE): package Mus.musculus already present in search() mouse.genes0 <- exonsby(mus.musculus, by="gene") mouse.genes0 GRangesList object of length 24028: $ GRanges object with 7 ranges and 2 metadata columns: seqnames ranges strand exon_id exon_name <Rle> <IRanges> <Rle> <integer> <character> [1] chr9 [ , ] <NA> [2] chr9 [ , ] <NA> [3] chr9 [ , ] <NA> [4] chr9 [ , ] <NA> [5] chr9 [ , ] <NA> [6] chr9 [ , ] <NA> [7] chr9 [ , ] <NA> $ GRanges object with 6 ranges and 2 metadata columns: seqnames ranges strand exon_id exon_name [1] chr7 [ , ] <NA> [2] chr7 [ , ] <NA> [3] chr7 [ , ] <NA> [4] chr7 [ , ] <NA> [5] chr7 [ , ] <NA> [6] chr7 [ , ] <NA> $ GRanges object with 1 range and 2 metadata columns: seqnames ranges strand exon_id exon_name [1] chr10 [ , ] <NA>... 9
10 <24025 more elements> seqinfo: 66 sequences (1 circular) from mm10 genome exons are for each transcript but grouped by gene so some exons represented multiple times mouse.genes0[218] GRangesList object of length 1: $ GRanges object with 22 ranges and 2 metadata columns: seqnames ranges strand exon_id exon_name <Rle> <IRanges> <Rle> <integer> <character> [1] chr12 [ , ] <NA> [2] chr12 [ , ] <NA> [3] chr12 [ , ] <NA> [4] chr12 [ , ] <NA> [5] chr12 [ , ] <NA> [18] chr12 [ , ] <NA> [19] chr12 [ , ] <NA> [20] chr12 [ , ] <NA> [21] chr12 [ , ] <NA> [22] chr12 [ , ] <NA> seqinfo: 66 sequences (1 circular) from mm10 genome Note that this mouse genes has 22 exons, but many of these are overlapping or the same. That is because all exons for all transcripts are represented for this gene. If you count reads over this object, you could potentially double or triple your counts because you would get gene counts for every transcript. In fact, Bioconductor is pretty smart and this does not happen, but we are going to perform a reduce on these exons. This function will combine overlapping or redundant intervals into a single non-rendundant set. mouse.genes <- reduce(mouse.genes0) mouse.genes[218] GRangesList object of length 1: 10
11 $ GRanges object with 16 ranges and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] chr12 [ , ] + [2] chr12 [ , ] + [3] chr12 [ , ] + [4] chr12 [ , ] + [5] chr12 [ , ] [12] chr12 [ , ] + [13] chr12 [ , ] + [14] chr12 [ , ] + [15] chr12 [ , ] + [16] chr12 [ , ] seqinfo: 66 sequences (1 circular) from mm10 genome Now, let s check the chromosome names (seqnames) that are used in our mouse.genes object with the chromosome names that are used in the BAM files. These are often incompatible, and you will not get any counts for your genes if Bioconductor can not match up the seqnames! We have to do some strange data conversions here because some of the data is stored as Run-Length encoded vectors. seqlevels(mouse.genes) [1] "chr1" "chr2" "chr3" [4] "chr4" "chr5" "chr6" [7] "chr7" "chr8" "chr9" [10] "chr10" "chr11" "chr12" [13] "chr13" "chr14" "chr15" [16] "chr16" "chr17" "chr18" [19] "chr19" "chrx" "chry" [22] "chrm" "chr1_gl456210_random" "chr1_gl456211_random" [25] "chr1_gl456212_random" "chr1_gl456213_random" "chr1_gl456221_random" [28] "chr4_gl456216_random" "chr4_gl456350_random" "chr4_jh584292_random" [31] "chr4_jh584293_random" "chr4_jh584294_random" "chr4_jh584295_random" 11
12 [34] "chr5_gl456354_random" "chr5_jh584296_random" "chr5_jh584297_random" [37] "chr5_jh584298_random" "chr5_jh584299_random" "chr7_gl456219_random" [40] "chrx_gl456233_random" "chry_jh584300_random" "chry_jh584301_random" [43] "chry_jh584302_random" "chry_jh584303_random" "chrun_gl456239" [46] "chrun_gl456359" "chrun_gl456360" "chrun_gl456366" [49] "chrun_gl456367" "chrun_gl456368" "chrun_gl456370" [52] "chrun_gl456372" "chrun_gl456378" "chrun_gl456379" [55] "chrun_gl456381" "chrun_gl456382" "chrun_gl456383" [58] "chrun_gl456385" "chrun_gl456387" "chrun_gl456389" [61] "chrun_gl456390" "chrun_gl456392" "chrun_gl456393" [64] "chrun_gl456394" "chrun_gl456396" "chrun_jh584304" not the same style as BAM files table(seqlevels(mouse.genes) %in% bam.chrs) FALSE 66 change the style seqlevelsstyle(mouse.genes) <- "NCBI" table(seqlevels(mouse.genes) %in% bam.chrs) FALSE TRUE still problems with some (Bioconductor error?) determine how many annotated genes are on each sequence seqnames.genes <- unlist(runvalue(seqnames(mouse.genes))) table(seqnames.genes) seqnames.genes
13 X Y MT chr1_gl456210_random chr1_gl456211_random chr1_gl456212_random chr1_gl456213_random chr1_gl456221_random chr4_gl456216_random chr4_gl456350_random chr4_jh584292_random chr4_jh584293_random chr4_jh584294_random chr4_jh584295_random chr5_gl456354_random chr5_jh584296_random chr5_jh584297_random chr5_jh584298_random chr5_jh584299_random chr7_gl456219_random chrx_gl456233_random chry_jh584300_random chry_jh584301_random chry_jh584302_random chry_jh584303_random chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_gl chrun_jh a few genes on unassembled contigs so will adjust seqnames to match bam.chrs temp.seqlevels <- seqlevels(mouse.genes) 13
14 summary(bam.chrs %in% temp.seqlevels) Mode FALSE TRUE NA's logical temp.seqlevels[23:44] <- substr(temp.seqlevels[23:44], 6, 13) temp.seqlevels[45:66] <- substr(temp.seqlevels[45:66], 7, 14) temp.seqlevels[23:66] <- paste(temp.seqlevels[23:66], ".1", sep="") summary(bam.chrs %in% temp.seqlevels) Mode TRUE NA's logical 66 0 seqlevels(mouse.genes) <- temp.seqlevels seqnames.genes <- unlist(runvalue(seqnames(mouse.genes))) table(seqnames.genes) seqnames.genes X Y MT GL GL GL GL GL GL GL JH JH JH JH GL JH JH JH JH GL GL JH JH JH JH GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL GL JH
15 Now, the seqnames of mouse.genes matches the seqnames in our BAM files. 2.5 Genome Annotation Next, we are going to make an object that represent the chromosomes in the mouse genome. This object is required if you want to access the reads in the BAM file one chromosome at a time. This is the best way to work with BAM files if you have limited memory on your computer, e.g. 16 GB or less on a laptop. Again, we need to make sure that the seqnames match! library(bsgenome.mmusculus.ucsc.mm10) mouse.bs <- BSgenome.Mmusculus.UCSC.mm10 mouse.bs Mouse genome: # organism: Mus musculus (Mouse) # provider: UCSC # provider version: mm10 # release date: Dec # release name: Genome Reference Consortium GRCm38 # 66 sequences: # chr1 chr2 chr3 # chr4 chr5 chr6 # chr7 chr8 chr9 # chr10 chr11 chr12 # chr13 chr14 chr15 # # chrun_gl chrun_gl chrun_gl # chrun_gl chrun_gl chrun_gl # chrun_gl chrun_gl chrun_gl # chrun_gl chrun_gl chrun_gl # chrun_gl chrun_gl chrun_jh # (use 'seqnames()' to see all the sequence names, use the '$' or '[[' # operator to access a given sequence) seqlevelsstyle(mouse.bs) <- "NCBI" seqlevels(mouse.bs) 15
16 [1] "1" "2" "3" [4] "4" "5" "6" [7] "7" "8" "9" [10] "10" "11" "12" [13] "13" "14" "15" [16] "16" "17" "18" [19] "19" "X" "Y" [22] "MT" "chr1_gl456210_random" "chr1_gl456211_random" [25] "chr1_gl456212_random" "chr1_gl456213_random" "chr1_gl456221_random" [28] "chr4_gl456216_random" "chr4_gl456350_random" "chr4_jh584292_random" [31] "chr4_jh584293_random" "chr4_jh584294_random" "chr4_jh584295_random" [34] "chr5_gl456354_random" "chr5_jh584296_random" "chr5_jh584297_random" [37] "chr5_jh584298_random" "chr5_jh584299_random" "chr7_gl456219_random" [40] "chrx_gl456233_random" "chry_jh584300_random" "chry_jh584301_random" [43] "chry_jh584302_random" "chry_jh584303_random" "chrun_gl456239" [46] "chrun_gl456359" "chrun_gl456360" "chrun_gl456366" [49] "chrun_gl456367" "chrun_gl456368" "chrun_gl456370" [52] "chrun_gl456372" "chrun_gl456378" "chrun_gl456379" [55] "chrun_gl456381" "chrun_gl456382" "chrun_gl456383" [58] "chrun_gl456385" "chrun_gl456387" "chrun_gl456389" [61] "chrun_gl456390" "chrun_gl456392" "chrun_gl456393" [64] "chrun_gl456394" "chrun_gl456396" "chrun_jh584304" same problem with sqnames seqlevels(mouse.bs) <- temp.seqlevels prepare GRanges object that can be used as a parameter to access specific chromosomes and regions from BAM files. mouse.gr <- GRanges(seqnames=Rle(as.character(seqnames(mouse.bs))), ranges=iranges(start mouse.gr GRanges object with 66 ranges and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] 1 [1, ] * [2] 2 [1, ] * [3] 3 [1, ] * [4] 4 [1, ] * [5] 5 [1, ] * [62] GL [1, 23629] * 16
17 [63] GL [1, 55711] * [64] GL [1, 24323] * [65] GL [1, 21240] * [66] JH [1, ] * seqinfo: 66 sequences (1 circular) from mm10 genome 3 Sequence Alignment Map Data Format 3.1 Converting BAM file to GAlignments Object There are several Bioconductor packages that allow you to access BAM files. Here we are going to use readgalignments to make a GAlignments object for the reads on chr19. To do this, we need to specify certain parameters. What data do you want? Which chromosomes? Do you want to filter the data on specific flags? needed to redefine the file paths so the PDF would compile data.dir <- "/Volumes/nihomics/data/ngs/Pike" #path for Pete's MacBook bam.files <- list.files(file.path(data.dir, "rnaseq", "tophat"), pattern="accepted_hits_ bam.index <- list.files(file.path(data.dir, "rnaseq", "tophat"), pattern="bam.bai$", rec library(genomicalignments) flag.param <- scanbamflag() what.param <- scanbamwhat() which.param <- mouse.gr[19] my.param <- ScanBamParam(what=what.param, which=which.param, flag=flag.param) my.param class: ScanBamParam bamflag (NA unless specified): bamsimplecigar: FALSE bamreversecomplement: FALSE bamtag: bamtagfilter: bamwhich: 1 ranges bamwhat: qname, flag, rname, strand, pos, qwidth, mapq, cigar, mrnm, mpos, isize, seq, qual, groupid, mate_status bammapqfilter: NA 17
18 bam.ga <- readgalignments(bam.files[1], index=bam.index[1], param=my.param) bam.ga GAlignments object with alignments and 13 metadata columns: seqnames strand cigar qwidth start end <Rle> <Rle> <character> <integer> <integer> <integer> [1] M [2] M [3] M [4] M [5] M [ ] M [ ] M [ ] M [ ] M [ ] M width njunc qname flag <integer> <integer> <character> <integer> [1] SRR [2] SRR [3] SRR [4] SRR [5] SRR [ ] SRR [ ] SRR [ ] SRR [ ] SRR [ ] SRR rname strand pos qwidth mapq cigar <factor> <factor> <integer> <integer> <integer> <character> [1] M [2] M [3] M [4] M [5] M [ ] M [ ] M 18
19 [ ] M [ ] M [ ] M mrnm mpos isize <factor> <integer> <integer> [1] <NA> 0 0 [2] <NA> 0 0 [3] <NA> 0 0 [4] <NA> 0 0 [5] <NA> [ ] <NA> 0 0 [ ] <NA> 0 0 [ ] <NA> 0 0 [ ] <NA> 0 0 [ ] <NA> 0 0 [1] CCTAGTATATCTGGAGAGTTAAGATGGGGAATTATGTCAACTTTCCCTCTTCCTATGCCAGTTATGCATAATG [2] CCTAGTATATCTGGAGAGTTAAGATGGGGAATTATGTCAACTTTCCCTCTTCCTATGCCAGTTATGCATAATG [3] AGGGGGAGATGTGAGGAGCCGCCCTTGCAATCGCCATTACAAAATGGTGCTGATATCCGGTGTTCTAACTAGT [4] AGGGGGAGATGTGAGGAGCCGCCCTTGCAATCGCCATTACAAAATGGTGCTGATATCCGGTGTTCTAACTAGT [5] CTATGGCCTTGGGCATCAAGATTTAAAAAATTAAGAGTGAAGAGTGCTATGGAAACAACTACTCTTGGTACTG... [ ] GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG [ ] GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG [ ] GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG [ ] TAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTT [ ] TAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTT [1] 3DCCDCCDCCCCDCCCEEDCEFFFCDA7HHHHGFIIIGEFIGEHJIIJJHGIJJIJJJJJJJJJJJJJIJIJJ [2] <DCDEDCDDCCACDECCCDEEEDC@B@EHEAE@EHGCHGFGJIIIIHBIHFGIIGJJIJIHEHEIJJJIJIGF [3] CCCFFFFFHHHHHIIJJIIIIJJJIJJJIJJJIJJIJJIIIIJJJHHEHHFFFFFFFFDD;?BDDEECDCCCC [4] CCCFFFFFHHHFHJIJIJJJJJJJJJJJJJJJJJJJJJGIJIJJIHHEHHHFFFFFFFDC;@DDDEEDCDDCC [5] CCCFFFFFHHDHHEIJIJJJIJJIHIJJIJJJJIIIIGHIJIIJ?FHIIIHGIIEIJIJJEIIJJIJICEHHH
20 [ ] [ ] [ ] seqinfo: 66 sequences from an unspecified genome You should see that some of the data is repeated. The reason is that some essential data is always imported, so we don t need to pass that in our what.param. what.param <- c("qname", "flag", "mapq", "seq", "qual") my.param <- ScanBamParam(what=what.param, which=which.param, flag=flag.param) bam.ga <- readgalignments(bam.files[1], index=bam.index[1], param=my.param) bam.ga GAlignments object with alignments and 5 metadata columns: seqnames strand cigar qwidth start end <Rle> <Rle> <character> <integer> <integer> <integer> [1] M [2] M [3] M [4] M [5] M [ ] M [ ] M [ ] M [ ] M [ ] M width njunc qname flag <integer> <integer> <character> <integer> [1] SRR [2] SRR [3] SRR [4] SRR [5] SRR [ ] SRR [ ] SRR [ ] SRR
21 [ ] SRR [ ] SRR mapq <integer> [1] 0 [2] 0 [3] 1 [4] 1 [5] [ ] 0 [ ] 0 [ ] 0 [ ] 50 [ ] 1 [1] CCTAGTATATCTGGAGAGTTAAGATGGGGAATTATGTCAACTTTCCCTCTTCCTATGCCAGTTATGCATAATG [2] CCTAGTATATCTGGAGAGTTAAGATGGGGAATTATGTCAACTTTCCCTCTTCCTATGCCAGTTATGCATAATG [3] AGGGGGAGATGTGAGGAGCCGCCCTTGCAATCGCCATTACAAAATGGTGCTGATATCCGGTGTTCTAACTAGT [4] AGGGGGAGATGTGAGGAGCCGCCCTTGCAATCGCCATTACAAAATGGTGCTGATATCCGGTGTTCTAACTAGT [5] CTATGGCCTTGGGCATCAAGATTTAAAAAATTAAGAGTGAAGAGTGCTATGGAAACAACTACTCTTGGTACTG... [ ] GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG [ ] GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG [ ] GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG [ ] TAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTT [ ] TAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTT [1] 3DCCDCCDCCCCDCCCEEDCEFFFCDA7HHHHGFIIIGEFIGEHJIIJJHGIJJIJJJJJJJJJJJJJIJIJJ [2] <DCDEDCDDCCACDECCCDEEEDC@B@EHEAE@EHGCHGFGJIIIIHBIHFGIIGJJIJIHEHEIJJJIJIGF [3] CCCFFFFFHHHHHIIJJIIIIJJJIJJJIJJJIJJIJJIIIIJJJHHEHHFFFFFFFFDD;?BDDEECDCCCC [4] CCCFFFFFHHHFHJIJIJJJJJJJJJJJJJJJJJJJJJGIJIJJIHHEHHHFFFFFFFDC;@DDDEEDCDDCC [5] CCCFFFFFHHDHHEIJIJJJIJJIHIJJIJJJJIIIIGHIJIIJ?FHIIIHGIIEIJIJJEIIJJIJICEHHH
22 [ ] [ ] seqinfo: 66 sequences from an unspecified genome 3.2 Essential Alignment Data There are certain fields in the BAM file that are always imported. Let s go through these one by one. seqnames(bam.ga) factor-rle of length with 1 run Lengths: Values : 19 Levels(66): JH JH JH MT X Y strand(bam.ga) factor-rle of length with runs Lengths: Values : Levels(3): + - * cigar.tbl <- sort(table(cigar(bam.ga)), decreasing=true) head(cigar.tbl, 20) 101M 26M623N75M 2M1413N99M 37M109N64M M623N67M 46M2212N55M 33M623N68M 84M781N17M M974N1M 37M634N64M 68M1413N33M 69M974N32M M634N80M1630N9M 97M1D4M 96M1I4M 61M781N40M M170N43M 63M3195N38M 99M781N2M 5M92N96M
23 range(qwidth(bam.ga)) [1] range(width(bam.ga)) [1] njunc.tbl <- table(njunc(bam.ga)) njunc.tbl Additional Data Some data is only optionally reported, but these fields contain important information about th e reads. The FLAG field contains information on how the reads aligned, especially how each read in a paired-end run aligned with respect to each other. The MAPQ field tells you how good or bad the alignment is. head(mcols(bam.ga)) 1 CCTAGTATATCTGGAGAGTTAAGATGGGGAATTATGTCAACTTTCCCTCTTCCTATGCCAGTTATGCATAATGCACAAATATT 2 CCTAGTATATCTGGAGAGTTAAGATGGGGAATTATGTCAACTTTCCCTCTTCCTATGCCAGTTATGCATAATGCACAAATATT 3 AGGGGGAGATGTGAGGAGCCGCCCTTGCAATCGCCATTACAAAATGGTGCTGATATCCGGTGTTCTAACTAGTAAACAAGTAG DataFrame with 6 rows and 5 columns qname flag mapq <character> <integer> <integer> 1 SRR SRR SRR SRR SRR SRR
24 4 AGGGGGAGATGTGAGGAGCCGCCCTTGCAATCGCCATTACAAAATGGTGCTGATATCCGGTGTTCTAACTAGTAAACAAGTAG 5 CTATGGCCTTGGGCATCAAGATTTAAAAAATTAAGAGTGAAGAGTGCTATGGAAACAACTACTCTTGGTACTGAGGGTAGATT 6 CTATGGCCTTGGGCATCAAGATTTAAAAAATTAAGAGTGAAGAGTGCTATGGAAACAACTACTCTTGGTACTGAGGGTAGATT 1 3DCCDCCDCCCCDCCCEEDCEFFFCDA7HHHHGFIIIGEFIGEHJIIJJHGIJJIJJJJJJJJJJJJJIJIJJJIJJJJJIJJ 2 <DCDEDCDDCCACDECCCDEEEDC@B@EHEAE@EHGCHGFGJIIIIHBIHFGIIGJJIJIHEHEIJJJIJIGFIIEGIIGIHG 3 CCCFFFFFHHHHHIIJJIIIIJJJIJJJIJJJIJJIJJIIIIJJJHHEHHFFFFFFFFDD;?BDDEECDCCCCCCCDDDDACD 4 CCCFFFFFHHHFHJIJIJJJJJJJJJJJJJJJJJJJJJGIJIJJIHHEHHHFFFFFFFDC;@DDDEEDCDDCCDDACDDC>CC 5 CCCFFFFFHHDHHEIJIJJJIJJIHIJJIJJJJIIIIGHIJIIJ?FHIIIHGIIEIJIJJEIIJJIJICEHHHFEDEDCD>CC flag.tbl <- table(mcols(bam.ga)$flag) flag.tbl flag.mat <- bamflagasbitmatrix(as.integer(names(flag.tbl))) flag.mat <- cbind(as.integer(names(flag.tbl)), flag.mat) flag.mat ispaired isproperpair isunmappedquery hasunmappedmate [1,] [2,] [3,] [4,] isminusstrand ismateminusstrand isfirstmateread issecondmateread [1,] [2,] [3,] [4,] issecondaryalignment isnotpassingqualitycontrols isduplicate [1,] [2,] [3,] [4,] dups <- duplicated(mcols(bam.ga)$qname) duplicated(mcols(bam.ga)$qname, fromlast=true) 24
25 dup.ga <- bam.ga[dups] dup.ga <- dup.ga[order(mcols(dup.ga)$qname)] dup.ga GAlignments object with alignments and 5 metadata columns: seqnames strand cigar qwidth start end <Rle> <Rle> <character> <integer> <integer> <integer> [1] M [2] M [3] M [4] M [5] M [383481] M [383482] M [383483] M [383484] M [383485] M width njunc qname flag <integer> <integer> <character> <integer> [1] SRR [2] SRR [3] SRR [4] SRR [5] SRR [383481] SRR [383482] SRR [383483] SRR [383484] SRR [383485] SRR mapq <integer> [1] 1 [2] 1 [3] 1 [4] 0 [5] [383481] 1 25
26 [383482] 1 [383483] 1 [383484] 1 [383485] 1 [1] GCCCCAAGTTCAAGATGCCTGAGATGAACATCAAGGCTCCAAAGATCTCCATGCCAGATTTACATCTTAAGGGT [2] GCCCCAAGTTCAAGATGCCTGAGATGAACATCAAGGCTCCAAAGATCTCCATGCCAGATTTACATCTTAAGGGT [3] GCCCCAAGTTCAAGATGCCTGAGATGAACATCAAGGCTCCAAAGATCTCCATGCCAGATTTACATCTTAAGGGT [4] TCCAGCTTCTCTCCATTTACTTTGATGTTGGCTACTGGTTTGCTGTAGATTGCTTTTATCATGTTTAGGTATGG [5] TCCAGCTTCTCTCCATTTACTTTGATGTTGGCTACTGGTTTGCTGTAGATTGCTTTTATCATGTTTAGGTATGG... [383481] GCCCCAAGTTCAAGATGCCTGACATGCACTTCAAGGCTCCTAAGATCTCCATGCCTGATGTGGACTTGCATCTG [383482] GCCCCAAGTTCAAGATGCCTGACATGCACTTCAAGGCTCCTAAGATCTCCATGCCTGATGTGGACTTGCATCTG [383483] AGTGCCAAAATTAGAGGGAGAATTAAAAGGCCCAAGTGTGGATGTGGAAGTACCTGGTGTTGATCTGGAATGTC [383484] AGTGCCAAAATTAGAGGGAGAATTAAAAGGCCCAAGTGTGGATGTGGAAGTACCTGGTGTTGATCTGGAATGTC [383485] AGTGCCAAAATTAGAGGGAGAATTAAAAGGCCCAAGTGTGGATGTGGAAGTACCTGGTGTTGATCTGGAATGTC [1] CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJIJJJJFJJJJJJIJJJJJJJIIJEHIJG [2] CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJIJJJJFJJJJJJIJJJJJJJIIJEHIJG [3] CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJIJJJJFJJJJJJIJJJJJJJIIJEHIJG [4] CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJGIIJJJJIJJJJJJJJJJJJJJIIJGHIGIJGIIIG [5] CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJGIIJJJJIJJJJJJJJJJJJJJIIJGHIGIJGIIIG... [383481] CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJIJJJJIJJJJJJJJJJJJJJIJIGJJJIIJJIJJJJJ [383482] CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJIJJJJIJJJJJJJJJJJJJJIJIGJJJIIJJIJJJJJ [383483] CCBFFFFFHHHHHJJJJJIJIJJJJIJJJJIJJJJJIJHIJJIJJIJIJJHIJJJIJJHIJIDHJJHHHHHHED [383484] CCBFFFFFHHHHHJJJJJIJIJJJJIJJJJIJJJJJIJHIJJIJJIJIJJHIJJJIJJHIJIDHJJHHHHHHED [383485] CCBFFFFFHHHHHJJJJJIJIJJJJIJJJJIJJJJJIJHIJJIJJIJIJJHIJJJIJJHIJIDHJJHHHHHHED seqinfo: 66 sequences from an unspecified genome table(mcols(bam.ga)$flag, dups) dups FALSE TRUE
27 mapq.tbl <- table(mcols(bam.ga)$mapq) mapq.tbl table(mcols(bam.ga)$mapq, dups) dups FALSE TRUE table(mcols(bam.ga)$mapq, mcols(bam.ga)$flag, dups),, dups = FALSE ,, dups = TRUE
28 3.4 Sequence Data The sequence data includes both the actual sequence of the read and the quality scores for each based in PHRED format. mcols(bam.ga)$seq A DNAStringSet instance of length width seq [1] 101 CCTAGTATATCTGGAGAGTTAAGATGGG...CACAAATATTTCCACGCTTTTTCACTAC [2] 101 CCTAGTATATCTGGAGAGTTAAGATGGG...CACAAATATTTCCACGCTTTTTCACTAC [3] 101 AGGGGGAGATGTGAGGAGCCGCCCTTGC...AAACAAGTAGTCTGCGCATGTGCTGGGG [4] 101 AGGGGGAGATGTGAGGAGCCGCCCTTGC...AAACAAGTAGTCTGCGCATGTGCTGGGG [5] 101 CTATGGCCTTGGGCATCAAGATTTAAAA...AGGGTAGATTCCCCCTTTTTGTTTAATT [ ] 101 GTTAGGGTTAGGGTTAGGGTTAGGGTTA...GTAGGGTTAGGGGTAGGGTTAGGGTTAG [ ] 101 GTTAGGGTTAGGGTTAGGGTTAGGGTTA...TTAGGGTTAGGGTTAGGGTTAGGGTTAG [ ] 101 GTTAGGGTTAGGGTTAGGGTTAGGGTTA...TTAGGGTTAGGGTTAGGGTTAGGGTTAG [ ] 101 TAGGGTTAGGGTTAGGGTTAGGGTTAGG...AGGGTTAGGGTTAGGGTTAGGGTTAGAT [ ] 101 TAGGGTTAGGGTTAGGGTTAGGGTTAGG...AGGGTTAGGGTTAGGGTTAGGGTTAGAT ecor1 <- vcountpattern("gaattc", mcols(bam.ga)$seq) table(ecor1) ecor mcols(bam.ga)$seq[ecor1 > 0] A DNAStringSet instance of length width seq [1] 101 ATATAGTGGATTACTTTGATGGATTTCCA...ATGATTGTTTTGATGTGTTCTTGAATTC [2] 101 ATATAGTGGATTACTTTGATGGATTTCCA...ATGATTGTTTTGATGTGTTCTTGAATTC [3] 101 CAAAATCCAACACCTATTCATGATGAAAG...AAAGCAATATACAGCAAGCCAGTAGCCA [4] 101 TGGCTACTGGTTTGCTGTAGATTGCTTTT...TTTTATCATGAATGGGTGTTGGATCTTG [5] 101 CATCTTTGCCATGATATTTTTTGCTTTAG...CCCAAATGCTGCATAATATCCCTTCCCC [109816] 101 CACATGATCATCTCGTTAGATGCAGAAAA...AAAGATCAGGAATTCAAGGCCCATACCT [109817] 101 TTAGATGCAGAAAAAGCATTTGACAAGAT...AAGGCCCATACCTAAACATGATAAAAGC [109818] 101 GTACACCCCTGCATGATAGAATTCTGCAG...TAAGTGATTCTTGGATCCTAACATTCTA [109819] 101 GTACACCCCTGCATGATAGAATTCTGCAG...TAAGTGATTCTTGGATCCTAACATTCTA [109820] 101 GTACACCCCTGCATGATAGAATTCTGCAG...TAAGTGATTCTTGGATCCTAACATTCTA 28
29 mcols(bam.ga)$qual A PhredQuality instance of length width seq [1] 101 3DCCDCCDCCCCDCCCEEDCEFFFCDA7...JIJJJJJIJJJJIHJHHHHHFFFFFC@C [2] 101 <DCDEDCDDCCACDECCCDEEEDC@B@E...IIEGIIGIHGIIHGHDHHHGFDEDF@C@ [3] 101 CCCFFFFFHHHHHIIJJIIIIJJJIJJJ...CCCDDDDACDDDDDDDB>BBCCCDC@B9 [4] 101 CCCFFFFFHHHFHJIJIJJJJJJJJJJJ...DDACDDC>CCCDEDDDBDDDDDCDDCBB [5] 101 CCCFFFFFHHDHHEIJIJJJIJJIHIJJ...FEDEDCD>CCEDCDDDDDDDDACDCAC: [ ] [ ] [ ] [ ] [ ] as.integer(chartoraw(as.character(mcols(bam.ga)$qual[1]))) [1] [24] [47] [70] [93] as.integer(chartoraw(as.character(mcols(bam.ga)$qual[1]))) - 33 [1] [24] [47] [70] [93] Determining Digital Gene Expression Now, we have everything we need to get our gene counts. It is important to think about how this is done. First, is your RNAseq library strand specific? If so, do you have paired-end reads? If so, which of the ends should you use for counting? 29
30 In addition, how stringent do you want to be when you determine whether or not a read actually overlaps with a gene? What if a read overlaps with two genes? In your package tab, go to GenomicAlignments, click on it. Open the User guides and select GenomicAlignments::summarizeOverlaps. Figure 1 shows the counting modes that are available. 4.1 Counting Modes on Simulated Data The following bit of code demonstrates how your gene expression counts can vary based on your choice of counting mode. library(gviz) my.gr <- GRanges(seqnames=Rle("chr1"), ranges=iranges(start=1, end=1000), strand=rle("*" my.seqinfo <- Seqinfo(seqnames="chr1", seqlengths=1000, iscircular=false, genome="simula seqinfo(my.gr) <- my.seqinfo gene1 <- GRanges(seqnames=Rle("chr1"), ranges=iranges(start=201, end=350), strand=rle("+ gene2 <- GRanges(seqnames=Rle(rep("chr1", 2)), ranges=iranges(start=c(301, 601), end=c(3 gene3 <- GRanges(seqnames=Rle("chr1"), ranges=iranges(start=451, end=550), strand=rle(" gene4 <- GRanges(seqnames=Rle(rep("chr1", 2)), ranges=iranges(start=c(101, 301), end=c(2 my.genes <- GRangesList(gene1=gene1, gene2=gene2, gene3=gene3, gene4=gene4) my.genes.df <- as.data.frame(my.genes) colnames(my.genes.df)[3] <- "chromosome" gene.track <-GeneRegionTrack(my.genes.df, fill="green4", arrowheadwidth=50, arrowheadmax gene1.reads <- GRanges(seqnames=Rle(rep("chr1", 20)), ranges=iranges(start=sample(200:32 gene2.reads <- GRanges(seqnames=Rle(rep("chr1", 60)), ranges=iranges(start=sample(c(300: gene3.reads <- GRanges(seqnames=Rle(rep("chr1", 10)), ranges=iranges(start=sample(450:52 gene4.reads <- GRanges(seqnames=Rle(rep("chr1", 20)), ranges=iranges(start=sample(c(100: all.reads <- c(gene1.reads, gene2.reads, gene3.reads, gene4.reads) read.track <- AnnotationTrack(all.reads, fill=c("red", "blue")[as.integer(strand(all.rea ax.track <- GenomeAxisTrack(GRanges(seqnames="chr1", ranges=iranges(1, 1000)), genome="s plottracks(list(gene.track, read.track, ax.track), from=1, to=700, grid=1, sizes=c(4, 4, 30
31 gene4 GeneRegionTrack gene1 gene2 gene3 Demo Reads count.mat <- matrix(0, nrow=4, ncol=5) colnames(count.mat) <- c("actual", "countoverlaps", "Union", "IntersectionStrict", "Inte rownames(count.mat) <- c("gene1", "gene2", "gene3", "gene4") count.mat[, 1] <- c(20, 60, 10, 20) count.mat[, 2] <- countoverlaps(my.genes, all.reads) count.mat[, 3] <- assays(summarizeoverlaps(my.genes, all.reads, "Union", ignore.strand=f count.mat[, 4] <- assays(summarizeoverlaps(my.genes, all.reads, "IntersectionStrict", ig count.mat[, 5] <- assays(summarizeoverlaps(my.genes, all.reads, "IntersectionNotEmpty", count.mat actual countoverlaps Union IntersectionStrict IntersectionNotEmpty 31
32 gene gene gene gene Clean-Up Environment to Free Memory We ve accumulated a bunch of objects in our workspace. remove what we don t need. ls() Let s carefully [1] "all.reads" "ax.track" "bam.chrs" "bam.files" [5] "bam.ga" "bam.header" "bam.index" "bpparam" [9] "cigar.tbl" "count.mat" "data.dir" "dup.ga" [13] "dups" "ecor1" "flag.mat" "flag.param" [17] "flag.tbl" "gene.track" "gene1" "gene1.reads" [21] "gene2" "gene2.reads" "gene3" "gene3.reads" [25] "gene4" "gene4.reads" "mapq.tbl" "mouse.bs" [29] "mouse.genes" "mouse.genes0" "mouse.gr" "my.genes" [33] "my.genes.df" "my.gr" "my.param" "my.seqinfo" [37] "njunc.tbl" "read.track" "seqnames.genes" "temp.seqlevels" [41] "what.param" "which.param" my.objs <- list(ls()) my.objs[[1]] <- setdiff(my.objs[[1]], c("data.dir", "bam.files", "bam.index", "mouse.gen rm(list=my.objs[[1]]) rm(my.objs) ls() [1] "bam.files" "bam.index" "data.dir" "mouse.genes" "mouse.gr" gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells Vcells
33 4.3 Method for Limited Memory and Single Core It is very likely that you will have limited computer resources. The following chunk of code will work down a list of BAM files, one chromosome at a time, and generate your gene count matrix. I have used this routinely on my MacBook Pro (16 GB memory). #sample.names <- substring(bam.files, 45, 54) #count.mat <- matrix(0, nrow=length(mouse.genes), ncol=length(sample.names)) #rownames(count.mat) <- names(mouse.genes) #colnames(count.mat) <- sample.names one file at a time, one computer core at a time # mapq.cutoff <- 50 # for(j in 1:length(bam.files)){ # for(i in 1:length(seqlevels(mouse.genes))){ # chr.param <- ScanBamParam(which=mouse.gr[i], mapqfilter=mapq.cutoff) # bam.ga <- readgalignments(bam.files[j], index=bam.index[j], param=chr.param) # my.counts <- summarizeoverlaps(features=mouse.genes, reads=bam.ga, mode="union", ign # count.mat[, j] <- count.mat[, j] + assays(my.counts) counts # } # } # write.table(count.mat, file=paste(sample.name, "counts.txt", sep="_"), row.names=true, 4.4 Parallel Processing of BAM Files on Multiple Cores The next chunk, will use the parameters we specified to BiocParallel to process all 12 BAM files (we have 18 workers). This should take less than 10 minutes. system.time(my.se <- summarizeoverlaps(mouse.genes, bam.files, mode="intersectionnotempt summarizeoverlaps will check for parallel processing parameters. We have 12 files and 18 workers so all 12 files will be processd in parallel registered() user system elapsed /60 Nine minutes to process all files. Would have taken ~110 minutes with for loop. class(my.se) 33
34 dim(assays(my.se)$counts) colnames(assays(my.se)$counts) save(my.se, file="pikesummarizedexp.rdata") Now, we are finished with the Big Data part of RNAseq analysis. 5 Differential Gene Expression Analysis with EdgeR There are several popular packages or tools for DGE analysis of RNAseq data. At one time CuffDiff was certainly the most popluar, but it has fallen out of favor for various reasons. It will be interesting to see how CuffDiff2 performs. DESeq2 and edger are two of the best available packages for DGE, and both are native to Bioconductor. Here we will cover edger because it is essentially limma for RNAseq. Both were developed by the same research group, so they have a similar philosophy and workflow. If you are interested in trying DESeq2, Bioconductor has an excellent workflow on their website that is quite easy to follow through if you know a little R (and you do now!) 5.1 Formatting the Data As with the microarray data, we need to format our raw data and our pdata. We will get the counts from the Summarized Experiment object. If you weren t able to generate the object, there is a copy that you can load. load("pikesummarizedexp.rdata") my.counts <- assays(my.se)$counts head(my.counts) SRR _accepted_hits_sorted.bam
35 SRR _accepted_hits_sorted.bam SRR _accepted_hits_sorted.bam SRR _accepted_hits_sorted.bam SRR _accepted_hits_sorted.bam SRR _accepted_hits_sorted.bam SRR _accepted_hits_sorted.bam
36 SRR _accepted_hits_sorted.bam SRR _accepted_hits_sorted.bam SRR _accepted_hits_sorted.bam SRR _accepted_hits_sorted.bam SRR _accepted_hits_sorted.bam colnames(my.counts) <- substr(colnames(my.counts), 1, 10) 36
37 head(my.counts) SRR SRR SRR SRR SRR SRR SRR SRR SRR SRR SRR SRR Next, we need to get the pdata for the sample. We will use a file that can be downloaded from SRA when you browse the data. Again, we are going to do a bit of data wrangling to make suitable names. metadata for experiment pdata <- read.delim(file.path(data.dir, "metadata", "Pike_SraRunTable_RNAseq.txt"), comm head(pdata) BioSample_s Experiment_s MBases_l MBytes_l Run_s SRA_Sample_s 1 SAMN SRX SRR SRS SAMN SRX SRR SRS SAMN SRX SRR SRS SAMN SRX SRR SRS SAMN SRX SRR SRS SAMN SRX SRR SRS Sample_Name_s differentiation_day_s source_name_s Assay_Type_s 37
38 1 GSM d3 IDGSW3_basal_d3 RNA-Seq 2 GSM d3 IDGSW3_basal_d3 RNA-Seq 3 GSM d3 IDGSW3_basal_d3 RNA-Seq 4 GSM d7 IDGSW3_basal_d7 RNA-Seq 5 GSM d7 IDGSW3_basal_d7 RNA-Seq 6 GSM d7 IDGSW3_basal_d7 RNA-Seq AssemblyName_s BioProject_s Center_Name_s Consent_s InsertSize_l 1 <not provided> PRJNA GEO public 0 2 <not provided> PRJNA GEO public 0 3 <not provided> PRJNA GEO public 0 4 <not provided> PRJNA GEO public 0 5 <not provided> PRJNA GEO public 0 6 <not provided> PRJNA GEO public 0 LibraryLayout_s LibrarySelection_s LibrarySource_s Library_Name_s 1 SINGLE cdna TRANSCRIPTOMIC <not provided> 2 SINGLE cdna TRANSCRIPTOMIC <not provided> 3 SINGLE cdna TRANSCRIPTOMIC <not provided> 4 SINGLE cdna TRANSCRIPTOMIC <not provided> 5 SINGLE cdna TRANSCRIPTOMIC <not provided> 6 SINGLE cdna TRANSCRIPTOMIC <not provided> LoadDate_s Organism_s Platform_s ReleaseDate_s SRA_Study_s cell_line_s 1 2/7/2014 Mus musculus ILLUMINA 6/2/2014 SRP IDG-SW3 2 2/7/2014 Mus musculus ILLUMINA 6/2/2014 SRP IDG-SW3 3 2/7/2014 Mus musculus ILLUMINA 6/2/2014 SRP IDG-SW3 4 2/7/2014 Mus musculus ILLUMINA 6/2/2014 SRP IDG-SW3 5 2/7/2014 Mus musculus ILLUMINA 6/2/2014 SRP IDG-SW3 6 2/7/2014 Mus musculus ILLUMINA 6/2/2014 SRP IDG-SW3 cell_type_s g1k_analysis_group_s g1k_pop_code_s source_s 1 osteocytic cells <not provided> <not provided> <not provided> 2 osteocytic cells <not provided> <not provided> <not provided> 3 osteocytic cells <not provided> <not provided> <not provided> 4 osteocytic cells <not provided> <not provided> <not provided> 5 osteocytic cells <not provided> <not provided> <not provided> 6 osteocytic cells <not provided> <not provided> <not provided> pdata <- pdata[pdata$run_s %in% colnames(my.counts), c("run_s", "differentiation_day_s", pdata$source_name_s <- as.factor(pdata$source_name_s) summary(pdata$source_name_s) IDGSW3_125_d3 IDGSW3_125_d35 IDGSW3_vehicle_d3 38
Practical: Read Counting in RNA-seq
Practical: Read Counting in RNA-seq Hervé Pagès (hpages@fhcrc.org) 5 February 2014 Contents 1 Introduction 1 2 First look at some precomputed read counts 2 3 Aligned reads and BAM files 4 4 Choosing and
More informationRange-based containers in Bioconductor
Range-based containers in Bioconductor Hervé Pagès hpages@fhcrc.org Fred Hutchinson Cancer Research Center Seattle, WA, USA 21 January 2014 Introduction IRanges objects GRanges objects Splitting a GRanges
More informationWorking with aligned nucleotides (WORK- IN-PROGRESS!)
Working with aligned nucleotides (WORK- IN-PROGRESS!) Hervé Pagès Last modified: January 2014; Compiled: November 17, 2017 Contents 1 Introduction.............................. 1 2 Load the aligned reads
More informationRNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF
RNA-Seq in Galaxy: Tuxedo protocol Igor Makunin, UQ RCC, QCIF Acknowledgments Genomics Virtual Lab: gvl.org.au Galaxy for tutorials: galaxy-tut.genome.edu.au Galaxy Australia: galaxy-aust.genome.edu.au
More informationHandling genomic data using Bioconductor II: GenomicRanges and GenomicFeatures
Handling genomic data using Bioconductor II: GenomicRanges and GenomicFeatures Motivating examples Genomic Features (e.g., genes, exons, CpG islands) on the genome are often represented as intervals, e.g.,
More informationCounting with summarizeoverlaps
Counting with summarizeoverlaps Valerie Obenchain Edited: August 2012; Compiled: August 23, 2013 Contents 1 Introduction 1 2 A First Example 1 3 Counting Modes 2 4 Counting Features 3 5 pasilla Data 6
More informationSequence Analysis Pipeline
Sequence Analysis Pipeline Transcript fragments 1. PREPROCESSING 2. ASSEMBLY (today) Removal of contaminants, vector, adaptors, etc Put overlapping sequence together and calculate bigger sequences 3. Analysis/Annotation
More informationNGS Data Analysis. Roberto Preste
NGS Data Analysis Roberto Preste 1 Useful info http://bit.ly/2r1y2dr Contacts: roberto.preste@gmail.com Slides: http://bit.ly/ngs-data 2 NGS data analysis Overview 3 NGS Data Analysis: the basic idea http://bit.ly/2r1y2dr
More informationA quick introduction to GRanges and GRangesList objects
A quick introduction to GRanges and GRangesList objects Hervé Pagès hpages@fredhutch.org Michael Lawrence lawrence.michael@gene.com July 2015 GRanges objects The GRanges() constructor GRanges accessors
More informationBuilding and Using Ensembl Based Annotation Packages with ensembldb
Building and Using Ensembl Based Annotation Packages with ensembldb Johannes Rainer 1 June 25, 2016 1 johannes.rainer@eurac.edu Introduction TxDb objects from GenomicFeatures provide gene model annotations:
More informationFile Formats: SAM, BAM, and CRAM. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015
File Formats: SAM, BAM, and CRAM UCD Genome Center Bioinformatics Core Tuesday 15 September 2015 / BAM / CRAM NEW! http://samtools.sourceforge.net/ - deprecated! http://www.htslib.org/ - SAMtools 1.0 and
More informationTP RNA-seq : Differential expression analysis
TP RNA-seq : Differential expression analysis Overview of RNA-seq analysis Fusion transcripts detection Differential expresssion Gene level RNA-seq Transcript level Transcripts and isoforms detection 2
More informationLecture 12. Short read aligners
Lecture 12 Short read aligners Ebola reference genome We will align ebola sequencing data against the 1976 Mayinga reference genome. We will hold the reference gnome and all indices: mkdir -p ~/reference/ebola
More informationHigh-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg
High-throughput sequencing: Alignment and related topic Simon Anders EMBL Heidelberg Established platforms HTS Platforms Illumina HiSeq, ABI SOLiD, Roche 454 Newcomers: Benchtop machines 454 GS Junior,
More informationHigh-throughout sequencing and using short-read aligners. Simon Anders
High-throughout sequencing and using short-read aligners Simon Anders High-throughput sequencing (HTS) Sequencing millions of short DNA fragments in parallel. a.k.a.: next-generation sequencing (NGS) massively-parallel
More informationLecture 8. Sequence alignments
Lecture 8 Sequence alignments DATA FORMATS bioawk bioawk is a program that extends awk s powerful processing of tabular data to processing tasks involving common bioinformatics formats like FASTA/FASTQ,
More informationAnalyzing ChIP- Seq Data in Galaxy
Analyzing ChIP- Seq Data in Galaxy Lauren Mills RISS ABSTRACT Step- by- step guide to basic ChIP- Seq analysis using the Galaxy platform. Table of Contents Introduction... 3 Links to helpful information...
More informationNGS Data Visualization and Exploration Using IGV
1 What is Galaxy Galaxy for Bioinformaticians Galaxy for Experimental Biologists Using Galaxy for NGS Analysis NGS Data Visualization and Exploration Using IGV 2 What is Galaxy Galaxy for Bioinformaticians
More informationNGS Analysis Using Galaxy
NGS Analysis Using Galaxy Sequences and Alignment Format Galaxy overview and Interface Get;ng Data in Galaxy Analyzing Data in Galaxy Quality Control Mapping Data History and workflow Galaxy Exercises
More information10 things (maybe) you didn t know about GenomicRanges, Biostrings, and Rsamtools
10 things (maybe) you didn t know about GenomicRanges, Biostrings, and Rsamtools Hervé Pagès hpages@fredhutch.org June 2016 1. Inner vs outer metadata columns > mcols(grl)$id
More informationPackage roar. August 31, 2018
Type Package Package roar August 31, 2018 Title Identify differential APA usage from RNA-seq alignments Version 1.16.0 Date 2016-03-21 Author Elena Grassi Maintainer Elena Grassi Identify
More informationProtocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification data
Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification data Table of Contents Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification
More informationGenomic Files. University of Massachusetts Medical School. October, 2014
.. Genomic Files University of Massachusetts Medical School October, 2014 2 / 39. A Typical Deep-Sequencing Workflow Samples Fastq Files Fastq Files Sam / Bam Files Various files Deep Sequencing Further
More informationPackage GenomicAlignments
Package GenomicAlignments November 26, 2017 Title Representation and manipulation of short genomic alignments Description Provides efficient containers for storing and manipulating short genomic alignments
More informationResequencing Analysis. (Pseudomonas aeruginosa MAPO1 ) Sample to Insight
Resequencing Analysis (Pseudomonas aeruginosa MAPO1 ) 1 Workflow Import NGS raw data Trim reads Import Reference Sequence Reference Mapping QC on reads Variant detection Case Study Pseudomonas aeruginosa
More informationAn Introduction to VariantTools
An Introduction to VariantTools Michael Lawrence, Jeremiah Degenhardt January 25, 2018 Contents 1 Introduction 2 2 Calling single-sample variants 2 2.1 Basic usage..............................................
More informationHigh-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg
High-throughput sequencing: Alignment and related topic Simon Anders EMBL Heidelberg Established platforms HTS Platforms Illumina HiSeq, ABI SOLiD, Roche 454 Newcomers: Benchtop machines: Illumina MiSeq,
More informationGalaxy Platform For NGS Data Analyses
Galaxy Platform For NGS Data Analyses Weihong Yan wyan@chem.ucla.edu Collaboratory Web Site http://qcb.ucla.edu/collaboratory Collaboratory Workshops Workshop Outline ü Day 1 UCLA galaxy and user account
More informationGalaxy workshop at the Winter School Igor Makunin
Galaxy workshop at the Winter School 2016 Igor Makunin i.makunin@uq.edu.au Winter school, UQ, July 6, 2016 Plan Overview of the Genomics Virtual Lab Introduce Galaxy, a web based platform for analysis
More informationMapping RNA sequence data (Part 1: using pathogen portal s RNAseq pipeline) Exercise 6
Mapping RNA sequence data (Part 1: using pathogen portal s RNAseq pipeline) Exercise 6 The goal of this exercise is to retrieve an RNA-seq dataset in FASTQ format and run it through an RNA-sequence analysis
More informationRsubread package: high-performance read alignment, quantification and mutation discovery
Rsubread package: high-performance read alignment, quantification and mutation discovery Wei Shi 14 September 2015 1 Introduction This vignette provides a brief description to the Rsubread package. For
More informationColorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi
Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi Although a little- bit long, this is an easy exercise
More informationEnsembl RNASeq Practical. Overview
Ensembl RNASeq Practical The aim of this practical session is to use BWA to align 2 lanes of Zebrafish paired end Illumina RNASeq reads to chromosome 12 of the zebrafish ZV9 assembly. We have restricted
More informationRsubread package: high-performance read alignment, quantification and mutation discovery
Rsubread package: high-performance read alignment, quantification and mutation discovery Wei Shi 14 September 2015 1 Introduction This vignette provides a brief description to the Rsubread package. For
More informationGoal: Learn how to use various tool to extract information from RNAseq reads.
ESSENTIALS OF NEXT GENERATION SEQUENCING WORKSHOP 2017 Class 4 RNAseq Goal: Learn how to use various tool to extract information from RNAseq reads. Input(s): Output(s): magnaporthe_oryzae_70-15_8_supercontigs.fasta
More informationGenomic Files. University of Massachusetts Medical School. October, 2015
.. Genomic Files University of Massachusetts Medical School October, 2015 2 / 55. A Typical Deep-Sequencing Workflow Samples Fastq Files Fastq Files Sam / Bam Files Various files Deep Sequencing Further
More informationssviz: A small RNA-seq visualizer and analysis toolkit
ssviz: A small RNA-seq visualizer and analysis toolkit Diana HP Low Institute of Molecular and Cell Biology Agency for Science, Technology and Research (A*STAR), Singapore dlow@imcb.a-star.edu.sg August
More informationsegmentseq: methods for detecting methylation loci and differential methylation
segmentseq: methods for detecting methylation loci and differential methylation Thomas J. Hardcastle October 30, 2018 1 Introduction This vignette introduces analysis methods for data from high-throughput
More informationIRanges and GenomicRanges An introduction
IRanges and GenomicRanges An introduction Kasper Daniel Hansen CSAMA, Brixen 2011 1 / 28 Why you should care IRanges and GRanges are data structures I use often to solve a variety of
More informationData: ftp://ftp.broad.mit.edu/pub/users/bhaas/rnaseq_workshop/rnaseq_workshop_dat a.tgz. Software:
A Tutorial: De novo RNA- Seq Assembly and Analysis Using Trinity and edger The following data and software resources are required for following the tutorial: Data: ftp://ftp.broad.mit.edu/pub/users/bhaas/rnaseq_workshop/rnaseq_workshop_dat
More informationSAM / BAM Tutorial. EMBL Heidelberg. Course Materials. Tobias Rausch September 2012
SAM / BAM Tutorial EMBL Heidelberg Course Materials Tobias Rausch September 2012 Contents 1 SAM / BAM 3 1.1 Introduction................................... 3 1.2 Tasks.......................................
More informationSingle/paired-end RNAseq analysis with Galaxy
October 016 Single/paired-end RNAseq analysis with Galaxy Contents: 1. Introduction. Quality control 3. Alignment 4. Normalization and read counts 5. Workflow overview 6. Sample data set to test the paired-end
More informationSAM : Sequence Alignment/Map format. A TAB-delimited text format storing the alignment information. A header section is optional.
Alignment of NGS reads, samtools and visualization Hands-on Software used in this practical BWA MEM : Burrows-Wheeler Aligner. A software package for mapping low-divergent sequences against a large reference
More informationPODKAT. An R Package for Association Testing Involving Rare and Private Variants. Ulrich Bodenhofer
Software Manual Institute of Bioinformatics, Johannes Kepler University Linz PODKAT An R Package for Association Testing Involving Rare and Private Variants Ulrich Bodenhofer Institute of Bioinformatics,
More informationRNA-seq. Manpreet S. Katari
RNA-seq Manpreet S. Katari Evolution of Sequence Technology Normalizing the Data RPKM (Reads per Kilobase of exons per million reads) Score = R NT R = # of unique reads for the gene N = Size of the gene
More informationITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013
ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013 1. Data and objectives We will use the data from GEO (GSE35368, Toedling, Servant et al. 2011). Two samples were
More informationRepresenting sequencing data in Bioconductor
Representing sequencing data in Bioconductor Mark Dunning mark.dunning@cruk.cam.ac.uk Last modified: July 28, 2015 Contents 1 Accessing Genome Sequence 1 1.1 Alphabet Frequencies...................................
More informationRNA-Seq Analysis With the Tuxedo Suite
June 2016 RNA-Seq Analysis With the Tuxedo Suite Dena Leshkowitz Introduction In this exercise we will learn how to analyse RNA-Seq data using the Tuxedo Suite tools: Tophat, Cuffmerge, Cufflinks and Cuffdiff.
More informationsegmentseq: methods for detecting methylation loci and differential methylation
segmentseq: methods for detecting methylation loci and differential methylation Thomas J. Hardcastle October 13, 2015 1 Introduction This vignette introduces analysis methods for data from high-throughput
More informationPackage SCAN.UPC. October 9, Type Package. Title Single-channel array normalization (SCAN) and University Probability of expression Codes (UPC)
Package SCAN.UPC October 9, 2013 Type Package Title Single-channel array normalization (SCAN) and University Probability of expression Codes (UPC) Version 2.0.2 Author Stephen R. Piccolo and W. Evan Johnson
More informationNGS FASTQ file format
NGS FASTQ file format Line1: Begins with @ and followed by a sequence idenefier and opeonal descripeon Line2: Raw sequence leiers Line3: + Line4: Encodes the quality values for the sequence in Line2 (see
More informationHow to store and visualize RNA-seq data
How to store and visualize RNA-seq data Gabriella Rustici Functional Genomics Group gabry@ebi.ac.uk EBI is an Outstation of the European Molecular Biology Laboratory. Talk summary How do we archive RNA-seq
More informationGenerating and using Ensembl based annotation packages
Generating and using Ensembl based annotation packages Johannes Rainer Modified: 9 October, 2015. Compiled: January 19, 2016 Contents 1 Introduction 1 2 Using ensembldb annotation packages to retrieve
More informationDr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata
Analysis of RNA sequencing data sets using the Galaxy environment Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata Microarray and Deep-sequencing core facility 30.10.2017 RNA-seq workflow I Hypothesis
More informationChIP-seq hands-on practical using Galaxy
ChIP-seq hands-on practical using Galaxy In this exercise we will cover some of the basic NGS analysis steps for ChIP-seq using the Galaxy framework: Quality control Mapping of reads using Bowtie2 Peak-calling
More informationThe SAM Format Specification (v1.3 draft)
The SAM Format Specification (v1.3 draft) The SAM Format Specification Working Group July 15, 2010 1 The SAM Format Specification SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text
More informationMaximizing Public Data Sources for Sequencing and GWAS
Maximizing Public Data Sources for Sequencing and GWAS February 4, 2014 G Bryce Christensen Director of Services Questions during the presentation Use the Questions pane in your GoToWebinar window Agenda
More informationCyverse tutorial 1 Logging in to Cyverse and data management. Open an Internet browser window and navigate to the Cyverse discovery environment:
Cyverse tutorial 1 Logging in to Cyverse and data management Open an Internet browser window and navigate to the Cyverse discovery environment: https://de.cyverse.org/de/ Click Log in with your CyVerse
More information11/8/2017 Trinity De novo Transcriptome Assembly Workshop trinityrnaseq/rnaseq_trinity_tuxedo_workshop Wiki GitHub
trinityrnaseq / RNASeq_Trinity_Tuxedo_Workshop Trinity De novo Transcriptome Assembly Workshop Brian Haas edited this page on Oct 17, 2015 14 revisions De novo RNA-Seq Assembly and Analysis Using Trinity
More informationThe SAM Format Specification (v1.3-r837)
The SAM Format Specification (v1.3-r837) The SAM Format Specification Working Group November 18, 2010 1 The SAM Format Specification SAM stands for Sequence Alignment/Map format. It is a TAB-delimited
More informationHigh-level S4 containers for HTS data
High-level S4 containers for HTS data Hervé Pagès hpages@fhcrc.org Fred Hutchinson Cancer Research Center Seattle, WA July 2013 Introduction Most frequently seen low-level containers Rle objects IRanges
More informationAn Introduction to the genoset Package
An Introduction to the genoset Package Peter M. Haverty April 4, 2013 Contents 1 Introduction 2 1.1 Creating Objects........................................... 2 1.2 Accessing Genome Information...................................
More informationNGS : reads quality control
NGS : reads quality control Data used in this tutorials are available on https:/urgi.versailles.inra.fr/download/tuto/ngs-readsquality-control. Select genome solexa.fasta, illumina.fastq, solexa.fastq
More informationIntroduction to GenomicFiles
Valerie Obenchain, Michael Love, Martin Morgan Last modified: October 2014; Compiled: October 30, 2018 Contents 1 Introduction.............................. 1 2 Quick Start..............................
More informationUseful software utilities for computational genomics. Shamith Samarajiwa CRUK Autumn School in Bioinformatics September 2017
Useful software utilities for computational genomics Shamith Samarajiwa CRUK Autumn School in Bioinformatics September 2017 Overview Search and download genomic datasets: GEOquery, GEOsearch and GEOmetadb,
More informationPackage SCAN.UPC. July 17, 2018
Type Package Package SCAN.UPC July 17, 2018 Title Single-channel array normalization (SCAN) and Universal expression Codes (UPC) Version 2.22.0 Author and Andrea H. Bild and W. Evan Johnson Maintainer
More informationBGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14)
BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14) Genome Informatics (Part 1) https://bioboot.github.io/bggn213_f17/lectures/#14 Dr. Barry Grant Nov 2017 Overview: The purpose of this lab session is
More informationBasic4Cseq: an R/Bioconductor package for the analysis of 4C-seq data
Basic4Cseq: an R/Bioconductor package for the analysis of 4C-seq data Carolin Walter October 30, 2017 Contents 1 Introduction 1 1.1 Loading the package...................................... 2 1.2 Provided
More informationImport GEO Experiment into Partek Genomics Suite
Import GEO Experiment into Partek Genomics Suite This tutorial will illustrate how to: Import a gene expression experiment from GEO SOFT files Specify annotations Import RAW data from GEO for gene expression
More informationPreparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers
Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers Data used in the exercise We will use D. melanogaster WGS paired-end Illumina data with NCBI accessions
More informationTiling Assembly for Annotation-independent Novel Gene Discovery
Tiling Assembly for Annotation-independent Novel Gene Discovery By Jennifer Lopez and Kenneth Watanabe Last edited on September 7, 2015 by Kenneth Watanabe The following procedure explains how to run the
More informationMapping NGS reads for genomics studies
Mapping NGS reads for genomics studies Valencia, 28-30 Sep 2015 BIER Alejandro Alemán aaleman@cipf.es Genomics Data Analysis CIBERER Where are we? Fastq Sequence preprocessing Fastq Alignment BAM Visualization
More informationThe QoRTs Analysis Pipeline Example Walkthrough
The QoRTs Analysis Pipeline Example Walkthrough Stephen Hartley National Human Genome Research Institute National Institutes of Health October 31, 2017 QoRTs v1.0.1 JunctionSeq v1.9.0 Contents 1 Overview
More informationUsing the GenomicFeatures package
Using the GenomicFeatures package Marc Carlson Fred Hutchinson Cancer Research Center December 10th 2010 Bioconductor Annotation Packages: a bigger picture PLATFORM PKGS GENE ID HOMOLOGY PKGS GENE ID ORG
More informationExomeDepth. Vincent Plagnol. May 15, What is new? 1. 4 Load an example dataset 4. 6 CNV calling 5. 9 Visual display 9
ExomeDepth Vincent Plagnol May 15, 2016 Contents 1 What is new? 1 2 What ExomeDepth does and tips for QC 2 2.1 What ExomeDepth does and does not do................................. 2 2.2 Useful quality
More informationUsing the Galaxy Local Bioinformatics Cloud at CARC
Using the Galaxy Local Bioinformatics Cloud at CARC Lijing Bu Sr. Research Scientist Bioinformatics Specialist Center for Evolutionary and Theoretical Immunology (CETI) Department of Biology, University
More informationExeter Sequencing Service
Exeter Sequencing Service A guide to your denovo RNA-seq results An overview Once your results are ready, you will receive an email with a password-protected link to them. Click the link to access your
More informationChIP-seq (NGS) Data Formats
ChIP-seq (NGS) Data Formats Biological samples Sequence reads SRA/SRF, FASTQ Quality control SAM/BAM/Pileup?? Mapping Assembly... DE Analysis Variant Detection Peak Calling...? Counts, RPKM VCF BED/narrowPeak/
More informationBioinformatics in next generation sequencing projects
Bioinformatics in next generation sequencing projects Rickard Sandberg Assistant Professor Department of Cell and Molecular Biology Karolinska Institutet March 2011 Once sequenced the problem becomes computational
More informationThe software and data for the RNA-Seq exercise are already available on the USB system
BIT815 Notes on R analysis of RNA-seq data The software and data for the RNA-Seq exercise are already available on the USB system The notes below regarding installation of R packages and other software
More informationRanges (and Data Integration)
Ranges (and Data Integration) Martin Morgan 1 Fred Hutchinson Cancer Research Center Seattle, WA 20 November 2013 1 mtmorgan@fhcrc.org Introduction Importance of range concepts: conceptually... Genomic
More informationde.nbi and its Galaxy interface for RNA-Seq
de.nbi and its Galaxy interface for RNA-Seq Jörg Fallmann Thanks to Björn Grüning (RBC-Freiburg) and Sarah Diehl (MPI-Freiburg) Institute for Bioinformatics University of Leipzig http://www.bioinf.uni-leipzig.de/
More informationfastseg An R Package for fast segmentation Günter Klambauer and Andreas Mitterecker Institute of Bioinformatics, Johannes Kepler University Linz
Software Manual Institute of Bioinformatics, Johannes Kepler University Linz fastseg An R Package for fast segmentation Günter Klambauer and Andreas Mitterecker Institute of Bioinformatics, Johannes Kepler
More informationCLC Server. End User USER MANUAL
CLC Server End User USER MANUAL Manual for CLC Server 10.0.1 Windows, macos and Linux March 8, 2018 This software is for research purposes only. QIAGEN Aarhus Silkeborgvej 2 Prismet DK-8000 Aarhus C Denmark
More informationChIP-Seq Tutorial on Galaxy
1 Introduction ChIP-Seq Tutorial on Galaxy 2 December 2010 (modified April 6, 2017) Rory Stark The aim of this practical is to give you some experience handling ChIP-Seq data. We will be working with data
More informationChIP-seq hands-on practical using Galaxy
ChIP-seq hands-on practical using Galaxy In this exercise we will cover some of the basic NGS analysis steps for ChIP-seq using the Galaxy framework: Quality control Mapping of reads using Bowtie2 Peak-calling
More informationRelationship Between BED and WIG Formats
Relationship Between BED and WIG Formats Pete E. Pascuzzi July 2, 2015 This example will illustrate the similarities and differences between the various ways to represent ranged data in R. In bioinformatics,
More informationDe novo genome assembly
BioNumerics Tutorial: De novo genome assembly 1 Aims This tutorial describes a de novo assembly of a Staphylococcus aureus genome, using single-end and pairedend reads generated by an Illumina R Genome
More information!"#$%&$'()#$*)+,-./).01"0#,23+3,303456"6,&((46,7$+-./&((468,
!"#$%&$'()#$*)+,-./).01"0#,23+3,303456"6,&((46,7$+-./&((468, 9"(1(02)1+(',:.;.4(*.',?9@A,!."2.4B.'#A,C(;.
More informationUsing the Streamer classes to count genomic overlaps with summarizeoverlaps
Using the Streamer classes to count genomic overlaps with summarizeoverlaps Nishant Gopalakrishnan, Martin Morgan October 30, 2018 1 Introduction This vignette illustrates how users can make use of the
More informationExercise 1. RNA-seq alignment and quantification. Part 1. Prepare the working directory. Part 2. Examine qualities of the RNA-seq data files
Exercise 1. RNA-seq alignment and quantification Part 1. Prepare the working directory. 1. Connect to your assigned computer. If you do not know how, follow the instruction at http://cbsu.tc.cornell.edu/lab/doc/remote_access.pdf
More informationMaize genome sequence in FASTA format. Gene annotation file in gff format
Exercise 1. Using Tophat/Cufflinks to analyze RNAseq data. Step 1. One of CBSU BioHPC Lab workstations has been allocated for your workshop exercise. The allocations are listed on the workshop exercise
More informationWorking with ChIP-Seq Data in R/Bioconductor
Working with ChIP-Seq Data in R/Bioconductor Suraj Menon, Tom Carroll, Shamith Samarajiwa September 3, 2014 Contents 1 Introduction 1 2 Working with aligned data 1 2.1 Reading in data......................................
More informationBioconductor packages for short read analyses
Bioconductor packages for short read analyses RNA-Seq / ChIP-Seq Data Analysis Workshop 10 September 2012 CSC, Helsinki Nicolas Delhomme Foreword The core packages for integrating NGS data analysis represents
More informationOur data for today is a small subset of Saimaa ringed seal RNA sequencing data (RNA_seq_reads.fasta). Let s first see how many reads are there:
Practical Course in Genome Bioinformatics 19.2.2016 (CORRECTED 22.2.2016) Exercises - Day 5 http://ekhidna.biocenter.helsinki.fi/downloads/teaching/spring2016/ Answer the 5 questions (Q1-Q5) according
More informationChIP-seq Analysis Practical
ChIP-seq Analysis Practical Vladimir Teif (vteif@essex.ac.uk) An updated version of this document will be available at http://generegulation.info/index.php/teaching In this practical we will learn how
More informationSome Basic ChIP-Seq Data Analysis
Some Basic ChIP-Seq Data Analysis July 28, 2009 Our goal is to describe the use of Bioconductor software to perform some basic tasks in the analysis of ChIP-Seq data. We will use several functions in the
More informationPackage scruff. November 6, 2018
Package scruff November 6, 2018 Title Single Cell RNA-Seq UMI Filtering Facilitator (scruff) Version 1.0.0 Date 2018-08-29 A pipeline which processes single cell RNA-seq (scrna-seq) reads from CEL-seq
More informationIdentiyfing splice junctions from RNA-Seq data
Identiyfing splice junctions from RNA-Seq data Joseph K. Pickrell pickrell@uchicago.edu October 4, 2010 Contents 1 Motivation 2 2 Identification of potential junction-spanning reads 2 3 Calling splice
More informationDavid Crossman, Ph.D. UAB Heflin Center for Genomic Science. GCC2012 Wednesday, July 25, 2012
David Crossman, Ph.D. UAB Heflin Center for Genomic Science GCC2012 Wednesday, July 25, 2012 Galaxy Splash Page Colors Random Galaxy icons/colors Queued Running Completed Download/Save Failed Icons Display
More information