riboseqr Introduction Getting Data Workflow Example Thomas J. Hardcastle, Betty Y.W. Chung October 30, 2017

Similar documents
segmentseq: methods for detecting methylation loci and differential methylation

Advanced analysis using bayseq; generic distribution definitions

segmentseq: methods for detecting methylation loci and differential methylation

Image Analysis with beadarray

bayseq: Empirical Bayesian analysis of patterns of differential expression in count data

Visualisation, transformations and arithmetic operations for grouped genomic intervals

Working with aligned nucleotides (WORK- IN-PROGRESS!)

Creating IGV HTML reports with tracktables

An Overview of the S4Vectors package

MMDiff2: Statistical Testing for ChIP-Seq Data Sets

An Introduction to ShortRead

INSPEcT - INference of Synthesis, Processing and degradation rates in Time-course analysis

Creating a New Annotation Package using SQLForge

INSPEcT - INference of Synthesis, Processing and degradation rates in Time-course analysis

PROPER: PROspective Power Evaluation for RNAseq

Overview of ensemblvep

The bumphunter user s guide

An Introduction to GenomeInfoDb

Introduction to GenomicFiles

A complete analysis of peptide microarray binding data using the pepstat framework

MIGSA: Getting pbcmc datasets

Overview of ensemblvep

Triform: peak finding in ChIP-Seq enrichment profiles for transcription factors

How to Use pkgdeptools

MultipleAlignment Objects

RMassBank for XCMS. Erik Müller. January 4, Introduction 2. 2 Input files LC/MS data Additional Workflow-Methods 2

DupChecker: a bioconductor package for checking high-throughput genomic data redundancy in meta-analysis

M 3 D: Statistical Testing for RRBS Data Sets

ssviz: A small RNA-seq visualizer and analysis toolkit

Introduction to QDNAseq

SCnorm: robust normalization of single-cell RNA-seq data

Introduction: microarray quality assessment with arrayqualitymetrics

Basic4Cseq: an R/Bioconductor package for the analysis of 4C-seq data

rhdf5 - HDF5 interface for R

Introduction to BatchtoolsParam

Introduction to the Codelink package

Using branchpointer for annotation of intronic human splicing branchpoints

NacoStringQCPro. Dorothee Nickles, Thomas Sandmann, Robert Ziman, Richard Bourgon. Modified: April 10, Compiled: April 24, 2017.

Shrinkage of logarithmic fold changes

Creating a New Annotation Package using SQLForge

The analysis of rtpcr data

The REDseq user s guide

The rtracklayer package

survsnp: Power and Sample Size Calculations for SNP Association Studies with Censored Time to Event Outcomes

Making and Utilizing TxDb Objects

Cardinal design and development

How To Use GOstats and Category to do Hypergeometric testing with unsupported model organisms

Basic4Cseq: an R/Bioconductor package for the analysis of 4C-seq data

mocluster: Integrative clustering using multiple

Count outlier detection using Cook s distance

GenoGAM: Genome-wide generalized additive

An Introduction to the methylumi package

CQN (Conditional Quantile Normalization)

Analysis of two-way cell-based assays

ExomeDepth. Vincent Plagnol. May 15, What is new? 1. 4 Load an example dataset 4. 6 CNV calling 5. 9 Visual display 9

Introduction to the TSSi package: Identification of Transcription Start Sites

How to use cghmcr. October 30, 2017

An Introduction to the methylumi package

HowTo: Querying online Data

Bioconductor: Annotation Package Overview

SimBindProfiles: Similar Binding Profiles, identifies common and unique regions in array genome tiling array data

Comparing methods for differential expression analysis of RNAseq data with the compcoder

RAPIDR. Kitty Lo. November 20, Intended use of RAPIDR 1. 2 Create binned counts file from BAMs Masking... 1

SCnorm: robust normalization of single-cell RNA-seq data

Analyse RT PCR data with ddct

SIBER User Manual. Pan Tong and Kevin R Coombes. May 27, Introduction 1

MethylAid: Visual and Interactive quality control of Illumina Human DNA Methylation array data

Getting started with flowstats

RNAseq Differential Expression Analysis Jana Schor and Jörg Hackermüller November, 2017

Fitting Baselines to Spectra

Analysis of Bead-level Data using beadarray

Bioconductor L A T E X Style 2.0

A wrapper to query DGIdb using R

Bioconductor L A T E X Style 2.0

An Introduction to the NarrowPeaks Package: Analysis of Transcription Factor Binding ChIPseq Data using Functional PCA

Analysis of Pairwise Interaction Screens

qplexanalyzer Matthew Eldridge, Kamal Kishore, Ashley Sawle January 4, Overview 1 2 Import quantitative dataset 2 3 Quality control 2

highscreen: High-Throughput Screening for Plate Based Assays

An Introduction to TransView

How to use CNTools. Overview. Algorithms. Jianhua Zhang. April 14, 2011

An introduction to gaucho

Analysis of screens with enhancer and suppressor controls

Seminar III: R/Bioconductor

GxE.scan. October 30, 2018

Comprehensive Pipeline for Analyzing and Visualizing Array-Based CGH Data

Using the qrqc package to gather information about sequence qualities

Wave correction for arrays

Classification of Breast Cancer Clinical Stage with Gene Expression Data

Introduction: microarray quality assessment with arrayqualitymetrics

Managing and analyzing multiple NGS samples with Bioconductor bamviews objects: application to RNA-seq

MLSeq package: Machine Learning Interface to RNA-Seq Data

Flexible Logo plots of symbols and alphanumeric strings using Logolas

Flexible Logo plots of symbols and alphanumeric strings using Logolas

The DMRcate package user s guide

Copy number variant detection in exome sequencing data using exomecopy

Clustering time series gene expression data with TMixClust

AllelicImbalance Vignette

RDAVIDWebService: a versatile R interface to DAVID

VariantFiltering: filtering of coding and noncoding genetic variants

Analyzing RNA-seq data for differential exon usage with the DEXSeq package

Transcription:

riboseqr Thomas J. Hardcastle, Betty Y.W. Chung October 30, 2017 Introduction Ribosome profiling extracts those parts of a coding sequence currently bound by a ribosome (and thus, are likely to be undergoing translation). Ribosomes typically cover between 20-30 bases of the mrna (dependant on conformational changes) and move along the mrna three bases at a time. Sequenced reads of a given length are thus likely to lie predominantly in a single frame relative to the start codon of the coding sequence. This package presents a set of methods for parsing ribosomal profiling data from multiple samples and aligned to coding sequences, inferring frameshifts, and plotting the average and transcript-specific behaviour of these data. Methods are also provided for extracting the data in a suitable form for differential translation analysis. For a fuller description of these methods and further examples of their use, see Chung & Hardcastle et al (2015) [1]. Getting Data riboseqr currently reads alignment data from BAM files, or from flat text files, although BAM format is recommended. Workflow Example Begin by loading the riboseqr library. > library(riboseqr) Identify the data directory for the example data. > datadir <- system.file("extdata", package = "riboseqr") The fastacds function can be used to guess at potential coding sequences from a (possibly compressed; see base::file) fasta file containing mrna transcripts (note; do not use this on a genome!). These can also be loaded into a GRanges object from an annotation file. > chlamyfasta <- paste(datadir, "/rsem_chlamy236_denovo.transcripts.fa", sep = "") > fastacds <- findcds(fastafile = chlamyfasta, + startcodon = c("atg"), + stopcodon = c("tag", "TAA", "TGA"))

The ribosomal and RNA (if available) alignment files are specified. > ribofiles <- paste(datadir, + "/chlamy236_plus_denovo_plusonly_index", c(17,3,5,7), sep = "") > rnafiles <- paste(datadir, + "/chlamy236_plus_denovo_plusonly_index", c(10,12,14,16), sep = "") The aligned ribosomal (and RNA) data can be read in using the readribodata function. The columns can be specified as a parameter of the readribodata function if the data in the alignment files are differently arranged. > ribodat <- readribodata(ribofiles, replicates = c("wt", "WT", "M", "M")) The alignments can be assigned to frames relative to the coding coordinates with the frame Counting function. > fcs <- framecounting(ribodat, fastacds) The predominant reading frame, relative to coding start, can be estimated from the frame calling (or from a set of coordinates and alignment data) for each n-mer. The weighting decribes the proportion of n-mers fitting with the most likely frameshift. The reading frame can also be readily visualised using the plotfs function. > fs <- readingframe(rc = fcs); fs [[1]] 0 712 10579 2228 1175 227 1 865 696 2095 531 358 2 316 3318 7987 1919 947 [[2]] 0 698 8485 597 160 73 1 715 537 1012 128 123 2 286 2663 3644 238 148 [[3]] 0 542 5066 266 188 91 1 435 268 447 98 99 2 215 1467 1429 171 184 frame.ml 0 0 2 0 2 [[4]] 0 1177 12129 622 201 74 1 1379 586 961 117 126 2 407 3690 3554 218 135 > plotfs(fs) 2

These can be filtered on the mean number of hits and unique hits within replicate groups to give plausible candidates for coding. Filtering can be limited to given lengths and frames, which may be inferred from the output of the readingframe function. > ffcs <- filterhits(fcs, lengths = c(27, 28), frames = list(1, 0), + hitmean = 50, unqhitmean = 10, fs = fs) We can plot the total alignment at the 5 and 3 ends of coding sequences using the plotcds function. The frames are colour coded; frame-0 is red, frame-1 is green, frame-2 is blue. > plotcds(coordinates = ffcs@cds, ribodat = ribodat, lengths = 27) Note the frameshift for 28-mers. > plotcds(coordinates = ffcs@cds, ribodat = ribodat, lengths = 28) We can plot the alignment over an individual transcript sequence using the plottranscript function. Observe that one CDS (on the right) contains the 27s in the same phase as the CDS (they are both red) while the putative CDSes to the left are not in phase with the aligned reads, suggesting either a sequence error in the transcript or a misalignment. The coverage of RNA sequenced reads is shown as a black curve (axis on the right). > plottranscript("cuff.37930.1", coordinates = ffcs@cds, + ribodata = ribodat, length = 27, cap = 200) NULL We can extract the counts from a ribocoding object using the slicecounts function > ribocounts <- slicecounts(ffcs, lengths = c(27, 28), frames = list(0, 2)) Counts for RNA-sequencing can be extracted using from the ribodata object and the coding coordinates using the rnacounts function. This is a relatively crude counting function, and alternatives have been widely described in the literature on mrna-seq. > rnacounts <- rnacounts(ribodat, ffcs@cds) These data may be used in an analysis of differential translation through comparison with the RNA-seq data. See the description of a beta-binomial analysis in the bayseq vignettes for further details. > library(bayseq) > pd <- new("countdata", replicates = ffcs@replicates, + data = list(ribocounts, rnacounts), + groups = list(ndt = c(1,1,1,1), DT = c("wt", "WT", "M", "M")), + annotation = as.data.frame(ffcs@cds), + densityfunction = bbdensity) > libsizes(pd) <- getlibsizes(pd) > pd <- getpriors(pd, cl = NULL) > pd <- getlikelihoods(pd, cl = NULL). > topcounts(pd, "DT", normalisedata = TRUE) seqnames start end width strand frame startcodon stopcodon context 3

1 Cre17.g723750.t1.3 271 471 201 * 0 ATG TGA GGAATGG 2 CUFF.9661.1 424 582 159 * 0 ATG TGA ACAATGG 3 CUFF.43721.1 6 161 156 * 2 ATG TAA GCAATGC 4 CUFF.9523.1 78 1040 963 * 2 ATG TAA AAAATGG minus3 plus1 WT.1 WT.2 M.1 M.2 Likelihood ordering FDR.DT 1 G G 3:3 1:1 2:2 5:5 0.05882353 M=WT 0.9411765 2 A G 0:0 1:1 1:1 0:0 0.05882352 M=WT 0.9411765 3 G C 1:1 3:3 2:2 0:0 0.05882350 M=WT 0.9411765 4 A G 246:246 491:491 247:247 3586:3586 0.05882193 M=WT 0.9411769 FWER.DT 1 0.9411765 2 0.9965398 3 0.9997965 4 0.9999880 4

[[1]] 0 712 10579 2228 1175 227 1 865 696 2095 531 358 2 316 3318 7987 1919 947 [[2]] 0 698 8485 597 160 73 1 715 537 1012 128 123 2 286 2663 3644 238 148 [[3]] 0 542 5066 266 188 91 1 435 268 447 98 99 2 215 1467 1429 171 184 frame.ml 0 0 2 0 2 [[4]] 0 1177 12129 622 201 74 1 1379 586 961 117 126 2 407 3690 3554 218 135 1 0 2000 4000 6000 8000 10000 Frame 0 Frame 1 Frame 2 5 Figure 1: Number of n-mers in each frame relative to coding start 27-mers are predominantly in frame-1, while 28-mers are chiefly in frame-0.

Mean number of reads 0 5 10 15 20 25 30 35 start 50 100 150 200 200 100 stop Base position relative to CDS Figure 2: Average alignment of 27-mers to 5 and 3 ends of coding sequences 6

Mean number of reads 0 10 20 30 40 start 50 100 150 200 200 100 stop Base position relative to CDS Figure 3: Average alignment of 28-mers to 5 and 3 ends of coding sequences 7

NULL chlamy236_plus_denovo_plusonly_index17 :: CUFF.37930.1 0 50 100 150 2 281 641 1040 1478 1916 2354 2792 3230 3668 Figure 4: Alignment to individual transcript 8

Session Info > sessioninfo() R Under development (unstable) (2017-10-20 r73567) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.3 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.7-bioc/r/lib/librblas.so LAPACK: /home/biocbuild/bbs-3.7-bioc/r/lib/librlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grdevices utils datasets [8] methods base other attached packages: [1] bayseq_2.13.0 riboseqr_1.13.0 abind_1.4-5 [4] GenomicRanges_1.31.0 GenomeInfoDb_1.15.0 IRanges_2.13.0 [7] S4Vectors_0.17.0 BiocGenerics_0.25.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.13 edger_3.21.0 knitr_1.17 [4] XVector_0.19.0 magrittr_1.5 zlibbioc_1.25.0 [7] BiocParallel_1.13.0 lattice_0.20-35 stringr_1.2.0 [10] tools_3.5.0 grid_3.5.0 seqlogo_1.45.0 [13] htmltools_0.3.6 yaml_2.1.14 rprojroot_1.2 [16] digest_0.6.12 GenomeInfoDbData_0.99.1 bitops_1.0-6 [19] RCurl_1.95-4.8 evaluate_0.10.1 rmarkdown_1.6 [22] limma_3.35.0 stringi_1.1.5 compiler_3.5.0 [25] Biostrings_2.47.0 backports_1.1.1 Rsamtools_1.31.0 [28] locfit_1.5-9.1 BiocStyle_2.7.0 References [1] BY Chung and TJ Hardcastle and JD Jones and N Irigoyen and AE Firth and DC Baulcombe and I Brierley The use of duplex-specific nuclease in ribosome profiling and a user-friendly software package for Ribo-seq data analysis. RNA (2015). 9