Count outlier detection using Cook s distance

Similar documents
Shrinkage of logarithmic fold changes

An introduction to gaucho

Clustering time series gene expression data with TMixClust

survsnp: Power and Sample Size Calculations for SNP Association Studies with Censored Time to Event Outcomes

How to Use pkgdeptools

Analysis of two-way cell-based assays

MLSeq package: Machine Learning Interface to RNA-Seq Data

Image Analysis with beadarray

An Introduction to the methylumi package

An Overview of the S4Vectors package

Visualisation, transformations and arithmetic operations for grouped genomic intervals

Advanced analysis using bayseq; generic distribution definitions

MIGSA: Getting pbcmc datasets

Expression Workshop Humberto Ortiz-Zuazaga November 14, 2014

CQN (Conditional Quantile Normalization)

How To Use GOstats and Category to do Hypergeometric testing with unsupported model organisms

segmentseq: methods for detecting methylation loci and differential methylation

How to use CNTools. Overview. Algorithms. Jianhua Zhang. April 14, 2011

The bumphunter user s guide

Introduction to the Codelink package

How to use cghmcr. October 30, 2017

Creating a New Annotation Package using SQLForge

Analysis of screens with enhancer and suppressor controls

Bioconductor: Annotation Package Overview

riboseqr Introduction Getting Data Workflow Example Thomas J. Hardcastle, Betty Y.W. Chung October 30, 2017

Working with aligned nucleotides (WORK- IN-PROGRESS!)

A complete analysis of peptide microarray binding data using the pepstat framework

segmentseq: methods for detecting methylation loci and differential methylation

MMDiff2: Statistical Testing for ChIP-Seq Data Sets

Simulation of molecular regulatory networks with graphical models

The analysis of rtpcr data

HowTo: Querying online Data

Creating a New Annotation Package using SQLForge

Analysis of Pairwise Interaction Screens

RGMQL: GenoMetric Query Language for R/Bioconductor

Visualizing RNA- Seq Differen5al Expression Results with CummeRbund

Introduction: microarray quality assessment with arrayqualitymetrics

Creating IGV HTML reports with tracktables

An Introduction to the methylumi package

An Introduction to ShortRead

Introduction to GenomicFiles

RNAseq Differential Expression Analysis Jana Schor and Jörg Hackermüller November, 2017

NacoStringQCPro. Dorothee Nickles, Thomas Sandmann, Robert Ziman, Richard Bourgon. Modified: April 10, Compiled: April 24, 2017.

Practical: exploring RNA-Seq counts Hugo Varet, Julie Aubert and Jacques van Helden

Package BiocGenerics

INSPEcT - INference of Synthesis, Processing and degradation rates in Time-course analysis

Introduction: microarray quality assessment with arrayqualitymetrics

mocluster: Integrative clustering using multiple

bayseq: Empirical Bayesian analysis of patterns of differential expression in count data

PROPER: PROspective Power Evaluation for RNAseq

Cardinal design and development

Differential Expression with DESeq2

SCnorm: robust normalization of single-cell RNA-seq data

MultipleAlignment Objects

RAPIDR. Kitty Lo. November 20, Intended use of RAPIDR 1. 2 Create binned counts file from BAMs Masking... 1

RMassBank for XCMS. Erik Müller. January 4, Introduction 2. 2 Input files LC/MS data Additional Workflow-Methods 2

Calibration of Quinine Fluorescence Emission Vignette for the Data Set flu of the R package hyperspec

Evaluation of VST algorithm in lumi package

The rtracklayer package

ssviz: A small RNA-seq visualizer and analysis toolkit

The Command Pattern in R

Overview of ensemblvep

GenoGAM: Genome-wide generalized additive

An Introduction to GenomeInfoDb

Raman Spectra of Chondrocytes in Cartilage: hyperspec s chondro data set

Overview of ensemblvep

highscreen: High-Throughput Screening for Plate Based Assays

Getting started with flowstats

Analyse RT PCR data with ddct

DupChecker: a bioconductor package for checking high-throughput genomic data redundancy in meta-analysis

INSPEcT - INference of Synthesis, Processing and degradation rates in Time-course analysis

Classification of Breast Cancer Clinical Stage with Gene Expression Data

Fitting Baselines to Spectra

M 3 D: Statistical Testing for RRBS Data Sets

crlmm to downstream data analysis

SimBindProfiles: Similar Binding Profiles, identifies common and unique regions in array genome tiling array data

GxE.scan. October 30, 2018

SIBER User Manual. Pan Tong and Kevin R Coombes. May 27, Introduction 1

ISA internals. October 18, Speeding up the ISA iteration 2

The REDseq user s guide

ExomeDepth. Vincent Plagnol. May 15, What is new? 1. 4 Load an example dataset 4. 6 CNV calling 5. 9 Visual display 9

Analyzing RNA-seq data for differential exon usage with the DEXSeq package

RDAVIDWebService: a versatile R interface to DAVID

qplexanalyzer Matthew Eldridge, Kamal Kishore, Ashley Sawle January 4, Overview 1 2 Import quantitative dataset 2 3 Quality control 2

Introduction to QDNAseq

Using the SRAdb Package to Query the Sequence Read Archive

Analysis of Bead-level Data using beadarray

Introduction to BatchtoolsParam

rhdf5 - HDF5 interface for R

An Introduction to Bioconductor s ExpressionSet Class

Wave correction for arrays

Triform: peak finding in ChIP-Seq enrichment profiles for transcription factors

Package snow. R topics documented: February 20, 2015

genefu: a package for breast cancer gene expression analysis

BioConductor Overviewr

Using the dorng package

Basic4Cseq: an R/Bioconductor package for the analysis of 4C-seq data

HMMcopy: A package for bias-free copy number estimation and robust CNA detection in tumour samples from WGS HTS data

MethylAid: Visual and Interactive quality control of Illumina Human DNA Methylation array data

Analyze Illumina Infinium methylation microarray data

Transcription:

Count outlier detection using Cook s distance Michael Love August 9, 2014 1 Run DE analysis with and without outlier removal The following vignette produces the Supplemental Figure of the effect of replacing outliers based on Cook s distance. First, we load the Bottomly et al. dataset, and subset the dataset to allow a 7 vs 7 sample comparison based on strain. Because we have 7 replicates per condition, the DESeq function automatically replaces outlier counts and refits the GLM for these genes. The argument controlling this behavior is minreplicatesforreplace which is set by default to 7. library("deseq2") Loading required package: GenomicRanges Loading required package: BiocGenerics Loading required package: parallel Attaching package: BiocGenerics The following objects are masked from package:parallel : clusterapply, clusterapplylb, clustercall, clusterevalq, clusterexport, clustermap, parapply, parcapply, parlapply, parlapplylb, parrapply, parsapply, parsapplylb The following object is masked from package:stats : xtabs The following objects are masked from package:base : Filter, Find, Map, Position, Reduce, anyduplicated, append, as.data.frame, as.vector, cbind, colnames, do.call, duplicated, eval, evalq, get, intersect, is.unsorted, lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank, rbind, rep.int, rownames, sapply, setdiff, sort, table, tapply, union, unique, unlist Loading required package: IRanges Loading required package: GenomeInfoDb Loading required package: Rcpp Loading required package: RcppArmadillo library("deseq2paper") data("bottomly_sumexp") dds <- DESeqDataSetFromMatrix(assay(bottomly), DataFrame(colData(bottomly)), ~strain) 1

dds <- dds[, c(8:11, 15:17, 12:14, 18:21)] as.data.frame(coldata(dds)) experiment sample.id num.tech.reps strain experiment.number SRR099230 SRX033480 SRX033480 1 C57BL/6J 6 SRR099231 SRX033481 SRX033481 1 C57BL/6J 6 SRR099232 SRX033482 SRX033482 1 C57BL/6J 6 SRR099233 SRX033483 SRX033483 1 C57BL/6J 6 SRR099237 SRX033488 SRX033488 1 C57BL/6J 7 SRR099238 SRX033489 SRX033489 1 C57BL/6J 7 SRR099239 SRX033490 SRX033490 1 C57BL/6J 7 SRR099234 SRX033484 SRX033484 1 DBA/2J 6 SRR099235 SRX033485 SRX033485 1 DBA/2J 6 SRR099236 SRX033486 SRX033486 1 DBA/2J 6 SRR099240 SRX033491 SRX033491 1 DBA/2J 7 SRR099241 SRX033492 SRX033492 1 DBA/2J 7 SRR099242 SRX033493 SRX033493 1 DBA/2J 7 SRR099243 SRX033494 SRX033494 1 DBA/2J 7 lane.number submission study sample run SRR099230 1 SRA026846 SRP004777 SRS140391 SRR099230 SRR099231 2 SRA026846 SRP004777 SRS140392 SRR099231 SRR099232 3 SRA026846 SRP004777 SRS140393 SRR099232 SRR099233 5 SRA026846 SRP004777 SRS140394 SRR099233 SRR099237 1 SRA026846 SRP004777 SRS140398 SRR099237 SRR099238 2 SRA026846 SRP004777 SRS140399 SRR099238 SRR099239 3 SRA026846 SRP004777 SRS140400 SRR099239 SRR099234 6 SRA026846 SRP004777 SRS140395 SRR099234 SRR099235 7 SRA026846 SRP004777 SRS140396 SRR099235 SRR099236 8 SRA026846 SRP004777 SRS140397 SRR099236 SRR099240 5 SRA026846 SRP004777 SRS140401 SRR099240 SRR099241 6 SRA026846 SRP004777 SRS140402 SRR099241 SRR099242 7 SRA026846 SRP004777 SRS140403 SRR099242 SRR099243 8 SRA026846 SRP004777 SRS140404 SRR099243 dds <- DESeq(dds) estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing -- replacing outliers and refitting for 33 genes -- DESeq argument minreplicatesforreplace = 7 -- original counts are preserved in counts(dds)) estimating dispersions fitting model and testing Here we run again without outlier replacement, in order to obtain the uncorrected fitted coefficients. ddsnoreplace <- DESeq(dds, minreplicatesforreplace = Inf) using pre-existing size factors estimating dispersions you had estimated dispersions, replacing these gene-wise dispersion estimates mean-dispersion relationship 2

final dispersion estimates fitting model and testing 2 Select the gene with highest Cook s distance Now we pick the gene which had one of the highest Cook s distance in the initial fit. The inital Cook s distances are available as a matrix in the assays slot of the DESeqDataSet. names(assays(dds)) [1] "counts" "mu" "cooks" "replacecounts" [5] "replacecooks" maxcooks <- apply(assays(dds)[["cooks"]], 1, max) idx <- which(rownames(dds) == "ENSMUSG00000076609") unname(counts(dds)[idx, ]) [1] 0 1 3 0 4 0 1 50 0 0 1 0 3 4 3 Plot The following code produces the plot. Note that the original counts are accessible via the counts function, and the replacement counts are accessible via the replacecounts slot of the assays of dds. We find the expected normalized values q ij by accessing the model coefficients in the metadata columns of the object. makecolors <- function(y = c(-1e+06, 1e+06)) { polygon(c(-1, -1, 7.5, 7.5), c(y, y[2:1]), col = rgb(0, 1, 0, 0.1), border = NA) polygon(c(7.5, 7.5, 50, 50), c(y, y[2:1]), col = rgb(0, 0, 1, 0.1), border = NA) } line <- 0.6 adj <- -0.5 cex <- 1 par(mfrow = c(1, 3), mar = c(4.3, 4.3, 3, 1)) out <- assays(dds)[["cooks"]][idx, ] > qf(0.99, 2, ncol(dds) - 2) plot(counts(dds, normalized = TRUE)[idx, ], main = "With outlier", ylab = "normalized counts", xlab = "", pch = as.integer(coldata(dds)$strain) + 1, ylim = c(0, max(counts(dds, normalized = TRUE)[idx, ])), col = ifelse(out, "red", "black")) makecolors() q0 <- 2^(mcols(ddsNoReplace)$Intercept[idx] + mcols(ddsnoreplace)$strainc57bl.6j[idx]) q1 <- 2^(mcols(ddsNoReplace)$Intercept[idx] + mcols(ddsnoreplace)$straindba.2j[idx]) segments(1, q0, 7, q0, lty = 3) segments(8, q1, 14, q1, lty = 3) mtext("a", side = 3, line = line, adj = adj, cex = cex) plot(assays(dds)[["cooks"]][idx, ], main = "Cook's distances", ylab = "", xlab = "", log = "y", pch = as.integer(coldata(dds)$strain) + 1, col = ifelse(out, "red", "black")) makecolors(y = c(1e-05, 1e+05)) abline(h = qf(0.99, 2, ncol(dds) - 2)) mtext("b", side = 3, line = line, adj = adj, cex = cex) plot(assays(dds)[["replacecounts"]][idx, ]/sizefactors(dds), main = "Outlier replaced", 3

A With outlier B Cook's distances C Outlier replaced normalized counts 0 10 30 50 1e 03 1e+00 normalized counts 0 1 2 3 4 Figure 1: Demonstration using the Bottomly et al dataset, of detection of outlier counts (A) using Cook s distances (B), and refitting after outliers have been replaced by the trimmed median over all (C). The dotted line indicates the fitted mean on the common scale. ylab = "normalized counts", xlab = "", ylim = c(0, max(assays(dds)[["replacecounts"]][idx, ]/sizefactors(dds))), pch = as.integer(coldata(dds)$strain) + 1, col = ifelse(out, "red", "black")) makecolors() q0 <- 2^(mcols(dds)$Intercept[idx] + mcols(dds)$strainc57bl.6j[idx]) q1 <- 2^(mcols(dds)$Intercept[idx] + mcols(dds)$straindba.2j[idx]) segments(1, q0, 7, q0, lty = 3) segments(8, q1, 14, q1, lty = 3) mtext("c", side = 3, line = line, adj = adj, cex = cex) 4

4 Session information R version 3.1.0 (2014-04-10), x86_64-unknown-linux-gnu Locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=C, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8, LC_IDENTIFICATION=C Base packages: base, datasets, grdevices, graphics, methods, parallel, stats, utils Other packages: BiocGenerics 0.10.0, DESeq2 1.4.0, DESeq2paper 1.2, GenomeInfoDb 1.0.0, GenomicRanges 1.16.0, IRanges 1.21.45, Rcpp 0.11.1, RcppArmadillo 0.4.200.0 Loaded via a namespace (and not attached): AnnotationDbi 1.26.0, Biobase 2.24.0, DBI 0.2-7, RColorBrewer 1.0-5, RSQLite 0.11.4, XML 3.98-1.1, XVector 0.4.0, annotate 1.42.0, codetools 0.2-8, digest 0.6.4, evaluate 0.5.5, formatr 0.10, genefilter 1.46.0, geneplotter 1.42.0, grid 3.1.0, highr 0.3, knitr 1.5, lattice 0.20-29, locfit 1.5-9.1, splines 3.1.0, stats4 3.1.0, stringr 0.6.2, survival 2.37-7, tools 3.1.0, xtable 1.7-3 5