Getting started with flowstats

Similar documents
RMassBank for XCMS. Erik Müller. January 4, Introduction 2. 2 Input files LC/MS data Additional Workflow-Methods 2

MIGSA: Getting pbcmc datasets

Image Analysis with beadarray

How to Use pkgdeptools

survsnp: Power and Sample Size Calculations for SNP Association Studies with Censored Time to Event Outcomes

MMDiff2: Statistical Testing for ChIP-Seq Data Sets

A complete analysis of peptide microarray binding data using the pepstat framework

Advanced analysis using bayseq; generic distribution definitions

Visualisation, transformations and arithmetic operations for grouped genomic intervals

segmentseq: methods for detecting methylation loci and differential methylation

The analysis of rtpcr data

Introduction to the TSSi package: Identification of Transcription Start Sites

Fitting Baselines to Spectra

How To Use GOstats and Category to do Hypergeometric testing with unsupported model organisms

riboseqr Introduction Getting Data Workflow Example Thomas J. Hardcastle, Betty Y.W. Chung October 30, 2017

Analysis of two-way cell-based assays

Working with aligned nucleotides (WORK- IN-PROGRESS!)

An Overview of the S4Vectors package

SCnorm: robust normalization of single-cell RNA-seq data

INSPEcT - INference of Synthesis, Processing and degradation rates in Time-course analysis

SimBindProfiles: Similar Binding Profiles, identifies common and unique regions in array genome tiling array data

How to use cghmcr. October 30, 2017

Introduction: microarray quality assessment with arrayqualitymetrics

flowcore: data structures package for the analysis of flow cytometry data

DupChecker: a bioconductor package for checking high-throughput genomic data redundancy in meta-analysis

An Introduction to the methylumi package

NacoStringQCPro. Dorothee Nickles, Thomas Sandmann, Robert Ziman, Richard Bourgon. Modified: April 10, Compiled: April 24, 2017.

HowTo: Querying online Data

Shrinkage of logarithmic fold changes

PROPER: PROspective Power Evaluation for RNAseq

segmentseq: methods for detecting methylation loci and differential methylation

Creating a New Annotation Package using SQLForge

Introduction to the Codelink package

The bumphunter user s guide

Bioconductor: Annotation Package Overview

Introduction to BatchtoolsParam

Count outlier detection using Cook s distance

The fastliquidassociation Package

Analyse RT PCR data with ddct

An Introduction to the methylumi package

Cardinal design and development

highscreen: High-Throughput Screening for Plate Based Assays

Analysis of Pairwise Interaction Screens

Creating IGV HTML reports with tracktables

mocluster: Integrative clustering using multiple

An Introduction to ShortRead

Package flowtype. R topics documented: November 27, Type Package Title Phenotyping Flow Cytometry Assays Version 2.16.

qplexanalyzer Matthew Eldridge, Kamal Kishore, Ashley Sawle January 4, Overview 1 2 Import quantitative dataset 2 3 Quality control 2

How to use CNTools. Overview. Algorithms. Jianhua Zhang. April 14, 2011

Analysis of screens with enhancer and suppressor controls

bayseq: Empirical Bayesian analysis of patterns of differential expression in count data

CQN (Conditional Quantile Normalization)

Simulation of molecular regulatory networks with graphical models

INSPEcT - INference of Synthesis, Processing and degradation rates in Time-course analysis

An introduction to gaucho

Triform: peak finding in ChIP-Seq enrichment profiles for transcription factors

RDAVIDWebService: a versatile R interface to DAVID

Raman Spectra of Chondrocytes in Cartilage: hyperspec s chondro data set

Overview of ensemblvep

rhdf5 - HDF5 interface for R

Using branchpointer for annotation of intronic human splicing branchpoints

Overview of ensemblvep

MethylAid: Visual and Interactive quality control of Illumina Human DNA Methylation array data

Nested Loop Cross Validation for Classification using nlcv

GxE.scan. October 30, 2018

Introduction: microarray quality assessment with arrayqualitymetrics

The REDseq user s guide

Classification of Breast Cancer Clinical Stage with Gene Expression Data

Using the qrqc package to gather information about sequence qualities

MetCirc: Navigating mass spectral similarity in high-resolution MS/MS metabolomics data

ssviz: A small RNA-seq visualizer and analysis toolkit

An Introduction to GenomeInfoDb

MultipleAlignment Objects

GenoGAM: Genome-wide generalized additive

Creating a New Annotation Package using SQLForge

Calibration of Quinine Fluorescence Emission Vignette for the Data Set flu of the R package hyperspec

Comparing methods for differential expression analysis of RNAseq data with the compcoder

Introduction to GenomicFiles

RNAseq Differential Expression Analysis Jana Schor and Jörg Hackermüller November, 2017

SCnorm: robust normalization of single-cell RNA-seq data

The rtracklayer package

Reproducible Homerange Analysis

Robust Model-based Clustering of Flow Cytometry Data The flowclust package

flowtrans: A Package for Optimizing Data Transformations for Flow Cytometry

SIBER User Manual. Pan Tong and Kevin R Coombes. May 27, Introduction 1

Introduction to R (BaRC Hot Topics)

MLSeq package: Machine Learning Interface to RNA-Seq Data

A wrapper to query DGIdb using R

A guide to use the flowpeaks: a flow cytometry data clustering algorithm via K-means and density peak finding

flowworkspace: A Package for Importing flowjo Workspaces into R

Basic4Cseq: an R/Bioconductor package for the analysis of 4C-seq data

Design Issues in Matrix package Development

Merging Mixture Components for Cell Population Identification in Flow Cytometry Data The flowmerge package

Introduction to QDNAseq

The Command Pattern in R

Analysis of Bead-level Data using beadarray

ncdfflow: Provides netcdf storage based methods and functions for manipulation of flow cytometry data

Package ggcyto. April 14, 2017

rbiopaxparser Vignette

Comprehensive Pipeline for Analyzing and Visualizing Array-Based CGH Data

Transcription:

Getting started with flowstats F. Hahne, N. Gopalakrishnan October, 7 Abstract flowstats is a collection of algorithms for the statistical analysis of flow cytometry data. So far, the focus is on automated gating and normalization. Introduction Since flowstats is more a collection of algorithms, writing a coherent Vignette is somewhat difficult. Instead, we will present a hypothetical data analysis process that also makes heavy use of the functionality provided by flowcore, mainly the work flow infrastructure. We start by loading the GvHD data set from the flowcore package. > library(flowstats) > data(itn) The data was acquired from blood samples by groups of patients, each group containing samples. Each flowframe includes, in addition to FSC and SSC, fluoresence parameters: CD, CD,, CD9 and HLADR. First we need to tranform all the fluorescense channels. This is a good point to start using a workflow object to keep track of our progress. > wf <- workflow(itn) > tl <- transformlist(colnames(itn)[:7], asinh, transformationid="asinh") > add(wf, tl) In a initial analysis step we first want to indentify and subset all T-cells. This can be archived by gating in the CD and SSC dimensions, however there are several other sub-populations, and we need to either specify our selection further, or segment the individual sub-populations. One solution for the latter aproach is to use the mixture modelling infrastructure provided by the flowclust package. However, since we are only interested in one single sub-population, the T-cell, it is much faster and easier to use the lymphgate function in the flowstats package. The idea here is to first do a rough preselection in the two-dimensional projection of the data based on expert knowledge or prior experience and subsequently to fit a normfilter to this subset. The function also allows to derive the pre-selection through back-gating: we know that CD positive cells are a subset of T-cells, so by estimating CD positive cells first we can get a rough idea on where to find the T-cells in the CD SSC projection.

> lg <- lymphgate(data(wf[["asinh"]]), channels=c("ssc", "CD"), + preselection="cd", filterid="tcells", eval=false, + scale=.) > add(wf, lg$ngate, parent="asinh") > library(flowviz) > print(xyplot(ssc ~ CD PatientID, wf[["tcells+"]], + par.settings=list(gate=list(col="red", + fill="red", alpha=.)))) pid pid9 pid pid7 SSC pid9 pid pid pid9 pid pid7 pid pid pid7 pid7 pid CD CD In the next step we want to separate T-helper and NK cells using the CD and stains. A convenient way of doing this is to apply a quadgate, assuming that both CD and are binary markers (cells are either positive or negative for CD and ). Often investigators use negative samples to derive a split point between the postive and negative populations, and apply this constant gate on all their samples. This will only work if there are no unforseen shifts in the fluorescence itensities between samples which are purely caused by technical variation rather than biological phenotype. Let s take a look at this variation for the T-cell subset and all remaining fluorescense channels: > pars <- colnames(data(wf[["base view"]]))[c(,,,7)]

> print(densityplot(patientid~., Data(wf[["TCells+"]]), channels=pars, groups=groupid, + scales=list(y=list(draw=f)), filter=lapply(pars, curvfilter), + layout=c(,))) CD9 CD HLADr Indeed the data, especially for CD and, don t align well. At this point we could decide to compute the quadgates for each sample separately. Alternatively, we can try to normalize the data and then compute a common gate. The warpset function can be used to normalize data according to a set of landmarks, which essentially are the peaks or high-density areas in the density estimates shown before. The ideas here are simple: ˆ High density areas represent particular sub-types of cells. ˆ Markers are binary. Cells are either positive or negative for a particular marker. ˆ Peaks should aline if the above statements are true. The algorithm in warpset performs the following steps:. Identify landmarks for each parameter using a curvfilter. Estimate the most likely total number (k) of landmarks

. Perform k-means clustering to classify landmarks. Estimate warping functions for each sample and parameter that best align the landmarks, given the underlying data. This step uses functionality from the fda package.. Transform the data using the warping functions. The algorithm should be robust to missing peaks in some of the samples, however the classification in step becomes harder since it is not clear which cell population it represents. > norm <- normalization(normfun=function(x, parameters,...) + warpset(x, parameters,...), + parameters=pars, + arguments=list(grouping="groupid", monwrd=true), + normalizationid="warping") > add(wf, norm, parent="tcells+") Estimating landmarks for channel... Estimating landmarks for channel CD9... Estimating landmarks for channel CD... Estimating landmarks for channel HLADr... Registering curves for parameter... Registering curves for parameter CD9... Registering curves for parameter CD... Registering curves for parameter HLADr... > print(densityplot(patientid~., Data(wf[["Warping"]]), channels=pars, groups=groupid, + scales=list(y=list(draw=f)), filter=lapply(pars, curvfilter), + layout=c(,)))

CD9 CD HLADr After normalization the data look much cleaner and we should be able to use a single static gate for all flowframes in order to separate CD and positive cells. Typically one would use a quadgate, and the quadrantgate function in flowstats can be used to automatically estimate such a gate. > qgate <- quadrantgate(data(wf[["warping"]]), stains=c("cd", ""), + filterid="cd", sd=) > add(wf, qgate, parent="warping") > print(xyplot( ~ CD PatientID, wf[["cd++"]], + par.settings=list(gate=list(fill="transparent", + col="red"))))

pid pid9 pid pid9 pid pid pid7 pid9 pid pid7 pid pid pid7 pid7 pid CD CD In a final step we might be interested in finding the proportion of activated T-helper cells by means of the CD9 stain. The rangegate function is helpful in separating positive and negative peaks in D. > CD9rg <- rangegate(data(wf[["warping"]]), stain="cd9", + alpha=.7, filterid="cd+-cd9", sd=.) > add(wf, CD9rg, parent="cd+-") > print(densityplot(patientid ~ CD9, Data(wf[["CD+-"]]), main = "CD+", + groups=groupid, refline=cd9rg@min))

CD+ CD9 pid pid7 pid7 pid pid pid7 pid pid9 pid pid pid9 pid7 pid pid9 pid Probability Binning A probability binning algorithm for quantitating multivariate distribution differences was described by Roederer et al. The algorithm identifies the flow parameter in a flowframe with the largest variance and divides the events in the flowframe into two subgroups based on the median of the parameter. This process continues until the number of events in each subgroup is less than a user specified threshold. For comparison across multiple samples, probability binning algorithm can be applied to a control dataset to obtain the position of bins and the same bins can be applied to the experimental dataset. The number of events in the control and sample bins can then be compared using the Pearsons chi-square test or the probability binning metric defined by Roederer et al. Although probability binning can be applied simultaneously to all parameters in a flowframe with bins in n dimensional hyperspace, we proceed with a two dimenstional example from our previous discussion involving CD and populations. This helps to simplify the demonstration of the method and interpretation of results. From the workflow object containing the warped data, we extract our data frame of interest. 7

We try to compare the panels using probability binning to identify patients with CD, populations different from a control flowframe that we create using the data from all the patients. > dat <- Data(wf[["Warping"]]) > The dat is visualized below > print( xyplot( ~ CD, dat, main= "Experimental data set")) Experimental data set sample sample sample sample sample sample7 sample sample sample9 sample sample sample sample sample sample CD CD The control dataset is created by combining all the flowframes in the flowset. The flowframe is then subsetted after applying a samplefilter so that the control flowset created has approximately the same number of events as the other flowsets in our example. > datcomb <- as(dat,"flowframe") > subcount <- nrow(exprs(datcomb))/length(dat) > sf <- samplefilter(filterid="mysamplefilter", size=subcount) > fres <- filter(datcomb, sf) > ctrldata <- Subset(datComb, fres)

> ctrldata <- ctrldata[,-ncol(ctrldata)] ##remove the column name "original" > The probability binning algorithm can then applied to the control data. The terminating condition for the algorithm is set so that the number of events in each bin is approximately percent of the total number of events in the control data. > minrow=subcount*. > refbins<-probin(ctrldata,minrow,channels=c("cd","")) > The binned control Data can be visualized using the plotbins function. Areas in the scatter plot with a large number of data points have a higher density of bins. Each bin also has approximately same number of events. > plotbins(refbins,ctrldata,channels=c("cd",""),title="control Data") > Control Data 7 7 CD The same bin positions from the control data set are then applied to each flowframe in our sample Data set. 9

> sampbins <- fsapply(dat,function(x){ + binbyref(refbins,x) + }) For each patient, the number events in the control and sample bins can be compared using the calcpearsonchi or using Roederers probability binning metric. > pearsonstat <- lapply(sampbins,function(x){ + calcpearsonchi(refbins,x) + }) > scount <- fsapply(dat,nrow) > pbstat <-lapply(seq_along(sampbins),function(x){ + calcpbchisquare(refbins,sampbins[[x]],subcount,scount[x]) + }) For each sample, the results can be visualized using the plotbins function. The residuals from Roeders probability binning metric or the Pearsons chi square test can be used to shade bins to highlight bins in each sample that differ the most from the control sample. > par(mfrow=c(,),mar=c(.,.,.,.)) > plotbins(refbins,ctrldata,channels=c("cd",""),title="control Data") > patnames <-samplenames(dat) > tm<-lapply(seq_len(length(dat)),function(x){ + plotbins(refbins,dat[[x]],channels=c("cd",""), + title=patnames[x], + residuals=pearsonstat[[x]]$residuals[,], + shadefactor=.7) + + } + )

sample sample 7 CD sample CD 7 CD sample7 sample CD sample 7 7 CD CD sample 7 CD CD sample9 sample CD 7 7 sample CD 7 CD 7 sample sample 7 7 sample sample 7 7 CD sample 7 7 Control Data The patient with CD/ populations most different from that of the control group can be identified from the magnitue of Pearson-chi square statistic(or Probability binning statistic).

chi Square Statistic pbin Statistic sample 9.9. sample.9 9.9 sample..99 sample..7 sample.9. sample.. sample7 7.7.77 sample. 7. sample9.. sample.. sample 7.9.9 sample 7..9 sample 9.7. sample.9. sample..

ˆ R Under development (unstable) (7-- r77), x_-pc-linux-gnu ˆ Locale: LC_CTYPE=en_US.UTF-, LC_NUMERIC=C, LC_TIME=en_US.UTF-, LC_COLLATE=C, LC_MONETARY=en_US.UTF-, LC_MESSAGES=en_US.UTF-, LC_PAPER=en_US.UTF-, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-, LC_IDENTIFICATION=C ˆ Running under: Ubuntu.. LTS ˆ Matrix products: default ˆ BLAS: /home/biocbuild/bbs-.7-bioc/r/lib/librblas.so ˆ LAPACK: /home/biocbuild/bbs-.7-bioc/r/lib/librlapack.so ˆ Base packages: base, datasets, grdevices, graphics, methods, splines, stats, utils ˆ Other packages: BH..-, Matrix.-, RcppArmadillo..., cluster.., fda..7, flowcore.., flowstats.7., flowviz.., flowworkspace.7., lattice.-, ncdfflow.., xtable.- ˆ Loaded via a namespace (and not attached): Biobase.9., BiocGenerics.., DEoptimR.-, FNN., IDPmisc..7, KernSmooth.-, MASS 7.-7, R.., RColorBrewer.-, Rcpp.., Rgraphviz.., XML.9-.9, assertthat.., bindr., bindrcpp., codetools.-, colorspace.-, compiler.., corpcor..9, data.table..-, digest.., dplyr.7., glue.., graph.7., grid.., gridextra., gtable.., hexbin.7., htmltools.., htmlwidgets.9, httpuv.., jsonlite., knitr.7, ks..7, latticeextra.-, magrittr., matrixstats.., mime., miscd.-, multicool.-, munsell.., mvtnorm.-, parallel.., pcapp.9-7, pkgconfig.., plyr.., rgl.9., rlang.., robustbase.9-7, rrcov.-, scales.., shiny.., stats.., stringi.., stringr.., tibble.., tools.., zlibbioc..