Alakazam: Analysis of clonal abundance and diversity

Similar documents
Alakazam: Reconstruction of Ig lineage trees

Package alakazam. March 30, 2018

Introduction to entropart

Package parfossil. February 20, 2015

Package inext. November 12, 2016

1. Estimation equations for strip transect sampling, using notation consistent with that used to

Evaluating generalization (validation) Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support

Multiple runs of the same parameter combination with R package pse

Shazam: Mutation analysis

An Introduction to the Bootstrap

Genome Assembly Using de Bruijn Graphs. Biostatistics 666

1. Determine the population mean of x denoted m x. Ans. 10 from bottom bell curve.

IQC monitoring in laboratory networks

Workshop 8: Model selection

Package shazam. January 28, 2019

Exam 4. In the above, label each of the following with the problem number. 1. The population Least Squares line. 2. The population distribution of x.

Appendix S1: A Quick Introduction to inext via Examples

Confidence Intervals: Estimators

Shazam: Tuning clonal assignment thresholds with nearest neighbor distances

Cpk: What is its Capability? By: Rick Haynes, Master Black Belt Smarter Solutions, Inc.

Simulation Supported POD Methodology and Validation for Automated Eddy Current Procedures

IMP:DisparityBox6. H. David Sheets, Dept. of Physics, Canisius College, Buffalo, NY 14208,

Bootstrap Confidence Interval of the Difference Between Two Process Capability Indices

Supplementary Material VDJServer: A Cloud-Based Analysis Portal and Data Commons for Immune Repertoire Sequences and Rearrangements

Week 7: The normal distribution and sample means

McGraw-Hill Ryerson. Data Management 12. Section 5.1 Continuous Random Variables. Continuous Random. Variables

Parthy. A test of neutrality using species abundance evenness, and parameter inference by Approximate Bayesian Computation

STA258H5. Al Nosedal and Alison Weir. Winter Al Nosedal and Alison Weir STA258H5 Winter / 27

R-package simctest A Short Introduction

Stochastic Simulation: Algorithms and Analysis

10.4 Measures of Central Tendency and Variation

10.4 Measures of Central Tendency and Variation

Using SAS to Manage Biological Species Data and Calculate Diversity Indices

More Summer Program t-shirts

Section 4 Matching Estimator

idem: Inference in Randomized Controlled Trials with Death a... Missingness

Bayesian Background Estimation

Analysis of real-time qpcr data Mahmoud Ahmed

Simulation and resampling analysis in R

Using the tronco package

6. More Loops, Control Structures, and Bootstrapping

Confidence Intervals. Dennis Sun Data 301

K-fold cross validation in the Tidyverse Stephanie J. Spielman 11/7/2017

Lecture 25: Review I

Package fitplc. March 2, 2017

(R) / / / / / / / / / / / / Statistics/Data Analysis

GenStat for Schools. Disappearing Rock Wren in Fiordland

Package capwire. February 19, 2015

And the benefits are immediate minimal changes to the interface allow you and your teams to access these

The Bootstrap and Jackknife

6. More Loops, Control Structures, and Bootstrapping

Updated Sections 3.5 and 3.6

Notes on Simulations in SAS Studio

STATISTICAL THINKING IN PYTHON II. Generating bootstrap replicates

Contents III. 1 Introduction 1

The preseq Manual. Timothy Daley Victoria Helus Andrew Smith. January 17, 2014

Screening Design Selection

Bootstrapped and Means Trimmed One Way ANOVA and Multiple Comparisons in R

Package PhyloMeasures

= = P. IE 434 Homework 2 Process Capability. Kate Gilland 10/2/13. Figure 1: Capability Analysis

PEER Report Addendum.

While you are writing your code, make sure to include comments so that you will remember why you are doing what you are doing!

MIPE: Model Informing Probability of Eradication of non-indigenous aquatic species. User Manual. Version 2.4

Recent advances in Metamodel of Optimal Prognosis. Lectures. Thomas Most & Johannes Will

Chapter 2 Describing, Exploring, and Comparing Data

Package spcadjust. September 29, 2016

Keywords hierarchic clustering, distance-determination, adaptation of quality threshold algorithm, depth-search, the best first search.

QQ normality plots Harvey Motulsky, GraphPad Software Inc. July 2013

Robust Methods of Performing One Way RM ANOVA in R

Introduction to the Practice of Statistics using R: Chapter 6

Modeling in the Tidyverse. Max Kuhn (RStudio)

E-Campus Inferential Statistics - Part 2

Package infer. July 11, Type Package Title Tidy Statistical Inference Version 0.3.0

Version 1.0. General Certificate of Education (A-level) June Unit 3: Investigative and practical skills in AS. Physics. Final.

About This Version ParkSEIS 3.0

Stat 427/527: Advanced Data Analysis I

Statistical Graphs & Charts

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.

Nonparametric Approaches to Regression

COPYRIGHTED MATERIAL CONTENTS

Package rptr. January 4, 2018

Notes on Fuzzy Set Ordination

Package metafunction

8.11 Multivariate regression trees (MRT)

Alakazam: Amino acid physicochemical property analysis

STATGRAPHICS PLUS for WINDOWS

Critical Numbers, Maximums, & Minimum

Optimizing Pharmaceutical Production Processes Using Quality by Design Methods

A technique for constructing monotonic regression splines to enable non-linear transformation of GIS rasters

Predict Outcomes and Reveal Relationships in Categorical Data

Nature Methods: doi: /nmeth Supplementary Figure 1

Package fbroc. August 29, 2016

Key areas of updates in GeoTeric Volumetrics

Data processing. Filters and normalisation. Mélanie Pétéra W4M Core Team 31/05/2017 v 1.0.0

Chapter 2 Modeling Distributions of Data

IPS9 in R: Bootstrap Methods and Permutation Tests (Chapter 16)

Chapter 5. Fibonacci Graceful Labeling of Some Graphs

Laboratory #11. Bootstrap Estimates

Outline. Topic 16 - Other Remedies. Ridge Regression. Ridge Regression. Ridge Regression. Robust Regression. Regression Trees. Piecewise Linear Model

Anaquin - Vignette Ted Wong January 05, 2019

Transcription:

Alakazam: Analysis of clonal abundance and diversity Jason Anthony Vander Heiden 2018-03-30 Contents Example data........................................... 1 Generate a clonal abundance curve............................... 2 Generate a diversity curve.................................... 3 Test diversity at a fixed diversity order............................. 5 The clonal diversity of the repertoire can be analyzed using the general form of the diversity index, as proposed by Hill in: Hill, M. Diversity and evenness: a unifying notation and its consequences. Ecology 54, 427-432 (1973). Coupled with resampling strategies to correct for variations in sequencing depth, as well as inferrence of complete clonal abundance distributions as described in: Chao A, et al. Rarefaction and extrapolation with Hill numbers: A framework for sampling and estimation in species diversity studies. Ecol Monogr. 2014 84:45-67. Chao A, et al. Unveiling the species-rank abundance distribution by generalizing the Good-Turing sample coverage theory. Ecology. 2015 96, 11891201. This package provides methods for the inferrence of a complete clonal abundance distribution, using the estimateabundance function, along with two approaches to assess diversity of these distributions: 1. Generation of a smooth diversity (D) curve over a range of diversity orders (q) using rarefydiversity. 2. A significance test of the diversity (D) at a fixed diversity order (q) using testdiversity. Example data A small example Change-O database, ExampleDb, is included in the alakazam package. Diversity calculation requires the CLONE field (column) to be present in the Change-O file, as well as an additional grouping column. In this example we will use the grouping columns SAMPLE and ISOTYPE. # Load required packages library(alakazam) # Load example data data(exampledb) 1

Generate a clonal abundance curve A simple table of the observed clonal abundance counts and frequencies may be generated using the countclones function either without copy numbers, where the size of each clone is determined by the number of sequence members: # Partitions the data based on the SAMPLE column clones <- countclones(exampledb, groups="sample") head(clones, 5) ## # A tibble: 5 x 4 ## # Groups: SAMPLE [1] ## SAMPLE CLONE SEQ_COUNT SEQ_FREQ ## <chr> <chr> <int> <dbl> ## 1 +7d 3128 100 0.100 ## 2 +7d 3100 50 0.0500 ## 3 +7d 3141 44 0.0440 ## 4 +7d 3177 30 0.0300 ## 5 +7d 3170 28 0.0280 You may also specify a column containing the abundance count of each sequence (usually copy numbers), that will including weighting of each clone size by the corresponding abundance count. Furthermore, multiple grouping columns may be specified such that SEQ_FREQ (unwieghted clone size as a fraction of total sequences in the group) and COPY_FREQ (weighted faction) are normalized to within multiple group data partitions. # Partitions the data based on both the SAMPLE and ISOTYPE columns # Weights the clone sizes by the DUPCOUNT column clones <- countclones(exampledb, groups=c("sample", "ISOTYPE"), copy="dupcount") head(clones, 5) ## # A tibble: 5 x 7 ## # Groups: SAMPLE, ISOTYPE [2] ## SAMPLE ISOTYPE CLONE SEQ_COUNT COPY_COUNT SEQ_FREQ COPY_FREQ ## <chr> <chr> <chr> <int> <int> <dbl> <dbl> ## 1 +7d IgA 3128 88 651 0.331 0.497 ## 2 +7d IgG 3100 49 279 0.0928 0.173 ## 3 +7d IgA 3141 44 240 0.165 0.183 ## 4 +7d IgG 3192 19 141 0.0360 0.0874 ## 5 +7d IgG 3177 29 130 0.0549 0.0806 While countclones will report observed abundances, it will not correct the distribution nor provide confidence intervals. A complete clonal abundance distribution may be inferred using the estimateabundance function with confidence intervals derived via bootstrapping. This output may be visualized using the plotabundancecurve function. # Partitions the data on the SAMPLE column # Calculates a 95% confidence interval via 200 bootstrap realizations clones <- estimateabundance(exampledb, group="sample", ci=0.95, nboot=200) # Plots a rank abundance curve of the relative clonal abundances 2

sample_colors <- c("-1h"="seagreen", "+7d"="steelblue") plot(clones, colors=sample_colors, legend_title="sample") 12.5% Rank Abundance 10.0% Abundance 7.5% 5.0% Sample +7d 1h 2.5% 0.0% 10 0 10 1 10 2 10 3 Rank Generate a diversity curve The function rarefydiversity performs uniform resampling of the input sequences and recalculates the clone size distribution, and diversity, with each resampling realization. Diversity (D) is calculated over a range of diversity orders (q) to generate a smooth curve. # Compare diversity curve across values in the "SAMPLE" column # q ranges from 0 (min_q=0) to 4 (max_q=4) in 0.05 incriments (step_q=0.05) # A 95% confidence interval will be calculated (ci=0.95) # 200 resampling realizations are performed (nboot=200) sample_div <- rarefydiversity(exampledb, "SAMPLE", min_q=0, max_q=4, step_q=0.05, ci=0.95, nboot=200) # Compare diversity curve across values in the "ISOTYPE" column # Analyse is restricted to ISOTYPE values with at least 30 sequences by min_n=30 # Excluded groups are indicated by a warning message isotype_div <- rarefydiversity(exampledb, "ISOTYPE", min_n=30, min_q=0, max_q=4, step_q=0.05, ci=0.95, nboot=200) # Plot a log-log (log_q=true, log_d=true) plot of sample diversity # Indicate number of sequences resampled from each group in the title sample_main <- paste0("sample diversity (n=", sample_div@n, ")") sample_colors <- c("-1h"="seagreen", "+7d"="steelblue") 3

plot(sample_div, colors=sample_colors, main_title=sample_main, legend_title="sample") Sample diversity (n=1000) 800 600 q D 400 Sample +7d 1h 200 0 0 1 2 3 4 q # Plot isotype diversity using default set of Ig isotype colors isotype_main <- paste0("isotype diversity (n=", isotype_div@n, ")") plot(isotype_div, colors=ig_colors, main_title=isotype_main, legend_title="isotype") 4

Isotype diversity (n=259) 200 q D 100 Isotype IgA IgD IgG IgM 0 0 1 2 3 4 q Test diversity at a fixed diversity order The function testdiversity performs resampling and diversity calculation in the same manner as rarefydiversity, but only for a single diversity order. Significance testing across groups is performed using the delta of the bootstrap distributions between groups. # Test diversity at q=2 (equivalent to Simpson s index) across values in the "SAMPLE" column # 200 bootstrap realizations are performed (nboot=200) sample_test <- testdiversity(exampledb, 2, "SAMPLE", nboot=200) # Print p-value table print(sample_test) ## test DELTA_MEAN DELTA_SD PVALUE ## 1 +7d!= -1h 518.6708 37.57775 0 # Test diversity across values in the "ISOTYPE" column # Analyse is restricted to ISOTYPE values with at least 30 sequences by min_n=30 # Excluded groups are indicated by a warning message isotype_test <- testdiversity(exampledb, 2, "ISOTYPE", min_n=30, nboot=200) # Print p-value table print(isotype_test) ## test DELTA_MEAN DELTA_SD PVALUE ## 1 IgA!= IgD 189.63466 11.696662 0 5

## 2 IgA!= IgG 26.69881 4.689725 0 ## 3 IgA!= IgM 229.35091 6.780907 0 ## 4 IgD!= IgG 162.93585 12.249270 0 ## 5 IgD!= IgM 39.71624 13.647642 0 ## 6 IgG!= IgM 202.65210 7.423980 0 # Plot the mean and standard deviations from the test plot(isotype_test, colors=ig_colors, main_title=isotype_main, legend_title="isotype") 250 Isotype diversity (n=259) 200 Mean 2 D ± SD 150 100 Isotype IgA IgD IgG IgM 50 0 IgA IgD IgG IgM 6