CNVPanelizer: Reliable CNV detection in target sequencing applications

Similar documents
The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA- MEM).

Package ChIC. R topics documented: May 4, Title Quality Control Pipeline for ChIP-Seq Data Version Author Carmen Maria Livi

Exome sequencing. Jong Kyoung Kim

Package PlasmaMutationDetector

Practical 4: ChIP-seq Peak calling

Package icnv. R topics documented: March 8, Title Integrated Copy Number Variation detection Version Author Zilu Zhou, Nancy Zhang

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA-MEM).

Rsubread package: high-performance read alignment, quantification and mutation discovery

Introduction to GE Microarray data analysis Practical Course MolBio 2012

Rsubread package: high-performance read alignment, quantification and mutation discovery

Tiling Assembly for Annotation-independent Novel Gene Discovery

Response to API 1163 and Its Impact on Pipeline Integrity Management

Robust Linear Regression (Passing- Bablok Median-Slope)

Transducers and Transducer Calibration GENERAL MEASUREMENT SYSTEM

Package saascnv. May 18, 2016

Analyzing Genomic Data with NOJAH

m6aviewer Version Documentation

NGS Data Analysis. Roberto Preste

How to use the DEGseq Package

ROTS: Reproducibility Optimized Test Statistic

Analyse RT PCR data with ddct

seqcna: A Package for Copy Number Analysis of High-Throughput Sequencing Cancer DNA

Package ic10. September 23, 2015

Evaluation of outlier detection methods for cut-point determination of immunogenicity screening and confirmatory assays

Package cgh. R topics documented: February 19, 2015

R topics documented: 2 R topics documented:

MS&E 226: Small Data

methylmnm Tutorial Yan Zhou, Bo Zhang, Nan Lin, BaoXue Zhang and Ting Wang January 14, 2013

Package polyphemus. February 15, 2013

Package aop. December 5, 2016

DRAGEN Bio-IT Platform Enabling the Global Genomic Infrastructure

BaseSpace - MiSeq Reporter Software v2.4 Release Notes

The preseq Manual. Timothy Daley Victoria Helus Andrew Smith. January 17, 2014

Tutorial. OTU Clustering Step by Step. Sample to Insight. March 2, 2017

Tutorial. OTU Clustering Step by Step. Sample to Insight. June 28, 2018

Cross-validation and the Bootstrap

The Bootstrap and Jackknife

Tutorial. Identification of Variants in a Tumor Sample. Sample to Insight. November 21, 2017

Multivariate Analysis Multivariate Calibration part 2

BadRegionFinder an R/Bioconductor package for identifying regions with bad coverage

Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification data

Introduction. Library quantification and rebalancing. Prepare Library Sequence Analyze Data. Library rebalancing with the iseq 100 System

How to use CNTools. Overview. Algorithms. Jianhua Zhang. April 14, 2011

Helpful Galaxy screencasts are available at:

RandomForest Documentation

FVGWAS- 3.0 Manual. 1. Schematic overview of FVGWAS

Cross-validation and the Bootstrap

Package CopywriteR. April 20, 2018

Sequence Analysis Pipeline

Package lol. R topics documented: December 13, Type Package Title Lots Of Lasso Version Date Author Yinyin Yuan

Applying Data-Driven Normalization Strategies for qpcr Data Using Bioconductor

Copy Number Variations Detection - TD. Using Sequenza under Galaxy

ChIP-Seq Tutorial on Galaxy

Gene Survey: FAQ. Gene Survey: FAQ Tod Casasent DRAFT

Anaquin - Vignette Ted Wong January 05, 2019

SOLOMON: Parentage Analysis 1. Corresponding author: Mark Christie

How do we obtain reliable estimates of performance measures?

QIAseq Targeted RNAscan Panel Analysis Plugin USER MANUAL

HIPPIE User Manual. (v0.0.2-beta, 2015/4/26, Yih-Chii Hwang, yihhwang [at] mail.med.upenn.edu)

Package RTCGAToolbox

Design of Experiments

Package EMDomics. November 11, 2018

Package HTSeqGenie. April 16, 2019

freebayes in depth: model, filtering, and walkthrough Erik Garrison Wellcome Trust Sanger of Iowa May 19, 2015

ExomeDepth. Vincent Plagnol. May 15, What is new? 1. 4 Load an example dataset 4. 6 CNV calling 5. 9 Visual display 9

Supplementary Information. Detecting and annotating genetic variations using the HugeSeq pipeline

QIAseq DNA V3 Panel Analysis Plugin USER MANUAL

Analyzing ChIP- Seq Data in Galaxy

/ Computational Genomics. Normalization

Developing Effect Sizes for Non-Normal Data in Two-Sample Comparison Studies

The software and data for the RNA-Seq exercise are already available on the USB system

Maximizing Public Data Sources for Sequencing and GWAS

Ion AmpliSeq Designer: Getting Started

Package conumee. February 17, 2018

Reference & Track Manager

Intro to NGS Tutorial

Package twilight. August 3, 2013

Resampling Methods. Levi Waldron, CUNY School of Public Health. July 13, 2016

Next-Generation Sequencing applied to adna

Package GOTHiC. November 22, 2017

Release Notes. JMP Genomics. Version 4.0

The simpleboot Package

Database Repository and Tools

User s Guide. Using the R-Peridot Graphical User Interface (GUI) on Windows and GNU/Linux Systems

The ChIP-seq quality Control package ChIC: A short introduction

Intrusion Detection using NASA HTTP Logs AHMAD ARIDA DA CHEN

LightCycler 480. Instrument and Software Training. Relative Quantification Module. (including Melting Curve)

Package pcr. November 20, 2017

Package ABSSeq. September 6, 2018

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Microarray Data Analysis (VI) Preprocessing (ii): High-density Oligonucleotide Arrays

ChromHMM: automating chromatin-state discovery and characterization

An Introduction to VariantTools

Package ezsummary. August 29, 2016

A manual for the use of mirvas

Package corclass. R topics documented: January 20, 2016

User's Guide to DNASTAR SeqMan NGen For Windows, Macintosh and Linux

IQC monitoring in laboratory networks

RNA-Seq. Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University

Transcription:

CNVPanelizer: Reliable CNV detection in target sequencing applications Oliveira, Cristiano cristiano.oliveira@med.uni-heidelberg.de Wolf, Thomas thomas_wolf71@gmx.de April 30, 2018 Amplicon based targeted sequencing, over the last few years, has become a mainstay in the clinical use of next generation sequencing technologies. For the detection of somatic and germline SNPs this has been proven to be a highly robust methodology. One area of genomic analysis which is usually not covered by targeted sequencing, is the detection of copy number variations (CNVs). While a large number of available algorithms and software address the problem of CNV detection in whole genome or whole exome sequencing, there are no such established tools for amplicon based targeted sequencing. We introduced a novel algorithm for the reliable detection of CNVs from targeted sequencing. 1 Introduction To assess if a region specific change in read counts correlates with the presence of CNVs, we implemented an algorithm that uses a subsampling strategy similar to Random Forest to reliable predict the presence of CNVs. We also introduce a novel method to correct for the background noise introduced by sequencing genes with a low number of amplicons. To make it available to the community we implemented the algorithm as an R package. 2 Using This section provides an overview of the package functions. 2.1 Installing and Loading the package The package is available through the Bioconductor repository and can be installed and loaded using the following R commands: # To install from Bioconductor source("http://bioconductor.org/bioclite.r") bioclite("cnvpanelizer") # To load the package library(cnvpanelizer) 1

2.2 Reading data 2.2.1 BED file The BED file is required to obtain amplicon and gene name information associated with the panel. # Bed file defining the amplicons bedfilepath <- "/somepath/somefile.bed" # The column number of the gene Names in the BED file. amplcolumnnumber <- 4 # Extract the information from a bed file genomicrangesfrombed <- BedToGenomicRanges(bedFilepath, ampliconcolumn = amplcolumnnumber, split = "_") metadatafromgenomicranges <- elementmetadata(genomicrangesfrombed) genenames = metadatafromgenomicranges["genenames"][, 1] ampliconnames = metadatafromgenomicranges["ampliconnames"][, 1] 2.2.2 Selecting files Two sets of data are required. The samples of interest and the set of reference bam files to compare against. # Directory with the test data sampledirectory <- "/somepathtotestdata" # Directory with the reference data referencedirectory <- "/somepathtoreferencedata" # Vector with test filenames samplefilenames <- list.files(path = sampledirectory, pattern = ".bam$", full.names = TRUE) # Vector with reference filenames referencefilenames <- list.files(path = referencedirectory, pattern = ".bam$", full.names = TRUE) 2.2.3 Counting reads The reads were counted using a wrapper function around the ExomeCopy package from the Bioconductor project. All reads overlapping with the region of an amplicon were counted for this amplicon. Only reads with a mapping quality 20 were counted and the function allows to remove PCR Duplicates. For reads with the same start site, end site and chromosomal orientation only one is kept. PCR duplicates might cause a bias for the ratio between reference and the samples of interest. Thus this serves as an additional quality control step in the CNV detection pipeline. It is only recommended for Ion Torrent generated data. For Illumina data this step is not recommended. 2

# Should duplicated reads (same start, end site and strand) be removed removepcrduplicates <- FALSE # TRUE is only recommended for Ion Torrent data # Read the Reference data set referencereadcounts <- ReadCountsFromBam(referenceFilenames, genomicrangesfrombed, samplenames = referencefilenames, ampliconnames = ampliconnames, removedup = removepcrduplicates) # Read the sample of interest data set samplereadcounts <- ReadCountsFromBam(sampleFilenames, genomicrangesfrombed, samplenames = samplefilenames, ampliconnames = ampliconnames, removedup = removepcrduplicates) 2.2.4 Using Synthetic Data We also make available synthetic data to test the functions. The following examples make use of two generated data sets, one for the reference and the other as a testing set data(samplereadcounts) data(referencereadcounts) # Gene names should be same size as row columns # For real data this is a vector of the genes associated # with each row/amplicon. For example Gene1, Gene1, Gene2, Gene2, Gene3,... genenames <- row.names(referencereadcounts) # Not defined for synthetic data # For real data this gives a unique name to each amplicon. # For example Gene1:Amplicon1, Gene1:Amplicon2, Gene2:Amplicon1, # Gene2:Amplicon2, Gene3:Amplicon1... ampliconnames <- NULL 2.3 Normalization To account for sample and sequencing run specific variations the counts obtained for each sample were normalized using a wrapper function around the tmm normalization function from the NOISeq package. normalizedreadcounts <- CombinedNormalizedCounts(sampleReadCounts, referencereadcounts, ampliconnames = ampliconnames) # After normalization data sets need to be splitted again to perform bootstrap samplesnormalizedreadcounts = normalizedreadcounts["samples"][[1]] referencenormalizedreadcounts = normalizedreadcounts["reference"][[1]] 2.4 Bootstrap based CNV This aproach is similar to the Random Forest method, which bootstraps the samples and subsamples the features. In our case features would be equivalent to amplicons. The subsampling procedure is repeated n times to generate a large set of randomized synthetic references B = b 1,..., b n by selecting 3

with replacement (bootstrapping) from the set of reference samples. The ratio between the sample of interest and each randomized reference is calculated for each gene, using only a randomly selected subset of amplicons. To get an estimate of significance, the 95 percent confidence interval of the bootstrapping/subsampling ratio distribution was calculated. A significant change is considered as an amplification for a lower bound ratio > 1 and a deletion for an upper bound ratio < 1. The confidence interval was calculated using the 0.025 and 0.975 percent quantiles of the bootstrapping/subsampling ratio distribution. # Number of bootstrap replicates to be used replicates <- 10000 # Perform the bootstrap based analysis bootlist <- BootList(geneNames, samplesnormalizedreadcounts, referencenormalizedreadcounts, replicates = replicates) 2.5 Background Estimation Not all genes have the same number of amplicons A G and it has been shown that sequencing genes with a higher number of amplicons yields better sensitivity and specifity when detecting putative copy number variations. Still genes sequenced with a low number of amplicons might still show significant changes in observed read counts. While normalization makes the comparison of read counts comparable between samples, genes with a small number of amplicons might still show a bias. To quantify the effect of a low number of amplicons on the calling of CNVs we introduced a background noise estimation procedure. Using the ratio between the mean reference and the sample used for calling we subsample for each unique number of amplicons. In the case of two amplicons we repeatedly sample two random amplicons from the set of all amplicons, and average the ratios. Amplicons that belong to genes showing significant copy number variations G sig are not included in the subsampling pool. Each amplicon is weighted according to the number of amplicons the respective gene has w A = 1. Thus the probablity of sampling from a gene is the same regardless of the number A g amplicons. For each number of amplicons a background noise distribution is estimated. The reported background is defined by the lower noise meannoise + qnorm(0.025) standarddeviationnoise and the upper noise meannoise + qnorm(0.975) standarddeviationnoise of the respective distribution. This defines the 95 percent confidence interval. The significance level can also be passed as parameter. Less conservative detection can be achieved by setting the significancelevel = 0.1. To calculate the mean and standard deviation (sd) of ratios the log ratios were used. If setting robust = TRUE, median is used instead of mean and mad (median absolute deviation) replaces sd. The upper bound ratio has to be below the lower noise (deletion) or the lower bound ratio above the upper noise (amplification) to be considered reliable. # Estimate the background noise left after normalization backgroundnoise <- Background(geneNames, samplesnormalizedreadcounts, referencenormalizedreadcounts, bootlist, replicates = replicates, significancelevel = 0.1, robust = TRUE) 2.6 Results To analyse the results we provide two outputs. A plot which shows the detected variations, and a report table with more detailed information about those variations. 4

2.6.1 Report The report describes the bootstrap distribution for each gene. An example can be found in figure 1. The final report is genewise, and is based on the bootstrapping. # Build report tables reporttables <- ReportTables(geneNames, samplesnormalizedreadcounts, referencenormalizedreadcounts, bootlist, backgroundnoise) At the figure 1 we can see an example of the report table for a single sample. 5

6 ## LowerBoundBoot MeanBoot UpperBoundBoot LowerNoise MeanNoise UpperNoise Signif. AboveNoise Amplicons PutativeStatus ReliableStatus ## Gene_01 0.18 0.7 2.68 0.53 0.85 1.37 FALSE FALSE 4 Normal Normal ## Gene_02 0.46 0.86 1.69 0.56 0.82 1.19 FALSE FALSE 7 Normal Normal ## Gene_03 0.6 0.98 1.59 0.53 0.85 1.37 FALSE FALSE 4 Normal Normal ## Gene_04 0.22 0.69 1.97 0.58 0.82 1.16 FALSE FALSE 8 Normal Normal ## Gene_05 0.21 0.49 1.09 0.59 0.82 1.14 FALSE FALSE 9 Normal Normal ## Gene_06 0.86 1.32 2.05 0.48 0.91 1.70 FALSE FALSE 2 Normal Normal ## Gene_07 0.37 0.6 1.04 0.55 0.84 1.29 FALSE FALSE 5 Normal Normal ## Gene_08 0.11 0.5 1.46 0.58 0.82 1.16 FALSE FALSE 8 Normal Normal ## Gene_09 0.6 1.03 1.69 0.55 0.84 1.29 FALSE FALSE 5 Normal Normal ## Gene_10 0.27 0.65 1.34 0.55 0.84 1.29 FALSE FALSE 5 Normal Normal ## Gene_11 0.35 0.79 1.72 0.59 0.82 1.14 FALSE FALSE 9 Normal Normal ## Gene_12 0.99 1.4 2 0.55 0.84 1.29 FALSE FALSE 5 Normal Normal ## Gene_13 0.1 0.58 1.62 0.56 0.83 1.24 FALSE FALSE 6 Normal Normal ## Gene_14 0.06 0.28 0.99 0.55 0.84 1.29 TRUE FALSE 5 Deletion Normal ## Gene_15 0.26 0.47 0.86 0.48 0.91 1.70 TRUE FALSE 2 Deletion Normal ## Gene_16 0.61 1.23 2.21 0.58 0.82 1.16 FALSE FALSE 8 Normal Normal ## Gene_17 1.99 2.35 2.87 0.51 0.87 1.48 TRUE TRUE 3 Amplification Amplification ## Gene_18 0.32 0.53 0.89 0.48 0.91 1.70 TRUE FALSE 2 Deletion Normal ## Gene_19 0.29 0.55 1.05 0.53 0.85 1.37 FALSE FALSE 4 Normal Normal ## Gene_20 0.63 1.15 1.89 0.59 0.82 1.14 FALSE FALSE 9 Normal Normal Figure 1: Sample report table. LowerBoundBoot MeanBoot UpperBoundBoot Lower Mean Upper Noise Signif. AboveNoise Amplicons PutativeStatus Passed The confidence interval of the Bootstrap distribution The Lower/Mean/Upper background noise bounds Boolean value representing if read counts differences are Significant Boolean value representing if LowerBoundBoot is above UpperNoise (UpperBoundBoot below LowerNoise) Number of Amplicons associated with the gene The detected level of change if both Signif. and AboveNoise are TRUE then 2, if only one is TRUE then 1 and if both are FALSE then 0 Table 1: Report table Column Description

2.6.2 Plots The generated plots (one per test sample) show the bootstrap distribution for each gene. The function PlotBootstrapDistributions generates a list of plots for all test samples (A plot per sample). At the figure 2 we can see an example of the plot for a single sample. PlotBootstrapDistributions(bootList, reporttables) 2.7 Shiny App To allow a more user friendly interaction and quick testing of CNVPanelizer, a shiny app is made available through the following command: # default port is 8100 RunCNVPanelizerShiny(port = 8080) 7

Sample_2 10 log2(ratios) 5 0 5 CNV Reliability NoChange NonReliableChange ReliableChange 10 8 Gene_01 Gene_02 Gene_03 Gene_04 Gene_05 Gene_06 Gene_07 Gene_08 Gene_09 Gene_10 Gene_11 Gene_12 Gene Names Gene_13 Gene_14 Gene_15 Gene_16 Gene_17 Gene_18 Gene_19 Gene_20 Figure 2: Plot for a single test sample No Change Non Reliable Change Reliable Change There is no change in the read count The change is significant but below the noise level for the number of amplicons associated with this gene The change is above the noise level and significant Table 2: Description of the CNV detection levels