seqcna: A Package for Copy Number Analysis of High-Throughput Sequencing Cancer DNA

Similar documents
SAM / BAM Tutorial. EMBL Heidelberg. Course Materials. Tobias Rausch September 2012

Copy Number Variations Detection - TD. Using Sequenza under Galaxy

Bioinformatics in next generation sequencing projects

SAM and VCF formats. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016

CNV-seq Manual. Xie Chao. May 26, 2011

Accurate Long-Read Alignment using Similarity Based Multiple Pattern Alignment and Prefix Tree Indexing

File Formats: SAM, BAM, and CRAM. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015

From the Schnable Lab:

Dindel User Guide, version 1.0

Peak Finder Meta Server (PFMS) User Manual. Computational & Systems Biology, ICM Uppsala University

Finding Structural Variants in Short Read, Paired-end Sequence Data with R and Bioconductor

Analysis of Chromosome 20 - A Study

HIPPIE User Manual. (v0.0.2-beta, 2015/4/26, Yih-Chii Hwang, yihhwang [at] mail.med.upenn.edu)

The analysis of acgh data: Overview

Ensembl RNASeq Practical. Overview

Identiyfing splice junctions from RNA-Seq data

BadRegionFinder an R/Bioconductor package for identifying regions with bad coverage

Supplementary Information. Detecting and annotating genetic variations using the HugeSeq pipeline

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

Easy visualization of the read coverage using the CoverageView package

REAPR version Martin Hunt. Feb 23 rd 2015

RNA- SeQC Documentation

Tn-seq Explorer 1.2. User guide

Package saascnv. May 18, 2016

Introduction to Read Alignment. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015

Package CNAnorm. April 11, 2018

ChIP-seq (NGS) Data Formats

Rsubread package: high-performance read alignment, quantification and mutation discovery

Rsubread package: high-performance read alignment, quantification and mutation discovery

Read Naming Format Specification

Analysis of ChIP-seq data

Step-by-Step Guide to Basic Genetic Analysis

Part 1: How to use IGV to visualize variants

Basic4Cseq: an R/Bioconductor package for the analysis of 4C-seq data

Genetic Analysis. Page 1

Basic4Cseq: an R/Bioconductor package for the analysis of 4C-seq data

Agilent Genomic Workbench 7.0

PICS: Probabilistic Inference for ChIP-Seq

CNVPanelizer: Reliable CNV detection in target sequencing applications

INTRODUCTION AUX FORMATS DE FICHIERS

Package RAPIDR. R topics documented: February 19, 2015

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA- MEM).

BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14)

QIAseq Targeted RNAscan Panel Analysis Plugin USER MANUAL

TECH NOTE Improving the Sensitivity of Ultra Low Input mrna Seq

Practical Bioinformatics for Life Scientists. Week 4, Lecture 8. István Albert Bioinformatics Consulting Center Penn State

Predictive Coding of Aligned Next-Generation Sequencing Data

v0.2.0 XX:Z:UA - Unassigned XX:Z:G1 - Genome 1-specific XX:Z:G2 - Genome 2-specific XX:Z:CF - Conflicting

Prepare input data for CINdex

Exome sequencing. Jong Kyoung Kim

ChIP-seq hands-on practical using Galaxy

ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013

m6aviewer Version Documentation

The ChIP-seq quality Control package ChIC: A short introduction

fastseg An R Package for fast segmentation Günter Klambauer and Andreas Mitterecker Institute of Bioinformatics, Johannes Kepler University Linz

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA-MEM).

Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata

Identify differential APA usage from RNA-seq alignments

ChIP-Seq Tutorial on Galaxy

Package CopywriteR. April 20, 2018

CS313 Exercise 4 Cover Page Fall 2017

RJaCGH, a package for analysis of

Package polyphemus. February 15, 2013

Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification data

KC-SMART Vignette. Jorma de Ronde, Christiaan Klijn April 30, 2018

Sentieon DNA Pipeline for Variant Detection Software-only solution, over 20 faster than GATK 3.3 with identical results

Package quantsmooth. R topics documented: May 4, Type Package

DRAGEN Bio-IT Platform Enabling the Global Genomic Infrastructure

Sentieon DNA pipeline for variant detection - Software-only solution, over 20 faster than GATK 3.3 with identical results

Analyzing ChIP- Seq Data in Galaxy

NMCS2 MIDAS Outstation Algorithm Specification

User Manual of ChIP-BIT 2.0

SVMerge (v1.2) Pipeline Documentation. August 8, 2012

A short Introduction to UCSC Genome Browser

Under the Hood of Alignment Algorithms for NGS Researchers

Package icnv. R topics documented: March 8, Title Integrated Copy Number Variation detection Version Author Zilu Zhou, Nancy Zhang

Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi

Advanced UCSC Browser Functions

Sanger Data Assembly in SeqMan Pro

CBSU/3CPG/CVG Joint Workshop Series Reference genome based sequence variation detection

CONSERTING User Documentation

Fusion Detection Using QIAseq RNAscan Panels

User Guide for Tn-seq analysis software (TSAS) by

Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers

User Guide. SLAMseq Data Analysis Pipeline SLAMdunk on Bluebee Platform

How to use CNTools. Overview. Algorithms. Jianhua Zhang. April 14, 2011

Supplementary Figure 1. Fast read-mapping algorithm of BrowserGenome.

MPG NGS workshop I: Quality assessment of SNP calls

Package cgh. R topics documented: February 19, 2015

R topics documented: 2 R topics documented:

Tutorial. Identification of Variants in a Tumor Sample. Sample to Insight. November 21, 2017

Scalable Genomics with R and Bioconductor

Package conumee. February 17, 2018

Running SNAP. The SNAP Team October 2012

ExomeDepth. Vincent Plagnol. May 15, What is new? 1. 4 Load an example dataset 4. 6 CNV calling 5. 9 Visual display 9

Package ChIC. R topics documented: May 4, Title Quality Control Pipeline for ChIP-Seq Data Version Author Carmen Maria Livi

Introduction to Galaxy

Compression of structured high-throughput sequencing data

Genome Assembly Using de Bruijn Graphs. Biostatistics 666

Scalable Genomics with R and Bioconductor

Transcription:

seqcna: A Package for Copy Number Analysis of High-Throughput Sequencing Cancer DNA David Mosen-Ansorena 1 October 30, 2017 Contents 1 Genome Analysis Platform CIC biogune and CIBERehd dmosen.gn@cicbiogune.es 1 Overview 1 1.1 Example data................................... 1 1.2 Before you start.................................. 2 2 Input 2 3 Filtering 2 4 Normalization 4 5 Segmentation and calling 4 1 Overview This document guides the reader through the use of the seqcna package. The aim of the package is to process high-throughput sequencing copy number data coming from tumoural samples, departing from SAM or BAM aligned reads up to the final stage of calling the copy numbers. It includes, among other features, an integrated summarization method, several filters based on a range of genomic and read-based characteristics and a normalization approach developed specifically for tumoural data. Overall, the package was conceived to be user-friendly, provide visual assessment at each step and keep a simple and limited parameterization. 1.1 Example data The package includes example data that represents the summarized read counts at 200Kbps for chromosomes 1 to 5 of the cancer cell-line HCC1143 (Chiang et al., 2008). This output is a subset of what can be achieved by running the summarization function runseqsumm 1

in the package (see?runseqsumm) over the corresponding SAM file, which is too large to be included in the package. Although the example data does not include the sexual chromosomes, the package is able to process them. 1.2 Before you start Analyses are organized by folder. We recommend placing each SAM or BAM file in a separate folder, where the corresponding analysis files will be generated. 2 Input An example analysis on the previously described example data follows, beginning by loading the package and the sample data. Note that, with your data, you will need to call runseqsumm(file=yourfile.sam) on your SAM or BAM file instead of just loading it with the function data. Also, be sure to have samtools in your path or indicate its location if your file is BAM. > library(seqcna) > data(seqsumm_hcc1143) The first step is to read the data into an R object, which will be updated with new information as we process it. If we wanted to perform the analysis with a matched normal sample, to indicate a larger summarization window or to use the annotation of a specific genome build, we would indicate it here too. Here we load the data straight from a dataframe, but the typical use is to indicate the folder where the output generated by runseqsumm is located (seqsumm out.txt). Notice that no further software is necessary for the summarization except if the input is in BAM format. Then, runseqsumm needs to be told the path to the samtools program (Li et al., 2009). > rco = readseqsumm(tumour.data=seqsumm_hcc1143) 3 Filtering Filtering windows is recommended to reduce the amount of false positives. Once GC content is accounted for, the basic trimming filter discards those windows with extremely low or high read counts. The amount of windows to filter out is given in quantiles (from 0 to 1). For instance, a value of 0.01 discards 1 per cent of the windows with at least one read on both the upper and lower extremes. Two values can be given for the upper and lower quantiles, respectively. Withouth a matched normal, a value of 1 for the upper quantile automatically selects its threshold. The mapping quality and mappability filters try to discard windows in which the amount of read segments might be lower than expected due to reasons such as the number of times the segments appear in the genome. The mappability filter requires the corresponding genome 2

annotation (hg18 and hg19 annotations are provided in the package). While mappability ranges from 0 to 1, the mapping quality of reads ranges from 0 to 255. A mean mapping quality of 5 would indicate that the average probability of the reads being correctly mapped is 10e-5. On samples sequenced with paired-end reads, the PEM-based filter discards windows in which the ratio of improper to proper reads is highest, where proper reads are those with correct orientation and whose insert size is within a reasonable limit. A correct orientation means that the reads in a pair are facing each other, whereas a sensible insert size depends on the library and is usually estimated by the aligner software. The CNV-based filter, which requires the corresponding genome build, discards windows where a common CNV spans most of their length. Annotation of common CNVs mapped to the hg18 and hg19 builds are provided in the package, where the CNVs are the ones detected by Altshuler et al. (Altshuler et al., 2010). Under most circumstances, this filter can be disregarded. In general, we suggest reviewing the function s output plots in order to adjust the filters to adequate values. In this particular example, we apply the trimming and mapping quality filters, which do not require neither genome build annotation nor paired-end reads. > rco = applyfilters(rco, trim.filter=1, mapq.filter=2) Pre normalized tumoural profile 0.5 0.5 1.0 1.5 0 1000 2000 3000 4000 5000 Index Trimming filter Mapping quality filter quality 26 2 20 trimmed Density 0.0 0.5 1.0 1.5 0.63 1.45 1.0 0.0 1.0 2.0 Pre normalized tumoural profile Density 0.00 0.05 0.10 0.15 2 0 10 20 30 40 Mean mapping quality 3

4 Normalization GC content is the factor that most affects spurious read count variability. The method called seqnorm effectively corrects for this bias taking into account a possible linear relationship between copy number and GC content. If we had previously loaded a matched normal sample, a correction not against GC content but against the normal read count profile would have been made. seqnorm uses GLAD (Hupé et al., 2004) to segment a preliminar normalization and then performs regressions on each segment, which provides a robust estimate of the actual regression line. Little parameterization is required internally by seqnorm. Only for the segmentation and for filtering segments. However, the same values can be used throughout samples, so the function takes no parameters. The process can be sped up through parallelization, using as many cores as are available in the machine. > rco = runseqnorm(rco) Density 0.0 0.2 0.4 0.6 Window counts (tumour) 1000 1500 2000 2500 Standard regression seqnorm regression 1 0 1 2 Normalized ratio 0.45 0.50 0.55 0.60 0.65 0.70 0.75 GC content 5 Segmentation and calling The same segmentation method, GLAD (Hupé et al., 2004), is used for the final segmentation. This step smooths the filtered and normalized profile, trying to keep only the significant changes. The level of smoothing can be controlled through a parameter if necessary (see?runglad). > rco = runglad(rco) 4

Finally, the copy number calling is made. We found that visual assessment is necessary for copy number calling even after complex computational modelling. Hence, we offer the means to manually make the calling. Required input is: (i) the thresholds that separate the copy numbers in the profile and (ii) the copy number of those segmented values below the lowest indicated threshold. We first plot the profile so that we can perform the visual assessment. Typically, the most common copy numbers are 2, 3 or 4 (i.e. mainly diploid, triploid or tetraploid profile). In this example, we can observe that what seems to be the CN2 is below zero and CN1 a bit below -1, so we set the lowest threshold at -0.8 and set subsequent thresholds at intervals of 0.8. We also need to indicate which the lowest copy number is. In this case, CN1. > plotcnprofile(rco) Normalized ratio 1 0 1 2 3 0 1000 2000 3000 4000 5000 Genomic window index > rco = applythresholds(rco, seq(-0.8,4,by=0.8), 1) Now that we have applied the thresholds, plotting the profile includes them in the plot of the finished analysis. > plotcnprofile(rco) 5

Normalized ratio 1 0 1 2 3 CN6 CN5 CN4 CN3 CN2 CN1 0 1000 2000 3000 4000 5000 Genomic window index At any step of the analysis, we can check a summary of the data and its processing step through the summary function. > summary(rco) Basic information: SeqCNAInfo object with 5314 200Kbp-long windows. PEM information is not available. Paired normal is not available. Genome and build unknown (chromosomes chr1 to chr5). Total filtered windows: 212. The profile is normalized and segmented. Copy numbers called from 1 to 8. CN1:872 windows. CN2:2089 windows. CN3:1327 windows. CN4:579 windows. CN5:205 windows. CN6:17 windows. CN7:9 windows. CN8:4 windows. 6

A quick peek at the output reveals the profiles at the different analysis steps, where those windows that were initially filtered present NA values. > head(rco@output) chrom win.start normalized segmented CN 1 chr1 0 NA NA <NA> 2 chr1 200 NA NA <NA> 3 chr1 400 NA NA <NA> 4 chr1 600 0.2170 0.0377 3 5 chr1 800 0.0122 0.0377 3 6 chr1 1000-0.0623 0.0377 3 If we wanted to output the results to a file, we would use the writecnprofile function. 7

References Altshuler, D., Gibbs, R., and Peltonen, L. (2010). Integrating common and rare genetic variation in diverse human populations. Nature, 467(7311):52 58. Chiang, D., Getz, G., Jaffe, D., O Kelly, M., Zhao, X., Carter, S., Russ, C., Nusbaum, C., Meyerson, M., and Lander, E. (2008). High-resolution mapping of copy-number alterations with massively parallel sequencing. Nature methods, 6(1):99 103. Hupé, P., Stransky, N., Thiery, J.-P., Radvanyi, F., and Barillot, E. (2004). Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics (Oxford, England), 20(18):3413 22. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., and Durbin, R. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics (Oxford, England), 25(16):2078 9. 8