Anaquin - Vignette Ted Wong January 05, 2019

Similar documents
Automated Bioinformatics Analysis System on Chip ABASOC. version 1.1

ROTS: Reproducibility Optimized Test Statistic

EBSeqHMM: An R package for identifying gene-expression changes in ordered RNA-seq experiments

RNA-Seq. Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University

Differential gene expression analysis using RNA-seq

TECH NOTE Improving the Sensitivity of Ultra Low Input mrna Seq

Package EDDA. April 14, 2019

Expression Analysis with the Advanced RNA-Seq Plugin

Transcript quantification using Salmon and differential expression analysis using bayseq

Methodology for spot quality evaluation

Multiple sequence alignment accuracy estimation and its role in creating an automated bioinformatician

Services Performed. The following checklist confirms the steps of the RNA-Seq Service that were performed on your samples.

Evaluating Classifiers

STREAMING FRAGMENT ASSIGNMENT FOR REAL-TIME ANALYSIS OF SEQUENCING EXPERIMENTS. Supplementary Figure 1

Data Processing and Analysis in Systems Medicine. Milena Kraus Data Management for Digital Health Summer 2017

Package ERSSA. November 4, 2018

Gene Expression Data Analysis. Qin Ma, Ph.D. December 10, 2017

Evaluating Classifiers

Goal: Learn how to use various tool to extract information from RNAseq reads. 4.1 Mapping RNAseq Reads to a Genome Assembly

Louis Fourrier Fabien Gaie Thomas Rolf

Galaxy workshop at the Winter School Igor Makunin

How to use the DEGseq Package

Ballgown. flexible RNA-seq differential expression analysis. Alyssa Frazee Johns Hopkins

Supplementary information: Detection of differentially expressed segments in tiling array data

EBSeq: An R package for differential expression analysis using RNA-seq data

Applying Data-Driven Normalization Strategies for qpcr Data Using Bioconductor

Windows. RNA-Seq Tutorial

Package SC3. September 29, 2018

CS145: INTRODUCTION TO DATA MINING

PROPER: PROspective Power Evaluation for RNAseq

Centre (CNIO). 3rd Melchor Fernández Almagro St , Madrid, Spain. s/n, Universidad de Vigo, Ourense, Spain.

srap: Simplified RNA-Seq Analysis Pipeline

A review of RNA-Seq normalization methods

Package SC3. November 27, 2017

Shotgun sequencing. Coverage is simply the average number of reads that overlap each true base in genome.

Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi

INSPEcT - INference of Synthesis, Processing and degradation rates in Time-course analysis

Introduction to GE Microarray data analysis Practical Course MolBio 2012

Data Mining and Knowledge Discovery Practice notes 2

Introduction to Cancer Genomics

Lecture 25: Review I

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Using Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions

TP RNA-seq : Differential expression analysis

Data Mining and Knowledge Discovery: Practice Notes

jackstraw: Statistical Inference using Latent Variables

Sequence Analysis Pipeline

Single/paired-end RNAseq analysis with Galaxy

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Resampling Methods. Levi Waldron, CUNY School of Public Health. July 13, 2016

INSPEcT - INference of Synthesis, Processing and degradation rates in Time-course analysis

Classification Part 4

All About PlexSet Technology Data Analysis in nsolver Software

Package BEARscc. May 2, 2018

The Detection Flags and Primary Flags Fields in QuanLynx and TargetLynx Version 4.1

Data Mining and Knowledge Discovery: Practice Notes

Long Read RNA-seq Mapper

CQN (Conditional Quantile Normalization)

Comparing methods for differential expression analysis of RNAseq data with the compcoder

CLC Server. End User USER MANUAL

BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14)

How to store and visualize RNA-seq data

ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013

Introduction. Library quantification and rebalancing. Prepare Library Sequence Analyze Data. Library rebalancing with the iseq 100 System

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Centroid based clustering of high throughput sequencing reads based on n-mer counts

Package rain. March 4, 2018

KisSplice. Identifying and Quantifying SNPs, indels and Alternative Splicing Events from RNA-seq data. 29th may 2013

mrna-seq Basic processing Read mapping (shown here, but optional. May due if time allows) Gene expression estimation

Course on Microarray Gene Expression Analysis

11/8/2017 Trinity De novo Transcriptome Assembly Workshop trinityrnaseq/rnaseq_trinity_tuxedo_workshop Wiki GitHub

Getting Started. April Strand Life Sciences, Inc All rights reserved.

subseq Package Vignette (Version )

Package macorrplot. R topics documented: October 2, Title Visualize artificial correlation in microarray data. Version 1.50.

Exercise 1 Review. --outfiltermismatchnmax : max number of mismatch (Default 10) --outreadsunmapped fastx: output unmapped reads

Identify differential APA usage from RNA-seq alignments

Quantification. Part I, using Excel

Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

Identiyfing splice junctions from RNA-Seq data

Package anota2seq. January 30, 2018

LFCseq: a nonparametric approach for differential expression analysis of RNA-seq data - supplementary materials

CS4491/CS 7265 BIG DATA ANALYTICS

Evaluating Machine-Learning Methods. Goals for the lecture

Visualization using CummeRbund 2014 Overview

Introduction to Galaxy

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

[Programming Assignment] (1)

Probabilistic Classifiers DWML, /27

SHARPR (Systematic High-resolution Activation and Repression Profiling with Reporter-tiling) User Manual (v1.0.2)

RNA-Seq data analysis software. User Guide 023UG050V0200

RNA-seq. Manpreet S. Katari

List of Exercises: Data Mining 1 December 12th, 2015

Multicollinearity and Validation CIVL 7012/8012

ISyE 6416 Basic Statistical Methods - Spring 2016 Bonus Project: Big Data Analytics Final Report. Team Member Names: Xi Yang, Yi Wen, Xue Zhang

RNA-Seq Analysis With the Tuxedo Suite

Mapping RNA sequence data (Part 1: using pathogen portal s RNAseq pipeline) Exercise 6

Evaluating Machine Learning Methods: Part 1

Transcription:

Anaquin - Vignette Ted Wong (t.wong@garvan.org.au) January 5, 219 Citation [1] Representing genetic variation with synthetic DNA standards. Nature Methods, 217 [2] Spliced synthetic genes as internal controls in RNA sequencing experiments. Nature Methods, 216. [3] Reference standards for next-generation sequencing. Nature Reviews, 217. [4] Anaquin: a software toolkit for the analysis of spike-in controls for next generation sequencing. Bioinformatics, 217. Website Visit our website to learn more about sequins: www.sequin.xyz. Overview In this document, we show how to conduct statistical analysis that models the performance of sequin controls in next-generation-sequencing (NGS) experiment. We call the sequins RnaQuin for RNA-Seq sequins, MetaQuin for metagenomic sequins, VarQuin for genomics variant sequins, and the statistical framework Anaquin. This vignette is written for R-usage. However, Anaquin is a framework covering the entire NGS workflow. Consequently, the R-package (and it s documentation) is a subset of the overall Anaquin framework. We also distribute a detailed workflow guide on our website. It is important to note Anaquin is both command-line tool and R-package. Our workflow guide has the details on how the command-line tool can be used with the R-package. Sequins Next-generation sequencing (NGS) enables rapid, cheap and high-throughput determination of sequences within a user s sample. NGS methods have been applied widely, and have fuelled major advances in the life sciences and clinical health care over the past decade. However, NGS typically generates a large amount of sequencing data that must be first analyzed and interpreted with bioinformatics tools. There is no standard way to perform an analysis of NGS data; different tools provide different advantages in different situations. The complexity and variation of sequences further compound this problem, and there is little reference by which compare next-generation sequencing and analysis. To address this problem, we have developed a suite of synthetic nucleic-acid sequins (sequencing spike-ins). Sequins are fractionally added to the extracted nucleic-acid sample prior to library preparation, so they are sequenced along with your sample of interest. We can use the sequins as an internal quantitative and qualitative control to assess any stage of the next-generation sequencing workflow. 1

Figure 1: NGS Workflow for sequins Mixture Sequins are combined together across a range of concentrations to formulate a mixture. Mixture file (CSV) is a text file that specifies the concentration of each sequin within a mixture. Mixture files are often required as input to enable Anaquin to perform quantitative analysis. Mixture file can be downloaded from our website. Let s demonstrate RnaQuin mixture A with a simple example. Load the mixture file (you can also download the file directly from our website): library('anaquin') ## Loading required package: ggplot2 ## Registered S3 methods overwritten by 'ggplot2': ## method from ## [.quosures rlang ## c.quosures rlang ## print.quosures rlang data("rnaquinisoformmixture") head(rnaquinisoformmixture) ## Name Length MixA MixB ## 1 R1_11_1 719 11.32965.47275 ## 2 R1_11_2 43 3.77655 1.416225 ## 3 R1_12_1 149 13.217925 7.5531 ## 4 R1_12_2 1362 1.888275 52.8717 ## 5 R1_13_1 1754 6.42486 453.186 ## 6 R1_13_2 1856 96.37294 3.2124 Each row represents a sequin. Name gives the sequin names, Length is the length of the sequins in nucleotide bases, MixA gives the concentration level in attoml/ul for Mixture A. Imagine we have two RNA-Seq experiments; a well-designed experiment and a poorly-designed experiment. We would like to quantify their isoform expression. Let s simulate the experiments: set.seed(1234) sim1 <- 1. + 1.2*log2(RnaQuinIsoformMixture$MixA) + rnorm(nrow(rnaquinisoformmixture),,1) sim2 <- c(1. + rnorm(1,1,3), 1. + 1.2*log2(tail(RnaQuinIsoformMixture,64)$MixA) + rnorm(64,,1)) In the first experiment, sequins are expected to correlate linearly with the measured FPKM. Indeed, the variables are strongly correlated: 2

names <- row.names(rnaquinisoformmixture) input <- log2(rnaquinisoformmixture$mixa) title <- 'Isoform expression (Good)' xlab <- 'Input concentration (log2)' ylab <- 'Measured FPKM (log2)' plotlinear(names, input, sim1, title=title, xlab=xlab, ylab=ylab) 15 Isoform expression (Good) y = c(.88) + c(1.2)x, r 2 =.971 1 Measured FPKM (log2) 5 5 1 5 5 1 Input concentration (log2) In our second experiment, the weakly expressed isoforms exhibit stochastic behavior and are clearly not linear with the input concentration. Furthermore, there is a limit of quantification (LOQ); below which accuracy of the experiment becomes questionable. names <- row.names(rnaquinisoformmixture) input <- log2(rnaquinisoformmixture$mixa) title <- 'Isoform expression (Bad)' xlab <- 'Input concentration (log2)' ylab <- 'Measured FPKM (log2)' plotlinear(names, input, sim2, title=title, xlab=xlab, ylab=ylab) 3

Isoform expression (Bad) 1 Overall : y = c(2.3) + c(.45)x, r 2 =.218 Above LOQ : y = c(1.8) + c(.59)x, r 2 =.32 Measured FPKM (log2) 5 5 LOQ:.236 attomol/ul 5 5 1 Input concentration (log2) The primary observation is that the artificial scale imposed by sequins allow us to quantify our experiments. Quantifying transcriptome assembly To quantify RNA-Seq transcriptome assembly, we need to run a transcriptome assember; a software that can assemble transcripts and estimates their abundances. Our workflow guide has the details. Here, we use a data set generated by Cufflinks, described in Section 5.4.5.1 in the user guide: data(userguidedata_5.4.5.1) head(userguidedata_5.4.5.1) ## Input Sn ## R1_11_1 1.78.99264 ## R1_11_2 5.354.39323 ## R1_12_1.8886.519463 ## R1_12_2 14.2176.92349 ## R1_13_1 17.422.995439 ## R1_13_2 859.375.9495 The first column gives the input concentration for each sequin in attomol/ul. The second column is the measured sensitivity. Run the following R-code to generate a sensitivity plot. title <- 'Assembly Plot' xlab <- 'Input Concentration (log2)' ylab <- 'Sensitivity' # Sequin names 4

names <- row.names(userguidedata_5.4.5.1) # Input concentration x <- log2(userguidedata_5.4.5.1$input) # Measured sensitivity y <- UserGuideData_5.4.5.1$Sn plotlogistic(names, x, y, title=title, xlab=xlab, ylab=ylab, showloa=true) 1. Assembly Plot LOA: 1.1 attomol/ul.75 Sensitivity.5.25. 5 5 1 15 Input Concentration (log2) The fitted logistic curve reveals clear relationship between input concentration and sensitivity. Unsurprisingly, the assembler has higher sensitivity with highly expressed isoforms. The limit-of-assembly (LOA) is defined as the intersection of the curve to sensitivity of.7. Quantifying gene expression Quantifying gene/isoform expression involves building a linear model between input concentration and measured FPKM. In this section, we consider a dataset generated by Cufflinks, described in Section 5.4.5.1 of the user guide. Load the data set: data(userguidedata_5.4.6.3) head(userguidedata_5.4.6.3) ## Input Observed1 Observed2 Observed3 5

## R1_11 15.162.958838 1.45665.9619 ## R1_12 15.162.86596.64539.652783 ## R1_13 966.797 2.6547 2.8957 3.2119 ## R1_11 241.699 3.8761 3.91995 4.24639 ## R1_12 3.2124.779118.898644.733175 ## R1_13 7734.38 135.71 1328.95 1358.97 The first column gives input concentration for each sequin in attomol/ul. The other columns are the FPKM values for each replicate (three replicates in total). The following code will quantify the first replicate: title <- 'Gene Expression' xlab <- 'Input Concentration (log2)' ylab <- 'FPKM (log2)' # Sequin names names <- row.names(userguidedata_5.4.6.3) # Input concentration x <- log2(userguidedata_5.4.6.3$input) # Measured FPKM y <- log2(userguidedata_5.4.6.3$observed1) plotlinear(names, x, y, title=title, xlab=xlab, ylab=ylab, showloq=true) Gene Expression 1 Overall : y = c( 3.3) + c(.96)x, r 2 =.931 Above LOQ : y = c( 3.6) + c(1)x, r 2 =.954 FPKM (log2) 5 5 LOQ:.236 attomol/ul 5 1 15 Input Concentration (log2) Coefficient of determination is over.9; over 9% of the variation (e.g. technical bias) can be explained by the model. LOQ is 3.78 attomol/ul, this is the estimated emphirical detection limit. We can also quantify multiple replicates: 6

title <- 'Gene Expression' xlab <- 'Input Concentration (log2)' ylab <- 'FPKM (log2)' # Sequin names names <- row.names(userguidedata_5.4.6.3) # Input concentration x <- log2(userguidedata_5.4.6.3$input) # Measured FPKM y <- log2(userguidedata_5.4.6.3[,2:4]) plotlinear(names, x, y, title=title, xlab=xlab, ylab=ylab, showloq=true) Gene Expression 1 y = c( 3.4) + c(.97)x, r 2 =.934 FPKM (log2) 5 5 5 1 15 Input Concentration (log2) Differential analysis In this section, we show how to quantify differential expression analysis between expected fold-change and measured fold-change. We apply our method to a data set described in Section 5.6.3 of the user guide. data(userguidedata_5.6.3) head(userguidedata_5.6.3) ## ExpLFC ObsLFC SD Pval Qval Mean ## R1_11-3 -1.89122.71723 7.69675e-3 2.56337e-2 9.953556 7

## R1_12-4 -2.51777.546374 1.731616e-4 7.646243e-4 17.285262 ## R1_13-1 3.837784.37762 2.883289e-24 6.53428e-23 1221.31532 ## R1_11-4 -2.431582.591352 3.924117e-5 1.974336e-4 47.17425 ## R1_12 1 1.542757.425562 2.88714e-4 1.214989e-3 73.872 ## R1_13.71771.242493 3.79564e-3 1.416e-2 4453.259914 ## Label ## R1_11 TP ## R1_12 TP ## R1_13 TP ## R1_11 TP ## R1_12 TP ## R1_13 FP For each of the sequin gene, we have expected log-fold change, measured log-fold change, standard deviation, p-value, q-value and mean. The estimation was done by DESeq2. Run the following code to construct a folding plot: title <- 'Gene Fold Change' xlab <- 'Expected fold change (log2)' ylab <- 'Measured fold change (log2)' # Sequin names names <- row.names(userguidedata_5.6.3) # Expected log-fold x <- UserGuideData_5.6.3$ExpLFC # Measured log-fold y <- UserGuideData_5.6.3$ObsLFC plotlinear(names, x, y, title=title, xlab=xlab, ylab=ylab, showaxis=true, showloq=false) 8

Gene Fold Change y = c(.5) + c(.95)x, r 2 =.781 Measured fold change (log2) 4 4 4 2 2 4 Expected fold change (log2) Outliers are obvious throughout the reference scale. Overall, DESeq2 is able to account for 78% of the variation. We can also construct a ROC plot. [1] has details on how the true-positives and false-positives are defined. title <- 'ROC Plot' # Sequin names seqs <- row.names(userguidedata_5.6.3) # Expected ratio ratio <- UserGuideData_5.6.3$ExpLFC # How the ROC points are ranked (scoring function) score <- 1-UserGuideData_5.6.3$Pval # Classified labels (TP/FP) label <- UserGuideData_5.6.3$Label plotroc(seqs, score, ratio, label, title=title, refgroup=) 9

ROC Plot 1. TPR.75.5.25 4 3 2 1 1 2 3 4...25.5.75 1. FPR AUC statistics for LFC 3 and 4 are higher than LFC 1 and 2. Overall, all LFC ratios can be correctly classified relative to LFC. Furthermore, we can construct limit of detection ratio (LOD) curves: xlab <- 'Average Counts' ylab <- 'P-value' title <- 'LOD Curves' # Measured mean mean <- UserGuideData_5.6.3$Mean # Expected log-fold ratio <- UserGuideData_5.6.3$ExpLFC # P-value pval <- UserGuideData_5.6.3$Pval qval <- UserGuideData_5.6.3$Qval plotlod(mean, pval, abs(ratio), qval=qval, xlab=xlab, ylab=ylab, title=title, FDR=.5) 1

LOD Curves 1e 15 P value 1e 4 1e 65 Ratio 1 2 3 4 1e 9 1e+1 1e+3 1e+5 Average Counts Unsurprisingly, p-value is inverse quadratically related with average counts. All the LFC ratios systematically outperform LFC. The function also estimates the empirical detection limits, [1] has the details. 11