The ChIP-seq quality Control package ChIC: A short introduction

Similar documents
Package ChIC. R topics documented: May 4, Title Quality Control Pipeline for ChIP-Seq Data Version Author Carmen Maria Livi

Easy visualization of the read coverage using the CoverageView package

Practical 4: ChIP-seq Peak calling

Supplementary Material. Cell type-specific termination of transcription by transposable element sequences

Analyzing ChIP- Seq Data in Galaxy

User's guide to ChIP-Seq applications: command-line usage and option summary

m6aviewer Version Documentation

W ASHU E PI G ENOME B ROWSER

Chromatin signature discovery via histone modification profile alignments Jianrong Wang, Victoria V. Lunyak and I. King Jordan

panda Documentation Release 1.0 Daniel Vera

ChIP-seq hands-on practical using Galaxy

ChIP-Seq Tutorial on Galaxy

Analysis of ChIP-seq data

Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification data

CLC Server. End User USER MANUAL

ChIP-seq practical: peak detection and peak annotation. Mali Salmon-Divon Remco Loos Myrto Kostadima

Nature Biotechnology: doi: /nbt Supplementary Figure 1

epigenomegateway.wustl.edu

Package dupradar. R topics documented: July 12, Type Package

Browser Exercises - I. Alignments and Comparative genomics

ChIP-seq Analysis. BaRC Hot Topics - March 21 st 2017 Bioinformatics and Research Computing Whitehead Institute.

BCRANK: predicting binding site consensus from ranked DNA sequences

Huber & Bulyk, BMC Bioinformatics MS ID , Additional Methods. Installation and Usage of MultiFinder, SequenceExtractor and BlockFilter

Agilent Genomic Workbench Lite Edition 6.5

ChIP-seq (NGS) Data Formats

Differential Expression Analysis at PATRIC

PICS: Probabilistic Inference for ChIP-Seq

You will be re-directed to the following result page.

Useful software utilities for computational genomics. Shamith Samarajiwa CRUK Autumn School in Bioinformatics September 2017

Getting Started. April Strand Life Sciences, Inc All rights reserved.

ChIP-seq Analysis Practical

Genome Environment Browser (GEB) user guide

Package polyphemus. February 15, 2013

Rsubread package: high-performance read alignment, quantification and mutation discovery

Triform: peak finding in ChIP-Seq enrichment profiles for transcription factors

Fusion Detection Using QIAseq RNAscan Panels

Working with ChIP-Seq Data in R/Bioconductor

Tutorial for the Exon Ontology website

Rsubread package: high-performance read alignment, quantification and mutation discovery

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

Table of contents Genomatix AG 1

From genomic regions to biology

ChIP-Seq data analysis workshop

RNA-Seq data analysis software. User Guide 023UG050V0210

Package GSCA. October 12, 2016

Package GSCA. April 12, 2018

Genome Browsers - The UCSC Genome Browser

RNA-Seq data analysis software. User Guide 023UG050V0200

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

A short Introduction to UCSC Genome Browser

Package CopywriteR. April 20, 2018

ChIP-seq hands-on practical using Galaxy

Analysis of ChIP-seq Data with mosaics Package

Helpful Galaxy screencasts are available at:

TECH NOTE Improving the Sensitivity of Ultra Low Input mrna Seq

Genomic Analysis with Genome Browsers.

The list of data and results files that have been used for the analysis can be found at:

Running SNAP. The SNAP Team October 2012

User Manual of ChIP-BIT 2.0

HIPPIE User Manual. (v0.0.2-beta, 2015/4/26, Yih-Chii Hwang, yihhwang [at] mail.med.upenn.edu)

All About PlexSet Technology Data Analysis in nsolver Software

CS313 Exercise 4 Cover Page Fall 2017

MAGMA joint modelling options and QC read-me (v1.07a)

Understanding and Pre-processing Raw Illumina Data

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

nanoraw Documentation

RNA- SeQC Documentation

ChromHMM Version 1.14 User Manual Jason Ernst and Manolis Kellis

RNA-Seq data analysis software. User Guide 023UG050V0100

W ASHU E PI G ENOME B ROWSER

Tutorial 7: Automated Peak Picking in Skyline

Package metagene. November 1, 2018

Chip-seq data analysis: from quality check to motif discovery and more Lausanne, 4-8 April 2016

When we search a nucleic acid databases, there is no need for you to carry out your own six frame translation. Mascot always performs a 6 frame

ChromHMM: automating chromatin-state discovery and characterization

Supplementary information: Detection of differentially expressed segments in tiling array data

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA- MEM).

ChIP-seq Analysis. BaRC Hot Topics - Feb 23 th 2016 Bioinformatics and Research Computing Whitehead Institute.

Exeter Sequencing Service

Some Basic ChIP-Seq Data Analysis

Running SNAP. The SNAP Team February 2012

Package enrich. September 3, 2013

Tutorial. Identification of Variants in a Tumor Sample. Sample to Insight. November 21, 2017

seqcna: A Package for Copy Number Analysis of High-Throughput Sequencing Cancer DNA

Tutorial: RNA-Seq Analysis Part II (Tracks): Non-Specific Matches, Mapping Modes and Expression measures

Long Read RNA-seq Mapper

BadRegionFinder an R/Bioconductor package for identifying regions with bad coverage

Database Searching Using BLAST

Annotating a single sequence

Tn-seq Explorer 1.2. User guide

Expander 7.2 Online Documentation

Using the DATAMINE Program

SEEK User Manual. Introduction

Supervised Clustering of Yeast Gene Expression Data

Integrative Genomics Viewer. Prat Thiru

Intro to NGS Tutorial

An Introduction to VariantTools

ChIP- seq Analysis. BaRC Hot Topics - Feb 24 th 2015 BioinformaBcs and Research CompuBng Whitehead InsBtute. hgp://barc.wi.mit.

Chromatin immunoprecipitation sequencing (ChIP-Seq) on the SOLiD system Nature Methods 6, (2009)

Package BayesPeak. R topics documented: September 21, Version Date Title Bayesian Analysis of ChIP-seq Data

Transcription:

The ChIP-seq quality Control package ChIC: A short introduction April 30, 2018 Abstract The ChIP-seq quality Control package (ChIC) provides functions and data structures to assess the quality of ChIP-seq data. The tool computes three different categories of QC metrics: QC-metrics designed for narrow-peak profiles and general metrics, QC-metrics based on global read distribution and QC-metrics from local signal enrichments around annotated genes. User-friendly functions allow to perform the analysis with a single command, whereas step by step functions are also available for more experienced users. The package comes with a large reference compendium of precomputed QC-metrics from public ChIP-seq samples. Key features of the package are functions for calculating, visualizing and creating summary plots of QC-metrics, tools for the comparison of metagene profiles against reference profiles, tools for the comparison of single QCmetrics with the compendium values and finally a random forestl model to compute the single value quality score. 1

Contents 1 ENCODE Metrics (EM) based on sharp-peak profiles and crosscorrelation analysis 3 1.1 Reading BAM files.......................... 4 1.2 Calculate QC-metrics from CrossCorrelation analysis....... 4 1.3 Remove anomalies in the read distrbution............. 5 1.4 Calculate QC-metrics from peak calls............... 7 1.5 Profile smoothing........................... 7 2 Global enrichment profile Metrics (GM) and Fingerprint-plot 7 3 Metagene profiles and local enrichment profile metrics (LM) 8 4 Quality assessment using the compendium of QC-metrics as reference 9 4.1 Comparing local enrichment profiles................ 11 4.2 Comparing QC-metrics with reference values of the compendium 13 4.3 Assessing data quality with machine learning........... 13 References 15 2

To run the example code the user must provide 2 bam files: one for ChIP and one for the input. Here we used ChIP-seq data from ENCODE. Two example files can be downloaded using the following link: https://www.encodeproject.org/files/encff000bfx/ https://www.encodeproject.org/files/encff000bdq/ The tutorial illustrates output and input of ChIC functions using a EN- CODE ChIP-seq dataset for H3K36me3 (ID: ENCFF000BFX) and its input (ID: ENCFF000BDQ). For timing reasons in this tutorial we will start from already read bam file reads for a subset of chromosomes. Therefore we are loading input and chip data from our datapackage ChIC.data : > ##load ChIC > library(chic) > ## set path for working directory > #filepath=tempdir() > #setwd(filepath) > filepath=getwd() > #load tag-list with reads aligned to a subset of chromosomes > > > data("chipsubset", package = "ChIC.data", envir = environment()) > chipbam=chipsubset > data("inputsubset", package = "ChIC.data", envir = environment()) > inputbam=inputsubset 1 ENCODE Metrics (EM) based on sharp-peak profiles and cross-correlation analysis qualityscores EM is a wrapper function that reads the bam files and calculates a number of QC-metrics from cross-correlation analysis and from peak-calling. We will refer to this measures as ENCODE Metrics (EM). > ##caluclate first set of QC-metrics: EM > mc=3 > filepath=tempdir() > setwd(filepath) > system("wget + https://www.encodeproject.org/files/encff000bfx/@@download/encff000bfx.bam") > system("wget + https://www.encodeproject.org/files/encff000bdq/@@download/encff000bdq.bam") > chipname=file.path(filepath,"encff000bfx") > inputname=file.path(filepath,"encff000bdq") > CC_Result=qualityScores_EM(chipName=chipName, inputname=inputname, + read_length=36, mc=mc) > finaltagshift=cc_result$qcscores_chip$tag.shift The function expects two bam files: one for the immunoprecipitation (ChIP) and one for the control (Input). The read length (read length parameter) can 3

be taken from the bam file itself. An additional option is the use of the mc parameter, set to 1 per default. When changed it triggers the parallelization of a few processes and speeds up the calculations. The function returns a number of QC-metrics, amongst others the tag.shift value which represents an input parameter for further steps (i.e. peak-calling and metagene calculation). The wrapper is executing the following functions: 1.1 Reading BAM files The first step in the qualityscores EM function reads ChIP-seq data in.bam file-format. The function expects only the filename, that can also contain the pathname. PLEASE NOTE: The following code chunk is executed if the user starts from the bam file. As we have already loaded the bam file, we skip this part. > chipname=file.path(filepath,"encff000bfx") > inputname=file.path(filepath,"encff000bdq") > chipbam=readbamfile(chipname) > inputbam=readbamfile(inputname) 1.2 Calculate QC-metrics from CrossCorrelation analysis The next function called is the getcrosscorrelationscores that calculates QCmetrics from the crosscorrelation analysis and other general metrics, e.g. the non-redundant fractions of mapped reads. An important parameter required by getcrosscorrelationscores is the binding-characteristics, calculated using spp::get.binding.characteristics function. The binding-characteristics structure provides information about the peak separation distance and the cross-correlation profile (for more details see [1]). > cluster <- parallel::makecluster( mc ) > ## calculate binding characteristics > > chip_binding.characteristics<-spp::get.binding.characteristics( + chipbam, srange=c(0,500), bin = 5, accept.all.tags = TRUE, + cluster = cluster) > input_binding.characteristics<-spp::get.binding.characteristics( + inputbam, srange=c(0,500), bin = 5, accept.all.tags = TRUE, + cluster = cluster) > parallel::stopcluster( cluster ) > > ## calculate cross correlation QC-metrics > crossvalues_chip<-getcrosscorrelationscores( chipbam, + chip_binding.characteristics, read_length = 36, + saveplotpath = filepath, mc = mc) The output of the function is a list with calculated QC-metrics. Additionally we have to save the calculated tag.shift value for further steps: 4

> str(crossvalues_chip) List of 20 $ CC_StrandShift : num 195 $ tag.shift : num 98 $ N1 : num 943998 $ Nd : num 982966 $ CC_PBC : num 0.96 $ CC_readLength : num 36 $ CC_UNIQUE_TAGS_LibSizeadjusted: num 982912 $ CC_NSC : num 1.44 $ CC_RSC : num 1.13 $ CC_QualityFlag : num 1 $ CC_shift. : num 200 $ CC_A. : num 0.256 $ CC_B. : num 0.247 $ CC_C. : num 0.178 $ CC_ALL_TAGS : int 1025541 $ CC_UNIQUE_TAGS : int 982966 $ CC_UNIQUE_TAGS_nostrand : int 967887 $ CC_NRF : num 0.958 $ CC_NRF_nostrand : num 0.944 $ CC_NRF_LibSizeadjusted : num 0.0983 > finaltagshift <- crossvalues_chip$tag.shift The user can also choose to calculate the cross-correlation for the input by simply using the following command: > ## calculate cross correlation QC-metrics for input > crossvalues_input <- getcrosscorrelationscores(inputbam, + chip_binding.characteristics, read_length = 36, + saveplotpath = filepath, mc = mc) saveplotpath sets the path in which the cross-correlation plot (as pdf) should be saved. If not provided the plot will be forwarded to default DISPLAY. An example of a cross-correlation profile is shown in Figure 1. getcrosscorrelationscores must be used on both, ChIP and Input. 1.3 Remove anomalies in the read distrbution The data has to be procesed further using removelocaltaganomalies and removes local read anomalies like regions with extremely high read counts compared to the neighborhood (for more details see [1]). > ##get chromosome information and order chip and input by it > chrl_final <- intersect(names(chipbam$tags), names(inputbam$tags)) > chipbam$tags <- chipbam$tags[chrl_final] > chipbam$quality <- chipbam$quality[chrl_final] > inputbam$tags <- inputbam$tags[chrl_final] > inputbam$quality <- inputbam$quality[chrl_final] 5

CrossCorrelation Profile A = 0.256 NSC = A/C = 1.44 RSC = (A C)/(B C) = 1.13 Quality flag = 1 B = 0.247 Shift = 200 Read length = 36 cross correlation 0.18 0.20 0.22 0.24 C = 0.178. 500 0 500 1000 1500 strand shift Figure 1: Cross-correlation plot for chromosome1 of the ChIP. 6

> ##remove sigular positions with extremely high read counts with > ##respect to the neighbourhood > selectedtags <- removelocaltaganomalies(chipbam, inputbam, + chip_binding.characteristics, input_binding.characteristics) > inputbamselected <- selectedtags$input.dataselected > chipbamselected <- selectedtags$chip.dataselected 1.4 Calculate QC-metrics from peak calls The last set of QC-metrics are calculated on the number of called peaks using getpeakcallingscores. finding background exclusion regions... done determining peaks on provided 1 control datasets: using reversed signal for FDR calculations bg.weight= 1.440726 excluding systematic background anomalies... done determining peaks on real data: bg.weight= 0.6940943 excluding systematic background anomalies... done calculating statistical thresholds FDR 0.01 threshold= 3.165151 finding background exclusion regions... done determining peaks on provided 1 control datasets: using reversed signal for FDR calculations bg.weight= 1.440726 excluding systematic background anomalies... done determining peaks on real data: bg.weight= 0.6940943 excluding systematic background anomalies... done calculating statistical thresholds E-value 10 threshold= 5.545838 1.5 Profile smoothing The last step executed is the smoothing (using a Gaussian kernel) of the read profile to obtain the tag density profile (for more details see [1]). The read density profile is needed to calculate the next two categories of QC-metrics: the Global enrichment profile Metrics (GM) and the local enrichment profile metrics (LM). > smoothedchip <- tagdensity(chipbamselected, + tag.shift = finaltagshift, mc = mc). 3-858... > smoothedinput <- tagdensity(inputbamselected, + tag.shift = finaltagshift, mc = mc) 2 Global enrichment profile Metrics (GM) and Fingerprint-plot This set of QC-metrics is based on the global read distribution along the genome for ChIP and Input data [2]. The function qualityscores GM reproduces the so- 7

called Fingerprint plot (Figure 2) and returns a list of 9 QC-metrics that are taken from the cumulative distribution of the plot. Examples of these metrics are the (a) fraction of bins without reads for ChIP and input, (b) the point of maximum distance between the ChIP and input (x-coordinate, y-coordinate for ChIP and input, the distance calculated as absolute difference between the two y-coordinates, the sign of the difference), (c) the fraction of reads in the top 1 percent of bins with highest coverage for ChIP and input. > Ch_Results <- qualityscores_gm(densitychip = smoothedchip, + densityinput = smoothedinput, saveplotpath = filepath) Fingerprint: global read distribution Percentage of tags 0.0 0.2 0.4 0.6 0.8 1.0 Input ChIP 0.0 0.2 0.4 0.6 0.8 1.0 Percentage of bins Figure 2: Fingerprint plot of sample ENCFF000BFX and its input ENCFF000BDQ. 3 Metagene profiles and local enrichment profile metrics (LM) Metagene plots show the signal enrichment around a region of interest like the transcription start site (TSS) or over the gene body. ChIC creates two types of metagene profiles: a non-scaled profile for the TSS and transcription end site, and a scaled profile for the entire gene, including the gene body like in Figure??. For the metagene profile the tag density of the immunoprecipitation is taken over all RefSeg annotated human genes, averaged and log2 transformed.the 8

same is done for the input (Figure?? A). The normalized profile (Figure?? B) is calculated as the signal enrichment (immunoprecipitation over the input) and plotted on the y-axis, whereas the genomic coordinates of the genes like the TSS or regions up- and downstream are shown on the x-axis. createmetageneprofile creates the metagene profiles for scaled and non-scaled profiles and returns a list with three items: genebody, TSS and TES. Each item is again a list with the metagene-profiles for ChIP and input. > Meta_Result <- createmetageneprofile( + smoothed.densitychip = smoothedchip, + smoothed.densityinput = smoothedinput, + tag.shift = finaltagshift, mc = mc) The objects in Meta Result can be used to create the final metagene plots and to get the respective QC-values for the non-scaled profiles around the TSS and TES. > TSS_Scores <- qualityscores_lm(data = Meta_Result$TSS, tag = "TSS", + saveplotpath = filepath) > TES_Scores <- qualityscores_lm(data = Meta_Result$TES, tag = "TES", + saveplotpath = filepath) The genebody object can be used to plot the scaled metagene profile and to get its respective QC-values: > #create scaled metagene profile > genebody_scores <- qualityscores_lmgenebody(meta_result$genebody, + saveplotpath = filepath) 4 Quality assessment using the compendium of QC-metrics as reference The comprehensive set of QC-metrics, computed over a large set of ChIP-seq samples, constitutes in itself a valuable compendium that can be used as a reference for comparison to new samples. ChIC provides tre functions for that: ˆ metageneplotsforcomparison to compare the metagene plots with the compendium ˆ plotreferencedistribution to compare a QC-metric with the compendium values ˆ predictionscore to obtain a single quality score from the previously computed QC-metrics 9

metagene mean of log2 tag density 0.5 1.0 1.5 2.0 Chip Input 2KB TSS TES 500 +1KB metagene coordinates Figure 3: Metagene profiles: Signal enrichment for ChIP and Input along the gene body. Chip Input normalized metagene mean log2 enrichment (signal/input) 0.0 0.5 1.0 1.5. 2KB TSS TSS+500 TES 500 TES +1KB metagene coordinates Figure 4: Normalized metagene profile: signal enrichment for ChIP over Input along the gene body. 10

4.1 Comparing local enrichment profiles The metageneplotsforcomparison function is used to compare the local enrichment profile to the reference compendium by plotting the metagene profile against the expected metagene for the same type of chromatin mark??. The expected metagene is provided by the compendium mean (black line) and standard error (blue shadow) shown in Figure 5. > metageneplotsforcomparison(data = Meta_Result$geneBody, + chrommark = "H3K4me3", + tag = "genebody", + saveplotpath = filepath) > metageneplotsforcomparison(data = Meta_Result$TSS, + chrommark = "H3K4me3", + tag = "TSS", + saveplotpath = filepath) H3K4me3_geneBody_Input mean of log2 tag density 0.0 0.5 1.0 1.5 2.0 mean 2*stdErr 2KB TSS TES 500 +1KB metagene coordinates Figure 5: Enrichment profile and QC-metrics can be plotted against the precomputed profiles of the compendium. The metagene profile shows the sample signal (red line) for the input when compared to the compendium mean signal (black line) and its 2x standard error (blue shadow). 11

A mean of log2 tag density 0.0 0.5 1.0 1.5 2.0 mean 2*stdErr H3K4me3_geneBody Chip 2KB TSS TSS+500 TES 500 TES +1KB B mean log2 enrichment (signal/input) 0.0 0.5 1.0 1.5 mean 2*stdErr H3K4me3_geneBody norm 2KB TSS TSS+500 TES 500 TES +1KB. metagene coordinates Figure 6: Enrichment profile plotted against the pre-computed profiles of the compendium. The metagene profiles show the sample signal (red line) for the ChIP (A) and the enrichment (chip over input) (B) when compared to the compendium mean signal (black line) and its 2x standard error (blue shadow). 12

4.2 Comparing QC-metrics with reference values of the compendium The plot against the reference compendium of metrics add an extra level of information that can be easily used by less experienced users. Indeed, the function plotreferencedistribution is helpful to visually compare the characteristics of an analysed sample with a large number of already published data (Figure 5). > plotreferencedistribution(chrommark = "H3K4me3", + metrictobeplotted = "RSC", + currentvalue = crossvalues_chip$cc_rsc, + saveplotpath = filepath) RSC H3K4me3 (Sharp class) Density 0.0 0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 N = 976 Bandwidth = 0.05086 Figure 7: QC-metric of newly analysed ChIP-seq sample can be compared to the reference values of the compendium. The density plot shows the QC-metric RSC (red dashed line) of the current sample. 4.3 Assessing data quality with machine learning Moreover, we used the compendium of metrics to train a machine learning model that summarizes the sample quality in a single score. > te <- predictionscore(chrommark = "H3K4me3", + features_cc = CC_Result, + features_global = Ch_Results, 13

+ features_tss = TSS_Scores, + features_tes = TES_Scores, + features_scaled = genebody_scores) > print(te) [1] 0.498 14

References [1] Peter V Kharchenko, Michael Y Tolstorukov, and Peter J Park. Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat. Biotechnol., 26(12):1351 1359, 2008. [2] Aaron Diaz, Abhinav Nellore, and Jun S Song. CHANCE: comprehensive software for quality control and validation of ChIP-seq data. Genome Biol., 13(10):R98, 2012. 15