PICS: Probabilistic Inference for ChIP-Seq

Similar documents
Using PING with Paired-End sequencing data

Package PICS. R topics documented: February 4, Type Package Title Probabilistic inference of ChIP-seq Version

Package PING. R topics documented: September 21, Type Package

Analyzing ChIP- Seq Data in Galaxy

Triform: peak finding in ChIP-Seq enrichment profiles for transcription factors

Easy visualization of the read coverage using the CoverageView package

Chromatin signature discovery via histone modification profile alignments Jianrong Wang, Victoria V. Lunyak and I. King Jordan

Practical 4: ChIP-seq Peak calling

Package mosaics. April 23, 2016

Package mosaics. July 3, 2018

ChIP-seq hands-on practical using Galaxy

Package rmat. November 1, 2018

User's guide to ChIP-Seq applications: command-line usage and option summary

Useful software utilities for computational genomics. Shamith Samarajiwa CRUK Autumn School in Bioinformatics September 2017

ChIP-Seq Tutorial on Galaxy

Analysis of ChIP-seq data

m6aviewer Version Documentation

Package BAC. R topics documented: November 30, 2018

ChIP-seq practical: peak detection and peak annotation. Mali Salmon-Divon Remco Loos Myrto Kostadima

Avadis NGS v1.5 Reference Manual

Analysis of ChIP-seq Data with mosaics Package

Some Basic ChIP-Seq Data Analysis

!"#$%&$'()#$*)+,-./).01"0#,23+3,303456"6,&((46,7$+-./&((468,

Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification data

Package MethylSeekR. January 26, 2019

Agilent Genomic Workbench Lite Edition 6.5

Package ChIPsim. January 21, 2019

The software and data for the RNA-Seq exercise are already available on the USB system

Outline Introduction Sequences Ranges Views Interval Datasets. IRanges. Bioconductor Infrastructure for Sequence Analysis.

Package ChIC. R topics documented: May 4, Title Quality Control Pipeline for ChIP-Seq Data Version Author Carmen Maria Livi

EBSeqHMM: An R package for identifying gene-expression changes in ordered RNA-seq experiments

Getting Started. April Strand Life Sciences, Inc All rights reserved.

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

Identiyfing splice junctions from RNA-Seq data

The ChIP-seq quality Control package ChIC: A short introduction

Package metagene. November 1, 2018

Package BayesPeak. R topics documented: September 21, Version Date Title Bayesian Analysis of ChIP-seq Data

Introduction to the TSSi package: Identification of Transcription Start Sites

IRanges and GenomicRanges An introduction

panda Documentation Release 1.0 Daniel Vera

User Manual of ChIP-BIT 2.0

Fusion Detection Using QIAseq RNAscan Panels

Analysis of ChIP-seq Data with mosaics Package: MOSAiCS and MOSAiCS-HMM

Working with ChIP-Seq Data in R/Bioconductor

HIPPIE User Manual. (v0.0.2-beta, 2015/4/26, Yih-Chii Hwang, yihhwang [at] mail.med.upenn.edu)

Package SVM2CRM. R topics documented: May 1, Type Package

SICER 1.1. If you use SICER to analyze your data in a published work, please cite the above paper in the main text of your publication.

Genome Browsers - The UCSC Genome Browser

MetaPhyler Usage Manual

The list of data and results files that have been used for the analysis can be found at:

BCRANK: predicting binding site consensus from ranked DNA sequences

ChIP-seq Analysis Practical

CLC Server. End User USER MANUAL

Rsubread package: high-performance read alignment, quantification and mutation discovery

Goldmine: Integrating information to place sets of genomic ranges into biological contexts Jeff Bhasin

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

Helpful Galaxy screencasts are available at:

Package polyphemus. February 15, 2013

Rsubread package: high-performance read alignment, quantification and mutation discovery

QIAseq DNA V3 Panel Analysis Plugin USER MANUAL

ChromHMM: automating chromatin-state discovery and characterization

Package HTSeqGenie. April 16, 2019

A short Introduction to UCSC Genome Browser

BovineMine Documentation

Package GOTHiC. November 22, 2017

Bioinformatics I. Teaching assistant(s): Eudes Barbosa Markus List

ASAP - Allele-specific alignment pipeline

ChIP-seq hands-on practical using Galaxy

Genomic Analysis with Genome Browsers.

ChIP-seq (NGS) Data Formats

A complete analysis of peptide microarray binding data using the pepstat framework

seqcna: A Package for Copy Number Analysis of High-Throughput Sequencing Cancer DNA

Package GreyListChIP

QIAseq Targeted RNAscan Panel Analysis Plugin USER MANUAL

Advanced UCSC Browser Functions

NGS Data Visualization and Exploration Using IGV

How to use earray to create custom content for the SureSelect Target Enrichment platform. Page 1

Package enrich. September 3, 2013

You will be re-directed to the following result page.

Tn-seq Explorer 1.2. User guide

ChIP-seq Analysis. BaRC Hot Topics - March 21 st 2017 Bioinformatics and Research Computing Whitehead Institute.

ABSTRACT. WIMBERLEY, CHARLES ERIC. PeakPass: A Machine Learning Approach for ChIP-Seq Blacklisting. (Under the direction of Dr. Steffen Heber.

Browser Exercises - I. Alignments and Comparative genomics

Nature Biotechnology: doi: /nbt Supplementary Figure 1

MMDiff2: Statistical Testing for ChIP-Seq Data Sets

Supplementary Figure 1. Fast read-mapping algorithm of BrowserGenome.

Counting with summarizeoverlaps

segmentseq: methods for detecting methylation loci and differential methylation

MethylSeekR. Lukas Burger, Dimos Gaidatzis, Dirk Schübeler and Michael Stadler. Modified: July 25, Compiled: April 30, 2018.

From genomic regions to biology

Bioinformatics Services for HT Sequencing

Package roar. August 31, 2018

Package MIRA. October 16, 2018

Package Rsubread. July 21, 2013

segmentseq: methods for detecting methylation loci and differential methylation

Annotating a single sequence

Package CopywriteR. April 20, 2018

Package les. October 4, 2013

Differential Expression Analysis at PATRIC

Basic4Cseq: an R/Bioconductor package for the analysis of 4C-seq data

Transcription:

PICS: Probabilistic Inference for ChIP-Seq Xuekui Zhang * and Raphael Gottardo, Arnaud Droit and Renan Sauteraud April 30, 2018 A step-by-step guide in the analysis of ChIP-Seq data using the PICS package in R Contents 1 Licensing 2 2 Introduction 2 3 PICS pipeline 2 4 Data Input and Formating 3 5 PICS analysis 4 5.1 Genome segmentation........................ 4 5.2 Parameter estimation........................ 4 5.3 FDR estimation............................ 4 6 Result output 6 6.1 Writing enriched regions to BED files............... 6 6.2 Writing density scores to WIG files................. 6 * ubcxzhang@gmail.com rgottard@fhcrc.org arnaud.droit@crchuq.ulaval.ca rsautera@fhcrc.org 1

1 Licensing Under the Artistic license 2.0, you are free to use and redistribute this software. However, if you use this package for your own publication, we would ask you to cite the following paper: X. Zhang, G. Robertson, M. Krzywinski, K. Ning, A. Droit, S.Jones and R. Gottardo. (2009). PICS: Pobabilistic Inference for Chip-seq. Biometrics 67, 151-163. 2 Introduction ChIP-Seq, which combines chromatin immunoprecipitation with massively parallel short-read sequencing, can profile in vivo genome-wide transcription factor- DNA association with higher sensitivity, specificity and spatial resolution than ChIP-chip. While it presents new opportunities for research, ChIP-Seq poses new challenges for statistical analysis that derive from the complexity of the biological systems characterized and the variability and biases in its digital sequence data. We propose a method called PICS (Probabilistic Inference for ChIP-Seq) for extracting information from ChIP-Seq aligned-read data in order to identify regions bound by transcription factors. PICS identifies enriched regions by modeling local concentrations of directional reads, and uses DNA fragment length prior information to discriminate closely adjacent binding events via a Bayesian hierarchical t-mixture model. Its per-event fragment length estimates also allow it to remove from analysis regions that have atypical fragment lengths. PICS uses pre-calculated, whole-genome read mappability profile and a truncated t-distribution to adjust binding event models to compensate for reads that are missing due to local genome repetitiveness. PICS estimates uncertainties in model parameters, and these can be used to define confidence regions on binding event locations and to filter estimates. 3 PICS pipeline A typical PICS analysis consists of the following steps: 1. Convert data to a GRanges for efficient processing 2. Genome segmentation via segmentpics 3. Estimation of binding site positions and other PICS parameters via PICS 4. Repeat 1-2 by swapping the IP and Control samples 5. Use 1-3 to estimate the FDR via picsfdr 6. Output enriched regions and PICS profiles with bed and wig files As with any package, you first need to load it with the following command > library(pics) 2

4 Data Input and Formating The first step of the PICS pipeline consists of converting the aligned reads (from both IP and control samples) into a format that allows efficient segmentation of the genome into a set of candidate regions that have enough Forward and Reverse reads. The data formatting could be a bed type dataframes as well as AlignedReads objects as returned by the function readaligned which can read various file formats including Eland, MAQ, Bowtie, etc. Please refer to the ShortRead vignette for more details on how to read data into R, other than from a BED file. In the example listed below, we use bed type files, which consists of tab delimited files with the following columns: space, start, end, strand, where each line represent a single read. In addition to the IP and Control datasets, we use another argument consisting of a mappability profile for the genome interrogated. For each chromosome, a mappability profile for a specific read length consists of a vector of zeros and ones that gives an estimated read mappability score for each base pair in the chromosome. A score of one at a position means that we should be able to align a read of that length uniquely at that position, while a score of zero indicates that no read of that length should be uniquely alienable at that position. As noted, reads that cannot be mapped to unique genomic locations are typically discarded. For convenience, and because transitions between mappable and non-mappable regions are typically much shorter than these regions, we compactly summarize each chromosome s mappability profile as a disjoint union of non-mappable intervals that specify only zero-valued profile. We store this information as a bed file where each line correspond a non-mappable intervals. PICS makes use of such mappability profiles to estimate the reads that are missing because they could not be aligned. Mappability profiles for a range of reads lengths and species are available from: http:/wiki.rglab.org/index.php?title=public:mappability Profile To make your own mappability profiles, use the script available at: https://github.com/srenan/promap/ Once the data have been read, we can create the GRanges object, which stores the sequences of genome locations and associatedd annotations. Each element of the list contains the IP and control reads as well as the mappability profiles for the corresponding chromosome. When reading the data, we can also sort and remove duplicated reads, which is recommended for a PICS analysis (sort is actually required). In this documentation, the path for the data is:.../pics/inst/extdata folder. > path<-system.file("extdata",package="pics") > dataip<-read.table(file.path(path,"treatment_tags_chr21_sort.bed"),header=true, + colclasses=c("factor","integer","integer","factor")) > dataip<-as(dataip,"granges") > datacont<-read.table(file.path(path,"input_tags_chr21_sort.bed"),header=true +,colclasses=c("factor","integer","integer","factor")) > datacont<-as(datacont,"granges") > map<-read.table(file.path(path,"mapprofileshort"),header=true, + colclasses=c("factor","integer","integer","null")) > map<-as(map,"granges") 3

5 PICS analysis 5.1 Genome segmentation We segment the genome into candidate regions by pre-processing bidirectional aligned reads data from a single ChIP-Seq experiment to detect candidate regions with a minimum number of forward and reverse reads. These candidate regions will then be processed by PICS. The minreads parameter will heavily influence the number of candidate regions returned. In principle, it is better to use a small value for minreads in order not to miss any true binding sites. However, a value too small will likely return many false positive regions, which can result a longer computing time for the PICS function. If a control sample is available you may also chose a value of minreads that would give you a ratio of the number of candidate regions in the control over the number of candidate regions in the IP of about 25%. A ratio larger than 25% would most likely result in a large FDR when using picsfdr, and therefore it is not necessary to use too low of a threshold. See below for an illustration. > seg<-segmentpics(dataip, datac=datacont, map=map, minreads=1) 5.2 Parameter estimation After having segmented the genome into candidate regions, we use the PICS function to detect binding regions in a probabilistic way, and returns binding site estimates, confidence intervals, etc. In the code below, we assume that you are processing transcription factor data, as specified with the datatype option set to TF for transcription factor. We also plan to support histone modification data in a future release of PICS, which will be turned on by specifying datatype= H. In order to improve the computational efficiency of the PICS package, we recommend the utilisation of the parallel package, which allows for easy parallel computations. In what follows, we assume that parallel is installed on your machine and you have at least two cores. If not, you could omit the first line and the argument ncores, and calculations will occur on a single CPU. By default the function will use only one core. > library(parallel) In our case, assuming that we have already segmented the genome using segmentpics, we can proceed with the following command: > pics<-pics(seg, ncores=2) 5.3 FDR estimation In order to estimate the FDR, we need to rerun our analysis after swapping the IP and control samples. Note that this requires the presence of a control sample. We proceed with the following commands to compute the FDR after removing binding site estimates with noisy parameters as specify by the filter. 4

> segc<-segmentpics(datacont,datac=dataip,map=map,minreads=1) > picsc<-pics(segc) > fdr<-picsfdr(pics,picsc,filter=list(delta=c(50,inf),se=c(0,50), + sigmasqf=c(0,22500),sigmasqr=c(0,22500))) Then once the FDR has been estimated, one can visualize the FDR as a function of the score to pick a threshold that would lead to the desired FDR. > plot(pics,picsc,xlim=c(2,8),ylim=c(0,.2),filter=list(delta=c(50,300), + se=c(0,50),sigmasqf=c(0,22500),sigmasqr=c(0,22500)),type="l") # regions 369 249 146 124 90 84 FDR 0.00 0.05 0.10 0.15 0.20 2 3 4 5 6 7 8 score You can also visualize the FDR as a function of the number of regions. > plot(fdr[,c(3,1)]) 5

FDR 0.0 0.2 0.4 0.6 0.8 1.0 0 100 200 300 400 500 600 N 6 Result output To facilitate data processing and data output, we make use of the IRanges package to summarize our results as RangedData objects. The resulting Ranged- Data objects can then be analyzed with the IRanges package and/or exported to bed/wig files with the rtracklayer package. In each case, we can make use of filters noisy regions and or regions with too low of score. In particular the picsfdr function can be used to set the score threshold leading to the desired FDR. 6.1 Writing enriched regions to BED files > myfilter=list(delta=c(50,300),se=c(0,50),sigmasqf=c(0,22500),sigmasqr=c(0,22500)) > rdbed<-makerangeddataoutput(pics,type="bed",filter=c(myfilter,list(score=c(1,inf)))) The following command will create the appropriate bed files, and because it requires the rtracklayer package, we do not run it here but simply displayed for your purpose. > library(rtracklayer) > export(rdbed,"myfile.bed") 6.2 Writing density scores to WIG files The following command will create the appropriate RangedData and corresponding wig file, and because it requires the rtracklayer package, we do not 6

run it here but simply displayed for your purpose. > rdbed<-makerangeddataoutput(pics,type="wig",filter=c(myfilter,list(score=c(1,inf)))) > export(rdbed,"myfile.wig") 7