Analysis of ChIP-seq data

Similar documents
ChIP- seq Analysis. BaRC Hot Topics - Feb 24 th 2015 BioinformaBcs and Research CompuBng Whitehead InsBtute. hgp://barc.wi.mit.

ChIP-seq Analysis. BaRC Hot Topics - March 21 st 2017 Bioinformatics and Research Computing Whitehead Institute.

ChIP-seq Analysis. BaRC Hot Topics - Feb 23 th 2016 Bioinformatics and Research Computing Whitehead Institute.

Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification data

Analyzing ChIP- Seq Data in Galaxy

NGS Data Visualization and Exploration Using IGV

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

Bioinformatics in next generation sequencing projects

Practical 4: ChIP-seq Peak calling

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013

Galaxy Platform For NGS Data Analyses

From genomic regions to biology

ChIP-seq practical: peak detection and peak annotation. Mali Salmon-Divon Remco Loos Myrto Kostadima

High-throughout sequencing and using short-read aligners. Simon Anders

Identifying ChIP-seq enrichment using MACS

ChIP-seq hands-on practical using Galaxy

ChIP-seq hands-on practical using Galaxy

!"#$%&$'()#$*)+,-./).01"0#,23+3,303456"6,&((46,7$+-./&((468,

User's guide to ChIP-Seq applications: command-line usage and option summary

ChIP-Seq Tutorial on Galaxy

Useful software utilities for computational genomics. Shamith Samarajiwa CRUK Autumn School in Bioinformatics September 2017

Genomic Analysis with Genome Browsers.

Today's outline. Resources. Genome browser components. Genome browsers: Discovering biology through genomics. Genome browser tutorial materials

RNA-seq. Manpreet S. Katari

Easy visualization of the read coverage using the CoverageView package

ChIP-seq (NGS) Data Formats

ChIP-seq Analysis Practical

ASAP - Allele-specific alignment pipeline

Agilent Genomic Workbench Lite Edition 6.5

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata

A short Introduction to UCSC Genome Browser

de.nbi and its Galaxy interface for RNA-Seq

NGS Data Analysis. Roberto Preste

Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi

TECH NOTE Improving the Sensitivity of Ultra Low Input mrna Seq

Our typical RNA quantification pipeline

Introduction to Read Alignment. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015

BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14)

Bioinformatics I. Teaching assistant(s): Eudes Barbosa Markus List

Genomic Files. University of Massachusetts Medical School. October, 2015

Maize genome sequence in FASTA format. Gene annotation file in gff format

RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF

Ensembl RNASeq Practical. Overview

EpiGnome Methyl Seq Bioinformatics User Guide Rev. 0.1

ChIP-Seq data analysis workshop

Sequencing. Short Read Alignment. Sequencing. Paired-End Sequencing 6/10/2010. Tobias Rausch 7 th June 2010 WGS. ChIP-Seq. Applied Biosystems.

Analysis of ChIP-seq Data with mosaics Package: MOSAiCS and MOSAiCS-HMM

Identiyfing splice junctions from RNA-Seq data

Introduction to Galaxy

modencode Galaxy: Uniform ChIP-Seq Processing Tools for modencode and ENCODE Data

CBSU/3CPG/CVG Joint Workshop Series Reference genome based sequence variation detection

Genomic Files. University of Massachusetts Medical School. October, 2014

Advanced UCSC Browser Functions

Automated Bioinformatics Analysis System on Chip ABASOC. version 1.1

Using Galaxy for NGS Analyses Luce Skrabanek

Triform: peak finding in ChIP-Seq enrichment profiles for transcription factors

TP RNA-seq : Differential expression analysis

Copyright 2014 Regents of the University of Minnesota

PICS: Probabilistic Inference for ChIP-Seq

Under the Hood of Alignment Algorithms for NGS Researchers

NGS Analysis Using Galaxy

Exercise 1 Review. --outfiltermismatchnmax : max number of mismatch (Default 10) --outreadsunmapped fastx: output unmapped reads

NGS Data and Sequence Alignment

Pre-processing and quality control of sequence data. Barbera van Schaik KEBB - Bioinformatics Laboratory

Copyright 2014 Regents of the University of Minnesota

RNA- SeQC Documentation

Exome sequencing. Jong Kyoung Kim

m6aviewer Version Documentation

Mapping NGS reads for genomics studies

Gene Expression Data Analysis. Qin Ma, Ph.D. December 10, 2017

Galaxy workshop at the Winter School Igor Makunin

SICER 1.1. If you use SICER to analyze your data in a published work, please cite the above paper in the main text of your publication.

RNA-seq Data Analysis

Illumina Next Generation Sequencing Data analysis

Sequence Preprocessing: A perspective

Nature Biotechnology: doi: /nbt Supplementary Figure 1

David Crossman, Ph.D. UAB Heflin Center for Genomic Science. GCC2012 Wednesday, July 25, 2012

Run Setup and Bioinformatic Analysis. Accel-NGS 2S MID Indexing Kits

The ChIP-seq quality Control package ChIC: A short introduction

Integrative Genomics Viewer. Prat Thiru

Sequence Analysis Pipeline

CLC Server. End User USER MANUAL

RNA-Seq. Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University

How To: Run the ENCODE histone ChIP- seq analysis pipeline on DNAnexus

Accessible, Transparent and Reproducible Analysis with Galaxy

Rsubread package: high-performance read alignment, quantification and mutation discovery

GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu

GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units

Introduction to GE Microarray data analysis Practical Course MolBio 2012

Welcome to GenomeView 101!

Analyzing ChIP-Seq Data at the Command Line

DNase I Seq data Analysis Strategy. Dragon Star 2013 QianQin 同济大学

Next Generation Sequence Alignment on the BRC Cluster. Steve Newhouse 22 July 2010

Short Read Alignment. Mapping Reads to a Reference

NGS Analyses with Galaxy

Using Galaxy to provide a NGS Analysis Platform GTC s NGS & Bioinformatics Summit Europe October 7-8, 2013 in Berlin, Germany.

Aligners. J Fass 23 August 2017

Next-Generation Sequencing applied to adna

Transcription:

Before we start: 1. Log into tak (step 0 on the exercises) 2. Go to your lab space and create a folder for the class (see separate hand out) 3. Connect to your lab space through the wihtdata network and open the folder you created 1

Analysis of ChIP-seq data BaRC Hot Topics March 2014 http://jura.wi.mit.edu/bio/education/hot_topics 2

Outline ChIP-seq overview Quality control Experimental design Mapping Map reads Remove unmapped reads and convert to bam files Check the profile of the mapped reads Peak calling 3

ChIP-Seq overview Park, P. J., ChIP-seq: advantages and challenges of a maturing technology, Nat Rev Genet. Oct;10(10):669-80 (2009) 4

Steps in data analysis 1. Quality control 2. Effective mapping Treat IP and control the same way (preprocessing and mapping) 3. Peak calling i) Read extension and signal profile generation ii) Peak assignment Pepke, S. et al. Computation for ChIP-seq and RNA-seq studies, Nat Methods. Nov (2009) 5

Experimental design Include a control sample. If the protein of interest binds repetitive regions using paired end sequencing may reduce the mapping ambiguity. Otherwise single reads should be fine. Include at least two biological replicates. If you have replicas you may want to use the parameter IDR irreproducible discovery rate. Talk to us if you want to learn more about it. If only a small percentage of the reads maps to the genome, you may have to troubleshot your ChIP protocol. 6

Data that we will use on the hands on Data is from ENCODE the exercises EncodeBroadHistoneHepg2H3k4me3 EncodeBroadHistoneHepg2Control For Quality control and mapping we will use a subset of the data (10%) For running MACS we will use the full set but taking only reads mapped to chr1 7

Quality control (before read mapping) http://jura.wi.mit.edu/bio/education/hot_topics/ngs_qc_mapping_feb2014/ngs2014.pdf Quality Control and Mapping Reads Run fastqc Look at output and determine 1. What type of quality encoding (Illumina versions) your reads have Input qualities --solexa-quals <= 1.2 --phred64 1.3-1.7 --phred33 >= 1.8 Illumina versions 2. If you need to remove adapter, trim reads, filter reads based on quality etc. 8

Hands on Step 1: Quality control Run step 1 commands (fastqc) bsub fastqc Hepg2Control_subset.fastq bsub fastqc Hepg2H3k4me3_subset.fastq 9

Fastqc Output 10

Mapping We have to map reads to the genome. The mapping software doesn t have to know about splicing. Bowtie, bowtie2 (allows gaps) are good choices Bwa (allows gaps) is fine too Treat IP and control samples the same way during preprocessing and mapping. 11

Hands on Step 2: Mapping bsub bowtie -k 1 -n 2 -l 36 --best --sam --phred33-quals /nfs/genomes/human_gp_feb_09_no_random/bowtie/hg19 Hepg2Control_subset.fastq Hepg2Control_subset_hg19.k1.n2.l36.best.sam bsub bowtie -k 1 -n 2 -l 36 --best --sam --phred33-quals /nfs/genomes/human_gp_feb_09_no_random/bowtie/hg19 Hepg2H3k4me3_subset.fastq Hepg2H3k4me3_subset_hg19.k1.n2.l36.best.sam --phred33-quals Because we have Illumina 1.9 -k 1 report up to 1 good alignments per read (this is the default setting) -n 2 max mismatches in seed (this is the default setting) -l 36 seed length (make this the same or close to the read length) --best report the best alignment 12

Hands on Steps 3: Remove unmapped reads and convert to BAM This step makes the final files smaller. bsub "samtools view -hs -F 4 Hepg2Control_subset_hg19.k1.n2.l36.best.sam samtools view -b -S - > Hepg2Control_subset_hg19.k1.n2.l36.best.mapped_only.bam bsub "samtools view -hs -F 4 Hepg2H3k4me3_subset_hg19.k1.n2.l36.best.sam samtools view -b -S - > Hepg2H3k4me3_subset_hg19.k1.n2.l36.best.mapped_only.bam" 13

Hands on Steps 4: Quality control after read mapping Check the strand cross-correlation peak and predominant fragment length. bsub Rscript /nfs/barc_public/phantompeakqualtools/run_spp.r -c=control_chr1.bam -savp -out=control_chr1.run_spp.out bsub Rscript /nfs/barc_public/phantompeakqualtools/run_spp.r -c=hepg2h3k4me3_chr1.bam -savp -out=hepg2h3k4me3_chr1.run_spp.out -savp, save cross-correlation plot 14

Strand cross-correlation analysis Bailey, et al. Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data, PLOS Computational Biology Nov 2013 Strand cross-correlation is computed as the Pearson correlation between the positive and the negative strand profiles at different strand shift distances, k 15

Strand cross-correlation analysis outputs 16

Peak calling Peak calling i) Read extension and signal profile generation strand cross-correlation can be used to calculate fragment length ii) Peak assignment Look for fold enrichment of the sample over input or expected background Estimate the significance of the fold enrichment using: Poisson distribution negative binomial distribution background distribution from input DNA model background data to adjust for local variation (MACS) Pepke, S. et al. Computation for ChIP-seq and RNA-seq studies, Nat Methods. Nov. 2009 17

Peak calling: MACS MACS uses a two-step strategy: 1. Builds a model using high quality peaks and infers shift size from it. 2. Shifts the reads and does a second round to call the final peaks. It uses a Poisson distribution to assign p-values to peaks. But the distribution has a dynamic parameter, local lambda, to capture the influence of local biases. MACS default is to filter out redundant tags at the same location and with the same strand by allowing at most 1 tag. This works well. -g: You need to set up this parameter accordingly: Effective genome size. It can be 1.0e+9 or 1000000000, or shortcuts: 'hs' for human (2.7e9), 'mm' for mouse (1.87e9), 'ce' for C. elegans (9e7) and 'dm' for fruit fly (1.2e8), Default:hs For broad peaks like some histone modifications it is recommended to use --nomodel and if there is not input sample to use --nolambda. 18

Hands on Steps 5: MACS Input files: Hepg2H3k4me3_chr1.bam Control_chr1.bam MACS command bsub "macs -t Hepg2H3k4me3_chr1.bam -c Control_chr1.bam --name=hepg2h3k4me3_chr1 --format=bam --wig --space=25 " PARAMETERS -t TFILE Treatment file -c CFILE Control file --name=name Experiment name, which will be used to generate output file names. DEFAULT: NA --format=format Format of tag file, BED or ELAND or ELANDMULTI or ELANDMULTIPET or SAM or BAM or BOWTIE. DEFAULT: BED --wig: Whether or not to save shifted raw tag count at every bp into a wiggle file --space=space The resolution for saving wiggle files, by default, MACS will save the raw tag count every 10 bps. Usable only with '--wig' option 19

MACS output Output files: 1. Folder with wig files for control and sample. 2. Excel file for peaks found containing the following columns : chr->start ->end->length->summit->tags-> -10*LOG10(pvalue) ->fold_enrichment->fdr(%) 3. Bed file with peaks found: chr->start->end-> -10*LOG10(pvalue) 4. Excel file for negative peaks Looking at the output on a genome browser: To visualize the peaks you can upload the bed file into a genome browser. You can also make a bedgraph file with columns (step 6 of hands on): chr->start ->end->fold_enrichment 20

Visualize your peaks in IGV Treatment wig Control wig Peaks bedgraph Peaks bed RefSeq Genes 21

Other recommendations Look at your raw data as well as the peak calls in a genome browser to find out if you agree with the peaks calling software Optional: remove reads mapping to the ENCODE and 1000 Genomes blacklisted regions https://sites.google.com/site/anshulkundaje/projects/blacklists 22

References Previous Hot Topics Quality Control and Mapping Reads http://jura.wi.mit.edu/bio/education/hot_topics/ngs_qc_mapping_feb2014/ngs2014.pdf SOPs http://barcwiki.wi.mit.edu/wiki/sops/chip_seq_peaks Reviews and benchmark papers: ChIP-seq: advantages and challenges of a maturing technology (Oct 09) (http://www.nature.com/nrg/journal/v10/n10/full/nrg2641.html) Computation for ChIP-seq and RNA-seq studies (Nov 09) (http://www.nature.com/nmeth/journal/v6/n11s/full/nmeth.1371.html) Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data. PLoS Comput. Biol. 2013 A computational pipeline for comparative ChIP-seq analyses. Nat. Protoc. 2011 ChIP-seq guidelines and practices of the ENCODE and modencode consortia. Genome Res. 2012. Quality control and strand cross-correlation http://code.google.com/p/phantompeakqualtools/ MACS: Model-based Analysis of ChIP-Seq (MACS). Genome Biol 2008 http://liulab.dfci.harvard.edu/macs/index.html Using MACS to identify peaks from ChIP-Seq data. Curr Protoc Bioinformatics. 2011 http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0214s34/pdf 23