Before we start: 1. Log into tak (step 0 on the exercises) 2. Go to your lab space and create a folder for the class (see separate hand out) 3. Connect to your lab space through the wihtdata network and open the folder you created 1
Analysis of ChIP-seq data BaRC Hot Topics March 2014 http://jura.wi.mit.edu/bio/education/hot_topics 2
Outline ChIP-seq overview Quality control Experimental design Mapping Map reads Remove unmapped reads and convert to bam files Check the profile of the mapped reads Peak calling 3
ChIP-Seq overview Park, P. J., ChIP-seq: advantages and challenges of a maturing technology, Nat Rev Genet. Oct;10(10):669-80 (2009) 4
Steps in data analysis 1. Quality control 2. Effective mapping Treat IP and control the same way (preprocessing and mapping) 3. Peak calling i) Read extension and signal profile generation ii) Peak assignment Pepke, S. et al. Computation for ChIP-seq and RNA-seq studies, Nat Methods. Nov (2009) 5
Experimental design Include a control sample. If the protein of interest binds repetitive regions using paired end sequencing may reduce the mapping ambiguity. Otherwise single reads should be fine. Include at least two biological replicates. If you have replicas you may want to use the parameter IDR irreproducible discovery rate. Talk to us if you want to learn more about it. If only a small percentage of the reads maps to the genome, you may have to troubleshot your ChIP protocol. 6
Data that we will use on the hands on Data is from ENCODE the exercises EncodeBroadHistoneHepg2H3k4me3 EncodeBroadHistoneHepg2Control For Quality control and mapping we will use a subset of the data (10%) For running MACS we will use the full set but taking only reads mapped to chr1 7
Quality control (before read mapping) http://jura.wi.mit.edu/bio/education/hot_topics/ngs_qc_mapping_feb2014/ngs2014.pdf Quality Control and Mapping Reads Run fastqc Look at output and determine 1. What type of quality encoding (Illumina versions) your reads have Input qualities --solexa-quals <= 1.2 --phred64 1.3-1.7 --phred33 >= 1.8 Illumina versions 2. If you need to remove adapter, trim reads, filter reads based on quality etc. 8
Hands on Step 1: Quality control Run step 1 commands (fastqc) bsub fastqc Hepg2Control_subset.fastq bsub fastqc Hepg2H3k4me3_subset.fastq 9
Fastqc Output 10
Mapping We have to map reads to the genome. The mapping software doesn t have to know about splicing. Bowtie, bowtie2 (allows gaps) are good choices Bwa (allows gaps) is fine too Treat IP and control samples the same way during preprocessing and mapping. 11
Hands on Step 2: Mapping bsub bowtie -k 1 -n 2 -l 36 --best --sam --phred33-quals /nfs/genomes/human_gp_feb_09_no_random/bowtie/hg19 Hepg2Control_subset.fastq Hepg2Control_subset_hg19.k1.n2.l36.best.sam bsub bowtie -k 1 -n 2 -l 36 --best --sam --phred33-quals /nfs/genomes/human_gp_feb_09_no_random/bowtie/hg19 Hepg2H3k4me3_subset.fastq Hepg2H3k4me3_subset_hg19.k1.n2.l36.best.sam --phred33-quals Because we have Illumina 1.9 -k 1 report up to 1 good alignments per read (this is the default setting) -n 2 max mismatches in seed (this is the default setting) -l 36 seed length (make this the same or close to the read length) --best report the best alignment 12
Hands on Steps 3: Remove unmapped reads and convert to BAM This step makes the final files smaller. bsub "samtools view -hs -F 4 Hepg2Control_subset_hg19.k1.n2.l36.best.sam samtools view -b -S - > Hepg2Control_subset_hg19.k1.n2.l36.best.mapped_only.bam bsub "samtools view -hs -F 4 Hepg2H3k4me3_subset_hg19.k1.n2.l36.best.sam samtools view -b -S - > Hepg2H3k4me3_subset_hg19.k1.n2.l36.best.mapped_only.bam" 13
Hands on Steps 4: Quality control after read mapping Check the strand cross-correlation peak and predominant fragment length. bsub Rscript /nfs/barc_public/phantompeakqualtools/run_spp.r -c=control_chr1.bam -savp -out=control_chr1.run_spp.out bsub Rscript /nfs/barc_public/phantompeakqualtools/run_spp.r -c=hepg2h3k4me3_chr1.bam -savp -out=hepg2h3k4me3_chr1.run_spp.out -savp, save cross-correlation plot 14
Strand cross-correlation analysis Bailey, et al. Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data, PLOS Computational Biology Nov 2013 Strand cross-correlation is computed as the Pearson correlation between the positive and the negative strand profiles at different strand shift distances, k 15
Strand cross-correlation analysis outputs 16
Peak calling Peak calling i) Read extension and signal profile generation strand cross-correlation can be used to calculate fragment length ii) Peak assignment Look for fold enrichment of the sample over input or expected background Estimate the significance of the fold enrichment using: Poisson distribution negative binomial distribution background distribution from input DNA model background data to adjust for local variation (MACS) Pepke, S. et al. Computation for ChIP-seq and RNA-seq studies, Nat Methods. Nov. 2009 17
Peak calling: MACS MACS uses a two-step strategy: 1. Builds a model using high quality peaks and infers shift size from it. 2. Shifts the reads and does a second round to call the final peaks. It uses a Poisson distribution to assign p-values to peaks. But the distribution has a dynamic parameter, local lambda, to capture the influence of local biases. MACS default is to filter out redundant tags at the same location and with the same strand by allowing at most 1 tag. This works well. -g: You need to set up this parameter accordingly: Effective genome size. It can be 1.0e+9 or 1000000000, or shortcuts: 'hs' for human (2.7e9), 'mm' for mouse (1.87e9), 'ce' for C. elegans (9e7) and 'dm' for fruit fly (1.2e8), Default:hs For broad peaks like some histone modifications it is recommended to use --nomodel and if there is not input sample to use --nolambda. 18
Hands on Steps 5: MACS Input files: Hepg2H3k4me3_chr1.bam Control_chr1.bam MACS command bsub "macs -t Hepg2H3k4me3_chr1.bam -c Control_chr1.bam --name=hepg2h3k4me3_chr1 --format=bam --wig --space=25 " PARAMETERS -t TFILE Treatment file -c CFILE Control file --name=name Experiment name, which will be used to generate output file names. DEFAULT: NA --format=format Format of tag file, BED or ELAND or ELANDMULTI or ELANDMULTIPET or SAM or BAM or BOWTIE. DEFAULT: BED --wig: Whether or not to save shifted raw tag count at every bp into a wiggle file --space=space The resolution for saving wiggle files, by default, MACS will save the raw tag count every 10 bps. Usable only with '--wig' option 19
MACS output Output files: 1. Folder with wig files for control and sample. 2. Excel file for peaks found containing the following columns : chr->start ->end->length->summit->tags-> -10*LOG10(pvalue) ->fold_enrichment->fdr(%) 3. Bed file with peaks found: chr->start->end-> -10*LOG10(pvalue) 4. Excel file for negative peaks Looking at the output on a genome browser: To visualize the peaks you can upload the bed file into a genome browser. You can also make a bedgraph file with columns (step 6 of hands on): chr->start ->end->fold_enrichment 20
Visualize your peaks in IGV Treatment wig Control wig Peaks bedgraph Peaks bed RefSeq Genes 21
Other recommendations Look at your raw data as well as the peak calls in a genome browser to find out if you agree with the peaks calling software Optional: remove reads mapping to the ENCODE and 1000 Genomes blacklisted regions https://sites.google.com/site/anshulkundaje/projects/blacklists 22
References Previous Hot Topics Quality Control and Mapping Reads http://jura.wi.mit.edu/bio/education/hot_topics/ngs_qc_mapping_feb2014/ngs2014.pdf SOPs http://barcwiki.wi.mit.edu/wiki/sops/chip_seq_peaks Reviews and benchmark papers: ChIP-seq: advantages and challenges of a maturing technology (Oct 09) (http://www.nature.com/nrg/journal/v10/n10/full/nrg2641.html) Computation for ChIP-seq and RNA-seq studies (Nov 09) (http://www.nature.com/nmeth/journal/v6/n11s/full/nmeth.1371.html) Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data. PLoS Comput. Biol. 2013 A computational pipeline for comparative ChIP-seq analyses. Nat. Protoc. 2011 ChIP-seq guidelines and practices of the ENCODE and modencode consortia. Genome Res. 2012. Quality control and strand cross-correlation http://code.google.com/p/phantompeakqualtools/ MACS: Model-based Analysis of ChIP-Seq (MACS). Genome Biol 2008 http://liulab.dfci.harvard.edu/macs/index.html Using MACS to identify peaks from ChIP-Seq data. Curr Protoc Bioinformatics. 2011 http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0214s34/pdf 23