Evaluate NimbleGen SeqCap Epi Target Enrichment Data

Size: px

Start display at page:

Download "Evaluate NimbleGen SeqCap Epi Target Enrichment Data"

Grant Lawson
5 years ago
Views:

1 Sequencing Solutions Technical Note April 2014 How To Evaluate NimbleGen SeqCap Epi Target Enrichment Data 1. OVERVIEW Analysis of NimbleGen SeqCap Epi target enrichment data generated using an Illumina sequencing system is most frequently performed using a variety of publicly available, open-source analysis tools. The typical methylation and variant calling analysis workflow consists of sequencing read quality assessment, read filtering, mapping against a reference genome, removal of PCR duplicates, coverage statistic assessment, methylation analysis, variant calling and variant filtering; at most of these steps, a variety of tools are available. This document shows how to use a selection of the available tools to perform SeqCap Epi data analysis, but other analysis workflows may be used. Applications Targeted Bisulfite Sequencing Products SeqCap Epi CpGiant Enrichment Kit SeqCap Epi Developer Enrichment Kit SeqCap Epi Choice Enrichment Kit In addition to showing how to use raw FASTQ files to generate a BAM file and perform methylation analysis (see Figure 1), additional supporting workflows include how to prepare and work with Roche NimbleGen BED files and how to create a Picard interval list file. This document should enable readers with bioinformatics experience to understand the basic analysis workflow in use at Roche NimbleGen. The reader should carefully consider the available options to determine the appropriate workflow for their research. The usage examples described here have been applied effectively in our hands. Please note that publicly available, open-source software tools may change and that such change is not under the control of Roche. Therefore Roche does not warrant and cannot be held liable for the results obtained when using the third party tools described herein. Please refer to the authors of each tool for support and documentation. For life science research only. Not for use in diagnostic procedures. 1

2 Figure 1. Schematic of basic methylation analysis workflow Evaluate NimbleGen SeqCap Epi Target Enrichment Data 2

3 2. SOLUTION Freely available third-party tools are available for converting raw sequencing data into appropriate file formats, mapping bisulfite converted reads to a reference sequence, evaluating sequencing quality, generating methylation calls and analyzing variant calls. This technical note describes a number of steps and mini-workflows that use such third-party tools, which can be combined together into a variety of data analysis workflows. Ideally, you should develop a workflow appropriate for your experimental data using benchmark/control samples, such as HapMap samples obtained from the Coriell Institute Cell Repositories. Known variants for HapMap sample genomes can be downloaded from the HapMap project ( the 1000 Genomes Project ( or in special collections such as GATK s resource bundle ( Note that where the text SAMPLE appears throughout these examples, you should replace it with a unique sample name. Similarly, replace /path/to/ with a valid path on your system. The current directory is assumed to be the location of all SAMPLE input files, and will also be the location of SAMPLE output files and report files. Type the entire command shown for each step on a single line, despite the way it appears on the printed page. There should be no spaces within a file path, but there must be spaces before and after each option. Tip: use the Tab key to auto-complete paths and file names while typing. Tools Overview Package (version) Tool Function, as used in this document bamutil (1.0.10) clipoverlap Soft clips overlapping read pairs in BAM files intersect Calculate on-target read rate by intersecting a list of regions in BED format against mapped reads in BAM format. BEDtools (2.17.0) sort Sort BED formatted regions. merge Merge overlapping regions within a BED file. genomecov Calculate sum total size of regions in a BED file. slop Pad length of target regions. BisulfiteCountCovariates Generate data file for recalibration BisulfiteTableRecalibration Recalibrate base qualities before SNP calling BisSNP (0.82.2) BSMAP (2.74) BisulfiteGenotyper sortbyrefandcor.pl VCFpostprocess bsmap methratio.py Determine methylated/unmethylated counts at each position and make SNP calls Sort VCF files by name and position Filter CpGs and SNPs Map sequencing reads to an indexed genome. Calculate percent methylation at individual bases FastQC (0.10.1) fastqc Assess sequencing read quality (per-base quality plot). GATK Framework (2.7-2) IGV DepthOfCoverage igv Calculate mean, median, and specific sequencing coverage depths. Genomic viewer for BAM and BED files (not explicitly used in the examples shown in this document). Evaluate NimbleGen SeqCap Epi Target Enrichment Data 3

4 Package (version) Tool Function, as used in this document methylkit (0.9.2) Picard (1.98) SAMtools (0.1.18) read filterbycoverage unite calculatediffmeth get.methyldiff tilemethylcounts AddOrReplaceReadGroups CollectInsertSizeMetrics CalculateHsMetrics MarkDuplicates CollectAlignmentSummaryMetrics SamToFastq sort index view mpileup BCFtools view vcfutils varfilter Import BSMAP methylation results Filter methylation data by coverage Consolidate data to position covered by all samples Calculate and score differential methylation at each position Filter differential methylation results by absolute value of methylation change, qvalue, and type of region (hypo, hyper, all) Summarizes methylated/unmethylated base counts over tiling or sliding windows in the genome Add read group information to a SAM or BAM file Estimate insert size mean and standard deviation and plot an insert size distribution. Assess performance of targeted enrichment reads. Remove or mark duplicate reads, and detail the number of paired, unpaired, and optical duplicates. Report mapping metrics for a BAM file. Create FASTQ file(s) from a SAM or BAM file. Sort a BAM file. Index a sorted BAM file. View or extract header or read data. Call variants on a BAM file. Convert between VCF and BCF formats. Filter raw variants. seqtk (1.0-r31) sample Randomly subsample FASTQ file(s). Trimmomatic (0.30) illuminaclip Trim raw reads for quality. Table 1: Shown are third-party data analysis and formatting tools used in this technical note. The examples described in this document were tested using the software versions listed in parentheses. See links in References for installation instructions and explanations of command options. These tools were tested on a Linux system, but should also work on MacOS. Evaluate NimbleGen SeqCap Epi Target Enrichment Data 4

5 Decompress a FASTQ File If the FASTQ files have been compressed (with a.gz extension), some tools require them to be decompressed before use. gunzip SAMPLE_R1.fastq.gz SAMPLE_R2.fastq.gz SAMPLE_R1.fastq SAMPLE_R2.fastq gunzip c SAMPLE_R1.fastq.gz > SAMPLE_R1.fastq gunzip c SAMPLE_R2.fastq.gz > SAMPLE_R2.fastq Generate FASTQ Files from a BAM File If you do not have access to the original FASTQ files, it is possible to generate FASTQ files from a BAM file in order to remap the reads using a different pipeline. Picard SamToFastq SAMPLE_file.bam SAMPLE_R1.fastq SAMPLE_R2.fastq java -Xmx4g Xms4g -jar /path/to/picard/samtofastq.jar VALIDATION_STRINGENCY=LENIENT INPUT=SAMPLE_file.bam FASTQ=SAMPLE_R1.fastq SECOND_END_FASTQ=SAMPLE_R2.fastq The generated FASTQ files will only be identical to the originals if the creator of the BAM file did not remove reads such as duplicates, low quality, etc., or perform base quality recalibration. Select a Subsample of Reads from a FASTQ File Random subsampling is useful for normalizing the number of reads per set when doing comparisons. Depending on the goal of the experiment, reads can be sampled before adapter trimming and quality filtering (see below) or after, sampling only from high quality reads of a minimum length. With paired end reads, it is important that the two files use the same values for the seed (-s) and number of reads. The seqtk application will sample from gzipped FASTQ files, but will write the sampled reads to uncompressed FASTQ files. seqtk sample SAMPLE_R1.fastq SAMPLE_R2.fastq SAMPLE_subset_R1.fastq SAMPLE_subset_R2.fastq /path/to/seqtk sample -s SAMPLE_R1.fastq > SAMPLE_subset_R1.fastq /path/to/seqtk sample -s SAMPLE_R2.fastq > SAMPLE_subset_R2.fastq The commands above will randomly subsample 10 million read pairs from the paired end FASTQ files. Supplying the same random seed value (-s) ensures that the FASTQ records will remain in synchronized sort order and can be used for mapping, etc. Note that seqtk requires an amount of RAM proportional to the number of reads being subsampled. As you increase the size of the subsampled read set, more RAM is needed. Evaluate NimbleGen SeqCap Epi Target Enrichment Data 5

6 Examine Sequence Read Quality Control Before spending time evaluating mapping statistics, use fastqc to run a series of tests on raw reads and generate a per-base sequence quality plot and report. The fastqc tool can work on both compressed and uncompressed FASTQ files. FastQC fastqc SAMPLE_R1.fastq SAMPLE_R2.fastq SAMPLE_R1_fastqc.zip SAMPLE_R2_fastqc.zip /path/to/fastqc --nogroup SAMPLE_R1.fastq SAMPLE_R2.fastq A.zip file and decompressed directory are created for each SAMPLE input file. They contain an HTML report named fastqc_report.html that is viewable in any internet browser. The authors of FastQC have posted the following examples of the QC report for a good and a bad sequencing run. Adapter trimming and Quality Filtering Before mapping the bisulfite converted reads, it is advisable to trim off any sequencing adapters found on the reads, and to filter or trim the reads for sequence quality. The Trimmomatic application can do both of those steps. Trimmomatic illuminaclip SAMPLE_R1.fastq SAMPLE_R2.fastq SAMPLE_R1_trimmed.fq SAMPLE_R1_unpaired.fq SAMPLE_R2_trimmed.fq SAMPLE_R2_unpaired.fq java Xms4g Xmx4g -jar /path/to/trimmomatic.jar PE -threads NumProcessors -phred33 SAMPLE_R1.fastq SAMPLE_R2.fastq SAMPLE_R1_trimmed.fq SAMPLE_R1_unpaired.fq SAMPLE_R2_trimmed.fq SAMPLE_R2_unpaired.fq ILLUMINACLIP:/path/to/adapters.fa:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:5:20 MINLEN:75 The Trimmomatic application will produce four files. The SAMPLE_R1_trimmed.fq and SAMPLE_R2_trimmed.fq contain the reads that are still paired after adapter trimming and quality filtering. The SAMPLE_R1_unpaired.fq and SAMPLE_R2_unpaired.fq contain singleton reads, where the other mate of the pair was discarded because of poor quality or because the remaining read was shorter than the minimum length of 75bp. The unpaired reads do not have to be discarded but for most alignment applications they will have to be mapped separately from the paired reads, which can complicate downstream steps in a workflow. If the workflow includes a filtering step to include only mapped reads which are paired, then there is no point in mapping the singleton reads from Trimmomatic. The parameters shown above are best used with 2x100bp reads, where you should expect to see about 90% of reads pass these quality filters. If you want to increase the percentage of passing reads, adjust the Trimmomatic parameters especially MINLEN, which is the required minimum length after trimming. Evaluate NimbleGen SeqCap Epi Target Enrichment Data 6

7 Mapping Reads There are a number of alignment software applications for mapping bisulfite converted reads to a reference genome. The choice of alignment software can depend on the length and type of reads being mapped (100bp paired end reads versus 76bp single end reads), the size and GC content of the organism, and the computational resources available. For 100bp paired end reads, BSMAP demonstrates a good compromise between alignment time and overall alignment percentage. The index for the reference genome is built at run time for BSMAP so there is no need to preindex the genome as is necessary for some other alignment software applications. BSMAP produces a SAM file which should be converted to a BAM file before proceeding with the workflow bsmap Picard AddOrReplaceReadGroups ref.fa SAMPLE_R1.fastq SAMPLE_R2.fastq SAMPLE.bam Map Reads /path/to/bsmap -r 0 -s 16 -n 1 -a SAMPLE_R1.fastq -b SAMPLE_R2.fastq -d /path/to/ref.fa -p NumProcessors -o SAMPLE.sam Add Read Group and Convert to BAM Xmx4g Xms4g -jar /path/to/picard/ AddOrReplaceReadGroups.jar VALIDATION_STRINGENCY=LENIENT INPUT=SAMPLE.sam OUTPUT=SAMPLE.bam INDEX=TRUE RGID=SAMPLE RGLB=SAMPLE RGPL=illumina RGSM=SAMPLE RGPU=platform_unit The parameters above will uniquely map reads to both strands. By default, BSMAP does not report unmapped reads. If you want unmapped reads to carry through the workflow, use the -u switch to report unmapped reads. Note that BSMAP does not assign a read group to the reads in the output SAM file, so that information is added at the same time the SAM file is converted to a BAM file. For single captures from one library, the sample ID, the read group library, and the read group sample can all be the same. If combining captures from multiple libraries or multiple captures, it is a good idea to distinguish those via the read group library ID or read group sample ID. The platform unit is a unique identifier for the flow cell and lane, which can generally be grabbed from the first four colon delimited fields from the FASTQ ID (in bold 1:N:0:CGATGT For the reference genome file, there are several points to consider. The FASTA formatted genome sequence should have chromosomes in karyotypic sort order, i.e. chr1, chr2,..., chr10, chr11, chrx, chry, chrm, chr1_random, etc. In the examples in this document, reference genome files are referred to as ref.fa, which should be replaced by the actual file name (e.g. hg19.fa ). Note: The SeqCap Epi products include a lambda spike-in control (GenBank accession NC_001416). This sequence should be added to the reference genome file so that captured controls can be mapped to the lambda genome, and be used to estimate bisulfite conversion efficiency as a QC step. If working with human reference genome sequence and mapping reads uniquely, it is advisable to mask the pseudo-autosomal regions on chry to N s so that reads may be mapped to the equivalent regions on chrx. For human experiments with 100bp paired end reads, you can generally expect to see >90% of the reads mapped to the human reference genome. For other, non-human reference genomes, the percent of reads mapping may vary based on a number of factors, including genome size, percent GC composition, repeat content and reference genome quality. Evaluate NimbleGen SeqCap Epi Target Enrichment Data 7

The treatment of the sequence with bisulfite converts unmethylated C s to T s resulting in top and bottom

8 Sorting and Removing Duplicates Working with bisulfite converted sequences poses some additional challenges over standard capture sequences. The treatment of the sequence with bisulfite converts unmethylated C s to T s resulting in top and bottom strands that are no longer complementary (Figure 2). Figure 2. Illustration of non-complementary strands generated by bisulfite-treatment. Evaluate NimbleGen SeqCap Epi Target Enrichment Data 8

9 The standard tool for duplicate remove (Picard MarkDuplicates) does not have an option for dealing with bisulfite sequence and non-complementary top and bottom strands. So to remove duplicates, the mapped reads must be split by top/bottom strand, duplicates removed, and then the strands re-merged. The BSMAP aligner includes a tag in the BAM file (ZS), which indicates top/bottom strand, and forward/reverse read. This tag is used to split the BAM file containing the mapped reads, which are then sorted, duplicates are removed, and the split files are rejoined. bamtools split bamtools merge SAMtools sort Picard MarkDuplicates SAMPLE.bam SAMPLE.TAG_ZS_++.bam SAMPLE.TAG_ZS_+-.bam SAMPLE.top.bam SAMPLE.top.sorted.bam SAMPLE.top.rmdups.bam SAMPLE.top.rmdups_metrics.txt SAMPLE.TAG_ZS_-+.bam SAMPLE.TAG_ZS_--.bam SAMPLE.bottom.bam SAMPLE.bottom.sorted.bam SAMPLE.bottom.rmdups.bam SAMPLE.bottom.rmdups_metrics.txt SAMPLE.rmdups.bam Split BAM /path/to/bamtools split -tag ZS -in SAMPLE.bam Merge strand BAM files /path/to/bamtools merge -in SAMPLE.TAG_ZS_++.bam -in SAMPLE.TAG_ZS_+-.bam -out SAMPLE.top.bam /path/to/bamtools merge -in SAMPLE.TAG_ZS_-+.bam -in SAMPLE.TAG_ZS_--.bam -out SAMPLE.bottom.bam Sort BAM Files /path/to/samtools sort SAMPLE.top.bam SAMPLE.top.sorted /path/to/samtools sort SAMPLE.bottom.bam SAMPLE.bottom.sorted Remove Duplicates java Xmx4g Xms4g -jar /path/to/picard/markduplicates.jar VALIDATION_STRINGENCY=LENIENT INPUT=SAMPLE.top.sorted.bam OUTPUT=SAMPLE.top.rmdups.bam METRICS_FILE=SAMPLE.top.rmdups_metrics.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true INDEX=true java Xmx4g Xms4g -jar /path/to/picard/markduplicates.jar VALIDATION_STRINGENCY=LENIENT INPUT=SAMPLE.bottom.sorted.bam OUTPUT=SAMPLE.bottom.rmdups.bam METRICS_FILE=SAMPLE.bottom.rmdups_metrics.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true INDEX=true Merge duplicate removed BAM files /path/to/bamtools merge -in SAMPLE.top.rmdups.bam -in SAMPLE.bottom.rmdups.bam -out SAMPLE.rmdups.bam Evaluate NimbleGen SeqCap Epi Target Enrichment Data 9

10 In the Mark Duplicates step, REMOVE_DUPLICATES=false may also be used. This will retain reads marked as duplicates and just mark them as duplicates. It will result in larger files downstream, but may be beneficial if you want to examine downstream results with and without duplicates removed. View the file SAMPLE.strand.rmdups_metrics.txt for counts of paired, unpaired, and duplicate reads. See for a description of the output metrics. Note that optical duplicates are also reported, based on sequence similarity and sequencing cluster distance. Optical duplicates are not separately included in the total duplicate rate but are counted within the paired and unpaired duplicates. Filter BAM file to keep only mapped and properly paired reads Due to the difficulty in mapping bisulfite reads, overall data quality can be improved by restricting analysis to only those reads pairs where both mates in the pair can be mapped, in the correct orientation and at a distance consistent with the library insert size. Filtering the BAM file to include only mapped and properly paired reads will remove the ability to find structural variants. If that is a goal, then a separate analysis workflow (not described here) would have to be set up to find and characterize those variants. bamtools filter SAMPLE.rmdups.bam SAMPLE.filtered.bam Filter BAM /path/to/bamtools filter -ismapped true -ispaired true -isproperpair true - forcecompression -in SAMPLE.rmdups.bam -out SAMPLE.filtered.bam For human SeqCap Epi experiments, this step generally filters out about 5% of the reads mapped with BSMAP. Clip Overlapping Reads When using 100bp paired end reads, and typical NGS library insert sizes of bp, a number of the paired end reads will overlap. When evaluating percent methylation, this can cause bias, as the converted/non-converted C s in the overlap between the two regions may be double-counted. The bias can be especially large at positions with low read coverage. To prevent this, one or both reads should be soft-clipped such that the reads no longer overlap. The bamutil clipoverlap application can be used to do this. bamutil clipoverlap SAMPLE.filtered.bam SAMPLE.clipped.bam Filter BAM /path/to/bamutil clipoverlap --stats --in SAMPLE.filtered.bam --out SAMPLE.clipped.bam The clipoverlap tools will look at both reads in the pair, and when an overlap is found, the tool will clip the read with the lowest average quality in the overlap region, and the alignment positions are updated if necessary. See the following page for complete details: The number of reads that are clipped will depend on the average insert size of the library and the distribution of sizes in the library. See Estimate Insert Size Distribution below. Evaluate NimbleGen SeqCap Epi Target Enrichment Data 10

11 Indexing BAM files A number of tools that operate on BAM files require that the BAM file be indexed for faster or more efficient calculations. The Picard tools will generate an index automatically if the parameter INDEX=true is included on the command line. For other tools which create BAM files, the BAM index will have to be created manually. To do so, use the SAMtools index command. SAMtools index SAMPLE.bam SAMPLE.bam.bai /path/to/samtools index SAMPLE.bam The command will create an additional file with the suffix.bai added to the original file name. Basic Mapping Metrics (Picard) Basic mapping metrics can be calculated using Picard, from SAMPLE.bam and ref.fa. Picard CollectAlignmentSummaryMetrics ref.fa SAMPLE.bam SAMPLE_picard_alignment_metrics.txt java -Xmx4g -Xms4g -jar /path/to/picard/collectalignmentsummarymetrics.jar METRIC_ACCUMULATION_LEVEL=ALL_READS INPUT=SAMPLE.bam OUTPUT=SAMPLE_picard_alignment_metrics.txt REFERENCE_SEQUENCE=/path/to/ref.fa VALIDATION_STRINGENCY=LENIENT Note: It is important to use VALIDATION_STRINGENCY=LENIENT. Not all tools that operate on BAM files produce output files are completely compliant with the BAM specification, and unless this parameter is used, the various Picard tools will halt as soon as they find a single non-compliant record. See for a description of the output metrics. Create Picard Interval Lists A Picard interval list is a genomic interval description file that contains a SAM-like header describing the reference genome and a set of coordinates with strand and name for each interval. The target intervals are equivalent to the NimbleGen primary target BED file, and the bait intervals are equivalent to the NimbleGen capture target BED file. The Picard interval files are required by the Picard CalculateHsMetrics utility. Note: Some NimbleGen catalog designs only contain a single bed file of coverage targets. For those designs, the same BED file can be used for both primary targets and capture targets below. Evaluate NimbleGen SeqCap Epi Target Enrichment Data 11

12 Create a Picard Target Interval List Use the SAMtools view command to extract the header from a SAMPLE.bam file, use the Linux gawk command to extract and format the interval list body from a DESIGN_primary_target.bed file, and finally use the cat command to concatenate the two pieces together into a properly-formatted interval list. SAMtools view cat gawk SAMPLE.bam DESIGN_primary_target.bed DESIGN_target_intervals.txt Create a Picard Interval List Header /path/to/samtools view H SAMPLE.bam > SAMPLE_bam_header.txt Create a Picard Target Interval List Body cat DESIGN_primary_target.bed gawk '{print $1 "\t" $2+1 "\t" $3 "\t+\tinterval_" NR}' > DESIGN_target_body.txt Concatenate to Create a Picard Target Interval List cat SAMPLE_bam_header.txt DESIGN_target_body.txt > DESIGN_target_intervals.txt The DESIGN_target_intervals.txt file is used by the Picard CalculateHsMetrics command. Create a Picard Bait Interval List Use the SAMtools view command to extract the header from a SAMPLE.bam file, use the Linux gawk command to extract and format the interval list body from a DESIGN_capture_target.bed file, and finally use the cat command to concatenate the two pieces together into a properly-formatted interval list. SAMtools view cat gawk SAMPLE.bam DESIGN_capture_target.bed DESIGN_bait_intervals.txt Create a Picard Interval List Header /path/to/samtools view H SAMPLE.bam > SAMPLE_bam_header.txt Create a Picard Bait Interval List Body cat DESIGN_capture_target.bed gawk '{print $1 "\t" $2+1 "\t" $3 "\t+\tinterval_" NR}' > DESIGN_bait_body.txt Concatenate to Create a Picard Bait Interval List cat SAMPLE_bam_header.txt DESIGN_bait_body.txt > DESIGN_bait_intervals.txt The DESIGN_bait_intervals.txt file is used by the Picard CalculateHsMetrics command. Evaluate NimbleGen SeqCap Epi Target Enrichment Data 12

13 Hybrid Selection (HS) Analysis Metrics The CalculateHsMetrics command calculates a number of metrics assessing the quality of targeted enrichment reads. If necessary, follow the steps described in Combined SNP/methylation calling using BisSNP before using the step described below. Picard CalculateHsMetrics ref.fa {indexed} SAMPLE.clipped.bam {indexed} DESIGN_target_intervals.txt DESIGN_bait_intervals.txt SAMPLE_picard_hs_metrics.txt java -Xmx4g -Xms4g -jar /path/to/picard/calculatehsmetrics.jar BAIT_INTERVALS=DESIGN_bait_intervals.txt TARGET_INTERVALS=DESIGN_target_intervals.txt INPUT=SAMPLE.clipped.bam OUTPUT=SAMPLE_picard_hs_metrics.txt METRIC_ACCUMULATION_LEVEL=ALL_READS REFERENCE_SEQUENCE=/path/to/ref.fa VALIDATION_STRINGENCY=LENIENT TMP_DIR=. Supplying only one of the Picard interval files to Picard CalculateHsMetrics can change the outcome, depending on how different the primary and capture target (bait) regions are from each other. Some Roche NimbleGen catalog designs have only one BED file supplied. In that case, generate a single Picard Interval List from the supplied BED file and use as both bait and target intervals. See for a description of output metrics. Estimate Insert Size Distribution The DNA that goes into sequence capture is generated by random fragmentation, and later size selected. It is normal to observe a range of fragment sizes, but if skewed too large or too small the on-target rate and/or percent of bases covered with at least one read can be adversely affected. It is best to generate this report on the SAMPLE.filtered.bam file created above Picard CollectInsertSizeMetrics SAMPLE.filtered.bam SAMPLE_picard_insert_size_metrics.txt SAMPLE_picard_insert_size_plot.pdf java -Xmx4g -jar /path/to/picard/collectinsertsizemetrics.jar VALIDATION_STRINGENCY=LENIENT HISTOGRAM_FILE=SAMPLE_picard_insert_size_plot.pdf INPUT=SAMPLE.filtered.bam OUTPUT=SAMPLE_picard_insert_size_metrics.txt See for a description of output metrics included in SAMPLE_picard_insert_size_metrics.txt, which can also be used to plot the insert size distributions across samples. A PDF plot is also created and placed in SAMPLE_picard_insert_size_plot.pdf. Evaluate NimbleGen SeqCap Epi Target Enrichment Data 13

14 Add Padding to Targets Target-adjacent coverage is typical for hybridization-based target enrichment due to the capture of partially on-target DNA library fragments. To optionally assess the amount of off-target reads which are target adjacent, padding the targets before assessing the on-target rate may be performed. The targets should be padded, sorted, and then merged to product the final file. These three operations can be performed simultaneously by piping the contents of one application to the BEDtools slop BEDtools sort BEDtools merge DESIGN_capture_target.bed chromosome_sizes.txt DESIGN_padded_capture_target.bed Pad, Sort and Merge Overlapping and Book-Ended Regions /path/to/bedtools slop -i DESIGN_capture_target.bed -b 100 -g chromosome_sizes.txt /path/to/bedtools sort -i - /path/to/bedtools merge -i - > DESIGN_padded_capture_target.bed In the first step, the value supplied to the -b option indicates the number of target-adjacent bases to add on both sides of the target. In this example, 100 bases would be added to both sides of all targets. The chromosome_sizes.txt file should be in the format: ChrName<tab>ChrSize, e.g. chr and have an entry for each chromosome present in the input BED file. The -i - parameters directs the application to take the STDOUT from the previous command as input. Determine Sum Total Size of Regions in a BED File The chromosome_sizes.txt file should be as described in Add Padding to Targets. BEDtools genomecov grep cut chromosome_sizes.txt DESIGN.bed {size} /path/to/bedtools genomecov -i DESIGN.bed -g chromosome_sizes.txt -max 1 grep -P "genome\t1" cut -f 3 Alternatively, a single gawk command can be used: cat DESIGN.bed gawk -F'\t' 'BEGIN{SUM=0}{SUM+=$3-$2}END{print SUM}' Evaluate NimbleGen SeqCap Epi Target Enrichment Data 14

15 Count On-Target Reads The number of on-target reads (i.e. the number of mapped reads overlapping a set of target regions by at least one base) can be determined using BEDtools intersect on any of the BAM files created above. BEDtools intersect wc SAMPLE.bam DESIGN_primary_target.bed or DESIGN_capture_target.bed {count} /path/to/bedtools intersect -bed -abam SAMPLE.bam -b DESIGN_primary_target.bed or /path/to/bedtools intersect -bed -abam SAMPLE.bam -b DESIGN_capture_target.bed The command will output the number of on-target reads, counting as on-target any read that overlaps the target by at least one base. The output may be redirected to a file to save for later calculations. Divide the number of on-target reads by the total number of mapped, non-duplicate reads to get the percentage of on-target reads ( on-target rate ). See Basic Mapping Metrics (Picard) for reporting of the total number of mapped, non-duplicate reads. Calculate Depth of Coverage The depth of coverage over the primary or capture target regions can be determined using the DepthofCoverage command from GATK. GATK DepthOfCoverage ref.fa {indexed} SAMPLE.clipped.bam {indexed} DESIGN_primary_target.bed or DESIGN_capture_target.bed SAMPLE_gatk_primary_target_coverage.sample_summary or SAMPLE_gatk_capture_target_coverage.sample_summary {there are other output files not listed here} java -Xmx4g -Xms4g -jar /path/to/gatkframework/genomeanalysistk.jar -T DepthOfCoverage R /path/to/ref.fa -I SAMPLE.clipped.bam -o SAMPLE_gatk_primary_target _coverage -L DESIGN_primary_target.bed -ct 1 -ct 10 -ct 20 or java -Xmx4g -Xms4g -jar /path/to/gatkframework/genomeanalysistk.jar -T DepthOfCoverage R /path/to/ref.fa -I SAMPLE.clipped.bam -o SAMPLE_gatk_capture_target_coverage -L DESIGN_capture_target.bed -ct 1 -ct 10 -ct 20 The sample summary output file reports the mean and median depth of coverage over the given regions of interest, as well as the percent of bases in the given regions covered at or above a given depth. Multiple depth thresholds can be evaluated by specifying multiple values using the -ct switch. For more information, consult the documentation at: Evaluate NimbleGen SeqCap Epi Target Enrichment Data 15

16 Determine methylation percentage using BSMAP The percent methylation at each C base in the sample can be determined using the methratio.py script supplied by BSMAP and the BAM file produced in the Clip Overlapping Reads step. BSMAP methratio.py ref.fa SAMPLE.clipped.bam SAMPLE.methylation_results.txt Determine methylation percentage python /path/to/methratio.py -d hg19.fa -s /path/to/samtools -m 1 -z -i skip -o SAMPLE.methylation_results.txt SAMPLE.clipped.bam The results can be filtered by depth of coverage by using the -m switch. Raising this to a higher value (e.g. -m 5) will require at least that many reads to cover the position before a result is reported. The -z switch is required to report positions with zero methylation. The -i switch directs the script to ignore positions were there is a possible C->T SNP. Use the BisSNP analysis to deal with those positions. The result files from this step can be rather large, several GB in size, since percent methylation is reported for all C positions. An additional switch -c can be used to process one chromosome at a time (e.g. -c chr1). In combination with the -m switch, this can reduce result files to more reasonable sizes. Determine bisulfite conversion efficiency using BSMAP Reliable estimates of percent methylation require high efficiency of the bilsulfite conversion process, otherwise the level of methylation will be over-estimated. The best way to determine bisulfite conversion efficiency is to measure the number of unconverted C s on DNA molecules which have no methylation. For humans and animals, the mitochondrial genome is an appropriate target. For plants, both the mitochondrial and chloroplast genome may be used. Unfortunately, the genomes of these organelles is not always known or well-characterized, so Roche NimbleGen supplies phage lambda DNA (GenBank Accession NC_001416) which may be used as a spike-in control. Each SeqCap Epi capture pool contains probes to capture the lambda genomic region from base 4500 to The lambda genome sequence must be appended to your reference genome sequence files so that the captured reads may be mapped to that sequence. The residual C s in the reads mapped to the captured region of the lambda genome can be used to estimate the overall conversion efficiency. BSMAP methratio.py ref.fa SAMPLE.clipped.bam SAMPLE.NC_ methylation_results.txt Determine methylation percentage python /path/to/methratio.py -d hg19.fa -s /path/to/samtools -m 1 -z -i skip -o -c NC_ SAMPLE.NC_ methylation_results.txt SAMPLE.clipped.bam The -c NC_ switch restricts the methylation analysis to just the lambda genome. If the lambda spike-in was not used, you may substitute the name of the mitochondrial genome (e.g. chrm) or the chloroplast genome (e.g. chrc). Once the analysis is complete, open the resulting methylation results file in a spreadsheet application, and if necessary, delete the rows that fall outside the capture region. Then calculate the conversion rate as follows: conversion rate = 1 - (sum(c_count)/sum(ct_count)) Bisulfite-conversion rates should generally be above 99.5% to be considered successful. Evaluate NimbleGen SeqCap Epi Target Enrichment Data 16

17 Combined SNP/methylation calling using BisSNP The percent methylation and possible SNPs at each C base in the sample can be determined using the BisSNP package and the BAM file produced in the Clip Overlapping Reads step. BisSNP uses a multistep process to make methylation calls and call SNPs. The first step normally calls for Indel Realignment (see BisSNP user guide). However, the BSMAP software does not support gapped alignments, so this step may be skipped for reads mapped with BSMAP software. The next step is Base Quality Recalibration, followed by Combined SNP/methylation calling. BisSNP BisulfiteCountCovariates BisSNP BisulfiteTableRecalibration BisSNP BisulfiteGenotyper BisSNP sortbyrefandcor.pl BisSNP VCFpostprocess BisSNP vcf2bed6plus2.strand.pl ref.fa {indexed} genome.snps.vcf DESIGNS_capture_target.bed SAMPLE.clipped.bam SAMPLE.cpg.raw.vcf SAMPLE.cpg.raw.sorted.vcf SAMPLE.cpg.filtered.vcf SAMPLE.cpg.filter.summary.txt SAMPLE. cpg.filtered.strand.6plus2.bed SAMPLE.snp.raw.vcf SAMPLE.snp.raw.sorted.vcf SAMPLE.snp.filtered.vcf SAMPLE.snp.filter.summary.txt SAMPLE. cpg.filtered.strand.6plus2.bed Base Quality Recalibration java -Xmx10g -jar /path/to/bissnp jar -R ref.fa -I SAMPLE.clipped.bam -T BisulfiteCountCovariates -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -recalfile SAMPLE.recalFile_before.csv -nt NumProcessors -knownsites genome.snps.vcf java -Xmx10g -jar /path/to/bissnp jar -R ref.fa -I SAMPLE.clipped.bam -o SAMPLE.recal.bam -T BisulfiteTableRecalibration -recalfile SAMPLE.recalFile_before.csv -maxq 40 Combined SNP/methylation calling java -Xmx10g -jar /path/to/bissnp jar -R ref.fa -I SAMPLE.recal.bam -T BisulfiteGenotyper -D genome.snps.vcf -vfn1 SAMPLE.cpg.raw.vcf -vfn2 SAMPLE.snp.raw.vcf -L DESIGNS_capture_target.bed -stand_call_conf 20 - stand_emit_conf 0 -mmq 30 -mbq 0 -nt NumProcessors Sort VCF files perl /path/to/sortbyrefandcor.pl --k 1 --c 2 SAMPLE.snp.raw.vcf ref.fa.fai > SAMPLE.snp.raw.sorted.vcf perl /path/to/sortbyrefandcor.pl --k 1 --c 2 SAMPLE.cpg.raw.vcf ref.fa.fai > SAMPLE.cpg.raw.sorted.vcf Evaluate NimbleGen SeqCap Epi Target Enrichment Data 17

18 Filter SNP/methylation calls java -Xmx10g -jar /path/to/bissnp jar -R ref.fa -T VCFpostprocess -oldvcf SAMPLE.snp.raw.sorted.vcf -newvcf SAMPLE.snp.filtered.vcf -snpvcf SAMPLE.snp.raw.sorted.vcf -o SAMPLE.snp.filter.summary.txt java -Xmx10g -jar /path/to/bissnp jar -R ref.fa -T VCFpostprocess -oldvcf SAMPLE.cpg.raw.sorted.vcf -newvcf SAMPLE.cpg.filtered.vcf -snpvcf SAMPLE.snp.raw.sorted.vcf -o SAMPLE.cpg.filter.summary.txt Convert VCF to BED file perl /path/to/vcf2bed6plus2.strand.pl SAMPLE.snp.filtered.vcf perl /path/to/vcf2bed6plus2.strand.pl SAMPLE.cpg.filtered.vcf BisSNP utilizes a set of known SNPs for the genome as part of the SNP calling process. BisSNP can work without this set of SNPs, but it will improve SNP detection accuracy, especially in areas of low coverage (e.g. less than 10X). The order of the SNP VCF file should be the same as the reference genome file and the input BAM file. The BisSNP website has SNP files in VCF format for human builds hg18 and hg19 and one mouse build (mm9) at: For the filtering step, by default BisSNP will filter out SNPs with: quality score less than 20 coverage more than 120X strand bias more than quality score by depth less than 1.0 mapping_quality_zero filter for heterozygous SNPs more than SNPs within the same 20bp window. The BED files that are created have the percent methylation in the 7 th column, and the C/T coverage in the 8 th column. If viewed in the SignalMap software, the percent methylation will be viewed as a score from 0 to 1000, where 0 equals zero percent methylation and 1000 equal one hundred percent methylation. The strand column is ignored, so both methylation percentages will be displayed as positive values. The C/T coverage will not be displayed. Evaluate NimbleGen SeqCap Epi Target Enrichment Data 18

19 Determination of differentially methylated regions There are a number of different packages for discovering differentially methylated regions when comparing two samples, or two sets of samples. One of these is methylkit, an R package for DNA methylation analysis. Installation of R, methylkit, or the various dependencies is beyond the scope of this document. But instructions for installing R may be found at and instructions for installing methylkit and its dependencies may be found at R/methylKit read R/methylKit filterbycoverage R/methylKit unite R/methylKit calculatediffmeth R/methylKit get.methyldiff R/methylKit tilemethylcounts BSMAP methylation results files varies # R code for determining differentially methylated regions # Load files library(methylkit) file.list = list("sample.methylation_results.txt","control.methylation_results.txt") sample.list = list("test","control") methdata <-read( file.list, sample.id=sample.list, treatment=c(0,1), assembly="hg19", header=true, context="cpg", resolution="base", pipeline=list( fraction=true, chr.col=1, start.col=2, end.col=2, coverage.col=6, strand.col=3, freqc.col=5 ) ) #Generate basic stats getmethylationstats(methdata,plot=t,both.strands=t) getcoveragestats(methdata,plot=t,both.strands=t) #Filter by coverage count (>10X) or percentage (<99.9 percentile) methdata.filtered = filterbycoverage(methdata, lo.count = 10, lo.perc=null, hi.count=null,hi.per = 99.9) #Merge samples to locations covered by all samples meth = unite(methdata.filtered,destrand=false) #Calculate per-base differential methylation p-values and q-values methdiff = calculatediffmeth(meth) #Find differentially methylated bases with 25% difference and qvalue<0.01 Diff25p=get.methylDiff(methDiff,difference=25,qvalue=0.01) Evaluate NimbleGen SeqCap Epi Target Enrichment Data 19

20 #Find differentially hypo methylated bases with 25% difference and qvalue<0.01 Diff25pHypo =get.methyldiff(methdiff,difference=25,qvalue=0.01,type="hypo") #Find differentially hyper methylated bases with 25% difference and qvalue<0.01 Diff25pHyper=get.methylDiff(methDiff,difference=25,qvalue=0.01,type="hyper") #Find differentially methylated regions with 25% difference and qvalue<0.01 tiles = tilemethylcounts(methdata.filtered,win.size=500,step.size=100) tilemeth = unite(tiles,destrand=false) tilediff = calculatediffmeth(tilemeth,num.cores=8) tile25pct <- get.methyldiff(tilediff,difference=25,qvalue=0.01) #Write data to file write.table(getdata(diff25p),file="diff25pct.txt",row.names=f,col.names=t,sep="\t",q uote=f) write.table(getdata(tile25pct),file="tile25pct.txt",row.names=f,col.names=t,sep="\t",quote=f) Replicate data sets will be automatically combined by the treatment values specified at import. Q-values and percent differences can be adjusted to provide higher or lower stringency. The methylkit site ( has examples and tutorials on additional analysis that can be performed, including correlations, clustering, PCA, and adding annotation. 3. REFERENCES Roche NimbleGen is not responsible for the maintenance or content of the following third-party websites. BamUtil: BEDtools: BisSNP: BSMAP: FastQC: GATK: IGV: methylkit: Picard: R: SAMtools (including BCFtools and VCFutils): seqtk: Trimmomatic: Evaluate NimbleGen SeqCap Epi Target Enrichment Data 20

21 4. GLOSSARY BAI file BAM index file. For tools that require an indexed BAM file, the BAI file must be present in the same location as the BAM file. Bait See Capture target. BAM file Compressed form of the SAM file format. BCF file Compressed form of the VCF file format. BED file File format for describing genomic regions/intervals. BED file start coordinates are 0-based. Capture target as defined by Roche NimbleGen, these are the regions covered directly by one or more probes (the probe footprint). NimbleGen BED files refer to these regions as the tiled regions. These are equivalent to the bait intervals referred to by Picard. FASTA file A standard file format for describing nucleic acid sequences. FASTQ file A standard file format for describing sequencing reads that also includes base quality information. Genomic index A form of the reference genome sequence which enables faster comparisons during alignment. Picard interval files File format for describing genomic regions/intervals which also contains a header describing the reference genome. Picard interval file start coordinates are 1-based. Primary target as defined by Roche NimbleGen, these are the regions against which probes were designed. NimbleGen BED files refer to these regions as simply Target Regions. Typically these are identical to the original regions of interest with regions less than 100bp expanded to 100bp total to facilitate probe selection. These are equivalent to the target intervals referred to by Picard. SAM file Sequence Alignment / Map file; a community standard format for specifying sequencing read alignment to a reference genome. Target regions see Primary target. Tiled regions see Capture target. VCF file Variant call format; a community standard format for specifying variant calls for one or more samples or populations against a reference genome. Evaluate NimbleGen SeqCap Epi Target Enrichment Data 21

22 Published by: Roche Diagnostics GmbH Sandhofer Straße Mannheim Germany 2014 Roche Diagnostics All rights reserved. Notice to Purchaser For patent license limitations for individual products please refer to: For life science research only. Not for use in diagnostic procedures. Trademarks NIMBLEGEN and SEQCAP are trademarks of Roche. All other product names and trademarks are the property of their respective owners. Technical Note: How To (1) 04/14 Evaluate NimbleGen SeqCap Epi Target Enrichment Data 22

Evaluate NimbleGen SeqCap RNA Target Enrichment Data

Roche Sequencing Technical Note November 2014 How To Evaluate NimbleGen SeqCap RNA Target Enrichment Data 1. OVERVIEW Analysis of NimbleGen SeqCap RNA target enrichment data generated using an Illumina