Evaluate NimbleGen SeqCap Epi Target Enrichment Data
|
|
- Grant Lawson
- 5 years ago
- Views:
Transcription
1 Sequencing Solutions Technical Note April 2014 How To Evaluate NimbleGen SeqCap Epi Target Enrichment Data 1. OVERVIEW Analysis of NimbleGen SeqCap Epi target enrichment data generated using an Illumina sequencing system is most frequently performed using a variety of publicly available, open-source analysis tools. The typical methylation and variant calling analysis workflow consists of sequencing read quality assessment, read filtering, mapping against a reference genome, removal of PCR duplicates, coverage statistic assessment, methylation analysis, variant calling and variant filtering; at most of these steps, a variety of tools are available. This document shows how to use a selection of the available tools to perform SeqCap Epi data analysis, but other analysis workflows may be used. Applications Targeted Bisulfite Sequencing Products SeqCap Epi CpGiant Enrichment Kit SeqCap Epi Developer Enrichment Kit SeqCap Epi Choice Enrichment Kit In addition to showing how to use raw FASTQ files to generate a BAM file and perform methylation analysis (see Figure 1), additional supporting workflows include how to prepare and work with Roche NimbleGen BED files and how to create a Picard interval list file. This document should enable readers with bioinformatics experience to understand the basic analysis workflow in use at Roche NimbleGen. The reader should carefully consider the available options to determine the appropriate workflow for their research. The usage examples described here have been applied effectively in our hands. Please note that publicly available, open-source software tools may change and that such change is not under the control of Roche. Therefore Roche does not warrant and cannot be held liable for the results obtained when using the third party tools described herein. Please refer to the authors of each tool for support and documentation. For life science research only. Not for use in diagnostic procedures. 1
2 Figure 1. Schematic of basic methylation analysis workflow Evaluate NimbleGen SeqCap Epi Target Enrichment Data 2
3 2. SOLUTION Freely available third-party tools are available for converting raw sequencing data into appropriate file formats, mapping bisulfite converted reads to a reference sequence, evaluating sequencing quality, generating methylation calls and analyzing variant calls. This technical note describes a number of steps and mini-workflows that use such third-party tools, which can be combined together into a variety of data analysis workflows. Ideally, you should develop a workflow appropriate for your experimental data using benchmark/control samples, such as HapMap samples obtained from the Coriell Institute Cell Repositories. Known variants for HapMap sample genomes can be downloaded from the HapMap project ( the 1000 Genomes Project ( or in special collections such as GATK s resource bundle ( Note that where the text SAMPLE appears throughout these examples, you should replace it with a unique sample name. Similarly, replace /path/to/ with a valid path on your system. The current directory is assumed to be the location of all SAMPLE input files, and will also be the location of SAMPLE output files and report files. Type the entire command shown for each step on a single line, despite the way it appears on the printed page. There should be no spaces within a file path, but there must be spaces before and after each option. Tip: use the Tab key to auto-complete paths and file names while typing. Tools Overview Package (version) Tool Function, as used in this document bamutil (1.0.10) clipoverlap Soft clips overlapping read pairs in BAM files intersect Calculate on-target read rate by intersecting a list of regions in BED format against mapped reads in BAM format. BEDtools (2.17.0) sort Sort BED formatted regions. merge Merge overlapping regions within a BED file. genomecov Calculate sum total size of regions in a BED file. slop Pad length of target regions. BisulfiteCountCovariates Generate data file for recalibration BisulfiteTableRecalibration Recalibrate base qualities before SNP calling BisSNP (0.82.2) BSMAP (2.74) BisulfiteGenotyper sortbyrefandcor.pl VCFpostprocess bsmap methratio.py Determine methylated/unmethylated counts at each position and make SNP calls Sort VCF files by name and position Filter CpGs and SNPs Map sequencing reads to an indexed genome. Calculate percent methylation at individual bases FastQC (0.10.1) fastqc Assess sequencing read quality (per-base quality plot). GATK Framework (2.7-2) IGV DepthOfCoverage igv Calculate mean, median, and specific sequencing coverage depths. Genomic viewer for BAM and BED files (not explicitly used in the examples shown in this document). Evaluate NimbleGen SeqCap Epi Target Enrichment Data 3
4 Package (version) Tool Function, as used in this document methylkit (0.9.2) Picard (1.98) SAMtools (0.1.18) read filterbycoverage unite calculatediffmeth get.methyldiff tilemethylcounts AddOrReplaceReadGroups CollectInsertSizeMetrics CalculateHsMetrics MarkDuplicates CollectAlignmentSummaryMetrics SamToFastq sort index view mpileup BCFtools view vcfutils varfilter Import BSMAP methylation results Filter methylation data by coverage Consolidate data to position covered by all samples Calculate and score differential methylation at each position Filter differential methylation results by absolute value of methylation change, qvalue, and type of region (hypo, hyper, all) Summarizes methylated/unmethylated base counts over tiling or sliding windows in the genome Add read group information to a SAM or BAM file Estimate insert size mean and standard deviation and plot an insert size distribution. Assess performance of targeted enrichment reads. Remove or mark duplicate reads, and detail the number of paired, unpaired, and optical duplicates. Report mapping metrics for a BAM file. Create FASTQ file(s) from a SAM or BAM file. Sort a BAM file. Index a sorted BAM file. View or extract header or read data. Call variants on a BAM file. Convert between VCF and BCF formats. Filter raw variants. seqtk (1.0-r31) sample Randomly subsample FASTQ file(s). Trimmomatic (0.30) illuminaclip Trim raw reads for quality. Table 1: Shown are third-party data analysis and formatting tools used in this technical note. The examples described in this document were tested using the software versions listed in parentheses. See links in References for installation instructions and explanations of command options. These tools were tested on a Linux system, but should also work on MacOS. Evaluate NimbleGen SeqCap Epi Target Enrichment Data 4
5 Decompress a FASTQ File If the FASTQ files have been compressed (with a.gz extension), some tools require them to be decompressed before use. gunzip SAMPLE_R1.fastq.gz SAMPLE_R2.fastq.gz SAMPLE_R1.fastq SAMPLE_R2.fastq gunzip c SAMPLE_R1.fastq.gz > SAMPLE_R1.fastq gunzip c SAMPLE_R2.fastq.gz > SAMPLE_R2.fastq Generate FASTQ Files from a BAM File If you do not have access to the original FASTQ files, it is possible to generate FASTQ files from a BAM file in order to remap the reads using a different pipeline. Picard SamToFastq SAMPLE_file.bam SAMPLE_R1.fastq SAMPLE_R2.fastq java -Xmx4g Xms4g -jar /path/to/picard/samtofastq.jar VALIDATION_STRINGENCY=LENIENT INPUT=SAMPLE_file.bam FASTQ=SAMPLE_R1.fastq SECOND_END_FASTQ=SAMPLE_R2.fastq The generated FASTQ files will only be identical to the originals if the creator of the BAM file did not remove reads such as duplicates, low quality, etc., or perform base quality recalibration. Select a Subsample of Reads from a FASTQ File Random subsampling is useful for normalizing the number of reads per set when doing comparisons. Depending on the goal of the experiment, reads can be sampled before adapter trimming and quality filtering (see below) or after, sampling only from high quality reads of a minimum length. With paired end reads, it is important that the two files use the same values for the seed (-s) and number of reads. The seqtk application will sample from gzipped FASTQ files, but will write the sampled reads to uncompressed FASTQ files. seqtk sample SAMPLE_R1.fastq SAMPLE_R2.fastq SAMPLE_subset_R1.fastq SAMPLE_subset_R2.fastq /path/to/seqtk sample -s SAMPLE_R1.fastq > SAMPLE_subset_R1.fastq /path/to/seqtk sample -s SAMPLE_R2.fastq > SAMPLE_subset_R2.fastq The commands above will randomly subsample 10 million read pairs from the paired end FASTQ files. Supplying the same random seed value (-s) ensures that the FASTQ records will remain in synchronized sort order and can be used for mapping, etc. Note that seqtk requires an amount of RAM proportional to the number of reads being subsampled. As you increase the size of the subsampled read set, more RAM is needed. Evaluate NimbleGen SeqCap Epi Target Enrichment Data 5
6 Examine Sequence Read Quality Control Before spending time evaluating mapping statistics, use fastqc to run a series of tests on raw reads and generate a per-base sequence quality plot and report. The fastqc tool can work on both compressed and uncompressed FASTQ files. FastQC fastqc SAMPLE_R1.fastq SAMPLE_R2.fastq SAMPLE_R1_fastqc.zip SAMPLE_R2_fastqc.zip /path/to/fastqc --nogroup SAMPLE_R1.fastq SAMPLE_R2.fastq A.zip file and decompressed directory are created for each SAMPLE input file. They contain an HTML report named fastqc_report.html that is viewable in any internet browser. The authors of FastQC have posted the following examples of the QC report for a good and a bad sequencing run. Adapter trimming and Quality Filtering Before mapping the bisulfite converted reads, it is advisable to trim off any sequencing adapters found on the reads, and to filter or trim the reads for sequence quality. The Trimmomatic application can do both of those steps. Trimmomatic illuminaclip SAMPLE_R1.fastq SAMPLE_R2.fastq SAMPLE_R1_trimmed.fq SAMPLE_R1_unpaired.fq SAMPLE_R2_trimmed.fq SAMPLE_R2_unpaired.fq java Xms4g Xmx4g -jar /path/to/trimmomatic.jar PE -threads NumProcessors -phred33 SAMPLE_R1.fastq SAMPLE_R2.fastq SAMPLE_R1_trimmed.fq SAMPLE_R1_unpaired.fq SAMPLE_R2_trimmed.fq SAMPLE_R2_unpaired.fq ILLUMINACLIP:/path/to/adapters.fa:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:5:20 MINLEN:75 The Trimmomatic application will produce four files. The SAMPLE_R1_trimmed.fq and SAMPLE_R2_trimmed.fq contain the reads that are still paired after adapter trimming and quality filtering. The SAMPLE_R1_unpaired.fq and SAMPLE_R2_unpaired.fq contain singleton reads, where the other mate of the pair was discarded because of poor quality or because the remaining read was shorter than the minimum length of 75bp. The unpaired reads do not have to be discarded but for most alignment applications they will have to be mapped separately from the paired reads, which can complicate downstream steps in a workflow. If the workflow includes a filtering step to include only mapped reads which are paired, then there is no point in mapping the singleton reads from Trimmomatic. The parameters shown above are best used with 2x100bp reads, where you should expect to see about 90% of reads pass these quality filters. If you want to increase the percentage of passing reads, adjust the Trimmomatic parameters especially MINLEN, which is the required minimum length after trimming. Evaluate NimbleGen SeqCap Epi Target Enrichment Data 6
7 Mapping Reads There are a number of alignment software applications for mapping bisulfite converted reads to a reference genome. The choice of alignment software can depend on the length and type of reads being mapped (100bp paired end reads versus 76bp single end reads), the size and GC content of the organism, and the computational resources available. For 100bp paired end reads, BSMAP demonstrates a good compromise between alignment time and overall alignment percentage. The index for the reference genome is built at run time for BSMAP so there is no need to preindex the genome as is necessary for some other alignment software applications. BSMAP produces a SAM file which should be converted to a BAM file before proceeding with the workflow bsmap Picard AddOrReplaceReadGroups ref.fa SAMPLE_R1.fastq SAMPLE_R2.fastq SAMPLE.bam Map Reads /path/to/bsmap -r 0 -s 16 -n 1 -a SAMPLE_R1.fastq -b SAMPLE_R2.fastq -d /path/to/ref.fa -p NumProcessors -o SAMPLE.sam Add Read Group and Convert to BAM Xmx4g Xms4g -jar /path/to/picard/ AddOrReplaceReadGroups.jar VALIDATION_STRINGENCY=LENIENT INPUT=SAMPLE.sam OUTPUT=SAMPLE.bam INDEX=TRUE RGID=SAMPLE RGLB=SAMPLE RGPL=illumina RGSM=SAMPLE RGPU=platform_unit The parameters above will uniquely map reads to both strands. By default, BSMAP does not report unmapped reads. If you want unmapped reads to carry through the workflow, use the -u switch to report unmapped reads. Note that BSMAP does not assign a read group to the reads in the output SAM file, so that information is added at the same time the SAM file is converted to a BAM file. For single captures from one library, the sample ID, the read group library, and the read group sample can all be the same. If combining captures from multiple libraries or multiple captures, it is a good idea to distinguish those via the read group library ID or read group sample ID. The platform unit is a unique identifier for the flow cell and lane, which can generally be grabbed from the first four colon delimited fields from the FASTQ ID (in bold 1:N:0:CGATGT For the reference genome file, there are several points to consider. The FASTA formatted genome sequence should have chromosomes in karyotypic sort order, i.e. chr1, chr2,..., chr10, chr11, chrx, chry, chrm, chr1_random, etc. In the examples in this document, reference genome files are referred to as ref.fa, which should be replaced by the actual file name (e.g. hg19.fa ). Note: The SeqCap Epi products include a lambda spike-in control (GenBank accession NC_001416). This sequence should be added to the reference genome file so that captured controls can be mapped to the lambda genome, and be used to estimate bisulfite conversion efficiency as a QC step. If working with human reference genome sequence and mapping reads uniquely, it is advisable to mask the pseudo-autosomal regions on chry to N s so that reads may be mapped to the equivalent regions on chrx. For human experiments with 100bp paired end reads, you can generally expect to see >90% of the reads mapped to the human reference genome. For other, non-human reference genomes, the percent of reads mapping may vary based on a number of factors, including genome size, percent GC composition, repeat content and reference genome quality. Evaluate NimbleGen SeqCap Epi Target Enrichment Data 7
8 Sorting and Removing Duplicates Working with bisulfite converted sequences poses some additional challenges over standard capture sequences. The treatment of the sequence with bisulfite converts unmethylated C s to T s resulting in top and bottom strands that are no longer complementary (Figure 2). Figure 2. Illustration of non-complementary strands generated by bisulfite-treatment. Evaluate NimbleGen SeqCap Epi Target Enrichment Data 8
9 The standard tool for duplicate remove (Picard MarkDuplicates) does not have an option for dealing with bisulfite sequence and non-complementary top and bottom strands. So to remove duplicates, the mapped reads must be split by top/bottom strand, duplicates removed, and then the strands re-merged. The BSMAP aligner includes a tag in the BAM file (ZS), which indicates top/bottom strand, and forward/reverse read. This tag is used to split the BAM file containing the mapped reads, which are then sorted, duplicates are removed, and the split files are rejoined. bamtools split bamtools merge SAMtools sort Picard MarkDuplicates SAMPLE.bam SAMPLE.TAG_ZS_++.bam SAMPLE.TAG_ZS_+-.bam SAMPLE.top.bam SAMPLE.top.sorted.bam SAMPLE.top.rmdups.bam SAMPLE.top.rmdups_metrics.txt SAMPLE.TAG_ZS_-+.bam SAMPLE.TAG_ZS_--.bam SAMPLE.bottom.bam SAMPLE.bottom.sorted.bam SAMPLE.bottom.rmdups.bam SAMPLE.bottom.rmdups_metrics.txt SAMPLE.rmdups.bam Split BAM /path/to/bamtools split -tag ZS -in SAMPLE.bam Merge strand BAM files /path/to/bamtools merge -in SAMPLE.TAG_ZS_++.bam -in SAMPLE.TAG_ZS_+-.bam -out SAMPLE.top.bam /path/to/bamtools merge -in SAMPLE.TAG_ZS_-+.bam -in SAMPLE.TAG_ZS_--.bam -out SAMPLE.bottom.bam Sort BAM Files /path/to/samtools sort SAMPLE.top.bam SAMPLE.top.sorted /path/to/samtools sort SAMPLE.bottom.bam SAMPLE.bottom.sorted Remove Duplicates java Xmx4g Xms4g -jar /path/to/picard/markduplicates.jar VALIDATION_STRINGENCY=LENIENT INPUT=SAMPLE.top.sorted.bam OUTPUT=SAMPLE.top.rmdups.bam METRICS_FILE=SAMPLE.top.rmdups_metrics.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true INDEX=true java Xmx4g Xms4g -jar /path/to/picard/markduplicates.jar VALIDATION_STRINGENCY=LENIENT INPUT=SAMPLE.bottom.sorted.bam OUTPUT=SAMPLE.bottom.rmdups.bam METRICS_FILE=SAMPLE.bottom.rmdups_metrics.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true INDEX=true Merge duplicate removed BAM files /path/to/bamtools merge -in SAMPLE.top.rmdups.bam -in SAMPLE.bottom.rmdups.bam -out SAMPLE.rmdups.bam Evaluate NimbleGen SeqCap Epi Target Enrichment Data 9
10 In the Mark Duplicates step, REMOVE_DUPLICATES=false may also be used. This will retain reads marked as duplicates and just mark them as duplicates. It will result in larger files downstream, but may be beneficial if you want to examine downstream results with and without duplicates removed. View the file SAMPLE.strand.rmdups_metrics.txt for counts of paired, unpaired, and duplicate reads. See for a description of the output metrics. Note that optical duplicates are also reported, based on sequence similarity and sequencing cluster distance. Optical duplicates are not separately included in the total duplicate rate but are counted within the paired and unpaired duplicates. Filter BAM file to keep only mapped and properly paired reads Due to the difficulty in mapping bisulfite reads, overall data quality can be improved by restricting analysis to only those reads pairs where both mates in the pair can be mapped, in the correct orientation and at a distance consistent with the library insert size. Filtering the BAM file to include only mapped and properly paired reads will remove the ability to find structural variants. If that is a goal, then a separate analysis workflow (not described here) would have to be set up to find and characterize those variants. bamtools filter SAMPLE.rmdups.bam SAMPLE.filtered.bam Filter BAM /path/to/bamtools filter -ismapped true -ispaired true -isproperpair true - forcecompression -in SAMPLE.rmdups.bam -out SAMPLE.filtered.bam For human SeqCap Epi experiments, this step generally filters out about 5% of the reads mapped with BSMAP. Clip Overlapping Reads When using 100bp paired end reads, and typical NGS library insert sizes of bp, a number of the paired end reads will overlap. When evaluating percent methylation, this can cause bias, as the converted/non-converted C s in the overlap between the two regions may be double-counted. The bias can be especially large at positions with low read coverage. To prevent this, one or both reads should be soft-clipped such that the reads no longer overlap. The bamutil clipoverlap application can be used to do this. bamutil clipoverlap SAMPLE.filtered.bam SAMPLE.clipped.bam Filter BAM /path/to/bamutil clipoverlap --stats --in SAMPLE.filtered.bam --out SAMPLE.clipped.bam The clipoverlap tools will look at both reads in the pair, and when an overlap is found, the tool will clip the read with the lowest average quality in the overlap region, and the alignment positions are updated if necessary. See the following page for complete details: The number of reads that are clipped will depend on the average insert size of the library and the distribution of sizes in the library. See Estimate Insert Size Distribution below. Evaluate NimbleGen SeqCap Epi Target Enrichment Data 10
11 Indexing BAM files A number of tools that operate on BAM files require that the BAM file be indexed for faster or more efficient calculations. The Picard tools will generate an index automatically if the parameter INDEX=true is included on the command line. For other tools which create BAM files, the BAM index will have to be created manually. To do so, use the SAMtools index command. SAMtools index SAMPLE.bam SAMPLE.bam.bai /path/to/samtools index SAMPLE.bam The command will create an additional file with the suffix.bai added to the original file name. Basic Mapping Metrics (Picard) Basic mapping metrics can be calculated using Picard, from SAMPLE.bam and ref.fa. Picard CollectAlignmentSummaryMetrics ref.fa SAMPLE.bam SAMPLE_picard_alignment_metrics.txt java -Xmx4g -Xms4g -jar /path/to/picard/collectalignmentsummarymetrics.jar METRIC_ACCUMULATION_LEVEL=ALL_READS INPUT=SAMPLE.bam OUTPUT=SAMPLE_picard_alignment_metrics.txt REFERENCE_SEQUENCE=/path/to/ref.fa VALIDATION_STRINGENCY=LENIENT Note: It is important to use VALIDATION_STRINGENCY=LENIENT. Not all tools that operate on BAM files produce output files are completely compliant with the BAM specification, and unless this parameter is used, the various Picard tools will halt as soon as they find a single non-compliant record. See for a description of the output metrics. Create Picard Interval Lists A Picard interval list is a genomic interval description file that contains a SAM-like header describing the reference genome and a set of coordinates with strand and name for each interval. The target intervals are equivalent to the NimbleGen primary target BED file, and the bait intervals are equivalent to the NimbleGen capture target BED file. The Picard interval files are required by the Picard CalculateHsMetrics utility. Note: Some NimbleGen catalog designs only contain a single bed file of coverage targets. For those designs, the same BED file can be used for both primary targets and capture targets below. Evaluate NimbleGen SeqCap Epi Target Enrichment Data 11
12 Create a Picard Target Interval List Use the SAMtools view command to extract the header from a SAMPLE.bam file, use the Linux gawk command to extract and format the interval list body from a DESIGN_primary_target.bed file, and finally use the cat command to concatenate the two pieces together into a properly-formatted interval list. SAMtools view cat gawk SAMPLE.bam DESIGN_primary_target.bed DESIGN_target_intervals.txt Create a Picard Interval List Header /path/to/samtools view H SAMPLE.bam > SAMPLE_bam_header.txt Create a Picard Target Interval List Body cat DESIGN_primary_target.bed gawk '{print $1 "\t" $2+1 "\t" $3 "\t+\tinterval_" NR}' > DESIGN_target_body.txt Concatenate to Create a Picard Target Interval List cat SAMPLE_bam_header.txt DESIGN_target_body.txt > DESIGN_target_intervals.txt The DESIGN_target_intervals.txt file is used by the Picard CalculateHsMetrics command. Create a Picard Bait Interval List Use the SAMtools view command to extract the header from a SAMPLE.bam file, use the Linux gawk command to extract and format the interval list body from a DESIGN_capture_target.bed file, and finally use the cat command to concatenate the two pieces together into a properly-formatted interval list. SAMtools view cat gawk SAMPLE.bam DESIGN_capture_target.bed DESIGN_bait_intervals.txt Create a Picard Interval List Header /path/to/samtools view H SAMPLE.bam > SAMPLE_bam_header.txt Create a Picard Bait Interval List Body cat DESIGN_capture_target.bed gawk '{print $1 "\t" $2+1 "\t" $3 "\t+\tinterval_" NR}' > DESIGN_bait_body.txt Concatenate to Create a Picard Bait Interval List cat SAMPLE_bam_header.txt DESIGN_bait_body.txt > DESIGN_bait_intervals.txt The DESIGN_bait_intervals.txt file is used by the Picard CalculateHsMetrics command. Evaluate NimbleGen SeqCap Epi Target Enrichment Data 12
13 Hybrid Selection (HS) Analysis Metrics The CalculateHsMetrics command calculates a number of metrics assessing the quality of targeted enrichment reads. If necessary, follow the steps described in Combined SNP/methylation calling using BisSNP before using the step described below. Picard CalculateHsMetrics ref.fa {indexed} SAMPLE.clipped.bam {indexed} DESIGN_target_intervals.txt DESIGN_bait_intervals.txt SAMPLE_picard_hs_metrics.txt java -Xmx4g -Xms4g -jar /path/to/picard/calculatehsmetrics.jar BAIT_INTERVALS=DESIGN_bait_intervals.txt TARGET_INTERVALS=DESIGN_target_intervals.txt INPUT=SAMPLE.clipped.bam OUTPUT=SAMPLE_picard_hs_metrics.txt METRIC_ACCUMULATION_LEVEL=ALL_READS REFERENCE_SEQUENCE=/path/to/ref.fa VALIDATION_STRINGENCY=LENIENT TMP_DIR=. Supplying only one of the Picard interval files to Picard CalculateHsMetrics can change the outcome, depending on how different the primary and capture target (bait) regions are from each other. Some Roche NimbleGen catalog designs have only one BED file supplied. In that case, generate a single Picard Interval List from the supplied BED file and use as both bait and target intervals. See for a description of output metrics. Estimate Insert Size Distribution The DNA that goes into sequence capture is generated by random fragmentation, and later size selected. It is normal to observe a range of fragment sizes, but if skewed too large or too small the on-target rate and/or percent of bases covered with at least one read can be adversely affected. It is best to generate this report on the SAMPLE.filtered.bam file created above Picard CollectInsertSizeMetrics SAMPLE.filtered.bam SAMPLE_picard_insert_size_metrics.txt SAMPLE_picard_insert_size_plot.pdf java -Xmx4g -jar /path/to/picard/collectinsertsizemetrics.jar VALIDATION_STRINGENCY=LENIENT HISTOGRAM_FILE=SAMPLE_picard_insert_size_plot.pdf INPUT=SAMPLE.filtered.bam OUTPUT=SAMPLE_picard_insert_size_metrics.txt See for a description of output metrics included in SAMPLE_picard_insert_size_metrics.txt, which can also be used to plot the insert size distributions across samples. A PDF plot is also created and placed in SAMPLE_picard_insert_size_plot.pdf. Evaluate NimbleGen SeqCap Epi Target Enrichment Data 13
14 Add Padding to Targets Target-adjacent coverage is typical for hybridization-based target enrichment due to the capture of partially on-target DNA library fragments. To optionally assess the amount of off-target reads which are target adjacent, padding the targets before assessing the on-target rate may be performed. The targets should be padded, sorted, and then merged to product the final file. These three operations can be performed simultaneously by piping the contents of one application to the BEDtools slop BEDtools sort BEDtools merge DESIGN_capture_target.bed chromosome_sizes.txt DESIGN_padded_capture_target.bed Pad, Sort and Merge Overlapping and Book-Ended Regions /path/to/bedtools slop -i DESIGN_capture_target.bed -b 100 -g chromosome_sizes.txt /path/to/bedtools sort -i - /path/to/bedtools merge -i - > DESIGN_padded_capture_target.bed In the first step, the value supplied to the -b option indicates the number of target-adjacent bases to add on both sides of the target. In this example, 100 bases would be added to both sides of all targets. The chromosome_sizes.txt file should be in the format: ChrName<tab>ChrSize, e.g. chr and have an entry for each chromosome present in the input BED file. The -i - parameters directs the application to take the STDOUT from the previous command as input. Determine Sum Total Size of Regions in a BED File The chromosome_sizes.txt file should be as described in Add Padding to Targets. BEDtools genomecov grep cut chromosome_sizes.txt DESIGN.bed {size} /path/to/bedtools genomecov -i DESIGN.bed -g chromosome_sizes.txt -max 1 grep -P "genome\t1" cut -f 3 Alternatively, a single gawk command can be used: cat DESIGN.bed gawk -F'\t' 'BEGIN{SUM=0}{SUM+=$3-$2}END{print SUM}' Evaluate NimbleGen SeqCap Epi Target Enrichment Data 14
15 Count On-Target Reads The number of on-target reads (i.e. the number of mapped reads overlapping a set of target regions by at least one base) can be determined using BEDtools intersect on any of the BAM files created above. BEDtools intersect wc SAMPLE.bam DESIGN_primary_target.bed or DESIGN_capture_target.bed {count} /path/to/bedtools intersect -bed -abam SAMPLE.bam -b DESIGN_primary_target.bed or /path/to/bedtools intersect -bed -abam SAMPLE.bam -b DESIGN_capture_target.bed The command will output the number of on-target reads, counting as on-target any read that overlaps the target by at least one base. The output may be redirected to a file to save for later calculations. Divide the number of on-target reads by the total number of mapped, non-duplicate reads to get the percentage of on-target reads ( on-target rate ). See Basic Mapping Metrics (Picard) for reporting of the total number of mapped, non-duplicate reads. Calculate Depth of Coverage The depth of coverage over the primary or capture target regions can be determined using the DepthofCoverage command from GATK. GATK DepthOfCoverage ref.fa {indexed} SAMPLE.clipped.bam {indexed} DESIGN_primary_target.bed or DESIGN_capture_target.bed SAMPLE_gatk_primary_target_coverage.sample_summary or SAMPLE_gatk_capture_target_coverage.sample_summary {there are other output files not listed here} java -Xmx4g -Xms4g -jar /path/to/gatkframework/genomeanalysistk.jar -T DepthOfCoverage R /path/to/ref.fa -I SAMPLE.clipped.bam -o SAMPLE_gatk_primary_target _coverage -L DESIGN_primary_target.bed -ct 1 -ct 10 -ct 20 or java -Xmx4g -Xms4g -jar /path/to/gatkframework/genomeanalysistk.jar -T DepthOfCoverage R /path/to/ref.fa -I SAMPLE.clipped.bam -o SAMPLE_gatk_capture_target_coverage -L DESIGN_capture_target.bed -ct 1 -ct 10 -ct 20 The sample summary output file reports the mean and median depth of coverage over the given regions of interest, as well as the percent of bases in the given regions covered at or above a given depth. Multiple depth thresholds can be evaluated by specifying multiple values using the -ct switch. For more information, consult the documentation at: Evaluate NimbleGen SeqCap Epi Target Enrichment Data 15
16 Determine methylation percentage using BSMAP The percent methylation at each C base in the sample can be determined using the methratio.py script supplied by BSMAP and the BAM file produced in the Clip Overlapping Reads step. BSMAP methratio.py ref.fa SAMPLE.clipped.bam SAMPLE.methylation_results.txt Determine methylation percentage python /path/to/methratio.py -d hg19.fa -s /path/to/samtools -m 1 -z -i skip -o SAMPLE.methylation_results.txt SAMPLE.clipped.bam The results can be filtered by depth of coverage by using the -m switch. Raising this to a higher value (e.g. -m 5) will require at least that many reads to cover the position before a result is reported. The -z switch is required to report positions with zero methylation. The -i switch directs the script to ignore positions were there is a possible C->T SNP. Use the BisSNP analysis to deal with those positions. The result files from this step can be rather large, several GB in size, since percent methylation is reported for all C positions. An additional switch -c can be used to process one chromosome at a time (e.g. -c chr1). In combination with the -m switch, this can reduce result files to more reasonable sizes. Determine bisulfite conversion efficiency using BSMAP Reliable estimates of percent methylation require high efficiency of the bilsulfite conversion process, otherwise the level of methylation will be over-estimated. The best way to determine bisulfite conversion efficiency is to measure the number of unconverted C s on DNA molecules which have no methylation. For humans and animals, the mitochondrial genome is an appropriate target. For plants, both the mitochondrial and chloroplast genome may be used. Unfortunately, the genomes of these organelles is not always known or well-characterized, so Roche NimbleGen supplies phage lambda DNA (GenBank Accession NC_001416) which may be used as a spike-in control. Each SeqCap Epi capture pool contains probes to capture the lambda genomic region from base 4500 to The lambda genome sequence must be appended to your reference genome sequence files so that the captured reads may be mapped to that sequence. The residual C s in the reads mapped to the captured region of the lambda genome can be used to estimate the overall conversion efficiency. BSMAP methratio.py ref.fa SAMPLE.clipped.bam SAMPLE.NC_ methylation_results.txt Determine methylation percentage python /path/to/methratio.py -d hg19.fa -s /path/to/samtools -m 1 -z -i skip -o -c NC_ SAMPLE.NC_ methylation_results.txt SAMPLE.clipped.bam The -c NC_ switch restricts the methylation analysis to just the lambda genome. If the lambda spike-in was not used, you may substitute the name of the mitochondrial genome (e.g. chrm) or the chloroplast genome (e.g. chrc). Once the analysis is complete, open the resulting methylation results file in a spreadsheet application, and if necessary, delete the rows that fall outside the capture region. Then calculate the conversion rate as follows: conversion rate = 1 - (sum(c_count)/sum(ct_count)) Bisulfite-conversion rates should generally be above 99.5% to be considered successful. Evaluate NimbleGen SeqCap Epi Target Enrichment Data 16
17 Combined SNP/methylation calling using BisSNP The percent methylation and possible SNPs at each C base in the sample can be determined using the BisSNP package and the BAM file produced in the Clip Overlapping Reads step. BisSNP uses a multistep process to make methylation calls and call SNPs. The first step normally calls for Indel Realignment (see BisSNP user guide). However, the BSMAP software does not support gapped alignments, so this step may be skipped for reads mapped with BSMAP software. The next step is Base Quality Recalibration, followed by Combined SNP/methylation calling. BisSNP BisulfiteCountCovariates BisSNP BisulfiteTableRecalibration BisSNP BisulfiteGenotyper BisSNP sortbyrefandcor.pl BisSNP VCFpostprocess BisSNP vcf2bed6plus2.strand.pl ref.fa {indexed} genome.snps.vcf DESIGNS_capture_target.bed SAMPLE.clipped.bam SAMPLE.cpg.raw.vcf SAMPLE.cpg.raw.sorted.vcf SAMPLE.cpg.filtered.vcf SAMPLE.cpg.filter.summary.txt SAMPLE. cpg.filtered.strand.6plus2.bed SAMPLE.snp.raw.vcf SAMPLE.snp.raw.sorted.vcf SAMPLE.snp.filtered.vcf SAMPLE.snp.filter.summary.txt SAMPLE. cpg.filtered.strand.6plus2.bed Base Quality Recalibration java -Xmx10g -jar /path/to/bissnp jar -R ref.fa -I SAMPLE.clipped.bam -T BisulfiteCountCovariates -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -recalfile SAMPLE.recalFile_before.csv -nt NumProcessors -knownsites genome.snps.vcf java -Xmx10g -jar /path/to/bissnp jar -R ref.fa -I SAMPLE.clipped.bam -o SAMPLE.recal.bam -T BisulfiteTableRecalibration -recalfile SAMPLE.recalFile_before.csv -maxq 40 Combined SNP/methylation calling java -Xmx10g -jar /path/to/bissnp jar -R ref.fa -I SAMPLE.recal.bam -T BisulfiteGenotyper -D genome.snps.vcf -vfn1 SAMPLE.cpg.raw.vcf -vfn2 SAMPLE.snp.raw.vcf -L DESIGNS_capture_target.bed -stand_call_conf 20 - stand_emit_conf 0 -mmq 30 -mbq 0 -nt NumProcessors Sort VCF files perl /path/to/sortbyrefandcor.pl --k 1 --c 2 SAMPLE.snp.raw.vcf ref.fa.fai > SAMPLE.snp.raw.sorted.vcf perl /path/to/sortbyrefandcor.pl --k 1 --c 2 SAMPLE.cpg.raw.vcf ref.fa.fai > SAMPLE.cpg.raw.sorted.vcf Evaluate NimbleGen SeqCap Epi Target Enrichment Data 17
18 Filter SNP/methylation calls java -Xmx10g -jar /path/to/bissnp jar -R ref.fa -T VCFpostprocess -oldvcf SAMPLE.snp.raw.sorted.vcf -newvcf SAMPLE.snp.filtered.vcf -snpvcf SAMPLE.snp.raw.sorted.vcf -o SAMPLE.snp.filter.summary.txt java -Xmx10g -jar /path/to/bissnp jar -R ref.fa -T VCFpostprocess -oldvcf SAMPLE.cpg.raw.sorted.vcf -newvcf SAMPLE.cpg.filtered.vcf -snpvcf SAMPLE.snp.raw.sorted.vcf -o SAMPLE.cpg.filter.summary.txt Convert VCF to BED file perl /path/to/vcf2bed6plus2.strand.pl SAMPLE.snp.filtered.vcf perl /path/to/vcf2bed6plus2.strand.pl SAMPLE.cpg.filtered.vcf BisSNP utilizes a set of known SNPs for the genome as part of the SNP calling process. BisSNP can work without this set of SNPs, but it will improve SNP detection accuracy, especially in areas of low coverage (e.g. less than 10X). The order of the SNP VCF file should be the same as the reference genome file and the input BAM file. The BisSNP website has SNP files in VCF format for human builds hg18 and hg19 and one mouse build (mm9) at: For the filtering step, by default BisSNP will filter out SNPs with: quality score less than 20 coverage more than 120X strand bias more than quality score by depth less than 1.0 mapping_quality_zero filter for heterozygous SNPs more than SNPs within the same 20bp window. The BED files that are created have the percent methylation in the 7 th column, and the C/T coverage in the 8 th column. If viewed in the SignalMap software, the percent methylation will be viewed as a score from 0 to 1000, where 0 equals zero percent methylation and 1000 equal one hundred percent methylation. The strand column is ignored, so both methylation percentages will be displayed as positive values. The C/T coverage will not be displayed. Evaluate NimbleGen SeqCap Epi Target Enrichment Data 18
19 Determination of differentially methylated regions There are a number of different packages for discovering differentially methylated regions when comparing two samples, or two sets of samples. One of these is methylkit, an R package for DNA methylation analysis. Installation of R, methylkit, or the various dependencies is beyond the scope of this document. But instructions for installing R may be found at and instructions for installing methylkit and its dependencies may be found at R/methylKit read R/methylKit filterbycoverage R/methylKit unite R/methylKit calculatediffmeth R/methylKit get.methyldiff R/methylKit tilemethylcounts BSMAP methylation results files varies # R code for determining differentially methylated regions # Load files library(methylkit) file.list = list("sample.methylation_results.txt","control.methylation_results.txt") sample.list = list("test","control") methdata <-read( file.list, sample.id=sample.list, treatment=c(0,1), assembly="hg19", header=true, context="cpg", resolution="base", pipeline=list( fraction=true, chr.col=1, start.col=2, end.col=2, coverage.col=6, strand.col=3, freqc.col=5 ) ) #Generate basic stats getmethylationstats(methdata,plot=t,both.strands=t) getcoveragestats(methdata,plot=t,both.strands=t) #Filter by coverage count (>10X) or percentage (<99.9 percentile) methdata.filtered = filterbycoverage(methdata, lo.count = 10, lo.perc=null, hi.count=null,hi.per = 99.9) #Merge samples to locations covered by all samples meth = unite(methdata.filtered,destrand=false) #Calculate per-base differential methylation p-values and q-values methdiff = calculatediffmeth(meth) #Find differentially methylated bases with 25% difference and qvalue<0.01 Diff25p=get.methylDiff(methDiff,difference=25,qvalue=0.01) Evaluate NimbleGen SeqCap Epi Target Enrichment Data 19
20 #Find differentially hypo methylated bases with 25% difference and qvalue<0.01 Diff25pHypo =get.methyldiff(methdiff,difference=25,qvalue=0.01,type="hypo") #Find differentially hyper methylated bases with 25% difference and qvalue<0.01 Diff25pHyper=get.methylDiff(methDiff,difference=25,qvalue=0.01,type="hyper") #Find differentially methylated regions with 25% difference and qvalue<0.01 tiles = tilemethylcounts(methdata.filtered,win.size=500,step.size=100) tilemeth = unite(tiles,destrand=false) tilediff = calculatediffmeth(tilemeth,num.cores=8) tile25pct <- get.methyldiff(tilediff,difference=25,qvalue=0.01) #Write data to file write.table(getdata(diff25p),file="diff25pct.txt",row.names=f,col.names=t,sep="\t",q uote=f) write.table(getdata(tile25pct),file="tile25pct.txt",row.names=f,col.names=t,sep="\t",quote=f) Replicate data sets will be automatically combined by the treatment values specified at import. Q-values and percent differences can be adjusted to provide higher or lower stringency. The methylkit site ( has examples and tutorials on additional analysis that can be performed, including correlations, clustering, PCA, and adding annotation. 3. REFERENCES Roche NimbleGen is not responsible for the maintenance or content of the following third-party websites. BamUtil: BEDtools: BisSNP: BSMAP: FastQC: GATK: IGV: methylkit: Picard: R: SAMtools (including BCFtools and VCFutils): seqtk: Trimmomatic: Evaluate NimbleGen SeqCap Epi Target Enrichment Data 20
21 4. GLOSSARY BAI file BAM index file. For tools that require an indexed BAM file, the BAI file must be present in the same location as the BAM file. Bait See Capture target. BAM file Compressed form of the SAM file format. BCF file Compressed form of the VCF file format. BED file File format for describing genomic regions/intervals. BED file start coordinates are 0-based. Capture target as defined by Roche NimbleGen, these are the regions covered directly by one or more probes (the probe footprint). NimbleGen BED files refer to these regions as the tiled regions. These are equivalent to the bait intervals referred to by Picard. FASTA file A standard file format for describing nucleic acid sequences. FASTQ file A standard file format for describing sequencing reads that also includes base quality information. Genomic index A form of the reference genome sequence which enables faster comparisons during alignment. Picard interval files File format for describing genomic regions/intervals which also contains a header describing the reference genome. Picard interval file start coordinates are 1-based. Primary target as defined by Roche NimbleGen, these are the regions against which probes were designed. NimbleGen BED files refer to these regions as simply Target Regions. Typically these are identical to the original regions of interest with regions less than 100bp expanded to 100bp total to facilitate probe selection. These are equivalent to the target intervals referred to by Picard. SAM file Sequence Alignment / Map file; a community standard format for specifying sequencing read alignment to a reference genome. Target regions see Primary target. Tiled regions see Capture target. VCF file Variant call format; a community standard format for specifying variant calls for one or more samples or populations against a reference genome. Evaluate NimbleGen SeqCap Epi Target Enrichment Data 21
22 Published by: Roche Diagnostics GmbH Sandhofer Straße Mannheim Germany 2014 Roche Diagnostics All rights reserved. Notice to Purchaser For patent license limitations for individual products please refer to: For life science research only. Not for use in diagnostic procedures. Trademarks NIMBLEGEN and SEQCAP are trademarks of Roche. All other product names and trademarks are the property of their respective owners. Technical Note: How To (1) 04/14 Evaluate NimbleGen SeqCap Epi Target Enrichment Data 22
Evaluate NimbleGen SeqCap RNA Target Enrichment Data
Roche Sequencing Technical Note November 2014 How To Evaluate NimbleGen SeqCap RNA Target Enrichment Data 1. OVERVIEW Analysis of NimbleGen SeqCap RNA target enrichment data generated using an Illumina
More informationEpiGnome Methyl Seq Bioinformatics User Guide Rev. 0.1
EpiGnome Methyl Seq Bioinformatics User Guide Rev. 0.1 Introduction This guide contains data analysis recommendations for libraries prepared using Epicentre s EpiGnome Methyl Seq Kit, and sequenced on
More informationNext Generation Sequence Alignment on the BRC Cluster. Steve Newhouse 22 July 2010
Next Generation Sequence Alignment on the BRC Cluster Steve Newhouse 22 July 2010 Overview Practical guide to processing next generation sequencing data on the cluster No details on the inner workings
More informationWelcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.
Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page. In this page you will learn to use the tools of the MAPHiTS suite. A little advice before starting : rename your
More informationSequence Data Quality Assessment Exercises and Solutions.
Sequence Data Quality Assessment Exercises and Solutions. Starting Note: Please do not copy and paste the commands. Characters in this document may not be copied correctly. Please type the commands and
More informationSAMtools. SAM BAM. mapping. BAM sort & indexing (ex: IGV) SNP call
SAMtools http://samtools.sourceforge.net/ SAM/BAM mapping BAM SAM BAM BAM sort & indexing (ex: IGV) mapping SNP call SAMtools NGS Program: samtools (Tools for alignments in the SAM format) Version: 0.1.19
More informationGalaxy Platform For NGS Data Analyses
Galaxy Platform For NGS Data Analyses Weihong Yan wyan@chem.ucla.edu Collaboratory Web Site http://qcb.ucla.edu/collaboratory Collaboratory Workshops Workshop Outline ü Day 1 UCLA galaxy and user account
More informationPreparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers
Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers Data used in the exercise We will use D. melanogaster WGS paired-end Illumina data with NCBI accessions
More informationUnderstanding and Pre-processing Raw Illumina Data
Understanding and Pre-processing Raw Illumina Data Matt Johnson October 4, 2013 1 Understanding FASTQ files After an Illumina sequencing run, the data is stored in very large text files in a standard format
More informationSNP Calling. Tuesday 4/21/15
SNP Calling Tuesday 4/21/15 Why Call SNPs? map mutations, ex: EMS, natural variation, introgressions associate with changes in expression develop markers for whole genome QTL analysis/ GWAS access diversity
More informationCopyright 2014 Regents of the University of Minnesota
Quality Control of Illumina Data using Galaxy Contents September 16, 2014 1 Introduction 2 1.1 What is Galaxy?..................................... 2 1.2 Galaxy at MSI......................................
More informationCalling variants in diploid or multiploid genomes
Calling variants in diploid or multiploid genomes Diploid genomes The initial steps in calling variants for diploid or multi-ploid organisms with NGS data are the same as what we've already seen: 1. 2.
More informationNA12878 Platinum Genome GENALICE MAP Analysis Report
NA12878 Platinum Genome GENALICE MAP Analysis Report Bas Tolhuis, PhD Jan-Jaap Wesselink, PhD GENALICE B.V. INDEX EXECUTIVE SUMMARY...4 1. MATERIALS & METHODS...5 1.1 SEQUENCE DATA...5 1.2 WORKFLOWS......5
More informationVariant calling using SAMtools
Variant calling using SAMtools Calling variants - a trivial use of an Interactive Session We are going to conduct the variant calling exercises in an interactive idev session just so you can get a feel
More informationHelpful Galaxy screencasts are available at:
This user guide serves as a simplified, graphic version of the CloudMap paper for applicationoriented end-users. For more details, please see the CloudMap paper. Video versions of these user guides and
More informationm6aviewer Version Documentation
m6aviewer Version 1.6.0 Documentation Contents 1. About 2. Requirements 3. Launching m6aviewer 4. Running Time Estimates 5. Basic Peak Calling 6. Running Modes 7. Multiple Samples/Sample Replicates 8.
More informationWM2 Bioinformatics. ExomeSeq data analysis part 1. Dietmar Rieder
WM2 Bioinformatics ExomeSeq data analysis part 1 Dietmar Rieder RAW data Use putty to logon to cluster.i med.ac.at In your home directory make directory to store raw data $ mkdir 00_RAW Copy raw fastq
More informationREPORT. NA12878 Platinum Genome. GENALICE MAP Analysis Report. Bas Tolhuis, PhD GENALICE B.V.
REPORT NA12878 Platinum Genome GENALICE MAP Analysis Report Bas Tolhuis, PhD GENALICE B.V. INDEX EXECUTIVE SUMMARY...4 1. MATERIALS & METHODS...5 1.1 SEQUENCE DATA...5 1.2 WORKFLOWS......5 1.3 ACCURACY
More informationReads Alignment and Variant Calling
Reads Alignment and Variant Calling CB2-201 Computational Biology and Bioinformatics February 22, 2016 Emidio Capriotti http://biofold.org/ Institute for Mathematical Modeling of Biological Systems Department
More informationVariation among genomes
Variation among genomes Comparing genomes The reference genome http://www.ncbi.nlm.nih.gov/nuccore/26556996 Arabidopsis thaliana, a model plant Col-0 variety is from Landsberg, Germany Ler is a mutant
More informationDecrypting your genome data privately in the cloud
Decrypting your genome data privately in the cloud Marc Sitges Data Manager@Made of Genes @madeofgenes The Human Genome 3.200 M (x2) Base pairs (bp) ~20.000 genes (~30%) (Exons ~1%) The Human Genome Project
More informationSequence Mapping and Assembly
Practical Introduction Sequence Mapping and Assembly December 8, 2014 Mary Kate Wing University of Michigan Center for Statistical Genetics Goals of This Session Learn basics of sequence data file formats
More informationChIP-Seq Tutorial on Galaxy
1 Introduction ChIP-Seq Tutorial on Galaxy 2 December 2010 (modified April 6, 2017) Rory Stark The aim of this practical is to give you some experience handling ChIP-Seq data. We will be working with data
More informationCORE Year 1 Whole Genome Sequencing Final Data Format Requirements
CORE Year 1 Whole Genome Sequencing Final Data Format Requirements To all incumbent contractors of CORE year 1 WGS contracts, the following acts as the agreed to sample parameters issued by NHLBI for data
More informationQIAseq DNA V3 Panel Analysis Plugin USER MANUAL
QIAseq DNA V3 Panel Analysis Plugin USER MANUAL User manual for QIAseq DNA V3 Panel Analysis 1.0.1 Windows, Mac OS X and Linux January 25, 2018 This software is for research purposes only. QIAGEN Aarhus
More informationThe software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA- MEM).
Release Notes Agilent SureCall 4.0 Product Number G4980AA SureCall Client 6-month named license supports installation of one client and server (to host the SureCall database) on one machine. For additional
More informationHIPPIE User Manual. (v0.0.2-beta, 2015/4/26, Yih-Chii Hwang, yihhwang [at] mail.med.upenn.edu)
HIPPIE User Manual (v0.0.2-beta, 2015/4/26, Yih-Chii Hwang, yihhwang [at] mail.med.upenn.edu) OVERVIEW OF HIPPIE o Flowchart of HIPPIE o Requirements PREPARE DIRECTORY STRUCTURE FOR HIPPIE EXECUTION o
More informationUSING BRAT-BW Table 1. Feature comparison of BRAT-bw, BRAT-large, Bismark and BS Seeker (as of on March, 2012)
USING BRAT-BW-2.0.1 BRAT-bw is a tool for BS-seq reads mapping, i.e. mapping of bisulfite-treated sequenced reads. BRAT-bw is a part of BRAT s suit. Therefore, input and output formats for BRAT-bw are
More informationITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013
ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013 1. Data and objectives We will use the data from GEO (GSE35368, Toedling, Servant et al. 2011). Two samples were
More informationDesign and Annotation Files
Design and Annotation Files Release Notes SeqCap EZ Exome Target Enrichment System The design and annotation files provide information about genomic regions covered by the capture probes and the genes
More informationSupplementary Information. Detecting and annotating genetic variations using the HugeSeq pipeline
Supplementary Information Detecting and annotating genetic variations using the HugeSeq pipeline Hugo Y. K. Lam 1,#, Cuiping Pan 1, Michael J. Clark 1, Phil Lacroute 1, Rui Chen 1, Rajini Haraksingh 1,
More informationGuide to Reviewing and Approving Custom Designs
Guide to Reviewing and Approving Custom Designs SeqCap EZ Designs, v4.1 Overview This document describes how to review and approve the proposed custom SeqCap EZ and SeqCap EZ Prime designs based on the
More informationINTRODUCTION AUX FORMATS DE FICHIERS
INTRODUCTION AUX FORMATS DE FICHIERS Plan. Formats de séquences brutes.. Format fasta.2. Format fastq 2. Formats d alignements 2.. Format SAM 2.2. Format BAM 4. Format «Variant Calling» 4.. Format Varscan
More informationPRACTICAL SESSION 5 GOTCLOUD ALIGNMENT WITH BWA JAN 7 TH, 2014 STOM 2014 WORKSHOP HYUN MIN KANG UNIVERSITY OF MICHIGAN, ANN ARBOR
PRACTICAL SESSION 5 GOTCLOUD ALIGNMENT WITH BWA JAN 7 TH, 2014 STOM 2014 WORKSHOP HYUN MIN KANG UNIVERSITY OF MICHIGAN, ANN ARBOR GOAL OF THIS SESSION Assuming that The audiences know how to perform GWAS
More informationAnalyzing ChIP- Seq Data in Galaxy
Analyzing ChIP- Seq Data in Galaxy Lauren Mills RISS ABSTRACT Step- by- step guide to basic ChIP- Seq analysis using the Galaxy platform. Table of Contents Introduction... 3 Links to helpful information...
More informationRPGC Manual. You will also need python 2.7 or above to run our home-brew python scripts.
Introduction Here we present a new approach for producing de novo whole genome sequences--recombinant population genome construction (RPGC)--that solves many of the problems encountered in standard genome
More informationPre-processing and quality control of sequence data. Barbera van Schaik KEBB - Bioinformatics Laboratory
Pre-processing and quality control of sequence data Barbera van Schaik KEBB - Bioinformatics Laboratory b.d.vanschaik@amc.uva.nl Topic: quality control and prepare data for the interesting stuf Keep Throw
More informationPractical Linux Examples
Practical Linux Examples Processing large text file Parallelization of independent tasks Qi Sun & Robert Bukowski Bioinformatics Facility Cornell University http://cbsu.tc.cornell.edu/lab/doc/linux_examples_slides.pdf
More informationRNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF
RNA-Seq in Galaxy: Tuxedo protocol Igor Makunin, UQ RCC, QCIF Acknowledgments Genomics Virtual Lab: gvl.org.au Galaxy for tutorials: galaxy-tut.genome.edu.au Galaxy Australia: galaxy-aust.genome.edu.au
More informationHigh-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg
High-throughput sequencing: Alignment and related topic Simon Anders EMBL Heidelberg Established platforms HTS Platforms Illumina HiSeq, ABI SOLiD, Roche 454 Newcomers: Benchtop machines 454 GS Junior,
More informationProtocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification data
Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification data Table of Contents Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification
More informationASAP - Allele-specific alignment pipeline
ASAP - Allele-specific alignment pipeline Jan 09, 2012 (1) ASAP - Quick Reference ASAP needs a working version of Perl and is run from the command line. Furthermore, Bowtie needs to be installed on your
More information3. Installation Download Cpipe and Run Install Script Create an Analysis Profile Create a Batch... 7
Cpipe User Guide 1. Introduction - What is Cpipe?... 3 2. Design Background... 3 2.1. Analysis Pipeline Implementation (Cpipe)... 4 2.2. Use of a Bioinformatics Pipeline Toolkit (Bpipe)... 4 2.3. Individual
More informationMapping NGS reads for genomics studies
Mapping NGS reads for genomics studies Valencia, 28-30 Sep 2015 BIER Alejandro Alemán aaleman@cipf.es Genomics Data Analysis CIBERER Where are we? Fastq Sequence preprocessing Fastq Alignment BAM Visualization
More informationWorkshop 6: DNA Methylation Analysis using Bisulfite Sequencing. Fides D Lay UCLA QCB Fellow
Workshop 6: DNA Methylation Analysis using Bisulfite Sequencing Fides D Lay UCLA QCB Fellow lay.fides@gmail.com Workshop 6 Outline Day 1: Introduction to DNA methylation & WGBS Quick review of linux, Hoffman2
More informationHandling sam and vcf data, quality control
Handling sam and vcf data, quality control We continue with the earlier analyses and get some new data: cd ~/session_3 wget http://wasabiapp.org/vbox/data/session_4/file3.tgz tar xzf file3.tgz wget http://wasabiapp.org/vbox/data/session_4/file4.tgz
More informationDemultiplexing Illumina sequencing data containing unique molecular indexes (UMIs)
next generation sequencing analysis guidelines Demultiplexing Illumina sequencing data containing unique molecular indexes (UMIs) See what more we can do for you at www.idtdna.com. For Research Use Only
More informationExome sequencing. Jong Kyoung Kim
Exome sequencing Jong Kyoung Kim Genome Analysis Toolkit The GATK is the industry standard for identifying SNPs and indels in germline DNA and RNAseq data. Its scope is now expanding to include somatic
More informationSep. Guide. Edico Genome Corp North Torrey Pines Court, Plaza Level, La Jolla, CA 92037
Sep 2017 DRAGEN TM Quick Start Guide www.edicogenome.com info@edicogenome.com Edico Genome Corp. 3344 North Torrey Pines Court, Plaza Level, La Jolla, CA 92037 Notice Contents of this document and associated
More informationNGS Analysis Using Galaxy
NGS Analysis Using Galaxy Sequences and Alignment Format Galaxy overview and Interface Get;ng Data in Galaxy Analyzing Data in Galaxy Quality Control Mapping Data History and workflow Galaxy Exercises
More informationThe software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA-MEM).
Release Notes Agilent SureCall 3.5 Product Number G4980AA SureCall Client 6-month named license supports installation of one client and server (to host the SureCall database) on one machine. For additional
More informationTumor-Specific NeoAntigen Detector (TSNAD) v2.0 User s Manual
Tumor-Specific NeoAntigen Detector (TSNAD) v2.0 User s Manual Zhan Zhou, Xingzheng Lyu and Jingcheng Wu Zhejiang University, CHINA March, 2016 USER'S MANUAL TABLE OF CONTENTS 1 GETTING STARTED... 1 1.1
More informationSentieon Documentation
Sentieon Documentation Release 201808.03 Sentieon, Inc Dec 21, 2018 Sentieon Manual 1 Introduction 1 1.1 Description.............................................. 1 1.2 Benefits and Value..........................................
More informationNGS Data Analysis. Roberto Preste
NGS Data Analysis Roberto Preste 1 Useful info http://bit.ly/2r1y2dr Contacts: roberto.preste@gmail.com Slides: http://bit.ly/ngs-data 2 NGS data analysis Overview 3 NGS Data Analysis: the basic idea http://bit.ly/2r1y2dr
More informationarxiv: v2 [q-bio.gn] 13 May 2014
BIOINFORMATICS Vol. 00 no. 00 2005 Pages 1 2 Fast and accurate alignment of long bisulfite-seq reads Brent S. Pedersen 1,, Kenneth Eyring 1, Subhajyoti De 1,2, Ivana V. Yang 1 and David A. Schwartz 1 1
More informationAgilent Genomic Workbench Lite Edition 6.5
Agilent Genomic Workbench Lite Edition 6.5 SureSelect Quality Analyzer User Guide For Research Use Only. Not for use in diagnostic procedures. Agilent Technologies Notices Agilent Technologies, Inc. 2010
More informationAgroMarker Finder manual (1.1)
AgroMarker Finder manual (1.1) 1. Introduction 2. Installation 3. How to run? 4. How to use? 5. Java program for calculating of restriction enzyme sites (TaqαI). 1. Introduction AgroMarker Finder (AMF)is
More informationRun Setup and Bioinformatic Analysis. Accel-NGS 2S MID Indexing Kits
Run Setup and Bioinformatic Analysis Accel-NGS 2S MID Indexing Kits Sequencing MID Libraries For MiSeq, HiSeq, and NextSeq instruments: Modify the config file to create a fastq for index reads Using the
More informationUser Guide. SLAMseq Data Analysis Pipeline SLAMdunk on Bluebee Platform
SLAMseq Data Analysis Pipeline SLAMdunk on Bluebee Platform User Guide Catalog Numbers: 061, 062 (SLAMseq Kinetics Kits) 015 (QuantSeq 3 mrna-seq Library Prep Kits) 063UG147V0100 FOR RESEARCH USE ONLY.
More informationv0.3.0 May 18, 2016 SNPsplit operates in two stages:
May 18, 2016 v0.3.0 SNPsplit is an allele-specific alignment sorter which is designed to read alignment files in SAM/ BAM format and determine the allelic origin of reads that cover known SNP positions.
More informationFalcon Accelerated Genomics Data Analysis Solutions. User Guide
Falcon Accelerated Genomics Data Analysis Solutions User Guide Falcon Computing Solutions, Inc. Version 1.0 3/30/2018 Table of Contents Introduction... 3 System Requirements and Installation... 4 Software
More informationGenomic Files. University of Massachusetts Medical School. October, 2015
.. Genomic Files University of Massachusetts Medical School October, 2015 2 / 55. A Typical Deep-Sequencing Workflow Samples Fastq Files Fastq Files Sam / Bam Files Various files Deep Sequencing Further
More informationHigh-throughout sequencing and using short-read aligners. Simon Anders
High-throughout sequencing and using short-read aligners Simon Anders High-throughput sequencing (HTS) Sequencing millions of short DNA fragments in parallel. a.k.a.: next-generation sequencing (NGS) massively-parallel
More informationRNAseq analysis: SNP calling. BTI bioinformatics course, spring 2013
RNAseq analysis: SNP calling BTI bioinformatics course, spring 2013 RNAseq overview RNAseq overview Choose technology 454 Illumina SOLiD 3 rd generation (Ion Torrent, PacBio) Library types Single reads
More informationRNA-seq. Manpreet S. Katari
RNA-seq Manpreet S. Katari Evolution of Sequence Technology Normalizing the Data RPKM (Reads per Kilobase of exons per million reads) Score = R NT R = # of unique reads for the gene N = Size of the gene
More informationUser's guide to ChIP-Seq applications: command-line usage and option summary
User's guide to ChIP-Seq applications: command-line usage and option summary 1. Basics about the ChIP-Seq Tools The ChIP-Seq software provides a set of tools performing common genome-wide ChIPseq analysis
More informationMerge Conflicts p. 92 More GitHub Workflows: Forking and Pull Requests p. 97 Using Git to Make Life Easier: Working with Past Commits p.
Preface p. xiii Ideology: Data Skills for Robust and Reproducible Bioinformatics How to Learn Bioinformatics p. 1 Why Bioinformatics? Biology's Growing Data p. 1 Learning Data Skills to Learn Bioinformatics
More informationGenomic Files. University of Massachusetts Medical School. October, 2014
.. Genomic Files University of Massachusetts Medical School October, 2014 2 / 39. A Typical Deep-Sequencing Workflow Samples Fastq Files Fastq Files Sam / Bam Files Various files Deep Sequencing Further
More informationHigh-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg
High-throughput sequencing: Alignment and related topic Simon Anders EMBL Heidelberg Established platforms HTS Platforms Illumina HiSeq, ABI SOLiD, Roche 454 Newcomers: Benchtop machines: Illumina MiSeq,
More informationPractical Linux examples: Exercises
Practical Linux examples: Exercises 1. Login (ssh) to the machine that you are assigned for this workshop (assigned machines: https://cbsu.tc.cornell.edu/ww/machines.aspx?i=87 ). Prepare working directory,
More informationMar. Guide. Edico Genome Inc North Torrey Pines Court, Plaza Level, La Jolla, CA 92037
Mar 2017 DRAGEN TM Quick Start Guide www.edicogenome.com info@edicogenome.com Edico Genome Inc. 3344 North Torrey Pines Court, Plaza Level, La Jolla, CA 92037 Notice Contents of this document and associated
More informationv0.2.0 XX:Z:UA - Unassigned XX:Z:G1 - Genome 1-specific XX:Z:G2 - Genome 2-specific XX:Z:CF - Conflicting
October 08, 2015 v0.2.0 SNPsplit is an allele-specific alignment sorter which is designed to read alignment files in SAM/ BAM format and determine the allelic origin of reads that cover known SNP positions.
More informationLecture 3. Essential skills for bioinformatics: Unix/Linux
Lecture 3 Essential skills for bioinformatics: Unix/Linux RETRIEVING DATA Overview Whether downloading large sequencing datasets or accessing a web application hundreds of times to download specific files,
More informationAnalyzing Variant Call results using EuPathDB Galaxy, Part II
Analyzing Variant Call results using EuPathDB Galaxy, Part II In this exercise, we will work in groups to examine the results from the SNP analysis workflow that we started yesterday. The first step is
More informationNGS Data Visualization and Exploration Using IGV
1 What is Galaxy Galaxy for Bioinformaticians Galaxy for Experimental Biologists Using Galaxy for NGS Analysis NGS Data Visualization and Exploration Using IGV 2 What is Galaxy Galaxy for Bioinformaticians
More informationChIP-seq hands-on practical using Galaxy
ChIP-seq hands-on practical using Galaxy In this exercise we will cover some of the basic NGS analysis steps for ChIP-seq using the Galaxy framework: Quality control Mapping of reads using Bowtie2 Peak-calling
More informationRsubread package: high-performance read alignment, quantification and mutation discovery
Rsubread package: high-performance read alignment, quantification and mutation discovery Wei Shi 14 September 2015 1 Introduction This vignette provides a brief description to the Rsubread package. For
More informationCopyright 2014 Regents of the University of Minnesota
Quality Control of Illumina Data using Galaxy August 18, 2014 Contents 1 Introduction 2 1.1 What is Galaxy?..................................... 2 1.2 Galaxy at MSI......................................
More informationSAM / BAM Tutorial. EMBL Heidelberg. Course Materials. Tobias Rausch September 2012
SAM / BAM Tutorial EMBL Heidelberg Course Materials Tobias Rausch September 2012 Contents 1 SAM / BAM 3 1.1 Introduction................................... 3 1.2 Tasks.......................................
More informationNext-Generation Sequencing applied to adna
Next-Generation Sequencing applied to adna Hands-on session June 13, 2014 Ludovic Orlando - Lorlando@snm.ku.dk Mikkel Schubert - MSchubert@snm.ku.dk Aurélien Ginolhac - AGinolhac@snm.ku.dk Hákon Jónsson
More informationHow to use earray to create custom content for the SureSelect Target Enrichment platform. Page 1
How to use earray to create custom content for the SureSelect Target Enrichment platform Page 1 Getting Started Access earray Access earray at: https://earray.chem.agilent.com/earray/ Log in to earray,
More informationCBSU/3CPG/CVG Joint Workshop Series Reference genome based sequence variation detection
CBSU/3CPG/CVG Joint Workshop Series Reference genome based sequence variation detection Computational Biology Service Unit (CBSU) Cornell Center for Comparative and Population Genomics (3CPG) Center for
More informationThese will serve as a basic guideline for read prep. This assumes you have demultiplexed Illumina data.
These will serve as a basic guideline for read prep. This assumes you have demultiplexed Illumina data. We have a few different choices for running jobs on DT2 we will explore both here. We need to alter
More informationRsubread package: high-performance read alignment, quantification and mutation discovery
Rsubread package: high-performance read alignment, quantification and mutation discovery Wei Shi 14 September 2015 1 Introduction This vignette provides a brief description to the Rsubread package. For
More informationChIP-seq Analysis. BaRC Hot Topics - March 21 st 2017 Bioinformatics and Research Computing Whitehead Institute.
ChIP-seq Analysis BaRC Hot Topics - March 21 st 2017 Bioinformatics and Research Computing Whitehead Institute http://barc.wi.mit.edu/hot_topics/ Outline ChIP-seq overview Experimental design Quality control/preprocessing
More informationGenomeStudio Software Release Notes
GenomeStudio Software 2009.2 Release Notes 1. GenomeStudio Software 2009.2 Framework... 1 2. Illumina Genome Viewer v1.5...2 3. Genotyping Module v1.5... 4 4. Gene Expression Module v1.5... 6 5. Methylation
More informationPractical exercises Day 2. Variant Calling
Practical exercises Day 2 Variant Calling Samtools mpileup Variant calling with samtools mpileup + bcftools Variant calling with HaplotypeCaller (GATK Best Practices) Genotype GVCFs Hard Filtering Variant
More informationDindel User Guide, version 1.0
Dindel User Guide, version 1.0 Kees Albers University of Cambridge, Wellcome Trust Sanger Institute caa@sanger.ac.uk October 26, 2010 Contents 1 Introduction 2 2 Requirements 2 3 Optional input 3 4 Dindel
More informationMolecular Identifier (MID) Analysis for TAM-ChIP Paired-End Sequencing
Molecular Identifier (MID) Analysis for TAM-ChIP Paired-End Sequencing Catalog Nos.: 53126 & 53127 Name: TAM-ChIP antibody conjugate Description Active Motif s TAM-ChIP technology combines antibody directed
More informationChIP-seq Analysis. BaRC Hot Topics - Feb 23 th 2016 Bioinformatics and Research Computing Whitehead Institute.
ChIP-seq Analysis BaRC Hot Topics - Feb 23 th 2016 Bioinformatics and Research Computing Whitehead Institute http://barc.wi.mit.edu/hot_topics/ Outline ChIP-seq overview Experimental design Quality control/preprocessing
More informationEssential Skills for Bioinformatics: Unix/Linux
Essential Skills for Bioinformatics: Unix/Linux WORKING WITH COMPRESSED DATA Overview Data compression, the process of condensing data so that it takes up less space (on disk drives, in memory, or across
More informationELPREP PERFORMANCE ACROSS PROGRAMMING LANGUAGES PASCAL COSTANZA CHARLOTTE HERZEEL FOSDEM, BRUSSELS, BELGIUM, FEBRUARY 3, 2018
ELPREP PERFORMANCE ACROSS PROGRAMMING LANGUAGES PASCAL COSTANZA CHARLOTTE HERZEEL FOSDEM, BRUSSELS, BELGIUM, FEBRUARY 3, 2018 USA SAN FRANCISCO USA ORLANDO BELGIUM - HQ LEUVEN THE NETHERLANDS EINDHOVEN
More informationSAM : Sequence Alignment/Map format. A TAB-delimited text format storing the alignment information. A header section is optional.
Alignment of NGS reads, samtools and visualization Hands-on Software used in this practical BWA MEM : Burrows-Wheeler Aligner. A software package for mapping low-divergent sequences against a large reference
More informationSequence Analysis Pipeline
Sequence Analysis Pipeline Transcript fragments 1. PREPROCESSING 2. ASSEMBLY (today) Removal of contaminants, vector, adaptors, etc Put overlapping sequence together and calculate bigger sequences 3. Analysis/Annotation
More informationGenomes On The Cloud GotCloud. University of Michigan Center for Statistical Genetics Mary Kate Wing Goo Jun
Genomes On The Cloud GotCloud University of Michigan Center for Statistical Genetics Mary Kate Wing Goo Jun Friday, March 8, 2013 Why GotCloud? Connects sequence analysis tools together Alignment, quality
More informationAnalysis of ChIP-seq data
Before we start: 1. Log into tak (step 0 on the exercises) 2. Go to your lab space and create a folder for the class (see separate hand out) 3. Connect to your lab space through the wihtdata network and
More informationBIOINFORMATICS APPLICATIONS NOTE
BIOINFORMATICS APPLICATIONS NOTE Sequence analysis BRAT: Bisulfite-treated Reads Analysis Tool (Supplementary Methods) Elena Y. Harris 1,*, Nadia Ponts 2, Aleksandr Levchuk 3, Karine Le Roch 2 and Stefano
More informationIntroduction to UNIX command-line II
Introduction to UNIX command-line II Boyce Thompson Institute 2017 Prashant Hosmani Class Content Terminal file system navigation Wildcards, shortcuts and special characters File permissions Compression
More informationTutorial. Find Very Low Frequency Variants With QIAGEN GeneRead Panels. Sample to Insight. November 21, 2017
Find Very Low Frequency Variants With QIAGEN GeneRead Panels November 21, 2017 Sample to Insight QIAGEN Aarhus Silkeborgvej 2 Prismet 8000 Aarhus C Denmark Telephone: +45 70 22 32 44 www.qiagenbioinformatics.com
More informationls /data/atrnaseq/ egrep "(fastq fasta fq fa)\.gz" ls /data/atrnaseq/ egrep "(cn ts)[1-3]ln[^3a-za-z]\."
Command line tools - bash, awk and sed We can only explore a small fraction of the capabilities of the bash shell and command-line utilities in Linux during this course. An entire course could be taught
More information