Evaluate NimbleGen SeqCap Epi Target Enrichment Data

Size: px
Start display at page:

Download "Evaluate NimbleGen SeqCap Epi Target Enrichment Data"

Transcription

1 Sequencing Solutions Technical Note April 2014 How To Evaluate NimbleGen SeqCap Epi Target Enrichment Data 1. OVERVIEW Analysis of NimbleGen SeqCap Epi target enrichment data generated using an Illumina sequencing system is most frequently performed using a variety of publicly available, open-source analysis tools. The typical methylation and variant calling analysis workflow consists of sequencing read quality assessment, read filtering, mapping against a reference genome, removal of PCR duplicates, coverage statistic assessment, methylation analysis, variant calling and variant filtering; at most of these steps, a variety of tools are available. This document shows how to use a selection of the available tools to perform SeqCap Epi data analysis, but other analysis workflows may be used. Applications Targeted Bisulfite Sequencing Products SeqCap Epi CpGiant Enrichment Kit SeqCap Epi Developer Enrichment Kit SeqCap Epi Choice Enrichment Kit In addition to showing how to use raw FASTQ files to generate a BAM file and perform methylation analysis (see Figure 1), additional supporting workflows include how to prepare and work with Roche NimbleGen BED files and how to create a Picard interval list file. This document should enable readers with bioinformatics experience to understand the basic analysis workflow in use at Roche NimbleGen. The reader should carefully consider the available options to determine the appropriate workflow for their research. The usage examples described here have been applied effectively in our hands. Please note that publicly available, open-source software tools may change and that such change is not under the control of Roche. Therefore Roche does not warrant and cannot be held liable for the results obtained when using the third party tools described herein. Please refer to the authors of each tool for support and documentation. For life science research only. Not for use in diagnostic procedures. 1

2 Figure 1. Schematic of basic methylation analysis workflow Evaluate NimbleGen SeqCap Epi Target Enrichment Data 2

3 2. SOLUTION Freely available third-party tools are available for converting raw sequencing data into appropriate file formats, mapping bisulfite converted reads to a reference sequence, evaluating sequencing quality, generating methylation calls and analyzing variant calls. This technical note describes a number of steps and mini-workflows that use such third-party tools, which can be combined together into a variety of data analysis workflows. Ideally, you should develop a workflow appropriate for your experimental data using benchmark/control samples, such as HapMap samples obtained from the Coriell Institute Cell Repositories. Known variants for HapMap sample genomes can be downloaded from the HapMap project ( the 1000 Genomes Project ( or in special collections such as GATK s resource bundle ( Note that where the text SAMPLE appears throughout these examples, you should replace it with a unique sample name. Similarly, replace /path/to/ with a valid path on your system. The current directory is assumed to be the location of all SAMPLE input files, and will also be the location of SAMPLE output files and report files. Type the entire command shown for each step on a single line, despite the way it appears on the printed page. There should be no spaces within a file path, but there must be spaces before and after each option. Tip: use the Tab key to auto-complete paths and file names while typing. Tools Overview Package (version) Tool Function, as used in this document bamutil (1.0.10) clipoverlap Soft clips overlapping read pairs in BAM files intersect Calculate on-target read rate by intersecting a list of regions in BED format against mapped reads in BAM format. BEDtools (2.17.0) sort Sort BED formatted regions. merge Merge overlapping regions within a BED file. genomecov Calculate sum total size of regions in a BED file. slop Pad length of target regions. BisulfiteCountCovariates Generate data file for recalibration BisulfiteTableRecalibration Recalibrate base qualities before SNP calling BisSNP (0.82.2) BSMAP (2.74) BisulfiteGenotyper sortbyrefandcor.pl VCFpostprocess bsmap methratio.py Determine methylated/unmethylated counts at each position and make SNP calls Sort VCF files by name and position Filter CpGs and SNPs Map sequencing reads to an indexed genome. Calculate percent methylation at individual bases FastQC (0.10.1) fastqc Assess sequencing read quality (per-base quality plot). GATK Framework (2.7-2) IGV DepthOfCoverage igv Calculate mean, median, and specific sequencing coverage depths. Genomic viewer for BAM and BED files (not explicitly used in the examples shown in this document). Evaluate NimbleGen SeqCap Epi Target Enrichment Data 3

4 Package (version) Tool Function, as used in this document methylkit (0.9.2) Picard (1.98) SAMtools (0.1.18) read filterbycoverage unite calculatediffmeth get.methyldiff tilemethylcounts AddOrReplaceReadGroups CollectInsertSizeMetrics CalculateHsMetrics MarkDuplicates CollectAlignmentSummaryMetrics SamToFastq sort index view mpileup BCFtools view vcfutils varfilter Import BSMAP methylation results Filter methylation data by coverage Consolidate data to position covered by all samples Calculate and score differential methylation at each position Filter differential methylation results by absolute value of methylation change, qvalue, and type of region (hypo, hyper, all) Summarizes methylated/unmethylated base counts over tiling or sliding windows in the genome Add read group information to a SAM or BAM file Estimate insert size mean and standard deviation and plot an insert size distribution. Assess performance of targeted enrichment reads. Remove or mark duplicate reads, and detail the number of paired, unpaired, and optical duplicates. Report mapping metrics for a BAM file. Create FASTQ file(s) from a SAM or BAM file. Sort a BAM file. Index a sorted BAM file. View or extract header or read data. Call variants on a BAM file. Convert between VCF and BCF formats. Filter raw variants. seqtk (1.0-r31) sample Randomly subsample FASTQ file(s). Trimmomatic (0.30) illuminaclip Trim raw reads for quality. Table 1: Shown are third-party data analysis and formatting tools used in this technical note. The examples described in this document were tested using the software versions listed in parentheses. See links in References for installation instructions and explanations of command options. These tools were tested on a Linux system, but should also work on MacOS. Evaluate NimbleGen SeqCap Epi Target Enrichment Data 4

5 Decompress a FASTQ File If the FASTQ files have been compressed (with a.gz extension), some tools require them to be decompressed before use. gunzip SAMPLE_R1.fastq.gz SAMPLE_R2.fastq.gz SAMPLE_R1.fastq SAMPLE_R2.fastq gunzip c SAMPLE_R1.fastq.gz > SAMPLE_R1.fastq gunzip c SAMPLE_R2.fastq.gz > SAMPLE_R2.fastq Generate FASTQ Files from a BAM File If you do not have access to the original FASTQ files, it is possible to generate FASTQ files from a BAM file in order to remap the reads using a different pipeline. Picard SamToFastq SAMPLE_file.bam SAMPLE_R1.fastq SAMPLE_R2.fastq java -Xmx4g Xms4g -jar /path/to/picard/samtofastq.jar VALIDATION_STRINGENCY=LENIENT INPUT=SAMPLE_file.bam FASTQ=SAMPLE_R1.fastq SECOND_END_FASTQ=SAMPLE_R2.fastq The generated FASTQ files will only be identical to the originals if the creator of the BAM file did not remove reads such as duplicates, low quality, etc., or perform base quality recalibration. Select a Subsample of Reads from a FASTQ File Random subsampling is useful for normalizing the number of reads per set when doing comparisons. Depending on the goal of the experiment, reads can be sampled before adapter trimming and quality filtering (see below) or after, sampling only from high quality reads of a minimum length. With paired end reads, it is important that the two files use the same values for the seed (-s) and number of reads. The seqtk application will sample from gzipped FASTQ files, but will write the sampled reads to uncompressed FASTQ files. seqtk sample SAMPLE_R1.fastq SAMPLE_R2.fastq SAMPLE_subset_R1.fastq SAMPLE_subset_R2.fastq /path/to/seqtk sample -s SAMPLE_R1.fastq > SAMPLE_subset_R1.fastq /path/to/seqtk sample -s SAMPLE_R2.fastq > SAMPLE_subset_R2.fastq The commands above will randomly subsample 10 million read pairs from the paired end FASTQ files. Supplying the same random seed value (-s) ensures that the FASTQ records will remain in synchronized sort order and can be used for mapping, etc. Note that seqtk requires an amount of RAM proportional to the number of reads being subsampled. As you increase the size of the subsampled read set, more RAM is needed. Evaluate NimbleGen SeqCap Epi Target Enrichment Data 5

6 Examine Sequence Read Quality Control Before spending time evaluating mapping statistics, use fastqc to run a series of tests on raw reads and generate a per-base sequence quality plot and report. The fastqc tool can work on both compressed and uncompressed FASTQ files. FastQC fastqc SAMPLE_R1.fastq SAMPLE_R2.fastq SAMPLE_R1_fastqc.zip SAMPLE_R2_fastqc.zip /path/to/fastqc --nogroup SAMPLE_R1.fastq SAMPLE_R2.fastq A.zip file and decompressed directory are created for each SAMPLE input file. They contain an HTML report named fastqc_report.html that is viewable in any internet browser. The authors of FastQC have posted the following examples of the QC report for a good and a bad sequencing run. Adapter trimming and Quality Filtering Before mapping the bisulfite converted reads, it is advisable to trim off any sequencing adapters found on the reads, and to filter or trim the reads for sequence quality. The Trimmomatic application can do both of those steps. Trimmomatic illuminaclip SAMPLE_R1.fastq SAMPLE_R2.fastq SAMPLE_R1_trimmed.fq SAMPLE_R1_unpaired.fq SAMPLE_R2_trimmed.fq SAMPLE_R2_unpaired.fq java Xms4g Xmx4g -jar /path/to/trimmomatic.jar PE -threads NumProcessors -phred33 SAMPLE_R1.fastq SAMPLE_R2.fastq SAMPLE_R1_trimmed.fq SAMPLE_R1_unpaired.fq SAMPLE_R2_trimmed.fq SAMPLE_R2_unpaired.fq ILLUMINACLIP:/path/to/adapters.fa:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:5:20 MINLEN:75 The Trimmomatic application will produce four files. The SAMPLE_R1_trimmed.fq and SAMPLE_R2_trimmed.fq contain the reads that are still paired after adapter trimming and quality filtering. The SAMPLE_R1_unpaired.fq and SAMPLE_R2_unpaired.fq contain singleton reads, where the other mate of the pair was discarded because of poor quality or because the remaining read was shorter than the minimum length of 75bp. The unpaired reads do not have to be discarded but for most alignment applications they will have to be mapped separately from the paired reads, which can complicate downstream steps in a workflow. If the workflow includes a filtering step to include only mapped reads which are paired, then there is no point in mapping the singleton reads from Trimmomatic. The parameters shown above are best used with 2x100bp reads, where you should expect to see about 90% of reads pass these quality filters. If you want to increase the percentage of passing reads, adjust the Trimmomatic parameters especially MINLEN, which is the required minimum length after trimming. Evaluate NimbleGen SeqCap Epi Target Enrichment Data 6

7 Mapping Reads There are a number of alignment software applications for mapping bisulfite converted reads to a reference genome. The choice of alignment software can depend on the length and type of reads being mapped (100bp paired end reads versus 76bp single end reads), the size and GC content of the organism, and the computational resources available. For 100bp paired end reads, BSMAP demonstrates a good compromise between alignment time and overall alignment percentage. The index for the reference genome is built at run time for BSMAP so there is no need to preindex the genome as is necessary for some other alignment software applications. BSMAP produces a SAM file which should be converted to a BAM file before proceeding with the workflow bsmap Picard AddOrReplaceReadGroups ref.fa SAMPLE_R1.fastq SAMPLE_R2.fastq SAMPLE.bam Map Reads /path/to/bsmap -r 0 -s 16 -n 1 -a SAMPLE_R1.fastq -b SAMPLE_R2.fastq -d /path/to/ref.fa -p NumProcessors -o SAMPLE.sam Add Read Group and Convert to BAM Xmx4g Xms4g -jar /path/to/picard/ AddOrReplaceReadGroups.jar VALIDATION_STRINGENCY=LENIENT INPUT=SAMPLE.sam OUTPUT=SAMPLE.bam INDEX=TRUE RGID=SAMPLE RGLB=SAMPLE RGPL=illumina RGSM=SAMPLE RGPU=platform_unit The parameters above will uniquely map reads to both strands. By default, BSMAP does not report unmapped reads. If you want unmapped reads to carry through the workflow, use the -u switch to report unmapped reads. Note that BSMAP does not assign a read group to the reads in the output SAM file, so that information is added at the same time the SAM file is converted to a BAM file. For single captures from one library, the sample ID, the read group library, and the read group sample can all be the same. If combining captures from multiple libraries or multiple captures, it is a good idea to distinguish those via the read group library ID or read group sample ID. The platform unit is a unique identifier for the flow cell and lane, which can generally be grabbed from the first four colon delimited fields from the FASTQ ID (in bold 1:N:0:CGATGT For the reference genome file, there are several points to consider. The FASTA formatted genome sequence should have chromosomes in karyotypic sort order, i.e. chr1, chr2,..., chr10, chr11, chrx, chry, chrm, chr1_random, etc. In the examples in this document, reference genome files are referred to as ref.fa, which should be replaced by the actual file name (e.g. hg19.fa ). Note: The SeqCap Epi products include a lambda spike-in control (GenBank accession NC_001416). This sequence should be added to the reference genome file so that captured controls can be mapped to the lambda genome, and be used to estimate bisulfite conversion efficiency as a QC step. If working with human reference genome sequence and mapping reads uniquely, it is advisable to mask the pseudo-autosomal regions on chry to N s so that reads may be mapped to the equivalent regions on chrx. For human experiments with 100bp paired end reads, you can generally expect to see >90% of the reads mapped to the human reference genome. For other, non-human reference genomes, the percent of reads mapping may vary based on a number of factors, including genome size, percent GC composition, repeat content and reference genome quality. Evaluate NimbleGen SeqCap Epi Target Enrichment Data 7

8 Sorting and Removing Duplicates Working with bisulfite converted sequences poses some additional challenges over standard capture sequences. The treatment of the sequence with bisulfite converts unmethylated C s to T s resulting in top and bottom strands that are no longer complementary (Figure 2). Figure 2. Illustration of non-complementary strands generated by bisulfite-treatment. Evaluate NimbleGen SeqCap Epi Target Enrichment Data 8

9 The standard tool for duplicate remove (Picard MarkDuplicates) does not have an option for dealing with bisulfite sequence and non-complementary top and bottom strands. So to remove duplicates, the mapped reads must be split by top/bottom strand, duplicates removed, and then the strands re-merged. The BSMAP aligner includes a tag in the BAM file (ZS), which indicates top/bottom strand, and forward/reverse read. This tag is used to split the BAM file containing the mapped reads, which are then sorted, duplicates are removed, and the split files are rejoined. bamtools split bamtools merge SAMtools sort Picard MarkDuplicates SAMPLE.bam SAMPLE.TAG_ZS_++.bam SAMPLE.TAG_ZS_+-.bam SAMPLE.top.bam SAMPLE.top.sorted.bam SAMPLE.top.rmdups.bam SAMPLE.top.rmdups_metrics.txt SAMPLE.TAG_ZS_-+.bam SAMPLE.TAG_ZS_--.bam SAMPLE.bottom.bam SAMPLE.bottom.sorted.bam SAMPLE.bottom.rmdups.bam SAMPLE.bottom.rmdups_metrics.txt SAMPLE.rmdups.bam Split BAM /path/to/bamtools split -tag ZS -in SAMPLE.bam Merge strand BAM files /path/to/bamtools merge -in SAMPLE.TAG_ZS_++.bam -in SAMPLE.TAG_ZS_+-.bam -out SAMPLE.top.bam /path/to/bamtools merge -in SAMPLE.TAG_ZS_-+.bam -in SAMPLE.TAG_ZS_--.bam -out SAMPLE.bottom.bam Sort BAM Files /path/to/samtools sort SAMPLE.top.bam SAMPLE.top.sorted /path/to/samtools sort SAMPLE.bottom.bam SAMPLE.bottom.sorted Remove Duplicates java Xmx4g Xms4g -jar /path/to/picard/markduplicates.jar VALIDATION_STRINGENCY=LENIENT INPUT=SAMPLE.top.sorted.bam OUTPUT=SAMPLE.top.rmdups.bam METRICS_FILE=SAMPLE.top.rmdups_metrics.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true INDEX=true java Xmx4g Xms4g -jar /path/to/picard/markduplicates.jar VALIDATION_STRINGENCY=LENIENT INPUT=SAMPLE.bottom.sorted.bam OUTPUT=SAMPLE.bottom.rmdups.bam METRICS_FILE=SAMPLE.bottom.rmdups_metrics.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true INDEX=true Merge duplicate removed BAM files /path/to/bamtools merge -in SAMPLE.top.rmdups.bam -in SAMPLE.bottom.rmdups.bam -out SAMPLE.rmdups.bam Evaluate NimbleGen SeqCap Epi Target Enrichment Data 9

10 In the Mark Duplicates step, REMOVE_DUPLICATES=false may also be used. This will retain reads marked as duplicates and just mark them as duplicates. It will result in larger files downstream, but may be beneficial if you want to examine downstream results with and without duplicates removed. View the file SAMPLE.strand.rmdups_metrics.txt for counts of paired, unpaired, and duplicate reads. See for a description of the output metrics. Note that optical duplicates are also reported, based on sequence similarity and sequencing cluster distance. Optical duplicates are not separately included in the total duplicate rate but are counted within the paired and unpaired duplicates. Filter BAM file to keep only mapped and properly paired reads Due to the difficulty in mapping bisulfite reads, overall data quality can be improved by restricting analysis to only those reads pairs where both mates in the pair can be mapped, in the correct orientation and at a distance consistent with the library insert size. Filtering the BAM file to include only mapped and properly paired reads will remove the ability to find structural variants. If that is a goal, then a separate analysis workflow (not described here) would have to be set up to find and characterize those variants. bamtools filter SAMPLE.rmdups.bam SAMPLE.filtered.bam Filter BAM /path/to/bamtools filter -ismapped true -ispaired true -isproperpair true - forcecompression -in SAMPLE.rmdups.bam -out SAMPLE.filtered.bam For human SeqCap Epi experiments, this step generally filters out about 5% of the reads mapped with BSMAP. Clip Overlapping Reads When using 100bp paired end reads, and typical NGS library insert sizes of bp, a number of the paired end reads will overlap. When evaluating percent methylation, this can cause bias, as the converted/non-converted C s in the overlap between the two regions may be double-counted. The bias can be especially large at positions with low read coverage. To prevent this, one or both reads should be soft-clipped such that the reads no longer overlap. The bamutil clipoverlap application can be used to do this. bamutil clipoverlap SAMPLE.filtered.bam SAMPLE.clipped.bam Filter BAM /path/to/bamutil clipoverlap --stats --in SAMPLE.filtered.bam --out SAMPLE.clipped.bam The clipoverlap tools will look at both reads in the pair, and when an overlap is found, the tool will clip the read with the lowest average quality in the overlap region, and the alignment positions are updated if necessary. See the following page for complete details: The number of reads that are clipped will depend on the average insert size of the library and the distribution of sizes in the library. See Estimate Insert Size Distribution below. Evaluate NimbleGen SeqCap Epi Target Enrichment Data 10

11 Indexing BAM files A number of tools that operate on BAM files require that the BAM file be indexed for faster or more efficient calculations. The Picard tools will generate an index automatically if the parameter INDEX=true is included on the command line. For other tools which create BAM files, the BAM index will have to be created manually. To do so, use the SAMtools index command. SAMtools index SAMPLE.bam SAMPLE.bam.bai /path/to/samtools index SAMPLE.bam The command will create an additional file with the suffix.bai added to the original file name. Basic Mapping Metrics (Picard) Basic mapping metrics can be calculated using Picard, from SAMPLE.bam and ref.fa. Picard CollectAlignmentSummaryMetrics ref.fa SAMPLE.bam SAMPLE_picard_alignment_metrics.txt java -Xmx4g -Xms4g -jar /path/to/picard/collectalignmentsummarymetrics.jar METRIC_ACCUMULATION_LEVEL=ALL_READS INPUT=SAMPLE.bam OUTPUT=SAMPLE_picard_alignment_metrics.txt REFERENCE_SEQUENCE=/path/to/ref.fa VALIDATION_STRINGENCY=LENIENT Note: It is important to use VALIDATION_STRINGENCY=LENIENT. Not all tools that operate on BAM files produce output files are completely compliant with the BAM specification, and unless this parameter is used, the various Picard tools will halt as soon as they find a single non-compliant record. See for a description of the output metrics. Create Picard Interval Lists A Picard interval list is a genomic interval description file that contains a SAM-like header describing the reference genome and a set of coordinates with strand and name for each interval. The target intervals are equivalent to the NimbleGen primary target BED file, and the bait intervals are equivalent to the NimbleGen capture target BED file. The Picard interval files are required by the Picard CalculateHsMetrics utility. Note: Some NimbleGen catalog designs only contain a single bed file of coverage targets. For those designs, the same BED file can be used for both primary targets and capture targets below. Evaluate NimbleGen SeqCap Epi Target Enrichment Data 11

12 Create a Picard Target Interval List Use the SAMtools view command to extract the header from a SAMPLE.bam file, use the Linux gawk command to extract and format the interval list body from a DESIGN_primary_target.bed file, and finally use the cat command to concatenate the two pieces together into a properly-formatted interval list. SAMtools view cat gawk SAMPLE.bam DESIGN_primary_target.bed DESIGN_target_intervals.txt Create a Picard Interval List Header /path/to/samtools view H SAMPLE.bam > SAMPLE_bam_header.txt Create a Picard Target Interval List Body cat DESIGN_primary_target.bed gawk '{print $1 "\t" $2+1 "\t" $3 "\t+\tinterval_" NR}' > DESIGN_target_body.txt Concatenate to Create a Picard Target Interval List cat SAMPLE_bam_header.txt DESIGN_target_body.txt > DESIGN_target_intervals.txt The DESIGN_target_intervals.txt file is used by the Picard CalculateHsMetrics command. Create a Picard Bait Interval List Use the SAMtools view command to extract the header from a SAMPLE.bam file, use the Linux gawk command to extract and format the interval list body from a DESIGN_capture_target.bed file, and finally use the cat command to concatenate the two pieces together into a properly-formatted interval list. SAMtools view cat gawk SAMPLE.bam DESIGN_capture_target.bed DESIGN_bait_intervals.txt Create a Picard Interval List Header /path/to/samtools view H SAMPLE.bam > SAMPLE_bam_header.txt Create a Picard Bait Interval List Body cat DESIGN_capture_target.bed gawk '{print $1 "\t" $2+1 "\t" $3 "\t+\tinterval_" NR}' > DESIGN_bait_body.txt Concatenate to Create a Picard Bait Interval List cat SAMPLE_bam_header.txt DESIGN_bait_body.txt > DESIGN_bait_intervals.txt The DESIGN_bait_intervals.txt file is used by the Picard CalculateHsMetrics command. Evaluate NimbleGen SeqCap Epi Target Enrichment Data 12

13 Hybrid Selection (HS) Analysis Metrics The CalculateHsMetrics command calculates a number of metrics assessing the quality of targeted enrichment reads. If necessary, follow the steps described in Combined SNP/methylation calling using BisSNP before using the step described below. Picard CalculateHsMetrics ref.fa {indexed} SAMPLE.clipped.bam {indexed} DESIGN_target_intervals.txt DESIGN_bait_intervals.txt SAMPLE_picard_hs_metrics.txt java -Xmx4g -Xms4g -jar /path/to/picard/calculatehsmetrics.jar BAIT_INTERVALS=DESIGN_bait_intervals.txt TARGET_INTERVALS=DESIGN_target_intervals.txt INPUT=SAMPLE.clipped.bam OUTPUT=SAMPLE_picard_hs_metrics.txt METRIC_ACCUMULATION_LEVEL=ALL_READS REFERENCE_SEQUENCE=/path/to/ref.fa VALIDATION_STRINGENCY=LENIENT TMP_DIR=. Supplying only one of the Picard interval files to Picard CalculateHsMetrics can change the outcome, depending on how different the primary and capture target (bait) regions are from each other. Some Roche NimbleGen catalog designs have only one BED file supplied. In that case, generate a single Picard Interval List from the supplied BED file and use as both bait and target intervals. See for a description of output metrics. Estimate Insert Size Distribution The DNA that goes into sequence capture is generated by random fragmentation, and later size selected. It is normal to observe a range of fragment sizes, but if skewed too large or too small the on-target rate and/or percent of bases covered with at least one read can be adversely affected. It is best to generate this report on the SAMPLE.filtered.bam file created above Picard CollectInsertSizeMetrics SAMPLE.filtered.bam SAMPLE_picard_insert_size_metrics.txt SAMPLE_picard_insert_size_plot.pdf java -Xmx4g -jar /path/to/picard/collectinsertsizemetrics.jar VALIDATION_STRINGENCY=LENIENT HISTOGRAM_FILE=SAMPLE_picard_insert_size_plot.pdf INPUT=SAMPLE.filtered.bam OUTPUT=SAMPLE_picard_insert_size_metrics.txt See for a description of output metrics included in SAMPLE_picard_insert_size_metrics.txt, which can also be used to plot the insert size distributions across samples. A PDF plot is also created and placed in SAMPLE_picard_insert_size_plot.pdf. Evaluate NimbleGen SeqCap Epi Target Enrichment Data 13

14 Add Padding to Targets Target-adjacent coverage is typical for hybridization-based target enrichment due to the capture of partially on-target DNA library fragments. To optionally assess the amount of off-target reads which are target adjacent, padding the targets before assessing the on-target rate may be performed. The targets should be padded, sorted, and then merged to product the final file. These three operations can be performed simultaneously by piping the contents of one application to the BEDtools slop BEDtools sort BEDtools merge DESIGN_capture_target.bed chromosome_sizes.txt DESIGN_padded_capture_target.bed Pad, Sort and Merge Overlapping and Book-Ended Regions /path/to/bedtools slop -i DESIGN_capture_target.bed -b 100 -g chromosome_sizes.txt /path/to/bedtools sort -i - /path/to/bedtools merge -i - > DESIGN_padded_capture_target.bed In the first step, the value supplied to the -b option indicates the number of target-adjacent bases to add on both sides of the target. In this example, 100 bases would be added to both sides of all targets. The chromosome_sizes.txt file should be in the format: ChrName<tab>ChrSize, e.g. chr and have an entry for each chromosome present in the input BED file. The -i - parameters directs the application to take the STDOUT from the previous command as input. Determine Sum Total Size of Regions in a BED File The chromosome_sizes.txt file should be as described in Add Padding to Targets. BEDtools genomecov grep cut chromosome_sizes.txt DESIGN.bed {size} /path/to/bedtools genomecov -i DESIGN.bed -g chromosome_sizes.txt -max 1 grep -P "genome\t1" cut -f 3 Alternatively, a single gawk command can be used: cat DESIGN.bed gawk -F'\t' 'BEGIN{SUM=0}{SUM+=$3-$2}END{print SUM}' Evaluate NimbleGen SeqCap Epi Target Enrichment Data 14

15 Count On-Target Reads The number of on-target reads (i.e. the number of mapped reads overlapping a set of target regions by at least one base) can be determined using BEDtools intersect on any of the BAM files created above. BEDtools intersect wc SAMPLE.bam DESIGN_primary_target.bed or DESIGN_capture_target.bed {count} /path/to/bedtools intersect -bed -abam SAMPLE.bam -b DESIGN_primary_target.bed or /path/to/bedtools intersect -bed -abam SAMPLE.bam -b DESIGN_capture_target.bed The command will output the number of on-target reads, counting as on-target any read that overlaps the target by at least one base. The output may be redirected to a file to save for later calculations. Divide the number of on-target reads by the total number of mapped, non-duplicate reads to get the percentage of on-target reads ( on-target rate ). See Basic Mapping Metrics (Picard) for reporting of the total number of mapped, non-duplicate reads. Calculate Depth of Coverage The depth of coverage over the primary or capture target regions can be determined using the DepthofCoverage command from GATK. GATK DepthOfCoverage ref.fa {indexed} SAMPLE.clipped.bam {indexed} DESIGN_primary_target.bed or DESIGN_capture_target.bed SAMPLE_gatk_primary_target_coverage.sample_summary or SAMPLE_gatk_capture_target_coverage.sample_summary {there are other output files not listed here} java -Xmx4g -Xms4g -jar /path/to/gatkframework/genomeanalysistk.jar -T DepthOfCoverage R /path/to/ref.fa -I SAMPLE.clipped.bam -o SAMPLE_gatk_primary_target _coverage -L DESIGN_primary_target.bed -ct 1 -ct 10 -ct 20 or java -Xmx4g -Xms4g -jar /path/to/gatkframework/genomeanalysistk.jar -T DepthOfCoverage R /path/to/ref.fa -I SAMPLE.clipped.bam -o SAMPLE_gatk_capture_target_coverage -L DESIGN_capture_target.bed -ct 1 -ct 10 -ct 20 The sample summary output file reports the mean and median depth of coverage over the given regions of interest, as well as the percent of bases in the given regions covered at or above a given depth. Multiple depth thresholds can be evaluated by specifying multiple values using the -ct switch. For more information, consult the documentation at: Evaluate NimbleGen SeqCap Epi Target Enrichment Data 15

16 Determine methylation percentage using BSMAP The percent methylation at each C base in the sample can be determined using the methratio.py script supplied by BSMAP and the BAM file produced in the Clip Overlapping Reads step. BSMAP methratio.py ref.fa SAMPLE.clipped.bam SAMPLE.methylation_results.txt Determine methylation percentage python /path/to/methratio.py -d hg19.fa -s /path/to/samtools -m 1 -z -i skip -o SAMPLE.methylation_results.txt SAMPLE.clipped.bam The results can be filtered by depth of coverage by using the -m switch. Raising this to a higher value (e.g. -m 5) will require at least that many reads to cover the position before a result is reported. The -z switch is required to report positions with zero methylation. The -i switch directs the script to ignore positions were there is a possible C->T SNP. Use the BisSNP analysis to deal with those positions. The result files from this step can be rather large, several GB in size, since percent methylation is reported for all C positions. An additional switch -c can be used to process one chromosome at a time (e.g. -c chr1). In combination with the -m switch, this can reduce result files to more reasonable sizes. Determine bisulfite conversion efficiency using BSMAP Reliable estimates of percent methylation require high efficiency of the bilsulfite conversion process, otherwise the level of methylation will be over-estimated. The best way to determine bisulfite conversion efficiency is to measure the number of unconverted C s on DNA molecules which have no methylation. For humans and animals, the mitochondrial genome is an appropriate target. For plants, both the mitochondrial and chloroplast genome may be used. Unfortunately, the genomes of these organelles is not always known or well-characterized, so Roche NimbleGen supplies phage lambda DNA (GenBank Accession NC_001416) which may be used as a spike-in control. Each SeqCap Epi capture pool contains probes to capture the lambda genomic region from base 4500 to The lambda genome sequence must be appended to your reference genome sequence files so that the captured reads may be mapped to that sequence. The residual C s in the reads mapped to the captured region of the lambda genome can be used to estimate the overall conversion efficiency. BSMAP methratio.py ref.fa SAMPLE.clipped.bam SAMPLE.NC_ methylation_results.txt Determine methylation percentage python /path/to/methratio.py -d hg19.fa -s /path/to/samtools -m 1 -z -i skip -o -c NC_ SAMPLE.NC_ methylation_results.txt SAMPLE.clipped.bam The -c NC_ switch restricts the methylation analysis to just the lambda genome. If the lambda spike-in was not used, you may substitute the name of the mitochondrial genome (e.g. chrm) or the chloroplast genome (e.g. chrc). Once the analysis is complete, open the resulting methylation results file in a spreadsheet application, and if necessary, delete the rows that fall outside the capture region. Then calculate the conversion rate as follows: conversion rate = 1 - (sum(c_count)/sum(ct_count)) Bisulfite-conversion rates should generally be above 99.5% to be considered successful. Evaluate NimbleGen SeqCap Epi Target Enrichment Data 16

17 Combined SNP/methylation calling using BisSNP The percent methylation and possible SNPs at each C base in the sample can be determined using the BisSNP package and the BAM file produced in the Clip Overlapping Reads step. BisSNP uses a multistep process to make methylation calls and call SNPs. The first step normally calls for Indel Realignment (see BisSNP user guide). However, the BSMAP software does not support gapped alignments, so this step may be skipped for reads mapped with BSMAP software. The next step is Base Quality Recalibration, followed by Combined SNP/methylation calling. BisSNP BisulfiteCountCovariates BisSNP BisulfiteTableRecalibration BisSNP BisulfiteGenotyper BisSNP sortbyrefandcor.pl BisSNP VCFpostprocess BisSNP vcf2bed6plus2.strand.pl ref.fa {indexed} genome.snps.vcf DESIGNS_capture_target.bed SAMPLE.clipped.bam SAMPLE.cpg.raw.vcf SAMPLE.cpg.raw.sorted.vcf SAMPLE.cpg.filtered.vcf SAMPLE.cpg.filter.summary.txt SAMPLE. cpg.filtered.strand.6plus2.bed SAMPLE.snp.raw.vcf SAMPLE.snp.raw.sorted.vcf SAMPLE.snp.filtered.vcf SAMPLE.snp.filter.summary.txt SAMPLE. cpg.filtered.strand.6plus2.bed Base Quality Recalibration java -Xmx10g -jar /path/to/bissnp jar -R ref.fa -I SAMPLE.clipped.bam -T BisulfiteCountCovariates -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -recalfile SAMPLE.recalFile_before.csv -nt NumProcessors -knownsites genome.snps.vcf java -Xmx10g -jar /path/to/bissnp jar -R ref.fa -I SAMPLE.clipped.bam -o SAMPLE.recal.bam -T BisulfiteTableRecalibration -recalfile SAMPLE.recalFile_before.csv -maxq 40 Combined SNP/methylation calling java -Xmx10g -jar /path/to/bissnp jar -R ref.fa -I SAMPLE.recal.bam -T BisulfiteGenotyper -D genome.snps.vcf -vfn1 SAMPLE.cpg.raw.vcf -vfn2 SAMPLE.snp.raw.vcf -L DESIGNS_capture_target.bed -stand_call_conf 20 - stand_emit_conf 0 -mmq 30 -mbq 0 -nt NumProcessors Sort VCF files perl /path/to/sortbyrefandcor.pl --k 1 --c 2 SAMPLE.snp.raw.vcf ref.fa.fai > SAMPLE.snp.raw.sorted.vcf perl /path/to/sortbyrefandcor.pl --k 1 --c 2 SAMPLE.cpg.raw.vcf ref.fa.fai > SAMPLE.cpg.raw.sorted.vcf Evaluate NimbleGen SeqCap Epi Target Enrichment Data 17

18 Filter SNP/methylation calls java -Xmx10g -jar /path/to/bissnp jar -R ref.fa -T VCFpostprocess -oldvcf SAMPLE.snp.raw.sorted.vcf -newvcf SAMPLE.snp.filtered.vcf -snpvcf SAMPLE.snp.raw.sorted.vcf -o SAMPLE.snp.filter.summary.txt java -Xmx10g -jar /path/to/bissnp jar -R ref.fa -T VCFpostprocess -oldvcf SAMPLE.cpg.raw.sorted.vcf -newvcf SAMPLE.cpg.filtered.vcf -snpvcf SAMPLE.snp.raw.sorted.vcf -o SAMPLE.cpg.filter.summary.txt Convert VCF to BED file perl /path/to/vcf2bed6plus2.strand.pl SAMPLE.snp.filtered.vcf perl /path/to/vcf2bed6plus2.strand.pl SAMPLE.cpg.filtered.vcf BisSNP utilizes a set of known SNPs for the genome as part of the SNP calling process. BisSNP can work without this set of SNPs, but it will improve SNP detection accuracy, especially in areas of low coverage (e.g. less than 10X). The order of the SNP VCF file should be the same as the reference genome file and the input BAM file. The BisSNP website has SNP files in VCF format for human builds hg18 and hg19 and one mouse build (mm9) at: For the filtering step, by default BisSNP will filter out SNPs with: quality score less than 20 coverage more than 120X strand bias more than quality score by depth less than 1.0 mapping_quality_zero filter for heterozygous SNPs more than SNPs within the same 20bp window. The BED files that are created have the percent methylation in the 7 th column, and the C/T coverage in the 8 th column. If viewed in the SignalMap software, the percent methylation will be viewed as a score from 0 to 1000, where 0 equals zero percent methylation and 1000 equal one hundred percent methylation. The strand column is ignored, so both methylation percentages will be displayed as positive values. The C/T coverage will not be displayed. Evaluate NimbleGen SeqCap Epi Target Enrichment Data 18

19 Determination of differentially methylated regions There are a number of different packages for discovering differentially methylated regions when comparing two samples, or two sets of samples. One of these is methylkit, an R package for DNA methylation analysis. Installation of R, methylkit, or the various dependencies is beyond the scope of this document. But instructions for installing R may be found at and instructions for installing methylkit and its dependencies may be found at R/methylKit read R/methylKit filterbycoverage R/methylKit unite R/methylKit calculatediffmeth R/methylKit get.methyldiff R/methylKit tilemethylcounts BSMAP methylation results files varies # R code for determining differentially methylated regions # Load files library(methylkit) file.list = list("sample.methylation_results.txt","control.methylation_results.txt") sample.list = list("test","control") methdata <-read( file.list, sample.id=sample.list, treatment=c(0,1), assembly="hg19", header=true, context="cpg", resolution="base", pipeline=list( fraction=true, chr.col=1, start.col=2, end.col=2, coverage.col=6, strand.col=3, freqc.col=5 ) ) #Generate basic stats getmethylationstats(methdata,plot=t,both.strands=t) getcoveragestats(methdata,plot=t,both.strands=t) #Filter by coverage count (>10X) or percentage (<99.9 percentile) methdata.filtered = filterbycoverage(methdata, lo.count = 10, lo.perc=null, hi.count=null,hi.per = 99.9) #Merge samples to locations covered by all samples meth = unite(methdata.filtered,destrand=false) #Calculate per-base differential methylation p-values and q-values methdiff = calculatediffmeth(meth) #Find differentially methylated bases with 25% difference and qvalue<0.01 Diff25p=get.methylDiff(methDiff,difference=25,qvalue=0.01) Evaluate NimbleGen SeqCap Epi Target Enrichment Data 19

20 #Find differentially hypo methylated bases with 25% difference and qvalue<0.01 Diff25pHypo =get.methyldiff(methdiff,difference=25,qvalue=0.01,type="hypo") #Find differentially hyper methylated bases with 25% difference and qvalue<0.01 Diff25pHyper=get.methylDiff(methDiff,difference=25,qvalue=0.01,type="hyper") #Find differentially methylated regions with 25% difference and qvalue<0.01 tiles = tilemethylcounts(methdata.filtered,win.size=500,step.size=100) tilemeth = unite(tiles,destrand=false) tilediff = calculatediffmeth(tilemeth,num.cores=8) tile25pct <- get.methyldiff(tilediff,difference=25,qvalue=0.01) #Write data to file write.table(getdata(diff25p),file="diff25pct.txt",row.names=f,col.names=t,sep="\t",q uote=f) write.table(getdata(tile25pct),file="tile25pct.txt",row.names=f,col.names=t,sep="\t",quote=f) Replicate data sets will be automatically combined by the treatment values specified at import. Q-values and percent differences can be adjusted to provide higher or lower stringency. The methylkit site ( has examples and tutorials on additional analysis that can be performed, including correlations, clustering, PCA, and adding annotation. 3. REFERENCES Roche NimbleGen is not responsible for the maintenance or content of the following third-party websites. BamUtil: BEDtools: BisSNP: BSMAP: FastQC: GATK: IGV: methylkit: Picard: R: SAMtools (including BCFtools and VCFutils): seqtk: Trimmomatic: Evaluate NimbleGen SeqCap Epi Target Enrichment Data 20

21 4. GLOSSARY BAI file BAM index file. For tools that require an indexed BAM file, the BAI file must be present in the same location as the BAM file. Bait See Capture target. BAM file Compressed form of the SAM file format. BCF file Compressed form of the VCF file format. BED file File format for describing genomic regions/intervals. BED file start coordinates are 0-based. Capture target as defined by Roche NimbleGen, these are the regions covered directly by one or more probes (the probe footprint). NimbleGen BED files refer to these regions as the tiled regions. These are equivalent to the bait intervals referred to by Picard. FASTA file A standard file format for describing nucleic acid sequences. FASTQ file A standard file format for describing sequencing reads that also includes base quality information. Genomic index A form of the reference genome sequence which enables faster comparisons during alignment. Picard interval files File format for describing genomic regions/intervals which also contains a header describing the reference genome. Picard interval file start coordinates are 1-based. Primary target as defined by Roche NimbleGen, these are the regions against which probes were designed. NimbleGen BED files refer to these regions as simply Target Regions. Typically these are identical to the original regions of interest with regions less than 100bp expanded to 100bp total to facilitate probe selection. These are equivalent to the target intervals referred to by Picard. SAM file Sequence Alignment / Map file; a community standard format for specifying sequencing read alignment to a reference genome. Target regions see Primary target. Tiled regions see Capture target. VCF file Variant call format; a community standard format for specifying variant calls for one or more samples or populations against a reference genome. Evaluate NimbleGen SeqCap Epi Target Enrichment Data 21

22 Published by: Roche Diagnostics GmbH Sandhofer Straße Mannheim Germany 2014 Roche Diagnostics All rights reserved. Notice to Purchaser For patent license limitations for individual products please refer to: For life science research only. Not for use in diagnostic procedures. Trademarks NIMBLEGEN and SEQCAP are trademarks of Roche. All other product names and trademarks are the property of their respective owners. Technical Note: How To (1) 04/14 Evaluate NimbleGen SeqCap Epi Target Enrichment Data 22

Evaluate NimbleGen SeqCap RNA Target Enrichment Data

Evaluate NimbleGen SeqCap RNA Target Enrichment Data Roche Sequencing Technical Note November 2014 How To Evaluate NimbleGen SeqCap RNA Target Enrichment Data 1. OVERVIEW Analysis of NimbleGen SeqCap RNA target enrichment data generated using an Illumina

More information

EpiGnome Methyl Seq Bioinformatics User Guide Rev. 0.1

EpiGnome Methyl Seq Bioinformatics User Guide Rev. 0.1 EpiGnome Methyl Seq Bioinformatics User Guide Rev. 0.1 Introduction This guide contains data analysis recommendations for libraries prepared using Epicentre s EpiGnome Methyl Seq Kit, and sequenced on

More information

Next Generation Sequence Alignment on the BRC Cluster. Steve Newhouse 22 July 2010

Next Generation Sequence Alignment on the BRC Cluster. Steve Newhouse 22 July 2010 Next Generation Sequence Alignment on the BRC Cluster Steve Newhouse 22 July 2010 Overview Practical guide to processing next generation sequencing data on the cluster No details on the inner workings

More information

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page. Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page. In this page you will learn to use the tools of the MAPHiTS suite. A little advice before starting : rename your

More information

Sequence Data Quality Assessment Exercises and Solutions.

Sequence Data Quality Assessment Exercises and Solutions. Sequence Data Quality Assessment Exercises and Solutions. Starting Note: Please do not copy and paste the commands. Characters in this document may not be copied correctly. Please type the commands and

More information

SAMtools. SAM BAM. mapping. BAM sort & indexing (ex: IGV) SNP call

SAMtools.   SAM BAM. mapping. BAM sort & indexing (ex: IGV) SNP call SAMtools http://samtools.sourceforge.net/ SAM/BAM mapping BAM SAM BAM BAM sort & indexing (ex: IGV) mapping SNP call SAMtools NGS Program: samtools (Tools for alignments in the SAM format) Version: 0.1.19

More information

Galaxy Platform For NGS Data Analyses

Galaxy Platform For NGS Data Analyses Galaxy Platform For NGS Data Analyses Weihong Yan wyan@chem.ucla.edu Collaboratory Web Site http://qcb.ucla.edu/collaboratory Collaboratory Workshops Workshop Outline ü Day 1 UCLA galaxy and user account

More information

Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers

Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers Data used in the exercise We will use D. melanogaster WGS paired-end Illumina data with NCBI accessions

More information

Understanding and Pre-processing Raw Illumina Data

Understanding and Pre-processing Raw Illumina Data Understanding and Pre-processing Raw Illumina Data Matt Johnson October 4, 2013 1 Understanding FASTQ files After an Illumina sequencing run, the data is stored in very large text files in a standard format

More information

SNP Calling. Tuesday 4/21/15

SNP Calling. Tuesday 4/21/15 SNP Calling Tuesday 4/21/15 Why Call SNPs? map mutations, ex: EMS, natural variation, introgressions associate with changes in expression develop markers for whole genome QTL analysis/ GWAS access diversity

More information

Copyright 2014 Regents of the University of Minnesota

Copyright 2014 Regents of the University of Minnesota Quality Control of Illumina Data using Galaxy Contents September 16, 2014 1 Introduction 2 1.1 What is Galaxy?..................................... 2 1.2 Galaxy at MSI......................................

More information

Calling variants in diploid or multiploid genomes

Calling variants in diploid or multiploid genomes Calling variants in diploid or multiploid genomes Diploid genomes The initial steps in calling variants for diploid or multi-ploid organisms with NGS data are the same as what we've already seen: 1. 2.

More information

NA12878 Platinum Genome GENALICE MAP Analysis Report

NA12878 Platinum Genome GENALICE MAP Analysis Report NA12878 Platinum Genome GENALICE MAP Analysis Report Bas Tolhuis, PhD Jan-Jaap Wesselink, PhD GENALICE B.V. INDEX EXECUTIVE SUMMARY...4 1. MATERIALS & METHODS...5 1.1 SEQUENCE DATA...5 1.2 WORKFLOWS......5

More information

Variant calling using SAMtools

Variant calling using SAMtools Variant calling using SAMtools Calling variants - a trivial use of an Interactive Session We are going to conduct the variant calling exercises in an interactive idev session just so you can get a feel

More information

Helpful Galaxy screencasts are available at:

Helpful Galaxy screencasts are available at: This user guide serves as a simplified, graphic version of the CloudMap paper for applicationoriented end-users. For more details, please see the CloudMap paper. Video versions of these user guides and

More information

m6aviewer Version Documentation

m6aviewer Version Documentation m6aviewer Version 1.6.0 Documentation Contents 1. About 2. Requirements 3. Launching m6aviewer 4. Running Time Estimates 5. Basic Peak Calling 6. Running Modes 7. Multiple Samples/Sample Replicates 8.

More information

WM2 Bioinformatics. ExomeSeq data analysis part 1. Dietmar Rieder

WM2 Bioinformatics. ExomeSeq data analysis part 1. Dietmar Rieder WM2 Bioinformatics ExomeSeq data analysis part 1 Dietmar Rieder RAW data Use putty to logon to cluster.i med.ac.at In your home directory make directory to store raw data $ mkdir 00_RAW Copy raw fastq

More information

REPORT. NA12878 Platinum Genome. GENALICE MAP Analysis Report. Bas Tolhuis, PhD GENALICE B.V.

REPORT. NA12878 Platinum Genome. GENALICE MAP Analysis Report. Bas Tolhuis, PhD GENALICE B.V. REPORT NA12878 Platinum Genome GENALICE MAP Analysis Report Bas Tolhuis, PhD GENALICE B.V. INDEX EXECUTIVE SUMMARY...4 1. MATERIALS & METHODS...5 1.1 SEQUENCE DATA...5 1.2 WORKFLOWS......5 1.3 ACCURACY

More information

Reads Alignment and Variant Calling

Reads Alignment and Variant Calling Reads Alignment and Variant Calling CB2-201 Computational Biology and Bioinformatics February 22, 2016 Emidio Capriotti http://biofold.org/ Institute for Mathematical Modeling of Biological Systems Department

More information

Variation among genomes

Variation among genomes Variation among genomes Comparing genomes The reference genome http://www.ncbi.nlm.nih.gov/nuccore/26556996 Arabidopsis thaliana, a model plant Col-0 variety is from Landsberg, Germany Ler is a mutant

More information

Decrypting your genome data privately in the cloud

Decrypting your genome data privately in the cloud Decrypting your genome data privately in the cloud Marc Sitges Data Manager@Made of Genes @madeofgenes The Human Genome 3.200 M (x2) Base pairs (bp) ~20.000 genes (~30%) (Exons ~1%) The Human Genome Project

More information

Sequence Mapping and Assembly

Sequence Mapping and Assembly Practical Introduction Sequence Mapping and Assembly December 8, 2014 Mary Kate Wing University of Michigan Center for Statistical Genetics Goals of This Session Learn basics of sequence data file formats

More information

ChIP-Seq Tutorial on Galaxy

ChIP-Seq Tutorial on Galaxy 1 Introduction ChIP-Seq Tutorial on Galaxy 2 December 2010 (modified April 6, 2017) Rory Stark The aim of this practical is to give you some experience handling ChIP-Seq data. We will be working with data

More information

CORE Year 1 Whole Genome Sequencing Final Data Format Requirements

CORE Year 1 Whole Genome Sequencing Final Data Format Requirements CORE Year 1 Whole Genome Sequencing Final Data Format Requirements To all incumbent contractors of CORE year 1 WGS contracts, the following acts as the agreed to sample parameters issued by NHLBI for data

More information

QIAseq DNA V3 Panel Analysis Plugin USER MANUAL

QIAseq DNA V3 Panel Analysis Plugin USER MANUAL QIAseq DNA V3 Panel Analysis Plugin USER MANUAL User manual for QIAseq DNA V3 Panel Analysis 1.0.1 Windows, Mac OS X and Linux January 25, 2018 This software is for research purposes only. QIAGEN Aarhus

More information

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA- MEM).

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA- MEM). Release Notes Agilent SureCall 4.0 Product Number G4980AA SureCall Client 6-month named license supports installation of one client and server (to host the SureCall database) on one machine. For additional

More information

HIPPIE User Manual. (v0.0.2-beta, 2015/4/26, Yih-Chii Hwang, yihhwang [at] mail.med.upenn.edu)

HIPPIE User Manual. (v0.0.2-beta, 2015/4/26, Yih-Chii Hwang, yihhwang [at] mail.med.upenn.edu) HIPPIE User Manual (v0.0.2-beta, 2015/4/26, Yih-Chii Hwang, yihhwang [at] mail.med.upenn.edu) OVERVIEW OF HIPPIE o Flowchart of HIPPIE o Requirements PREPARE DIRECTORY STRUCTURE FOR HIPPIE EXECUTION o

More information

USING BRAT-BW Table 1. Feature comparison of BRAT-bw, BRAT-large, Bismark and BS Seeker (as of on March, 2012)

USING BRAT-BW Table 1. Feature comparison of BRAT-bw, BRAT-large, Bismark and BS Seeker (as of on March, 2012) USING BRAT-BW-2.0.1 BRAT-bw is a tool for BS-seq reads mapping, i.e. mapping of bisulfite-treated sequenced reads. BRAT-bw is a part of BRAT s suit. Therefore, input and output formats for BRAT-bw are

More information

ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013

ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013 ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013 1. Data and objectives We will use the data from GEO (GSE35368, Toedling, Servant et al. 2011). Two samples were

More information

Design and Annotation Files

Design and Annotation Files Design and Annotation Files Release Notes SeqCap EZ Exome Target Enrichment System The design and annotation files provide information about genomic regions covered by the capture probes and the genes

More information

Supplementary Information. Detecting and annotating genetic variations using the HugeSeq pipeline

Supplementary Information. Detecting and annotating genetic variations using the HugeSeq pipeline Supplementary Information Detecting and annotating genetic variations using the HugeSeq pipeline Hugo Y. K. Lam 1,#, Cuiping Pan 1, Michael J. Clark 1, Phil Lacroute 1, Rui Chen 1, Rajini Haraksingh 1,

More information

Guide to Reviewing and Approving Custom Designs

Guide to Reviewing and Approving Custom Designs Guide to Reviewing and Approving Custom Designs SeqCap EZ Designs, v4.1 Overview This document describes how to review and approve the proposed custom SeqCap EZ and SeqCap EZ Prime designs based on the

More information

INTRODUCTION AUX FORMATS DE FICHIERS

INTRODUCTION AUX FORMATS DE FICHIERS INTRODUCTION AUX FORMATS DE FICHIERS Plan. Formats de séquences brutes.. Format fasta.2. Format fastq 2. Formats d alignements 2.. Format SAM 2.2. Format BAM 4. Format «Variant Calling» 4.. Format Varscan

More information

PRACTICAL SESSION 5 GOTCLOUD ALIGNMENT WITH BWA JAN 7 TH, 2014 STOM 2014 WORKSHOP HYUN MIN KANG UNIVERSITY OF MICHIGAN, ANN ARBOR

PRACTICAL SESSION 5 GOTCLOUD ALIGNMENT WITH BWA JAN 7 TH, 2014 STOM 2014 WORKSHOP HYUN MIN KANG UNIVERSITY OF MICHIGAN, ANN ARBOR PRACTICAL SESSION 5 GOTCLOUD ALIGNMENT WITH BWA JAN 7 TH, 2014 STOM 2014 WORKSHOP HYUN MIN KANG UNIVERSITY OF MICHIGAN, ANN ARBOR GOAL OF THIS SESSION Assuming that The audiences know how to perform GWAS

More information

Analyzing ChIP- Seq Data in Galaxy

Analyzing ChIP- Seq Data in Galaxy Analyzing ChIP- Seq Data in Galaxy Lauren Mills RISS ABSTRACT Step- by- step guide to basic ChIP- Seq analysis using the Galaxy platform. Table of Contents Introduction... 3 Links to helpful information...

More information

RPGC Manual. You will also need python 2.7 or above to run our home-brew python scripts.

RPGC Manual. You will also need python 2.7 or above to run our home-brew python scripts. Introduction Here we present a new approach for producing de novo whole genome sequences--recombinant population genome construction (RPGC)--that solves many of the problems encountered in standard genome

More information

Pre-processing and quality control of sequence data. Barbera van Schaik KEBB - Bioinformatics Laboratory

Pre-processing and quality control of sequence data. Barbera van Schaik KEBB - Bioinformatics Laboratory Pre-processing and quality control of sequence data Barbera van Schaik KEBB - Bioinformatics Laboratory b.d.vanschaik@amc.uva.nl Topic: quality control and prepare data for the interesting stuf Keep Throw

More information

Practical Linux Examples

Practical Linux Examples Practical Linux Examples Processing large text file Parallelization of independent tasks Qi Sun & Robert Bukowski Bioinformatics Facility Cornell University http://cbsu.tc.cornell.edu/lab/doc/linux_examples_slides.pdf

More information

RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF

RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF RNA-Seq in Galaxy: Tuxedo protocol Igor Makunin, UQ RCC, QCIF Acknowledgments Genomics Virtual Lab: gvl.org.au Galaxy for tutorials: galaxy-tut.genome.edu.au Galaxy Australia: galaxy-aust.genome.edu.au

More information

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg High-throughput sequencing: Alignment and related topic Simon Anders EMBL Heidelberg Established platforms HTS Platforms Illumina HiSeq, ABI SOLiD, Roche 454 Newcomers: Benchtop machines 454 GS Junior,

More information

Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification data

Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification data Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification data Table of Contents Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification

More information

ASAP - Allele-specific alignment pipeline

ASAP - Allele-specific alignment pipeline ASAP - Allele-specific alignment pipeline Jan 09, 2012 (1) ASAP - Quick Reference ASAP needs a working version of Perl and is run from the command line. Furthermore, Bowtie needs to be installed on your

More information

3. Installation Download Cpipe and Run Install Script Create an Analysis Profile Create a Batch... 7

3. Installation Download Cpipe and Run Install Script Create an Analysis Profile Create a Batch... 7 Cpipe User Guide 1. Introduction - What is Cpipe?... 3 2. Design Background... 3 2.1. Analysis Pipeline Implementation (Cpipe)... 4 2.2. Use of a Bioinformatics Pipeline Toolkit (Bpipe)... 4 2.3. Individual

More information

Mapping NGS reads for genomics studies

Mapping NGS reads for genomics studies Mapping NGS reads for genomics studies Valencia, 28-30 Sep 2015 BIER Alejandro Alemán aaleman@cipf.es Genomics Data Analysis CIBERER Where are we? Fastq Sequence preprocessing Fastq Alignment BAM Visualization

More information

Workshop 6: DNA Methylation Analysis using Bisulfite Sequencing. Fides D Lay UCLA QCB Fellow

Workshop 6: DNA Methylation Analysis using Bisulfite Sequencing. Fides D Lay UCLA QCB Fellow Workshop 6: DNA Methylation Analysis using Bisulfite Sequencing Fides D Lay UCLA QCB Fellow lay.fides@gmail.com Workshop 6 Outline Day 1: Introduction to DNA methylation & WGBS Quick review of linux, Hoffman2

More information

Handling sam and vcf data, quality control

Handling sam and vcf data, quality control Handling sam and vcf data, quality control We continue with the earlier analyses and get some new data: cd ~/session_3 wget http://wasabiapp.org/vbox/data/session_4/file3.tgz tar xzf file3.tgz wget http://wasabiapp.org/vbox/data/session_4/file4.tgz

More information

Demultiplexing Illumina sequencing data containing unique molecular indexes (UMIs)

Demultiplexing Illumina sequencing data containing unique molecular indexes (UMIs) next generation sequencing analysis guidelines Demultiplexing Illumina sequencing data containing unique molecular indexes (UMIs) See what more we can do for you at www.idtdna.com. For Research Use Only

More information

Exome sequencing. Jong Kyoung Kim

Exome sequencing. Jong Kyoung Kim Exome sequencing Jong Kyoung Kim Genome Analysis Toolkit The GATK is the industry standard for identifying SNPs and indels in germline DNA and RNAseq data. Its scope is now expanding to include somatic

More information

Sep. Guide. Edico Genome Corp North Torrey Pines Court, Plaza Level, La Jolla, CA 92037

Sep. Guide.  Edico Genome Corp North Torrey Pines Court, Plaza Level, La Jolla, CA 92037 Sep 2017 DRAGEN TM Quick Start Guide www.edicogenome.com info@edicogenome.com Edico Genome Corp. 3344 North Torrey Pines Court, Plaza Level, La Jolla, CA 92037 Notice Contents of this document and associated

More information

NGS Analysis Using Galaxy

NGS Analysis Using Galaxy NGS Analysis Using Galaxy Sequences and Alignment Format Galaxy overview and Interface Get;ng Data in Galaxy Analyzing Data in Galaxy Quality Control Mapping Data History and workflow Galaxy Exercises

More information

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA-MEM).

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA-MEM). Release Notes Agilent SureCall 3.5 Product Number G4980AA SureCall Client 6-month named license supports installation of one client and server (to host the SureCall database) on one machine. For additional

More information

Tumor-Specific NeoAntigen Detector (TSNAD) v2.0 User s Manual

Tumor-Specific NeoAntigen Detector (TSNAD) v2.0 User s Manual Tumor-Specific NeoAntigen Detector (TSNAD) v2.0 User s Manual Zhan Zhou, Xingzheng Lyu and Jingcheng Wu Zhejiang University, CHINA March, 2016 USER'S MANUAL TABLE OF CONTENTS 1 GETTING STARTED... 1 1.1

More information

Sentieon Documentation

Sentieon Documentation Sentieon Documentation Release 201808.03 Sentieon, Inc Dec 21, 2018 Sentieon Manual 1 Introduction 1 1.1 Description.............................................. 1 1.2 Benefits and Value..........................................

More information

NGS Data Analysis. Roberto Preste

NGS Data Analysis. Roberto Preste NGS Data Analysis Roberto Preste 1 Useful info http://bit.ly/2r1y2dr Contacts: roberto.preste@gmail.com Slides: http://bit.ly/ngs-data 2 NGS data analysis Overview 3 NGS Data Analysis: the basic idea http://bit.ly/2r1y2dr

More information

arxiv: v2 [q-bio.gn] 13 May 2014

arxiv: v2 [q-bio.gn] 13 May 2014 BIOINFORMATICS Vol. 00 no. 00 2005 Pages 1 2 Fast and accurate alignment of long bisulfite-seq reads Brent S. Pedersen 1,, Kenneth Eyring 1, Subhajyoti De 1,2, Ivana V. Yang 1 and David A. Schwartz 1 1

More information

Agilent Genomic Workbench Lite Edition 6.5

Agilent Genomic Workbench Lite Edition 6.5 Agilent Genomic Workbench Lite Edition 6.5 SureSelect Quality Analyzer User Guide For Research Use Only. Not for use in diagnostic procedures. Agilent Technologies Notices Agilent Technologies, Inc. 2010

More information

AgroMarker Finder manual (1.1)

AgroMarker Finder manual (1.1) AgroMarker Finder manual (1.1) 1. Introduction 2. Installation 3. How to run? 4. How to use? 5. Java program for calculating of restriction enzyme sites (TaqαI). 1. Introduction AgroMarker Finder (AMF)is

More information

Run Setup and Bioinformatic Analysis. Accel-NGS 2S MID Indexing Kits

Run Setup and Bioinformatic Analysis. Accel-NGS 2S MID Indexing Kits Run Setup and Bioinformatic Analysis Accel-NGS 2S MID Indexing Kits Sequencing MID Libraries For MiSeq, HiSeq, and NextSeq instruments: Modify the config file to create a fastq for index reads Using the

More information

User Guide. SLAMseq Data Analysis Pipeline SLAMdunk on Bluebee Platform

User Guide. SLAMseq Data Analysis Pipeline SLAMdunk on Bluebee Platform SLAMseq Data Analysis Pipeline SLAMdunk on Bluebee Platform User Guide Catalog Numbers: 061, 062 (SLAMseq Kinetics Kits) 015 (QuantSeq 3 mrna-seq Library Prep Kits) 063UG147V0100 FOR RESEARCH USE ONLY.

More information

v0.3.0 May 18, 2016 SNPsplit operates in two stages:

v0.3.0 May 18, 2016 SNPsplit operates in two stages: May 18, 2016 v0.3.0 SNPsplit is an allele-specific alignment sorter which is designed to read alignment files in SAM/ BAM format and determine the allelic origin of reads that cover known SNP positions.

More information

Falcon Accelerated Genomics Data Analysis Solutions. User Guide

Falcon Accelerated Genomics Data Analysis Solutions. User Guide Falcon Accelerated Genomics Data Analysis Solutions User Guide Falcon Computing Solutions, Inc. Version 1.0 3/30/2018 Table of Contents Introduction... 3 System Requirements and Installation... 4 Software

More information

Genomic Files. University of Massachusetts Medical School. October, 2015

Genomic Files. University of Massachusetts Medical School. October, 2015 .. Genomic Files University of Massachusetts Medical School October, 2015 2 / 55. A Typical Deep-Sequencing Workflow Samples Fastq Files Fastq Files Sam / Bam Files Various files Deep Sequencing Further

More information

High-throughout sequencing and using short-read aligners. Simon Anders

High-throughout sequencing and using short-read aligners. Simon Anders High-throughout sequencing and using short-read aligners Simon Anders High-throughput sequencing (HTS) Sequencing millions of short DNA fragments in parallel. a.k.a.: next-generation sequencing (NGS) massively-parallel

More information

RNAseq analysis: SNP calling. BTI bioinformatics course, spring 2013

RNAseq analysis: SNP calling. BTI bioinformatics course, spring 2013 RNAseq analysis: SNP calling BTI bioinformatics course, spring 2013 RNAseq overview RNAseq overview Choose technology 454 Illumina SOLiD 3 rd generation (Ion Torrent, PacBio) Library types Single reads

More information

RNA-seq. Manpreet S. Katari

RNA-seq. Manpreet S. Katari RNA-seq Manpreet S. Katari Evolution of Sequence Technology Normalizing the Data RPKM (Reads per Kilobase of exons per million reads) Score = R NT R = # of unique reads for the gene N = Size of the gene

More information

User's guide to ChIP-Seq applications: command-line usage and option summary

User's guide to ChIP-Seq applications: command-line usage and option summary User's guide to ChIP-Seq applications: command-line usage and option summary 1. Basics about the ChIP-Seq Tools The ChIP-Seq software provides a set of tools performing common genome-wide ChIPseq analysis

More information

Merge Conflicts p. 92 More GitHub Workflows: Forking and Pull Requests p. 97 Using Git to Make Life Easier: Working with Past Commits p.

Merge Conflicts p. 92 More GitHub Workflows: Forking and Pull Requests p. 97 Using Git to Make Life Easier: Working with Past Commits p. Preface p. xiii Ideology: Data Skills for Robust and Reproducible Bioinformatics How to Learn Bioinformatics p. 1 Why Bioinformatics? Biology's Growing Data p. 1 Learning Data Skills to Learn Bioinformatics

More information

Genomic Files. University of Massachusetts Medical School. October, 2014

Genomic Files. University of Massachusetts Medical School. October, 2014 .. Genomic Files University of Massachusetts Medical School October, 2014 2 / 39. A Typical Deep-Sequencing Workflow Samples Fastq Files Fastq Files Sam / Bam Files Various files Deep Sequencing Further

More information

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg High-throughput sequencing: Alignment and related topic Simon Anders EMBL Heidelberg Established platforms HTS Platforms Illumina HiSeq, ABI SOLiD, Roche 454 Newcomers: Benchtop machines: Illumina MiSeq,

More information

Practical Linux examples: Exercises

Practical Linux examples: Exercises Practical Linux examples: Exercises 1. Login (ssh) to the machine that you are assigned for this workshop (assigned machines: https://cbsu.tc.cornell.edu/ww/machines.aspx?i=87 ). Prepare working directory,

More information

Mar. Guide. Edico Genome Inc North Torrey Pines Court, Plaza Level, La Jolla, CA 92037

Mar. Guide.  Edico Genome Inc North Torrey Pines Court, Plaza Level, La Jolla, CA 92037 Mar 2017 DRAGEN TM Quick Start Guide www.edicogenome.com info@edicogenome.com Edico Genome Inc. 3344 North Torrey Pines Court, Plaza Level, La Jolla, CA 92037 Notice Contents of this document and associated

More information

v0.2.0 XX:Z:UA - Unassigned XX:Z:G1 - Genome 1-specific XX:Z:G2 - Genome 2-specific XX:Z:CF - Conflicting

v0.2.0 XX:Z:UA - Unassigned XX:Z:G1 - Genome 1-specific XX:Z:G2 - Genome 2-specific XX:Z:CF - Conflicting October 08, 2015 v0.2.0 SNPsplit is an allele-specific alignment sorter which is designed to read alignment files in SAM/ BAM format and determine the allelic origin of reads that cover known SNP positions.

More information

Lecture 3. Essential skills for bioinformatics: Unix/Linux

Lecture 3. Essential skills for bioinformatics: Unix/Linux Lecture 3 Essential skills for bioinformatics: Unix/Linux RETRIEVING DATA Overview Whether downloading large sequencing datasets or accessing a web application hundreds of times to download specific files,

More information

Analyzing Variant Call results using EuPathDB Galaxy, Part II

Analyzing Variant Call results using EuPathDB Galaxy, Part II Analyzing Variant Call results using EuPathDB Galaxy, Part II In this exercise, we will work in groups to examine the results from the SNP analysis workflow that we started yesterday. The first step is

More information

NGS Data Visualization and Exploration Using IGV

NGS Data Visualization and Exploration Using IGV 1 What is Galaxy Galaxy for Bioinformaticians Galaxy for Experimental Biologists Using Galaxy for NGS Analysis NGS Data Visualization and Exploration Using IGV 2 What is Galaxy Galaxy for Bioinformaticians

More information

ChIP-seq hands-on practical using Galaxy

ChIP-seq hands-on practical using Galaxy ChIP-seq hands-on practical using Galaxy In this exercise we will cover some of the basic NGS analysis steps for ChIP-seq using the Galaxy framework: Quality control Mapping of reads using Bowtie2 Peak-calling

More information

Rsubread package: high-performance read alignment, quantification and mutation discovery

Rsubread package: high-performance read alignment, quantification and mutation discovery Rsubread package: high-performance read alignment, quantification and mutation discovery Wei Shi 14 September 2015 1 Introduction This vignette provides a brief description to the Rsubread package. For

More information

Copyright 2014 Regents of the University of Minnesota

Copyright 2014 Regents of the University of Minnesota Quality Control of Illumina Data using Galaxy August 18, 2014 Contents 1 Introduction 2 1.1 What is Galaxy?..................................... 2 1.2 Galaxy at MSI......................................

More information

SAM / BAM Tutorial. EMBL Heidelberg. Course Materials. Tobias Rausch September 2012

SAM / BAM Tutorial. EMBL Heidelberg. Course Materials. Tobias Rausch September 2012 SAM / BAM Tutorial EMBL Heidelberg Course Materials Tobias Rausch September 2012 Contents 1 SAM / BAM 3 1.1 Introduction................................... 3 1.2 Tasks.......................................

More information

Next-Generation Sequencing applied to adna

Next-Generation Sequencing applied to adna Next-Generation Sequencing applied to adna Hands-on session June 13, 2014 Ludovic Orlando - Lorlando@snm.ku.dk Mikkel Schubert - MSchubert@snm.ku.dk Aurélien Ginolhac - AGinolhac@snm.ku.dk Hákon Jónsson

More information

How to use earray to create custom content for the SureSelect Target Enrichment platform. Page 1

How to use earray to create custom content for the SureSelect Target Enrichment platform. Page 1 How to use earray to create custom content for the SureSelect Target Enrichment platform Page 1 Getting Started Access earray Access earray at: https://earray.chem.agilent.com/earray/ Log in to earray,

More information

CBSU/3CPG/CVG Joint Workshop Series Reference genome based sequence variation detection

CBSU/3CPG/CVG Joint Workshop Series Reference genome based sequence variation detection CBSU/3CPG/CVG Joint Workshop Series Reference genome based sequence variation detection Computational Biology Service Unit (CBSU) Cornell Center for Comparative and Population Genomics (3CPG) Center for

More information

These will serve as a basic guideline for read prep. This assumes you have demultiplexed Illumina data.

These will serve as a basic guideline for read prep. This assumes you have demultiplexed Illumina data. These will serve as a basic guideline for read prep. This assumes you have demultiplexed Illumina data. We have a few different choices for running jobs on DT2 we will explore both here. We need to alter

More information

Rsubread package: high-performance read alignment, quantification and mutation discovery

Rsubread package: high-performance read alignment, quantification and mutation discovery Rsubread package: high-performance read alignment, quantification and mutation discovery Wei Shi 14 September 2015 1 Introduction This vignette provides a brief description to the Rsubread package. For

More information

ChIP-seq Analysis. BaRC Hot Topics - March 21 st 2017 Bioinformatics and Research Computing Whitehead Institute.

ChIP-seq Analysis. BaRC Hot Topics - March 21 st 2017 Bioinformatics and Research Computing Whitehead Institute. ChIP-seq Analysis BaRC Hot Topics - March 21 st 2017 Bioinformatics and Research Computing Whitehead Institute http://barc.wi.mit.edu/hot_topics/ Outline ChIP-seq overview Experimental design Quality control/preprocessing

More information

GenomeStudio Software Release Notes

GenomeStudio Software Release Notes GenomeStudio Software 2009.2 Release Notes 1. GenomeStudio Software 2009.2 Framework... 1 2. Illumina Genome Viewer v1.5...2 3. Genotyping Module v1.5... 4 4. Gene Expression Module v1.5... 6 5. Methylation

More information

Practical exercises Day 2. Variant Calling

Practical exercises Day 2. Variant Calling Practical exercises Day 2 Variant Calling Samtools mpileup Variant calling with samtools mpileup + bcftools Variant calling with HaplotypeCaller (GATK Best Practices) Genotype GVCFs Hard Filtering Variant

More information

Dindel User Guide, version 1.0

Dindel User Guide, version 1.0 Dindel User Guide, version 1.0 Kees Albers University of Cambridge, Wellcome Trust Sanger Institute caa@sanger.ac.uk October 26, 2010 Contents 1 Introduction 2 2 Requirements 2 3 Optional input 3 4 Dindel

More information

Molecular Identifier (MID) Analysis for TAM-ChIP Paired-End Sequencing

Molecular Identifier (MID) Analysis for TAM-ChIP Paired-End Sequencing Molecular Identifier (MID) Analysis for TAM-ChIP Paired-End Sequencing Catalog Nos.: 53126 & 53127 Name: TAM-ChIP antibody conjugate Description Active Motif s TAM-ChIP technology combines antibody directed

More information

ChIP-seq Analysis. BaRC Hot Topics - Feb 23 th 2016 Bioinformatics and Research Computing Whitehead Institute.

ChIP-seq Analysis. BaRC Hot Topics - Feb 23 th 2016 Bioinformatics and Research Computing Whitehead Institute. ChIP-seq Analysis BaRC Hot Topics - Feb 23 th 2016 Bioinformatics and Research Computing Whitehead Institute http://barc.wi.mit.edu/hot_topics/ Outline ChIP-seq overview Experimental design Quality control/preprocessing

More information

Essential Skills for Bioinformatics: Unix/Linux

Essential Skills for Bioinformatics: Unix/Linux Essential Skills for Bioinformatics: Unix/Linux WORKING WITH COMPRESSED DATA Overview Data compression, the process of condensing data so that it takes up less space (on disk drives, in memory, or across

More information

ELPREP PERFORMANCE ACROSS PROGRAMMING LANGUAGES PASCAL COSTANZA CHARLOTTE HERZEEL FOSDEM, BRUSSELS, BELGIUM, FEBRUARY 3, 2018

ELPREP PERFORMANCE ACROSS PROGRAMMING LANGUAGES PASCAL COSTANZA CHARLOTTE HERZEEL FOSDEM, BRUSSELS, BELGIUM, FEBRUARY 3, 2018 ELPREP PERFORMANCE ACROSS PROGRAMMING LANGUAGES PASCAL COSTANZA CHARLOTTE HERZEEL FOSDEM, BRUSSELS, BELGIUM, FEBRUARY 3, 2018 USA SAN FRANCISCO USA ORLANDO BELGIUM - HQ LEUVEN THE NETHERLANDS EINDHOVEN

More information

SAM : Sequence Alignment/Map format. A TAB-delimited text format storing the alignment information. A header section is optional.

SAM : Sequence Alignment/Map format. A TAB-delimited text format storing the alignment information. A header section is optional. Alignment of NGS reads, samtools and visualization Hands-on Software used in this practical BWA MEM : Burrows-Wheeler Aligner. A software package for mapping low-divergent sequences against a large reference

More information

Sequence Analysis Pipeline

Sequence Analysis Pipeline Sequence Analysis Pipeline Transcript fragments 1. PREPROCESSING 2. ASSEMBLY (today) Removal of contaminants, vector, adaptors, etc Put overlapping sequence together and calculate bigger sequences 3. Analysis/Annotation

More information

Genomes On The Cloud GotCloud. University of Michigan Center for Statistical Genetics Mary Kate Wing Goo Jun

Genomes On The Cloud GotCloud. University of Michigan Center for Statistical Genetics Mary Kate Wing Goo Jun Genomes On The Cloud GotCloud University of Michigan Center for Statistical Genetics Mary Kate Wing Goo Jun Friday, March 8, 2013 Why GotCloud? Connects sequence analysis tools together Alignment, quality

More information

Analysis of ChIP-seq data

Analysis of ChIP-seq data Before we start: 1. Log into tak (step 0 on the exercises) 2. Go to your lab space and create a folder for the class (see separate hand out) 3. Connect to your lab space through the wihtdata network and

More information

BIOINFORMATICS APPLICATIONS NOTE

BIOINFORMATICS APPLICATIONS NOTE BIOINFORMATICS APPLICATIONS NOTE Sequence analysis BRAT: Bisulfite-treated Reads Analysis Tool (Supplementary Methods) Elena Y. Harris 1,*, Nadia Ponts 2, Aleksandr Levchuk 3, Karine Le Roch 2 and Stefano

More information

Introduction to UNIX command-line II

Introduction to UNIX command-line II Introduction to UNIX command-line II Boyce Thompson Institute 2017 Prashant Hosmani Class Content Terminal file system navigation Wildcards, shortcuts and special characters File permissions Compression

More information

Tutorial. Find Very Low Frequency Variants With QIAGEN GeneRead Panels. Sample to Insight. November 21, 2017

Tutorial. Find Very Low Frequency Variants With QIAGEN GeneRead Panels. Sample to Insight. November 21, 2017 Find Very Low Frequency Variants With QIAGEN GeneRead Panels November 21, 2017 Sample to Insight QIAGEN Aarhus Silkeborgvej 2 Prismet 8000 Aarhus C Denmark Telephone: +45 70 22 32 44 www.qiagenbioinformatics.com

More information

ls /data/atrnaseq/ egrep "(fastq fasta fq fa)\.gz" ls /data/atrnaseq/ egrep "(cn ts)[1-3]ln[^3a-za-z]\."

ls /data/atrnaseq/ egrep (fastq fasta fq fa)\.gz ls /data/atrnaseq/ egrep (cn ts)[1-3]ln[^3a-za-z]\. Command line tools - bash, awk and sed We can only explore a small fraction of the capabilities of the bash shell and command-line utilities in Linux during this course. An entire course could be taught

More information