NA12878 Platinum Genome GENALICE MAP Analysis Report

Size: px
Start display at page:

Download "NA12878 Platinum Genome GENALICE MAP Analysis Report"

Transcription

1 NA12878 Platinum Genome GENALICE MAP Analysis Report Bas Tolhuis, PhD Jan-Jaap Wesselink, PhD GENALICE B.V.

2

3 INDEX EXECUTIVE SUMMARY MATERIALS & METHODS SEQUENCE DATA WORKFLOWS ACCURACY AND CONCORDANCE HARDWARE CONFIGURATION RESULT PROCESSING SPEED STORAGE FOOTPRINTS ACCURACY DISCUSSION CONCLUSION...14

4 EXECUTIVE SUMMARY The aim of this benchmark study was to do alignment and variant calling using GENALICE MAP software on NA12878 data from the Illumina Platinum Genomes project. The resulting BAM and VCF files, including SNPs and INDELs are being made available for prospective customers for evaluation. This document describes the data analysis approach. It further discusses processing speed, storage footprint reduction and accuracy of the variant call sets which we tested in house before sharing the data. Moreover, GENALICE MAP s performance is directly compared to three widely used third-party NGS workflows. GENALICE MAP is an extremely fast NGS data analysis solution, which significantly reduces data storage requirements and produces high quality analysis results, including aligned reads and sequence variants. 4

5 1. MATERIALS & METHODS The NA12878 Platinum Genome data set was subjected to alignment and variant calling by GENALICE MAP and three third-party NGS workflows. The resulting variants were compared to dbsnp137 and benchmark variants (Genome In A Bottle). Throughout this pilot study build 37 of the human genome (GRCh37) was used as a reference. 1.1 SEQUENCE DATA NA12878 Platinum Genome (Illumina) Illumina s NA12878 Platinum Genome data set ( was downloaded as two compressed FASTQ files from the European Nucleotide Archive (ERR194147) Genome In A Bottle High Confidence Variants Variants discovered by GENALICE MAP and BWA-MEM/GATK workflows were compared to the Genome In A Bottle (GIAB) NA12878 benchmark variant call set (NIST v2.18). These so-called high confidence variants were downloaded from: ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/na12878.variant_calls/nist. 1.2 WORKFLOWS A. GENALICE MAP B. BWA-MEM/GATK HC C. BWA-MEM/Platypus D. BWA-MEM/VarScan Read alignment Read sorting GENALICE MAP BWA-MEM SAMtools SAMtools BWA-MEM SAMtools SAMtools BWA-MEM SAMtools SAMtools Mark duplicates Picard Picard INDEL realign GATK Platypus Genotyping GENALICE MAP GATK SAMtools SAMtools Filtering GATK VarScan VarScan Variants SNPs INDELs SNPs INDELs SNPs INDELs SNPs INDELs Figure 1. Workflows to Process NGS data Key processes for NGS data analysis are shown on the left, including read alignment, read sorting, PCR duplicate handling, INDEL realignment, genotyping and filtering. Each colored rectangle represents a single command line processing step. 5

6 1.2.1 GENALICE MAP GENALICE MAP version was used throughout this benchmark study. GENALICE MAP uses a two-stage workflow to process read sequences in FASTQ format into high quality variants (SNP and indel) in VCF format (Figure 1A). Sequence reads were streamed via the network into the aligner node. Notably, in this study GENALICE MAP aligned directly from compressed (.gzip) FASTQ files. The aligner maps sequence reads onto the indexed reference genome and writes, genome-sorted aligned reads in GENALICE Aligned Reads (GAR) format. In addition, potential PCR duplicates are marked. Next, the variant caller reads the GAR file and performs indel realignment, genotyping and filtering to produce high quality sequence variants BWA-MEM For all three third-party NGS workflows BWA-MEM (v0.7.8) was used to align reads. Alignment was done with default parameter settings with one exception. We marked shorter split hits as secondary for Picard compatibility. Aligned reads were written to disk in BAM format with SAMtools (v1.0). Subsequently, the mapped reads were sorted based on genome position using SAMtools (v1.0) GATK The GATK (v3.2) workflow requires three additional processing steps (Figure 1B). Prior to genotyping potential PCR duplicates were marked using Picard s MarkDuplicates functionality (v1.129). The HaplotypeCaller was used to genotype the aligned reads. Finally, the raw variants identified by the genotyping tool were refined using Variant Quality Score Recalibration. The GATK workflow was appied using best practices described by the GATK developers: Platypus Platypus uses a single step procedure to produce variants from a set aligned reads (Figure 1C). Platypus (v ) was used with default parameter settings. This means that the software applies potential PCR duplicate marking, indel realignment, genotyping and filtering on the fly. The default filtering procedure is highly stringent, resulting in low sensitivity (96.1%) and high precision (99.9%) when detecting NIST/GIAB benchmark variants. This deviates strongly from the other workflows. Therefore, we decided to display results obtained with raw/unfiltered variants. This increases sensitivity at the expense of precision (Figure 3). Of note, Platypus can also detect variants using local assembly, which increases processing time by roughly two fold. Using local assembly had little impact on variant detection accuracy VarScan Prior to genotyping using VarScan (Figure 1D), potential PCR duplicates were marked (Picard v1.129) and around INDELs reads were subjected to local realignment (GATK v3.2). Next, SAMtools (v1.0) mpileup functionality was run with a filter on mapping quality. Only reads with mapping quality 20 or higher were included in the pileup information. SNPs and INDELs were called by two separate functions of VarScan (v2.3.7), namely: pileup2snp and pileup2indel. The minimum base quality parameter was set to 17. Maximum coverage depth was limited to a read depth of 250. For other parameters default settings were applied. 6

7 1.3 ACCURACY AND CONCORDANCE The SMASH (v1.0.1) benchmarking software ( was used to determine accuracy of detecting high confidence (NIST-GIAB) variants and to do pairwise comparisons between GENALICE and BWA-MEM/GATK variant call sets. A challenge when comparing VCF files comes from the fact that the same underlying sequence can be represented as a variant in VCF using various ways. SMASH uses normalization of VCF files in order to deal with ambiguities between variant representations in VCF files. SnpSift (v3.5) annotate was used to determine concordance with dbsnp (v137) variants. Unlike SMASH, SnpSift uses strict comparison of variants and does not normalize to interpret ambiguities between variant representations. 1.4 HARDWARE CONFIGURATION Benchmarking Hardware All data analysis were performed on a server with the following hardware configuration: Dual Intel Xeon E V2 ( Mhz) CPU with a total of 12 cores and 24 threads 128GB Memory (8x16 GB, 1333 Mhz) 2x256GB Solid State Disks (read/write: 500MB/second) The server was connected to storage servers via an Infiniband adapter. NGS reads were streamed via network from FASTQ (located on one of the storage servers) into GENALICE MAP. The Infiniband and storage server configurations are: Infiniband adapter of 40 Gigabit/second 36x2TB HDD disk space Raid 50 with 1 GB/second IO speed (read/write) Operating System GENALICE MAP runs on SuSE Linux Enterprise server 12 Service Pack 1 on standard server board. 7

8 2. RESULTS This report focuses on three important aspects of NGS data processing: speed, data storage footprint and accuracy. 2.1 PROCESSING SPEED From FASTQ to VCF GENALICE MAP workflow (Figure 1A) processes sequence reads in compressed FASTQ format to high quality variant calls in VCF format. Table 1 shows the total processing time required to map reads and call variants for the NA12878 Platinum Genome data set. Total runtime is just under 38 minutes. Table 1 further shows runtimes of three third party workflows (Figure 1B-D) using the same hardware configuration and same input data. GENALICE MAP is more than 200 fold faster than the BWA-MEM/VarScan workflow and approximately 50 fold faster than the BWA-MEM/Platypus workflow. Table 1. Performance Statistics of GENALICE MAP on NA12878 Platinum Genome data Workflow Alignment Time 1 (hh:mm:ss) Variant Calling Time 2 (hh:mm:ss) GENALICE MAP 00:31:12 00:06:43 00:37:55 BWA-MEM/GATK 44:12:46 34:48:02 79:00:48 BWA-MEM/Platypus 31:57:38 00:27:02 32:24:40 BWA-MEM/VarScan 70:07:22 32:23:20 102:30:42 Notes: 1 Alignment time includes all read preparation steps prior to variant calling. Those differ between workflows and are described in the Materials and Methods section. 2 Variant calling time includes all processing steps that lead to high quality variants. Details for each workflow are described in the Materials and Methods section. Total Time (hh:mm:ss) Table 2. Runtime Statistics of GENALICE MAP Read Alignment on NA12878 Platinum Genome Data Metric Value Description Average CPU usage ± standard deviation 67 ± 13 Average proportion (in %) of Dual Intel Xeon E V2 ( Mhz) CPU capacity that is used during alignment Megabytes per second Average data processing speed during alignment Megabases per second 75.9 Average alignment speed in million bases per second Maximum memory usage (GB) 67.8 Total memory requirement for NA12878 Pt alignment in Gigabytes Table 2 shows hardware requirements of the read alignment-processing step. The average CPU usage is on average 67% of the total capacity. GENALICE MAP dynamically scales CPU usage depending on the workload. The high quality NA12878 Platinum Genome data set contains many reads that match perfectly with the reference genome (Figure 2), resulting in a relatively low workload. Consequently, GENALICE MAP uses only a fraction of the total CPU capacity. 8

9 Data processing and alignment speeds are shown in Megabytes/seconds and Megabases/seconds, respectively (Table 2). For the biologist, GENALICE MAP aligns more than 75 million bases each second. This means that every second more than 750 thousand reads are aligned for the NA12878 Platinum Genome data. Read pair information is maintained by GENALICE MAP alignments. Read alignment occurs in memory. GENALICE MAP dynamically allocates memory usage until the maximum capacity of the system is reached. A total of 67.8 GB memory is required to align the NA12878 Platinum Genome data set (Table 2), which is well below the systems maximum of 128 GB. 2.2 STORAGE FOOTPRINTS Typical NGS data has high storage footprints making file handling and sharing extremely challenging. GENALICE MAP stores aligned sequence reads is a small storage footprint format called GENALICE Aligned Reads (GAR) enhancing file handling and sharing. Table 3. Storage Footprints of NA12878 Platinum Genome data Metric Value Description FASTQ.GZ 97 GB Compressed sequence read files (2 files of read pairs combined) BAM 106 GB Archival BAM file from European Nucleotide Archive GAR 14 GB Fully realignable GAR file (including perfect, partial, repeat, low confidence and unmapped reads) GAR.GZ 8.3 GB Compressed fully realignable GAR file GAR 5.0 GB Minimal GAR file containing only perfect and partial read mappings Table 3 clearly shows reductions in storage footprint achieved by the GAR format. The smallest GAR footprint is 5.0 GB in size, which is achieved by only saving Perfect and Partial read mappings (see Paragraph for details). This is ideal for temporary storage when processing data further using GENALICE MAP variant calling. A fully realignable (i.e. archival) GAR also requires storing of Repeats, Low Confidence and Unmappable reads. This leads to an increase in storage footprint (Table 3), but its footprint is still far less compared to BAM or compressed FASTQ format. Furthermore, compression of the fully realignable GAR reduces the footprint (Table 3) saving costs of long-term storage and easing file sharing. 9

10 2.3 ACCURACY Processing speed and reduced storage footprints are only valuable when the accuracy of alignment and variant calling is sufficient Read mapping distributions GENALICE MAP web application allows near real-time and post alignment monitoring of the alignment process. One of the reports is read mapping distribution (Figure 2). GENALICE MAP produces six alignment results: 1. Perfect read mappings align 100% with a unique position of the reference sequence 2. Partial read mappings do not have a 100% matching position due to sequence variations or read errors 3. Repeats are mappings with multiple positions in the reference sequence 4. Low Confidence read mappings have too many bases that do not match the reference sequence 5. Bad Reads failed quality control prior to alignment 6. Unmappable means that no suitable alignment solution was found With the exception of Bad Reads, which are filtered out before alignment, all mappings can be stored in the GENALICE Aligned Reads format, making it a fully realignable format. By definition reads are aligned when they are assigned as Perfect, Partial, Repeats or Low Confidence mappings. Variant calling, however, only takes Perfect and Partial mappings into account. The variant caller ignores Repeats, because their alignment is ambiguous and potential variants cannot be assigned to a specific genome location. In house validation studies showed that ignoring Low Confidence mappings results in better accuracy of the called variants. Unmappable reads lack an alignment solution given the parameter settings applied to the data. Figure 2 shows that the GENALICE MAP variant caller uses approximately 90% of the sequence reads, because they are either Perfect (73.44%) or Partial (16.29%) mappings. The caller ignores the remaining 10%. The proportion of Repeats, Low Confidence, Bad Reads and Unmappable reads is fully configurable within GENALICE MAP. The settings used here are chosen to give optimal results with respect to detection of high quality and accurate variants. Perfect % Partial % Repeats % Low confidence % Bad reads % Unmappable % Figure 2. Read mapping distributions The pie charts show read mapping distributions for NA12878 Platinum data. GENALICE MAP reports 6 read mapping results: perfect (reads that align fully to unique genome positions); partial (reads that align partially due to variants or sequencing errors to unique genome positions); repeats (reads that map to multiple genome positions); low confidence (reads that have too many bases that do not match the reference in these alignments > 25% of all bases in a read); bad reads (reads that fail quality control prior to alignment); and unmappable (reads without an alignment solution). 10

11 2.3.2 Concordance with dbsnp (v137) Variants GENALICE MAP discovers over 3.7 million SNPs in the NA12878 Platinum Genome data set (Table 4). The vast majority of those variants are also listed in the dbsnp (v137) database. For SNPs the transition-transversion (Ts/Tv) ratios were calculated. SNPs listed in dbsnp have a Ts/Tv ratio of 2.04, which is close to the expected ratio for whole genome SNP detection. The novel SNP discoveries have a lower Ts/Tv ratio, suggesting that this data contains a higher degree of false positive calls. The BWA-MEM/GATK workflow detects SNPs with known/novel and Ts/Tv ratios that are similar to GENALICE MAP. The BWA-MEM/Platypus workflow detects more novel calls, but with a reduced Ts/Tv ratio. In contrast, BWA-MEM/VarScan detects less novel calls with potentially less false positive calls. Table 4. Detection of dbsnp (v137) SNPs Workflow dbsnp % 1 Ts/Tv Novel % 2 Ts/Tv GENALICE MAP 3,735, , BWA-MEM GATK 3,691, , BWA-MEM Platypus 3,576, , BWA-MEM VarScan 3,683, , Notes: 1 Percentage of total SNPs: dbsnp concordant SNPs divided by the sum of dbsnp concordant and novel SNPs. 2 Percentage of total SNPs: novel SNPs divided by the sum of dbsnp concordant and novel SNPs. GENALICE MAP detects roughly 780 thousand INDELs (Table 5) of which 80% are previously described in the dbsnp database and 20% novel discoveries. The BWA-MEM/GATK detects a similar amount of INDELs, but with relatively more novel calls. With more than 820 thousand discoveries the BWA-MEM/Platypus workflow is most efficient in detecting INDELs. Like GENALICE MAP, 80% of the INDELs detected by BWA-MEM/Platypus are present in the dbsnp database. The number of indel discoveries is the lowest for BWA-MEM/VarScan. Table 5. Detection of dbsnp (v137) INDELs Workflow Type Total dbsnp % 1 Novel % 2 GENALICE MAP DEL 422, , , GENALICE MAP INS 362, , , BWA-MEM GATK DEL 403, , , BWA-MEM GATK INS 365, , , BWA-MEM Platypus DEL 427, , , BWA-MEM Platypus INS 394, , , BWA-MEM VarScan DEL 333, , , BWA-MEM VarScan INS 296, , , Notes: 1 Percentage of total INDELs: dbsnp concordant INDELs divided by the sum of dbsnp concordant and novel INDELs. 2 Percentage of total INDELs: novel INDELs divided by the sum of dbsnp concordant and novel INDELs. 11

12 2.3.3 Comparison with High Confidence Variants from Genome In A Bottle The Genome In A Bottle (GIAB) consortium released a set of highly reliable sequence variants for sample NA These high confidence variants are suitable for benchmarking NGS workflows. Therefore, NA12878 Platinum Genome variants discovered by GENALICE MAP were compared to these GIAB variants. GENALICE MAP performs similar to the other workflows. Overall accuracy was measured using the F1 score, which is a weighted average of sensitivity and precision. For the F1 score all variants (SNP and INDELs) were included. In this comparison, GENALICE MAP is only outperformed by the BWA-MEM/GATK workflow (Table 6). Table 6. Overall Accuracy of NGS Workflows Workflow Sensitivity (%) Precision (%) F 1 Score GENALICE MAP BWA-MEM/GATK BWA-MEM/Platypus BWA-MEM/VarScan INDEL Lengths Compared GENALICE MAP can align reads with infinite gap sizes at high speed due to its architecture, algorithms and comprehensive reference index. This enables it to detect longer INDELs with higher efficiency than other NGS workflows (Figure 4). GENALICE MAP detects more long deletions than any of the other three NGS workflows. Deletions up to 250 bases are easily detected. The BWA-MEM/GATK detects far less deletion that are longer than 50 bases. BWA- MEM/Platypus in assembly mode is also capable of detecting INDELs longer than 50 bases, but less efficient than BWA-MEM/GATK. BWA-MEM/VarScan and BWA-MEM/Platypus (default) hardly detect any deletions longer than 50 bases. GENALICE MAP and BWA-MEM/GATK are capable of detecting insertions longer than 25 bases with higher efficiency than the other workflows. For GENALICE MAP, maximum insertion length depends on the length of the sequence reads. Insertions up to 50 bases are detected with read length of 100 bases. Longer reads (150 bases) increase maximum insertion length to approximately 100 bases. Maximum insertion length in the BWA- MEM/GATK workflow is not dependend on the read length, but on the size of the active region that can be maximally 300 bases. 12

13 Number of INDELs ,000 10, ,000 GATK HaplotypeCaller v3.2 Platypus V assembly Platypus V default VarScan V2.3.7 GENALICE MAP V2.3.0 [2x100bp] GENALICE MAP V2.3.0 [2x150bp] Figure 4. INDEL Length Distributions INDEL size (bp) For each workflow the number of INDELs (y-axis) is plotted as a funtion of their length (x-axis). Negative lengths represent deltions and positive lengths insertions. For the BWA-MEM/Platypus workflow distributions are shown with assembly (blue) and without assembly (green). For GENALICE MAP INDELs detected in the NA12878 Platinum data (2x100bp reads) are shown in orange. In red INDELs detected in NA12878 HiSeq X Ten data 13

14 3. DISCUSSION This report focuses on three important aspects of NGS data processing: speed, data storage footprint and accuracy. 3.1 CONCLUSIONS GENALICE MAP performs with extremely high speed to process sequence reads in compressed FASTQ format to high quality variant calls in VCF format. Aligned reads are stored in the GENALICE Aligned Reads (GAR) format, which gives a tremendous storage footprint reduction compared to compressed FASTQ and BAM formats. GENALICE MAP detects sequence variants with high accuracy, which is comparable to other NGS workflows. There are, however delicate differences between all workflows. GENALICE MAP and BWA-MEM/GATK detect a similar number of sequence variants. Moreover, their concordance with known (dbsnp) variants is equally high. BWA-MEM/Platypus detects the highest number of INDELs. The BWA-MEM/VarScan workflow is most conservative and discovers the least amount of variants in the NA12878 Platinum Genome data. This conservative detection strategy of BWA-MEM/VarScan ensures high precision (i.e. low false positive rates) when compared to the NIST/GIAB truth variants. Sensitivity of indel detection, however, is strongly impaired for this workflow. The efficient detection of INDELs by the BWA-MEM/Platypus workflow results in good sensitivity and precision for insertions, but the detection of truth deletions does not seem to benefit and has fairly low sensitivity and precision. In addition, this workflow has the weakest performance on accurate detection of SNPs. GENALICE MAP has a more all-round accuracy, because it has the second F1 score measured for SNPs and INDELs. It is only outperformed by the BWA-MEM/GATK workflow. Notably, GATK outperforms all other tested workflows. However, we believe this is due to the fact the benchmark variant call set for the NA12878 genome is biased in favour of the GATK workflow. This call set was generated using both GATK s UnifiedGenotyper and HaplotypeCaller, as well as Cortex. As GATK was the main variant detection method used to generate this benchmark truth set, we suggest this to a large extent explains the slightly higher performance of GATK in this comparison. GENALICE MAP uniquely detects more long INDELs than any of the other tested workflows. Insertions longer than 25 bases up to 100 bases, depending on sequence read length, are detected. The maximum deletion length is limited to 250 bases. A sequence read can be mapped with gaps longer than 250 bases, but those events are registered as break points. In the future GENALICE MAP will use these break points for structural variant detection. 14

15 15

16 ONLINE: PHONE: ADDRESS: DEVENTERWEG 9D 3843 GA HARDERWIJK THE NETHERLANDS

REPORT. NA12878 Platinum Genome. GENALICE MAP Analysis Report. Bas Tolhuis, PhD GENALICE B.V.

REPORT. NA12878 Platinum Genome. GENALICE MAP Analysis Report. Bas Tolhuis, PhD GENALICE B.V. REPORT NA12878 Platinum Genome GENALICE MAP Analysis Report Bas Tolhuis, PhD GENALICE B.V. INDEX EXECUTIVE SUMMARY...4 1. MATERIALS & METHODS...5 1.1 SEQUENCE DATA...5 1.2 WORKFLOWS......5 1.3 ACCURACY

More information

Falcon Accelerated Genomics Data Analysis Solutions. User Guide

Falcon Accelerated Genomics Data Analysis Solutions. User Guide Falcon Accelerated Genomics Data Analysis Solutions User Guide Falcon Computing Solutions, Inc. Version 1.0 3/30/2018 Table of Contents Introduction... 3 System Requirements and Installation... 4 Software

More information

Exome sequencing. Jong Kyoung Kim

Exome sequencing. Jong Kyoung Kim Exome sequencing Jong Kyoung Kim Genome Analysis Toolkit The GATK is the industry standard for identifying SNPs and indels in germline DNA and RNAseq data. Its scope is now expanding to include somatic

More information

Practical exercises Day 2. Variant Calling

Practical exercises Day 2. Variant Calling Practical exercises Day 2 Variant Calling Samtools mpileup Variant calling with samtools mpileup + bcftools Variant calling with HaplotypeCaller (GATK Best Practices) Genotype GVCFs Hard Filtering Variant

More information

Supplementary Information. Detecting and annotating genetic variations using the HugeSeq pipeline

Supplementary Information. Detecting and annotating genetic variations using the HugeSeq pipeline Supplementary Information Detecting and annotating genetic variations using the HugeSeq pipeline Hugo Y. K. Lam 1,#, Cuiping Pan 1, Michael J. Clark 1, Phil Lacroute 1, Rui Chen 1, Rajini Haraksingh 1,

More information

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page. Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page. In this page you will learn to use the tools of the MAPHiTS suite. A little advice before starting : rename your

More information

Decrypting your genome data privately in the cloud

Decrypting your genome data privately in the cloud Decrypting your genome data privately in the cloud Marc Sitges Data Manager@Made of Genes @madeofgenes The Human Genome 3.200 M (x2) Base pairs (bp) ~20.000 genes (~30%) (Exons ~1%) The Human Genome Project

More information

Reads Alignment and Variant Calling

Reads Alignment and Variant Calling Reads Alignment and Variant Calling CB2-201 Computational Biology and Bioinformatics February 22, 2016 Emidio Capriotti http://biofold.org/ Institute for Mathematical Modeling of Biological Systems Department

More information

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA- MEM).

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA- MEM). Release Notes Agilent SureCall 4.0 Product Number G4980AA SureCall Client 6-month named license supports installation of one client and server (to host the SureCall database) on one machine. For additional

More information

SNP Calling. Tuesday 4/21/15

SNP Calling. Tuesday 4/21/15 SNP Calling Tuesday 4/21/15 Why Call SNPs? map mutations, ex: EMS, natural variation, introgressions associate with changes in expression develop markers for whole genome QTL analysis/ GWAS access diversity

More information

Mapping NGS reads for genomics studies

Mapping NGS reads for genomics studies Mapping NGS reads for genomics studies Valencia, 28-30 Sep 2015 BIER Alejandro Alemán aaleman@cipf.es Genomics Data Analysis CIBERER Where are we? Fastq Sequence preprocessing Fastq Alignment BAM Visualization

More information

Next Generation Sequence Alignment on the BRC Cluster. Steve Newhouse 22 July 2010

Next Generation Sequence Alignment on the BRC Cluster. Steve Newhouse 22 July 2010 Next Generation Sequence Alignment on the BRC Cluster Steve Newhouse 22 July 2010 Overview Practical guide to processing next generation sequencing data on the cluster No details on the inner workings

More information

Kelly et al. Genome Biology (2015) 16:6 DOI /s x. * Correspondence:

Kelly et al. Genome Biology (2015) 16:6 DOI /s x. * Correspondence: Kelly et al. Genome Biology (215) 16:6 DOI 1.1186/s1359-14-577-x METHOD Open Access Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human

More information

Running SNAP. The SNAP Team October 2012

Running SNAP. The SNAP Team October 2012 Running SNAP The SNAP Team October 2012 1 Introduction SNAP is a tool that is intended to serve as the read aligner in a gene sequencing pipeline. Its theory of operation is described in Faster and More

More information

Running SNAP. The SNAP Team February 2012

Running SNAP. The SNAP Team February 2012 Running SNAP The SNAP Team February 2012 1 Introduction SNAP is a tool that is intended to serve as the read aligner in a gene sequencing pipeline. Its theory of operation is described in Faster and More

More information

DRAGEN Bio-IT Platform Enabling the Global Genomic Infrastructure

DRAGEN Bio-IT Platform Enabling the Global Genomic Infrastructure TM DRAGEN Bio-IT Platform Enabling the Global Genomic Infrastructure About DRAGEN Edico Genome s DRAGEN TM (Dynamic Read Analysis for GENomics) Bio-IT Platform provides ultra-rapid secondary analysis of

More information

Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers

Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers Data used in the exercise We will use D. melanogaster WGS paired-end Illumina data with NCBI accessions

More information

ELPREP PERFORMANCE ACROSS PROGRAMMING LANGUAGES PASCAL COSTANZA CHARLOTTE HERZEEL FOSDEM, BRUSSELS, BELGIUM, FEBRUARY 3, 2018

ELPREP PERFORMANCE ACROSS PROGRAMMING LANGUAGES PASCAL COSTANZA CHARLOTTE HERZEEL FOSDEM, BRUSSELS, BELGIUM, FEBRUARY 3, 2018 ELPREP PERFORMANCE ACROSS PROGRAMMING LANGUAGES PASCAL COSTANZA CHARLOTTE HERZEEL FOSDEM, BRUSSELS, BELGIUM, FEBRUARY 3, 2018 USA SAN FRANCISCO USA ORLANDO BELGIUM - HQ LEUVEN THE NETHERLANDS EINDHOVEN

More information

CORE Year 1 Whole Genome Sequencing Final Data Format Requirements

CORE Year 1 Whole Genome Sequencing Final Data Format Requirements CORE Year 1 Whole Genome Sequencing Final Data Format Requirements To all incumbent contractors of CORE year 1 WGS contracts, the following acts as the agreed to sample parameters issued by NHLBI for data

More information

Helpful Galaxy screencasts are available at:

Helpful Galaxy screencasts are available at: This user guide serves as a simplified, graphic version of the CloudMap paper for applicationoriented end-users. For more details, please see the CloudMap paper. Video versions of these user guides and

More information

Accelerating InDel Detection on Modern Multi-Core SIMD CPU Architecture

Accelerating InDel Detection on Modern Multi-Core SIMD CPU Architecture Accelerating InDel Detection on Modern Multi-Core SIMD CPU Architecture Da Zhang Collaborators: Hao Wang, Kaixi Hou, Jing Zhang Advisor: Wu-chun Feng Evolution of Genome Sequencing1 In 20032: 1 human genome

More information

CBSU/3CPG/CVG Joint Workshop Series Reference genome based sequence variation detection

CBSU/3CPG/CVG Joint Workshop Series Reference genome based sequence variation detection CBSU/3CPG/CVG Joint Workshop Series Reference genome based sequence variation detection Computational Biology Service Unit (CBSU) Cornell Center for Comparative and Population Genomics (3CPG) Center for

More information

Sentieon Documentation

Sentieon Documentation Sentieon Documentation Release 201808.03 Sentieon, Inc Dec 21, 2018 Sentieon Manual 1 Introduction 1 1.1 Description.............................................. 1 1.2 Benefits and Value..........................................

More information

Variant calling using SAMtools

Variant calling using SAMtools Variant calling using SAMtools Calling variants - a trivial use of an Interactive Session We are going to conduct the variant calling exercises in an interactive idev session just so you can get a feel

More information

Under the Hood of Alignment Algorithms for NGS Researchers

Under the Hood of Alignment Algorithms for NGS Researchers Under the Hood of Alignment Algorithms for NGS Researchers April 16, 2014 Gabe Rudy VP of Product Development Golden Helix Questions during the presentation Use the Questions pane in your GoToWebinar window

More information

Genome Assembly Using de Bruijn Graphs. Biostatistics 666

Genome Assembly Using de Bruijn Graphs. Biostatistics 666 Genome Assembly Using de Bruijn Graphs Biostatistics 666 Previously: Reference Based Analyses Individual short reads are aligned to reference Genotypes generated by examining reads overlapping each position

More information

WM2 Bioinformatics. ExomeSeq data analysis part 1. Dietmar Rieder

WM2 Bioinformatics. ExomeSeq data analysis part 1. Dietmar Rieder WM2 Bioinformatics ExomeSeq data analysis part 1 Dietmar Rieder RAW data Use putty to logon to cluster.i med.ac.at In your home directory make directory to store raw data $ mkdir 00_RAW Copy raw fastq

More information

ADNI Sequencing Working Group. Robert C. Green, MD, MPH Andrew J. Saykin, PsyD Arthur Toga, PhD

ADNI Sequencing Working Group. Robert C. Green, MD, MPH Andrew J. Saykin, PsyD Arthur Toga, PhD ADNI Sequencing Working Group Robert C. Green, MD, MPH Andrew J. Saykin, PsyD Arthur Toga, PhD Why sequencing? V V V V V V V V V V V V V A fortuitous relationship TIME s Best Invention of 2008 The initial

More information

Heterogeneous compute in the GATK

Heterogeneous compute in the GATK Heterogeneous compute in the GATK Mauricio Carneiro GSA Broad Ins

More information

Halvade: scalable sequence analysis with MapReduce

Halvade: scalable sequence analysis with MapReduce Bioinformatics Advance Access published March 26, 2015 Halvade: scalable sequence analysis with MapReduce Dries Decap 1,5, Joke Reumers 2,5, Charlotte Herzeel 3,5, Pascal Costanza, 4,5 and Jan Fostier

More information

PRACTICAL SESSION 5 GOTCLOUD ALIGNMENT WITH BWA JAN 7 TH, 2014 STOM 2014 WORKSHOP HYUN MIN KANG UNIVERSITY OF MICHIGAN, ANN ARBOR

PRACTICAL SESSION 5 GOTCLOUD ALIGNMENT WITH BWA JAN 7 TH, 2014 STOM 2014 WORKSHOP HYUN MIN KANG UNIVERSITY OF MICHIGAN, ANN ARBOR PRACTICAL SESSION 5 GOTCLOUD ALIGNMENT WITH BWA JAN 7 TH, 2014 STOM 2014 WORKSHOP HYUN MIN KANG UNIVERSITY OF MICHIGAN, ANN ARBOR GOAL OF THIS SESSION Assuming that The audiences know how to perform GWAS

More information

MPG NGS workshop I: Quality assessment of SNP calls

MPG NGS workshop I: Quality assessment of SNP calls MPG NGS workshop I: Quality assessment of SNP calls Kiran V Garimella (kiran@broadinstitute.org) Genome Sequencing and Analysis Medical and Population Genetics February 4, 2010 SNP calling workflow Filesize*

More information

Sep. Guide. Edico Genome Corp North Torrey Pines Court, Plaza Level, La Jolla, CA 92037

Sep. Guide.  Edico Genome Corp North Torrey Pines Court, Plaza Level, La Jolla, CA 92037 Sep 2017 DRAGEN TM Quick Start Guide www.edicogenome.com info@edicogenome.com Edico Genome Corp. 3344 North Torrey Pines Court, Plaza Level, La Jolla, CA 92037 Notice Contents of this document and associated

More information

Galaxy Platform For NGS Data Analyses

Galaxy Platform For NGS Data Analyses Galaxy Platform For NGS Data Analyses Weihong Yan wyan@chem.ucla.edu Collaboratory Web Site http://qcb.ucla.edu/collaboratory Collaboratory Workshops Workshop Outline ü Day 1 UCLA galaxy and user account

More information

Variation among genomes

Variation among genomes Variation among genomes Comparing genomes The reference genome http://www.ncbi.nlm.nih.gov/nuccore/26556996 Arabidopsis thaliana, a model plant Col-0 variety is from Landsberg, Germany Ler is a mutant

More information

BaseSpace - MiSeq Reporter Software v2.4 Release Notes

BaseSpace - MiSeq Reporter Software v2.4 Release Notes Page 1 of 5 BaseSpace - MiSeq Reporter Software v2.4 Release Notes For MiSeq Systems Connected to BaseSpace June 2, 2014 Revision Date Description of Change A May 22, 2014 Initial Version Revision History

More information

Intro to NGS Tutorial

Intro to NGS Tutorial Intro to NGS Tutorial Release 8.6.0 Golden Helix, Inc. October 31, 2016 Contents 1. Overview 2 2. Import Variants and Quality Fields 3 3. Quality Filters 10 Generate Alternate Read Ratio.........................................

More information

Pre-processing and quality control of sequence data. Barbera van Schaik KEBB - Bioinformatics Laboratory

Pre-processing and quality control of sequence data. Barbera van Schaik KEBB - Bioinformatics Laboratory Pre-processing and quality control of sequence data Barbera van Schaik KEBB - Bioinformatics Laboratory b.d.vanschaik@amc.uva.nl Topic: quality control and prepare data for the interesting stuf Keep Throw

More information

Bioinformatics Framework

Bioinformatics Framework Persona: A High-Performance Bioinformatics Framework Stuart Byma 1, Sam Whitlock 1, Laura Flueratoru 2, Ethan Tseng 3, Christos Kozyrakis 4, Edouard Bugnion 1, James Larus 1 EPFL 1, U. Polytehnica of Bucharest

More information

RPGC Manual. You will also need python 2.7 or above to run our home-brew python scripts.

RPGC Manual. You will also need python 2.7 or above to run our home-brew python scripts. Introduction Here we present a new approach for producing de novo whole genome sequences--recombinant population genome construction (RPGC)--that solves many of the problems encountered in standard genome

More information

Calling variants in diploid or multiploid genomes

Calling variants in diploid or multiploid genomes Calling variants in diploid or multiploid genomes Diploid genomes The initial steps in calling variants for diploid or multi-ploid organisms with NGS data are the same as what we've already seen: 1. 2.

More information

Mar. Guide. Edico Genome Inc North Torrey Pines Court, Plaza Level, La Jolla, CA 92037

Mar. Guide.  Edico Genome Inc North Torrey Pines Court, Plaza Level, La Jolla, CA 92037 Mar 2017 DRAGEN TM Quick Start Guide www.edicogenome.com info@edicogenome.com Edico Genome Inc. 3344 North Torrey Pines Court, Plaza Level, La Jolla, CA 92037 Notice Contents of this document and associated

More information

AgroMarker Finder manual (1.1)

AgroMarker Finder manual (1.1) AgroMarker Finder manual (1.1) 1. Introduction 2. Installation 3. How to run? 4. How to use? 5. Java program for calculating of restriction enzyme sites (TaqαI). 1. Introduction AgroMarker Finder (AMF)is

More information

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA-MEM).

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA-MEM). Release Notes Agilent SureCall 3.5 Product Number G4980AA SureCall Client 6-month named license supports installation of one client and server (to host the SureCall database) on one machine. For additional

More information

Genome 373: Mapping Short Sequence Reads III. Doug Fowler

Genome 373: Mapping Short Sequence Reads III. Doug Fowler Genome 373: Mapping Short Sequence Reads III Doug Fowler What is Galaxy? Galaxy is a free, open source web platform for running all sorts of computational analyses including pretty much all of the sequencing-related

More information

Analyzing massive genomics datasets using Databricks Frank Austin Nothaft,

Analyzing massive genomics datasets using Databricks Frank Austin Nothaft, Analyzing massive genomics datasets using Databricks Frank Austin Nothaft, PhD frank.nothaft@databricks.com @fnothaft VISION Accelerate innovation by unifying data science, engineering and business PRODUCT

More information

GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu

GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu Matt Huska Freie Universität Berlin Computational Methods for High-Throughput Omics

More information

DELL EMC POWER EDGE R940 MAKES DE NOVO ASSEMBLY EASIER

DELL EMC POWER EDGE R940 MAKES DE NOVO ASSEMBLY EASIER DELL EMC POWER EDGE R940 MAKES DE NOVO ASSEMBLY EASIER Genome Assembly on Deep Sequencing data with SOAPdenovo2 ABSTRACT De novo assemblies are memory intensive since the assembly algorithms need to compare

More information

Handling sam and vcf data, quality control

Handling sam and vcf data, quality control Handling sam and vcf data, quality control We continue with the earlier analyses and get some new data: cd ~/session_3 wget http://wasabiapp.org/vbox/data/session_4/file3.tgz tar xzf file3.tgz wget http://wasabiapp.org/vbox/data/session_4/file4.tgz

More information

Super-Fast Genome BWA-Bam-Sort on GLAD

Super-Fast Genome BWA-Bam-Sort on GLAD 1 Hututa Technologies Limited Super-Fast Genome BWA-Bam-Sort on GLAD Zhiqiang Ma, Wangjun Lv and Lin Gu May 2016 1 2 Executive Summary Aligning the sequenced reads in FASTQ files and converting the resulted

More information

GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units

GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units Abstract A very popular discipline in bioinformatics is Next-Generation Sequencing (NGS) or DNA sequencing. It specifies

More information

INTRODUCTION AUX FORMATS DE FICHIERS

INTRODUCTION AUX FORMATS DE FICHIERS INTRODUCTION AUX FORMATS DE FICHIERS Plan. Formats de séquences brutes.. Format fasta.2. Format fastq 2. Formats d alignements 2.. Format SAM 2.2. Format BAM 4. Format «Variant Calling» 4.. Format Varscan

More information

ASAP - Allele-specific alignment pipeline

ASAP - Allele-specific alignment pipeline ASAP - Allele-specific alignment pipeline Jan 09, 2012 (1) ASAP - Quick Reference ASAP needs a working version of Perl and is run from the command line. Furthermore, Bowtie needs to be installed on your

More information

Sentieon DNA pipeline for variant detection - Software-only solution, over 20 faster than GATK 3.3 with identical results

Sentieon DNA pipeline for variant detection - Software-only solution, over 20 faster than GATK 3.3 with identical results Sentieon DNA pipeline for variant detection - Software-only solution, over 0 faster than GATK. with identical results Sentieon DNAseq Software is a suite of tools for running DNA sequencing secondary analyses.

More information

NGS Analysis Using Galaxy

NGS Analysis Using Galaxy NGS Analysis Using Galaxy Sequences and Alignment Format Galaxy overview and Interface Get;ng Data in Galaxy Analyzing Data in Galaxy Quality Control Mapping Data History and workflow Galaxy Exercises

More information

NGS Data Analysis. Roberto Preste

NGS Data Analysis. Roberto Preste NGS Data Analysis Roberto Preste 1 Useful info http://bit.ly/2r1y2dr Contacts: roberto.preste@gmail.com Slides: http://bit.ly/ngs-data 2 NGS data analysis Overview 3 NGS Data Analysis: the basic idea http://bit.ly/2r1y2dr

More information

Accelrys Pipeline Pilot and HP ProLiant servers

Accelrys Pipeline Pilot and HP ProLiant servers Accelrys Pipeline Pilot and HP ProLiant servers A performance overview Technical white paper Table of contents Introduction... 2 Accelrys Pipeline Pilot benchmarks on HP ProLiant servers... 2 NGS Collection

More information

Sentieon DNA Pipeline for Variant Detection Software-only solution, over 20 faster than GATK 3.3 with identical results

Sentieon DNA Pipeline for Variant Detection Software-only solution, over 20 faster than GATK 3.3 with identical results Sentieon DNA Pipeline for Variant Detection Software-only solution, over 20 faster than GATK 3.3 with identical results Jessica A. Weber 1, Rafael Aldana 5, Brendan D. Gallagher 5, Jeremy S. Edwards 2,3,4

More information

3. Installation Download Cpipe and Run Install Script Create an Analysis Profile Create a Batch... 7

3. Installation Download Cpipe and Run Install Script Create an Analysis Profile Create a Batch... 7 Cpipe User Guide 1. Introduction - What is Cpipe?... 3 2. Design Background... 3 2.1. Analysis Pipeline Implementation (Cpipe)... 4 2.2. Use of a Bioinformatics Pipeline Toolkit (Bpipe)... 4 2.3. Individual

More information

MIRING: Minimum Information for Reporting Immunogenomic NGS Genotyping. Data Standards Hackathon for NGS HACKATHON 1.0 Bethesda, MD September

MIRING: Minimum Information for Reporting Immunogenomic NGS Genotyping. Data Standards Hackathon for NGS HACKATHON 1.0 Bethesda, MD September MIRING: Minimum Information for Reporting Immunogenomic NGS Genotyping Data Standards Hackathon for NGS HACKATHON 1.0 Bethesda, MD September 27 2014 Static Dynamic Static Minimum Information for Reporting

More information

Assembly of the Ariolimax dolicophallus genome with Discovar de novo. Chris Eisenhart, Robert Calef, Natasha Dudek, Gepoliano Chaves

Assembly of the Ariolimax dolicophallus genome with Discovar de novo. Chris Eisenhart, Robert Calef, Natasha Dudek, Gepoliano Chaves Assembly of the Ariolimax dolicophallus genome with Discovar de novo Chris Eisenhart, Robert Calef, Natasha Dudek, Gepoliano Chaves Overview -Introduction -Pair correction and filling -Assembly theory

More information

RNAseq analysis: SNP calling. BTI bioinformatics course, spring 2013

RNAseq analysis: SNP calling. BTI bioinformatics course, spring 2013 RNAseq analysis: SNP calling BTI bioinformatics course, spring 2013 RNAseq overview RNAseq overview Choose technology 454 Illumina SOLiD 3 rd generation (Ion Torrent, PacBio) Library types Single reads

More information

Accelerate Applications Using EqualLogic Arrays with directcache

Accelerate Applications Using EqualLogic Arrays with directcache Accelerate Applications Using EqualLogic Arrays with directcache Abstract This paper demonstrates how combining Fusion iomemory products with directcache software in host servers significantly improves

More information

NGS Data Visualization and Exploration Using IGV

NGS Data Visualization and Exploration Using IGV 1 What is Galaxy Galaxy for Bioinformaticians Galaxy for Experimental Biologists Using Galaxy for NGS Analysis NGS Data Visualization and Exploration Using IGV 2 What is Galaxy Galaxy for Bioinformaticians

More information

Best Practices for Illumina Genome Analysis Based on Huawei OceanStor 9000 Big Data Storage System. Huawei Technologies Co., Ltd.

Best Practices for Illumina Genome Analysis Based on Huawei OceanStor 9000 Big Data Storage System. Huawei Technologies Co., Ltd. Best Practices for Illumina Genome Analysis Based on Huawei OceanStor 9000 Big Data Storage With the rapid development of society, economy, science, and technology as well as improvement of living standards,

More information

Minimum Information for Reporting Immunogenomic NGS Genotyping (MIRING)

Minimum Information for Reporting Immunogenomic NGS Genotyping (MIRING) Minimum Information for Reporting Immunogenomic NGS Genotyping (MIRING) Reporting guideline statement for HLA and KIR genotyping data generated via Next Generation Sequencing (NGS) technologies and analysis

More information

Introduction to Read Alignment. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015

Introduction to Read Alignment. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015 Introduction to Read Alignment UCD Genome Center Bioinformatics Core Tuesday 15 September 2015 From reads to molecules Why align? Individual A Individual B ATGATAGCATCGTCGGGTGTCTGCTCAATAATAGTGCCGTATCATGCTGGTGTTATAATCGCCGCATGACATGATCAATGG

More information

An Oracle White Paper September Oracle Utilities Meter Data Management Demonstrates Extreme Performance on Oracle Exadata/Exalogic

An Oracle White Paper September Oracle Utilities Meter Data Management Demonstrates Extreme Performance on Oracle Exadata/Exalogic An Oracle White Paper September 2011 Oracle Utilities Meter Data Management 2.0.1 Demonstrates Extreme Performance on Oracle Exadata/Exalogic Introduction New utilities technologies are bringing with them

More information

m6aviewer Version Documentation

m6aviewer Version Documentation m6aviewer Version 1.6.0 Documentation Contents 1. About 2. Requirements 3. Launching m6aviewer 4. Running Time Estimates 5. Basic Peak Calling 6. Running Modes 7. Multiple Samples/Sample Replicates 8.

More information

Isaac Enrichment v2.0 App

Isaac Enrichment v2.0 App Isaac Enrichment v2.0 App Introduction 3 Running Isaac Enrichment v2.0 5 Isaac Enrichment v2.0 Output 7 Isaac Enrichment v2.0 Methods 31 Technical Assistance ILLUMINA PROPRIETARY 15050960 Rev. C December

More information

Local Run Manager Resequencing Analysis Module Workflow Guide

Local Run Manager Resequencing Analysis Module Workflow Guide Local Run Manager Resequencing Analysis Module Workflow Guide For Research Use Only. Not for use in diagnostic procedures. Overview 3 Set Parameters 4 Analysis Methods 6 View Analysis Results 8 Analysis

More information

SAMtools. SAM BAM. mapping. BAM sort & indexing (ex: IGV) SNP call

SAMtools.   SAM BAM. mapping. BAM sort & indexing (ex: IGV) SNP call SAMtools http://samtools.sourceforge.net/ SAM/BAM mapping BAM SAM BAM BAM sort & indexing (ex: IGV) mapping SNP call SAMtools NGS Program: samtools (Tools for alignments in the SAM format) Version: 0.1.19

More information

QIAseq DNA V3 Panel Analysis Plugin USER MANUAL

QIAseq DNA V3 Panel Analysis Plugin USER MANUAL QIAseq DNA V3 Panel Analysis Plugin USER MANUAL User manual for QIAseq DNA V3 Panel Analysis 1.0.1 Windows, Mac OS X and Linux January 25, 2018 This software is for research purposes only. QIAGEN Aarhus

More information

Illumina Next Generation Sequencing Data analysis

Illumina Next Generation Sequencing Data analysis Illumina Next Generation Sequencing Data analysis Chiara Dal Fiume Sr Field Application Scientist Italy 2010 Illumina, Inc. All rights reserved. Illumina, illuminadx, Solexa, Making Sense Out of Life,

More information

Galaxy workshop at the Winter School Igor Makunin

Galaxy workshop at the Winter School Igor Makunin Galaxy workshop at the Winter School 2016 Igor Makunin i.makunin@uq.edu.au Winter school, UQ, July 6, 2016 Plan Overview of the Genomics Virtual Lab Introduce Galaxy, a web based platform for analysis

More information

freebayes in depth: model, filtering, and walkthrough Erik Garrison Wellcome Trust Sanger of Iowa May 19, 2015

freebayes in depth: model, filtering, and walkthrough Erik Garrison Wellcome Trust Sanger of Iowa May 19, 2015 freebayes in depth: model, filtering, and walkthrough Erik Garrison Wellcome Trust Sanger Institute @University of Iowa May 19, 2015 Overview 1. Primary filtering: Bayesian callers 2. Post-call filtering:

More information

Aeromancer: A Workflow Manager for Large- Scale MapReduce-Based Scientific Workflows

Aeromancer: A Workflow Manager for Large- Scale MapReduce-Based Scientific Workflows Aeromancer: A Workflow Manager for Large- Scale MapReduce-Based Scientific Workflows Presented by Sarunya Pumma Supervisors: Dr. Wu-chun Feng, Dr. Mark Gardner, and Dr. Hao Wang synergy.cs.vt.edu Outline

More information

Sequence Mapping and Assembly

Sequence Mapping and Assembly Practical Introduction Sequence Mapping and Assembly December 8, 2014 Mary Kate Wing University of Michigan Center for Statistical Genetics Goals of This Session Learn basics of sequence data file formats

More information

Performance analysis of parallel de novo genome assembly in shared memory system

Performance analysis of parallel de novo genome assembly in shared memory system IOP Conference Series: Earth and Environmental Science PAPER OPEN ACCESS Performance analysis of parallel de novo genome assembly in shared memory system To cite this article: Syam Budi Iryanto et al 2018

More information

DNA Sequencing analysis on Artemis

DNA Sequencing analysis on Artemis DNA Sequencing analysis on Artemis Mapping and Variant Calling Tracy Chew Senior Research Bioinformatics Technical Officer Rosemarie Sadsad Informatics Services Lead Hayim Dar Informatics Technical Officer

More information

Run Setup and Bioinformatic Analysis. Accel-NGS 2S MID Indexing Kits

Run Setup and Bioinformatic Analysis. Accel-NGS 2S MID Indexing Kits Run Setup and Bioinformatic Analysis Accel-NGS 2S MID Indexing Kits Sequencing MID Libraries For MiSeq, HiSeq, and NextSeq instruments: Modify the config file to create a fastq for index reads Using the

More information

TECH NOTE Improving the Sensitivity of Ultra Low Input mrna Seq

TECH NOTE Improving the Sensitivity of Ultra Low Input mrna Seq TECH NOTE Improving the Sensitivity of Ultra Low Input mrna Seq SMART Seq v4 Ultra Low Input RNA Kit for Sequencing Powered by SMART and LNA technologies: Locked nucleic acid technology significantly improves

More information

Performing a resequencing assembly

Performing a resequencing assembly BioNumerics Tutorial: Performing a resequencing assembly 1 Aim In this tutorial, we will discuss the different options to obtain statistics about the sequence read set data and assess the quality, and

More information

BWT Indexing: Big Data from Next Generation Sequencing and GPU

BWT Indexing: Big Data from Next Generation Sequencing and GPU GPU Technology Conference 2014 BWT Indexing: Big Data from Next Generation Sequencing and GPU Jeanno Cheung HKU-BGI Bioinformatics Algorithms and Core Technology Research Laboratory University of Hong

More information

CLC Server. End User USER MANUAL

CLC Server. End User USER MANUAL CLC Server End User USER MANUAL Manual for CLC Server 10.0.1 Windows, macos and Linux March 8, 2018 This software is for research purposes only. QIAGEN Aarhus Silkeborgvej 2 Prismet DK-8000 Aarhus C Denmark

More information

Sequence Preprocessing: A perspective

Sequence Preprocessing: A perspective Sequence Preprocessing: A perspective Dr. Matthew L. Settles Genome Center University of California, Davis settles@ucdavis.edu Why Preprocess reads We have found that aggressively cleaning and processing

More information

Resequencing Analysis. (Pseudomonas aeruginosa MAPO1 ) Sample to Insight

Resequencing Analysis. (Pseudomonas aeruginosa MAPO1 ) Sample to Insight Resequencing Analysis (Pseudomonas aeruginosa MAPO1 ) 1 Workflow Import NGS raw data Trim reads Import Reference Sequence Reference Mapping QC on reads Variant detection Case Study Pseudomonas aeruginosa

More information

Quantifying FTK 3.0 Performance with Respect to Hardware Selection

Quantifying FTK 3.0 Performance with Respect to Hardware Selection Quantifying FTK 3.0 Performance with Respect to Hardware Selection Background A wide variety of hardware platforms and associated individual component choices exist that can be utilized by the Forensic

More information

Dindel User Guide, version 1.0

Dindel User Guide, version 1.0 Dindel User Guide, version 1.0 Kees Albers University of Cambridge, Wellcome Trust Sanger Institute caa@sanger.ac.uk October 26, 2010 Contents 1 Introduction 2 2 Requirements 2 3 Optional input 3 4 Dindel

More information

From fastq to vcf. NGG 2016 / Evolutionary Genomics Ari Löytynoja /

From fastq to vcf. NGG 2016 / Evolutionary Genomics Ari Löytynoja / From fastq to vcf Overview of resequencing analysis samples fastq fastq fastq fastq mapping bam bam bam bam variant calling samples 18917 C A 0/0 0/0 0/0 0/0 18969 G T 0/0 0/0 0/0 0/0 19022 G T 0/1 1/1

More information

STREAMING FRAGMENT ASSIGNMENT FOR REAL-TIME ANALYSIS OF SEQUENCING EXPERIMENTS. Supplementary Figure 1

STREAMING FRAGMENT ASSIGNMENT FOR REAL-TIME ANALYSIS OF SEQUENCING EXPERIMENTS. Supplementary Figure 1 STREAMING FRAGMENT ASSIGNMENT FOR REAL-TIME ANALYSIS OF SEQUENCING EXPERIMENTS ADAM ROBERTS AND LIOR PACHTER Supplementary Figure 1 Frequency 0 1 1 10 100 1000 10000 1 10 20 30 40 50 60 70 13,950 Bundle

More information

White Paper. File System Throughput Performance on RedHawk Linux

White Paper. File System Throughput Performance on RedHawk Linux White Paper File System Throughput Performance on RedHawk Linux By: Nikhil Nanal Concurrent Computer Corporation August Introduction This paper reports the throughput performance of the,, and file systems

More information

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg High-throughput sequencing: Alignment and related topic Simon Anders EMBL Heidelberg Established platforms HTS Platforms Illumina HiSeq, ABI SOLiD, Roche 454 Newcomers: Benchtop machines 454 GS Junior,

More information

Toward High Utilization of Heterogeneous Computing Resources in SNP Detection

Toward High Utilization of Heterogeneous Computing Resources in SNP Detection Toward High Utilization of Heterogeneous Computing Resources in SNP Detection Myungeun Lim, Minho Kim, Ho-Youl Jung, Dae-Hee Kim, Jae-Hun Choi, Wan Choi, and Kyu-Chul Lee As the amount of re-sequencing

More information

Sequence Analysis Pipeline

Sequence Analysis Pipeline Sequence Analysis Pipeline Transcript fragments 1. PREPROCESSING 2. ASSEMBLY (today) Removal of contaminants, vector, adaptors, etc Put overlapping sequence together and calculate bigger sequences 3. Analysis/Annotation

More information

PRACTICAL SESSION 8 SEQUENCE-BASED ASSOCIATION, INTERPRETATION, VISUALIZATION USING EPACTS JAN 7 TH, 2014 STOM 2014 WORKSHOP

PRACTICAL SESSION 8 SEQUENCE-BASED ASSOCIATION, INTERPRETATION, VISUALIZATION USING EPACTS JAN 7 TH, 2014 STOM 2014 WORKSHOP PRACTICAL SESSION 8 SEQUENCE-BASED ASSOCIATION, INTERPRETATION, VISUALIZATION USING EPACTS JAN 7 TH, 2014 STOM 2014 WORKSHOP HYUN MIN KANG UNIVERSITY OF MICHIGAN, ANN ARBOR EPACTS ASSOCIATION ANALYSIS

More information

NGS Data and Sequence Alignment

NGS Data and Sequence Alignment Applications and Servers SERVER/REMOTE Compute DB WEB Data files NGS Data and Sequence Alignment SSH WEB SCP Manpreet S. Katari App Aug 11, 2016 Service Terminal IGV Data files Window Personal Computer/Local

More information

Welcome to GenomeView 101!

Welcome to GenomeView 101! Welcome to GenomeView 101! 1. Start your computer 2. Download and extract the example data http://www.broadinstitute.org/~tabeel/broade.zip Suggestion: - Linux, Mac: make new folder in your home directory

More information

Low Latency Evaluation of Fibre Channel, iscsi and SAS Host Interfaces

Low Latency Evaluation of Fibre Channel, iscsi and SAS Host Interfaces Low Latency Evaluation of Fibre Channel, iscsi and SAS Host Interfaces Evaluation report prepared under contract with LSI Corporation Introduction IT professionals see Solid State Disk (SSD) products as

More information