NA12878 Platinum Genome GENALICE MAP Analysis Report

Size: px

Start display at page:

Download "NA12878 Platinum Genome GENALICE MAP Analysis Report"

Margaret Hines
5 years ago
Views:

1 NA12878 Platinum Genome GENALICE MAP Analysis Report Bas Tolhuis, PhD Jan-Jaap Wesselink, PhD GENALICE B.V.

3 INDEX EXECUTIVE SUMMARY MATERIALS & METHODS SEQUENCE DATA WORKFLOWS ACCURACY AND CONCORDANCE HARDWARE CONFIGURATION RESULT PROCESSING SPEED STORAGE FOOTPRINTS ACCURACY DISCUSSION CONCLUSION...14

4 EXECUTIVE SUMMARY The aim of this benchmark study was to do alignment and variant calling using GENALICE MAP software on NA12878 data from the Illumina Platinum Genomes project. The resulting BAM and VCF files, including SNPs and INDELs are being made available for prospective customers for evaluation. This document describes the data analysis approach. It further discusses processing speed, storage footprint reduction and accuracy of the variant call sets which we tested in house before sharing the data. Moreover, GENALICE MAP s performance is directly compared to three widely used third-party NGS workflows. GENALICE MAP is an extremely fast NGS data analysis solution, which significantly reduces data storage requirements and produces high quality analysis results, including aligned reads and sequence variants. 4

5 1. MATERIALS & METHODS The NA12878 Platinum Genome data set was subjected to alignment and variant calling by GENALICE MAP and three third-party NGS workflows. The resulting variants were compared to dbsnp137 and benchmark variants (Genome In A Bottle). Throughout this pilot study build 37 of the human genome (GRCh37) was used as a reference. 1.1 SEQUENCE DATA NA12878 Platinum Genome (Illumina) Illumina s NA12878 Platinum Genome data set ( was downloaded as two compressed FASTQ files from the European Nucleotide Archive (ERR194147) Genome In A Bottle High Confidence Variants Variants discovered by GENALICE MAP and BWA-MEM/GATK workflows were compared to the Genome In A Bottle (GIAB) NA12878 benchmark variant call set (NIST v2.18). These so-called high confidence variants were downloaded from: ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/na12878.variant_calls/nist. 1.2 WORKFLOWS A. GENALICE MAP B. BWA-MEM/GATK HC C. BWA-MEM/Platypus D. BWA-MEM/VarScan Read alignment Read sorting GENALICE MAP BWA-MEM SAMtools SAMtools BWA-MEM SAMtools SAMtools BWA-MEM SAMtools SAMtools Mark duplicates Picard Picard INDEL realign GATK Platypus Genotyping GENALICE MAP GATK SAMtools SAMtools Filtering GATK VarScan VarScan Variants SNPs INDELs SNPs INDELs SNPs INDELs SNPs INDELs Figure 1. Workflows to Process NGS data Key processes for NGS data analysis are shown on the left, including read alignment, read sorting, PCR duplicate handling, INDEL realignment, genotyping and filtering. Each colored rectangle represents a single command line processing step. 5

6 1.2.1 GENALICE MAP GENALICE MAP version was used throughout this benchmark study. GENALICE MAP uses a two-stage workflow to process read sequences in FASTQ format into high quality variants (SNP and indel) in VCF format (Figure 1A). Sequence reads were streamed via the network into the aligner node. Notably, in this study GENALICE MAP aligned directly from compressed (.gzip) FASTQ files. The aligner maps sequence reads onto the indexed reference genome and writes, genome-sorted aligned reads in GENALICE Aligned Reads (GAR) format. In addition, potential PCR duplicates are marked. Next, the variant caller reads the GAR file and performs indel realignment, genotyping and filtering to produce high quality sequence variants BWA-MEM For all three third-party NGS workflows BWA-MEM (v0.7.8) was used to align reads. Alignment was done with default parameter settings with one exception. We marked shorter split hits as secondary for Picard compatibility. Aligned reads were written to disk in BAM format with SAMtools (v1.0). Subsequently, the mapped reads were sorted based on genome position using SAMtools (v1.0) GATK The GATK (v3.2) workflow requires three additional processing steps (Figure 1B). Prior to genotyping potential PCR duplicates were marked using Picard s MarkDuplicates functionality (v1.129). The HaplotypeCaller was used to genotype the aligned reads. Finally, the raw variants identified by the genotyping tool were refined using Variant Quality Score Recalibration. The GATK workflow was appied using best practices described by the GATK developers: Platypus Platypus uses a single step procedure to produce variants from a set aligned reads (Figure 1C). Platypus (v ) was used with default parameter settings. This means that the software applies potential PCR duplicate marking, indel realignment, genotyping and filtering on the fly. The default filtering procedure is highly stringent, resulting in low sensitivity (96.1%) and high precision (99.9%) when detecting NIST/GIAB benchmark variants. This deviates strongly from the other workflows. Therefore, we decided to display results obtained with raw/unfiltered variants. This increases sensitivity at the expense of precision (Figure 3). Of note, Platypus can also detect variants using local assembly, which increases processing time by roughly two fold. Using local assembly had little impact on variant detection accuracy VarScan Prior to genotyping using VarScan (Figure 1D), potential PCR duplicates were marked (Picard v1.129) and around INDELs reads were subjected to local realignment (GATK v3.2). Next, SAMtools (v1.0) mpileup functionality was run with a filter on mapping quality. Only reads with mapping quality 20 or higher were included in the pileup information. SNPs and INDELs were called by two separate functions of VarScan (v2.3.7), namely: pileup2snp and pileup2indel. The minimum base quality parameter was set to 17. Maximum coverage depth was limited to a read depth of 250. For other parameters default settings were applied. 6

7 1.3 ACCURACY AND CONCORDANCE The SMASH (v1.0.1) benchmarking software ( was used to determine accuracy of detecting high confidence (NIST-GIAB) variants and to do pairwise comparisons between GENALICE and BWA-MEM/GATK variant call sets. A challenge when comparing VCF files comes from the fact that the same underlying sequence can be represented as a variant in VCF using various ways. SMASH uses normalization of VCF files in order to deal with ambiguities between variant representations in VCF files. SnpSift (v3.5) annotate was used to determine concordance with dbsnp (v137) variants. Unlike SMASH, SnpSift uses strict comparison of variants and does not normalize to interpret ambiguities between variant representations. 1.4 HARDWARE CONFIGURATION Benchmarking Hardware All data analysis were performed on a server with the following hardware configuration: Dual Intel Xeon E V2 ( Mhz) CPU with a total of 12 cores and 24 threads 128GB Memory (8x16 GB, 1333 Mhz) 2x256GB Solid State Disks (read/write: 500MB/second) The server was connected to storage servers via an Infiniband adapter. NGS reads were streamed via network from FASTQ (located on one of the storage servers) into GENALICE MAP. The Infiniband and storage server configurations are: Infiniband adapter of 40 Gigabit/second 36x2TB HDD disk space Raid 50 with 1 GB/second IO speed (read/write) Operating System GENALICE MAP runs on SuSE Linux Enterprise server 12 Service Pack 1 on standard server board. 7

8 2. RESULTS This report focuses on three important aspects of NGS data processing: speed, data storage footprint and accuracy. 2.1 PROCESSING SPEED From FASTQ to VCF GENALICE MAP workflow (Figure 1A) processes sequence reads in compressed FASTQ format to high quality variant calls in VCF format. Table 1 shows the total processing time required to map reads and call variants for the NA12878 Platinum Genome data set. Total runtime is just under 38 minutes. Table 1 further shows runtimes of three third party workflows (Figure 1B-D) using the same hardware configuration and same input data. GENALICE MAP is more than 200 fold faster than the BWA-MEM/VarScan workflow and approximately 50 fold faster than the BWA-MEM/Platypus workflow. Table 1. Performance Statistics of GENALICE MAP on NA12878 Platinum Genome data Workflow Alignment Time 1 (hh:mm:ss) Variant Calling Time 2 (hh:mm:ss) GENALICE MAP 00:31:12 00:06:43 00:37:55 BWA-MEM/GATK 44:12:46 34:48:02 79:00:48 BWA-MEM/Platypus 31:57:38 00:27:02 32:24:40 BWA-MEM/VarScan 70:07:22 32:23:20 102:30:42 Notes: 1 Alignment time includes all read preparation steps prior to variant calling. Those differ between workflows and are described in the Materials and Methods section. 2 Variant calling time includes all processing steps that lead to high quality variants. Details for each workflow are described in the Materials and Methods section. Total Time (hh:mm:ss) Table 2. Runtime Statistics of GENALICE MAP Read Alignment on NA12878 Platinum Genome Data Metric Value Description Average CPU usage ± standard deviation 67 ± 13 Average proportion (in %) of Dual Intel Xeon E V2 ( Mhz) CPU capacity that is used during alignment Megabytes per second Average data processing speed during alignment Megabases per second 75.9 Average alignment speed in million bases per second Maximum memory usage (GB) 67.8 Total memory requirement for NA12878 Pt alignment in Gigabytes Table 2 shows hardware requirements of the read alignment-processing step. The average CPU usage is on average 67% of the total capacity. GENALICE MAP dynamically scales CPU usage depending on the workload. The high quality NA12878 Platinum Genome data set contains many reads that match perfectly with the reference genome (Figure 2), resulting in a relatively low workload. Consequently, GENALICE MAP uses only a fraction of the total CPU capacity. 8

9 Data processing and alignment speeds are shown in Megabytes/seconds and Megabases/seconds, respectively (Table 2). For the biologist, GENALICE MAP aligns more than 75 million bases each second. This means that every second more than 750 thousand reads are aligned for the NA12878 Platinum Genome data. Read pair information is maintained by GENALICE MAP alignments. Read alignment occurs in memory. GENALICE MAP dynamically allocates memory usage until the maximum capacity of the system is reached. A total of 67.8 GB memory is required to align the NA12878 Platinum Genome data set (Table 2), which is well below the systems maximum of 128 GB. 2.2 STORAGE FOOTPRINTS Typical NGS data has high storage footprints making file handling and sharing extremely challenging. GENALICE MAP stores aligned sequence reads is a small storage footprint format called GENALICE Aligned Reads (GAR) enhancing file handling and sharing. Table 3. Storage Footprints of NA12878 Platinum Genome data Metric Value Description FASTQ.GZ 97 GB Compressed sequence read files (2 files of read pairs combined) BAM 106 GB Archival BAM file from European Nucleotide Archive GAR 14 GB Fully realignable GAR file (including perfect, partial, repeat, low confidence and unmapped reads) GAR.GZ 8.3 GB Compressed fully realignable GAR file GAR 5.0 GB Minimal GAR file containing only perfect and partial read mappings Table 3 clearly shows reductions in storage footprint achieved by the GAR format. The smallest GAR footprint is 5.0 GB in size, which is achieved by only saving Perfect and Partial read mappings (see Paragraph for details). This is ideal for temporary storage when processing data further using GENALICE MAP variant calling. A fully realignable (i.e. archival) GAR also requires storing of Repeats, Low Confidence and Unmappable reads. This leads to an increase in storage footprint (Table 3), but its footprint is still far less compared to BAM or compressed FASTQ format. Furthermore, compression of the fully realignable GAR reduces the footprint (Table 3) saving costs of long-term storage and easing file sharing. 9

10 2.3 ACCURACY Processing speed and reduced storage footprints are only valuable when the accuracy of alignment and variant calling is sufficient Read mapping distributions GENALICE MAP web application allows near real-time and post alignment monitoring of the alignment process. One of the reports is read mapping distribution (Figure 2). GENALICE MAP produces six alignment results: 1. Perfect read mappings align 100% with a unique position of the reference sequence 2. Partial read mappings do not have a 100% matching position due to sequence variations or read errors 3. Repeats are mappings with multiple positions in the reference sequence 4. Low Confidence read mappings have too many bases that do not match the reference sequence 5. Bad Reads failed quality control prior to alignment 6. Unmappable means that no suitable alignment solution was found With the exception of Bad Reads, which are filtered out before alignment, all mappings can be stored in the GENALICE Aligned Reads format, making it a fully realignable format. By definition reads are aligned when they are assigned as Perfect, Partial, Repeats or Low Confidence mappings. Variant calling, however, only takes Perfect and Partial mappings into account. The variant caller ignores Repeats, because their alignment is ambiguous and potential variants cannot be assigned to a specific genome location. In house validation studies showed that ignoring Low Confidence mappings results in better accuracy of the called variants. Unmappable reads lack an alignment solution given the parameter settings applied to the data. Figure 2 shows that the GENALICE MAP variant caller uses approximately 90% of the sequence reads, because they are either Perfect (73.44%) or Partial (16.29%) mappings. The caller ignores the remaining 10%. The proportion of Repeats, Low Confidence, Bad Reads and Unmappable reads is fully configurable within GENALICE MAP. The settings used here are chosen to give optimal results with respect to detection of high quality and accurate variants. Perfect % Partial % Repeats % Low confidence % Bad reads % Unmappable % Figure 2. Read mapping distributions The pie charts show read mapping distributions for NA12878 Platinum data. GENALICE MAP reports 6 read mapping results: perfect (reads that align fully to unique genome positions); partial (reads that align partially due to variants or sequencing errors to unique genome positions); repeats (reads that map to multiple genome positions); low confidence (reads that have too many bases that do not match the reference in these alignments > 25% of all bases in a read); bad reads (reads that fail quality control prior to alignment); and unmappable (reads without an alignment solution). 10

11 2.3.2 Concordance with dbsnp (v137) Variants GENALICE MAP discovers over 3.7 million SNPs in the NA12878 Platinum Genome data set (Table 4). The vast majority of those variants are also listed in the dbsnp (v137) database. For SNPs the transition-transversion (Ts/Tv) ratios were calculated. SNPs listed in dbsnp have a Ts/Tv ratio of 2.04, which is close to the expected ratio for whole genome SNP detection. The novel SNP discoveries have a lower Ts/Tv ratio, suggesting that this data contains a higher degree of false positive calls. The BWA-MEM/GATK workflow detects SNPs with known/novel and Ts/Tv ratios that are similar to GENALICE MAP. The BWA-MEM/Platypus workflow detects more novel calls, but with a reduced Ts/Tv ratio. In contrast, BWA-MEM/VarScan detects less novel calls with potentially less false positive calls. Table 4. Detection of dbsnp (v137) SNPs Workflow dbsnp % 1 Ts/Tv Novel % 2 Ts/Tv GENALICE MAP 3,735, , BWA-MEM GATK 3,691, , BWA-MEM Platypus 3,576, , BWA-MEM VarScan 3,683, , Notes: 1 Percentage of total SNPs: dbsnp concordant SNPs divided by the sum of dbsnp concordant and novel SNPs. 2 Percentage of total SNPs: novel SNPs divided by the sum of dbsnp concordant and novel SNPs. GENALICE MAP detects roughly 780 thousand INDELs (Table 5) of which 80% are previously described in the dbsnp database and 20% novel discoveries. The BWA-MEM/GATK detects a similar amount of INDELs, but with relatively more novel calls. With more than 820 thousand discoveries the BWA-MEM/Platypus workflow is most efficient in detecting INDELs. Like GENALICE MAP, 80% of the INDELs detected by BWA-MEM/Platypus are present in the dbsnp database. The number of indel discoveries is the lowest for BWA-MEM/VarScan. Table 5. Detection of dbsnp (v137) INDELs Workflow Type Total dbsnp % 1 Novel % 2 GENALICE MAP DEL 422, , , GENALICE MAP INS 362, , , BWA-MEM GATK DEL 403, , , BWA-MEM GATK INS 365, , , BWA-MEM Platypus DEL 427, , , BWA-MEM Platypus INS 394, , , BWA-MEM VarScan DEL 333, , , BWA-MEM VarScan INS 296, , , Notes: 1 Percentage of total INDELs: dbsnp concordant INDELs divided by the sum of dbsnp concordant and novel INDELs. 2 Percentage of total INDELs: novel INDELs divided by the sum of dbsnp concordant and novel INDELs. 11

12 2.3.3 Comparison with High Confidence Variants from Genome In A Bottle The Genome In A Bottle (GIAB) consortium released a set of highly reliable sequence variants for sample NA These high confidence variants are suitable for benchmarking NGS workflows. Therefore, NA12878 Platinum Genome variants discovered by GENALICE MAP were compared to these GIAB variants. GENALICE MAP performs similar to the other workflows. Overall accuracy was measured using the F1 score, which is a weighted average of sensitivity and precision. For the F1 score all variants (SNP and INDELs) were included. In this comparison, GENALICE MAP is only outperformed by the BWA-MEM/GATK workflow (Table 6). Table 6. Overall Accuracy of NGS Workflows Workflow Sensitivity (%) Precision (%) F 1 Score GENALICE MAP BWA-MEM/GATK BWA-MEM/Platypus BWA-MEM/VarScan INDEL Lengths Compared GENALICE MAP can align reads with infinite gap sizes at high speed due to its architecture, algorithms and comprehensive reference index. This enables it to detect longer INDELs with higher efficiency than other NGS workflows (Figure 4). GENALICE MAP detects more long deletions than any of the other three NGS workflows. Deletions up to 250 bases are easily detected. The BWA-MEM/GATK detects far less deletion that are longer than 50 bases. BWA- MEM/Platypus in assembly mode is also capable of detecting INDELs longer than 50 bases, but less efficient than BWA-MEM/GATK. BWA-MEM/VarScan and BWA-MEM/Platypus (default) hardly detect any deletions longer than 50 bases. GENALICE MAP and BWA-MEM/GATK are capable of detecting insertions longer than 25 bases with higher efficiency than the other workflows. For GENALICE MAP, maximum insertion length depends on the length of the sequence reads. Insertions up to 50 bases are detected with read length of 100 bases. Longer reads (150 bases) increase maximum insertion length to approximately 100 bases. Maximum insertion length in the BWA- MEM/GATK workflow is not dependend on the read length, but on the size of the active region that can be maximally 300 bases. 12

13 Number of INDELs ,000 10, ,000 GATK HaplotypeCaller v3.2 Platypus V assembly Platypus V default VarScan V2.3.7 GENALICE MAP V2.3.0 [2x100bp] GENALICE MAP V2.3.0 [2x150bp] Figure 4. INDEL Length Distributions INDEL size (bp) For each workflow the number of INDELs (y-axis) is plotted as a funtion of their length (x-axis). Negative lengths represent deltions and positive lengths insertions. For the BWA-MEM/Platypus workflow distributions are shown with assembly (blue) and without assembly (green). For GENALICE MAP INDELs detected in the NA12878 Platinum data (2x100bp reads) are shown in orange. In red INDELs detected in NA12878 HiSeq X Ten data 13

14 3. DISCUSSION This report focuses on three important aspects of NGS data processing: speed, data storage footprint and accuracy. 3.1 CONCLUSIONS GENALICE MAP performs with extremely high speed to process sequence reads in compressed FASTQ format to high quality variant calls in VCF format. Aligned reads are stored in the GENALICE Aligned Reads (GAR) format, which gives a tremendous storage footprint reduction compared to compressed FASTQ and BAM formats. GENALICE MAP detects sequence variants with high accuracy, which is comparable to other NGS workflows. There are, however delicate differences between all workflows. GENALICE MAP and BWA-MEM/GATK detect a similar number of sequence variants. Moreover, their concordance with known (dbsnp) variants is equally high. BWA-MEM/Platypus detects the highest number of INDELs. The BWA-MEM/VarScan workflow is most conservative and discovers the least amount of variants in the NA12878 Platinum Genome data. This conservative detection strategy of BWA-MEM/VarScan ensures high precision (i.e. low false positive rates) when compared to the NIST/GIAB truth variants. Sensitivity of indel detection, however, is strongly impaired for this workflow. The efficient detection of INDELs by the BWA-MEM/Platypus workflow results in good sensitivity and precision for insertions, but the detection of truth deletions does not seem to benefit and has fairly low sensitivity and precision. In addition, this workflow has the weakest performance on accurate detection of SNPs. GENALICE MAP has a more all-round accuracy, because it has the second F1 score measured for SNPs and INDELs. It is only outperformed by the BWA-MEM/GATK workflow. Notably, GATK outperforms all other tested workflows. However, we believe this is due to the fact the benchmark variant call set for the NA12878 genome is biased in favour of the GATK workflow. This call set was generated using both GATK s UnifiedGenotyper and HaplotypeCaller, as well as Cortex. As GATK was the main variant detection method used to generate this benchmark truth set, we suggest this to a large extent explains the slightly higher performance of GATK in this comparison. GENALICE MAP uniquely detects more long INDELs than any of the other tested workflows. Insertions longer than 25 bases up to 100 bases, depending on sequence read length, are detected. The maximum deletion length is limited to 250 bases. A sequence read can be mapped with gaps longer than 250 bases, but those events are registered as break points. In the future GENALICE MAP will use these break points for structural variant detection. 14

15 15

16 ONLINE: PHONE: ADDRESS: DEVENTERWEG 9D 3843 GA HARDERWIJK THE NETHERLANDS

REPORT. NA12878 Platinum Genome. GENALICE MAP Analysis Report. Bas Tolhuis, PhD GENALICE B.V.

REPORT. NA12878 Platinum Genome. GENALICE MAP Analysis Report. Bas Tolhuis, PhD GENALICE B.V. REPORT NA12878 Platinum Genome GENALICE MAP Analysis Report Bas Tolhuis, PhD GENALICE B.V. INDEX EXECUTIVE SUMMARY...4 1. MATERIALS & METHODS...5 1.1 SEQUENCE DATA...5 1.2 WORKFLOWS......5 1.3 ACCURACY