Super-Fast Genome BWA-Bam-Sort on GLAD

Similar documents
Decrypting your genome data privately in the cloud

Galaxy workshop at the Winter School Igor Makunin

Mapping NGS reads for genomics studies

Supplementary Information. Detecting and annotating genetic variations using the HugeSeq pipeline

Aeromancer: A Workflow Manager for Large- Scale MapReduce-Based Scientific Workflows

Bioinformatics Framework

WM2 Bioinformatics. ExomeSeq data analysis part 1. Dietmar Rieder

A Comprehensive Study on the Performance of Implicit LS-DYNA

Ensembl RNASeq Practical. Overview

Accelrys Pipeline Pilot and HP ProLiant servers

ELPREP PERFORMANCE ACROSS PROGRAMMING LANGUAGES PASCAL COSTANZA CHARLOTTE HERZEEL FOSDEM, BRUSSELS, BELGIUM, FEBRUARY 3, 2018

Processing Genomics Data: High Performance Computing meets Big Data. Jan Fostier

Halvade: scalable sequence analysis with MapReduce

Copy Number Variations Detection - TD. Using Sequenza under Galaxy

Sequence Mapping and Assembly

Helpful Galaxy screencasts are available at:

Running SNAP. The SNAP Team October 2012

Variation among genomes

Running SNAP. The SNAP Team February 2012

Falcon Accelerated Genomics Data Analysis Solutions. User Guide

Lecture 12. Short read aligners

Maize genome sequence in FASTA format. Gene annotation file in gff format

Users and utilization of CERIT-SC infrastructure

Galaxy. Data intensive biology for everyone. / #usegalaxy

NA12878 Platinum Genome GENALICE MAP Analysis Report

REPORT. NA12878 Platinum Genome. GENALICE MAP Analysis Report. Bas Tolhuis, PhD GENALICE B.V.

Best Practices for Illumina Genome Analysis Based on Huawei OceanStor 9000 Big Data Storage System. Huawei Technologies Co., Ltd.

Kelly et al. Genome Biology (2015) 16:6 DOI /s x. * Correspondence:

Scalable RNA Sequencing on Clusters of Multicore Processors

Using MPI One-sided Communication to Accelerate Bioinformatics Applications

Performance analysis of parallel de novo genome assembly in shared memory system

From fastq to vcf. NGG 2016 / Evolutionary Genomics Ari Löytynoja /

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

Gearing Hadoop towards HPC sytems

SAP HANA. Jake Klein/ SVP SAP HANA June, 2013

SEASHORE / SARUMAN. Short Read Matching using GPU Programming. Tobias Jakobi

PRACTICAL SESSION 5 GOTCLOUD ALIGNMENT WITH BWA JAN 7 TH, 2014 STOM 2014 WORKSHOP HYUN MIN KANG UNIVERSITY OF MICHIGAN, ANN ARBOR

Genomic Data Analysis Services Available for PL-Grid Users

DRAGEN Bio-IT Platform Enabling the Global Genomic Infrastructure

Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers

NGS Analysis Using Galaxy

Massively Parallel Processing of Whole Genome Sequence Data: An In-Depth Performance Study

Financed by the European Commission 7 th Framework Programme. biobankcloud.eu. Jim Dowling, PhD Assoc. Prof, KTH Project Coordinator

Analyzing massive genomics datasets using Databricks Frank Austin Nothaft,

GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units

Calling variants in diploid or multiploid genomes

Using the Galaxy Local Bioinformatics Cloud at CARC

SurFS Product Description

New Approach to Unstructured Data

Reads Alignment and Variant Calling

Genome 373: Mapping Short Sequence Reads III. Doug Fowler

Hinri Kerstens. NGS pipeline using Broad's Cromwell

Sep. Guide. Edico Genome Corp North Torrey Pines Court, Plaza Level, La Jolla, CA 92037

ls /data/atrnaseq/ egrep "(fastq fasta fq fa)\.gz" ls /data/atrnaseq/ egrep "(cn ts)[1-3]ln[^3a-za-z]\."

Galaxy Platform For NGS Data Analyses

Genomes On The Cloud GotCloud. University of Michigan Center for Statistical Genetics Mary Kate Wing Goo Jun

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

Exome sequencing. Jong Kyoung Kim

CBSU/3CPG/CVG Joint Workshop Series Reference genome based sequence variation detection

Machine Learning In A Snap. Thomas Parnell Research Staff Member IBM Research - Zurich

RNA-seq. Manpreet S. Katari

Using Galaxy to provide a NGS Analysis Platform

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

Variant Calling and Filtering for SNPs

JULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING

NGS Data Visualization and Exploration Using IGV

Bioinformatics in next generation sequencing projects

ZFS for NGS data analysis

Practical exercises Day 2. Variant Calling

<Insert Picture Here> Introducing Oracle WebLogic Server on Oracle Database Appliance

Amazon Web Services. Foundational Services for Research Computing. April Mike Kuentz, WWPS Solutions Architect

CORE Year 1 Whole Genome Sequencing Final Data Format Requirements

SlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching

Handling sam and vcf data, quality control

How to store and visualize RNA-seq data

IBM Data Engine for Genomics Power Systems Edition

elprep: a high- performance tool for preparing SAM/BAM files for variant calling Charlo<e Herzeel (Imec) Pascal Costanza (Intel) July 2014

Memory Footprint of Locality Information On Many-Core Platforms Brice Goglin Inria Bordeaux Sud-Ouest France 2018/05/25

IT Infrastructure: Poised for Change

DNA Sequencing analysis on Artemis

ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013

Data Protection for Cisco HyperFlex with Veeam Availability Suite. Solution Overview Cisco Public

Data Protection at Cloud Scale. A reference architecture for VMware Data Protection using CommVault Simpana IntelliSnap and Dell Compellent Storage

Veeam Availability Solution for Cisco UCS: Designed for Virtualized Environments. Solution Overview Cisco Public

A U G U S T 8, S A N T A C L A R A, C A

Tutorial. Identification of Variants Using GATK. Sample to Insight. November 21, 2017

(software agnostic) Computational Considerations

DataON and Intel Select Hyper-Converged Infrastructure (HCI) Maximizes IOPS Performance for Windows Server Software-Defined Storage

Emerging Technologies for HPC Storage

An Oracle White Paper June Enterprise Database Cloud Deployment with Oracle SuperCluster T5-8

Accelerating Data Center Workloads with FPGAs

Introduction to NGS analysis on a Raspberry Pi. Beta version 1.1 (04 June 2013)

Storage on the Lunatic Fringe. Thomas M. Ruwart University of Minnesota Digital Technology Center Intelligent Storage Consortium

StreamBox: Modern Stream Processing on a Multicore Machine

ChIP-seq (NGS) Data Formats

Accelerating Enterprise Search with Fusion iomemory PCIe Application Accelerators

de.nbi and its Galaxy interface for RNA-Seq

Burrows-Wheeler Short Read Aligner on AWS EC2 F1 Instances

Hierarchical PLABs, CLABs, TLABs in Hotspot

NBIC Cloud. Mattias de Hollander David van Enckevort Leon Mei Rob Hooft

Transcription:

1 Hututa Technologies Limited Super-Fast Genome BWA-Bam-Sort on GLAD Zhiqiang Ma, Wangjun Lv and Lin Gu May 2016 1

2 Executive Summary Aligning the sequenced reads in FASTQ files and converting the resulted SAM file into a sorted BAM file are a common practice and often the first stage of many genome analysis pipelines. Software such as BWA and Samtools has been developed by the bioinformatics community to achieve this on single computers. Although those tools usually can make use of multi-threading technologies to use multiple CPU cores in modern computers, the time used is still so long as hours, taking a large part of the whole analysis pipeline and limiting the overall speed of pipeline executions. As the application area of genome analysis is larger and larger, the analysis should finish in minutes instead of hours. GLAD is a genome analysis system that makes use of the resources of many compute nodes in a cluster from private or public clouds to significantly accelerate the gnome analysis such as FASTQ to sorted BAM generating which we call BWA-BAM-Sort or BBS in this white paper. With a flexible programming interface, GLAD does not require modifications to software such as BWA and Samtools to run on GLAD. This white paper describes performance results of BBS computation on GLAD in clusters of compute nodes in private or public clouds. The BBS step s time can be shortened to minutes from hours even with a cluster of a moderate size. 2

3 Background Our products for fast large-scale genome storage and analysis consist of the GLAD genome analysis software and the Genes Mind ready-to-use genome storage and analysis appliance. GLAD: Super-fast software for parallel genome data analysis The development of cloud computing makes it easy to own or use a private cloud or resources consisting of many compute nodes in public clouds at affordable prices. However, the fact that the traditional software such as BWA and Samtools was designed for single computers limits the ability of using resources such as CPU cores from a cluster of nodes. On the other hand, big data technologies, such as Data Thinker, have been developed in recent years to integrate many resources from many nodes to a single system for fast processing of large data. Powered by D-thinker parallelization, GLAD speeds up processing by 10-100 times on normal PCs. When running GATK pipelines, GLAD reduces data processing time from days to hours. FIGURE 1: BWA MEM ON GLAD GLAD helps biomedical scientists and practitioners store PBs of genome data safely and process them 100 times faster than before. Supporting standard software such as BWA and GATK, the GLAD technology is used in global projects including ICGC-PanCancer. 3

4 FIGURE 2: GATK-BASED VARIANTS CALLING ON GLAD Genes Mind: Inexpensive hardware for storing and processing genome data Genes' Mind packages multiple compute servers, 10s-100s of processor cores, 100s of TBs of distributed storage and genome data processing software as a ready-to-use genome data storage and processing system, all at a very affordable price. FIGURE 3: GENES' MIND GENOME STORAGE AND PROCESSING APPLIANCE 4

5 Genes Mind has pre-installed big data and genome analysis systems such as Data Thinker and GLAD. It can also double-duty as an internal private cloud system for an organization, providing storage and virtual machines for R&D and office work. 5

6 BBS on GLAD in Private and Public Cloud We use GLAD to speed up BBS computation on a dataset consisting of a pair of FASTQ files on 3 different clusters and compare the performance of BBS on GLAD and a single compute node. Dataset The input dataset, named S81, for the performance measurement is a human whole exome data. The S81 dataset comprises two FAQSTQ files of 4.8GB each (9.6GB in total). /thinker/dstore/world/gene/sapiens/study/exome/s81/s81_1.fastq /thinker/dstore/world/gene/sapiens/study/exome/s81/s81_2.fastq GLAD uses D-store, a distributed file system scaling to tens of petabytes, to store genome data and organize them as a file system hierarchy visible to authorized people around the world. Clusters We use 3 clusters for performance measurement of BBS on the S81 dataset. All these 3 clusters are installed with the GLAD/D-store/D-thinker systems. tbg6: a small cluster with 8 inexpensive nodes inside of Hututa s private cloud. Each node has 4 Intel Core i3 CPU cores and 32GB memory. gm15: a Genes Mind appliance with 5 inexpensive nodes. Each compute node has 8 Intel Xeon E3 CPU cores and 32GB memory. It is a shared environment for development and testing from universities and cooperators. ksc: a cluster in the Kingsoft Cloud, a public cloud for business and developers. The ksc cluster consists of 4 VMs allocated in the public cloud. Each VM has 8 Intel Xeon E5 CPU cores and 16GB memory. Performance measurement In each cluster, we run BBS on GLAD and on a single compute node and measure the time used to finish the analysis. Multithreaded BBS computation: run native BWA and Samtools tools to do BBS computation on a single compute node. The computation uses multiple threads to utilize the multithreading capability of native BWA and Samtools tools to make full use of the CPU cores on a single node. BBS computation on GLAD: run a BBS computation on GLAD. The computation is run in many containers on the compute nodes in the cluster used. GLAD runs unmodified BWA and Samtools in parallel on the compute nodes in the cluster. Performance results We measure and draw the time used for the BBS computation on the S81 datasets on these 3 clusters using GLAD and multi-threads on a compute node. Figure 4 shows the times used on all 3 clusters. 6

7 FIGURE 4: BBS EXECUTION TIME ON DIFFERENT CLUSTERS On all 3 clusters, GLAD can finish the BBS computation of the S81 data in 8 to 16 minutes, while the multithreaded version takes 26 minutes to one and half hours. GLAD makes use of the CPU cores in the cluster to make BBS analysis super-fast, speeding up the BBS by 3 6 times in these clusters of 4-8 nodes. In the ksc cluster in the public cloud, GLAD even shows super-linear speedup. Conclusions GLAD, built on the super-fast Data Thinker technology, makes use of the compute power from many compute nodes in a cluster. In the performance comparison of multithreaded and GLAD-based BBS computation on all 3 different clusters with different configurations, GLAD delivers great performance and greatly accelerates the BBS computation. GLAD can also scale to larges clusters consisting of tens to hundreds of compute nodes for even higher performance. 7

8 Read more Readers interested in more details can visit the following online info about GLAD and Data Thinker: GLAD homepage: http://www.hututa.com/en/products/glad.html Data Thinker website: http://d-thinker.org/ GLAD and Data Thinker forum: http://tab.d-thinker.org/forumdisplay.php?fid=515 8