BWT Indexing: Big Data from Next Generation Sequencing and GPU

Similar documents
Heterogeneous compute in the GATK

GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units

High-performance short sequence alignment with GPU acceleration

Mapping NGS reads for genomics studies

REPORT. NA12878 Platinum Genome. GENALICE MAP Analysis Report. Bas Tolhuis, PhD GENALICE B.V.

MICA: A fast short-read aligner that takes full advantage of Many Integrated Core Architecture (MIC)

NA12878 Platinum Genome GENALICE MAP Analysis Report

Under the Hood of Alignment Algorithms for NGS Researchers

A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System

Reads Alignment and Variant Calling

Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs

The Human Variant Database

HISAT2. Fast and sensi0ve alignment against general human popula0on. Daehwan Kim

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

Sequencing. Short Read Alignment. Sequencing. Paired-End Sequencing 6/10/2010. Tobias Rausch 7 th June 2010 WGS. ChIP-Seq. Applied Biosystems.

Supplementary Information. Detecting and annotating genetic variations using the HugeSeq pipeline

Super-Fast Genome BWA-Bam-Sort on GLAD

Integrating GPU-Accelerated Sequence Alignment and SNP Detection for Genome Resequencing Analysis

Practical exercises Day 2. Variant Calling

Achieving High Throughput Sequencing with Graphics Processing Units

A Fast Read Alignment Method based on Seed-and-Vote For Next GenerationSequencing

SNP Calling. Tuesday 4/21/15

INTRODUCING NVBIO: HIGH PERFORMANCE PRIMITIVES FOR COMPUTATIONAL GENOMICS. Jonathan Cohen, NVIDIA Nuno Subtil, NVIDIA Jacopo Pantaleoni, NVIDIA

Falcon Accelerated Genomics Data Analysis Solutions. User Guide

Hypergraph Sparsifica/on and Its Applica/on to Par//oning

High-Performance Graph Traversal for De Bruijn Graph-Based Metagenome Assembly

High-throughput Sequence Alignment using Graphics Processing Units

Review of Recent NGS Short Reads Alignment Tools BMI-231 final project, Chenxi Chen Spring 2014

Lam, TW; Li, R; Tam, A; Wong, S; Wu, E; Yiu, SM.

High-throughout sequencing and using short-read aligners. Simon Anders

Heterogeneous Hardware/Software Acceleration of the BWA-MEM DNA Alignment Algorithm

WM2 Bioinformatics. ExomeSeq data analysis part 1. Dietmar Rieder

SAPLING: Suffix Array Piecewise Linear INdex for Genomics Michael Kirsche

Analyzing massive genomics datasets using Databricks Frank Austin Nothaft,

Decrypting your genome data privately in the cloud

Aligning reads: tools and theory

Galaxy workshop at the Winter School Igor Makunin

Lecture 12. Short read aligners

NEXT Generation sequencers have a very high demand

Exome sequencing. Jong Kyoung Kim

Overcoming the Barriers of Graphs on GPUs: Delivering Graph Analy;cs 100X Faster and 40X Cheaper

Accelerating InDel Detection on Modern Multi-Core SIMD CPU Architecture

Pre-processing and quality control of sequence data. Barbera van Schaik KEBB - Bioinformatics Laboratory

Next Generation Sequence Alignment on the BRC Cluster. Steve Newhouse 22 July 2010

On enhancing variation detection through pan-genome indexing

LAB # 3 / Project # 1

Hardware Acceleration of Genetic Sequence Alignment

I519 Introduction to Bioinformatics. Indexing techniques. Yuzhen Ye School of Informatics & Computing, IUB

Halvade: scalable sequence analysis with MapReduce

Course work. Today. Last lecture index construc)on. Why compression (in general)? Why compression for inverted indexes?

GPU Accelerated API for Alignment of Genomics Sequencing Data

CS60092: Informa0on Retrieval

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

Short Read Alignment Algorithms

SEASHORE / SARUMAN. Short Read Matching using GPU Programming. Tobias Jakobi

Cloud Computing WSU Dr. Bahman Javadi. School of Computing, Engineering and Mathematics

DELL EMC POWER EDGE R940 MAKES DE NOVO ASSEMBLY EASIER

Scalable RNA Sequencing on Clusters of Multicore Processors

Bioinformatics in next generation sequencing projects

BigDataBench- S: An Open- source Scien6fic Big Data Benchmark Suite

BLAST & Genome assembly

Document Databases: MongoDB

hsa-ds: A Heterogeneous Suffix Array Construction Using D-Critical Substrings for Burrow-Wheeler Transform

Inexact Sequence Mapping Study Cases: Hybrid GPU Computing and Memory Demanding Indexes

Mapping Reads to Reference Genome

Musemage. The Revolution of Image Processing

CMSC423: Bioinformatic Algorithms, Databases and Tools. Exact string matching: Suffix trees Suffix arrays

RNA-seq Data Analysis

CS 378 Big Data Programming

Visual Analysis of Lagrangian Particle Data from Combustion Simulations

Big Data, Big Compute, Big Interac3on Machines for Future Biology. Rick Stevens. Argonne Na3onal Laboratory The University of Chicago

USING BRAT-BW Table 1. Feature comparison of BRAT-bw, BRAT-large, Bismark and BS Seeker (as of on March, 2012)

Assembly of the Ariolimax dolicophallus genome with Discovar de novo. Chris Eisenhart, Robert Calef, Natasha Dudek, Gepoliano Chaves

Introduction to NGS analysis on a Raspberry Pi. Beta version 1.1 (04 June 2013)

Illumina Next Generation Sequencing Data analysis

Sequence Alignment: Mo1va1on and Algorithms

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

Sequence Alignment: Mo1va1on and Algorithms. Lecture 2: August 23, 2012

Burrows Wheeler Transform

Performance analysis of parallel de novo genome assembly in shared memory system

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

de novo assembly Simon Rasmussen 36626: Next Generation Sequencing analysis DTU Bioinformatics Next Generation Sequencing Analysis

Concurrency-Optimized I/O For Visualizing HPC Simulations: An Approach Using Dedicated I/O Cores

Suffix Array Construction

A Script- Based Autotuning Compiler System to Generate High- Performance CUDA code

Scalability in a Real-Time Decision Platform

CBSU/3CPG/CVG Joint Workshop Series Reference genome based sequence variation detection

Mar%n Norling. Uppsala, November 15th 2016

The Burrows-Wheeler Transform and Bioinformatics. J. Matthew Holt April 1st, 2015

Aeromancer: A Workflow Manager for Large- Scale MapReduce-Based Scientific Workflows

NGS Data Visualization and Exploration Using IGV

Burrows-Wheeler Short Read Aligner on AWS EC2 F1 Instances

Linear-Time Suffix Array Implementation in Haskell

NGS Data Analysis. Roberto Preste

Kelly et al. Genome Biology (2015) 16:6 DOI /s x. * Correspondence:

Accurate Long-Read Alignment using Similarity Based Multiple Pattern Alignment and Prefix Tree Indexing

Introduc)on to. CS60092: Informa0on Retrieval

Read Mapping and Variant Calling

Processing Genomics Data: High Performance Computing meets Big Data. Jan Fostier

PFAC Library: GPU-Based String Matching Algorithm

Transcription:

GPU Technology Conference 2014 BWT Indexing: Big Data from Next Generation Sequencing and GPU Jeanno Cheung HKU-BGI Bioinformatics Algorithms and Core Technology Research Laboratory University of Hong Kong Core team members: Tak-Wah Lam, Wai-Chun Law, Chi-Man Liu & Ruibang Luo

About BAL HKU- BGI Bioinforma5cs Algorithms and Core Technology Research Laboratory A research lab established at the University of Hong Kong in collabora5on with BGI (then Beijing Genomics Ins5tute). Focus on the algorithmics, analy5cs, and engineering aspects of compu5ng technologies for the enhancement of the throughput and quality of the analysis of the next- genera5on sequencing data. Exchange students from BGI. Funding: Hong Kong Government, Innova5on & Technology Fund. Four years of working experience on CUDA. Selected SoRware: Aligner: SOAP2(2008), SOAP3(2011), SOAP3- dp(2013) Assembler: SOAPdenovo2(2012) RNA: SOAPsplice(2010), SOAPfusion(2011)

Content CX1: BWT construc5on with billions of reads In collabora5on with nvidia Preprint in arxiv GPU- Accelerated BWT Construc5on for Large Collec5on of Short Reads BALSA - Fast and accurate integrated NGS secondary analysis WGS from raw reads to variants within hours Sensi5vity and accuracy in produc5on standard Paper in prepara5on

Indexing genomes A genome is a long string of characters - The human genome has 3 billion A/C/G/T's Indexing a genome allows fast searching of pa`erns (short strings) within the genome Applica5on: short- read alignment Different kinds of indices: - Suffix trees - Suffix arrays - Burrows- Wheeler transform (BWT)

Indexing sequencing data Billions of short reads - Typically hundreds of characters in length Goal: construct the BWT of all the reads (as concatenated into a very long string) Applica5on: de novo assembly (via string graph)

Data volume Indexing a human genome: 3 billion chars Indexing 30- fold human sequencing data: almost 100 billion chars Exis5ng tools take >12 hours to construct the BWT for 30- fold

Our contribu5on A sorware tool for construc5ng the BWT of 30- fold human sequencing data (~100 billion chars) in 2 hours - using a CPU with 4 cores, 64 GB of RAM, and a 4- GB GPU card Highly scalable E.g., using 4 iden5cal machines, the construc5on 5me can be shortened to 45 minutes

Burrows- Wheeler transform (BWT) Short reads (input) ACGA ATAG GGTC

Burrows- Wheeler transform (BWT) Append sen5nel character ACGA$ ATAG$ GGTC$

Burrows- Wheeler transform (BWT) List all suffixes ACGA$ - ACGA$, CGA$, GA$, A$, $ ATAG$ - ATAG$, TAG$, AG$, G$, $ GGTC$ - GGTC$, GTC$, TC$, C$, $

Burrows- Wheeler transform (BWT) Sort all suffixes lexicographically - $ - $ - $ - A$ - ACGA$ - AG$ - ATAG$ - C$ - CGA$ - G$ - GA$ - GGTC$ - GTC$ - TAG$ - TC$

Burrows- Wheeler transform (BWT) Find the character preceding each suffix - A$ - G$ - C$ - GA$ - $ACGA$ - TAG$ - $ATAG$ - TC$ - ACGA$ - AG$ - CGA$ - $GGTC$ - GGTC$ - ATAG$ - GTC$ Resul5ng BWT - AGCG$T$TAAC$GAG

Burrows- Wheeler transform (BWT) Resul5ng BWT - AGCG$T$TAAC$GAG

Our approach The construc5on looks simple: sort all suffixes, then output the preceding chars in order Technical difficulty: - MANY suffixes: we will have hundred billions of suffixes, each 100 chars long - Storing all suffixes explicitly requires a prohibi5ve amount of memory, but sor5ng without the GPU is slow, and disk- based sor5ng is SLOWER

Par55oning by prefix Idea: At any 5me, only store a subset of suffixes in the memory for sor5ng. The suffixes are par55oned by the length- L prefix. For example, if L=1 - list all suffixes star5ng with $, sort them, and output - list all suffixes star5ng with A, sort them, and output - list all suffixes star5ng with C, sort them, and output - list all suffixes star5ng with G, sort them, and output - list all suffixes star5ng with T, sort them, and output

Par55oning by prefix When L=1, there are 5 prefix par55ons If the largest par55on can fit into main memory, we are good Larger L gives more par55ons, expected size of the largest par55on is smaller; but having more par55ons also increases the preprocessing overhead (as in bucket sort) In prac5ce, we set L=8

GPU radix sort For sor5ng suffixes, we use the CUDA radix sort library by back40compu,ng The library sorts 32 and 64- bit integers real fast How to sort suffixes of 100 chars long? - Encode suffix compactly into binary form - A length- 100 suffix can be encoded into seven 32- bit words - Sort the suffixes word- by- word, from least to most significant word (a.k.a. LSB radix sort)

Mul5- core paralleliza5on Suppose we have N cores Divide the input short reads into N roughly equal sets, with each core assigned to one of the sets Given a par55on prefix, each core is responsible for lis5ng all qualified suffixes in its assigned set Does not scale linearly due to memory conten5on

Experiments Intel Core i7 (4 cores used), 64 GB RAM, Nvidia GTX680 with 4 GB video memory 100M reads 500M reads 1000M reads BCR [*] 6,141 23,094 46,899 Our software 565 3,108 6,886 Table 1: BWT construction (wall clock) time in seconds. All reads have length 100. [*] M.J. Bauer, A.J. Cox, G. Rosone. Lightweight algorithms for construc5ng and inver5ng the BWT of string collec5ons. Theore5cal Computer Science, 483: 134 148, 2013.

Mul5- machine paralleliza5on Observa5on: Different prefix par55ons can be processed simultaneously since they do not depend on each other Excluding I/O, the computa5on scales linearly with the number of machines Experimental result: measured in seconds (5me excluding I/O shown in brackets) # machines 100M reads 500M reads 1000M reads 1 565 (468) 3,108 (2,624) 6,886 (5,882) 2 338 (241) 1,797 (1,317) 3,998 (2,994) 3 266 (169) 14,26 (944) 3,071 (2,067) 4 230 (133) 1,192 (712) 2,584 (1,580)

Mul5- machine paralleliza5on Observa5on: Different prefix par55ons can be processed simultaneously since they do not depend on each other For example, if prefix length L=1 and we have 2 machines, we can let the first machine process par55ons $, A, C and the second machine process par55ons G, T Scales linearly with the number of machines, except a li`le overhead

Experiments: Mul5- machine Each machine: Intel Core i7 (4 cores used), 64 GB RAM, Nvidia GTX680 with 4 GB video memory # machines 100M reads 500M reads 1000M reads 1 565 3,108 6,886 2 338 1,797 3,998 3 266 14,26 3,071 4 230 1,192 2,584 Table 2: BWT construction (wall clock) time in seconds by our software using multiple machines. All short reads have length 100.

BALSA Based on SOAP3-dp. Whole secondary analysis (input: raw reads; output: variants) in memory with most of the modules accelerated with GPU. Spec. of a node: 1 x E5-2620 v2 1 x Tesla K40c 64G Host Memory 1 x 600G SAS HDD Performance per node so far: 6 hours/wgs (1.5k WGS/yr.) 30mins/WES (18k WES/yr.) Space efficient lossless Snapshot and Database storage schema to displace filebased BAM and CRAM format for instant query and columnbased visit.

Note: BWA v0.7.5a (Li et al., 2012); GATK v3.0-1 (DePristo et al., 2011); SOAP3- dp r176 (Luo et al., 2013); isaac- 01.13.09.17+iSAAC variant caller v1.06 (Raczy et al., 2013); BALSA r128 100 Performance, raw reads to variants, WGS 50- fold 100bp (150 Gigabases) 90 88.00 80 70 60 Hours 50 48.68 46.27 40 30 20 10-11.92 BWAaln+GATK BWAmem+GATK SOAP3- dp+gatk isaac BALSA BALSA, Expect 2015 Configura2ons 5.49 2.00 0.75 Magne5c Resonance Imaging

30- fold 100bp Simulated Data with known SNPs and Indels SNP Calling isaac Indel Calling isaac Truth Truth 39,438 18,260 48,399 2,770,408 896 3,793 276,502 749 BALSA 2,881 276 38 Isaac BALSA 523 432 159 Isaac Note: isaac- 01.13.09.17+iSAAC variant caller v1.06 (Raczy et al., 2013); BALSA r128

SNP Calling Other Individual Callers Truth Truth Truth 39,258 38,119 39,382 4,622 1,076 2,814,185 2,215 25,202 2,793,605 8,432 2,810,375 952 1,481 BALSA 1,676 9,375 Atlas 2,399 BALSA 758 1,676 Freebayes 2,450 BALSA 707 3,314 HC Truth Truth Truth 39,439 38,705 37,086 12,981 2,805,826 895 4,590 2,814,217 1,629 214 2,818,593 3,248 2,439 BALSA 718 62 Samtools 1,545 BALSA 1,612 311 UG 649 BALSA 2,508 13,737 Mutect Note: All tools processed and filtered with the guide of best prac5ce. Mutect was designed for SNV calling, but can be used for SNP calling with normal sample absent. Atlas2 v1.4.3.r158 (Shen et al., 2010, Challis et al., 2012), GATK HaplotypeCaller and UnifiedGenotyper v2.8-1- g932cd3a (DePristo et al., 2011), Freebayes v0.9.9 (Garrison et al., 2012), Mutect v1.1.4 (Cibulskis et al., 2013), Samtools v0.1.19 (Li et al., 2013) and VarScan v2.3.5 (Koboldt et al., 2012).

Combining the best results from 6 individual callers SNP Calling Ensemble Truth Indel Calling Ensemble Truth BALSA 928 1,284 71.03% Qual.<20 38,178 2,817,879 1,873 2,156 4,082 Ensemble BALSA 118 51 5,197 96.89% DP<10 51.88% DP<10 74.77% DP<10 34.90% Length 100 71.62% DP<10 94.12% MAF<0.3, DP 10 87,77% MAF<0.3, DP 10 0.39% Qual.<20 34.76% Qual.<20 280,244 0.86% Qual.<20 837 8.60% Qual.<20 61.01% Qual.<20 2,241 2,120 Ensemble

Acknowledgement Prof. Tak- Wah Lam Mr. Ruibang Luo BWT Team Mr. Chi- Man Liu BALSA Team Mr. Victor Wong Mr. Wai- Chun Law BAL Members Dr. Sze- Hang Chan Dr. Ricky Ma Mr. Dinghua Li Ms. Min Ou

Thanks