BWT Indexing: Big Data from Next Generation Sequencing and GPU

GPU Technology Conference 2014 BWT Indexing: Big Data from Next Generation Sequencing and GPU Jeanno Cheung HKU-BGI Bioinformatics Algorithms and Core Technology Research Laboratory University of Hong Kong Core team members: Tak-Wah Lam, Wai-Chun Law, Chi-Man Liu & Ruibang Luo

About BAL HKU- BGI Bioinforma5cs Algorithms and Core Technology Research Laboratory A research lab established at the University of Hong Kong in collabora5on with BGI (then Beijing Genomics Ins5tute). Focus on the algorithmics, analy5cs, and engineering aspects of compu5ng technologies for the enhancement of the throughput and quality of the analysis of the next- genera5on sequencing data. Exchange students from BGI. Funding: Hong Kong Government, Innova5on & Technology Fund. Four years of working experience on CUDA. Selected SoRware: Aligner: SOAP2(2008), SOAP3(2011), SOAP3- dp(2013) Assembler: SOAPdenovo2(2012) RNA: SOAPsplice(2010), SOAPfusion(2011)

Content CX1: BWT construc5on with billions of reads In collabora5on with nvidia Preprint in arxiv GPU- Accelerated BWT Construc5on for Large Collec5on of Short Reads BALSA - Fast and accurate integrated NGS secondary analysis WGS from raw reads to variants within hours Sensi5vity and accuracy in produc5on standard Paper in prepara5on

Indexing genomes A genome is a long string of characters - The human genome has 3 billion A/C/G/T's Indexing a genome allows fast searching of pa`erns (short strings) within the genome Applica5on: short- read alignment Different kinds of indices: - Suffix trees - Suffix arrays - Burrows- Wheeler transform (BWT)

Indexing sequencing data Billions of short reads - Typically hundreds of characters in length Goal: construct the BWT of all the reads (as concatenated into a very long string) Applica5on: de novo assembly (via string graph)

Data volume Indexing a human genome: 3 billion chars Indexing 30- fold human sequencing data: almost 100 billion chars Exis5ng tools take >12 hours to construct the BWT for 30- fold

Our contribu5on A sorware tool for construc5ng the BWT of 30- fold human sequencing data (~100 billion chars) in 2 hours - using a CPU with 4 cores, 64 GB of RAM, and a 4- GB GPU card Highly scalable E.g., using 4 iden5cal machines, the construc5on 5me can be shortened to 45 minutes

Burrows- Wheeler transform (BWT) Short reads (input) ACGA ATAG GGTC

Burrows- Wheeler transform (BWT) Append sen5nel character ACGA$ ATAG$ GGTC$

Burrows- Wheeler transform (BWT) List all suffixes ACGA$ - ACGA$, CGA$, GA$, A$, $ ATAG$ - ATAG$, TAG$, AG$, G$, $ GGTC$ - GGTC$, GTC$, TC$, C$, $

Burrows- Wheeler transform (BWT) Sort all suffixes lexicographically - $ - $ - $ - A$ - ACGA$ - AG$ - ATAG$ - C$ - CGA$ - G$ - GA$ - GGTC$ - GTC$ - TAG$ - TC$

Burrows- Wheeler transform (BWT) Find the character preceding each suffix - A$ - G$ - C$ - GA$ - $ACGA$ - TAG$ - $ATAG$ - TC$ - ACGA$ - AG$ - CGA$ - $GGTC$ - GGTC$ - ATAG$ - GTC$ Resul5ng BWT - AGCG$T$TAAC$GAG

Burrows- Wheeler transform (BWT) Resul5ng BWT - AGCG$T$TAAC$GAG

Our approach The construc5on looks simple: sort all suffixes, then output the preceding chars in order Technical difficulty: - MANY suffixes: we will have hundred billions of suffixes, each 100 chars long - Storing all suffixes explicitly requires a prohibi5ve amount of memory, but sor5ng without the GPU is slow, and disk- based sor5ng is SLOWER

Par55oning by prefix Idea: At any 5me, only store a subset of suffixes in the memory for sor5ng. The suffixes are par55oned by the length- L prefix. For example, if L=1 - list all suffixes star5ng with $, sort them, and output - list all suffixes star5ng with A, sort them, and output - list all suffixes star5ng with C, sort them, and output - list all suffixes star5ng with G, sort them, and output - list all suffixes star5ng with T, sort them, and output

Par55oning by prefix When L=1, there are 5 prefix par55ons If the largest par55on can fit into main memory, we are good Larger L gives more par55ons, expected size of the largest par55on is smaller; but having more par55ons also increases the preprocessing overhead (as in bucket sort) In prac5ce, we set L=8

GPU radix sort For sor5ng suffixes, we use the CUDA radix sort library by back40compu,ng The library sorts 32 and 64- bit integers real fast How to sort suffixes of 100 chars long? - Encode suffix compactly into binary form - A length- 100 suffix can be encoded into seven 32- bit words - Sort the suffixes word- by- word, from least to most significant word (a.k.a. LSB radix sort)

Mul5- core paralleliza5on Suppose we have N cores Divide the input short reads into N roughly equal sets, with each core assigned to one of the sets Given a par55on prefix, each core is responsible for lis5ng all qualified suffixes in its assigned set Does not scale linearly due to memory conten5on

Experiments Intel Core i7 (4 cores used), 64 GB RAM, Nvidia GTX680 with 4 GB video memory 100M reads 500M reads 1000M reads BCR [*] 6,141 23,094 46,899 Our software 565 3,108 6,886 Table 1: BWT construction (wall clock) time in seconds. All reads have length 100. [*] M.J. Bauer, A.J. Cox, G. Rosone. Lightweight algorithms for construc5ng and inver5ng the BWT of string collec5ons. Theore5cal Computer Science, 483: 134 148, 2013.

Mul5- machine paralleliza5on Observa5on: Different prefix par55ons can be processed simultaneously since they do not depend on each other Excluding I/O, the computa5on scales linearly with the number of machines Experimental result: measured in seconds (5me excluding I/O shown in brackets) # machines 100M reads 500M reads 1000M reads 1 565 (468) 3,108 (2,624) 6,886 (5,882) 2 338 (241) 1,797 (1,317) 3,998 (2,994) 3 266 (169) 14,26 (944) 3,071 (2,067) 4 230 (133) 1,192 (712) 2,584 (1,580)

Mul5- machine paralleliza5on Observa5on: Different prefix par55ons can be processed simultaneously since they do not depend on each other For example, if prefix length L=1 and we have 2 machines, we can let the first machine process par55ons $, A, C and the second machine process par55ons G, T Scales linearly with the number of machines, except a li`le overhead

Experiments: Mul5- machine Each machine: Intel Core i7 (4 cores used), 64 GB RAM, Nvidia GTX680 with 4 GB video memory # machines 100M reads 500M reads 1000M reads 1 565 3,108 6,886 2 338 1,797 3,998 3 266 14,26 3,071 4 230 1,192 2,584 Table 2: BWT construction (wall clock) time in seconds by our software using multiple machines. All short reads have length 100.

BALSA Based on SOAP3-dp. Whole secondary analysis (input: raw reads; output: variants) in memory with most of the modules accelerated with GPU. Spec. of a node: 1 x E5-2620 v2 1 x Tesla K40c 64G Host Memory 1 x 600G SAS HDD Performance per node so far: 6 hours/wgs (1.5k WGS/yr.) 30mins/WES (18k WES/yr.) Space efficient lossless Snapshot and Database storage schema to displace filebased BAM and CRAM format for instant query and columnbased visit.

Note: BWA v0.7.5a (Li et al., 2012); GATK v3.0-1 (DePristo et al., 2011); SOAP3- dp r176 (Luo et al., 2013); isaac- 01.13.09.17+iSAAC variant caller v1.06 (Raczy et al., 2013); BALSA r128 100 Performance, raw reads to variants, WGS 50- fold 100bp (150 Gigabases) 90 88.00 80 70 60 Hours 50 48.68 46.27 40 30 20 10-11.92 BWAaln+GATK BWAmem+GATK SOAP3- dp+gatk isaac BALSA BALSA, Expect 2015 Configura2ons 5.49 2.00 0.75 Magne5c Resonance Imaging

30- fold 100bp Simulated Data with known SNPs and Indels SNP Calling isaac Indel Calling isaac Truth Truth 39,438 18,260 48,399 2,770,408 896 3,793 276,502 749 BALSA 2,881 276 38 Isaac BALSA 523 432 159 Isaac Note: isaac- 01.13.09.17+iSAAC variant caller v1.06 (Raczy et al., 2013); BALSA r128

SNP Calling Other Individual Callers Truth Truth Truth 39,258 38,119 39,382 4,622 1,076 2,814,185 2,215 25,202 2,793,605 8,432 2,810,375 952 1,481 BALSA 1,676 9,375 Atlas 2,399 BALSA 758 1,676 Freebayes 2,450 BALSA 707 3,314 HC Truth Truth Truth 39,439 38,705 37,086 12,981 2,805,826 895 4,590 2,814,217 1,629 214 2,818,593 3,248 2,439 BALSA 718 62 Samtools 1,545 BALSA 1,612 311 UG 649 BALSA 2,508 13,737 Mutect Note: All tools processed and filtered with the guide of best prac5ce. Mutect was designed for SNV calling, but can be used for SNP calling with normal sample absent. Atlas2 v1.4.3.r158 (Shen et al., 2010, Challis et al., 2012), GATK HaplotypeCaller and UnifiedGenotyper v2.8-1- g932cd3a (DePristo et al., 2011), Freebayes v0.9.9 (Garrison et al., 2012), Mutect v1.1.4 (Cibulskis et al., 2013), Samtools v0.1.19 (Li et al., 2013) and VarScan v2.3.5 (Koboldt et al., 2012).

Combining the best results from 6 individual callers SNP Calling Ensemble Truth Indel Calling Ensemble Truth BALSA 928 1,284 71.03% Qual.<20 38,178 2,817,879 1,873 2,156 4,082 Ensemble BALSA 118 51 5,197 96.89% DP<10 51.88% DP<10 74.77% DP<10 34.90% Length 100 71.62% DP<10 94.12% MAF<0.3, DP 10 87,77% MAF<0.3, DP 10 0.39% Qual.<20 34.76% Qual.<20 280,244 0.86% Qual.<20 837 8.60% Qual.<20 61.01% Qual.<20 2,241 2,120 Ensemble

Acknowledgement Prof. Tak- Wah Lam Mr. Ruibang Luo BWT Team Mr. Chi- Man Liu BALSA Team Mr. Victor Wong Mr. Wai- Chun Law BAL Members Dr. Sze- Hang Chan Dr. Ricky Ma Mr. Dinghua Li Ms. Min Ou

Thanks