BWT Indexing: Big Data from Next Generation Sequencing and GPU

Size: px

Start display at page:

Download "BWT Indexing: Big Data from Next Generation Sequencing and GPU"

Adela Horton
6 years ago
Views:

GPU Technology Conference 2014 BWT Indexing:

Laboratory University of Hong Kong Core team

1 GPU Technology Conference 2014 BWT Indexing: Big Data from Next Generation Sequencing and GPU Jeanno Cheung HKU-BGI Bioinformatics Algorithms and Core Technology Research Laboratory University of Hong Kong Core team members: Tak-Wah Lam, Wai-Chun Law, Chi-Man Liu & Ruibang Luo

2 About BAL HKU- BGI Bioinforma5cs Algorithms and Core Technology Research Laboratory A research lab established at the University of Hong Kong in collabora5on with BGI (then Beijing Genomics Ins5tute). Focus on the algorithmics, analy5cs, and engineering aspects of compu5ng technologies for the enhancement of the throughput and quality of the analysis of the next- genera5on sequencing data. Exchange students from BGI. Funding: Hong Kong Government, Innova5on & Technology Fund. Four years of working experience on CUDA. Selected SoRware: Aligner: SOAP2(2008), SOAP3(2011), SOAP3- dp(2013) Assembler: SOAPdenovo2(2012) RNA: SOAPsplice(2010), SOAPfusion(2011)

3 Content CX1: BWT construc5on with billions of reads In collabora5on with nvidia Preprint in arxiv GPU- Accelerated BWT Construc5on for Large Collec5on of Short Reads BALSA - Fast and accurate integrated NGS secondary analysis WGS from raw reads to variants within hours Sensi5vity and accuracy in produc5on standard Paper in prepara5on

4 Indexing genomes A genome is a long string of characters - The human genome has 3 billion A/C/G/T's Indexing a genome allows fast searching of pa`erns (short strings) within the genome Applica5on: short- read alignment Different kinds of indices: - Suffix trees - Suffix arrays - Burrows- Wheeler transform (BWT)

5 Indexing sequencing data Billions of short reads - Typically hundreds of characters in length Goal: construct the BWT of all the reads (as concatenated into a very long string) Applica5on: de novo assembly (via string graph)

6 Data volume Indexing a human genome: 3 billion chars Indexing 30- fold human sequencing data: almost 100 billion chars Exis5ng tools take >12 hours to construct the BWT for 30- fold

7 Our contribu5on A sorware tool for construc5ng the BWT of 30- fold human sequencing data (~100 billion chars) in 2 hours - using a CPU with 4 cores, 64 GB of RAM, and a 4- GB GPU card Highly scalable E.g., using 4 iden5cal machines, the construc5on 5me can be shortened to 45 minutes

8 Burrows- Wheeler transform (BWT) Short reads (input) ACGA ATAG GGTC

9 Burrows- Wheeler transform (BWT) Append sen5nel character ACGA$ ATAG$ GGTC$

10 Burrows- Wheeler transform (BWT) List all suffixes ACGA$ - ACGA$, CGA$, GA$, A$, $ ATAG$ - ATAG$, TAG$, AG$, G$, $ GGTC$ - GGTC$, GTC$, TC$, C$, $

11 Burrows- Wheeler transform (BWT) Sort all suffixes lexicographically - $ - $ - $ - A$ - ACGA$ - AG$ - ATAG$ - C$ - CGA$ - G$ - GA$ - GGTC$ - GTC$ - TAG$ - TC$

12 Burrows- Wheeler transform (BWT) Find the character preceding each suffix - A$ - G$ - C$ - GA$ - $ACGA$ - TAG$ - $ATAG$ - TC$ - ACGA$ - AG$ - CGA$ - $GGTC$ - GGTC$ - ATAG$ - GTC$ Resul5ng BWT - AGCG$T$TAAC$GAG

13 Burrows- Wheeler transform (BWT) Resul5ng BWT - AGCG$T$TAAC$GAG

14 Our approach The construc5on looks simple: sort all suffixes, then output the preceding chars in order Technical difficulty: - MANY suffixes: we will have hundred billions of suffixes, each 100 chars long - Storing all suffixes explicitly requires a prohibi5ve amount of memory, but sor5ng without the GPU is slow, and disk- based sor5ng is SLOWER

15 Par55oning by prefix Idea: At any 5me, only store a subset of suffixes in the memory for sor5ng. The suffixes are par55oned by the length- L prefix. For example, if L=1 - list all suffixes star5ng with $, sort them, and output - list all suffixes star5ng with A, sort them, and output - list all suffixes star5ng with C, sort them, and output - list all suffixes star5ng with G, sort them, and output - list all suffixes star5ng with T, sort them, and output

16 Par55oning by prefix When L=1, there are 5 prefix par55ons If the largest par55on can fit into main memory, we are good Larger L gives more par55ons, expected size of the largest par55on is smaller; but having more par55ons also increases the preprocessing overhead (as in bucket sort) In prac5ce, we set L=8

17 GPU radix sort For sor5ng suffixes, we use the CUDA radix sort library by back40compu,ng The library sorts 32 and 64- bit integers real fast How to sort suffixes of 100 chars long? - Encode suffix compactly into binary form - A length- 100 suffix can be encoded into seven 32- bit words - Sort the suffixes word- by- word, from least to most significant word (a.k.a. LSB radix sort)

18 Mul5- core paralleliza5on Suppose we have N cores Divide the input short reads into N roughly equal sets, with each core assigned to one of the sets Given a par55on prefix, each core is responsible for lis5ng all qualified suffixes in its assigned set Does not scale linearly due to memory conten5on

19 Experiments Intel Core i7 (4 cores used), 64 GB RAM, Nvidia GTX680 with 4 GB video memory 100M reads 500M reads 1000M reads BCR [*] 6,141 23,094 46,899 Our software 565 3,108 6,886 Table 1: BWT construction (wall clock) time in seconds. All reads have length 100. [*] M.J. Bauer, A.J. Cox, G. Rosone. Lightweight algorithms for construc5ng and inver5ng the BWT of string collec5ons. Theore5cal Computer Science, 483: , 2013.

20 Mul5- machine paralleliza5on Observa5on: Different prefix par55ons can be processed simultaneously since they do not depend on each other Excluding I/O, the computa5on scales linearly with the number of machines Experimental result: measured in seconds (5me excluding I/O shown in brackets) # machines 100M reads 500M reads 1000M reads (468) 3,108 (2,624) 6,886 (5,882) (241) 1,797 (1,317) 3,998 (2,994) (169) 14,26 (944) 3,071 (2,067) (133) 1,192 (712) 2,584 (1,580)

21 Mul5- machine paralleliza5on Observa5on: Different prefix par55ons can be processed simultaneously since they do not depend on each other For example, if prefix length L=1 and we have 2 machines, we can let the first machine process par55ons $, A, C and the second machine process par55ons G, T Scales linearly with the number of machines, except a li`le overhead

22 Experiments: Mul5- machine Each machine: Intel Core i7 (4 cores used), 64 GB RAM, Nvidia GTX680 with 4 GB video memory # machines 100M reads 500M reads 1000M reads ,108 6, ,797 3, ,26 3, ,192 2,584 Table 2: BWT construction (wall clock) time in seconds by our software using multiple machines. All short reads have length 100.

23 BALSA Based on SOAP3-dp. Whole secondary analysis (input: raw reads; output: variants) in memory with most of the modules accelerated with GPU. Spec. of a node: 1 x E v2 1 x Tesla K40c 64G Host Memory 1 x 600G SAS HDD Performance per node so far: 6 hours/wgs (1.5k WGS/yr.) 30mins/WES (18k WES/yr.) Space efficient lossless Snapshot and Database storage schema to displace filebased BAM and CRAM format for instant query and columnbased visit.

, 2013); BALSA r128 100 Performance, raw reads to variants, WGS 50- fold 100bp (150 Gigabases) 90 88.

24 Note: BWA v0.7.5a (Li et al., 2012); GATK v3.0-1 (DePristo et al., 2011); SOAP3- dp r176 (Luo et al., 2013); isaac iSAAC variant caller v1.06 (Raczy et al., 2013); BALSA r Performance, raw reads to variants, WGS 50- fold 100bp (150 Gigabases) Hours BWAaln+GATK BWAmem+GATK SOAP3- dp+gatk isaac BALSA BALSA, Expect 2015 Configura2ons Magne5c Resonance Imaging

25 30- fold 100bp Simulated Data with known SNPs and Indels SNP Calling isaac Indel Calling isaac Truth Truth 39,438 18,260 48,399 2,770, , , BALSA 2, Isaac BALSA Isaac Note: isaac iSAAC variant caller v1.06 (Raczy et al., 2013); BALSA r128

26 SNP Calling Other Individual Callers Truth Truth Truth 39,258 38,119 39,382 4,622 1,076 2,814,185 2,215 25,202 2,793,605 8,432 2,810, ,481 BALSA 1,676 9,375 Atlas 2,399 BALSA 758 1,676 Freebayes 2,450 BALSA 707 3,314 HC Truth Truth Truth 39,439 38,705 37,086 12,981 2,805, ,590 2,814,217 1, ,818,593 3,248 2,439 BALSA Samtools 1,545 BALSA 1, UG 649 BALSA 2,508 13,737 Mutect Note: All tools processed and filtered with the guide of best prac5ce. Mutect was designed for SNV calling, but can be used for SNP calling with normal sample absent. Atlas2 v1.4.3.r158 (Shen et al., 2010, Challis et al., 2012), GATK HaplotypeCaller and UnifiedGenotyper v g932cd3a (DePristo et al., 2011), Freebayes v0.9.9 (Garrison et al., 2012), Mutect v1.1.4 (Cibulskis et al., 2013), Samtools v (Li et al., 2013) and VarScan v2.3.5 (Koboldt et al., 2012).

27 Combining the best results from 6 individual callers SNP Calling Ensemble Truth Indel Calling Ensemble Truth BALSA 928 1, % Qual.<20 38,178 2,817,879 1,873 2,156 4,082 Ensemble BALSA , % DP< % DP< % DP< % Length % DP< % MAF<0.3, DP 10 87,77% MAF<0.3, DP % Qual.< % Qual.<20 280, % Qual.< % Qual.< % Qual.<20 2,241 2,120 Ensemble

28 Acknowledgement Prof. Tak- Wah Lam Mr. Ruibang Luo BWT Team Mr. Chi- Man Liu BALSA Team Mr. Victor Wong Mr. Wai- Chun Law BAL Members Dr. Sze- Hang Chan Dr. Ricky Ma Mr. Dinghua Li Ms. Min Ou

29 Thanks

Heterogeneous compute in the GATK

Heterogeneous compute in the GATK Mauricio Carneiro GSA Broad Ins