Scalable RNA Sequencing on Clusters of Multicore Processors

Size: px

Start display at page:

Download "Scalable RNA Sequencing on Clusters of Multicore Processors"

Briana McCarthy
5 years ago
Views:

1 JOAQUÍN DOPAZO JOAQUÍN TARRAGA SERGIO BARRACHINA MARÍA ISABEL CASTILLO HÉCTOR MARTÍNEZ ENRIQUE S. QUINTANA ORTÍ IGNACIO MEDINA

2 INTRODUCTION DNA Exon 0 Exon 1 Exon 2 Intron 0 Intron 1 Reads Sequencing RNA Exon 0 Exon 1 Exon 2 Reads Sequencing

3 INTRODUCTION DNA MAPPERS Bowtie Cushaw BWA HPG Aligner DNA DNA TTTCACCAGCCCAT Reference Genome GTCCAAG CCGGTACCTTAG Exon 0 Intron 0 Exon 1 CCGGTACCTTAGGCCATCCCAGGGCTAAAAGGTAAA Read GCCATCCCAGGGCTAAAAGGTAAA RNA MAPPERS Tophat STAR MapSplice HPG Aligner BW RNA TTTCACCAGCCCAT Exon 0 Intron 0 Exon 1 TTTCACCAGCCCAT Reference Genome GTCCAAG CCGGTACCTTAG GCCATCCCAGGGCTAAAAGGTAAA Read GCCATCCCAGGGCTAAAAGGTAAA

4 OUTLINE HPG ALIGNER BW WORK FLOWS METAEXON STRUCTURE MULTI THREADED MPI EXPERIMENTAL RESULTS CONCLUSIONS AND FUTURE WORK

WORK FLOWS Speed Multi threaded HPG Aligner BW Work flows with several stages to exploit parallelism SSE instructions Sensitivity Metaexon structure Outperforms TopHat 2+Bowtie 2, Mapsplice & STAR in

5 WORK FLOWS Speed Multi threaded HPG Aligner BW Work flows with several stages to exploit parallelism SSE instructions Sensitivity Metaexon structure Outperforms TopHat 2+Bowtie 2, Mapsplice & STAR in both throughput and sensitivity "Concurrent and accurate short read mapping on multicore platforms" H. Martínez, J. Tárraga, I. Medina, S. Barrachina, M. Castillo, J. Dopazo, E. S. Quintana Ortí IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2015 Speed MPI HPG Aligner BW Improve speed running on several nodes Sensitivity Maintain the sensitivity with merged structures

6 WORK FLOWS 3 Work flows: Work flow 1 : Map reads Work flow 2 : Map single anchors Work flow 3 : Map incomplete mapped reads

unmapped and seek for Candidates Alignment Localization (CALs) D) Apply Smith Waterman to complete

7 WORK FLOWS WORK FLOW 1 5 stages, each operating on a collection of read batches : A) Read sequences from FastQ file and generate batches of reads to process B) Map reads using Burrows Wheeler transform C) Seeding unmapped and seek for Candidates Alignment Localization (CALs) D) Apply Smith Waterman to complete mappings, search splice junctions, and write to single anchor and hard clipping files E) Write alignments to BAM file

If no mapping possible, apply seeding, CALs, Smith Waterman and write to hard clipping file C) Write alignments to

8 WORK FLOWS WORK FLOW 2 3 Stages: A) Read sequences from a single anchor file and generate batches of reads to process B) Complete mapping with metaexon structure. If no mapping possible, apply seeding, CALs, Smith Waterman and write to hard clipping file C) Write alignments to BAM file WORK FLOW 3 3 Stages: A) Read sequences from a hard clipping file and generate batches of reads to process B) Complete mappings with metaexon structure C) Write alignments to BAM file

9 METAEXON STRUCTURE Structure updated from the mapped reads in each work flow Exon0 Exon1 Exon2 Potential start of splice junction Chromosome vector V 1 Exon Chromosome vector Potential start of splice junction V Potential end of splice junction AVL Search O(log n) Insertion O(log n)

10 MULTI THREADED B16 Reader B13 Read queue B15 B BWT queue B12 B11 B10 B9 CAL queue B7 B6 B5 B3 2 B4 SWA queue B2 4 B8 3 Writer B1 C 3

11 MULTI THREADED B17 Read queue B22 Limit of read queue! B21 B20 B19 B16 1 B18 0 C 1 BWT queue CAL queue B15 B14 B13 B10 B9 B8 B7 2 B12 B6 Now, reader is a worker! SWA queue B5 B4 B3 4 B11 3 Writer B2 C 2

12 MULTI THREADED B19 Reader B19 Read queue B23 B22 B21 B BWT queue B18 B17 B16 B15 CAL queue B13 B12 B10 2 B11 SWA queue SWA queue is empty! 4 B14 3 C 3 B9 C 2 Now, writer is a worker!

13 MULTI THREADED Read file finished! B25 Read queue B30 B29 B28 B27 B26 1 B24 0 C 2 BWT queue CAL queue B23 B22 B21 B18 B17 B16 B15 2 B20 B14 Now, reader is a worker! SWA queue SWA queue is empty! 4 B19 3 C 3 B13 C 1 Now, writer is a worker!

14 MPI FastQ file of reads Slice 0 Slice 1 Slice 2 Slice 3 MPI Rank 0 MPI Rank 1 MPI Rank 2 MPI Rank 3 Slice N 1 Onesliceforeachnode MPI Rank N 1

15 MPI Node 0 splits the input file in slices, and sends the start and end position of its slice to each node Only one thread per MPI rank reads from input file Each MPI worker rank applies HPG Aligner BW to its own slice Each MPI worker rank builds its local metaexon, AVL structure, and buffer of alignments to store the output data MPI Rank r Work flow C 1 Each work flow executes on C threads

16 MPI MPI Rank 0 MPI Rank 1 MPI Rank N Writer MPI writer rank writes in the final file MPI Rank 2 MPI Rank 3 MPI Rank N 1 Alignments are sent to MPI writer rank when the buffer is full

17 MPI MPI Rank 0 MPI Rank 1 MPI Rank N Writer MPI writer rank writes in the final file MPI Rank 2 MPI Rank 3 MPI Rank N 1 In the same node, ONLY one thread calls MPI_Send Alignments are sent to MPI writer rank when the buffer is full

18 MPI When all MPI worker ranks complete the execution of either work flows 1 or 2, the local metaexon structures are merged in node 0 MPI Rank 0 MPI Rank 1 MPI Rank 2 Once the parallel merge is done, the result is broadcast from node 0 MPI Rank 3 MPI Rank N 1

19 MPI MPI Rank 0 Use minimum spanning tree for merge MPI Rank 1 MPI Rank 2 MPI Rank 3 MPI Rank 4 Merge 0 MPI Rank 5 MPI Rank 6 MPI Rank 7

20 MPI MPI Rank 0 Use minimum spanning tree for merge MPI Rank 1 MPI Rank 2 MPI Rank 3 MPI Rank 4 Merge 1 MPI Rank 5 MPI Rank 6 MPI Rank 7

21 MPI MPI Rank 0 Use minimum spanning tree for merge MPI Rank 1 MPI Rank 2 MPI Rank 3 MPI Rank 4 Merge 2 MPI Rank 5 MPI Rank 6 MPI Rank 7

22 MPI MPI Rank 0 MPI Rank 1 MPI Rank 2 MPI Rank 3 MPI Rank 4 Broadcast MPI Rank 5 MPI Rank 6 MPI Rank 7

23 EXPERIMENTAL RESULTS Hardware 12 nodes Two Intel Xeon CPU 2.40GHz (hexa core CPUs) 24GB of DDR3 RAM Connected via Infiniband QDR (Mellanox MTS3600 switch) Data Datasets simulated with Beers Two single end datasets of 80 and 5 million reads of 100 nts Mutation rate 0.1% Indel frequency 0.05%

24 EXPERIMENTAL RESULTS 14 Speed Up with 80 million reads of 100 nts Time (sec.) ,96 7,08 4 6,17 5,24 2 3,63 1 1, #Nodes Speed Up Ideal Speed Up

25 EXPERIMENTAL RESULTS Detailed Execution time (sec.) of the MPI HPG Aligner BW with 80 million reads of 100 nts #Nodes Work flow1 688,05 351,90 3,35 173,98 2,89 117,12 0,76 94,23 3,76 79,81 4,11 66,65 4,28 Merge 1 1,52 2,32 3,06 2,94 3,68 3,54 Work flow2 41,38 20,98 0,24 10,67 0,11 7,28 0,04 5,59 0,10 4,65 0,20 3,83 0,05 Merge 2 1,12 1,64 2,15 2,10 2,58 2,54 Work flow3 3,57 2,01 0,09 1,34 0,21 1,05 0,21 0,83 0,15 0,72 0,15 0,65 0,14 Merge 3 0,57 1,11 1,62 1,59 2,09 2,06

26 EXPERIMENTAL RESULTS Sensitivity of the MPI HPG Aligner BW with 5 million reads of 100 nts #Nodes Exon Reads (%) Configuration ,44 Merge 98,82 98,82 98,82 98,82 98,82 98,82 98,82 Without merge 98,82 98,82 98,82 98,82 98,82 98, ,34 3 2,16 Merge 87,62 88,03 87,67 87,28 87,03 86,75 86,54 Without merge 87,46 86,78 85,97 85,35 84,71 84,12 Merge 68,56 69,78 67,97 66,34 64,91 63,97 63,04 Without merge 68,03 65,79 63,48 61,73 59,92 58,24

27 EXPERIMENTAL RESULTS Sensitivity of the MPI HPG Aligner BW with 5 million reads of 100 nts #Nodes Exon Reads (%) Configuration Merge 45,48 48,12 44,62 40,98 39,19 36,76 35,58 >3 0,06 Without merge 43,23 40,41 37,98 35,16 33,01 32,08 All 100,00 Merge 95,64 95,76 95,63 95,51 95,42 95,34 95,27 Without merge 95,59 95,38 95,15 94,97 94,79 94,62

28 CONCLUSIONS AND FUTURE WORK Conclusions HPC4Genomics High sensitivity and specificity High speed process FUTURE WORK Multi mapper MPI framework CONTACT

29 QUESTIONS?

Concurrent and Accurate RNA Sequencing on Multicore Platforms

Technical Report ICC 2013-03-01 Concurrent and Accurate RNA Sequencing on Multicore Platforms Héctor Martínez *, Joaquín Tárraga, Ignacio Medina, Sergio Barrachina *, Maribel Castillo *, Joaquín Dopazo,