Using MPI One-sided Communication to Accelerate Bioinformatics Applications

Size: px

Start display at page:

Download "Using MPI One-sided Communication to Accelerate Bioinformatics Applications"

Esther Powell
5 years ago
Views:

1 Using MPI One-sided Communication to Accelerate Bioinformatics Applications Hao Wang Department of Computer Science, Virginia Tech

2 Next-Generation Sequencing (NGS) Data Analysis NGS Data Analysis DNA is isolation from normal tissue and blood DNA is fragmented and the captured DNA is washed and amplified DNA is sequenced and analyzed DNA is used for clinical trials, e.g., disease detection, personalized medicine, etc.

DNA Sequencing Costs: Data, https://www.genome.

3 NGS Data Analysis Next Generation Sequencing (NGS) has significantly reduced cost per genome; and data analysis (instead of sequencing) is becoming the bottleneck NGS data analysis market is boosting and predicted to exceed 1 Billion in 2024 NIH, DNA Sequencing Costs: Data, Grand View Research, NGS Data Analysis Market Analysis 2024,

4 Irregular NGS Data Analysis Applications NGS applications can be characterized by Irregular memory accesses Irregular control flows Irregular communication patterns Many such applications exhibit irregularities Basic Local Alignment Search Tool (BLAST) for sequence search Heuristic algorithms BWA, Bowtie1/2, and SOAPaligner for short read mapping Compressed data structures These applications have irregular communication patterns!

5 Outline Background Sequence Search Using one-sided communications for sequence search Evaluation (early stage) Summary and Future Work

6 Sequence Search Search for similarities between a query sequence and database sequences (i.e., subject sequences) Query Sequence ADGIFAIDQFTKVLLNYTGHITWNPPAIF KSYCEIIVTYFPFDEQNCSMKLG.. Output Sequence Search gi sp P Score = gi sp P Score = gi sp P Score = Database (Subject Sequences) >gi sp P ACHA_NATTE ADGIFAIDQFTKVLLNYTGHITWNPPAIFKSYCEIIVTYFPFDEQ NCSMKLGTRTYDGTV... >gi sp P ACHA_NATTE MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFN PERSAIYLFQVYDKDNDGKITIKELAGDIDFD. >gi sp P ACHA_RAT MARVTVQDAVEKIGNRFDLVLVAARRARQIQSGGKDALVPEE NDKVTVIALREIEEGLITNQILDVRERQEQQEQ. >gi sp P ACHA_MOUSE ADGDFAIVKFTKVLLQYTGHITWTPPAIFKSYCEIIVTHFPFDEQ NCSMKLGTWTYDGSV

7 MPI Implementation Inter-node parallel implementation using MPI Partition the database D into j subsets D0 D1 Dj For a query sequence, search on each database subset Dj in parallel and get local search result Rij Merge and sort all local search result Ri0 to Rij and get the final result Ri MPI Rank 0 search on D0 and get Ri0 Query sequence MPI Rank 1 search on D1 and get Ri1 Merge, sort {Ri0, Ri1, Rij} and get Ri MPI Rank j search on Dj and get Rij

8 MPI Implementation Inter-node parallel implementation using MPI Partition the database D into j subsets D0 D1 Dj For a query sequence, search on each database subset Dj in parallel and get local search result Rij Merge and sort all local search result Ri0 to Rij and get the final result Ri search batch {q0, q1-1} on D0, get {R00,R10 Ri-10} Query sequence batch {q0, q1-1} Query sequence batch {, q2i-1} Query sequence batch {q2i, q2i+1 q3i-1} search batch {q0, q1-1} on D1, get {R01, R11 Ri-11} search batch {q0, q1-1} on Dj, get {R0j, R1j Ri-1j} Merge, sort and get {R0, R1 Ri} of batch0 Merge, sort and get {R0, R1 Ri} of batch1 Merge, sort and get {R0, R1 Ri} of batch2

9 mpiblast Implementations Characteristics Both computation time and data size of compute nodes are highly diverse A dedicated MPI process as the master 1. All workers send meta data, i.e., query id, search score, and data size, to the master 2. The master merges and sorts meta data, and selects a worker for IO and notifies all workers 3. All workers send local selected results to the IO worker 4. The IO worker finally writes data to disk H. Lin, et al. "Coordinating computation and i/o in massively parallel sequence search." Parallel and Distributed Systems, IEEE Transactions on 22.4 (2011):

10 Why Redesign Sequence Search Bottlenecks in previous sequence search tools Local search Disk IO New tendencies of sequence search Local search is much faster right now, e.g., DIAMOND 1 Sequence search has become a stage of NGS work flow, and search results are resided in memory for reuse 2 Data communication is becoming a new performance bottleneck! 1. B. Buchfink, Xie C., D. Huson, "Fast and sensitive protein alignment using DIAMOND", Nature Methods 12, (2015). 2. Genome Analysis Toolkit,

11 Outline Background Sequence Search Using one-sided communications for sequence search Evaluation (early stage) Summary and Future Work

12 Using MPI One-sided for Sequence Search Benefits of using MPI one-sided communication More economically express irregular communication pattern More efficiently overlap communication and computation Bypass tag matching in two-sided communication Basic ideas Use MPI one-sided communications (put and get) to overlap communication and computation Don t need a dedicated MPI process as the master to coordinate disk IO

13 Challenges Challenges Three one-sided synchronization modes: Fence (active target), Post- Start-Complete-Wait (active target), Lock/Unlock (passive target). Which is better?

14 MPI Windows for Meta Data and Local Results Rank 0 Rank 1 Rank n-1 Cyclic buffer pool on each MPI process Register to MPI window <addr, size> <addr, size> <addr, size> Buf 0 on Worker0 for Buf 1 on Worker1 for Each MPI process registers two types of cyclic buffers to MPI window, for meta data and local search results respectively

15 Use MPI Put to Write Meta Data Rank 0 Rank 1 Rank n-1 Cyclic buffer pool on each MPI process Put metadata to others <addr, size> <addr, size> <addr, size> Buf 0 on Worker0 for gi sp P Score = gi sp P Score = gi sp P Score = Buf 1 on Worker1 for gi sp P Score = gi sp P Score = gi sp P Score = After local search for a batch of query sequences, a MPI process will write meta data to others with MPI Put

16 Wait the Finish of Put on Previous Batch Rank 0 Rank 1 Rank n-1 Cyclic buffer pool on each MPI process Merge & sort metadata <addr, size> <addr, size> <addr, size> Buf 0 on Worker0 for gi sp P Score = gi sp P Score = gi sp P Score = Buf 1 on Worker1 for gi sp P Score = gi sp P Score = gi sp P Score = After MPI Put (for the current batch), MPI processes will Wait for the finish of MPI Put of the previous batch, e.g., MPI_Win_fence() Merge and sort meta data for the pervious batch and select out final results (meta data)

17 A Process will Gather Data by MPI Get Rank 0 Rank 1 Rank n-1 Cyclic buffer pool on each MPI process <addr, size> <addr, size> <addr, size> Buf 0 on Worker0 for +1 Get needed local results to the selected process gi sp P Score = gi sp P Score = gi sp P Score = Buf 1 on Worker1 for gi sp P Score = gi sp P Score = gi sp P Score = A MPI process is selected out as the one to merge final results, e.g., who has most final results, and it will gather data from others by using MPI Get Other processes will continue the computation for the next batch

18 Summary of Our Method create MPI Windows for metadata and local results for batch 0 : n-1 Local search: do the sequence search on local partition of database Write metadata: MPI_Put() metadata of current batch to others Wait: wait for the finish of MPI_Put() on previous batch Merge & sort: merge and sort metadata if (I m the selected process) Get local results: MPI_Get() local results from all processes Generate output: Sort and write final results for the previous batch endif endfor

19 Implementations and Optimizations Double buffering Use double windows for metadata in order to wait for the finish of previous batch, after issuing MPI_Put() for the current batch Different types of synchronization methods Fence mechanism: MPI_Win_fence() Lock/unlock mechanism: MPI_Win_flush() PSCW mechanism: MPI_Win_wait()

20 Implementations and Optimizations Rank 0 Rank 1 Rank n-1 Create one MPI window and register buffers to one window One window vs one windows per rank Create one window per process to avoid the unnecessary wait

21 Implementations and Optimizations Rank 0 Rank 1 Rank n-1 Create n MPI windows and register the buffer per rank per window One window vs one windows per rank Create one window per process to avoid the unnecessary wait

22 Outline Background Sequence Search Using one-sided communications for sequence search Evaluation (early stage) Summary and Future Work

23 Experimental Setups Hardware Up to 16 compute nodes, each of which has 2 Intel Xeon CPU E (Sandy Bridge EP, 16 cores in total) 64 GB main memory Mellanox ConnectX-3 MT27500 Datasets env_nr and nr databases from NCBI GeneBank Randomly select sequences from the target database as query sequences Data partitions Partition databases evenly on each compute node Software DIAMOND (C++ Thread) + MPI MVAPICH2 (version 2.2)

24 Data size (MB) Computation time (sec) Breakdown 0 rank0 batch1 rank1 rank2 batch2 rank3 batch3 rank4 rank5 batch4 rank6 rank7 batch5 batch6 batch7 batch8 batch9 batch rank0 rank1 rank2 rank3 rank4 rank5 rank6 rank7 batch1 batch2 batch3 batch4 batch5 batch6 batch7 batch8 batch9 batch10 Different MPI processes contribute different sizes of data to the final results Different MPI processes have different computation time in each batch Setups: running on 8 nodes,10000 query sequences in 10 batches

25 Normalized Execution time Overall Performance on 8 nodes Batch size Fence_mwins Fence_1win PSCW_mwins PSCW_1win LockFlush_mwins LockFlush_1win SendRecv_w/_master SendRecv_w/o_master Lower is better MPI_Win_Fence() with multiple windows is best 1.4x and 1.32 x speedup over 2sided w/ and w/o master, respectively

26 Normalized Execution time Overall Performance on 16 nodes Batch size Fence_mwins Fence_1win PSCW_mwins PSCW_1win LockFlush_mwins LockFlush_1win SendRecv_w/_master SendRecv_w/o_master Lower is better MPI_Win_Fence() with multiple windows is best 1.42x and 1.19 x speedup over 2sided w/ and w/o master, respectively

27 Observations MPI fence exhibits better performance than MPI flush All-to-all communication pattern in metadata communication

28 Summary and Future Work We use MPI one-sided communication to accelerate sequence search on InfiniBand clusters The experimental results show up to 1.42x speedup over two-sided communication We are analyzing performance numbers of different one-sided synchronization mechanisms We are collecting more application performance numbers, for mpiblast, DIAMOND, and pbwa We would like to check application performance with MVAPICH2-2.3b

Scalable RNA Sequencing on Clusters of Multicore Processors

Scalable RNA Sequencing on Clusters of Multicore Processors JOAQUÍN DOPAZO JOAQUÍN TARRAGA SERGIO BARRACHINA MARÍA ISABEL CASTILLO HÉCTOR MARTÍNEZ ENRIQUE S. QUINTANA ORTÍ IGNACIO MEDINA INTRODUCTION DNA Exon 0 Exon 1 Exon 2 Intron 0 Intron 1 Reads Sequencing RNA