High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT Krishna Kandalla (1), Hari Subramoni (1), Karen Tomko (2), Dmitry Pekurovsky (3), Sayantan Sur (1) and Dhabaleswar. K. Panda (1) (1) Computer Science & Engineering Department, The Ohio State University (2) The Ohio Supercomputer Center (3) San Diego Supercomputer Center

Outline Introduction Problem Statement Designing MPI_Ialltoall with Collective Offload Re-designing P3DFFT for overlap Experimental Evaluation Conclusions and Future work 2

Introduction Parallel applications can scale beyond 100,000 cores InfiniBand is commonly used across commodity clusters Message Passing Interface (MPI) is the de-facto programming model Tsubame Supercomputer 73,278 cores 3

Collective Communication in MPI MPI-2.2 defines blocking collective operations limits performance and scalability of dense operations(alltoall) MPI-3 may support non-blocking collectives Hoefler et. al, proposed host-based approaches Latest ConnectX-2 adapters from Mellanox supports network offload features Study benefits with real scientific libraries P3DFFT 4

Overview of InfiniBand MQ M C Q Send Send CQ Wait Application Task List Send InfiniBand HCA Send Q Recv Q Collective Offload Applications can offload task-lists Send Send Wait to the NIC A CQE gets created on the MCQ after execution Problems: Alltoall is extremely communication intensive Size of task-list is limited Directly affects Alltoall scalability Physical Link Data Recv CQ Subramoni et. Al, Hot Interconnects 2010 HotI '10 5

Outline Introduction Problem Statement Designing MPI_Ialltoall with Collective Offload Re-designing P3DFFT for overlap Experimental Evaluation Conclusions and Future work 6

Motivation MPI_Ialtoall Compute Computation Overlap? MPI_Wait Rank0 Rank1 Rank n Challenge: progress the collective schedules in an asynchronous manner with: performance portability minimal host processor intervention acceptable communication latency 7

Overlap (Higher is Better) Design Space of Collective Algorithms Blocking Coll Non-Blocking Coll (Host) Non-Blocking Coll (Offload) Latency (Lower is Better) Portability (Higher is Better)

Problem Statement Can we leverage network offload to design MPI_Ialltoall? Will network offload help overlap with collectives? Can network offload improve application throughput? Can we re-design scientific libraries(such as 3DFFT) to leverage our proposed MPI_Ialltoall? 9

Outline Introduction Problem Statement Designing MPI_Ialltoall with Collective Offload Re-designing P3DFFT for overlap Experimental Evaluation Conclusions and Future work 10

Creating Task-lists with the trigger operation Trigger Application Task List Wait Send Send Send Send Task list with multiple phases A phase has send, wait & trigger tasks Progress thread calls ibv_get_cq_event() and blocks Trigger task generates an interrupt which signals the progress thread MQ M C Q Trigger Send CQ InfiniBand HCA Send Q Wait Send Send Recv Q Send Send Physical Link Recv CQ Trigger Send CQ Trigger Recv CQ 11

Designing Scalable Offload Alltoall Application Thread MPI_Init PI_Ialltoall MPI_Ialltoall returns Create Task-List Offload Progress Thread Post-list; ibv_get_cq _event() Compute Alltoall Complete Trigger; Post-list; Ibv_get_cq _event() MPI_Wait Trigger; End of list; 12

Outline Introduction Problem Statement Designing MPI_Ialltoall with Collective Offload Re-designing P3DFFT for overlap Experimental Evaluation Conclusions and Future work 13

Parallel 3DFFT Library Applications in areas related to simulations of turbulence, Astrophysics etc. rely heavily on 3DFFT P3DFFT from San Diego Supercomputer Center (SDSC) is a portable, high performance implementation of 3DFFT (http://code.google.com/p/p3dfft/) P3DFFT uses a 2D pencil decomposition to maximize parallelism P3DFFT relies on expensive large message Alltoall operations to implement the transpose operations 14

Re-designing P3DFFT for Overlap V1 V2 V3 X X Y ( Y YZ Z X X Y Y Y Z Z X X Y Y YZ Z Intra-Node Inter-Node A 1DFFT along A Dimension V1 A B V2 A-B Transpose X X Y Y X X Y Y Y Z (I) Z X X Y Y Z (I) Two Parallel Transpose Operations Y Z... Y Z (I) Two Parallel Transpose Operations 15

Outline Introduction Problem Statement Designing MPI_Ialltoall with Collective Offload Re-designing P3DFFT for overlap Experimental Evaluation Conclusions and Future work 16

Experimental Setup Intel Xeon E5640 (2.53 GHz), 12 GB memory per node MT26428 QDR ConnectX-2 with PCI-Ex interfaces, 171- port Mellanox QDR switch, OFED 1.5.1 RHEL 5.4, 2.6.18-164.e15 kernel version MVAPICH2 A High Performance MPI implementation over InfiniBand and other RDMA networks (v1.6) http://mvapich.cse.ohio-state.edu/ Used by more than 1580 organizations world-wide 17

Micro-Benchmark Evaluations Overlap Benchmark Measure average MPI_Ialltoall latency Overlap Percentage: start_overlap_timer() MPI_Ialtoall(..) while(time < alltoall_latency) MPI_Wait(..) update timer end_overlap_timer() Throughput Benchmark start_throughput_timer() MPI_Ialtoall(..) CBLAS_DGEMM() MPI_Wait(..) end_throughput_timer() 18

Overlap Percentage (%) 100 90 80 70 60 50 40 30 20 10 0 Communication/Computation Overlap 16K 32K 64K 128K 256K 512K 1M Message Length (Bytes) Alltoall-Offload Alltoall-Host-Based-Test-10 Alltoall-Host-Based-Test-1000 Alltoall-Host-Based-Test-5000 Alltoall Overlap Comparison with 256 Processes Alltoall-Offload delivers near perfect communication/computation overlap for all messages in a portable manner 19

Throughput (GFLOPS) DGEMM Throughput Comparison 6000 5000 4000 3000 2000 1000 Serial 0 500 1500 2500 3500 CBLAS-DGEMM Problem size (N) Alltoall-Offload Host-Based Theoretical Peak CBLAS-DGEMM overlapped with Offload-Ialltoall delivers better throughput upto 110% when compared to Host-Based Ialltoall with 512 processes 20

Latency (msec) 2500 2000 1500 1000 Latency Comparison Alltoall-Default-Host Alltoall-Offload Alltoall-Host-Based Alltoall-Host-Based-Thread 500 0 16K 32K 64K 128K 256K 512K 1M Message Length (Bytes) Alltoall Latency Comparison with 256 Processes Alltoall-Offload delivers good overlap, without sacrificing on communication latency! 21

Run-Time (s) 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 Parallel 3D FFT Performance Blocking H-Test Offload 23% ` 512 600 720 800 Problem Size P3DFFT Application kernel run-time Comparison with 128 processes P3DFFT with Offload-Ialltoall performs about 13.5% better than default P3DFFT and about 12% better than P3DFFT with Host-based-Test ` 13% Lower is Better 22

Outline Introduction Problem Statement Designing MPI_Ialltoall with Collective Offload Re-designing P3DFFT for overlap Experimental Evaluation Conclusions and Future work 23

Conclusions and Future Work Proposed MPI_Ialltoall shows near perfect (99%) overlap Throughput of applications improves significantly through offload-based non-blocking collectives P3DFFT s run-time improved by up to 23% Future work: Extend Offload-based techniques for other MPI collectives and study their benefits with real applications Support for Offload-based collectives will be available in future MVAPICH2 releases 24

Thank you! http://mvapich.cse.ohio-state.edu (1) {kandalla, subramon, surs, panda}@cse.ohio-state.edu (2) ktomko@osc.edu (3) dmitry@sdsc.edu (1) Network-Based Computing Laboratory, Ohio State University (2) The Ohio Supercomputer Center (3) San Diego Supercomputer Center 25