Exploiting Computation and Communication Overlap in MVAPICH2 and MVAPICH2-GDR MPI Libraries

Size: px

Start display at page:

Download "Exploiting Computation and Communication Overlap in MVAPICH2 and MVAPICH2-GDR MPI Libraries"

Teresa Mathews
5 years ago
Views:

1 Exploiting Computation and Communication Overlap in MVAPICH2 and MVAPICH2-GDR MPI Libraries Talk at Overlapping Communication with Computation Symposium (April 18) by Dhabaleswar K. (DK) Panda The Ohio State University

2 Parallel Programming Models Overview P1 P2 P3 P1 P2 P3 P1 P2 P3 Shared Memory Memory Memory Memory Logical shared memory Memory Memory Memory Shared Memory Model Distributed Memory Model Partitioned Global Address Space (PGAS) SHMEM, DSM MPI (Message Passing Interface) Global Arrays, UPC, Chapel, X1, CAF, Programming models provide abstract machine models Models can be mapped on different types of systems e.g. Distributed Shared Memory (DSM), MPI within a node, etc. PGAS models and Hybrid MPI+PGAS models are gradually receiving importance Overlap Symposium (April 18) 2

Supporting Programming Models for Multi-Petaflop and Exaflop Systems: Challenges Application Kernels/Applications Point-to-point Communication Programming Models MPI, PGAS

Networking Technologies (InfiniBand, 4/1GigE, Aries, and Omni-Path) Middleware Communication Library or Runtime for Programming Models Collective Communication Energy-

3 Supporting Programming Models for Multi-Petaflop and Exaflop Systems: Challenges Application Kernels/Applications Point-to-point Communication Programming Models MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP, OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc. Networking Technologies (InfiniBand, 4/1GigE, Aries, and Omni-Path) Middleware Communication Library or Runtime for Programming Models Collective Communication Energy- Awareness Synchronization and Locks Multi-/Many-core Architectures I/O and File Systems Fault Tolerance Accelerators (GPU and FPGA) Co-Design Opportunities and Challenges across Various Layers Performance Scalability Resilience Overlap Symposium (April 18) 3

Overlap - Transparently - Co-design Design MPI Primitives

4 Basic Concept of Overlapping Communication with Computation Application MPI Runtime Networking Technology Take Advantage of Overlap - Transparently - Co-design Design MPI Primitives Exploiting Overlap Capabilities of Network Mechanisms Overlap Symposium (April 18) 4

Overlap Symposium (April 18) 5 Overview of the MVAPICH2 Project High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) MVAPICH

5 Overlap Symposium (April 18) 5 Overview of the MVAPICH2 Project High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 21, First version available in 22 MVAPICH2-X (MPI + PGAS), Available since 211 Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 214 Support for Virtualization (MVAPICH2-Virt), Available since 215 Support for Energy-Awareness (MVAPICH2-EA), Available since 215 Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 215 Used by more than 2,875 organizations in 86 countries More than 46, (>.46 million) downloads from the OSU site directly Empowering many TOP5 clusters (Nov 17 ranking) 1st, 1,649,6-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China 9th, 556,14 cores (Oakforest-PACS) in Japan 12th, 368,928-core (Stampede2) at TACC 17th, 241,18-core (Pleiades) at NASA 48th, 76,32-core (Tsubame 2.5) at Tokyo Institute of Technology Available with software stacks of many vendors and Linux Distros (RedHat and SuSE) Empowering Top5 systems for over a decade

6 MVAPICH2 Release Timeline and Downloads 5 Number of Downloads MV.9.4 MV2.9. MV2.9.8 MV2 1. MV 1. MV MV 1.1 MV2 1.4 MV2 1.5 MV2 1.6 MV2 1.7 MV2 1.8 MV2 1.9 MV2-GDR 2.b MV2-MIC 2. MV2 Virt 2.2 MV2-X 2.3b MV2-GDR 2.3a MV2 2.3rc1 OSU INAM Sep-4 Feb-5 Jul-5 Dec-5 May-6 Oct-6 Mar-7 Aug-7 Jan-8 Jun-8 Nov-8 Apr-9 Sep-9 Feb-1 Jul-1 Dec-1 May-11 Oct-11 Mar-12 Aug-12 Jan-13 Jun-13 Nov-13 Apr-14 Sep-14 Feb-15 Jul-15 Dec-15 Timeline Overlap Symposium (April 18) 6 May-16 Oct-16 Mar-17 Aug-17 Jan-18

Architecture of MVAPICH2 Software Family High Performance Parallel Programming Models Message Passing Interface (MPI) PGAS (UPC, OpenSHMEM, CAF, UPC++) Hybrid --- MPI + X (MPI + PGAS + OpenMP/Cilk)

7 Architecture of MVAPICH2 Software Family High Performance Parallel Programming Models Message Passing Interface (MPI) PGAS (UPC, OpenSHMEM, CAF, UPC++) Hybrid --- MPI + X (MPI + PGAS + OpenMP/Cilk) High Performance and Scalable Communication Runtime Diverse APIs and Mechanisms Point-topoint Primitives Collectives Algorithms Job Startup Energy- Awareness Remote Memory Access I/O and File Systems Fault Tolerance Virtualization Active Messages Introspection & Analysis Support for Modern Networking Technology (InfiniBand, iwarp, RoCE, Omni-Path) Support for Modern Multi-/Many-core Architectures (Intel-Xeon, OpenPower, Xeon-Phi, ARM, NVIDIA GPGPU) Transport Protocols RC XRC UD DC UMR ODP Modern Features SR- IOV Multi Rail Shared Memory Transport Mechanisms CMA IVSHMEM XPMEM* Modern Features MCDRAM * NVLink * CAPI * * Upcoming Overlap Symposium (April 18) 7

Overlap Symposium (April 18) 8 MVAPICH2 Software Family High-Performance Parallel Programming Libraries MVAPICH2 MVAPICH2-X MVAPICH2-GDR MVAPICH2-Virt MVAPICH2-EA MVAPICH2-MIC Microbenchmarks OMB

8 Overlap Symposium (April 18) 8 MVAPICH2 Software Family High-Performance Parallel Programming Libraries MVAPICH2 MVAPICH2-X MVAPICH2-GDR MVAPICH2-Virt MVAPICH2-EA MVAPICH2-MIC Microbenchmarks OMB Tools OSU INAM OEMT Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime Optimized MPI for clusters with NVIDIA GPUs High-performance and scalable MPI for hypervisor and container based HPC cloud Energy aware and High-performance MPI Optimized MPI for clusters with Intel KNC Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration Utility to measure the energy consumption of MPI applications

9 Overlap Symposium (April 18) 9 Presentation Outline MVAPICH2/MVAPICH2-X Job Startup Point-to-point Communication Remote Memory Access (RMA) Collective Communication MVAPICH2-GDR Support for InfiniBand Core-Direct GPU-kernel based Reduction Datatype Processing Deep Learning Application: OSU Caffe

10 Overlapping Application Compute with MPI Startup P P1 P2 P3 MPI_Init Initialize HCA Obtain Endpoint Address Exchange Addresses Application Read Input Files Set Up Problem Compute / Communicate P P1 P2 P3 Exchange Addresses MPI_Init Initialize HCA Obtain Endpoint Address Application Read Input Files Set Up Problem Communication Independent Tasks Compute / Communicate No Overlap between MPI_Init and Application Computation MPI can continue to initialize in the background while Application starts Overlap Symposium (April 18) 1

11 Towards High Performance and Scalable Startup at Exascale Memory Required to Store Endpoint Information P P M O a b c M PGAS State of the art MPI State of the art PGAS/MPI Optimized Job Startup Performance d e O a b c d e Near-constant MPI and OpenSHMEM On-demand initialization time at any process count Connection 1x and 3x improvement in startup time PMIX_Ring of MPI and OpenSHMEM respectively at PMIX_Ibarrier 16,384 processes PMIX_Iallgather Memory consumption reduced for remote endpoint information by Shmem based PMI O(processes per node) 1GB Memory saved per node with 1M processes and 16 processes per node a On-demand Connection Management for OpenSHMEM and OpenSHMEM+MPI. S. Chakraborty, H. Subramoni, J. Perkins, A. A. Awan, and D K Panda, 2th International Workshop on High-level Parallel Programming Models and Supportive Environments (HIPS 15) b PMI Extensions for Scalable MPI Startup. S. Chakraborty, H. Subramoni, A. Moody, J. Perkins, M. Arnold, and D K Panda, Proceedings of the 21st European MPI Users' Group Meeting (EuroMPI/Asia 14) c d Non-blocking PMI Extensions for Fast MPI Startup. S. Chakraborty, H. Subramoni, A. Moody, A. Venkatesh, J. Perkins, and D K Panda, 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 15) e SHMEMPMI Shared Memory based PMI for Improved Performance and Scalability. S. Chakraborty, H. Subramoni, J. Perkins, and D K Panda, 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 16) Overlap Symposium (April 18) 11

12 Startup Performance on KNL + Omni-Path MPI_Init (Seconds) MPI_Init - TACC Stampede-KNL Intel MPI 218 beta MVAPICH2 2.3a 8.8x 51s Time Taken (Seconds) MPI_Init & Hello World - Oakforest-PACS Hello World (MVAPICH2-2.3a) MPI_Init (MVAPICH2-2.3a) 21s 5.8s K 2K 4K 8K 16K 32K 64K 128K 181K 232K Number of Processes K 2K 4K 8K Number of Processes 16K 32K 64K MPI_Init takes 51 seconds on 231,956 processes on 3,624 KNL nodes (Stampede Full scale) 8.8 times faster than Intel MPI at 128K processes (Courtesy: TACC) At 64K processes, MPI_Init and Hello World takes 5.8s and 21s respectively (Oakforest-PACS) All numbers reported with 64 processes per node New designs available in MVAPICH2-2.3a and as patch for SLURM and SLURM Overlap Symposium (April 18) 12

13 Overlap Symposium (April 18) 13 Process Management Interface (PMI) over Shared Memory (SHMEMPMI) SHMEMPMI allows MPI processes to directly read remote endpoint (EP) information from the process manager through shared memory segments Only a single copy per node - O(processes per node) reduction in memory usage Estimated savings of 1GB per node with 1 million processes and 16 processes per node Up to 1, times faster PMI Gets compared to default design Available since MVAPICH2 2.2rc1 and SLURM Time Taken (milliseconds) Time Taken by one PMI_Get Default SHMEMPMI 1x Number of Processes per Node Memory Usage per Node (MB) Memory Usage for Remote EP Information Fence - Default Allgather - Default Fence - Shmem Allgather - Shmem Actual K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Number of Processes per Job Estimated 16x

14 On-demand Connection Management for OpenSHMEM+MPI Time Taken (Seconds) Breakdown of OpenSHMEM Startup Connection Setup PMI Exchange Memory Registration Shared Memory Setup Other Time Taken (Seconds) Performance of OpenSHMEM Initialization and Hello World Hello World - Static Initialization - Static Hello World - On-demand Initialization - On-demand K 2K 4K Number of Processes K 2K 4K 8K Number of Processes Static connection establishment wastes memory and takes a lot of time On-demand connection management improves OpenSHMEM initialization time by 29.6 times Time taken for Hello World reduced by 8.31 times at 8,192 processes Available since MVAPICH2-X 2.1rc1 Overlap Symposium (April 18) 14

15 How to Get the Best Startup Performance with MVAPICH2? MV2_HOMOGENEOUS_CLUSTER=1 //Set for homogenous clusters MV2_ON_DEMAND_UD_INFO_EXCHANGE=1 //Enable UD based address exchange Use PMI2 Using SLURM as launcher./configure --with-pm=slurm --withpmi=pmi2 srun --mpi=pmi2./a.out Use PMI Extensions Patch for SLURM available at Patches available for SLURM 15, 16, and 17 PMI Extensions are automatically detected by MVAPICH2 Using mpirun_rsh as launcher MV2_MT_DEGREE degree of the hierarchical tree used by mpirun_rsh MV2_FASTSSH_THRESHOLD #nodes beyond which hierarchical-ssh scheme is used MV2_NPROCS_THRESHOLD #nodes beyond which file-based communication is used for hierarchical-ssh Overlap Symposium (April 18) 15

16 Overlap Symposium (April 18) 16 Presentation Outline MVAPICH2/MVAPICH2-X Job Startup Point-to-point Communication Remote Memory Access (RMA) Collective Communication MVAPICH2-GDR Support for InfiniBand Core-Direct GPU-kernel based Reduction Datatype Processing Deep Learning Application: OSU Caffe

Application Data Pre-registered Cost: Communication Memcpy Buffers Application MPI Library Application Data Cost:

17 Overlap Symposium (April 18) 17 Communication Costs of Point-to-point Protocols - Eager Good communication performance for smaller messages No synchronization required between sender and receiver Cost of extra copies is high for large messages Application Data Pre-registered Cost: Communication Memcpy Buffers Application MPI Library Application Data Cost: Pre-registered Memcpy Communication Buffers Buffer #1 Buffer #1 Buffer #n Buffer #n High-Performance Networks Cost: Network Transfer

18 Communication Costs of Point-to-point Protocols - Rendezvous Avoid extra copies for larger messages Synchronization required between sender and receiver Can be based on RDMA Read or RDMA Write (shown here) Cost: Half RTT Cost: Half RTT Cost: Network Transfer Cost: Half RTT Overlap Symposium (April 18) 18

19 Analyzing Overlap Potential of Eager Protocol Application Process Network Interface Card Schedule Send Operation Application Process Network Interface Card Schedule Receive Operation Application processes schedule communication operation Network adapter progresses communication in the background Application process free to perform useful compute in the foreground Overlap of computation and communication => Better Overall Application Performance Increased buffer requirement Poor communication performance if used for all types of communication operations Check for Completion Check for Completion Complete Complete Computation Communication Progress Impact of changing Eager Threshold on performance of multi-pair message-rate benchmark with 32 processes on Stampede Overlap Symposium (April 18) 19

20 Overlap Symposium (April 18) 2 Analyzing Overlap Potential of Rendezvous Protocol Application Process Network Interface Card Schedule Send Operation Check for Completion Not Complete Check for Completion Not Complete Check for Completion Complete RTS CTS Application Process Network Interface Card Schedule Receive Operation Check for Completion Not Complete Check for Completion Not Complete Check for Completion Complete Computation Communication Progress Application processes schedule communication operation Application process free to perform useful compute in the foreground Little communication progress in the background All communication takes place at final synchronization Reduced buffer requirement Good communication performance if used for large message sizes and operations where communication library is progressed frequently Poor overlap of computation and communication => Poor Overall Application Performance

Impact of Tuning Rendezvous Threshold on 3D-Stencil Latency (ms) 35 3 25 2 15 1 5 Communication Time Default Tuned 1 4 16 64 256 1K 4K 16K 64K 256K Message Size (Bytes) Overlap (%) 1 8 6 4 2 Overlap

21 Impact of Tuning Rendezvous Threshold on 3D-Stencil Latency (ms) Communication Time Default Tuned K 4K 16K 64K 256K Message Size (Bytes) Overlap (%) Overlap Potential Default Tuned K 4K 16K 64K 256K Message Size (Bytes) Latency (ms) Overall Performance Default Tuned K 4K 16K 64K 256K Message Size (Bytes) Increased eager threshold from 16KB to 512KB Very small degradation in raw communication performance Significant improvement in overlap of computation and communication ~18% Improvement in overall performance MV2_IBA_EAGER_THRESHOLD=512K MV2_SMP_EAGERSIZE=512K (Applicable to both InfiniBand and Omni-Path) 8192 Processes, SandyBridge + FDR Overlap Symposium (April 18) 21

Impact of Tuning Rendezvous Protocol on 3D-Stencil Latency (us) 8 7 6 5 4 3 2 1 Communication Time Default Tuned Overlap (%) 1 8 6 4 2 Overlap Potential Default Tuned Latency (us) 12 1 8 6 4 2

22 Impact of Tuning Rendezvous Protocol on 3D-Stencil Latency (us) Communication Time Default Tuned Overlap (%) Overlap Potential Default Tuned Latency (us) Overall Performance Default Tuned Message Size (Bytes) Message Size (Bytes) Message Size (Bytes) RDMA Read based protocol (RGET) used instead of RDMA Write Very minor penalty in raw performance Offers more overlap due to less synchronization overhead MV2_RNDV_PROTOCOL=RGET (Applicable to InfiniBand) 64 Processes, Broadwell + EDR Up to 15% improvement in overall execution time Overlap Symposium (April 18) 22

23 Overlap Symposium (April 18) 23 Dynamic and Adaptive MPI Point-to-point Communication Protocols Different communication protocols have different trade-offs Need to consider performance, overlap, memory requirement Manual tuning is difficult and time-consuming Can the MPI library select the best protocol at runtime? Use different protocols and thresholds between different pair of processes Deliver good performance and minimize resource consumption Dynamically adapt to the application s communication requirements at runtime Default Poor overlap; Low memory requirement Low Performance; High Productivity Manually Tuned Good overlap; High memory requirement High Performance; Low Productivity Dynamic + Adaptive Good overlap; Optimal memory requirement High Performance; High Productivity

24 Dynamic and Adaptive MPI Point-to-point Communication Protocols (cont.) Desired Eager Threshold Process Pair Eager Threshold (KB) Eager Threshold for Example Communication Pattern with Different Designs Default Manually Tuned Dynamic + Adaptive KB 16 KB 16 KB 16 KB 128 KB 128 KB 128 KB 128 KB 32 KB 64 KB 128 KB 32 KB Process on Node 1 Process on Node 2 Wall Clock Time (seconds) Execution Time of Amber K Number of Processes Default Threshold=17K Threshold=64K Threshold=128K Dynamic Threshold Relative Memory Consumption Relative Memory Consumption of Amber K Number of Processes Default Threshold=17K Threshold=64K Threshold=128K Dynamic Threshold H. Subramoni, S. Chakraborty, D. K. Panda, Designing Dynamic & Adaptive MPI Point-to-Point Communication Protocols for Efficient Overlap of Computation & Communication, ISC'17 - Best Paper Overlap Symposium (April 18) 24

25 Overlap Symposium (April 18) 25 Presentation Outline MVAPICH2/MVAPICH2-X Job Startup Point-to-point Communication Remote Memory Access (RMA) Collective Communication MVAPICH2-GDR Support for InfiniBand Core-Direct GPU-kernel based Reduction Datatype Processing Deep Learning Application: OSU Caffe

26 MPI-3 RMA: Communication and synchronization Primitives Non-blocking one-sided communication routines Put, Get (Rput, Rget) Accumulate, Get_accumulate Atomics MVAPICH2 supports all RMA communication with Best performance and overlap Flexible synchronization operations to control initiation and completion MPI One-sided Synchronization/Completion Primitives Synchronization Completion Win_sync Fence Lock/ Unlock Flush_local Flush Post-Wait/ Start-Complete Lock_all/ Unlock_all Flush_local_all Flush_all Overlap Symposium (April 18) 26

27 Overlap Symposium (April 18) 27 Overlap between Computation and RMA Operations Overlap (%) MPI_Put with Lock/Unlock and MPI_Win_flush K 4K 16K 64K 256K 1M 4M Message Size (Bytes) MVAPICH2-2.3rc1 Overlap (%) MPI_Get with Lock/Unlock and MPI_Win_flush K 4K 16K 64K 256K 1M Message Size (Bytes) 4M MVAPICH2-2.3rc % overlap between MPI_Put and computation 75-99% overlap between MPI_Get and computation MVAPICH2-2.3rc1 Intel Haswell 3.1 GHz) node - 2 cores Mellanox Connect-X4 EDR HCA Mellanox OFED 4.3

28 Graph Processing Framework with Optimized MPI RMA Execution Time (s) PageRank with LiveJournal #Cores Mizan-Default Mizan-RMA-Opt 2X Better Proposed design performs better than default implementation For Weakly Connected Components (WCC) on 256 cores, proposed design could reduce the total execution time by 2X compared with the default scheme M. Li, X. Lu, K. Hamidouche, J. Zhang and D. K. Panda, "Mizan-RMA: Accelerating Mizan Graph Processing Framework with MPI RMA," IEEE HiPC, 216 Overlap Symposium (April 18) 28 Execution Time (s) WCC with LiveJournal #Cores Mizan-Default Mizan-RMA-Opt 3X

29 Graph Processing Framework with Optimized MPI RMA Total Execution Time (s) PageRank with Arabic Dataset Applications (128 Processes) Mizan-Default Mizan-RMA-Opt 2.5X Better Proposed design shows good strong scaling Proposed design scales better than default implementation M. Li, X. Lu, K. Hamidouche, J. Zhang and D. K. Panda, "Mizan-RMA: Accelerating Mizan Graph Processing Framework with MPI RMA," IEEE HiPC, 216 Overlap Symposium (April 18) 29

30 Overlap Symposium (April 18) 3 Presentation Outline MVAPICH2/MVAPICH2-X Job Startup Point-to-point Communication Remote Memory Access (RMA) Collective Communication MVAPICH2-GDR Support for InfiniBand Core-Direct GPU-kernel based Reduction Datatype Processing Deep Learning Application: OSU Caffe

Collective Communication in MVAPICH2 Blocking and Non-Blocking Collective Algorithms in MV2 Conventional (Flat) Multi/Many-Core Aware Designs Inter-Node Communication Intra-Node Communication Point

31 Collective Communication in MVAPICH2 Blocking and Non-Blocking Collective Algorithms in MV2 Conventional (Flat) Multi/Many-Core Aware Designs Inter-Node Communication Intra-Node Communication Point to Point Hardware Multicast SHARP Designed for Performance & Overlap RDMA Point to Point (SHMEM, LiMIC, CMA, XPMEM) Direct Shared Memory Direct Kernel Assisted (CMA, XPMEM, LiMIC) Run-time flags: All shared-memory based collectives : MV2_USE_SHMEM_COLL (Default: ON) Hardware Mcast-based collectives : MV2_USE_MCAST (Default : OFF) CMA-based collectives : MV2_USE_CMA_COLL (Default : ON) Overlap Symposium (April 18) 31

32 Hardware Multicast-aware MPI_Bcast on TACC Stampede Latency (us) Small Messages (12,4 Cores) Default Multicast Message Size (Bytes) 85% Latency (us) Large Messages (12,4 Cores) Default Multicast 2K 8K 32K Message Size (Bytes) 128K Latency (us) Byte Message Default Multicast 8% Latency (us) KByte Message Default Multicast Number of Nodes MCAST-based designs improve latency of MPI_Bcast by up to 85% Number of Nodes Use MV2_USE_MCAST=1 to enable MCAST-based designs Overlap Symposium (April 18) 32

33 Latency (us) Optimized CMA-based Collectives for Large Messages KNL (2 Nodes, 128 Procs) 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size ~ 2.5x Better MVAPICH2-2.3a Intel MPI 217 OpenMPI 2.1. Tuned CMA KNL (4 Nodes, 256 Procs) MVAPICH2-2.3a Intel MPI 217 OpenMPI 2.1. Tuned CMA 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M Message Size Performance of MPI_Gather on KNL nodes (64PPN) Overlap Symposium (April 18) KNL (8 Nodes, 512 Procs) 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Message Size Significant improvement over existing implementation for Scatter/Gather with 1MB messages (up to 4x on KNL, 2x on Broadwell, 14x on OpenPower) New two-level algorithms for better scalability Improved performance for other collectives (Bcast, Allgather, and Alltoall) S. Chakraborty, H. Subramoni, and D. K. Panda, Contention Aware Kernel-Assisted MPI Collectives for Multi/Many-core Systems, IEEE Cluster 17, BEST Paper Finalist ~ 3.2x Better ~ 17x Better ~ 4x Better MVAPICH2-2.3a Intel MPI 217 OpenMPI 2.1. Tuned CMA Available in MVAPICH2-X 2.3b

34 Shared Address Space (XPMEM)-based Collectives Design OSU_Allreduce (Broadwell 256 procs) OSU_Reduce (Broadwell 256 procs) 1 MVAPICH2-2.3b IMPI-217v MVAPICH2-Opt 1.8X 1 MVAPICH2-2.3b IMPI-217v MVAPICH2-Opt 4X Latency (us) K 32K 64K 128K 256K 512K 1M 2M 4M Message Size Shared Address Space -based true zero-copy Reduction collective designs in MVAPICH2 Offloaded computation/communication to peers ranks in reduction collective operation Up to 4X improvement for 4MB Reduce and up to 1.8X improvement for 4M AllReduce J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and D. Panda, Designing Efficient Shared Address Space Reduction Will be available in future Collectives for Multi-/Many-cores, International Parallel & Distributed Processing Symposium (IPDPS '18), May 218. Overlap Symposium (April 18) K 32K 64K 128K 256K 512K 1M 2M 4M Message Size

35 Application-Level Benefits of XPMEM-Based Collectives CNTK AlexNet Training (Broadwell, B.S=default, iteration=5, ppn=28) MiniAMR (Broadwell, ppn=16) Execution Time (s) IMPI-217v1.132 MVAPICH2-2.3b MVAPICH2-Opt 2% 9% No. of Processes Execution Time (s) IMPI-217v1.132 MVAPICH2-2.3b MVAPICH2-Opt 27% 15% No. of Processes Up to 2% benefits over IMPI for CNTK DNN training using AllReduce Up to 27% benefits over IMPI and up to 15% improvement over MVAPICH2 for MiniAMR application kernel Overlap Symposium (April 18) 35

36 Problems with Blocking Collective Operations Application Process Application Process Application Process Application Process Computation Communication Communication time cannot be used for compute No overlap of computation and communication Inefficient Overlap Symposium (April 18) 36

37 Concept of Non-blocking Collectives Computation Application Process Communication Support Entity Application Process Communication Support Entity Application Process Communication Support Entity Application Process Communication Support Entity Communication Schedule Operation Schedule Operation Schedule Operation Schedule Operation Check if Complete Check if Complete Check if Complete Check if Complete Check if Complete Check if Complete Check if Complete Check if Complete Application processes schedule collective operation Check periodically if operation is complete Overlap of computation and communication => Better Performance Catch: Who will progress communication Overlap Symposium (April 18) 37

38 Overlap Symposium (April 18) 38 Non-blocking Collective (NBC) Operations Enables overlap of computation with communication Non-blocking calls do not match blocking collective calls MPI may use different algorithms for blocking and non-blocking collectives Blocking collectives: Optimized for latency Non-blocking collectives: Optimized for overlap A process calling a NBC operation Schedules collective operation and immediately returns Executes application computation code Waits for the end of the collective The communication progress by Application code through MPI_Test Network adapter (HCA) with hardware support Dedicated processes / thread in MPI library There is a non-blocking equivalent for each blocking operation Has an I in the name MPI_Bcast -> MPI_Ibcast; MPI_Reduce -> MPI_Ireduce

39 Overlap Symposium (April 18) 39 How do I write applications with NBC? void main() { MPI_Init().. MPI_Ialltoall( ) Computation that does not depend on result of Alltoall MPI_Test(for Ialltoall) /* Check if complete (non-blocking) */ Computation that does not depend on result of Alltoall MPI_Wait(for Ialltoall) /* Wait till complete (Blocking) */ MPI_Finalize() }

40 P3DFFT Performance with Non-Blocking Alltoall using RDMA Primitives CPU Time Per Loop (Seconds) Small Scale Runs Weak scaling experiments; problem size increases with job size RDMA-Aware delivers 19% improvement over 8,192 procs Default-Thread exhibits worst performance Possibly because threads steal CPU cycles from P3DFFT Do not consider for large scale experiments 14 Default RDMA-Aware Default-Thread Default RDMA-Aware 19% Number of Processes CPU Time Per Loop (Seconds) Large Scale Runs K 2K 4K 8K Number of Processes Designing Non-Blocking Personalized Collectives with Near Perfect Overlap for RDMA-Enabled Clusters, H. Subramoni, A. Awan, K. Hamidouche, D. Pekurovsky, A. Venkatesh, S. Chakraborty, K. Tomko, and D. K. Panda, ISC '15, Jul 215 Will be available in future Overlap Symposium (April 18) 4

Offloading with Scalable Hierarchical Aggregation Protocol (SHArP) Management and execution of MPI operations in the network by using SHArP Manipulation of data while it is being transferred in the

41 Offloading with Scalable Hierarchical Aggregation Protocol (SHArP) Management and execution of MPI operations in the network by using SHArP Manipulation of data while it is being transferred in the switch network SHArP provides an abstraction to realize the reduction operation Defines Aggregation Nodes (AN), Aggregation Tree, and Aggregation Groups AN logic is implemented as an InfiniBand Target Channel Adapter (TCA) integrated into the switch ASIC * Uses RC for communication between ANs and between AN and hosts in the Aggregation Tree * Physical Network Topology* * Bloch et al. Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction Logical SHArP Tree* Overlap Symposium (April 18) 41

42 Evaluation of SHArP based Non Blocking Allreduce MPI_Iallreduce Benchmark Lower is Better Pure Communication Latency (us) MVAPICH2 1 PPN*, 8 Nodes MVAPICH2-SHArP 2.3x Message Size (Bytes) Higher is Better Communication-Computation Overlap (%) MVAPICH2 1 PPN, 8 Nodes MVAPICH2-SHArP Message Size (Bytes) Complete offload of Allreduce collective operation to Switch helps to have much higher overlap of communication and computation *PPN: Processes Per Node Available since MVAPICH2 2.3a Overlap Symposium (April 18) 42

43 Collective Offload in ConnectX-2, ConnectX-3, Connect-IB and ConnectX-4, ConnectX-5 Mellanox s ConnectX-2, ConnectX-3, ConnectIB, ConnectX-4, and ConnectX-5 adapters feature task-list offload interface Extension to existing InfiniBand APIs Collective communication with `blocking feature is usually a scaling bottleneck Matches with the need for non-blocking collective in MPI Accordingly MPI software stacks need to be re-designed to leverage offload in a comprehensive manner Can applications be modified to take advantage of non-blocking collectives and what will be the benefits? Overlap Symposium (April 18) 43

44 Overlap Symposium (April 18) 44 Collective Offload Support in ConnectX InfiniBand Adapter (Recv followed by Multi-Send) Application Sender creates a task-list consisting of only send and wait WQEs MQ MCQ Send Wait Task List Send Send InfiniBand HCA Send Q Send Wait One send WQE is created for each registered receiver and is appended to the rear of a singly linked task-list A wait WQE is added to make the ConnectX-2 HCA wait for ACK packet from the receiver Send CQ Physical Link Data Recv Q Recv CQ

45 Co-designing HPL with Core-Direct and Performance Benefits HPL-Offload HPL-1ring HPL-Host HPL-Offload HPL-1ring HPL-Host Normalized HPL Performance % HPL Problem Size (N) as % of Total Memory Memory Consumption (%) HPL-Offload HPL-1ring HPL-Host System Size (Number of Processes) Throughput (GFlops) HPL Performance Comparison with 512 Processes HPL-Offload consistently offers higher throughput than HPL-1ring and HPL- Host. Improves peak throughput by up to 4.5 % for large problem sizes HPL-Offload surpasses the peak throughput of HPL-1ring with significantly smaller problem sizes and run-times! K. Kandalla, H. Subramoni, J. Vienne, S. Pai Raikar, K. Tomko, S. Sur, and D K Panda, Designing Non-blocking Broadcast with Collective Offload on InfiniBand Clusters: A Case Study with HPL, (HOTI 211) Overlap Symposium (April 18) 45

46 Pre-conditioned Conjugate Gradient (PCG) Solver Performance with Non- Blocking Allreduce based on CX-2 Collective Offload 15 PCG-Default Modified-PCG-Offload Run-Time (s) % Number of Processes 64, unknowns per process. Modified PCG with Offload-Allreduce performs 21% better than default PCG K. Kandalla, U. Yang, J. Keasler, T. Kolev, A. Moody, H. Subramoni, K. Tomko, J. Vienne and D. K. Panda, Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, IPDPS 12, May 212. Overlap Symposium (April 18) 46

47 Overlap Symposium (April 18) 47 Presentation Outline MVAPICH2/MVAPICH2-X Job Startup Point-to-point Communication Remote Memory Access (RMA) Collective Communication MVAPICH2-GDR Support for InfiniBand Core-Direct GPU-kernel based Reduction Datatype Processing Deep Learning Application: OSU Caffe

GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU Standard MPI

Sender: MPI_Send(s_devbuf, size, ); inside MVAPICH2 At

48 GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU Standard MPI interfaces used for unified data movement Takes advantage of Unified Virtual Addressing (>= CUDA 4.) Overlaps data movement from GPU with RDMA transfers At Sender: MPI_Send(s_devbuf, size, ); inside MVAPICH2 At Receiver: MPI_Recv(r_devbuf, size, ); High Performance and High Productivity Overlap Symposium (April 18) 48

49 CUDA-Aware MPI: MVAPICH2-GDR Releases Support for MPI communication from NVIDIA GPU device memory High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU) High performance intra-node point-to-point communication for multi-gpu adapters/node (GPU-GPU, GPU-Host and Host-GPU) Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node Optimized and tuned collectives for GPU device buffers MPI datatype support for point-to-point and collective communication from GPU device buffers Unified memory Overlap Symposium (April 18) 49

GPU-Direct RDMA (GDR) with CUDA OFED with support for GPUDirect RDMA is developed by NVIDIA and Mellanox OSU has a design of MVAPICH2 using GPUDirect RDMA Hybrid design using GPU-Direct RDMA

50 GPU-Direct RDMA (GDR) with CUDA OFED with support for GPUDirect RDMA is developed by NVIDIA and Mellanox OSU has a design of MVAPICH2 using GPUDirect RDMA Hybrid design using GPU-Direct RDMA GPUDirect RDMA and Host-based pipelining Alleviates P2P bandwidth bottlenecks on SandyBridge and IvyBridge Similar bottlenecks on Haswell Support for communication using multi-rail Support for Mellanox Connect-IB and ConnectX VPI adapters IB Adapter System Memory GPU Memory P2P write 5.2 GBs <3 MBs 6.4 GBs <3 MBs Overlap Symposium (April 18) 5 CPU SNB E5-267 / IVB E5-268V2 Chipset Support for RoCE with Mellanox ConnectX VPI adapters SNB E5-267 IVB E5-268V2 Intra-socket Inter-sockets Intra-socket Inter-sockets P2P read <1. GBs <3 MBs 3.5 GBs <3 MBs GPU

51 Overlap Symposium (April 18) 51 Latency (us) Bandwidth (MB/s) Optimized MVAPICH2-GDR Design GPU-GPU Inter-node Latency 1.88us 1x K 2K 4K 8K Message Size (Bytes) MV2-(NO-GDR) MV2-GDR-2.3a GPU-GPU Inter-node Bandwidth K 2K 4K Message Size (Bytes) 9x Bandwidth (MB/s) GPU-GPU Inter-node Bi-Bandwidth K 2K 4K Message Size (Bytes) MV2-(NO-GDR) MV2-GDR-2.3a 11X MVAPICH2-GDR-2.3a Intel Haswell 3.1 GHz) node - 2 cores NVIDIA Volta V1 GPU Mellanox Connect-X4 EDR HCA CUDA 9. Mellanox OFED 4. with GPU-Direct-RDMA MV2-(NO-GDR) MV2-GDR-2.3a

52 Overlap (%) Overlap with Optimized MVAPICH2-GDR Design 1 GPU-GPU Intra-node Overlap* K 4K Message Size (Bytes) MVAPICH2-GDR-2.3a 16K 64K 256K 1M Overlap (%) GPU-GPU Inter-node Overlap* K 4K 16K 64K 256K Message Size (Bytes) MVAPICH2-(NO-GDR) MVAPCH2-GDR-2.3a 1M Up to 69% overlap* for intra-node GPU-GPU communication With GDR, up to 78% overlap* for inter-node small and medium message transfers MVAPICH2-GDR-2.3a Intel Haswell 3.1 GHz) node - 2 cores With intelligent pipeline, up to 88% overlap* for internode NVIDIA Volta V1 GPU large message transfers Mellanox Connect-X4 EDR HCA CUDA 9. *Overlap between GPU-to-GPU communication and CPU computation Mellanox OFED 4. with GPU-Direct-RDMA Overlap Symposium (April 18) 52

Average Time Steps per second (TPS) Application-Level Evaluation (HOOMD-blue) 35 3 25 2 15 1 5 64K Particles Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K2c + Mellanox Connect-IB) HoomdBlue

53 Average Time Steps per second (TPS) Application-Level Evaluation (HOOMD-blue) K Particles Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K2c + Mellanox Connect-IB) HoomdBlue Version 1..5 GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_ MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT= X Number of Processes Average Time Steps per second (TPS) K Particles MV2 MV2+GDR Number of Processes Overlap Symposium (April 18) 53 2X

Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland Normalized Execution Time Wilkes GPU Cluster Default Callback-based Event-based 1.2 1.8.6.4.

54 Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland Normalized Execution Time Wilkes GPU Cluster Default Callback-based Event-based Number of GPUs CSCS GPU cluster Default Callback-based Event-based Normalized Execution Time Number of GPUs 2X improvement on 32 GPUs nodes 3% improvement on 96 GPU nodes (8 GPUs/node) Cosmo model: /tasks/operational/meteoswiss/ On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee, H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS 16 Overlap Symposium (April 18) 54

55 Overlap Symposium (April 18) 55 Presentation Outline MVAPICH2/MVAPICH2-X Job Startup Point-to-point Communication Remote Memory Access Collective Communication MVAPICH2-GDR Support for InfiniBand CORE-Direct GPU-kernel based Reduction Datatype Processing OSU Caffe

56 Motivation: Exploiting CORE-Direct and GPUDirect RDMA Applications use GPU/CPU resources for computation and MPI for communication directly from GPU buffers MPI collectives common in GPU applications. E.g.: Alltoall for FFTs Collectives are time consuming with scale so MPI-3. introduced NBCs Non-blocking communication operations from GPU buffers can Allow CPU to overlap GPU-based communication with CPU compute Ease GPU kernels redundancy in waiting for non-dependent communication Allow power efficient execution from CPU perspective Rich set of GPU and network primitives available for NBC designs but architectural limitations must be addressed A. Venkatesh, K. Hamidouche, H. Subramoni, and D. K. Panda, Offloaded GPU Collectives using CORE-Direct and CUDA Capabilities on IB Clusters, HiPC 15 Overlap Symposium (April 18) 56

57 Overview of Core-Direct + GPUDirect Designs Realized through mapping of MPICH schedule abstraction Schedule composed of sched_send, sched_barrier, sched_recv, sched_start etc Mapped to Core-Direct primitives with collective-specific GPU Host done additionally Multiple designs explored Naïve Design: Host-assisted GPU NBC (Scatter) Offload-Staged: Host-assisted GPU NBC + Core-Direct Offload-GDR: (GDR + Core-Direct)-based NBC Offload-Callback: (Core-Direct, GDR, CUDA)-based NBC Overlap Symposium (April 18) 57

58 Latency Comparison with Blocking Collectives LATENCY (US) gdr-offload-cb staged-offload offload-gdr Blocking 64- NODE IALLTOALL LATENCY LATENCY (US) gdr-offload-cb staged-offload offload-gdr Blocking 64- NODE IALLGATHER LATENCY K 64K 1MB 3 4K 64K 1MB MESSAGE SIZE (BYTES) MESSAGE SIZE (BYTES) Use of GDR and CUDA callback mechanisms improve latency (comparable for alltoall) Latency high for the case of alltoall even though callback designed to avoid staging latency Overlap Symposium (April 18) 58

59 Effect of Compute Location on Overlap/Latency OVERLAP(%) NODE IALLGATHER OVERLAP gdr-offload-cb-gpu gdr-offload-cb-cpu offload-gdr-gpu offload-gdr-cpu 4K 8K 16K 32K 64K 128K 256K MESSAGE SIZE (BYTES) LATENCY (US) NODE IALLGATHER LATENCY gdr-offload-cb-gpu gdr-offload-cb-cpu offload-gdr-gpu offload-gdr-cpu 4K 8K 16K 32K 64K 128K 256K MESSAGE SIZE (BYTES) New schemes are able to exploit overlap well Available in MVAPICH2-GDR 2.3a Overlap Symposium (April 18) 59

60 Overlap Symposium (April 18) 6 Presentation Outline MVAPICH2/MVAPICH2-X Job Startup Point-to-point Communication Remote Memory Access (RMA) Collective Communication MVAPICH2-GDR Support for InfiniBand CORE-Direct GPU-kernel based Reduction Datatype Processing Deep Learning Application: OSU Caffe

61 GPU-kernel based Reduction Scientific parallel applications spend a considerable amount of time in GPU-based collective communication operations E.g. Deep learning frameworks such as TensorFlow and Caffe Optimized computation-intensive collectives in MVAPICH2-GDR MPI_Reduce and MPI_Allreduce Exploring the best combinations Computation on CPU Node A 3 Node B CPU CPU or GPU Communication through Host or GPU memory 1 Host Memory PCIe GPU IB Adapter 2 1 IB Adapter 2 Host Memory 4 PCIe GPU Overlap Symposium (April 18) 61

62 Evaluation - CSCS (96 GPUs) Latency (us) Gather-first approaches* win for small messages Default BD-DD GR-DD GR-HD GR-HH GR-H-HH Latency (us) K-nomial GPU-based approach* win for large messages Default GR-DD GR-HH BD-DD GR-HD GR-H-HH K 2K 4K 8K Message Size (Bytes) 16K 64K 256K 1M 4M Message Size (Bytes) *Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan, and Dhabaleswar K. Panda, "CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters, IEEE/ACM CCGrid 16. Overlap Symposium (April 18) 62

63 Evaluation - MPI_Allreduce Available in MVAPICH2-GDR 2.3a 25 Default 96 CSCS 1 Good Scalability 32 GPU Wilkes Default 2 RD-DD 8 RD-DD Latency (ms) 15 1 BRB-DD Latency (ms) 6 4 BRB-DD Message Size (Bytes) System Size (Number of Nodes) Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan, and Dhabaleswar K. Panda, "CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters, IEEE/ACM CCGrid 16. Overlap Symposium (April 18) 63

64 Overlap Symposium (April 18) 64 Presentation Outline MVAPICH2/MVAPICH2-X Job Startup Point-to-point Communication Remote Memory Access (RMA) Collective Communication MVAPICH2-GDR Support for InfiniBand CORE-Direct GPU-kernel based Reduction Datatype Processing Deep Learning Application: OSU Caffe

65 Non-contiguous Data Exchange Halo data exchange Multi-dimensional data Row based organization Contiguous on one dimension Non-contiguous on other dimensions Halo data exchange Duplicate the boundary Exchange the boundary in each iteration Overlap Symposium (April 18) 65

66 MPI Datatype support in MVAPICH2 Datatypes support in MPI Operate on customized datatypes to improve productivity Enable MPI library to optimize non-contiguous data At Sender: MPI_Type_vector (n_blocks, n_elements, stride, old_type, &new_type); MPI_Type_commit(&new_type); MPI_Send(s_buf, size, new_type, dest, tag, MPI_COMM_WORLD); Inside MVAPICH2 - Use datatype specific CUDA Kernels to pack data in chunks - Efficiently move data between nodes using RDMA - In progress - currently optimizes vector and hindexed datatypes - Transparent to the user H. Wang, S. Potluri, D. Bureddy, C. Rosales and D. K. Panda, GPU-aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation, IEEE Transactions on Parallel and Distributed Systems, Vol. 25, No. 1, pp , Oct 214. Overlap Symposium (April 18) 66

67 Overlap Symposium (April 18) 67 MPI Datatype Processing (Computation Optimization ) Comprehensive support Targeted kernels for regular datatypes - vector, subarray, indexed_block Generic kernels for all other irregular datatypes Separate non-blocking stream for kernels launched by MPI library Avoids stream conflicts with application kernels Flexible set of parameters for users to tune kernels Vector MV2_CUDA_KERNEL_VECTOR_TIDBLK_SIZE MV2_CUDA_KERNEL_VECTOR_YSIZE Subarray MV2_CUDA_KERNEL_SUBARR_TIDBLK_SIZE MV2_CUDA_KERNEL_SUBARR_XDIM MV2_CUDA_KERNEL_SUBARR_YDIM MV2_CUDA_KERNEL_SUBARR_ZDIM Indexed_block MV2_CUDA_KERNEL_IDXBLK_XDIM

MPI_Isend(Buf1,...,req1); MPI_Isend(Buf2,.

68 Overlap Symposium (April 18) 68 MPI Datatype Processing (Communication Optimization ) Common Scenario Waste of computing resources on CPU and GPU MPI_Isend(Buf1,...,req1); MPI_Isend(Buf2,...,req2); Application work on the CPU/GPU MPI_Waitall(requests, ) *Buf1, Buf2 contain non-contiguous MPI Datatype

69 MPI Datatype Processing (Communication Optimization ) Available in MVAPICH2-GDR 2.3a Modified CUDA-Aware DDTBench for NAS_MG_y Up to 9% overlap between datatype processing and other computation Overlap (%) Default Event-based Callback-based [32x16x16] [128x64x64] [256x128x128] [512x256x256] Input Size C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee, H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS 16 Overlap Symposium (April 18) 69

70 Overlap Symposium (April 18) 7 Presentation Outline MVAPICH2/MVAPICH2-X Job Startup Point-to-point Communication Remote Memory Access (RMA) Collective Communication MVAPICH2-GDR Support for InfiniBand CORE-Direct GPU-kernel based Reduction Datatype Processing Deep Learning Application: OSU Caffe

Deep Learning: New Challenges for MPI Runtimes Deep Learning frameworks are a different game altogether Unusually large message sizes (order of megabytes) Most communication

only! Proposed: Can we co-design the MPI runtime (MVAPICH2- GDR) and the DL framework (Caffe) to achieve both?

(Reductions) Hadoop What application co-designs are needed to exploit communication-runtime co-designs? Scale-out Performance A. A. Awan, K. Hamidouche, J. M. Hashmi, and D.

71 Deep Learning: New Challenges for MPI Runtimes Deep Learning frameworks are a different game altogether Unusually large message sizes (order of megabytes) Most communication based on GPU buffers Existing State-of-the-art cudnn, cublas, NCCL --> scale-up performance NCCL2, CUDA-Aware MPI --> scale-out performance For small and medium message sizes only! Proposed: Can we co-design the MPI runtime (MVAPICH2- GDR) and the DL framework (Caffe) to achieve both? Scale-up Performance NCCL cudnn cublas NCCL2 grpc Proposed Co- Designs MPI Efficient Overlap of Computation and Communication Efficient Large-Message Communication (Reductions) Hadoop What application co-designs are needed to exploit communication-runtime co-designs? Scale-out Performance A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17) Overlap Symposium (April 18) 71

72 OSU-Caffe: Proposed Co-Design Overview To address the limitations of Caffe and existing MPI runtimes, we propose the OSU-Caffe (S-Caffe) framework At the application (DL framework) level Develop a fine-grain workflow i.e. layer-wise communication instead of communicating the entire model At the runtime (MPI) level Develop support to perform reduction of very-large GPU buffers Perform reduction using GPU kernels OSU-Caffe is available from the HiDL project page ( Overlap Symposium (April 18) 72

73 Optimized Data Propagation and Gradient Aggregation using NBC Designs Exploit Non-Blocking Collective (NBC) operations in MPI-3 Divide communication into fine-grain steps Overlap computation of layer i with communication of layer i+1 MPI_Ibcast to post all communication in advance Wait in an on-demand fashion Allow for runtime selection of data propagation design Based on message (DL model) size, number of GPUs, and number of nodes Co-design gradient aggregation at application level Helper thread based approach to realize a non-blocking MPI_Reduce A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters, PPoPP 17 Overlap Symposium (April 18) 73

74 S-Caffe vs. Inspur-Caffe and Microsoft CNTK AlexNet: Notoriously hard to scale-out on multiple nodes due to comm. overhead! Large number of parameters ~ 64 Million (comm. buffer size = 256 MB) GoogLeNet is a popular DNN 13 million parameters (comm. buffer size = ~5 MB) Up to 14% improvement (Scale-up) S-Caffe delivers better or comparable performance with other multi-node capable DL frameworks Impact of HR Overlap Symposium (April 18) 74

75 Overlap Symposium (April 18) 75 Concluding Remarks Exploiting overlap between computation and communication is significant in HPC Presented some of the approaches and results along these directions taken by the MVAPICH2 and MVAPICH2-GDR Libraries Allows applications to take advantage of the overlap capabilities As exascale systems are getting more complicated in their architectures, solutions exploiting overlap capabilities will be important

76 Additional Presentation and Tutorials Tutorial: How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries 4/5/18, 9: am-12: noon Tutorial: High Performance Distributed Deep Learning: A Beginner s Guide 4/5/18, 1:pm-4: pm Overlap Symposium (April 18) 76

77 Overlap Symposium (April 18) 77 Funding Acknowledgments Funding Support by Equipment Support by

78 Personnel Acknowledgments Current Students (Graduate) A. Awan (Ph.D.) R. Biswas (M.S.) M. Bayatpour (Ph.D.) S. Chakraborthy (Ph.D.) C.-H. Chu (Ph.D.) S. Guganani (Ph.D.) Past Students A. Augustine (M.S.) P. Balaji (Ph.D.) S. Bhagvat (M.S.) A. Bhat (M.S.) D. Buntinas (Ph.D.) L. Chai (Ph.D.) B. Chandrasekharan (M.S.) N. Dandapanthula (M.S.) V. Dhanraj (M.S.) T. Gangadharappa (M.S.) K. Gopalakrishnan (M.S.) Current Students (Undergraduate) J. Hashmi (Ph.D.) N. Sarkauskas (B.S.) H. Javed (Ph.D.) P. Kousha (Ph.D.) D. Shankar (Ph.D.) H. Shi (Ph.D.) J. Zhang (Ph.D.) W. Huang (Ph.D.) J. Liu (Ph.D.) W. Jiang (M.S.) M. Luo (Ph.D.) J. Jose (Ph.D.) A. Mamidala (Ph.D.) S. Kini (M.S.) G. Marsh (M.S.) M. Koop (Ph.D.) V. Meshram (M.S.) K. Kulkarni (M.S.) A. Moody (M.S.) R. Kumar (M.S.) S. Naravula (Ph.D.) S. Krishnamoorthy (M.S.) R. Noronha (Ph.D.) K. Kandalla (Ph.D.) X. Ouyang (Ph.D.) M. Li (Ph.D.) S. Pai (M.S.) P. Lai (M.S.) S. Potluri (Ph.D.) Current Research Scientists X. Lu H. Subramoni Current Post-doc A. Ruhela K. Manian R. Rajachandrasekar (Ph.D.) G. Santhanaraman (Ph.D.) A. Singh (Ph.D.) J. Sridhar (M.S.) S. Sur (Ph.D.) H. Subramoni (Ph.D.) K. Vaidyanathan (Ph.D.) A. Vishnu (Ph.D.) J. Wu (Ph.D.) W. Yu (Ph.D.) Current Research Specialist J. Smith M. Arnold Past Research Scientist K. Hamidouche S. Sur Past Programmers D. Bureddy J. Perkins Past Post-Docs D. Banerjee X. Besseron H.-W. Jin J. Lin M. Luo E. Mancini S. Marcarelli J. Vienne H. Wang Overlap Symposium (April 18) 78

Overlap Symposium (April 18) 79 Thank You!

edu/ The High-Performance MPI/PGAS Project

edu/ The High-Performance Big Data Project

79 Overlap Symposium (April 18) 79 Thank You! Network-Based Computing Laboratory The High-Performance MPI/PGAS Project The High-Performance Big Data Project The High-Performance Deep Learning Project

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

How to Boost the Performance of Your MPI and PGAS s with MVAPICH2 Libraries A Tutorial at the MVAPICH User Group (MUG) Meeting 18 by The MVAPICH Team The Ohio State University E-mail: panda@cse.ohio-state.edu