Overview of the MVAPICH Project: Latest Status and Future Roadmap

Overview of the MVAPICH Project: Latest Status and Future Roadmap MVAPICH2 User Group (MUG) Meeting by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

Drivers of Modern HPC Cluster Architectures Multi-core Processors High Performance Interconnects - InfiniBand <1usec latency, >1Gbps Bandwidth Accelerators / Coprocessors high compute density, high performance/watt >1 TFlop DP on a chip Multi-core processors are ubiquitous InfiniBand very popular in HPC clusters Accelerators/Coprocessors becoming common in high-end systems Pushing the envelope for Exascale computing Tianhe 2 (1) Titan (2) Stampede (6) Tianhe 1A (1) MVAPICH User Group Meeting 214 2

Parallel Programming Models Overview P1 P2 P3 P1 P2 P3 P1 P2 P3 Shared Memory Memory Memory Memory Logical shared memory Memory Memory Memory Shared Memory Model SHMEM, DSM Distributed Memory Model MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X1, CAF, Programming models provide abstract machine models Models can be mapped on different types of systems e.g. Distributed Shared Memory (DSM), MPI within a node, etc. PGAS models and Hybrid MPI+PGAS models are gradually receiving importance MVAPICH User Group Meeting 214 3

Supporting Programming Models for Multi- Petaflop and Exaflop Systems: Challenges Application Kernels/Applications Middleware Programming Models MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenACC, Cilk, Hadoop, MapReduce, etc. Co-Design Opportunities and Challenges across Various Layers Communication Library or Runtime for Programming Models Point-to-point Communication (two-sided & one-sided) Collective Communication Synchronization & Locks I/O & File Systems Fault Tolerance Networking Technologies (InfiniBand, 1/4GigE, Aries, BlueGene) Multi/Many-core Architectures Accelerators (NVIDIA and MIC) MVAPICH User Group Meeting 214 4

Designing (MPI+X) at Exascale Scalability for million to billion processors Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) Extremely minimum memory footprint Hybrid programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, ) Balancing intra-node and inter-node communication for next generation multi-core (128-124 cores/node) Multiple end-points per node Support for efficient multi-threading Scalable Collective communication Offload Non-blocking Topology-aware Power-aware Support for MPI-3 RMA Model Support for GPGPUs and Accelerators Fault-tolerance/resiliency QoS support for communication and I/O MVAPICH User Group Meeting 214 5

MVAPICH2/MVAPICH2-X Software http://mvapich.cse.ohio-state.edu High Performance open-source MPI Library for InfiniBand, 1Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE) MVAPICH (MPI-1),MVAPICH2 (MPI-2.2 and MPI-3.), Available since 22 MVAPICH2-X (MPI + PGAS), Available since 212 Used by more than 2,2 organizations (HPC Centers, Industry and Universities) in 73 countries D. K. Panda, K. Tomko, K. Schulz and A. Majumdar, The MVAPICH Project: Evolution and Sustainability of an Open Source Production Quality MPI Library for HPC,, Int'l Workshop on Sustainable Software for Science: Practice and Experiences, held in conjunction with Int'l Conference on Supercomputing (SC '13), November 213. MVAPICH User Group Meeting 214 6

Oct- Jan-2 Nov-4 Jan-1 Nov-12 Aug-14 MVAPICH Team Projects MVAPICH2-GDR MVAPICH2-X MVAPICH2 OMB MVAPICH EOL Timeline MVAPICH User Group Meeting 214 7

Sep-4 Feb-5 Jul-5 Dec-5 May-6 Oct-6 Mar-7 Aug-7 Jan-8 Jun-8 Nov-8 Apr-9 Sep-9 Feb-1 Jul-1 Dec-1 May-11 Oct-11 Mar-12 Aug-12 Jan-13 Jun-13 Nov-13 Apr-14 MV.9.4 MV2.9. MV2.9.8 MV2 1. MV 1. MV2 1..3 MV 1.1 MV2 1.4 Number of Downloads MV2 1.5 MV2 1.6 MV2 1.7 MV2 1.8 MV2 1.9 MV2 2.a MV2 2.b MV2 2.rc2 MV2 2. MVAPICH/MVAPICH2 Release Timeline and Downloads 35 3 25 2 15 1 5 Timeline Download counts from MVAPICH2 website MVAPICH User Group Meeting 214 8

MVAPICH2/MVAPICH2-X Software (Cont d) Available with software stacks of many IB, HSE, and server vendors including Linux Distros (RedHat and SuSE) Empowering many TOP5 clusters 7th ranked 519,64-core cluster (Stampede) at TACC 13th, 74,358-core (Tsubame 2.5) at Tokyo Institute of Technology 23rd, 96,192-core (Pleiades) at NASA and many others Partner in the U.S. NSF-TACC Stampede System MVAPICH User Group Meeting 214 9

Strong Procedure for Design, Development and Release Research is done for exploring new designs Designs are first presented to conference/journal publications Best performing designs are incorporated into the codebase Rigorous Q&A procedure before making a release Exhaustive unit testing Various test procedures on diverse range of platforms and interconnects Performance tuning Applications-based evaluation Evaluation on large-scale systems Even alpha and beta versions go through the above testing MVAPICH User Group Meeting 214 1

MVAPICH2 Architecture (Latest Release 2.) All Different PCI interfaces Major Computing Platforms: IA-32, EM64T, Ivybridge, Westmere, Sandybridge, Opteron, Magny,.. MVAPICH User Group Meeting 214 11

MVAPICH2 2. and MVAPICH2-X 2. Released on 6/2/214 Major Features and Enhancements Based on MPICH-3.1 Extended support for MPI-3 RMA in OFA-IB-CH3, OFA-IWARP-CH3, and OFA-RoCE-CH3 interfaces Multiple optimizations and performance enhancements for RMA in OFA-IB-CH3 channel MPI-3 RMA support for CH3-PSM channel CMA support is now enabled by default Enhanced intra-node SMP performance Dynamic CUDA initialization Support for running on heterogeneous clusters with GPU and non-gpu nodes Multi-rail support for GPU communication and UD-Hybrid channel Improved job-startup performance for large-scale mpirun_rsh jobs Reduced memory footprint Updated compiler wrappers to remove application dependency on network and other extra libraries Updated hwloc to version 1.9 MVAPICH2-X 2.GA supports hybrid MPI + PGAS (UPC and OpenSHMEM) programming models. Based on MVAPICH2 2.GA including MPI-3 features; Compliant with UPC 2.18. and OpenSHMEM v1.f Enhanced optimization of OpenSHMEM collectives and Optimized shmalloc routine Optimized UPC collectives and support for GUPC translator MVAPICH User Group Meeting 214 12

Latency (us) Latency (us) One-way Latency: MPI over IB with MVAPICH2 6. 5. 4. 3. 2. 1.82 1.66 1.64 Small Message Latency 25. 2. 15. 1. Large Message Latency Qlogic-DDR Qlogic-QDR ConnectX-DDR ConnectX2-PCIe2-QDR ConnectX3-PCIe3-FDR Sandy-ConnectIB-DualFDR Ivy-ConnectIB-DualFDR 1.. 1.56 1.9.99 1.12 5.. Message Size (bytes) Message Size (bytes) DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch MVAPICH User Group Meeting 214 13

Bandwidth (MBytes/sec) Bandwidth (MBytes/sec) Bandwidth: MPI over IB with MVAPICH2 14 12 1 8 6 4 2 Unidirectional Bandwidth 1281 12485 6343 3385 328 1917 176 3 25 2 15 1 5 Bidirectional Bandwidth Qlogic-DDR Qlogic-QDR ConnectX-DDR ConnectX2-PCIe2-QDR ConnectX3-PCIe3-FDR Sandy-ConnectIB-DualFDR Ivy-ConnectIB-DualFDR 24727 2125 11643 6521 447 374 3341 Message Size (bytes) Message Size (bytes) DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch MVAPICH User Group Meeting 214 14

Bandwidth (MB/s) Bandwidth (MB/s) Latency (us) MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA)) 1.8.6.4.2 Latency Intra-Socket Inter-Socket.45 us.18 us 1 2 4 8 16 32 64 128 256 512 1K Message Size (Bytes) Latest MVAPICH2 2. Intel Ivy-bridge 16 14 12 1 8 6 4 2 Bandwidth (Intra-socket) intra-socket-cma intra-socket-shmem intra-socket-limic Bandwidth (Inter-socket) 14,25 MB/s 16 14 inter-socket-cma 13,749 MB/s 12 inter-socket-shmem 1 inter-socket-limic 8 6 4 2 Message Size (Bytes) Message Size (Bytes) MVAPICH User Group Meeting 214 15

Latency (us) Bandwidth (Mbytes/sec) Latency (us) Bandwidth (Mbytes/sec) MPI-3 RMA Get/Put with Flush Performance 3.5 3 2.5 2 1.5 1.5 Inter-node Get/Put Latency Get Put 2.4 us 1.56 us 1 2 4 8 16 32 64 128 256 512 1K 2K 4K Message Size (Bytes) 8 7 6 5 4 3 2 1 Inter-node Get/Put Bandwidth Get Put 6881 6876 1 4 16 64 256 1K 4K 16K 64K 256K Message Size (Bytes).35.3.25 Intra-socket Get/Put Latency Get Put 2 15 Intra-socket Get/Put Bandwidth 15364 Get Put.2.15 1 14926.1.5.8 us 1 2 4 8 16 32 64 128256512 1K 2K 4K 8K Message Size (Bytes) 5 1 4 16 64 256 1K 4K 16K 64K 256K Message Size (Bytes) Latest MVAPICH2 2., Intel Sandy-bridge with Connect-IB (single-port) MVAPICH User Group Meeting 214 16

MVAPICH2-X for Hybrid MPI + PGAS Applications MPI Applications, OpenSHMEM Applications, UPC Applications, Hybrid (MPI + PGAS) Applications OpenSHMEM Calls UPC Calls MPI Calls Unified MVAPICH2-X Runtime InfiniBand, RoCE, iwarp Unified communication runtime for MPI, UPC, OpenSHMEM available with MVAPICH2-X 1.9 onwards! http://mvapich.cse.ohio-state.edu Feature Highlights Supports MPI(+OpenMP), OpenSHMEM, UPC, MPI(+OpenMP) + OpenSHMEM, MPI(+OpenMP) + UPC MPI-3 compliant, OpenSHMEM v1. standard compliant, UPC v1.2 standard compliant (with initial support for UPC 1.3) Scalable Inter-node and intra-node communication point-to-point and collectives MVAPICH User Group Meeting 214 17

Hybrid (MPI+PGAS) Programming Application sub-kernels can be re-written in MPI/PGAS based on communication characteristics Benefits: Best of Distributed Computing Model Best of Shared Memory Computing Model Exascale Roadmap*: Hybrid Programming is a practical way to program exascale systems HPC Application Kernel 1 MPI Kernel 2 PGAS MPI Kernel 3 MPI Kernel N PGAS MPI * The International Exascale Software Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 211, International Journal of High Performance Computer Applications, ISSN 194-342 MVAPICH User Group Meeting 214 18

Billions of Traversed Edges Per Second (TEPS) Time (s) Billions of Traversed Edges Per Second (TEPS) Hybrid MPI+OpenSHMEM Graph5 Design 35 3 25 2 15 1 5 MPI-Simple MPI-CSC Execution Time MPI-CSR Hybrid (MPI+OpenSHMEM) 7.6X 4K 8K 16K No. of Processes 13X Performance of Hybrid (MPI+OpenSHMEM) Graph5 Design 8,192 processes - 2.4X improvement over MPI-CSR - 7.6X improvement over MPI-Simple 16,384 processes - 1.5X improvement over MPI-CSR - 13X improvement over MPI-Simple J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph5 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, International Supercomputing Conference (ISC 13), June 213 J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 212 MVAPICH User Group Meeting 214 19 9 8 7 6 5 4 3 2 1 9 8 7 6 5 4 3 2 1 Strong Scalability MPI-Simple MPI-CSC MPI-CSR Hybrid(MPI+OpenSHMEM) 124 248 496 8192 # of Processes Weak Scalability MPI-Simple MPI-CSC MPI-CSR Hybrid(MPI+OpenSHMEM) 26 27 28 29 Scale

MVAPICH2 2.1a and MVAPICH2-X 2.1a Will be available later this week or next week Features Based on MPICH 3.1.2 PMI-2-based startup with SLURM Improved job-startup, etc. MVAPICH User Group Meeting 214 2

MVAPICH2-GDR 2. Released on 8/23/214 Major Features and Enhancements Based on MVAPICH2 2. (OFA-IB-CH3 interface) High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU) using GPUDirect RDMA and pipelining Support for MPI-3 RMA (one Sided) communication and atomics using GPU buffers with GPUDirect RDMA and pipelining Efficient small message inter-node communication using the new NVIDIA GDRCOPY module Efficient small message inter-node communication using loopback design Multi-rail support for inter-node point-to-point GPU communication High performance intra-node point-to-point communication for multi-gpu adapters/node (GPU-GPU, GPU-Host and Host-GPU) using CUDA IPC and pipelining Automatic communication channel selection for different GPU communication modes (DD, HH and HD) in different configurations (intra-ioh and inter-ioh) Automatic selection and binding of the best GPU/HCA pair in multi-gpu/hca system configuration Optimized and tuned support for collective communication from GPU buffers Enhanced and Efficient support for datatype processing on GPU buffers including support for vector/h-vector, index/h-index, array and subarray Dynamic CUDA initialization. Support GPU device selection after MPI_Init Support for non-blocking streams in asynchronous CUDA transfers for better overlap Efficient synchronization using CUDA Events for pipelined device data transfers MVAPICH User Group Meeting 214 21

Latency (us) Performance of MVAPICH2 with GPU-Direct-RDMA: Latency GPU-GPU Internode MPI Latency 3 25 2 15 Small Message Latency MV2-GDR2. MV2-GDR2.b MV2 w/o GDR 9 % 1 5 71 % 2.18 usec 2 8 32 128 512 2K MVAPICH User Group Meeting 214 Message Size (bytes) MVAPICH2-GDR-2. Intel Ivy Bridge (E5-268 v2) node with 2 cores NVIDIA Tesla K4c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 6.5, Mellanox OFED 2.1 with GPU-Direct-RDMA 22

Bandwidth (MB/s) Performance of MVAPICH2 with GPU-Direct-RDMA: Bandwidth GPU-GPU Internode MPI Uni-Directional Bandwidth 3 25 2 15 Small Message Bandwidth MV2-GDR2. MV2-GDR2.b MV2 w/o GDR 2.2x 1x 1 5 1 4 16 64 256 1K 4K MVAPICH User Group Meeting 214 Message Size (bytes) MVAPICH2-GDR-2. Intel Ivy Bridge (E5-268 v2) node with 2 cores NVIDIA Tesla K4c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 6.5, Mellanox OFED 2.1 with GPU-Direct-RDMA 23

Bi-Bandwidth (MB/s) Performance of MVAPICH2 with GPU-Direct-RDMA: Bi-Bandwidth GPU-GPU Internode MPI Bi-directional Bandwidth 4 35 3 25 Small Message Bi-Bandwidth MV2-GDR2. MV2-GDR2.b MV2 w/o GDR 2x 2 1x 15 1 5 1 4 16 64 256 1K 4K MVAPICH User Group Meeting 214 Message Size (bytes) MVAPICH2-GDR-2. Intel Ivy Bridge (E5-268 v2) node with 2 cores NVIDIA Tesla K4c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 6.5, Mellanox OFED 2.1 with GPU-Direct-RDMA 24

Latency (us) Performance of MVAPICH2 with GPU-Direct-RDMA: MPI-3 RMA GPU-GPU Internode MPI Put latency (RMA put operation Device to Device ) MPI-3 RMA provides flexible synchronization and completion primitives 35 3 25 2 Small Message Latency MV2-GDR2. MV2-GDR2.b 83% 15 1 5 2.88 us 2 8 32 128 512 2K 8K Message Size (bytes) MVAPICH User Group Meeting 214 MVAPICH2-GDR-2. Intel Ivy Bridge (E5-268 v2) node with 2 cores NVIDIA Tesla K4c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 6.5, Mellanox OFED 2.1 with GPU-Direct-RDMA 25

Application-Level Evaluation (HOOMD-blue) HOOMD-blue Strong Scaling HOOMD-blue Weak Scaling Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K2c + Mellanox Connect-IB) MV2-GDR 2. (released on 8/23/14) : try it out!! GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_ MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384 MVAPICH User Group Meeting 214 26

MPI Applications on MIC Clusters MPI (+X) continues to be the predominant programming model in HPC Flexibility in launching MPI jobs on clusters with Xeon Phi Xeon Xeon Phi Multi-core Centric Host-only MPI Program Offload (/reverse Offload) MPI Program Offloaded Computation Symmetric MPI Program MPI Program Coprocessor-only Many-core Centric MPI Program MVAPICH User Group Meeting 214 27

MVAPICH2-MIC Design for Clusters with IB and MIC Offload Mode Intranode Communication Coprocessor-only Mode Symmetric Mode Internode Communication Coprocessors-only Symmetric Mode Multi-MIC Node Configurations MVAPICH User Group Meeting 214 28

Latency (usec) Bandwidth (MB/sec) Better Latency (usec) Bandwidth (MB/sec) Better MIC-Remote-MIC P2P Communication 5 4 Latency (Large Messages) Intra-socket P2P 6 Bandwidth 5236 3 4 2 1 Better 2 8K 32K 128K 512K 2M 1 16 256 4K 64K 1M Message Size (Bytes) Message Size (Bytes) 16 14 12 1 8 6 4 2 Latency (Large Messages) Inter-socket P2P Better 6 4 2 8K 32K 128K 512K 2M Bandwidth 5594 1 16 256 4K 64K 1M Message Size (Bytes) Message Size (Bytes) MVAPICH User Group Meeting 214 29

Latest Status on MVAPICH2-MIC Running on three major systems Stampede : module swap mvapich2 mvapich2-mic/213911 Blueridge(Virginia Tech) : module swap mvapich2 mvapich2-mic/1.9 Beacon (UTK) : module unload intel-mpi; module load mvapich2- mic/1.9 A new version based on MVAPICH2 2. is being worked out Will be available in a few weeks MVAPICH User Group Meeting 214 3

Latency (usecs) x 1 Execution Time (secs) Latency (usecs) Latency (usecs)x 1 Optimized MPI Collectives for MIC Clusters (Allgather & Alltoall) 25 2 15 1 5 32-Node-Allgather (16H + 16 M) Small Message Latency MV2-MIC MV2-MIC-Opt 76% 15 1 5 32-Node-Allgather (8H + 8 M) Large Message Latency MV2-MIC MV2-MIC-Opt 58% 8 6 4 2 1 2 4 8 16 32 64 128 256 512 1K Message Size (Bytes) 32-Node-Alltoall (8H + 8 M) Large Message Latency MV2-MIC MV2-MIC-Opt 55% 5 4 3 2 1 8K 16K 32K 64K 128K 256K 512K 1M Message Size (Bytes) P3DFFT Performance Communication Computation MV2-MIC-Opt MV2-MIC 4K 8K 16K 32K 64K 128K 256K 512K 32 Nodes (8H + 8M), Size = 2K*2K*1K Message Size (Bytes) A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda - High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters; IPDPS 14, May 214 MVAPICH User Group Meeting 214 31

OSU Micro-Benchmarks (OMB) Started in 24 and continuing steadily Allows MPI developers and users to Test and evaluate MPI libraries Has a wide-range of benchmarks Two-sided (MPI-1, MPI-2 and MPI-3) One-sided (MPI-2 and MPI-3) RMA (MPI-3) Collectives (MPI-1, MPI-2 and MPI-3) Extensions for GPU-aware communication (CUDA and OpenACC) UPC (Pt-to-Pt) OpenSHMEM (Pt-to-Pt and Collectives) Widely-used in the MPI community MVAPICH User Group Meeting 214 32

OMB (OSU Micro-Benchmarks) 4.4 Released on 8/23/214 Include benchmarks for MPI, OpenSHMEM, and UPC Programming Models MPI Support MPI-1, MPI-2 and MPI-3 standards Point-to-Point Benchmarks: Latency, Multi-threaded latency, Multi-pair latency, Multiple bandwidth / Message rate test bandwidth, Bidirectional bandwidth Collective Benchmarks: Collective latency test for MPI_Allgather, MPI_Alltoall, MPI_Allreduce, MPI_Barrier, MPI_Bcast, MPI_Gather, MPI_Reduce, MPI_Reduce_Scatter, MPI_Scatter, and vector collectives One-sided Benchmarks: Put latency, Put bandwidth, Put bidirectional bandwidth, Get latency, Get bandwidth, Accumulate latency, Atomics latency, and Get_accumulate latency CUDA Extensions to evaluate MPI communication performance from/to buffers on NVIDIA GPU devices OpenACC Extensions to evaluate MPI communication performance from/to buffers on NVIDIA GPU devices Support for MPI-3 RMA operations using GPU buffers OpenSHMEM Point-to-Point benchmarks: Put latency, Get latency, Put message rate, and Atomics latency Collective benchmarks: Collect latency, FCollect latency, Broadcast Latency, Reduce latency, and Barrier latency Unified Parallel C (UPC) Point-to-Point benchmarks: Put latency, Get latency Collective benchmarks: Barrier latency, Exchange latency, Gatherall latency, Gather latency, Reduce latency, and Scatter latency MVAPICH User Group Meeting 214 33

Latency (us) Bandwidth (MB/s) Virtualization Support with MVAPICH2 using SR-IOV Intra-node Inter-VM Point-to-Point Latency and Bandwidth 14 12 12 MVAPICH2-SR-IOV-IB 1 MVAPICH2-SR-IOV-IB 1 MVAPICH2-Native-IB 8 MVAPICH2-Native-IB 8 6 6 4 4 2 2 1 4 16 64 256 1K 4K 16K 64K 256K 1M 1 4 16 64 256 1K 4K 16K 64K 256K 1M Message Size (byte) Message Size (byte) 1 VM per Core MVAPICH2-SR-IOV-IB brings only 3-7% (latency) and 3-8%(BW) overheads, compared to MAPICH2 over Native InfiniBand Verbs (MVAPICH2-Native-IB) MVAPICH User Group Meeting 214 34

Execution Time(s) Execution time(ms) Performance Evaluations with NAS and Graph5 7 16 6 5 MVAPICH2-SR-IOV-IB 14 12 MVAPICH2-SR-IOV-IB 4 MVAPICH2-Native-IB 1 MVAPICH2-Native-IB 8 3 6 2 4 1 2 MG-B-64 CG-B-64 EP-B-64 LU-B-64 BT-B-64 2,1 2,16 22,16 24,16 NAS Benchmarks Graph5 8 VMs across 4 nodes, 1 VM per Socket, 64 cores totally MVAPICH2-SR-IOV-IB brings 3-7% and 3-9% overheads for NAS Benchmarks and Graph5, respectively, compared to MVAPICH2-Native-IB MVAPICH User Group Meeting 214 35

Execution Time(s) Performance Evaluation with LAMMPS 4 35 3 25 2 15 1 5 MVAPICH2-SR-IOV-IB MVAPICH2-Native-IB LJ-64-scaled LAMMPS CHAIN-64-scaled 8 VMs across 4 nodes, 1 VM per Socket, 64 cores totally MVAPICH2-SR-IOV-IB brings 7% and 9% overheads for LJ and CHAIN in LAMMPS, respectively, compared to MVAPICH2-Native-IB MVAPICH User Group Meeting 214 36

MVAPICH2 Plans for Exascale Performance and Memory scalability toward 5K-1M cores Dynamically Connected Transport (DCT) service with Connect-IB Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF ) Enhanced Optimization for GPU Support and Accelerators Taking advantage of Collective Offload framework Including support for non-blocking collectives (MPI 3.) Core-Direct support Extended RMA support (as in MPI 3.) Extended topology-aware collectives Power-aware collectives Extended Support for MPI Tools Interface (as in MPI 3.) Extended Checkpoint-Restart and migration support with SCR Low-overhead and high-performance Virtualization support MVAPICH User Group Meeting 214 37

Web Pointers NOWLAB Web Page http://nowlab.cse.ohio-state.edu MVAPICH Web Page http://mvapich.cse.ohio-state.edu MVAPICH User Group Meeting 214 38