Overview of the MVAPICH Project: Latest Status and Future Roadmap

Similar documents
Overview of MVAPICH2 and MVAPICH2- X: Latest Status and Future Roadmap

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Scaling with PGAS Languages

MVAPICH2 and MVAPICH2-MIC: Latest Status

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters

Designing Software Libraries and Middleware for Exascale Systems: Opportunities and Challenges

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

Optimizing & Tuning Techniques for Running MVAPICH2 over IB

MVAPICH2 Project Update and Big Data Acceleration

MVAPICH2: A High Performance MPI Library for NVIDIA GPU Clusters with InfiniBand

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

High-Performance Broadcast for Streaming and Deep Learning

Accelerating HPL on Heterogeneous GPU Clusters

High-Performance and Scalable Designs of Programming Models for Exascale Systems Talk at HPC Advisory Council Stanford Conference (2015)

Interconnect Your Future

High-Performance Training for Deep Learning and Computer Vision HPC

Using MVAPICH2- X for Hybrid MPI + PGAS (OpenSHMEM and UPC) Programming

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning

The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

Unifying UPC and MPI Runtimes: Experience with MVAPICH

Programming Models for Exascale Systems

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

Unified Runtime for PGAS and MPI over OFED

Intra-MIC MPI Communication using MVAPICH2: Early Experience

The MVAPICH2 Project: Pushing the Frontier of InfiniBand and RDMA Networking Technologies Talk at OSC/OH-TECH Booth (SC 15)

Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters

Scalability and Performance of MVAPICH2 on OakForest-PACS

How to Tune and Extract Higher Performance with MVAPICH2 Libraries?

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Solutions for Scalable HPC

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

MVAPICH2 MPI Libraries to Exploit Latest Networking and Accelerator Technologies

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Overview of the MVAPICH Project: Latest Status and Future Roadmap

Support Hybrid MPI+PGAS (UPC/OpenSHMEM/CAF) Programming Models through a Unified Runtime: An MVAPICH2-X Approach

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Communication Frameworks for HPC and Big Data

How to Boost the Performance of your MPI and PGAS Applications with MVAPICH2 Libraries?

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X

Interconnect Your Future

High-Performance and Scalable Designs of Programming Models for Exascale Systems

Enabling Efficient Use of UPC and OpenSHMEM PGAS Models on GPU Clusters. Presented at GTC 15

High Performance MPI Support in MVAPICH2 for InfiniBand Clusters

High Performance Migration Framework for MPI Applications on HPC Cloud

Accelerating Big Data Processing with RDMA- Enhanced Apache Hadoop

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

Designing Shared Address Space MPI libraries in the Many-core Era

Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach

Job Startup at Exascale: Challenges and Solutions

Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters

Designing HPC, Big Data, and Deep Learning Middleware for Exascale Systems: Challenges and Opportunities

Job Startup at Exascale:

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

High-Performance MPI and Deep Learning on OpenPOWER Platform

Operational Robustness of Accelerator Aware MPI

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

Acceleration for Big Data, Hadoop and Memcached

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication

MPI: Overview, Performance Optimizations and Tuning

High-performance and Scalable MPI+X Library for Emerging HPC Clusters & Cloud Platforms

UCX: An Open Source Framework for HPC Network APIs and Beyond

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

Advanced Topics in InfiniBand and High-speed Ethernet for Designing HEC Systems

Unified Communication X (UCX)

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

Addressing Emerging Challenges in Designing HPC Runtimes: Energy-Awareness, Accelerators and Virtualization

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

INAM 2 : InfiniBand Network Analysis and Monitoring with MPI

Overview of the MVAPICH Project: Latest Status and Future Roadmap

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

High Performance Computing

GPU-centric communication for improved efficiency

The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing

Op#miza#on and Tuning Collec#ves in MVAPICH2

Designing High-Performance MPI Libraries for Multi-/Many-core Era

The Future of Supercomputer Software Libraries

Software Libraries and Middleware for Exascale Systems

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management

Comparing Ethernet and Soft RoCE for MPI Communication

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

An Assessment of MPI 3.x (Part I)

Transcription:

Overview of the MVAPICH Project: Latest Status and Future Roadmap MVAPICH2 User Group (MUG) Meeting by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

Drivers of Modern HPC Cluster Architectures Multi-core Processors High Performance Interconnects - InfiniBand <1usec latency, >1Gbps Bandwidth Accelerators / Coprocessors high compute density, high performance/watt >1 TFlop DP on a chip Multi-core processors are ubiquitous InfiniBand very popular in HPC clusters Accelerators/Coprocessors becoming common in high-end systems Pushing the envelope for Exascale computing Tianhe 2 (1) Titan (2) Stampede (6) Tianhe 1A (1) MVAPICH User Group Meeting 214 2

Parallel Programming Models Overview P1 P2 P3 P1 P2 P3 P1 P2 P3 Shared Memory Memory Memory Memory Logical shared memory Memory Memory Memory Shared Memory Model SHMEM, DSM Distributed Memory Model MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X1, CAF, Programming models provide abstract machine models Models can be mapped on different types of systems e.g. Distributed Shared Memory (DSM), MPI within a node, etc. PGAS models and Hybrid MPI+PGAS models are gradually receiving importance MVAPICH User Group Meeting 214 3

Supporting Programming Models for Multi- Petaflop and Exaflop Systems: Challenges Application Kernels/Applications Middleware Programming Models MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenACC, Cilk, Hadoop, MapReduce, etc. Co-Design Opportunities and Challenges across Various Layers Communication Library or Runtime for Programming Models Point-to-point Communication (two-sided & one-sided) Collective Communication Synchronization & Locks I/O & File Systems Fault Tolerance Networking Technologies (InfiniBand, 1/4GigE, Aries, BlueGene) Multi/Many-core Architectures Accelerators (NVIDIA and MIC) MVAPICH User Group Meeting 214 4

Designing (MPI+X) at Exascale Scalability for million to billion processors Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) Extremely minimum memory footprint Hybrid programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, ) Balancing intra-node and inter-node communication for next generation multi-core (128-124 cores/node) Multiple end-points per node Support for efficient multi-threading Scalable Collective communication Offload Non-blocking Topology-aware Power-aware Support for MPI-3 RMA Model Support for GPGPUs and Accelerators Fault-tolerance/resiliency QoS support for communication and I/O MVAPICH User Group Meeting 214 5

MVAPICH2/MVAPICH2-X Software http://mvapich.cse.ohio-state.edu High Performance open-source MPI Library for InfiniBand, 1Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE) MVAPICH (MPI-1),MVAPICH2 (MPI-2.2 and MPI-3.), Available since 22 MVAPICH2-X (MPI + PGAS), Available since 212 Used by more than 2,2 organizations (HPC Centers, Industry and Universities) in 73 countries D. K. Panda, K. Tomko, K. Schulz and A. Majumdar, The MVAPICH Project: Evolution and Sustainability of an Open Source Production Quality MPI Library for HPC,, Int'l Workshop on Sustainable Software for Science: Practice and Experiences, held in conjunction with Int'l Conference on Supercomputing (SC '13), November 213. MVAPICH User Group Meeting 214 6

Oct- Jan-2 Nov-4 Jan-1 Nov-12 Aug-14 MVAPICH Team Projects MVAPICH2-GDR MVAPICH2-X MVAPICH2 OMB MVAPICH EOL Timeline MVAPICH User Group Meeting 214 7

Sep-4 Feb-5 Jul-5 Dec-5 May-6 Oct-6 Mar-7 Aug-7 Jan-8 Jun-8 Nov-8 Apr-9 Sep-9 Feb-1 Jul-1 Dec-1 May-11 Oct-11 Mar-12 Aug-12 Jan-13 Jun-13 Nov-13 Apr-14 MV.9.4 MV2.9. MV2.9.8 MV2 1. MV 1. MV2 1..3 MV 1.1 MV2 1.4 Number of Downloads MV2 1.5 MV2 1.6 MV2 1.7 MV2 1.8 MV2 1.9 MV2 2.a MV2 2.b MV2 2.rc2 MV2 2. MVAPICH/MVAPICH2 Release Timeline and Downloads 35 3 25 2 15 1 5 Timeline Download counts from MVAPICH2 website MVAPICH User Group Meeting 214 8

MVAPICH2/MVAPICH2-X Software (Cont d) Available with software stacks of many IB, HSE, and server vendors including Linux Distros (RedHat and SuSE) Empowering many TOP5 clusters 7th ranked 519,64-core cluster (Stampede) at TACC 13th, 74,358-core (Tsubame 2.5) at Tokyo Institute of Technology 23rd, 96,192-core (Pleiades) at NASA and many others Partner in the U.S. NSF-TACC Stampede System MVAPICH User Group Meeting 214 9

Strong Procedure for Design, Development and Release Research is done for exploring new designs Designs are first presented to conference/journal publications Best performing designs are incorporated into the codebase Rigorous Q&A procedure before making a release Exhaustive unit testing Various test procedures on diverse range of platforms and interconnects Performance tuning Applications-based evaluation Evaluation on large-scale systems Even alpha and beta versions go through the above testing MVAPICH User Group Meeting 214 1

MVAPICH2 Architecture (Latest Release 2.) All Different PCI interfaces Major Computing Platforms: IA-32, EM64T, Ivybridge, Westmere, Sandybridge, Opteron, Magny,.. MVAPICH User Group Meeting 214 11

MVAPICH2 2. and MVAPICH2-X 2. Released on 6/2/214 Major Features and Enhancements Based on MPICH-3.1 Extended support for MPI-3 RMA in OFA-IB-CH3, OFA-IWARP-CH3, and OFA-RoCE-CH3 interfaces Multiple optimizations and performance enhancements for RMA in OFA-IB-CH3 channel MPI-3 RMA support for CH3-PSM channel CMA support is now enabled by default Enhanced intra-node SMP performance Dynamic CUDA initialization Support for running on heterogeneous clusters with GPU and non-gpu nodes Multi-rail support for GPU communication and UD-Hybrid channel Improved job-startup performance for large-scale mpirun_rsh jobs Reduced memory footprint Updated compiler wrappers to remove application dependency on network and other extra libraries Updated hwloc to version 1.9 MVAPICH2-X 2.GA supports hybrid MPI + PGAS (UPC and OpenSHMEM) programming models. Based on MVAPICH2 2.GA including MPI-3 features; Compliant with UPC 2.18. and OpenSHMEM v1.f Enhanced optimization of OpenSHMEM collectives and Optimized shmalloc routine Optimized UPC collectives and support for GUPC translator MVAPICH User Group Meeting 214 12

Latency (us) Latency (us) One-way Latency: MPI over IB with MVAPICH2 6. 5. 4. 3. 2. 1.82 1.66 1.64 Small Message Latency 25. 2. 15. 1. Large Message Latency Qlogic-DDR Qlogic-QDR ConnectX-DDR ConnectX2-PCIe2-QDR ConnectX3-PCIe3-FDR Sandy-ConnectIB-DualFDR Ivy-ConnectIB-DualFDR 1.. 1.56 1.9.99 1.12 5.. Message Size (bytes) Message Size (bytes) DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch MVAPICH User Group Meeting 214 13

Bandwidth (MBytes/sec) Bandwidth (MBytes/sec) Bandwidth: MPI over IB with MVAPICH2 14 12 1 8 6 4 2 Unidirectional Bandwidth 1281 12485 6343 3385 328 1917 176 3 25 2 15 1 5 Bidirectional Bandwidth Qlogic-DDR Qlogic-QDR ConnectX-DDR ConnectX2-PCIe2-QDR ConnectX3-PCIe3-FDR Sandy-ConnectIB-DualFDR Ivy-ConnectIB-DualFDR 24727 2125 11643 6521 447 374 3341 Message Size (bytes) Message Size (bytes) DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch MVAPICH User Group Meeting 214 14

Bandwidth (MB/s) Bandwidth (MB/s) Latency (us) MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA)) 1.8.6.4.2 Latency Intra-Socket Inter-Socket.45 us.18 us 1 2 4 8 16 32 64 128 256 512 1K Message Size (Bytes) Latest MVAPICH2 2. Intel Ivy-bridge 16 14 12 1 8 6 4 2 Bandwidth (Intra-socket) intra-socket-cma intra-socket-shmem intra-socket-limic Bandwidth (Inter-socket) 14,25 MB/s 16 14 inter-socket-cma 13,749 MB/s 12 inter-socket-shmem 1 inter-socket-limic 8 6 4 2 Message Size (Bytes) Message Size (Bytes) MVAPICH User Group Meeting 214 15

Latency (us) Bandwidth (Mbytes/sec) Latency (us) Bandwidth (Mbytes/sec) MPI-3 RMA Get/Put with Flush Performance 3.5 3 2.5 2 1.5 1.5 Inter-node Get/Put Latency Get Put 2.4 us 1.56 us 1 2 4 8 16 32 64 128 256 512 1K 2K 4K Message Size (Bytes) 8 7 6 5 4 3 2 1 Inter-node Get/Put Bandwidth Get Put 6881 6876 1 4 16 64 256 1K 4K 16K 64K 256K Message Size (Bytes).35.3.25 Intra-socket Get/Put Latency Get Put 2 15 Intra-socket Get/Put Bandwidth 15364 Get Put.2.15 1 14926.1.5.8 us 1 2 4 8 16 32 64 128256512 1K 2K 4K 8K Message Size (Bytes) 5 1 4 16 64 256 1K 4K 16K 64K 256K Message Size (Bytes) Latest MVAPICH2 2., Intel Sandy-bridge with Connect-IB (single-port) MVAPICH User Group Meeting 214 16

MVAPICH2-X for Hybrid MPI + PGAS Applications MPI Applications, OpenSHMEM Applications, UPC Applications, Hybrid (MPI + PGAS) Applications OpenSHMEM Calls UPC Calls MPI Calls Unified MVAPICH2-X Runtime InfiniBand, RoCE, iwarp Unified communication runtime for MPI, UPC, OpenSHMEM available with MVAPICH2-X 1.9 onwards! http://mvapich.cse.ohio-state.edu Feature Highlights Supports MPI(+OpenMP), OpenSHMEM, UPC, MPI(+OpenMP) + OpenSHMEM, MPI(+OpenMP) + UPC MPI-3 compliant, OpenSHMEM v1. standard compliant, UPC v1.2 standard compliant (with initial support for UPC 1.3) Scalable Inter-node and intra-node communication point-to-point and collectives MVAPICH User Group Meeting 214 17

Hybrid (MPI+PGAS) Programming Application sub-kernels can be re-written in MPI/PGAS based on communication characteristics Benefits: Best of Distributed Computing Model Best of Shared Memory Computing Model Exascale Roadmap*: Hybrid Programming is a practical way to program exascale systems HPC Application Kernel 1 MPI Kernel 2 PGAS MPI Kernel 3 MPI Kernel N PGAS MPI * The International Exascale Software Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 211, International Journal of High Performance Computer Applications, ISSN 194-342 MVAPICH User Group Meeting 214 18

Billions of Traversed Edges Per Second (TEPS) Time (s) Billions of Traversed Edges Per Second (TEPS) Hybrid MPI+OpenSHMEM Graph5 Design 35 3 25 2 15 1 5 MPI-Simple MPI-CSC Execution Time MPI-CSR Hybrid (MPI+OpenSHMEM) 7.6X 4K 8K 16K No. of Processes 13X Performance of Hybrid (MPI+OpenSHMEM) Graph5 Design 8,192 processes - 2.4X improvement over MPI-CSR - 7.6X improvement over MPI-Simple 16,384 processes - 1.5X improvement over MPI-CSR - 13X improvement over MPI-Simple J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph5 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, International Supercomputing Conference (ISC 13), June 213 J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 212 MVAPICH User Group Meeting 214 19 9 8 7 6 5 4 3 2 1 9 8 7 6 5 4 3 2 1 Strong Scalability MPI-Simple MPI-CSC MPI-CSR Hybrid(MPI+OpenSHMEM) 124 248 496 8192 # of Processes Weak Scalability MPI-Simple MPI-CSC MPI-CSR Hybrid(MPI+OpenSHMEM) 26 27 28 29 Scale

MVAPICH2 2.1a and MVAPICH2-X 2.1a Will be available later this week or next week Features Based on MPICH 3.1.2 PMI-2-based startup with SLURM Improved job-startup, etc. MVAPICH User Group Meeting 214 2

MVAPICH2-GDR 2. Released on 8/23/214 Major Features and Enhancements Based on MVAPICH2 2. (OFA-IB-CH3 interface) High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU) using GPUDirect RDMA and pipelining Support for MPI-3 RMA (one Sided) communication and atomics using GPU buffers with GPUDirect RDMA and pipelining Efficient small message inter-node communication using the new NVIDIA GDRCOPY module Efficient small message inter-node communication using loopback design Multi-rail support for inter-node point-to-point GPU communication High performance intra-node point-to-point communication for multi-gpu adapters/node (GPU-GPU, GPU-Host and Host-GPU) using CUDA IPC and pipelining Automatic communication channel selection for different GPU communication modes (DD, HH and HD) in different configurations (intra-ioh and inter-ioh) Automatic selection and binding of the best GPU/HCA pair in multi-gpu/hca system configuration Optimized and tuned support for collective communication from GPU buffers Enhanced and Efficient support for datatype processing on GPU buffers including support for vector/h-vector, index/h-index, array and subarray Dynamic CUDA initialization. Support GPU device selection after MPI_Init Support for non-blocking streams in asynchronous CUDA transfers for better overlap Efficient synchronization using CUDA Events for pipelined device data transfers MVAPICH User Group Meeting 214 21

Latency (us) Performance of MVAPICH2 with GPU-Direct-RDMA: Latency GPU-GPU Internode MPI Latency 3 25 2 15 Small Message Latency MV2-GDR2. MV2-GDR2.b MV2 w/o GDR 9 % 1 5 71 % 2.18 usec 2 8 32 128 512 2K MVAPICH User Group Meeting 214 Message Size (bytes) MVAPICH2-GDR-2. Intel Ivy Bridge (E5-268 v2) node with 2 cores NVIDIA Tesla K4c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 6.5, Mellanox OFED 2.1 with GPU-Direct-RDMA 22

Bandwidth (MB/s) Performance of MVAPICH2 with GPU-Direct-RDMA: Bandwidth GPU-GPU Internode MPI Uni-Directional Bandwidth 3 25 2 15 Small Message Bandwidth MV2-GDR2. MV2-GDR2.b MV2 w/o GDR 2.2x 1x 1 5 1 4 16 64 256 1K 4K MVAPICH User Group Meeting 214 Message Size (bytes) MVAPICH2-GDR-2. Intel Ivy Bridge (E5-268 v2) node with 2 cores NVIDIA Tesla K4c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 6.5, Mellanox OFED 2.1 with GPU-Direct-RDMA 23

Bi-Bandwidth (MB/s) Performance of MVAPICH2 with GPU-Direct-RDMA: Bi-Bandwidth GPU-GPU Internode MPI Bi-directional Bandwidth 4 35 3 25 Small Message Bi-Bandwidth MV2-GDR2. MV2-GDR2.b MV2 w/o GDR 2x 2 1x 15 1 5 1 4 16 64 256 1K 4K MVAPICH User Group Meeting 214 Message Size (bytes) MVAPICH2-GDR-2. Intel Ivy Bridge (E5-268 v2) node with 2 cores NVIDIA Tesla K4c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 6.5, Mellanox OFED 2.1 with GPU-Direct-RDMA 24

Latency (us) Performance of MVAPICH2 with GPU-Direct-RDMA: MPI-3 RMA GPU-GPU Internode MPI Put latency (RMA put operation Device to Device ) MPI-3 RMA provides flexible synchronization and completion primitives 35 3 25 2 Small Message Latency MV2-GDR2. MV2-GDR2.b 83% 15 1 5 2.88 us 2 8 32 128 512 2K 8K Message Size (bytes) MVAPICH User Group Meeting 214 MVAPICH2-GDR-2. Intel Ivy Bridge (E5-268 v2) node with 2 cores NVIDIA Tesla K4c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 6.5, Mellanox OFED 2.1 with GPU-Direct-RDMA 25

Application-Level Evaluation (HOOMD-blue) HOOMD-blue Strong Scaling HOOMD-blue Weak Scaling Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K2c + Mellanox Connect-IB) MV2-GDR 2. (released on 8/23/14) : try it out!! GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_ MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384 MVAPICH User Group Meeting 214 26

MPI Applications on MIC Clusters MPI (+X) continues to be the predominant programming model in HPC Flexibility in launching MPI jobs on clusters with Xeon Phi Xeon Xeon Phi Multi-core Centric Host-only MPI Program Offload (/reverse Offload) MPI Program Offloaded Computation Symmetric MPI Program MPI Program Coprocessor-only Many-core Centric MPI Program MVAPICH User Group Meeting 214 27

MVAPICH2-MIC Design for Clusters with IB and MIC Offload Mode Intranode Communication Coprocessor-only Mode Symmetric Mode Internode Communication Coprocessors-only Symmetric Mode Multi-MIC Node Configurations MVAPICH User Group Meeting 214 28

Latency (usec) Bandwidth (MB/sec) Better Latency (usec) Bandwidth (MB/sec) Better MIC-Remote-MIC P2P Communication 5 4 Latency (Large Messages) Intra-socket P2P 6 Bandwidth 5236 3 4 2 1 Better 2 8K 32K 128K 512K 2M 1 16 256 4K 64K 1M Message Size (Bytes) Message Size (Bytes) 16 14 12 1 8 6 4 2 Latency (Large Messages) Inter-socket P2P Better 6 4 2 8K 32K 128K 512K 2M Bandwidth 5594 1 16 256 4K 64K 1M Message Size (Bytes) Message Size (Bytes) MVAPICH User Group Meeting 214 29

Latest Status on MVAPICH2-MIC Running on three major systems Stampede : module swap mvapich2 mvapich2-mic/213911 Blueridge(Virginia Tech) : module swap mvapich2 mvapich2-mic/1.9 Beacon (UTK) : module unload intel-mpi; module load mvapich2- mic/1.9 A new version based on MVAPICH2 2. is being worked out Will be available in a few weeks MVAPICH User Group Meeting 214 3

Latency (usecs) x 1 Execution Time (secs) Latency (usecs) Latency (usecs)x 1 Optimized MPI Collectives for MIC Clusters (Allgather & Alltoall) 25 2 15 1 5 32-Node-Allgather (16H + 16 M) Small Message Latency MV2-MIC MV2-MIC-Opt 76% 15 1 5 32-Node-Allgather (8H + 8 M) Large Message Latency MV2-MIC MV2-MIC-Opt 58% 8 6 4 2 1 2 4 8 16 32 64 128 256 512 1K Message Size (Bytes) 32-Node-Alltoall (8H + 8 M) Large Message Latency MV2-MIC MV2-MIC-Opt 55% 5 4 3 2 1 8K 16K 32K 64K 128K 256K 512K 1M Message Size (Bytes) P3DFFT Performance Communication Computation MV2-MIC-Opt MV2-MIC 4K 8K 16K 32K 64K 128K 256K 512K 32 Nodes (8H + 8M), Size = 2K*2K*1K Message Size (Bytes) A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda - High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters; IPDPS 14, May 214 MVAPICH User Group Meeting 214 31

OSU Micro-Benchmarks (OMB) Started in 24 and continuing steadily Allows MPI developers and users to Test and evaluate MPI libraries Has a wide-range of benchmarks Two-sided (MPI-1, MPI-2 and MPI-3) One-sided (MPI-2 and MPI-3) RMA (MPI-3) Collectives (MPI-1, MPI-2 and MPI-3) Extensions for GPU-aware communication (CUDA and OpenACC) UPC (Pt-to-Pt) OpenSHMEM (Pt-to-Pt and Collectives) Widely-used in the MPI community MVAPICH User Group Meeting 214 32

OMB (OSU Micro-Benchmarks) 4.4 Released on 8/23/214 Include benchmarks for MPI, OpenSHMEM, and UPC Programming Models MPI Support MPI-1, MPI-2 and MPI-3 standards Point-to-Point Benchmarks: Latency, Multi-threaded latency, Multi-pair latency, Multiple bandwidth / Message rate test bandwidth, Bidirectional bandwidth Collective Benchmarks: Collective latency test for MPI_Allgather, MPI_Alltoall, MPI_Allreduce, MPI_Barrier, MPI_Bcast, MPI_Gather, MPI_Reduce, MPI_Reduce_Scatter, MPI_Scatter, and vector collectives One-sided Benchmarks: Put latency, Put bandwidth, Put bidirectional bandwidth, Get latency, Get bandwidth, Accumulate latency, Atomics latency, and Get_accumulate latency CUDA Extensions to evaluate MPI communication performance from/to buffers on NVIDIA GPU devices OpenACC Extensions to evaluate MPI communication performance from/to buffers on NVIDIA GPU devices Support for MPI-3 RMA operations using GPU buffers OpenSHMEM Point-to-Point benchmarks: Put latency, Get latency, Put message rate, and Atomics latency Collective benchmarks: Collect latency, FCollect latency, Broadcast Latency, Reduce latency, and Barrier latency Unified Parallel C (UPC) Point-to-Point benchmarks: Put latency, Get latency Collective benchmarks: Barrier latency, Exchange latency, Gatherall latency, Gather latency, Reduce latency, and Scatter latency MVAPICH User Group Meeting 214 33

Latency (us) Bandwidth (MB/s) Virtualization Support with MVAPICH2 using SR-IOV Intra-node Inter-VM Point-to-Point Latency and Bandwidth 14 12 12 MVAPICH2-SR-IOV-IB 1 MVAPICH2-SR-IOV-IB 1 MVAPICH2-Native-IB 8 MVAPICH2-Native-IB 8 6 6 4 4 2 2 1 4 16 64 256 1K 4K 16K 64K 256K 1M 1 4 16 64 256 1K 4K 16K 64K 256K 1M Message Size (byte) Message Size (byte) 1 VM per Core MVAPICH2-SR-IOV-IB brings only 3-7% (latency) and 3-8%(BW) overheads, compared to MAPICH2 over Native InfiniBand Verbs (MVAPICH2-Native-IB) MVAPICH User Group Meeting 214 34

Execution Time(s) Execution time(ms) Performance Evaluations with NAS and Graph5 7 16 6 5 MVAPICH2-SR-IOV-IB 14 12 MVAPICH2-SR-IOV-IB 4 MVAPICH2-Native-IB 1 MVAPICH2-Native-IB 8 3 6 2 4 1 2 MG-B-64 CG-B-64 EP-B-64 LU-B-64 BT-B-64 2,1 2,16 22,16 24,16 NAS Benchmarks Graph5 8 VMs across 4 nodes, 1 VM per Socket, 64 cores totally MVAPICH2-SR-IOV-IB brings 3-7% and 3-9% overheads for NAS Benchmarks and Graph5, respectively, compared to MVAPICH2-Native-IB MVAPICH User Group Meeting 214 35

Execution Time(s) Performance Evaluation with LAMMPS 4 35 3 25 2 15 1 5 MVAPICH2-SR-IOV-IB MVAPICH2-Native-IB LJ-64-scaled LAMMPS CHAIN-64-scaled 8 VMs across 4 nodes, 1 VM per Socket, 64 cores totally MVAPICH2-SR-IOV-IB brings 7% and 9% overheads for LJ and CHAIN in LAMMPS, respectively, compared to MVAPICH2-Native-IB MVAPICH User Group Meeting 214 36

MVAPICH2 Plans for Exascale Performance and Memory scalability toward 5K-1M cores Dynamically Connected Transport (DCT) service with Connect-IB Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF ) Enhanced Optimization for GPU Support and Accelerators Taking advantage of Collective Offload framework Including support for non-blocking collectives (MPI 3.) Core-Direct support Extended RMA support (as in MPI 3.) Extended topology-aware collectives Power-aware collectives Extended Support for MPI Tools Interface (as in MPI 3.) Extended Checkpoint-Restart and migration support with SCR Low-overhead and high-performance Virtualization support MVAPICH User Group Meeting 214 37

Web Pointers NOWLAB Web Page http://nowlab.cse.ohio-state.edu MVAPICH Web Page http://mvapich.cse.ohio-state.edu MVAPICH User Group Meeting 214 38