Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Similar documents
Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

MVAPICH2: A High Performance MPI Library for NVIDIA GPU Clusters with InfiniBand

MVAPICH2 MPI Libraries to Exploit Latest Networking and Accelerator Technologies

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

Overview of the MVAPICH Project: Latest Status and Future Roadmap

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

How to Boost the Performance of your MPI and PGAS Applications with MVAPICH2 Libraries?

Scaling with PGAS Languages

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication

Addressing Emerging Challenges in Designing HPC Runtimes: Energy-Awareness, Accelerators and Virtualization

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand. Presented at GTC 15

MVAPICH2 and MVAPICH2-MIC: Latest Status

High-Performance Training for Deep Learning and Computer Vision HPC

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

High Performance MPI Support in MVAPICH2 for InfiniBand Clusters

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach

High-Performance Broadcast for Streaming and Deep Learning

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

The MVAPICH2 Project: Pushing the Frontier of InfiniBand and RDMA Networking Technologies Talk at OSC/OH-TECH Booth (SC 15)

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

How to Tune and Extract Higher Performance with MVAPICH2 Libraries?

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters

Accelerating HPL on Heterogeneous GPU Clusters

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Overview of MVAPICH2 and MVAPICH2- X: Latest Status and Future Roadmap

TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016

MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling GPUDirect Technologies

Solutions for Scalable HPC

Enabling Efficient Use of UPC and OpenSHMEM PGAS Models on GPU Clusters. Presented at GTC 15

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters

NVIDIA GPUDirect Technology. NVIDIA Corporation 2011

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters

GPU-centric communication for improved efficiency

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

Operational Robustness of Accelerator Aware MPI

Unifying UPC and MPI Runtimes: Experience with MVAPICH

Support Hybrid MPI+PGAS (UPC/OpenSHMEM/CAF) Programming Models through a Unified Runtime: An MVAPICH2-X Approach

High-Performance MPI and Deep Learning on OpenPOWER Platform

Designing Shared Address Space MPI libraries in the Many-core Era

Unified Runtime for PGAS and MPI over OFED

High Performance Migration Framework for MPI Applications on HPC Cloud

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Exploiting Computation and Communication Overlap in MVAPICH2 and MVAPICH2-GDR MPI Libraries

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Designing Software Libraries and Middleware for Exascale Systems: Opportunities and Challenges

Interconnect Your Future

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

Mellanox GPUDirect RDMA User Manual

Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters

High-Performance and Scalable Designs of Programming Models for Exascale Systems Talk at HPC Advisory Council Stanford Conference (2015)

Optimizing & Tuning Techniques for Running MVAPICH2 over IB

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters

INAM 2 : InfiniBand Network Analysis and Monitoring with MPI

High Performance Computing

MVAPICH2 Project Update and Big Data Acceleration

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

High-performance and Scalable MPI+X Library for Emerging HPC Clusters & Cloud Platforms

The Future of Interconnect Technology

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

Interconnect Your Future

Enhancing Checkpoint Performance with Staging IO & SSD

High-Performance and Scalable Designs of Programming Models for Exascale Systems

Application Acceleration Beyond Flash Storage

Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Building the Most Efficient Machine Learning System

Communication Frameworks for HPC and Big Data

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Designing HPC, Big Data, and Deep Learning Middleware for Exascale Systems: Challenges and Opportunities

Mellanox GPUDirect RDMA User Manual

Mellanox GPUDirect RDMA User Manual

Overview of the MVAPICH Project: Latest Status and Future Roadmap

Overview of the MVAPICH Project: Latest Status and Future Roadmap

Transcription:

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Presentation at Mellanox Theater () Dhabaleswar K. (DK) Panda - The Ohio State University panda@cse.ohio-state.edu

Outline Communication on InfiniBand Clusters with GPUs MVAPICH2-GPU with GPUDirect-RDMA (GDR) New Enhancements Non-Blocking Collectives (MPI3 NBC) support Multi-Stream for efficient MPI Datatype Processing On-Going Work (GDR-Async and Managed Memory) OpenACC-Aware support Conclusions 2

MVAPICH2-GPU: CUDA-Aware MPI Before CUDA 4: Additional copies Low performance and low productivity After CUDA 4: Host-based pipeline Unified Virtual Address Pipeline CUDA copies with IB transfers High performance and high productivity After CUDA 5.5: GPUDirect-RDMA support GPU to GPU direct transfer CPU Bypass the host memory Hybrid design to avoid PCI bottlenecks InfiniBand Chip set GPU GPU Memory 3

Overview of MVAPICH2 Software High Performance MPI and PGAS Library for InfiniBand, 10-40Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE) and Accelerators MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002 MVAPICH2-X (MPI + PGAS), Available since 2011 Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 Support for Virtualization (MVAPICH2-Virt), Available since 2015 Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 Used by more than 2,475 organizations in 76 countries More than 308,000 downloads from the OSU site directly Empowering many TOP500 clusters (Nov 15 ranking) 10 th ranked 519,640-core cluster (Stampede) at TACC 13 th ranked 185,344-core cluster (Pleiades) at NASA 25 th ranked 76,032-core cluster (Tsubame 2.5) at Tokyo Institute of Technology and many others Available with software stacks of many vendors and Linux Distros (RedHat and SuSE) http://mvapich.cse.ohio-state.edu 4

GPU GPU GPU GPU PCIe Optimizing MPI Data Movement on GPU Clusters Connected as PCIe devices Flexibility but Complexity CPU Node 0 Node 1 QPI CPU CPU IB Memory buffers 1. Intra-GPU 2. Intra-Socket GPU-GPU 3. Inter-Socket GPU-GPU 4. Inter-Node GPU-GPU 5. Intra-Socket GPU-Host 6. Inter-Socket GPU-Host 7. Inter-Node GPU-Host 8. Inter-Node GPU-GPU with IB adapter on remote socket and more... For each path different schemes: Shared_mem, IPC, GPUDirect RDMA, pipeline Critical for runtimes to optimize data movement while hiding the complexity 5

Outline Communication on InfiniBand Clusters with GPUs MVAPICH2-GPU with GPUDirect-RDMA (GDR) New Enhancements Non-Blocking Collectives (MPI3 NBC) support Multi-Stream for efficient MPI Datatype Processing On-Going Work (GDR-Async and Managed Memory) OpenACC-Aware support Conclusions 6

GPUDirect RDMA (GDR) with CUDA Hybrid design using GPUDirect RDMA GPUDirect RDMA and Host-based pipelining Alleviates P2P bandwidth bottlenecks on SandyBridge and IvyBridge Support for communication using multi-rail Support for Mellanox Connect-IB and ConnectX VPI adapters Support for RoCE with Mellanox ConnectX VPI adapters S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy and D. K. Panda, Efficient Inter-node MPI Communication using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs, Int'l Conference on Parallel Processing (ICPP '13) IB Adapter CPU SNB E5-2670 / IVB E5-2680V2 Chipset SNB E5-2670 P2P write: 5.2 GB/s P2P read: < 1.0 GB/s IVB E5-2680V2 P2P write: 6.4 GB/s P2P read: 3.5 GB/s System Memory GPU GPU Memory 7

Using MVAPICH2-GPUDirect Version MVAPICH2-2.2a with GDR support can be downloaded from https://mvapich.cse.ohio-state.edu/download/mvapich2gdr/ System software requirements Mellanox OFED 2.1 or later NVIDIA Driver 331.20 or later NVIDIA CUDA Toolkit 6.5 or later Plugin for GPUDirect RDMA http://www.mellanox.com/page/products_dyn?product_family=116 Strongly Recommended: use the new GDRCOPY module from NVIDIA Has optimized designs for point-to-point and collective communication using GDR Contact MVAPICH help list with any questions related to the package mvapich-help@cse.ohio-state.edu 8

Bi-Bandwidth (MB/s) Latency (us) Bandwidt h (MB/s) Performance of MVAPICH2-GDR with GPU-Direct-RDMA 30 25 20 15 10 5 0 GPU-GPU Internode Bi-directional Bandwidth 4000 3000 2000 1000 0 GPU-GPU Internode MPI Latency 9.3X 3.5X 2.18 usec 0 2 8 32 128 512 2K Message Size (bytes) MV2-GDR2.1 MV2-GDR2.0b MV2 w/o GDR MV2-GDR2.1 MV2-GDR2.0b MV2 w/o GDR 1 4 16 64 256 1K 4K Message Size (bytes) 11x 2x GPU-GPU Internode MPI Uni-Bandwidth 3000 2500 2000 MV2-GDR2.1 11x 1500 1000 2X 500 0 1 4 16 64 256 Message Size (bytes) 1K 4K MVAPICH2-GDR-2.1 Intel Ivy Bridge (E5-2680 v2) node - 20 cores NVIDIA Tesla K40c GPU Mellanox Connect-IB Dual-FDR HCA CUDA 7 Mellanox OFED 2.4 with GPU-Direct-RDMA 9

Application-Level Evaluation (HOOMD-blue) Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB) HoomdBlue Version 1.0.5 GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384 10

Latency (us) Performance of MVAPICH2 with GPU-Direct-RDMA: MPI-3 RMA GPU-GPU Internode MPI Put latency (RMA put operation Device to Device ) MPI-3 RMA provides flexible synchronization and completion primitives 35 30 25 20 15 10 5 0 MV2-GDR2.1 Small Message Latency MV2-GDR2.0b 2.88 us 0 2 8 32 128 512 2K 8K Message Size (bytes) 7X MVAPICH2-GDR-2.1 Intel Ivy Bridge (E5-2680 v2) node with 20 cores NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 7, Mellanox OFED 2.4 with GPU-Direct-RDMA 11

Bandwidth (MB/s) Bi-Bandwidth (MB/s) Performance of MVAPICH2-GDR with GPU-Direct-RDMA and Multi-Rail Support GPU-GPU Internode MPI Uni-Directional Bandwidth 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 MV2-GDR 2.1 MV2-GDR 2.1 RC2 1 4 16 64 2561K 4K16K64K256K1M 4M Message Size (bytes) 8,759 40% 16000 14000 12000 10000 8000 6000 4000 2000 MVAPICH2-GDR-2.1 Intel Ivy Bridge (E5-2680 v2) node - 20 cores NVIDIA Tesla K40c GPU Mellanox Connect-IB Dual-FDR HCA CUDA 7 Mellanox OFED 2.4 with GPU-Direct-RDMA GPU-GPU Internode Bi-directional Bandwidth 0 MV2-GDR 2.1 MV2-GDR 2.1 RC2 15,111 20% 1 4 16 64 256 1K 4K16K64K256K1M 4M Message Size (bytes) 12

Outline Communication on InfiniBand Clusters with GPUs MVAPICH2-GPU with GPUDirect-RDMA (GDR) New Enhancements Non-Blocking Collectives (MPI3 NBC) support Multi-Stream for efficient MPI Datatype Processing On-Going Work (GDR-Async and Managed Memory) OpenACC-Aware support Conclusions 13

NonBlocking Collectives (NBC) Non-blocking collectives (NBC) using network tasks offload mechanism (CORE-Direct) MPI NBC decouple initiation(ialltoall) and completion(wait) phases and provide overlap potential (Ialltoall + compute + Wait) but CPU drives progress largely in Wait (=> 0 overlap) CORE-Direct feature provides true overlap capabilities by providing a priori specification of a list of network-tasks which is progressed by the NIC instead of the CPU (hence freeing it) We propose a design that combines GPUDirect RDMA and Core-Direct features to provide efficient support of CUDA-Aware NBC collectives on GPU buffers Overlap communication with CPU computation and Overlap communication with GPU computation Extend OMB with CUDA-Aware NBC benchmarks to evaluate Latency Overlap on both CPU and GPU A. Venkatesh, K. Hamidouche, H. Subramoni, and D. K. Panda, Offloaded GPU Collectives using CORE-Direct and CUDA Capabilities on IB Clusters, HIPC, 2015 14

Overlap (%) Overlap (%) CUDA-Aware NonBlocking collectives 120 Medium/Large Message Overlap (64 GPU nodes) 120 Medium/Large Message Overlap (64 GPU nodes) 100 100 80 80 60 40 20 0 Ialltoall (1process/node) Ialltoall (2process/node; 1process/GPU) 4K 16K 64K 256K 1M Message Size (Bytes) 60 40 20 0 Igather (1process/node) Igather (2processes/node; 1process/GPU) 4K 16K 64K 256K 1M Message Size (Bytes) Platform: Wilkes: Intel Ivy Bridge NVIDIA Tesla K20c + Mellanox Connect-IB MVAPICH2-GDR 2.2a 15

Outline Communication on InfiniBand Clusters with GPUs MVAPICH2-GPU with GPUDirect-RDMA (GDR) New Enhancements Non-Blocking Collectives (MPI3 NBC) support Multi-Stream for efficient MPI Datatype Processing On-Going Work (GDR-Async and Managed Memory) OpenACC-Aware support Conclusions 16

Non-contiguous Data Exchange Halo data exchange Multi-dimensional data Row based organization Contiguous on one dimension Non-contiguous on other dimensions Halo data exchange Duplicate the boundary Exchange the boundary in each iteration 17

MPI Datatype Processing (Computation Optimization ) Comprehensive support Targeted kernels for regular datatypes - vector, subarray, indexed_block Generic kernels for all other irregular datatypes Separate non-blocking stream for kernels launched by MPI library Avoids stream conflicts with application kernels Flexible set of parameters for users to tune kernels Vector MV2_CUDA_KERNEL_VECTOR_TIDBLK_SIZE MV2_CUDA_KERNEL_VECTOR_YSIZE Subarray MV2_CUDA_KERNEL_SUBARR_TIDBLK_SIZE MV2_CUDA_KERNEL_SUBARR_XDIM MV2_CUDA_KERNEL_SUBARR_YDIM MV2_CUDA_KERNEL_SUBARR_ZDIM Indexed_block MV2_CUDA_KERNEL_IDXBLK_XDIM 18

Initiate Kernel Initiate Kernel Initiate Kernel WFK Start Send WFK Start Send WFK Start Send Initiate Kernel Start Send Initiate Kernel Start Send Initiate Kernel Start Send MPI Datatype Processing (Communication Optimization ) Common Scenario MPI_Isend (A,.. Datatype, ) MPI_Isend (B,.. Datatype, ) MPI_Isend (C,.. Datatype, ) MPI_Isend (D,.. Datatype, ) Waste of computing resources on CPU and GPU Existing Design Isend(1) Wait For Kernel (WFK) Kernel on Stream Proposed Design Isend(1) Isend(2)Isend(3) Isend(1) Wait For Kernel (WFK) Kernel on Stream Wait GPU Isend(1) Wait For Kernel (WFK) Kernel on Stream Wait Progress CPU CPU Progress MPI_Waitall ( ); Kernel on Stream Kernel on Stream GPU Expected Benefits Kernel on Stream Start Time Finish Proposed Finish Existing 19

Normalized Execution Time Normalized Execution Time Application-Level Evaluation (HaloExchange - Cosmo) 1.5 0.5 Wilkes GPU Cluster Default Callback-based Event-based 1 CSCS GPU cluster Default Callback-based Event-based 1.5 1 0.5 0 4 8 16 32 Number of GPUs 0 16 32 64 96 Number of GPUs 2X improvement on 32 GPUs nodes 30% improvement on 96 GPU nodes (8 GPUs/node) C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee, H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, under review 20

Outline Communication on InfiniBand Clusters with GPUs MVAPICH2-GPU with GPUDirect-RDMA (GDR) New Enhancements Non-Blocking Collectives (MPI3 NBC) support Multi-Stream for efficient MPI Datatype Processing On-Going Work (GDR-Async and Managed Memory) OpenACC-Aware support Conclusions 21

On Going Work Support for GDR_Async feature (GPUDirect RDMA Family 4) Offload control flow to the GPU Issue the communication operation from/to GPU Free CPU and remove from critical path Hide the overhead of launching CUDA Kernels and keep the GPU busy Extend OMB with GDR_Async semantics Initial Support for Managed Memory MPI-Aware managed memory Transparently handle the data movement of managed memory at MPI level High productivity (unique pointer for both CPU and GPU work) Extend OMB with managed memory semantics 22

Outline Communication on InfiniBand Clusters with GPUs MVAPICH2-GPU with GPUDirect-RDMA (GDR) New Enhancements Non-Blocking Collectives (MPI3 NBC) support Multi-Stream for efficient MPI Datatype Processing On-Going Work (GDR-Async and Managed Memory) OpenACC-Aware support Conclusions 23

OpenACC-Aware MPI acc_malloc to allocate device memory No changes to MPI calls MVAPICH2 detects the device pointer and optimizes data movement acc_deviceptr to get device pointer (in OpenACC 2.0) Enables MPI communication from memory allocated by compiler when it is available in OpenACC 2.0 implementations MVAPICH2 will detect the device pointer and optimize communication Delivers the same performance as with CUDA A = acc_malloc(sizeof(int) * N); #pragma acc parallel loop deviceptr(a)... //compute for loop MPI_Send (A, N, MPI_INT, 0, 1, MPI_COMM_WORLD); acc_free(a); A = malloc(sizeof(int) * N); #pragma acc data copyin(a)... { #pragma acc parallel loop... //compute for loop MPI_Send(acc_deviceptr(A), N, MPI_INT, 0, 1, MPI_COMM_WORLD); } free(a); 24

Outline Communication on InfiniBand Clusters with GPUs MVAPICH2-GPU with GPUDirect-RDMA (GDR) New Enhancements Non-Blocking Collectives (MPI3 NBC) support Multi-Stream for efficient MPI Datatype Processing On-Going Work (GDR-Async and Managed Memory) OpenACC-Aware support Conclusions 25

Conclusions MVAPICH2 optimizes MPI communication on InfiniBand clusters with GPUs Provides optimized designs for point-to-point two-sided and one-sided communication, datatype processing and collective operations Efficient and maximal overlap for MPI3 NBC collectives Takes advantage of CUDA features like IPC and GPUDirect RDMA families Delivers High performance High productivity With support for latest NVIDIA GPUs and InfiniBand Adapters Users are strongly encouraged to use MVAPICH2-GDR 2.2a 26

Additional Pointers Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ MVAPICH Web Page http://mvapich.cse.ohio-state.edu/ 27