Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

Similar documents
MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

MVAPICH2: A High Performance MPI Library for NVIDIA GPU Clusters with InfiniBand

High Performance MPI Support in MVAPICH2 for InfiniBand Clusters

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

NVIDIA GPUDirect Technology. NVIDIA Corporation 2011

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Scaling with PGAS Languages

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters

Unified Runtime for PGAS and MPI over OFED

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

NVIDIA GPUs in Earth System Modelling Thomas Bradley

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

Unifying UPC and MPI Runtimes: Experience with MVAPICH

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

Enhancing Checkpoint Performance with Staging IO & SSD

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

MVAPICH2 and MVAPICH2-MIC: Latest Status

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

High-Performance Broadcast for Streaming and Deep Learning

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

MPI and CUDA. Filippo Spiga, HPCS, University of Cambridge.

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

MVAPICH2 Project Update and Big Data Acceleration

Memcached Design on High Performance RDMA Capable Interconnects

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

The Future of Supercomputer Software Libraries

The Future of Interconnect Technology

High-Performance Training for Deep Learning and Computer Vision HPC

GPU-Aware Non-contiguous Data Movement In Open MPI

TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning

Overview of the MVAPICH Project: Latest Status and Future Roadmap

dcuda: Distributed GPU Computing with Hardware Overlap

GPU-centric communication for improved efficiency

Tightly Coupled Accelerators Architecture

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand. Presented at GTC 15

Solutions for Scalable HPC

Interconnect Your Future

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

INAM 2 : InfiniBand Network Analysis and Monitoring with MPI

Designing Shared Address Space MPI libraries in the Many-core Era

Overview of MVAPICH2 and MVAPICH2- X: Latest Status and Future Roadmap

GPU acceleration on IB clusters. Sadaf Alam Jeffrey Poznanovic Kristopher Howard Hussein Nasser El-Harake

Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters

Towards Transparent and Efficient GPU Communication on InfiniBand Clusters. Sadaf Alam Jeffrey Poznanovic Kristopher Howard Hussein Nasser El-Harake

Acceleration for Big Data, Hadoop and Memcached

Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters

HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

Comparing Ethernet and Soft RoCE for MPI Communication

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Interconnect Your Future

MVAPICH2 MPI Libraries to Exploit Latest Networking and Accelerator Technologies

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

EFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM

Software Libraries and Middleware for Exascale Systems

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

CUDA Update: Present & Future. Mark Ebersole, NVIDIA CUDA Educator

High Performance MPI on IBM 12x InfiniBand Architecture

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X

Performance Evaluation of InfiniBand with PCI Express

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

Performance Evaluation of InfiniBand with PCI Express

Interconnection Network for Tightly Coupled Accelerators Architecture

Buffering in MPI communications

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

ICON Performance Benchmark and Profiling. March 2012

High Performance Migration Framework for MPI Applications on HPC Cloud

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters

A Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

CP2K Performance Benchmark and Profiling. April 2011

Designing Multi-Leader-Based Allgather Algorithms for Multi-Core Clusters *

Transcription:

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2 H. Wang, S. Potluri, M. Luo, A. K. Singh, X. Ouyang, S. Sur, D. K. Panda Network-Based Computing Laboratory The Ohio State University 1

Outline Introduction Problem Statement Our Solution: MVAPICH2-GPU-NC Design Considerations Performance Evaluation Conclusion & Future Work 2

InfiniBand Clusters in TOP500 Percentage share of InfiniBand is steadily increasing 41% of systems in TOP500 using InfiniBand (June 11) 61% of systems in TOP100 using InfiniBand (June 11) 3

Growth in GPGPUs GPGPUs are gaining significance on clusters for data-centric applications Word Occurrence, Sparse Integer Occurrence K-means clustering, Linear regression GPGPUs + InfiniBand are gaining momentum for large clusters #2 (Tianhe-1A), #4 (Nebulae) and #5 (Tsubame) Petascale systems GPGPUs programming CUDA or OpenCL + MPI Big issues: performance of data movement Latency Bandwidth Overlap 4

Data Movement in GPU Clusters GPU Main Memory IB Adapter IB Adapter Main Memory GPU PCI-E PCI-E PCI-E PCI-E PCI-E Hub IB Network PCI-E Hub Data movement in InfiniBand clusters with GPUs CUDA: Device memory Main memory [at source process] MPI: Source rank Destination process CUDA: Main memory Device memory [at destination process] 5

MVAPICH/MVAPICH2 Software High Performance MPI Library for IB and HSE MVAPICH (MPI-1) and MVAPICH2 (MPI-2.2) Used by more than 1,710 organizations in 63 countries More than 79,000 downloads from OSU site directly Empowering many TOP500 clusters 5 th ranked 73,278-core cluster (Tsubame 2.0) at Tokyo Institute of Technology 7 th ranked 111,104-core cluster (Pleiades) at NASA 17 th ranked 62,976-core cluster (Ranger) at TACC Available with software stacks of many IB, HSE and server vendors including Open Fabrics Enterprise Distribution (OFED) and Linux Distros (RedHat and SuSE) http://mvapich.cse.ohio-state.edu 6

MVAPICH2-GPU: GPU-GPU using MPI Is it possible to optimize GPU-GPU communication with MPI? H. Wang, S. Potluri, M. Luo, A. K. Singh, S. Sur, D. K. Panda, MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters, ISC 11, June, 2011 Support GPU to remote GPU communication using MPI P2P and One-sided were improved Collectives can directly get benefits from p2p improvement How to optimize GPU-GPU collectives with different algorithms? A. K. Singh, S. Potluri, H. Wang, K. Kandalla, S. Sur, D. K. Panda, MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits, PPAC 11 with Cluster 11, Sep, 2011 Support GPU to GPU Alltoall communication with Dynamic Staging mechanism GPU-GPU Alltoall performance was improved How to handle non-contiguous data in GPU device memory? This paper! Support GPU-GPU non-contiguous data communication (P2P) using MPI Vector datatype and SHOC benchmark are optimized 7

Non-contiguous Data Exchange Halo data exchange Multi-dimensional data Row based organization Contiguous on one dimension Non-contiguous on other dimensions Halo data exchange Duplicate the boundary Exchange the boundary in each iteration 8

Datatype Support in MPI Native datatypes support in MPI improve programming productivity Enable MPI library to optimize non-contiguous data transfer At Sender: MPI_Type_vector (n_blocks, n_elements, stride, old_type, &new_type); MPI_Type_commit(&new_type); MPI_Send(s_buf, size, new_type, dest, tag, MPI_COMM_WORLD); What will happen if the non-contiguous data is inside GPU device memory? 9

Outline Introduction Problem Statement Our Solution: MVAPICH2-GPU-NC Design Considerations Performance Evaluation Conclusion & Future Work 10

Problem Statement Non-contiguous data movement from/to GPGPUs Performance bottleneck Reduced programmer productivity Hard to optimize GPU-GPU non-contiguous data communication at the user level CUDA and MPI expertise is required for efficient implementation Hardware dependent characteristics, such as latency Different choices of Pack/Unpack non-contiguous data, which is better? 11

Problem Statement (Cont.) GPU Device Memory? Host Main Memory (a) No pack (b) Pack by GPU into Host Which is better? (c) Pack by GPU inside Device 12

Outline Introduction Problem Statement Our Solution: MVAPICH2-GPU-NC Design Considerations Performance Evaluation Conclusion & Future Work 13

Time (us) Time (us) 300 250 200 150 100 50 Performance for Vector Pack D2H_nc2nc D2H_nc2c D2D2H_nc2c2c 300000 D2H_nc2nc 250000 D2H_nc2c 200000 D2D2H_nc2c2c2 150000 100000 50000 0 8 16 32 64 128 256 512 1K 2K 4K Message size (bytes) Pack latency (similar for unpack) 0 4 8 16 32 64 128 256 512 1K 2K 4K Message size (K bytes) (c) has up to factor of 8 improvement from (a)! (a) D2H_nc2nc: D2H, non-contiguous to non-contiguous. Pack by CPU later (b) D2H_nc2c: D2H, non-contiguous to contiguous. Pack by GPU directly to host memory (c) D2D2H_nc2c2c: D2D2H, non-contiguous to contiguous inside GPU 14

MVAPICH2-GPU-NC: Design Goals Support GPU-GPU non-contiguous data communication through standard MPI interfaces e.g. MPI_Send / MPI_Recv can operate on GPU memory address for noncontiguous datatype, like MPI_Type_vector Provide high performance without exposing low level details to the programmer offload datatype pack and unpack to GPU Pack: pack non-contiguous data into contiguous buffer inside GPU, then move out Unpack: move contiguous data into GPU memory, then unpack to noncontiguous address pipeline data pack/unpack, data movement between device and host, and data transfer on networks Automatically provides optimizations inside MPI library without user tuning 15

Sample Code - Without MPI Integration Simple implementation for vector type with MPI and CUDA Data pack and unpack by CPU MPI_Type_vector (n_rows, width, n_cols, old_datatype, &new_type); MPI_Type_commit(&new_type); At Sender: cudamemcpy2d(s_buf, n_cols * datasize, s_device, n_cols * datasize, width * datasize, n_rows, DeviceToHost); MPI_Send(s_buf, 1, new_type, dest, tag, MPI_COMM_WORLD); At Receiver: MPI_Recv(r_buf, 1, new_type, src, tag, MPI_COMM_WORLD, &req); cudamemcpy2d(r_device, n_cols * datasize, r_buf, n_cols * datasize, width * datasize, n_rows, HostToDevice); High productivity but poor performance 16

Sample Code User Optimized Data pack/upack is done by GPU without MPI data type support Pipelining at user level using non-blocking MPI and CUDA interfaces At Sender: for (j = 0; j < pipeline_len; j++) // pack: from non-contiguous to contiguous buffer in GPU device memory cudamemcpy2dasync( ); while (active_pack_stream active_d2h_stream) { if (active_pack_stream > 0 && cudastreamquery() == cudasucess) { // contiguous data move from device to host cudamemcpyasync( ); } if (active_d2h_stream > 0 && cudastreamquery() == cudasucess) MPI_Isend(.); } MPI_Waitall(); Good performance but poor productivity 17

Sample Code MVAPICH2-GPU-NC MVAPICH2-GPU-NC: supports GPU-GPU non-contiguous data communication with standard MPI library Offload data Pack and unpack to GPU Implement pipeline inside MPI library MPI_Type_vector (n_rows, width, n_cols, old_datatype, &new_type); MPI_Type_commit(&new_type); At Sender: // s_device is data buffer in GPU MPI_Send(s_device, 1, new_type, dest, tag, MPI_COMM_WORLD); At Receiver: // r_device is data buffer in GPU MPI_Recv(r_device, 1, new_type, src, tag, MPI_COMM_WORLD, &req); High productivity and high performance! 18

Outline Introduction Problem Statement Our Solution: MVAPICH2-GPU-NC Design Considerations Performance Evaluation Conclusion & Future Work 19

Memory detection Design Considerations CUDA 4.0 feature Unified Virtual Addressing (UVA) MPI library can differentiate between device memory and host memory without any hints from the user Overlap data pack/unpack with CUDA copy and RDMA transfer Data pack and unpack by GPU inside device memory Pipeline data pack/unpack, data movement between device and host, and InfiniBand RDMA Allow for progressing DMAs individual data chunks 20

Pipeline Design Chunk size depends on CUDA copy cost and RDMA latency over the network Automatic tuning of chunk size Detects CUDA copy and RDMA latencies during installation Chunk size can be stored in configuration file (mvapich.conf) User transparent to deliver the best performance 21

Outline Introduction Problem Statement Our Solution: MVAPICH2-GPU-NC Design Considerations Performance Evaluation Conclusion & Future Work 22

Performance Evaluation Experimental environment NVIDIA Tesla C2050 Mellanox QDR InfiniBand HCA MT26428 Intel Westmere processor with 12 GB main memory MVAPICH2 1.6, CUDA Toolkit 4.0 Modified OSU Micro-Benchmarks The source and destination addresses are in GPU device memory SHOC Benchmarks (1.0.1) from ORNL Stencil2D: a two-dimensional nine point stencil calculation, including the halo data exchange Run one process per node with one GPU card (8 nodes cluster) 23

Times (us) Times (us) 250 200 150 100 50 0 Ping Pong Latency for Vector Cpy2D+Send Cpy2DAsync+CpyAsync+Isend MV2-GPU-NC 8 16 32 64 128 256 512 1K 2K 4K Message Size (bytes) MV2-GPU-NC 180000 160000 140000 120000 100000 80000 60000 40000 20000 0 Cpy2D+Send Cpy2DAsync+CpyAsync+Isend MV2-GPU-NC 88% improvement compared with Cpy2D+Send at 4MB Message Size (bytes) have the similar latency with Cpy2DAsync+CpyAsync+Isend, but much easier for programming 24

Times (us) Stencil2D Time Breakdown 180000 Rank 0 Rank 1 Rank 2 Rank 3 160000 south_mpi 140000 west_mpi east_mpi 120000 south_cuda 100000 west_cuda 80000 east_cuda 60000 Rank 4 Rank 5 Rank 6 Rank 7 40000 20000 Time breakdown for process Rank 1 2x4 process grid; 8K x 8K matrix; 32K bytes halo data for each dimension Contiguous data exchange with Rank 5, non-contiguous data exchange with Rank 0 and Rank 2 Non-contiguous data dominates data transfer time 0 32K Data size (bytes) 25

Code Complexity Comparison MV2-GPU-NC uses GPU non-contiguous data address as parameters Stencil2D-Default Function calls MPI_Irecv: 4 MPI_Send: 4 MPI_Waitall: 2 cudamemcpy: 4 cudamemcpy2d: 4 Lines of code 245 158 Stencil2D-MV2-GPU-NC MPI_Irecv: 4 MPI_Send: 4 MPI_Waitall: 2 cudamemcpy: 0 cudamemcpy2d: 0 36% code reduction in Stencil2D communication kernel! 26

Comparing Median Execution Time Single precision 1 x 8: only existing non-contiguous data exchange 8 x 1: only existing contiguous data exchange 2 x 4: 60% non-contiguous, 40% contiguous data exchange 4 x 2: 40% non-contiguous, 60% contiguous data exchange Process Grid (Matrix Size / Process) 1 x 8 (64K x 1K) 8 x 1 (1K x 64K) 2 x 4 (8K x 8K) 4 x 2 (8K x 8K) Stencil2D- Default (sec) Stencil2D-MV2-GPU- NC (sec) Improvem ent 0.547788 0.314085 42% non-contiguous optimization in this paper 0.33474 0.272082 19% contiguous optimization in ISC 11 paper 0.36016 0.261888 27% both contiguous and non-contiguous optimizations 0.33183 0.258249 22% both contiguous and non-contiguous optimizations 27

Comparing Median Execution Time Double precision Process Grid (Matrix Size / Process) 1 x 8 (64K x 1K) 8 x 1 (1K x 64K) 2 x 4 (8K x 8K) 4 x 2 (8K x 8K) Stencil2D- Default (sec) Stencil2D-MV2-GPU- NC (sec) Improvem ent 0.780297 0.474613 39% 0.563038 0.438698 22% 0.57544 0.424826 26% 0.546968 0.431908 21% MV2-GPU-NC: Up to 42% improvement for Stencil2D with single precision data set; Up to 39% improvement for Stencil2D with double precision data set; 28

Conclusion & Future Work GPU-GPU non-contiguous P2P communication is optimized by MVAPICH2-GPU-NC Support GPU-GPU non-contiguous p2p communication using standard MPI functions; improve the programming productivity Offload non-contiguous data pack/unpack to GPU Overlap Pack/Unpack, data movement between device and host, and data transfer on networks get up to 88% latency improvement compared with without MPI level optimization for vector type get up to 42% and 39% improvement compared with default implementation of Stencil2D in SHOC benchmarks 29

Future work Conclusion & Future Work evaluate non-contiguous datatype performance with our design on larger scale GPGPUs cluster improve more benchmarks and applications with MVAPICH2-GPU-NC integrate this design into MVAPICH2 future releases 30

Thank You! {wangh, potluri, luom, singhas, ouyangx, surs, panda} @cse.ohio-state.edu Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ MVAPICH Web Page http://mvapich.cse.ohio-state.edu/ 31