TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016

Similar documents
Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

GPUDIRECT: INTEGRATING THE GPU WITH A NETWORK INTERFACE DAVIDE ROSSETTI, SW COMPUTE TEAM

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

LECTURE ON PASCAL GPU ARCHITECTURE. Jiri Kraus, November 14 th 2016

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Building the Most Efficient Machine Learning System

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

Interconnect Your Future

Operational Robustness of Accelerator Aware MPI

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

NVIDIA GPUDirect Technology. NVIDIA Corporation 2011

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Interconnection Network for Tightly Coupled Accelerators Architecture

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

S8688 : INSIDE DGX-2. Glenn Dearth, Vyas Venkataraman Mar 28, 2018

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

2008 International ANSYS Conference

HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs

n N c CIni.o ewsrg.au

MULTI-GPU PROGRAMMING MODELS. Jiri Kraus, Senior Devtech Compute Sreeram Potluri, Senior CUDA Software Engineer

Building the Most Efficient Machine Learning System

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY FOR NVIDIA GPU CLUSTERS. Sreeram Potluri, Anshuman Goswami - NVIDIA 3/28/2018

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

Solutions for Scalable HPC

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

The rcuda middleware and applications

EFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

The Future of Interconnect Technology

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Multi-GPU Programming

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012

Tightly Coupled Accelerators Architecture

GPUs as better MPI Citizens

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

High Performance Computing

UCX: An Open Source Framework for HPC Network APIs and Beyond

Mellanox GPUDirect RDMA User Manual

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Evaluating On-Node GPU Interconnects for Deep Learning Workloads

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication

Architectures for Scalable Media Object Search

dcuda: Distributed GPU Computing with Hardware Overlap

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

Birds of a Feather Presentation

Finite Difference Simulations on GPU Clusters: How far can you push 1D domain decomposition?

April 4-7, 2016 Silicon Valley INSIDE PASCAL. Mark Harris, October 27,

OpenStaPLE, an OpenACC Lattice QCD Application

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Application Acceleration Beyond Flash Storage

Fundamental CUDA Optimization. NVIDIA Corporation

Mellanox GPUDirect RDMA User Manual

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

MVAPICH2: A High Performance MPI Library for NVIDIA GPU Clusters with InfiniBand

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

ABySS Performance Benchmark and Profiling. May 2010

HYCOM Performance Benchmark and Profiling

GPU-centric communication for improved efficiency

Power Systems AC922 Overview. Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

MICROWAY S NVIDIA TESLA V100 GPU SOLUTIONS GUIDE

Paving the Road to Exascale

OCTOPUS Performance Benchmark and Profiling. June 2015

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters

The NE010 iwarp Adapter

Simulation using MIC co-processor on Helios

Single-Points of Performance

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

LECTURE ON MULTI-GPU PROGRAMMING. Jiri Kraus, November 14 th 2016

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks

Messaging Overview. Introduction. Gen-Z Messaging

NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL)

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS)

Ravindra Babu Ganapathi

NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL)

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Towards Transparent and Efficient GPU Communication on InfiniBand Clusters. Sadaf Alam Jeffrey Poznanovic Kristopher Howard Hussein Nasser El-Harake

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Future Routing Schemes in Petascale clusters

Open Fabrics Workshop 2013

The Visual Computing Company

A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers

APENet: LQCD clusters a la APE

Increasing the efficiency of your GPU-enabled cluster with rcuda. Federico Silla Technical University of Valencia Spain

The Role of InfiniBand Technologies in High Performance Computing. 1 Managed by UT-Battelle for the Department of Energy

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION

Interconnect Your Future

Mapping MPI+X Applications to Multi-GPU Architectures

Transcription:

TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016

MULTI GPU PROGRAMMING Node 0 Node 1 Node N-1 MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM GPU CPU GPU CPU GPU CPU PCIe Switch PCIe Switch PCIe Switch IB IB IB 3

MULTI GPU PROGRAMMING Minimizing Communication Overhead Application: Expose enough Parallelism Balance Load Minimize communication Hide communication time MPI/System SW/HW: Maximize BW Minimize Latencies 4

1D RING EXCHANGE Halo updates for 1D domain decomposition with periodic boundary conditions Unidirectional rings are important building block for collective algorithms 5

MAXIMIZING BANDWIDTH Bidirectional intra node GPU to GPU CPU staging pipeline can achieve good GPU to GPU Bandwidth: GPUA to CPU CPU to GPUB GPUB to CPU: CPU to GPUA: 6

Bandwidth (MB/s) MAXIMIZING BANDWIDTH MVAPICH2-GDR 2.2b intra-node GPU-to-GPU pt2pt BiBW 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 Staging BW (MB/s) 7

TREND TO DENSER GPU NODES Dual Socket + 8 Tesla K80 = 16 GPUs CPU0 IB PCIe Switch PCIe Switch PCIe Switch PCIe Switch PCIe Switch PCIe Switch PCIe Switch GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU8 GPU 8

TREND TO DENSER GPU NODES Dual Socket + 8 Tesla K80 = 16 GPUs 8 GPUs per CPU Socket BiBW Staging Pipeline: 2 writes and 2 reads in CPU mem per Com. pair 9 Com. pairs (including IB and intra Socket) CPU Memory BW: 70 GB/s 70 GB/s / (9*(2+2)) = 70 GB/s / 36 1.94 GB/s (12% of PCI-E x16 gen3) 9

GPUDIRECT P2P MEM MEM MEM MEM MEM MEM GPU0 GPU1 CPU PCIe Switch Maximizes intra node inter GPU Bandwidth Avoids Host memory and system topology bottlenecks 10

Bandwidth (MB/s) MAXIMIZING BANDWIDTH MVPAICH2-GDR 2.2b intra-node GPU-to-GPU pt2tp BiBW 25000 20000 15000 10000 5000 0 21.3 GB/s Staging BW (MB/s) P2P BW (MB/s) 11

NVLINK P100 supports 4 NVLinks Up to 94% bandwidth efficiency Supports read/writes/atomics to peer GPU Supports read/write access to NVLink-enabled CPU Links can be ganged for higher bandwidth 40 GB/s 40 GB/s 40 GB/s 40 GB/s NVLink on Tesla P100 12

NVLINK - GPU CLUSTER Two fully connected quads, connected at corners 160GB/s per GPU bidirectional to Peers Load/store access to Peer Memory Full atomics to Peer GPUs High speed copy engines for bulk data copy PCIe to/from CPU 13

NVLINK TO CPU Fully connected quad 120 GB/s per GPU bidirectional for peer traffic 40 GB/s per GPU bidirectional to CPU Direct Load/store access to CPU Memory High Speed Copy Engines for bulk data movement 14

Bandwidth (MB/s) GPUDIRECT P2P ON PASCAL early results, P2P thru NVLink 40000 35000 30000 25000 20000 15000 10000 5000 0 OpenMPI intra-node GPU-to-GPU pt2pt BiBW 34.2 GB/s P100 NVLink K80@875 PCI-E 15

LATENCY GPUDirect RDMA Host staging pipeline increases latency because of staging step GPUDirect RDMA allows 3rd party devices to directly read/write GPU memory No host staging necessary for inter node GPU to GPU messages IB stack is optimized for latencies Using Loop back for GPU to GPU messages also helps with intra node GPU to GPU latency MEM MEM GPU IB MEM MEM CPU PCIe Switch 16

Latency (us) MINIMIZING LATENCY MVAPICH2-GDR 2.2b intra-node GPU-to-GPU pt2pt latency 25 20 15 10 5 0 4x 16 32 64 128 256 512 1024 2048 4096 8192 16384 Size (byte) Staging GPUDirect RDMA 17

SCALING LIMITER CPU (Time marked for one step, Domain size/gpu 1024, Boundary 16, Ghost Width 1) 18

SCALING LIMITER CPU CPU bounded (Time marked for one step, Domain size/gpu 128, Boundary 16, Ghost Width 1) 19

Stream CPU IB USING NORMAL MPI while ( t < T ) { //... for (int i=0; i<num_neighbors;++i) { MPI_Irecv(...,req); compute_boundary<<<grid,block,0,stream>>>(...); cudastreamsynchronize(stream); MPI_Isend(...,req + 1 ); MPI_Waitall(...,req,...); } //... } 20

GPUDIRECT ASYNC expose GPU front-end unit Front-end unit CPU prepares work plan hardly parallelizable, branch intensive GPU orchestrates flow Runs on optimized front-end unit Same one scheduling GPU work Now also scheduling network communications Compute Engines 21

GPUDIRECT ASYNC expose GPU front-end unit Front-end unit *(volatile uint32_t*)h_flag = 0; //... custreamwaitvalue32(stream, d_flag, 1, CU_STREAM_WAIT_VALUE_EQ); calc_kernel<<<gsz,bsz,0,stream>>>(); custreamwritevalue32(stream, d_flag, 2, 0); //... *(volatile uint32_t*)h_flag = 1; //... cudastreamsynchronize(stream); assert(*(volatile uint32_t*)h_flag == 2); CPU MEM 3 01 2 1 2 Compute Engines 22

stream CPU IB USING STREAM-ORDERED API + ASYNC while ( t < T ) { //... for (int i=0; i<num_neighbors;++i) { mp_irecv(...,req); compute_boundary<<<grid,block,0,stream>>>(...); mp_isend_on_stream(...,req + 1, stream ); mp_wait_all_on_stream(...,req,stream ); } //... } 23

Percentage Improvement 2D STENCIL PERFORMANCE weak scaling, RDMA vs. RDMA+Async 2D Stencil 35,00% 30,00% 25,00% 20,00% 15,00% 10,00% 5,00% 0,00% 8192 4096 2048 1024 512 256 128 64 32 16 8 local lattice size four nodes, IVB Xeon CPUs, K40m GPUs, Mellanox Connect-IB FDR, Mellanox FDR switch 24

GPUDIRECT ASYNC + INFINIBAND preview release of components CUDA Async extensions, with CUDA 8 Peer-direct async extension, in MLNX OFED 3.x, soon 25