TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016
MULTI GPU PROGRAMMING Node 0 Node 1 Node N-1 MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM GPU CPU GPU CPU GPU CPU PCIe Switch PCIe Switch PCIe Switch IB IB IB 3
MULTI GPU PROGRAMMING Minimizing Communication Overhead Application: Expose enough Parallelism Balance Load Minimize communication Hide communication time MPI/System SW/HW: Maximize BW Minimize Latencies 4
1D RING EXCHANGE Halo updates for 1D domain decomposition with periodic boundary conditions Unidirectional rings are important building block for collective algorithms 5
MAXIMIZING BANDWIDTH Bidirectional intra node GPU to GPU CPU staging pipeline can achieve good GPU to GPU Bandwidth: GPUA to CPU CPU to GPUB GPUB to CPU: CPU to GPUA: 6
Bandwidth (MB/s) MAXIMIZING BANDWIDTH MVAPICH2-GDR 2.2b intra-node GPU-to-GPU pt2pt BiBW 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 Staging BW (MB/s) 7
TREND TO DENSER GPU NODES Dual Socket + 8 Tesla K80 = 16 GPUs CPU0 IB PCIe Switch PCIe Switch PCIe Switch PCIe Switch PCIe Switch PCIe Switch PCIe Switch GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU8 GPU 8
TREND TO DENSER GPU NODES Dual Socket + 8 Tesla K80 = 16 GPUs 8 GPUs per CPU Socket BiBW Staging Pipeline: 2 writes and 2 reads in CPU mem per Com. pair 9 Com. pairs (including IB and intra Socket) CPU Memory BW: 70 GB/s 70 GB/s / (9*(2+2)) = 70 GB/s / 36 1.94 GB/s (12% of PCI-E x16 gen3) 9
GPUDIRECT P2P MEM MEM MEM MEM MEM MEM GPU0 GPU1 CPU PCIe Switch Maximizes intra node inter GPU Bandwidth Avoids Host memory and system topology bottlenecks 10
Bandwidth (MB/s) MAXIMIZING BANDWIDTH MVPAICH2-GDR 2.2b intra-node GPU-to-GPU pt2tp BiBW 25000 20000 15000 10000 5000 0 21.3 GB/s Staging BW (MB/s) P2P BW (MB/s) 11
NVLINK P100 supports 4 NVLinks Up to 94% bandwidth efficiency Supports read/writes/atomics to peer GPU Supports read/write access to NVLink-enabled CPU Links can be ganged for higher bandwidth 40 GB/s 40 GB/s 40 GB/s 40 GB/s NVLink on Tesla P100 12
NVLINK - GPU CLUSTER Two fully connected quads, connected at corners 160GB/s per GPU bidirectional to Peers Load/store access to Peer Memory Full atomics to Peer GPUs High speed copy engines for bulk data copy PCIe to/from CPU 13
NVLINK TO CPU Fully connected quad 120 GB/s per GPU bidirectional for peer traffic 40 GB/s per GPU bidirectional to CPU Direct Load/store access to CPU Memory High Speed Copy Engines for bulk data movement 14
Bandwidth (MB/s) GPUDIRECT P2P ON PASCAL early results, P2P thru NVLink 40000 35000 30000 25000 20000 15000 10000 5000 0 OpenMPI intra-node GPU-to-GPU pt2pt BiBW 34.2 GB/s P100 NVLink K80@875 PCI-E 15
LATENCY GPUDirect RDMA Host staging pipeline increases latency because of staging step GPUDirect RDMA allows 3rd party devices to directly read/write GPU memory No host staging necessary for inter node GPU to GPU messages IB stack is optimized for latencies Using Loop back for GPU to GPU messages also helps with intra node GPU to GPU latency MEM MEM GPU IB MEM MEM CPU PCIe Switch 16
Latency (us) MINIMIZING LATENCY MVAPICH2-GDR 2.2b intra-node GPU-to-GPU pt2pt latency 25 20 15 10 5 0 4x 16 32 64 128 256 512 1024 2048 4096 8192 16384 Size (byte) Staging GPUDirect RDMA 17
SCALING LIMITER CPU (Time marked for one step, Domain size/gpu 1024, Boundary 16, Ghost Width 1) 18
SCALING LIMITER CPU CPU bounded (Time marked for one step, Domain size/gpu 128, Boundary 16, Ghost Width 1) 19
Stream CPU IB USING NORMAL MPI while ( t < T ) { //... for (int i=0; i<num_neighbors;++i) { MPI_Irecv(...,req); compute_boundary<<<grid,block,0,stream>>>(...); cudastreamsynchronize(stream); MPI_Isend(...,req + 1 ); MPI_Waitall(...,req,...); } //... } 20
GPUDIRECT ASYNC expose GPU front-end unit Front-end unit CPU prepares work plan hardly parallelizable, branch intensive GPU orchestrates flow Runs on optimized front-end unit Same one scheduling GPU work Now also scheduling network communications Compute Engines 21
GPUDIRECT ASYNC expose GPU front-end unit Front-end unit *(volatile uint32_t*)h_flag = 0; //... custreamwaitvalue32(stream, d_flag, 1, CU_STREAM_WAIT_VALUE_EQ); calc_kernel<<<gsz,bsz,0,stream>>>(); custreamwritevalue32(stream, d_flag, 2, 0); //... *(volatile uint32_t*)h_flag = 1; //... cudastreamsynchronize(stream); assert(*(volatile uint32_t*)h_flag == 2); CPU MEM 3 01 2 1 2 Compute Engines 22
stream CPU IB USING STREAM-ORDERED API + ASYNC while ( t < T ) { //... for (int i=0; i<num_neighbors;++i) { mp_irecv(...,req); compute_boundary<<<grid,block,0,stream>>>(...); mp_isend_on_stream(...,req + 1, stream ); mp_wait_all_on_stream(...,req,stream ); } //... } 23
Percentage Improvement 2D STENCIL PERFORMANCE weak scaling, RDMA vs. RDMA+Async 2D Stencil 35,00% 30,00% 25,00% 20,00% 15,00% 10,00% 5,00% 0,00% 8192 4096 2048 1024 512 256 128 64 32 16 8 local lattice size four nodes, IVB Xeon CPUs, K40m GPUs, Mellanox Connect-IB FDR, Mellanox FDR switch 24
GPUDIRECT ASYNC + INFINIBAND preview release of components CUDA Async extensions, with CUDA 8 Peer-direct async extension, in MLNX OFED 3.x, soon 25