TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016

Size: px

Start display at page:

Download "TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016"

Joshua Godfrey Boyd
6 years ago
Views:

1 TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016

2 MULTI GPU PROGRAMMING Node 0 Node 1 Node N-1 MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM GPU CPU GPU CPU GPU CPU PCIe Switch PCIe Switch PCIe Switch IB IB IB 3

Balance Load Minimize communication Hide

3 MULTI GPU PROGRAMMING Minimizing Communication Overhead Application: Expose enough Parallelism Balance Load Minimize communication Hide communication time MPI/System SW/HW: Maximize BW Minimize Latencies 4

4 1D RING EXCHANGE Halo updates for 1D domain decomposition with periodic boundary conditions Unidirectional rings are important building block for collective algorithms 5

5 MAXIMIZING BANDWIDTH Bidirectional intra node GPU to GPU CPU staging pipeline can achieve good GPU to GPU Bandwidth: GPUA to CPU CPU to GPUB GPUB to CPU: CPU to GPUA: 6

6 Bandwidth (MB/s) MAXIMIZING BANDWIDTH MVAPICH2-GDR 2.2b intra-node GPU-to-GPU pt2pt BiBW Staging BW (MB/s) 7

7 TREND TO DENSER GPU NODES Dual Socket + 8 Tesla K80 = 16 GPUs CPU0 IB PCIe Switch PCIe Switch PCIe Switch PCIe Switch PCIe Switch PCIe Switch PCIe Switch GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU8 GPU 8

8 TREND TO DENSER GPU NODES Dual Socket + 8 Tesla K80 = 16 GPUs 8 GPUs per CPU Socket BiBW Staging Pipeline: 2 writes and 2 reads in CPU mem per Com. pair 9 Com. pairs (including IB and intra Socket) CPU Memory BW: 70 GB/s 70 GB/s / (9*(2+2)) = 70 GB/s / GB/s (12% of PCI-E x16 gen3) 9

9 GPUDIRECT P2P MEM MEM MEM MEM MEM MEM GPU0 GPU1 CPU PCIe Switch Maximizes intra node inter GPU Bandwidth Avoids Host memory and system topology bottlenecks 10

10 Bandwidth (MB/s) MAXIMIZING BANDWIDTH MVPAICH2-GDR 2.2b intra-node GPU-to-GPU pt2tp BiBW GB/s Staging BW (MB/s) P2P BW (MB/s) 11

11 NVLINK P100 supports 4 NVLinks Up to 94% bandwidth efficiency Supports read/writes/atomics to peer GPU Supports read/write access to NVLink-enabled CPU Links can be ganged for higher bandwidth 40 GB/s 40 GB/s 40 GB/s 40 GB/s NVLink on Tesla P100 12

12 NVLINK - GPU CLUSTER Two fully connected quads, connected at corners 160GB/s per GPU bidirectional to Peers Load/store access to Peer Memory Full atomics to Peer GPUs High speed copy engines for bulk data copy PCIe to/from CPU 13

bidirectional to CPU Direct Load/store access to

13 NVLINK TO CPU Fully connected quad 120 GB/s per GPU bidirectional for peer traffic 40 GB/s per GPU bidirectional to CPU Direct Load/store access to CPU Memory High Speed Copy Engines for bulk data movement 14

14 Bandwidth (MB/s) GPUDIRECT P2P ON PASCAL early results, P2P thru NVLink OpenMPI intra-node GPU-to-GPU pt2pt BiBW 34.2 GB/s P100 NVLink K80@875 PCI-E 15

15 LATENCY GPUDirect RDMA Host staging pipeline increases latency because of staging step GPUDirect RDMA allows 3rd party devices to directly read/write GPU memory No host staging necessary for inter node GPU to GPU messages IB stack is optimized for latencies Using Loop back for GPU to GPU messages also helps with intra node GPU to GPU latency MEM MEM GPU IB MEM MEM CPU PCIe Switch 16

16 Latency (us) MINIMIZING LATENCY MVAPICH2-GDR 2.2b intra-node GPU-to-GPU pt2pt latency x Size (byte) Staging GPUDirect RDMA 17

17 SCALING LIMITER CPU (Time marked for one step, Domain size/gpu 1024, Boundary 16, Ghost Width 1) 18

18 SCALING LIMITER CPU CPU bounded (Time marked for one step, Domain size/gpu 128, Boundary 16, Ghost Width 1) 19

19 Stream CPU IB USING NORMAL MPI while ( t < T ) { //... for (int i=0; i<num_neighbors;++i) { MPI_Irecv(...,req); compute_boundary<<<grid,block,0,stream>>>(...); cudastreamsynchronize(stream); MPI_Isend(...,req + 1 ); MPI_Waitall(...,req,...); } //... } 20

20 GPUDIRECT ASYNC expose GPU front-end unit Front-end unit CPU prepares work plan hardly parallelizable, branch intensive GPU orchestrates flow Runs on optimized front-end unit Same one scheduling GPU work Now also scheduling network communications Compute Engines 21

21 GPUDIRECT ASYNC expose GPU front-end unit Front-end unit *(volatile uint32_t*)h_flag = 0; //... custreamwaitvalue32(stream, d_flag, 1, CU_STREAM_WAIT_VALUE_EQ); calc_kernel<<<gsz,bsz,0,stream>>>(); custreamwritevalue32(stream, d_flag, 2, 0); //... *(volatile uint32_t*)h_flag = 1; //... cudastreamsynchronize(stream); assert(*(volatile uint32_t*)h_flag == 2); CPU MEM Compute Engines 22

22 stream CPU IB USING STREAM-ORDERED API + ASYNC while ( t < T ) { //... for (int i=0; i<num_neighbors;++i) { mp_irecv(...,req); compute_boundary<<<grid,block,0,stream>>>(...); mp_isend_on_stream(...,req + 1, stream ); mp_wait_all_on_stream(...,req,stream ); } //... } 23

23 Percentage Improvement 2D STENCIL PERFORMANCE weak scaling, RDMA vs. RDMA+Async 2D Stencil 35,00% 30,00% 25,00% 20,00% 15,00% 10,00% 5,00% 0,00% local lattice size four nodes, IVB Xeon CPUs, K40m GPUs, Mellanox Connect-IB FDR, Mellanox FDR switch 24

24 GPUDIRECT ASYNC + INFINIBAND preview release of components CUDA Async extensions, with CUDA 8 Peer-direct async extension, in MLNX OFED 3.x, soon 25

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda