EFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM

Size: px

Start display at page:

Download "EFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM"

Walter McDonald
5 years ago
Views:

1 EFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM Sreeram Potluri, Anshuman Goswami NVIDIA Manjunath Gorentla Venkata, Neena Imam - ORNL

2 SCOPE OF THE WORK Reliance on CPU for communication limits scaling on GPU clusters NVSHMEM provides GPU kernel-side OpenSHMEM API NVLink is a high-bandwidth memory-like fabric between GPUs Effectiveness of NVSHMEM over NVLink with Breadth First Search CPU-initiated MPI vs GPU-initiated SHMEM in a particular implementation of BFS 2

3 GPU Programming Models: MPI and SHMEM NVSHMEM Overview AGENDA DGX and NVLink MPI+CUDA implementation of BFS SHMEM implementation of BFS Performance Evaluation Conclusion and Future Work 3

4 GPU CLUSTER PROGRAMMING Offload model Compute on GPU Communication from CPU Synchronization at boundaries Overheads on GPU Clusters Offload latencies Synchronization overheads Limits strong scaling More CPU means more power void 2dstencil (u, v, ) { for (timestep = 0; ) { interior_compute_kernel <<< >>> ( ) pack_kernel <<< >>> ( ) cudastreamsynchronize( ) MPI_Irecv( ) MPI_Isend( ) MPI_Waitall( ) unpack_kernel <<< >>> ( ) boundary_compute_kernel <<< >>> ( ) } } 4

5 GPU-CENTRIC COMMUNICATION GPU capabilities Warp scheduling to hide latencies to global memory Implicit coalescing of loads/stores to achieve efficiency CUDA helps to program to these Should also benefit when accessing data over the network Direct accesses to remote memory simplifies programming Achieving efficiency while making it easier to program Continuous fine-grained accesses smooths traffic over the network 5

6 Symmetric Heap GPU GLOBAL ADDRESS SPACE 6

7 COMMUNICATION FROM CUDA KERNELS Long running CUDA kernels Communication within parallelism void 2dstencil (u, v, ) { stencil_kernel <<< >>> ( ) } global void 2dstencil (u, v, sync, ) { for(timestep = 0; ) { if (i+1 > nx) { v[i+1] = shmem_float_g (v[1], rightpe); } if (i-1 < 1) { v[i-1] = shmem_float_g (v[nx], leftpe); } } } u[i] = (u[i] + (v[i+1] + v[i-1]... if (i < 2) { shmem_int_p (sync + i, 1, peers[i]); shmem_quiet(); shmem_wait_until (sync + i, EQ, 1); } //intra-kernel sync 7

8 NVSHMEM Implementation of OpenSHMEM Specification (1.3) Symmetric heap on GPU memory Implements multi-threading support as being discussed in PR #43 HOST ONLY Library setup, exit and query Memory management HOST/GPU shmem_ptr Remote memory access Atomic memory operations Collectives Point-to-point synchronization Memory ordering Locking routines Extensions for GPUs CUDA stream-based SHMEM communication Thread cooperative SHMEM communication Expected availability: Q (single OS Multi-GPU nodes: NVLink, PCIe, QPI) 8

9 DGX-1 ARCHITECTURE CPU CPU IB 100 GB/sec PLX switch PLX switch PLX switch PLX switch IB 100 GB/sec IB 100 GB/sec IB 100 GB/sec P100 P100 P100 P100 P100 P100 P100 P100 9

10 CPU Inter-GPU Copies Single Process Unified Virtual Addressing available since CUDA 4.0 (2011) cudasetdevice(0); cudamalloc(&buf0, size) cudasetdevice(1); cudamalloc(&buf1, size) cudamemcpy (buf0, buf1, size, cudamemcpydefault ) GPU 0 GPU 1 10

11 CPU CPU GPUDirect P2P: by-passing CPU memory cudadevicecanaccesspeer(int *access, int device, int peerdevice); cudaenablepeeraccess(int peerdevice, unsigned int flags) cudasetdevice(0); cudamalloc(&buf0, size); cudacanaccesspeer (&access, 0, 1); assert(access == 1); cudaenablepeeraccess (1, 0); GPU 0 GPU 1 PCIe P2P (copy/ld/st) cudasetdevice(1); cudamalloc(&buf1, size); cudasetdevice (0); cudamemcpy (buf0, buf1, size, cudamemcpydefault); GPU 0 GPU 1 NVLink (copy/ld/st /atomics) Can also access (LD/ST) the peer GPU memory from inside a CUDA kernel 11

12 cudaipcgetmemhandle(), cudaipcopenmemhandle() & cudadeviceenablepeeraccess() Using GPU memory from another process GPU-0 VAs for Process 1 Physical GPU Memory Existing allocation Get handle and send to other process GPU-0 H Opening handle creates mapping in receiving process GPU-1 GPU-1 VAs for Process 2 Available for access by another process AND from a second GPU use peering 12

13 GB/sec GB/sec GB/sec Copy using DMA NVLINK1 W/ PASCAL PCIe 1 NVLink1 4x NVLink x 5.89x 0 PCIe 1 NVLink1 4x NVLink1 PCIe 1 NVLink1 4x NVLink LD/ST - Linear PCIe 1 NVLink1 4x NVLink1 5.43x 1.35x LD/ST - Random Access PCIe 1 NVLink1 4x NVLink1 2.64x PCIe 1 NVLink1 4x NVLink x GPU memory model is relaxed and scoped Same memory model works within/across GPUs 13

14 BREADTH FIRST SEARCH (BFS) Baseline: an MPI-based multi-gpu implementation of BFS by M.Bisson et. al. Has been shown to scale to thousands of GPUs on the Tsubame supercomputer Loosely follows the Graph 500 specification R-MAT generator, mean performance over 64 BFS operations Uses 32-bit vertices instead of required 48-bit representation Uses a parallel level-synchronous algorithm Frontier: list of vertices from which traversal starts in a give level Expand: traverse graph to find un-visited neighbors of frontier vertices Exchange: exchange list of newly found vertices Append: remove redundancy and build the frontier for next level 14

15 DATA PARTITIONING 0 2 N/C Process grid: R rows x C columns 0 1 C-1 P0,0 P0,1 P0,C-1 P1,0 P1,1 P1,C-1 R-1 PR-1,0 PR-1,1 PR-1,C-1 0R + i 1R + i R R+1 P0,0 P1,0 P0,1 P1,1 P0,C-1 P1,C-1 jr + i (C-1)R + i Partition at Process Pi,j RC 2R-1 PR-1,0 PR-1,1 PR-1,C-1 Destination vertex of each edge is in the same process row Adjacency list of a vertex is distributed along process column R(C-1) P0,0 P0,1 P0,C-1 R(C-1)+1 P1,0 P1,1 P1,C-1 RC-1 PR-1,0 PR-1,1 PR-1,C-1 15

16 BFS USING MPI frontier list at one processor column is populated with the seed vertex while (frontier not empty) { Expand: local expansion Horizontal Exchange: Alltoall exchange newly discovered vertices (happens only along grid rows) Append: build new frontier vertex list Vertical exchange: Allgather frontier list along grid columns (happens only along columns) Convergence check: Allreduce frontier list size among all processors } 16

17 EXCHANGE USING MPI Vertex lists are exchanged in two forms: Using a vertex list (one integer per vertex) Using a bitmap (one bit per vertex) Format is statically picked based on the level of traversal bitmap when number of discovered vertices is expected to be large vertex list when number of discovered vertices is expected to be large Communication implemented using non-blocking MPI sends and receives Enhanced the base version to using CUDA-aware MPI internally uses CUDA IPC/P2P to take advantage of direct PCIe or NVLink connections 17

18 GTEPS GTEPS PERFORMANCE WITH CUDA-AWARE MPI baseline cumpi baseline cumpi % % Graph Size (scale) Graph Size (scale) 4 K40 GPUs + PCIe 4 P100 GPUs + NVLink Supermicro server w/ IvyBridge CPU and Broadcom PCIe switch, CUDA 8.0, MVAPICH2 2.2 DGX-1 w/ Broadwell CPU, CUDA 8.0, MVAPICH

19 BFS USING NVSHMEM Fuse communication into CUDA kernels expand <- exchange of newly discovered vertices append <- exchange frontier vertices while (frontier not empty) { Expand on GPU: local expansion Append on GPU: build new frontier vertex list Convergence check on CPU: Allreduce frontier list size among all processors } atomicor(sbuf + r[k]/bits(msk), m[k]); int peer = r[k] / drow_bl; uint32_t off = disp + r[k]%drow_bl; q[k] = shmem_int_or ((int *)(msk + off/bits(msk)), m[k], peer); gets rid of the MPI exchange code on the host 19

20 WORKAROUND FOR PCIE GPU-GPU atomics are not supported over PCIE We take advantage of fact that 32-bit writes are atomic Vertex list is represented as integers, marked with shmem_int_p Higher processing overhead as the vertex map is 32x larger Provides a common version to compare behavior over PCIe and NVLink 20

21 GTEPS GTEPS PERFORMANCE WITH PUTS cumpi SHMEM-Put cumpi SHMEM-Put Graph Size (Scale) Graph Size (Scale) 4 K40 GPUs + PCIe 4 P100 GPUs + NVLink 21

22 GTEPS GTEPS PERFORMANCE WITH ATOMICS cumpi SHMEM-Atomics cumpi SHMEM-Atomics 50 Overall: 12% % 52% 22% Graph Size (scale) Graph Size (scale) 8 P100 GPUs + NVLink (2x4) 8 P100 GPUs + NVLink (4x2) 22

23 FUTURE WORK Fairer comparison by removing tracking of predecessor information Hybrid vertex list/bitmap design like in the case of MPI Can offload multiple levels of BFS using stream-based collectives in NVSHMEM Possibility of kernel fusion, using kernel-side collectives in NVSHMEM Evaluate the use of IB to extend NVSHEM across multiple nodes 23

24 CONCLUSION NVLink provides a high-bandwidth memory-like fabric between GPUS NVSHMEM provides GPU-side communication API based on OpenSHMEM We presented a study of using GPU-initiated communication in BFS Benefits from reduced overheads in strong scaling cases, compared to using MPI Tradeoff with efficient network utilization for weak scaling 24

25 THANKS We thank authors of the baseline BFS code Mauro Bisson, Massimo Bernaschi, Enrico Mastrostefano: Parallel Distributed Breadth First Search on the Kepler Architecture This has been supported in part by Oak Ridge National Lab under subcontract #

NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY FOR NVIDIA GPU CLUSTERS. Sreeram Potluri, Anshuman Goswami - NVIDIA 3/28/2018

NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY FOR NVIDIA GPU CLUSTERS Sreeram Potluri, Anshuman Goswami - NVIDIA 3/28/2018 GPU Programming Models AGENDA Overview of NVSHMEM Porting to NVSHMEM Performance