NCCL 2.0. Sylvain Jeaugey

Size: px

Start display at page:

Download "NCCL 2.0. Sylvain Jeaugey"

Josephine Davis
6 years ago
Views:

1 NCCL 2.0 Sylvain Jeaugey

2 DEE LEARNING ON GUS Making DL training times shorter Deeper neural networks, larger data sets training is a very, very long operation! CUDA NCCL 1 NCCL 2 Multi-core CU GU Multi-GU Multi-GU Multi-node 2

3 NCCL A multi-gu communication library To other systems Sockets (Ethernet) Infiniband, with GU Direct RDMA Within a system NVLink CIe GU Direct 2 3

4 NCCL Architecture Caffe Caffe2 Torch TF MXNET CNTK Deep Learning Frameworks NCCL CUDNN CUBLAS CUDA NVIDIA GUs 4

5 NCCL History Design AGENDA NCCL 2.0 New features AI Changes erformance Future 5

6 HISTORY Q4 2015: NCCL 1.x Open-source research project on github, helping Deep Learning frameworks compute on multiple GUs with efficient collective operations. Limited to intra-node. Q2 2017: NCCL 2.x and beyond NVIDIA Library, multi-node support and improved AI. 6

7 DESIGN What is NCCL? Optimized collective communication library between CUDA devices. Easy to integrate into any DL framework, as well as traditional HC apps using MI. Runs on the GU using asynchronous CUDA kernels, for faster access to GU memory, parallel reductions, NVLink usage. Operates on CUDA pointers. Operations are tied to a CUDA stream. Uses as little threads as possible to permit other computation to progress simultaneously. 7

8 DESIGN Rings NCCL uses rings to move data across all GUs and perform reductions. 8

9 DESIGN Rings NCCL uses rings to move data across all GUs and perform reductions. CIe / QI : 1 unidirectional ring 9

10 DESIGN Rings NCCL uses rings to move data across all GUs and perform reductions. CIe / QI : 1 unidirectional ring DGX-1 : 4 unidirectional rings 10

11 DESIGN Kernels sendbuff recvbuff FIFO Reduction revious GU in the ring Next GU in the ring 11

12 NCCL

13 NCCL 2.0 Inter-node communication Inter-node communication using Sockets or Infiniband verbs, with multi-rail support, topology detection and automatic use of GU Direct RDMA. Optimal combination of NVLink, CI and network interfaces to maximize bandwidth and create rings across nodes. CIe, Infiniband DGX-1 : NVLink, 4x Infiniband 13

verbs, with multi-rail support, topology detection and automatic use of GU Direct

14 NCCL 2.0 Inter-node communication Inter-node communication using Sockets or Infiniband verbs, with multi-rail support, topology detection and automatic use of GU Direct RDMA. Optimal combination of NVLink, CI and network interfaces to maximize bandwidth and create rings across nodes. CIe, Infiniband DGX-1 : NVLink, 4x Infiniband 14

15 GU0 GU1 GU2 GU3 GU4 GU5 GU6 GU7 GU0 GU1 GU2 GU3 GU4 GU5 GU6 GU7 GU0 GU1 GU2 GU3 GU4 GU5 GU6 GU7 NCCL 2.0 rocesses, threads and GUs Supports a combination of processes (potentially across nodes), threads per process and GUs per thread. n nodes Node 0 Node 1 Node N-1 2 sockets per node CU0 CU1 CU0 CU1 CU0 CU1 4 GUs per socket 15

16 GU0 GU1 GU2 GU3 GU4 GU5 GU6 GU7 GU0 GU1 GU2 GU3 GU4 GU5 GU6 GU7 GU0 GU1 GU2 GU3 GU4 GU5 GU6 GU7 NCCL 2.0 rocesses, threads and GUs Supports a combination of processes (potentially across nodes), threads per process and GUs per thread. Node 0 Node 1 Node n-1 1 process per GU CU CU CU CU n -8 CU0 8n 8n n -5 8n -4 CU1 8n 8n n -1 16

17 GU0 GU1 GU2 GU3 GU4 GU5 GU6 GU7 GU0 GU1 GU2 GU3 GU4 GU5 GU6 GU7 GU0 GU1 GU2 GU3 GU4 GU5 GU6 GU7 NCCL 2.0 rocesses, threads and GUs Supports a combination of processes (potentially across nodes), threads per process and GUs per thread. Node 0 Node 1 Node n-1 1 process per socket rocess 0 CU0 t0 t1 t2 t3 rocess 1 CU1 t0 t1 t2 t3 rocess 2 CU0 t0 t1 t2 t3 rocess 3 CU1 t0 t1 t2 t3 rocess 2n-2 CU0 t0 t1 t2 t3 rocess 2n-1 CU1 t0 t1 t2 t3 1 thread per GU 17

18 GU0 GU1 GU2 GU3 GU4 GU5 GU6 GU7 GU0 GU1 GU2 GU3 GU4 GU5 GU6 GU7 GU0 GU1 GU2 GU3 GU4 GU5 GU6 GU7 NCCL 2.0 rocesses, threads and GUs Supports a combination of processes (potentially across nodes), threads per process and GUs per thread. Node 0 Node 1 Node n-1 1 process CU0 CU1 CU0 CU1 CU0 per node rocess 0 rocess 1 rocess n-1 CU1 8 GUs per process 18

19 GU0 GU1 GU2 GU3 GU4 GU5 GU6 GU7 NCCL 2.0 AI Group calls NCCL 2.0 is introducing mandatory new verbs ncclgroupstart/ncclgroupend when managing multiple devices from a single thread NCCL 1.x : for (int i=0; i<ngpus; i++) { cudasetdevice(devices[i]); ncclallreduce(, comms[i], streams[i]); } CU0 rocess 0 CU1 NCCL 2.0 : ncclgroupstart(); for (int i=0; i<ngpus; i++) { ncclallreduce(, comms[i], streams[i]); } ncclgroupend(); 19

20 GU0 GU1 GU2 GU3 GU4 GU5 GU6 GU7 GU0 GU1 GU2 GU3 GU4 GU5 GU6 GU7 NCCL 2.0 AI Integration with parallel environments Inter-node communicator creation still uses the NCCL 1.x verbs : ncclgetuniqueid/ncclcomminitrank if (rank == 0) ncclgetuniqueid(&id) My_Bcast(&id); ncclcomminitrank(&comm, nranks, id, rank); CU CU1 7 Multi-process + multi-gu per process (from a single thread) : combine ncclcomminitrank with ncclgroupstart/ncclgroupend if (rank == 0) ncclgetuniqueid(&id) My_Bcast(&id); ncclgroupstart(); for (int i=0; i<ndev; i++) { cudasetdevice(devices[i]); ncclcomminitrank(&comm, ndev*nranks, id, ndev*rank+i); } ncclgroupend(); CU0 rocess 0 CU1 20

21 NCCL 2.0 AI Others Other small AI adjustments over the NCCL 1.x AI : Counts are now of type size_t instead of int allgather arguments order has been fixed to be similar to other operations Additions/clarification on datatypes : integral : int8 = char, uint8, int32 = int, uint32, int64, uint64 floating point : float16 = half, float32 = float, float64 = double Clarifications and fixes for allgather and reduce_scatter send/receive counts and in-place operations 21

22 ERFORMANCE 22

23 ERFORMANCE Intra-node performance AllReduce bandwidth (OMB, size=128mb, in GB/s) QI 4 CU 4 CI DGX-1 23

24 ERFORMANCE Inter-node performance 45 AllReduce bandwidth (OMB, size=128mb, in GB/s) MI Baidu Allreduce NCCL nodes x 4 GUs (IB EDR, CI Switch) 4 nodes x 8 GUs (DGX-1 : 4x IB EDR, 4x NVLink) 24

25 ERFORMANCE Deep Learning - CNTK 8000 CNTK scaling ResNet50, images/s Ideal MI NCCL 25

26 FUTURE 26

27 FUTURE Top asked features Additional communication primitives : point-to-point communication scatter (1 to N), gather (N to 1), alltoall (N to N) neighbor collectives (send/receive in multiple dimensions) User-defined reduction operations also, trying to merge computation and communication better Windows support lease let us know your needs! Connect with experts / NCCL session : Wed Apr 10, 4pm 27

NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL)

NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL) DU-08730-210_v01 March 2018 Installation Guide TABLE OF CONTENTS Chapter 1. Overview... 1 Chapter 2. Prerequisites...3 2.1. Software Requirements... 3 2.2.