Parallel Stochastic Gradient Descent: The case for native GPU-side GPI

Size: px

Start display at page:

Download "Parallel Stochastic Gradient Descent: The case for native GPU-side GPI"

Julie McCormick
6 years ago
Views:

1 Parallel Stochastic Gradient Descent: The case for native GPU-side GPI J. Keuper Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Mark Silberstein Accelerated Computer Systems Lab Technion Israel Institute of Technology

2 Accelerated Systems Lab Operating system support for accelerators GPU file system layer GPU networking API GPU virtual memory and huge data sets OS support for optimized GPU-SSD transfers GPU RDMA FPGA-CPU SoCs, near-data I/O accelerators SGX, accelerator security October 2016 Mark Technion 2

3 Outline Stochastic Gradual Descent in a nutshell Data-parallel distributed algorithm GPI implementation Communication bottlenecks and sparsity GPU-native GPI-2 communications October 2016 Mark Technion 3

4 SGD in a nutshell Common technique for training ML models Optimizes a loss function by modifying model's parameters Computations October 2016 Mark Technion 4

5 Parallelization Intrinsically sequential Computations Data parallel October 2016 Mark Technion 5

6 Parallelization via Master Worker Parameter server (w, s 0 ) (w, s 1 ) w w Worker Worker Sub-batch 0 Sub-batch 1 October 2016 Mark Technion 6

7 SGD with Deep Neural Networks Network output Layer n w n Forward pass Layer 2 w 2 Backward pass Layer 1 w 1 October 2016 Mark Technion 7

8 GPUs are used for computations Network output Layer n w n Forward pass Layer 2 w 2 Backward pass Layer 1 w 1 October 2016 Mark Technion 8

9 Data parallel SGD with DNNs Parameter server Layer n w n Layer n w n Layer n w n Layer 2 w 2 Layer 2 w 2 Layer 2 w 2 Layer 1 w 1 Layer 1 w 1 Layer 1 w 1 October 2016 Mark Technion 9

10 Problem: scalability limit Time until convergence Machines October 2016 Mark Technion 10

11 Communication bottleneck Communications grow linearly with the number of nodes! Communication time per node Keuper, Preundt Distributed training of deep neural networks: Theoretical and practical Limits of Parallel Scalability. To appear at MLHCP16 October 2016 Mark Technion 11

12 Steps toward improved scalability Asynchronous, zero-copy I/O Direct transfer from/to GPU memory GPI-2 Sparsity-aware compressed communications GPU-side networking Ongoing research October 2016 Mark Technion 12

13 Background: GPI October 2016 Mark Technion 13

14 Special Requirement 1: 1-sided communications Previous computations Layer 3 w 3 (w 3, s 0 ) Current computations Layer 2 w 2 Parameter server Layer 1 w 1 October 2016 Mark Technion 14

15 Special Requirement 1: 1-sided communications Previous computations Layer 3 w 3 (w 3, s 0 ) Current computations Layer 2 w 2 Parameter server Layer 1 w 1 GPI-2 PGAS model makes it easy to implement Cons: high memory requirements due to zero-copy October 2016 Mark Technion 15

16 Special requirement 2: Direct data transfer from GPU CPU NIC NIC CPU GPU GPU Extra-hop in CPU memory: extra latency lower bandwidth more complex pipelining October 2016 Mark Technion 16

17 Special requirement 2: Direct data transfer from GPU GPI-2 leverages GPUDirectRDMA, allows CPU to move data from GPU to NIC CPU NIC NIC CPU GPU GPU October 2016 Mark Technion 17

18 Reducing network traffic via smart compression Observation 1: updates many zeros or close to zero values are sent 40% values Fully connected layer: AlexNet iteration #10 October 2016 Mark Technion 18

19 Reducing network traffic via smart compression Observation 2: updates become more sparse toward convergence 95% values Fully connected layer: AlexNet Iteration #100 October 2016 Mark Technion 19

20 Special requirement 3: Predicated send send(vector,len, F()): foreach i<len if (F(vector[i])) send(vector[i]) E October 2016 Mark Technion 20

21 Problem with predicated send on GPUs send(vector,len, F()): foreach i<len if (F(vector[i])) send(vector[i]) Must be done on the GPU close to the data Must be done on CPU because no GPU send October 2016 Mark Technion 21

22 Problem with predicated send on GPUs send(vector,len, F()): foreach i<len if (F(vector[i])) send(vector[i]) Must be done on the GPU close to the data Goal: enable GPU networking October 2016 Mark Technion 22

23 GPUrdma and GPU-side networking Daoud, Wated, Silberstein GPUrdma: GPU-side library for high performance networking from GPU kernels, ROSS16 October 2016 Mark Technion 23

24 GPUrdma enables networking from GPU without CPU October 2016 Mark Technion 24

25 GPUrdma is faster for small messages GPU x3 CPU October 2016 Mark Technion 25

26 GPU-to-GPU GPU-side GPI-2 Preliminary results 52Gbit/s max throughput vs. 38Gbit/s GPI usec one-way latency 4x performance on toy applications Ongoing collaboration with NVIDIA and Mellanox October 2016 Mark Technion 26

Conclusions Improved scalability of distributed SGD requires Careful communication-computation overlap via one-sided communications Optimized GPU-NIC data path Smart

27 Conclusions Improved scalability of distributed SGD requires Careful communication-computation overlap via one-sided communications Optimized GPU-NIC data path Smart sparsity-aware data compression GPU-side networking Thank you! October 2016 Mark Technion 27

CafeGPI. Single-Sided Communication for Scalable Deep Learning

CafeGPI. Single-Sided Communication for Scalable Deep Learning CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks