Parallel Stochastic Gradient Descent: The case for native GPU-side GPI

Parallel Stochastic Gradient Descent: The case for native GPU-side GPI J. Keuper Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Mark Silberstein Accelerated Computer Systems Lab Technion Israel Institute of Technology

Accelerated Systems Lab Operating system support for accelerators GPU file system layer GPU networking API GPU virtual memory and huge data sets OS support for optimized GPU-SSD transfers GPU RDMA FPGA-CPU SoCs, near-data I/O accelerators SGX, accelerator security https://sites.google.com/site/silbersteinmark/ October 2016 Mark Silberstein @ Technion 2

Outline Stochastic Gradual Descent in a nutshell Data-parallel distributed algorithm GPI implementation Communication bottlenecks and sparsity GPU-native GPI-2 communications October 2016 Mark Silberstein @ Technion 3

SGD in a nutshell Common technique for training ML models Optimizes a loss function by modifying model's parameters Computations October 2016 Mark Silberstein @ Technion 4

Parallelization Intrinsically sequential Computations Data parallel October 2016 Mark Silberstein @ Technion 5

Parallelization via Master Worker Parameter server (w, s 0 ) (w, s 1 ) w w Worker Worker Sub-batch 0 Sub-batch 1 October 2016 Mark Silberstein @ Technion 6

SGD with Deep Neural Networks Network output Layer n w n Forward pass Layer 2 w 2 Backward pass Layer 1 w 1 October 2016 Mark Silberstein @ Technion 7

GPUs are used for computations Network output Layer n w n Forward pass Layer 2 w 2 Backward pass Layer 1 w 1 October 2016 Mark Silberstein @ Technion 8

Data parallel SGD with DNNs Parameter server Layer n w n Layer n w n Layer n w n Layer 2 w 2 Layer 2 w 2 Layer 2 w 2 Layer 1 w 1 Layer 1 w 1 Layer 1 w 1 October 2016 Mark Silberstein @ Technion 9

Problem: scalability limit Time until convergence Machines October 2016 Mark Silberstein @ Technion 10

Communication bottleneck Communications grow linearly with the number of nodes! Communication time per node Keuper, Preundt Distributed training of deep neural networks: Theoretical and practical Limits of Parallel Scalability. To appear at MLHCP16 October 2016 Mark Silberstein @ Technion 11

Steps toward improved scalability Asynchronous, zero-copy I/O Direct transfer from/to GPU memory GPI-2 Sparsity-aware compressed communications GPU-side networking Ongoing research October 2016 Mark Silberstein @ Technion 12

Background: GPI October 2016 Mark Silberstein @ Technion 13

Special Requirement 1: 1-sided communications Previous computations Layer 3 w 3 (w 3, s 0 ) Current computations Layer 2 w 2 Parameter server Layer 1 w 1 October 2016 Mark Silberstein @ Technion 14

Special Requirement 1: 1-sided communications Previous computations Layer 3 w 3 (w 3, s 0 ) Current computations Layer 2 w 2 Parameter server Layer 1 w 1 GPI-2 PGAS model makes it easy to implement Cons: high memory requirements due to zero-copy October 2016 Mark Silberstein @ Technion 15

Special requirement 2: Direct data transfer from GPU CPU NIC NIC CPU GPU GPU Extra-hop in CPU memory: extra latency lower bandwidth more complex pipelining October 2016 Mark Silberstein @ Technion 16

Special requirement 2: Direct data transfer from GPU GPI-2 leverages GPUDirectRDMA, allows CPU to move data from GPU to NIC CPU NIC NIC CPU GPU GPU October 2016 Mark Silberstein @ Technion 17

Reducing network traffic via smart compression Observation 1: updates many zeros or close to zero values are sent 40% values Fully connected layer: AlexNet iteration #10 October 2016 Mark Silberstein @ Technion 18

Reducing network traffic via smart compression Observation 2: updates become more sparse toward convergence 95% values Fully connected layer: AlexNet Iteration #100 October 2016 Mark Silberstein @ Technion 19

Special requirement 3: Predicated send send(vector,len, F()): foreach i<len if (F(vector[i])) send(vector[i]) 0.5 0.01 0.0 10E-6 0.1 0.9 0.5 0.01 0.9 October 2016 Mark Silberstein @ Technion 20

Problem with predicated send on GPUs send(vector,len, F()): foreach i<len if (F(vector[i])) send(vector[i]) Must be done on the GPU close to the data Must be done on CPU because no GPU send October 2016 Mark Silberstein @ Technion 21

Problem with predicated send on GPUs send(vector,len, F()): foreach i<len if (F(vector[i])) send(vector[i]) Must be done on the GPU close to the data Goal: enable GPU networking October 2016 Mark Silberstein @ Technion 22

GPUrdma and GPU-side networking Daoud, Wated, Silberstein GPUrdma: GPU-side library for high performance networking from GPU kernels, ROSS16 October 2016 Mark Silberstein @ Technion 23

GPUrdma enables networking from GPU without CPU October 2016 Mark Silberstein @ Technion 24

GPUrdma is faster for small messages GPU x3 CPU October 2016 Mark Silberstein @ Technion 25

GPU-to-GPU GPU-side GPI-2 Preliminary results 52Gbit/s max throughput vs. 38Gbit/s GPI-2 4.5 usec one-way latency 4x performance on toy applications Ongoing collaboration with NVIDIA and Mellanox October 2016 Mark Silberstein @ Technion 26

Conclusions Improved scalability of distributed SGD requires Careful communication-computation overlap via one-sided communications Optimized GPU-NIC data path Smart sparsity-aware data compression GPU-side networking Thank you! https://sites.google.com/site/silbersteinmark/ mark@ee.technion.ac.il October 2016 Mark Silberstein @ Technion 27