CafeGPI. Single-Sided Communication for Scalable Deep Learning

Size: px

Start display at page:

Download "CafeGPI. Single-Sided Communication for Scalable Deep Learning"

Kory Charles
5 years ago
Views:

1 CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany

2 Deep Neural Networks In a Nutshell At an abstract level, DNNs are: Layer 1 directed (acyclic) graphs Nodes are compute entities (=Layers) Edges define the data flow through the graph Layer 2 Inference / Training Layer 3 Layer 4 Forward feed of data through the network Layer 5 Layer 6

3 Deep Neural Networks In a Nutshell At an abstract level, DNNs are: Common intuition Layer 1 directed (acyclic) graphs Nodes are compute entities (=Layers) Edges define the data flow through the graph Layer 2... Layer n-1 Layer n Inference / Training Forward feed of data through the network

4 Training Deep Neural Networks The Underlying Optimization Problem Computed via Back Propagation Algorithm: 1. feed forward and compute activation 2. error by layer 3. compute derivative by layer Minimize Loss-Function via gradient descent (high dimensional and NON CONVEX!)

5 Optimization Problem By Stochastic Gradient Descent (SGD) forward Initialize weights W at random Take small random subset X (=batch) of the train data Run X through network (forward feed) Compute Loss Compute Gradient Propagate backwards through the network Update W Repeat 2-8 until convergence backward Layer 1 Layer 2... Layer n-1 Layer n

6 Parallelization Common approaches to parallelize SGD for DL Parallelization of SGD is very hard: it is an inherently sequential algorithm 1. Start at some state t (point in a billion dimensional space) 2. Introduce t to data batch d1 3. Compute an update (based on the objective function) 4. Apply the update t+1 d1 d2 d3 How to gain Speedup? Make faster updates Make larger updates State t+2 State t State t+1

7 Parallelization Common approaches to parallelize SGD for DL Internal parallelization = parallel execution of the layer operation Mostly dense matrix multiplication: standard blas sgemm MKL, Open-Blas CuBlas on GPUs Task parallelization for special Layers Cuda-CNN for fast convolutions forward backward Layer 1 Layer 2... Layer n-1 Layer n

8 Parallelization Common approaches to parallelize SGD for DL External: data parallelization over the data batch Master 1. Split batch and send to workers Layer 1 Layer 1 Layer 1 Layer 2 Layer 2 Layer 2... Layer n-1 Layer n forward... Layer n-1 Layer n forward... Layer n-1 Layer n forward

9 Parallelization Common approaches to parallelize SGD for DL External: data parallelization over the data batch Master 1. Split batch and send to workers 2. Workers compute forward+backward and send gradients to master Layer 1 Layer 1 Layer 1 Layer 2 Layer 2 Layer 2... Layer n-1 Layer n forward backward... Layer n-1 Layer n forward backward... Layer n-1 Layer n forward backward

10 Parallelization Common approaches to parallelize SGD for DL External: data parallelization over the data batch Master 1. Split batch and send to workers 2. Workers compute forward+backward and send gradients to master 3. Master combines gradients and computes updates of W. Sends new W' to workers Layer 1 Layer 1 Layer 1 Layer 2 Layer 2 Layer 2... Layer n-1 Layer n forward backward... Layer n-1 Layer n forward backward... Layer n-1 Layer n forward backward

11 Parallelization Common approaches to parallelize SGD for DL External: data parallelization over the data batch Master 1. Split batch and send to workers 2. Workers compute forward+backward and send gradients to master 3. Master combines gradients and computes updates of W. Sends new W' to workers Speedup

12 Outline I Overview: distributed parallel training of DNNs II Limits of Scalability Limitation I: Communication Bounds Limitation II: Skinny Matrix Multiplication aka large batch problem Limitation III: Data I/O Based on our paper from SC 16

13 Outline I Overview: distributed parallel training of DNNs II Limits of Scalability Limitation I: Communication Bounds Limitation II: Skinny Matrix Multiplication aka large batch problem Limitation III: Data I/O Based on our paper from SC 16

14 Limitation I Distributed SGD is heavily Communication Bound Gradients have the same size as the model Model size can be hundreds of MB Iteration time (GPU) <1s

15 Communication Bound Experimental Evaluation

16 Solving the Communication Bottleneck Solutions in Hardware: example NVLink NVIDIA DGX-1 8x P100 DGX-1: NVLink between all GPUs NVLink spec: ~40GB/s (single direction) PCIe v3 16x: ~15GB/s Diagram DGX-1 by NVIDIA

17 Solving the Communication Bottleneck Solutions in Hardware: NVLink Benchmark IBM Minsky 4x P100 2x10 core Power8 (160 hw threads) NVLink between all components NVLink spec: ~40GB/s (single direction)

18 Solving the Communication Bottleneck Solutions in Hardware: NVLink Benchmark Minsky Benchmark Deep Learning AlexNet Topology with NVIDIA-Caffe on ImageNet Data (batch_size 1024) IBM Minsky Speedup 3 Linear Scaling Theoretical Speedup Limit Actual Speedup PCIe 4x K # GPUs used x P100 2x10 core Power8 (160 hw threads) NVLink between all components NVLink spec: ~40GB/s (single direction)

19 Solving the Communication Bottleneck Possible Algorithmic Solutions Network Design (later more on this) Avoid fully connected Layers for smaller models (see GoogLeNet vs AlexNet) Reduce Model Size Reduce Floating Point precision (8 Bit is enough) Reduce / Avoid Communication Sparse Updates Compression Asynchronous Updates

CafeGPI: Distributed Synchronous SGD Parallelization With asynchronous communication overlay Better scalability using asynchronous PGAS programming of

20 CafeGPI: Distributed Synchronous SGD Parallelization With asynchronous communication overlay Better scalability using asynchronous PGAS programming of optimization algorithms with GPI-2. Direct RDMA access to main and GPU memory instead of message passing Optimized data-layers for distributed File systems

21 CC-HPC Current Projects CaffeGPI: Approaching DNN Communication Communication Reduction Tree Communication Quantization Communication Overlay Direct RDMA GPU GPU Based on GPI Optimized distributed data-layer

22 CafeGPI: Open Source 5

23 Projects: Low Cost Deep Learning Setup: Build on CafeGPI Price: <8000 EUR standard components GTX1080:0 GTX1080:2 GTX1080:1 GTX1080:3 IB:0 FDR IB:0 FDR Specs: CaffeGPI IB:1 FDR IB:1 FDR SSD:0 SSD:1 Workstation 1 Workstation 2 32 GB GPU Mem 64 GB PGAS Mem 2TB BeeGFS for Train Data GPU interconnect: PCIe

24 CafeGPI: Benchmarks: Single Node Specs: 4x K80 GPU (PCIe Interconnect) Cuda 8 CuDNN 5.1 Topology: AlexNet Batch Size (global) (1024)

25 Low Cost Deep Learning Setup: Build on CafeGPI Time till Convergence Scaling DGX-1 CaffeNV ITWM CaffeGPI Speedup Training time in h 250 DGX-1 CaffeNV ITWM CaffeGPI #GPUs Specs: Topology: GoogLeNet, Cuda 8, cudnn 5.1, CaffeNV 16.4, Batch Size/Node: #GPUs 8

26 CafeGPI: Distributed Benchmarks Specs: Distrubuted Scaling of CaffeGPI Speedup Topology: GloogleNet Batch Size: 64 per node x K80 GPU (per node) Cuda 8 CuDNN #Nodes 8 16

methods DNN Graph (Import aus Caffe, TensorFlow,.

27 HP-DLF High Performance Deep Learning Framework Scalable Transparent generic Auto-parallel Elastic Automatic data flow Automatic Hardware selection Portable Monitoring Simulation New optimization methods DNN Graph (Import aus Caffe, TensorFlow,...) Daten Modell Hardware Topologie Unterstützte Hardware Compiler Performance Model Petri Netz Runtime System Workflow Engine Scheduler Datenfluss Global Address Space...

28 Optimization for HP-DLF Based on: CaffeASGD Our asynchronous cafe implementation Sparse communication for multi model optimization Lower demands on the communication bandwidth Superior convergence

29 Discussion What is next? need to exploit sparsity! support very heterogeneous HPC hardware compute and memory By xkcd.com

Towards Scalable Machine Learning

Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Fraunhofer Center Machnine Larning Outline I Introduction