Asynchronous Stochastic Gradient Descent on GPU: Is It Really Better than CPU?

Size: px

Start display at page:

Download "Asynchronous Stochastic Gradient Descent on GPU: Is It Really Better than CPU?"

John Flynn
6 years ago
Views:

1 Asynchronous Stochastic Gradient Descent on GPU: Is It Really Better than CPU? Florin Rusu Yujing Ma, Martin Torres (Ph.D. students) University of California Merced

2 Machine Learning (ML) Boom Two SIGMOD 2017 tutorials

3 ML Systems General purpose (databases) BIDMach Bismarck Cumulon DeepDive DimmWitted GLADE GraphLab MADlib Mahout MLlib (MLbase) SimSQL (BUDS) SystemML Vowpal Wabbit Deep learning Caffe (con Troll) CNTK DL4J Keras MXNet SINGA TensorFlow Theano Torch

4 ML Hardware Accelerators

5 ML Systems with GPU Acceleration General purpose BIDMach Bismarck Cumulon DeepDive DimmWitted GLADE GraphLab MADlib Mahout MLlib (MLbase) SimSQL (BUDS) SystemML Vowpal Wabbit Deep learning Caffe CNTK DL4J Keras MXNet SINGA TensorFlow Theano Torch

6 ML in Databases It is not so much about deep learning Regression (linear, logistic) Classification (SVM) Recommendation (LMF) Mostly about training Inside DB, close to data Over joins or factorized databases Compressed data, (compressed) large models Selection of optimization algorithm and hyperparameters BGD vs. SGD vs. SCD

7 Classification Tasks Logistic regression (LR) Support Vector Machines (SVM)

8 Datasets and Platforms S. Sallinen et al: High Performance Parallel Stochastic Gradient Descent in Shared Memory in IPDPS CPU: Intel Xeon E (14 cores, 28 threads) GPU: Tesla K80 (use only one multiprocessor)

9 Experiments Stochastic gradient descent (SGD) optimizer: mini-batch with 4096 batch size Average time per iteration over 100 iterations (measure only the iteration time) TensorFlow and MXNet support only dense data: covtype and w8a are densified ; others do not fit in GPU memory LR SVM

10 Research Questions Why is GPU not significantly better than CPU on LR and SVM models? The gain in deep nets seems to come mostly from convolutions, not gradient computations SparseMatrix-Vector (SpMV) and SparseMatrix- Matrix (SpMM) are harder to optimize Can we improve the GPU performance?

11 Gradient Descent

12 (Mini-)Batch Gradient Descent (BGD)

13 Parallel BGD Parallel execution on CPU and GPU Synchronous execution on CPU

14 Stochastic Gradient Descent (SGD)

15 Parallel SGD (Hogwild) No synchronization or locks

16 BGD vs. SGD

17 GPU Architecture Tesla K80 (GK210) # MP = 13 # cores/mp = 192 # warps/mp = 64 # blocks/mp = 16 # threads/mp = 2048 # threads/warp (SIMD) = 32 # threads/block = 1024 # registers/mp = 2 17 # registers/block = 2 16 # registers/thread = 255 Shared mem/mp = 112KB Shared mem/block = 48KB L1 cache = 48KB Read-only texture = 48KB L2 cache = 1.5MB Global mem = 12GB

18 Map Hogwild to GPU Algorithm 1. Copy data and model to GPU 2. While not converge do 1. Execute kernel update_model that implements Hogwild 3. End while

19 Design Space Data access Storage scheme Row-store Column-store Partitioning Round-robin Chunking Data replication Number of threads accessing an example 1-way K-way Model replication Where is model stored on GPU memory hierarchy Per thread (registers) Per block (shared memory) Per kernel (global memory)

20 Evaluation Metrics DimmWitted by Zhang and Re in PVLDB 2014 Hardware efficiency Time to convergence Statistical efficiency Number of iterations to convergence

21 Data Access Storage Scheme

22 Data Access Partitioning

23 Evaluation Dense Data (covtype) Hardware efficiency Statistical efficiency

24 Evaluation Sparse Data (news) Hardware efficiency Statistical efficiency

25 Data Replication

26 Evaluation Dense Data (covtype) Hardware efficiency Statistical efficiency

27 Evaluation Sparse Data (news) Hardware efficiency Statistical efficiency

28 Model Replication PerThread

29 Model Replication PerBlock

30 Model Replication PerKernel

31 Evaluation Dense Data (covtype) Hardware efficiency Statistical efficiency

32 Evaluation Sparse Data (news) Hardware efficiency Statistical efficiency

33 Comparison with Synchronous SGD

34 CPU vs. GPU

35 Conclusions Synchronous mini-batch in deep learning systems is rarely faster in convergence on GPU than on CPU Asynchronous SGD on GPU is always faster in time per iteration than synchronous minibatch on GPU Asynchronous SGD on GPU is sometimes faster in convergence than asynchronous SGD on CPU

36 Thank you. Questions???

Stochastic Gradient Descent on Highly-Parallel Architectures

Stochastic Gradient Descent on Highly-Parallel Architectures Yujing Ma Florin Rusu Martin Torres University of California Merced {yma33, frusu, mtorres58}@ucmerced.edu February 28 arxiv:82.88v [cs.db]