Toward Scalable Deep Learning

Size: px

Start display at page:

Download "Toward Scalable Deep Learning"

Paul Garry Sherman
5 years ago
Views:

1 한국정보과학회 인공지능소사이어티 머신러닝연구회 두번째딥러닝워크샵 Toward Scalable Deep Learning 윤성로 Electrical and Computer Engineering Seoul National University

2 Breakthrough: Big Data + Machine Learning Daphne Koller Andrew Ng

Shadow beyond the Revolution Training challenges

12 (2010): 3207-3220. http://blogs.nvidia.

3 Shadow beyond the Revolution Training challenges < MNIST test > Ciresan, Dan Claudiu, et al. "Deep, big, simple neural nets for handwritten digit recognition." Neural computation (2010): T. Chilimbi et al. (OSDI 2014)

4 Machine Learning Representation + Training Sparse structured input/output regression Nonparametric Bayesian models Representations Graphical models Deep learning Training

5 Parallelism in Machine Learning Basic form of ML F D, θ = L D, θ + r(θ) Iterative-convergent θ t+1 = θ t + Δθ(D) Δθ(D) Data Parallel Model Parallel Δθ(D) Δθ(D 1 ) Δθ 1 (D) Δθ(D 2 ) Δθ 2 (D) Δθ(D n ) Δθ m (D) E. Xing & Q. Ho., 2015 KDD Tutorial

6 Deep Learning: A New Learning Paradigm Far more complex and larger than conventional ML models A large number of model parameters to learn Many (mostly simple) computations with latent variables Needs scaling up/out computation & numerical optimization

Dealing with the Challenges (1) Minimize computation [Bengio, 2014] Improve (reduce) the ratio of # OF COMPUTATIONS / # OF PARAMETERS Extreme success story (but poor

7 Dealing with the Challenges (1) Minimize computation [Bengio, 2014] Improve (reduce) the ratio of # OF COMPUTATIONS / # OF PARAMETERS Extreme success story (but poor generalization): decision trees O(n) computations for O(2 n ) parameters Extreme unlucky story: deep neural nets O(n) computations for O(n) parameters Example Conditional computation (Bengio, 2014)

workload Enhanced single machine performance Organized in

8 Dealing with the Challenges (2) Scale-up approaches Co-processor/Accelerator (SIMD, GPGPU, ) Learning workload Enhanced single machine performance Organized in SIMD blocks 10-fold to 100-fold speed-up Stuck with memory constraints!

9 Dealing with the Challenges (3) Scale-out approaches Learning workload Distributed System (GraphLab, Hadoop, Spark,...) Can handle enormous size of data or model Split the entire workload using Data parallelism Model parallelism Parameter communication issues!

Gather-Apply-Scatter programming model Large-scale graph

10 Notable ML Platforms Spark RDD-based programming model ML library (includes deep learning) GraphLab Gather-Apply-Scatter programming model Large-scale graph mining Petuum Key-value store + scheduler General-purpose large-scale ML

11 Notable ML Platforms GPU-based (scale-up) Keras( ) Distributed (scale-out) DistBelief [D. Jeffrey et al., "Large scale distributed deep networks," NIPS 2012] Project Adam [T. Chilimbi et al., "Project Adam: Building an efficient and scalable deep learning training system, OSDI 2014]

12 Recent Technological Trends DistBelief: supports both data and model parallelism J. Dean et al. "Large scale distributed deep networks," NIPS 2012

13 Recent Technological Trends GPU-accelerated library of primitives for DNN Used by frameworks such as Caffe, Theano, Ex) cudnn (v3) vs. cudnn (v2) on Caffe

14 Recent Technological Trends Open-source, distributed, commercial-grade DL framework DeepLearning4j ND4J (Scientific computing library for JVM) Scalable backend Apache Hadoop and Spark GPUs Partners

15 Recent Technological Trends Large-scale distributed machine learning Considers both data and model parallelism Key-value store + dynamic scheduler

MapReduce, query, graph processing and stream

16 Recent Technological Trends» Retainable Evaluator Execution Framework An Apache incubator project Package a variety of data-processing libraries in a reusable form MapReduce, query, graph processing and stream data processing REEF introduction (

Scalable Deep Learning Techniques 1) Data parallelism Hogwild! (B. Recht, et al., NIPS 2011) Downpour SGD (J. Dean, et al., NIPS 2012), Dogwild (C. Noel, et al., 2014) 2) Parameter Server (M.

17 Scalable Deep Learning Techniques 1) Data parallelism Hogwild! (B. Recht, et al., NIPS 2011) Downpour SGD (J. Dean, et al., NIPS 2012), Dogwild (C. Noel, et al., 2014) 2) Parameter Server (M. Li, et al., NIPS 2013) 3) Model parallelism (STRADS) (S. Lee, et al., NIPS 2014) 4) Acceleration with GPUs (CUDA convnet) Examples of distributed schemes

18 Data Parallelism Based on the independency between data Leads to concurrent executions for each data speed up Samples Attributes DATA Worker 1 Worker 2 Worker 3 Model Aggregation

19 Data Parallelism : Hogwild! Asynchronous running; Don t Lock! Don t Communicate! For each processor, calculate gradients independently Processors can overwrite each others work Y. Nishioka, Scalable Task-Parallel SGD on Matrix Factorization in Multicore Architectures, IPDPS 2015

20 Data Parallelism : Hogwild! Guarantees a reasonable converge rate Exploits sparsity Better performance even in non-sparse examples than traditional synchronized techniques (e.g., SVM) B. Recht et al., "Hogwild: A lock-free approach to parallelizing stochastic gradient descent," NIPS 2011

Data Parallelism : Downpour SGD, Dogwild!

master or parameter server Ex) downpour SGD & Dogwild!

21 Data Parallelism : Downpour SGD, Dogwild! Hogwild: designed for shared-memory machines Limited scalability Expand the concept of Hogwild! to distributed systems Asynchronous update gradients to master or parameter server Ex) downpour SGD & Dogwild! (=distributed Hogwild!) J. Dean et al., "Large scale distributed deep networks," NIPS 2012

Parameter Server Parameter server Widely used concept for distributed machine learning Separate servers for parameters in the model Key features (Li et al.

22 Parameter Server Parameter server Widely used concept for distributed machine learning Separate servers for parameters in the model Key features (Li et al., 2013) Efficient communication Flexible consistency models Elastic scalability Fault tolerance and durability Ease of Use M. Li et al.,"parameter server for distributed machine learning," Big Learning NIPS Workshop, 2013

23 Parameter Server : Key Value Vector Model Usually expressed as a vector or an array Sparse data & linear model Not all parameters are used to calculate gradients Key value vector ww 1, ww 2,, ww nn {(i, w) i Feature, ww Weight} Used to transmit the parameters only which workers need Example: (ww 1, ww 2, ww 3, ww 4 ) (1, ww 1 ) (2, ww 2 ) (3, ww 3 ) (4, ww 4 )

Parameter Server : Interface Server node Data: a partition of the globally shared parameters Worker node Data: a portion of the training data Task: local Push Direction: Worker Server

24 Parameter Server : Interface Server node Data: a partition of the globally shared parameters Worker node Data: a portion of the training data Task: local Push Direction: Worker Server Data: Calculated update value Pull Direction : Server Worker Data: Updated parameter M. Li et al., "Parameter server for distributed machine learning," Big Learning NIPS Workshop, 2013

25 Parameter Server : Data & model Partition Partition Model Server Push Pull Dimension Worker Data

26 STRADS STRADS (Lee et al., 2014) STRucture-Aware Dynamic Scheduler Parameter server with dynamic scheduler Chooses a set of parameters which can be updated in parallel Parameters are not transmitted between masters and workers S. Lee et al., "On model parallelization and scheduling strategies for distributed machine learning, NIPS 2014

STRADS : Execution Basic execution unit order Schedule Push - Pull Schedule Subject: Master Task Pick sets of model parameters that can be safely updated in parallel Push Subject: Master Tasks

27 STRADS : Execution Basic execution unit order Schedule Push - Pull Schedule Subject: Master Task Pick sets of model parameters that can be safely updated in parallel Push Subject: Master Tasks Dispatch computation jobs via the coordinator to the workers Execute push to compute partial updates for each parameter Pull Subject: key-value store Tasks Aggregate the partial updates Keep newly updated parameters

28 STRADS : Performance Performance advantages of STRADS Faster convergence Larger model size Latent Dirichlet Allocation Matrix Factorization S. Lee et al., "On model parallelization and scheduling strategies for distributed machine learning, NIPS 2014

29 CUDA-convnet Fast C++/CUDA implementation of convolutional neural networks Supports multiple-gpu training GPU1 GPU1 Fully-connected layers GPU2 GPU2 A. Krizhevsky, 2012 Convolutional layers

CUDA-convnet2 New features (wrt cuda-convnet): Improved training time Enhanced data parallelism, model parallelism, and hybrids Possible parallelizing schemes: (a) Computing fully-connected

30 CUDA-convnet2 New features (wrt cuda-convnet): Improved training time Enhanced data parallelism, model parallelism, and hybrids Possible parallelizing schemes: (a) Computing fully-connected activities after assembling a big batch from laststage convlayer activities. (b) Each worker sending its last-stage convlayer activities to all the other workers in turns. In parallel with feedforward & backprop computation, the next worker updates its activities. A. Krizhevsky, One weird trick for parallelizing convolutional neural networks, (c) All of the workers sending # eeeeeeeeeeeeeeee KK of their convlayer activities to all other workers. The workers then proceed as in (b)

31 CUDA-convnet2 : Model Parallelism (Fully Connected Layers) A. Krizhevsky, One weird trick for parallelizing convolutional neural networks, 2014

CPU and GPU Application Object classification Learning semantic features Object detection Sequences Reinforcement

32 Caffe Open framework, models, and worked examples for deep learning Pure C++/CUDA architecture for deep learning (Python, Matlab interfaces) Fast, well-tested code Tools, reference models, demos, and recipes Seamless switch between CPU and GPU Application Object classification Learning semantic features Object detection Sequences Reinforcement learning Speech + text C5G71UMscNPlvArsWER41PsU/edit#slide=id.gc2fcdcce7_216_0

33 Caffe : Example LeNet A network is a set of layers and their connections Caffe creates and checks the net from the definition Layer : plain text scheme, not a code LeNet

34 Caffe : Pros and Cons Performance 2 ms / image on K40 GPU <1 ms inference with Caffe + cudnn v2 on Titan X 72 million images per day with batched IO Pros Fast way to apply deep neural networks Support GPU Many common and new functions are supported Python and Matlab binding Cons Only a few input formats and only one output format (HDF5)

35 DistBelief Introduced by Google Brain research team J. Dean et al., Large scale distributed deep networks, NIPS 2012 Use large-scale cluster to distribute training and inference Exploits both data & model parallelism Distributed optimization Algorithms using parameter server Downpour SGD Sandblaster L-BFGS Trains a DN w/ billions of params using tens of thousands of CPU cores Capable of training a deep network 30x larger State-of-art performance on ImageNet 1) (by 2012) Faster than a GPU on modestly sized deep networks 1) An image database w/ 16m images, 20k categories and 1b params

36 DistBelief : Partition Model Across Machines J. Dean et al., "Large scale distributed deep networks," NIPS 2012

37 DistBelief : Asynchronous Distributed SGD Computes gradient on partial data (3) ttttttpp jj = ww jj 300 h ww xx ii yy ii 2 ii=201 xx 201, yy 201,, (xx 300, yy 300 ) Asynchronous communication on partitioned data Utilization of parameter server J. Dean, et al. "Large scale distributed deep networks." NIPS 2012.

DistBelief : Downpour SGD Asynchronous distributed SGD Robust to machine failures Introduces additional stochasticity Adagrad Adaptive learning rate Improve robustness and scalability 1.

38 DistBelief : Downpour SGD Asynchronous distributed SGD Robust to machine failures Introduces additional stochasticity Adagrad Adaptive learning rate Improve robustness and scalability 1. Asynchronously fetching parameters to multiple model replicas 2. SGD process inside model 3. Asynchronously pushing gradients to parameter server J. Dean et al., "Large scale distributed deep networks," NIPS 2012

39 Adam Optimizes and balances computation and communication Exploits model parallelism Minimized memory bandwidth and communication overhead Achieves high performance and scalability Also with accuracy improvement Multi-threaded model parameter updates without locks Asynchronous batched parameter updates Supports training any combination of Stacked convolutional and fully-connected network layers

40 Adam : Architecture On a single machine: Multi-threaded training Fast weight updates without lock (similar to Hogwild!) Multiple machines: Model partitioning Reducing memory copies (= data transfer) using own network library Optimization of memory system: L3 cache, cache locality Use vector processing units for matrix multiplication Asynchronous mitigating (speed variance of machines) Asynchronous updates with a global parameter server T. Chilimbi et al., Project adam: Building an efficient and scalable deep learning training system, OSDI 14

41 Adam : Results Application: Mnist / ImageNet 120 machines: 90 (training) + 20 (parameter server) + 10 (image server) <Performance of training nodes> <Scaling model size with more workers> <Accuracy of two applications> 30 fewer machines 2x improvements

42 Petuum Data & model parallel approach Considers the three properties of ML stated below Three properties of general ML Error tolerance Robustness against limited errors in the middle of calculation Dynamic structural dependency Changes in correlation between parameters Non-uniform convergence Differences between the convergence speed for each parameters

Petuum : Architecture Scheduler The core of the model parallelism support User can schedule which parameters are updated by schedule( ) Partial updates are aggregated by pull( ) Worker Parameters are

43 Petuum : Architecture Scheduler The core of the model parallelism support User can schedule which parameters are updated by schedule( ) Partial updates are aggregated by pull( ) Worker Parameters are received by schedule( ) Updates are computed by push( ) Any data storage system can be used Parameter server Uses the Stale Synchronous Parallel (SSP) consistency model Table based or key value stores

Petuum : Stale Synchronous Parallel (SSP) A parallel consistency model Limits the difference of the number of iteration which have progressed between workers Reduces network

44 Petuum : Stale Synchronous Parallel (SSP) A parallel consistency model Limits the difference of the number of iteration which have progressed between workers Reduces network synchronization and communication costs due to error tolerant convergence Ho et al.,"more effective distributed ml via a stale synchronous parallel parameter server, NIPS 2013

45 Petuum : Performance High relative speedup compared to other implementations Near-linear speed-up by increasing machines

SINGA Distributed deep learning platform for big data analytics Support CNN, RBM, RNN and others Flexible to run synchronous/asynchronous and hybrid framework Support various neural net partitioning

46 SINGA Distributed deep learning platform for big data analytics Support CNN, RBM, RNN and others Flexible to run synchronous/asynchronous and hybrid framework Support various neural net partitioning schemes Design goals Generality Different categories of models Different training frameworks Scalability Scalable to a large model and training datasets ex) Trained with 1 billion parameters and 10M images Ease of use Provides a simple programming model Supports built-in models, Python binding, and web interface Useable without much awareness of the underlying distributed platform

SINGA : Distributed Training Worker Group Loads a subset of training data and computes gradients for model replica Workers within a group run synchronously Different worker groups run

47 SINGA : Distributed Training Worker Group Loads a subset of training data and computes gradients for model replica Workers within a group run synchronously Different worker groups run asynchronously Server Group Maintains on ParamShard Handles requests of multiple worker groups for parameter updates Synchronize with neighboring groups

48 SINGA : Configurations 1 server group & 1 worker group (synchronous frameworks) 1 server group & 1 worker groups (asynchronous frameworks) Co-locate worker and server AllReduce (Baidu s DeepImage) Dogwild (distributed Hogwild) Separate worker and server groups Sandblaster Downpour

49 SINGA : Pros and Cons Pros Easy to use and support programming without much awareness of the underlying distributed platform Distributed architecture using synchronous, asynchronous and hybrid updates Cons Limited scale-up support (e.g., no support for GPUs)

50 Summary In the era of big data, deep learning techniques show higher accuracy than the traditional machine learning algorithms. However, deep learning often requires a huge amount of resources for showing state-of-the art performance on large-scale data. This talk provides a survey of recent proposals for alleviating the computational challenges involved in training large-scale deep neural networks. With emphasis on examples of scale-up or scale-out techniques

Large Scale Distributed Deep Networks

Large Scale Distributed Deep Networks Yifu Huang School of Computer Science, Fudan University huangyifu@fudan.edu.cn COMP630030 Data Intensive Computing Report, 2013 Yifu Huang (FDU CS) COMP630030 Report