Training Deep Neural Networks (in parallel)

Similar documents
Parallel Deep Network Training

Parallel Deep Network Training

Deep Learning with Tensorflow AlexNet

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

CafeGPI. Single-Sided Communication for Scalable Deep Learning

Toward Scalable Deep Learning

Spatial Localization and Detection. Lecture 8-1

Lecture : Training a neural net part I Initialization, activations, normalizations and other practical details Anne Solberg February 28, 2018

Deep Learning for Computer Vision II

Decentralized and Distributed Machine Learning Model Training with Actors

Fuzzy Set Theory in Computer Vision: Example 3

Natural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 4: Backpropagation and computation graphs

Asynchronous Parallel Stochastic Gradient Descent. A Numeric Core for Scalable Distributed Machine Learning Algorithms

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

COMP6237 Data Mining Data Mining & Machine Learning with Big Data. Jonathon Hare

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

ImageNet Classification with Deep Convolutional Neural Networks

Deep Neural Network Evaluation

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Deep Learning and Its Applications

Tutorial on Keras CAP ADVANCED COMPUTER VISION SPRING 2018 KISHAN S ATHREY

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Scalable deep learning on distributed GPUs with a GPU-specialized parameter server

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Towards Scalable Machine Learning

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

A performance comparison of Deep Learning frameworks on KNL

Parallelism. CS6787 Lecture 8 Fall 2017

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

Como funciona o Deep Learning

Fuzzy Set Theory in Computer Vision: Example 3, Part II

Deep Learning Benchmarks Mumtaz Vauhkonen, Quaizar Vohra, Saurabh Madaan Collaboration with Adam Coates, Stanford Unviersity

High Performance Computing

Perceptron: This is convolution!

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

Lecture : Neural net: initialization, activations, normalizations and other practical details Anne Solberg March 10, 2017

arxiv: v2 [cs.ne] 26 Apr 2014

Large Scale Distributed Deep Networks

High-Performance Data Loading and Augmentation for Deep Neural Network Training

A More Robust Asynchronous SGD

Lecture 37: ConvNets (Cont d) and Training

In-Place Activated BatchNorm for Memory- Optimized Training of DNNs

DEEP LEARNING IN PYTHON. The need for optimization

CNN Basics. Chongruo Wu

Know your data - many types of networks

SCALABLE DISTRIBUTED DEEP LEARNING

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan

Intro to Deep Learning. Slides Credit: Andrej Karapathy, Derek Hoiem, Marc Aurelio, Yann LeCunn

Scaling Distributed Machine Learning

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

MIXED PRECISION TRAINING OF NEURAL NETWORKS. Carl Case, Senior Architect, NVIDIA

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Linear Regression Optimization

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017

Profiling DNN Workloads on a Volta-based DGX-1 System

Solving the Non-Volatile Memory Conundrum for Deep Learning Workloads

Advanced Video Analysis & Imaging

A Brief Look at Optimization

Backpropagation + Deep Learning

Deep Learning Workshop. Nov. 20, 2015 Andrew Fishberg, Rowan Zellers

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

TensorFlow: A System for Learning-Scale Machine Learning. Google Brain

Parallelization in the Big Data Regime: Model Parallelization? Sham M. Kakade

Face Recognition A Deep Learning Approach

CSE 559A: Computer Vision

Usable while performant: the challenges building. Soumith Chintala

Fei-Fei Li & Justin Johnson & Serena Yeung

Omnivore: An Optimizer for Multi-device Deep Learning on CPUs and GPUs

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques

CS 179 Lecture 16. Logistic Regression & Parallel SGD

MoonRiver: Deep Neural Network in C++

Optimizing Object Detection:

Democratizing Machine Learning on Kubernetes

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Harp-DAAL for High Performance Big Data Computing

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

Research Faculty Summit Systems Fueling future disruptions

Parallel and Distributed Deep Learning

CNN optimization. Rassadin A

Asynchronous Stochastic Gradient Descent on GPU: Is It Really Better than CPU?

Neural Networks with Input Specified Thresholds

arxiv: v1 [cs.dc] 8 Jun 2018

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Machine Learning 13. week

Machine Learning. MGS Lecture 3: Deep Learning

INF 5860 Machine learning for image classification. Lecture 11: Visualization Anne Solberg April 4, 2018

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

CAP 6412 Advanced Computer Vision

YOLO 9000 TAEWAN KIM

The Future of High Performance Computing

Mocha.jl. Deep Learning in Julia. Chiyuan Zhang CSAIL, MIT

Transcription:

Lecture 9: Training Deep Neural Networks (in parallel) Visual Computing Systems

How would you describe this professor? Easy? Mean? Boring? Nerdy?

Professor classification task Classifies professors as easy, mean, boring, or nerdy based on their appearance. Input: image of a professor Output: probability of each of four possible labels f (image) professor classifier 4 Easy: Mean: Boring: Nerdy:????????

Professor classification network Classifies professors as easy, mean, boring, or nerdy based on their appearance. Input: image of a professor Output: probability of label convlayer convlayer convlayer convlayer convlayer 4 Recall: 10 s-100 s of millions of s Easy: Mean: Boring: Nerdy:????????

Professor classification network convlayer convlayer convlayer convlayer convlayer 4 Easy: Mean: Boring: Nerdy: 0.26 0.08 0.14 0.52

Training data (ground truth answers) [label omitted] [label omitted] [label omitted] [label omitted] [label omitted] [label omitted] [label omitted] Nerdy Nerdy Nerdy [label omitted] [label omitted] [label omitted] [label omitted] [label omitted] [label omitted] Nerdy [label omitted] [label omitted] [label omitted] Nerdy

Professor classification network New image of Kayvon (not in training set) convlayer convlayer convlayer convlayer convlayer 4 Easy: Mean: Boring: Nerdy: 0.0 0.0 0.0 1.0 Easy: Mean: Boring: Nerdy: 0.26 0.08 0.14 0.52 Ground truth (what the answer should be) Network output

Error (loss) Ground truth: (what the answer should be) Network output: * Easy: 0.0 Easy: 0.26 Mean: 0.0 Mean: 0.08 Boring: 0.0 Boring: 0.14 Nerdy: 1.0 Nerdy: 0.52! Output of network for correct category Common example: softmax loss: L = log e f c P j ef j Output of network for all categories * In practice a network using a softmax classifier outputs unnormalized, log probabilities (f j ), but I m showing a probability distribution above for clarity

Training Goal of training: learning good values of network s so that the network outputs the correct classification result for any input image Idea: minimize loss for all the training examples (for which the correct answer is known) L = X i L i (total loss for entire training set is sum of losses L i for each training example x i ) Intuition: if the network gets the answer correct for a wide range of training examples, then hopefully it has learned values that yield the correct answer for future images as well.

Intuition: gradient descent Say you had a function f that contained hidden s p 1 and p 2 : f(x i ) And for some input x i, your says the function should output 0. But for the current values of p 1 and p 2, it currently outputs 10. f(x i,p 1,p 2 ) = 10 red = high values of f blue = low values And say I also gave you expressions for the derivative of f with respect to p 1 and p 2 so you could compute their value at x i. df dp 1 =2 df dp 2 = 5 rf =[2, 5] p 2 p 1 How might you adjust the values p 1 and p 2 to reduce the error for this training example?

Basic gradient descent while (loss too high): for each item x_i in training set: grad += evaluate_loss_gradient(f, params, loss_func, x_i) params += -grad * step_size; Mini-batch stochastic gradient descent (mini-batch SGD): choose a random (small) subset of the training examples to compute gradient in each iteration of the while loop How do we compute dloss/dp for a deep neural network with millions of s?

Derivatives using the chain rule f(x, y, z) =(x + y)z = az Where: a = x + y df da = z da dx =1 da dy =1 So, by the derivative chain rule: df dx = df da da dx = z x 3 5 (df/dx) 4 + 7 (a) y 5 (df/dy) 5 5 (df/da) * 35 1 z 7 (df/dz) Red = output of node Blue = df/dnode

Backpropagation Red = output of node Blue = df/dnode Recall: df dx = df dg dg dx x y 10 10 + 10 g(x, y) =x + y dg dx =1, dg dy =1 x y 15 10 0 max 10 12 g(x, y) = max(x, y) dg dx = 1, if x > y 0, otherwise x y 15 10*12 12 * 10*15 10 g(x, y) =xy dg dx dg = y, dy = x

Backpropagating through single unit x 0 * Recall: behavior of unit:! w 0 yw 0 yx 0 y + f(x 0,x 1,x 2,x 3 )=max 0, X i x i w i + b x 1 * w 1 yw 1 yx 1 y y + let y = 10, if upper input to max is > 0 0, otherwise x 2 * yw 2 y w 2 yx 2 y + y + x 3 yw 3 * w 3 yx 3 y y max 10 dloss b y 0 dunit Observe: output of prior layer must be retained in order to compute weight gradients for this unit during backprop.

Multiple uses of an input variable x y 5 10*5 5 5 10*5 10 10 * + 10 10 + 10 Sum gradients from each use of variable: Here: df dx = df dg dg dx g(x, y) =(x + y)+x x = a + b = 10 dg dx = 10(2x + 1) da dx =1, db dx =2x = 10(10 + 1) = 110 dg dx = dg da da dx + dg db db dx =2x +1 Implication: backpropagation through all units in a convolutional layer adds gradients computed from each unit to the overall gradient for the shared weights

Backpropagation: matrix form X y = Xw w dl dw * dl dy 9-element vector (WxH)-element vector dy j dw i =X ji dl dw i = X j dl dy j dy j dw i dy dw 2 9 0 0 0 0 x00 x01 0 x10 x11 0 0 0 x00 x01 x02 x10 x11 x12 0 0 0 x01 x02 x03 x11 x12 x13 2 3 w 0 w 1 6. 7 4.. 5 = X j dl dy j X ji WxH... x00 x01 x02 x10 x11 x12 x20 x21 x22 w 8 w Therefore:... dl dw =XT dl dy X

Backpropagation through the entire professor classification network loss 4 For each training example x i in mini-batch: Perform forward evaluation to compute loss for x i Compute gradient of loss w.r.t. final layer s outputs Backpropagate gradient to compute gradient of loss w.r.t. all network s Accumulate gradients (over all images in batch) Update all values: w_new = w_old - step_size * grad

Recall from last class: VGG memory footprint Calculations assume 32-bit values (image batch size = 1) input: 224 x 224 RGB image conv: (3x3x3) x 64 conv: (3x3x64) x 64 maxpool conv: (3x3x64) x 128 conv: (3x3x128) x 128 maxpool conv: (3x3x128) x 256 conv: (3x3x256) x 256 conv: (3x3x256) x 256 maxpool conv: (3x3x256) x 512 conv: (3x3x512) x 512 conv: (3x3x512) x 512 maxpool conv: (3x3x512) x 512 conv: (3x3x512) x 512 conv: (3x3x512) x 512 maxpool fully-connected 4096 fully-connected 4096 fully-connected 1000 soft-max weights mem: 6.5 KB 144 KB 228 KB 576 KB 1.1 MB 2.3 MB 2.3 MB 4.5 MB 9 MB 9 MB 9 MB 9 MB 9 MB Many weights in fullyconnected players 392 MB 64 MB 15.6 MB output size (per image) 224x224x3 224x224x64 224x224x64 112x112x64 112x112x128 112x112x128 56x56x128 56x56x256 56x56x256 56x56x256 28x28x256 28x28x512 28x28x512 28x28x512 14x14x512 14x14x512 14x14x512 14x14x512 7x7x512 4096 4096 1000 1000 (mem) 150K 12.3 MB 12.3 MB 3.1 MB 6.2 MB 6.2 MB 1.5 MB 3.1 MB 3.1 MB 3.1 MB 766 KB 1.5 MB 1.5 MB 1.5 MB 383 KB 383 KB 383 KB 383 KB 98 KB 16 KB 16 KB 4 KB 4 KB Storing convolution layer outputs (unit activations ) can get big in early layers with large input size and many filters Note: multiply these numbers by N for batch size of N images

Data lifetimes during network evaluation 4 Weights (read-only) reside in memory After evaluating layer i, can free outputs from layer i-1

Data lifetimes during training loss 4 conv1 vel 11x11x96 conv2 vel 5x5x256 conv3 vel 3x3x384 conv4 vel 3x3x384 conv5 vel 3x3x256 fc6 vel 4k x 4k fc7 vel 4k x 4k conv1 grad 11x11x96 conv2 grad 5x5x256 conv3 grad 3x3x384 conv4 grad 3x3x384 conv5 grad 3x3x256 fc6 grad 4k x 4k fc7 grad 4k x 4k - Must retain outputs for all layers because they are needed to compute gradients during back-prop - Parallel back-prop will require storage for per-weight gradients (more about this in a second) - In practice: may also store per-weight gradient velocity (if using SGD with momentum ) or step cache in Adagrad vel_new = mu * vel_old - step_size * grad w_new = w_old + vel_new

VGG memory footprint Calculations assume 32-bit values (image batch size = 1) input: 224 x 224 RGB image conv: (3x3x3) x 64 conv: (3x3x64) x 64 maxpool conv: (3x3x64) x 128 conv: (3x3x128) x 128 maxpool conv: (3x3x128) x 256 conv: (3x3x256) x 256 conv: (3x3x256) x 256 maxpool conv: (3x3x256) x 512 conv: (3x3x512) x 512 conv: (3x3x512) x 512 maxpool conv: (3x3x512) x 512 conv: (3x3x512) x 512 conv: (3x3x512) x 512 maxpool fully-connected 4096 fully-connected 4096 fully-connected 1000 soft-max weights mem: 6.5 KB 144 KB 228 KB 576 KB 1.1 MB 2.3 MB 2.3 MB 4.5 MB 9 MB 9 MB 9 MB 9 MB 9 MB 392 MB 64 MB 15.6 MB Must also store perweight gradients Many implementations also store gradient momentum as well (multiply by 3) inputs/outputs get multiplied by minibatch size output size (per image) 224x224x3 224x224x64 224x224x64 112x112x64 112x112x128 112x112x128 56x56x128 56x56x256 56x56x256 56x56x256 28x28x256 28x28x512 28x28x512 28x28x512 14x14x512 14x14x512 14x14x512 14x14x512 7x7x512 4096 4096 1000 1000 Unlike forward evaluation: 1. cannot immediately free outputs once consumed by next level of network (mem) 150K 12.3 MB 12.3 MB 3.1 MB 6.2 MB 6.2 MB 1.5 MB 3.1 MB 3.1 MB 3.1 MB 766 KB 1.5 MB 1.5 MB 1.5 MB 383 KB 383 KB 383 KB 383 KB 98 KB 16 KB 16 KB 4 KB 4 KB

SGD workload while (loss too high): At first glance, this loop is sequential (each step of walking downhill depends on previous) for each item x_i in mini-batch: Parallel across images grad += evaluate_loss_gradient(f, loss_func, params, x_i) sum reduction large computation with its own parallelism (but working set may not fit on single machine) params += -grad * step_size; trivial data-parallel over s

DNN training workload Huge computational expense - Must evaluate the network (forward and backward) for millions of training images - Must iterate for many iterations of gradient descent (100 s of thousands) - Training modern networks takes days Large memory footprint - Must maintain network layer outputs from forward pass - Additional memory to store gradients/gradient velocity for each - Recall s for popular VGG-16 network require ~500 MB of memory (training requires GBs of memory for academic networks) - Scaling to larger networks requires partitioning DNN across nodes to keep DNN + intermediates in memory Dependencies /synchronization (not embarrassingly parallel) - Each update step depends on previous - Many units contribute to same gradients (fine-scale reduction) - Different images in mini batch contribute to same gradients

Data-parallel training (across images) for each item x_i in mini-batch: grad += evaluate_loss_gradient(f, loss_func, params, x_i) params += -grad * step_size; Consider parallelization of the outer for loop across machines in a cluster image x 0 image x 1 gradients due to x 0 Node 0 copy of values gradients due to x 1 copy of values Node 1 partition mini-batch across nodes for each item x_i in mini-batch assigned to node: // just like single node training grad += evaluate_loss_gradient(f, loss_func, params, x_i) barrier(); sum reduce gradients, communicate results to all nodes barrier(); update copy of values

Challenges of computing at cluster scale Slow communication between nodes - Commodity clusters do not feature high-performance interconnects (e.g., infiniband) typical of supercomputers Nodes with different performance (even if machines are the same) - Workload imbalance at barriers (sync points between nodes) Modern solution: exploit characteristics of SGD using asynchronous execution!

Parameter server design Pool of worker nodes Parameter Server [Li OSDI14] Google s DistBelief [Dean NIPS12] Microsoft s Project Adam [Chilimbi OSDI14] values Node 0 Node 1 Parameter Server Node 2 Node 3

Training data partitioned among workers Pool of worker nodes x 0 - x 1000 x 1000 - x 2000 values (v0) Node 0 Node 1 x 2000-3000 x 3000-4000 Parameter Server Node 2 Node 3

Copy of s sent to workers Pool of worker nodes params v 0 params v 0 values (v0) params v 0 Node 0 Node 1 params v 0 Parameter Server Node 2 Node 3

s independently compute " Pool of worker nodes Node 0 Node 2 Node 1 Node 3 values (v0) Parameter Server

sends subgradient to server Pool of worker nodes subgradient values (v0) Node 0 Node 1 Parameter Server Node 2 Node 3

Server updates global values based on subgradient Node 0 Node 2 Node 1 Node 3 values (v1) Parameter Server params += -subgrad * step_size;

Updated s sent to worker proceeds with another gradient computation step s (v1) params v 1 values (v1) Node 0 Node 1 Parameter Server Node 2 Node 3 Note: Node 1 is operating on different set of values than other nodes Those values were computed without gradient information from the other nodes

Updated s sent to worker (again) s (v1) subgradient values (v1) Node 0 Node 1 Parameter Server Node 2 Node 3

continues with updated s s (v1) params v 2 values (v2) Node 0 Node 1 Parameter Server s (v2) Node 2 Node 3

Summary: asynchronous update Idea: avoid global synchronization on all updates between each SGD iteration - Design reflects realities of cluster computing: - Slow interconnects - Unpredictable machine performance Solution: asynchronous (and partial) subgradient updates Will impact convergence of SGD - Node N working on iteration i may not have values that result the results of the i-1 prior SGD iterations

Bottleneck? What if there is heavy contention for server? Node 0 Node 2 s (v1) Node 1 s (v2) Node 3 values (v2) Parameter Server

Shard the server Partition s across servers sends chunk of to owning server s (v1) subgradient (chunk 0) values (chunk 0) Node 0 Node 1 subgradient (chunk 1) Parameter Server 0 values (chunk 1) s (v2) Parameter Server 1 Node 2 Node 3 Reduces data transmission load on individual servers (less important: also reduces cost of update)

What if model s do not fit on one worker? Recall high footprint of training large networks (particularly with large mini-batch sizes) Node 0 Node 2 s (v1) Node 1 s (v2) Node 3 values (chunk 0) Parameter Server 0 values (chunk 1) Parameter Server 1

Model parallelism Partition network s across nodes (spatial partitioning to reduce communication) Reduce internode communication through network design: - Use small spatial convolutions (1x1 convolutions) - Reduce/shrink fully-connected layers Node 0 4 Convolutional layers: only need to community outputs near spatial partition Fully-connected layers: all data owned by a node must by communicated to other nodes Node 1

Training data-parallel and model-parallel execution Working on subgradient computation for a single copy of the model s (v1): chunk 0 chunk 0 Find-grained communication of layer outputs,, etc. s (v1): chunk 1 chunk 1 values (chunk 0) Parameter Server 0 Node 0 : chunk 0 chunk 0 Find-grained communication of layer outputs,, etc. Node 1 : chunk 1 chunk 1 values (chunk 1) Parameter Server 1 Node 2 Node 3 Working on subgradient computation for a single copy of the model

Using supercomputers for training? Fast interconnects critical for model-parallel training - Fine-grained communication of outputs and gradients Fast interconnect diminishes need for async training algorithms - Avoid randomness in training due to computation schedule (there remains randomness due to SGD algorithm) OakRidge Titan Supercomputer NVIDIA DGX-1: 8 Pascal GPUs connected via high speed NV-Link interconnect

Accelerating data-parallel training FireCaffe [Iandola 16] Use a high-performance Cray Gemini interconnect (Titan supercomputer) Use combining tree for accumulating gradients (rather than a single server) Use larger batch size (reduce frequency of communication) but offset by increasing learning rate Table 3. Accelerating the training of ultra-deep, computationally intensive models on ImageNet-1k. Hardware Net Epochs Batch size Initial Learning Rate Train time Speedup Top-1 Accuracy Top-5 Accuracy Caffe 1 NVIDIA K20 GoogLeNet 64 32 0.01 21 days 1x 68.3% 88.7% [41] FireCaffe 32 NVIDIA K20s (Titan GoogLeNet 72 1024 0.08 23.4 20x 68.3% 88.7% (ours) supercomputer) hours FireCaffe (ours) 128 NVIDIA K20s (Titan supercomputer) GoogLeNet 72 1024 0.08 10.5 hours 47x 68.3% 88.7% Dataset: ImageNet 1K Result: reasonably scalability without asynchronous update: for modern DNNs Measuring communication only (if computation were free) with fewer weights (due to no fully connected layers) such as GoogLeNet

Parallelizing mini-batch on one machine for each item x_i in mini-batch: grad += evaluate_loss_gradient(f, loss_func, params, x_i) params += -grad * step_size; Consider parallelization of the outer for loop across cores image x 0 gradients due to x 0 image x 1 gradients due to x 1 Core 0 final gradients Core 1 values Good: completely independent computations (until gradient reduction) Bad: complete duplication of gradient state (100 s MB per core)

Asynchronous update on one node for each item x_i in mini-batch: grad += evaluate_loss_gradient(f, loss_func, params, x_i) params += -grad * step_size; Cores update shared set of gradients. Skip taking locks / synchronizing across cores: perform approximate reduction image x 0 image x 1 Core 0 gradients Core 1 values Project Adam [Chilimbi OSDI14]

Summary: training large networks in parallel Most systems rely on asynchronous update to efficiently use clusters of commodity machines - Modification of SGD algorithm to meet constraints of modern parallel systems - Open question: effects on convergence are problem dependent and not particularly well understood - Tighter integration / faster interconnects may provide alternative to these methods (facilitate tightly orchestrated solutions much like supercomputing applications) Although modern DNN designs (with fewer weights) and efficient use of high performance interconnects (much like any parallel computing problem) enables scalability without asynchronous execution.