A More Robust Asynchronous SGD

Similar documents
Perceptron: This is convolution!

Deep Learning for Computer Vision II

Visual object classification by sparse convolutional neural networks

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Machine Learning 13. week

Study of Residual Networks for Image Recognition

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Linear Regression Optimization

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017

Parallel Deep Network Training

Optimization in the Big Data Regime 5: Parallelization? Sham M. Kakade

Training Deep Neural Networks (in parallel)

Accelerating Convolutional Neural Nets. Yunming Zhang

Conflict Graphs for Parallel Stochastic Gradient Descent

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Deep Learning and Its Applications

Parallelization in the Big Data Regime 5: Data Parallelization? Sham M. Kakade

Logistic Regression and Gradient Ascent

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Parallel Deep Network Training

Decentralized and Distributed Machine Learning Model Training with Actors

Deep Learning. Volker Tresp Summer 2014

PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Lecture 22 : Distributed Systems for ML

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Deep Learning with Tensorflow AlexNet

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Report: Privacy-Preserving Classification on Deep Neural Network

Ensemble methods in machine learning. Example. Neural networks. Neural networks

PERFORMANCE OF GRID COMPUTING FOR DISTRIBUTED NEURAL NETWORK. Submitted By:Mohnish Malviya & Suny Shekher Pankaj [CSE,7 TH SEM]

A Brief Look at Optimization

Facial Expression Classification with Random Filters Feature Extraction

Chapter 2 Basic Structure of High-Dimensional Spaces

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Asynchronous Parallel Stochastic Gradient Descent. A Numeric Core for Scalable Distributed Machine Learning Algorithms

Machine Learning and Computational Statistics, Spring 2015 Homework 1: Ridge Regression and SGD

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

On the Effectiveness of Neural Networks Classifying the MNIST Dataset

How Learning Differs from Optimization. Sargur N. Srihari

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 10 Fall 2018

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

The exam is closed book, closed notes except your one-page cheat sheet.

Louis Fourrier Fabien Gaie Thomas Rolf

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

732A54/TDDE31 Big Data Analytics

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Deep Learning for Visual Computing Prof. Debdoot Sheet Department of Electrical Engineering Indian Institute of Technology, Kharagpur

Neuron Selectivity as a Biologically Plausible Alternative to Backpropagation

Artificial Neuron Modelling Based on Wave Shape

Lecture : Training a neural net part I Initialization, activations, normalizations and other practical details Anne Solberg February 28, 2018

A Deep Learning Approach to Vehicle Speed Estimation

Deep Neural Networks Optimization

Deep Learning With Noise

ECS289: Scalable Machine Learning

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

Machine Learning and Computational Statistics, Spring 2016 Homework 1: Ridge Regression and SGD

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

CS281 Section 3: Practical Optimization

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 9 Fall 2017

Convolutional Neural Networks

CS489/698: Intro to ML

Combine the PA Algorithm with a Proximal Classifier

HENet: A Highly Efficient Convolutional Neural. Networks Optimized for Accuracy, Speed and Storage

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Scaling Distributed Machine Learning

3 Types of Gradient Descent Algorithms for Small & Large Data Sets

The Mathematics Behind Neural Networks

Inception Network Overview. David White CS793

Convolution Neural Networks for Chinese Handwriting Recognition

Regularization and model selection

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning

Lecture : Neural net: initialization, activations, normalizations and other practical details Anne Solberg March 10, 2017

Lecture 37: ConvNets (Cont d) and Training

Know your data - many types of networks

Lecture 1 Notes. Outline. Machine Learning. What is it? Instructors: Parth Shah, Riju Pahwa

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

Clustering and Visualisation of Data

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

AM 221: Advanced Optimization Spring 2016

Why DNN Works for Speech and How to Make it More Efficient?

Fuzzy Set Theory in Computer Vision: Example 3, Part II

Deep Learning Cook Book

(Refer Slide Time: 1:27)

Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features

An algorithm for Performance Analysis of Single-Source Acyclic graphs

Neural Network and Deep Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Parallel and Distributed Deep Learning

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Fall 09, Homework 5

SAS Meets Big Iron: High Performance Computing in SAS Analytic Procedures

CNN Basics. Chongruo Wu

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Image Compression: An Artificial Neural Network Approach

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

Transcription:

A More Robust Asynchronous SGD Said Al Faraby Artificial Intelligence University of Amsterdam This thesis is submitted for the degree of Master of Science February 2015

Abstract With the massive amount of data available for learning good models, the times required to process the data become a big concern. In this thesis we compare the distributed synchronous and asynchronous SGD algorithms tested on a web browser distributed framework of MLitB. We also introduce new strategies to improve the performance of asynchronous SGD, namely asynchronous-parameter and partial asynchronous-parameter. Finally, we report the experimental results of the distributed SGD algorithms. The results show that the asynchronous-parameter and the partial asynchronous-parameter are more robust to stale parameters update than the traditional asynchronous SGD. The partial asynchronousparameter obtained similar or better error rates compared to the full update with the size of the transferred update elements starting from 10% of the original size.

Table of contents List of figures List of tables Nomenclature vii ix ix 1 Introduction 1 1.1 Motivation................................... 1 1.2 Research Goals................................ 2 1.3 Thesis Contributions............................. 2 2 Background 3 2.1 Convolutional Neural Networks....................... 3 2.1.1 CNN Architecture and Configurations................ 3 2.1.2 Training CNN............................ 5 2.2 MLitB Framework.............................. 7 2.2.1 Architecture.............................. 7 2.2.2 Synchronous Events Loop...................... 8 2.2.3 Asynchronous Events Loop..................... 9 3 Distributed SGD Algorithms 11 3.1 Synchronous SGD.............................. 12 3.2 Asynchronous-gradient SGD......................... 14 3.2.1 Asyncrhonous SGD......................... 14 3.2.2 Asynchronous SGD on MLitB.................... 14 3.3 Asynchronous-Parameters SGD....................... 16 3.3.1 Motivation.............................. 16 3.3.2 Previous Work............................ 17 3.3.3 Our Approach............................. 18

vi Table of contents 3.3.4 Adaptive α.............................. 20 3.4 Partial Asynchronous-Parameters....................... 21 3.4.1 Motivation.............................. 21 3.4.2 Partial Update............................. 21 4 Experiments and Analysis 23 4.1 Uniform Processing Speed.......................... 24 4.2 Different Values of α............................. 25 4.3 Slow Updates................................. 25 4.4 Partial Update................................. 26 4.4.1 Selection Methods.......................... 26 4.4.2 Different Values of ρ......................... 27 4.5 Performance on Cifar10............................ 28 5 Conclusion and Discussion 31 References 33

List of figures 2.1 Convolutional Neural Networks Architecture................ 4 2.2 MLitB architecture [23]............................ 7 2.3 Synchronous event loop............................ 8 2.4 Asynchronous event loop........................... 9 3.1 Illustration of updating parameters in asynchronous-gradient SGD..... 16 3.2 Illustration of updating parameters in asynchronous-parameter SGD.... 18 4.1 Training classification error for different distributed SGD algorithms.... 24 4.2 Training classification errors for different values of α............ 25 4.3 Training classification errors after adding random delays.......... 26 4.4 Training classification errors for different indexing-selection methods... 27 4.5 Training classification errors for different values of ρ............ 28 4.6 Mini-batch error on Cifar10 dataset..................... 29

List of tables 2.1 Activation functions.............................. 5

Chapter 1 Introduction 1.1 Motivation Years ago, there is a time where the limited quantity of data has prevented researchers to create accurate models. In these days, the capacity of data storage has increased enormously and collecting data has become much easier than before. As a result, gigantic volume of data is able to be collected everyday. The challenge now is to process the data in a sensible amount of time. Many researchers have been working to speed up the learning time for very huge datasets [4, 7, 10, 21, 26]. Most of the work used multicore CPU setting to distribute data and model. Some recent works have started to utilize GPU for general high performance computations, especially for distributed computation [25, 33]. Besides using multicore or computer clusters to do distributed computing, there are opportunities to build a cheap distributed computing over web browsers with the use of JavaScript virtual machine [23, 28]. This platform also offers a flexibility for research collaboration with the only requirement of internet connection. The portability of web technologies has enables everyone to build their own system by using the shared codes. Stochastic gradient descent has become a standard algorithm to solve a complex optimization problem. With the increase in the volume of data there is a need to distribute the data and SGD computations into many machines. SGD is an iterative optimization algorithm, so that naturally the distributed SGD will require synchronization. However, the synchronization potentially reduce the efficiency of resource utilization, so that many has tried to distribute SGD asynchronously [5, 17, 32]. In this thesis we are interested to investigate distributed computation of SGD via web browsers. There are some inherent issues that differentiate web browsers from the multicore setting as the medium for distributed computation. Using the web browsers setting, the

2 Introduction processing speeds of users machines might vary significantly compared to the uniform speeds in multicore setting. Another factor that might influence the performance of SGD is the high network latency in transferring data through internet connection. These two factors make the distributed SGD with synchronization much less optimal in utilizing computing resources. Although the traditional asynchronous SGD could be a straightforward solution, it seems to be less stable and more sensitive to the learning parameters setting and the number of machines [5, 11, 25]. 1.2 Research Goals In this thesis, we attempt to answer the following questions : 1. How are the relative performances of the existing distributed SGD algorithms if they are implemented in the web browsers setting? 2. How can we improve the asynchronous SGD? 3. The bigger models will cause high network latency. Can we reduce the transferred data without detrimental effects on learning? This thesis will focus on comparing the performances of different distributed SGD algorithms that run the same architecture and hyperparameters. Finding the best possible combinations of the hyperparameters in order to achieve the best performance (e.g error rate) is out of the scope of this thesis. 1.3 Thesis Contributions We provide the comparisons of the synchronous and the asynchronous distributed SGD algorithms. We also demonstrate the effectiveness of a new asynchronous SGD algorithm that is more robust to stale updates. Moreover, we present the empirical results of the partial update method that is able to reduce the size of transferred data while retain a good performance as the full update.

Chapter 2 Background 2.1 Convolutional Neural Networks As the study case for this thesis, we use convolutional neural network as the model for learning image classification task. Convolutional neural networks or CNN were first introduced by Fukushima [9], whose model was inspired by human visual nervous system proposed by Hubel and Wiesel. The important aspect of this model is that it is invariant to the position of the input pattern, and only dependent on the shape of the input pattern. The designed model was later improved by LeCun et al. [19], who also popularized the name of Convolutional Neural Networks. Since then CNN has shown successes in many different applications including face recognition [18], ImageNet classification [16], and speech recognition [1]. 2.1.1 CNN Architecture and Configurations The main components of CNN that distinguish it from the common ANN are the convolution and the sub-sampling layers. A convolution layer and its previous layer are connected by sets of weights which are also usually called as filters. The filters here have the same function with filters used in the convolution operation of image processing or signal processing in general. For instance, in image processing, there are some edge filters to detect edges in input images. It is just, in CNN we do not define what patterns to detect from the beginning, hence we do not know the right values for the filters. In fact we initialize the filters with random values, and start to learn the correct ones during the learning process. The convolution operations are usually followed by applying an activation function to the resulting values, and the final output is called as feature map. A convolutional layer usually consists of many filters, and each filter will produce one feature map.

4 Background Sub-sampling or down-sampling usually refers to reducing the size of the input signal. Besides size reduction, sub-sampling can also be thought as adding translation invariance to the CNN. A commonly used sub-sampling method in CNN is max-pooling. Originally, max-pooling operates by splitting the input matrix into non-overlapping smaller grids, and taking the maximum values of each grid as the output of values. In practice, sometimes people also use small overlapping between the grids. There are no connection weights in sub-sampling layers, or the weights can be considered as constant values of 1s. The output of sub-sampling operations are also termed as feature maps. Fig. 2.1 Convolutional Neural Networks Architecture A typical architecture of CNN consists of an input layer, followed by some pairs of convolution and sub-sampling layers, and ended by one or more fully connected layers. A graphical illustration of a CNN architecture is shown by figure 2.1. The input layer has size of 28 x 28, which is followed by the first convolution layer (C1) with 4 filters, each has dimension of 5 x 5. The outputs of C1 layer consists of 4 feature maps with size of 24 x 24, which are then sub-sampled by the first sub-sampling layer (S1) with the grid size of 2 x 2, and produces 4 feature maps of 12 x 12. The same procedures are continued by the second pair of convolution (C2: filters size of 3 x 3) and sub-sampling (S2: grid size of 2 x 2) layer. S2 layer is fully connected with the last layer that consists of 10 neurons, each of those is a result of a full dot product operation between all output values in S2 layer and the connection weights coming to each neuron. There are some additional configurations to complete the architecture above. Beside filter and grid size, people sometime use an option called stride to define the distance between the center of two consecutive filter position in the input signal, or the distance between two neighboring grids. The last configuration to be specified is the activation functions for the convolution layers and the output layer. There are some well known activation functions used for ANN, but throughout this thesis, we will use rectifier (relu) function for every convolution

2.1 Convolutional Neural Networks 5 rectifier f (x) = max(0, x) softmax f (z) j = exp(z j) K k=1 exp(z for j = 1 K k) Table 2.1 Activation functions layers, and softmax function for the output layer as those are the most popular for CNN classifier (see table 2.1 for definitions). Finally, the complete configurations of the illustrated CNN can be written as the following textual format : Input : size=(28,28) Conv Pool Conv Pool FC : filters=4, size=(5,5), stride=1, actfunc=relu : size=(2,2), stride=2 : filters=8, size=(3,3), stride=1, actfunc=relu : size=(2,2), stride=2 : neurons=10, actfunc=softmax 2.1.2 Training CNN CNN is trained using back-propagation procedure and gradient descent optimization technique [19]. Back-propagation was introduced by Rumelhart et al. [29]. The back-propagation procedure has two passes. The first one is forward pass, in which given an input x, each neuron in each layer will produce an output value. The objective of the training is to adjust the connection weights w such that the outputs of the network y(x), which are the values of neurons in the output values, become as close as possible to the target values t of the input x. The difference between the output of the network and the target defines the error for the input x. The total error function is defined as E = 1 2 (y(x) j t j ) 2 (2.1) x,t j where x,t is pair of input data and the target values, y(x) is the values of the output neurons, and j is index of the output neurons. Gradient descent optimization method is then used to minimize the error function by taking partial derivatives of E with respect to each weight in the network [29]. Computing the partial derivatives is done sequentially, propagating derivatives of neurons in the last layer backward toward the input layer, which is why this pass is called backward pass.

6 Background In the original gradient descent method, which is also called as total gradient descent, the gradient of each weight is accumulated over all training data, and the update to the current weights follows w t+1,i = w t,i η 1 N g i (2.2) where η is a positive learning rate, g is the accumulated gradient, and i is the index of weights component if all weights are stored in a single vector. This method guarantees to converge to local minimum of the error function. When the volume of training data is huge, employing total gradient descent becomes impractical, because in order to produce one update, it needs to average the gradients over the entire massive training data. In the stochastic version, each update is made after processing one randomly picked training example [2, 35]. As a trade-off to the faster update iteration, the stochastic gradient descent (SGD) does not hold the general convergence guarantee from the total gradient descent, but Zhang [35] and Bottou [2] show that by setting learning rate η 0, the SGD can still converges. Moreover, in practice the stochastic version offers some advantages over the total gradient algorithm. As discussed by Bottou [2], the stochastic gradient descent often converges much faster when the data are redundant. Furthermore, even though it is good that the total gradient descent guarantees to convert to a local minimum, but the fact that it can not escape the local minimum could become its drawback. For instance, where the local minimum is very poor while there are many other local minimum that are much better. On the other hand, SGD with its random behavior normally will not be trapped at that situation. Besides using one example for each update, Bottou [2] also mentioned another common practice in implementing SGD which is known as mini-batch SGD. Instead of using one example, mini-batch SGD uses a small batches of training example at each update iteration. The mini-batch SGD usually is preferable because it offers less random behavior due to averaging gradients over more training examples than in normal SGD. There is also new popular update method called Adagrad which was introduced by Duchi et al. [6]. This method adapts the learning rate for each specific weights components. Adagrad is known to help the training to converge faster and more stable [5, 34]. The weight update of Adagrad method is defined as η w t+1,i = w t,i g t,i (2.3) t t =1 g2 t,i

2.2 MLitB Framework 7 2.2 MLitB Framework In this thesis we use MLitB framework to run and test several distributed algorithms. Machine Learning in the Browser or MLitB is a software framework for doing distributed machine learning computation in the browsers. The usage of the browsers is the heart of the framework to provide a cheap, ubiquitous, and collaborative distributed learning [12, 23]. The common usage of big clusters of CPU or GPU indeed serve well in the speed of computation, but not everyone can afford accessing such facilities. MLitB on the other hand, effortlessly transforms any device that supports recent browsers into a computing resource, makes it affordable to everyone. Furthermore, every client that joins the framework not only can contributes the resource, but also can collaboratively improve a model by adding new training data or download a model and do some modifications to improve the model. In addition, in collaborative distributed computing, there is a concern about private learning, where people have some data and want to contribute to some learning models, but they do not want to share the data to other clients [24]. In this case, bringing the models and computations into their devices, as what this framework does, might be the only way to make it possible. 2.2.1 Architecture Fig. 2.2 MLitB architecture [23] MLitB uses a client-server architecture and message-passing communication method. MLitB is implemented mainly in Javascript and employs recent web technologies such as Web Workers for multithreading, and Websocket for communication. Browsers act as the clients,

8 Background and there is a server that controls the system and aggregates the learning results from the clients in the form of model parameters. Figure 2.2 describes the framework in more details by showing the data flow and communication between each component. The Master Server (1) is a server that initiates the framework and stores all models configurations and parameters that are currently running. There is a Boss (3) as an interface to create new model, uploading data, and manage workers. A new model will be sent to the Master Server so that the model will become visible for any Boss in the framework. The data that has been uploaded by any Boss will be transferred to Data Server and can be downloaded by workers which are usually assigned to work with part of the data. XHR (4) is used for data communication, while for message passing and also transmitting model configurations and parameters (2), the framework relies on web socket and also XHR. 2.2.2 Synchronous Events Loop Fig. 2.3 Synchronous event loop MLitB originally implement synchronous distributed computing, where each step is started with the server dispatches the job by sending the recent parameters to all of the clients. The clients then work with the parameters and their local data to produce updates. Furthermore, the server will wait and pool the updates from all clients before aggregating them into a single combined update. Finally, the combined update will be used to update the current parameters and produce a new parameters (see Figure 2.3). These processes will be repeated again until the stopping condition is satisfied or the running model is stopped by user. The synchronization process will guarantee all clients to have sent the same number of update at any given time.

2.2 MLitB Framework 9 2.2.3 Asynchronous Events Loop Fig. 2.4 Asynchronous event loop Unlike in the synchronous distributed computation, the events loop in the asynchronous version is more concise. As clearly seen in figure 2.4, there is no pooling and aggregating process for clients updates. Each client can send their updates directly and independently to the updating process, and the resulting parameters will be transmitted back immediately to that client. In this asynchronous fashion, each client can produce different number of update during the learning process.

Chapter 3 Distributed SGD Algorithms Since the era of big data begun, centralized machine learning computations seemed to be inadequate to process the data in reasonable amount of time. Many machine learning researchers have tried to accelerate the computing time by distribute the data and computations to many machines [3, 8, 10, 21, 27, 31]. Some special frameworks also emerged for handling large-scale computation by distributing the data and computation into many machines, Map-reduce [4] and Graphlab [20] are prominent examples. Not only data-parallelization, for the case of extremely big models that do not fit into one machine (computer memory or GPU memory), new frameworks for model-parallelization have been developed, the models parameters and architecture are split into many machines [5] or GPUs [25, 33]. Notice that, most of the aforementioned works are built on multi-core setting, and some used super high speed connection between machines in the style of supercomputers or computer clusters. However, in this work, we focus to develop distributed computing algorithm, especially SGD, trough users browsers. In this setting, there are two main different from the previous works, 1) the processing speed of clients computers are vary, and 2) the connections between clients and server usually are much slower. As a consequence, we are not competing in term of speed and scale, instead we are interested in using efficient distributed SGD algorithms that are suitable for our setting. In this chapter we present three different distributed SGD algorithms, namely synchronous SGD, asynchronous-gradient SGD, and asynchronous-parameter SGD. In addition, we also present an extension of asynchronous-parameter SGD to reduce data transmission from clients to server, which we call as partial asynchronous-parameter SGD.

12 Distributed SGD Algorithms 3.1 Synchronous SGD The most straightforward extension of centralized SGD into distributed SGD is the synchronous SGD. In the Map-reduce framework, Chu et al. [4] implemented synchronous SGD, which partitions and distributes data into several nodes. Each machine will compute the partial gradients from their local data, and send the resulting gradients into a central node. The central node sums all of the partial gradients, does the total gradient descent update to the current parameters, and broadcasts the new parameters to the computing nodes. On similar work, Mcdonald et al. [21] presented synchronous SGD for multinomial logistic regression, and also compared the method with two other methods of distributed training, combining prediction and combining parameters. Originally, MLitB implements similar synchronous SGD as in [4, 21], but with some adaptations for client-server architecture. The high level concept of synchronous SGD for client-server architecture is described in 2.2.2, and the technical implementations for the client and the server side are presented in algorithm 1 and 2 respectively. The server starts the training by sending the current parameters w step to all clients via DispatchJob procedure. All clients will work with the given parameters to process N c training examples from their own local dataset D. The training examples are drawn uniformly at random without replacement by popping one example at a time from workingset, the copy and shuffled version of D. In practice, sampling with replacement perform better than sampling without replacement [27, 36]. Finally, the accumulated gradients computed from each example will be sent to the server, and the client step c will get increased. The step c tracks how many updates each client has made. Algorithm 1 SyncClient 1: D = Partition of dataset on this client 2: step c 0 3: procedure WORK(Parameter w, Mini-batch size N c ) 4: Initialize total gradient g = 0 5: for all i 1...N c do 6: if Workingset is empty then 7: Workingset COPYANDSHUFFLE(D) 8: end if 9: data POP(Workingset) 10: g g + COMPUTEGRADIENT(data, w) 11: end for 12: SERVER.POOLGRADIENTS(g) 13: step c step c + 1 14: end procedure

3.1 Synchronous SGD 13 Algorithm 2 SyncServer 1: procedure DISPATCHJOB(w) 2: for all c C do 3: Call procedure WORK(w, N c ) for client c 4: end for 5: end procedure 6: procedure POOLGRADIENTS(Gradients g) 7: GradientsPool GradientsPool + {g} 8: ngrad ngrad + 1 9: if ngrad = m then 10: g AGGREGATE(GradientsPool) 11: UPDATE(g) 12: GradientsPool {} 13: ngrad 0 14: end if 15: end procedure 16: procedure UPDATE(Gradients g) 17: w step+1 w step ADAGRAD(g) 18: step step + 1 19: if running then 20: DISPATCHJOB(w step ) 21: end if 22: end procedure Main Algorithm starts from here 23: step = 0 24: Initialize mini-batch-size N c = 100 25: Initialize w step with random values 26: Initialize SGD and Adagrad components 27: C {Client 1 Client m } 28: GradientsPool {} 29: ngrad 0 30: running true 31: DISPATCHJOB(w step ) In the PoolGradients procedure, the server keeps pooling the gradients from every client until it receives the gradients from the last working client. After the last gradients arrived, all gradients are aggregated by simple average function. After that, the resulting gradients are sent to Update procedure to produce the new parameters. If the running condition is true, then the cycle will start again.

14 Distributed SGD Algorithms 3.2 Asynchronous-gradient SGD When working with distributed synchronized SGD that involves network latency and diverse processing speed, often the case that some clients are idle while waiting the others to finish. As a result, the resources utilization will not be optimal, and the waiting time usually become higher with the increase of number of clients. In practice, in order to decrease the waiting time, the traditional synchronization procedure is removed [5, 13, 17, 25, 32, 33]. 3.2.1 Asyncrhonous SGD Ho et al. [13] proposed a stale synchronization method, where any client can continue working with stale parameters as long as the different between its clock and the clock of the slowest client is less than some threshold. The synchronization for a clock will happen once all clients has passed the clock, and the resulting parameters are made visible for all clients. The clock defines the number of update produced by clients. Besides clock-based stale synchronization, one variation is based on the accumulated sum of unsynchronized local parameters [32]. These method aim at giving time to the slowest client to catch up without making the others to wait, but if there are some consistently slow clients, for instance in the case of different client processing speeds, then the fastest client will always wait for those slow clients. Langford et al. [17] investigated asynchronous SGD method for convex problem where each client can update the parameters independently. The authors proved that the method can converge well even though the parameters are updated most likely using stale gradients. Extending to non-convex problem, Dean et al. [5], Paine et al. [25], Wu et al. [33] also show that asynchronous SGD with the help of warmstarting method is able to converge faster than training on single machine. In general, the advantage of doing asynchronous updates is that the client machines or processors can produce more updates per time compared to synchronous updates. By having more updates, asynchronous SGD could potentially increase the convergence rate per time. However, the use of stales gradients removes the general theoretical guarantee of the synchronous version. Moreover, carelessly applying asynchronous SGD for non-convex problem could result in divergence [11]. 3.2.2 Asynchronous SGD on MLitB To test our asynchronous SGD methods, we will make modification to the MLitB platform. From implementation point of view, changing synchronous SGD to asynchronous requires

3.2 Asynchronous-gradient SGD 15 one line modification on the client side, and a major changes are done at the server side. The only change on the client side is that the gradients are now sent directly to the Update procedure instead of PoolGradients. The step c information is sent along with the gradient as it is needed by the server to update its step. In the server algorithm, The PoolGradients procedure is removed, and DispatchJob procedure is modified such that it sends the job only to a specific client c instead of all clients. The resulting modification of the server algorithm are shown by algorithm 3. Notice that the step now does not indicate the total number of updates that have been made to the parameters. Instead, it is a notion to signify the number of cycle has been passed as in the synchronous SGD. One cycle means that all clients have sent their update to the server. To do that, we need to keep track all step c values coming from the clients when they send their updates, and increase the server step if the current gradients received are coming from the latest client whose step c is equal to the server s step. For tracking the number of times the parameters have been updated, a new variable t is added, and we call it as the parameter timestamp. Algorithm 3 Async-gradServer 1: procedure DISPATCHJOB(Client c, Parameter w t ) 2: Call procedure WORK(w t,n c ) for client c 3: end procedure 4: procedure UPDATE(Client c, Gradient g) 5: w t+1 w t ADAGRAD(g) 6: t t + 1 7: if c is the last client whose step c = the server step then 8: step step + 1 9: end if 10: if running then 11: DISPATCHJOB(c, w t ) 12: end if 13: end procedure 14: step = 0 15: t 0 parameter timestamp 16: Initialize mini-batch-size N c = 100 17: Initialize w t with a random values 18: Initialize SGD and Adagrad components 19: C {Client 1 Client m } 20: running true 21: for all c C do 22: DISPATCHJOB(c,w t ) 23: end for

16 Distributed SGD Algorithms 3.3 Asynchronous-Parameters SGD This chapter will introduce a new distributed asynchronous SGD algorithm that is addressed to improve the traditional asynchronous SGD which now we name as asynchronous-gradient SGD or async-grad. 3.3.1 Motivation In contrast to the synchronous SGD which ensures that gradients from all clients are computed using the latest parameters, the general asynchronous SGD has to permit the clients to update the parameters in the server using stale updates, the updates calculated using slightly older parameters from the latest one. In the case of async-grad, the update that are sent to the server is the gradients. Unfortunately, the new parameters resulting from updating using stale gradients may be completely different compared to the resulting parameters if the gradients were used to update the parameters from which they were computed. Updating parameters using stale gradients from several machine could result random change to the parameters. The overall results might be even worse in the increase of number of machine [25]. Figure 3.1 illustrates this problem in a simple and intuitive way. (a) First step (b) Second step Fig. 3.1 Illustration of updating parameters in asynchronous-gradient SGD Assuming that the clients always update the parameters with the same order, at the first step all clients compute the gradients from the same parameters. The solid arrows are the update vectors from each client (i.e the output of Adagrad method). The updates are applied

3.3 Asynchronous-Parameters SGD 17 sequentially. After the last update is done, the position of the resulting parameters (denoted by the flag) is far away from the initial position. The first step ideally is the best case, because every client use the same initial parameters, so that the earlier gradients which update the parameters have low staleness levels. From the second step on, the gradients from each client are computed from different parameters location, so that the directions of the gradients might vary significantly, making the parameters movements are looked like a random walk (see figure 3.1b). There might be a condition that favors asynchronous-gradient update, for example when all gradients from clients are pointing more or less to the same direction to a local minimum. With a reasonably small learning rate, summing up these update vectors might shift the parameters faster to the local minimum compared to averaging the gradients like in the synchronous version. Essentially, in order to avoid the random effect from integrating many stale gradients, we need to use a different forms of update such that it will produce new parameters that still makes sense even if it is applied to slightly different parameters from the one it was computed. Thus, this chapter will discuss one possible instance of asynchronous SGD that uses such a robust form of update. 3.3.2 Previous Work An update form that fit the criteria better is by integrating the parameters instead of gradients. There are some previous works that investigated method of integrating parameters in distributed SGD[22, 30, 37]. Zinkevich et al. [37] aimed at communication-efficient method of distributed SGD. In their work, each machine performs independent SGD training on their local data until convergence. The final parameters from all machines are then integrated to get the combined results. Furthermore, Shamir et al. [30] observed that even though the one time average method reduces the variance, the bias can still be bad. The authors then introduced a method of averaging the parameters several times during the learning. Even though the convergence bound are proved, the two methods assume strong convex loss function, which is not the case of CNN. On the non-convex direction, McDonald et al. [22] employs iterative parameter mix to train structured perceptron for NLP problems. They claimed that one time parameter mix was not suitable for non-convex problem, and proposed to mix the parameters at every epoch. In structured perceptron, the parameters are updated online, and the optimization does not utilize gradient descent method. Especially for gradient descent method, instead of mixing the parameters synchronously at each epoch, averaging the gradients is theoretically superior.

18 Distributed SGD Algorithms 3.3.3 Our Approach To extend the idea of parameter mix for gradient descent strategy for non-convex problem, we present a modification of asynchronous SGD described in section 3.2. The idea of the algorithm is somewhat simple, instead of sending gradients to update the parameters on the server, the clients use the gradients to update their own parameters on their local machines. The resulting parameters are then sent to the server, which in turn uses it to update the current parameters with some procedures that will be explained shortly. We call this method asynchronous-parameter SGD or async-param. The difference between updating parameters using gradients and using parameters is analogous to moving something from one location to another using direction and coordinate respectively. A direction when it is applied to different starting points will end up at different destinations, while using coordinate the destination will always be the same regardless the starting position. (a) First step (b) Second step Fig. 3.2 Illustration of updating parameters in asynchronous-parameter SGD Figure 3.2a illustrates the first update done by each client. The clients compute the gradients using the same initial parameters (denoted by the black circle), update their local parameters, and send the resulting parameters to the server. The solid arrows denote the movements from initial parameters to the parameters coming from clients. If the parameters are integrated sequentially by linear interpolation method (α = 0.5), then the colored dots in the figure represent the resulting parameters after one by one integration of red, blue, and pink parameters respectively. Unlike updating using gradients, the resulting parameters after integrating all updates is still not too far from the initial position, which is how SGD should

3.3 Asynchronous-Parameters SGD 19 actually work. After the first step, the clients will compute the gradients from different initial parameters, as shown by figure 3.2b. The implementation details are given by algorithm 4 and 5. In the client side, it now not only accumulates the gradients, but also does the update to the parameters. Each client has their own Adagrad and learning rate properties. For the server side, the only change made is at the parameter update line inside Update procedure (see algorithm 5 line 2). The Update procedure is not integrating the gradients, but the parameters instead. Algorithm 4 Async-paramClient 1: D = Partition of dataset on this client 2: step = 0 3: Initialize SGD components (learning rate, sum square gradient for Adagrad) 4: procedure WORK(Parameter w ct, Mini-batch size N c ) 5: Initialize total gradient g = 0 6: for all i 1...N c do 7: if Workingset is empty then 8: Workingset COPYANDSHUFFLE(D) 9: end if 10: data POP(Workingset) 11: g g + COMPUTEGRADIENT(data,w ct ) 12: end for 13: w ct w ct ADAGRAD(g) 14: step step + 1 15: SERVER.UPDATE(w ct ) 16: end procedure Algorithm 5 Server Async-param 1: procedure UPDATE(Client c, Parameter w ct ) 2: w t+1 (1 α)w t + αw ct 3: t t + 1 4: if c is the last client whose step c = the server step then 5: step step + 1 6: end if 7: if running then 8: DISPATCHJOB(c, w t ) 9: end if 10: end procedure Clearly, when there is only one client, the two methods are equivalent to centralized SGD, as long as async-param completely replaces the latest parameter with the one coming from

20 Distributed SGD Algorithms client. However, in the situation where there are many clients, clients should not completely replace the parameters on the server at each update, because it means that at any given time the parameters will represent one client only, which is the latest client that sends the update. Instead, all the updates should be integrated. To do this, the update formula should involve both the latest parameters on the server and the new parameters coming from clients. There could be many ways of doing this, and one of them is using linear interpolation as shown by 3.1 θ t+1 = (1 α)w t + αw ct (3.1) where 0 α 1. The value α defines time-scale of the updates. Most of the time we want α to be somewhere in between 0 and 1, but there are in some extreme cases we may want to set α to be 0 or 1. For instance, if there is a long delay in the connection between a client and the server such that the parameters from the client arrives too late from the normal time, then we might want to discard the parameters by setting the α to 0. On the other hand, if there is a client which picks up the latest parameters on the server and becomes the first to update that parameters, then we may want to completely replace the server s parameters by setting α to be 1. 3.3.4 Adaptive α Under ideal condition, where the speeds of the clients are more or less the same and there is no network problem, then using a reasonable fix value of α (e.g 0.5) should be fine. On the other hand, a more flexible α is more desirable to tackle the more realistic behaviour of a network of devices. Intuitively, we want α to be high when the parameters used by client to produce the current update is new, and low otherwise. This idea can be represented by an exponential function (3.2) ( α = exp c t ) m (3.2) where t = t ct is the timestamp difference between the current server parameters and the parameters used by client c to produce the current update. And m is number of clients machine running. The value of m can be seen as a normalization constant because t will grow linearly with m. Hence, the range of the resulting α is more or less the same even if it is applied to different number of clients. In an ideal condition, the updates from clients will

3.4 Partial Asynchronous-Parameters 21 come to the server with the same order, so that t is known, then we can tune the constant c to produce a reasonable α. 3.4 Partial Asynchronous-Parameters 3.4.1 Motivation In distributed computing in general, minimizing transfer time could be another step to improve resource utilization after the asynchronization. Especially when the models get bigger, the performance of distributed framework could suffer from a high network latency. In this situation, there could be many ways to keep the waiting time as short as possible, for instance by continuing working with cache parameters on local machine while sending an update or receiving recent parameters. However, if we want to keep the client to work with the latest parameters, then it is necessary to wait for the server to update the parameters and work with the new resulting parameters. In this case, there is not much we can do in terms of algorithm, except making the transferred data as small as possible. For sparse problems, where each example is known to influence only a small part of parameters, Hogwild method shows that updating only the relevant parameters without any synchronization could reduce the bottleneck effect in a distributed multicore setting [27]. Unfortunately, the exact notion of the sparse problems does not hold for our case. It is almost impossible to know in advance which elements of the parameters of CNN model that are influenced by each example. Nevertheless, in general there is a possibility that for models with lots of parameters, only a small parts of the parameters are changed significantly at each update. 3.4.2 Partial Update Our approach to reduce the network latency follows the idea of updating only part of parameters. We apply the idea to the async-param method, but it could also be used with the async-grad and the synchronous SGD. At each step, the clients select elements of the parameters (or the gradients for the async-grad and the synchronous SGD) based on a criteria or a selection method, and then send only the selected part to the server. We denote the number of elements that are selected at each update as ρ, and it is specified by user in a percentage value.

22 Distributed SGD Algorithms Random and Sorted Selection Methods A sufficient selection method should eventually select the important elements of the update vectors that would move the parameters as close as possible to the result of the full parameters update. Random selection is one possible candidate. In the long run, all elements would finally be chosen equally many. Although, the convergence speed might be slower compared to the full update because there are possibilities that the method select unimportant elements before finally select the important ones at the later step. A more desirable selection method should utilize some information to pick the more important elements over the others. For the case of the gradient descent, we can use the gradient-based information to rank the importance of the corresponding elements in the parameters. Specifically, after the clients update their local parameters using a gradient-based update (e.g the results of Adagrad update method), the elements of the resulting parameters are sorted according to the absolute values of their corresponding gradient-based information. Some elements of the parameters that are ranked top will then be sent to the server. Local and Global Indexing Since the clients only send the partial update, it becomes necessary to also attach the index of each update value. Especially for neural networks model including CNN, where the parameters are structured by the filters and the layers, there are different ways of indexing methods that could affect the performance of the previous selection methods. In this work we present two different methods, namely global indexing and local indexing. In the global indexing, the selection methods are run over the whole parameters, while in the local indexing the selections of the parameters are done locally per filter for the convolutional layer, and per layer for the full connection layer. The latter approach ensures that some of the parameters in each filter and layer will be updated in every step.

Chapter 4 Experiments and Analysis To test the distributed algorithms, we use surfsara HPC cloud infrastructure. We allocate a VM with many cores and open many browser clients in one VM. The environment might be different compared to the real one where MLitB is supposed to be deployed, but we could simulate the real situation by adding artificial delays. For test dataset, we use MNIST [14] and CIFAR10 [15] as they are quite popular for testing new algorithms whether they would work or not. MNIST is a gray scale handwriting digit image dataset with size of 28 x 28. The training dataset contains 50000 images for 10 classes. CIFAR10 is a color image dataset that contains 10 classes. The training set contains 50000 images with 5000 for each class. CIFAR10 image size is 32 x 32. We do not apply any augmentation to the two training dataset. For the MNIST dataset, the CNN configurations are defined as : Input : size=(28,28) Conv Pool Conv Pool FC : filters=8, size=(5,5), stride=1, actfunc=relu : size=(2,2), stride=2 : filters=16, size=(5,5), stride=1, actfunc=relu : size=(3,3), stride=3 : neurons=10, actfunc=softmax and for CIFAR10 we use the following configurations : Input : size=(32,32,3) Conv Pool Conv Pool FC : filters=12, size=(5,5), stride=1, actfunc=relu : size=(3,3), stride=2 : filters=24, size=(5,5), stride=1, actfunc=relu : size=(4,4), stride=4 : neurons=10, actfunc=softmax

24 Experiments and Analysis For all experiments we used mini-batch size of 100, the learning rates were set to 0.01, and the parameters update used Adagrad method. We also run each experiment 5 times and plot the averaged results. 4.1 Uniform Processing Speed In this experiment, we compare the performance of the three distributed SGD algorithms in the setting of uniform processing speeds. The uniform processing speeds is achieved by simply using a multicore VM. The processing speed of each core might not exactly the same, small variation could be due to I/O or other background processes. The number of core used for the VM is 11, and from those we create 10 client browsers to distribute the SGD computations. (a) (b) Fig. 4.1 Training classification error for different distributed SGD algorithms As the results shown in figure 4.1, the convergence rate of the async-grad is slower compared to the synchronous and the async-param. On the other hand, the async-param give similar performance as the synchronous, even slightly better in terms of wall clock time. The colored shadows represent the standard deviations of the error rate. In both figures we can see that the asynch-grad method has higher variance compared to the other two methods. In terms of number of steps, both the async-param and the async-grad produce more steps compared to the synchronous as can be seen in figure 4.1a.

4.2 Different Values of α 25 4.2 Different Values of α In order to see the effect of α, we test the async-param with different values of α, which are 0.2, 0.5, and 0.8. Fig. 4.2 Training classification errors for different values of α The experiment results in figure 4.2 suggests that, some variations of α values do not give significant impact to the overall performance of async-param method. What important is that the parameters are integrated. The learning rate and Adagrad seem to take care of the smooth changes to the parameters. 4.3 Slow Update In this experiment we want to test how the performance of async-param on the situation involving delay. The delay in receiving computed parameters from clients, can be due to slow computation or network latency. In order to simulate the situation, at each iteration we add a random delay that force the clients to wait certain time before actually sending the update. The actual time delay is a product of random delay factor (df) with the time spent to finish computing the data. The random delay factor itself is drawn from a half-normal distribution as in (4.1). d f = max( N(0,1),3) (4.1)

26 Experiments and Analysis Limiting the maximum value of d f is necessary since this experiment will also be tested on synchronous SGD, where it needs to wait all updates from clients but there is no mechanism to stop waiting if some clients take too long. (a) (b) Fig. 4.3 Training classification errors after adding random delays Our experimental results demonstrate that the convergence rate per iteration of the asyncparam method is as good as the the synchronous version. In terms of wall clock time, the async-param surpasses the asynchronous. We test the async-param with the adaptive α, and a fixed α value of 0.8. The adaptive method appears to be slightly better at the earlier updates, but eventually the two methods converge to the same level. On the other hand, the async-grad is still the worst, even now its variance becomes significantly larger at the latter step, while the variances of the other methods become smaller. As a conclusion, the async-grad seems to be greatly affected by the staleness of the gradients. 4.4 Partial Update 4.4.1 Selection Methods In this experiment we compare different indexing and selection methods for the partial async-param SGD. The ρ value was set at 10% for all instance in this experiment.

4.4 Partial Update 27 (a) (b) Fig. 4.4 Training classification errors for different indexing-selection methods It is clearly seen from figure 4.4a that the sorting method help to converge faster than the random selection method. The local indexing is slightly better than the global one in terms of step-based errors, and becomes more evident if the methods are compared in wall clock time as shown by figure 4.4b. One of the reason is that the global-sorting method is slower than the local-sorting due to the higher time complexity to sort more elements at one time than to sort few elements multiple times. We also observed that at the beginning of each run, the speeds of the processors were not stable, so that the variances at the earlier time are high. This could worsen the time records of a slower method. 4.4.2 Different Values of ρ In this experiment we test the performance of the partial async-param method on different values of ρ. For the selection method, we use the local-sorting as it is the best method from the previous experiment.

28 Experiments and Analysis (a) (b) Fig. 4.5 Training classification errors for different values of ρ Figure 4.5 shows some interesting results. First, we see that the partial update method works well with large range of ρ. The more surprising result is that some values of ρ lead the partial update to perform better than the full update. The fastest convergence rate in the comparison shown by both figures is obtained by the partial async-param method with ρ = 30% which also surpasses the performance of the full update significantly. Besides reducing the size of the transferred data, the partial update can also be thought as a kind of regularization method. 4.5 Performance on Cifar10 We have seen some interesting results from the asynchronous-parameter SGD which were tested on MNIST dataset. Although MNIST is a good dataset for testing new algorithm, it is considered as an easy problem. In order to verify the previous good results, we compare the three algorithms on CIFAR10, which is known to be a more challenging dataset than MNIST. We compare the mini-batch error which is defined as E = 1 n n i=1 (1 Y cn ) where n is the number of examples processed by all clients in each step, and Y cn is the CNN output of the neuron corresponds to the true class of the n-th example.