A More Robust Asynchronous SGD

Size: px
Start display at page:

Download "A More Robust Asynchronous SGD"

Transcription

1 A More Robust Asynchronous SGD Said Al Faraby Artificial Intelligence University of Amsterdam This thesis is submitted for the degree of Master of Science February 2015

2

3 Abstract With the massive amount of data available for learning good models, the times required to process the data become a big concern. In this thesis we compare the distributed synchronous and asynchronous SGD algorithms tested on a web browser distributed framework of MLitB. We also introduce new strategies to improve the performance of asynchronous SGD, namely asynchronous-parameter and partial asynchronous-parameter. Finally, we report the experimental results of the distributed SGD algorithms. The results show that the asynchronous-parameter and the partial asynchronous-parameter are more robust to stale parameters update than the traditional asynchronous SGD. The partial asynchronousparameter obtained similar or better error rates compared to the full update with the size of the transferred update elements starting from 10% of the original size.

4

5 Table of contents List of figures List of tables Nomenclature vii ix ix 1 Introduction Motivation Research Goals Thesis Contributions Background Convolutional Neural Networks CNN Architecture and Configurations Training CNN MLitB Framework Architecture Synchronous Events Loop Asynchronous Events Loop Distributed SGD Algorithms Synchronous SGD Asynchronous-gradient SGD Asyncrhonous SGD Asynchronous SGD on MLitB Asynchronous-Parameters SGD Motivation Previous Work Our Approach

6 vi Table of contents Adaptive α Partial Asynchronous-Parameters Motivation Partial Update Experiments and Analysis Uniform Processing Speed Different Values of α Slow Updates Partial Update Selection Methods Different Values of ρ Performance on Cifar Conclusion and Discussion 31 References 33

7 List of figures 2.1 Convolutional Neural Networks Architecture MLitB architecture [23] Synchronous event loop Asynchronous event loop Illustration of updating parameters in asynchronous-gradient SGD Illustration of updating parameters in asynchronous-parameter SGD Training classification error for different distributed SGD algorithms Training classification errors for different values of α Training classification errors after adding random delays Training classification errors for different indexing-selection methods Training classification errors for different values of ρ Mini-batch error on Cifar10 dataset

8

9 List of tables 2.1 Activation functions

10

11 Chapter 1 Introduction 1.1 Motivation Years ago, there is a time where the limited quantity of data has prevented researchers to create accurate models. In these days, the capacity of data storage has increased enormously and collecting data has become much easier than before. As a result, gigantic volume of data is able to be collected everyday. The challenge now is to process the data in a sensible amount of time. Many researchers have been working to speed up the learning time for very huge datasets [4, 7, 10, 21, 26]. Most of the work used multicore CPU setting to distribute data and model. Some recent works have started to utilize GPU for general high performance computations, especially for distributed computation [25, 33]. Besides using multicore or computer clusters to do distributed computing, there are opportunities to build a cheap distributed computing over web browsers with the use of JavaScript virtual machine [23, 28]. This platform also offers a flexibility for research collaboration with the only requirement of internet connection. The portability of web technologies has enables everyone to build their own system by using the shared codes. Stochastic gradient descent has become a standard algorithm to solve a complex optimization problem. With the increase in the volume of data there is a need to distribute the data and SGD computations into many machines. SGD is an iterative optimization algorithm, so that naturally the distributed SGD will require synchronization. However, the synchronization potentially reduce the efficiency of resource utilization, so that many has tried to distribute SGD asynchronously [5, 17, 32]. In this thesis we are interested to investigate distributed computation of SGD via web browsers. There are some inherent issues that differentiate web browsers from the multicore setting as the medium for distributed computation. Using the web browsers setting, the

12 2 Introduction processing speeds of users machines might vary significantly compared to the uniform speeds in multicore setting. Another factor that might influence the performance of SGD is the high network latency in transferring data through internet connection. These two factors make the distributed SGD with synchronization much less optimal in utilizing computing resources. Although the traditional asynchronous SGD could be a straightforward solution, it seems to be less stable and more sensitive to the learning parameters setting and the number of machines [5, 11, 25]. 1.2 Research Goals In this thesis, we attempt to answer the following questions : 1. How are the relative performances of the existing distributed SGD algorithms if they are implemented in the web browsers setting? 2. How can we improve the asynchronous SGD? 3. The bigger models will cause high network latency. Can we reduce the transferred data without detrimental effects on learning? This thesis will focus on comparing the performances of different distributed SGD algorithms that run the same architecture and hyperparameters. Finding the best possible combinations of the hyperparameters in order to achieve the best performance (e.g error rate) is out of the scope of this thesis. 1.3 Thesis Contributions We provide the comparisons of the synchronous and the asynchronous distributed SGD algorithms. We also demonstrate the effectiveness of a new asynchronous SGD algorithm that is more robust to stale updates. Moreover, we present the empirical results of the partial update method that is able to reduce the size of transferred data while retain a good performance as the full update.

13 Chapter 2 Background 2.1 Convolutional Neural Networks As the study case for this thesis, we use convolutional neural network as the model for learning image classification task. Convolutional neural networks or CNN were first introduced by Fukushima [9], whose model was inspired by human visual nervous system proposed by Hubel and Wiesel. The important aspect of this model is that it is invariant to the position of the input pattern, and only dependent on the shape of the input pattern. The designed model was later improved by LeCun et al. [19], who also popularized the name of Convolutional Neural Networks. Since then CNN has shown successes in many different applications including face recognition [18], ImageNet classification [16], and speech recognition [1] CNN Architecture and Configurations The main components of CNN that distinguish it from the common ANN are the convolution and the sub-sampling layers. A convolution layer and its previous layer are connected by sets of weights which are also usually called as filters. The filters here have the same function with filters used in the convolution operation of image processing or signal processing in general. For instance, in image processing, there are some edge filters to detect edges in input images. It is just, in CNN we do not define what patterns to detect from the beginning, hence we do not know the right values for the filters. In fact we initialize the filters with random values, and start to learn the correct ones during the learning process. The convolution operations are usually followed by applying an activation function to the resulting values, and the final output is called as feature map. A convolutional layer usually consists of many filters, and each filter will produce one feature map.

14 4 Background Sub-sampling or down-sampling usually refers to reducing the size of the input signal. Besides size reduction, sub-sampling can also be thought as adding translation invariance to the CNN. A commonly used sub-sampling method in CNN is max-pooling. Originally, max-pooling operates by splitting the input matrix into non-overlapping smaller grids, and taking the maximum values of each grid as the output of values. In practice, sometimes people also use small overlapping between the grids. There are no connection weights in sub-sampling layers, or the weights can be considered as constant values of 1s. The output of sub-sampling operations are also termed as feature maps. Fig. 2.1 Convolutional Neural Networks Architecture A typical architecture of CNN consists of an input layer, followed by some pairs of convolution and sub-sampling layers, and ended by one or more fully connected layers. A graphical illustration of a CNN architecture is shown by figure 2.1. The input layer has size of 28 x 28, which is followed by the first convolution layer (C1) with 4 filters, each has dimension of 5 x 5. The outputs of C1 layer consists of 4 feature maps with size of 24 x 24, which are then sub-sampled by the first sub-sampling layer (S1) with the grid size of 2 x 2, and produces 4 feature maps of 12 x 12. The same procedures are continued by the second pair of convolution (C2: filters size of 3 x 3) and sub-sampling (S2: grid size of 2 x 2) layer. S2 layer is fully connected with the last layer that consists of 10 neurons, each of those is a result of a full dot product operation between all output values in S2 layer and the connection weights coming to each neuron. There are some additional configurations to complete the architecture above. Beside filter and grid size, people sometime use an option called stride to define the distance between the center of two consecutive filter position in the input signal, or the distance between two neighboring grids. The last configuration to be specified is the activation functions for the convolution layers and the output layer. There are some well known activation functions used for ANN, but throughout this thesis, we will use rectifier (relu) function for every convolution

15 2.1 Convolutional Neural Networks 5 rectifier f (x) = max(0, x) softmax f (z) j = exp(z j) K k=1 exp(z for j = 1 K k) Table 2.1 Activation functions layers, and softmax function for the output layer as those are the most popular for CNN classifier (see table 2.1 for definitions). Finally, the complete configurations of the illustrated CNN can be written as the following textual format : Input : size=(28,28) Conv Pool Conv Pool FC : filters=4, size=(5,5), stride=1, actfunc=relu : size=(2,2), stride=2 : filters=8, size=(3,3), stride=1, actfunc=relu : size=(2,2), stride=2 : neurons=10, actfunc=softmax Training CNN CNN is trained using back-propagation procedure and gradient descent optimization technique [19]. Back-propagation was introduced by Rumelhart et al. [29]. The back-propagation procedure has two passes. The first one is forward pass, in which given an input x, each neuron in each layer will produce an output value. The objective of the training is to adjust the connection weights w such that the outputs of the network y(x), which are the values of neurons in the output values, become as close as possible to the target values t of the input x. The difference between the output of the network and the target defines the error for the input x. The total error function is defined as E = 1 2 (y(x) j t j ) 2 (2.1) x,t j where x,t is pair of input data and the target values, y(x) is the values of the output neurons, and j is index of the output neurons. Gradient descent optimization method is then used to minimize the error function by taking partial derivatives of E with respect to each weight in the network [29]. Computing the partial derivatives is done sequentially, propagating derivatives of neurons in the last layer backward toward the input layer, which is why this pass is called backward pass.

16 6 Background In the original gradient descent method, which is also called as total gradient descent, the gradient of each weight is accumulated over all training data, and the update to the current weights follows w t+1,i = w t,i η 1 N g i (2.2) where η is a positive learning rate, g is the accumulated gradient, and i is the index of weights component if all weights are stored in a single vector. This method guarantees to converge to local minimum of the error function. When the volume of training data is huge, employing total gradient descent becomes impractical, because in order to produce one update, it needs to average the gradients over the entire massive training data. In the stochastic version, each update is made after processing one randomly picked training example [2, 35]. As a trade-off to the faster update iteration, the stochastic gradient descent (SGD) does not hold the general convergence guarantee from the total gradient descent, but Zhang [35] and Bottou [2] show that by setting learning rate η 0, the SGD can still converges. Moreover, in practice the stochastic version offers some advantages over the total gradient algorithm. As discussed by Bottou [2], the stochastic gradient descent often converges much faster when the data are redundant. Furthermore, even though it is good that the total gradient descent guarantees to convert to a local minimum, but the fact that it can not escape the local minimum could become its drawback. For instance, where the local minimum is very poor while there are many other local minimum that are much better. On the other hand, SGD with its random behavior normally will not be trapped at that situation. Besides using one example for each update, Bottou [2] also mentioned another common practice in implementing SGD which is known as mini-batch SGD. Instead of using one example, mini-batch SGD uses a small batches of training example at each update iteration. The mini-batch SGD usually is preferable because it offers less random behavior due to averaging gradients over more training examples than in normal SGD. There is also new popular update method called Adagrad which was introduced by Duchi et al. [6]. This method adapts the learning rate for each specific weights components. Adagrad is known to help the training to converge faster and more stable [5, 34]. The weight update of Adagrad method is defined as η w t+1,i = w t,i g t,i (2.3) t t =1 g2 t,i

17 2.2 MLitB Framework MLitB Framework In this thesis we use MLitB framework to run and test several distributed algorithms. Machine Learning in the Browser or MLitB is a software framework for doing distributed machine learning computation in the browsers. The usage of the browsers is the heart of the framework to provide a cheap, ubiquitous, and collaborative distributed learning [12, 23]. The common usage of big clusters of CPU or GPU indeed serve well in the speed of computation, but not everyone can afford accessing such facilities. MLitB on the other hand, effortlessly transforms any device that supports recent browsers into a computing resource, makes it affordable to everyone. Furthermore, every client that joins the framework not only can contributes the resource, but also can collaboratively improve a model by adding new training data or download a model and do some modifications to improve the model. In addition, in collaborative distributed computing, there is a concern about private learning, where people have some data and want to contribute to some learning models, but they do not want to share the data to other clients [24]. In this case, bringing the models and computations into their devices, as what this framework does, might be the only way to make it possible Architecture Fig. 2.2 MLitB architecture [23] MLitB uses a client-server architecture and message-passing communication method. MLitB is implemented mainly in Javascript and employs recent web technologies such as Web Workers for multithreading, and Websocket for communication. Browsers act as the clients,

18 8 Background and there is a server that controls the system and aggregates the learning results from the clients in the form of model parameters. Figure 2.2 describes the framework in more details by showing the data flow and communication between each component. The Master Server (1) is a server that initiates the framework and stores all models configurations and parameters that are currently running. There is a Boss (3) as an interface to create new model, uploading data, and manage workers. A new model will be sent to the Master Server so that the model will become visible for any Boss in the framework. The data that has been uploaded by any Boss will be transferred to Data Server and can be downloaded by workers which are usually assigned to work with part of the data. XHR (4) is used for data communication, while for message passing and also transmitting model configurations and parameters (2), the framework relies on web socket and also XHR Synchronous Events Loop Fig. 2.3 Synchronous event loop MLitB originally implement synchronous distributed computing, where each step is started with the server dispatches the job by sending the recent parameters to all of the clients. The clients then work with the parameters and their local data to produce updates. Furthermore, the server will wait and pool the updates from all clients before aggregating them into a single combined update. Finally, the combined update will be used to update the current parameters and produce a new parameters (see Figure 2.3). These processes will be repeated again until the stopping condition is satisfied or the running model is stopped by user. The synchronization process will guarantee all clients to have sent the same number of update at any given time.

19 2.2 MLitB Framework Asynchronous Events Loop Fig. 2.4 Asynchronous event loop Unlike in the synchronous distributed computation, the events loop in the asynchronous version is more concise. As clearly seen in figure 2.4, there is no pooling and aggregating process for clients updates. Each client can send their updates directly and independently to the updating process, and the resulting parameters will be transmitted back immediately to that client. In this asynchronous fashion, each client can produce different number of update during the learning process.

20

21 Chapter 3 Distributed SGD Algorithms Since the era of big data begun, centralized machine learning computations seemed to be inadequate to process the data in reasonable amount of time. Many machine learning researchers have tried to accelerate the computing time by distribute the data and computations to many machines [3, 8, 10, 21, 27, 31]. Some special frameworks also emerged for handling large-scale computation by distributing the data and computation into many machines, Map-reduce [4] and Graphlab [20] are prominent examples. Not only data-parallelization, for the case of extremely big models that do not fit into one machine (computer memory or GPU memory), new frameworks for model-parallelization have been developed, the models parameters and architecture are split into many machines [5] or GPUs [25, 33]. Notice that, most of the aforementioned works are built on multi-core setting, and some used super high speed connection between machines in the style of supercomputers or computer clusters. However, in this work, we focus to develop distributed computing algorithm, especially SGD, trough users browsers. In this setting, there are two main different from the previous works, 1) the processing speed of clients computers are vary, and 2) the connections between clients and server usually are much slower. As a consequence, we are not competing in term of speed and scale, instead we are interested in using efficient distributed SGD algorithms that are suitable for our setting. In this chapter we present three different distributed SGD algorithms, namely synchronous SGD, asynchronous-gradient SGD, and asynchronous-parameter SGD. In addition, we also present an extension of asynchronous-parameter SGD to reduce data transmission from clients to server, which we call as partial asynchronous-parameter SGD.

22 12 Distributed SGD Algorithms 3.1 Synchronous SGD The most straightforward extension of centralized SGD into distributed SGD is the synchronous SGD. In the Map-reduce framework, Chu et al. [4] implemented synchronous SGD, which partitions and distributes data into several nodes. Each machine will compute the partial gradients from their local data, and send the resulting gradients into a central node. The central node sums all of the partial gradients, does the total gradient descent update to the current parameters, and broadcasts the new parameters to the computing nodes. On similar work, Mcdonald et al. [21] presented synchronous SGD for multinomial logistic regression, and also compared the method with two other methods of distributed training, combining prediction and combining parameters. Originally, MLitB implements similar synchronous SGD as in [4, 21], but with some adaptations for client-server architecture. The high level concept of synchronous SGD for client-server architecture is described in 2.2.2, and the technical implementations for the client and the server side are presented in algorithm 1 and 2 respectively. The server starts the training by sending the current parameters w step to all clients via DispatchJob procedure. All clients will work with the given parameters to process N c training examples from their own local dataset D. The training examples are drawn uniformly at random without replacement by popping one example at a time from workingset, the copy and shuffled version of D. In practice, sampling with replacement perform better than sampling without replacement [27, 36]. Finally, the accumulated gradients computed from each example will be sent to the server, and the client step c will get increased. The step c tracks how many updates each client has made. Algorithm 1 SyncClient 1: D = Partition of dataset on this client 2: step c 0 3: procedure WORK(Parameter w, Mini-batch size N c ) 4: Initialize total gradient g = 0 5: for all i 1...N c do 6: if Workingset is empty then 7: Workingset COPYANDSHUFFLE(D) 8: end if 9: data POP(Workingset) 10: g g + COMPUTEGRADIENT(data, w) 11: end for 12: SERVER.POOLGRADIENTS(g) 13: step c step c : end procedure

23 3.1 Synchronous SGD 13 Algorithm 2 SyncServer 1: procedure DISPATCHJOB(w) 2: for all c C do 3: Call procedure WORK(w, N c ) for client c 4: end for 5: end procedure 6: procedure POOLGRADIENTS(Gradients g) 7: GradientsPool GradientsPool + {g} 8: ngrad ngrad + 1 9: if ngrad = m then 10: g AGGREGATE(GradientsPool) 11: UPDATE(g) 12: GradientsPool {} 13: ngrad 0 14: end if 15: end procedure 16: procedure UPDATE(Gradients g) 17: w step+1 w step ADAGRAD(g) 18: step step : if running then 20: DISPATCHJOB(w step ) 21: end if 22: end procedure Main Algorithm starts from here 23: step = 0 24: Initialize mini-batch-size N c = : Initialize w step with random values 26: Initialize SGD and Adagrad components 27: C {Client 1 Client m } 28: GradientsPool {} 29: ngrad 0 30: running true 31: DISPATCHJOB(w step ) In the PoolGradients procedure, the server keeps pooling the gradients from every client until it receives the gradients from the last working client. After the last gradients arrived, all gradients are aggregated by simple average function. After that, the resulting gradients are sent to Update procedure to produce the new parameters. If the running condition is true, then the cycle will start again.

24 14 Distributed SGD Algorithms 3.2 Asynchronous-gradient SGD When working with distributed synchronized SGD that involves network latency and diverse processing speed, often the case that some clients are idle while waiting the others to finish. As a result, the resources utilization will not be optimal, and the waiting time usually become higher with the increase of number of clients. In practice, in order to decrease the waiting time, the traditional synchronization procedure is removed [5, 13, 17, 25, 32, 33] Asyncrhonous SGD Ho et al. [13] proposed a stale synchronization method, where any client can continue working with stale parameters as long as the different between its clock and the clock of the slowest client is less than some threshold. The synchronization for a clock will happen once all clients has passed the clock, and the resulting parameters are made visible for all clients. The clock defines the number of update produced by clients. Besides clock-based stale synchronization, one variation is based on the accumulated sum of unsynchronized local parameters [32]. These method aim at giving time to the slowest client to catch up without making the others to wait, but if there are some consistently slow clients, for instance in the case of different client processing speeds, then the fastest client will always wait for those slow clients. Langford et al. [17] investigated asynchronous SGD method for convex problem where each client can update the parameters independently. The authors proved that the method can converge well even though the parameters are updated most likely using stale gradients. Extending to non-convex problem, Dean et al. [5], Paine et al. [25], Wu et al. [33] also show that asynchronous SGD with the help of warmstarting method is able to converge faster than training on single machine. In general, the advantage of doing asynchronous updates is that the client machines or processors can produce more updates per time compared to synchronous updates. By having more updates, asynchronous SGD could potentially increase the convergence rate per time. However, the use of stales gradients removes the general theoretical guarantee of the synchronous version. Moreover, carelessly applying asynchronous SGD for non-convex problem could result in divergence [11] Asynchronous SGD on MLitB To test our asynchronous SGD methods, we will make modification to the MLitB platform. From implementation point of view, changing synchronous SGD to asynchronous requires

25 3.2 Asynchronous-gradient SGD 15 one line modification on the client side, and a major changes are done at the server side. The only change on the client side is that the gradients are now sent directly to the Update procedure instead of PoolGradients. The step c information is sent along with the gradient as it is needed by the server to update its step. In the server algorithm, The PoolGradients procedure is removed, and DispatchJob procedure is modified such that it sends the job only to a specific client c instead of all clients. The resulting modification of the server algorithm are shown by algorithm 3. Notice that the step now does not indicate the total number of updates that have been made to the parameters. Instead, it is a notion to signify the number of cycle has been passed as in the synchronous SGD. One cycle means that all clients have sent their update to the server. To do that, we need to keep track all step c values coming from the clients when they send their updates, and increase the server step if the current gradients received are coming from the latest client whose step c is equal to the server s step. For tracking the number of times the parameters have been updated, a new variable t is added, and we call it as the parameter timestamp. Algorithm 3 Async-gradServer 1: procedure DISPATCHJOB(Client c, Parameter w t ) 2: Call procedure WORK(w t,n c ) for client c 3: end procedure 4: procedure UPDATE(Client c, Gradient g) 5: w t+1 w t ADAGRAD(g) 6: t t + 1 7: if c is the last client whose step c = the server step then 8: step step + 1 9: end if 10: if running then 11: DISPATCHJOB(c, w t ) 12: end if 13: end procedure 14: step = 0 15: t 0 parameter timestamp 16: Initialize mini-batch-size N c = : Initialize w t with a random values 18: Initialize SGD and Adagrad components 19: C {Client 1 Client m } 20: running true 21: for all c C do 22: DISPATCHJOB(c,w t ) 23: end for

26 16 Distributed SGD Algorithms 3.3 Asynchronous-Parameters SGD This chapter will introduce a new distributed asynchronous SGD algorithm that is addressed to improve the traditional asynchronous SGD which now we name as asynchronous-gradient SGD or async-grad Motivation In contrast to the synchronous SGD which ensures that gradients from all clients are computed using the latest parameters, the general asynchronous SGD has to permit the clients to update the parameters in the server using stale updates, the updates calculated using slightly older parameters from the latest one. In the case of async-grad, the update that are sent to the server is the gradients. Unfortunately, the new parameters resulting from updating using stale gradients may be completely different compared to the resulting parameters if the gradients were used to update the parameters from which they were computed. Updating parameters using stale gradients from several machine could result random change to the parameters. The overall results might be even worse in the increase of number of machine [25]. Figure 3.1 illustrates this problem in a simple and intuitive way. (a) First step (b) Second step Fig. 3.1 Illustration of updating parameters in asynchronous-gradient SGD Assuming that the clients always update the parameters with the same order, at the first step all clients compute the gradients from the same parameters. The solid arrows are the update vectors from each client (i.e the output of Adagrad method). The updates are applied

27 3.3 Asynchronous-Parameters SGD 17 sequentially. After the last update is done, the position of the resulting parameters (denoted by the flag) is far away from the initial position. The first step ideally is the best case, because every client use the same initial parameters, so that the earlier gradients which update the parameters have low staleness levels. From the second step on, the gradients from each client are computed from different parameters location, so that the directions of the gradients might vary significantly, making the parameters movements are looked like a random walk (see figure 3.1b). There might be a condition that favors asynchronous-gradient update, for example when all gradients from clients are pointing more or less to the same direction to a local minimum. With a reasonably small learning rate, summing up these update vectors might shift the parameters faster to the local minimum compared to averaging the gradients like in the synchronous version. Essentially, in order to avoid the random effect from integrating many stale gradients, we need to use a different forms of update such that it will produce new parameters that still makes sense even if it is applied to slightly different parameters from the one it was computed. Thus, this chapter will discuss one possible instance of asynchronous SGD that uses such a robust form of update Previous Work An update form that fit the criteria better is by integrating the parameters instead of gradients. There are some previous works that investigated method of integrating parameters in distributed SGD[22, 30, 37]. Zinkevich et al. [37] aimed at communication-efficient method of distributed SGD. In their work, each machine performs independent SGD training on their local data until convergence. The final parameters from all machines are then integrated to get the combined results. Furthermore, Shamir et al. [30] observed that even though the one time average method reduces the variance, the bias can still be bad. The authors then introduced a method of averaging the parameters several times during the learning. Even though the convergence bound are proved, the two methods assume strong convex loss function, which is not the case of CNN. On the non-convex direction, McDonald et al. [22] employs iterative parameter mix to train structured perceptron for NLP problems. They claimed that one time parameter mix was not suitable for non-convex problem, and proposed to mix the parameters at every epoch. In structured perceptron, the parameters are updated online, and the optimization does not utilize gradient descent method. Especially for gradient descent method, instead of mixing the parameters synchronously at each epoch, averaging the gradients is theoretically superior.

28 18 Distributed SGD Algorithms Our Approach To extend the idea of parameter mix for gradient descent strategy for non-convex problem, we present a modification of asynchronous SGD described in section 3.2. The idea of the algorithm is somewhat simple, instead of sending gradients to update the parameters on the server, the clients use the gradients to update their own parameters on their local machines. The resulting parameters are then sent to the server, which in turn uses it to update the current parameters with some procedures that will be explained shortly. We call this method asynchronous-parameter SGD or async-param. The difference between updating parameters using gradients and using parameters is analogous to moving something from one location to another using direction and coordinate respectively. A direction when it is applied to different starting points will end up at different destinations, while using coordinate the destination will always be the same regardless the starting position. (a) First step (b) Second step Fig. 3.2 Illustration of updating parameters in asynchronous-parameter SGD Figure 3.2a illustrates the first update done by each client. The clients compute the gradients using the same initial parameters (denoted by the black circle), update their local parameters, and send the resulting parameters to the server. The solid arrows denote the movements from initial parameters to the parameters coming from clients. If the parameters are integrated sequentially by linear interpolation method (α = 0.5), then the colored dots in the figure represent the resulting parameters after one by one integration of red, blue, and pink parameters respectively. Unlike updating using gradients, the resulting parameters after integrating all updates is still not too far from the initial position, which is how SGD should

29 3.3 Asynchronous-Parameters SGD 19 actually work. After the first step, the clients will compute the gradients from different initial parameters, as shown by figure 3.2b. The implementation details are given by algorithm 4 and 5. In the client side, it now not only accumulates the gradients, but also does the update to the parameters. Each client has their own Adagrad and learning rate properties. For the server side, the only change made is at the parameter update line inside Update procedure (see algorithm 5 line 2). The Update procedure is not integrating the gradients, but the parameters instead. Algorithm 4 Async-paramClient 1: D = Partition of dataset on this client 2: step = 0 3: Initialize SGD components (learning rate, sum square gradient for Adagrad) 4: procedure WORK(Parameter w ct, Mini-batch size N c ) 5: Initialize total gradient g = 0 6: for all i 1...N c do 7: if Workingset is empty then 8: Workingset COPYANDSHUFFLE(D) 9: end if 10: data POP(Workingset) 11: g g + COMPUTEGRADIENT(data,w ct ) 12: end for 13: w ct w ct ADAGRAD(g) 14: step step : SERVER.UPDATE(w ct ) 16: end procedure Algorithm 5 Server Async-param 1: procedure UPDATE(Client c, Parameter w ct ) 2: w t+1 (1 α)w t + αw ct 3: t t + 1 4: if c is the last client whose step c = the server step then 5: step step + 1 6: end if 7: if running then 8: DISPATCHJOB(c, w t ) 9: end if 10: end procedure Clearly, when there is only one client, the two methods are equivalent to centralized SGD, as long as async-param completely replaces the latest parameter with the one coming from

30 20 Distributed SGD Algorithms client. However, in the situation where there are many clients, clients should not completely replace the parameters on the server at each update, because it means that at any given time the parameters will represent one client only, which is the latest client that sends the update. Instead, all the updates should be integrated. To do this, the update formula should involve both the latest parameters on the server and the new parameters coming from clients. There could be many ways of doing this, and one of them is using linear interpolation as shown by 3.1 θ t+1 = (1 α)w t + αw ct (3.1) where 0 α 1. The value α defines time-scale of the updates. Most of the time we want α to be somewhere in between 0 and 1, but there are in some extreme cases we may want to set α to be 0 or 1. For instance, if there is a long delay in the connection between a client and the server such that the parameters from the client arrives too late from the normal time, then we might want to discard the parameters by setting the α to 0. On the other hand, if there is a client which picks up the latest parameters on the server and becomes the first to update that parameters, then we may want to completely replace the server s parameters by setting α to be Adaptive α Under ideal condition, where the speeds of the clients are more or less the same and there is no network problem, then using a reasonable fix value of α (e.g 0.5) should be fine. On the other hand, a more flexible α is more desirable to tackle the more realistic behaviour of a network of devices. Intuitively, we want α to be high when the parameters used by client to produce the current update is new, and low otherwise. This idea can be represented by an exponential function (3.2) ( α = exp c t ) m (3.2) where t = t ct is the timestamp difference between the current server parameters and the parameters used by client c to produce the current update. And m is number of clients machine running. The value of m can be seen as a normalization constant because t will grow linearly with m. Hence, the range of the resulting α is more or less the same even if it is applied to different number of clients. In an ideal condition, the updates from clients will

31 3.4 Partial Asynchronous-Parameters 21 come to the server with the same order, so that t is known, then we can tune the constant c to produce a reasonable α. 3.4 Partial Asynchronous-Parameters Motivation In distributed computing in general, minimizing transfer time could be another step to improve resource utilization after the asynchronization. Especially when the models get bigger, the performance of distributed framework could suffer from a high network latency. In this situation, there could be many ways to keep the waiting time as short as possible, for instance by continuing working with cache parameters on local machine while sending an update or receiving recent parameters. However, if we want to keep the client to work with the latest parameters, then it is necessary to wait for the server to update the parameters and work with the new resulting parameters. In this case, there is not much we can do in terms of algorithm, except making the transferred data as small as possible. For sparse problems, where each example is known to influence only a small part of parameters, Hogwild method shows that updating only the relevant parameters without any synchronization could reduce the bottleneck effect in a distributed multicore setting [27]. Unfortunately, the exact notion of the sparse problems does not hold for our case. It is almost impossible to know in advance which elements of the parameters of CNN model that are influenced by each example. Nevertheless, in general there is a possibility that for models with lots of parameters, only a small parts of the parameters are changed significantly at each update Partial Update Our approach to reduce the network latency follows the idea of updating only part of parameters. We apply the idea to the async-param method, but it could also be used with the async-grad and the synchronous SGD. At each step, the clients select elements of the parameters (or the gradients for the async-grad and the synchronous SGD) based on a criteria or a selection method, and then send only the selected part to the server. We denote the number of elements that are selected at each update as ρ, and it is specified by user in a percentage value.

32 22 Distributed SGD Algorithms Random and Sorted Selection Methods A sufficient selection method should eventually select the important elements of the update vectors that would move the parameters as close as possible to the result of the full parameters update. Random selection is one possible candidate. In the long run, all elements would finally be chosen equally many. Although, the convergence speed might be slower compared to the full update because there are possibilities that the method select unimportant elements before finally select the important ones at the later step. A more desirable selection method should utilize some information to pick the more important elements over the others. For the case of the gradient descent, we can use the gradient-based information to rank the importance of the corresponding elements in the parameters. Specifically, after the clients update their local parameters using a gradient-based update (e.g the results of Adagrad update method), the elements of the resulting parameters are sorted according to the absolute values of their corresponding gradient-based information. Some elements of the parameters that are ranked top will then be sent to the server. Local and Global Indexing Since the clients only send the partial update, it becomes necessary to also attach the index of each update value. Especially for neural networks model including CNN, where the parameters are structured by the filters and the layers, there are different ways of indexing methods that could affect the performance of the previous selection methods. In this work we present two different methods, namely global indexing and local indexing. In the global indexing, the selection methods are run over the whole parameters, while in the local indexing the selections of the parameters are done locally per filter for the convolutional layer, and per layer for the full connection layer. The latter approach ensures that some of the parameters in each filter and layer will be updated in every step.

33 Chapter 4 Experiments and Analysis To test the distributed algorithms, we use surfsara HPC cloud infrastructure. We allocate a VM with many cores and open many browser clients in one VM. The environment might be different compared to the real one where MLitB is supposed to be deployed, but we could simulate the real situation by adding artificial delays. For test dataset, we use MNIST [14] and CIFAR10 [15] as they are quite popular for testing new algorithms whether they would work or not. MNIST is a gray scale handwriting digit image dataset with size of 28 x 28. The training dataset contains images for 10 classes. CIFAR10 is a color image dataset that contains 10 classes. The training set contains images with 5000 for each class. CIFAR10 image size is 32 x 32. We do not apply any augmentation to the two training dataset. For the MNIST dataset, the CNN configurations are defined as : Input : size=(28,28) Conv Pool Conv Pool FC : filters=8, size=(5,5), stride=1, actfunc=relu : size=(2,2), stride=2 : filters=16, size=(5,5), stride=1, actfunc=relu : size=(3,3), stride=3 : neurons=10, actfunc=softmax and for CIFAR10 we use the following configurations : Input : size=(32,32,3) Conv Pool Conv Pool FC : filters=12, size=(5,5), stride=1, actfunc=relu : size=(3,3), stride=2 : filters=24, size=(5,5), stride=1, actfunc=relu : size=(4,4), stride=4 : neurons=10, actfunc=softmax

34 24 Experiments and Analysis For all experiments we used mini-batch size of 100, the learning rates were set to 0.01, and the parameters update used Adagrad method. We also run each experiment 5 times and plot the averaged results. 4.1 Uniform Processing Speed In this experiment, we compare the performance of the three distributed SGD algorithms in the setting of uniform processing speeds. The uniform processing speeds is achieved by simply using a multicore VM. The processing speed of each core might not exactly the same, small variation could be due to I/O or other background processes. The number of core used for the VM is 11, and from those we create 10 client browsers to distribute the SGD computations. (a) (b) Fig. 4.1 Training classification error for different distributed SGD algorithms As the results shown in figure 4.1, the convergence rate of the async-grad is slower compared to the synchronous and the async-param. On the other hand, the async-param give similar performance as the synchronous, even slightly better in terms of wall clock time. The colored shadows represent the standard deviations of the error rate. In both figures we can see that the asynch-grad method has higher variance compared to the other two methods. In terms of number of steps, both the async-param and the async-grad produce more steps compared to the synchronous as can be seen in figure 4.1a.

35 4.2 Different Values of α Different Values of α In order to see the effect of α, we test the async-param with different values of α, which are 0.2, 0.5, and 0.8. Fig. 4.2 Training classification errors for different values of α The experiment results in figure 4.2 suggests that, some variations of α values do not give significant impact to the overall performance of async-param method. What important is that the parameters are integrated. The learning rate and Adagrad seem to take care of the smooth changes to the parameters. 4.3 Slow Update In this experiment we want to test how the performance of async-param on the situation involving delay. The delay in receiving computed parameters from clients, can be due to slow computation or network latency. In order to simulate the situation, at each iteration we add a random delay that force the clients to wait certain time before actually sending the update. The actual time delay is a product of random delay factor (df) with the time spent to finish computing the data. The random delay factor itself is drawn from a half-normal distribution as in (4.1). d f = max( N(0,1),3) (4.1)

36 26 Experiments and Analysis Limiting the maximum value of d f is necessary since this experiment will also be tested on synchronous SGD, where it needs to wait all updates from clients but there is no mechanism to stop waiting if some clients take too long. (a) (b) Fig. 4.3 Training classification errors after adding random delays Our experimental results demonstrate that the convergence rate per iteration of the asyncparam method is as good as the the synchronous version. In terms of wall clock time, the async-param surpasses the asynchronous. We test the async-param with the adaptive α, and a fixed α value of 0.8. The adaptive method appears to be slightly better at the earlier updates, but eventually the two methods converge to the same level. On the other hand, the async-grad is still the worst, even now its variance becomes significantly larger at the latter step, while the variances of the other methods become smaller. As a conclusion, the async-grad seems to be greatly affected by the staleness of the gradients. 4.4 Partial Update Selection Methods In this experiment we compare different indexing and selection methods for the partial async-param SGD. The ρ value was set at 10% for all instance in this experiment.

37 4.4 Partial Update 27 (a) (b) Fig. 4.4 Training classification errors for different indexing-selection methods It is clearly seen from figure 4.4a that the sorting method help to converge faster than the random selection method. The local indexing is slightly better than the global one in terms of step-based errors, and becomes more evident if the methods are compared in wall clock time as shown by figure 4.4b. One of the reason is that the global-sorting method is slower than the local-sorting due to the higher time complexity to sort more elements at one time than to sort few elements multiple times. We also observed that at the beginning of each run, the speeds of the processors were not stable, so that the variances at the earlier time are high. This could worsen the time records of a slower method Different Values of ρ In this experiment we test the performance of the partial async-param method on different values of ρ. For the selection method, we use the local-sorting as it is the best method from the previous experiment.

38 28 Experiments and Analysis (a) (b) Fig. 4.5 Training classification errors for different values of ρ Figure 4.5 shows some interesting results. First, we see that the partial update method works well with large range of ρ. The more surprising result is that some values of ρ lead the partial update to perform better than the full update. The fastest convergence rate in the comparison shown by both figures is obtained by the partial async-param method with ρ = 30% which also surpasses the performance of the full update significantly. Besides reducing the size of the transferred data, the partial update can also be thought as a kind of regularization method. 4.5 Performance on Cifar10 We have seen some interesting results from the asynchronous-parameter SGD which were tested on MNIST dataset. Although MNIST is a good dataset for testing new algorithm, it is considered as an easy problem. In order to verify the previous good results, we compare the three algorithms on CIFAR10, which is known to be a more challenging dataset than MNIST. We compare the mini-batch error which is defined as E = 1 n n i=1 (1 Y cn ) where n is the number of examples processed by all clients in each step, and Y cn is the CNN output of the neuron corresponds to the true class of the n-th example.

Perceptron: This is convolution!

Perceptron: This is convolution! Perceptron: This is convolution! v v v Shared weights v Filter = local perceptron. Also called kernel. By pooling responses at different locations, we gain robustness to the exact spatial location of image

More information

Deep Learning for Computer Vision II

Deep Learning for Computer Vision II IIIT Hyderabad Deep Learning for Computer Vision II C. V. Jawahar Paradigm Shift Feature Extraction (SIFT, HoG, ) Part Models / Encoding Classifier Sparrow Feature Learning Classifier Sparrow L 1 L 2 L

More information

Visual object classification by sparse convolutional neural networks

Visual object classification by sparse convolutional neural networks Visual object classification by sparse convolutional neural networks Alexander Gepperth 1 1- Ruhr-Universität Bochum - Institute for Neural Dynamics Universitätsstraße 150, 44801 Bochum - Germany Abstract.

More information

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah Reference Most of the slides are taken from the third chapter of the online book by Michael Nielson: neuralnetworksanddeeplearning.com

More information

Machine Learning 13. week

Machine Learning 13. week Machine Learning 13. week Deep Learning Convolutional Neural Network Recurrent Neural Network 1 Why Deep Learning is so Popular? 1. Increase in the amount of data Thanks to the Internet, huge amount of

More information

Study of Residual Networks for Image Recognition

Study of Residual Networks for Image Recognition Study of Residual Networks for Image Recognition Mohammad Sadegh Ebrahimi Stanford University sadegh@stanford.edu Hossein Karkeh Abadi Stanford University hosseink@stanford.edu Abstract Deep neural networks

More information

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa Instructors: Parth Shah, Riju Pahwa Lecture 2 Notes Outline 1. Neural Networks The Big Idea Architecture SGD and Backpropagation 2. Convolutional Neural Networks Intuition Architecture 3. Recurrent Neural

More information

Linear Regression Optimization

Linear Regression Optimization Gradient Descent Linear Regression Optimization Goal: Find w that minimizes f(w) f(w) = Xw y 2 2 Closed form solution exists Gradient Descent is iterative (Intuition: go downhill!) n w * w Scalar objective:

More information

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017 Hyperparameter optimization CS6787 Lecture 6 Fall 2017 Review We ve covered many methods Stochastic gradient descent Step size/learning rate, how long to run Mini-batching Batch size Momentum Momentum

More information

Parallel Deep Network Training

Parallel Deep Network Training Lecture 26: Parallel Deep Network Training Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2016 Tunes Speech Debelle Finish This Album (Speech Therapy) Eat your veggies and study

More information

Optimization in the Big Data Regime 5: Parallelization? Sham M. Kakade

Optimization in the Big Data Regime 5: Parallelization? Sham M. Kakade Optimization in the Big Data Regime 5: Parallelization? Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 21 Announcements...

More information

Training Deep Neural Networks (in parallel)

Training Deep Neural Networks (in parallel) Lecture 9: Training Deep Neural Networks (in parallel) Visual Computing Systems How would you describe this professor? Easy? Mean? Boring? Nerdy? Professor classification task Classifies professors as

More information

Accelerating Convolutional Neural Nets. Yunming Zhang

Accelerating Convolutional Neural Nets. Yunming Zhang Accelerating Convolutional Neural Nets Yunming Zhang Focus Convolutional Neural Nets is the state of the art in classifying the images The models take days to train Difficult for the programmers to tune

More information

Conflict Graphs for Parallel Stochastic Gradient Descent

Conflict Graphs for Parallel Stochastic Gradient Descent Conflict Graphs for Parallel Stochastic Gradient Descent Darshan Thaker*, Guneet Singh Dhillon* Abstract We present various methods for inducing a conflict graph in order to effectively parallelize Pegasos.

More information

CS 179 Lecture 16. Logistic Regression & Parallel SGD

CS 179 Lecture 16. Logistic Regression & Parallel SGD CS 179 Lecture 16 Logistic Regression & Parallel SGD 1 Outline logistic regression (stochastic) gradient descent parallelizing SGD for neural nets (with emphasis on Google s distributed neural net implementation)

More information

Deep Learning and Its Applications

Deep Learning and Its Applications Convolutional Neural Network and Its Application in Image Recognition Oct 28, 2016 Outline 1 A Motivating Example 2 The Convolutional Neural Network (CNN) Model 3 Training the CNN Model 4 Issues and Recent

More information

Parallelization in the Big Data Regime 5: Data Parallelization? Sham M. Kakade

Parallelization in the Big Data Regime 5: Data Parallelization? Sham M. Kakade Parallelization in the Big Data Regime 5: Data Parallelization? Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 23 Announcements...

More information

Logistic Regression and Gradient Ascent

Logistic Regression and Gradient Ascent Logistic Regression and Gradient Ascent CS 349-02 (Machine Learning) April 0, 207 The perceptron algorithm has a couple of issues: () the predictions have no probabilistic interpretation or confidence

More information

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 18: Deep learning and Vision: Convolutional neural networks Teacher: Gianni A. Di Caro DEEP, SHALLOW, CONNECTED, SPARSE? Fully connected multi-layer feed-forward perceptrons: More powerful

More information

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning, A Acquisition function, 298, 301 Adam optimizer, 175 178 Anaconda navigator conda command, 3 Create button, 5 download and install, 1 installing packages, 8 Jupyter Notebook, 11 13 left navigation pane,

More information

Parallel Deep Network Training

Parallel Deep Network Training Lecture 19: Parallel Deep Network Training Parallel Computer Architecture and Programming How would you describe this professor? Easy? Mean? Boring? Nerdy? Professor classification task Classifies professors

More information

Decentralized and Distributed Machine Learning Model Training with Actors

Decentralized and Distributed Machine Learning Model Training with Actors Decentralized and Distributed Machine Learning Model Training with Actors Travis Addair Stanford University taddair@stanford.edu Abstract Training a machine learning model with terabytes to petabytes of

More information

Deep Learning. Volker Tresp Summer 2014

Deep Learning. Volker Tresp Summer 2014 Deep Learning Volker Tresp Summer 2014 1 Neural Network Winter and Revival While Machine Learning was flourishing, there was a Neural Network winter (late 1990 s until late 2000 s) Around 2010 there

More information

PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION

PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION Stanislav Kontár Speech@FIT, Dept. of Computer Graphics and Multimedia, FIT, BUT, Brno, Czech Republic E-mail: xkonta00@stud.fit.vutbr.cz In

More information

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016 CPSC 340: Machine Learning and Data Mining Deep Learning Fall 2016 Assignment 5: Due Friday. Assignment 6: Due next Friday. Final: Admin December 12 (8:30am HEBB 100) Covers Assignments 1-6. Final from

More information

Lecture 22 : Distributed Systems for ML

Lecture 22 : Distributed Systems for ML 10-708: Probabilistic Graphical Models, Spring 2017 Lecture 22 : Distributed Systems for ML Lecturer: Qirong Ho Scribes: Zihang Dai, Fan Yang 1 Introduction Big data has been very popular in recent years.

More information

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech Convolutional Neural Networks Computer Vision Jia-Bin Huang, Virginia Tech Today s class Overview Convolutional Neural Network (CNN) Training CNN Understanding and Visualizing CNN Image Categorization:

More information

Deep Learning with Tensorflow AlexNet

Deep Learning with Tensorflow   AlexNet Machine Learning and Computer Vision Group Deep Learning with Tensorflow http://cvml.ist.ac.at/courses/dlwt_w17/ AlexNet Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton, "Imagenet classification

More information

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent Slide credit: http://sebastianruder.com/optimizing-gradient-descent/index.html#batchgradientdescent

More information

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3 Neural Network Optimization and Tuning 11-785 / Spring 2018 / Recitation 3 1 Logistics You will work through a Jupyter notebook that contains sample and starter code with explanations and comments throughout.

More information

Report: Privacy-Preserving Classification on Deep Neural Network

Report: Privacy-Preserving Classification on Deep Neural Network Report: Privacy-Preserving Classification on Deep Neural Network Janno Veeorg Supervised by Helger Lipmaa and Raul Vicente Zafra May 25, 2017 1 Introduction In this report we consider following task: how

More information

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Ensemble methods in machine learning. Example. Neural networks. Neural networks Ensemble methods in machine learning Bootstrap aggregating (bagging) train an ensemble of models based on randomly resampled versions of the training set, then take a majority vote Example What if you

More information

PERFORMANCE OF GRID COMPUTING FOR DISTRIBUTED NEURAL NETWORK. Submitted By:Mohnish Malviya & Suny Shekher Pankaj [CSE,7 TH SEM]

PERFORMANCE OF GRID COMPUTING FOR DISTRIBUTED NEURAL NETWORK. Submitted By:Mohnish Malviya & Suny Shekher Pankaj [CSE,7 TH SEM] PERFORMANCE OF GRID COMPUTING FOR DISTRIBUTED NEURAL NETWORK Submitted By:Mohnish Malviya & Suny Shekher Pankaj [CSE,7 TH SEM] All Saints` College Of Technology, Gandhi Nagar, Bhopal. Abstract: In this

More information

A Brief Look at Optimization

A Brief Look at Optimization A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year s version Overview Introduction Classes of optimization problems Linear programming Steepest

More information

Facial Expression Classification with Random Filters Feature Extraction

Facial Expression Classification with Random Filters Feature Extraction Facial Expression Classification with Random Filters Feature Extraction Mengye Ren Facial Monkey mren@cs.toronto.edu Zhi Hao Luo It s Me lzh@cs.toronto.edu I. ABSTRACT In our work, we attempted to tackle

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group Deep Learning Vladimir Golkov Technical University of Munich Computer Vision Group 1D Input, 1D Output target input 2 2D Input, 1D Output: Data Distribution Complexity Imagine many dimensions (data occupies

More information

Asynchronous Parallel Stochastic Gradient Descent. A Numeric Core for Scalable Distributed Machine Learning Algorithms

Asynchronous Parallel Stochastic Gradient Descent. A Numeric Core for Scalable Distributed Machine Learning Algorithms Asynchronous Parallel Stochastic Gradient Descent A Numeric Core for Scalable Distributed Machine Learning Algorithms J. Keuper and F.-J. Pfreundt Competence Center High Performance Computing Fraunhofer

More information

Machine Learning and Computational Statistics, Spring 2015 Homework 1: Ridge Regression and SGD

Machine Learning and Computational Statistics, Spring 2015 Homework 1: Ridge Regression and SGD Machine Learning and Computational Statistics, Spring 2015 Homework 1: Ridge Regression and SGD Due: Friday, February 6, 2015, at 4pm (Submit via NYU Classes) Instructions: Your answers to the questions

More information

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used. 1 4.12 Generalization In back-propagation learning, as many training examples as possible are typically used. It is hoped that the network so designed generalizes well. A network generalizes well when

More information

On the Effectiveness of Neural Networks Classifying the MNIST Dataset

On the Effectiveness of Neural Networks Classifying the MNIST Dataset On the Effectiveness of Neural Networks Classifying the MNIST Dataset Carter W. Blum March 2017 1 Abstract Convolutional Neural Networks (CNNs) are the primary driver of the explosion of computer vision.

More information

How Learning Differs from Optimization. Sargur N. Srihari

How Learning Differs from Optimization. Sargur N. Srihari How Learning Differs from Optimization Sargur N. srihari@cedar.buffalo.edu 1 Topics in Optimization Optimization for Training Deep Models: Overview How learning differs from optimization Risk, empirical

More information

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 10 Fall 2018

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 10 Fall 2018 Memory Bandwidth and Low Precision Computation CS6787 Lecture 10 Fall 2018 Memory as a Bottleneck So far, we ve just been talking about compute e.g. techniques to decrease the amount of compute by decreasing

More information

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Neural Networks CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Biological and artificial neural networks Feed-forward neural networks Single layer

More information

The exam is closed book, closed notes except your one-page cheat sheet.

The exam is closed book, closed notes except your one-page cheat sheet. CS 189 Fall 2015 Introduction to Machine Learning Final Please do not turn over the page before you are instructed to do so. You have 2 hours and 50 minutes. Please write your initials on the top-right

More information

Louis Fourrier Fabien Gaie Thomas Rolf

Louis Fourrier Fabien Gaie Thomas Rolf CS 229 Stay Alert! The Ford Challenge Louis Fourrier Fabien Gaie Thomas Rolf Louis Fourrier Fabien Gaie Thomas Rolf 1. Problem description a. Goal Our final project is a recent Kaggle competition submitted

More information

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017 COMP9444 Neural Networks and Deep Learning 7. Image Processing COMP9444 17s2 Image Processing 1 Outline Image Datasets and Tasks Convolution in Detail AlexNet Weight Initialization Batch Normalization

More information

732A54/TDDE31 Big Data Analytics

732A54/TDDE31 Big Data Analytics 732A54/TDDE31 Big Data Analytics Lecture 10: Machine Learning with MapReduce Jose M. Peña IDA, Linköping University, Sweden 1/27 Contents MapReduce Framework Machine Learning with MapReduce Neural Networks

More information

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Lecture 20: Neural Networks for NLP. Zubin Pahuja Lecture 20: Neural Networks for NLP Zubin Pahuja zpahuja2@illinois.edu courses.engr.illinois.edu/cs447 CS447: Natural Language Processing 1 Today s Lecture Feed-forward neural networks as classifiers simple

More information

Deep Learning for Visual Computing Prof. Debdoot Sheet Department of Electrical Engineering Indian Institute of Technology, Kharagpur

Deep Learning for Visual Computing Prof. Debdoot Sheet Department of Electrical Engineering Indian Institute of Technology, Kharagpur Deep Learning for Visual Computing Prof. Debdoot Sheet Department of Electrical Engineering Indian Institute of Technology, Kharagpur Lecture - 05 Classification with Perceptron Model So, welcome to today

More information

Neuron Selectivity as a Biologically Plausible Alternative to Backpropagation

Neuron Selectivity as a Biologically Plausible Alternative to Backpropagation Neuron Selectivity as a Biologically Plausible Alternative to Backpropagation C.J. Norsigian Department of Bioengineering cnorsigi@eng.ucsd.edu Vishwajith Ramesh Department of Bioengineering vramesh@eng.ucsd.edu

More information

Artificial Neuron Modelling Based on Wave Shape

Artificial Neuron Modelling Based on Wave Shape Artificial Neuron Modelling Based on Wave Shape Kieran Greer, Distributed Computing Systems, Belfast, UK. http://distributedcomputingsystems.co.uk Version 1.2 Abstract This paper describes a new model

More information

Lecture : Training a neural net part I Initialization, activations, normalizations and other practical details Anne Solberg February 28, 2018

Lecture : Training a neural net part I Initialization, activations, normalizations and other practical details Anne Solberg February 28, 2018 INF 5860 Machine learning for image classification Lecture : Training a neural net part I Initialization, activations, normalizations and other practical details Anne Solberg February 28, 2018 Reading

More information

A Deep Learning Approach to Vehicle Speed Estimation

A Deep Learning Approach to Vehicle Speed Estimation A Deep Learning Approach to Vehicle Speed Estimation Benjamin Penchas bpenchas@stanford.edu Tobin Bell tbell@stanford.edu Marco Monteiro marcorm@stanford.edu ABSTRACT Given car dashboard video footage,

More information

Deep Neural Networks Optimization

Deep Neural Networks Optimization Deep Neural Networks Optimization Creative Commons (cc) by Akritasa http://arxiv.org/pdf/1406.2572.pdf Slides from Geoffrey Hinton CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy

More information

Deep Learning With Noise

Deep Learning With Noise Deep Learning With Noise Yixin Luo Computer Science Department Carnegie Mellon University yixinluo@cs.cmu.edu Fan Yang Department of Mathematical Sciences Carnegie Mellon University fanyang1@andrew.cmu.edu

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 4, 2016 Outline Multi-core v.s. multi-processor Parallel Gradient Descent Parallel Stochastic Gradient Parallel Coordinate Descent Parallel

More information

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES Deep Learning Practical introduction with Keras Chapter 3 27/05/2018 Neuron A neural network is formed by neurons connected to each other; in turn, each connection of one neural network is associated

More information

Machine Learning and Computational Statistics, Spring 2016 Homework 1: Ridge Regression and SGD

Machine Learning and Computational Statistics, Spring 2016 Homework 1: Ridge Regression and SGD Machine Learning and Computational Statistics, Spring 2016 Homework 1: Ridge Regression and SGD Due: Friday, February 5, 2015, at 6pm (Submit via NYU Classes) Instructions: Your answers to the questions

More information

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu Natural Language Processing CS 6320 Lecture 6 Neural Language Models Instructor: Sanda Harabagiu In this lecture We shall cover: Deep Neural Models for Natural Language Processing Introduce Feed Forward

More information

CS281 Section 3: Practical Optimization

CS281 Section 3: Practical Optimization CS281 Section 3: Practical Optimization David Duvenaud and Dougal Maclaurin Most parameter estimation problems in machine learning cannot be solved in closed form, so we often have to resort to numerical

More information

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 9 Fall 2017

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 9 Fall 2017 Memory Bandwidth and Low Precision Computation CS6787 Lecture 9 Fall 2017 Memory as a Bottleneck So far, we ve just been talking about compute e.g. techniques to decrease the amount of compute by decreasing

More information

Convolutional Neural Networks

Convolutional Neural Networks Lecturer: Barnabas Poczos Introduction to Machine Learning (Lecture Notes) Convolutional Neural Networks Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications.

More information

CS489/698: Intro to ML

CS489/698: Intro to ML CS489/698: Intro to ML Lecture 14: Training of Deep NNs Instructor: Sun Sun 1 Outline Activation functions Regularization Gradient-based optimization 2 Examples of activation functions 3 5/28/18 Sun Sun

More information

Combine the PA Algorithm with a Proximal Classifier

Combine the PA Algorithm with a Proximal Classifier Combine the Passive and Aggressive Algorithm with a Proximal Classifier Yuh-Jye Lee Joint work with Y.-C. Tseng Dept. of Computer Science & Information Engineering TaiwanTech. Dept. of Statistics@NCKU

More information

HENet: A Highly Efficient Convolutional Neural. Networks Optimized for Accuracy, Speed and Storage

HENet: A Highly Efficient Convolutional Neural. Networks Optimized for Accuracy, Speed and Storage HENet: A Highly Efficient Convolutional Neural Networks Optimized for Accuracy, Speed and Storage Qiuyu Zhu Shanghai University zhuqiuyu@staff.shu.edu.cn Ruixin Zhang Shanghai University chriszhang96@shu.edu.cn

More information

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016 CS 2750: Machine Learning Neural Networks Prof. Adriana Kovashka University of Pittsburgh April 13, 2016 Plan for today Neural network definition and examples Training neural networks (backprop) Convolutional

More information

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011 Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011 1. Introduction Reddit is one of the most popular online social news websites with millions

More information

Scaling Distributed Machine Learning

Scaling Distributed Machine Learning Scaling Distributed Machine Learning with System and Algorithm Co-design Mu Li Thesis Defense CSD, CMU Feb 2nd, 2017 nx min w f i (w) Distributed systems i=1 Large scale optimization methods Large-scale

More information

3 Types of Gradient Descent Algorithms for Small & Large Data Sets

3 Types of Gradient Descent Algorithms for Small & Large Data Sets 3 Types of Gradient Descent Algorithms for Small & Large Data Sets Introduction Gradient Descent Algorithm (GD) is an iterative algorithm to find a Global Minimum of an objective function (cost function)

More information

The Mathematics Behind Neural Networks

The Mathematics Behind Neural Networks The Mathematics Behind Neural Networks Pattern Recognition and Machine Learning by Christopher M. Bishop Student: Shivam Agrawal Mentor: Nathaniel Monson Courtesy of xkcd.com The Black Box Training the

More information

Inception Network Overview. David White CS793

Inception Network Overview. David White CS793 Inception Network Overview David White CS793 So, Leonardo DiCaprio dreams about dreaming... https://m.media-amazon.com/images/m/mv5bmjaxmzy3njcxnf5bml5banbnxkftztcwnti5otm0mw@@._v1_sy1000_cr0,0,675,1 000_AL_.jpg

More information

Convolution Neural Networks for Chinese Handwriting Recognition

Convolution Neural Networks for Chinese Handwriting Recognition Convolution Neural Networks for Chinese Handwriting Recognition Xu Chen Stanford University 450 Serra Mall, Stanford, CA 94305 xchen91@stanford.edu Abstract Convolutional neural networks have been proven

More information

Regularization and model selection

Regularization and model selection CS229 Lecture notes Andrew Ng Part VI Regularization and model selection Suppose we are trying select among several different models for a learning problem. For instance, we might be using a polynomial

More information

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

COMP 551 Applied Machine Learning Lecture 16: Deep Learning COMP 551 Applied Machine Learning Lecture 16: Deep Learning Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted, all

More information

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic SEMANTIC COMPUTING Lecture 8: Introduction to Deep Learning Dagmar Gromann International Center For Computational Logic TU Dresden, 7 December 2018 Overview Introduction Deep Learning General Neural Networks

More information

CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning

CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning Justin Chen Stanford University justinkchen@stanford.edu Abstract This paper focuses on experimenting with

More information

Lecture : Neural net: initialization, activations, normalizations and other practical details Anne Solberg March 10, 2017

Lecture : Neural net: initialization, activations, normalizations and other practical details Anne Solberg March 10, 2017 INF 5860 Machine learning for image classification Lecture : Neural net: initialization, activations, normalizations and other practical details Anne Solberg March 0, 207 Mandatory exercise Available tonight,

More information

Lecture 37: ConvNets (Cont d) and Training

Lecture 37: ConvNets (Cont d) and Training Lecture 37: ConvNets (Cont d) and Training CS 4670/5670 Sean Bell [http://bbabenko.tumblr.com/post/83319141207/convolutional-learnings-things-i-learned-by] (Unrelated) Dog vs Food [Karen Zack, @teenybiscuit]

More information

Know your data - many types of networks

Know your data - many types of networks Architectures Know your data - many types of networks Fixed length representation Variable length representation Online video sequences, or samples of different sizes Images Specific architectures for

More information

Lecture 1 Notes. Outline. Machine Learning. What is it? Instructors: Parth Shah, Riju Pahwa

Lecture 1 Notes. Outline. Machine Learning. What is it? Instructors: Parth Shah, Riju Pahwa Instructors: Parth Shah, Riju Pahwa Lecture 1 Notes Outline 1. Machine Learning What is it? Classification vs. Regression Error Training Error vs. Test Error 2. Linear Classifiers Goals and Motivations

More information

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability Janis Keuper Itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla DEEP LEARNING REVIEW Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature 2015 -Presented by Divya Chitimalla What is deep learning Deep learning allows computational models that are composed of multiple

More information

AM 221: Advanced Optimization Spring 2016

AM 221: Advanced Optimization Spring 2016 AM 221: Advanced Optimization Spring 2016 Prof. Yaron Singer Lecture 2 Wednesday, January 27th 1 Overview In our previous lecture we discussed several applications of optimization, introduced basic terminology,

More information

Why DNN Works for Speech and How to Make it More Efficient?

Why DNN Works for Speech and How to Make it More Efficient? Why DNN Works for Speech and How to Make it More Efficient? Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering, York University, CANADA Joint work with Y.

More information

Fuzzy Set Theory in Computer Vision: Example 3, Part II

Fuzzy Set Theory in Computer Vision: Example 3, Part II Fuzzy Set Theory in Computer Vision: Example 3, Part II Derek T. Anderson and James M. Keller FUZZ-IEEE, July 2017 Overview Resource; CS231n: Convolutional Neural Networks for Visual Recognition https://github.com/tuanavu/stanford-

More information

Deep Learning Cook Book

Deep Learning Cook Book Deep Learning Cook Book Robert Haschke (CITEC) Overview Input Representation Output Layer + Cost Function Hidden Layer Units Initialization Regularization Input representation Choose an input representation

More information

(Refer Slide Time: 1:27)

(Refer Slide Time: 1:27) Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture 1 Introduction to Data Structures and Algorithms Welcome to data

More information

Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features

Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features Xu SUN ( 孙栩 ) Peking University xusun@pku.edu.cn Motivation Neural networks -> Good Performance CNN, RNN, LSTM

More information

An algorithm for Performance Analysis of Single-Source Acyclic graphs

An algorithm for Performance Analysis of Single-Source Acyclic graphs An algorithm for Performance Analysis of Single-Source Acyclic graphs Gabriele Mencagli September 26, 2011 In this document we face with the problem of exploiting the performance analysis of acyclic graphs

More information

Neural Network and Deep Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Neural Network and Deep Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Neural Network and Deep Learning Early history of deep learning Deep learning dates back to 1940s: known as cybernetics in the 1940s-60s, connectionism in the 1980s-90s, and under the current name starting

More information

Parallel and Distributed Deep Learning

Parallel and Distributed Deep Learning Parallel and Distributed Deep Learning Vishakh Hegde Stanford University vishakh@stanford.edu Sheema Usmani Stanford University sheema@stanford.edu Abstract The goal of this report is to explore ways to

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Fall 09, Homework 5

Fall 09, Homework 5 5-38 Fall 09, Homework 5 Due: Wednesday, November 8th, beginning of the class You can work in a group of up to two people. This group does not need to be the same group as for the other homeworks. You

More information

SAS Meets Big Iron: High Performance Computing in SAS Analytic Procedures

SAS Meets Big Iron: High Performance Computing in SAS Analytic Procedures SAS Meets Big Iron: High Performance Computing in SAS Analytic Procedures Robert A. Cohen SAS Institute Inc. Cary, North Carolina, USA Abstract Version 9targets the heavy-duty analytic procedures in SAS

More information

CNN Basics. Chongruo Wu

CNN Basics. Chongruo Wu CNN Basics Chongruo Wu Overview 1. 2. 3. Forward: compute the output of each layer Back propagation: compute gradient Updating: update the parameters with computed gradient Agenda 1. Forward Conv, Fully

More information

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction Akarsh Pokkunuru EECS Department 03-16-2017 Contractive Auto-Encoders: Explicit Invariance During Feature Extraction 1 AGENDA Introduction to Auto-encoders Types of Auto-encoders Analysis of different

More information

Image Compression: An Artificial Neural Network Approach

Image Compression: An Artificial Neural Network Approach Image Compression: An Artificial Neural Network Approach Anjana B 1, Mrs Shreeja R 2 1 Department of Computer Science and Engineering, Calicut University, Kuttippuram 2 Department of Computer Science and

More information

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University. Visualizing and Understanding Convolutional Networks Christopher Pennsylvania State University February 23, 2015 Some Slide Information taken from Pierre Sermanet (Google) presentation on and Computer

More information