NEURAL network learning is typically slow, where back

Size: px
Start display at page:

Download "NEURAL network learning is typically slow, where back"

Transcription

1 1 Training Simplification and Model Simplification for Deep Learning: A Minimal Effort Back Propagation Method Xu Sun, Xuancheng Ren, Shuming Ma, Bingzhen Wei, Wei Li, and Houfeng Wang arxiv: v1 [cs.lg] 17 Nov 2017 Abstract We propose a simple yet effective technique to simplify the training and the resulting model of neural networks. In back propagation, only a small subset of the full gradient is computed to update the model parameters. The gradient vectors are sparsified in such a way that only the top-k elements (in terms of magnitude) are kept. As a result, only k rows or columns (depending on the layout) of the weight matrix are modified, leading to a linear reduction in the computational cost. Based on the sparsified gradients, we further simplify the model by eliminating the rows or columns that are seldom updated, which will reduce the computational cost both in the training and decoding, and potentially accelerate decoding in real-world applications. Surprisingly, experimental results demonstrate that most of time we only need to update fewer than 5% of the weights at each back propagation pass. More interestingly, the accuracy of the resulting models is actually improved rather than degraded, and a detailed analysis is given. The model simplification results show that we could adaptively simplify the model which could often be reduced by around 9x, without any loss on accuracy or even with improved accuracy. Index Terms neural network, back propagation, sparse learning, model pruning. 1 INTRODUCTION NEURAL network learning is typically slow, where back propagation usually dominates the computational cost during the learning process. Back propagation entails a high computational cost because it needs to compute full gradients and update all model parameters in each learning step. It is not uncommon for a neural network to have a massive number of model parameters. In this study, we propose a minimal effort back propagation method, which we call meprop, for neural network learning. The idea is that we compute only a very small but critical portion of the gradient information, and update only the corresponding minimal portion of the parameters in each learning step. This leads to sparsified gradients, such that only highly relevant parameters are updated, while other parameters stay untouched. The sparsified back propagation leads to a linear reduction in the computational cost. On top of meprop, we further propose to simplify the trained model by eliminating the less relevant parameters discovered during meprop, so that the computational cost of decoding can also be reduced. We name the method mesimp (minimal effort simplification). The idea is that we record which portion of the parameters is updated at each learning step in meprop, and gradually remove the parameters that are less updated. This leads to a simplified model that costs X. Sun, X. Ren, S. Ma, B. Wei, W. Li, and H. Wang are with School of Electronics Engineering and Computer Science, Peking University, China, and MOE Key Laboratory of Computational Linguistics, Peking University, China. {xusun, renxc, shumingma, weibz, liweitj47, wanghf}@pku.edu.cn The first two authors contributed equally to this work. This work is a substantial extension of the work presented at ICML 2017 [1]. The codes are available at less in computation during decoding, while meprop can only speed up the training of the neural networks. One of the motivations for such method is that if we suppose back propagation can determine the importance of input features, with meprop, the essential features are welltrained, and the non-essential features are less-trained, so that the robustness of the models can be improved, and overfitting can be reduced. As the essential features play a more important role in the final model, there are chances that the parameters related to non-essential features could be eliminated, which leads to the idea of mesimp. For a classification task, there are essential features that are decisive in the classification, non-essential features that are helpful but can also be distractions, and irrelevant features that are not useful at all. For example, when classifying a picture as a taxi, the taxi sign is one of the essential features, and the color yellow, which is often the color of a taxi, is one of the non-essential features. Overfitting often occurs when the non-essential features are given too much importance in the model, while meprop intentionally focuses on training the probable essential features to lessen the risk of overfitting. To realize our approaches, we need to answer four questions. The first question is how to find the highly relevant subset of the parameters from the current sample in stochastic learning. We propose a top-k search method to find the most important parameters. Interestingly, experimental results demonstrate that most of the time we only need to update fewer than 5% of the weights at each back propagation pass. This does not result in a larger number of training iterations. The proposed method is general-purpose and it is independent of specific models and specific optimizers (e.g., Adam and AdaGrad). The second question is whether or not this minimal effort

2 2 Fig. 1. An illustration of meprop. back propagation strategy would hurt the accuracy of the trained models. We show that our strategy does not degrade the accuracy of the trained model, even when a very small portion of the parameters is updated. More interestingly, our experimental results reveal that our strategy actually improves the model accuracy in most cases. Based on our experiments, we find that it is probably because the minimal effort update does not modify weakly relevant parameters in each update, according with our assumption, which makes overfitting less likely, similar to the dropout effect. The third question is whether or not the decoding cost of the model can be reduced, as meprop can only shorten the training time. Based on meprop, we further apply the technique of mesimp. From our observations, the simplifying strategy can indeed shrink the final model by usually around 9x without any loss on accuracy. It also supports our assumption that, in fact, many learned features are not essential to the final correct prediction. The final question is whether or not the size of the simplified models needs to be set explicitly in advance. In most previous work, the final model size is pre-configured as desired or using heuristic rules, making it hard to simplify models with multiple layers, because naturally, each layer should have a different dimension, since it captures a different level of abstraction. In practice, we find that mesimp could adaptively reduce the size of the hidden layers, and automatically decide which features are essential for the task at different abstraction levels, resulting in a model of different hidden layer sizes. The contributions of this work are as follows: We propose a minimal effort back propagation technique for neural network learning, which could automatically find the most important features. Only a small subset of the full gradient is computed to update the model parameters, and is used to determine whether the related parameters should be kept in the final model. Applying the technique to training simplification (meprop), we find that the strategy actually improve the accuracy of the resulting models, rather than degraded, even if fewer than 5% of the weights are updated at each back propagation pass most of the time. The technique does not entail a larger number of training iterations, and could reduce the time of the training substantially. Most importantly, applying the technique to model simplification (mesimp) could potentially reduce the time of decoding. With the ability to adaptively simplify each layer of the model to only keep essential features, the resulting model could be reduced to around one ninth of its original size, which equals to an around 9x reduction in decoding cost, on a base of no accuracy loss or even improved accuracy. It s worth mentioning, when applied to models with multiple layers, given a single hyper-parameter, mesimp could simplify each hidden layer to a different extent, alleviating the need to set different hyperparameters for different layers. The minimal effort back propagation technique can be applied to different types of deep learning models (MLP and LSTM), can be applied with various optimization methods (Adam and AdaGrad), and works on diverse tasks (natural language processing and image recognition). 2 PROPOSED METHOD We propose a simple yet effective technique for neural network learning. The forward propagation is computed as usual. During back propagation, only a small subset of the full gradient is computed to update the model parameters. The gradient vectors are quantized so that only the topk components in terms of magnitude are kept. Based on the technique, we further propose to simplify the resulting models by removing the rows that are seldom updated, according to the top-k indices. The model is simplified in such a way that only actively updated rows are kept. We first present the proposed methods, and then describe the implementation details. 2.1 Simplified Back Propagation (meprop) Forward propagation of neural network models, including feedforward neural networks, RNN, LSTM, consists of linear transformations and non-linear transformations. For simplicity, we take a computation unit with one linear transformation and one non-linear transformation as an example: y = Wx (1)

3 3 Fig. 2. An illustration of the computational flow of meprop. z = σ(y) (2) where W R n m,x R m,y R n,z R n, m is the dimension of the input vector, n is the dimension of the output vector, and σ is a non-linear function (e.g., relu, tanh, and sigmoid). During back propagation, we need to compute the gradient of the parameter matrixw and the input vector x: W ij = σ ix T j (1 i n,1 j m) (3) x i = j W T ijσ j (1 j n,1 i m) (4) where σ i Rn means i y i. We can see that the computational cost of back propagation is directly proportional to the dimension of output vector n. The proposed meprop uses approximate gradients by keeping only top-k elements based on the magnitude values. That is, only the top-k elements with the largest absolute values are kept. For example, suppose a vector v = 1,2,3, 4, then top 2 (v) = 0,0,3, 4. We denote the indices of vector σ (y) s top-k values as S = {t 1,t 2,...,t k }(1 k n), and the approximate gradient of the parameter matrix W and input vector x is: W ij σ i xt j if i {t 1,t 2,...,t k } else 0 (5) x i j W T ij σ j if j {t 1,t 2,...,t k } else 0 (6) As a result, only k rows or columns (depending on the layout) of the weight matrix are modified, leading to a linear reduction (k divided by the vector dimension) in the computational cost. The algorithm is described in Algorithm 1. Figure 1 is an illustration of meprop for a single computation unit of neural models. The original back propagation uses the full gradient of the output vectors to compute the gradient of the parameters. The proposed method selects the top-k values of the gradient of the output vector, and back propagates the loss through the corresponding subset of the total model parameters. Algorithm 1 Backward Propagation Simplification for A Computation Unit 1: z σ(w x) Forward propagation 2: σ σ Gradient of Wx w.r.t. z 3: S {t 1,t 2,...,t k } Indices of k largest derivatives of σ in magnitude 4: W ij σ i xt j if i S else 0 5: x i j WT ij σ j if j S else 0 As for a complete neural network framework with a loss L, the original back propagation computes the gradient of the parameter matrix W as: L W = L y y W (7)

4 4 Fig. 3. An illustration of the computational flow of meprop on a mini-batch learning setting. while the gradient of the input vector x is: L x = y x L y The proposed meprop selects top-k elements of the gradient L y to approximate the original gradient, and passes them through the gradient computation graph according to the chain rule. Hence, the gradient of W goes to: L W top k( L y ) y W while the gradient of the vector x is: (8) (9) L x y x top k( L y ) (10) Figure 2 shows an illustration of the computational flow of meprop. The forward propagation is the same as traditional forward propagation, which computes the output vector via a matrix multiplication operation between two input tensors. The original back propagation computes the full gradient for the input vector and the weight matrix. For meprop, back propagation computes an approximate gradient by keeping top-k values of the backward flowed gradient and masking the remaining values to 0. Figure 3 further shows the computational flow of me- Prop for the mini-batch case. 2.2 Simplified Model (mesimp) The method from section 2.1 simplifies the training process, thus reduces the training time. However, for most deep learning applications in real life, it is even more important to reduce the computational cost of decoding, for the fact that although training is time consuming, it only needs to be done once, while decoding needs to be done as long as there is a new request. In this section, we propose to simplify the model by eliminating the inactive paths, which we define as the neurons whose gradients are not in top-k. This way, the decoding cost would also be reduced. There are two major concerns about this proposal. The main problem here is that we don t know the active path of unseen examples in advance, as we don t know the gradient information of those examples. Our solution for this problem is that we could obtain the overall inactive paths from the inactive paths of the training samples, which could be removed gradually in the training. The second is that the reduction in dimension could lead to performance degradation. Surprisingly, from our experimental results, our top-k gradient based method does not deteriorate the model. Instead, with an appropriate configuration, the resulting smaller model often performs better than the baseline large model, or even the baseline model of the similar size. As a matter of fact, after pruning the performance does drop. However, with the following training, the performance is regained. In what follows, we will briefly introduce the inspiration of the proposed method, and how the model simplification is done. In the experiments of meprop, we discover an interesting phenomenon that during training, apart from the active paths with top-k gradients, there are some inactive paths that are not activated at all for any of the examples. We call

5 5 these paths universal inactive paths. These neurons are not updated at all during training, and their parameter values remain the same as their initialized values, and we have every reason to believe that they would not be effective for new samples as well. However, the number of those paths may not be enough to bring a substantial contraction to the model. forward propaga on (original) back propaga on (meprop) Algorithm 2 Model Simplification for A Computation Unit 1: Initialize W, t 0,c 0 2: while training do 3: Draw (x, y) from training data 4: z σ(w x) Forward propagation 5: W top k(σ )x T Gradient of W w.r.t. z 6: S {t 1,t 2,...,t k } indices of k largest derivatives of σ in magnitude 7: c i increase if i in S Record the top-k indices 8: Update W with W 9: if t mod m = 0 then Prune inactive paths 10: θ m prune rate 11: for all i where c i < θ do 12: Remove row i from W 13: end for 14: c 0 Reset c 15: end if 16: t t+1 17: end while ac veness collected from mul ple examples model simplifica on inac ve neurons are eliminated Based on the previous findings, we generalize the idea of universal inactive paths, and prune the paths that are less updated, that is, the paths we eliminate are the paths that are not active for a number of samples. To realize the idea, we keep a record c of how many times the index is in the top-k indices S, during the back propagation at the same time as meprop. After several training steps m, we take out the less active paths that are not updated for a number of samples, e.g. 90% prune rate, which results in a simplified model. The record is cleared at each pruning action. By doing that iteratively, the model size will approach near stable in the end. Algorithm 2 describes the method for a computation unit, and an illustration is shown in Figure 4. An important hyper-parameter for the method is the pruning threshold. When determining the threshold, the model size and the number of examples between pruning actions should be taken into account. As shown in Algorithm 2, the threshold could be parameterized by prune interval, that is, how many samples between pruning, and prune rate, that is, how active the path should be if it is not to be eliminated. Note that the layer sizes are determined adaptively in a multi-layer setting, and only one threshold is needed for a model with multiple layers to have different layer sizes. Because the top-k indices of different layers at different iterations intersect differently in back propagation. For some layers, the top-k indices are similar, hence results in a larger layer size, compared to k. For other layers, the top-k indices are quite different at each iteration, so that the intersection happens more often, which means c i is lower, hence the resulting layer size is smaller. How the layer is simplified depends on how the learning is done, which is in accordance with our intuition. forward propaga on (simplified) back propaga on (simplified & meprop) Fig. 4. An illustration of model simplification (k = 2). The figure shows the three main stages of the simplified model training. First, as the upper shows, the model is trained using meprop, for several iterations, and a record of the activeness of the paths is kept, indicated by the shades of the neurons. Second, as the middle shows, the model is simplified based on the collected record, and the inactive paths are eliminated. Third, as the lower shows, we train the simplified model also using meprop. We repeat the procedure until the goal of the training is met. In a deep neural network, it s worth noticing that when simplifying a hidden layer, the respective columns in the next layer could also be removed, as the values in the columns represent the connection between the eliminated inputs and the outputs, which is no longer effective. That could reduce the model even further. However, we have not include that in our implementation yet. There are some extra considerations for LSTM models. In an LSTM, there is a lasting linear memory, and four gates controlling the modifying of the memory cells. It makes sense only if the pruning is for the memory cells instead of the gates, which are the computation units defined previously, because there is coherence between the memory and

6 6 the gates. Otherwise, the pruning would cause chaos and mismatch of dimensions, as each gate is of its own size, and the memory is of another dimension if it is set to the union of the gates. For LSTM models, we treat an LSTM module as a whole unit for model simplification, instead of treating each gate in an LSTM module as a unit for simplification. However, the top-k gradient selection takes place as the level of gates rather than memory cells. In practice, we still obtain the top-k indices from the gates, but we merge the top-k indices records of the gates into one record, and the pruning is for memory cells, so that the related gates are pruned as well. For model simplification, we also propose a kind of cycle mechanism. During our experiments, we find that at the time of the simplification, there is a drop in performance, but it recovers quickly within the following training, and may even supersede the performance before the simplification. It makes us wonder whether the training after simplification is critical to the performance improvement. We propose to divide the training procedure into several stages, and in each stage, we first conduct the training of model simplification, and then conduct the normal training. At the start of each stage, we also reinitialize the optimizer, if there is historical information of the gradients stored. The reason for such operation is that after model simplification, the dynamics of how the neurons interacted with each other changed, and the previous gradient information may interfere with the new dynamics of the simplified network. We find this cycle mechanism could improve the resulting model s performance even further on some tasks. 2.3 Implementation We have coded two neural network models, including an LSTM model for part-of-speech (POS) tagging, and a feedforward NN model (MLP) for transition-based dependency parsing and MNIST image recognition. We use the optimizers with automatically adaptive learning rates, including Adam [2] and AdaGrad [3]. In our implementation, we make no modification to the optimizers, although there are many zero elements in the gradients. Most of the experiments on CPU are conducted on the framework coded in C# on our own. This framework builds a dynamic computation graph of the model for each sample, making it suitable for data in variable lengths. A typical training procedure contains three parts: forward propagation, back propagation, and parameter update. We also have an implementation based on the PyTorch framework for GPU based experiments. To focus on the method itself, the results of GPU based experiments will be presented in appendices Where to apply top-k selection The proposed method aims to reduce the complexity of the back propagation by reducing the elements in the computationally intensive operations. In our preliminary observations, matrix-matrix or matrix-vector multiplication consumed more than 90% of the time of back propagation. In our implementation, we apply meprop only to the back propagation from the output of the multiplication to its inputs. For other element-wise operations (e.g., activation functions), the original back propagation procedure is kept, because those operations are already fast enough compared with matrix-matrix or matrix-vector multiplication operations. If there are multiple hidden layers, the top-k sparsification needs to be applied to every hidden layer, because the sparsified gradient will again be dense from one layer to another. That is, in meprop the gradients are sparsified with a top-k operation at the output of every hidden layer. While we apply meprop to all hidden layers using the same k of top-k, usually the k for the output layer could be different from the k for the hidden layers, because the output layer typically has a very different dimension compared with the hidden layers. For example, there are 10 tags in the MNIST task, so the dimension of the output layer is 10, and we use an MLP with the hidden dimension of 500. Thus, the best k for the output layer could be different from that of the hidden layers Choice of top-k algorithms In our C# implementation, instead of sorting the entire vector, we use the well-known min-heap based top-k selection method, which is slightly changed to focus on memory reuse. The algorithm has a time complexity of O(nlogk) and a space complexity of O(k). PyTorch comes with a GPU implementation of a certain paralleled top-k algorithm, which we are not sure how the operation is done exactly. 3 EXPERIMENTS To demonstrate that the proposed method is generalpurpose, we perform experiments on different models (LSTM/MLP), various training methods (Adam/AdaGrad), and diverse tasks. Transition-based Dependency Parsing (Parsing): Following prior work, we use English Penn TreeBank (PTB) [4] for evaluation. We follow the standard split of the corpus and use sections 2-21 as the training set (39,832 sentences, 1,900,056 transition examples), 1 section 22 as the development set (1,700 sentences, 80,234 transition examples) and section 23 as the final test set (2,416 sentences, 113,368 transition examples). The evaluation metric is unlabeled attachment score (UAS). We implement a parser using MLP following [5], which is used as our baseline. Part-of-Speech Tagging (POS-Tag): We use the standard benchmark dataset in prior work [6], which is derived from the Penn Treebank corpus, and use sections 0-18 of the Wall Street Journal (WSJ) for training (38,219 examples), and sections for testing (5,462 examples). The evaluation metric is per-word accuracy. A popular model for this task is the LSTM model [7], 2 which is used as our baseline. MNIST Image Recognition (MNIST): We use the MNIST handwritten digit dataset [8] for evaluation. MNIST consists of 60, pixel training images and additional 10,000 test examples. Each image contains a single numerical digit (0-9). We select the first 5,000 images of the training images as the development set and the rest as the 1. A transition example consists of a parsing context and its optimal transition action. 2. In this work, we use the bi-directional LSTM (Bi-LSTM) as the implementation of LSTM.

7 TABLE 1 meprop results based on LSTM/MLP models and Adam optimizers. Time means averaged time per iteration. Iter means the number of iterations to reach the optimal score on development data. The model of this iteration is then used to obtain the test score. As we can see, applying meprop can substantially speedup the back propagation with improved accuracy. 7 Parsing (Adam) Iter Backprop time (s) Dev UAS (%) Test UAS (%) MLP (h=500) 10 9, meprop (k=20) (18.6x) (+0.04) POS-Tag (Adam) Iter Backprop time (s) Dev Acc (%) Test Acc (%) LSTM (h=500) 3 16, meprop (k=10) (37.1x) (+0.03) MNIST (Adam) Iter Backprop time (s) Dev Acc (%) Test Acc (%) MLP (h=500) meprop (k=80) ( 5.9x) (+0.07) training set. The evaluation metric is per-image accuracy. We use the MLP model as the baseline. Following common practice, we use ReLU [9] as the activation function of the hidden layers. 3.1 Experimental Settings We set the dimension of the hidden layers to 500 for all the tasks. For Parsing, the input dimension is 48 (features) 50 (dim per feature) = 2400, and the output dimension is 25. For POS-Tag, the input dimension is 1 (word) 50 (dim per word) + 7 (features) 20 (dim per feature) = 190, and the output dimension is 45. For MNIST, the input dimension is 28 (pixels per row) 28 (pixels per column) 1 (dim per pixel) = 784, and the output dimension is 10. Based on the development set and prior work, we set the mini-batch size to 10,000 (transition examples), 1 (sentence), and 10 (images) for Parsing, POS-Tag, and MNIST, respectively. Using 10,000 transition examples for Parsing follows [5]. As discussed in Section 2, the optimal k of topk for the output layer could be different from the hidden layers, because their dimensions could be very different. For Parsing and MNIST, we find using the same k for the output and the hidden layers works well, and we simply do so. For another task, POS-Tag, we find the the output layer should use a different k from the hidden layers. For simplicity, we do not apply meprop to the output layer for POS-Tag, because in this task we find the computational cost of the output layer is almost negligible compared with other layers. In the experiments of model simplification, we use the Adam optimizer for all the tasks, for the sake of simplicity. In addition, we also apply the cycle mechanism in the reported results. Note that, to simulate the real scenario, we run each configuration 5 times with different random seeds, and choose the best model on development set to report. The hyperparameters are tuned based on the development data. For the Adam optimization method, we find the default hyperparameters work well on development sets, which are as follows: the learning rate α = 0.001, andβ 1 = 0.9,β 2 = 0.999, ǫ = The experiments on CPU are conducted on a computer with the INTEL(R) Xeon(R) 3.0GHz CPU. The experiments on GPU are conducted on NVIDIA GeForce GTX Experimental Results of meprop In this experiment, the LSTM is based on one hidden layer and the MLP is based on two hidden layers (experiments on more hidden layers will be presented later). We conduct experiments on different optimization methods, including AdaGrad and Adam. Since meprop is applied to the linear transformations (which entail the major computational cost), we report the linear transformation related backprop time as Backprop Time. It does not include non-linear activations, which usually have only less than 2% computational cost. The total time of back propagation, including nonlinear activations, is reported as Overall Backprop Time. Table 1 shows the results based on different models and different optimization methods. In the table, meprop means applying meprop to the corresponding baseline model, h = 500 means that the hidden layer dimension is 500, and k = 20 means that meprop uses top-20 elements (among 500 in total) for back propagation. Note that, for fair comparisons, all experiments are first conducted on the development data and the test data is not observable. Then, the optimal number of iterations is decided based on the optimal score on development data, and the model of this iteration is used upon the test data to obtain the test scores. As we can see, applying meprop can substantially speed up the back propagation. It provides a linear reduction in the computational cost. Surprisingly, results demonstrate that we can update only fewer than 5% of the weights at each back propagation pass for the natural language processing tasks. This does not result in a larger number of training iterations. More surprisingly, the accuracy of the resulting models is actually improved rather than decreased. The main reason could be that the minimal effort update does not modify weakly relevant parameters, which makes overfitting less likely, similar to the dropout effect Result Analysis of meprop Changing Optimizer TABLE 2 meprop: Results using AdaGrad optimizers. We can see that meprop also works with AdaGrad optimizers, indicating that meprop is independent of optimizers. Parsing (AdaGrad) Iter Test UAS (%) MLP (h=500) meprop (k=20) (+0.03) POS-Tag (AdaGrad) Iter Test Acc (%) LSTM (h=500) meprop (k=5) (+0.32) MNIST (AdaGrad) Iter Test Acc (%) MLP (h=500) meprop (k=10) (+0.48)

8 8 Accuracy (%) MNIST: Reduce Overfitting 97.7 meprop MLP Backprop Ratio (%) Accuracy (%) MNIST: Topk vs Random 88 Topk meprop Random meprop Backprop Ratio (%) Accuracy (%) MNIST: Change h/k meprop MLP Backprop/Hidden Ratio (%) Fig. 5. Accuracy vs. meprop s backprop ratio (left). Results of top-k meprop vs. random meprop (middle). Results of top-k meprop vs. baseline with the hidden dimension h (right). TABLE 3 meprop: Results based on the same k and h. It can be concluded that meprop does not rely on redundant neurons, as the model of the small hidden dimension works much worse. Parsing (Adam) Iter Test UAS (%) MLP (h=20) meprop (k=20) (+1.64) POS-Tag (Adam) Iter Test Acc (%) LSTM (h=5) meprop (k=5) (+0.72) MNIST (Adam) Iter Test Acc (%) MLP (h=20) meprop (k=20) (+2.24) TABLE 4 meprop: Varying the number of hidden layers on the MNIST task. The experiments demonstrate that meprop can also be applied to traditional deep models. Layers Method Test Acc (%) 2 MLP (h=500) meprop (k=25) (+0.10) 3 MLP (h=500) meprop (k=25) (+0.16) 4 MLP (h=500) meprop (k=25) (+0.05) 5 MLP (h=500) meprop (k=25) (+0.16) It is important to see whether meprop can be applied with different optimizers, because the minimal effort technique sparifies the gradient, which affects the update of the parameters. For the AdaGrad learner, the learning rate is set to α = 0.01, 0.01, 0.1 for Parsing, POS-Tag, and MNIST, respectively, and ǫ = As shown in Table 2, the results are consistent among AdaGrad and Adam. The results demonstrate that meprop is independent of specific optimization methods. For simplicity, the following experiments use Adam. Varying Backprop Ratio In Figure 5 (left), we vary the k of top-k meprop to compare the test accuracy on different ratios of meprop backprop. For example, when k=5, it means that the backprop ratio is 5/500=1%. The optimizer is Adam. As we can see, meprop achieves consistently better accuracy than the baseline. Top-k vs. Random It will be interesting to check the role of top-k elements. Figure 5 (middle) shows the results of top-k meprop vs. random meprop. The random meprop means that random elements (instead of top-k ones) are selected for back propagation. As we can see, the top-k version works better than the random version. It suggests that top-k elements contain the most important information of the gradients. Varying Hidden Dimension We still have a question: does the top-k meprop work well simply because the original model does not require that big dimension of the hidden layers? For example, the meprop (topk=5) works simply because the LSTM works well with the hidden dimension of 5, and there is no need to use the hidden dimension of 500. To examine this, we perform experiments on using the same hidden dimension as k, and the results are shown in Table 3. As we can see, however, the results of the small hidden dimensions are much worse than those of meprop. In addition, Figure 5 (right) shows more detailed curves by varying the value of k. In the figure, different k gives different backprop ratio for meprop and different hidden dimension ratio for LSTM/MLP. As we can see, the answer to that question is negative: meprop does not rely on redundant hidden layer elements. Adding More Hidden Layers Another question is whether or not meprop relies on shallow models with only a few hidden layers. To answer this question, we also perform experiments on more hidden layers, from 2 hidden layers to 5 hidden layers. We find setting the dropout rate to 0.1 works well for most cases of different numbers of layers. For simplicity of comparison, we set the same dropout rate to 0.1 in this experiment. Table 4 shows that adding the number of hidden layers does not hurt the performance of meprop. Adding Dropout Since we have observed that meprop can reduce overfitting of deep learning, a natural question is that if meprop is reducing the same type of overfitting risk as dropout. Thus, we use development data to find a proper value of the dropout rate on those tasks, and then further add meprop to check if further improvement is possible. Table 5 shows the results. As we can see, meprop can achieve further improvement over dropout. In particular, meprop has an improvement of 0.46 UAS on Parsing. The results suggest that the type of overfitting that meprop

9 9 TABLE 5 meprop: Adding the dropout technique. As the results show, meprop could further improve the performance on top of dropout, suggesting that meprop is reducing a different type of overfitting, comparing to dropout. Parsing (Adam) Dropout Test UAS (%) MLP (h=500) meprop (k=40) (+0.46) POS-Tag (Adam) Dropout Test Acc (%) LSTM (h=500) meprop (k=20) (+0.11) MNIST (Adam) Dropout Test Acc (%) MLP (h=500) meprop (k=25) (+0.23) TABLE 6 Results of simple unified top-k meprop based on a whole mini-batch (i.e., unified sparse patterns). The optimizer is Adam. Mini-batch Size is 50. Layers Method Test Acc (%) 2 MLP (h=500) meprop (k=30) (+0.11) 5 MLP (h=500) meprop (k=50) (+0.27) TABLE 7 Acceleration results on the matrix multiplication synthetic data using GPU. The batch size is Method Backprop time (ms) Baseline (h=8192) meprop (k=8) 8.37 (36.8x) meprop (k=16) 9.16 (33.6x) meprop (k=32) (27.5x) meprop (k=64) (21.4x) meprop (k=128) (14.5x) meprop (k=256) ( 8.0x) meprop (k=512) ( 4.4x) TABLE 8 Acceleration results on MNIST using GPU. Method Overall backprop time (ms) MLP (h=8192) 17,696.2 meprop (k=8) 1,501.5 (11.8x) meprop (k=16) 1,542.8 (11.5x) meprop (k=32) 1,656.9 (10.7x) meprop (k=64) 1,828.3 ( 9.7x) meprop (k=128) 2,200.0 ( 8.0x) meprop (k=256) 3,149.6 ( 5.6x) meprop (k=512) 4,874.1 ( 3.6x) reduces is probably different from that of dropout. Thus, a model should be able to take advantage of both meprop and dropout to reduce overfitting. Speedup on GPU For implementing meprop on GPU, the simplest solution is to treat the entire mini-batch as a big training example, where the top-k operation is based on the averaged values of all examples in the mini-batch. In this way, the big sparse matrix of the mini-batch will have consistent sparse patterns among examples, and this consistent sparse matrix can be transformed into a small dense matrix by removing the zero values. We call this implementation as simple unified top-k. This experiment is based on PyTorch. Despite its simplicity, Table 6 shows the good performance of this implementation, which is based on the minibatch size of 50. We also find the speedup on GPU is less significant when the hidden dimension is low. The reason is that our GPU s computational power is not fully consumed by the baseline (with small hidden layers), so that the normal back propagation is already fast enough, making it hard for meprop to achieve substantial speedup. For example, supposing a GPU can finish 1000 operations in one cycle, there could be no speed difference between a method with 100 and a method with 10 operations. Indeed, we find MLP (h=64) and MLP (h=512) have almost the same GPU speed even on forward propagation (i.e., without meprop), while theoretically there should be an 8x difference. With GPU, the forward propagation time of MLP (h=64) and MLP (h=512) is 572ms and 644ms, respectively. This provides evidence for our hypothesis that our GPU is not fully consumed with the small hidden dimensions. Thus, the speedup test on GPU is more meaningful for the heavy models, such that the baseline can at least fully consume the GPU s computational power. To check this, we test the GPU speedup on synthetic data of matrix multiplication with a larger hidden dimension. Indeed, Table 7 shows that meprop achieves much higher speed than the traditional backprop with the large hidden dimension. Furthermore, we test the GPU speedup on MLP with the large hidden dimension [10]. Table 8 shows that meprop also has substantial GPU speedup on MNIST with the large hidden dimension. In this experiment, the speedup is based on Overall Backprop Time (see the prior definition). Those results demonstrate that meprop can achieve good speedup on GPU when it is applied to heavy models. Finally, there are potentially other implementation choices of meprop on GPU. For example, another natural solution is to use a big sparse matrix to represent the sparsified gradient of the output of a mini-batch. Then, the sparse matrix multiplication library can be used to accelerate the computation. This could be an interesting direction of future work. 3.3 Experimental Results of mesimp In this experiment, we only simplify the hidden layers of the model, and we use Adam optimizer for all the tasks. We set the cycle to 10 for all the tasks, that is, we first train the model using mesimp for 5 epochs, then train the model normally for 5 epochs, and repeat the procedure till the end. Table 9 shows the model simplificatoin results based on different models. In the table, meprop means applying meprop to the corresponding baseline model, and mesimp means applying model compression on top of meprop. h = 500 means that the dimension of the model s hidden layers is 500, k = 20 means that in back propagation we propagate top-20 elements, and prune = 0.08 means that the dimension which is updated less than 8% times during an statistical interval is dropped. As we can see, our method is capable of reducing the models to a relatively small size, while maintaining the performance if not improving. The hidden layers of the models are reduced by around 10x, 8x, and 3x for Parsing, POS-Tag, and MNIST respectively. That means when the simplified model is deployed, it could achieve more than

10 TABLE 9 mesimp results based on LSTM/MLP models. Iter means the number of iterations to reach the optimal score on development data. The model of this iteration is then used to obtain the test score. Dim means the dimension of the model of this iteration. For LSTM, it is the average of each direction; for MLP, it is the average of hidden layers. It can be drawn from the results that mesimp could reduce the model to a smaller size, often around 10%, while maintaining the performance if not improving. Parsing Iter Dim Dev UAS (%) Test UAS (%) MLP (h=500) mesimp (k=20, prune=0.08) (10.2%) (+0.31) POS-Tag Iter Dim Dev Acc (%) Test Acc (%) LSTM (h=500) mesimp (k=20, prune=0.08) 3 60 (12.0%) (+0.03) MNIST Iter Dim Dev Acc (%) Test Acc (%) MLP (h=500) mesimp (k=160, prune=0.10) (30.8%) (+0.11) 10 TABLE 10 mesimp: The dimensions of the resulting models. The results confirm that mesimp is suitable for deep models, and could adaptively determine the proper sizes of different layers. Parsing #Average #Hidden mesimp(k=20, prune=0.08) POS-Tag #Average #Forward #Backward mesimp (k=20, prune=0.08) MNIST #Average #First #Second mesimp (k=160, prune=0.10) x, 8x, and 3x computational cost reduction respectively in decoding, with a similar or better performance. The reason could be that the minimal effort update captures important features, so that the simplified model is enough to fit the data, while without minimal effort update, the model of a similar size treats each feature equally at start, limiting its ability to learn from the data. We will show the experimental results in section The results show that the simplifying method is effective in reducing the model size, thus bringing a substantial reduction of the computational cost of decoding in real-world task. More importantly, the accuracy of the original model is kept, or even more often improved. This means model simplifying could make it more probable to deploy a deep learning system to a computation constrained environment Result Analysis of mesimp Adaptively setting the size of the hidden layers It s worth noticing that mesimp is also able to automatically determine the appropriate size of the resulting model of deep neural networks (Table 10). At the beginning, we conduct the experiments on a neural network with a single hidden layer, that is, Parsing, and we get promising result, as the model size is reduced to 10.2% of its original size. The result of Paring makes us wondering whether meprop could also simplify deeper networks, so we continue to run experiments on different models. In the experiments of POS-Tag, the LSTM is based on Bi-LSTM, that is, there is a forward LSTM and a backward LSTM regarding to the input sequences, which means it is often very deep in time series (horizontal). As shown in Table 10, the forward and backward LSTMs indeed gets different dimensions, 60 and 57 respectively. We further conduct experiments on an MLP with 2 hidden layers (vertical), and the result shows that the first hidden layer and the second hidden layer are TABLE 11 mesimp: Results based on the same k and h for Parsing. We report the results of 5 different runs of the baseline model. It s clear that the simplified model constantly surpasses the traditionally-trained model of the same size, indicating that mesimp may enable a more efficient and effective learning. Method Dev UAS (%) Test UAS (%) mesimp (h=51) MLP (h=51) MLP (h=51) MLP (h=51) MLP (h=51) MLP (h=51) again of different dimensions, which confirms that mesimp could adaptively adjust the hidden layer size in a multilayer setting. We also need to remind the readers that mesimp does not need to specify different hyper-parameters for different layers, while most of the previous work ([11], [12], [13]), if different layer sizes are pursued, a different hyperparameter need to be set separately for each hidden layer, limiting their ability to simplifying the models adaptively. Comparing with the models of similar sizes One natural and important question is how does the simplified model perform compared to the model of a similar size. If the simplified models perform not as well as the normally trained models of similar sizes, the simplifying method may be redundant and unnecessary. Results in Table 11 shed light on that question. We train baseline models of sizes similar to the sizes of simplified models, and report the results in Table 11. It can be shown that, our simplified models perform better than the models trained of similar sizes, especially on the Parsing task. The results shows that the model simplifying training is not unnecessary, as a simplified model achieves better accuracy than a model trained using a small dimension. An attempt at revealing why the minimal effort technique works From the back propagation simplification and model simplification results, we could see that approaches based on active paths, which are measured by the back propagation, are effective in reducing overfitting. One of our hypothesis is that for a neural network, for each example, only a small part of the neurons is needed to produce the correct results, and gradients are good identifiers to detect

11 11 TABLE 12 MNIST: minimal effort activation. Dim means the averaged active dimension of hidden layers across examples means during epoch 10 to epoch 15, we apply the minimal effort activation technique. The results show that for an example, a smaller number of neurons is enough to generate the correct prediction, and that by only training the highly-related neurons, the performance could be improved. MNIST Iter Dim Test Acc (%) MLP (h=500) meact (threshold=0.004, 10-20) (+0.24) meact (threshold=0.004, 15-20) (+0.14) Accuracy (%) Average Absolute Update per Parameter MNIST: Accuracy on Dev e 4 Epoch meact (prune=0.04, 10~20) baseline MNIST: Wei ht Update meact (prune=0.04, 10~20) baseline Epoch Fig. 6. Change of accuracy (upper), and Average update of parameters (lower) in active path finding. To isolate the impact of meact, we fix the random seed, which means the initialization of parameters and the shuffle are the same between meact and baseline, so the lines coincide with each other during Epoch 1-10, in which the training is exactly the same. As we can see, after meact is applied, the accuracy rises, which indicates training focused on the most relevant neurons could reduce overfitting, and the update drops, which suggests in the later of normal training, most of the update is caused by fitting of the noises, making the already trained neurons constantly changing. the decisive neurons. Too many neurons are harmful to the model, because the extra neurons may be trained to fit the noise in the training examples. To examine the hypothesis, we design an new algorithm, which we call meact (minimal effort activation), which only activates the active path, with respect to each example, in the forward propagation, and the experimental results are consistent with our hypothesis. To realize the idea, for each example, we only activate the paths with the largest accumulated absolute gradients, and the number of the chosen paths is controlled by threshold. Specifically, we accumulate the absolute gradients of each layer s output for each example, denoted by g i (x j ), where i is the neuron s index in a layer, and j is the example s identifier. For a layer, if a neuron s accumulated absolute gradients accounts for less than a specified percentage of the sum of the gradients of all the neurons of the layer, that is, g i (x j ) < threshold n i=1 g i(x j ), where n is the number of the neurons in the layer, and 0 < threshold < 1, the neuron is considered inactive. The paths out of active paths are deactivated and we use the previous activation values at the last encounter in the training as their outputs, thus the effort in activation is minimized. As the sparse activation is done in forward propagation, the back propagation is sparsified as well, because obviously, the deactivate neurons contribute none to the results, meaning their gradients are zeros, which requires no computation. Note that the method does not reduce the size of the model, and for each example, we obtain its own active paths. During test, the forward propagation is done normally, as we wouldn t know the active paths of these unseen examples. From results shown in Table 12, we can see that, for the MNIST task, on average, fewer than 100 neurons are adequate to achieve good results or even better results. From the results shown in Figure 6, we can see that, during minimal effort activation, the accuracy rises from the baseline, which shows that the accuracy could benefit from the training focused more on the related neurons. To see how the accuracy improvement is acquired, we further investigate the change of the parameters during training. Normally, the gradient is used to represent the change. However, because we use Adam as the optimizer, where the update is not done by directly using the gradient, we consider the change before and after the Adam update rule as the change. As there are many iterations in an epoch and many parameters for a model, we average the change of all the parameters at each iteration, that is t n j=1 i=1 δj i n t, where δ j i means the absolute update = change of parameter i at iteration j, n means the number of the parameters, andtmeans the number of training iterations, and report the average absolute change per parameter per iteration as update. We use the absolute of the update of a parameter, because we would like to see how much the parameters have been modified during the training process, not just the change between the start and the end. As shown in Figure 6, we can see that the update dropped sharply when meact is applied, meaning the accuracy improvement is achieved by very little change of the parameters, while the update of normal training is still high, more than 5x of the update of meact, suggesting that many update is redundant and unnecessary, which could be the result of the model trying to adapt to the noise in the data. As there should be no regular pattern in the noise, it requires more subtle update of all the parameters to fit the noise, which is much harder and often affects the training of the essential features, thus leading to a lower accuracy than our method which tries to focus only on the essential features for each example. The results confirm our initial hypothesis that for an example, only a few neurons are required, and minimal effort technique provides a simple yet effective way to train and extract the helpful neurons.

CS489/698: Intro to ML

CS489/698: Intro to ML CS489/698: Intro to ML Lecture 14: Training of Deep NNs Instructor: Sun Sun 1 Outline Activation functions Regularization Gradient-based optimization 2 Examples of activation functions 3 5/28/18 Sun Sun

More information

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

COMP 551 Applied Machine Learning Lecture 16: Deep Learning COMP 551 Applied Machine Learning Lecture 16: Deep Learning Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted, all

More information

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah Reference Most of the slides are taken from the third chapter of the online book by Michael Nielson: neuralnetworksanddeeplearning.com

More information

arxiv: v4 [cs.lg] 31 Oct 2017

arxiv: v4 [cs.lg] 31 Oct 2017 meprop: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting Xu Sun 1 2 Xuancheng Ren 1 2 Shuming Ma 1 2 Houfeng Wang 1 2 arxiv:1706.06197v4 [cs.lg] 31 Oct 2017 Abstract We

More information

Machine Learning 13. week

Machine Learning 13. week Machine Learning 13. week Deep Learning Convolutional Neural Network Recurrent Neural Network 1 Why Deep Learning is so Popular? 1. Increase in the amount of data Thanks to the Internet, huge amount of

More information

The Mathematics Behind Neural Networks

The Mathematics Behind Neural Networks The Mathematics Behind Neural Networks Pattern Recognition and Machine Learning by Christopher M. Bishop Student: Shivam Agrawal Mentor: Nathaniel Monson Courtesy of xkcd.com The Black Box Training the

More information

Deep Learning With Noise

Deep Learning With Noise Deep Learning With Noise Yixin Luo Computer Science Department Carnegie Mellon University yixinluo@cs.cmu.edu Fan Yang Department of Mathematical Sciences Carnegie Mellon University fanyang1@andrew.cmu.edu

More information

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu Natural Language Processing CS 6320 Lecture 6 Neural Language Models Instructor: Sanda Harabagiu In this lecture We shall cover: Deep Neural Models for Natural Language Processing Introduce Feed Forward

More information

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic SEMANTIC COMPUTING Lecture 8: Introduction to Deep Learning Dagmar Gromann International Center For Computational Logic TU Dresden, 7 December 2018 Overview Introduction Deep Learning General Neural Networks

More information

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3 Neural Network Optimization and Tuning 11-785 / Spring 2018 / Recitation 3 1 Logistics You will work through a Jupyter notebook that contains sample and starter code with explanations and comments throughout.

More information

LSTM: An Image Classification Model Based on Fashion-MNIST Dataset

LSTM: An Image Classification Model Based on Fashion-MNIST Dataset LSTM: An Image Classification Model Based on Fashion-MNIST Dataset Kexin Zhang, Research School of Computer Science, Australian National University Kexin Zhang, U6342657@anu.edu.au Abstract. The application

More information

Why DNN Works for Speech and How to Make it More Efficient?

Why DNN Works for Speech and How to Make it More Efficient? Why DNN Works for Speech and How to Make it More Efficient? Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering, York University, CANADA Joint work with Y.

More information

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Lecture 20: Neural Networks for NLP. Zubin Pahuja Lecture 20: Neural Networks for NLP Zubin Pahuja zpahuja2@illinois.edu courses.engr.illinois.edu/cs447 CS447: Natural Language Processing 1 Today s Lecture Feed-forward neural networks as classifiers simple

More information

Deep Learning with Tensorflow AlexNet

Deep Learning with Tensorflow   AlexNet Machine Learning and Computer Vision Group Deep Learning with Tensorflow http://cvml.ist.ac.at/courses/dlwt_w17/ AlexNet Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton, "Imagenet classification

More information

arxiv: v1 [stat.ml] 21 Feb 2018

arxiv: v1 [stat.ml] 21 Feb 2018 Detecting Learning vs Memorization in Deep Neural Networks using Shared Structure Validation Sets arxiv:2.0774v [stat.ml] 2 Feb 8 Elias Chaibub Neto e-mail: elias.chaibub.neto@sagebase.org, Sage Bionetworks

More information

Deep Neural Networks Optimization

Deep Neural Networks Optimization Deep Neural Networks Optimization Creative Commons (cc) by Akritasa http://arxiv.org/pdf/1406.2572.pdf Slides from Geoffrey Hinton CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy

More information

Dynamic Routing Between Capsules

Dynamic Routing Between Capsules Report Explainable Machine Learning Dynamic Routing Between Capsules Author: Michael Dorkenwald Supervisor: Dr. Ullrich Köthe 28. Juni 2018 Inhaltsverzeichnis 1 Introduction 2 2 Motivation 2 3 CapusleNet

More information

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 18: Deep learning and Vision: Convolutional neural networks Teacher: Gianni A. Di Caro DEEP, SHALLOW, CONNECTED, SPARSE? Fully connected multi-layer feed-forward perceptrons: More powerful

More information

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES Deep Learning Practical introduction with Keras Chapter 3 27/05/2018 Neuron A neural network is formed by neurons connected to each other; in turn, each connection of one neural network is associated

More information

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Neural Networks CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Biological and artificial neural networks Feed-forward neural networks Single layer

More information

Neural Network and Deep Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Neural Network and Deep Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Neural Network and Deep Learning Early history of deep learning Deep learning dates back to 1940s: known as cybernetics in the 1940s-60s, connectionism in the 1980s-90s, and under the current name starting

More information

Deep Learning and Its Applications

Deep Learning and Its Applications Convolutional Neural Network and Its Application in Image Recognition Oct 28, 2016 Outline 1 A Motivating Example 2 The Convolutional Neural Network (CNN) Model 3 Training the CNN Model 4 Issues and Recent

More information

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center Machine Learning With Python Bin Chen Nov. 7, 2017 Research Computing Center Outline Introduction to Machine Learning (ML) Introduction to Neural Network (NN) Introduction to Deep Learning NN Introduction

More information

Research on Pruning Convolutional Neural Network, Autoencoder and Capsule Network

Research on Pruning Convolutional Neural Network, Autoencoder and Capsule Network Research on Pruning Convolutional Neural Network, Autoencoder and Capsule Network Tianyu Wang Australia National University, Colledge of Engineering and Computer Science u@anu.edu.au Abstract. Some tasks,

More information

Study of Residual Networks for Image Recognition

Study of Residual Networks for Image Recognition Study of Residual Networks for Image Recognition Mohammad Sadegh Ebrahimi Stanford University sadegh@stanford.edu Hossein Karkeh Abadi Stanford University hosseink@stanford.edu Abstract Deep neural networks

More information

Logical Rhythm - Class 3. August 27, 2018

Logical Rhythm - Class 3. August 27, 2018 Logical Rhythm - Class 3 August 27, 2018 In this Class Neural Networks (Intro To Deep Learning) Decision Trees Ensemble Methods(Random Forest) Hyperparameter Optimisation and Bias Variance Tradeoff Biological

More information

Efficient Deep Learning Optimization Methods

Efficient Deep Learning Optimization Methods 11-785/ Spring 2019/ Recitation 3 Efficient Deep Learning Optimization Methods Josh Moavenzadeh, Kai Hu, and Cody Smith Outline 1 Review of optimization 2 Optimization practice 3 Training tips in PyTorch

More information

Perceptron: This is convolution!

Perceptron: This is convolution! Perceptron: This is convolution! v v v Shared weights v Filter = local perceptron. Also called kernel. By pooling responses at different locations, we gain robustness to the exact spatial location of image

More information

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS Neural Networks Classifier Introduction INPUT: classification data, i.e. it contains an classification (class) attribute. WE also say that the class

More information

Deep Learning Applications

Deep Learning Applications October 20, 2017 Overview Supervised Learning Feedforward neural network Convolution neural network Recurrent neural network Recursive neural network (Recursive neural tensor network) Unsupervised Learning

More information

Practical Tips for using Backpropagation

Practical Tips for using Backpropagation Practical Tips for using Backpropagation Keith L. Downing August 31, 2017 1 Introduction In practice, backpropagation is as much an art as a science. The user typically needs to try many combinations of

More information

Assignment 2. Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions

Assignment 2. Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions ENEE 739Q: STATISTICAL AND NEURAL PATTERN RECOGNITION Spring 2002 Assignment 2 Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions Aravind Sundaresan

More information

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders Neural Networks for Machine Learning Lecture 15a From Principal Components Analysis to Autoencoders Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed Principal Components

More information

Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features

Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features Xu SUN ( 孙栩 ) Peking University xusun@pku.edu.cn Motivation Neural networks -> Good Performance CNN, RNN, LSTM

More information

Keras: Handwritten Digit Recognition using MNIST Dataset

Keras: Handwritten Digit Recognition using MNIST Dataset Keras: Handwritten Digit Recognition using MNIST Dataset IIT PATNA February 9, 2017 1 / 24 OUTLINE 1 Introduction Keras: Deep Learning library for Theano and TensorFlow 2 Installing Keras Installation

More information

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla DEEP LEARNING REVIEW Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature 2015 -Presented by Divya Chitimalla What is deep learning Deep learning allows computational models that are composed of multiple

More information

ECE 5470 Classification, Machine Learning, and Neural Network Review

ECE 5470 Classification, Machine Learning, and Neural Network Review ECE 5470 Classification, Machine Learning, and Neural Network Review Due December 1. Solution set Instructions: These questions are to be answered on this document which should be submitted to blackboard

More information

Lecture on Modeling Tools for Clustering & Regression

Lecture on Modeling Tools for Clustering & Regression Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Data Clustering Overview Organizing data into

More information

Artificial Intelligence Introduction Handwriting Recognition Kadir Eren Unal ( ), Jakob Heyder ( )

Artificial Intelligence Introduction Handwriting Recognition Kadir Eren Unal ( ), Jakob Heyder ( ) Structure: 1. Introduction 2. Problem 3. Neural network approach a. Architecture b. Phases of CNN c. Results 4. HTM approach a. Architecture b. Setup c. Results 5. Conclusion 1.) Introduction Artificial

More information

MoonRiver: Deep Neural Network in C++

MoonRiver: Deep Neural Network in C++ MoonRiver: Deep Neural Network in C++ Chung-Yi Weng Computer Science & Engineering University of Washington chungyi@cs.washington.edu Abstract Artificial intelligence resurges with its dramatic improvement

More information

5 Learning hypothesis classes (16 points)

5 Learning hypothesis classes (16 points) 5 Learning hypothesis classes (16 points) Consider a classification problem with two real valued inputs. For each of the following algorithms, specify all of the separators below that it could have generated

More information

Recurrent Neural Network (RNN) Industrial AI Lab.

Recurrent Neural Network (RNN) Industrial AI Lab. Recurrent Neural Network (RNN) Industrial AI Lab. For example (Deterministic) Time Series Data Closed- form Linear difference equation (LDE) and initial condition High order LDEs 2 (Stochastic) Time Series

More information

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group Deep Learning Vladimir Golkov Technical University of Munich Computer Vision Group 1D Input, 1D Output target input 2 2D Input, 1D Output: Data Distribution Complexity Imagine many dimensions (data occupies

More information

Sequence Modeling: Recurrent and Recursive Nets. By Pyry Takala 14 Oct 2015

Sequence Modeling: Recurrent and Recursive Nets. By Pyry Takala 14 Oct 2015 Sequence Modeling: Recurrent and Recursive Nets By Pyry Takala 14 Oct 2015 Agenda Why Recurrent neural networks? Anatomy and basic training of an RNN (10.2, 10.2.1) Properties of RNNs (10.2.2, 8.2.6) Using

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Plankton Classification Using ConvNets

Plankton Classification Using ConvNets Plankton Classification Using ConvNets Abhinav Rastogi Stanford University Stanford, CA arastogi@stanford.edu Haichuan Yu Stanford University Stanford, CA haichuan@stanford.edu Abstract We present the

More information

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning, A Acquisition function, 298, 301 Adam optimizer, 175 178 Anaconda navigator conda command, 3 Create button, 5 download and install, 1 installing packages, 8 Jupyter Notebook, 11 13 left navigation pane,

More information

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS Puyang Xu, Ruhi Sarikaya Microsoft Corporation ABSTRACT We describe a joint model for intent detection and slot filling based

More information

Keras: Handwritten Digit Recognition using MNIST Dataset

Keras: Handwritten Digit Recognition using MNIST Dataset Keras: Handwritten Digit Recognition using MNIST Dataset IIT PATNA January 31, 2018 1 / 30 OUTLINE 1 Keras: Introduction 2 Installing Keras 3 Keras: Building, Testing, Improving A Simple Network 2 / 30

More information

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling Authors: Junyoung Chung, Caglar Gulcehre, KyungHyun Cho and Yoshua Bengio Presenter: Yu-Wei Lin Background: Recurrent Neural

More information

Natural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 4: Backpropagation and computation graphs

Natural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 4: Backpropagation and computation graphs Natural Language Processing with Deep Learning CS4N/Ling84 Christopher Manning Lecture 4: Backpropagation and computation graphs Lecture Plan Lecture 4: Backpropagation and computation graphs 1. Matrix

More information

On the Effectiveness of Neural Networks Classifying the MNIST Dataset

On the Effectiveness of Neural Networks Classifying the MNIST Dataset On the Effectiveness of Neural Networks Classifying the MNIST Dataset Carter W. Blum March 2017 1 Abstract Convolutional Neural Networks (CNNs) are the primary driver of the explosion of computer vision.

More information

Recurrent Neural Networks

Recurrent Neural Networks Recurrent Neural Networks 11-785 / Fall 2018 / Recitation 7 Raphaël Olivier Recap : RNNs are magic They have infinite memory They handle all kinds of series They re the basis of recent NLP : Translation,

More information

Neural Networks (pp )

Neural Networks (pp ) Notation: Means pencil-and-paper QUIZ Means coding QUIZ Neural Networks (pp. 106-121) The first artificial neural network (ANN) was the (single-layer) perceptron, a simplified model of a biological neuron.

More information

Non-Profiled Deep Learning-Based Side-Channel Attacks

Non-Profiled Deep Learning-Based Side-Channel Attacks Non-Profiled Deep Learning-Based Side-Channel Attacks Benjamin Timon UL Transaction Security, Singapore benjamin.timon@ul.com Abstract. Deep Learning has recently been introduced as a new alternative to

More information

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

COMP 551 Applied Machine Learning Lecture 14: Neural Networks COMP 551 Applied Machine Learning Lecture 14: Neural Networks Instructor: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551 Unless otherwise noted, all material posted for this course

More information

Scalable Trigram Backoff Language Models

Scalable Trigram Backoff Language Models Scalable Trigram Backoff Language Models Kristie Seymore Ronald Rosenfeld May 1996 CMU-CS-96-139 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 This material is based upon work

More information

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius What is Mixed Precision Training? Reduced precision tensor math with FP32 accumulation, FP16 storage Successfully used to train a variety

More information

Artificial Neuron Modelling Based on Wave Shape

Artificial Neuron Modelling Based on Wave Shape Artificial Neuron Modelling Based on Wave Shape Kieran Greer, Distributed Computing Systems, Belfast, UK. http://distributedcomputingsystems.co.uk Version 1.2 Abstract This paper describes a new model

More information

Clustering algorithms and autoencoders for anomaly detection

Clustering algorithms and autoencoders for anomaly detection Clustering algorithms and autoencoders for anomaly detection Alessia Saggio Lunch Seminars and Journal Clubs Université catholique de Louvain, Belgium 3rd March 2017 a Outline Introduction Clustering algorithms

More information

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Ensemble methods in machine learning. Example. Neural networks. Neural networks Ensemble methods in machine learning Bootstrap aggregating (bagging) train an ensemble of models based on randomly resampled versions of the training set, then take a majority vote Example What if you

More information

Automated Diagnosis of Vertebral Fractures using 2D and 3D Convolutional Networks

Automated Diagnosis of Vertebral Fractures using 2D and 3D Convolutional Networks Automated Diagnosis of Vertebral Fractures using 2D and 3D Convolutional Networks CS189 Final Project Naofumi Tomita Overview Automated diagnosis of osteoporosis-related vertebral fractures is a useful

More information

Machine learning for vision. It s the features, stupid! cathedral. high-rise. Winter Roland Memisevic. Lecture 2, January 26, 2016

Machine learning for vision. It s the features, stupid! cathedral. high-rise. Winter Roland Memisevic. Lecture 2, January 26, 2016 Winter 2016 Lecture 2, Januar 26, 2016 f2? cathedral high-rise f1 A common computer vision pipeline before 2012 1. 2. 3. 4. Find interest points. Crop patches around them. Represent each patch with a sparse

More information

How Learning Differs from Optimization. Sargur N. Srihari

How Learning Differs from Optimization. Sargur N. Srihari How Learning Differs from Optimization Sargur N. srihari@cedar.buffalo.edu 1 Topics in Optimization Optimization for Training Deep Models: Overview How learning differs from optimization Risk, empirical

More information

Lecture : Training a neural net part I Initialization, activations, normalizations and other practical details Anne Solberg February 28, 2018

Lecture : Training a neural net part I Initialization, activations, normalizations and other practical details Anne Solberg February 28, 2018 INF 5860 Machine learning for image classification Lecture : Training a neural net part I Initialization, activations, normalizations and other practical details Anne Solberg February 28, 2018 Reading

More information

Deep Learning. Volker Tresp Summer 2014

Deep Learning. Volker Tresp Summer 2014 Deep Learning Volker Tresp Summer 2014 1 Neural Network Winter and Revival While Machine Learning was flourishing, there was a Neural Network winter (late 1990 s until late 2000 s) Around 2010 there

More information

Residual Networks And Attention Models. cs273b Recitation 11/11/2016. Anna Shcherbina

Residual Networks And Attention Models. cs273b Recitation 11/11/2016. Anna Shcherbina Residual Networks And Attention Models cs273b Recitation 11/11/2016 Anna Shcherbina Introduction to ResNets Introduced in 2015 by Microsoft Research Deep Residual Learning for Image Recognition (He, Zhang,

More information

Deepest Neural Networks

Deepest Neural Networks Deepest Neural Networks arxiv:707.0267v [cs.ne] 9 Jul 207 Raúl Rojas Dahlem Center for Machine Learning and Robotics Freie Universität Berlin July 207 Abstract This paper shows that a long chain of perceptrons

More information

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant

More information

Supplementary A. Overview. C. Time and Space Complexity. B. Shape Retrieval. D. Permutation Invariant SOM. B.1. Dataset

Supplementary A. Overview. C. Time and Space Complexity. B. Shape Retrieval. D. Permutation Invariant SOM. B.1. Dataset Supplementary A. Overview This supplementary document provides more technical details and experimental results to the main paper. Shape retrieval experiments are demonstrated with ShapeNet Core55 dataset

More information

Visual object classification by sparse convolutional neural networks

Visual object classification by sparse convolutional neural networks Visual object classification by sparse convolutional neural networks Alexander Gepperth 1 1- Ruhr-Universität Bochum - Institute for Neural Dynamics Universitätsstraße 150, 44801 Bochum - Germany Abstract.

More information

Kernel-based online machine learning and support vector reduction

Kernel-based online machine learning and support vector reduction Kernel-based online machine learning and support vector reduction Sumeet Agarwal 1, V. Vijaya Saradhi 2 andharishkarnick 2 1- IBM India Research Lab, New Delhi, India. 2- Department of Computer Science

More information

Notes on Multilayer, Feedforward Neural Networks

Notes on Multilayer, Feedforward Neural Networks Notes on Multilayer, Feedforward Neural Networks CS425/528: Machine Learning Fall 2012 Prepared by: Lynne E. Parker [Material in these notes was gleaned from various sources, including E. Alpaydin s book

More information

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used. 1 4.12 Generalization In back-propagation learning, as many training examples as possible are typically used. It is hoped that the network so designed generalizes well. A network generalizes well when

More information

Image Compression: An Artificial Neural Network Approach

Image Compression: An Artificial Neural Network Approach Image Compression: An Artificial Neural Network Approach Anjana B 1, Mrs Shreeja R 2 1 Department of Computer Science and Engineering, Calicut University, Kuttippuram 2 Department of Computer Science and

More information

Neuron Selectivity as a Biologically Plausible Alternative to Backpropagation

Neuron Selectivity as a Biologically Plausible Alternative to Backpropagation Neuron Selectivity as a Biologically Plausible Alternative to Backpropagation C.J. Norsigian Department of Bioengineering cnorsigi@eng.ucsd.edu Vishwajith Ramesh Department of Bioengineering vramesh@eng.ucsd.edu

More information

Hidden Units. Sargur N. Srihari

Hidden Units. Sargur N. Srihari Hidden Units Sargur N. srihari@cedar.buffalo.edu 1 Topics in Deep Feedforward Networks Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation

More information

A Quick Guide on Training a neural network using Keras.

A Quick Guide on Training a neural network using Keras. A Quick Guide on Training a neural network using Keras. TensorFlow and Keras Keras Open source High level, less flexible Easy to learn Perfect for quick implementations Starts by François Chollet from

More information

Neural Network Neurons

Neural Network Neurons Neural Networks Neural Network Neurons 1 Receives n inputs (plus a bias term) Multiplies each input by its weight Applies activation function to the sum of results Outputs result Activation Functions Given

More information

SEMANTIC COMPUTING. Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) TU Dresden, 21 December 2018

SEMANTIC COMPUTING. Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) TU Dresden, 21 December 2018 SEMANTIC COMPUTING Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) Dagmar Gromann International Center For Computational Logic TU Dresden, 21 December 2018 Overview Handling Overfitting Recurrent

More information

All You Want To Know About CNNs. Yukun Zhu

All You Want To Know About CNNs. Yukun Zhu All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from http://imgur.com/ Deep Learning Image from http://imgur.com/ Deep Learning Image from http://imgur.com/ Deep Learning Image

More information

LSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia

LSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia 1 LSTM for Language Translation and Image Captioning Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia 2 Part I LSTM for Language Translation Motivation Background (RNNs, LSTMs) Model

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining Fundamentals of learning (continued) and the k-nearest neighbours classifier Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart.

More information

Sentiment Classification of Food Reviews

Sentiment Classification of Food Reviews Sentiment Classification of Food Reviews Hua Feng Department of Electrical Engineering Stanford University Stanford, CA 94305 fengh15@stanford.edu Ruixi Lin Department of Electrical Engineering Stanford

More information

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda 1 Observe novel applicability of DL techniques in Big Data Analytics. Applications of DL techniques for common Big Data Analytics problems. Semantic indexing

More information

Predicting Messaging Response Time in a Long Distance Relationship

Predicting Messaging Response Time in a Long Distance Relationship Predicting Messaging Response Time in a Long Distance Relationship Meng-Chen Shieh m3shieh@ucsd.edu I. Introduction The key to any successful relationship is communication, especially during times when

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

Fast or furious? - User analysis of SF Express Inc

Fast or furious? - User analysis of SF Express Inc CS 229 PROJECT, DEC. 2017 1 Fast or furious? - User analysis of SF Express Inc Gege Wen@gegewen, Yiyuan Zhang@yiyuan12, Kezhen Zhao@zkz I. MOTIVATION The motivation of this project is to predict the likelihood

More information

Neural Networks. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Neural Networks. Robot Image Credit: Viktoriya Sukhanova 123RF.com Neural Networks These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these slides

More information

Stacked Denoising Autoencoders for Face Pose Normalization

Stacked Denoising Autoencoders for Face Pose Normalization Stacked Denoising Autoencoders for Face Pose Normalization Yoonseop Kang 1, Kang-Tae Lee 2,JihyunEun 2, Sung Eun Park 2 and Seungjin Choi 1 1 Department of Computer Science and Engineering Pohang University

More information

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python. Inception and Residual Networks Hantao Zhang Deep Learning with Python https://en.wikipedia.org/wiki/residual_neural_network Deep Neural Network Progress from Large Scale Visual Recognition Challenge (ILSVRC)

More information

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska Classification Lecture Notes cse352 Neural Networks Professor Anita Wasilewska Neural Networks Classification Introduction INPUT: classification data, i.e. it contains an classification (class) attribute

More information

Final Report: Classification of Plankton Classes By Tae Ho Kim and Saaid Haseeb Arshad

Final Report: Classification of Plankton Classes By Tae Ho Kim and Saaid Haseeb Arshad Final Report: Classification of Plankton Classes By Tae Ho Kim and Saaid Haseeb Arshad Table of Contents 1. Project Overview a. Problem Statement b. Data c. Overview of the Two Stages of Implementation

More information

Decentralized and Distributed Machine Learning Model Training with Actors

Decentralized and Distributed Machine Learning Model Training with Actors Decentralized and Distributed Machine Learning Model Training with Actors Travis Addair Stanford University taddair@stanford.edu Abstract Training a machine learning model with terabytes to petabytes of

More information

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, 1 Yong Wang, 1 Bo Yu, 1 Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A

More information

arxiv: v1 [cs.cv] 14 Dec 2016

arxiv: v1 [cs.cv] 14 Dec 2016 Detect, Replace, Refine: Deep Structured Prediction For Pixel Wise Labeling arxiv:1612.04770v1 [cs.cv] 14 Dec 2016 Spyros Gidaris University Paris-Est, LIGM Ecole des Ponts ParisTech spyros.gidaris@imagine.enpc.fr

More information

11. Neural Network Regularization

11. Neural Network Regularization 11. Neural Network Regularization CS 519 Deep Learning, Winter 2016 Fuxin Li With materials from Andrej Karpathy, Zsolt Kira Preventing overfitting Approach 1: Get more data! Always best if possible! If

More information

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart Machine Learning The Breadth of ML Neural Networks & Deep Learning Marc Toussaint University of Stuttgart Duy Nguyen-Tuong Bosch Center for Artificial Intelligence Summer 2017 Neural Networks Consider

More information

Artificial Intelligence. Programming Styles

Artificial Intelligence. Programming Styles Artificial Intelligence Intro to Machine Learning Programming Styles Standard CS: Explicitly program computer to do something Early AI: Derive a problem description (state) and use general algorithms to

More information

CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning

CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning Justin Chen Stanford University justinkchen@stanford.edu Abstract This paper focuses on experimenting with

More information