NEURAL network learning is typically slow, where back

Size: px

Start display at page:

Download "NEURAL network learning is typically slow, where back"

Philomena Davis
5 years ago
Views:

1 1 Training Simplification and Model Simplification for Deep Learning: A Minimal Effort Back Propagation Method Xu Sun, Xuancheng Ren, Shuming Ma, Bingzhen Wei, Wei Li, and Houfeng Wang arxiv: v1 [cs.lg] 17 Nov 2017 Abstract We propose a simple yet effective technique to simplify the training and the resulting model of neural networks. In back propagation, only a small subset of the full gradient is computed to update the model parameters. The gradient vectors are sparsified in such a way that only the top-k elements (in terms of magnitude) are kept. As a result, only k rows or columns (depending on the layout) of the weight matrix are modified, leading to a linear reduction in the computational cost. Based on the sparsified gradients, we further simplify the model by eliminating the rows or columns that are seldom updated, which will reduce the computational cost both in the training and decoding, and potentially accelerate decoding in real-world applications. Surprisingly, experimental results demonstrate that most of time we only need to update fewer than 5% of the weights at each back propagation pass. More interestingly, the accuracy of the resulting models is actually improved rather than degraded, and a detailed analysis is given. The model simplification results show that we could adaptively simplify the model which could often be reduced by around 9x, without any loss on accuracy or even with improved accuracy. Index Terms neural network, back propagation, sparse learning, model pruning. 1 INTRODUCTION NEURAL network learning is typically slow, where back propagation usually dominates the computational cost during the learning process. Back propagation entails a high computational cost because it needs to compute full gradients and update all model parameters in each learning step. It is not uncommon for a neural network to have a massive number of model parameters. In this study, we propose a minimal effort back propagation method, which we call meprop, for neural network learning. The idea is that we compute only a very small but critical portion of the gradient information, and update only the corresponding minimal portion of the parameters in each learning step. This leads to sparsified gradients, such that only highly relevant parameters are updated, while other parameters stay untouched. The sparsified back propagation leads to a linear reduction in the computational cost. On top of meprop, we further propose to simplify the trained model by eliminating the less relevant parameters discovered during meprop, so that the computational cost of decoding can also be reduced. We name the method mesimp (minimal effort simplification). The idea is that we record which portion of the parameters is updated at each learning step in meprop, and gradually remove the parameters that are less updated. This leads to a simplified model that costs X. Sun, X. Ren, S. Ma, B. Wei, W. Li, and H. Wang are with School of Electronics Engineering and Computer Science, Peking University, China, and MOE Key Laboratory of Computational Linguistics, Peking University, China. {xusun, renxc, shumingma, weibz, liweitj47, wanghf}@pku.edu.cn The first two authors contributed equally to this work. This work is a substantial extension of the work presented at ICML 2017 [1]. The codes are available at less in computation during decoding, while meprop can only speed up the training of the neural networks. One of the motivations for such method is that if we suppose back propagation can determine the importance of input features, with meprop, the essential features are welltrained, and the non-essential features are less-trained, so that the robustness of the models can be improved, and overfitting can be reduced. As the essential features play a more important role in the final model, there are chances that the parameters related to non-essential features could be eliminated, which leads to the idea of mesimp. For a classification task, there are essential features that are decisive in the classification, non-essential features that are helpful but can also be distractions, and irrelevant features that are not useful at all. For example, when classifying a picture as a taxi, the taxi sign is one of the essential features, and the color yellow, which is often the color of a taxi, is one of the non-essential features. Overfitting often occurs when the non-essential features are given too much importance in the model, while meprop intentionally focuses on training the probable essential features to lessen the risk of overfitting. To realize our approaches, we need to answer four questions. The first question is how to find the highly relevant subset of the parameters from the current sample in stochastic learning. We propose a top-k search method to find the most important parameters. Interestingly, experimental results demonstrate that most of the time we only need to update fewer than 5% of the weights at each back propagation pass. This does not result in a larger number of training iterations. The proposed method is general-purpose and it is independent of specific models and specific optimizers (e.g., Adam and AdaGrad). The second question is whether or not this minimal effort

2 2 Fig. 1. An illustration of meprop. back propagation strategy would hurt the accuracy of the trained models. We show that our strategy does not degrade the accuracy of the trained model, even when a very small portion of the parameters is updated. More interestingly, our experimental results reveal that our strategy actually improves the model accuracy in most cases. Based on our experiments, we find that it is probably because the minimal effort update does not modify weakly relevant parameters in each update, according with our assumption, which makes overfitting less likely, similar to the dropout effect. The third question is whether or not the decoding cost of the model can be reduced, as meprop can only shorten the training time. Based on meprop, we further apply the technique of mesimp. From our observations, the simplifying strategy can indeed shrink the final model by usually around 9x without any loss on accuracy. It also supports our assumption that, in fact, many learned features are not essential to the final correct prediction. The final question is whether or not the size of the simplified models needs to be set explicitly in advance. In most previous work, the final model size is pre-configured as desired or using heuristic rules, making it hard to simplify models with multiple layers, because naturally, each layer should have a different dimension, since it captures a different level of abstraction. In practice, we find that mesimp could adaptively reduce the size of the hidden layers, and automatically decide which features are essential for the task at different abstraction levels, resulting in a model of different hidden layer sizes. The contributions of this work are as follows: We propose a minimal effort back propagation technique for neural network learning, which could automatically find the most important features. Only a small subset of the full gradient is computed to update the model parameters, and is used to determine whether the related parameters should be kept in the final model. Applying the technique to training simplification (meprop), we find that the strategy actually improve the accuracy of the resulting models, rather than degraded, even if fewer than 5% of the weights are updated at each back propagation pass most of the time. The technique does not entail a larger number of training iterations, and could reduce the time of the training substantially. Most importantly, applying the technique to model simplification (mesimp) could potentially reduce the time of decoding. With the ability to adaptively simplify each layer of the model to only keep essential features, the resulting model could be reduced to around one ninth of its original size, which equals to an around 9x reduction in decoding cost, on a base of no accuracy loss or even improved accuracy. It s worth mentioning, when applied to models with multiple layers, given a single hyper-parameter, mesimp could simplify each hidden layer to a different extent, alleviating the need to set different hyperparameters for different layers. The minimal effort back propagation technique can be applied to different types of deep learning models (MLP and LSTM), can be applied with various optimization methods (Adam and AdaGrad), and works on diverse tasks (natural language processing and image recognition). 2 PROPOSED METHOD We propose a simple yet effective technique for neural network learning. The forward propagation is computed as usual. During back propagation, only a small subset of the full gradient is computed to update the model parameters. The gradient vectors are quantized so that only the topk components in terms of magnitude are kept. Based on the technique, we further propose to simplify the resulting models by removing the rows that are seldom updated, according to the top-k indices. The model is simplified in such a way that only actively updated rows are kept. We first present the proposed methods, and then describe the implementation details. 2.1 Simplified Back Propagation (meprop) Forward propagation of neural network models, including feedforward neural networks, RNN, LSTM, consists of linear transformations and non-linear transformations. For simplicity, we take a computation unit with one linear transformation and one non-linear transformation as an example: y = Wx (1)

3 3 Fig. 2. An illustration of the computational flow of meprop. z = σ(y) (2) where W R n m,x R m,y R n,z R n, m is the dimension of the input vector, n is the dimension of the output vector, and σ is a non-linear function (e.g., relu, tanh, and sigmoid). During back propagation, we need to compute the gradient of the parameter matrixw and the input vector x: W ij = σ ix T j (1 i n,1 j m) (3) x i = j W T ijσ j (1 j n,1 i m) (4) where σ i Rn means i y i. We can see that the computational cost of back propagation is directly proportional to the dimension of output vector n. The proposed meprop uses approximate gradients by keeping only top-k elements based on the magnitude values. That is, only the top-k elements with the largest absolute values are kept. For example, suppose a vector v = 1,2,3, 4, then top 2 (v) = 0,0,3, 4. We denote the indices of vector σ (y) s top-k values as S = {t 1,t 2,...,t k }(1 k n), and the approximate gradient of the parameter matrix W and input vector x is: W ij σ i xt j if i {t 1,t 2,...,t k } else 0 (5) x i j W T ij σ j if j {t 1,t 2,...,t k } else 0 (6) As a result, only k rows or columns (depending on the layout) of the weight matrix are modified, leading to a linear reduction (k divided by the vector dimension) in the computational cost. The algorithm is described in Algorithm 1. Figure 1 is an illustration of meprop for a single computation unit of neural models. The original back propagation uses the full gradient of the output vectors to compute the gradient of the parameters. The proposed method selects the top-k values of the gradient of the output vector, and back propagates the loss through the corresponding subset of the total model parameters. Algorithm 1 Backward Propagation Simplification for A Computation Unit 1: z σ(w x) Forward propagation 2: σ σ Gradient of Wx w.r.t. z 3: S {t 1,t 2,...,t k } Indices of k largest derivatives of σ in magnitude 4: W ij σ i xt j if i S else 0 5: x i j WT ij σ j if j S else 0 As for a complete neural network framework with a loss L, the original back propagation computes the gradient of the parameter matrix W as: L W = L y y W (7)

4 4 Fig. 3. An illustration of the computational flow of meprop on a mini-batch learning setting. while the gradient of the input vector x is: L x = y x L y The proposed meprop selects top-k elements of the gradient L y to approximate the original gradient, and passes them through the gradient computation graph according to the chain rule. Hence, the gradient of W goes to: L W top k( L y ) y W while the gradient of the vector x is: (8) (9) L x y x top k( L y ) (10) Figure 2 shows an illustration of the computational flow of meprop. The forward propagation is the same as traditional forward propagation, which computes the output vector via a matrix multiplication operation between two input tensors. The original back propagation computes the full gradient for the input vector and the weight matrix. For meprop, back propagation computes an approximate gradient by keeping top-k values of the backward flowed gradient and masking the remaining values to 0. Figure 3 further shows the computational flow of me- Prop for the mini-batch case. 2.2 Simplified Model (mesimp) The method from section 2.1 simplifies the training process, thus reduces the training time. However, for most deep learning applications in real life, it is even more important to reduce the computational cost of decoding, for the fact that although training is time consuming, it only needs to be done once, while decoding needs to be done as long as there is a new request. In this section, we propose to simplify the model by eliminating the inactive paths, which we define as the neurons whose gradients are not in top-k. This way, the decoding cost would also be reduced. There are two major concerns about this proposal. The main problem here is that we don t know the active path of unseen examples in advance, as we don t know the gradient information of those examples. Our solution for this problem is that we could obtain the overall inactive paths from the inactive paths of the training samples, which could be removed gradually in the training. The second is that the reduction in dimension could lead to performance degradation. Surprisingly, from our experimental results, our top-k gradient based method does not deteriorate the model. Instead, with an appropriate configuration, the resulting smaller model often performs better than the baseline large model, or even the baseline model of the similar size. As a matter of fact, after pruning the performance does drop. However, with the following training, the performance is regained. In what follows, we will briefly introduce the inspiration of the proposed method, and how the model simplification is done. In the experiments of meprop, we discover an interesting phenomenon that during training, apart from the active paths with top-k gradients, there are some inactive paths that are not activated at all for any of the examples. We call

5 5 these paths universal inactive paths. These neurons are not updated at all during training, and their parameter values remain the same as their initialized values, and we have every reason to believe that they would not be effective for new samples as well. However, the number of those paths may not be enough to bring a substantial contraction to the model. forward propaga on (original) back propaga on (meprop) Algorithm 2 Model Simplification for A Computation Unit 1: Initialize W, t 0,c 0 2: while training do 3: Draw (x, y) from training data 4: z σ(w x) Forward propagation 5: W top k(σ )x T Gradient of W w.r.t. z 6: S {t 1,t 2,...,t k } indices of k largest derivatives of σ in magnitude 7: c i increase if i in S Record the top-k indices 8: Update W with W 9: if t mod m = 0 then Prune inactive paths 10: θ m prune rate 11: for all i where c i < θ do 12: Remove row i from W 13: end for 14: c 0 Reset c 15: end if 16: t t+1 17: end while ac veness collected from mul ple examples model simplifica on inac ve neurons are eliminated Based on the previous findings, we generalize the idea of universal inactive paths, and prune the paths that are less updated, that is, the paths we eliminate are the paths that are not active for a number of samples. To realize the idea, we keep a record c of how many times the index is in the top-k indices S, during the back propagation at the same time as meprop. After several training steps m, we take out the less active paths that are not updated for a number of samples, e.g. 90% prune rate, which results in a simplified model. The record is cleared at each pruning action. By doing that iteratively, the model size will approach near stable in the end. Algorithm 2 describes the method for a computation unit, and an illustration is shown in Figure 4. An important hyper-parameter for the method is the pruning threshold. When determining the threshold, the model size and the number of examples between pruning actions should be taken into account. As shown in Algorithm 2, the threshold could be parameterized by prune interval, that is, how many samples between pruning, and prune rate, that is, how active the path should be if it is not to be eliminated. Note that the layer sizes are determined adaptively in a multi-layer setting, and only one threshold is needed for a model with multiple layers to have different layer sizes. Because the top-k indices of different layers at different iterations intersect differently in back propagation. For some layers, the top-k indices are similar, hence results in a larger layer size, compared to k. For other layers, the top-k indices are quite different at each iteration, so that the intersection happens more often, which means c i is lower, hence the resulting layer size is smaller. How the layer is simplified depends on how the learning is done, which is in accordance with our intuition. forward propaga on (simplified) back propaga on (simplified & meprop) Fig. 4. An illustration of model simplification (k = 2). The figure shows the three main stages of the simplified model training. First, as the upper shows, the model is trained using meprop, for several iterations, and a record of the activeness of the paths is kept, indicated by the shades of the neurons. Second, as the middle shows, the model is simplified based on the collected record, and the inactive paths are eliminated. Third, as the lower shows, we train the simplified model also using meprop. We repeat the procedure until the goal of the training is met. In a deep neural network, it s worth noticing that when simplifying a hidden layer, the respective columns in the next layer could also be removed, as the values in the columns represent the connection between the eliminated inputs and the outputs, which is no longer effective. That could reduce the model even further. However, we have not include that in our implementation yet. There are some extra considerations for LSTM models. In an LSTM, there is a lasting linear memory, and four gates controlling the modifying of the memory cells. It makes sense only if the pruning is for the memory cells instead of the gates, which are the computation units defined previously, because there is coherence between the memory and

6 6 the gates. Otherwise, the pruning would cause chaos and mismatch of dimensions, as each gate is of its own size, and the memory is of another dimension if it is set to the union of the gates. For LSTM models, we treat an LSTM module as a whole unit for model simplification, instead of treating each gate in an LSTM module as a unit for simplification. However, the top-k gradient selection takes place as the level of gates rather than memory cells. In practice, we still obtain the top-k indices from the gates, but we merge the top-k indices records of the gates into one record, and the pruning is for memory cells, so that the related gates are pruned as well. For model simplification, we also propose a kind of cycle mechanism. During our experiments, we find that at the time of the simplification, there is a drop in performance, but it recovers quickly within the following training, and may even supersede the performance before the simplification. It makes us wonder whether the training after simplification is critical to the performance improvement. We propose to divide the training procedure into several stages, and in each stage, we first conduct the training of model simplification, and then conduct the normal training. At the start of each stage, we also reinitialize the optimizer, if there is historical information of the gradients stored. The reason for such operation is that after model simplification, the dynamics of how the neurons interacted with each other changed, and the previous gradient information may interfere with the new dynamics of the simplified network. We find this cycle mechanism could improve the resulting model s performance even further on some tasks. 2.3 Implementation We have coded two neural network models, including an LSTM model for part-of-speech (POS) tagging, and a feedforward NN model (MLP) for transition-based dependency parsing and MNIST image recognition. We use the optimizers with automatically adaptive learning rates, including Adam [2] and AdaGrad [3]. In our implementation, we make no modification to the optimizers, although there are many zero elements in the gradients. Most of the experiments on CPU are conducted on the framework coded in C# on our own. This framework builds a dynamic computation graph of the model for each sample, making it suitable for data in variable lengths. A typical training procedure contains three parts: forward propagation, back propagation, and parameter update. We also have an implementation based on the PyTorch framework for GPU based experiments. To focus on the method itself, the results of GPU based experiments will be presented in appendices Where to apply top-k selection The proposed method aims to reduce the complexity of the back propagation by reducing the elements in the computationally intensive operations. In our preliminary observations, matrix-matrix or matrix-vector multiplication consumed more than 90% of the time of back propagation. In our implementation, we apply meprop only to the back propagation from the output of the multiplication to its inputs. For other element-wise operations (e.g., activation functions), the original back propagation procedure is kept, because those operations are already fast enough compared with matrix-matrix or matrix-vector multiplication operations. If there are multiple hidden layers, the top-k sparsification needs to be applied to every hidden layer, because the sparsified gradient will again be dense from one layer to another. That is, in meprop the gradients are sparsified with a top-k operation at the output of every hidden layer. While we apply meprop to all hidden layers using the same k of top-k, usually the k for the output layer could be different from the k for the hidden layers, because the output layer typically has a very different dimension compared with the hidden layers. For example, there are 10 tags in the MNIST task, so the dimension of the output layer is 10, and we use an MLP with the hidden dimension of 500. Thus, the best k for the output layer could be different from that of the hidden layers Choice of top-k algorithms In our C# implementation, instead of sorting the entire vector, we use the well-known min-heap based top-k selection method, which is slightly changed to focus on memory reuse. The algorithm has a time complexity of O(nlogk) and a space complexity of O(k). PyTorch comes with a GPU implementation of a certain paralleled top-k algorithm, which we are not sure how the operation is done exactly. 3 EXPERIMENTS To demonstrate that the proposed method is generalpurpose, we perform experiments on different models (LSTM/MLP), various training methods (Adam/AdaGrad), and diverse tasks. Transition-based Dependency Parsing (Parsing): Following prior work, we use English Penn TreeBank (PTB) [4] for evaluation. We follow the standard split of the corpus and use sections 2-21 as the training set (39,832 sentences, 1,900,056 transition examples), 1 section 22 as the development set (1,700 sentences, 80,234 transition examples) and section 23 as the final test set (2,416 sentences, 113,368 transition examples). The evaluation metric is unlabeled attachment score (UAS). We implement a parser using MLP following [5], which is used as our baseline. Part-of-Speech Tagging (POS-Tag): We use the standard benchmark dataset in prior work [6], which is derived from the Penn Treebank corpus, and use sections 0-18 of the Wall Street Journal (WSJ) for training (38,219 examples), and sections for testing (5,462 examples). The evaluation metric is per-word accuracy. A popular model for this task is the LSTM model [7], 2 which is used as our baseline. MNIST Image Recognition (MNIST): We use the MNIST handwritten digit dataset [8] for evaluation. MNIST consists of 60, pixel training images and additional 10,000 test examples. Each image contains a single numerical digit (0-9). We select the first 5,000 images of the training images as the development set and the rest as the 1. A transition example consists of a parsing context and its optimal transition action. 2. In this work, we use the bi-directional LSTM (Bi-LSTM) as the implementation of LSTM.

7 TABLE 1 meprop results based on LSTM/MLP models and Adam optimizers. Time means averaged time per iteration. Iter means the number of iterations to reach the optimal score on development data. The model of this iteration is then used to obtain the test score. As we can see, applying meprop can substantially speedup the back propagation with improved accuracy. 7 Parsing (Adam) Iter Backprop time (s) Dev UAS (%) Test UAS (%) MLP (h=500) 10 9, meprop (k=20) (18.6x) (+0.04) POS-Tag (Adam) Iter Backprop time (s) Dev Acc (%) Test Acc (%) LSTM (h=500) 3 16, meprop (k=10) (37.1x) (+0.03) MNIST (Adam) Iter Backprop time (s) Dev Acc (%) Test Acc (%) MLP (h=500) meprop (k=80) ( 5.9x) (+0.07) training set. The evaluation metric is per-image accuracy. We use the MLP model as the baseline. Following common practice, we use ReLU [9] as the activation function of the hidden layers. 3.1 Experimental Settings We set the dimension of the hidden layers to 500 for all the tasks. For Parsing, the input dimension is 48 (features) 50 (dim per feature) = 2400, and the output dimension is 25. For POS-Tag, the input dimension is 1 (word) 50 (dim per word) + 7 (features) 20 (dim per feature) = 190, and the output dimension is 45. For MNIST, the input dimension is 28 (pixels per row) 28 (pixels per column) 1 (dim per pixel) = 784, and the output dimension is 10. Based on the development set and prior work, we set the mini-batch size to 10,000 (transition examples), 1 (sentence), and 10 (images) for Parsing, POS-Tag, and MNIST, respectively. Using 10,000 transition examples for Parsing follows [5]. As discussed in Section 2, the optimal k of topk for the output layer could be different from the hidden layers, because their dimensions could be very different. For Parsing and MNIST, we find using the same k for the output and the hidden layers works well, and we simply do so. For another task, POS-Tag, we find the the output layer should use a different k from the hidden layers. For simplicity, we do not apply meprop to the output layer for POS-Tag, because in this task we find the computational cost of the output layer is almost negligible compared with other layers. In the experiments of model simplification, we use the Adam optimizer for all the tasks, for the sake of simplicity. In addition, we also apply the cycle mechanism in the reported results. Note that, to simulate the real scenario, we run each configuration 5 times with different random seeds, and choose the best model on development set to report. The hyperparameters are tuned based on the development data. For the Adam optimization method, we find the default hyperparameters work well on development sets, which are as follows: the learning rate α = 0.001, andβ 1 = 0.9,β 2 = 0.999, ǫ = The experiments on CPU are conducted on a computer with the INTEL(R) Xeon(R) 3.0GHz CPU. The experiments on GPU are conducted on NVIDIA GeForce GTX Experimental Results of meprop In this experiment, the LSTM is based on one hidden layer and the MLP is based on two hidden layers (experiments on more hidden layers will be presented later). We conduct experiments on different optimization methods, including AdaGrad and Adam. Since meprop is applied to the linear transformations (which entail the major computational cost), we report the linear transformation related backprop time as Backprop Time. It does not include non-linear activations, which usually have only less than 2% computational cost. The total time of back propagation, including nonlinear activations, is reported as Overall Backprop Time. Table 1 shows the results based on different models and different optimization methods. In the table, meprop means applying meprop to the corresponding baseline model, h = 500 means that the hidden layer dimension is 500, and k = 20 means that meprop uses top-20 elements (among 500 in total) for back propagation. Note that, for fair comparisons, all experiments are first conducted on the development data and the test data is not observable. Then, the optimal number of iterations is decided based on the optimal score on development data, and the model of this iteration is used upon the test data to obtain the test scores. As we can see, applying meprop can substantially speed up the back propagation. It provides a linear reduction in the computational cost. Surprisingly, results demonstrate that we can update only fewer than 5% of the weights at each back propagation pass for the natural language processing tasks. This does not result in a larger number of training iterations. More surprisingly, the accuracy of the resulting models is actually improved rather than decreased. The main reason could be that the minimal effort update does not modify weakly relevant parameters, which makes overfitting less likely, similar to the dropout effect Result Analysis of meprop Changing Optimizer TABLE 2 meprop: Results using AdaGrad optimizers. We can see that meprop also works with AdaGrad optimizers, indicating that meprop is independent of optimizers. Parsing (AdaGrad) Iter Test UAS (%) MLP (h=500) meprop (k=20) (+0.03) POS-Tag (AdaGrad) Iter Test Acc (%) LSTM (h=500) meprop (k=5) (+0.32) MNIST (AdaGrad) Iter Test Acc (%) MLP (h=500) meprop (k=10) (+0.48)

8 8 Accuracy (%) MNIST: Reduce Overfitting 97.7 meprop MLP Backprop Ratio (%) Accuracy (%) MNIST: Topk vs Random 88 Topk meprop Random meprop Backprop Ratio (%) Accuracy (%) MNIST: Change h/k meprop MLP Backprop/Hidden Ratio (%) Fig. 5. Accuracy vs. meprop s backprop ratio (left). Results of top-k meprop vs. random meprop (middle). Results of top-k meprop vs. baseline with the hidden dimension h (right). TABLE 3 meprop: Results based on the same k and h. It can be concluded that meprop does not rely on redundant neurons, as the model of the small hidden dimension works much worse. Parsing (Adam) Iter Test UAS (%) MLP (h=20) meprop (k=20) (+1.64) POS-Tag (Adam) Iter Test Acc (%) LSTM (h=5) meprop (k=5) (+0.72) MNIST (Adam) Iter Test Acc (%) MLP (h=20) meprop (k=20) (+2.24) TABLE 4 meprop: Varying the number of hidden layers on the MNIST task. The experiments demonstrate that meprop can also be applied to traditional deep models. Layers Method Test Acc (%) 2 MLP (h=500) meprop (k=25) (+0.10) 3 MLP (h=500) meprop (k=25) (+0.16) 4 MLP (h=500) meprop (k=25) (+0.05) 5 MLP (h=500) meprop (k=25) (+0.16) It is important to see whether meprop can be applied with different optimizers, because the minimal effort technique sparifies the gradient, which affects the update of the parameters. For the AdaGrad learner, the learning rate is set to α = 0.01, 0.01, 0.1 for Parsing, POS-Tag, and MNIST, respectively, and ǫ = As shown in Table 2, the results are consistent among AdaGrad and Adam. The results demonstrate that meprop is independent of specific optimization methods. For simplicity, the following experiments use Adam. Varying Backprop Ratio In Figure 5 (left), we vary the k of top-k meprop to compare the test accuracy on different ratios of meprop backprop. For example, when k=5, it means that the backprop ratio is 5/500=1%. The optimizer is Adam. As we can see, meprop achieves consistently better accuracy than the baseline. Top-k vs. Random It will be interesting to check the role of top-k elements. Figure 5 (middle) shows the results of top-k meprop vs. random meprop. The random meprop means that random elements (instead of top-k ones) are selected for back propagation. As we can see, the top-k version works better than the random version. It suggests that top-k elements contain the most important information of the gradients. Varying Hidden Dimension We still have a question: does the top-k meprop work well simply because the original model does not require that big dimension of the hidden layers? For example, the meprop (topk=5) works simply because the LSTM works well with the hidden dimension of 5, and there is no need to use the hidden dimension of 500. To examine this, we perform experiments on using the same hidden dimension as k, and the results are shown in Table 3. As we can see, however, the results of the small hidden dimensions are much worse than those of meprop. In addition, Figure 5 (right) shows more detailed curves by varying the value of k. In the figure, different k gives different backprop ratio for meprop and different hidden dimension ratio for LSTM/MLP. As we can see, the answer to that question is negative: meprop does not rely on redundant hidden layer elements. Adding More Hidden Layers Another question is whether or not meprop relies on shallow models with only a few hidden layers. To answer this question, we also perform experiments on more hidden layers, from 2 hidden layers to 5 hidden layers. We find setting the dropout rate to 0.1 works well for most cases of different numbers of layers. For simplicity of comparison, we set the same dropout rate to 0.1 in this experiment. Table 4 shows that adding the number of hidden layers does not hurt the performance of meprop. Adding Dropout Since we have observed that meprop can reduce overfitting of deep learning, a natural question is that if meprop is reducing the same type of overfitting risk as dropout. Thus, we use development data to find a proper value of the dropout rate on those tasks, and then further add meprop to check if further improvement is possible. Table 5 shows the results. As we can see, meprop can achieve further improvement over dropout. In particular, meprop has an improvement of 0.46 UAS on Parsing. The results suggest that the type of overfitting that meprop

9 9 TABLE 5 meprop: Adding the dropout technique. As the results show, meprop could further improve the performance on top of dropout, suggesting that meprop is reducing a different type of overfitting, comparing to dropout. Parsing (Adam) Dropout Test UAS (%) MLP (h=500) meprop (k=40) (+0.46) POS-Tag (Adam) Dropout Test Acc (%) LSTM (h=500) meprop (k=20) (+0.11) MNIST (Adam) Dropout Test Acc (%) MLP (h=500) meprop (k=25) (+0.23) TABLE 6 Results of simple unified top-k meprop based on a whole mini-batch (i.e., unified sparse patterns). The optimizer is Adam. Mini-batch Size is 50. Layers Method Test Acc (%) 2 MLP (h=500) meprop (k=30) (+0.11) 5 MLP (h=500) meprop (k=50) (+0.27) TABLE 7 Acceleration results on the matrix multiplication synthetic data using GPU. The batch size is Method Backprop time (ms) Baseline (h=8192) meprop (k=8) 8.37 (36.8x) meprop (k=16) 9.16 (33.6x) meprop (k=32) (27.5x) meprop (k=64) (21.4x) meprop (k=128) (14.5x) meprop (k=256) ( 8.0x) meprop (k=512) ( 4.4x) TABLE 8 Acceleration results on MNIST using GPU. Method Overall backprop time (ms) MLP (h=8192) 17,696.2 meprop (k=8) 1,501.5 (11.8x) meprop (k=16) 1,542.8 (11.5x) meprop (k=32) 1,656.9 (10.7x) meprop (k=64) 1,828.3 ( 9.7x) meprop (k=128) 2,200.0 ( 8.0x) meprop (k=256) 3,149.6 ( 5.6x) meprop (k=512) 4,874.1 ( 3.6x) reduces is probably different from that of dropout. Thus, a model should be able to take advantage of both meprop and dropout to reduce overfitting. Speedup on GPU For implementing meprop on GPU, the simplest solution is to treat the entire mini-batch as a big training example, where the top-k operation is based on the averaged values of all examples in the mini-batch. In this way, the big sparse matrix of the mini-batch will have consistent sparse patterns among examples, and this consistent sparse matrix can be transformed into a small dense matrix by removing the zero values. We call this implementation as simple unified top-k. This experiment is based on PyTorch. Despite its simplicity, Table 6 shows the good performance of this implementation, which is based on the minibatch size of 50. We also find the speedup on GPU is less significant when the hidden dimension is low. The reason is that our GPU s computational power is not fully consumed by the baseline (with small hidden layers), so that the normal back propagation is already fast enough, making it hard for meprop to achieve substantial speedup. For example, supposing a GPU can finish 1000 operations in one cycle, there could be no speed difference between a method with 100 and a method with 10 operations. Indeed, we find MLP (h=64) and MLP (h=512) have almost the same GPU speed even on forward propagation (i.e., without meprop), while theoretically there should be an 8x difference. With GPU, the forward propagation time of MLP (h=64) and MLP (h=512) is 572ms and 644ms, respectively. This provides evidence for our hypothesis that our GPU is not fully consumed with the small hidden dimensions. Thus, the speedup test on GPU is more meaningful for the heavy models, such that the baseline can at least fully consume the GPU s computational power. To check this, we test the GPU speedup on synthetic data of matrix multiplication with a larger hidden dimension. Indeed, Table 7 shows that meprop achieves much higher speed than the traditional backprop with the large hidden dimension. Furthermore, we test the GPU speedup on MLP with the large hidden dimension [10]. Table 8 shows that meprop also has substantial GPU speedup on MNIST with the large hidden dimension. In this experiment, the speedup is based on Overall Backprop Time (see the prior definition). Those results demonstrate that meprop can achieve good speedup on GPU when it is applied to heavy models. Finally, there are potentially other implementation choices of meprop on GPU. For example, another natural solution is to use a big sparse matrix to represent the sparsified gradient of the output of a mini-batch. Then, the sparse matrix multiplication library can be used to accelerate the computation. This could be an interesting direction of future work. 3.3 Experimental Results of mesimp In this experiment, we only simplify the hidden layers of the model, and we use Adam optimizer for all the tasks. We set the cycle to 10 for all the tasks, that is, we first train the model using mesimp for 5 epochs, then train the model normally for 5 epochs, and repeat the procedure till the end. Table 9 shows the model simplificatoin results based on different models. In the table, meprop means applying meprop to the corresponding baseline model, and mesimp means applying model compression on top of meprop. h = 500 means that the dimension of the model s hidden layers is 500, k = 20 means that in back propagation we propagate top-20 elements, and prune = 0.08 means that the dimension which is updated less than 8% times during an statistical interval is dropped. As we can see, our method is capable of reducing the models to a relatively small size, while maintaining the performance if not improving. The hidden layers of the models are reduced by around 10x, 8x, and 3x for Parsing, POS-Tag, and MNIST respectively. That means when the simplified model is deployed, it could achieve more than

10 TABLE 9 mesimp results based on LSTM/MLP models. Iter means the number of iterations to reach the optimal score on development data. The model of this iteration is then used to obtain the test score. Dim means the dimension of the model of this iteration. For LSTM, it is the average of each direction; for MLP, it is the average of hidden layers. It can be drawn from the results that mesimp could reduce the model to a smaller size, often around 10%, while maintaining the performance if not improving. Parsing Iter Dim Dev UAS (%) Test UAS (%) MLP (h=500) mesimp (k=20, prune=0.08) (10.2%) (+0.31) POS-Tag Iter Dim Dev Acc (%) Test Acc (%) LSTM (h=500) mesimp (k=20, prune=0.08) 3 60 (12.0%) (+0.03) MNIST Iter Dim Dev Acc (%) Test Acc (%) MLP (h=500) mesimp (k=160, prune=0.10) (30.8%) (+0.11) 10 TABLE 10 mesimp: The dimensions of the resulting models. The results confirm that mesimp is suitable for deep models, and could adaptively determine the proper sizes of different layers. Parsing #Average #Hidden mesimp(k=20, prune=0.08) POS-Tag #Average #Forward #Backward mesimp (k=20, prune=0.08) MNIST #Average #First #Second mesimp (k=160, prune=0.10) x, 8x, and 3x computational cost reduction respectively in decoding, with a similar or better performance. The reason could be that the minimal effort update captures important features, so that the simplified model is enough to fit the data, while without minimal effort update, the model of a similar size treats each feature equally at start, limiting its ability to learn from the data. We will show the experimental results in section The results show that the simplifying method is effective in reducing the model size, thus bringing a substantial reduction of the computational cost of decoding in real-world task. More importantly, the accuracy of the original model is kept, or even more often improved. This means model simplifying could make it more probable to deploy a deep learning system to a computation constrained environment Result Analysis of mesimp Adaptively setting the size of the hidden layers It s worth noticing that mesimp is also able to automatically determine the appropriate size of the resulting model of deep neural networks (Table 10). At the beginning, we conduct the experiments on a neural network with a single hidden layer, that is, Parsing, and we get promising result, as the model size is reduced to 10.2% of its original size. The result of Paring makes us wondering whether meprop could also simplify deeper networks, so we continue to run experiments on different models. In the experiments of POS-Tag, the LSTM is based on Bi-LSTM, that is, there is a forward LSTM and a backward LSTM regarding to the input sequences, which means it is often very deep in time series (horizontal). As shown in Table 10, the forward and backward LSTMs indeed gets different dimensions, 60 and 57 respectively. We further conduct experiments on an MLP with 2 hidden layers (vertical), and the result shows that the first hidden layer and the second hidden layer are TABLE 11 mesimp: Results based on the same k and h for Parsing. We report the results of 5 different runs of the baseline model. It s clear that the simplified model constantly surpasses the traditionally-trained model of the same size, indicating that mesimp may enable a more efficient and effective learning. Method Dev UAS (%) Test UAS (%) mesimp (h=51) MLP (h=51) MLP (h=51) MLP (h=51) MLP (h=51) MLP (h=51) again of different dimensions, which confirms that mesimp could adaptively adjust the hidden layer size in a multilayer setting. We also need to remind the readers that mesimp does not need to specify different hyper-parameters for different layers, while most of the previous work ([11], [12], [13]), if different layer sizes are pursued, a different hyperparameter need to be set separately for each hidden layer, limiting their ability to simplifying the models adaptively. Comparing with the models of similar sizes One natural and important question is how does the simplified model perform compared to the model of a similar size. If the simplified models perform not as well as the normally trained models of similar sizes, the simplifying method may be redundant and unnecessary. Results in Table 11 shed light on that question. We train baseline models of sizes similar to the sizes of simplified models, and report the results in Table 11. It can be shown that, our simplified models perform better than the models trained of similar sizes, especially on the Parsing task. The results shows that the model simplifying training is not unnecessary, as a simplified model achieves better accuracy than a model trained using a small dimension. An attempt at revealing why the minimal effort technique works From the back propagation simplification and model simplification results, we could see that approaches based on active paths, which are measured by the back propagation, are effective in reducing overfitting. One of our hypothesis is that for a neural network, for each example, only a small part of the neurons is needed to produce the correct results, and gradients are good identifiers to detect

11 11 TABLE 12 MNIST: minimal effort activation. Dim means the averaged active dimension of hidden layers across examples means during epoch 10 to epoch 15, we apply the minimal effort activation technique. The results show that for an example, a smaller number of neurons is enough to generate the correct prediction, and that by only training the highly-related neurons, the performance could be improved. MNIST Iter Dim Test Acc (%) MLP (h=500) meact (threshold=0.004, 10-20) (+0.24) meact (threshold=0.004, 15-20) (+0.14) Accuracy (%) Average Absolute Update per Parameter MNIST: Accuracy on Dev e 4 Epoch meact (prune=0.04, 10~20) baseline MNIST: Wei ht Update meact (prune=0.04, 10~20) baseline Epoch Fig. 6. Change of accuracy (upper), and Average update of parameters (lower) in active path finding. To isolate the impact of meact, we fix the random seed, which means the initialization of parameters and the shuffle are the same between meact and baseline, so the lines coincide with each other during Epoch 1-10, in which the training is exactly the same. As we can see, after meact is applied, the accuracy rises, which indicates training focused on the most relevant neurons could reduce overfitting, and the update drops, which suggests in the later of normal training, most of the update is caused by fitting of the noises, making the already trained neurons constantly changing. the decisive neurons. Too many neurons are harmful to the model, because the extra neurons may be trained to fit the noise in the training examples. To examine the hypothesis, we design an new algorithm, which we call meact (minimal effort activation), which only activates the active path, with respect to each example, in the forward propagation, and the experimental results are consistent with our hypothesis. To realize the idea, for each example, we only activate the paths with the largest accumulated absolute gradients, and the number of the chosen paths is controlled by threshold. Specifically, we accumulate the absolute gradients of each layer s output for each example, denoted by g i (x j ), where i is the neuron s index in a layer, and j is the example s identifier. For a layer, if a neuron s accumulated absolute gradients accounts for less than a specified percentage of the sum of the gradients of all the neurons of the layer, that is, g i (x j ) < threshold n i=1 g i(x j ), where n is the number of the neurons in the layer, and 0 < threshold < 1, the neuron is considered inactive. The paths out of active paths are deactivated and we use the previous activation values at the last encounter in the training as their outputs, thus the effort in activation is minimized. As the sparse activation is done in forward propagation, the back propagation is sparsified as well, because obviously, the deactivate neurons contribute none to the results, meaning their gradients are zeros, which requires no computation. Note that the method does not reduce the size of the model, and for each example, we obtain its own active paths. During test, the forward propagation is done normally, as we wouldn t know the active paths of these unseen examples. From results shown in Table 12, we can see that, for the MNIST task, on average, fewer than 100 neurons are adequate to achieve good results or even better results. From the results shown in Figure 6, we can see that, during minimal effort activation, the accuracy rises from the baseline, which shows that the accuracy could benefit from the training focused more on the related neurons. To see how the accuracy improvement is acquired, we further investigate the change of the parameters during training. Normally, the gradient is used to represent the change. However, because we use Adam as the optimizer, where the update is not done by directly using the gradient, we consider the change before and after the Adam update rule as the change. As there are many iterations in an epoch and many parameters for a model, we average the change of all the parameters at each iteration, that is t n j=1 i=1 δj i n t, where δ j i means the absolute update = change of parameter i at iteration j, n means the number of the parameters, andtmeans the number of training iterations, and report the average absolute change per parameter per iteration as update. We use the absolute of the update of a parameter, because we would like to see how much the parameters have been modified during the training process, not just the change between the start and the end. As shown in Figure 6, we can see that the update dropped sharply when meact is applied, meaning the accuracy improvement is achieved by very little change of the parameters, while the update of normal training is still high, more than 5x of the update of meact, suggesting that many update is redundant and unnecessary, which could be the result of the model trying to adapt to the noise in the data. As there should be no regular pattern in the noise, it requires more subtle update of all the parameters to fit the noise, which is much harder and often affects the training of the essential features, thus leading to a lower accuracy than our method which tries to focus only on the essential features for each example. The results confirm our initial hypothesis that for an example, only a few neurons are required, and minimal effort technique provides a simple yet effective way to train and extract the helpful neurons.

CS489/698: Intro to ML

CS489/698: Intro to ML Lecture 14: Training of Deep NNs Instructor: Sun Sun 1 Outline Activation functions Regularization Gradient-based optimization 2 Examples of activation functions 3 5/28/18 Sun Sun