Layer-Specific Adaptive Learning Rates for Deep Networks

Size: px

Start display at page:

Download "Layer-Specific Adaptive Learning Rates for Deep Networks"

Claud McDowell
6 years ago
Views:

1 Layer-Specific Adaptive Learning Rates for Deep Networks arxiv: v1 [cs.cv] 15 Oct 2015 Bharat Singh, Soham De, Yangmuzi Zhang, Thomas Godstein, and Gavin Tayor Department of Computer Science Department of Eectrica & Computer Engineering University of Maryand, Coege Park, MD Department of Computer Science, US Nava Academy, Annapois, MD Emai: {bharat, sohamde, Abstract The increasing compexity of deep earning architectures is resuting in training time requiring weeks or even months. This sow training is due in part to vanishing gradients, in which the gradients used by back-propagation are extremey arge for weights connecting deep ayers (ayers near the output ayer), and extremey sma for shaow ayers (near the input ayer); this resuts in sow earning in the shaow ayers. Additionay, it has aso been shown that in highy non-convex probems, such as deep neura networks, there is a proiferation of high-error ow curvature sadde points, which sows down earning dramaticay [1]. In this paper, we attempt to overcome the two above probems by proposing an optimization method for training deep neura networks which uses earning rates which are both specific to each ayer in the network and adaptive to the curvature of the function, increasing the earning rate at ow curvature points. This enabes us to speed up earning in the shaow ayers of the network and quicky escape high-error ow curvature sadde points. We test our method on standard image cassification datasets such as MNIST, CIFAR10 and ImageNet, and demonstrate that our method increases accuracy as we as reduces the required training time over standard agorithms. I. INTRODUCTION Deep neura networks have been extremey successfu over the past few years, achieving state of the art performance on a arge number of tasks such as image cassification [2], face recognition [3], sentiment anaysis [4], speech recognition [5], etc. One can spot a genera trend in these papers: resuts tend to get better as the amount of training data increases, aong with an increase in the compexity of the deep network architecture. However, increasingy compex deep networks can take weeks or months to train, even with high-performance hardware. Thus, there is a need for more efficient methods for training deep networks. Deep neura networks earn high-eve features by performing a sequence of non-inear transformations. Let our training data set A be composed of n data points a 1, a 2,..., a n R m and corresponding abes B = {b i } n i=1. Let us consider a 3-ayer network with activation function f. Let X 1 and X 2 denote the weights on each ayer that we are trying to earn, i.e., X 1 denotes the weights between nodes of the first ayer and the second ayer, and X 2 denotes the weights between nodes of the second ayer and the third ayer. The earning probem for this specific exampe can be formuated as the foowing optimization probem: minimize X 1,X 2 f(f(a X 1 ) X 2 ) B 2 2 (1) The activation function f can be any non-inear mapping, and is traditionay a sigmoid or tanh function. Recenty, rectified inear (ReLu) units (f(z) = max{0, z}) have become popuar because they tend to be easy to train and yied superior resuts for some probems [6]. The non-convex objective (1) is usuay minimized using iterative methods (such as back-propagation) with the hope of converging to a good oca minima. Most iterative schemes generate additive updates to a set of parameters x (in our case, the weight matrices) of the form x (k+1) = x (k) + (2) where is some appropriatey chosen update. Notice we use sighty different notation here from standard optimization iterature in that we incorporate the step size or earning rate t (k) within. This is done to hep us describe other optimization agorithms easiy in the foowing sections. Thus, denotes the update in the parameters, and comprises of a search direction and a step size or earning rate t (k), which contros how arge of a step to take in that direction. Most common update rues are variants of gradient descent, where the search direction is given by the negative gradient g (k) : = t (k) g (k) = t (k) f(x (k) ) (3) Since the size of the training data for these deep networks is usuay of the order of miions or biions of data points, exact computation of the gradient is not feasibe. Rather, the gradient is often estimated using a singe data point or a sma batch of data points. This is the basis for stochastic gradient descent (SGD) [7], which is the most widey used method for training deep nets. SGD requires manuay seecting an initia earning rate, and then designing an update rue for the earning rate which decreases it over time (for exampe, exponentia decay with time). The performance of SGD, however, is very sensitive to this choice of update, eading to adaptive methods that automaticay adjust the earning rate as the system earns [8], [9]. When these descent methods are used to train deep networks, additiona probems are introduced. As the number of ayers in a network increases, the gradients that are propagated back to the initia ayers get very sma. This dramaticay sows down the rate of earning in the initia ayers, and sows down convergence of the whoe network [10].

2 Recenty, it has aso been shown that for high-dimensiona non-convex probems, such as deep networks, the existence of oca minima which have high error reative to the goba minima is exponentiay sma in the number of dimensions. Instead, in these probems, there is an exponentiay arge number of high error sadde points with ow curvature [1], [11], [12]. Gradient descent methods, in genera, move away from sadde points by foowing the directions of negative curvature. However, due to the ow curvature of sma negative eigenvaues, the steps taken become very sma, thus sowing down earning consideraby. In this paper, we propose a method that aeviates the probems mentioned above. The main contribution of our method is summarized beow: The earning rates are specific to each ayer in the network. This aows arger earning rates to compensate for the sma size of gradients in shaow ayers. The earning rates for each ayer tend to increase at ow curvature points. This enabes the method to quicky escape from high-error, ow-curvature sadde points, which occur in abundance in deep network. It is appicabe to most existing stochastic gradient optimization methods which use a goba earning rate. It requires very itte extra computation over standard stochastic gradient methods, and requires no extra storage of previous gradients required as in AdaGrad [9]. In Section II, we review some popuar gradient methods that have been successfu for deep networks. In Section III, we describe our optimization agorithm. Finay, in Section IV we compare our method to standard optimization agorithms on datasets ike MNIST, CIFAR10 and ImageNet. II. RELATED WORK Stochastic Gradient Descent (SGD) sti remains one of the most widey used methods for arge-scae machine earning, argey due to its ease in impementation. In SGD, the updates for the parameters are defined by equations (2) and (3), and the earning rate is decreased over time as iterates approach a oca optimum. A standard earning rate update is given by t (k) = t (0) /(1 + γk) p (4) where the initia earning rate t (0), γ and p are hyperparameters chosen by the user. Many modifications to the basic gradient descent agorithm have been proposed. A popuar method in the convex optimization iterature is Newton s method, which uses the Hessian of the objective function f(x) to determine the step size: nt = 2 f(x (k) ) 1 g (k) (5) Unfortunatey, as the number of parameters increases, even to moderate size, computing the Hessian becomes very computationay expensive. Thus, there have been many modifications proposed which either try to improve the use of first-order information or try to approximate the Hessian of the objective function. In this paper, we focus on modifications to first-order methods. The cassica momentum method [13] is a technique that increases the earning rate for parameters for which the gradient consistenty points in the same direction, whie decreasing the earning rate for parameters for which the gradient is changing fast. Thus, the update equation keeps track of previous updates to the parameters with an exponentia decay: = µ x (k 1) tg (k) (6) where µ [0, 1] is caed the momentum coefficient, and t > 0 is the goba earning rate. Nesterov s Acceerated Gradient (NAG) [14], a first order method, has a better convergence rate than gradient descent in certain situations. This method predicts the gradient for the next iteration and changes the earning rate for the current iteration based on the predicted gradient. Thus, if the gradient is higher for the next step, it woud increase the earning rate for the current iteration and if it is ow, it woud sow down. Recenty, [15] showed that this method can be thought of as a momentum method with the update equation as foows: = µ x (k 1) t f(x (k 1) + µ x (k 1) ) (7) Through a carefuy designed random initiaization and using a particuar type of sowy increasing schedue for µ, this method can reach high eves of performance when used on deep networks [15]. Rather than using a singe earning rate over a parameters, recent work has shown that using a earning rate specific to each parameter can be a much more successfu approach. A method that has gained popuarity is AdaGrad [9], which uses the foowing update rue: t = k g (k) i=1 (g(i) ) 2 The denominator is the 2 norm of a the gradients of the previous iterations. This scaes the goba earning rate t, which is shared by a the parameters, to give a parameterspecific earning rate. One disadvantage of AdaGrad is that it accumuates the gradients over a previous iterations, the sum of which continues to grow throughout training. This (aong with weight decay) shrinks the earning rate on each parameter unti each is infinitesimay sma, imiting the number of iterations of usefu training. A method which buids on AdaGrad and attempts to address some of the above-mentioned disadvantages is AdaDeta [8]. AdaDeta accumuates the gradients in the previous time steps using an exponentiay decaying average of the squared gradients. This prevents the denominator from becoming infinitesimay sma, and ensures that the parameters continue to be updated even after a arge number of iterations. It aso repaces the goba earning rate t with an exponentiay decaying average of the squares of the parameter updates x over the previous iterations. This method has been shown to perform reativey we when used to train deep networks, and is much ess sensitive to the choice of hyper-parameters. However, it does not perform as we as other methods ike SGD and AdaGrad in terms of accuracy [8]. (8)

3 III. OUR APPROACH Because of the vanishing gradients phenomenon, shaow network ayers tend to have much smaer gradients than deep ayers sometimes differing by orders of magnitude from one ayer to the next [10]. In most previous work in optimization for deep networks, methods either keep a goba earning rate that is shared over a parameters, or use an adaptive earning rate specific to each parameter. Our method expoits the foowing observation: parameters in the same ayer have gradients of simiar magnitudes, and can thus efficienty share a common earning rate. Layer-specific earning rates can be used to acceerate ayers with smaer gradients. Another advantage of this approach is that by avoiding the computation of arge numbers of parameter-specific earning rates, our method remains computationay efficient. Finay, as mentioned in Section I, to avoid sowing down earning at high-error ow curvature sadde points, we aso want our method to take arge steps at ow curvature points. Let t (k) be the earning rate at the k-th iteration for any standard optimization method. In case of SGD, this woud be given by equation 4, whie for AdaGrad it woud just be the goba earning rate t as in equation 8. We propose to modify t (k) as foows: t (k) = t (k) (1 + og(1 + 1/( g (k) 2 ))) (9) Here t (k) denotes the new earning rate for the parameters in the -th ayer at the k-th iteration and g (k) denotes a vector of the gradients of the parameters in the -th ayer at the k-th iteration. Thus, we see that we use ony the gradients in the same ayer to determine the earning rate for that ayer. It is aso important to note that we do not use any gradients from previous iterations, and thus save on storage. From equation 9, we see that when the gradients in a ayer are very arge, the equation just reduces to using the norma earning rate t (k). However, when the gradients are very sma, we are more ikey to be near a ow curvature point. Thus, the equation scaes up the earning rate to ensure that the initia ayers of the network earn faster, and that we escape higherror ow curvature sadde points quicky. We can use this ayer-specific earning rate on top of SGD. Using equation 3, the update in that case, woud be: = t (k) g (k) (10) = t (k) (1 + og(1 + 1/( g (k) 2 )))g (k) (11) where denotes the update in the parameters of the -th ayer at the k-th iteration. Simiary, we can modify AdaGrad s update equation (8) to use our modified earning rates. = t (k) k i=1 (g(i) (k) g ) 2 (12) Note that, unike AdaGrad which uses a distinct earning rate for each parameter, we use a different earning rate for each ayer, which is shared by a weights in that ayer. Additionay, AdaGrad modifies the earning rate based on the entire history of gradients observed for that weight whie we update a ayer s earning rate based ony on gradients observed for a weights in a specific ayer in the current iteration. Thus, our scheme avoids both storing gradient information from previous iterations and computing earning rates for each parameter; it is therefore ess computationay and memory intensive when compared to AdaGrad. The proposed ayer specific earning rates aso works we on arge scae datasets ike ImageNet (when appied over SGD), where AdaGrad fais to converge to a good soution. The proposed method can be used with any existing optimization technique which uses a goba earning rate, provides a ayer-specific earning rate, and escapes sadde points quicky, a without sacrificing computation or memory usage. As we show in Section IV, using our adaptive earning rates on top of existing optimization techniques amost aways improves performance on standard datasets. The proposed method can be used with any existing optimization technique which uses a goba earning rate. This heps in getting a ayer-specific earning rate, as we as, heps in escaping sadde points quicker, with very itte computationa overhead. As we show in Section IV, using our adaptive earning rates on top of existing optimization techniques amost aways improves performance on standard datasets. A. Dataset IV. EXPERIMENTAL RESULTS We present image cassification resuts on three standard datasets: MNIST, CIFAR10 and ImageNet (ILSVRC 2012 dataset, part of the ImageNet chaenge). MNIST contains 60,000 handwritten digit images for training and 10,000 handwritten digit images for testing. CIFAR10 contains has 10 casses with 6,000 images in each cass. ImageNet contains 1.2 miion coor images from 1000 different casses. B. Experimenta Detais We use Caffe [16] to impement our method. Caffe provides optimization methods for Stochastic Gradient Descent (SGD), Nesterov s Acceerated Gradient (NAG) and AdaGrad. For a fair comparison between state-of-the-art methods, we add our adaptive ayer-specific earning rate method on top of each of these optimization methods. In our experiments, we demonstrate the effectiveness of our agorithm on convoutiona neura networks on 3 datasets. On CIFAR10, we use the same goba earning rate as provided in Caffe. Since our method aways increases the ayer-specific earning rate (with respect to other optimization methods) based on the goba earning rate, we start with a sighty smaer earning rate of to make the earning ess aggressive for the ImageNet experiment. SGD was initiaized with the earning rate used in [2] for experiments done on ImageNet. 1) MNIST: We use the same architecture as LeNet for our experiments on MNIST. We present the resuts of using our proposed ayer-specific earning rates on top of stochastic gradient descent, Nesterov s acceerated gradient method and AdaGrad on the MNIST dataset. Since a methods converge very quicky on this dataset, we present the accuracy and oss ony for the first 2,000 iterations. Tabe I shows the

4 Iteration SGD Ours-SGD Nesterov Ours-NAG AdaGrad Ours-AdaGrad ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 0.08 TABLE I: Mean error rate on MNIST after different iterations for stochastic gradient descent, Nesterov s acceerated gradient and AdaGrad with their ayer specific adaptive versions are shown in the tabe. Each method was run 10 times and their mean and standard deviation is reported Accuracy k Step Down 60k Step Down SGD 50k Step Down SGD 60k Step Down Nesterov Adagrad 0.72 (a) Stochastic Gradient Descent 0.73 (b) Nesterov s Acceerated Gradient (c) AdaGrad Fig. 1: On CIFAR data set: pots showing accuracies (Figures 1a-1c) comparing SGD, NAG and AdaGrad, each with our adaptive ayer-wise earning rates. For the SGD pot, we show resuts both when we step down the earning rate at 50,000 iterations as we as 60,000 iterations. mean accuracy and standard deviation when each method was run 10 times. We observe that our proposed ayer-specific earning rate is consistenty better than Nesterov s acceerated gradient, stochastic gradient descent and AdaGrad. In a the experiments, the proposed method aso attains the maximum accuracy of 99.2% just ike stochastic gradient descent, Nesterov s acceerated gradient and AdaGrad. 2) CIFAR10: On CIFAR10 we use a convoutiona neura network with 2 ayers of 32 feature maps from 5 5 convoution kernes, each foowed by 3 3 max pooing ayers. After this we have another convoution ayer with 64 feature maps from a 5 5 convoution kerne foowed by a 3 3 max pooing ayer. Finay, we have a fuy connected ayer with 10 hidden nodes and a soft-max ogistic regression ayer. After each convoution ayer a ReLu non-inearity is appied. This is the same architecture as specified in Caffe. For the first 60,000 iterations the earning rate was and it was dropped by a factor of 10 at 60,000 and 65,000 iterations. On this dataset, we again observe that fina error and oss of our method is consistenty ower than SGD, NAG and AdaGrad (Tabe II). After step down, our adaptive method reaches a ower accuracy than both SGD and NAG. Note that just using our optimization method (without changing the network architecture) we can get an improvement of 0.32% over the mean accuracy for SGD. Even if we step down the earning rate at 50,000 iterations (taking iterations in tota), we obtain an accuracy of 82.08%, which is better than SGD after 70,000 iterations, significanty cutting down on required training time Fig. 1. Since our method converges much faster when used with SGD, it is possibe to perform the step down on the earning rate even earier, potentiay reducing training time even further. Athough Adagrad does not perform very we on CIFAR10 with defaut parameters, we observe an improvement of 1.3% over the mean fina accuracy, with again a significant speed-up in training time. 3) ImageNet: We use an impementation of AexNet [2] in Caffe, a deep convoutiona neura network architecture, for comparing our method with other optimization agorithms. AexNet consists of 5 convoution ayers foowed by 3 fuy connected ayers. More detais regarding the architecture can be found in the paper [2]. Since AexNet is a deep neura network with significant compexity, it is suitabe to appy our method on this network architecture. Fig 2 shows the resuts of using our method over SGD. We observe that our method obtains significanty greater accuracy and ower oss after 100,000 and 200,000 iterations. Further, we are aso abe to reach the maximum accuracy of 57.5% on the vaidation set after 295,000 iterations which is achieved by SGD ony after 345,000 iterations, resuting in a reduction of 15% in training time. Given that such a arge mode takes more than a week to train propery, this is a significant reduction. Our oss is aso consistenty ower than SGD across a iterations. In the existing mode, we perform a step down by a factor of 10 after every 100,000 iterations. In order to anayze how our method performs when we reduce the number of training iterations, we vary the number of training iterations at a specific earning rate before performing a step down. Tabe III shows the fina accuracy after 350,000 iterations of SGD and our method. Athough the fina accuracy drops sighty as we decrease the number of iterations after which we perform the step down in the earning rate, it is ceary evident that our method achieves better accuracy than SGD. Note that we report top-1 cass accuracy. Since we use the Caffe impementation of the AexNet architecture and do not use any data augmentation techniques, our resuts are sighty ower than those reported in [2]. V. CONCLUSIONS In this paper we propose a genera method for training deep neura networks using ayer-specific adaptive earning rates,

5 Iteration SGD Ours-SGD Nesterov Ours-NAG AdaGrad Ours-AdaGrad ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 0.39 TABLE II: Mean accuracy on CIFAR10 after different iterations for SGD, NAG and AdaGrad with their ayer specific adaptive versions are shown in the tabe. The mean and standard deviation over 5 runs is reported SGD the University of Maryand supercomputing resources ( 2.6 REFERENCES Loss Accuracy Number Of Iterations x (a) Loss with Stochastic Gradient Descent SGD Number Of Iterations x 10 5 (b) Accuracy with Stochastic Gradient Descent Fig. 2: On ImageNet data set: pot comparing stochastic gradient descent with our adaptive ayer-wise earning rates. We can see a consistent improvement in accuracy and oss over the reguar SGD method across a iterations. Iterations SGD 70, % 55.84% 80, % 56.57% 90, % 57.13% TABLE III: Comparison of stochastic gradient descent and Our Method with step-down at different iterations on ImageNet which can be used on top of any optimization method with a goba earning rate. The method uses gradients from each ayer to compute an adaptive earning rate for each ayer. It aims to speed up convergence when the parameters are in a ow curvature sadde point region. Layer-specific earning rates aso enabe the method to prevent sow earning in initia ayers of the deep network, usuay caused by very sma gradient vaues. ACKNOWLEDGMENT The authors acknowedge ONR Grant numbers N WX01341 and N , as we as [1] R. Pascanu, Y. N. Dauphin, S. Gangui, and Y. Bengio, On the sadde point probem for non-convex optimization, arxiv preprint arxiv: , [2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet cassification with deep convoutiona neura networks, in Advances in neura information processing systems, 2012, pp [3] Y. Taigman, M. Yang, M. Ranzato, and L. Wof, Deepface: Cosing the gap to human-eve performance in face verification, in Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. IEEE, 2014, pp [4] R. Socher, A. Pereygin, J. Y. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts, Recursive deep modes for semantic compositionaity over a sentiment treebank, in Proceedings of the Conference on Empirica Methods in Natura Language Processing (EMNLP). Citeseer, 2013, pp [5] G. Hinton, L. Deng, D. Yu, G. E. Dah, A.-r. Mohamed, N. Jaity, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et a., Deep neura networks for acoustic modeing in speech recognition: The shared views of four research groups, Signa Processing Magazine, IEEE, vo. 29, no. 6, pp , [6] X. Gorot, A. Bordes, and Y. Bengio, Deep sparse rectifier networks, in Proceedings of the 14th Internationa Conference on Artificia Inteigence and Statistics. JMLR W&CP Voume, vo. 15, 2011, pp [7] H. Robbins, S. Monro et a., A stochastic approximation method, The Annas of Mathematica Statistics, vo. 22, no. 3, pp , [8] M. D. Zeier, Adadeta: An adaptive earning rate method, arxiv preprint arxiv: , [9] J. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods for onine earning and stochastic optimization, The Journa of Machine Learning Research, vo. 12, pp , [10] S. Hochreiter and J. Schmidhuber, Long short-term memory, Neura computation, vo. 9, no. 8, pp , [11] A. J. Bray and D. S. Dean, Statistics of critica points of gaussian fieds on arge-dimensiona spaces, Physica review etters, vo. 98, no. 15, p , [12] Y. V. Fyodorov and I. Wiiams, Repica symmetry breaking condition exposed by random matrix cacuation of andscape compexity, Journa of Statistica Physics, vo. 129, no. 5-6, pp , [13] B. T. Poyak, Some methods of speeding up the convergence of iteration methods, USSR Computationa Mathematics and Mathematica Physics, vo. 4, no. 5, pp. 1 17, [14] Y. Nesterov, A method of soving a convex programming probem with convergence rate o (1/k2), in Soviet Mathematics Dokady, vo. 27, no. 2, 1983, pp [15] I. Sutskever, J. Martens, G. Dah, and G. Hinton, On the importance of initiaization and momentum in deep earning, in Proceedings of the 30th Internationa Conference on Machine Learning (ICML-13), 2013, pp [16] Y. Jia, E. Shehamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darre, Caffe: Convoutiona architecture for fast feature embedding, arxiv preprint arxiv: , 2014.

Nearest Neighbor Learning

Nearest Neighbor Learning Cassify based on oca simiarity Ranges from simpe nearest neighbor to case-based and anaogica reasoning Use oca information near the current query instance to decide the cassification