Stochastic Gradient Descent Algorithm in the Computational Network Toolkit

Similar documents
Comparing Dropout Nets to Sum-Product Networks for Predicting Molecular Activity

Parallel one-versus-rest SVM training on the GPU

Keras: Handwritten Digit Recognition using MNIST Dataset

Emotion Detection using Deep Belief Networks

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Deep Learning for Computer Vision II

Theano Tutorials. Ian Goodfellow

Deep Learning With Noise

On the Effectiveness of Neural Networks Classifying the MNIST Dataset

DNN-BASED AUDIO SCENE CLASSIFICATION FOR DCASE 2017: DUAL INPUT FEATURES, BALANCING COST, AND STOCHASTIC DATA DUPLICATION

Stochastic Function Norm Regularization of DNNs

Weighted Convolutional Neural Network. Ensemble.

Distributed neural networks for Internet of Things: the Big-Little approach

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Mocha.jl. Deep Learning in Julia. Chiyuan Zhang CSAIL, MIT

Learning visual odometry with a convolutional network

Decentralized and Distributed Machine Learning Model Training with Actors

arxiv: v1 [cs.sc] 23 Nov 2012

Deep Learning with Tensorflow AlexNet

Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks

arxiv: v1 [cs.lg] 29 Oct 2015

arxiv: v3 [cs.lg] 15 Apr 2014

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Scheduled denoising autoencoders

The exam is closed book, closed notes except your one-page cheat sheet.

Neural Networks: promises of current research

ImageNet Classification with Deep Convolutional Neural Networks

How Learning Differs from Optimization. Sargur N. Srihari

Automated Crystal Structure Identification from X-ray Diffraction Patterns

Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD.

Deep Learning Workshop. Nov. 20, 2015 Andrew Fishberg, Rowan Zellers

arxiv: v5 [cs.lg] 23 Sep 2015

Advanced Machine Learning

Motivation Dropout Fast Dropout Maxout References. Dropout. Auston Sterling. January 26, 2016

Study of Residual Networks for Image Recognition

Optimization in the Big Data Regime 5: Parallelization? Sham M. Kakade

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

Chapter 1 A Computational Network 1

Deep Learning & Neural Networks

A Brief Look at Optimization

A Comparison of Random Forests and Dropout Nets for Sign Language Recognition with the Kinect

Deep Neural Networks:

Keras: Handwritten Digit Recognition using MNIST Dataset

Convolutional Neural Networks for No-Reference Image Quality Assessment

Kernels vs. DNNs for Speech Recognition

A Quick Guide on Training a neural network using Keras.

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

MATTI TUHOLA WIRELESS ACCESS POINT QUALITY ASSESSMENT USING CONVOLUTIONAL NEURAL NETWORKS. Bachelor of Science Thesis

An Introduction to NNs using Keras

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Groupout: A Way to Regularize Deep Convolutional Neural Network

Using neural nets to recognize hand-written digits. Srikumar Ramalingam School of Computing University of Utah

Neuron Selectivity as a Biologically Plausible Alternative to Backpropagation

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Perceptron: This is convolution!

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Profiling the Performance of Binarized Neural Networks. Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

DropConnect Regularization Method with Sparsity Constraint for Neural Networks

Deep Learning Applications

Deep Learning. Volker Tresp Summer 2015

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017

Probabilistic Structured-Output Perceptron: An Effective Probabilistic Method for Structured NLP Tasks

Balanced Search Trees

Advanced Introduction to Machine Learning, CMU-10715

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Parallelization in the Big Data Regime 5: Data Parallelization? Sham M. Kakade

Convolution Neural Network for Traditional Chinese Calligraphy Recognition

End-To-End Spam Classification With Neural Networks

Stacked Denoising Autoencoders for Face Pose Normalization

Neural Networks and Deep Learning

Deep Neural Network Acceleration Framework Under Hardware Uncertainty

arxiv: v5 [cs.lg] 6 Apr 2016

Machine Learning. MGS Lecture 3: Deep Learning

Why DNN Works for Speech and How to Make it More Efficient?

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

HW Assignment 3 (Due by 9:00am on Mar 6)

Deep Learning and Its Applications

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Neural Network Weight Selection Using Genetic Algorithms

Back Propagation and Other Differentiation Algorithms. Sargur N. Srihari

Deep Neural Network Hyperparameter Optimization with Genetic Algorithms

All You Want To Know About CNNs. Yukun Zhu

Scalable Gradient-Based Tuning of Continuous Regularization Hyperparameters

Semi-Supervised Phone Classification using Deep Neural Networks and Stochastic Graph-Based Entropic Regularization

Motion Sequence Recognition with Multi-sensors Using Deep Convolutional Neural Network

LSTM: An Image Classification Model Based on Fashion-MNIST Dataset

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

Tutorial on Keras CAP ADVANCED COMPUTER VISION SPRING 2018 KISHAN S ATHREY

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Cache-efficient Gradient Descent Algorithm

Sayan Pathak Principal ML Scientist. Chris Basoglu Partner Dev Manager

Efficient Deep Learning Optimization Methods

Classification and Regression Trees

11. Neural Network Regularization

Seminars in Artifiial Intelligenie and Robotiis

Convolutional Sequence to Sequence Learning. Denis Yarats with Jonas Gehring, Michael Auli, David Grangier, Yann Dauphin Facebook AI Research

Parallel Deep Network Training

Transcription:

Stochastic Gradient Descent Algorithm in the Computational Network Toolkit Brian Guenter, Dong Yu, Adam Eversole, Oleksii Kuchaiev, Michael L. Seltzer Microsoft Corporation One Microsoft Way Redmond, WA 98052 {bguenter, dongyu, adame, olekku, mseltzer}@microsoft.com Abstract We introduce the stochastic gradient descent algorithm used in the computational network toolkit (CNTK) a general purpose machine learning toolkit written in C++ for training and using models that can be expressed as a computational network. We describe the algorithm used to compute the gradients automatically for a given network. We also propose a low-cost automatic learning rate selection algorithm and demonstrate that it works well in practice. 1 Computational Network Toolkit A computational network (CN) is a directed graph in which each leaf represents an input value or a learnable parameter and each node represents an operator. Figure 1 illustrates an example CN of a log-linear model. Here, each node is identified by a {node name : operator type} pair and takes its ordered children as the operator s inputs. For example, in the figure, T = T imes(w, X) which is different from T = T imes(x, W). A CN can have many root nodes which are used under different conditions. For example, one root node may represent a cross-entropy training criterion and another may represent an evaluation criterion. The network in Figure 1 has only one root node {C: Cross Entropy}. Many machine learning models, such as neural networks, that can be described via a series of operations, can be converted into a CN. The computational network toolkit (CNTK) is a general purpose C++ based machine learning toolkit for models that can be described as CNs. Figure 2 illustrates the architecture of CNTK. The core of CNTK is an internal representation of a CN which provides two key methods: Evaluate, which computes the value of a node given its inputs and Compute Gradient, which computes the gradient of a node with respect to its inputs. These methods are executed using an IExecutionEngine such as a CPU, a GPU, or a data flow graph such as ptask [1]. ICNBuilder reads the network description (or language) and creates a CN. IDataReader reads in features and labels stored in different formats. ILearner optimizes the model parameters with different optimization algorithms. In this paper, we focus on the well-known stochastic gradient descent (SGD) algorithm. Section 2 describes how gradients are computed efficiently regardless of the CN architecture. In Section 3 a new learning rate selection algorithm is proposed and evaluated. Section 4 provides some practical tricks used in the toolkit s implementation. 2 Automatic Gradient Computation The objective function, f, for training the computational network is of the form f : R m R 1. The gradient of this function is n D(f(g 1 (h 1,..., h m ),..., g n (...)) = D(g i ) (1) g i 1 i=1

Figure 1: A log-linear model represented as a computational network Figure 2: Architecture of CNTK. where the g i (...) are themselves functions of some h j and so on. Naively applying this recursion to the function graph will lead to exponential time worst-case behavior as graph nodes will be evaluated multiple times. Factoring out the common prefix terms g i makes computation linear in the number of computation nodes in the graph. This is analogous to common subexpression elimination in a conventional expression graph, only here the common subexpressions are the parents of the nodes, rather than the children. For scalar functions this factorization can be performed with a simple recursive algorithm[2]. However, for matrix expressions we need differentiation rules that express partial derivatives in terms of matrix expressions, rather than as a large scalar function expression graph. For functions applied per matrix element, such as sigmoid, the differentiation rules are analogous to the scalar case. Matrix addition is similarly straightforward. The only case that requires some consideration is matrix multiplication. Suppose our objective function is f(c) = C where C = AB and C is the Frobenius norm of C. Writing the elements of the gradient of f with respect to A = c i,j (2) a k,l c i j i,j a k,l and noticing that we can rewrite eq. 2 as If we define matrix dc with elements then we can write eq. 3 as { c i,j blj i = k = a k,l 0 otherwise = a k,l j dc i,j = da k,l = j c k,j b l,j (3) c i,j dc k,j b l,j. (4) Eq. 4 is just the formula for the matrix multiplication da = dcb T. A similar derivation shows that db = A T dc. A similar approach was used in Theano [3] although the implementation is different. A simple recursive algorithm for computing the gradient is shown in Algorithm 1. The function is called on the root node of the computation network, with dc argument set to a 1x1 matrix containing the value 1. Each node type, such as *, +, or sigmoid, defines its own node.d(child) function. 3 Automatic Learning Rate Selection The performance of the stochastic gradient descent (SGD) algorithm depends critically on how the learning rates are chosen. Many researchers have proposed schemes for making learning rates adaptive [4]. These methods typically can be classified into two categories: those based on an estimated 2

Algorithm 1 Recursive Gradient function GRAD=GRADIENT(f, dc) Input: f=root node of computational network, dc=derivative of parent node Output: df=new network representing the derivative of f if timesv isited = parentcount then foreach child of this node grad(child, node.d(child)) else node.derivative += dc timesv isited+ = 1 end if end function Hessian and those based on an empirical search algorithm. We have found these existing schemes to be expensive when applied to large datasets. For example, the efficient Hessian estimation algorithm used in [5] incurs one additional back-propagation pass or 100% additional computation cost. A full range empirical search typically requires processing the training data with different learning rates at least 10 times, even if an efficient algorithm is used to adjust the learning rate based on the sign of improvement. Such a large expense is not affordable for real-world tasks with millions or billions of training samples. We propose a novel method for automatically adjusting learning rates. Similar to that proposed in [5], our approach can increase or decrease the learning rate and is suitable for non-stationary problems. Unlike other empirical methods, however, our algorithm determines the learning rate before each epoch using a small sample of the training set. This requires only minimal additional computation. Algorithm 2 describes the steps of the learning rate search algorithm. During the first epoch, in which the model parameters have typically been randomly initialized, the algorithm searches for the best learning rate based on a set of n random samples using a grid search which decreases the learning rate by 0.618 each time (0.5 works equally well). For all subsequent epochs, the algorithm searches for the largest learning rate that is sufficient to see improvement in the average training criterion if all N samples in the epoch are used to update the parameters. Note that when the learning rate is fixed, the earlier samples typically improve the training criterion more than the later samples since the error signal becomes smaller over time. We found that we can approximate this behavior by assuming that the gain is proportional to the square root of the number of samples seen. The learning rate selection algorithm is typically run on 2-5% (or 200-500 mini-batches) of the full epoch size. Note that starting with the second epoch, there are actually two alternative criteria to select the learning rates: one is to select the best learning rate just as in the first epoch, and one is to select the largest learning rate that improves over the 0 learning rate. In practice, we have found that these two approaches will reduce the learning rate quickly and lead to an inferior final model. Algorithm 2 Learning Rate Search function λ=searchlearningrate(n, n, e t 1 ) Input: N=total number of samples in an epoch; n=number of samples used to estimate learning rate; e t 1 =training criterion from previous epoch Output: λ=optimal learning rate if t 0 then λ=grid search the best λ to minimize e s λ else Compute e s 0 with learning rate 0 by running SGD over n samples r = sqrt(n/n) e base = (1 r) e s 0 + r e t 1 λ=grid search for the largest λ so that e s λ e base end ifreturn λ end function 3

Figure 3: Comparison of different learning rate adjustment strategies on MNIST. Figure 3 (a) and (b) illustrate the training set cross-entropy improvement and the learning rate selected over epochs with different learning rate adjustment schemes on the MNIST data set [6]. In these experiments, we have used a neural network with 784 input units, 256 hidden units, and 10 output units. The SGD algorithm used a mini-batch size of 32. The total training set has 60,000 samples, of which 3200 samples (or 5%) were used to adjust the learning rate. On average each epoch required 3 passes through the 3200 samples which increased the total training time by 15%. In the figure, we can clearly see the inferior performance of other two alternatives. We can also observe that the proposed approach performs better than the fixed-learning rate search approach in training set performance. Note that the proposed algorithm may increase the learning rate. We have applied this algorithm to the speech data and advertisement click prediction data and outperformed a hand-tuned learning rate in both cases. For example, on the Switchboard 24-hour training set with 1502 classes (tied triphone states), the proposed algorithm, which used 1.5% samples and 5% overhead in learning rate search, together with dropout [7] at a rate of 0.2 to control overfitting, achieved 49.1% frame accuracy on the evaluation set without pretraining, compared to 47.3% accuracy achieved by the best hand-tuned learning rate strategy with pretraining. 4 Practical Tricks Most of the work performed in Evaluate and Compute Gradient functions is expressed using matrix operations. Currently, CNTK implements matrix operation using the following BLAS libraries: ACML math library from AMD and cublas from NVIDIA, for CPU and GPU computations, respectively. Since cublas doesn t implement all the functions necessary for CN, we had to implement many element-wise matrix operations and other functions using custom CUDA kernels. One of the most important considerations when optimizing GPU code is minimizing memory copies between CPU RAM and GPU RAM. For example, when computing training loss per minibatch the resulting value is kept on the GPU rather than transferring it to CPU memory to avoid either synchronization or asynchronous memory copies. We also set cublas pointer mode to CUBLAS POINTER MODE DEVICE which allows it to keep call results in GPU memory. When implementing CUDA kernels, making use of shared memory and achieving higher levels of parallelism is necessary for a good performance. Functions like Max(),Sum(), Frobenius norm(), etc. are implemented using a reduction-based approach maximizing the use of shared memory within each thread block. The goal is to split the work among as many threads as possible since thread overhead is negligible for CUDA. For classification tasks with many output classes (such as 1502 tied triphone states), efficiency of functions such as AssignColumnwiseSoftmax() is important. In our implementation, we assign a separate CUDA block with many threads to each matrix column and use a reduction-based approach several times within each column to achieve greater speed. 5 Summary In this paper we described the SGD algorithm used in the CNTK with a focus on the automatic gradient computation and the learning rate adjustment algorithm. We confirmed the effectiveness of the proposed approach in several real-world tasks. 4

References [1] C. J. Rossbach, J. J. Currey, M. Silberstein, B. Ray, and E. Witchel, PTask: Operating system abstractions to manage GPUs as compute devices, in Symposium on Operating Systems Principles, 2011. [2] Brian Guenter, Efficient symbolic differentiation for graphics applications, in ACM SIGGRAPH 2007 papers, New York, NY, USA, 2007, SIGGRAPH 07, ACM. [3] James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio, Theano: a CPU and GPU math expression compiler, in Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010, Oral Presentation. [4] Abraham P George and Warren B Powell, Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming, Machine learning, vol. 65, no. 1, pp. 167 198, 2006. [5] Tom Schaul, Sixin Zhang, and Yann LeCun, No more pesky learning rates, arxiv preprint arxiv:1206.1106, 2012. [6] Yann LeCun and Corinna Cortes, The mnist database of handwritten digits, 1998. [7] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, Improving neural networks by preventing co-adaptation of feature detectors, http://arxiv.org/abs/1207.0580, 2012. 5