Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

Similar documents
Efficient Deep Learning Optimization Methods

CS489/698: Intro to ML

Deep Neural Networks Optimization

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

Lecture : Training a neural net part I Initialization, activations, normalizations and other practical details Anne Solberg February 28, 2018

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Linear Regression & Gradient Descent

CSE446: Linear Regression. Spring 2017

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

COMP6237 Data Mining Data Mining & Machine Learning with Big Data. Jonathon Hare

Machine Learning and Computational Statistics, Spring 2015 Homework 1: Ridge Regression and SGD

Machine Learning and Computational Statistics, Spring 2016 Homework 1: Ridge Regression and SGD

The exam is closed book, closed notes except your one-page cheat sheet.

CS281 Section 3: Practical Optimization

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Logistic Regression and Gradient Ascent

A Simple (?) Exercise: Predicting the Next Word

Natural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 4: Backpropagation and computation graphs

3 Types of Gradient Descent Algorithms for Small & Large Data Sets

Perceptron: This is convolution!

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

11. Neural Network Regularization

REGRESSION ANALYSIS : LINEAR BY MAUAJAMA FIRDAUS & TULIKA SAHA

Case Study 1: Estimating Click Probabilities

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

DEEP LEARNING IN PYTHON. The need for optimization

Parallelization in the Big Data Regime 5: Data Parallelization? Sham M. Kakade

Learning via Optimization

Lecture : Neural net: initialization, activations, normalizations and other practical details Anne Solberg March 10, 2017

EE 511 Linear Regression

Deep Learning for Computer Vision II

Neural Networks: Optimization Part 1. Intro to Deep Learning, Fall 2018

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Training Deep Neural Networks (in parallel)

K-Means and Gaussian Mixture Models

Package graddescent. January 25, 2018

An Introduction to NNs using Keras

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

Optimization in the Big Data Regime 5: Parallelization? Sham M. Kakade

A Brief Look at Optimization

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

Linear Regression Optimization

Cache-efficient Gradient Descent Algorithm

A Technical Report about SAG,SAGA,SVRG

Tutorial on Keras CAP ADVANCED COMPUTER VISION SPRING 2018 KISHAN S ATHREY

Parallel Deep Network Training

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017

Parallel Deep Network Training

CAP 6412 Advanced Computer Vision

Analyzing Stochastic Gradient Descent for Some Non- Convex Problems

SEMANTIC COMPUTING. Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) TU Dresden, 21 December 2018

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Advanced Video Analysis & Imaging

CNN Basics. Chongruo Wu

Cost Functions in Machine Learning

Gradient Descent. Michail Michailidis & Patrick Maiden

Deep learning using Caffe Execution Process

I How does the formulation (5) serve the purpose of the composite parameterization

Stanford University. A Distributed Solver for Kernalized SVM

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Knowledge Discovery and Data Mining. Neural Nets. A simple NN as a Mathematical Formula. Notes. Lecture 13 - Neural Nets. Tom Kelsey.

Parallelization in the Big Data Regime: Model Parallelization? Sham M. Kakade

Knowledge Discovery and Data Mining

Usable while performant: the challenges building. Soumith Chintala

Machine Learning / Jan 27, 2010

Introduction to Deep Learning

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

In stochastic gradient descent implementations, the fixed learning rate η is often replaced by an adaptive learning rate that decreases over time,

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Logistic Regression

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

Neural Networks. Robot Image Credit: Viktoriya Sukhanova 123RF.com

CS 6501: Deep Learning for Computer Graphics. Training Neural Networks II. Connelly Barnes

Neural Network Learning. Today s Lecture. Continuation of Neural Networks. Artificial Neural Networks. Lecture 24: Learning 3. Victor R.

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Lecture 13. Deep Belief Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

CSC 578 Neural Networks and Deep Learning

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Today. Gradient descent for minimization of functions of real variables. Multi-dimensional scaling. Self-organizing maps

Computer Vision: Homework 5 Optical Character Recognition using Neural Networks

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

CS294-1 Assignment 2 Report

Robust Regression. Robust Data Mining Techniques By Boonyakorn Jantaranuson

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Logistic Regression. Abstract

CS 179 Lecture 16. Logistic Regression & Parallel SGD

CSE446: Linear Regression. Spring 2017

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey. Chapter 4 : Optimization for Machine Learning

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization. Presented by: Karen Lucknavalai and Alexandr Kuznetsov

Combine the PA Algorithm with a Proximal Classifier

Package rnn. R topics documented: June 21, Title Recurrent Neural Network Version 0.8.1

Machine Learning Basics. Sargur N. Srihari

Optimization for Machine Learning

CSC 4510 Machine Learning

Transcription:

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent Slide credit: http://sebastianruder.com/optimizing-gradient-descent/index.html#batchgradientdescent

,,, :, : = SGD=Stochastic GD Momentum NAG=Nesterov accelerated gradient Adagrad Adadelta RMSprop Adam

Notations,,, input (random variable) having n number of features.,, = labeled dataset with the corresponding target values,, =value of feature j in training example,,,,,, = Training set,,,, = Cross validation set,,,, = Test set Let be a random variable. ~, : is distributed as Gaussian (Normal) with mean and variance.the probability density function is ;,.

Backpropagation Algorithm The goal is to determine,,,,,,,,,,,, such that,,,,,,, ( variables) minimizes the following error E(W) = = g=relu

E(W) = = g=relu Its general form To understand Gradient Descent Optimization, let us consider the following simplified version of the cost function Drop g & simplify notations W=(b,W) X=(1,x) =,,,,,,

Gradient Descent Optimization for Linear model SGD=Stochastic GD Momentum NAG=Nesterov accelerated gradient Adagrad Adadelta RMSprop Adam Assume a linear model,,,,,, = Hypothesis. For notational convenience, we may denote,,,,. Here, represents bias and,, represents weight vector. Cost function Notations,,, :, : =

Gradient Descent Cost function The gradient decent algorithm is Repeat { (simultaneously update,,, ) } ( =learning rate) There are three variants of gradient descent (Batch, Stochastic, Mini-batch), which differ in how much data we use to compute the gradient of the objective function. Depending on the amount of data, we make a trade-off between the accuracy of the parameter update and the time it takes to perform an update.

Quick review of GD & CGD: Solve Let be the residual, which indicates how far we are from the correct value of y. Let f be a function such that. Then where is the error. GD method is where is chosen such that 0 = -y) Taking -A to, we have, CGD method is where, CGD method is where,,

Mini-batch gradient decent algorithm If m=lk& batch =k, for i=0, k, 2k,.,(L-1)k, W=W, :, : Where, :, : = The advantage of computing more than one example at a time is that we can use vectorized implementations over the k examples. Map reduce and Data parallelism 1. Divide the training set in to L subsets (z may be the number of machines you have.) 2. On each those machines, calculate, :, : 3. (Map reduce) =, :, : where the data from to are given to the k-th machine.

Stochastic gradient descent(sgd) From the training set,,,, and a given architecture, Stochastic gradient descent (SGD) in contrast performs a parameter update for each, :,,. The algorithm is as follows: 1. Randomly shuffle the data set 2. Repeat {For,,, { W,, }} Here,,,,. (1 is for bias b) :,, may change a lot on each iteration due to the diversity of the training example.

Extensions and variants on SGD Many improvements on the basic SGD algorithm have been proposed and used. In particular, in machine learning, the need to set a learning rate (step size) has been recognized as problematic. Setting this parameter too high can cause the algorithm to diverge; setting it too low makes it slow to converge. A conceptually simple extension of stochastic gradient descent makes the learning rate a decreasing function of the iteration number, giving a learning rate schedule, so that the first iterations cause large changes in the parameters, while the later ones do only fine-tuning. SGD=Stochastic GD Momentum NAG=Nesterov accelerated gradient Adagrad Adadelta RMSprop Adam

Momentum: gradient decent algorithm Stochastic gradient descent with momentum remembers the update at each iteration, and determines the next update as a convex combination of the gradient and the previous update.

NAG uses our momentum term to move. Computing thus gives us an approximation of the next position of. We can now effectively look ahead by calculating the gradient not w.r.t. to our current parameters but w.r.t. the approximate future position Momentum method 1. 2. W A ball that rolls down a hill, blindly following the slope, is highly unsatisfactory. We'd like to have a smarter ball, a ball that has a notion of where it is going so that it knows to slow down before the hill slopes up again. We set the momentum term to a value of around 0.9. While Momentum first computes the current gradient (small blue vector in Image) and then takes a big jump in the direction of the updated accumulated gradient (big blue vector), NAG first makes a big jump in the direction of the previous accumulated gradient (brown vector), measures the gradient and then makes a correction (red vector), which results in the complete NAG update (green vector). This anticipatory update prevents us from going too fast and results in increased responsiveness, which has significantly increased the performance of RNNs on a number of tasks

Adagrad,,,, : the gradient of the objective function w.r.t. to parameter at time step t. is a diagonal matrix where each diagonal element (i,i) is the sum of the squares of the gradients w.r.t. W up to time step t. That is,,, A B indicates the element-wise matrix-vector multiplication between A and B:

Adadelta is an extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to some fixed size = RMSprop

RMSProp RMSProp (for Root Mean Square Propagation) is also a method in which the learning rate is adapted for each of the parameters. The idea is to divide the learning rate for a weight by a running average of the magnitudes of recent gradients for that weightrmsprop and Adadelta have both been developed independently around the same time stemming from the need to resolve Adagrad's radically diminishing learning rates. RMSprop in fact is identical to the first update vector of Adadelta

Adaptive Moment Estimation (Adam) is another method that computes adaptive learning rates for each parameter. In addition to storing an exponentially decaying average of past squared gradients like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients similar to momentum: Adadelta:, RMSprop:,,

Regularized Least Square Ridge regression Lasso

Optimality Theorem :,,,