SGD: Stochastic Gradient Descent

Similar documents
Implement NN using NumPy

Weiguang Guan Code & data: guanw.sharcnet.ca/ss2017-deeplearning.tar.gz

The Mathematics Behind Neural Networks

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Keras: Handwritten Digit Recognition using MNIST Dataset

Perceptron: This is convolution!

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

Neural Network Learning. Today s Lecture. Continuation of Neural Networks. Artificial Neural Networks. Lecture 24: Learning 3. Victor R.

Neural Network Neurons

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Keras: Handwritten Digit Recognition using MNIST Dataset

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

Lecture : Neural net: initialization, activations, normalizations and other practical details Anne Solberg March 10, 2017

CSC 578 Neural Networks and Deep Learning

Machine Learning. MGS Lecture 3: Deep Learning

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Deep Neural Networks Optimization

Class 6 Large-Scale Image Classification

Machine Learning 13. week

Supervised Learning in Neural Networks (Part 2)

Deep Learning with Tensorflow AlexNet

DEEP LEARNING IN PYTHON. The need for optimization

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

Practical Methodology. Lecture slides for Chapter 11 of Deep Learning Ian Goodfellow

Deep Learning With Noise

Dropout. Sargur N. Srihari This is part of lecture slides on Deep Learning:

Lecture 13. Deep Belief Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

A neural network that classifies glass either as window or non-window depending on the glass chemistry.

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan

Neural Nets & Deep Learning

A Quick Guide on Training a neural network using Keras.

Efficient Deep Learning Optimization Methods

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

CS 4510/9010 Applied Machine Learning

Energy Based Models, Restricted Boltzmann Machines and Deep Networks. Jesse Eickholt

Deep Learning for Computer Vision II

On the Effectiveness of Neural Networks Classifying the MNIST Dataset

CS 4510/9010 Applied Machine Learning. Neural Nets. Paula Matuszek Fall copyright Paula Matuszek 2016

ImageNet Classification with Deep Convolutional Neural Networks

Efficient Algorithms may not be those we think

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Neural Networks. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Multinomial Regression and the Softmax Activation Function. Gary Cottrell!

Know your data - many types of networks

CS6220: DATA MINING TECHNIQUES

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

Data Mining. Neural Networks

Parallel Stochastic Gradient Descent

A Fast Learning Algorithm for Deep Belief Nets

Lecture 17: Neural Networks and Deep Learning. Instructor: Saravanan Thirumuruganathan

Deep Generative Models Variational Autoencoders

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Advanced Video Analysis & Imaging

EE 511 Neural Networks

Image Classification using Fast Learning Convolutional Neural Networks

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Deep Learning for Generic Object Recognition

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

3 Types of Gradient Descent Algorithms for Small & Large Data Sets

Thomas Nabelek September 22, ECE 7870 Project 1 Backpropagation

Final Report: Classification of Plankton Classes By Tae Ho Kim and Saaid Haseeb Arshad

COMP6237 Data Mining Data Mining & Machine Learning with Big Data. Jonathon Hare

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Lecture 37: ConvNets (Cont d) and Training

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Hand Written Digit Recognition Using Tensorflow and Python

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

CS489/698: Intro to ML

Neural networks. About. Linear function approximation. Spyros Samothrakis Research Fellow, IADS University of Essex.

Exercise: Training Simple MLP by Backpropagation. Using Netlab.

Convolution Neural Networks for Chinese Handwriting Recognition

Review on Methods of Selecting Number of Hidden Nodes in Artificial Neural Network

CS281 Section 3: Practical Optimization

For Monday. Read chapter 18, sections Homework:

Perceptron as a graph

Lecture 3: Theano Programming

SEMANTIC COMPUTING. Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) TU Dresden, 21 December 2018

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Recurrent Convolutional Neural Networks for Scene Labeling

CS 6501: Deep Learning for Computer Graphics. Training Neural Networks II. Connelly Barnes

A Simple (?) Exercise: Predicting the Next Word

Knowledge Discovery and Data Mining. Neural Nets. A simple NN as a Mathematical Formula. Notes. Lecture 13 - Neural Nets. Tom Kelsey.

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

Knowledge Discovery and Data Mining

Cost Functions in Machine Learning

Neural Networks (Overview) Prof. Richard Zanibbi

Introduction to Deep Learning

Using neural nets to recognize hand-written digits. Srikumar Ramalingam School of Computing University of Utah

Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization. Presented by: Karen Lucknavalai and Alexandr Kuznetsov

Tutorial on Keras CAP ADVANCED COMPUTER VISION SPRING 2018 KISHAN S ATHREY

EECS 496 Statistical Language Models. Winter 2018

In this assignment, we investigated the use of neural networks for supervised classification

Transcription:

Improving SGD Hantao Zhang Deep Learning with Python Reading: http://neuralnetworksanddeeplearning.com/index.html Chapter 2 SGD: Stochastic Gradient Descent Main Idea: Given a set of input/output examples D = { (x, y) }. Define the network as a function f(w, x) on weights w and x. Define the cost, say C = ½( (x,y) D a(x) y 2 )/ D and try to minimalize. For each epoch, repeat the following: 1. compute a(x) = f(w, x) and C = ½( (x,y) D a(x) y 2 )/ D. 2. compute C/ w 3. update w by w = w - ( C/ w) to decrease C. For large datasets this is expensive: we don t want to load all the data D into memory, and the gradient depends on all the data. An alternative: pick a small subset of examples, called mini batch, B<<D approximate the gradient using C = ½( (x,y) B a(x) y 2 )/ B. on average C/ w is the right direction. take a step in that direction repeat. B = one example is a very popular choice, called online update 1

Batch Update With On-line (stochastic) update we update weights after every pattern. With Batch update we accumulate the changes for each weight in a batch, and update weights at the end of each batch. Batch update often gives a correct direction of the gradient for the entire data set, while on-line could do some weight updates in directions quite different from the average gradient of the entire data set Based on noisy instances and also just that specific instances will not represent the average gradient Size of mini batch? Another super-parameter to choose through experiments and experiences. 3 Stochastic Gradient Descent Since True gradient is approximated only, loss will not always decrease (locally) as training data point is random. Still converges over time. 2

Computation in General NN There are L layers, l {1, 2,, L} plus input layer (l = 0); each layer is fully connected to the next. An example of 4-layer NN: C/W i [k, j] = y i [k] i [j] C/W i = (y i 1) (1 i ) For mini-batch B, define i = B ( C/W i ) Update W i by W i = W i i or 4 weights: W 0 W 1 W 2 W 3 W i = W i i / B W i [j,k] = link weight from j th neuron in layer i to k th neuron in layer (i+1) 4 outputs: y 0 y 1 y 2 y 3 y 4 (exclude y 0 = x) 4 sums: y 0 z 0 y 1 z 1 y 2 z 2 y 3 z 3 y 4 z i = y i W i + b i, y i+1 = a(z i ) 4 i & i : 0 1 1 2 2 3 3 4 Let cost C ½ (y L y) 2, i C/y i, i C/z i, then L = (y L y), i = i+1 a (z i ), and i = W i i Improve readability def backprop(self, x, y): # feedforward computation y_act = x y_acts = [x] # list to store all the activations z_sums = [] # list to store all the z vectors for b, w in zip(self.biases, self.weights): z = np.dot(y_act, w)+ b z_sums.append(z) y_act = self.activation(z) y_acts.append(y_act) # backward propagation theta = self.cost_derivative(y_act, y) for i in range(self.num_layers-1, -1, -1): ad = self.activation_derivative(z_sums[i], y_acts[i+1]) delta = theta * ad self.delta_b[i] = delta y_hat = y_acts[i][:, np.newaxis] delta_hat = delta[np.newaxis, :] self.delta_w[i] = np.dot(y_hat, delta_hat) if (i > 0): theta = np.dot(self.weights[i], delta) return (self.delta_b, self.delta_w) Change negative index to positive z i = y i W i y i+1 = a(z i ) C ½ (y L y) 2 i C/y i i C/z i L = (y L y) i = i+1 a (z i ) i = W i i C/W i = (y i 1) (1 i ) 3

Improve performance Before, one example at a time for backprop. def update_batch(self, batch_x, batch_y, eta): # batch: mini batch of examples # eta: learning rate nabla_b = np.array([np.zeros(b.shape) for b in self.biases]) nabla_w = np.array([np.zeros(w.shape) for w in self.weights]) for x, y in zip(batch_x, batch_y): delta_b, delta_w = self.backprop(x, y) nabla_w = nabla_w + delta_w nabla_b = nabla_b + delta_b self.weights -= eta*nabla_w self.biases -= eta*nabla_b For mini-batch B, define i = B ( C/W i ) Update W i by W i = W i i or W i = W i i / B z i = y i W i y i+1 = a(z i ) C ½ (y L y) 2 i C/y i i C/z i L = (y L y) i = i+1 a (z i ) i = W i i C/W i = (y i 1) (1 i ) Improve performance Now, a batch of examples are feed to backprop. def update_batch(self, batch_x, batch_y, eta): # batch: mini batch of examples # eta: learning rate nabla_w, nabla_b = self.backprop(batch_x, batch_y) for i in range(self.num_layers): self.weights[i] -= eta*np.sum(nabla_w[i], axis=0) self.biases[i] -= eta*np.sum(nabla_b[i], axis=0) For mini-batch B, define i = B ( C/W i ) Update W i by W i = W i i or W i = W i i / B z i = y i W i y i+1 = a(z i ) C ½ (y L y) 2 i C/y i i C/z i L = (y L y) i = i+1 a (z i ) i = W i i C/W i = (y i 1) (1 i ) 4

Improve performance def backprop(self, x, y): # feedforward, x is a batch of examples y_act = x y_acts = [x] z_wsums = [] for b, w in zip(self.biases, self.weights): z = np.dot(y_act, w) + b z_wsums.append(z) y_act = self.activation(z) y_acts.append(y_act) # backward propagation theta = self.cost_derivative(y_act, y) for i in range(self.num_layers-1, -1, -1): ad = self.activation_derivative(z_wsums[i], y_acts[i+1]) delta = np.multiply(theta, ad) y_hat = y_acts[i][:, :, np.newaxis] delta_hat = delta[:, np.newaxis, :] self.nabla_w[i] = np.multiply(y_hat, delta_hat) self.nabla_b[i] = delta if (i>0): theta = np.dot(delta, np.transpose(self.weights[i])) return (self.nabla_w, self.nabla_b) Now, a batch of examples are feed to backprop. z i = y i W i y i+1 = a(z i ) C ½ (y L y) 2 i C/y i i C/z i L = (y L y) i = i+1 a (z i ) i = W i i C/W i = (y i 1) (1 i ) MNIST: Database of handwritten digits yann.lecun.com/exdb/mnist/ by Yann LeCun s team at NYU Has a training set of 60 K examples (6K examples for each digit), and a test set of 10K examples. Each digit is a 28 x 28 pixel grey level image. The digit itself occupies the central 20 x 20 pixels, and the center of mass lies at the center of the box. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. 5

MNIST: Database of handwritten digits MNIST also keeps a performance record of all image recognition programs. LeCun s Convolutional Neural Networks variations (0.8%, 0.6% and 0.4% on MNIST) Tangent Distance (Simard, LeCun & Denker: 2.5%) Randomized Decision Trees (Amit, Geman & Wilder, 0.8%) K-NN based Shape context/tps matching (Belongie, Malik & Puzicha: 0.6%) SVM on orientation histograms (Maji & Malik, 0.8%) Network.py s performance: architecture = [784, 30, 10], epoch=30: 4.6%, 70 seconds architecture = [784, 60, 30, 10],, epoch=30: 4.2%, 100 seconds mnist.py def vectorized_digit(j): "" Return a 10-dimensional unit vector with a 1.0 in the jth position and zeroes elsewhere. This is used to convert a digit in (0...9) into a corresponding desired output from the neural network.""" e = np.zeros((10)) e[j] = 1.0 return e f = gzip.open('mnist.pkl.gz', 'rb') train_data, valid_data, test_data = pickle.load(f, encoding="latin1") f.close() # train_x.shape = (50000, 784) 784 = 28x28, an image 28x28. # train_data[1].shape = (50000,) # train_y.shape = (50000, 10) train_x = train_data[0] train_y = [vectorized_digit(y) for y in train_data[1]] 6

mnist.py import time start_time = time.time() import mlnnsgd as mlnn net = mlnn.network([784, 60, 10]) print('creating Network =', net.sizes) print('weight shapes:', [w.shape for w in net.weights]) net.sgd(train_x, train_y, epochs=10, batch_size=100, eta=0.1, test_data=test_data) print("run time: %s seconds" % (time.time() - start_time)) # in mlnn.py: def evaluate(self, test_data): "" Return the number of test inputs for which the neural network outputs the incorrect result. """ digits = np.argmax(self.feedforward(test_data[0]), axis=1) return np.count_nonzero(digits - test_data[1]) s = 0 for x, y in zip(test_data[0], test_data[1]): if (np.argmax(self.feedforward(x))!= y): s = s+1 return s Backpropagation Observations Procedure is (relatively) efficient All computations are local Use inputs and outputs of current node What is good enough? Rarely reach target (0 or 1) outputs Typically, train until within 0.1 of target How to improve further the performance? 7

Hyperparameter Selection Pick a small Learning Rate, e.g. 0.1, as a starting point. Connectivity: typically fully connected between layers Number of hidden nodes: Too many nodes make learning slower, could overfit Too many hidden nodes is usually OK if using a reasonable stopping criteria Too few will underfit Number of layers: 1 (common) or 2 hidden layers which are usually sufficient for good results, attenuation makes learning very slow modern deep learning approaches show significant improvement using many layers Manually set hyperparameters: trial and error runs Often sequential, or binary search: find one hyperparameter value with others held constant, freeze it, find next hyperparameter, etc. Random is empirically most consistently effective typically each hyperparameter is chosen with a uniform distribution from a log scale for each trial Hyperparameters could be learned by the learning algorithm in which case you must take care to not overfit the training data Performance of NN Training Convergence of Backpropagation Let Gradient descent find a local minimum quickly What affect the convergence? NN size and training set size Learning rate Initial weight values Derivative values Avoiding Overfitting Generalize well and work better for general cases What affect overfitting? NN Architecture Weight values using weight-decay through regulation Stop earlier 8

Hidden Nodes Typically one fully connected hidden layer. Common initial number is 2n or 2logn hidden nodes where n is the number of inputs In practice train with a small number of hidden nodes, then keep doubling, etc. until no more significant improvement on test sets All output and hidden nodes should have bias weights Hidden nodes discover new higher order features which are fed into the output layer. i k i j k i k i Local Minima SGD in general have more difficulties with simple tasks than with more complex tasks Good news with MLPs Many dimensions make for many descent options Local minima more common with very simple/toy problems, very rare with larger problems and larger nets Even if there are occasional local minima problems, could simply train multiple nets and pick the best Some algorithms add noise to the updates to escape minima 9

Local Minima and Neural Networks Neural Network can get stuck in local minima for small networks, but for most large networks (many weights), local minima rarely occur in practice This is because with so many dimensions of weights it is unlikely that we are in a minima in every dimension simultaneously almost always a way down Backpropagation Summary Excellent Empirical results Scaling The pleasant surprise Local minima very rare as problem and network complexity increase Most common neural network approach Many other different styles of neural networks User defined parameters usually handled by multiple experiments Many variants Regression Typically Linear output nodes, normal hidden nodes Adaptive Parameters Many different learning algorithm approaches Higher order gradient descent (Newton, Conjugate Gradient, etc.) Recurrent networks Deep networks! Still an active research area 10