Lecture : Neural net: initialization, activations, normalizations and other practical details Anne Solberg March 10, 2017

Similar documents
Lecture : Training a neural net part I Initialization, activations, normalizations and other practical details Anne Solberg February 28, 2018

Lecture 5: Training Neural Networks, Part I

EE 511 Neural Networks

Deep Learning with Tensorflow AlexNet

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

CS489/698: Intro to ML

CS 6501: Deep Learning for Computer Graphics. Training Neural Networks II. Connelly Barnes

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

Deep neural networks II

Natural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 4: Backpropagation and computation graphs

A Quick Guide on Training a neural network using Keras.

Deep Neural Networks Optimization

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Machine Learning and Algorithms for Data Mining Practical 2: Neural Networks

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Deep Learning Cook Book

Perceptron: This is convolution!

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Deep Learning for Computer Vision II

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

Neural Network Neurons

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

11. Neural Network Regularization

The Mathematics Behind Neural Networks

Machine Learning 13. week

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Efficient Deep Learning Optimization Methods

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Package automl. September 13, 2018

Keras: Handwritten Digit Recognition using MNIST Dataset

Optimization in the Big Data Regime 5: Parallelization? Sham M. Kakade

DEEP LEARNING IN PYTHON. The need for optimization

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

All You Want To Know About CNNs. Yukun Zhu

ImageNet Classification with Deep Convolutional Neural Networks

Multinomial Regression and the Softmax Activation Function. Gary Cottrell!

An Introduction to NNs using Keras

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Introduction to Neural Networks

Hidden Units. Sargur N. Srihari

CNNS FROM THE BASICS TO RECENT ADVANCES. Dmytro Mishkin Center for Machine Perception Czech Technical University in Prague

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Deep Learning for Computer Vision

Report: Privacy-Preserving Classification on Deep Neural Network

Deep Learning. Volker Tresp Summer 2014

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Deep Learning With Noise

Practical 8: Neural networks

SGD: Stochastic Gradient Descent

For Monday. Read chapter 18, sections Homework:

The exam is closed book, closed notes except your one-page cheat sheet.

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Multi-layer Perceptron Forward Pass Backpropagation. Lecture 11: Aykut Erdem November 2016 Hacettepe University

Lecture 37: ConvNets (Cont d) and Training

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

Deep Learning and Its Applications

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

Dropout. Sargur N. Srihari This is part of lecture slides on Deep Learning:

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Parallel Deep Network Training

Assignment # 5. Farrukh Jabeen Due Date: November 2, Neural Networks: Backpropation

Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization. Presented by: Karen Lucknavalai and Alexandr Kuznetsov

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Understanding Andrew Ng s Machine Learning Course Notes and codes (Matlab version)

Facial Expression Classification with Random Filters Feature Extraction

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

COMP9444 Neural Networks and Deep Learning 5. Geometry of Hidden Units

Fusion of Mini-Deep Nets

SEMANTIC COMPUTING. Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) TU Dresden, 21 December 2018

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

Batch Normalization Accelerating Deep Network Training by Reducing Internal Covariate Shift. Amit Patel Sambit Pradhan

Administrative. Assignment 1 due Wednesday April 18, 11:59pm

Training Deep Neural Networks (in parallel)

CS230: Deep Learning Winter Quarter 2018 Stanford University

Convolutional Neural Networks and Supervised Learning

Neural Networks (pp )

Introduction to ANSYS DesignXplorer

Keras: Handwritten Digit Recognition using MNIST Dataset

Machine learning for vision. It s the features, stupid! cathedral. high-rise. Winter Roland Memisevic. Lecture 2, January 26, 2016

Convolutional Neural Networks

Machine Learning. MGS Lecture 3: Deep Learning

Practical Tips for using Backpropagation

Study of Residual Networks for Image Recognition

Parallelization in the Big Data Regime 5: Data Parallelization? Sham M. Kakade

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

Knowledge Discovery and Data Mining. Neural Nets. A simple NN as a Mathematical Formula. Notes. Lecture 13 - Neural Nets. Tom Kelsey.

Knowledge Discovery and Data Mining

CS230: Deep Learning Winter Quarter 2018 Stanford University

Deep Learning. Architecture Design for. Sargur N. Srihari

A Simple (?) Exercise: Predicting the Next Word

Como funciona o Deep Learning

Deep Learning. Deep Learning provided breakthrough results in speech recognition and image classification. Why?

Artificial Neural Networks Lecture Notes Part 5. Stephen Lucci, PhD. Part 5

In-Place Activated BatchNorm for Memory- Optimized Training of DNNs

How Learning Differs from Optimization. Sargur N. Srihari

Transcription:

INF 5860 Machine learning for image classification Lecture : Neural net: initialization, activations, normalizations and other practical details Anne Solberg March 0, 207

Mandatory exercise Available tonight, deadline 3.3.7 Implementing Softmax-classification Implementing 2-layer net with backpropagation Seeing the effect of adding feature extraction using histogram of gradients. Optimizing network parameters on validation data How high accuracy can you get? 0.3.207 INF 5860 3

The coming weeks No weekly exercise set this week as you should work on Mandatory Group sessions as normal Next lecture: background in image convolution, filters, and filter banks/multiscale representations. Gives a useful background for convolutional nets. In two weeks: continue with useful tricks for making learning work, mainly: chapter 8 in Deep Learning http://cs23n.github.io/neural_networks-3 0.3.207 INF 5860 4

Reading material Reading material: http://cs23n.github.io/neural_networks-2 Deep Learning 6.2.2 and 6.3 on activation functions Deep Learning 8.7. on Batch normalization 0.03.207 INF 5860 5

Today Activation functions Mini-batch gradient descent Data preprocessing Weight initialization Batch normalization Training,validation, and test sets Searching for the best parameters 0.3.207 INF 5860 6

The feedforward net problem 0.3.207 INF 5860 7 f(x i, (), (L) ) a (out ) J data loss J i (), (L x i y i Regularization loss Data set X,y Loss for each sample J i (softmax or logistic one vs. all) Regularization loss J r Total loss: J= J i +J r ) ( 2 log ) ( 2 ) (. j l i T k i T k s j l ji s i L l m i K k x x K k i m e e k y m J

The error surface of a linear neuron For a linear neuron with squared error, the error surface is a quadratic bowl. For neural net loss functions it is more complex, but can be approximated by a bowl locally. One of the challenges of gradient descent is how to make it converge best possible. In two weeks: other parameter update schemes like rmsprop, ADAM etc. Which direction does gradient descent choose for an ellipse? 0.3.207 INF 5860 8

Convergence of batch gradient descent Convergence is often slow If the error surface locally is like an ellipse, the gradient is big in the direction we only want a small change, and small in the direction we want a big change. With a high learning rate the process will oscillate and convergence is slow. If the gradient is computed from ALL training samples, there are ways to speed up the process. For large networks, it is normally better to use mini-batch learning. 0.3.207 INF 5860 9

Batch gradient descent Batch gradient descent computes the loss summed over ALL training samples before doing gradient descent update. This is slow if the training data set is large. 0.3.207 INF 5860 0 ) ( 2 log ) ( 2 ) (. j l i T k i T k s j l ji s i L l m i K k x x K k i m e e k y m J (l) ) ( D l

Mini batch gradient descent Select randomly a small batch, update, then repeat: for it in range(num_iterations): sample_ind = np.random.choice(num_train, batch_size) X_batch = X[sample_ind] y_batch = y[sample_ind] Ji, dgrad_l = loss(x_batch, y_batch, lambda) for all l Theta_l -= learning_rate*dgrad_l If batch_size=, this is called online learning, and sometimes Stochastic Gradient Descent (SGD) But the term SGD sometimes also means mini batch gradient descent Mandatory exercise : implement mini batch gradient descent We get back to other parameter update schemes in two weeks Common parameter value: 32, 64, 28. 0.3.207 INF 5860

Activation functions Reading material: cs23n.github.io/neural-networks- Deep Learning: 6.2.2 and 6.3 Active area of research, new functions are published annually. We will consider: Sigmoid activation Tanh activation ReLU activation And mention recent alternatives like: Leaky ReLU Maxout ELU 0.3.207 INF 5860 2

Sigmoid activation g( z) z e g' ( z) g( z)( g( z) Output between 0 and Historically popular Has some shortcomings 0.3.207 INF 5860 3

x Activations L x L z z x z x Sigmoid z What happens when: x=-0 x=0 x=0 L z 0.3.207 INF 5860 4

Sigmoid activation Sigmoids kill gradients Why? If the input is very small or large, what happens? Not zero-centered If all inputs positive, then all gradients dj/d will be either positive of negative and gradient updates often zig-zag Somewhat expensive to compute Currently : sigmoids are rarely used! 0.3.207 INF 5860 5

Tanh activation g( z) tanh( z) Scaled version of sigmoid Output between - and Zero-centered Saturates and kill gradients Preferred to sigmoid due to the zero-centering 0.3.207 INF 5860 6

ReLU activation ReLU( z) max( z,0) Derivative of ReLU : max(z,0) if z 0 and 0 otherwise Rectified Linear Unit Does not saturate Fast to compute Converge fast Drawback: can sometimes die during training and become inactive If this happens, the gradients will be 0 from that point Be careful with the learning rate Currently: the best overall recommendation 0.3.207 INF 5860 7

ReLU x Activations L x L z z x z x ReLU z What happens when: x=-0 x=0 x=0 L z 0.3.207 INF 5860 8

Leaky ReLU activation Leaky ReLU( z) max(0.0z, z) Will not die Results are not consistent that Leaky ReLU is better than ReLU 0.3.207 INF 5860 9

ELU activation Exponential Linear Unit (ELU)( z) z, z 0 (exp(z) -) Will not die Benefits of ReLU, but more expensive to compute 0.3.207 INF 5860 20

Maxout activation Maxout( z) max( w z b, w2z b2 ) Here there are two weights for each node This applies the nonlinearity to each input product. Can be seen as a generalization of ReLU/Leaky Relu Doubles the amount of parameters per node compared to ReLU. 0.3.207 INF 5860 2

Activations for output vs. hidden layers For classification, where the loss is either one vs. all logistic or softmax, the output layer will have: Softmax activation for a softmax loss function Sigmoid activation for one vs. All When we use ReLU or a different activation, this is normally for the hidden layers only Remember from logistic classification that : h ( x) e T x 0.3.207 INF 5860 22

Review: Cost function for softmax neural networks For a neural net with softmax loss function : 0.03.207 INF 5860 23 * RegularizationTerm LossTerm ) ( in layer l (without bias) units Number of : s layers number of L : ) ( 2 log ) ( ) x, K P(y ) x, 2 P(y ) x, P(y (x) h : a Output l 2 ) ( L. 2 J m e e k y m J e e e e j l i T k i T k T K T T T k s j l ji s i L l m i K k x x K k i x x x K k x

Question You can check your loss function. Set =0. If you generate random data from n (say n=3) classes with equal probability, what do you expect the loss to be? 0.03.207 INF 5860 24 ) ( 2 log ) ( ) x, K P(y ) x, 2 P(y ) x, P(y (x) h : a Output 2 ) ( L. 2 j l i T k i T k T K T T T k s j l ji s i L l m i K k x x K k i x x x K k x m e e k y m J e e e e

Guidelines for activation function Active area of research, recommendations might change: Currently: Use ReLU for hidden layers but monitor the fraction of dead units in a network. For output: most common with softmax 0.3.207 INF 5860 25

Data preprocessing Original Scaling of the features matters: If we have the samples xi: yi Scale to zero mean 0, 0: 2 0, 99: 0 xi: yi, : 2, -: 0 Error surface Error surface w w 2 0.3.207 INF 5860 26

Data preprocessing Original Scaling of the features matters: If we have the samples xi: yi Normalize to unit variance 0.2, 0: 2 0.2, -0: 0 xi: yi, : 2, -: 0 Error surface Error surface w w 2 0.3.207 INF 5860 27

Common normalization Standardize data to zero mean and unit variance For each feature m, compute the mean and standdard deviation over the training data set and let x m = (x m - )/ Remark: STORE and because new data/test data must have the same normalization. Figure from http://cs23n.github.io/neural-networks-2/ 0.3.207 INF 5860 28

Consider whitening the data If features are highly correlated, principal component transform can be considered to whiten the data. Drawback: computationally heavy for image data. Normally not used for image data Figure from http://cs23n.github.io/neural-networks-2 0.3.207 INF 5860 29

Common normalization for image data Consider e.g. CIFAR-0 image (32,32,3) Two alternatives: Subtract the mean image Keep track of a mean image of (32,32,3) Subtract the mean of each channel (r,g,b ) Keep track of the channel mean, 3 values for RGB. 0.3.207 INF 5860 30

Weight initialization Avoid all zero initialization! If all weights are equal, they will produce the same gradients and same outputs, and undergo exactly the same parameter updates. They will learn the same thing. We break symmetry by initializing the weights to have small random numbers. Initialization is more complicated for deep networks 0.3.207 INF 5860 3

Weight initialization Consider a neuron with n inputs and w x i i (n is called fan-in) The variance of z is n Var ( z ) Var ( i It can be shown that w x i i ) Var( z) ( nvar( w))( Var( x)) If we make sure that Var(w i )=/n for all i, so by scaling each weight wi by / n, the variance of the output will be. (Called Xavier initialization) Use this Glorot et al. propose to use: w = np.random.rand(n)*sqrt(2/n) for ReLU for ReLU because of the max-operation that will alter the distribution. n z i 0.3.207 INF 5860 32

Batch normalization So far, we noticed that normalizing the inputs and the initial weights to zero mean, unit variance help convergence. As training progresses, the mean and variance of the weights will change, and at a certain point they make converenge slow again. This is called a covariance shift. Batch normalization (Ioffe and Szegedy) https://arxiv.org/abs/502.0367 countereffects this. After fully connected layers (or convolutional layers),and before the nonlinearity, a batch normalization layer is inserted. This layer makes the input gaussian with zero mean and unit variance by applying xk k xˆ k Varx k 0.3.207 INF 5860 33

k and Var x k is computed after each mini batch during training. This normalization can limit the expressive power of the unit. To maintain this we rescale to y k What? Does this help? y k xˆ k k k Yes, because the network can learn k and k during backpropagation, and it learns faster. Learning without the new parameter scaling must be done through the input weights and is much more complicated. Batch normalization significantly speeds up gradient descent, and even improved the accuracy. USE IT! 0.3.207 INF 5860 34

Batch normalization: training 0.3.207 INF 5860 35

Batch normalization: test time At test time: mean/std is computed for the ENTIRE TRAINING set, not mini batches used during backprop (you should store these). Remark: use running average to update 0.3.207 INF 5860 36

Optimizing hyperparameters Training data set: part of data set used in backpropagation to estimate the weights. Validation data set (mean cross-validation): part of the data set used to find the best values of the hyperparameters, e.g. number of nodes and learning rate. Test data: used ONCE after fitting all parameters to estimate the final error rate. 0.3.207 INF 5860 37

Search strategy: coarse-to-fine First stage: run a few epochs (iterations through all training samples) Second stage: longer runs with finer search. Parameters like learning rate are multiplicative, search in log-space 0.3.207 INF 5860 38

Coarse search input_size = 32 * 32 * 3 hidden_size = 50 num_classes = 0 best_acc = - for hidden_size in [50, 00, 50]: for learning_rate in [5e-2, e-3, e-4]: for reg in [0.3, 0.4, 0.5, 0.6]: # Train the network net = TwoLayerNet(input_size, hidden_size, num_classes) stats = net.train(x_train, y_train, X_val, y_val, num_iters=000, batch_size=200, learning_rate=learning_rate, learning_rate_decay=0.95, reg=reg, verbose=true) # Predict on the validation set val_acc = (net.predict(x_val) == y_val).mean() print 'Hidden size:', hidden_size, 'Learning rate:', learning_rate, 'Reg', reg print 'Validation accuracy: ', val_acc if best_acc < val_acc: best_acc = val_acc best_net = net 0.3.207 INF 5860 39

Consider a random grid 0.3.207 INF 5860 40

Monitor the loss function 0.3.207 INF 5860 4

Behaviour of loss function 0.3.207 INF 5860 42

Debug the code Take a small subset of the training data Verify that you can overfit this subset e.g. without regularization Expect very small loss and training accuracy of.00 0.3.207 INF 5860 43

Study outputs during training iteration 0 / 000: loss 2.303033 iteration 00 / 000: loss nan iteration 200 / 000: loss nan iteration 300 / 000: loss nan iteration 400 / 000: loss nan iteration 500 / 000: loss nan iteration 600 / 000: loss nan iteration 700 / 000: loss nan iteration 800 / 000: loss nan iteration 900 / 000: loss nan Hidden size: 50 Learning rate: 0.05 Reg 0.6 Validation accuracy: 0.087 iteration 0 / 000: loss 2.3028 iteration 00 / 000: loss.987349 iteration 200 / 000: loss.809840 iteration 300 / 000: loss.734420 iteration 400 / 000: loss.65647 iteration 500 / 000: loss.596564 iteration 600 / 000: loss.730853 iteration 700 / 000: loss.55625 iteration 800 / 000: loss.427454 iteration 900 / 000: loss.50289 0.3.207 INF 5860 44

Track ratio of weight updates/weight magnitudes 0.3.207 INF 5860 45

To be continued. 0.3.207 INF 5860 46