Machine Learning. MGS Lecture 3: Deep Learning

Dr Michel F. Valstar http://cs.nott.ac.uk/~mfv/ Machine Learning MGS Lecture 3: Deep Learning

Dr Michel F. Valstar http://cs.nott.ac.uk/~mfv/ WHAT IS DEEP LEARNING? Shallow network: Only one hidden layer Deep network: simplest case What we really mean with deep network (each rectangle is a layer):

Dr Michel F. Valstar http://cs.nott.ac.uk/~mfv/ WHAT IS DEEP LEARNING? Definition: Hierarchical organisation with more than one (non-linear) hidden layer in-between the input and the output variables Output of one layer is the input of the next layer Methods: (Deep) Neural Networks Convolutional Neural Networks Restricted Boltzmann Machines/Deep Belief Networks Recurrent Neural Networks

Dr Michel F. Valstar http://cs.nott.ac.uk/~mfv/

Dr Michel F. Valstar http://cs.nott.ac.uk/~mfv/ WHY DEEP LEARNING? Hierarchy is a powerful and compact representation: Deep: (x 1 + x 2 + a)(x 3 + x 4 + b) = x 1 x 3 + x 1 x 4 + bx 1 + x 2 x 3 + x 2 x 4 + bx 2 + ax 3 + ax 4 + ab Shallow: x 1 + x 2 + x 3 + x 4 + a

Dr Michel F. Valstar http://cs.nott.ac.uk/~mfv/ WHY DEEP LEARNING Sharing lower-level representations to build an object Natural data organisation

CONVOLUTIONAL NEURAL NETWORKS Dr Michel F. Valstar http://cs.nott.ac.uk/~mfv/

Sliding Window Convolution Convolve one filter over the whole image RESPONSE MAP

Convolutions in a NN Not fixed but learnt from data!! Filter weights (NN weights) Value of a Hidden Neuron Sigmoid( -8 ) or Response ReLU( -8 ) or Input layer: Pixel values

Convolutions in a NN Using a 3 x 3 convolution filter Jumping one pixel at a time (stride = 1) Response map dim( Response ) = (m-2) x (n-2) Input Image dim( Im ) = m x n

Link to linear classification The most simple NN: convolution + decision (thresholding) At test time it is like Logistic Regression Convolution Decision

Multiple Convolutions Grayscale im (1 channel), n convolutions n response maps = n-channel image You can go deep!

Pooling Pooling: Spatial sub-sampling

Conv. + Max-Pooling Response map Input Input

Other Aspects Stride: Information of contiguous patches is highly redundant Stride = 2 means convolutions are done every 2 pixels Fully connected layers: Last 1 to 3 layers of the networks, Like a standard NN Decision layer Last layer: make a decision (typically a logistic regressor)

What does a CNN look like?: AlexNet 7 hidden layers 650.000 neurons 60.000.000 parameters Trained on 2 GPUs for a week (the final CNN) A. Krizhevsky, I. Sutskever, G. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012.

A more modern one: GoogLeNet

Dr Michel F. Valstar http://cs.nott.ac.uk/~mfv/ BREAK

Training Convolutional Neural Networks

Training formulation How well does the model fit the training data? L2-norm (k outputs) Penalise complex solutions to avoid overfitting (Multinomial) logistic regression Anything differentiable Optimised through gradient descent! (Backpropagation algorithm)

Stochastic Gradient Descent (SGD) Training set often too large to even keep it in memory!! SGD Minimisation through Gradient Descent Minimisation through Stochastic Gradient Descent In practice you can do this with mini-batches

Practicalities of GD As always, take care of overfitting. When using SGD, use proxy of the full loss (e.g. running average?) Empirical approximation of a derivative You will always make mistakes computing derivatives. Use this to check them

The challenges of training Underfitting: The network is trained only to a sub-optimal configuration Variants of gradient descent Sheer computational power ReLU Overfitting: The network does not generalise well to new data Standard regularisation Pre-trained models Drop-out

Backpropagation and vanishing gradient Chain rule: We make an error: We would like NOT to make an error: Backprop can be derived simply using the chain rule Rectified Linear Unit (ReLU) Strong gradient only between -1 and 1. Then almost flat

Making things work Problem: - small gradients - large flat valleys Solutions: Variants of Gradient Descent Powerful GPU for massive convolution parallelisation (CUDA) Patience: Training from scratch takes weeks/months Pre-training

Layer-wise greedy pre-training - Often unsupervised: Pick random images and try to represent them - Greedy learning: each layer is optimised independently of the others Fine-tune with supervised data specific of the problem at hand. Do whole network at once - Large amount of data. Since not labelled, pick a lot from everywhere

Use pre-trained networks Most modern approach (last 1-2 years): Take a pre-trained very deep CNN + fine-tune it to your problem! M. Oquab, L. Bottou, I. Laptev, J. Sivic, Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks, CVPR 2014.

Regularisation Penalise complex solutions to avoid overfitting AlexNet has 60.000.000 parameters!! and it is a simple CNN L2 regularisation L1 regularisation (enforces sparsity)

Regularisation: Dropout Idea: cripple the neural network by removing hidden layers stochastically Each hidden unit is set to 0 with probability 0.5 Hidden units cannot co-adapt to other units Hidden units must be more generally useful You can use different dropout probabilities, but 0.5 works well in practice

Publicly-available packages

How good does it get? http://parkorbird.flickr.com/ http://demo.caffe.berkeleyvision.org/

Deep learning = CNN? (Sneak Peak)

Generative Models You can reconstruct data of a given class Auto-Encoder Restricted Boltzmann Machines

Temporal data Recurrent Neural Networks (e.g. LSTM-NN)

Visualise high-dimensional data

All kinds of problems

Even audio can be done with CNN

3D faces from 2D images Created by Aaron Jackson in our own CVL http://cvl-demos.cs.nott.ac.uk/vrn/