Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

Machine Learning The Breadth of ML Neural Networks & Deep Learning Marc Toussaint University of Stuttgart Duy Nguyen-Tuong Bosch Center for Artificial Intelligence Summer 2017

Neural Networks Consider a regression problem with input x R d and output y R Linear function: (β R d ) f(x) = β x 1-layer Neural Network function: (W 0 R h 1 d ) f(x) = β σ(w 0x) 2-layer Neural Network function: f(x) = β σ(w 1σ(W 0x)) Neural Networks are a special function model y = f(x, w), i.e. a special way to parameterize non-linear functions 2/22

Neural Networks: Training How to determine the weights W l,ij in the layer l for the node j, given the sample {x i, y i }? Idea: Initialize the weights W l,j for each layer l and each node j First, propagate x i through the network, bottom up (Forward Propagation) Then, compute the error between prediction and ground-truth y i, given an error function l Subsequently, propagate the error backwards through the network, and recursively compute the error gradients for each W l,ij (Back-Propagation) Update the weights W l,j using the computed error gradients for each sample {x i, y i} Notation: Consider L hidden layers, each h l -dimensional let z l = W l-1 x l-1 be the inputs to all neurons in layer l let x l = σ(z l ) the activation of all neurons in layer l redundantly, we denote by x 0 x the activation of the input layer, and by φ(x) x L the activation of the last hidden layer 3/22

Neural Networks: Basic Equations Forward propagation: An L-layer NN recursively computes, for l = 1,.., L, l=1,..,l : z l = W l-1 x l-1, x l = σ(z l ) and then computes the output f z L+1 = W Lx L Backpropagation: Given some loss l(f), let δ L+1 = l. We can recursivly f compute the loss-gradient w.r.t. the inputs of layer l: l=l,..,1 : δ l = dl = dl z l+1 x l = [δ l+1 W l ] [x l (1 x l )] dz l dz l+1 x l z l where is an element-wise product. The gradient w.r.t. weights is: dl = dl dw l,ij dz l+1,i z l+1,i W l,ij = δ l+1,i x l,j or dl dw l = δ l+1x l Weight-update: many ways of different weight-updates possible, given gradients dl dw l for example, the delta rule: W new l = W old l + W l = W old l η dl dw l 4/22

Neural Networks: Regression In the standard regression case, y R, we typically assume a squared error loss l(f) = i (f(x i, w) y i ) 2. We have δ L+1 = i 2(f(x i, w) y i ) Regularization: Add a L 2 or L 1 regularization. First compute all gradients as before, then add λw l,ij (for L 2 ), or λ sign W l,ij (for L 1 ) to the gradient. Historically, this is called weight decay, as the additional gradient leads to a step decaying the weighs. The optimal output weights are as for standard regression W L = (X X + λi) -1 X y where X is the data matrix of activations x L φ(x) 5/22

Neural Networks: Classification Consider the multi-class case y {1,.., M}. Then we have M output neurons to represent the discriminative function f(x, y, w) = (W Lz L ) y, W R M h L Choosing neg-log-likelihood objective logistic regression Choosing hinge loss objective NN + SVM For given x, let y be the correct class. The one-vs-all hinge loss is: y y max{0, 1 (f y f y)} For output neuron y y this implies a gradient δ y = [f y < f y + 1] For output neuron y this implies a gradient δ y = y y [f y < f y + 1] Only data points inside the margin induce an error (and gradient). This is also called Perceptron Algorithm 6/22

Neural Networks: Dimensionality Reduction Dimensionality reduction can be performed with autoencoders An autoencoder typically is a NN of the type which is trained to reproduce the input: min i y(x i ) x i 2 The hidden layer ( bottleneck ) needs to find a good representation/compression. Similar to the PCA objective, but nonlinear Stacking autoencoders (Deep Autoencoders): 7/22

Remarks NN is usually trained based on the gradient W l f(x) (The output weights can be optimized analytically as for linear regression) NNs are a very powerful function class (By tweaking/training the weights one can approximate any non-linear function) BUT: Are there any guarantees on generalization? What happens with the gradients, when the NN is very deep? How can NN be used to learn intelligent (autonomous) behavior (e.g. Autonomous Learning, Reinforcement Learning, Robotics, etc.)? Is there any insight on what the neurons will actually represent (e.g. discovering/developing abstractions, hierarchies, etc.)? Deep Learning is a revival of Neural Networks and was mainly driven by the latter, i.e. learning useful representations 8/22

Deep Learning: Basic Concept Idea: learn hierarchical features from data, from simple features to complex features Deep Learning can also be performed with other frameworks, e.g. Deep Gaussian Processes So what changed towards classical NN? Algorithmic advancement e.g. Dropout, ReLUs, Pre-training More general models, e.g. Deep GPs, Deep Kernel Machines,... More computational power (e.g. GPUs) Large data sets Deep Learning is useful for very high dimensional problems with many labeled or unlabeled samples (e.g. vision and speech tasks) 9/22

Typical Process to Train a Deep Network pre-process data, e.g. ZCA, distortions network type, e.g. convolutional network activation function, e.g. ReLU regularization, e.g. dropout network training, e.g. stochastic gradient descent with Adadelta combining multiple models, e.g. ensemble of networks optimizing high-level parameters, e.g. with Bayesian optimization Many heuristics involved when training Deep Networks 10/22

Example: 2-D Convolutional Network Open parameters: Nr. of layers Nr. of feature maps per convolution Filter size for each convolution Subsamling size Nr. of hidden units 11/22

Pre-Processing Steps 1. Removing means from images 2. Distortions of images 3. Zero Component Analysis Subtracting mean from images Standardizing the data Add distorted images to training data Randomly translate & rotate images Zero Component Analysis Perform transformation: x = P T Λ 1 P x where Λ = diag( σ 1 + ɛ, σ 2 + ɛ,..., σ n + ɛ) In practice, ɛ has the effect of strengthening the edges 12/22

Activation Function: Rectified Linear Units New activation function: rectified linear units (ReLUs) ReLU: f(z) = max(0, z) non-saturating sparse activation helps against vanishing gradients Relation to logistic activations n=1 logistic(z + 0.5 n) log(1 + e z ) max(0, z) 13/22

Deep Networks and Overfitting Overfitting: good training, bad testing performance. Deep models are very sensitive to overfitting, due complex model structures. How to avoid overfitting Weight-decay, penalize W 1 or W 2 Early stopping: recognize overfitting on validation data set Pre-training: initialize parameters meaningful Dropout 14/22

Dropout Training (Backpropagation): randomly deactivate each unit with probability p compute error for new network architecture perform gradient descent step Prediction (Forward Propagation): multiply output of each unit by p preserves expected value of output for single layer......... 15/22

ADADELTA: Stochastic Gradient Descent Computation of update steps on batch of samples ADADELTA uses only first-order gradients Simple in implementation and application Appl. for large data and number of parameters ( 500.000) ADADELTA Update rule: x t+1 = x t + x t, where x t = η t g t = α T i=1 ρi (1 ρ) x t i T g i=0 ρi (1 ρ) g t t i Remarks: Adaptive learning rate η t. Parameters α and ρ muss be chosen Estimation of learning rate from previous gradients g t and t The algorithm has shown to work well in practice 16/22

Bayesian Optimization Optimizing selected network parameters, e.g. decay rate ρ Objective function unknown (i.e. parameters pred. errors) Bayesian Optimization: Optimizing while approximating objective function Infering objective functions from data, i.e. [parameters, errors] Initializing parameters Training the network with the parameters 17/22

Bayesian Optimization Optimizing selected network parameters, e.g. decay rate ρ Objective function unknown (i.e. parameters pred. errors) Bayesian Optimization: Optimizing while approximating objective function Infering objective functions from data, i.e. [parameters, errors] Choosing parameters according to a criterion Initializing parameters Learn the objective function: Parameters Validation error Training the network with the parameters Compute prediction error on validation data 17/22

Bayesian Optimization with Gaussian Prior Learning objective function with Gaussian process regression GP prediction for a test point x t : N(µ(x t ), ν(x t )) Selection criterion is computed based µ(x t ) and ν(x t ) 1 Φ norm( )... normal accumulative distribution function, φ norm( )... normal probability density function, y best... currently best measurement / observation 18/22

Bayesian Optimization with Gaussian Prior Learning objective function with Gaussian process regression GP prediction for a test point x t : N(µ(x t ), ν(x t )) Selection criterion is computed based µ(x t ) and ν(x t ) Expected Improvement criterion for given point x a EI = ν(x) [γ(x)φ norm (γ(x)) + φ norm (x)] ; γ(x) = y best µ(x) Expected Improvement ν(x) 3 2 1 0 1 2 3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 Φ norm( )... normal accumulative distribution function, φ norm( )... normal probability density function, y best... currently best measurement / observation 18/22

Ensembles Boosting Prediction Performance Standard ML approach to improve test performance: combine output of different models We can use different random weight initializations and training with/without the validation set How to combine the predictions? each network gives us a prediction, e.g. p 1 = (0.4, 0.3, 0.3), p 2 = (0.35, 0.35, 0.3), p 3 = (0.1, 0.9, 0.0) we can take the arithmetic or geometric mean, e.g. p avg = (0.28, 0.52, 0.2) class prediction is the index with the highest score, e.g. class 2 19/22

Results on Traffic Sign Recognition CCR (%) Team Deep Learning used 99.46 IDSIA yes 98.84 human average - 98.80 BOSCH deep nets yes 98.31 Sermanet yes 96.14 CAOR no 95.68 INI-RTCV no 93.18 INI-RTCV no 92.34 INI-RTCV no Correct classification rate (CCR) on the final-stage of the German Traffic Sign Recognition Benchmark: 38.880 images for training and 12.960 images for testing from 43 different German road sign classes. 20/22

Remarks Various approaches for optimizing and training of Deep Nets, e.g. Bayesian optimization, pre-processing, dropouts... Choice of appropriate techniques based on applications, experiences and knowledge in Machine Learning Try-out of different training approaches Gaining experiences Keep up with the developments in the Deep Learning community Further research problems: Bayesian Deep Learning, unsupervised learning, generative deep models, deep reinforcement learning, adversarial problems, etc. 60 % Error Rate traditional involves deep learning 100 Error Score # DL Publications Google Scholar 1400 traditional involves deep learning 1200 1000 40 % 90 800 26.2 % 80 600 20 % 16.4 % 70 400 10 % 60 200 2010 2011 2012 2013 * 2014 * Year * only 10 best results plotted 2013 2014 Year Year 2000 2005 2010 2014 * * as of October 14, 2014 21/22

Deep Learning further reading Weston, Ratle & Collobert: Deep Learning via Semi-Supervised Embedding, ICML 2008. Hinton & Salakhutdinov: Reducing the Dimensionality of Data with Neural Networks, Science 313, pp. 504-507, 2006. Bengio & LeCun: Scaling Learning Algorithms Towards AI. In Bottou et al. (Eds) Large-Scale Kernel Machines, MIT Press 2007. Hadsell, Chopra & LeCun: Dimensionality Reduction by Learning an Invariant Mapping, CVPR 2006. Glorot, Bengio: Understanding the difficulty of training deep feedforward neural networks, AISTATS 10. Jason Weston et al.: Deep Learning via Semi-SupervisedEmbedding, ICML 2008.... and newer papers citing those 22/22