CS489/698: Intro to ML

Size: px

Start display at page:

Download "CS489/698: Intro to ML"

Basil Parks
5 years ago
Views:

1 CS489/698: Intro to ML Lecture 14: Training of Deep NNs Instructor: Sun Sun 1

2 Outline Activation functions Regularization Gradient-based optimization 2

3 Examples of activation functions 3 5/28/18 Sun Sun

4 Sigmoid and tanh Sigmoid Output range (0,1) 1 σ x Gradient range (0,1): σ x Small gradient at saturated regions Ø x = 0, gradient = 0.25 Ø x = 10, gradient = e-05 Ø x = -10, gradient = e-05 tanh Output range (-1,1) Gradient range (0,1): 1 tanh * x Small gradient at saturated regions 4

5 Vanishing gradients Common weight initialization in (-1, 1) Denote input of the i-th σ() as a. 5

6 Avoiding vanishing gradients Proper initialization of network parameters Choose activation functions that do not saturate Use Long Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU) architectures. Both of these RNN architectures are explicitly designed to deal with vanishing gradients and efficiently learn long-range dependencies 6

7 ReLU, Leaky ReLU, and ELU ReLU Computationally efficient Gradient = 1 when x> 0 Gradient = 0 when x<0 (cannot update parameter) Ø Initialize ReLU units with slightly positive values, e.g., 0.01 Ø Try other activation functions, e.g., Leaky ReLU Leaky ReLU Computationally efficient Constant gradient for both x>0 and x<0 ELU Small gradient at negative saturation region 7

8 Maxout Generalize ReLU and Leaky ReLU Ø Piece-wise linear function Do not saturate More parameters 8

9 Tips Sigmoid and tanh functions are sometimes avoided due to the vanishing gradient problem Use ReLU as a default choice Explore other activation functions, e.g., Leaky ReLU and Maxout, if ReLU results in dead hidden units 9

10 Outline Activation functions Regularization Gradient-based optimization 10

11 Overfitting and regularization High capacity increases risk of overfitting Ø # of parameters is often larger than # of data Regularization: any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error Ø Parameter norm penalties (e.g., L1 and L2 norms) Ø Bagging Ø Dropout Ø Data augmentation Ø Early stopping Ø 11 definition from: Ian Goodfellow, Yoshua Bengio and Aaron Courville, Deep learning

12 Bagging (1) Bagging (bootstrap aggregating): reduce generalization error by combining several models Construct a different data set for each model Each dataset has the same size as the original one Sampling with replacement from the original dataset Prediction: Regression: average; Classification: vote Different models usually do not make all the same errors 12

13 Bagging (2) Consider k regression models Model errors: zero-mean multivariate Gaussian ε. : error of the i-th model; E ε *. = v, E ε. ε 4 = c Expected squared error (ESE) If model errors are uncorrelated, i.e., c = 0, ESE = v/k If model errors are perfectly correlated and c = v, ESE = v 13

14 Dropout For each training example keep hidden units with probability p. A different and random network for each training example A smaller network with less capacity e.g., let p = 0.5 for all hidden layers delete ingoing and outgoing links for eliminated hidden units 14

15 Implementation Consider implementing dropout in layer 1. Denote output of layer 1 as h with dimension 4*1 Generate a 4*1 vector d with elements 0 or 1 indicating whether units are kept or not Update h by multiplying h and d element-wise Update h by h/p (``inverted dropout ) 15

16 Inverted dropout Motivation: keep the mean value of output unchanged Example: Consider h (9) in the previous slide. h (9) is used as the input to layer 2. Let p = 0.8. z (*) = W (*) h (9) + b (*) 20% of hidden units in layer 1 are eliminated Without ``inverted dropout, the mean of z (*) would be decreased by 20% roughly 16

17 Prediction Training Use (``inverted ) dropout for each training example Prediction Usually do not use dropout If ``inverted dropout is used in training, no further scaling is needed in testing 17

18 Why works? Intuition: Since each hidden unit can disappear with some probability, the network cannot rely on any particular units and have to spread out the weights à similar to L2 regularization Another interpretation: Dropout is training a large ensemble of models sharing parameters 18

19 Hyperparameter p p: probability of keeping units How to design p? Keep p the same for all layers Or, design p for each layer: set a lower p for overfitting layers, i.e., layers with a large number of units Can also design p for the input, but usually set p = 1 or very close to 1 19

Data augmentation The best way to improve generalization is to have more training data. But often, data are limited. One solution: create fake data, e.g., transform input Example: object recognition (classification) Horizontal/vertical flip Rotation Stretching But may not be readily applicable to many other tasks, e.

20 Data augmentation The best way to improve generalization is to have more training data. But often, data are limited. One solution: create fake data, e.g., transform input Example: object recognition (classification) Horizontal/vertical flip Rotation Stretching But may not be readily applicable to many other tasks, e.g., density estimation 20 6/25/18 Sun Sun

21 Early stopping For models with large capacity, training error decreases steadily validation error first decreases then increases figure from: Ian Goodfellow, Yoshua Bengio and Aaron Courville, Deep learning 21

22 Batch normalization (BN) Use BN for each dimension in hidden layers, either before activation function or after h (3) = f 3 (W (3) h (2) + b (3) ) z (A) either apply BN to z (A) or h (A) γ and β are optimization variables; not hyperparameters 22

23 Why BN? (1) Consider applying BN to z (i.e., before activation) z : mean β, variance γ * Robust to parameter changes in previous layers Independent learning of each layer apply BN Without BN, learning of layer 2 depends on output of layer 1 and hence parameters of layer 1 With BN, mean and variance of z (*) are unchanged 23

24 Why BN? (2) Assume the orders of features in hidden layers are significantly different E.g., feature 1: 1; feature 2: 10 3 ; feature 3: 10-3 With BN, features are of similar scale Speed up learning 24

25 Why BN? (3) BN has a slight regularization effect (not intended) Each mini-batch is scaled by the mean/variance computed on that mini-batch This adds some noise to z and to each layer s activations 25

26 Outline Activation functions Regularization Gradient-based optimization 26

27 Optimization Problem of SGD: slow convergence SGD + momentum g O 1 m! mx L(f(x (i) ; ),y (i) ) i=1 v v +(1 )g v exponentially weighted averages Accumulate an exponentially decaying moving average of past gradients Hyperparameter α determines how quickly the contributions of previous gradients exponentially decay (α = 0.9 usually works well) 27

28 Exponentially weighted averages Let v t = v t 1 +(1 ) t ; =0.9 v 100 = total weight: 1 β F (close to 1 if t is large) 28

29 Illustration Contour of loss function red: SGD + momentum black: SGD figure from: Ian Goodfellow, Yoshua Bengio and Aaron Courville, Deep learning 29

30 RMSProp (root mean square propagation) Greater progress in the more gently sloped directions One of the go-to optimization methods for deep learning discard history from extreme past 30

31 Adam (adaptive moments) A variant on the combination of RMSProp and momentum both first and second moments are included when t is large, the effect of bias correction is small 31

32 Learning rate SGD, SGD+Momentum, AdaGrad, RMSProp, Adam all have learning rate as a hyperparameter. àreduce learning rate over time! Step decay e.g., decrease learning rate by half every few epochs Exponential decay 1/t decay t: iteration number 32

33 Illustration 33 figure from: Stanford cs231n course slides

34 Questions? 34

Deep Neural Networks Optimization

Deep Neural Networks Optimization Creative Commons (cc) by Akritasa http://arxiv.org/pdf/1406.2572.pdf Slides from Geoffrey Hinton CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy