CS489/698: Intro to ML

Similar documents
Deep Neural Networks Optimization

Efficient Deep Learning Optimization Methods

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

Lecture : Training a neural net part I Initialization, activations, normalizations and other practical details Anne Solberg February 28, 2018

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

11. Neural Network Regularization

SEMANTIC COMPUTING. Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) TU Dresden, 21 December 2018

CS 6501: Deep Learning for Computer Graphics. Training Neural Networks II. Connelly Barnes

Early Stopping. Sargur N. Srihari

Perceptron: This is convolution!

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Deep Learning with Tensorflow AlexNet

Lecture : Neural net: initialization, activations, normalizations and other practical details Anne Solberg March 10, 2017

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Dropout. Sargur N. Srihari This is part of lecture slides on Deep Learning:

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

EE 511 Neural Networks

COMP9444 Neural Networks and Deep Learning 5. Geometry of Hidden Units

Deep Learning Cook Book

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Practical Methodology. Lecture slides for Chapter 11 of Deep Learning Ian Goodfellow

Hidden Units. Sargur N. Srihari

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Deep Learning and Its Applications

An Introduction to NNs using Keras

Deep Learning for Computer Vision

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

CS 229 Midterm Review

Deep Learning for Computer Vision II

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Deep Learning Applications

MoonRiver: Deep Neural Network in C++

A Simple (?) Exercise: Predicting the Next Word

Natural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 4: Backpropagation and computation graphs

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

STA 4273H: Statistical Machine Learning

CNN Basics. Chongruo Wu

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Deep Learning With Noise

Deep Learning Workshop. Nov. 20, 2015 Andrew Fishberg, Rowan Zellers

Ensemble methods in machine learning. Example. Neural networks. Neural networks

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Deep neural networks II

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017

Restricted Boltzmann Machines. Shallow vs. deep networks. Stacked RBMs. Boltzmann Machine learning: Unsupervised version

Convolutional Neural Networks

Sentiment Classification of Food Reviews

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Keras: Handwritten Digit Recognition using MNIST Dataset

Lecture 5: Training Neural Networks, Part I

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks

Practical Tips for using Backpropagation

CSC 578 Neural Networks and Deep Learning

All You Want To Know About CNNs. Yukun Zhu

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

Logical Rhythm - Class 3. August 27, 2018

Model Generalization and the Bias-Variance Trade-Off

Machine learning for vision. It s the features, stupid! cathedral. high-rise. Winter Roland Memisevic. Lecture 2, January 26, 2016

Recurrent Neural Network (RNN) Industrial AI Lab.

Machine Learning / Jan 27, 2010

Perceptron as a graph

On the Effectiveness of Neural Networks Classifying the MNIST Dataset

CNNS FROM THE BASICS TO RECENT ADVANCES. Dmytro Mishkin Center for Machine Perception Czech Technical University in Prague

Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD.

Machine Learning 13. week

Machine Learning. MGS Lecture 3: Deep Learning

Advanced Video Analysis & Imaging

CSC 411 Lecture 4: Ensembles I

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Batch Normalization Accelerating Deep Network Training by Reducing Internal Covariate Shift. Amit Patel Sambit Pradhan

Convolutional Neural Networks

Residual Networks And Attention Models. cs273b Recitation 11/11/2016. Anna Shcherbina

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Deep Learning. Volker Tresp Summer 2014

Motivation. Problem: With our linear methods, we can train the weights but not the basis functions: Activator Trainable weight. Fixed basis function

MULTIVARIATE ANALYSIS AND ITS USE IN HIGH ENERGY PHYSICS

A new approach for supervised power disaggregation by using a deep recurrent LSTM network

Seminars in Artifiial Intelligenie and Robotiis

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

LSTM: An Image Classification Model Based on Fashion-MNIST Dataset

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Inception Network Overview. David White CS793

Ryerson University CP8208. Soft Computing and Machine Intelligence. Naive Road-Detection using CNNS. Authors: Sarah Asiri - Domenic Curro

Machine Learning and Algorithms for Data Mining Practical 2: Neural Networks

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

CS230: Deep Learning Winter Quarter 2018 Stanford University

RNNs as Directed Graphical Models

Neural Network Neurons

Sequence Modeling: Recurrent and Recursive Nets. By Pyry Takala 14 Oct 2015

Outline GF-RNN ReNet. Outline

Autoencoders. Stephen Scott. Introduction. Basic Idea. Stacked AE. Denoising AE. Sparse AE. Contractive AE. Variational AE GAN.

Research on Pruning Convolutional Neural Network, Autoencoder and Capsule Network

A Brief Look at Optimization

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

Transcription:

CS489/698: Intro to ML Lecture 14: Training of Deep NNs Instructor: Sun Sun 1

Outline Activation functions Regularization Gradient-based optimization 2

Examples of activation functions 3 5/28/18 Sun Sun

Sigmoid and tanh Sigmoid Output range (0,1) 1 σ x Gradient range (0,1): σ x Small gradient at saturated regions Ø x = 0, gradient = 0.25 Ø x = 10, gradient = 4.5396e-05 Ø x = -10, gradient = 4.5396e-05 tanh Output range (-1,1) Gradient range (0,1): 1 tanh * x Small gradient at saturated regions 4

Vanishing gradients Common weight initialization in (-1, 1) Denote input of the i-th σ() as a. 5

Avoiding vanishing gradients Proper initialization of network parameters Choose activation functions that do not saturate Use Long Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU) architectures. Both of these RNN architectures are explicitly designed to deal with vanishing gradients and efficiently learn long-range dependencies 6

ReLU, Leaky ReLU, and ELU ReLU Computationally efficient Gradient = 1 when x> 0 Gradient = 0 when x<0 (cannot update parameter) Ø Initialize ReLU units with slightly positive values, e.g., 0.01 Ø Try other activation functions, e.g., Leaky ReLU Leaky ReLU Computationally efficient Constant gradient for both x>0 and x<0 ELU Small gradient at negative saturation region 7

Maxout Generalize ReLU and Leaky ReLU Ø Piece-wise linear function Do not saturate More parameters 8

Tips Sigmoid and tanh functions are sometimes avoided due to the vanishing gradient problem Use ReLU as a default choice Explore other activation functions, e.g., Leaky ReLU and Maxout, if ReLU results in dead hidden units 9

Outline Activation functions Regularization Gradient-based optimization 10

Overfitting and regularization High capacity increases risk of overfitting Ø # of parameters is often larger than # of data Regularization: any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error Ø Parameter norm penalties (e.g., L1 and L2 norms) Ø Bagging Ø Dropout Ø Data augmentation Ø Early stopping Ø 11 definition from: Ian Goodfellow, Yoshua Bengio and Aaron Courville, Deep learning

Bagging (1) Bagging (bootstrap aggregating): reduce generalization error by combining several models Construct a different data set for each model Each dataset has the same size as the original one Sampling with replacement from the original dataset Prediction: Regression: average; Classification: vote Different models usually do not make all the same errors 12

Bagging (2) Consider k regression models Model errors: zero-mean multivariate Gaussian ε. : error of the i-th model; E ε *. = v, E ε. ε 4 = c Expected squared error (ESE) If model errors are uncorrelated, i.e., c = 0, ESE = v/k If model errors are perfectly correlated and c = v, ESE = v 13

Dropout For each training example keep hidden units with probability p. A different and random network for each training example A smaller network with less capacity e.g., let p = 0.5 for all hidden layers delete ingoing and outgoing links for eliminated hidden units 14

Implementation Consider implementing dropout in layer 1. Denote output of layer 1 as h with dimension 4*1 Generate a 4*1 vector d with elements 0 or 1 indicating whether units are kept or not Update h by multiplying h and d element-wise Update h by h/p (``inverted dropout ) 15

Inverted dropout Motivation: keep the mean value of output unchanged Example: Consider h (9) in the previous slide. h (9) is used as the input to layer 2. Let p = 0.8. z (*) = W (*) h (9) + b (*) 20% of hidden units in layer 1 are eliminated Without ``inverted dropout, the mean of z (*) would be decreased by 20% roughly 16

Prediction Training Use (``inverted ) dropout for each training example Prediction Usually do not use dropout If ``inverted dropout is used in training, no further scaling is needed in testing 17

Why works? Intuition: Since each hidden unit can disappear with some probability, the network cannot rely on any particular units and have to spread out the weights à similar to L2 regularization Another interpretation: Dropout is training a large ensemble of models sharing parameters 18

Hyperparameter p p: probability of keeping units How to design p? Keep p the same for all layers Or, design p for each layer: set a lower p for overfitting layers, i.e., layers with a large number of units Can also design p for the input, but usually set p = 1 or very close to 1 19

Data augmentation The best way to improve generalization is to have more training data. But often, data are limited. One solution: create fake data, e.g., transform input Example: object recognition (classification) Horizontal/vertical flip Rotation Stretching But may not be readily applicable to many other tasks, e.g., density estimation 20 6/25/18 Sun Sun

Early stopping For models with large capacity, training error decreases steadily validation error first decreases then increases figure from: Ian Goodfellow, Yoshua Bengio and Aaron Courville, Deep learning 21

Batch normalization (BN) Use BN for each dimension in hidden layers, either before activation function or after h (3) = f 3 (W (3) h (2) + b (3) ) z (A) either apply BN to z (A) or h (A) γ and β are optimization variables; not hyperparameters 22

Why BN? (1) Consider applying BN to z (i.e., before activation) z : mean β, variance γ * Robust to parameter changes in previous layers Independent learning of each layer apply BN Without BN, learning of layer 2 depends on output of layer 1 and hence parameters of layer 1 With BN, mean and variance of z (*) are unchanged 23

Why BN? (2) Assume the orders of features in hidden layers are significantly different E.g., feature 1: 1; feature 2: 10 3 ; feature 3: 10-3 With BN, features are of similar scale Speed up learning 24

Why BN? (3) BN has a slight regularization effect (not intended) Each mini-batch is scaled by the mean/variance computed on that mini-batch This adds some noise to z and to each layer s activations 25

Outline Activation functions Regularization Gradient-based optimization 26

Optimization Problem of SGD: slow convergence SGD + momentum g O 1 m! mx L(f(x (i) ; ),y (i) ) i=1 v v +(1 )g v exponentially weighted averages Accumulate an exponentially decaying moving average of past gradients Hyperparameter α determines how quickly the contributions of previous gradients exponentially decay (α = 0.9 usually works well) 27

Exponentially weighted averages Let v t = v t 1 +(1 ) t ; =0.9 v 100 =0.1 100 +0.1 0.9 99 +0.1 0.9 2 98 + +0.1 0.9 99 1 total weight: 1 β F (close to 1 if t is large) 28

Illustration Contour of loss function red: SGD + momentum black: SGD figure from: Ian Goodfellow, Yoshua Bengio and Aaron Courville, Deep learning 29

RMSProp (root mean square propagation) Greater progress in the more gently sloped directions One of the go-to optimization methods for deep learning discard history from extreme past 30

Adam (adaptive moments) A variant on the combination of RMSProp and momentum both first and second moments are included when t is large, the effect of bias correction is small 31

Learning rate SGD, SGD+Momentum, AdaGrad, RMSProp, Adam all have learning rate as a hyperparameter. àreduce learning rate over time! Step decay e.g., decrease learning rate by half every few epochs Exponential decay 1/t decay t: iteration number 32

Illustration 33 figure from: Stanford cs231n course slides

Questions? 34