Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

Similar documents
Artificial Intelligence

ImageNet Classification with Deep Convolutional Neural Networks

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Deep Learning for Computer Vision II

Neural Networks. Theory And Practice. Marco Del Vecchio 19/07/2017. Warwick Manufacturing Group University of Warwick

Deep Learning with Tensorflow AlexNet

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD.

Model Generalization and the Bias-Variance Trade-Off

Deep Learning & Neural Networks

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

Advanced Introduction to Machine Learning, CMU-10715

Neural Network Neurons

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Neural Network and Deep Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Deep Learning. Volker Tresp Summer 2014

Restricted Boltzmann Machines. Shallow vs. deep networks. Stacked RBMs. Boltzmann Machine learning: Unsupervised version

Facial Expression Classification with Random Filters Feature Extraction

Introduction to Deep Learning

CS 6501: Deep Learning for Computer Graphics. Training Neural Networks II. Connelly Barnes

Neural Networks and Deep Learning

Dropout. Sargur N. Srihari This is part of lecture slides on Deep Learning:

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Unsupervised Learning

Alternatives to Direct Supervision

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

10-701/15-781, Fall 2006, Final

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

Deep Learning Cook Book

Deep Learning Applications

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Computer Vision Lecture 16

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

CS489/698: Intro to ML

27: Hybrid Graphical Models and Neural Networks

On the Effectiveness of Neural Networks Classifying the MNIST Dataset

Machine Learning. Chao Lan

Bilevel Sparse Coding

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Convolutional Neural Networks

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

COMPUTATIONAL INTELLIGENCE

Deep Generative Models Variational Autoencoders

Machine Learning 13. week

ECG782: Multidimensional Digital Signal Processing

5 Learning hypothesis classes (16 points)

Efficient Algorithms may not be those we think

Tutorial on Machine Learning Tools

Machine Learning Classifiers and Boosting

Seminars in Artifiial Intelligenie and Robotiis

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

Machine Learning. MGS Lecture 3: Deep Learning

What is machine learning?

Deep Learning. Volker Tresp Summer 2017

Introduction to Neural Networks

Deep Learning for Computer Vision

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Perceptron: This is convolution!

Introduction to Neural Networks

Deep Neural Networks with Flexible Activation Function

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Research on Pruning Convolutional Neural Network, Autoencoder and Capsule Network

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

3D model classification using convolutional neural network

Stacked Denoising Autoencoders for Face Pose Normalization

Lecture 17: Neural Networks and Deep Learning. Instructor: Saravanan Thirumuruganathan

Study of Residual Networks for Image Recognition

Learning image representations equivariant to ego-motion (Supplementary material)

Deep Learning for Vision: Tricks of the Trade

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Object Detection Lecture Introduction to deep learning (CNN) Idar Dyrdal

Vulnerability of machine learning models to adversarial examples

Weiguang Guan Code & data: guanw.sharcnet.ca/ss2017-deeplearning.tar.gz

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Is Bigger CNN Better? Samer Hijazi on behalf of IPG CTO Group Embedded Neural Networks Summit (enns2016) San Jose Feb. 9th

CS6220: DATA MINING TECHNIQUES

Generative and discriminative classification techniques

Ryerson University CP8208. Soft Computing and Machine Intelligence. Naive Road-Detection using CNNS. Authors: Sarah Asiri - Domenic Curro

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

Deep Learning. Volker Tresp Summer 2015

Deep neural networks II

Computer Vision Lecture 16

Autoencoders, denoising autoencoders, and learning deep networks

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

Probabilistic Siamese Network for Learning Representations. Chen Liu

The exam is closed book, closed notes except your one-page cheat sheet.

Minimum Risk Feature Transformations

3D Object Recognition with Deep Belief Nets

Lecture 13. Deep Belief Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

Single Image Depth Estimation via Deep Learning

Deep Learning. Architecture Design for. Sargur N. Srihari

Transcription:

Machine Learning The Breadth of ML Neural Networks & Deep Learning Marc Toussaint University of Stuttgart Duy Nguyen-Tuong Bosch Center for Artificial Intelligence Summer 2017

Neural Networks Consider a regression problem with input x R d and output y R Linear function: (β R d ) f(x) = β x 1-layer Neural Network function: (W 0 R h 1 d ) f(x) = β σ(w 0x) 2-layer Neural Network function: f(x) = β σ(w 1σ(W 0x)) Neural Networks are a special function model y = f(x, w), i.e. a special way to parameterize non-linear functions 2/22

Neural Networks: Training How to determine the weights W l,ij in the layer l for the node j, given the sample {x i, y i }? Idea: Initialize the weights W l,j for each layer l and each node j First, propagate x i through the network, bottom up (Forward Propagation) Then, compute the error between prediction and ground-truth y i, given an error function l Subsequently, propagate the error backwards through the network, and recursively compute the error gradients for each W l,ij (Back-Propagation) Update the weights W l,j using the computed error gradients for each sample {x i, y i} Notation: Consider L hidden layers, each h l -dimensional let z l = W l-1 x l-1 be the inputs to all neurons in layer l let x l = σ(z l ) the activation of all neurons in layer l redundantly, we denote by x 0 x the activation of the input layer, and by φ(x) x L the activation of the last hidden layer 3/22

Neural Networks: Basic Equations Forward propagation: An L-layer NN recursively computes, for l = 1,.., L, l=1,..,l : z l = W l-1 x l-1, x l = σ(z l ) and then computes the output f z L+1 = W Lx L Backpropagation: Given some loss l(f), let δ L+1 = l. We can recursivly f compute the loss-gradient w.r.t. the inputs of layer l: l=l,..,1 : δ l = dl = dl z l+1 x l = [δ l+1 W l ] [x l (1 x l )] dz l dz l+1 x l z l where is an element-wise product. The gradient w.r.t. weights is: dl = dl dw l,ij dz l+1,i z l+1,i W l,ij = δ l+1,i x l,j or dl dw l = δ l+1x l Weight-update: many ways of different weight-updates possible, given gradients dl dw l for example, the delta rule: W new l = W old l + W l = W old l η dl dw l 4/22

Neural Networks: Regression In the standard regression case, y R, we typically assume a squared error loss l(f) = i (f(x i, w) y i ) 2. We have δ L+1 = i 2(f(x i, w) y i ) Regularization: Add a L 2 or L 1 regularization. First compute all gradients as before, then add λw l,ij (for L 2 ), or λ sign W l,ij (for L 1 ) to the gradient. Historically, this is called weight decay, as the additional gradient leads to a step decaying the weighs. The optimal output weights are as for standard regression W L = (X X + λi) -1 X y where X is the data matrix of activations x L φ(x) 5/22

Neural Networks: Classification Consider the multi-class case y {1,.., M}. Then we have M output neurons to represent the discriminative function f(x, y, w) = (W Lz L ) y, W R M h L Choosing neg-log-likelihood objective logistic regression Choosing hinge loss objective NN + SVM For given x, let y be the correct class. The one-vs-all hinge loss is: y y max{0, 1 (f y f y)} For output neuron y y this implies a gradient δ y = [f y < f y + 1] For output neuron y this implies a gradient δ y = y y [f y < f y + 1] Only data points inside the margin induce an error (and gradient). This is also called Perceptron Algorithm 6/22

Neural Networks: Dimensionality Reduction Dimensionality reduction can be performed with autoencoders An autoencoder typically is a NN of the type which is trained to reproduce the input: min i y(x i ) x i 2 The hidden layer ( bottleneck ) needs to find a good representation/compression. Similar to the PCA objective, but nonlinear Stacking autoencoders (Deep Autoencoders): 7/22

Remarks NN is usually trained based on the gradient W l f(x) (The output weights can be optimized analytically as for linear regression) NNs are a very powerful function class (By tweaking/training the weights one can approximate any non-linear function) BUT: Are there any guarantees on generalization? What happens with the gradients, when the NN is very deep? How can NN be used to learn intelligent (autonomous) behavior (e.g. Autonomous Learning, Reinforcement Learning, Robotics, etc.)? Is there any insight on what the neurons will actually represent (e.g. discovering/developing abstractions, hierarchies, etc.)? Deep Learning is a revival of Neural Networks and was mainly driven by the latter, i.e. learning useful representations 8/22

Deep Learning: Basic Concept Idea: learn hierarchical features from data, from simple features to complex features Deep Learning can also be performed with other frameworks, e.g. Deep Gaussian Processes So what changed towards classical NN? Algorithmic advancement e.g. Dropout, ReLUs, Pre-training More general models, e.g. Deep GPs, Deep Kernel Machines,... More computational power (e.g. GPUs) Large data sets Deep Learning is useful for very high dimensional problems with many labeled or unlabeled samples (e.g. vision and speech tasks) 9/22

Typical Process to Train a Deep Network pre-process data, e.g. ZCA, distortions network type, e.g. convolutional network activation function, e.g. ReLU regularization, e.g. dropout network training, e.g. stochastic gradient descent with Adadelta combining multiple models, e.g. ensemble of networks optimizing high-level parameters, e.g. with Bayesian optimization Many heuristics involved when training Deep Networks 10/22

Example: 2-D Convolutional Network Open parameters: Nr. of layers Nr. of feature maps per convolution Filter size for each convolution Subsamling size Nr. of hidden units 11/22

Pre-Processing Steps 1. Removing means from images 2. Distortions of images 3. Zero Component Analysis Subtracting mean from images Standardizing the data Add distorted images to training data Randomly translate & rotate images Zero Component Analysis Perform transformation: x = P T Λ 1 P x where Λ = diag( σ 1 + ɛ, σ 2 + ɛ,..., σ n + ɛ) In practice, ɛ has the effect of strengthening the edges 12/22

Activation Function: Rectified Linear Units New activation function: rectified linear units (ReLUs) ReLU: f(z) = max(0, z) non-saturating sparse activation helps against vanishing gradients Relation to logistic activations n=1 logistic(z + 0.5 n) log(1 + e z ) max(0, z) 13/22

Deep Networks and Overfitting Overfitting: good training, bad testing performance. Deep models are very sensitive to overfitting, due complex model structures. How to avoid overfitting Weight-decay, penalize W 1 or W 2 Early stopping: recognize overfitting on validation data set Pre-training: initialize parameters meaningful Dropout 14/22

Dropout Training (Backpropagation): randomly deactivate each unit with probability p compute error for new network architecture perform gradient descent step Prediction (Forward Propagation): multiply output of each unit by p preserves expected value of output for single layer......... 15/22

Dropout Training (Backpropagation): randomly deactivate each unit with probability p compute error for new network architecture perform gradient descent step Prediction (Forward Propagation): multiply output of each unit by p preserves expected value of output for single layer......... 15/22

ADADELTA: Stochastic Gradient Descent Computation of update steps on batch of samples ADADELTA uses only first-order gradients Simple in implementation and application Appl. for large data and number of parameters ( 500.000) ADADELTA Update rule: x t+1 = x t + x t, where x t = η t g t = α T i=1 ρi (1 ρ) x t i T g i=0 ρi (1 ρ) g t t i Remarks: Adaptive learning rate η t. Parameters α and ρ muss be chosen Estimation of learning rate from previous gradients g t and t The algorithm has shown to work well in practice 16/22

Bayesian Optimization Optimizing selected network parameters, e.g. decay rate ρ Objective function unknown (i.e. parameters pred. errors) Bayesian Optimization: Optimizing while approximating objective function Infering objective functions from data, i.e. [parameters, errors] 17/22

Bayesian Optimization Optimizing selected network parameters, e.g. decay rate ρ Objective function unknown (i.e. parameters pred. errors) Bayesian Optimization: Optimizing while approximating objective function Infering objective functions from data, i.e. [parameters, errors] Initializing parameters 17/22

Bayesian Optimization Optimizing selected network parameters, e.g. decay rate ρ Objective function unknown (i.e. parameters pred. errors) Bayesian Optimization: Optimizing while approximating objective function Infering objective functions from data, i.e. [parameters, errors] Initializing parameters Training the network with the parameters 17/22

Bayesian Optimization Optimizing selected network parameters, e.g. decay rate ρ Objective function unknown (i.e. parameters pred. errors) Bayesian Optimization: Optimizing while approximating objective function Infering objective functions from data, i.e. [parameters, errors] Initializing parameters Training the network with the parameters Compute prediction error on validation data 17/22

Bayesian Optimization Optimizing selected network parameters, e.g. decay rate ρ Objective function unknown (i.e. parameters pred. errors) Bayesian Optimization: Optimizing while approximating objective function Infering objective functions from data, i.e. [parameters, errors] Initializing parameters Learn the objective function: Parameters Validation error Training the network with the parameters Compute prediction error on validation data 17/22

Bayesian Optimization Optimizing selected network parameters, e.g. decay rate ρ Objective function unknown (i.e. parameters pred. errors) Bayesian Optimization: Optimizing while approximating objective function Infering objective functions from data, i.e. [parameters, errors] Choosing parameters according to a criterion Initializing parameters Learn the objective function: Parameters Validation error Training the network with the parameters Compute prediction error on validation data 17/22

Bayesian Optimization with Gaussian Prior Learning objective function with Gaussian process regression GP prediction for a test point x t : N(µ(x t ), ν(x t )) Selection criterion is computed based µ(x t ) and ν(x t ) 1 Φ norm( )... normal accumulative distribution function, φ norm( )... normal probability density function, y best... currently best measurement / observation 18/22

Bayesian Optimization with Gaussian Prior Learning objective function with Gaussian process regression GP prediction for a test point x t : N(µ(x t ), ν(x t )) Selection criterion is computed based µ(x t ) and ν(x t ) Expected Improvement criterion for given point x a EI = ν(x) [γ(x)φ norm (γ(x)) + φ norm (x)] ; γ(x) = y best µ(x) Expected Improvement ν(x) 3 2 1 0 1 2 3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 Φ norm( )... normal accumulative distribution function, φ norm( )... normal probability density function, y best... currently best measurement / observation 18/22

Ensembles Boosting Prediction Performance Standard ML approach to improve test performance: combine output of different models We can use different random weight initializations and training with/without the validation set How to combine the predictions? each network gives us a prediction, e.g. p 1 = (0.4, 0.3, 0.3), p 2 = (0.35, 0.35, 0.3), p 3 = (0.1, 0.9, 0.0) we can take the arithmetic or geometric mean, e.g. p avg = (0.28, 0.52, 0.2) class prediction is the index with the highest score, e.g. class 2 19/22

Results on Traffic Sign Recognition CCR (%) Team Deep Learning used 99.46 IDSIA yes 98.84 human average - 98.80 BOSCH deep nets yes 98.31 Sermanet yes 96.14 CAOR no 95.68 INI-RTCV no 93.18 INI-RTCV no 92.34 INI-RTCV no Correct classification rate (CCR) on the final-stage of the German Traffic Sign Recognition Benchmark: 38.880 images for training and 12.960 images for testing from 43 different German road sign classes. 20/22

Remarks Various approaches for optimizing and training of Deep Nets, e.g. Bayesian optimization, pre-processing, dropouts... Choice of appropriate techniques based on applications, experiences and knowledge in Machine Learning Try-out of different training approaches Gaining experiences Keep up with the developments in the Deep Learning community Further research problems: Bayesian Deep Learning, unsupervised learning, generative deep models, deep reinforcement learning, adversarial problems, etc. 60 % Error Rate traditional involves deep learning 100 Error Score # DL Publications Google Scholar 1400 traditional involves deep learning 1200 1000 40 % 90 800 26.2 % 80 600 20 % 16.4 % 70 400 10 % 60 200 2010 2011 2012 2013 * 2014 * Year * only 10 best results plotted 2013 2014 Year Year 2000 2005 2010 2014 * * as of October 14, 2014 21/22

Deep Learning further reading Weston, Ratle & Collobert: Deep Learning via Semi-Supervised Embedding, ICML 2008. Hinton & Salakhutdinov: Reducing the Dimensionality of Data with Neural Networks, Science 313, pp. 504-507, 2006. Bengio & LeCun: Scaling Learning Algorithms Towards AI. In Bottou et al. (Eds) Large-Scale Kernel Machines, MIT Press 2007. Hadsell, Chopra & LeCun: Dimensionality Reduction by Learning an Invariant Mapping, CVPR 2006. Glorot, Bengio: Understanding the difficulty of training deep feedforward neural networks, AISTATS 10. Jason Weston et al.: Deep Learning via Semi-SupervisedEmbedding, ICML 2008.... and newer papers citing those 22/22