Neural Networks: promises of current research

Similar documents
Autoencoders, denoising autoencoders, and learning deep networks

Extracting and Composing Robust Features with Denoising Autoencoders

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation

Extracting and Composing Robust Features with Denoising Autoencoders

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation

Introduction to Deep Learning

Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Advanced Introduction to Machine Learning, CMU-10715

3D Object Recognition with Deep Belief Nets

To be Bernoulli or to be Gaussian, for a Restricted Boltzmann Machine

Stacked Denoising Autoencoders for Face Pose Normalization

DEEP learning algorithms have been the object of much

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

Depth Image Dimension Reduction Using Deep Belief Networks

Energy Based Models, Restricted Boltzmann Machines and Deep Networks. Jesse Eickholt

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

arxiv: v1 [stat.ml] 14 Sep 2014

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Neural Networks and Deep Learning

Novel Lossy Compression Algorithms with Stacked Autoencoders

A supervised strategy for deep kernel machine

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

Emotion Detection using Deep Belief Networks

Transfer Learning Using Rotated Image Data to Improve Deep Neural Network Performance

Cambridge Interview Technical Talk

Training Restricted Boltzmann Machines with Overlapping Partitions

Parallel Implementation of Deep Learning Using MPI

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

Using neural nets to recognize hand-written digits. Srikumar Ramalingam School of Computing University of Utah

A Fast Learning Algorithm for Deep Belief Nets

THE MNIST DATABASE of handwritten digits Yann LeCun, Courant Institute, NYU Corinna Cortes, Google Labs, New York

Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient

Learning Invariant Representations with Local Transformations

Deep Generative Models Variational Autoencoders

Deep Belief Network for Clustering and Classification of a Continuous Data

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Day 3 Lecture 1. Unsupervised Learning

Learning Two-Layer Contractive Encodings

arxiv: v2 [cs.lg] 22 Mar 2014

Online Incremental Feature Learning with Denoising Autoencoders

Deep Learning With Noise

Deep Learning for Generic Object Recognition

Convolutional Neural Networks

Stochastic Function Norm Regularization of DNNs

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Tutorial Deep Learning : Unsupervised Feature Learning

Autoencoder. Representation learning (related to dictionary learning) Both the input and the output are x

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies

Sparse Feature Learning for Deep Belief Networks

Efficient Algorithms may not be those we think

Learning Class-relevant Features and Class-irrelevant Features via a Hybrid third-order RBM

Application of Support Vector Machines, Convolutional Neural Networks and Deep Belief Networks to Recognition of Partially Occluded Objects

Scheduled denoising autoencoders

Image Restoration Using DNN

ORT EP R RCH A ESE R P A IDI! " #$$% &' (# $!"

GAN Frontiers/Related Methods

Autoencoders. Stephen Scott. Introduction. Basic Idea. Stacked AE. Denoising AE. Sparse AE. Contractive AE. Variational AE GAN.

Deep Boltzmann Machines

Deep Learning. Volker Tresp Summer 2015

Marginalized Denoising Auto-encoders for Nonlinear Representations

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Unsupervised Learning

Variational Autoencoders. Sargur N. Srihari

Online Incremental Feature Learning with Denoising Autoencoders

A Taxonomy of Semi-Supervised Learning Algorithms

Stacks of Convolutional Restricted Boltzmann Machines for Shift-Invariant Feature Learning

Facing Non-Convex Optimization to Scale Machine Learning to AI

Alternatives to Direct Supervision

Limited Generalization Capabilities of Autoencoders with Logistic Regression on Training Sets of Small Sizes

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Vulnerability of machine learning models to adversarial examples

Lecture 19: Generative Adversarial Networks

Comparing Dropout Nets to Sum-Product Networks for Predicting Molecular Activity

Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations

Machine Learning. MGS Lecture 3: Deep Learning

Deep Learning of Visual Control Policies

Lecture 13. Deep Belief Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

DEEP LEARNING FOR BRAIN DECODING

Learning Subjective Adjectives from Images by Stacked Convolutional Auto-Encoders

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning

Deep Learning Workshop. Nov. 20, 2015 Andrew Fishberg, Rowan Zellers

Challenges motivating deep learning. Sargur N. Srihari

C. Poultney S. Cho pra (NYU Courant Institute) Y. LeCun

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

Introduction to GAN. Generative Adversarial Networks. Junheng(Jeff) Hao

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

SPE MS. Abstract. Introduction. Autoencoders

arxiv: v2 [cs.lg] 23 Oct 2018

Learning Feature Hierarchies for Object Recognition

Tiled convolutional neural networks

FEATURE DESCRIPTOR BY CONVOLUTION AND POOLING AUTOENCODERS

DEEP LEARNING TO DIVERSIFY BELIEF NETWORKS FOR REMOTE SENSING IMAGE CLASSIFICATION

Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD.

Visual object classification by sparse convolutional neural networks

Deep Neural Networks:

Transcription:

April 2008 www.apstat.com

Current research on deep architectures A few labs are currently researching deep neural network training: Geoffrey Hinton s lab at U.Toronto Yann LeCun s lab at NYU Our LISA lab at U.Montreal... many others to come! (we hope) I will talk about my latest work (with Hugo Larochelle, Yoshua Bengio, Pierre-Antoine Manzagol): Extracting and Composing Robust Features with Denoising Autoencoders upcoming ICML 2008

The problem Building good predictors on complex domains means learning complicated functions. These are best represented by multiple levels of non-linear operations i.e. deep architectures. Learning the parameters of deep architectures proved to be challenging!

Training deep architectures: attempted solutions Solution 1: initialize at random, and do gradient descent (Rumelhart et al., 1986). disappointing performance. Stuck in poor solutions. Solution 2: Deep Belief Nets (Hinton et al., 2006): initialize by stacking Restricted Boltzmann Machines, fine-tune with Up-Down. impressive performance. Key seems to be good unsupervised layer-by-layer initialization... Solution 3: initialize by stacking autoencoders, fine-tune with gradient descent. (Bengio et al., 2007; Ranzato et al., 2007) Simple generic procedure, no sampling required. Performance almost as good as Solution 2... but not quite. Can we do better?

Training deep architectures: attempted solutions Solution 1: initialize at random, and do gradient descent (Rumelhart et al., 1986). disappointing performance. Stuck in poor solutions. Solution 2: Deep Belief Nets (Hinton et al., 2006): initialize by stacking Restricted Boltzmann Machines, fine-tune with Up-Down. impressive performance. Key seems to be good unsupervised layer-by-layer initialization... Solution 3: initialize by stacking autoencoders, fine-tune with gradient descent. (Bengio et al., 2007; Ranzato et al., 2007) Simple generic procedure, no sampling required. Performance almost as good as Solution 2... but not quite. Can we do better?

Training deep architectures: attempted solutions Solution 1: initialize at random, and do gradient descent (Rumelhart et al., 1986). disappointing performance. Stuck in poor solutions. Solution 2: Deep Belief Nets (Hinton et al., 2006): initialize by stacking Restricted Boltzmann Machines, fine-tune with Up-Down. impressive performance. Key seems to be good unsupervised layer-by-layer initialization... Solution 3: initialize by stacking autoencoders, fine-tune with gradient descent. (Bengio et al., 2007; Ranzato et al., 2007) Simple generic procedure, no sampling required. Performance almost as good as Solution 2... but not quite. Can we do better?

Training deep architectures: attempted solutions Solution 1: initialize at random, and do gradient descent (Rumelhart et al., 1986). disappointing performance. Stuck in poor solutions. Solution 2: Deep Belief Nets (Hinton et al., 2006): initialize by stacking Restricted Boltzmann Machines, fine-tune with Up-Down. impressive performance. Key seems to be good unsupervised layer-by-layer initialization... Solution 3: initialize by stacking autoencoders, fine-tune with gradient descent. (Bengio et al., 2007; Ranzato et al., 2007) Simple generic procedure, no sampling required. Performance almost as good as Solution 2... but not quite. Can we do better?

Can we do better? Open question: what would make a good unsupervised criterion for finding good initial intermediate representations? Inspiration: our ability to fill-in-the-blanks in sensory input. missing pixels, small occlusions, image from sound,... Good fill-in-the-blanks performance distribution is well captured. old notion of associative memory (motivated Hopfield models (Hopfield, 1982)) What we propose: unsupervised initialization by explicit fill-in-the-blanks training.

The denoising autoencoder x Clean input x [0, 1] d is partially destroyed, yielding corrupted input: x q D ( x x). x is mapped to hidden representation y = f θ ( x). From y we reconstruct a z = g θ (y). Train parameters to minimize the cross-entropy reconstruction error

The denoising autoencoder q D x x Clean input x [0, 1] d is partially destroyed, yielding corrupted input: x q D ( x x). x is mapped to hidden representation y = f θ ( x). From y we reconstruct a z = g θ (y). Train parameters to minimize the cross-entropy reconstruction error

The denoising autoencoder y f θ q D x x Clean input x [0, 1] d is partially destroyed, yielding corrupted input: x q D ( x x). x is mapped to hidden representation y = f θ ( x). From y we reconstruct a z = g θ (y). Train parameters to minimize the cross-entropy reconstruction error

The denoising autoencoder y f θ g θ q D x x z Clean input x [0, 1] d is partially destroyed, yielding corrupted input: x q D ( x x). x is mapped to hidden representation y = f θ ( x). From y we reconstruct a z = g θ (y). Train parameters to minimize the cross-entropy reconstruction error

The denoising autoencoder y f θ g θ L H (x, z) q D x x z Clean input x [0, 1] d is partially destroyed, yielding corrupted input: x q D ( x x). x is mapped to hidden representation y = f θ ( x). From y we reconstruct a z = g θ (y). Train parameters to minimize the cross-entropy reconstruction error

The input corruption process q D ( x x) q D x x Choose a fixed proportion ν of components of x at random. Reset their values to 0. Can be viewed as replacing a component considered missing by a default value. Other corruption processes could be considered.

Form of parameterized mappings We use standard sigmoid network layers: y = f θ ( x) = sigmoid( }{{} W x + }{{} b ) d d d 1 g θ (y) = sigmoid( }{{} W y + }{{} b ). d d d 1 Denoising using autoencoders was actually introduced much earlier (LeCun, 1987; Gallinari et al., 1987), as an alternative to Hopfield networks (Hopfield, 1982).

Form of parameterized mappings We use standard sigmoid network layers: y = f θ ( x) = sigmoid( }{{} W x + }{{} b ) d d d 1 g θ (y) = sigmoid( }{{} W y + }{{} b ). d d d 1 Denoising using autoencoders was actually introduced much earlier (LeCun, 1987; Gallinari et al., 1987), as an alternative to Hopfield networks (Hopfield, 1982).

Learning deep networks Layer-wise initialization y f θ g θ L H (x, z) q D x x z 1 Learn first mapping f θ by training as a denoising autoencoder. 2 Remove scaffolding. Use f θ directly on input yielding higher level representation. 3 Learn next level mapping f (2) θ by training denoising autoencoder on current level representation. 4 Iterate to initialize subsequent layers.

Learning deep networks Layer-wise initialization y f θ g θ L H (x, z) q D x x z 1 Learn first mapping f θ by training as a denoising autoencoder. 2 Remove scaffolding. Use f θ directly on input yielding higher level representation. 3 Learn next level mapping f (2) θ by training denoising autoencoder on current level representation. 4 Iterate to initialize subsequent layers.

Learning deep networks Layer-wise initialization f θ 1 Learn first mapping f θ by training as a denoising autoencoder. 2 Remove scaffolding. Use f θ directly on input yielding higher level representation. x 3 Learn next level mapping f (2) θ by training denoising autoencoder on current level representation. 4 Iterate to initialize subsequent layers.

Learning deep networks Layer-wise initialization f θ 1 Learn first mapping f θ by training as a denoising autoencoder. 2 Remove scaffolding. Use f θ directly on input yielding higher level representation. x 3 Learn next level mapping f (2) θ by training denoising autoencoder on current level representation. 4 Iterate to initialize subsequent layers.

Learning deep networks Layer-wise initialization f (2) θ g (2) θ L H q D f θ 1 Learn first mapping f θ by training as a denoising autoencoder. 2 Remove scaffolding. Use f θ directly on input yielding higher level representation. x 3 Learn next level mapping f (2) θ by training denoising autoencoder on current level representation. 4 Iterate to initialize subsequent layers.

Learning deep networks Layer-wise initialization f (2) θ f θ 1 Learn first mapping f θ by training as a denoising autoencoder. 2 Remove scaffolding. Use f θ directly on input yielding higher level representation. x 3 Learn next level mapping f (2) θ by training denoising autoencoder on current level representation. 4 Iterate to initialize subsequent layers.

Learning deep networks Layer-wise initialization f (2) θ f θ 1 Learn first mapping f θ by training as a denoising autoencoder. 2 Remove scaffolding. Use f θ directly on input yielding higher level representation. x 3 Learn next level mapping f (2) θ by training denoising autoencoder on current level representation. 4 Iterate to initialize subsequent layers.

Learning deep networks Supervised fine-tuning Initial deep mapping was learnt in an unsupervised way. initialization for a supervised task. f (3) θ Output layer gets added. Global fine tuning by gradient descent on supervised criterion. f (2) θ fθ x

Learning deep networks Supervised fine-tuning Initial deep mapping was learnt in an unsupervised way. initialization for a supervised task. f (3) θ Output layer gets added. Global fine tuning by gradient descent on supervised criterion. f (2) θ fθ x Target

Learning deep networks Supervised fine-tuning supervised cost Initial deep mapping was learnt in an unsupervised way. initialization for a supervised task. f sup θ f (3) θ Output layer gets added. Global fine tuning by gradient descent on supervised criterion. f (2) θ fθ x Target

Perspectives on denoising autoencoders Manifold learning perspective g θ (f θ ( x)) x q D ( x x) x x x Denoising autoencoder can be seen as a way to learn a manifold: Suppose training data ( ) concentrate near a low-dimensional manifold. Corrupted examples (.) are obtained by applying corruption process q D( e X X ) and will lie farther from the manifold. The model learns with p(x e X ) to project them back onto the manifold. Intermediate representation Y can be interpreted as a coordinate system for points on the manifold.

Perspectives on denoising autoencoders Information theoretic perspective Consider X q(x ), q unknown. X qd ( X X ). Y = f θ ( X ). It can be shown that minimizing the expected reconstruction error amounts to maximizing a lower bound on mutual information I(X ; Y ). Denoising autoencoder training can thus be justified by the objective that hidden representation Y captures as much information as possible about X even as Y is a function of corrupted input.

Perspectives on denoising autoencoders Generative model perspective Denoising autoencoder training can be shown to be equivalent to maximizing a variational bound on the likelihood of a generative model for the corrupted data. Y hidden factors Y hidden factors corrupted data observed data corrupted data observed data X X X X variational model generative model

Benchmark problems Overview We used a benchmark of challenging classification problems (Larochelle et al., 2007) available at http://www.iro.umontreal.ca/~lisa/icml2007 8 classification problems on 28 28 grayscale pixel images (d = 784). First 5 are harder variations on the MNIST digit classification problem. Last 3 are challenging binary classification tasks. All problems have separate training, validation and test sets (test set size: 50 000).

Benchmark problems Variations on MNIST digit classification basic: subset of original MNIST digits: 10 000 training samples, 2 000 validation samples, 50 000 test samples. rot: applied random rotation (angle between 0 and 2π radians) bg-rand: background made of random pixels (value in 0... 255) bg-img: background is random patch from one of 20 images rot-bg-img: combination of rotation and background image

Benchmark problems Shape discrimination rect: discriminate between tall and wide rectangles on black background. rect-img: borderless rectangle filled with random image patch. Background is a different image patch. convex: discriminate between convex and non-convex shapes.

Experiments We compared the following algorithms on the benchmark problems: SVM rbf : suport Vector Machines with Gaussian Kernel. DBN-3: Deep Belief Nets with 3 hidden layers (stacked Restricted Boltzmann Machines trained with contrastive divergence). SAA-3: Stacked Autoassociators with 3 hidden layers (no denoising). SdA-3: Stacked Denoising Autoassociators with 3 hidden layers. Hyper-parameters for all algorithms were tuned based on classificaiton performance on validation set. (In particular hidden-layer sizes, and ν for SdA-3).

Performance comparison Results Dataset SVM rbf DBN-3 SAA-3 SdA-3 (ν) basic 3.03±0.15 3.11±0.15 3.46±0.16 2.80±0.14 (10%) rot 11.11±0.28 10.30±0.27 10.30±0.27 10.29±0.27 (10%) bg-rand 14.58±0.31 6.73±0.22 11.28±0.28 10.38±0.27 (40%) bg-img 22.61±0.37 16.31±0.32 23.00±0.37 16.68±0.33 (25%) rot-bg-img 55.18±0.44 47.39±0.44 51.93±0.44 44.49±0.44 (25%) rect 2.15±0.13 2.60±0.14 2.41±0.13 1.99±0.12 (10%) rect-img 24.04±0.37 22.50±0.37 24.05±0.37 21.59±0.36 (25%) convex 19.13±0.34 18.63±0.34 18.41±0.34 19.06±0.34 (10%)

Performance comparison Results Dataset SVM rbf DBN-3 SAA-3 SdA-3 (ν) basic 3.03±0.15 3.11±0.15 3.46±0.16 2.80±0.14 (10%) rot 11.11±0.28 10.30±0.27 10.30±0.27 10.29±0.27 (10%) bg-rand 14.58±0.31 6.73±0.22 11.28±0.28 10.38±0.27 (40%) bg-img 22.61±0.37 16.31±0.32 23.00±0.37 16.68±0.33 (25%) rot-bg-img 55.18±0.44 47.39±0.44 51.93±0.44 44.49±0.44 (25%) rect 2.15±0.13 2.60±0.14 2.41±0.13 1.99±0.12 (10%) rect-img 24.04±0.37 22.50±0.37 24.05±0.37 21.59±0.36 (25%) convex 19.13±0.34 18.63±0.34 18.41±0.34 19.06±0.34 (10%)

Performance comparison Results Dataset SVM rbf DBN-3 SAA-3 SdA-3 (ν) basic 3.03±0.15 3.11±0.15 3.46±0.16 2.80±0.14 (10%) rot 11.11±0.28 10.30±0.27 10.30±0.27 10.29±0.27 (10%) bg-rand 14.58±0.31 6.73±0.22 11.28±0.28 10.38±0.27 (40%) bg-img 22.61±0.37 16.31±0.32 23.00±0.37 16.68±0.33 (25%) rot-bg-img 55.18±0.44 47.39±0.44 51.93±0.44 44.49±0.44 (25%) rect 2.15±0.13 2.60±0.14 2.41±0.13 1.99±0.12 (10%) rect-img 24.04±0.37 22.50±0.37 24.05±0.37 21.59±0.36 (25%) convex 19.13±0.34 18.63±0.34 18.41±0.34 19.06±0.34 (10%)

Performance comparison Results Dataset SVM rbf DBN-3 SAA-3 SdA-3 (ν) basic 3.03±0.15 3.11±0.15 3.46±0.16 2.80±0.14 (10%) rot 11.11±0.28 10.30±0.27 10.30±0.27 10.29±0.27 (10%) bg-rand 14.58±0.31 6.73±0.22 11.28±0.28 10.38±0.27 (40%) bg-img 22.61±0.37 16.31±0.32 23.00±0.37 16.68±0.33 (25%) rot-bg-img 55.18±0.44 47.39±0.44 51.93±0.44 44.49±0.44 (25%) rect 2.15±0.13 2.60±0.14 2.41±0.13 1.99±0.12 (10%) rect-img 24.04±0.37 22.50±0.37 24.05±0.37 21.59±0.36 (25%) convex 19.13±0.34 18.63±0.34 18.41±0.34 19.06±0.34 (10%)

Performance comparison Results Dataset SVM rbf DBN-3 SAA-3 SdA-3 (ν) basic 3.03±0.15 3.11±0.15 3.46±0.16 2.80±0.14 (10%) rot 11.11±0.28 10.30±0.27 10.30±0.27 10.29±0.27 (10%) bg-rand 14.58±0.31 6.73±0.22 11.28±0.28 10.38±0.27 (40%) bg-img 22.61±0.37 16.31±0.32 23.00±0.37 16.68±0.33 (25%) rot-bg-img 55.18±0.44 47.39±0.44 51.93±0.44 44.49±0.44 (25%) rect 2.15±0.13 2.60±0.14 2.41±0.13 1.99±0.12 (10%) rect-img 24.04±0.37 22.50±0.37 24.05±0.37 21.59±0.36 (25%) convex 19.13±0.34 18.63±0.34 18.41±0.34 19.06±0.34 (10%)

Performance comparison Results Dataset SVM rbf DBN-3 SAA-3 SdA-3 (ν) basic 3.03±0.15 3.11±0.15 3.46±0.16 2.80±0.14 (10%) rot 11.11±0.28 10.30±0.27 10.30±0.27 10.29±0.27 (10%) bg-rand 14.58±0.31 6.73±0.22 11.28±0.28 10.38±0.27 (40%) bg-img 22.61±0.37 16.31±0.32 23.00±0.37 16.68±0.33 (25%) rot-bg-img 55.18±0.44 47.39±0.44 51.93±0.44 44.49±0.44 (25%) rect 2.15±0.13 2.60±0.14 2.41±0.13 1.99±0.12 (10%) rect-img 24.04±0.37 22.50±0.37 24.05±0.37 21.59±0.36 (25%) convex 19.13±0.34 18.63±0.34 18.41±0.34 19.06±0.34 (10%)

Performance comparison Results Dataset SVM rbf DBN-3 SAA-3 SdA-3 (ν) basic 3.03±0.15 3.11±0.15 3.46±0.16 2.80±0.14 (10%) rot 11.11±0.28 10.30±0.27 10.30±0.27 10.29±0.27 (10%) bg-rand 14.58±0.31 6.73±0.22 11.28±0.28 10.38±0.27 (40%) bg-img 22.61±0.37 16.31±0.32 23.00±0.37 16.68±0.33 (25%) rot-bg-img 55.18±0.44 47.39±0.44 51.93±0.44 44.49±0.44 (25%) rect 2.15±0.13 2.60±0.14 2.41±0.13 1.99±0.12 (10%) rect-img 24.04±0.37 22.50±0.37 24.05±0.37 21.59±0.36 (25%) convex 19.13±0.34 18.63±0.34 18.41±0.34 19.06±0.34 (10%)

Performance comparison Analysis of the results Pretraining as denoising autoencoders appears to be a very effective initialization technique. SdA-3 outperformed SVMs on all tasks, the difference being statistically significan on 6 of the 8 problems. Compared to Deep Belief Nets, SdA-3 was significantly superior in 4 of the problems, while being significantly worse in only 1 case. More striking is the large gain from using noise (SdA-3) compared to the noise free case SAA-3. Note that best results on most MNIST tasks were obtained with first hidden layer size of 2000, i.e. with over-complete representations (since d = 784).

Analyzing the filters learnt by denoising autoencoders To view the effect of noise, we look at filters learnt by the first layer on mnist basic for different destruction levels ν. Each filter corresponds to the weights linking input x and one of the first hidden layer units. Filters at the same position in the image for different destruction levels are related only by the fact that the autoencoders were started from the same random initialization point.

Learnt filters 0 % destroyed

Learnt filters 10 % destroyed

Learnt filters 25 % destroyed

Learnt filters 50 % destroyed

Learnt filters Qualitative analysis with no noise, many filters appear similarly uninteresting (undistinctive almost uniform grey patches). as we increase the noise level, denoising training forces the filters to differentiate more, and capture more distinctive features. higher noise levels tend to induce less local filters (as we expected). one can distinguish different kinds of filters, from local blob detectors, to stroke detectors, and some full character detectors at the higher noise levels. Neuron A (0%, 10%, 20%, 50% destruction) Neuron B (0%, 10%, 20%, 50% destruction)

Conclusion and future work Unsupervised initialization of layers with an explicit denoising criterion appears to help capture interesting structure in the input distribution. This leads to intermediate representations much better suited for subsequent learning tasks such as supervised classification. Resulting algorithm for learning deep networks is simple and improves on state-of-the-art on benchmark problems. Future work will investigate the effect of different types of corruption process.

THANK YOU!

Performance comparison Dataset SVM rbf SVM poly DBN-1 DBN-3 SAA-3 SdA-3 (ν) basic 3.03±0.15 3.69±0.17 3.94±0.17 3.11±0.15 3.46±0.16 2.80±0.14 (10%) rot 11.11±0.28 15.42±0.32 14.69±0.31 10.30±0.27 10.30±0.27 10.29±0.27 (10%) bg-rand 14.58±0.31 16.62±0.33 9.80±0.26 6.73±0.22 11.28±0.28 10.38±0.27 (40%) bg-img 22.61±0.37 24.01±0.37 16.15±0.32 16.31±0.32 23.00±0.37 16.68±0.33 (25%) rot-bg-img 55.18±0.44 56.41±0.43 52.21±0.44 47.39±0.44 51.93±0.44 44.49±0.44 (25%) rect 2.15±0.13 2.15±0.13 4.71±0.19 2.60±0.14 2.41±0.13 1.99±0.12 (10%) rect-img 24.04±0.37 24.05±0.37 23.69±0.37 22.50±0.37 24.05±0.37 21.59±0.36 (25%) convex 19.13±0.34 19.82±0.35 19.92±0.35 18.63±0.34 18.41±0.34 19.06±0.34 (10%) red when confidence intervals overlap.

References Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). Greedy layer-wise training of deep networks. Advances in Neural Information Processing Systems 19 (NIPS 06) (pp. 153 160). MIT Press. Gallinari, P., LeCun, Y., Thiria, S., & Fogelman-Soulie, F. (1987). Memoires associatives distribuees. Proceedings of COGNITIVA 87. Paris, La Villette. Hinton, G. E., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527 1554. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, USA, 79. Larochelle, H., Erhan, D., Courville, A., Bergstra, J., & Bengio, Y. (2007). An empirical evaluation of deep architectures on problems with many factors of variation. Proceedings of the Twenty-fourth International Conference on Machine Learning (ICML 07) (pp. 473 480). Corvallis, OR: ACM. LeCun, Y. (1987). Modèles connexionistes de l apprentissage. Doctoral dissertation, Université de Paris VI. Ranzato, M., Poultney, C., Chopra, S., & LeCun, Y. (2007). Efficient learning of sparse representations with an energy-based model. Advances in Neural Information Processing Systems 19 (NIPS 06) (pp. 1137 1144). MIT Press.

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533 536.