Extracting and Composing Robust Features with Denoising Autoencoders

Size: px

Start display at page:

Download "Extracting and Composing Robust Features with Denoising Autoencoders"

Jonah Griffith
5 years ago
Views:

1 Presenter: Alexander Truong March 16, 2017 Extracting and Composing Robust Features with Denoising Autoencoders Pascal Vincent, Hugo Larochelle, Yoshua Bengio, Pierre-Antoine Manzagol 1

2 Outline Introduction Autoencoder Other Approaches Analysis Experiment Conclusion 2

3 Introduction Deep architectures efficient for modeling performant at recognition In practice, learning deep architectures was difficult Multi-layer neural networks beyond 1-2 hidden layers provided little benefit (Bengio, 2007) 3

4 Unsupervised pre-training Successful approaches use unsupervised initialization 1. Greedy pre-training by layer 2. Unsupervised learning at each layer 3. Fine-tuning the network for the task Provides good intermediate representations Avoids getting stuck in poor solutions due to random initializations 4

5 Representation Criterion What criteria should an intermediate representation satisfy? Retain minimum information from input Constrained to a given form Goal of authors Investigate additional criteria: robustness to partial destruction of input Introduce robustness to a basic autoencoder 5

6 Autoencoder Can be used for unsupervised training Outputs a reconstruction x of the input x Latent representation y can be different size Parameters θ = W, b, θ = {W, b } x reconstruction g θ y = s(w y + b ) y latent representation or code f θ x = s(wx + b) x input 6

7 Autoencoder Parameters θ, θ are optimized to minimize average reconstruction error of n training inputs θ, θ = argmin θ,θ d 1 n n i=1 L(x i, g θ f θ x i ) L H x, x = [x k log x k + 1 x k log 1 x k ] k=1 Optimization of θ, θ θ, θ = arg min θ,θ E q 0 (X) [L H X, g θ f θ X ] E p(x) [f X ] is expectation q 0 is an empirical distribution of n 7

8 Denoising Autoencoder Perform stochastic mapping x ~ q D x x υ is proportion of destruction for each input x, υd random components are set to 0 Reconstruct clean input x from partially destroyed input x x reconstruction g θ y = s(w y + b ) y latent representation or code x clean input q D f θ x = s(wx + b) x corrupted input 8

9 Denoising Autoencoder Minimize average reconstruction error L H x, x where x = g θ f θ (x ) Compare reconstruction x with clean input x Joint distribution q 0 X, X, Y = q 0 X q D X X δ fθ X (Y) δ u v = 0 when u v Optimize parameters by picking (input sample, randomly corrupted input sample) Denoising Criterion θ, θ = arg min θ,θ E q 0 (X, X) [L H X, g θ f θ X ] 9

10 Stacking k Layers and Fine Tuning Train autoencoders from the output of the previous layer k + 1 layer is optimized w.r.t. a supervised training criterion x መh 1 መh k 1 h (1) h (2) h (k) 10 x h 1 Layer 1 Layer 2 Layer k h k 1

11 Other Approaches Denoising is extensively studied in image processing Proposed procedure criterion to learn intermediate representations not specific to images learns a robust mapping from corrupted inputs 11

12 Analysis The denoising criterion was derived from the perspective of discovering robust representations. arg min θ,θ q 0 (X, X) [L H X, g θ f θ X ] Other perspectives Manifold Learning Stochastic Operator Bottom-up Filtering, Information Theoretical Top-down-Generative Model 12

13 Analysis Manifold Learning Defines a manifold (clean inputs) and mapping from X to manifold g θ (f θ X ) Corrupted inputs X are far from manifold 13

14 Analysis Stochastic Operator p(x X) Joint distribution p X, X = p X p(x X) Empirical distribution q 0 and model p on (X, X) pairs Perform maximum likelihood to yield the denoising criterion 14

15 Analysis Information Theoretical Perspective Maximizing mutual information between Y and X arg max{i X; Y + λ J(Y)} θ Generative Model Perspective From a generative model, recover the denoising training criterion p X, X, Y = p Y p X Y p( X X) 15

16 Experiment MNIST digit classification problem with variations training 2000 validation testing Larachelle (2007) 16

Results SdA-3 achieves equal or greater performance on all but bg-rand Variations: rotation (rot), random background pixels (bg-rand), patches of extracted images (bg-img),

17 Results SdA-3 achieves equal or greater performance on all but bg-rand Variations: rotation (rot), random background pixels (bg-rand), patches of extracted images (bg-img), combination (rot-gb-img), wide rectangles (rect), long rectangles (rect-img), convex shapes (convex) Error Rate Comparison of stacked denoising autoencoders (SdA-3) with other models 17

$procedure At higher noise levels, filters are more sensitive to larger structures Destruction fraction v = 0, 0.$

18 Filter Comparison Without noise, many filters remain uninteresting More features are learned with denoising procedure At higher noise levels, filters are more sensitive to larger structures Destruction fraction v = 0, 0.25,

19 Conclusion Introduced a robust autoencoder for initializing a deep neural network Explicit denoising criterion leads to representations suitable for supervised classification Future work Other types of corruption of both input and representation 19

20 20 Questions?

Neural Networks: promises of current research

Neural Networks: promises of current research April 2008 www.apstat.com Current research on deep architectures A few labs are currently researching deep neural network training: Geoffrey Hinton s lab at U.Toronto Yann LeCun s lab at NYU Our LISA lab