Efficient Learning of Sparse Representations with an Energy-Based Model

Size: px

Start display at page:

Download "Efficient Learning of Sparse Representations with an Energy-Based Model"

Dorothy Bradford
5 years ago
Views:

1 Efficient of Sparse Representations with an Energy-Based Model Marc Aurelio Ranzato, Christopher Poultney, Sumit Chopra, Yann Le Cun Presented by Pascal Lamblin February 14 th, 2007 Efficient of Sparse Representations with an Energy-Based Model

2 1 Pre-processors and feature extractors Coding and decoding 2 Energy Function and Architecture The Sparsifying Logistic 3 4 Feature extraction Initialization of a Convolutional Neural Net Hierarchical Extension: Topographic Maps Si vous vous attendez à lire des bêtises ici, tant pis pour vous

3 Pre-processors and feature extractors Coding and decoding Unsupervised of Representations Methods like PCA, ICA, Wavelet decompositions... Usually, dimensionality is reduced Not necessary: sparse overcomplete representations Improved separability of classes Better interpretation (sum of basic components) Biological parallel (early visual areas) En effet, je resterai d un sérieux exemplaire

4 Usual Architecture Pre-processors and feature extractors Coding and decoding output code input decoder encoder Usually, an encoder and a decoder (possibly sharing parameters) Architecture for auto-encoders, restricted Boltzmann machines, PCA,... Sometimes, the encoder or decoder is absent (e.g., replaced by a sampling or minimization procedure) Here we present a model with an encoder and a decoder Je ne voudrais pas prendre le risque de déconcentrer l auditoire

5 Procedure Pre-processors and feature extractors Coding and decoding Usually (PCA, auto-encoders,... ), we minimize a reconstruction error criterion Here, we also want sparsity in the code: another constraint Use of a Sparsifying Logistic module, between code and decoder Hard to learn through backprop only: optimize a global energy function, which depends also on the codes Iterative coordinate descent optimization (like EM) Mais rendus là, je pense que certains sont déjà perdus

6 Notation and Components Energy Function and Architecture The Sparsifying Logistic The input: an image patch, X, as a vector The encoder: set of linear filters, rows of W C The code: a vector Z The Sparsifying Logistic: transforms Z into Z The sparse code vector: a vector Z with components in [0, 1] The decoder: reverse linear filters, columns of W D Alors rien de mal à leur poser une petite devinette

7 Energy of the System Energy Function and Architecture The Sparsifying Logistic We want to minimize the global energy of the system, function of the model s parameters W C and W D, the free parameter Z, and the input X. E(X, Z, W C, W D ) = E C (X, Z, W C ) + E D (X, Z, W D ) Code prediction energy: E C (X, Z, W C ) = 1 2 Z W C X 2 Reconstruction energy: E C (X, Z, W D ) = 1 X WD Z 2 2 We have no hard equality constraint between Z and W C X, nor on X and W D Z. Z W C X W D Z X Attention, accrochez-vous : qu est-ce qui fait Toin! Toin!?

8 Energy of the System Energy Function and Architecture The Sparsifying Logistic We want to minimize the global energy of the system, function of the model s parameters W C and W D, the free parameter Z, and the input X. E(X, Z, W C, W D ) = E C (X, Z, W C ) + E D (X, Z, W D ) Code prediction energy: E C (X, Z, W C ) = 1 2 Z W C X 2 Reconstruction energy: E C (X, Z, W D ) = 1 X WD Z 2 2 We have no hard equality constraint between Z and W C X, nor on X and W D Z. Z W C X W D Z X Attention, accrochez-vous : qu est-ce qui fait Toin! Toin!?

9 Cool Figure Energy Function and Architecture The Sparsifying Logistic Architecture of the energy-based model Pendant que vous cherchez, on va à la pub

10 In Theory Energy Function and Architecture The Sparsifying Logistic Let s consider the k-th training sample z i (k) = ηeβz i (k) ζ i (k) ζ i (k) = ηe βz i (k) + (1 η)ζ i (k 1) Like a weighted softmax applied through time High values of β makes the values more binary High values of η increases the firing rate The Sparsifying Logistic enforces sparsity through the examples for each individual component. There is no constraint of sparsity between the units of a code. Ragoutoutou! Le ragoût de mon toutou, hmm, j en suis fou!

11 In Practice Energy Function and Architecture The Sparsifying Logistic z i (k) = ηeβz i (k) ζ i (k) z i (k) = (1 η)ζ i(k 1) ηe βz i (k) [ ] 1 z i (k) = 1 + (1 η) η ζ i (k 1)e βz i (k) We learn ζ i across the training set and fix it Logistic function, with fixed gain and learnt bias This version of the Sparsifying Logistic module is deterministic, and does not depend on the ordering of the samples. Continuons dans la détente

12 Procedure We want to minimize: E ( W C, W D, Z 1,..., Z P) = by the procedure: {W C, W D } = argmin {W C,W D } ( P ( ( EC X i, Z i ) (, W C + ED X i, Z i )), W D i=1 ( min E W C, W D, Z 1,..., Z P)) Z 1,...,Z P 1 Find the optimal Z i, given W C and W D 2 Update the weights W C and W D, given Z i found at step 1, in order to minimize the energy 3 Iterate until convergence C est deux vaches qui broutent dans un pré

13 Online Version We consider only one sample X at a time. The cost to minimize is C = E C (X, Z, W C ) + E D (X, Z, W D ) 1 Initialize Z by Z init = W C X 2 Minimize C wrt Z, by gradient descent initialized at Z init 3 Compute the gradient of C wrt W C and W D, and perform one gradient step We iterate over all samples, until convergence. L une dit à l autre : Ça t inquiète pas, ces histoires de vaches folles?

14 So, What Happens? Only a few steps of gradient descent are necessary to minimize Z At the end of the process, even Z init = W C X is accurate enough So E C (X, Z, W C ) = 1 2 Z W C X 2 is minimized The reconstruction errors from Z init are also low So E D (X, Z, W D ) = 1 2 X WD Z 2 is also minimized The minimization procedure manages to minimize both energy terms Imposing the hard constraint W C X = Z does not work, because of the saturation of the sparsifying module An L 1 penalty term is added to W C, and an L 2 penalty term to W D Et l autre répond : Pas du tout, tu vois bien que je suis un lapin!

15 Natural Image Patches Feature extraction Initialization of a Convolutional Neural Net Hierarchical Extension: Topographic Maps patches from the Berkeley segmentation data set Codes of length minutes on a 2 GHz processor for 200 filters on 100, patches Filters learnt by the decoder Spatially localized filters, similar to Gabor wavelets, like receptive fields of V1 neurons W C and W D are really close after the optimization Là, vous pouvez faire semblant de vous intéresser à l exposé : il y a des images

On MNIST Digit Recognition Data Set Feature extraction Initialization of a Convolutional Neural Net Hierarchical Extension: Topographic Maps Input is the whole 28 28 image (not a patch) Codes of

16 On MNIST Digit Recognition Data Set Feature extraction Initialization of a Convolutional Neural Net Hierarchical Extension: Topographic Maps Input is the whole image (not a patch) Codes of length 196 = Some encoder filters, and an example of digit reconstruction Stroke detectors are learnt Reconstruction: sum of a few parts Encore des images, ça aide à patienter jusqu à la fin

On MNIST Feature extraction Initialization of a Convolutional Neural Net Hierarchical Extension: Topographic Maps Train filters on 5 5 image patches Codes of length 50 Initialize a network with 50

17 On MNIST Feature extraction Initialization of a Convolutional Neural Net Hierarchical Extension: Topographic Maps Train filters on 5 5 image patches Codes of length 50 Initialize a network with 50 features on layer 1 and 2, 50 on layer 3 and 4, 200 on layer 5, and 10 output units. Misclassification Random Pre-training No distortions 0.70% 0.60% Distortions 0.49% 0.39% Bon, maintenant le suspense a assez duré

Natural Image Patches Feature extraction Initialization of a Convolutional Neural Net Hierarchical Extension: Topographic Maps 12 12 patches from the Berkeley segmentation data set Codes of length

18 Natural Image Patches Feature extraction Initialization of a Convolutional Neural Net Hierarchical Extension: Topographic Maps patches from the Berkeley segmentation data set Codes of length 400 Close filters learn similar weights I NPUT X E d E ucl. Dist. W c W d C ODE L E V E L K = C ONV OL. K Spar s. L ogistic C ODE L E V E L 2 E ucl. Dist. C ODE Z E c La réponse est donc...

19 Energy-based model for unsupervised learning of sparse overcomplete representations Fast and accurate processing after learning Sparsification of each unit across the dataset seems easier than sparsification of each example across the code units Can be extended to non-linear encoder and decoder Sparse code can be used as input for another feature extractor Un tanard! > /

20 Questions? Merci de votre attention

21 Questions? Hopefully not... Merci de votre attention

C. Poultney S. Cho pra (NYU Courant Institute) Y. LeCun

Efficient Learning of Sparse Overcomplete Representations with an Energy-Based Model Marc'Aurelio Ranzato C. Poultney S. Cho pra (NYU Courant Institute) Y. LeCun CIAR Summer School Toronto 2006 Why Extracting