Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Akarsh Pokkunuru EECS Department 03-16-2017 Contractive Auto-Encoders: Explicit Invariance During Feature Extraction 1

AGENDA Introduction to Auto-encoders Types of Auto-encoders Analysis of different Auto-encoder Contractive Auto-encoder Results and benchmarking tests Conclusion 2

Introduction To Auto- Encoders 3

Auto-Encoder Introduction Notation for Auto-Encoders - AE AE = retain good info bad info. AE is a great technique for : Characterizing the input distribution Dimensionality reduction Feature rich extraction # of hidden layer nodes < input nodes = bottleneck! (fail to extract enough useful info.) 4

Auto-Encoder Illustration Composed of two parts Encoder Decoder 5

Auto-Encoder Mathematical Expression Encoder maps input x to a hidden or higher level representation: where h is the hidden layer representation, s f is encoder activation, W is weight, b h is the bias. Encoder output contains reduced dimension or compact data. 6

Auto-Encoder Mathematical Expression Cont. Decoder tries to reconstruct the original information with less error. where g(h) is the decoder function, s g is decoder activation, W is transpose of weight, b y is the bias. 7

Auto-Encoder Cont. Types of activation functions used : Linear (identity, binary etc.) Non-linear (sigmoid, tanh etc.) Why linear activation? : Very simple to implement No interesting information at output Why non-linear activation? : Feature rich output Computation burden Very popular 8

Training AE and Cost Function Initialize the weight, biases of encoder and decoder function parameters. Train the data set and minimize the reconstruction error and cost function : L is reconstruction error function (e.g. MMSE, cross entropy function) and τ AE (θ) is the cost function. 9

10 Types of Auto-Encoders

Types of Auto-Encoders Auto-Encoders can be categorized as follows : Normal AE Regularized AE Denoising AE Sparse AE Contractive AE We will focus on regularized, denoising and the proposed contractive AE. 11

Regularized Auto-Encoder Idea is to favor very small increments in weight by decaying the bad features : λ controls strength of regularizing weights, W is weight parameters. Offers significantly better results than normal AE in most benchmarking datasets (MNIST, CIFAR etc.) 12

Denoising Auto-Encoder Modification of Regularized AE. Idea is to add noise to input on purpose and reconstruct a cleaner version. x = x + Є is the corrupted version of input, q( x x) is the corruption process (e.g. Gaussian noise). Optimization is done by stochastic gradient descent algorithm. 13

14 Contractive Auto-Encoders

Contractive AE Modification of Regularized AE. Idea is to avoid/penalize uninteresting features. Introduce a penalty function which penalizes the highly sensitive inputs to increase robustness as follows : As a result, all the samples are flat or invariant to small variations in input samples. 15

Contractive Auto-Encoder Cont. is the Frobenius norm of the Jacobian matrix. If the encoder is linear, RAE and CAE are identical. CAE and Denoising AE (DAE) behave in the same way, but.. CAE increases flatness from first hidden layer in contrast to DAE; DAE encourages flatness only from reconstruction layer. However, cost of computation remains same! 16

Contractive Auto-Encoder Cont. The cost function is given as follows: where λ has the same functionality as in regularized AE and is the Jacobian penalty function as discussed previously. 17

Example Received power data set 4 million samples. 18

19 Results and Benchmarking

Considered Models For Comparison The models considered for performance comparisons with CAE are as follows: 20

Experimental Setting The experimental setting for AE is as follows: Unsupervised training. First a single layer NN, then extended to multilayer. All auto-encoder variants used tied weights (faster convergence and less parameters to optimize). A sigmoid activation function for both encoder and decoder. A cross-entropy reconstruction error function. Optimization by stochastic gradient descent. 1000 hidden layer units are considered while training. 21

Experimental Setting The experimental setting for RBM neural network is as follows: Unsupervised training. First a single layer NN, then extended to multilayer. Contrastive divergence to train the RBM. After training, feature extraction parameters W, b are fed to a MLP with another random output layer for classification. Gradient decent is then used for fine tuning. 22

Results Two standard data sets are considered MNIST and CIFAR-bw. The results are as follows : 23

Results Cont. SAT indicates the average fraction of saturated units. A unit is saturated, if activation function output is below a certain threshold (e.g. 0.05 is lower SAT or 0.95 for upper SAT). The penalty function is a measure of contraction/flatness. Lower the average, better the invariance to small variations. 24

Results Cont. Results for stacked neural networks are as follows: Dual layer CAE is better than other 3-layer NN! 25

How Contraction Works? For better understanding of how contraction works, we use the following analysis: Need to understand local behavior of a data point when contractive penalty is applied. Singular values of Jacobian matrix. Contraction of samples has effect on not just the immediate samples, but beyond (mean and variance). Contraction ratio between two close points - d 1 d 2 (r) Average contraction ratio for a hidden layer defined using a randomly generated sphere of radius r. 26

Effect of Singular Values Large singular value corresponds to direction of allowed variation. CAE better at characterizing low-dimensional inputs. 27

Contraction Ratio The contraction ratio can be visualized as follows: x 0 is some point from validation data set. x 1 is randomly generated mapping of x 0 in a 3D sphere of radius r, centered at x 0. Contraction ratio between x 0 and x 1 after mapping is given by d 1 d 2 (r) d 1 = dist. in original i/p space. d 2 = dist. in mapped space. 28

Contraction ratio vs Radius Decrease in CAE ratio occurs at max r. CAE is trying to make the features invariant in all directions around the training examples. Reconstruction error is making sure that that the representation function doesn t change. 29

Contraction ratio vs Radius Measure of contraction ratio for CIFAR-bw. 30

Contraction ratio vs Radius Deeper encoders produce features that are more invariant, over a farther distance. 31

Conclusion Contractive AE uses a penalty to induce flatness to small variations in input. By looking at the contraction ratio and singular values, we have studied how the CAE is robust to small scale variations in the data set. Finally, the penalty function helps the CAE to improve the performance compared to other auto-encoders. 32

33 Thank you