Autoencoding Beyond Pixels Using a Learned Similarity Metric

Autoencoding Beyond Pixels Using a Learned Similarity Metric International Conference on Machine Learning, 2016 Anders Boesen Lindbo Larsen, Hugo Larochelle, Søren Kaae Sønderby, Ole Winther Technical University of Denmark, University of Copenhagen, Twitter 16 December 2016 Presented by: Kevin Liang

Introduction Deep models have scored impressive successes for discriminative models on large and diverse datasets. However, generative models still present some problems. Variational Autoencoder (VAE) generated faces

Current Practice: Pixel-wise Similarity Metrics Many unsupervised learning methods (eg Variational Autoencoders) evaluate performance by comparing reconstruction with the original input. Common Metric: Pixel-wise squared error Problem: Humans do not perceive images element-wise

Proposed: Feature-wise Similarity Metric Use higher-level and sufficiently invariant representations of images for comparison, and then learn a function for similarity measure Propose jointly training a Variational Autoencoder (VAE) and a Generative Adversarial Network (GAN) VAE Decoder and GAN Generator shared GAN discriminator is the feature-similarity metric

Variational Autoencoder (VAE) Two networks: Encoder and Decoder z Enc(x) = q(z x), x Dec(z) = p(x z) (1) Regularize with a prior over latent distribution p(z) Typically: z N (0, I) Loss to minimize: with L VAE = E q(z x) [ log p(x z)p(z) ] q(z x) = L pixel llike + L prior (2) L pixel llike = E q(z x)[log p(x z)] (3) L prior = D KL (q(z x) p(z)) (4)

Generative Adversarial Network (GAN) Two networks: Generator and Discriminator x Gen(z), y = Dis(x) [0, 1] (5) Loss to maximize/minimize wrt Dis/Gen: L GAN = log(dis(x)) + log(1 Dis(Gen(z))) (6) with x being a training sample and z p(z)

VAE/GAN In order to properly distinguish generated images from real ones, GAN discriminators must implicitly learn a rich similarity metric. Let Dis l (x) be the hidden representation of lth layer of the discriminator: where x Dec(z) p(dis l (x) z) = N (Dis l (x) Dis l ( x), I) (7) Replace the pixel-wise error with a feature-based one from the discriminator: L Dis l llike = E q(z x)[log p(dis l (x) z)] (8)

VAE/GAN Loss VAE/GAN Loss: L = L prior + L Dis l llike + L GAN (9) L Dis l llike : Content Error L GAN : Style error Decoder and Generator share parameters

Practical Considerations Limiting error signals to relevant networks Dis should not try to minimize L Dis l llike Do not backpropagate the error signal from L GAN to Enc. Weighting VAE vs GAN Balance reconstruction with fooling the discriminator when training Dec θ Dec + θdec (γl Dis l llike L GAN) (10) Discriminating based on samples from p(z) and q(z x) In addition to exposing the discriminator to the usual fake and real data, encode the real data and add reconstructions L GAN = log(dis(x)) + log(1 Dis(Dec(z))) + log(1 Dis(Dec(Enc(x)))) (11)

Summary/Algorithm

Models Being Compared 4 Models, all with the same architectures for the sake of parity VAE: Vanilla VAE (pixel-wise Gaussian observation model) VAE Disl : VAE with learned feature-wise similarity metric from a GAN VAE/GAN: Similar to VAE Disl, but Dec also optimized wrt L GAN GAN: Vanilla GAN

Introduction Samples VAEs and GANs VAE/GAN Experiments

Reconstructions Note: GAN is not included, as it is incapable of performing reconstruction

Introduction VAEs and GANs VAE/GAN Visual Attribute Vectors - VAE/GAN Experiments

Introduction VAEs and GANs Attribute Query - VAE/GAN VAE/GAN Experiments

Shortcomings As a method of training a VAE: Train VAE/GAN model, throw away the GAN, and evaluate the remaining VAE using pixel-wise log likelihood Far from competitive with plain VAE models (on the CIFAR-10 datatset) Semi-supervised: Unsupervised pre-training with the VAE/GAN followed by fine-tuning on a small number of labeled examples Does not reach state-of-the-art (Ladder Networks, Stacked What-Where Autoencoders)