Deep Generative Models Variational Autoencoders

Deep Generative Models Variational Autoencoders Sudeshna Sarkar 5 April 2017

Generative Nets Generative models that represent probability distributions over multiple variables in some way. Directed Generative Nets Differentiable Generator Nets

Differentiable Generator Nets Many generative models are based on the idea of using a differentiable generator network. The model transforms samples of latent variables z to samples x or to distributions over samples x using a differentiable function g z; θ g, typically represented using a NN 1. Variational autoencoders - which pair the generator net with an inference net 2. Generative adversarial networks - which pair the generator network with a discriminator network 3. Techniques that train generator networks in isolation.

Generator Networks Generator networks are essentially just parameterized computational procedures for generating samples the architecture provides the family of possible distributions to sample from the parameters select a distribution from within that family. Example, the standard procedure for drawing samples from a normal distribution with mean µ and covariance Σ is to feed samples z from a normal distribution with zero mean and identity covariance into a very simple generator network. This generator network contains just one affine layer x = g z = μ + L z L is the Cholesky decomposition of Σ

Generator networks To generate samples from more complicated distributions, we may use a feedforward network to represent a parametric family of nonlinear functions g, and use training data to infer the parameters selecting the desired function. We can think of g as providing a nonlinear change of variables that transforms the distribution over z into the desired distribution over x. we often use indirect means of learning g In some cases, rather than using g to provide a sample of x directly, we use g to define a conditional distribution over x. For example, we could use a generator net whose final layer consists of sigmoid outputs to provide the mean parameters of Bernoulli distributions p x i = 1 z = g(z) i

In this case, when we use g to define p(x z), we impose a distribution over x by marginalizing z: p x = E z p x z The two different approaches to formulating generator nets emitting the parameters of a conditional distribution versus directly emitting samples have complementary strengths and weaknesses

1. emitting the parameters of a conditional distribution 2. directly emitting the samples When the generator net defines a conditional distribution over x, it is capable of generating discrete data as well as continuous data. When the generator net provides samples directly, it is capable of generating only continuous data. The advantage to direct sampling is that we are no longer forced to use conditional distributions whose form can be easily written down and algebraically manipulated by a human designer

Generative modeling seems to be more difficult than classification or regression because the learning process requires optimizing intractable criteria. In differentiable generator nets, the criteria are intractable because the data does not specify both the inputs z and the outputs x The learning procedure needs to determine how to arrange z space in a useful way and additionally how to map from z to x Several approaches to training differentiable generator nets given only training samples of x

Variational Autoencoder Graphical models + Neural networks A directed model that uses learned approximate inference and can be trained purely with gradient-based methods Lets us design complex generative models of data, and fit them to large datasets. They can be used to learn a low dimensional representation Z of high dimensional data X such as images (of e.g. faces). X and Z are random variables. It s therefore possible to sample X from the distribution P(X Z), thus creating e.g. images of faces, MNIST Digits, or speech.

VAE History Simultaneously discovered by Kingma and Welling. Auto-Encoding Variational Bayes, International Conference on Learning Representations. ICLR, 2014. Rezende, Mohamed and Wierstra. Stochastic back-propagation and variational inference in deep latent Gaussian models. ICML, 2014

Manifold Hypothesis

Variational auto encoders (idea of low dim manifold)

The neural net perspective THE ENCODER COMPRESSES DATA INTO A LATENT SPACE (Z). THE DECODER RECONSTRUCTS THE DATA GIVEN THE HIDDEN REPRESENTATION. Example: x is a 28 by 28-pixel photo of a handwritten number. The encoder encodes the data into a latent (hidden) representation space z, of lower dimension => the encoder must learn an efficient compression of the data The lower-dimensional space is stochastic: the encoder outputs parameters to q θ (z x), which is a Gaussian probability density. We can sample from this distribution to get noisy values of the representations z.

The neural net perspective Example: x is a 28 by 28-pixel photo of a handwritten number. The decoder is a neural net denoted by p φ (x z) Input: representation z outputs the parameters to the probability distribution of the data, and has weights and biases ϕ. The decoder gets as input the latent representation of a digit z and outputs 784 parameters, one for each of the pixels in the image. Information is lost because it goes from a smaller to a larger dimensionality - reconstruction log-likelihood log p φ (x z)

Variational Autoencoders To generate a sample from the model, the VAE first draws a sample z from the code distribution p model (z). The sample is then run through a differentiable generator network g(z). Finally, x is sampled from a distribution p model (x;g(z)) =p model (x z). During training, the approximate inference network (or encoder) q(z x) is used to obtain z and p model (x z) is then viewed as a decoder network.

The loss function of the variational autoencoder is the negative loglikelihood with a regularizer. The loss function l i for datapoint x i is: l i θ, φ = E z~qθ z x i log p φ x i z + KL q θ z x i p(z) L = data points The first term is the reconstruction loss, or expected negative loglikelihood of the i-th datapoint. The expectation is taken with respect to the encoder s distribution over the representations. This term encourages the decoder to learn to reconstruct the data. Regularizer KL divergence between the encoder s distribution q θ z x and p(z) l i

In the variational autoencoder, p is specified as a standard Normal distribution with mean zero and variance one. This has the effect of keeping similar numbers representations close together. We train the variational autoencoder using gradient descent to optimize the loss with respect to the parameters of the encoder and decoder

The probability model perspective a variational autoencoder contains a specific probability model of data x and latent variables z. The joint probability of the model p(x,z)=p(x z)p(z) The generative process for each datapoint i: Draw latent variables zi p(z) Draw datapoint xi p(x z)

Inference in this model: Goal: to infer good values of the latent variables given observed data p x z p(z) p z x = p(x) Evidence p x = p x z p z dz requires exponential time to compute as it needs to be evaluated over all configurations of latent variables. We therefore need to approximate this posterior distribution. Variational inference approximates the posterior with a family of distributions q λ (z x) λ indexes the family of distributions. For example, if q were Gaussian, λ xi = (μ xi,σ 2 xi ) how well our variational posterior q(z x) approximates the true posterior p(z x)? KL-divergence measures the information lost when using q to approximate p

This is intractable. Consider the function ELBO() We combine this with the KL divergence and rewrite the evidence as By Jensen s inequality, the KL divergence is always greater than or equal to zero. ==> minimizing the KL divergence is equivalent to maximizing the ELBO (Evidence Lower Bound) which is tractable.

In the variational autoencoder model, there are only local latent variables So we can decompose the ELBO into a sum where each term depends on a single datapoint. This allows us to use stochastic gradient descent with respect to the parameters λ

Reparameterization Backpropagation not possible through random sampling! how to take derivatives with respect to the parameters of a stochastic variable. If we are given z that is drawn from a distribution q θ (z x), and we want to take derivatives of a function of z with respect to θ, how do we do that? Reparametrize samples in a clever way, such that the stochasticity is independent of the parameters, e,g. for normal distribution, z = μ + σ ε where ε~n(0,1)

Reparameterization