Deep Generative Models Variational Autoencoders

Similar documents
Variational Autoencoders. Sargur N. Srihari

Auto-Encoding Variational Bayes

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Variational Autoencoders

Deep generative models of natural images

19: Inference and learning in Deep Learning

Implicit generative models: dual vs. primal approaches

GAN Frontiers/Related Methods

Autoencoder. Representation learning (related to dictionary learning) Both the input and the output are x

Semi-Amortized Variational Autoencoders

Score function estimator and variance reduction techniques

Alternatives to Direct Supervision

Neural Networks and Deep Learning

DEEP LEARNING PART THREE - DEEP GENERATIVE MODELS CS/CNS/EE MACHINE LEARNING & DATA MINING - LECTURE 17

Unsupervised Learning

CSC412: Stochastic Variational Inference. David Duvenaud

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Classification of 1D-Signal Types Using Semi-Supervised Deep Learning

27: Hybrid Graphical Models and Neural Networks

Semantic Segmentation. Zhongang Qi

Energy Based Models, Restricted Boltzmann Machines and Deep Networks. Jesse Eickholt

Iterative Inference Models

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

Generative Models in Deep Learning. Sargur N. Srihari

Deep Generative Models and a Probabilistic Programming Library

arxiv: v2 [cs.lg] 17 Dec 2018

Deep Boltzmann Machines

Autoencoders. Stephen Scott. Introduction. Basic Idea. Stacked AE. Denoising AE. Sparse AE. Contractive AE. Variational AE GAN.

Grundlagen der Künstlichen Intelligenz

Latent Regression Bayesian Network for Data Representation

Implicit Mixtures of Restricted Boltzmann Machines

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Stochastic Simulation with Generative Adversarial Networks

Neural Network Neurons

Day 3 Lecture 1. Unsupervised Learning

K-Means and Gaussian Mixture Models

CS6220: DATA MINING TECHNIQUES

arxiv: v1 [stat.ml] 10 Dec 2018

RNNs as Directed Graphical Models

LEARNING TO INFER ABSTRACT 1 INTRODUCTION. Under review as a conference paper at ICLR Anonymous authors Paper under double-blind review

Adversarially Learned Inference

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

arxiv: v6 [stat.ml] 15 Jun 2015

Cambridge Interview Technical Talk

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Denoising Adversarial Autoencoders

Autoencoders, denoising autoencoders, and learning deep networks

Hierarchical Variational Models

Quantitative Evaluation of Generative Adversarial Networks and Improved Training Techniques

The Multi-Entity Variational Autoencoder

Recurrent Neural Network (RNN) Industrial AI Lab.

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Modeling and Optimization of Thin-Film Optical Devices using a Variational Autoencoder

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

Autoencoding Beyond Pixels Using a Learned Similarity Metric

Bayesian model ensembling using meta-trained recurrent neural networks

08 An Introduction to Dense Continuous Robotic Mapping

Computer vision: models, learning and inference. Chapter 10 Graphical Models

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Probabilistic Programming with Pyro

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

arxiv: v1 [cs.lg] 24 Jan 2019

Introduction to Machine Learning CMU-10701

EE-559 Deep learning Non-volume preserving networks

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

What is machine learning?

Amortised MAP Inference for Image Super-resolution. Casper Kaae Sønderby, Jose Caballero, Lucas Theis, Wenzhe Shi & Ferenc Huszár ICLR 2017

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Adversarial Symmetric Variational Autoencoder

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

Structured Attention Networks

Neural Networks: promises of current research

GENERATIVE ADVERSARIAL NETWORKS (GAN) Presented by Omer Stein and Moran Rubin

(University Improving of Montreal) Generative Adversarial Networks with Denoising Feature Matching / 17

Extracting and Composing Robust Features with Denoising Autoencoders

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015

Graphite: Iterative Generative Modeling of Graphs

Lecture 19: Generative Adversarial Networks

arxiv: v1 [cs.cv] 17 Nov 2016

22 October, 2012 MVA ENS Cachan. Lecture 5: Introduction to generative models Iasonas Kokkinos

arxiv: v2 [cs.lg] 9 Jun 2017

10703 Deep Reinforcement Learning and Control

Computationally Efficient M-Estimation of Log-Linear Structure Models

Auxiliary Deep Generative Models

Perceptual Loss for Convolutional Neural Network Based Optical Flow Estimation. Zong-qing LU, Xiang ZHU and Qing-min LIAO *

Probabilistic Graphical Models

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Fixing a Broken ELBO. Abstract. 1. Introduction. Alexander A. Alemi 1 Ben Poole 2 * Ian Fischer 1 Joshua V. Dillon 1 Rif A. Saurous 1 Kevin Murphy 1

Learning a Representation Map for Robot Navigation using Deep Variational Autoencoder

ECG782: Multidimensional Digital Signal Processing

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

10-701/15-781, Fall 2006, Final

Building Classifiers using Bayesian Networks

Auxiliary Variational Information Maximization for Dimensionality Reduction

Iterative Amortized Inference

Generative and discriminative classification techniques

Deep Learning Srihari. Autoencoders. Sargur Srihari

Auto-encoder with Adversarially Regularized Latent Variables

Factor Topographic Latent Source Analysis: Factor Analysis for Brain Images

Transcription:

Deep Generative Models Variational Autoencoders Sudeshna Sarkar 5 April 2017

Generative Nets Generative models that represent probability distributions over multiple variables in some way. Directed Generative Nets Differentiable Generator Nets

Differentiable Generator Nets Many generative models are based on the idea of using a differentiable generator network. The model transforms samples of latent variables z to samples x or to distributions over samples x using a differentiable function g z; θ g, typically represented using a NN 1. Variational autoencoders - which pair the generator net with an inference net 2. Generative adversarial networks - which pair the generator network with a discriminator network 3. Techniques that train generator networks in isolation.

Generator Networks Generator networks are essentially just parameterized computational procedures for generating samples the architecture provides the family of possible distributions to sample from the parameters select a distribution from within that family. Example, the standard procedure for drawing samples from a normal distribution with mean µ and covariance Σ is to feed samples z from a normal distribution with zero mean and identity covariance into a very simple generator network. This generator network contains just one affine layer x = g z = μ + L z L is the Cholesky decomposition of Σ

Generator networks To generate samples from more complicated distributions, we may use a feedforward network to represent a parametric family of nonlinear functions g, and use training data to infer the parameters selecting the desired function. We can think of g as providing a nonlinear change of variables that transforms the distribution over z into the desired distribution over x. we often use indirect means of learning g In some cases, rather than using g to provide a sample of x directly, we use g to define a conditional distribution over x. For example, we could use a generator net whose final layer consists of sigmoid outputs to provide the mean parameters of Bernoulli distributions p x i = 1 z = g(z) i

In this case, when we use g to define p(x z), we impose a distribution over x by marginalizing z: p x = E z p x z The two different approaches to formulating generator nets emitting the parameters of a conditional distribution versus directly emitting samples have complementary strengths and weaknesses

1. emitting the parameters of a conditional distribution 2. directly emitting the samples When the generator net defines a conditional distribution over x, it is capable of generating discrete data as well as continuous data. When the generator net provides samples directly, it is capable of generating only continuous data. The advantage to direct sampling is that we are no longer forced to use conditional distributions whose form can be easily written down and algebraically manipulated by a human designer

Generative modeling seems to be more difficult than classification or regression because the learning process requires optimizing intractable criteria. In differentiable generator nets, the criteria are intractable because the data does not specify both the inputs z and the outputs x The learning procedure needs to determine how to arrange z space in a useful way and additionally how to map from z to x Several approaches to training differentiable generator nets given only training samples of x

Variational Autoencoder Graphical models + Neural networks A directed model that uses learned approximate inference and can be trained purely with gradient-based methods Lets us design complex generative models of data, and fit them to large datasets. They can be used to learn a low dimensional representation Z of high dimensional data X such as images (of e.g. faces). X and Z are random variables. It s therefore possible to sample X from the distribution P(X Z), thus creating e.g. images of faces, MNIST Digits, or speech.

VAE History Simultaneously discovered by Kingma and Welling. Auto-Encoding Variational Bayes, International Conference on Learning Representations. ICLR, 2014. Rezende, Mohamed and Wierstra. Stochastic back-propagation and variational inference in deep latent Gaussian models. ICML, 2014

Manifold Hypothesis

Variational auto encoders (idea of low dim manifold)

The neural net perspective THE ENCODER COMPRESSES DATA INTO A LATENT SPACE (Z). THE DECODER RECONSTRUCTS THE DATA GIVEN THE HIDDEN REPRESENTATION. Example: x is a 28 by 28-pixel photo of a handwritten number. The encoder encodes the data into a latent (hidden) representation space z, of lower dimension => the encoder must learn an efficient compression of the data The lower-dimensional space is stochastic: the encoder outputs parameters to q θ (z x), which is a Gaussian probability density. We can sample from this distribution to get noisy values of the representations z.

The neural net perspective Example: x is a 28 by 28-pixel photo of a handwritten number. The decoder is a neural net denoted by p φ (x z) Input: representation z outputs the parameters to the probability distribution of the data, and has weights and biases ϕ. The decoder gets as input the latent representation of a digit z and outputs 784 parameters, one for each of the pixels in the image. Information is lost because it goes from a smaller to a larger dimensionality - reconstruction log-likelihood log p φ (x z)

Variational Autoencoders To generate a sample from the model, the VAE first draws a sample z from the code distribution p model (z). The sample is then run through a differentiable generator network g(z). Finally, x is sampled from a distribution p model (x;g(z)) =p model (x z). During training, the approximate inference network (or encoder) q(z x) is used to obtain z and p model (x z) is then viewed as a decoder network.

The loss function of the variational autoencoder is the negative loglikelihood with a regularizer. The loss function l i for datapoint x i is: l i θ, φ = E z~qθ z x i log p φ x i z + KL q θ z x i p(z) L = data points The first term is the reconstruction loss, or expected negative loglikelihood of the i-th datapoint. The expectation is taken with respect to the encoder s distribution over the representations. This term encourages the decoder to learn to reconstruct the data. Regularizer KL divergence between the encoder s distribution q θ z x and p(z) l i

In the variational autoencoder, p is specified as a standard Normal distribution with mean zero and variance one. This has the effect of keeping similar numbers representations close together. We train the variational autoencoder using gradient descent to optimize the loss with respect to the parameters of the encoder and decoder

The probability model perspective a variational autoencoder contains a specific probability model of data x and latent variables z. The joint probability of the model p(x,z)=p(x z)p(z) The generative process for each datapoint i: Draw latent variables zi p(z) Draw datapoint xi p(x z)

Inference in this model: Goal: to infer good values of the latent variables given observed data p x z p(z) p z x = p(x) Evidence p x = p x z p z dz requires exponential time to compute as it needs to be evaluated over all configurations of latent variables. We therefore need to approximate this posterior distribution. Variational inference approximates the posterior with a family of distributions q λ (z x) λ indexes the family of distributions. For example, if q were Gaussian, λ xi = (μ xi,σ 2 xi ) how well our variational posterior q(z x) approximates the true posterior p(z x)? KL-divergence measures the information lost when using q to approximate p

This is intractable. Consider the function ELBO() We combine this with the KL divergence and rewrite the evidence as By Jensen s inequality, the KL divergence is always greater than or equal to zero. ==> minimizing the KL divergence is equivalent to maximizing the ELBO (Evidence Lower Bound) which is tractable.

In the variational autoencoder model, there are only local latent variables So we can decompose the ELBO into a sum where each term depends on a single datapoint. This allows us to use stochastic gradient descent with respect to the parameters λ

Reparameterization Backpropagation not possible through random sampling! how to take derivatives with respect to the parameters of a stochastic variable. If we are given z that is drawn from a distribution q θ (z x), and we want to take derivatives of a function of z with respect to θ, how do we do that? Reparametrize samples in a clever way, such that the stochasticity is independent of the parameters, e,g. for normal distribution, z = μ + σ ε where ε~n(0,1)

Reparameterization