Auto-Encoding Variational Bayes

Similar documents
Deep Generative Models Variational Autoencoders

Variational Autoencoders. Sargur N. Srihari

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Deep generative models of natural images

19: Inference and learning in Deep Learning

GAN Frontiers/Related Methods

Alternatives to Direct Supervision

arxiv: v2 [cs.lg] 31 Oct 2014

Variational Autoencoders

DEEP LEARNING PART THREE - DEEP GENERATIVE MODELS CS/CNS/EE MACHINE LEARNING & DATA MINING - LECTURE 17

Auxiliary Deep Generative Models

arxiv: v6 [stat.ml] 15 Jun 2015

Unsupervised Learning

Implicit generative models: dual vs. primal approaches

Adversarially Learned Inference

Generative Models in Deep Learning. Sargur N. Srihari

Day 3 Lecture 1. Unsupervised Learning

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

Semi-Amortized Variational Autoencoders

Cambridge Interview Technical Talk

arxiv: v2 [cs.lg] 17 Dec 2018

arxiv: v1 [cs.lg] 24 Jan 2019

Gradient of the lower bound

Classification of 1D-Signal Types Using Semi-Supervised Deep Learning

Semantic Segmentation. Zhongang Qi

arxiv: v1 [stat.ml] 10 Dec 2018

Energy Based Models, Restricted Boltzmann Machines and Deep Networks. Jesse Eickholt

Autoencoders. Stephen Scott. Introduction. Basic Idea. Stacked AE. Denoising AE. Sparse AE. Contractive AE. Variational AE GAN.

Autoencoder. Representation learning (related to dictionary learning) Both the input and the output are x

Model Generalization and the Bias-Variance Trade-Off

Deep Generative Models and a Probabilistic Programming Library

Bidirectional GAN. Adversarially Learned Inference (ICLR 2017) Adversarial Feature Learning (ICLR 2017)

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

Iterative Inference Models

Denoising Adversarial Autoencoders

27: Hybrid Graphical Models and Neural Networks

Latent Regression Bayesian Network for Data Representation

Grundlagen der Künstlichen Intelligenz

Neural Networks and Deep Learning

Auto-encoder with Adversarially Regularized Latent Variables

Probabilistic Programming with Pyro

LEARNING TO INFER ABSTRACT 1 INTRODUCTION. Under review as a conference paper at ICLR Anonymous authors Paper under double-blind review

Learning a Representation Map for Robot Navigation using Deep Variational Autoencoder

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

CSC412: Stochastic Variational Inference. David Duvenaud

Autoencoding Beyond Pixels Using a Learned Similarity Metric

IMPLICIT AUTOENCODERS

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

When Variational Auto-encoders meet Generative Adversarial Networks

UCLA UCLA Electronic Theses and Dissertations

Bilevel Sparse Coding

GENERATIVE ADVERSARIAL NETWORKS (GAN) Presented by Omer Stein and Moran Rubin

(University Improving of Montreal) Generative Adversarial Networks with Denoising Feature Matching / 17

Capsule Networks. Eric Mintun

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework

A Fast Learning Algorithm for Deep Belief Nets

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Extracting and Composing Robust Features with Denoising Autoencoders

Deep Boltzmann Machines

Machine Learning

arxiv: v2 [cs.lg] 7 Feb 2019

Tutorial Deep Learning : Unsupervised Feature Learning

arxiv: v1 [stat.ml] 3 Apr 2017

arxiv: v1 [stat.ml] 11 Feb 2018

Supervised Learning for Image Segmentation

JOINT MULTIMODAL LEARNING WITH DEEP GENERA-

Probabilistic Graphical Models

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Towards Conceptual Compression

What is machine learning?

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

The Multi-Entity Variational Autoencoder

Lecture 19: Generative Adversarial Networks

Perceptual Loss for Convolutional Neural Network Based Optical Flow Estimation. Zong-qing LU, Xiang ZHU and Qing-min LIAO *

Clustering web search results

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation

Score function estimator and variance reduction techniques

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Autoencoders, denoising autoencoders, and learning deep networks

arxiv: v1 [cs.cv] 17 Nov 2016

Generative and discriminative classification techniques

ECG782: Multidimensional Digital Signal Processing

Generative Adversarial Networks (GANs)

Clustering algorithms

arxiv: v1 [stat.ml] 17 Apr 2017

Auxiliary Variational Information Maximization for Dimensionality Reduction

Machine Learning

arxiv: v2 [cs.lg] 25 May 2016

DOMAIN-ADAPTIVE GENERATIVE ADVERSARIAL NETWORKS FOR SKETCH-TO-PHOTO INVERSION

Generative Modeling with Convolutional Neural Networks. Denis Dus Data Scientist at InData Labs

Max-Margin Deep Generative Models

arxiv: v2 [cs.lg] 9 Jun 2017

Tackling Over-pruning in Variational Autoencoders

arxiv: v2 [stat.ml] 21 Oct 2017

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning

Auxiliary Guided Autoregressive Variational Autoencoders

Modeling and Optimization of Thin-Film Optical Devices using a Variational Autoencoder

Multimodal Prediction and Personalization of Photo Edits with Deep Generative Models

Stochastic Simulation with Generative Adversarial Networks

Transcription:

Auto-Encoding Variational Bayes Diederik P (Durk) Kingma, Max Welling University of Amsterdam Ph.D. Candidate, advised by Max Durk Kingma D.P. Kingma Max Welling

Problem class Directed graphical model: x : observed variable z : latent variables (continuous) θ : model parameters pθ(x,z): joint PDF Factorized, differentiable Hard case: intractable posterior distribution pθ(z x) e.g. neural nets as components We want fast approximate posterior inference per datapoint After inference, learning params is easy D.P. Kingma 2

Latent variable generative model latent variable model: learn a mapping from some latent variable z to a complicated distribution on x. p(x) = p(x, z) dz where p(x, z) =p(x z)p(z) p(z) = something simple p(x z) =f(z) Can we learn to decouple the true explanatory factors underlying the data distribution? E.g. separate identity and expression in face images z 2 x 2 f Image from: Ward, A. D., Hamarneh, G.: 3D Surface Parameterization Using Manifold Learning for Medial Shape Representation, Conference on Image Processing, Proc. of SPIE Medical Imaging, 2007 IFT6266: Representation (Deep) Learning Aaron Courville x 3 z 1 x 1 10 10

Variational autoencoder (VAE) approach Leverage neural networks to learn a latent variable model. p(x) = p(x, z) dz where p(x, z) =p(x z)p(z) p(z) = something simple p(x z) =f(z) z 2 x 2 z : f x 3 z 1 x 1 f(z) : x : IFT6266: Representation (Deep) Learning Aaron Courville 11 11

What VAE can do? x2 z2 f x3 z1 MNIST: z2 Frey Face dataset: Expression z2 x1 Face manifold (b) Learned MNIST manifold z1 ns of learned data manifold for generative models with two-dimensional latent AEVB. Since the prior of the latent space is Gaussian, linearly spaced coorrepresentation (Deep) Learning Courville quareift6266: were transformed through the inverse CDF of the Aaron Gaussian to produce ariables z. For each of these values z, we plotted the corresponding generative (a) Learned Pose Frey Face manifoldz1 Figure 4: Visualisations of learned data manifol 12 space, learned with AEVB. Since the prior12of

The inference / learning challenge Where does z come from? The classic directed model dilemma. Computing the posterior p(z x) is intractable. We need it to train the directed model. z 2? x 2 z : f x 3 z 1 x 1 f(z) : x : IFT6266: Representation (Deep) Learning Aaron Courville 13 13

Auto-Encoding Variational Bayes Idea: Learn neural net to approximate the posterior qφ(z x) with 'variational parameters' φ one-shot approximate inference akin to the recognition model in Wake-Sleep Construct estimator of the variational lower bound which we can optimize jointly w.r.t. φ jointly with θ -> Stochastic gradient ascent D.P. Kingma 4

Variational Lower Bound of the marg. lik. D.P. Kingma 5

Monte Carlo estimator of the variational bound Can we differentiate through the sampling process w.r.t. φ? D.P. Kingma

Variational Autoencoder (VAE) Where does z come from? The classic DAG problem. The VAE approach: introduce an inference machine q φ (z x) that learns to approximate the posterior p θ (z x). - Define a variational lower bound on the data likelihood: p θ (x) L(θ, φ,x) L(,,x)=E q (z x) [log p (x, z) log q (z x)] = E q (z x) [log p (x z) + log p (z) log q (z x)] = D KL (q (z x) p (z)) + E q (z x) [log p (x z)] What is q φ (z x)? IFT6266: Representation (Deep) Learning Aaron Courville 14 14

Variational Autoencoder (VAE) Where does z come from? The classic DAG problem. The VAE approach: introduce an inference machine q φ (z x) that learns to approximate the posterior p θ (z x). - Define a variational lower bound on the data likelihood: p θ (x) L(θ, φ,x) L(,,x)=E q (z x) [log p (x, z) log q (z x)] = E q (z x) [log p (x z) + log p (z) log q (z x)] = D KL (q (z x) p (z)) + E q (z x) [log p (x z)] reconstruction term What is q φ (z x)? IFT6266: Representation (Deep) Learning Aaron Courville 14 14

Variational Autoencoder (VAE) Where does z come from? The classic DAG problem. The VAE approach: introduce an inference machine q φ (z x) that learns to approximate the posterior p θ (z x). - Define a variational lower bound on the data likelihood: p θ (x) L(θ, φ,x) L(,,x)=E q (z x) [log p (x, z) log q (z x)] = E q (z x) [log p (x z) + log p (z) log q (z x)] = D KL (q (z x) p (z)) + E q (z x) [log p (x z)] regularization term reconstruction term What is q φ (z x)? IFT6266: Representation (Deep) Learning Aaron Courville 14 14

VAE Inference model The VAE approach: introduce an inference model q φ (z x) that learns to approximates the intractable posterior p θ (z x) by optimizing the variational lower bound: L(θ, φ,x)= D KL (q φ (z x) p θ (z)) + E qφ (z x) [log p θ (x z)] We parameterize q φ (z x) with another neural network: q φ (z x) =q(z; g(x, φ)) z : p θ (x z) =p(x; f(z,θ)) z : g(x) : f(z) : x : x : IFT6266: Representation (Deep) Learning Aaron Courville 15 15

Key reparameterization trick Construct samples z ~ qφ(z x) in two steps: 1. ε ~ p(ε) (random seed independent of φ) 2. z = g(φ, ε, x) (differentiable perturbation) such that z ~ qφ(z x) (the correct distribution) Examples: if q(z x) ~ N(μ(x), σ(x)^2) ε ~ N(0,I) z = μ(x) + σ(x) * ε (approximate) Inverse CDF Much more possibilities (see paper) D.P. Kingma 7

Reparametrization trick Adding a few details + one really important trick Let s consider z to be real and q φ (z x) =N (z; µ z (x), σ z (x)) Parametrize z as z = µ z (x)+σ z (x)ϵ z where ϵ z = N (0, 1) (optional) Parametrize x a x = µ x (z)+σ x (z)ϵ x where ϵ x = N (0, 1) µ z (x) σ z (x) z : { { g(z) : f(z) : x : µ x (z) { σ x (z) { IFT6266: Representation (Deep) Learning Aaron Courville 16 16

SGVB estimator Really simple and appropriate for differentiation w.r.t. φ and θ! D.P. Kingma

Variational auto-encoder x p injected noise ε p(z) and p(x z) (decoder) z q q(z x) = N(μ,σ) (encoder) x D.P. Kingma 11

Why reparametrization helps September 19, 2016 1 / 6

Training with backpropagation! Due to a reparametrization trick, we can simultaneously train both the generative model p θ (x z) and the inference model q φ (z x) by optimizing the variational bound using gradient backpropagation. Objective function: L(θ, φ,x)= D KL (q φ (z x) p θ (z)) + E qφ (z x) [log p θ (x z)] Forward propagation z x Backward propagation q φ (z x) p θ (x z) ˆx IFT6266: Representation (Deep) Learning Aaron Courville 17 17

Auto-Encoding Variational Bayes Online algorithm repeat Backprop (Torch7 / Theano) e.g. Adagrad until convergence Scales to very large datasets! D.P. Kingma 9

Model used in experiments (noisy) negative reconstruction error D.P. Kingma regularization terms 10

Special case with Gaussian prior and posterior Suppose p(z) = N (z; 0, I ) Suppose q φ (z x) = N (z; µ φ (x), σφ 2(x)) Variational bound L = ln p θ (x) D KL (q φ (z x) p θ (z x) (1) = IE qφ (z x)[ln p θ (x z)] D KL (q φ (z x) p(z)) (2) Closed-form computation of KL divergence D KL (q(z x) p(z)) = D 2 + 1 2 D 2 ln σ j (x) µ j (x) 2 σ j (x) 2 d=1 Deterministic regularization, stochastic data term September 19, 2016 2 / 6

Results: Marginal likelihood lower bound D.P. Kingma 12

Results: Marginal log-likelihood Monte Carlo EM does not scale well to large datasets D.P. Kingma 13

Robustness to high-dimensional latent space D.P. Kingma 14

Effect of KL term: component collapse IFT6266: Representation (Deep) Learning Aaron Courville Figure from Laurent Dinh & Vincent Dumoulin 19 19

Component collapse & depth Deep model: some component collapse Deeper model: more component collapse IFT6266: Representation (Deep) Learning Aaron Courville Figures from Laurent Dinh & Vincent Dumoulin 20 20

Samples from MNIST (simple ancestral sampling) D.P. Kingma 15

2D Latent space: Frey Face z2 D.P. Kingma z1 16

2D Latent space: MNIST z2 D.P. Kingma z1 17

Labeled Faces in the Wild (random samples from generative model) D.P. Kingma 19

Conditional generation using M2, central pixels image September 19, 2016 3 / 6

Conditional generation: central pixels image September 19, 2016 4 / 6

Semi-supervised Learning with Deep Generative Models Diederik P. Kingma, Danilo J. Rezende, Shakir Mohamed, Max Welling They study two basic approaches: M1: Standard unsupervised feature learning ( self-taught learning ) - Train features z on unlabeled data, train a classifier to map from z to label y. - Generative model: (recall that x = data, z = latent features) p(z) =N (z 0, I); p (x z) =f(x; z, ), z M2: Generative semi-supervised model. p(y) =Cat(y ); p(z) =N (z 0, I); is the multinomial distribution, the cl p (x y, z) =f(x; y, z, ), labels are treated as latent y x z x IFT6266: Representation (Deep) Learning Aaron Courville 23 23

Semi-supervised Learning with Deep Generative Models Diederik P. Kingma, Danilo J. Rezende, Shakir Mohamed, Max Welling M1+M2: Combination semi-supervised model - Train generative semi-supervised model on unsupervised features z1 on unlabeled data, train a classifier to map from z1 to label z1. ead of the raw data. The result is a deep generativ p (x,y,z 1, z 2 )= p(y)p(z 2 )p (z 1 y, z 2 )p (x z 1 ), y and z above, and both and a y z 2 z 1 x IFT6266: Representation (Deep) Learning Aaron Courville 24 24

Semi-supervised Learning with Deep Generative Models Diederik P. Kingma, Danilo J. Rezende, Shakir Mohamed, Max Welling Appoximate posterior (encoder model) - Following the VAE strategy we parametrize the approximate posterior with a high capacity model, like a MLP or some other deep model (convnet, RNN, etc). M1: q (z x) =N (z µ (x), diag( 2 (x))), M2: q (z y, x) =N (z µ (y, x), diag( 2 (x))); q (y x) =Cat(y (x)), µ (x) - and ( 2 (x) are parameterized by deep MLPs, that can share parameters. M1: z M2: z y x x IFT6266: Representation (Deep) Learning Aaron Courville 25 25

Semi-supervised Learning with Deep Generative Models Diederik P. Kingma, Danilo J. Rezende, Shakir Mohamed, Max Welling M2: The lower bound for the generative semi-supervised model. - Objective with labeled data: log p (x,y) E q (z x,y) [log p (x y, z) + log p (y) + log p(z) log q (z x,y)]= L(x,y), - Objective without labels: posterior inference and the resulting bound for handling data points with an unobserved label X log p (x) E q (y,z x) [log p (x y, z) + log p (y) + log p(z) log q (y, z x)] = X q (y x)( L(x,y)) + H(q (y x)) = U(x). y - Semi-supervised objective: X J = X L(x,y)+ X U(x) (x,y) epl x epu X y z x - actually, for classification, they use J = J + E epl (x,y) [ log q (y x)], IFT6266: Representation (Deep) Learning Aaron Courville 26 26

Semi-supervised MNIST classification results Diederik P. Kingma, Danilo J. Rezende, Shakir Mohamed, Max Welling Combination model M1+M2 shows dramatic improvement: Table 1: Benchmark results of semi-supervised classification on MNIST with few labels. N NN CNN TSVM CAE MTC AtlasRBF M1+TSVM M2 M1+M2 100 25.81 22.98 16.81 13.47 12.03 8.10 (± 0.95) 11.82 (± 0.25) 11.97 (± 1.71) 3.33 (± 0.14) 600 11.44 7.68 6.16 6.3 5.13 5.72 (± 0.049) 4.94 (± 0.13) 2.59 (± 0.05) 1000 10.7 6.45 5.38 4.77 3.64 3.68 (± 0.12) 4.24 (± 0.07) 3.60 (± 0.56) 2.40 (± 0.02) 3000 6.04 3.35 3.45 3.22 2.57 3.49 (± 0.04) 3.92 (± 0.63) 2.18 (± 0.04) 4 Experimental Results Full MNIST test error: 0.96% (for comparison, current SOTA: 0.78%). IFT6266: Representation (Deep) Learning Aaron Courville 27 27

Conditional generation using M2 September 19, 2016 5 / 6

Conditional generation using M2 September 19, 2016 6/6