Score function estimator and variance reduction techniques

Similar documents
Deep Generative Models Variational Autoencoders

GRADIENT-BASED OPTIMIZATION OF NEURAL

Hierarchical Variational Models

Deep generative models of natural images

19: Inference and learning in Deep Learning

Variational Autoencoders. Sargur N. Srihari

arxiv: v1 [stat.ml] 31 Dec 2013

10703 Deep Reinforcement Learning and Control

Semi-Amortized Variational Autoencoders

One-Shot Generalization in Deep Generative Models

Iterative Amortized Inference

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Bayesian model ensembling using meta-trained recurrent neural networks

Deep Generative Models and a Probabilistic Programming Library

Liangjie Hong*, Dawei Yin*, Jian Guo, Brian D. Davison*

Auto-Encoding Variational Bayes

LEARNING TO INFER ABSTRACT 1 INTRODUCTION. Under review as a conference paper at ICLR Anonymous authors Paper under double-blind review

Probabilistic Programming with Pyro

Latent Regression Bayesian Network for Data Representation

Iterative Inference Models

Slides credited from Dr. David Silver & Hung-Yi Lee

CSC412: Stochastic Variational Inference. David Duvenaud

Implicit generative models: dual vs. primal approaches

27: Hybrid Graphical Models and Neural Networks

10703 Deep Reinforcement Learning and Control

10-701/15-781, Fall 2006, Final

arxiv: v2 [stat.ml] 22 Feb 2018

DEEP LEARNING PART THREE - DEEP GENERATIVE MODELS CS/CNS/EE MACHINE LEARNING & DATA MINING - LECTURE 17

Autoencoder. Representation learning (related to dictionary learning) Both the input and the output are x

Modeling and Optimization of Thin-Film Optical Devices using a Variational Autoencoder

Learning from Data: Adaptive Basis Functions

Factor Topographic Latent Source Analysis: Factor Analysis for Brain Images

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

Deep Reinforcement Learning

08 An Introduction to Dense Continuous Robotic Mapping

Variational Autoencoders

Variational Methods for Discrete-Data Latent Gaussian Models

Unsupervised Learning

Gradient of the lower bound

Towards Conceptual Compression

Supervised Learning for Image Segmentation

arxiv: v1 [stat.ml] 3 Apr 2017

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

arxiv: v1 [cs.lg] 5 May 2015

Inference and Representation

Stochastic Simulation with Generative Adversarial Networks

Importance Sampled Stochastic Optimization for Variational Inference

Lecture 27, April 24, Reading: See class website. Nonparametric regression and kernel smoothing. Structured sparse additive models (GroupSpAM)

Conjugate-Computation Variational Inference : Converting Variational Inference in Non-Conjugate Models to Inferences in Conjugate Models

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

MCMC Methods for data modeling

arxiv: v6 [stat.ml] 15 Jun 2015

arxiv: v1 [stat.ml] 17 Apr 2017

Neural Networks. Theory And Practice. Marco Del Vecchio 19/07/2017. Warwick Manufacturing Group University of Warwick

COMBINING MODEL-BASED AND MODEL-FREE RL

arxiv: v1 [stat.ml] 1 Sep 2017

Latent Variable Models and Expectation Maximization

Cost Functions in Machine Learning

TOWARDS A NEURAL STATISTICIAN

Generative Models in Deep Learning. Sargur N. Srihari

Max-Margin Deep Generative Models

PIXELCNN++: IMPROVING THE PIXELCNN WITH DISCRETIZED LOGISTIC MIXTURE LIKELIHOOD AND OTHER MODIFICATIONS

Structured Attention Networks

Efficient Feature Learning Using Perturb-and-MAP

The Amortized Bootstrap

Multimodal Prediction and Personalization of Photo Edits with Deep Generative Models

Mixture Models and the EM Algorithm

Clustering and The Expectation-Maximization Algorithm

Perceptron: This is convolution!

Mondrian Forests: Efficient Online Random Forests

Latent Variable Models for the Analysis, Visualization and Prediction of Network and Nodal Attribute Data. Isabella Gollini.

A Brief Introduction to Reinforcement Learning

arxiv: v2 [cs.lg] 17 Dec 2018

DEEP UNSUPERVISED CLUSTERING WITH GAUSSIAN MIXTURE VARIATIONAL AUTOENCODERS

Deep Learning. Yee Whye Teh (Oxford Statistics & DeepMind)

GAN Frontiers/Related Methods

Regularization and model selection

Non-Linearity of Scorecard Log-Odds

Learning to generate with adversarial networks

732A54/TDDE31 Big Data Analytics

Replacing Neural Networks with Black-Box ODE Solvers

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

K-means and Hierarchical Clustering

Vulnerability of machine learning models to adversarial examples

Scene Grammars, Factor Graphs, and Belief Propagation

The Multi-Entity Variational Autoencoder

A derivative-free trust-region algorithm for reliability-based optimization

Warped Mixture Models

Programming Exercise 4: Neural Networks Learning

ECS289: Scalable Machine Learning

CS229 Final Project: Predicting Expected Response Times

Scene Grammars, Factor Graphs, and Belief Propagation

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

When Variational Auto-encoders meet Generative Adversarial Networks

arxiv: v2 [stat.ml] 21 Oct 2017

A trust-region method for stochastic variational inference with applications to streaming data

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

Transcription:

and variance reduction techniques Wilker Aziz University of Amsterdam May 24, 2018 Wilker Aziz Discrete variables 1

Outline 1 2 3 Wilker Aziz Discrete variables 1

Variational inference for belief networks Generative model with NN likelihood z x N λ θ Let z {0, 1} d and q λ (z x) = = d q λ (z i x) i=1 d Bern(z i sigmoid(f λ (x))) Jointly optimise generative model p θ (x z) and inference model q λ (z x) under the same objective (ELBO) i=1 (1) Mnih and Gregor (2014) Wilker Aziz Discrete variables 2

Objective log p θ (x) Parameter estimation ELBO {}}{ E qλ (z x) [log p θ (x, Z) + H (q λ (z x)) = E qλ (z x) [log p θ (x Z) KL (q λ (z x) p(z)) arg max θ,λ E qλ (z x) [log p θ (x Z) KL (q λ (z x) p(z)) Wilker Aziz Discrete variables 3

Generative Network Gradient E θ qλ (z x) [log p θ (x z) constant wrt θ {}}{ KL (q λ (z x) p(z))

Generative Network Gradient constant wrt θ {}}{ E θ qλ (z x) [log p θ (x z) KL (q λ (z x) p(z)) [ = E qλ (z x) θ log p θ(x z) }{{} expected gradient :)

Generative Network Gradient constant wrt θ {}}{ E θ qλ (z x) [log p θ (x z) KL (q λ (z x) p(z)) [ = E qλ (z x) θ log p θ(x z) }{{} expected gradient :) MC 1 K K θ log p θ(x z (k) ) k=1 z (k) q λ (Z x)

Generative Network Gradient constant wrt θ {}}{ E θ qλ (z x) [log p θ (x z) KL (q λ (z x) p(z)) [ = E qλ (z x) θ log p θ(x z) }{{} expected gradient :) MC 1 K K θ log p θ(x z (k) ) k=1 z (k) q λ (Z x) Note: q λ (z x) does not depend on θ.

Inference Network Gradient λ analytical {}}{ E qλ (z x) [log p θ (x z) KL (q λ (z x) p(z)) Wilker Aziz Discrete variables 5

Inference Network Gradient λ analytical {}}{ E qλ (z x) [log p θ (x z) KL (q λ (z x) p(z)) = λ E q λ (z x) [log p θ (x z) λ KL (q λ(z x) p(z)) }{{} analytical computation Wilker Aziz Discrete variables 5

Inference Network Gradient λ analytical {}}{ E qλ (z x) [log p θ (x z) KL (q λ (z x) p(z)) = λ E q λ (z x) [log p θ (x z) λ KL (q λ(z x) p(z)) }{{} analytical computation The first term again requires approximation by sampling, but there is a problem Wilker Aziz Discrete variables 5

Inference Network Gradient λ E q λ (z x) [log p θ (x z) Wilker Aziz Discrete variables 6

Inference Network Gradient λ E q λ (z x) [log p θ (x z) = q λ (z x) log p θ (x z)dz λ Wilker Aziz Discrete variables 6

Inference Network Gradient λ E q λ (z x) [log p θ (x z) = q λ (z x) log p θ (x z)dz λ = λ (q λ(z x)) log p θ (x z) dz }{{} not an expectation Wilker Aziz Discrete variables 6

Inference Network Gradient λ E q λ (z x) [log p θ (x z) = q λ (z x) log p θ (x z)dz λ = λ (q λ(z x)) log p θ (x z) dz }{{} not an expectation MC estimator is non-differentiable: cannot sample first Wilker Aziz Discrete variables 6

Inference Network Gradient λ E q λ (z x) [log p θ (x z) = q λ (z x) log p θ (x z)dz λ = λ (q λ(z x)) log p θ (x z) dz }{{} not an expectation MC estimator is non-differentiable: cannot sample first Differentiating the expression does not yield an expectation: cannot approximate via MC Wilker Aziz Discrete variables 6

Inference Network Gradient λ E q λ (z x) [log p θ (x z) = q λ (z x) log p θ (x z)dz λ = λ (q λ(z x)) log p θ (x z) dz }{{} not an expectation MC estimator is non-differentiable: cannot sample first Differentiating the expression does not yield an expectation: cannot approximate via MC The original VAE employed a reparameterisation, but... Wilker Aziz Discrete variables 6

Bernoulli pmf Bern(z j b j ) = b z j j (1 b j ) 1 z j Wilker Aziz Discrete variables 7

Bernoulli pmf Bern(z j b j ) = b z j j (1 b j ) 1 z j { b j if z j = 1 = 1 b j if z j = 0 Wilker Aziz Discrete variables 7

Bernoulli pmf Bern(z j b j ) = b z j j (1 b j ) 1 z j { b j if z j = 1 = 1 b j if z j = 0 Can we reparameterise a Bernoulli variable? Wilker Aziz Discrete variables 7

Reparameterisation requires a Jacobian matrix Not really :( q(z; λ) = φ(ɛ = h(z, λ)) det Jh(z,λ) }{{} change of density (2) Elements in the Jacobian matrix J h(z,λ) [i, j = h i(z, λ) z j are not defined for non-differentiable functions Wilker Aziz Discrete variables 8

Outline 1 2 3 Wilker Aziz Discrete variables 9

We can again use the log identity for derivatives λ E q λ (z x) [log p θ (x z) = q λ (z x) log p θ (x z)dz λ = λ (q λ(z x)) log p θ (x z) dz }{{} not an expectation Wilker Aziz Discrete variables 10

We can again use the log identity for derivatives λ E q λ (z x) [log p θ (x z) = q λ (z x) log p θ (x z)dz λ = λ (q λ(z x)) log p θ (x z) dz }{{} = not an expectation q λ (z x) λ (log q λ(z x)) log p θ (x z) dz Wilker Aziz Discrete variables 10

We can again use the log identity for derivatives λ E q λ (z x) [log p θ (x z) = q λ (z x) log p θ (x z)dz λ = λ (q λ(z x)) log p θ (x z) dz }{{} not an expectation = q λ (z x) λ (log q λ(z x)) log p θ (x z) dz [ = E qλ (z x) log p θ (x z) λ log q λ(z x) }{{} expected gradient :) Wilker Aziz Discrete variables 10

: high variance We can now build an MC estimator λ E q λ (z x) [log p θ (x z) [ = E qλ (z x) log p θ (x z) λ log q λ(z x)

: high variance We can now build an MC estimator λ E q λ (z x) [log p θ (x z) [ = E qλ (z x) log p θ (x z) λ log q λ(z x) MC 1 K log p θ (x z (k) ) K λ log q λ(z (k) x) k=1 z (k) q λ (Z x)

: high variance We can now build an MC estimator λ E q λ (z x) [log p θ (x z) [ = E qλ (z x) log p θ (x z) λ log q λ(z x) MC 1 K log p θ (x z (k) ) K λ log q λ(z (k) x) k=1 z (k) q λ (Z x) but magnitude of log p θ (x z) varies widely

: high variance We can now build an MC estimator λ E q λ (z x) [log p θ (x z) [ = E qλ (z x) log p θ (x z) λ log q λ(z x) MC 1 K log p θ (x z (k) ) K λ log q λ(z (k) x) k=1 z (k) q λ (Z x) but magnitude of log p θ (x z) varies widely model likelihood does not contribute to direction of gradient

: high variance We can now build an MC estimator λ E q λ (z x) [log p θ (x z) [ = E qλ (z x) log p θ (x z) λ log q λ(z x) MC 1 K log p θ (x z (k) ) K λ log q λ(z (k) x) k=1 z (k) q λ (Z x) but magnitude of log p θ (x z) varies widely model likelihood does not contribute to direction of gradient too much variance to be useful

: high variance We can now build an MC estimator λ E q λ (z x) [log p θ (x z) [ = E qλ (z x) log p θ (x z) λ log q λ(z x) MC 1 K log p θ (x z (k) ) K λ log q λ(z (k) x) k=1 z (k) q λ (Z x) but but magnitude of log p θ (x z) varies widely model likelihood does not contribute to direction of gradient too much variance to be useful

: high variance We can now build an MC estimator λ E q λ (z x) [log p θ (x z) [ = E qλ (z x) log p θ (x z) λ log q λ(z x) MC 1 K log p θ (x z (k) ) K λ log q λ(z (k) x) k=1 z (k) q λ (Z x) but but magnitude of log p θ (x z) varies widely model likelihood does not contribute to direction of gradient too much variance to be useful fully differentiable!

When variance is high we can sample more Wilker Aziz Discrete variables 12

When variance is high we can sample more won t scale Wilker Aziz Discrete variables 12

When variance is high we can sample more won t scale use variance reduction techniques (e.g. baselines and control variates) Wilker Aziz Discrete variables 12

When variance is high we can sample more won t scale use variance reduction techniques (e.g. baselines and control variates) excellent idea! Wilker Aziz Discrete variables 12

When variance is high we can sample more won t scale use variance reduction techniques (e.g. baselines and control variates) excellent idea! and now it s time for it! Wilker Aziz Discrete variables 12

Example Model Let us consider a latent factor model for topic modelling: Wilker Aziz Discrete variables 13

Example Model Let us consider a latent factor model for topic modelling: a document x = (x 1,..., x n ) consists of n i.i.d. categorical draws from that model Wilker Aziz Discrete variables 13

Example Model Let us consider a latent factor model for topic modelling: a document x = (x 1,..., x n ) consists of n i.i.d. categorical draws from that model the categorical distribution in turn depends on the binary latent factors z = (z 1,..., z k ) which are also i.i.d. Wilker Aziz Discrete variables 13

Example Model Let us consider a latent factor model for topic modelling: a document x = (x 1,..., x n ) consists of n i.i.d. categorical draws from that model the categorical distribution in turn depends on the binary latent factors z = (z 1,..., z k ) which are also i.i.d. z j Bernoulli (φ) (1 j k) x i Categorical (g θ (z)) (1 i n) Here φ specifies a Bernoulli prior and g θ ( ) is a function computed by neural network with softmax output. Wilker Aziz Discrete variables 13

Example Model z 1 z 2 z 3 x 1 x 2 x 3 x 4 At inference time the latent variables are marginally dependent. For our variational distribution we are going to assume that they are not (recall: mean field assumption). Wilker Aziz Discrete variables 14

Inference Network λ z 1 z 2 z 3 x 1 x 2 x 3 x 4 The inference network needs to predict k Bernoulli parameters ψ. Any neural network with sigmoid output will do that job. k q λ (z x) = Bern(z j ψ j ) j=1 where ψ k 1 = f λ (x) (3) Wilker Aziz Discrete variables 15

Computation Graph ψ x λ x θ inference model generation model

Computation Graph KL ψ x λ x θ inference model generation model

Computation Graph log q λ (z ψ) log p θ (x z) KL ψ x λ x θ inference model generation model

Computation Graph log q λ (z ψ) log p θ (x z) z Bernoulli (ψ) KL ψ x λ x θ inference model generation model Wilker Aziz Discrete variables 16

Reparametrisation Gradient x generation model KL z θ KL inference model µ σ λ x λ ɛ N (0, 1) Wilker Aziz Discrete variables 17

Pros and Cons Pros Applicable to all distributions Many libraries come with samplers for common distributions Wilker Aziz Discrete variables 18

Pros and Cons Pros Applicable to all distributions Many libraries come with samplers for common distributions Cons High Variance! Wilker Aziz Discrete variables 18

Outline 1 2 3 Wilker Aziz Discrete variables 19

Control variates Suppose we want to estimate E[f (Z) and we know the expected value of another function ψ(z) on the same support. Greensmith et al. (2004) Wilker Aziz Discrete variables 20

Control variates Suppose we want to estimate E[f (Z) and we know the expected value of another function ψ(z) on the same support. Then it holds that E[f (Z) = E[f (Z) ψ(z) + E[ψ(Z) (4) Greensmith et al. (2004) Wilker Aziz Discrete variables 20

Control variates Suppose we want to estimate E[f (Z) and we know the expected value of another function ψ(z) on the same support. Then it holds that E[f (Z) = E[f (Z) ψ(z) + E[ψ(Z) (4) If ψ(z) = f (z), and we estimate the expected value of f (x) ψ(x), then we have reduced variance to 0. Greensmith et al. (2004) Wilker Aziz Discrete variables 20

Control variates Suppose we want to estimate E[f (Z) and we know the expected value of another function ψ(z) on the same support. Then it holds that E[f (Z) = E[f (Z) ψ(z) + E[ψ(Z) (4) If ψ(z) = f (z), and we estimate the expected value of f (x) ψ(x), then we have reduced variance to 0. In general Var(f ψ) = Var(f ) 2 Cov(f, ψ) + Var(ψ) (5) If f and ψ are strongly correlated and the covariance is greater than Var(ψ), then we improve on the original estimation problem. Greensmith et al. (2004) Wilker Aziz Discrete variables 20

Reducing variance of score function estimator Back to the score function estimator [ E qλ (z x) log p θ (x z) λ log q λ(z x)

Reducing variance of score function estimator Back to the score function estimator [ E qλ (z x) log p θ (x z) λ log q λ(z x) = E qλ (z x) log p θ(x z) λ log q λ(z x) C(x) }{{} λ log q λ(z x) }{{} f (z) ψ(z) + E qλ (z x) C(x) λ log q λ(z x) }{{} ψ(z) The last term is very simple!

E[ψ(Z) E qλ (z x) [ C(x) λ log q λ(z x) Wilker Aziz Discrete variables 22

E[ψ(Z) E qλ (z x) [ C(x) λ log q λ(z x) = C(x)E qλ (z x) [ λ log q λ(z x) Wilker Aziz Discrete variables 22

E[ψ(Z) E qλ (z x) [ C(x) λ log q λ(z x) = C(x)E qλ (z x) [ λ q λ(z x) q λ (z x) = C(x)E qλ (z x) [ λ log q λ(z x) Wilker Aziz Discrete variables 22

E[ψ(Z) E qλ (z x) [ C(x) λ log q λ(z x) = C(x)E qλ (z x) [ λ q λ(z x) q λ (z x) = C(x)E qλ (z x) = C(x) q λ (z x) [ λ log q λ(z x) λ q λ(z x) q λ (z x) dz Wilker Aziz Discrete variables 22

E[ψ(Z) E qλ (z x) [ C(x) λ log q λ(z x) = C(x)E qλ (z x) = C(x) [ λ q λ(z x) q λ (z x) λ q λ(z x) dz = C(x)E qλ (z x) = C(x) q λ (z x) [ λ log q λ(z x) λ q λ(z x) q λ (z x) dz Wilker Aziz Discrete variables 22

E[ψ(Z) E qλ (z x) [ C(x) λ log q λ(z x) = C(x)E qλ (z x) = C(x) [ λ q λ(z x) q λ (z x) = C(x)E qλ (z x) = C(x) λ q λ(z x) dz = C(x) λ q λ (z x) [ λ log q λ(z x) λ q λ(z x) q λ (z x) q λ (z x) dz dz Wilker Aziz Discrete variables 22

E[ψ(Z) E qλ (z x) [ C(x) λ log q λ(z x) = C(x)E qλ (z x) = C(x) [ λ q λ(z x) q λ (z x) = C(x)E qλ (z x) = C(x) λ q λ(z x) dz = C(x) λ = C(x) λ 1 q λ (z x) [ λ log q λ(z x) λ q λ(z x) q λ (z x) q λ (z x) dz dz Wilker Aziz Discrete variables 22

E[ψ(Z) E qλ (z x) [ C(x) λ log q λ(z x) = C(x)E qλ (z x) = C(x) [ λ q λ(z x) q λ (z x) = C(x)E qλ (z x) = C(x) λ q λ(z x) dz = C(x) λ = C(x) λ 1 = 0 q λ (z x) [ λ log q λ(z x) λ q λ(z x) q λ (z x) q λ (z x) dz dz Wilker Aziz Discrete variables 22

Improved estimator Back to the score function estimator [ E qλ (z x) log p θ (x z) λ log q λ(z x) Wilker Aziz Discrete variables 23

Improved estimator Back to the score function estimator [ E qλ (z x) log p θ (x z) λ log q λ(z x) [ = E qλ (z x) log p θ (x z) λ log q λ(z x) C(x) λ log q λ(z x) [ + E qλ (z x) C(x) λ log q λ(z x) Wilker Aziz Discrete variables 23

Improved estimator Back to the score function estimator [ E qλ (z x) log p θ (x z) λ log q λ(z x) [ = E qλ (z x) log p θ (x z) λ log q λ(z x) C(x) [ + E qλ (z x) C(x) λ log q λ(z x) = E qλ (z x) λ log q λ(z x) [ log p θ (x z) λ log q λ(z x) C(x) λ log q λ(z x) Wilker Aziz Discrete variables 23

Improved estimator Back to the score function estimator [ E qλ (z x) log p θ (x z) λ log q λ(z x) [ = E qλ (z x) log p θ (x z) λ log q λ(z x) C(x) [ + E qλ (z x) C(x) λ log q λ(z x) [ = E qλ (z x) log p θ (x z) λ log q λ(z x) C(x) [ = E qλ (z x) (log p θ (x z) C(x)) λ log q λ(z x) C(x) is called a baseline λ log q λ(z x) λ log q λ(z x) Wilker Aziz Discrete variables 23

Baselines Baselines can be constant [ E qλ (z x) (log p θ (x z) C) λ log q λ(z x) (6) Williams (1992) Wilker Aziz Discrete variables 24

Baselines Baselines can be constant [ E qλ (z x) (log p θ (x z) C) λ log q λ(z x) or input-dependent [ E qλ (z x) (log p θ (x z) C(x)) λ log q λ(z x) (6) (7) Williams (1992) Wilker Aziz Discrete variables 24

Baselines Baselines can be constant [ E qλ (z x) (log p θ (x z) C) λ log q λ(z x) or input-dependent [ E qλ (z x) (log p θ (x z) C(x)) λ log q λ(z x) or both E qλ (z x) [ (log p θ (x z) C C(x)) λ log q λ(z x) (6) (7) (8) Williams (1992) Wilker Aziz Discrete variables 24

Full power of control variates If we design C( ) to depend on the variable of integration z, we exploit the full power of control variates, but designing and using those require more careful treatment Blei et al. (2012); Ranganath et al. (2014); Gregor et al. (2014) Wilker Aziz Discrete variables 25

Learning baselines Baselines are predicted by a regression model (e.g. a neural net). One idea is to centre the learning signal, in which case we train the baseline with an L 2 -loss: ρ = arg min (C ρ (x) log p(x z)) 2 ρ Gu et al. (2015) Wilker Aziz Discrete variables 26

Putting it together Parameter estimation arg max θ,λ E qλ (z x) [log p θ (x Z) KL (q λ (z x) p(z)) arg min (C ρ (x) log p(x z)) 2 ρ Generative gradient E qλ (z x) [ θ log p θ(x z) Inference gradient E qλ (z x) [ (log p θ (x z) C(x)) λ log q λ(z x)

Summary Wilker Aziz Discrete variables 28

Summary Reparametrisation not available for discrete variables. Wilker Aziz Discrete variables 28

Summary Reparametrisation not available for discrete variables. Use score function estimator. Wilker Aziz Discrete variables 28

Summary Reparametrisation not available for discrete variables. Use score function estimator. High variance. Wilker Aziz Discrete variables 28

Summary Reparametrisation not available for discrete variables. Use score function estimator. High variance. Always use baselines for variance reduction! Wilker Aziz Discrete variables 28

Literature I David M. Blei, Michael I. Jordan, and John W. Paisley. Variational bayesian inference with stochastic search. In ICML, 2012. URL http://icml.cc/2012/papers/687.pdf. Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471 1530, 2004. Karol Gregor, Ivo Danihelka, Andriy Mnih, Charles Blundell, and Daan Wierstra. Deep autoregressive networks. In Eric P. Xing and Tony Jebara, editors, ICML, pages 1242 1250, 2014. URL http://proceedings.mlr.press/v32/gregor14.html. Shixiang Gu, Sergey Levine, Ilya Sutskever, and Andriy Mnih. Muprop: Unbiased backpropagation for stochastic neural networks. arxiv preprint arxiv:1511.05176, 2015. Wilker Aziz Discrete variables 29

Literature II Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. arxiv preprint arxiv:1402.0030, 2014. Rajesh Ranganath, Sean Gerrish, and David Blei. Black Box Variational Inference. In Samuel Kaski and Jukka Corander, editors, AISTATS, pages 814 822, 2014. URL http://proceedings.mlr.press/v33/ranganath14.pdf. Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4): 229 256, 1992. URL https://doi.org/10.1007/bf00992696. Wilker Aziz Discrete variables 30