Score function estimator and variance reduction techniques

Size: px

Start display at page:

Download "Score function estimator and variance reduction techniques"

Marilynn Arnold
5 years ago
Views:

1 and variance reduction techniques Wilker Aziz University of Amsterdam May 24, 2018 Wilker Aziz Discrete variables 1

2 Outline Wilker Aziz Discrete variables 1

3 Variational inference for belief networks Generative model with NN likelihood z x N λ θ Let z {0, 1} d and q λ (z x) = = d q λ (z i x) i=1 d Bern(z i sigmoid(f λ (x))) Jointly optimise generative model p θ (x z) and inference model q λ (z x) under the same objective (ELBO) i=1 (1) Mnih and Gregor (2014) Wilker Aziz Discrete variables 2

4 Objective log p θ (x) Parameter estimation ELBO {}}{ E qλ (z x) [log p θ (x, Z) + H (q λ (z x)) = E qλ (z x) [log p θ (x Z) KL (q λ (z x) p(z)) arg max θ,λ E qλ (z x) [log p θ (x Z) KL (q λ (z x) p(z)) Wilker Aziz Discrete variables 3

5 Generative Network Gradient E θ qλ (z x) [log p θ (x z) constant wrt θ {}}{ KL (q λ (z x) p(z))

6 Generative Network Gradient constant wrt θ {}}{ E θ qλ (z x) [log p θ (x z) KL (q λ (z x) p(z)) [ = E qλ (z x) θ log p θ(x z) }{{} expected gradient :)

7 Generative Network Gradient constant wrt θ {}}{ E θ qλ (z x) [log p θ (x z) KL (q λ (z x) p(z)) [ = E qλ (z x) θ log p θ(x z) }{{} expected gradient :) MC 1 K K θ log p θ(x z (k) ) k=1 z (k) q λ (Z x)

8 Generative Network Gradient constant wrt θ {}}{ E θ qλ (z x) [log p θ (x z) KL (q λ (z x) p(z)) [ = E qλ (z x) θ log p θ(x z) }{{} expected gradient :) MC 1 K K θ log p θ(x z (k) ) k=1 z (k) q λ (Z x) Note: q λ (z x) does not depend on θ.

9 Inference Network Gradient λ analytical {}}{ E qλ (z x) [log p θ (x z) KL (q λ (z x) p(z)) Wilker Aziz Discrete variables 5

10 Inference Network Gradient λ analytical {}}{ E qλ (z x) [log p θ (x z) KL (q λ (z x) p(z)) = λ E q λ (z x) [log p θ (x z) λ KL (q λ(z x) p(z)) }{{} analytical computation Wilker Aziz Discrete variables 5

11 Inference Network Gradient λ analytical {}}{ E qλ (z x) [log p θ (x z) KL (q λ (z x) p(z)) = λ E q λ (z x) [log p θ (x z) λ KL (q λ(z x) p(z)) }{{} analytical computation The first term again requires approximation by sampling, but there is a problem Wilker Aziz Discrete variables 5

12 Inference Network Gradient λ E q λ (z x) [log p θ (x z) Wilker Aziz Discrete variables 6

13 Inference Network Gradient λ E q λ (z x) [log p θ (x z) = q λ (z x) log p θ (x z)dz λ Wilker Aziz Discrete variables 6

14 Inference Network Gradient λ E q λ (z x) [log p θ (x z) = q λ (z x) log p θ (x z)dz λ = λ (q λ(z x)) log p θ (x z) dz }{{} not an expectation Wilker Aziz Discrete variables 6

15 Inference Network Gradient λ E q λ (z x) [log p θ (x z) = q λ (z x) log p θ (x z)dz λ = λ (q λ(z x)) log p θ (x z) dz }{{} not an expectation MC estimator is non-differentiable: cannot sample first Wilker Aziz Discrete variables 6

16 Inference Network Gradient λ E q λ (z x) [log p θ (x z) = q λ (z x) log p θ (x z)dz λ = λ (q λ(z x)) log p θ (x z) dz }{{} not an expectation MC estimator is non-differentiable: cannot sample first Differentiating the expression does not yield an expectation: cannot approximate via MC Wilker Aziz Discrete variables 6

17 Inference Network Gradient λ E q λ (z x) [log p θ (x z) = q λ (z x) log p θ (x z)dz λ = λ (q λ(z x)) log p θ (x z) dz }{{} not an expectation MC estimator is non-differentiable: cannot sample first Differentiating the expression does not yield an expectation: cannot approximate via MC The original VAE employed a reparameterisation, but... Wilker Aziz Discrete variables 6

18 Bernoulli pmf Bern(z j b j ) = b z j j (1 b j ) 1 z j Wilker Aziz Discrete variables 7

19 Bernoulli pmf Bern(z j b j ) = b z j j (1 b j ) 1 z j { b j if z j = 1 = 1 b j if z j = 0 Wilker Aziz Discrete variables 7

20 Bernoulli pmf Bern(z j b j ) = b z j j (1 b j ) 1 z j { b j if z j = 1 = 1 b j if z j = 0 Can we reparameterise a Bernoulli variable? Wilker Aziz Discrete variables 7

21 Reparameterisation requires a Jacobian matrix Not really :( q(z; λ) = φ(ɛ = h(z, λ)) det Jh(z,λ) }{{} change of density (2) Elements in the Jacobian matrix J h(z,λ) [i, j = h i(z, λ) z j are not defined for non-differentiable functions Wilker Aziz Discrete variables 8

22 Outline Wilker Aziz Discrete variables 9

23 We can again use the log identity for derivatives λ E q λ (z x) [log p θ (x z) = q λ (z x) log p θ (x z)dz λ = λ (q λ(z x)) log p θ (x z) dz }{{} not an expectation Wilker Aziz Discrete variables 10

24 We can again use the log identity for derivatives λ E q λ (z x) [log p θ (x z) = q λ (z x) log p θ (x z)dz λ = λ (q λ(z x)) log p θ (x z) dz }{{} = not an expectation q λ (z x) λ (log q λ(z x)) log p θ (x z) dz Wilker Aziz Discrete variables 10

25 We can again use the log identity for derivatives λ E q λ (z x) [log p θ (x z) = q λ (z x) log p θ (x z)dz λ = λ (q λ(z x)) log p θ (x z) dz }{{} not an expectation = q λ (z x) λ (log q λ(z x)) log p θ (x z) dz [ = E qλ (z x) log p θ (x z) λ log q λ(z x) }{{} expected gradient :) Wilker Aziz Discrete variables 10

26 : high variance We can now build an MC estimator λ E q λ (z x) [log p θ (x z) [ = E qλ (z x) log p θ (x z) λ log q λ(z x)

27 : high variance We can now build an MC estimator λ E q λ (z x) [log p θ (x z) [ = E qλ (z x) log p θ (x z) λ log q λ(z x) MC 1 K log p θ (x z (k) ) K λ log q λ(z (k) x) k=1 z (k) q λ (Z x)

28 : high variance We can now build an MC estimator λ E q λ (z x) [log p θ (x z) [ = E qλ (z x) log p θ (x z) λ log q λ(z x) MC 1 K log p θ (x z (k) ) K λ log q λ(z (k) x) k=1 z (k) q λ (Z x) but magnitude of log p θ (x z) varies widely

29 : high variance We can now build an MC estimator λ E q λ (z x) [log p θ (x z) [ = E qλ (z x) log p θ (x z) λ log q λ(z x) MC 1 K log p θ (x z (k) ) K λ log q λ(z (k) x) k=1 z (k) q λ (Z x) but magnitude of log p θ (x z) varies widely model likelihood does not contribute to direction of gradient

30 : high variance We can now build an MC estimator λ E q λ (z x) [log p θ (x z) [ = E qλ (z x) log p θ (x z) λ log q λ(z x) MC 1 K log p θ (x z (k) ) K λ log q λ(z (k) x) k=1 z (k) q λ (Z x) but magnitude of log p θ (x z) varies widely model likelihood does not contribute to direction of gradient too much variance to be useful

31 : high variance We can now build an MC estimator λ E q λ (z x) [log p θ (x z) [ = E qλ (z x) log p θ (x z) λ log q λ(z x) MC 1 K log p θ (x z (k) ) K λ log q λ(z (k) x) k=1 z (k) q λ (Z x) but but magnitude of log p θ (x z) varies widely model likelihood does not contribute to direction of gradient too much variance to be useful

32 : high variance We can now build an MC estimator λ E q λ (z x) [log p θ (x z) [ = E qλ (z x) log p θ (x z) λ log q λ(z x) MC 1 K log p θ (x z (k) ) K λ log q λ(z (k) x) k=1 z (k) q λ (Z x) but but magnitude of log p θ (x z) varies widely model likelihood does not contribute to direction of gradient too much variance to be useful fully differentiable!

33 When variance is high we can sample more Wilker Aziz Discrete variables 12

34 When variance is high we can sample more won t scale Wilker Aziz Discrete variables 12

35 When variance is high we can sample more won t scale use variance reduction techniques (e.g. baselines and control variates) Wilker Aziz Discrete variables 12

36 When variance is high we can sample more won t scale use variance reduction techniques (e.g. baselines and control variates) excellent idea! Wilker Aziz Discrete variables 12

37 When variance is high we can sample more won t scale use variance reduction techniques (e.g. baselines and control variates) excellent idea! and now it s time for it! Wilker Aziz Discrete variables 12

38 Example Model Let us consider a latent factor model for topic modelling: Wilker Aziz Discrete variables 13

39 Example Model Let us consider a latent factor model for topic modelling: a document x = (x 1,..., x n ) consists of n i.i.d. categorical draws from that model Wilker Aziz Discrete variables 13

40 Example Model Let us consider a latent factor model for topic modelling: a document x = (x 1,..., x n ) consists of n i.i.d. categorical draws from that model the categorical distribution in turn depends on the binary latent factors z = (z 1,..., z k ) which are also i.i.d. Wilker Aziz Discrete variables 13

41 Example Model Let us consider a latent factor model for topic modelling: a document x = (x 1,..., x n ) consists of n i.i.d. categorical draws from that model the categorical distribution in turn depends on the binary latent factors z = (z 1,..., z k ) which are also i.i.d. z j Bernoulli (φ) (1 j k) x i Categorical (g θ (z)) (1 i n) Here φ specifies a Bernoulli prior and g θ ( ) is a function computed by neural network with softmax output. Wilker Aziz Discrete variables 13

42 Example Model z 1 z 2 z 3 x 1 x 2 x 3 x 4 At inference time the latent variables are marginally dependent. For our variational distribution we are going to assume that they are not (recall: mean field assumption). Wilker Aziz Discrete variables 14

43 Inference Network λ z 1 z 2 z 3 x 1 x 2 x 3 x 4 The inference network needs to predict k Bernoulli parameters ψ. Any neural network with sigmoid output will do that job. k q λ (z x) = Bern(z j ψ j ) j=1 where ψ k 1 = f λ (x) (3) Wilker Aziz Discrete variables 15

44 Computation Graph ψ x λ x θ inference model generation model

45 Computation Graph KL ψ x λ x θ inference model generation model

46 Computation Graph log q λ (z ψ) log p θ (x z) KL ψ x λ x θ inference model generation model

47 Computation Graph log q λ (z ψ) log p θ (x z) z Bernoulli (ψ) KL ψ x λ x θ inference model generation model Wilker Aziz Discrete variables 16

48 Reparametrisation Gradient x generation model KL z θ KL inference model µ σ λ x λ ɛ N (0, 1) Wilker Aziz Discrete variables 17

49 Pros and Cons Pros Applicable to all distributions Many libraries come with samplers for common distributions Wilker Aziz Discrete variables 18

50 Pros and Cons Pros Applicable to all distributions Many libraries come with samplers for common distributions Cons High Variance! Wilker Aziz Discrete variables 18

51 Outline Wilker Aziz Discrete variables 19

52 Control variates Suppose we want to estimate E[f (Z) and we know the expected value of another function ψ(z) on the same support. Greensmith et al. (2004) Wilker Aziz Discrete variables 20

53 Control variates Suppose we want to estimate E[f (Z) and we know the expected value of another function ψ(z) on the same support. Then it holds that E[f (Z) = E[f (Z) ψ(z) + E[ψ(Z) (4) Greensmith et al. (2004) Wilker Aziz Discrete variables 20

54 Control variates Suppose we want to estimate E[f (Z) and we know the expected value of another function ψ(z) on the same support. Then it holds that E[f (Z) = E[f (Z) ψ(z) + E[ψ(Z) (4) If ψ(z) = f (z), and we estimate the expected value of f (x) ψ(x), then we have reduced variance to 0. Greensmith et al. (2004) Wilker Aziz Discrete variables 20

55 Control variates Suppose we want to estimate E[f (Z) and we know the expected value of another function ψ(z) on the same support. Then it holds that E[f (Z) = E[f (Z) ψ(z) + E[ψ(Z) (4) If ψ(z) = f (z), and we estimate the expected value of f (x) ψ(x), then we have reduced variance to 0. In general Var(f ψ) = Var(f ) 2 Cov(f, ψ) + Var(ψ) (5) If f and ψ are strongly correlated and the covariance is greater than Var(ψ), then we improve on the original estimation problem. Greensmith et al. (2004) Wilker Aziz Discrete variables 20

56 Reducing variance of score function estimator Back to the score function estimator [ E qλ (z x) log p θ (x z) λ log q λ(z x)

57 Reducing variance of score function estimator Back to the score function estimator [ E qλ (z x) log p θ (x z) λ log q λ(z x) = E qλ (z x) log p θ(x z) λ log q λ(z x) C(x) }{{} λ log q λ(z x) }{{} f (z) ψ(z) + E qλ (z x) C(x) λ log q λ(z x) }{{} ψ(z) The last term is very simple!

58 E[ψ(Z) E qλ (z x) [ C(x) λ log q λ(z x) Wilker Aziz Discrete variables 22

59 E[ψ(Z) E qλ (z x) [ C(x) λ log q λ(z x) = C(x)E qλ (z x) [ λ log q λ(z x) Wilker Aziz Discrete variables 22

60 E[ψ(Z) E qλ (z x) [ C(x) λ log q λ(z x) = C(x)E qλ (z x) [ λ q λ(z x) q λ (z x) = C(x)E qλ (z x) [ λ log q λ(z x) Wilker Aziz Discrete variables 22

61 E[ψ(Z) E qλ (z x) [ C(x) λ log q λ(z x) = C(x)E qλ (z x) [ λ q λ(z x) q λ (z x) = C(x)E qλ (z x) = C(x) q λ (z x) [ λ log q λ(z x) λ q λ(z x) q λ (z x) dz Wilker Aziz Discrete variables 22

62 E[ψ(Z) E qλ (z x) [ C(x) λ log q λ(z x) = C(x)E qλ (z x) = C(x) [ λ q λ(z x) q λ (z x) λ q λ(z x) dz = C(x)E qλ (z x) = C(x) q λ (z x) [ λ log q λ(z x) λ q λ(z x) q λ (z x) dz Wilker Aziz Discrete variables 22

63 E[ψ(Z) E qλ (z x) [ C(x) λ log q λ(z x) = C(x)E qλ (z x) = C(x) [ λ q λ(z x) q λ (z x) = C(x)E qλ (z x) = C(x) λ q λ(z x) dz = C(x) λ q λ (z x) [ λ log q λ(z x) λ q λ(z x) q λ (z x) q λ (z x) dz dz Wilker Aziz Discrete variables 22

64 E[ψ(Z) E qλ (z x) [ C(x) λ log q λ(z x) = C(x)E qλ (z x) = C(x) [ λ q λ(z x) q λ (z x) = C(x)E qλ (z x) = C(x) λ q λ(z x) dz = C(x) λ = C(x) λ 1 q λ (z x) [ λ log q λ(z x) λ q λ(z x) q λ (z x) q λ (z x) dz dz Wilker Aziz Discrete variables 22

65 E[ψ(Z) E qλ (z x) [ C(x) λ log q λ(z x) = C(x)E qλ (z x) = C(x) [ λ q λ(z x) q λ (z x) = C(x)E qλ (z x) = C(x) λ q λ(z x) dz = C(x) λ = C(x) λ 1 = 0 q λ (z x) [ λ log q λ(z x) λ q λ(z x) q λ (z x) q λ (z x) dz dz Wilker Aziz Discrete variables 22

66 Improved estimator Back to the score function estimator [ E qλ (z x) log p θ (x z) λ log q λ(z x) Wilker Aziz Discrete variables 23

67 Improved estimator Back to the score function estimator [ E qλ (z x) log p θ (x z) λ log q λ(z x) [ = E qλ (z x) log p θ (x z) λ log q λ(z x) C(x) λ log q λ(z x) [ + E qλ (z x) C(x) λ log q λ(z x) Wilker Aziz Discrete variables 23

68 Improved estimator Back to the score function estimator [ E qλ (z x) log p θ (x z) λ log q λ(z x) [ = E qλ (z x) log p θ (x z) λ log q λ(z x) C(x) [ + E qλ (z x) C(x) λ log q λ(z x) = E qλ (z x) λ log q λ(z x) [ log p θ (x z) λ log q λ(z x) C(x) λ log q λ(z x) Wilker Aziz Discrete variables 23

69 Improved estimator Back to the score function estimator [ E qλ (z x) log p θ (x z) λ log q λ(z x) [ = E qλ (z x) log p θ (x z) λ log q λ(z x) C(x) [ + E qλ (z x) C(x) λ log q λ(z x) [ = E qλ (z x) log p θ (x z) λ log q λ(z x) C(x) [ = E qλ (z x) (log p θ (x z) C(x)) λ log q λ(z x) C(x) is called a baseline λ log q λ(z x) λ log q λ(z x) Wilker Aziz Discrete variables 23

70 Baselines Baselines can be constant [ E qλ (z x) (log p θ (x z) C) λ log q λ(z x) (6) Williams (1992) Wilker Aziz Discrete variables 24

71 Baselines Baselines can be constant [ E qλ (z x) (log p θ (x z) C) λ log q λ(z x) or input-dependent [ E qλ (z x) (log p θ (x z) C(x)) λ log q λ(z x) (6) (7) Williams (1992) Wilker Aziz Discrete variables 24

72 Baselines Baselines can be constant [ E qλ (z x) (log p θ (x z) C) λ log q λ(z x) or input-dependent [ E qλ (z x) (log p θ (x z) C(x)) λ log q λ(z x) or both E qλ (z x) [ (log p θ (x z) C C(x)) λ log q λ(z x) (6) (7) (8) Williams (1992) Wilker Aziz Discrete variables 24

73 Full power of control variates If we design C( ) to depend on the variable of integration z, we exploit the full power of control variates, but designing and using those require more careful treatment Blei et al. (2012); Ranganath et al. (2014); Gregor et al. (2014) Wilker Aziz Discrete variables 25

74 Learning baselines Baselines are predicted by a regression model (e.g. a neural net). One idea is to centre the learning signal, in which case we train the baseline with an L 2 -loss: ρ = arg min (C ρ (x) log p(x z)) 2 ρ Gu et al. (2015) Wilker Aziz Discrete variables 26

75 Putting it together Parameter estimation arg max θ,λ E qλ (z x) [log p θ (x Z) KL (q λ (z x) p(z)) arg min (C ρ (x) log p(x z)) 2 ρ Generative gradient E qλ (z x) [ θ log p θ(x z) Inference gradient E qλ (z x) [ (log p θ (x z) C(x)) λ log q λ(z x)

76 Summary Wilker Aziz Discrete variables 28

77 Summary Reparametrisation not available for discrete variables. Wilker Aziz Discrete variables 28

78 Summary Reparametrisation not available for discrete variables. Use score function estimator. Wilker Aziz Discrete variables 28

79 Summary Reparametrisation not available for discrete variables. Use score function estimator. High variance. Wilker Aziz Discrete variables 28

80 Summary Reparametrisation not available for discrete variables. Use score function estimator. High variance. Always use baselines for variance reduction! Wilker Aziz Discrete variables 28

81 Literature I David M. Blei, Michael I. Jordan, and John W. Paisley. Variational bayesian inference with stochastic search. In ICML, URL Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov): , Karol Gregor, Ivo Danihelka, Andriy Mnih, Charles Blundell, and Daan Wierstra. Deep autoregressive networks. In Eric P. Xing and Tony Jebara, editors, ICML, pages , URL Shixiang Gu, Sergey Levine, Ilya Sutskever, and Andriy Mnih. Muprop: Unbiased backpropagation for stochastic neural networks. arxiv preprint arxiv: , Wilker Aziz Discrete variables 29

82 Literature II Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. arxiv preprint arxiv: , Rajesh Ranganath, Sean Gerrish, and David Blei. Black Box Variational Inference. In Samuel Kaski and Jukka Corander, editors, AISTATS, pages , URL Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4): , URL Wilker Aziz Discrete variables 30

Deep Generative Models Variational Autoencoders

Deep Generative Models Variational Autoencoders Sudeshna Sarkar 5 April 2017 Generative Nets Generative models that represent probability distributions over multiple variables in some way. Directed Generative