Warped Mixture Models

Size: px

Start display at page:

Download "Warped Mixture Models"

Cory Sutton
5 years ago
Views:

1 Warped Mixture Models Tomoharu Iwata, David Duvenaud, Zoubin Ghahramani Cambridge University Computational and Biological Learning Lab March 11, 2013

2 OUTLINE Motivation Gaussian Process Latent Variable Model Warped Mixtures Generative Model Inference Results Generative Model Inference Future Work: Variational Inference Semi-supervised learning Other Latent Priors Life of a Bayesian Model

3 MOTIVATION I: MANIFOLD SEMI-SUPERVISED LEARNING Most manifold learning algorithms start by constructing a graph locally. X O X O

4 MOTIVATION I: MANIFOLD SEMI-SUPERVISED LEARNING Most manifold learning algorithms start by constructing a graph locally. X O X O Most don t update original topology to account for long-range structure or label information.

5 Most don t update original topology to account for long-range structure or label information. Often hard to recover from a bad connectivity graph. MOTIVATION I: MANIFOLD SEMI-SUPERVISED LEARNING Most manifold learning algorithms start by constructing a graph locally. X O X O

6 MOTIVATION II: INFINITE GAUSSIAN MIXTURE MODEL Dirichelt Process prior on cluster weights. Recovers number of clusters automatically. Since each cluster must be Gaussian, number of clusters is often innapropriate

7 MOTIVATION II: INFINITE GAUSSIAN MIXTURE MODEL Dirichelt Process prior on cluster weights. Recovers number of clusters automatically. Since each cluster must be Gaussian, number of clusters is often innapropriate How to create nonparametric cluster shapes?

8 GAUSSIAN PROCESS LATENT VARIABLE MODEL Suppose observations Y = (y 1,, y N ) where y n R D, latent coordinates X = (x 1,, x N ), where x n R Q. y = f (x). observed y coordinate Latent x coordinate

9 GAUSSIAN PROCESS LATENT VARIABLE MODEL Suppose observations Y = (y 1,, y N ) where y n R D, latent coordinates X = (x 1,, x N ), where x n R Q. y = f (x) observed y coordinate Latent x coordinate

10 GAUSSIAN PROCESS LATENT VARIABLE MODEL Suppose observations Y = (y 1,, y N ) where y n R D, latent coordinates X = (x 1,, x N ), where x n R Q. y = f (x) observed y coordinate Latent x coordinate

11 GAUSSIAN PROCESS LATENT VARIABLE MODEL Suppose observations Y = (y 1,, y N ) where y n R D, latent coordinates X = (x 1,, x N ), where x n R Q. y = f (x) observed y coordinate Latent x coordinate

12 GAUSSIAN PROCESS LATENT VARIABLE MODEL Suppose observations Y = (y 1,, y N ) where y n R D, latent coordinates X = (x 1,, x N ), where x n R Q. y = f (x). observed y coordinate Latent x coordinate

13 GAUSSIAN PROCESS LATENT VARIABLE MODEL Suppose observations Y = (y 1,, y N ) where y n R D, latent coordinates X = (x 1,, x N ), where x n R Q. y = f (x). observed y coordinate Latent x coordinate

14 GAUSSIAN PROCESS LATENT VARIABLE MODEL Suppose observations Y = (y 1,, y N ) where y n R D, latent coordinates X = (x 1,, x N ), where x n R Q. y = f (x). 0 observed y coordinate Latent x coordinate

15 GAUSSIAN PROCESS LATENT VARIABLE MODEL Suppose observations Y = (y 1,, y N ) where y n R D, have latent coordinates X = (x 1,, x N ), where x n R Q. y d = f d (x), where each f d (x) GP(0, K). Mapping marginal likelihood: p(y X, θ) = (2π) DN 2 K D 2 exp ( 1 ) 2 tr(y K 1 Y) Prior on x, treated mainly as a regularizer: p(x) = N (x 0, I) Can be interpreted as a density model.

16 GAUSSIAN PROCESS LATENT VARIABLE MODEL Suppose observations Y = (y 1,, y N ) where y n R D, have latent coordinates X = (x 1,, x N ), where x n R Q. y d = f d (x), where each f d (x) GP(0, K). Mapping marginal likelihood: p(y X, θ) = (2π) DN 2 K D 2 exp ( 1 ) 2 tr(y K 1 Y) Prior on x, treated mainly as a regularizer: p(x) = N (x 0, I) Can be interpreted as a density model. Can give warped densities; how to get clusters?

17 WARPED MIXTURE MODEL GPs Latent space Observed space A sample from the iwmm prior: Sample a latent mixture of Gaussians. Warp the latent mixture to produce non-gaussian manifolds in observed space. Some areas with almost no density; some edges and peaks.

18 WARPED MIXTURE MODEL An extension of GP-LVM, where p(x) is a mixture of Gaussians. Or: An extension of igmm, where mixture is warped. Given mixture assignments, likelihood has only two parts: GP-LVM and GMM p(y X, Z, θ) = (2π) DN 2 K D 2 exp ( 1 ) 2 tr(k 1 YY ) }{{} i c=1 GP-LVM Likelihood λ c N (x i µ c, R 1 c )I(x i Z c ) } {{ } Mixture of Gaussians Likelihood

19 INFERENCE observed y coordinate Latent x coordinate Find posterior over latent X s. Many ways to do inference; high dimension of latent space means derivatives helpful. Our scheme: Alternate: 1. Sampling latent cluster assignments 2. Updating latent positions and GP hypers with HMC No cross-validation, but HMC params annoying to set Show demo!

20 INFERENCE: MIXING Changing number of clusters helps mixing.

21 DENSITY RESULTS Automatically reduces latent dimension, separately per-cluster! Wishart prior may be causing problems.

22 LATENT VISUALIZATION Hard to summarize posterior which is symmetric - average for now. VB might address problem of summarizing posterior.

23 Captures number, dimension, relationship between manifolds. LATENT VISUALIZATION: UMIST FACES DATASET

24 THE WARPED DENSITY MODEL What if we take density model of GP-LVM seriously? Why not just warp one Gaussian? Even one latent Gaussian can be made fairly flexible.

25 THE WARPED DENSITY MODEL Even one latent Gaussian can be made fairly flexible, but must place some mass between clusters. Also easier to interpret latent clusters.

26 RESULTS Evaluated iwmm as a density model, as well as a clustering model.

27 LIMITATIONS O(N 3 ) runtime Stationary kernel means diffculty modeling clusters of different sizes.

28 FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

29 FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

30 FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

31 FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

32 FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

33 FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

34 FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

35 FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

36 FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

37 FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

38 FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

39 FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

40 FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

41 FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

42 FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

43 FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

44 FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

45 FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

46 FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

47 FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

48 FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

49 FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

50 FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

51 FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

52 FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

53 FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

54 FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

55 FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

56 FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

57 FUTURE WORK: SEMI-SUPERVISED LEARNING X O X O Spread labels along regions of high density.

58 OTHER PRIORS ON LATENT DENSITIES Density model is separate from warping model. Hierarchical clustering (bio applications) Deep Gaussian Processes

59 LIFE OF A BAYESIAN MODEL Write down generative model. Sample from it to see if it looks reasonable. Fiddle with sampler for a month. Maybe years later, a decent inference scheme comes out. Modeling decisions are in principle separate from inference scheme Can verify approximate inference schemes on examples. Modeling sophistication is far ahead of inference sophistication

60 LIFE OF A BAYESIAN MODEL Write down generative model. Sample from it to see if it looks reasonable. Fiddle with sampler for a month. Maybe years later, a decent inference scheme comes out. Modeling decisions are in principle separate from inference scheme Can verify approximate inference schemes on examples. Modeling sophistication is far ahead of inference sophistication Thanks!

Clustering Lecture 5: Mixture Model

Clustering Lecture 5: Mixture Model Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics