Warped Mixture Models

Similar documents
Clustering Lecture 5: Mixture Model

Inference and Representation

Clustering web search results

08 An Introduction to Dense Continuous Robotic Mapping

COMS 4771 Clustering. Nakul Verma

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

CSC412: Stochastic Variational Inference. David Duvenaud

The Multi Stage Gibbs Sampling: Data Augmentation Dutch Example

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework

CS281 Section 9: Graph Models and Practical MCMC

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

Variational Methods for Discrete-Data Latent Gaussian Models

Scalable Bayes Clustering for Outlier Detection Under Informative Sampling

A Taxonomy of Semi-Supervised Learning Algorithms

Learning Alignments from Latent Space Structures

A Nonparametric Bayesian Approach to Detecting Spatial Activation Patterns in fmri Data

Estimating Labels from Label Proportions

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Introduction to Machine Learning CMU-10701

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

ECG782: Multidimensional Digital Signal Processing

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Applied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at CVPR 2012 Erik Sudderth Brown University

K-Means and Gaussian Mixture Models

Quantitative Biology II!

Clustering algorithms

Monocular Human Motion Capture with a Mixture of Regressors. Ankur Agarwal and Bill Triggs GRAVIR-INRIA-CNRS, Grenoble, France

Grundlagen der Künstlichen Intelligenz

3D Human Motion Analysis and Manifolds

Conditional Random Fields - A probabilistic graphical model. Yen-Chin Lee 指導老師 : 鮑興國

3 : Representation of Undirected GMs

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

One-Shot Learning with a Hierarchical Nonparametric Bayesian Model

Multi-task Multi-modal Models for Collective Anomaly Detection

Unsupervised Learning

Extracting and Composing Robust Features with Denoising Autoencoders

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

22 October, 2012 MVA ENS Cachan. Lecture 5: Introduction to generative models Iasonas Kokkinos

CS 229 Midterm Review

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques.

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

ECE 5424: Introduction to Machine Learning

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

SGN (4 cr) Chapter 11

Modeling time series with hidden Markov models

A Deterministic Global Optimization Method for Variational Inference

Clustering: Classic Methods and Modern Views

Clustering in R d. Clustering. Widely-used clustering methods. The k-means optimization problem CSE 250B

Latent Variable Models and Expectation Maximization

Multiplicative Mixture Models for Overlapping Clustering

Note Set 4: Finite Mixture Models and the EM Algorithm

Deep Generative Models Variational Autoencoders

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Clustering Relational Data using the Infinite Relational Model

Machine Learning (BSMC-GA 4439) Wenke Liu

Pruning Nearest Neighbor Cluster Trees

Segmentation: Clustering, Graph Cut and EM

Clustering and The Expectation-Maximization Algorithm

MCMC Methods for data modeling

Will Monroe July 21, with materials by Mehran Sahami and Chris Piech. Joint Distributions

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

Variational Inference for Non-Stationary Distributions. Arsen Mamikonyan

Randomized Algorithms for Fast Bayesian Hierarchical Clustering

Nested Sampling: Introduction and Implementation

Probabilistic Graphical Models

COMPUTATIONAL STATISTICS UNSUPERVISED LEARNING

Semi- Supervised Learning

Gradient of the lower bound

Mixture Models and the EM Algorithm

What is machine learning?

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

Liangjie Hong*, Dawei Yin*, Jian Guo, Brian D. Davison*

Unit 5: Estimating with Confidence

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Computer vision: models, learning and inference. Chapter 10 Graphical Models

Day 3 Lecture 1. Unsupervised Learning

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Auto-Encoding Variational Bayes

Implicit generative models: dual vs. primal approaches

Graphical Models for Computer Vision

Machine Learning. Unsupervised Learning. Manfred Huber

Scene Segmentation in Adverse Vision Conditions

Mixture Models and EM

Mesh segmentation. Florent Lafarge Inria Sophia Antipolis - Mediterranee

Bioimage Informatics

Graphical Models, Bayesian Method, Sampling, and Variational Inference

Generative and discriminative classification techniques

Overview. Monte Carlo Methods. Statistics & Bayesian Inference Lecture 3. Situation At End Of Last Week

Machine Learning A W 1sst KU. b) [1 P] For the probability distribution P (A, B, C, D) with the factorization

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Smoothing Dissimilarities for Cluster Analysis: Binary Data and Functional Data

Probabilistic multi-resolution scanning for cross-sample differences

Content-based image and video analysis. Machine learning

Recap: The E-M algorithm. Biostatistics 615/815 Lecture 22: Gibbs Sampling. Recap - Local minimization methods

Generative and discriminative classification techniques

Probabilistic Graphical Models

Thorsten Joachims Then: Universität Dortmund, Germany Now: Cornell University, USA

Transcription:

Warped Mixture Models Tomoharu Iwata, David Duvenaud, Zoubin Ghahramani Cambridge University Computational and Biological Learning Lab March 11, 2013

OUTLINE Motivation Gaussian Process Latent Variable Model Warped Mixtures Generative Model Inference Results Generative Model Inference Future Work: Variational Inference Semi-supervised learning Other Latent Priors Life of a Bayesian Model

MOTIVATION I: MANIFOLD SEMI-SUPERVISED LEARNING Most manifold learning algorithms start by constructing a graph locally. X O........ X........ O

MOTIVATION I: MANIFOLD SEMI-SUPERVISED LEARNING Most manifold learning algorithms start by constructing a graph locally. X O........ X........ O Most don t update original topology to account for long-range structure or label information.

Most don t update original topology to account for long-range structure or label information. Often hard to recover from a bad connectivity graph. MOTIVATION I: MANIFOLD SEMI-SUPERVISED LEARNING Most manifold learning algorithms start by constructing a graph locally. X O........ X........ O

MOTIVATION II: INFINITE GAUSSIAN MIXTURE MODEL Dirichelt Process prior on cluster weights. Recovers number of clusters automatically. Since each cluster must be Gaussian, number of clusters is often innapropriate

MOTIVATION II: INFINITE GAUSSIAN MIXTURE MODEL Dirichelt Process prior on cluster weights. Recovers number of clusters automatically. Since each cluster must be Gaussian, number of clusters is often innapropriate How to create nonparametric cluster shapes?

GAUSSIAN PROCESS LATENT VARIABLE MODEL Suppose observations Y = (y 1,, y N ) where y n R D, latent coordinates X = (x 1,, x N ), where x n R Q. y = f (x). observed y coordinate 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 2 1 0 1 2 3 Latent x coordinate

GAUSSIAN PROCESS LATENT VARIABLE MODEL Suppose observations Y = (y 1,, y N ) where y n R D, latent coordinates X = (x 1,, x N ), where x n R Q. y = f (x). 0.7 0.6 observed y coordinate 0.5 0.4 0.3 0.2 0.1 0 0.1 3 2 1 0 1 2 3 Latent x coordinate

GAUSSIAN PROCESS LATENT VARIABLE MODEL Suppose observations Y = (y 1,, y N ) where y n R D, latent coordinates X = (x 1,, x N ), where x n R Q. y = f (x). 1.6 1.8 observed y coordinate 2 2.2 2.4 2.6 2.8 3 4 3 2 1 0 1 2 3 Latent x coordinate

GAUSSIAN PROCESS LATENT VARIABLE MODEL Suppose observations Y = (y 1,, y N ) where y n R D, latent coordinates X = (x 1,, x N ), where x n R Q. y = f (x). 1.5 1 observed y coordinate 0.5 0 0.5 1 1.5 2 3 2 1 0 1 2 3 4 Latent x coordinate

GAUSSIAN PROCESS LATENT VARIABLE MODEL Suppose observations Y = (y 1,, y N ) where y n R D, latent coordinates X = (x 1,, x N ), where x n R Q. y = f (x). observed y coordinate 2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 3 2 1 0 1 2 3 Latent x coordinate

GAUSSIAN PROCESS LATENT VARIABLE MODEL Suppose observations Y = (y 1,, y N ) where y n R D, latent coordinates X = (x 1,, x N ), where x n R Q. y = f (x). observed y coordinate 1.5 2 2.5 3 3 2 1 0 1 2 Latent x coordinate

GAUSSIAN PROCESS LATENT VARIABLE MODEL Suppose observations Y = (y 1,, y N ) where y n R D, latent coordinates X = (x 1,, x N ), where x n R Q. y = f (x). 0 observed y coordinate 0.2 0.4 0.6 0.8 1 3 2 1 0 1 2 3 Latent x coordinate

GAUSSIAN PROCESS LATENT VARIABLE MODEL Suppose observations Y = (y 1,, y N ) where y n R D, have latent coordinates X = (x 1,, x N ), where x n R Q. y d = f d (x), where each f d (x) GP(0, K). Mapping marginal likelihood: p(y X, θ) = (2π) DN 2 K D 2 exp ( 1 ) 2 tr(y K 1 Y) Prior on x, treated mainly as a regularizer: p(x) = N (x 0, I) Can be interpreted as a density model.

GAUSSIAN PROCESS LATENT VARIABLE MODEL Suppose observations Y = (y 1,, y N ) where y n R D, have latent coordinates X = (x 1,, x N ), where x n R Q. y d = f d (x), where each f d (x) GP(0, K). Mapping marginal likelihood: p(y X, θ) = (2π) DN 2 K D 2 exp ( 1 ) 2 tr(y K 1 Y) Prior on x, treated mainly as a regularizer: p(x) = N (x 0, I) Can be interpreted as a density model. Can give warped densities; how to get clusters?

WARPED MIXTURE MODEL GPs Latent space Observed space A sample from the iwmm prior: Sample a latent mixture of Gaussians. Warp the latent mixture to produce non-gaussian manifolds in observed space. Some areas with almost no density; some edges and peaks.

WARPED MIXTURE MODEL An extension of GP-LVM, where p(x) is a mixture of Gaussians. Or: An extension of igmm, where mixture is warped. Given mixture assignments, likelihood has only two parts: GP-LVM and GMM p(y X, Z, θ) = (2π) DN 2 K D 2 exp ( 1 ) 2 tr(k 1 YY ) }{{} i c=1 GP-LVM Likelihood λ c N (x i µ c, R 1 c )I(x i Z c ) } {{ } Mixture of Gaussians Likelihood

INFERENCE observed y coordinate 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 2 1 0 1 2 3 Latent x coordinate Find posterior over latent X s. Many ways to do inference; high dimension of latent space means derivatives helpful. Our scheme: Alternate: 1. Sampling latent cluster assignments 2. Updating latent positions and GP hypers with HMC No cross-validation, but HMC params annoying to set Show demo!

INFERENCE: MIXING Changing number of clusters helps mixing.

DENSITY RESULTS Automatically reduces latent dimension, separately per-cluster! Wishart prior may be causing problems.

LATENT VISUALIZATION Hard to summarize posterior which is symmetric - average for now. VB might address problem of summarizing posterior.

Captures number, dimension, relationship between manifolds. LATENT VISUALIZATION: UMIST FACES DATASET

THE WARPED DENSITY MODEL What if we take density model of GP-LVM seriously? Why not just warp one Gaussian? Even one latent Gaussian can be made fairly flexible.

THE WARPED DENSITY MODEL Even one latent Gaussian can be made fairly flexible, but must place some mass between clusters. Also easier to interpret latent clusters.

RESULTS Evaluated iwmm as a density model, as well as a clustering model.

LIMITATIONS O(N 3 ) runtime Stationary kernel means diffculty modeling clusters of different sizes.

FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

FUTURE WORK: VARIATIONAL INFERENCE Joint work with James Hensman. Optimization instead of integration. SVI could allow large datasets. Non-convex optimization is hard; harder than mixing?

FUTURE WORK: SEMI-SUPERVISED LEARNING X O........ X........ O Spread labels along regions of high density.

OTHER PRIORS ON LATENT DENSITIES Density model is separate from warping model. Hierarchical clustering (bio applications) Deep Gaussian Processes

LIFE OF A BAYESIAN MODEL Write down generative model. Sample from it to see if it looks reasonable. Fiddle with sampler for a month. Maybe years later, a decent inference scheme comes out. Modeling decisions are in principle separate from inference scheme Can verify approximate inference schemes on examples. Modeling sophistication is far ahead of inference sophistication

LIFE OF A BAYESIAN MODEL Write down generative model. Sample from it to see if it looks reasonable. Fiddle with sampler for a month. Maybe years later, a decent inference scheme comes out. Modeling decisions are in principle separate from inference scheme Can verify approximate inference schemes on examples. Modeling sophistication is far ahead of inference sophistication Thanks!