Variational Methods for Discrete-Data Latent Gaussian Models

Similar documents
08 An Introduction to Dense Continuous Robotic Mapping

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

Deep Generative Models Variational Autoencoders

Warped Mixture Models

Applied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at CVPR 2012 Erik Sudderth Brown University

Clustering Lecture 5: Mixture Model

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

BINARY PRINCIPAL COMPONENT ANALYSIS IN THE NETFLIX COLLABORATIVE FILTERING TASK

Conjugate-Computation Variational Inference : Converting Variational Inference in Non-Conjugate Models to Inferences in Conjugate Models

Logis&c Regression. Aar$ Singh & Barnabas Poczos. Machine Learning / Jan 28, 2014

Ludwig Fahrmeir Gerhard Tute. Statistical odelling Based on Generalized Linear Model. íecond Edition. . Springer

CS 229 Midterm Review

Adaptive Dropout Training for SVMs

ECE 5424: Introduction to Machine Learning

Introduction to Machine Learning CMU-10701

Learning the Structure of Sum-Product Networks. Robert Gens Pedro Domingos

CRFs for Image Classification

Computationally Efficient M-Estimation of Log-Linear Structure Models

Clustering and The Expectation-Maximization Algorithm

Comparison of Variational Bayes and Gibbs Sampling in Reconstruction of Missing Values with Probabilistic Principal Component Analysis

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015

Sampling PCA, enhancing recovered missing values in large scale matrices. Luis Gabriel De Alba Rivera 80555S

Network Traffic Measurements and Analysis

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Probabilistic Graphical Models Part III: Example Applications

Using Machine Learning to Optimize Storage Systems

10-701/15-781, Fall 2006, Final

Missing Data Analysis for the Employee Dataset

K-Means Clustering. Sargur Srihari

Latent Regression Bayesian Network for Data Representation

Lecture 8: The EM algorithm

Conditional Random Fields - A probabilistic graphical model. Yen-Chin Lee 指導老師 : 鮑興國

Generative and discriminative classification techniques

Modeling and Monitoring Crop Disease in Developing Countries

Weighted Alternating Least Squares (WALS) for Movie Recommendations) Drew Hodun SCPD. Abstract

Unsupervised: no target value to predict

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

STA 4273H: Statistical Machine Learning

Expectation-Maximization. Nuno Vasconcelos ECE Department, UCSD

Case Study IV: Bayesian clustering of Alzheimer patients

Implicit Mixtures of Restricted Boltzmann Machines

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Structured Learning. Jun Zhu

Probabilistic Matrix Factorization

Machine Learning Classifiers and Boosting

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Machine Learning / Jan 27, 2010

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Expectation Propagation

732A54/TDDE31 Big Data Analytics

Distributed Training of Large-scale Logistic models

Alternating Minimization. Jun Wang, Tony Jebara, and Shih-fu Chang

Score function estimator and variance reduction techniques

Machine Learning using Matlab. Lecture 3 Logistic regression and regularization

Tree-structured approximations by expectation propagation

Building Classifiers using Bayesian Networks

Probabilistic Graphical Models

Variational Autoencoders. Sargur N. Srihari

Quantitative Biology II!

Estimating Human Pose in Images. Navraj Singh December 11, 2009

Logistic Regression. Abstract

Bayes Net Learning. EECS 474 Fall 2016

Supervised Learning for Image Segmentation

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Simulation Calibration with Correlated Knowledge-Gradients

Information Filtering for arxiv.org

Liangjie Hong*, Dawei Yin*, Jian Guo, Brian D. Davison*

CHAPTER 1 INTRODUCTION

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

CSC412: Stochastic Variational Inference. David Duvenaud

Dynamic Bayesian network (DBN)

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Variational Methods for Graphical Models

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

TagProp: Discriminative Metric Learning in Nearest Neighbor Models for Image Annotation

Feature Subset Selection for Logistic Regression via Mixed Integer Optimization

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

Convexization in Markov Chain Monte Carlo

Analysis and Synthesis of 3D Shape Families via Deep Learned Generative Models of Surfaces

Multi-Class Logistic Regression and Perceptron

Introduction to Machine Learning

Class 6 Large-Scale Image Classification

Shape from Silhouettes I CV book Szelisky

Behavioral Data Mining. Lecture 18 Clustering

Simulation Calibration with Correlated Knowledge-Gradients

Multi-task Multi-modal Models for Collective Anomaly Detection

Search Engines. Information Retrieval in Practice

Graphical Models, Bayesian Method, Sampling, and Variational Inference

Markov Chain Monte Carlo (part 1)

Univariate Extreme Value Analysis. 1 Block Maxima. Practice problems using the extremes ( 2.0 5) package. 1. Pearson Type III distribution

Multiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

CS489/698: Intro to ML

Virtual Vector Machine for Bayesian Online Classification

K-Means and Gaussian Mixture Models

Linear Regression & Gradient Descent

Modelling Proportions and Count Data

Transcription:

Variational Methods for Discrete-Data Latent Gaussian Models University of British Columbia Vancouver, Canada March 6, 2012

The Big Picture Joint density models for data with mixed data types Bayesian models principled and robust approach Algorithms that are not only accurate and fast, but are also easy to tune, implement, and intuitive (speed- accuracy tradeoffs) Variational Methods for Discrete-Data Latent Gaussian Models Slide 2 of 46

Sources of Discrete Data User rating data Survey/voting data and blogs for sentiment analysis Health data Consumer choice data Sports/game data tag correlation. Slide 3 of 46

Motivation: Recommendation system Movie rating dataset Missing values Different types of data User1 User2 User3 User4 User5 User6. Movie1 9 2 3 9. Movie2 8 8 2. Movie3 2 8. Movie4 3 8 8 1. Movie5 2 7 1. Movie6 7 2 1. Slide 4 of 46

Missing Ratings From Wikipedia on Netflix-prize dataset The training set is such that the average user rated over 200 movies, and the average movie was rated by over 5000 users. But there is wide variance in the data some movies in the training set have as few as 3 ratings, while one user rated over 17,000 movies. Movielens Dataset Slide 6 of 46

Sources of Discrete Data User rating data Survey/voting data and blogs for sentiment analysis Health data Consumer choice data Sports/game data tag correlation. Slide 8 of 46

What we need! For these datasets, we need a method of analysis which Handles missing values efficiently Makes efficient use of the data by weighting reliable data vectors more than the unreliable ones Makes efficient use of the data by fusing different types of data efficiently (binary, ordinal, categorical, count, text) Slide 9 of 46

Factor Model movies D Y D movies W L factors N users Z N users L factors ExpFamily: Gaussian: Slide 10 of 46 Collins et. al. 2002, Khan et. al.2010, Yu et.al. 2009

Bayesian Learning µ, Σ x z n y n n=1:n W This talk: Lower bound maximization Slide 11 of 46

Variational Methods Design tractable bounds to reduce approximation error Efficient optimization since lower bounds are concave : good convergence rates and easy convergence diagnostics Efficient expectation-maximization (EM) algorithms for parameter leaning Comparable performance to MCMC, but much faster Algorithms with a wide range of speed-accuracy trade-offs Slide 12 of 46

Outline Latent Gaussian models Bounds for binary data Bounds for categorical data Results Future work and conclusions Slide 13 of 46

Outline Latent Gaussian models Definition and examples Problem with parameter learning Bounds for binary data Bounds for categorical data Results Future work and conclusions Slide 14 of 46

Latent Gaussian Model (LGM) y 11 y 12 y 13 η 11 η 12 η 13 z 11 z 12 z 13 y 21 y 23 y 32 y 32 η 21 η 23 w η 32 η 32 N 32 32 L z 21 z 22 z 23 D y D1 y D2 y D3 N η D1 η D2 η D3 D L µ, Σ z n W y n n=1:n Slide 15 of 46

Likelihood Examples Data type Real Count Binary Distribution Gaussian Poisson Bernoulli-Logit 0.9 Categorical Ordinal Multinomial-Logit Proportional-odds 2 Slide 16 of 46

Parameter Estimation µ, Σ z n W y n Bernoulli-logit n=1:n x Slide 19 of 46

Jensen s Lower Bound Slide 20 of 46

Variational Lower Bound Generalized EM algorithm E-step involves minimizing convex function Early stopping in E-step (Almost) no tuning parameters Easy convergence diagnostics Slide 21 of 46

Outline Latent Gaussian models Bounds for binary data Bernoulli-logistic likelihood The Bohning bound (Khan, Marlin, Bouchard, Murphy, NIPS 2010) Piecewise bounds (Marlin, Khan, Murphy, ICML 2011) Bounds for categorical data Results Future work and conclusions Slide 22 of 46

Bernoulli-Logit Likelihood 0.9 2 + some other tractable terms in m and V Slide 23 of 46

Local Variational Bounds + some other tractable terms in m and V x Bohning s bound (Khan, Marlin, Bouchard, Murphy 2010) Jaakola s bound (Jaakkola and Jordan1996) Piecewise quadratic bounds (Marlin, Khan, Murphy 2011) Slide 24 of 46

Bohning Bound is Faster Slide 25 of 46 O(L 3 ND) For n = 1:N V n = (W T A n W + I) -1 m n =.. end O(L 2 ND) V = (W T AW + I) -1 For n = 1:N m n =.. end

Piecewise bounds are more accurate Bohning Jaakkola Piecewise Q 1 (x) Q 3 (x) Q 2 (x) Slide 27 of 46

Details of Piecewise bounds Find cut points and parameters of each piece by minimizing maximum error Linear pieces (Hsiung, Kim and Boyd, 2008) Quadratic Pieces (Nelder- Mead method) Fixed Piecewise Bounds! Increase accuracy by increasing the number of pieces Slide 28 of 46

Outline Latent Gaussian models Bounds for binary data Bounds for categorical data Multinomial-logistic likelihood and local variational bounds Stick-breaking likelihood (Khan, Mohamed, Marlin, Murphy, AI-Stats 2012) Results Future work and conclusions Slide 29 of 46

Multinomial-Logit Likelihood Slide 30 of 46

Local Variational bounds The Bohning bound Fast and closed form updates The log bound (Blei and Lafferty 2006) More accurate than the Bohning bound, but slower The product of sigmoid bound (Bouchard 2007) Slide 31 of 46

Stick-Breaking Likelihood 0 1 Slide 32 of 46

Outline Latent Gaussian models Bounds for binary data Bounds for categorical data Results Future work and conclusions Slide 33 of 46

Speed Accuracy Trade-offs Binary FA : UCI voting dataset (D=15, N=435) Bohning Jaakkola Piecewise Linear with 3 pieces Piecewise Quad with 3 pieces Piecewise Quad with 10 pieces Slide 34 of 46

Comparison with EP Binary Gaussian Process : Ionosphere dataset (D=200) Σ ij =σ exp[- x i -x j 2 /s] µ, Σ z n W y n n=1:n Slide 35 of 46

EP vs PW : Posterior Distribution (Neg) KL-Lower Bound to MargLik Approximation To MargLik Pred Error EP Piece ewise Bound Slide 38 of 46

Comparison with EP Both methods give very similar results for GPs Our approach can be easily extended to factor models Variational EM objective function is well-defined and can be obtained by solving minimization of convex functions Numerically stable Slide 39 of 46

MultiClass Gaussian Process MCMC Glass dataset (D=143, K=6) Bohning Log VB-probit Stick-PW Prediction Error Ne egloglik Slide 40 of 46

Categorical Factor Analysis Glass dataset (D = 10, N = 958, sum of K = 29) Logit-log Stick-PW Slide 41 of 46

Outline Latent Gaussian models Bounds for binary data Bounds for categorical data Results Future work and conclusions Slide 42 of 46

Future Work Large-scale collaborative filtering Use convexity to design approximate gradient methods Sparse Gaussian Posterior Distribution Tuning HMC using Bayesian optimization methods Latent Sparse-factor model Conditional models (e.g. to model for tag-image correlation) Slide 43 of 46

Conclusions Variational methods show comparable performance with existing approaches The main sources of errors is the bounding error Design of piecewise bounds to control these errors A good control over speed-accuracy trade-offs can be obtained Variational lower bounds can be optimized efficiently Use of convex optimization methods to get fast convergence rates and easy convergence diagnostics Design of efficient expectation-maximization (EM) algorithms Slide 44 of 46

Collaborators Kevin Murphy UBC Benjamin Marlin U. Mass-Amherst Guillaume Bouchard XRCE, France Shakir Mohamed U. Cambridge, now at UBC Slide 45 of 46

Thank You Slide 46 of 46