A Deterministic Global Optimization Method for Variational Inference

Similar documents
Probabilistic Graphical Models

Supplementary Material: The Emergence of. Organizing Structure in Conceptual Representation

Stochastic Separable Mixed-Integer Nonlinear Programming via Nonconvex Generalized Benders Decomposition

Theoretical Concepts of Machine Learning

Introduction to Machine Learning CMU-10701

10-701/15-781, Fall 2006, Final

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Deep Generative Models Variational Autoencoders

Geoff McLachlan and Angus Ng. University of Queensland. Schlumberger Chaired Professor Univ. of Texas at Austin. + Chris Bishop

Mixture Models and EM

Clustering web search results

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

COMS 4771 Support Vector Machines. Nakul Verma

Nonlinear Programming

Convexization in Markov Chain Monte Carlo

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

Bilinear Modeling Solution Approach for Fixed Charged Network Flow Problems (Draft)

David G. Luenberger Yinyu Ye. Linear and Nonlinear. Programming. Fourth Edition. ö Springer

Research Interests Optimization:

Probabilistic Graphical Models

Machine Learning. Unsupervised Learning. Manfred Huber

Introduction to Mobile Robotics

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Mixture Models and the EM Algorithm

Kernel Methods. Chapter 9 of A Course in Machine Learning by Hal Daumé III. Conversion to beamer by Fabrizio Riguzzi

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

On the Global Solution of Linear Programs with Linear Complementarity Constraints

10. MLSP intro. (Clustering: K-means, EM, GMM, etc.)

Primal and Dual Methods for Optimisation over the Non-dominated Set of a Multi-objective Programme and Computing the Nadir Point

Clustering: Classic Methods and Modern Views

Characterizing Improving Directions Unconstrained Optimization

On the Parameter Estimation of the Generalized Exponential Distribution Under Progressive Type-I Interval Censoring Scheme

The Branch-and-Sandwich Algorithm for Mixed-Integer Nonlinear Bilevel Problems

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

PRIMAL-DUAL INTERIOR POINT METHOD FOR LINEAR PROGRAMMING. 1. Introduction

Warped Mixture Models

Learning with infinitely many features

Random projection for non-gaussian mixture models

Integer Programming Theory

Assessing the Quality of the Natural Cubic Spline Approximation

Clustering Lecture 5: Mixture Model

Clustering and The Expectation-Maximization Algorithm

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Note Set 4: Finite Mixture Models and the EM Algorithm

Inference and Representation

Lagrangian Relaxation: An overview

The AIMMS Outer Approximation Algorithm for MINLP

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Inclusion of Aleatory and Epistemic Uncertainty in Design Optimization

9.1. K-means Clustering

Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation

CS 229 Midterm Review

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Week 5. Convex Optimization

The AIMMS Outer Approximation Algorithm for MINLP

Nested Sampling: Introduction and Implementation

A primal-dual framework for mixtures of regularizers

A NEW SEQUENTIAL CUTTING PLANE ALGORITHM FOR SOLVING MIXED INTEGER NONLINEAR PROGRAMMING PROBLEMS

Probabilistic Graphical Models

Applied Lagrange Duality for Constrained Optimization

Convex Optimization and Machine Learning

732A54/TDDE31 Big Data Analytics

Modified Augmented Lagrangian Coordination and Alternating Direction Method of Multipliers with

Surrogate Gradient Algorithm for Lagrangian Relaxation 1,2

Unsupervised Learning and Clustering

The K-modes and Laplacian K-modes algorithms for clustering

Image Segmentation Using Iterated Graph Cuts Based on Multi-scale Smoothing

Support Vector Machines.

9. Support Vector Machines. The linearly separable case: hard-margin SVMs. The linearly separable case: hard-margin SVMs. Learning objectives

MVE165/MMG630, Applied Optimization Lecture 8 Integer linear programming algorithms. Ann-Brith Strömberg

Expectation-Maximization. Nuno Vasconcelos ECE Department, UCSD

Expectation Propagation

The Alternating Direction Method of Multipliers

A Course in Machine Learning

Bilevel Sparse Coding

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

An Extension of the Multicut L-Shaped Method. INEN Large-Scale Stochastic Optimization Semester project. Svyatoslav Trukhanov

Recent Developments in Model-based Derivative-free Optimization

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

REAL-CODED GENETIC ALGORITHMS CONSTRAINED OPTIMIZATION. Nedim TUTKUN

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

K-Means Clustering. Sargur Srihari

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

LECTURE 18 LECTURE OUTLINE

Lecture 8: The EM algorithm

Semantic Segmentation. Zhongang Qi

Part 5: Structured Support Vector Machines

Machine Learning Department School of Computer Science Carnegie Mellon University. K- Means + GMMs

Hierarchical Mixture Models for Nested Data Structures

MODEL SELECTION AND REGULARIZATION PARAMETER CHOICE

Lecture 7: Support Vector Machine

A Comparison of Mixed-Integer Programming Models for Non-Convex Piecewise Linear Cost Minimization Problems

Generative and discriminative classification techniques

Semi-Supervised Clustering with Partial Background Information

Support Vector Machines

Clustering with Reinforcement Learning

Support Vector Machines

K-Means and Gaussian Mixture Models

Visual Recognition: Examples of Graphical Models

Transcription:

A Deterministic Global Optimization Method for Variational Inference Hachem Saddiki Mathematics and Statistics University of Massachusetts, Amherst saddiki@math.umass.edu Andrew C. Trapp Operations and Industrial Engineering Worcester Polytechnic Institute atrapp@wpi.edu Patrick Flaherty Mathematics and Statistics University of Massachusetts, Amherst flaherty@math.umass.edu Abstract Variational inference methods for latent variable statistical models have gained popularity because they are relatively fast, can handle large data sets, and have deterministic convergence guarantees. However, in practice it is unclear whether the fixed point identified by the variational inference algorithm is a local or a global optimum. Here, we propose a method for constructing iterative optimization algorithms for variational inference problems that are guaranteed to converge to the ɛ-global variational lower bound on the log-likelihood. We derive inference algorithms for two variational approximations to a standard Bayesian Gaussian mixture model (BGMM). We present a minimal data set for empirically testing convergence and show that a variational inference algorithm frequently converges to a local optimum while our algorithm always converges to the globally optimal variational lower bound. We characterize the loss incurred by selecting a nonoptimal variational approximation distribution, suggesting that selection of the approximating variational distribution deserves as much attention as the selection of the original statistical model for a given data set. 1 Introduction Maximum likelihood estimation of latent (hidden) variable models is computationally challenging because one must integrate over the latent variables to compute the likelihood. Often, the integral is high-dimensional and computationally intractable so one must resort to approximation methods such as variational expectation-maximization [1]. The variational expectation-maximization algorithm alternates between maximizing an evidence lower bound (ELBO) on the log-likelihood with respect to the model parameters and maximizing the lower bound with respect to the variational distribution parameters. Variational expectationmaximization is popular because it is computationally efficient and performs deterministic coordinate ascent on the ELBO surface. However, variational inference has some statistical and computational issues. First, the ELBO is often multimodal, so the deterministic algorithm may converge to a local rather than a global optimum depending on the initial parameter estimates. Second, the variational distribution is often chosen for computational convenience rather than for accuracy with respect to the posterior distribution, so the lower bound may be far from tight. The magnitude of the cost of using a poor approximation is typically not known. Here, we develop a deterministic global optimization 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

algorithm for variational inference to address the first issue and quantify the optimal ELBO for an example model to address the second issue. 2 GOP Variational Inference The general Global OPtimization (GOP) algorithm was first developed by Floudas et. al. [4, 5, 3]. They extended the decomposition ideas of Benders [2] and Geoffrion [7]. While the framework has been successfully used for problems in process design, control, and computational chemistry, it has not, to our knowledge, been applied to statistical inference. Here, we apply the GOP algorithm for variational inference in hierarchical exponential family models. In Section 2.1, we state the problem conditions necessary for the GOP algorithm. Then, we briefly review the mathematical theory for the GOP algorithm in Section 2.2. The review of the GOP algorithm in this section is not novel and necessarily brief. A more complete presentation of the general algorithm can be found in [3]. 2.1 Problem Statement GOP addresses biconvex optimization problems that can be formulated as min α,β f(α, β) s.t. g(α, β) 0 h(α, β) = 0 α A, β B, (1) where A and B are convex compact sets, and g(α, β) and h(α, β) are vectors of inequality and equality constraints respectively. We require the following conditions on (1): 1. f(α, β) is convex in α for every fixed β, and convex in β for every fixed α; 2. g(α, β) is convex in α for every fixed β, and convex in β for every fixed α; 3. h(α, β) is affine in α for every fixed β, and affine in β for every fixed α; 4. first-order constraints qualifications (e.g. Slater s conditions) are satisfied for every fixed β. If these conditions are satisfied, we have a biconvex optimization problem and we can use the GOP algorithm to find the ɛ-global optimum. Partitioning and Transformation of Decision Variables Variational inference is typically formulated as an alternating coordinate ascent algorithm where the evidence lower bound is iteratively maximized with respect to the model parameters and the variational distribution parameters. However, the objective function construed under that partitioning of the decision variables is not necessarily biconvex. Instead, we consider all of the decision variables (model parameters and variational parameters) in the variational inference optimization problem jointly. Then, we are free to partition variables as we choose to ensure the GOP problem conditions are satisfied. One may be able to transform variables in the original variational optimization problem, adding equality constraints as necessary, to satisfy the GOP problem conditions. For larger problems and more complex models, the partitioning of the decision variables may not be apparent. Hansen [8] proposed an algorithm to bilinearize quadratic and polynomial function problems, rational polynomials, and problems involving hyperbolic functions. Biconvexity for Exponential Family Hierarchical Latent Variable Models The variational EM problem is a biconvex optimization problem under certain conditions. We cast the variational EM problem into a biconvex convex optimization form by partitioning the model parameters, φ, and the variational parameters ξ into α and β such that the GOP conditions are satisfied. All of the functions are analytical and differentiable. 2

2.2 GOP Theory Projection We can cast the original optimization problem (1) as inner and outer optimization problems using a projection of the problem onto the space of β variables: min β v(β) s.t. v(β) = min f(α, β) α A (2) s.t. g(α, β) 0 h(α, β) = 0 β B V where V = {β : h(α, β) = 0, g(α, β) 0} for some α A. For a given β t, the inner minimization problem in (2) is called the primal problem. Therefore, the function v(β) can be defined for a set of solutions of the primal problem for different values of β. However, the last two conditions in (2) define an implicit set of constraints making the solution of the problem difficult. Relaxed Dual Problem Dropping the last two constraints gives a relaxed dual problem, min v B β B, v B s.t. v B min L(α, β, λ, µ) µ 0, λ (3) α A where v B R and the Lagrange function for the primal problem is L(α, β, λ, µ) = f(α, β) + µ g(α, β) + λ h(α, β) We call the minimization problem in (3) the inner relaxed dual problem. In summary, the primal problem contains more constraints than the original problem because β is fixed and so provides an upper bound on the original problem. The relaxed dual problem (3) contains fewer constraints than the original problem and so provides a lower bound on the original problem. Because the primal and relaxed dual problems are formed from an inner-outer optimization problem, they can be used to form an iterative alternating algorithm to determine the global solution of the original problem. For the rest of the mathematical development, we make the algorithm iterations explicit by introducing a t index. The key here is that information is passed from the primal problem to the dual problem through the Lagrange multipliers, {λ t, µ t }, and information is passed from the dual problem to the primal problem through the optimal dual variables, β t. 3 Bayesian Gaussian Mixture Model - Point Mass Approximation Here, we consider a variant of the Gaussian mixture model where the cluster mean is endowed with a Gaussian prior. We consider the variational model for the posterior distribution to be a point mass function for discrete latent random variables and a Dirac delta function for continuous random variables. Variational expectation-maximization with a point-mass approximating distribution is exactly classical expectation-maximization [6, p 337]. 3.1 GOP Algorithm Derivation Model Structure A Gaussian mixture model with a prior distribution on the cluster means is M k Γ Gaussian(0, Γ) for k = 1,..., K, Z i π Categorial(K, π), for i = 1,..., N, Y i Z i, M Gaussian(m zi, 1), for i = 1,..., N, (4) with model parameters φ {π, Γ} = {π 1,..., π K, Γ}, unobserved/latent variables {Z, M 1,..., M K }, and observed variable Y = y. Our inferential aim is the joint posterior distribution Z i, M Y i, ˆφ, for i = 1,..., N, where ˆφ is the maximum likelihood parameter estimate. The log likelihood for the model is l D (φ) log f(y φ). 3

3.2 Experimental Results Here, we examine the convergence of the upper and lower bounds on the ELBO for a small data set: y = ( 10, 10, 5, 25). We find that this minimal data set is useful for testing inference algorithms for the (Bayesian) Gaussian mixture model. This data set has two properties that promote local optima repeated observations and outliers. Convergence Experiment We examined the optimum ELBO values for GOP and variational EM (VEM) by randomly drawing initial model and variational parameters and running each algorithm to convergence. We drew 100 random initial values and observed that variational EM converged to a sub-optimal local fixed point in 87% of the trials while GOP converged to the global optimum in 100% of the trials. We verified the global optimal value using BARON, an independent global optimization algorithm [10, 11]. These results show that GOP, empirically, always converges to the global optimum. The VEM algorithm failed to converge to the global optimum more often than it succeeded. Figure 1 shows the VEM and GOP iterations with time along the x-axis and the ELBO along the y-axis for two random seeds. In Figure 1a VEM converges to a local optimum and is outside the bounds provided by GOP. Figure 1b shows a case where VEM converged to the global optimum. The locally optimal ELBO is -108.86 and the globally optimal ELBO is -84.04. These results quantify and agree with anecdotal evidence in the literature [9]. (a) GOP iterations for a parameter initialization where VEM converges to a local optimum. (b) GOP iterations for a parameter initialization where VEM converges to a global optimum. Figure 1: Comparison of GOP, VEM, and BARON for two random initializations. 4 Bayesian Gaussian Mixture Model - Gaussian Approximation In this section we explore the quality of the approximation provided by two different variational models for the Bayesian Gaussian mixture model - a point mass distribution and a Gaussian distribution. We ran GOP, adapted to both approximations, using the small data set described. The globally optimal ELBO for point mass approximation and for Gaussian approximation is -84.04 and -82.75, respectively. We observe that the globally optimal ELBO for Gaussian approximation is higher than that of point mass approximation. This result shows that we do indeed incur a loss in accuracy when selecting a sub-optimal approximating distribution suggesting that careful consideration must be made when choosing an approximating distribution. References [1] Matthew J Beal and Zoubin Ghahramani. The variational bayesian em algorithm for incomplete data: with application to scoring graphical model structures. Bayesian Statistics, 7, 2003. [2] J. F. Benders. Partitioning procedures for solving mixed-variables programming problems. Computational Management Science, 2(1):3 19, January 1962. [3] Christodoulos A Floudas. Deterministic Global Optimization. Theory, Methods and Applications. Springer US, 2000. [4] Christodoulos A Floudas and V Visweswaran. A global optimization algorithm (gop) for certain classes of nonconvex nlps i. theory. Computers & chemical engineering, 14(12):1397 1417, December 1990. 4

[5] Christodoulos A Floudas and Vishy Visweswaran. Primal-relaxed dual global optimization approach. Journal of Optimization Theory and Applications, 78(2):187 225, 1993. [6] Andrew Gelman, J B Carlin, Stern Hal, Dunson David, Aki Vehtari, and Rubin Donald. Bayesian Data Analysis. Chapman and Hall, 3 edition, 2013. [7] A. M. Geoffrion. Generalized benders decomposition. Journal of optimization theory and applications, 1972. [8] P Hansen, B Jaumard, and Junjie Xiong. Decomposition and interval arithmetic applied to global minimization of polynomial and rational functions. Journal of Computational Optimization, 3(4):421 437, December 1993. [9] G. D. Murray. Contribution to discussion of paper by A. P. Dempstar, N. M. Laird, and D. B. Rubin. Journal of the Royal Statistical Society. Series B (Methodological), 39:27 28, 1977. [10] Nikolaos V Sahinidis. Baron: A general purpose global optimization software package. Journal of global optimization, 8(2):201 205, 1996. [11] M Tawarmalani and N V Sahinidis. A polyhedral banch-and-cut approach to global optimization. Mathematical Programming, 103(2):225 249, 2005. 5