An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework

Similar documents
Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Clustering Lecture 5: Mixture Model

Note Set 4: Finite Mixture Models and the EM Algorithm

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques.

MCMC Methods for data modeling

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Clustering Relational Data using the Infinite Relational Model

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

INLA: Integrated Nested Laplace Approximations

Introduction to Machine Learning CMU-10701

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

Semiparametric Mixed Effecs with Hierarchical DP Mixture

The UCD community has made this article openly available. Please share how this access benefits you. Your story matters!

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Generative and discriminative classification techniques

Convexization in Markov Chain Monte Carlo

Image analysis. Computer Vision and Classification Image Segmentation. 7 Image analysis

COMPUTATIONAL STATISTICS UNSUPERVISED LEARNING

Clustering web search results

Probabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

Mixture Models and the EM Algorithm

Overview. Monte Carlo Methods. Statistics & Bayesian Inference Lecture 3. Situation At End Of Last Week

Quantitative Biology II!

Machine Learning A WS15/16 1sst KU Version: January 11, b) [1 P] For the probability distribution P (A, B, C, D) with the factorization

One-Shot Learning with a Hierarchical Nonparametric Bayesian Model

Approximate Bayesian Computation. Alireza Shafaei - April 2016

Warped Mixture Models

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Supplementary Material: The Emergence of. Organizing Structure in Conceptual Representation

Randomized Algorithms for Fast Bayesian Hierarchical Clustering

Computer vision: models, learning and inference. Chapter 10 Graphical Models

Dynamic Thresholding for Image Analysis

Inference and Representation

K-Means and Gaussian Mixture Models

Markov chain Monte Carlo methods

COMS 4771 Clustering. Nakul Verma

Image Segmentation using Gaussian Mixture Models

Chapter 6 Continued: Partitioning Methods

Clustering algorithms

Global modelling of air pollution using multiple data sources

Latent Variable Models and Expectation Maximization

Homework #4 Programming Assignment Due: 11:59 pm, November 4, 2018

A Nonparametric Bayesian Approach to Detecting Spatial Activation Patterns in fmri Data

Summer School in Statistics for Astronomers & Physicists June 15-17, Cluster Analysis

Unsupervised Learning : Clustering

Machine Learning A W 1sst KU. b) [1 P] For the probability distribution P (A, B, C, D) with the factorization

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

What is machine learning?

K-means and Hierarchical Clustering

CS 229 Midterm Review

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Motivation. Technical Background

Particle Filters for Visual Tracking

SGN (4 cr) Chapter 11

Expectation Maximization: Inferring model parameters and class labels

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Deep Generative Models Variational Autoencoders

Comparison of Variational Bayes and Gibbs Sampling in Reconstruction of Missing Values with Probabilistic Principal Component Analysis

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

MRF-based Algorithms for Segmentation of SAR Images

Bayes Classifiers and Generative Methods

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008

Passive Differential Matched-field Depth Estimation of Moving Acoustic Sources

Expectation Maximization: Inferring model parameters and class labels

Probabilistic Graphical Models

Unsupervised Learning

Machine Learning

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

CHAPTER 1 INTRODUCTION

Machine Learning. Unsupervised Learning. Manfred Huber

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Auto-Encoding Variational Bayes

Expectation-Maximization Methods in Population Analysis. Robert J. Bauer, Ph.D. ICON plc.

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

CS Introduction to Data Mining Instructor: Abdullah Mueen

Network-based auto-probit modeling for protein function prediction

Temporal Modeling and Missing Data Estimation for MODIS Vegetation data

Multiple Model Estimation : The EM Algorithm & Applications

Unsupervised Learning and Clustering

Normalized Texture Motifs and Their Application to Statistical Object Modeling

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

ECE 5424: Introduction to Machine Learning

732A54/TDDE31 Big Data Analytics

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning Lecture 3

UNIVERSITY OF OSLO. Please make sure that your copy of the problem set is complete before you attempt to answer anything. k=1

Applied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at CVPR 2012 Erik Sudderth Brown University

Cluster Analysis. Debashis Ghosh Department of Statistics Penn State University (based on slides from Jia Li, Dept. of Statistics)

Missing Data Analysis for the Employee Dataset

Probabilistic Graphical Models

08 An Introduction to Dense Continuous Robotic Mapping

K-Means Clustering 3/3/17

10-701/15-781, Fall 2006, Final

Exam Issued: May 29, 2017, 13:00 Hand in: May 29, 2017, 16:00

Nested Sampling: Introduction and Implementation

Transcription:

IEEE SIGNAL PROCESSING LETTERS, VOL. XX, NO. XX, XXX 23 An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework Ji Won Yoon arxiv:37.99v [cs.lg] 3 Jul 23 Abstract In order to cluster or partition data, we often use Expectation-and-Maximization (EM) or Variational approximation with a Gaussian Mixture Model (GMM), which is a parametric probability density function represented as a weighted sum of ˆ Gaussian component densities. However, model selection to find underlying ˆ is one of the key concerns in GMM clustering, since we can obtain the desired clusters only when ˆ is known. In this paper, we propose a new model selection algorithm to explore ˆ in a Bayesian framework. The proposed algorithm builds the density of the model order which any information criterions such as and basically fail to reconstruct. In addition, this algorithm reconstructs the density quickly as compared to the time-consuming Monte Carlo simulation. INTRODUCTION The Gaussian Mixture Model (GMM) is a well-known approach for clustering data. GMM is a parametric probability density function represented as a weighted sum of Gaussian component densities. Given an assumed model M, the GMM takes the form p(y M ) = π k p(y µ k,q k ) () k= wherey is a set ofn measurements (observations) and it has a multivariate (d-dimensional) continuous valued form. Here, π k and p(y µ k,q k ) represent the mixture weight and the Gaussian density of the k-th component respectively. Each component density has the multivariate Gaussian function p(y µ k,q k ) = N ( y;µ k,q k ) with mean µk and the covariance Q k of the kth component. Further, the sum of the non-negative weights is one, i.e k= π k = and π k. In this parameterized form, we can collectively represent hidden variables by x k = (π k,µ k,q k ) for k =,,. Now, our interest is to reconstruct the posterior distribution p(x : y) given a model assumption that the Gaussian Mixture Model has components. Let be the optimal number of Gaussian components, where = arg maxp( y). It is known that if is known, the desired x : is straightforwardly estimated by Variational approximation or classic Expectationand-Maximization algorithm (EM). However, in general the optimal number of components is not known J. Yoon is with the Center for Information Security Technology (CIST), orea University, orea. E-mail: jiwon yoon@korea.ac.kr

IEEE SIGNAL PROCESSING LETTERS, VOL. XX, NO. XX, XXX 23 2 and therefore it is rather difficult to estimate hidden parameters x :. This is because the dimension of x : is changing with varying. Generically, the model selection problem for GMM involves finding the optimal by = arg maxp( y). There are many studies in the literature that have addressed the model order estimation for GMM [????? ]. For instance, eribin et al. [? ] estimated the number of components for mixture models using a maximum penalized likelihood. Information criterions have also been applied to GMM model selection, such as [? ], [? ] and the Entropy criterion [? ]. In the Monte Carlo simulation, Richardson et al. developed the inference of GMM model selection using the reversible jump Markov chain Monte Carlo (RJMCMC) [? ]. Recently, Nobile et al. introduced an efficient clustering algorithm, the so called allocation sampler [? ], which basically infers the model order and clusters using Monte Carlo in an efficient marginalized proposal distribution. 2 STATISTICAL BACGROUND 2. Integrated Nested Laplace Approximation (INLA) Suppose that we have a set of hidden variables f and a set of observations Y. Integrated Nested Laplace Approximation (INLA) [? ] approximates the marginal posterior p(f Y) by p(f Y) = p(θ Y) p(f,y,θ) p F (f Y,θ) p(f Y, θ)p(θ Y)dθ p(f Y, θ) p(θ Y)dθ θ i p(f Y,θ) p(θ Y) θi where f=f (θ) = p(y f,θ)p(f θ)p(θ) p F (f Y,θ) f=f (θ). (2) Here, F denotes a simple functional approximation close to p(f Y,θ), as in Gaussian approximation, and f (θ) is a value of the functional approximation. For the simple Gaussian approximation case, the proper choice of f (θ) is the mode of Gaussian approximation of p G (f Y,θ). Given the log of posterior, we can calculate a mode θ and its Hessian matrix H θ via quasi-newton style optimization: θ = arg θ maxlog p(θ Y), and for H θ, we do a grid search from the mode in all directions until the log p(θ Y) log p(θ Y) > ϕ for a given threshold ϕ. 3 PROPOSED APPROACH In this study, we extend our previous work, which addressed model selection for the -nearest neighbour classifier using the -ORder Estimation Algorithm (OREA) [? ], to resolve the model selection problem in clustering domains using OREA. Our proposed algorithm reconstructs the distribution of the number of components using Eq. (2). 3. Obtaining the optimal number of components Let y denote a set of observations and let x : be a set of the model parameters given a model order. The first step of our algorithm is to estimate the optimal number of components, : = arg maxp( y). According to

IEEE SIGNAL PROCESSING LETTERS, VOL. XX, NO. XX, XXX 23 3 Eq. (2), we can obtain an approximated marginal posterior distribution by p( y) p(y,x :,) p F (x : y,). (3) x:()=x : () This equation has the property that is an integer variable, while θ of Eq. (2) is, in general, continuous variables. By ignoring this difference, we can still use a quasi-newton method to obtain optimal efficiently. Alternatively, we can also calculate some potential candidates between and max if max is not too large. Otherwise, we may still use the quasi-newton style algorithm with a rounding operator that transforms a real value to an integer for. 3.2 Bayesian Model Selection for GMM In the GMM model of Eq. (), we have four different types of hidden variables for the profile of the components: mean (µ : ), precision (Q : ), the weights (π : ) of the component and an unknown number of components. Therefore, given Eq. (3), we can make the mathematical form: p( y) p(y x:)p(x:)p() p F(x : y,) where x:=x (), : y = y :N and x : = (π :,µ :,Q : ). However, it is rather difficult to obtain the approximated distribution p F (x : y,) close to target distribution since there is no close form. Worse, it is infeasible to build a Hessian matrix via a quasi-newton method since it is extremely slow when the dimension of x is large and Q is not a vector but a matrix. Therefore, we introduce labeling indicator z to decompose the mixture model and apply the Variational approach [? ] to obtain p F (z,x : y,) such that q(z,x : ) = p F (z,x : y,). Therefore, we re-define the problems by adding component indicators of observations z. We finally obtain in a form similar to that of Eq. (3) p( y) p(y,z,x :,) p F (z,x : y,) (z,x:)=(z,x :) () = p(y z,x : )p(z x : )p(x : )p() q (z,x : ) (4) where z and x : are the mode of q (z,x : ), which is an approximated posterior obtained by variational approximation. Here, p() = exp( ). max exp( j) j= 4 EVALUATION In order to evaluate the performance of our proposed approach, we simulated our clustering algorithm using an artificial dataset and several real experimental datasets. We first investigated the performance of the proposed algorithm on two dimensional synthetic datasets for GMM clustering. Given, data were generated by the hierarchical model: [ ( ) ( )] 2π 2π µ j = 2cos j,2sin j W(I 2 2,) for j {,2,,} Σ j π : y i DP(/,/,,/) π s N( ;µ s,σ s ) for i {,2,,N} () s=

IEEE SIGNAL PROCESSING LETTERS, VOL. XX, NO. XX, XXX 23 4 2 x 4. (a) (b) (c) Our approach n= n= n=2 n=3 n= n= n=2 n=3 2 x 4 n= n= n=2. n=3 n= n= n=2 n=3 n= n=.8 n=2 n=3 n=.6 n= n=2 n=3.4...2 ˆ = 2 4 6 8 2 4 6 8 2 4 6 8 2 x 4. n= n= n=2 n=3 n= n= n=2 n=3 2 x 4. n= n= n=2 n=3 n= n= n=2 n=3.8.6.4 n= n= n=2 n=3 n= n= n=2 n=3...2 ˆ = 2 2 4 6 8 2 4 6 8 2 4 6 8 2 x 4.. n= n= n=2 n=3 n= n= n=2 n=3 2 x 4.. n= n= n=2 n=3 n= n= n=2 n=3.8.6.4 n= n= n=2 n=3 n= n= n=2 n=3...2 ˆ = 3 2 4 6 8 2 4 6 8 2 4 6 8 2 x 4.. n= n= n=2 n=3 n= n= n=2 n=3 2 x 4.. n= n= n=2 n=3 n= n= n=2 n=3.8.6.4 n= n= n=2 n=3 n= n= n=2 n=3...2 ˆ = 4 2 4 6 8 2 4 6 8 2 4 6 8 2 x 4.. n= n= n=2 n=3 n= n= n=2 n=3 2 x 4.. n= n= n=2 n=3 n= n= n=2 n=3.8.6.4 n= n= n=2 n=3 n= n= n=2 n=3...2 ˆ = 2 4 6 8 2 4 6 8 2 4 6 8 Fig.. Clustered synthetic dataset where the number of clusters and the number of observations N are varied: (a), (b), and (c) our proposed approach wherew anddp represent the Wishart distribution and Dirichlet Process respectively. In the experiments, we tested the performance by varying the number of clusters ˆ {,2,3,4,} and the number of datan {,,2,3,,,2 Figures and 2 demonstrate a comparison of the performance of our approach with that of other model selection algorithms, and. Figure explains the outputs of three model selection approaches with different ˆ and N. Interestingly, our proposed approach is effective even when ˆ =, where both and fail. In addition, whereas and can find only when N is large, our proposed approach builds a distinguishable and clear posterior distribution in all cases from to 3, and this enables us to detect the apparent close to ˆ easily. The next question that arises concerns the stability of our approach in noisy environments. Therefore, we ran five parallel and random simulations with different seeds. The mean and (mean square error) of for five different runs are displayed in Figure 2. We find that our proposed algorithm is stable even when ˆ = and ˆ = 2, where and are not effective. Furthermore, and sharply increase the (Mean Square Error) as N increases, when ˆ =,2,3. However, our approach has a small (close to zero) and stable, although N increases.

IEEE SIGNAL PROCESSING LETTERS, VOL. XX, NO. XX, XXX 23 ˆ = ˆ = 2 ˆ = 3 ˆ = 4 ˆ = E( Y) E( Y) E( Y) E( Y) E( Y) (a) Mean 2 2 3 2 2 3 2 2 3 2 2 3 2 2 3 3 2 3 2 3 2 3 2 3 2 2 2 2 2 2 (b) 2 2 3 2 2 3 2 2 3 2 2 3 2 2 3 Fig. 2. Mean of in five different runs and with varying ˆ We also evaluated the performance of our algorithm with the three well-known data sets used in [? ] for real experimental data: Enzymatic activity in the blood of 24 unrelated individuals [? ], acidity in a sample of lakes in the Northeastern United States [? ], and galaxy data with the velocities of 82 distant galaxies. Figure 3 shows 8 2 6 7 8 4 6 6 2 4 2 density 4 density density 8 3 8 6 6 2 4 4 2 2.. 2 2. 3 2. 3 3. 4 4.. 6 6. 7 7. 2 2 3 3 6 4 2 2 4 4 3 3 2 2 4 4 3 3 4 2 4 6 8 2 4 6 8 2 2 4 6 8 2 4 6 8 2 2 2 4 6 8 2 4 6 8 2..4 MCMC Our approach.8 MCMC Our approach.8 MCMC Our approach p( Y).3.2 p( Y).6.4 p( Y).6.4..2.2 2 4 6 8 2 4 6 8 2 2 4 6 8 2 4 6 8 2 2 4 6 8 2 4 6 8 2 (a) Enzyme (b) Acidity (c) Galaxy Fig. 3. Histogram and its clustering results of one dimensional real datasets: (a) Enzyme (b) Acidity and (c) Galaxy the performance comparison between,, MCMC, and our approach. The top sub-graphs demonstrate the histograms of the datasets with different numbers of mixture components. and with varying are plotted in the center row of sub-graphs. The bottom sub-figures display plots of the reconstructed distributions of p( Y) by MCMC used in [? ] and our approach. CONCLUSION For Gaussian mixture clustering, we proposed a novel model selection algorithm, which is based on functional approximation in a Bayesian framework. This algorithm has a few advantages as compared to other conventional model selection techniques. First, the proposed approach can quickly provide a proper distribution of the model The mathematical models used in MCMC and our approach are slightly different.

IEEE SIGNAL PROCESSING LETTERS, VOL. XX, NO. XX, XXX 23 6 order which is not provided by other approaches, only a few time-consuming techniques such as Monte Carlo simulation can provide it. In addition, since the proposed algorithm is based on the Bayesian scheme, we do not need to run a cross validation, as is usually done in performance evaluation.