t 1 y(x;w) x 2 t 2 t 3 x 1

Similar documents
t 1 S y(x;w) x 2 t 2 x 1 t 3

GTM: The Generative Topographic Mapping

GTM: The Generative Topographic Mapping

Experimental Analysis of GTM

A Principled Approach to Interactive Hierarchical Non-Linear Visualization of High-Dimensional Data

Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data

Tight Clusters and Smooth Manifolds with the Harmonic Topographic Map.

Clustering with Reinforcement Learning

Comparing Self-Organizing Maps Samuel Kaski and Krista Lagus Helsinki University of Technology Neural Networks Research Centre Rakentajanaukio 2C, FIN

Data visualisation and exploration with prior knowledge

Function approximation using RBF network. 10 basis functions and 25 data points.

Semi-Supervised Construction of General Visualization Hierarchies

Mixture Models and EM

What is machine learning?

A Population Based Convergence Criterion for Self-Organizing Maps

Topographic Local PCA Maps

Clustering Lecture 5: Mixture Model

Adaptive Estimation of Distributions using Exponential Sub-Families Alan Gous Stanford University December 1996 Abstract: An algorithm is presented wh

Calibrating a Structured Light System Dr Alan M. McIvor Robert J. Valkenburg Machine Vision Team, Industrial Research Limited P.O. Box 2225, Auckland

Machine Learning : Clustering, Self-Organizing Maps

A Topography-Preserving Latent Variable Model with Learning Metrics

Today. Gradient descent for minimization of functions of real variables. Multi-dimensional scaling. Self-organizing maps

Graph projection techniques for Self-Organizing Maps

Radial Basis Function Networks: Algorithms

Seismic facies analysis using generative topographic mapping

Robot Manifolds for Direct and Inverse Kinematics Solutions

Unsupervised Learning

International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERA

Machine Learning Lecture 3

Introduction to Machine Learning CMU-10701

Self-Organizing Maps for cyclic and unbounded graphs

A Stochastic Optimization Approach for Unsupervised Kernel Regression

Advances in Neural Information Processing Systems, 1999, In press. Unsupervised Classication with Non-Gaussian Mixture Models using ICA Te-Won Lee, Mi

Estimating the Intrinsic Dimensionality of. Jorg Bruske, Erzsebet Merenyi y

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Machine Learning Lecture 3

Aston University GTM: The Generative Topographic Mapping Johan Fredrik Markus Svensen Doctor Of Philosophy, 998 Thesis Summary This thesis describes t

Clustering Sequences with Hidden. Markov Models. Padhraic Smyth CA Abstract

Shared Kernel Models for Class Conditional Density Estimation

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

Machine Learning Lecture 3

10. MLSP intro. (Clustering: K-means, EM, GMM, etc.)

Mixture Models and EM

Clustering and Visualisation of Data

9.1. K-means Clustering

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Machine Learning Methods in Visualisation for Big Data 2018

Classifier C-Net. 2D Projected Images of 3D Objects. 2D Projected Images of 3D Objects. Model I. Model II

Manifold Learning for Video-to-Video Face Recognition

Experimental Evaluation of Latent Variable Models for Dimensionality Reduction

Neuro-Fuzzy Comp. Ch. 8 May 12, 2005

Unsupervised Learning

Modified Self-Organizing Mixture Network for Probability Density Estimation and Classification

Constraints in Particle Swarm Optimization of Hidden Markov Models

2. CNeT Architecture and Learning 2.1. Architecture The Competitive Neural Tree has a structured architecture. A hierarchy of identical nodes form an

A Visualization Tool to Improve the Performance of a Classifier Based on Hidden Markov Models

J. Weston, A. Gammerman, M. Stitson, V. Vapnik, V. Vovk, C. Watkins. Technical Report. February 5, 1998

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Reinforcement Control via Heuristic Dynamic Programming. K. Wendy Tang and Govardhan Srikant. and

Clustering web search results

COMPUTATIONAL STATISTICS UNSUPERVISED LEARNING

Automatic Alignment of Local Representations

An Introduction to PDF Estimation and Clustering

Intersection of sets *

Kernel Generative Topographic Mapping

The rest of the paper is organized as follows: we rst shortly describe the \growing neural gas" method which we have proposed earlier [3]. Then the co

Improved smoothing spline regression by combining estimates of dierent smoothness

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Journal of Global Optimization, 10, 1{40 (1997) A Discrete Lagrangian-Based Global-Search. Method for Solving Satisability Problems *

C(1) C(i-1) C(i) C(i+1) C(M) Threshold

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

I D I A P. Conditional Gaussian Mixture Models for Environmental Risk Mapping Nicolas Gilardi * Samy Bengio * R E S E A R C H R E P O R T

Application of Principal Components Analysis and Gaussian Mixture Models to Printer Identification

Note Set 4: Finite Mixture Models and the EM Algorithm

K-Means and Gaussian Mixture Models

Image Segmentation Region Number. Ping Guo and Michael R. Lyu. The Chinese University of Hong Kong, normal (Gaussian) mixture, has been used in a

Mixture Models and the EM Algorithm

Theoretical Concepts of Machine Learning

Andrew Davenport and Edward Tsang. fdaveat,edwardgessex.ac.uk. mostly soluble problems and regions of overconstrained, mostly insoluble problems as

Recent Developments in Model-based Derivative-free Optimization

Department of. Computer Science. Remapping Subpartitions of. Hyperspace Using Iterative. Genetic Search. Keith Mathias and Darrell Whitley

Missing Data Analysis for the Employee Dataset

Unsupervised Learning: Clustering


CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Figure (5) Kohonen Self-Organized Map

Expectation Maximization (EM) and Gaussian Mixture Models

Machine Learning (BSMC-GA 4439) Wenke Liu

Robust cartogram visualization of outliers in manifold learning

SGN (4 cr) Chapter 11

Richard S. Zemel 1 Georey E. Hinton North Torrey Pines Rd. Toronto, ONT M5S 1A4. Abstract

A B. A: sigmoid B: EBA (x0=0.03) C: EBA (x0=0.05) U

Inference and Representation

Grundlagen der Künstlichen Intelligenz

Neuro-Remodeling via Backpropagation of Utility. ABSTRACT Backpropagation of utility is one of the many methods for neuro-control.

Human Motion Synthesis by Motion Manifold Learning and Motion Primitive Segmentation

NeuroScale: Novel Topographic Feature Extraction using RBF Networks

Dimension Reduction CS534

CS 6140: Machine Learning Spring 2016

Transcription:

Neural Computing Research Group Dept of Computer Science & Applied Mathematics Aston University Birmingham B4 7ET United Kingdom Tel: +44 (0)121 333 4631 Fax: +44 (0)121 333 4586 http://www.ncrg.aston.ac.uk/ GTM: A Principled Alternative to the Self-Organizing Map Christopher M. Bishop C.M.Bishop@aston.ac.uk Markus Svensen svensjfm@aston.ac.uk Christopher K. I. Williams C.K.I.Williams@aston.ac.uk Technical Report NCRG/96/031 April 15, 1997 Published in Proceedings 1996 International Conference on Articial Neural Networks, ICANN'96, Ed. C. von der Malsburg, W. von Seelen, J. C. Vorbruggen and B. Sendho, Springer-Verlag, 164{170. Abstract The Self-Organizing Map (SOM) algorithm has been extensively studied and has been applied with considerable success to a wide variety of problems. However, the algorithm is derived from heuristic ideas and this leads to a number of signicant limitations. In this paper, we consider the problem of modelling the probability density of data in a space of several dimensions in terms of a smaller number of latent, or hidden, variables. We introduce a novel form of latent variable model, which we call the GTM algorithm (for Generative Topographic Map), which allows general non-linear transformations from latent space to data space, and which is trained using the EM (expectationmaximization) algorithm. Our approach overcomes the limitations of the SOM, while introducing no signicant disadvantages. We demonstrate the performance of the GTM algorithm on simulated data from ow diagnostics for a multi-phase oil pipeline.

GTM: A Principled Alternative to the Self-Organizing Map 2 1 Introduction The Self-Organizing Map (SOM) algorithm of Kohonen (1982) represents a form of unsupervised learning in which a set of unlabelled data vectors t n (n = 1; : : : ; N) in a D-dimensional data space is summarized in terms of a set of reference vectors having a spatial organization corresponding (generally) to a two-dimensional sheet. While this algorithm has achieved many successes in practical applications, it also suers from some major deciencies, many of which are highlighted in Kohonen (1995) and reviewed in this paper 1. From the perspective of statistical pattern recognition the fundamental goal of unsupervised learning is to develop a representation of the distribution p(t) from which the data were generated. In this paper we consider the problem of modelling p(t) in terms of a number (usually two) of latent or hidden variables. By considering a particular class of such models we arrive at a formulation in terms of a constrained Gaussian mixture which can be trained using the EM (expectationmaximization) algorithm. The topographic nature of the representation is an intrinsic feature of the model and is not dependent on the details of the learning process. Our model denes a generative distribution p(t) and will be referred to as the GTM (Generative Topographic Map) algorithm. 2 Latent Variables The goal of a latent variable model is to nd a representation for the distribution p(t) of data in a D-dimensional space t = (t 1 ; : : : ; t D ) in terms of a number L of latent variables x = (x 1 ; : : : ; x L ). This is achieved by rst considering a non-linear function y(x; W), governed by a set of parameters W, which maps points x in the latent space into corresponding points y(x; W) in the data space. Typically we are interested in the situation in which the dimensionality L of the latent variable space is less than the dimensionality D of the data space, since our premise is that the data itself has an intrinsic dimensionality which is less than D. The transformation y(x; W) then maps the hidden-variable space into an L-dimensional non-euclidean manifold embedded within the data space. If we dene a probability distribution p(x) on the latent variable space, this will induce a corresponding distribution p(yjw) in the data space. We shall refer to p(x) as the prior distribution of x for reasons which will become clear shortly. Since L < D, the distribution in t-space would be conned to a manifold of dimension L and hence would be singular. Since in reality the data will only approximately live on a lower-dimensional manifold, it is appropriate to include a noise model for the t vector. We therefore dene the distribution of t, for given x and W, to be a spherical Gaussian centred on y(x; W) having variance?1 so that p(tjx; W; ) N (tjy(x; W);?1 I). The distribution in t-space, for a given value of W, is then obtained by integration over the x-distribution Z p(tjw; ) = p(tjx; W; )p(x) dx: (1) For a given a data set D = (t 1 ; : : : ; t N ) of N data points, we can determine the parameter matrix W, and the inverse variance, using maximum likelihood, where the log likelihood function is 1 Biological metaphor is sometimes invoked when motivating the SOM procedure. It should be stressed that our goal here is not neuro-biological modelling, but rather the development of eective algorithms for data analysis, for which biological realism is irrelevant.

GTM: A Principled Alternative to the Self-Organizing Map 3 y(x;w) t 1 x 2 x 1 t 3 t 2 Figure 1: We consider a prior distribution p(x) consisting of a superposition of delta functions, located at the nodes of a regular grid in latent space. Each node x i is mapped to a point y(x i ; W) in data space, which forms the centre of the corresponding Gaussian distribution. given by L(W; ) = NX n=1 ln p(t n jw; ): (2) In principle we can now seek the maximum likelihood solution for the weight matrix, once we have specied the prior distribution p(x) and the functional form of the mapping y(x; W), by maximizing L(W; ). The latent variable model can be related to the Kohonen SOM algorithm by choosing p(x) P to be a sum of delta functions centred on the nodes of a regular grid in latent space K p(x) = 1=K i=1 (x?xi ). This form of p(x) allows the integral in (1) to be performed analytically. Each point x i is then mapped to a corresponding point y(x i ; W) in data space, which forms the centre of a Gaussian density function, as illustrated in Figure 1. Thus the P distribution function in K data space takes the form of a Gaussian mixture model p(tjw; ) = 1=K i=1 p(tjxi ; W; ) and the log likelihood function (2) becomes L(W; ) = NX n=1 ln ( 1 K KX i=1 p(t n jx i ; W; ) ) : (3) This distribution is a constrained Gaussian mixture since the centres of the Gaussians cannot move independently but are related through the function y(x; W). Note that, provided the mapping function y(x; W) is smooth and continuous, the projected points y(x i ; W) will necessarily have a topographic ordering. 2.1 The EM Algorithm If we choose a particular parametrized form for y(x; W) which is a dierentiable function of W we can use standard techniques for non-linear optimization, such as conjugate gradients or quasi- Newton methods, to nd a weight matrix W?, and inverse variance?, which maximize L(W; ). However, our model consists of a mixture distribution which suggests that we might seek an EM algorithm (Dempster, Laird, and Rubin 1977; Bishop 1995). By making a careful choice of model y(x; W) we will see that the M-step can be solved exactly. In particular we shall choose y(x; W) to be given by a generalized linear network model of the form y(x; W) = W(x) (4) where the elements of (x) consist of M xed basis functions j (x), and W is a DM matrix with elements w kj. Generalized linear networks possess the same universal approximation capabilities as multi-layer adaptive networks, provided the basis functions j (x) are chosen appropriately.

GTM: A Principled Alternative to the Self-Organizing Map 4 By setting the derivatives of (3) with respect to w kj to zero, we obtain T GW T = T RT (5) where is a K M matrix with elements ij = j (x i ), T is a N D matrix with elements t n k, and R is a K N matrix with elements R in given by R in (W; ) = p(t n jx i ; W; ) X K i 0 =1 p(tn jx i0 ; W; ) which represents the posterior probability, or responsibility, of the mixture components i for the data point n. Finally, G P N is a K K diagonal matrix, with elements G ii = R n=1 in(w; ). Equation (5) can be solved for W using standard matrix inversion techniques. Similarly, optimizing with respect to we obtain 1 = 1 ND KX NX i=1 n=1 (6) R ni (W; ) ky(x n ; W)? t n k 2 : (7) Here (6) corresponds to the E-step, while (5) and (7) correspond to the M-step. Typically the EM algorithm gives satisfactory convergence after a few tens of cycles. An on-line version of this algorithm can be obtained by using the Robbins-Monro procedure to nd a zero of the objective function. 3 Relation to the Self-Organizing Map The list below describes some of the problems with the SOM procedure and how the GTM algorithm solves them. 1. The SOM algorithm is not derived by optimizing an objective function, unlike GTM. Indeed it has been proven (Erwin, Obermayer, and Schulten 1992) that such an objective function cannot exist for the SOM algorithm. 2. In GTM the neighbourhood-preserving nature of the mapping is an automatic consequence of the choice of a smooth, continuous function y(x; W). Neighbourhood-preservation is not guaranteed by the SOM procedure. 3. There is no assurance that the code-book vectors will converge using SOM. Convergence of the batch GTM algorithm is guaranteed by the EM algorithm, and the Robbins-Monro theorem provides a convergence proof for the on-line version. 4. GTM denes an explicit probability density function in data space. In contrast, SOM does not dene a density model. Attempts have been made to interpret the density of codebook vectors as a model of the data distribution but with limited success. 5. For SOM the choice of how the neighbourhood function should shrink over time during training is arbitrary, and so this must be optimized empirically. There is no neighbourhood function to select for GTM. 6. It is dicult to know by what criteria to compare dierent runs of the SOM procedure. For GTM one simply compares the likelihood of the data under the model, and standard statistical tests can be used for model comparison.

GTM: A Principled Alternative to the Self-Organizing Map 5 Figure 2: Examples of the posterior probabilities (responsibilities) of the latent space points at an early stage (left) and late stage (right) during the convergence of the GTM algorithm, evaluated for a single data point from the training set in the oil-ow problem discussed in Section 4. Notwithstanding these key dierences, there are very close similarities between the SOM and GTM techniques. Figure 2 shows the posterior probabilities (responsibilities) corresponding to the oil ow problem considered in Section 4. At an early stage of training the responsibility for representing a particular data point is spread over a relatively large region of the map. As the EM algorithm proceeds so this responsibility `bubble' shrinks automatically. The responsibilities (computed in the E-step) govern the updating of W and in the M-step and, together with the smoothing eect of the basis functions j (x), play an analogous role to the neighbourhood function in the SOM procedure. 4 Experimental Results We present results from the application of this algorithm to a problem involving 12-dimensional data arising from diagnostic measurements of oil ows along multi-phase pipelines (Bishop and James 1993). The three phases in the pipe (oil, water and gas) can belong to one of three dierent geometrical congurations, corresponding to stratied, homogeneous, and annular ows, and the data set consists of 1000 points drawn with equal probability from the 3 classes. We take the latent variable space to be two-dimensional, since our goal in this application is data visualization. Each data point t n induces a posterior distribution p(xjt n ; W; ) in x-space. However, it is often convenient to project each data point down to a unique point in x-space, which can be done by nding the mean of the posterior distribution. Figure 3 shows the oil data visualized with GTM and SOM. The CPU times taken for the GTM, SOM with a Gaussian neighbourhood, and SOM with a bubble neighbourhood were 644, 1116 and 355 seconds respectively. In each case the algorithms were run for 25 complete passes through the data set. In conclusion, we have provided an alternative algorithm to the SOM which overcomes its principal deciencies while retaining its general characteristics. We know of no signicant disadvantage in using the GTM algorithm in place of the SOM. While we believe the SOM procedure is superseded by the GTM algorithm, is should be noted that the SOM has provided much of the inspiration for developing GTM. The relationships between GTM and a number of other algorithms are discussed in a longer paper available from: http://www.ncrg.aston.ac.uk/papers/postscript/ncrg 96 015.ps.Z

GTM: A Principled Alternative to the Self-Organizing Map 6 Figure 3: The left plot shows the posterior-mean projection of the oil ow data in the latent space of the non-linear model. The plot on the right shows the same data set visualized using the batch SOM procedure, in which each data point is assigned to the point on the feature map corresponding to the codebook vector to which it is nearest. In both plots, crosses, circles and plus-signs represent the three dierent oil-ow congurations. Acknowledgements This work was supported by EPSRC grant GR/K51808: Dimensional Data. Neural Networks for Visualisation of High- References Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford University Press. Bishop, C. M. and G. D. James (1993). Analysis of multiphase ows using dual-energy gamma densitometry and neural networks. Nuclear Instruments and Methods in Physics Research A327, 580{593. Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc B 39 (1), 1{38. Erwin, E., K. Obermayer, and K. Schulten (1992). Self-organizing maps: ordering, convergence properties and energy functions. Biological Cybernetics 67, 47{55. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics 43, 59{69. Kohonen, T. (1995). Self-Organizing Maps. Berlin: Springer-Verlag.