Experimental Analysis of GTM

Size: px

Start display at page:

Download "Experimental Analysis of GTM"

Terence Wood
5 years ago
Views:

1 Experimental Analysis of GTM Elias Pampalk In the past years many different data mining techniques have been developed. The goal of the seminar Kosice-Vienna is to compare some of them to determine which of them is preferable for which type of datasets. Therefor we analyze three different datasets with different properties regarding dimensionality and amount of data. I will summarize the generative topographic mapping algorithm and discuss the results of the experiments and show the similarity to related architectures. 1. INTRODUCTION Not linear methods for statistical data analysis have become more and more popular thanks to the rapid development of computers. The fields in which they are applied to are as various as the methods them self. Generative topographic mapping (GTM) has been developed by [Bishop et al. 1997] as a principal alternative to the self-organizing map (SOM) algorithm [Kohonen 1982] in which a set of unlabelled data vectors is summarized in terms of a set to reference vectors having a spatial organization corresponding to a (generally) two-dimensional sheet. While the SOM algorithm has achieved many successes in practical applications, it also suffers from the significant deficiencies, many of which highlighted in [Kohonen 1995]. They include: the absence of a cost function, the lack of theoretical basis for choosing learning rate parameter schedules and neighborhood parameters to ensure topographic ordering, the absence of any general proofs of convergence, and the fact that the model does not define a probability density. These Problems can all be traced to the heuristic origins of the SOM algorithm. The GTM algorithm overcomes most of the limitations of the SOM while introducing no significant disadvantages. The datasets used to analyze the strengths and weaknesses of GTM range from low dimensional with few vectors to very high dimensional with many vectors. In Section 2 I give an overview of related work. I summarize the method in Section 3. In Section 4 I describe the experiments I made and discuss the similarities to the SOM. Conclusions are presented in Section RELATED WORK A lot of research has been done in the field of statistical data analysis. GTM belongs to the family of unsupervised methods. Another representative is for example k-means clustering. An important aspect of unsupervised training is data visualization. The high dimensional data space is mapped on to a mostly two dimensional space. The main two criteria are preserving the topology and clustering the data. The most common tool is the SOM but there are also others as for example Sammons mapping. The GTM is a rather new robabilistic re-formulation of the SOM. The developers of the GTM algorithm have compared it with a batch algorithm of the SOM and obtained the result that the computational cost of the

2 2 Elias Pampalk GTM algorithm is about a third higher. 3. THE METHOD 3.1 Principals GTM consists of a constrained mixture of Gaussians in which the model parameters are determined by maximum likelihood using the EM algorithm. It is defined by specifying a set of points {x i } in latent space, together with a set of basis functions {Φ j (x)}. The adaptive parameters W and β define a constrained mixture of Gaussians with centers W Φ(x i ) and a common covariance matrix given by β 1 I. After initializing W and β, training involves alternating between the E-step in which the posterior probabilities are evaluated, and the M-step in which W and β are re-estimated. 3.2 Details The continuous function y = f(x, W ) defines a mapping from the latent space into the data space. Where x is a latent variable (vector) and W are the parameters of the mapping (matrix). And y is a vector in the higher dimensional data space. The transformation f(x, W ) maps the latent-variable space into an non-euclidean manifold embedded within the data space. Defining a probability distribution p(x) on the latent-variable space induces a corresponding distribution p(y W ) in the data space. Since in reality the data t will only approximately live on a lower-dimensional manifold, it is appropriate to include a noise model. A radially-symmetric Gaussian distribution centered on f(x, W ) is chosen p(t x, W, β) = ( β 2π ) D 2 exp{ β 2 f(x, W ) t 2 }. Where D is the dimension of the data space and β 1 is the variance of the Gaussian distribution. The distribution in the data space, for a given value of W, is then obtained by integration over the x-distribution p(t W, β) = p(t x, W, β)p(x)dx. For a given data set D = (t 1,.., t N ) of N data points, the parameters W and β can be determined using maximum likelihood. The log likelihood function is given by L(W, β) = ln N p(t n W, β). n=1 The (prior) distribution p(x) is chosen so that the integral over x can be solved p(x) = 1 K Where K is the number of latent points. K δ(x x i ). i=1

3 Experimental Analysis of GTM 3 Each point x i is mapped to a corresponding point f(x i, W ) in the data space, which forms the center of a Gaussian density function p(t W, β) = 1 K K p(t x i, W, β) i=1 and the log likelihood function becomes L(W, β) = N ln{ 1 K n=1 K p(t n x i, W, β)}. i=1 This corresponds to a constrained Gaussian mixture model, since the centers of the Gaussians, given by f(x i, W ), cannot move independently but are related through the function f(x, W ). If the mapping function f(x, W ) is smooth and continuous, the projected points f(x i, W ) will necessarily have a topographic ordering in the data space. If now a particular function parametrized form for f(x, W ) is chosen which is a differntiable function of W (for example a feed-forward network with sigmoidal hidden units) then any standard non-linear optimization method, such as conjugate gradients can be used. However, since the model consists of a mixture distribution the EM algorithm is used. f(x, W ) is chosen to be given by a generalized linear regression model of the form f(x, W ) = W Φ(x) where the elements of Φ(x) consist of M fixed basis functions Φ j (x), and W is a DxM matrix. 3.3 Matlab Toolbox The GTM toolbox for Matlab [Svensen 1999] provides a set of functions to generate GTMs and use them for visualization of data. There are standard functions for the two main steps: setup and training. Setup refers to a process of generating an initial GTM model, made up by a set of components (Matlab matrices). Training refers to adapting the initial model to a data set, in order to improve the fit to that data. The standard initialization (gtm stp2) consists of the following steps: First a latent variable sample is generated. Then the centers of the basis functions are generated and the activations in the basis functions are computed, given the latent variable sample. At last an initial weight matrix mapping from the output of the basis functions to the data space, and an initial value for the inverse variance of the Gaussian mixture is computed useing the two first principal components. The standard training function (gtm trn) basically consists of two steps: In the first E-step a matrix is calculated, containing the responsibilities assumed by each Gaussian mixture component for each of the data points. These responsibilities are used in the second M-step for calculating new parameters of the Gaussian mixture. For each training cycle both steps are calculated.

4 4 Elias Pampalk 4. EXPERIMENTS 4.1 Introduction Diagrams. GTM defines a distribution for each data point in the latent space. There are a few ways to visualize this. The first would be to look at the distribution of each single data point. Normally this is not desired. The second way would be to plot the means of the data points. However the distribution could be multi-modal in which case the mean can give a very misleading summary of the distribution. So another way would be too plot the means and the corresponding modes. The problem with this approach however is that with many data points it becomes difficult to recognize anything. Therefor for each experiment I plot a diagram of the means to represent the data points and to easily recognize clusters. And I plot the means with their corresponding modes to indicate if the distributions are multi-modal. Further I plot the sum of the distribution of all points in the data set. This is easily done by adding the single distributions and normalizing the result. The forth plot I always make is the log-likelihood. At [ elan/elias/kosice/] a complete description of the experiments including the used Matlab scripts can be found Parameters. Following parameters must be specified using the GTM toolbox. A more detailed description can be found in [Svensen 1999]. Number of Latent Points Using a L-dimensional latent space for visualization it is recommended to set the number of latent points to O(10 L ) in the support of each basis function. The latent points lie within a regular square grid. Settings as 10x5 are not possible. With too few points per basis function the smoothness of the mapping is lost. The number of latent points is limited computationally, but a very high number would be ideal. Number of Basis Functions A limited number of basis functions will necessarily restrict the possible forms that the mapping can take. The amount of basis functions must be a square of a whole number. Settings as 2x3 are not possible. Sigma Denotes a scalar giving the relative width of the basis functions. The absolute width is calculated as sigma times the distance between two neighboring basis function centers. When basis functions overlap their response will be correlated, which causes the smoothness of the mapping. More or narrower basis functions will allow a more flexible mapping, while fewer or broader functions will prescribe a smoother mapping. Sigma = 1.0 is a good starting point. Cycles The GTM implemented in Matlab uses a batch algorithm. With the testruns I made the log likelihood had converged after 5 to 30 cycles. Lambda The weight regulation factor governs the degree of weight decay applied during training. It controls the scaling, by restricting the magnitude of the weights. It is recommended to set it to All experiments I made had this setting. 4.2 Animals Description. A toy example, very useful to test the various parameters: small, easy to handle, and intuitively interpretable: 16 animals, described by 13 attributes.

Experimental Analysis of GTM 5 4.2.2 Raw Data Experiments. The following four diagrams illustrate the effects of the GTM parameters. Fig. 1. Legend for the figures 2,

5 Experimental Analysis of GTM Raw Data Experiments. The following four diagrams illustrate the effects of the GTM parameters. Fig. 1. Legend for the figures 2, 3 and 4. Fig. 2. Parameters: 3x3 latent points, 2x2 basis functions, 1.0 sigma, 10 cycles. A low number of latent points generates a low resolution of the distribution. Since this dataset only contains few basic clusters a low number of latent points is sufficient. In figure 2 (means) you can see that the clusters are separated. Horse, zebra and cow are mapped to the upper left. Dove, hen duck and goose are mapped to the upper right. Tiger and lion are mapped to the middle left, cat is mapped to the center. Owl, hawk and eagle are mapped to the middle and lower right. Fox, dog and wolf can be found on the lower left. Notice that convergence is reached after about 5 cycles. Figure 3 shows the effects of a higher number of latent points: the form of the distribution has a higher resolution. Compared to figure 2 a clearer picture of the data structure is revealed. The low number of basis functions and the high sigma cause the smoothness of the mapping. Each hill in the distribution represents a cluster. The peek at (-1,1) is caused by two identical vectors. Because the standard setup procedure uses a principal component initialization the clusters are almost at the same location as in figure 2.

6 6 Elias Pampalk Fig. 3. Parameters: 20x20 latent points, 2x2 basis functions, 3.0 sigma, 10 cycles. Fig. 4. Parameters: 20x20 latent points, 2x2 basis functions, 0.2 sigma, 10 cycles. Decreasing sigma reduces the smoothness of the mapping as can be seen in figure 4. The hills in the distribution are much higher and rougher as in figure 3. Notice that the distributions of the data points are now multi-modal (modes diagram). Increasing the number of basis functions dramatically increases the flexibility of the mapping. Figure 5 shows the result. The distribution is spiked. Because there are more basis functions than data points the log-likelihood diagram seems strange. Even with these parameter settings topology and clusters are basically correct.

Experimental Analysis of GTM 7 Fig. 5. Parameters: 20x20 latent points, 5x5 basis functions, 0.2 sigma, 10 cycles. 4.2.3 Evaluation. I encountered no problems with this dataset.

7 Experimental Analysis of GTM 7 Fig. 5. Parameters: 20x20 latent points, 5x5 basis functions, 0.2 sigma, 10 cycles Evaluation. I encountered no problems with this dataset. The results made sense and the standard functions worked fine. Time was no problem. The clusters generated look very nice. Finding good parameters was not a problem. 4.3 MIS Description. This data describes characteristics of software modules (size, complexity measures,...), medium-sized data set with low dimensionality: 420 vectors, described by 13 attributes. Fig. 6. Legend for the figures 7, 8 and 9. The numbers represent how often the source code has been modified. A plotted circle represents a source code which has been modified up to 50 times and so on Raw Data Experiments. It is difficult to find parameters with which a result other than a mapping on to one point is generated. Figure 7 shows one of my best tries. Notice the log-likelihood diagram. I suspect a numerical error. On the right side of the means diagram there are mainly data points which represent source codes which have been modified allot. On the left side there is a big cluster of the rest.

8 8 Elias Pampalk Fig. 7. Parameters: 20x20 latent points, 10x10 basis functions, 2.5 sigma, 10 cycles. Fig. 8. Parameters: 30x30 latent points, 2x2 basis functions, 1.0 sigma, 20 cycles Normalized by Attribute Experiments. In Figure 8 again mainly data points which represent a high modification are on the right side of the means diagram and there is one big cluster of the rest. Notice that the distribution has a form similar to waves. Possibly this is a side effect of the GTM algorithm. This strange form was generated with almost any parameter setting.

9 Experimental Analysis of GTM 9 Fig. 9. Parameters: 20x20 latent points, 2x2 basis functions, 4.0 sigma, 20 cycles Vector Length Normalized Experiments. In Figure 9 there seems to be no left right separation between data points with a low and a high modification count. What can be seen is that there are small clusters of similar data points. One example is the top right, where many squares occupy the same space. This can also be seen in the distribution: there is a high peek at (-1, -1) Evaluation. I used the standard procedures. Time was not a real problem. But I was not able to produce good looking results with the raw data. The convergence problems I had might be caused by numerical errors. With the vector length normalized a very strange form was generated. The form reminded me of waves in water after dropping a stone. Normalizing the attributes caused more or less random results, at least I could not detect any structure. 4.4 TIME Magazine Description. Newspaper articles of the TIME Magazine from the 1960 s, medium-sized data set with very high dimensionality: 420 vectors (articles) described by 5923 attributes Vector Length Normalized Experiments. Matlab does not support functions that would make it possible to easily analyze maps with many (420) vectors. To read a document plotted at a certain point on the map I have to find out by hand which vector it is. One way to do it is to mark one vector after the other with a special symbol (for example + instead of o ) which makes it possible to distinguish the vector from the others. It is impossible to analyze a map that way within acceptable time limits. The approach I chose was a comparison. I used the results of Michael Dittenbach. He has trained a flat SOM and labeled the nodes. I took some nodes that looked good. I marked these clusters with symbols so I could recognize them in the diagrams.

10 10 Elias Pampalk viet(1) SOM node: (1, 8); Content: South Vietnam, religious crisis. viet(2) SOM node: (1, 7); Content: South Vietnam. viet(3) SOM node: (1, 6); Content: South Vietnam, military. moscow,... SOM node: (11, 7); Content: Russia, communism, Khrushchev. khrushch(1) SOM node: (11, 8); Content: Khrushchev, Cold War. khrushch(2) SOM node: (10, 7); Content: Russia, economy. khrushch(3) SOM node: (10, 8); Content: Russia, culture. zanzibar, kenya SOM node: (3, 4); Content: Kenya, Zanzibar. nato, nuclear SOM node: (2, 11); Content: NATO, nuclear weapons. Fig. 10. Legend for the figures 11 and 12. Figure 11 shows a mapping for the TIME magazine documents. The clusters are basically identical with those found by the SOM. Especially the South Vietnam cluster is separated very nicely. Overlap of the symbols + and x causes a * as seen on the left top in the means diagram. Because of the low number of basis functions the clusters in the center are cramped. The connection between the triangles pointing down and the pentagrams came out very nicely. Notice the convergence after the second cycle. This can be observed by all other parameter settings as well. Figure 12 shows a mapping with more basis functions. The clusters are the same again, except that the Russian documents have moved closer together. It is hard to recognize in this plot but the triangles pointing down [krushch(1)] are not all in the same region. All but one are on the top left with the other documents on Russia. The remainder is on the top right with the documents on nato and nuclear weapons. The reason therefor is that the document is about Russian nuclear weapons. The South Vietnam cluster is now located at the lower right. While with the MIS and animals experiments principal component analysis was used to initialize the variables with the TIME dataset I hade to use a random initialization Evaluation. The quality of the mappings seems to be good. The results are very similar to those of the flat SOM. I have observed 9 clusters that have been found by the SOM. GTM found the same clusters and topology with almost any parameter settings. For practical use with text analysis it would be necessary to develop a better user interface. It is difficult to find the corresponding documents to the plotted diagrams.

11 Experimental Analysis of GTM 11 Fig. 11. Parameters: 10x10 latent points, 3x3 basis functions, 2.0 sigma, 5 cycles. Fig. 12. Parameters: 10x10 latent points, 4x4 basis functions, 4.0 sigma, 5 cycles. I encountered several problems while working with this dataset. First of all it was not possible to use the standard procedures because of the high dimensionality. The principal component analysis which is used by default to initialize the data took too long. And in my case also too much memory (over 500MB). The GTM toolbox also provides the possibility to use a random initialization which I used. The next problem where numerical errors. Some of the functions in the GTM toolbox by default use algorithms which are fast but not very precise. This caused

12 12 Elias Pampalk divisions by zero when the means were calculated. The next problem was the efficiency of some calculations. I was able to solve the problem by changing some matrix multiplications into loops. Clearly this toolbox was not developed for high dimensional data. The remaining problem still is the time it takes to initialize, train and visualize a map. With average parameters it takes me about 20 minutes for one run. Almost halve of the time is spent calculating the distribution of the entire dataset (if plotting modes and means is enough a run only takes halve of the time). 5. CONCLUSION I have presented the results of experiments done with different datasets. I have explained the problems of the GTM toolbox for Matlab and I have shown the similarity between GTM and SOM. For datasets with only few clusters the GTM is a good alternative to the SOM. For datasets with many unclear clusters, as in text data mining, it is necessary to develop a better interface to work with the results. REFERENCES Bishop, C. et al GTM: The Generative Topographic Mapping. Kohonen, T Self-Organizing Maps. Kohonen, T Self-organized formation of topologically correct feature maps. Svensen, M The GTM Toolbox - User s Guide.

t 1 y(x;w) x 2 t 2 t 3 x 1

t 1 y(x;w) x 2 t 2 t 3 x 1 Neural Computing Research Group Dept of Computer Science & Applied Mathematics Aston University Birmingham B4 7ET United Kingdom Tel: +44 (0)121 333 4631 Fax: +44 (0)121 333 4586 http://www.ncrg.aston.ac.uk/