Topographic Local PCA Maps - PDF Free Download

Topographic Local PCA Maps Peter Meinicke and Helge Ritter Neuroinformatics Group, University of Bielefeld E-mail:{pmeinick, helge}@techfak.uni-bielefeld.de Abstract We present a model for coupling Local Principal Component Analysers based on a probabilistic notion of neighbourhood, which is inspired by the Self-Organizing Map (SOM). With our approach topologically ordered configurations of Local PCA s arise from homotopy-based minimization of a global error function. We indicate that such an approach can be viewed as a natural generalization of the basic SOM while, unlike the SOM, it is not restricted to capture the variation of multivariate data only along a small number of grid dimensions. We show the close relations to the Adaptive Subspace SOM (ASSOM) and by experimental results on synthetic and high-dimensional real-world data we demonstrate the capabilities of the model. 1 Introduction Local PCA (LPCA) learning [7, 1] can be viewed as a plausible extension to the conventional Vector Quantization (VQ) framework. It replaces the point prototypes by linear manifolds, which can considerably improve generalization especially in high-dimensional data spaces. In both frameworks of LPCA and VQ based learning the problem of overfitting quickly arises if we increase the number of prototypes to improve the approximation capabilities of the model. In the VQ case, one successful type of approach to restrict the model complexity is by an introduction of couplings that constrain the flexibility of each prototype relative to a set of "neighbours". The wellknown SOM [8, 10] imposes the relationships among the K prototypes by means of a neighbourhood function h jk which determines the coupling strength between prototypes j and k. Usually, h jk is related to the closeness of lattice points ("nodes") j, k on a fictitious grid, which is chosen to resemble the topology of the data in order to obtain good generalization. For a biological motivation of such an approach we refer to [8, 10]. In this paper we propose to extend the point-wise SOM prototypes to linear manifold prototypes, which are now able to extract some local subspace structure from the data. Optimization of the neighbourhood coupled manifolds results in a special kind of feature maps which we refer to as "Topographic Local PCA" (T-LPCA) maps. With respect to the representational capabilities T- LPCA can be viewed as a natural generalization of the classical SOM and in addition it can also realize the Adaptive Subspace SOM (ASSOM) [8], which turns out to be a special case of the T-LPCA framework too. Therefore a probabilistic version of the SOM can be easily realized within the Topographic Local PCA framework and indeed it plays an important role during learning via successive model refinement. 2 T-LPCA Prototypes In essence the definition of our T-LPCA model is based on the combination of two formalisms, which realize some variability with respect to the kind of prototype and a certain kind of probabilistic neighbourhood respectively. More specifically, the prototype variability is achieved by a parametrized distance function, which can provide a smooth transition from points to linear manifolds. The neighbourhood coupling is achieved by a set of assignment probabilities, which combine the squared distances w.r.t. distinct prototypes. This idea of a probabilistic notion of neighbourhood goes back to [9] and has been extended to "Soft Topographic Vector Quantization" by [4]. The combination of the above formalisms results in a function which measures the error of a d-dimensional point x with respect to the neighbourhood of a prototype j: E j (x; ) = 1 h jk k(i V k Vk T )(x w k)k 2 (1) 2 k=1 for j = 1; : : : ; K with 2 [0; 1]. The error function sums up (squared) neighbourhood-weighted central (SOM) distances for = 0 and orthogonal distances w.r.t. some linear manifolds for = 1. While I is the d d identity matrix the V k are d q matrices with orthonormal columns which determine the directions of the

manifolds. Each V k Vk T thus represents an orthogonal projection onto a q-dimensional subspace that becomes associated with node k. The w k are points in data space which correspond to the SOM prototypes for = 0 and which determine the distance of the linear manifolds to the origin for = 1. Within a statistical interpretation [9, 4] a data-point is not deterministically assigned to a single prototype but instead to a set of probabilistically related prototypes. The final assignment depends on a discrete random variable r j which assigns a point P to the k-th prototype with probability h jk. Therefore k h jk = 1 and the distribution of r j defines what we refer to as the "neighbourhood" of prototype j or shortly the j-th neighbourhood. The above error function (1) therefore defines the expected squared -distance of a data-point w.r.t. the distribution of r j. The topological justification for that probabilistic notion of neighbourhood comes from the fact that within a neighbourhood j we require the probability h jk to be a monotonous function of the distance between the corresponding nodes of prototypes j and k on a prespecified grid ("array") of chosen topology. The corresponding neighbourhood function which maps array-distances to probabilities establishes the connection to the classical SOM. In subsection 3.3 we will propose a Gaussian realization of that function which leads to a convenient parametrization of the probabilities. For notational convenience the neighbourhood probabilities of all K random variables are collected in the prespecified neighbourhood matrix H = (h jk ) which encodes the topological relations between all prototypes. For the sake of a simplified notation in the following we suppress the H-dependency of E j (). 3 Optimization Learning from the sample X = fx 1 ; : : : ; x N g R d requires minimization of the following global error function: E(M; ; ) = i=1 j=1 m ij E j (x i ; ) (2) where comprises all model parameters w 1 ; V 1 ; : : : ; w K ; V K and the N K matrix M = (m ij ) contains a set of membership variables m ij, which denote the membership of a data point i w.r.t. to a certain neighbourhood j. Within a hard-clustering framework the membership variables would be binary and would assign a data point to exactly one neighbourhood. In order to avoid poor local minima of the above objective function we do not start with a direct minimization of (2) by hard-clustering where a nearest neighbour(hood) partitioning of the data and a subsequent reestimation of the parameter values are iterated. Instead we use a homotopy-based method which gradually deforms an initial error function with a welldefined global minimum until the original error function is minimized in a final optimization step. This technique is usually referred to as deterministic annealing [11] and will be the subject of the next subsection. 3.1 Deterministic Annealing In the following the value of m ij is viewed as the probability of a data-point i to belong to neighbourhood j, requiring X j m ij = 1; m ij 0; j = 1; : : : ; K (3) In that way we introduce a set of N random variables s i with probability distributions P fs i = jg = m ij which randomize the data-to-neighbourhood assignments. Thus (2) denotes the expected error w.r.t. to these s i distributions. In contrast to the r j of the previous section it doesn t make sense to prespecify the distributions of the s i, the corresponding probabilities have to be estimated from the data. However such an estimation scheme is not well-defined unless some further constraints are imposed on the m ij. For that purpose a suitable approach is to constrain the entropy of the s i distributions. This technique can be derived from the well-known "maximum entropy"-principle [5] and can be simply implemented by adding a regularization term to the above error function (2): E(M; ; ; ) = E(M; ; ) 1 m ij log m ij i=1 j=1 (4) where plays the role of an inverse temperature [11]. For an infinitely high temperature, i.e.! 0, minimization of (4) w.r.t. to the m ij yields the maximum entropy solution for the s i distributions with all probabilities equal to 1=K. For P = 0 optimization of the point prototypes yields ^w k = 1=N x i i for all k as the unique minimizers of (4) since the Hessian of the error function is positive definite in this case [11]. This means that for! 0 all optimal prototypes coincide in the global sample mean. If is increased, without neighbourhood constraint, this state remains stable as long as the value of 1= exceeds the largest eigenvalue of the sample covariance matrix, which is known as the "critical temperature" [11]. With neighbourhood constraint this critical temperature also depends on H [4]. With further increasing

the prototype vectors undergo a series of splittings in order to minimize the regularized error function and in the limit! 1 a hard-clustering of the data is achieved. The technique described so far is well known as "deterministic annealing" and it has the reputation of being rather robust against shallow local minima, provided an adequate annealing schedule is chosen. 3.2 Parameter Estimation For given values of, and H minimization of (4) can be achieved by a special version of the EM-algorithm [2]. Thereby the following two steps are iterated until convergence. E-Step Given some parameter values for the optimal membership probabilities can be derived from the corresponding stationarity conditions (zero first derivatives) under the constraint (3), which yield ^m ij = expfe j(x i ; )g Pk expfe k(x i ; )g for i = 1; : : : ; N and j = 1; : : : ; K. M-Step (5) Given some values for the membership probabilities the optimal prototypes are derived from the corresponding stationarity conditions, which yield the following local means for k = 1; : : : ; K. ^w k = 1 n k n k = i=1 x i i=1 j=1 j=1 ^m ij h jk (6) ^m ij h jk (7) For > 0 from (4) optimal direction-matrices are defined by ^V k = arg max V tr (S kvv T ) subject to V T V = I q (8) with tr() denoting the trace operation, I q being the q q identity matrix and (x i ^w k )(x i ^w k ) T ^m ij h jk (9) S k = 1 n k i=1 j=1 being some local covariance matrix. Now it can be shown that a (non-unique) maximizer of the above trace in (8) can be found from an eigenvalue decomposition of S k with ^V k containing those eigenvectors as columns which are associated with the q largest eigenvalues of S k (see e.g. [6] pp. 9). Thus estimation of the optimal direction matrices is achieved by performing K local PCA s. 3.3 Varying and H To obtain a good local minimum of the global error we combine the above deterministic annealing with two other deformations of the error function which involve a gradual increase of the above parameter and a successive modification of H, which realizes a "shrinking" neighbourhood. Since the above splitting scheme of the previous subsection is not well-defined for general linear manifolds it makes sense to first apply deterministic annealing to a set of initial point prototypes with = 0. In order to control the extent of the neighbourhoods it is necessary to provide a suitable parametrization of the neighbourhood matrix H. A convenient choice is to use a Gaussian neighbourhood function, which for a 1-D array leads to the following probabilities: h jk = 1 exp 1 Z j 2 2 jj kj2 (10) where the Z j is chosen to provide unit row sums of H. Such a scheme easily generalizes to higher-dimensional arrays and provides a suitable control of the neighbourhood width by the variance 2 of the Gaussian neighbourhood function. With the specification of a neighbourhood function we can now extend the global error function to E(M; ; ; ; ) in order to make the -dependency explicit. As in the SOM case, for the minimization of this function it is recommendable to start with a large neighbourhood width and successively decrease the width until, in the limit, all couplings between prototypes may vanish. Although other strategies are conceivable the following overall optimization scheme has proven useful in all our experiments. We always start with = 0 at some high temperature 1= min and with a large neighbourhood width. Then is increased in a few optimization steps according to an exponential schedule. After this initial deterministic annealing phase we continue with zero temperature hard-clustering and according to a linear schedule we increase and decrease in a few steps until = 1 and = 0 respectively. For the case = 0 the Gaussian neighbourhood function is replaced by a Dirac impulse. The overall optimization scheme is shown in table 1 in a more algorithmic fashion.

➊ Define max > min > 0; max > 0; > 1; > 0 ➋ Initialize = 0; = min ; = max ➌ Minimize E(M; ; ; ; ) ➍ Set = ➎ If < max Goto ➌ ➏ Minimize E(M; ; ; 1; ) ➐ Set = + ; = (1 ) max ➑ If 1 Goto ➏ Else Stop. Table 1: Homotopy-based optimization scheme; for example values of constants min, max, max, and see section 5. 4 Relations to the ASSOM Learning with the Adaptive Subspace SOM (ASSOM) [8] can be viewed as an online variant of our T-LPCA optimization scheme for the particular case! 1 and = 1 with the linear manifolds passing through the origin, i.e. w k = 0; k = 1; : : : ; K. Due to the latter constraint formally the ASSOM can not be viewed as a generalization of the SOM and in practice it wouldn t be possible to build the ASSOM from an initially given SOM by simply extending the prototypes. In addition the constraint specializes the ASSOM to certain kinds of data distributions as illustrated by the experiments of the next section. In cases where the local means w k of the T-LPCA map are highly correlated with the main directions of the V k the ASSOM can be expected to yield a similarly good representation of the data. Our experimental results indicate that this might be the case for the handwritten digit image data which we used for T-LPCA training, since figure 3 shows that the main directions in the second row (from bottom) are approximately scaled versions of the corresponding means of the bottom row. However the noisy circle (see figure 1 and 2) shows an example, which is better suited for the more general T-LPCA model, since it allows the linear manifolds to have arbitrary offsets w.r.t. the origin. 5 Experimental Results In all experiments we applied the optimization scheme described in section 3 and table 1. Thereby the initial neighbourhood width max was set to twice the grid spacing and the initial temperature 1= min was set to the largest eigenvalue of the sample covariance matrix. In the deterministic annealing phase we used a factor = 2 to increase over 10 iterations. During each iteration the above EM-optimization of subsection 3.2 was applied to reduce the global error. In the second zero-temperature phase was incremented by = 0:2. 5.1 Noisy Circle In the first experiment prototypes with one-dimensional subspaces (q = 1) were fitted to a synthetic data set of 100 points, which were generated by sampling from the unit circle and adding isotropic Gaussian noise with standard deviation 0:1. We used a model with K = 6 prototypes and a 1D array of equally spaced nodes. The residual squared error was 0.00858 per data point on the average and the resulting model is depicted in figure 1. For comparison we also fitted a model with zero local means in order to achieve an ASSOM-like representation. The average squared error was 0.0191 in this case and the result is shown in figure 2. 1 0-1 -1 0 1 Figure 1: T-LPCA model with 1D-topology and K = 6 prototypes fitted to 100-point sample of noisy circle. 5.2 Feature Representation In the second experiment we used a downsampled 1000-point subset of the MNIST database (http://www.research.att.com/yann/ocr/mnist/) containing 8 8 images of handwritten "1" digits. In the 64-dimensional data space we fitted a T-LPCA model with K = 6 prototypes and q = 5 subspace dimensions.

1 0-1 -1 0 1 Figure 2: T-LPCA map with w k = 0; k = 1; : : : ; K leads to an ASSOM-like model with all K = 6 lines passing through the origin Again the nodes were arranged on a regular 1D array. From the result in figure 3 we see that most of the non-linear variation, in this case mainly due to rotation in the image plane, is captured along the horizontal array dimension. In addition some linear feature filters emerge along the vertical subspace dimensions. 5.3 Visualization A convenient visualization of a 2D SOM can be achieved if the distance between neighbouring prototypes is mapped to the greylevel of a corresponding image region, according to the topology of the underlying SOM array. In essence the resulting visualization resembles the socalled U-map [12] and one might argue that this concept easily carries over to linear manifold prototypes. However a suitable distance metric isn t quite obvious. The distance between two subspaces S j and S k, represented by matrices V j ; V k (see section 2), can be defined as [3] dist(s j ; S k ) = kv j V T j V kv T k k 2 (11) which equals the largest singular value of V j Vj T V k Vk T. However this distance doesn t involve the local means and the results we achieved by using the direction matrices only, were rather poor. As a possible alternative we investigated an extension of the usual point-to-point distance to a sum of point-to-manifold distances D jk = 1 2 k(2i V jv T j V kv T k )(w j w k )k (12) Figure 3: T-LPCA map from 1000-point sample of 64 dimensional image vectors of handwritten "1"-digits; bottom row shows local means w k, the next upper rows show the column-vectors of the direction matrices V k for q = 5. which is simply half the orthogonal distance of the local mean w j to linear manifold k plus half the distance of w k to manifold j. For zero-dimensional subspaces it reduces to the usual point-to-point distance, normally used for U-map imaging. As an illustrative example we used this "pseudo"-distance to build an U-map from a T-LPCA model which we had trained on 88 images of digits "0" to "4". The training set contained 1000 examples of each digit which were used to optimize a model with 36 nodes arranged on a regular 6 6 grid. The resulting U-map is shown in figure 4. 6 Conclusion We conclude that the T-LPCA map is a highly promising extension of the SOM, which is capable to catch some high-dimensional local linear variation in addition to the global non-linear variation along the low-dimensional SOM array. We showed that T-LPCA maps can be formulated as a probabilistic generalization of the SOM by means of a suitable parametrization of a global error function which is minimized by homotopy-based optimization.

Information Processing Systems, volume 6, pages 152 159. Morgan Kaufmann Publishers, Inc., 1994. [8] T. Kohonen. Self-Organizing Maps. Springer, Berlin, 1995. [9] Stephen P. Luttrell. A Bayesian analysis of selforganizing maps. Neural Computation, 6(5):767 794, 1994. [10] H. Ritter, T. Martinetz, and K. Schulten. Neural Computation and Self-Organizing Maps. An Introduction. Addison-Wesley, Reading, MA, 1992. [11] K. Rose, E. Gurewitz, and G. C. Fox. Vector quantization by deterministic annealing. IEEE Transactions on Information Theory, 38(4):1249 1257, 1992. Figure 4: T-LPCA U-map for 6x6 model built from digit data; fields between units have greylevel proportional to the D jk "distance" defined in the text; on diagonals the minimum of both distances is taken; fields of units take the median of their surrounding fields; labels indicate most common digit class mapped to the corresponding prototype. [12] A. Ultsch. Self-organizing neural networks for visualization and classification. In O. Opitz, B. Lausen, and R. Klar, editors, Information and Classification, pages 307 313, Berlin, 1993. Springer. References [1] Christoph Bregler and Stephen M. Omohundro. Surface learning with applications to lipreading. In Jack D. Cowan, Gerald Tesauro, and Joshua Alspector, editors, Advances in Neural Information Processing Systems, volume 6, pages 43 50. Morgan Kaufmann Publishers, Inc., 1994. [2] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B, 39:1 38, 1977. [3] G. H. Golub and C. F. Van Loan. Matrix Computations. Johns Hopkins University Press, Baltimore, third edition, 1996. [4] T. Graepel, M. Burger, and K. Obermayer. Phase transitions in stochastic self-organizing maps. Physical Review E, 56(4):3876 3890, 1997. [5] E. Jaynes. Information theory and statistical mechanics. Physical Review, 106(4):620 630, 1957. [6] I. T. Jolliffe. Principal Component Analysis. Springer, New York, 1986. [7] Nanda Kambhaltla and Todd K. Leen. Fast nonlinear dimension reduction. In Advances in Neural