A Stochastic Optimization Approach for Unsupervised Kernel Regression

A Stochastic Optimization Approach for Unsupervised Kernel Regression Oliver Kramer Institute of Structural Mechanics Bauhaus-University Weimar oliver.kramer@uni-weimar.de Fabian Gieseke Institute of Structural Mechanics Bauhaus-University Weimar fabian.gieseke@uni-weimar.de Abstract Unsupervised kernel regression (), the unsupervised counterpart of the Nadaraya-Watson estimator, is a dimension reduction technique for learning of low-dimensional manifolds. It is based on optimizing representative low-dimensional latent variables with regard to the space reconstruction error. The problem of scaling initial local linear embedding solutions, and optimization in latent space is a continuous multi-modal optimization problem. In this paper we employ evolutionary approaches for the optimization of the model. Based on local linear embedding solutions, the stochastic search techniques are used to optimize the model. An experimental study compares covariance matrix adaptation evolution strategies to an iterated local search evolution strategy. I. INTRODUCTION Unsupervised kernel regression () is the unsupervised counterpart of the Nadaraya-Watson estimator. It is based on optimizing latent variables w.r.t. to the space reconstruction error. The evolved latent variables define a solution. Projection back into space yields the manifold. But the optimization problem has multiple local optima. In the following, we employ evolutionary optimization strategies, i.e., the CMA-ES and Powell evolution strategies (Powell-ES) [], to optimize manifolds. Furthermore, we analyze the influence of local linear embedding (LLE) [8] as initialization procedure. This paper is organized as follows. In Section II we give a brief overview of manifold learning techniques. In Section III we introduce the problem by Meinicke et al. [], []. In this section we also shortly present the standard optimization approach that we are going to replace by an evolutionary framework. Section IV introduces the components of the evolutionary approach. An experimental study in Section V will give insights into the evolutionary optimization process. Important results are summarized in Section VI. II. RELATED WORK High-dimensional is usually difficult to interpret and visualize. But many sets show correlations between their variables. For high-dimensional, a low-dimensional simplified manifold can often be found that represents characteristics of the original. The assumption is that the structurally relevant lies on an embedded manifold of lower dimension. In the following, we will overview famous manifold learning techniques pointing out that the overview can only be subjective depiction of a wide field of methods. One of the most famous and widely used dimension reduction methods is principal component analysis (PCA) that assumes linearity of the manifold. Pearson [6] provided early work in this field. He fitted lines and planes to a given set of points. The standard PCA as most frequently known today can be found in the depiction of Jolliffe [8]. The PCA computes the eigenvectors, i.e., the largest eigenvalues of the covariance matrix of the samples. An approach for learning of nonlinear manifolds is kernel PCA [9] that projects the into a Hilbert space similarly to the SVM and SVR principle. Hastie and Stuetzle [6] introduced principal curves that are self-consistent smooth curves that pass through the middle of clouds. Self-consistency means that the principal curve is the average of the projected on it. Bad initializations can lead to bad local optima in the previous approaches. A solution to this problem is k-segments by Verbeek, Vlassis, and Kröse [] that alternates fitting of unconnected local principal axes and connecting the segments to form a polygonal line. Self-organizing feature maps [] proposed by Kohonen learn a topological mapping from space to a map of neurons, i.e., they perform a mapping to discrete values based on neural (codebook) vectors in space. During the training phase the neural vectors are pulled into the direction of the training. Generative topographic mapping (GTM) by Bishop, Svensén, and Williams [], [] is similar to selforganizing feature maps, but assumes that the observed has been generated by a parametric model, e.g., a Gaussian mixture model. It can be seen as constrained mixture of Gaussian, while the SOM can be viewed as constrained vector quantizer. Multi-dimensional scaling is a further class of dimension reduction methods, and is based on the pointwise embedding of the set, i.e., for each high-dimensional point y i, a low dimensional point x i is found, and for which similarity or dissimilarity conditions hold. For example, the pairwise (Euclidean) distances of two low-dimensional points shall be consistent with the high-dimensional counterparts. Another famous dimension reduction method based on multi-dimensional scaling is Isomap introduced by Tenenbaum, Silva, and Langford []. It is based on three steps: first, a neighborhood graph of Y is computed, second, the shortest distances between its

nodes y i and y j are computed (e.g. with Dijkstra), third, multi-dimensional scaling is applied based on the distances along the neighborhood graph that more represent curvilinear distances than Euclidean distances. The optimal embedding can be computed by the solution of an eigendecomposition. III. UNSUPERVISED KERNEL REGRESSION In this section we introduce the approach, regularization techniques and the Klanke-Ritter optimization approach [9]. A. Kernel Functions is based on kernel density functions K : R d R. A typical kernel function is the Gaussian (multivariate) kernel: ( K G (z) = (π) q/ det(h) exp H z ), () with bandwidth matrix H = diag(h, h,..., h d ). Figure illustrates that the result significantly depends on the choice of an appropriate bandwidth. For a random cloud the kernel density estimate is visualized. A too small bandwidth results in an overfitted model (left)..4... - - - Bandwidth. - - -.4.5..5..5..5.6..8.4 - - - Bandidth.5 Fig.. Comparison of kernel density estimates with two bandwidths of a cloud. B. Formulation has been introduced by Meinicke et al. [], [] within a general regression framework for the reversed problem of learning manifolds. The task of is to find the input parameters of the regression function that are the latent variables of the low-dimensional manifold. reverses the Nadaraya-Watson estimator [4], [], i.e., the latent variables X = (x,... x N ) R q N become parameters of the system, in particular X becomes the lower dimensional representation of the observed Y = (y,... y N ) R d N. The regression function can be written as follows: f(x; X) = i= y i K(x x i ) N j= K(x x j), () which is the revised Nadaraya-Watson estimator. The free parameters X define the manifold, i.e., the low-dimensional representation of the Y. Parameter x is the location where the function is evaluated, and is based on the entries of X. For convenience, Klanke et al. [9] introduced a vector b( ) R N of basis functions that define the ratios of the kernels: K(x x i ) b i (x, X) = N j= K(x x j). () - - -.6.4...8.6.4. Each component i of the vector b( ) contains the relative kernel density of point x w.r.t. the i-th point of matrix X. Equation () can also be written in terms of these basis functions: y i b i (x; X) = Yb(x; X). (4) i= The matrix Y of observed is fixed, and the basis functions are tuned during the learning process. The basis functions b i sum to one as they are normalized by the denominator. The quality of the principal manifold learning is evaluated with the space reconstruction error, i.e., the Euclidean distance between training and its reconstruction. R(X) = N Y YB(X) F, (5) using the Frobenius norm, and with the matrix of basis functions. The Frobenius norm of a matrix A is defined as follows: A m n F = a ij. (6) i= j= To summarize, b i R is a relative kernel density, b R N is a vector of basis functions, and B R N N is a matrix, whose columns consists of the vector of basis functions. Hence, the product of Y R d N and B R N N (which is the Nadaraya-Watson estimator) results in a d N-matrix. B(X) = (b(x ; X),..., b(x N ; X)). (7) Leave-one-out cross-validation (LOO-CV) can easily be employed by setting the diagonal entries of B to zero, normalizing the columns, and then applying Equation 5. Instead of applying LOO-CV an model can be regularized via penalizing extension in latent space (corresponding to penalizing small bandwidths in kernel regression) [9]: R(X) = N Y YB(X) F + λ X F. (8) The regularized approach will be used in the comparison between the CMA-ES and the Powell-ES in Section V-C. C. Klanke-Ritter Optimization Scheme Klanke and Ritter [9] have introduced an optimization scheme consisting of various steps. It uses PCA [8] and multiple LLE [8] solutions for initialization, see Section III-D. In particular, the optimization scheme consists of the following steps: ) Initialization of n + candidate solutions are, n solution from LLE, one solution from PCA, ) selection of the best initial solution w.r.t. CV-error, ) search for optimal scale factors that scale the best LLE solution to an solution w.r.t. CV-error, 4) selection of the most promising solution w.r.t. CV-error, 5) CV-error minimization: if the best solution stems from PCA: search for optimal regularization parameters η with the homotopy method,

if the best solution steps form LLE: CV-error minimization with the homotopy method / resiliant backpropagation (RPROP) [7], 6) final density threshold selection. For a detailed and formal description of the steps, we refer to the depiction of Klanke and Ritter [9]. They discuss that spectral methods do not always yield an initial set of latent variables that is close to a sufficient deep local minimum. We will employ evolution strategies to solve the global optimization problem, but also use LLE for initial solutions. D. Local Linear Embedding For non-linear manifolds LLE by Roweis and Saul [8] is often employed. LLE assumes local linearity of manifolds. It works as follows for mapping high-dimensional points y Y to low-dimensional embedding vectors x X. LLE computes a neighborhood graph like for the other nonlinear spectral embedding methods, see Section II. Then it computes the weights w ij that best reconstruct each point y i from its neighbors, minimizing the cost function: R(w) = i= y i j w ij y j. (9) The resulting weights capture the geometric structure of the as they are invariant under rotation, scaling and translation of the vectors. Then, LLE computes the vectors y i best reconstructed by the weights w ij minimizing R(w) = x i w ij x j () i= j= For a detailed introduction to LLE we refer to [8], and Chang and Yeung [4] for a variant robust against outliers. A free parameter of LLE is the number of local models, which can reach from to N. LLE is employed as initialization routine. The best LLE solution (w.r.t. the CV-error of the manifold) is used as basis for the subsequent stochastic CVerror minimization. IV. EVOLUTIONARY UNSUPERVISED KERNEL REGRESSION A. Evolutionary Optimization Scheme We employ the CMA-ES, and the Powell-ES to solve two steps of the optimization framework. Our aim is to replace the rather complicated optimization scheme we briefly summarized in Section III-C. It consists of the following steps: ) Initialization of n candidate LLE solutions, ) selection of the best initial solution ˆX init w.r.t. CV-error, ) search for optimal scale factors s = arg min s R CV (diag(s) ˆX init) () with CMA-ES/Powell-ES, and 4) CV-error minimization with CMA-ES/Powell-ES. We employ Huber s loss function [7], see Section IV-B, for the following experiments. The subsequent sections describe an experimental analysis of the evolutionary approach. A side effect of the use of an evolutionary scheme is that arbitrary, also non-differentiable kernel functions can be employed. LOO-CV can easily be implemented by setting the diagonal entries of X to zero, and normalizing the columns, and then applying Equation (5). In the following, we compare two optimization approaches for optimizing the model: () Powell s conjugate gradient ES [], and () the CMA- ES [5]. B. Huber s Loss In regression typically different loss function are used that weight the residuals. In the best case the loss function is chosen according to the needs of the underlying mining model. With the design of a loss function, the emphasis of outliers can be controlled. Let L : R q R d R be the loss function. In the univariate case d = a loss function is defined as L = N i= L(y i, f(x i )). The L loss is defined as and L is defined as L = L = y i f(x i ), () i= (y i f(x i )). () i= Huber s loss [7] is a differential alternative to the L loss, and makes use of a trade-off point δ between the L and the L characteristic, i.e.: and: L H = L h (r) = L h (y i f(x i )), (4) i= { δ r r < δ r δ r δ (5) Parameter δ allows a problem specific adjustment to certain problem characteristics. In the experimental part we use the setting δ =.. V. EXPERIMENTAL STUDY A. Datasets and Fitness Measure The experimental analysis is based on the following sets: -D-S: Noisy S : points, d =, noise magnitude σ =., and, test points without noise, -D-S: Noisy S : points, d =, noise magnitude σ =. and, test points without noise, digits 7: samples with d = 56 (6 6 greyscale values) of figure 7 from the digits set [5], and 5 test samples. The test error is computed by the projection of test points uniformly generated in latent space, and mapped to space. The test error is the sum of distances between each test point

x t T, and closest projection and the sum distances between each projected point x p P and the closest test point: R t = min x p x t + min x p x t (6) x p P x t T x t T x p P For training (validation) and test measurement of residuals Huber s loss function is employed, see Section IV-B. B. LLE Initialization and Scaling Balance Initialization with LLE is part of the optimization scheme introduced by Klanke and Ritter [9]. The question arises, how much optimization effort should be invested into the scaling in comparison to the CV-error minimization process. In the following, we analyze the balance of optimization effort for both optimization steps with a budget of 4, optimization steps. Table I shows the CV-errors () of the best LLE model, () after scaling optimization with the CMA-ES, () after the final CV-error minimization, and (4) the error on the test set. We test five optimization balances, the first number indicates the number of steps for the LLE scaling optimization, the second number states the number of steps for the latent variable-based optimization. We test the following combinations: (/4) meaning no scaling, 4, generations of final CV minimization, and the balances (/) meaning, scaling and, CV, (/) meaning, scaling and, CV, (/) meaning, scaling and, CV, and finally (4/) meaning 4 scaling and no final CV optimization. The values shown in Table I present the best test error of runs. The results show that the (/) variant achieves the lowest rest error, while the (/4) variant achieves the lowest training error. TABLE I EXPERIMENTAL ANALYSIS OF OPTIMIZATION STEPS INVESTED INTO SCALING OF THE LLE SOLUTIONS, AND CV-ERROR MINIMIZATION (MINIMAL VALUES). balance /4 / / / 4/ LLE 7.7 7.7 7.7 7.7 7.7 scaling 7.7.88.95.44.8969 CV.865.9579.4789.777.8969 test.76.5.95..9 C. CMA-ES and Powell-ES In the following, we compare two optimization algorithms, i.e., the CMA-ES and the Powell-ES (also known as Powell- ILS, see []) as optimization approaches for the learning problem. It is based on Powell s fast local search. To overcome local optima the Powell-ES makes use of a (µ+λ)- ES []. Each objective variable x R is mutated using Gaussian mutation [] with a step-size (variance σ), and a Powell-search is conducted until a local optimum is found. The step-sizes of the ES are mutated as follows: if in successive iterations the same local optimum is found, the step-sizes are increased to overcome local optima. In turn, if different local optima are found, the step-sizes are decreased. Table II compares the two optimizers on three artificial sets, using the penalized variant, see Equation (8) with λ =.. The values show the test error, i.e., the distance between the original to the projections of d samples after 4 fitness function evaluations, i.e., steps of scale optimization, and steps of CV error minimization. The experiments show that the Powell-ES achieves a lower training error in each of the experiments. But only on the problems D-S this is reflected in a lower test error. This means that on D-S and digits overfitting effect occurred that could not been prevented with the penalty regularization approach. A deeper analysis of the balancing parameter λ will be necessary. - - - - - - - -.5 - -.5.5.5 TABLE II EXPERIMENTAL COMPARISON OF CMA-ES AND POWELL-ES ON THREE DATASETS D-S, D- AND digits. CMA-ES Powell-ES train test max train test max - - D-S 9.448..845 8.7854.5.4 D-S 7.74.5.5 5.89.96.488 digits 5.97.47.99 48.865.68.54 - - - - - - Fig.. Evolutionary on noisy S set. The figures show the results of various optimization efforts spent on scaling and final CV-error minimization, i.e., /4, /, /, and a regularized approach with λ =.5, see Equation (8). The projection is used for visualization. Figure gives a visual impression of the evolved manifolds. The original is shown by (blue) spots, the manifold is drawn as a (red) line. With the help of the test set, we compute a test error, see Equation (6). VI. CONCLUSIONS The multi-modal optimization problem of can be solved with the CMA-ES or the Powell-ES leading to an easier optimization framework that is also capable of handling nondifferentiable loss and kernel functions. However, initial LLE solutions still improve the optimization process. Overfitting effects might occur that have to be avoided by improved regularization approaches. For this sake further search has to be invested into parameter λ, e.g., employing grid-search.

In the future we plan to balance regularization with multiobjective optimization techniques. REFERENCES [] H.-G. Beyer and H.-P. Schwefel. Evolution strategies - A comprehensive introduction. Natural Computing, : 5,. [] C. M. Bishop, M. Svensén, and C. K. I. Williams. Developments of the generative topographic mapping. Neurocomputing, (-): 4, 998. [] C. M. Bishop, M. Svensén, and C. K. I. Williams. Gtm: The generative topographic mapping. Neural Computation, ():5 4, 998. [4] H. Chang and D. Yeung. Robust locally linear embedding. Pattern Recognition, 9:5 65, 6. [5] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer, Berlin, 9. [6] Y. Hastie and W. Stuetzle. Principal curves. Journal of the American Statistical Association, 85(46):5 56, 989. [7] P. J. Huber. Robust Statistics. Wiley, New York, 98. [8] I. Jolliffe. Principal component analysis. Springer series in statistics. Springer, New York u.a., 986. [9] S. Klanke and H. Ritter. Variants of unsupervised kernel regression: General cost functions. Neurocomputing, 7(7-9):89, 7. [] T. Kohonen. Self-Organizing Maps. Springer,. [] O. Kramer. Fast blackbox optimization: Iterated local search and the strategy of powell. In Proceedings of the 9 International Conference on Genetic and Evolutionary Methods, 9. [] P. Meinicke. Unsupervised Learning in a Generalized Regression Framework. PhD thesis, University of Bielefeld,. [] P. Meinicke, S. Klanke, R. Memisevic, and H. Ritter. Principal surfaces from unsupervised kernel regression. IEEE Trans. Pattern Anal. Mach. Intell., 7(9):79 9, 5. [4] E. Nadaraya. On estimating regression. Theory of Probability and Its Application, :86 9, 964. [5] A. Ostermeier, A. Gawelczyk, and N. Hansen. A derandomized approach to self adaptation of evolution strategies. Evolutionary Computation, (4):69 8, 994. [6] K. Pearson. On lines and planes of closest fit to systems of points in space. Philosophical Magazine, (6):559 57, 9. [7] M. Riedmiller and H. Braun. A direct adaptive method for faster backpropagation learning: The rprop algorithm. In In Proceedings of the IEEE International Conference on Neural Networks, pages 586 59, 99. [8] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. SCIENCE, 9: 6,. [9] B. Schölkopf, A. Smola, and K.-R. Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, (5):99 9, 998. [] J. B. Tenenbaum, V. D. Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 9:9,. [] J. Verbeek, N. Vlassis, and B. Kröse. A k-segments algorithm for finding principal curves. Pattern Recognition, :9 7,. [] G. Watson. Smooth regression analysis. Sankhya Series A, 6:59 7, 964.