The rest of the paper is organized as follows: we rst shortly describe the \growing neural gas" method which we have proposed earlier [3]. Then the co

Similar documents
Modification of the Growing Neural Gas Algorithm for Cluster Analysis

Representation of 2D objects with a topology preserving network

Topological Correlation

TreeGNG - Hierarchical Topological Clustering

Classifier C-Net. 2D Projected Images of 3D Objects. 2D Projected Images of 3D Objects. Model I. Model II

Comparing Self-Organizing Maps Samuel Kaski and Krista Lagus Helsinki University of Technology Neural Networks Research Centre Rakentajanaukio 2C, FIN

Process. Measurement vector (Feature vector) Map training and labeling. Self-Organizing Map. Input measurements 4. Output measurements.

Function approximation using RBF network. 10 basis functions and 25 data points.

3D Hand Pose Estimation with Neural Networks

Performance analysis of a MLP weight initialization algorithm

Growing Neural Gas A Parallel Approach

An Instantaneous Topological Mapping Model for Correlated Stimuli

Figure (5) Kohonen Self-Organized Map

Richard S. Zemel 1 Georey E. Hinton North Torrey Pines Rd. Toronto, ONT M5S 1A4. Abstract

Graph projection techniques for Self-Organizing Maps

Estimating the Intrinsic Dimensionality of. Jorg Bruske, Erzsebet Merenyi y

Growing Neural Gas approach for obtaining homogeneous maps by restricting the insertion of new nodes

Extending the Growing Neural Gas Classifier for Context Recognition

Stability Assessment of Electric Power Systems using Growing Neural Gas and Self-Organizing Maps

Center for Automation and Autonomous Complex Systems. Computer Science Department, Tulane University. New Orleans, LA June 5, 1991.

Handwritten Digit Recognition with a. Back-Propagation Network. Y. Le Cun, B. Boser, J. S. Denker, D. Henderson,

Using Local Trajectory Optimizers To Speed Up Global. Christopher G. Atkeson. Department of Brain and Cognitive Sciences and

Extract an Essential Skeleton of a Character as a Graph from a Character Image

Analysis of Decision Boundaries Generated by Constructive Neural Network Learning Algorithms

AppART + Growing Neural Gas = high performance hybrid neural network for function approximation

Classification and Regression using Linear Networks, Multilayer Perceptrons and Radial Basis Functions

Color reduction by using a new self-growing and self-organized neural network

Introduction to Machine Learning

Building Adaptive Basis Functions with a Continuous Self-Organizing Map

t 1 y(x;w) x 2 t 2 t 3 x 1

Title. Author(s)Liu, Hao; Kurihara, Masahito; Oyama, Satoshi; Sato, Issue Date Doc URL. Rights. Type. File Information

Figure 1: An Area Voronoi Diagram of a typical GIS Scene generated from the ISPRS Working group III/3 Avenches data set. 2 ARRANGEMENTS 2.1 Voronoi Di

Artificial Neural Networks

Rowena Cole and Luigi Barone. Department of Computer Science, The University of Western Australia, Western Australia, 6907

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]

Online labelling strategies for growing neural gas

Multi-Clustering Centers Approach to Enhancing the Performance of SOM Clustering Ability

6. NEURAL NETWORK BASED PATH PLANNING ALGORITHM 6.1 INTRODUCTION

Machine Learning in Biology

This leads to our algorithm which is outlined in Section III, along with a tabular summary of it's performance on several benchmarks. The last section

Wj = α TD(P,Wj) Wj : Current reference vector W j : New reference vector P : Input vector SENSITIVITY REGION. W j= Wj + Wj MANHATTAN DISTANCE

Extensive research has been conducted, aimed at developing

Self-Organizing Feature Map. Kazuhiro MINAMIMOTO Kazushi IKEDA Kenji NAKAYAMA

BMVC 1996 doi: /c.10.41

Möbius Transformations in Scientific Computing. David Eppstein

2 The MiníMax Principle First consider a simple problem. This problem will address the tradeos involved in a two-objective optimiation problem, where

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX

Radial Basis Function Neural Network Classifier

Locally Weighted Learning for Control. Alexander Skoglund Machine Learning Course AASS, June 2005

Controlling the spread of dynamic self-organising maps

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

2. CNeT Architecture and Learning 2.1. Architecture The Competitive Neural Tree has a structured architecture. A hierarchy of identical nodes form an

Clustering & Dimensionality Reduction. 273A Intro Machine Learning

Techniques. IDSIA, Istituto Dalle Molle di Studi sull'intelligenza Articiale. Phone: Fax:

N. Hitschfeld. Blanco Encalada 2120, Santiago, CHILE.

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Hierarchical Clustering 4/5/17

Vector Regression Machine. Rodrigo Fernandez. LIPN, Institut Galilee-Universite Paris 13. Avenue J.B. Clement Villetaneuse France.

Enumeration of Full Graphs: Onset of the Asymptotic Region. Department of Mathematics. Massachusetts Institute of Technology. Cambridge, MA 02139

y 2 x 1 Simulator Controller client (C,Java...) Client Ball position joint 2 link 2 link 1 0 joint 1 link 0 (virtual) State of the ball

Appears in Proceedings of the International Joint Conference on Neural Networks (IJCNN-92), Baltimore, MD, vol. 2, pp. II II-397, June, 1992

Lab 2: Support Vector Machines

Error measurements and parameters choice in the GNG3D model for mesh simplification

Comparison of supervised self-organizing maps using Euclidian or Mahalanobis distance in classification context

Figure 1: An instance of the path nding problem (left) and a solution to it (right). The white rectangle is the robot, the black circle is the goal, t

Tilings of the Euclidean plane

Processing Missing Values with Self-Organized Maps

A Learning Algorithm for Piecewise Linear Regression

Unsupervised Learning

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Classification of Face Images for Gender, Age, Facial Expression, and Identity 1

Supervised Hybrid SOM-NG Algorithm

Dept. of Computer Science. The eld of time series analysis and forecasting methods has signicantly changed in the last

A Bintree Representation of Generalized Binary. Digital Images

On-line Pattern Analysis by Evolving Self-Organizing Maps

Neural Network Learning. Today s Lecture. Continuation of Neural Networks. Artificial Neural Networks. Lecture 24: Learning 3. Victor R.

CSE 5526: Introduction to Neural Networks Radial Basis Function (RBF) Networks

AUTOMATIC GENERATION OF MORPHOLOGICAL OPENING CLOSING SEQUENCES FOR TEXTURE SEGMENTATION. J. Racky, M. Pandit

the number of states must be set in advance, i.e. the structure of the model is not t to the data, but given a priori the algorithm converges to a loc

Document Image Restoration Using Binary Morphological Filters. Jisheng Liang, Robert M. Haralick. Seattle, Washington Ihsin T.

Optimization Methods for Machine Learning (OMML)

The task of inductive learning from examples is to nd an approximate definition

Robustness of Selective Desensitization Perceptron Against Irrelevant and Partially Relevant Features in Pattern Classification

Unsupervised learning

Figure 1: Representation of moving images using layers Once a set of ane models has been found, similar models are grouped based in a mean-square dist

Keywords: ANN; network topology; bathymetric model; representability.

Voronoi Diagram. Xiao-Ming Fu

Unsupervised Learning of a Kinematic Arm Model

Neural Network Neurons

A modular neural network architecture for inverse kinematics model learning

Autoencoders, denoising autoencoders, and learning deep networks

/00/$10.00 (C) 2000 IEEE

A Self Organizing Map for dissimilarity data 0

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

Reinforcement Learning Scheme. for Network Routing. Michael Littman*, Justin Boyan. School of Computer Science. Pittsburgh, PA

User Interface. Global planner. Local planner. sensors. actuators

PROJECTION MODELING SIMPLIFICATION MARKER EXTRACTION DECISION. Image #k Partition #k

Bumptrees for Efficient Function, Constraint, and Classification Learning

3.1. Solution for white Gaussian noise

LIF Marseille, CNRS & University Aix{Marseille address: URL:

Transcription:

In: F. Fogelman and P. Gallinari, editors, ICANN'95: International Conference on Artificial Neural Networks, pages 217-222, Paris, France, 1995. EC2 & Cie. Incremental Learning of Local Linear Mappings Bernd Fritzke Institut fur Neuroinformatik, Ruhr-Universitat Bochum, Germany http://www.neuroinformatik.ruhr-uni-bochum.de Abstract A new incremental network model for supervised learning is proposed. The model builds up a structure of units each of which has an associated local linear mapping (LLM). Error information obtained during training is used to determine where to insert new units whose LLMs are interpolated from their neighbors. Simulation results for several classication tasks indicate fast convergence as well as good generalization. The ability of the model to also perform function approximation is demonstrated by an example. 1 Introduction Local (or piece-wise) linear mappings (LLMs) are an economic means of describing a \well-behaved" function f : R n! R m. The principle is to approximate the function (which may be given by a number of input/output samples (; ) 2 R n R m ) with a set of linear mappings each of which is constrained to a local region of the input space R n. LLM-based methods have been used earlier to learn the inverse kinematics of robot arms [7], for classication [4] and for time series prediction [6]. A general problem which has to be solved when using LLMs is to partition the input space into a number of parcels such that within each parcel the function f can be described suciently well by a linear mapping. Those parcels may be rather large in areas of R n where f indeed behaves approximately linear and must be smaller where this is not the case. The total number of parcels needed depends on the desired approximation accuracy and may be limited by the amount of available sample data since over-tting might occur. A widely used method to achieve a partitioning of the input space into parcels is to choose a number of centers in R n and use the corresponding Voronoi tessellation (which associates each point to the center with minimum Euclidean distance). Existing LLM-based approaches generally assume a xed number of centers which are distributed in input space by some vector quantization method. Thereafter, or even during the vector quantization, the linear mapping f c : R n! R m associated with each center c is learned by evaluating data pairs. A problem with this approach, however, is that the vector quantization method is only driven by the n-dimensional input part of the data pairs (; ) and, therefore, does not take into account at all the linearity or non-linearity of f. Rather, the centers are distributed according to the density of the input data which may result in a partition which is sub-optimal for the given task. It may happen, e.g., that a region of R n where f is perfectly linear is partitioned into many parcels since a large part of the available input data happens to lie in this region. In this paper we propose a method for incrementally generating a partition of the input space. Our approach uses locally accumulated approximation error to determine where to insert new centers (and associated LLMs). The principle of insertion based on accumulated error has been used earlier for the incremental construction of radial basis function networks [2, 1]. Here we adapt the same idea for LLM-based networks.

The rest of the paper is organized as follows: we rst shortly describe the \growing neural gas" method which we have proposed earlier [3]. Then the combination with LLMs is outlined and nally some simulation results are given. 2 Growing Neural Gas \Growing neural gas" (GNG) is an unsupervised network model which learns topologies [3]: It incrementally constructs a graph representation of a given data set which is n-dimensional but may stem from a lower-dimensional sub-manifold of the input space R n. In the following we assume that the data obeys some (unknown) probability distribution P (). In particular the data set need not be nite but may also be generated continuously by some stationary process. The GNG method distributes a set of centers (or units) in R n. This is partially done by adaptation steps but mostly by interpolation of new centers from existing ones. Between two centers there may be an edge indicating neighborhood in R n. These edges which are used for interpolation (see below) are inserted with the \competitive Hebbian learning" rule [5] during the mentioned adaptation steps. The \competitive Hebbian learning" rule can simply be stated as: \Insert an edge between the nearest and second-nearest center with respect to the current input signal." The GNG algorithm is the following (for a more detailed discussion see [3]): 0. Start with two units a and b at random positions wa and wb in R n. 1. Generate an input signal according to P (). 2. Find the nearest unit s 1 and the second-nearest unit s 2. 3. Increment the age of all edges emanating from s 1. 4. Add the squared distance between the input signal and the nearest unit in input space to a local error variable: error(s 1 ) = kws 1? k 2 5. Move s 1 and its direct topological neighbors 1 towards by fractions b and n, respectively, of the total distance: ws 1 = b(? ws 1 ) wn = n(? wn) for all direct neighbors n of s 1 6. If s 1 and s 2 are connected by an edge, set the age of this edge to zero. If such an edge does not exist, create it. 7. Remove edges with an age larger than amax. If this results in points having no emanating edges, remove them as well. 8. If the number of input signals generated so far is an integer multiple of a parameter, insert a new unit as follows: Determine the unit q with the maximum accumulated error. Insert a new unit r halfway between q and its neighbor f with the largest error variable: wr = 0:5 (wq + wf ): Insert edges connecting the new unit r with units q and f, and remove the original edge between q and f. Decrease the error variables of q and f by multiplying them with a constant. Initialize the error variable of r with the new value of the error variable of q. 1 Throughout this paper the term neighbors denotes units which are topological neighbors in the graph (as opposed to units within a small Euclidean distance of each other in input space).

9. Decrease all error variables by multiplying them with a decay constant d. 10. If a stopping criterion (e.g., net size or some performance measure) is not yet fullled continue with step 1. How does this method work? The adaptation steps towards the input signals (5.) lead to a general movement of all units towards those areas of the input space where signals come from (P () > 0). The insertion of edges (6.) between the nearest and the second-nearest unit with respect to an input signal generates a single connection of the \induced Delaunay triangulation", a subgraph of the Delaunay triangulation restricted to areas of the input space with P () > 0. The removal of edges (7.) is necessary to get rid of those edges which are no longer part of the \induced Delaunay triangulation" because their ending points have moved and other units are in between them. This is achieved by local edge aging (3.) around the nearest unit combined with age re-setting of those edges (6.) which already exist between nearest and second-nearest units. With insertion and removal of edges the model tries to construct and then track the \induced Delaunay triangulation" which is a slowly moving target due to the adaptation of the reference vectors. The accumulation of squared distances (4.) during the adaptation helps to identify units lying in areas of the input space where the mapping from signals to units causes much error. To reduce this error, new units are inserted in such regions. 3 GNG and LLM The GNG model just described is unsupervised and it inserts new units in order to reduce the mean distortion error. For this reason the distortion error is locally accumulated and new units are inserted near the unit with maximum accumulated error. How can this principle be used for supervised learning? We rst have to dene what the networks output is (which was not necessary for unsupervised learning). Then we can use the dierence between actual and desired output to guide the insertions of new units. Our original problem was to approximate a function f : R n! R m which is given by a number of data pairs (; ) 2 R n R m. One should note that this problem includes classication tasks as a special case. The dierent classes can be en-coded by a small number of m-dimensional vectors which are often chosen to be binary (1-out-of-m). With every unit c of the GNG network (c is positioned at w c in input space) we now associate an m-dimensional output vector c and an m n-matrix A c. The vector c is the output of the network for the case = w c, i.e., for input vectors coinciding with one of the centers. For a general input vector the nearest center s 1 is determined and the output g() of the network is computed from the LLM realized by the stored value s1 and the Matrix A s1 as follows: g() = s1 + A s1 (? w s1 ): We now have to change the original GNG algorithm to incorporate the LLMs. Since we are interested in reducing the expectation of the mean square error E(j? g()j 2 ) for data pairs (; ) we change step 4 of the GNG algorithm to error(s 1 ) = j? g()j 2 This means that we now locally accumulate the error with respect to the function to be approximated. New units are inserted where the approximation is poor.

The LLMs associated with the units of our network are initially set at random. At each adaptation step the data pair (; ) is used two-fold: is used (as before) for center adaptation and the whole pair (; ) is used to improve the LLM of the nearest center s 1. This done with a simple Delta-rule: s1 = " m (? g()) A s1 = " m (? g()) (? w s1 ) Thereby " m is an adaptation parameter and denotes the outer product of two vectors. When a new unit r is inserted (step 8 of the GNG algorithm), its LLM is interpolated from its neighbors q and f: r = 0:5 ( q + f ) A r = 0:5 (A q + A f ) A stopping criterion has to be dened to nish the growth process. This can be arbitrarily chosen depending on the application. A possible choice is to observe network performance on a validation set during training and stop when this performance begins to decrease. Alternatively, the error on the training set may be used or simply the number of units in the network, if for some reason a specic network size is desired. 4 Simulation Examples In the following some simulation examples are given in order to provide some insight in the performance of the method and the kind of solutions generated. Let us rst consider the XOR problem. XOR is not interesting per se but since it is wellknown, we nd it useful as an initial example. In gure 1 the nal output of a GNG-LLM network for an XOR-like problem is shown together with the decision regions illustrating how the network generalizes over unseen patterns. The solution shown was obtained after the presentation of 300 single patterns. In contrast, a 2-2-1 (input-hidden-output) multi-layer perceptron (MLP) trained with back-propagation (plus momentum) needed over 10000 patterns to converge on the same data. The development of the network for another classication problem is shown in gure 2. The total number of presented patterns for the GNG-LLM network was 5400 in this case (CPU-time: 2 17 sec.). A 2-7-1 MLP needed 75.000 presented patterns (CPU-time: 118 sec.). As a larger classication example a high-dimensional problem shall be examined. In this case it is the vowel data from the CMU benchmark collection which has been investigated with several network models (among them MLPs) by Robinson in his thesis [8]. The data consists of 990 10-dimensional vectors derived from vowels spoken by male and female speakers. 528 vectors from four male and four female speakers are used to train the networks. The remaining 462 frames from four male and three female speakers are for testing. Since training and test data originate from disjunct speaker sets, the task is probably a dicult one. We observed 100 GNG-LLM networks growing until size 70 (see gure 3). The performance on the test set was checked at sizes 5; 10; : : : ; 70. The mean misclassifaction rate was 48 % (compared to 44-67 % reported by Robinson for the models, 2 CPU time measurements are always problematic but we assume they may be useful for some readers. The computations have all been performed on (one processor of) an SGI Challenge L computer. Times on a Sparc 20 are about four times as large.

a) output of GNG-LLM network b) decision regions Figure 1: A solution of an XOR-\problem" found by the described GNG-LLM network. The data stems from four square regions in the unit square. Diagonally opposing squares belong to one class. The generated network consists of only two units each associated with a local linear mapping. The output of the network (a) can be thresholded to obtain sharp decision regions(b) which have been determined for a square region here. The parameters of this simulation were: " b = 0:02; " n = 0:0006; = 300; " m = 0:15; = 0:5; d = 0:9995 a) 2 units b) 7 units c) 18 units Figure 2: The development of a solution for a two-class classication problem. The training data stems from the two approximately u-shaped regions. Each region is one class. The parameters of this simulation are identical to those in the previous example. he investigated). About 9 % of the GNG-LLM networks of size 20 and up had a performance superior to 44 % error, the best result Robinson achieved (he got it with the nearest neighbor classier). An important practical aspect is that the GNG-LLM networks needed only about 60 training epochs 3 to reach their maximum size. Robinson, in contrast, did report that he used 3000 epochs for the models he investigated. GNG-LLM networks can also be used for function approximation. A simple example (on which we can not elaborate here due to lack of space) is shown in gure 4. Function approximation with GNG-LLM networks is a eld we intend to investigate closer in the future. References [1] B. Fritzke. Fast learning with incremental RBF networks. Neural Processing Letters, 1(1):2{5, 1994. [2] B. Fritzke. Growing cell structures { a self-organizing network for unsupervised and supervised learning. Neural Networks, 7(9):1441{1460, 1994. 3 This is equivalent to 52860 = 31680 single patterns, or 11 min. SGI Challenge L CPU-time. A 10-88-11 MLP (one of the sizes Robinson had investigated) needed over 4 hours to converge (and had a test error of 60 %).

% misclassifications on vowel test set 75 70 65 60 55 50 45 40 mean error (w. std. dev.) maximum error minimum error 35 0 10 20 30 40 50 60 70 number of units Figure 3: Performance of GNG-LLM networks on the vowel test data during growth. 100 networks have been evaluated and were allowed to grow until size 70. The graph does not show any signs of over-tting, although the nal mean performance of about 48 % error is reached already at size 20. The exact network size does not seem to inuence performance critically. a) training data b) 3 units c) 15 units d) 65 units Figure 4: A GNG-LLM network learns to approximate a two-dimensional bell curve. Shown is the training data set (a) and the output of the networks with 3, 15, and 65 units (b,c,d). The last plot (d) has the training data overlaid to ease comparison. [3] B. Fritzke. A growing neural gas network learns topologies. In G. Tesauro, D. S. Touretzky, and T. K. Leen, editors, Advances in Neural Information Processing Systems 7 (to appear). MIT Press, Cambridge MA, 1995. [4] E. Littmann and H. Ritter. Cascade LLM networks. In I. Aleksander and J. Taylor, editors, Articial Neural Networks 2, pages 253{257. Elsevier Science Publishers B.V., North Holland, 1992. [5] T. M. Martinetz. Competitive Hebbian learning rule forms perfectly topology preserving maps. In ICANN'93: International Conference on Articial Neural Networks, pages 427{434, Amsterdam, 1993. Springer. [6] T. M. Martinetz, S. G. Berkovich, and K. J. Schulten. Neural-gas network for vector quantization and its application to time-series prediction. IEEE Transactions on Neural Networks, 4(4):558{569, 1993. [7] H. J. Ritter, T. M. Martinetz, and K. J. Schulten. Topology-conserving maps for learning visuo-motor-coordination. Neural Networks, 2:159{168, 1989. [8] A. J. Robinson. Dynamic Error Propagation Networks. Ph.D. thesis, Cambridge University, Cambridge, 1989.