In: F. Fogelman and P. Gallinari, editors, ICANN'95: International Conference on Artificial Neural Networks, pages 217-222, Paris, France, 1995. EC2 & Cie. Incremental Learning of Local Linear Mappings Bernd Fritzke Institut fur Neuroinformatik, Ruhr-Universitat Bochum, Germany http://www.neuroinformatik.ruhr-uni-bochum.de Abstract A new incremental network model for supervised learning is proposed. The model builds up a structure of units each of which has an associated local linear mapping (LLM). Error information obtained during training is used to determine where to insert new units whose LLMs are interpolated from their neighbors. Simulation results for several classication tasks indicate fast convergence as well as good generalization. The ability of the model to also perform function approximation is demonstrated by an example. 1 Introduction Local (or piece-wise) linear mappings (LLMs) are an economic means of describing a \well-behaved" function f : R n! R m. The principle is to approximate the function (which may be given by a number of input/output samples (; ) 2 R n R m ) with a set of linear mappings each of which is constrained to a local region of the input space R n. LLM-based methods have been used earlier to learn the inverse kinematics of robot arms [7], for classication [4] and for time series prediction [6]. A general problem which has to be solved when using LLMs is to partition the input space into a number of parcels such that within each parcel the function f can be described suciently well by a linear mapping. Those parcels may be rather large in areas of R n where f indeed behaves approximately linear and must be smaller where this is not the case. The total number of parcels needed depends on the desired approximation accuracy and may be limited by the amount of available sample data since over-tting might occur. A widely used method to achieve a partitioning of the input space into parcels is to choose a number of centers in R n and use the corresponding Voronoi tessellation (which associates each point to the center with minimum Euclidean distance). Existing LLM-based approaches generally assume a xed number of centers which are distributed in input space by some vector quantization method. Thereafter, or even during the vector quantization, the linear mapping f c : R n! R m associated with each center c is learned by evaluating data pairs. A problem with this approach, however, is that the vector quantization method is only driven by the n-dimensional input part of the data pairs (; ) and, therefore, does not take into account at all the linearity or non-linearity of f. Rather, the centers are distributed according to the density of the input data which may result in a partition which is sub-optimal for the given task. It may happen, e.g., that a region of R n where f is perfectly linear is partitioned into many parcels since a large part of the available input data happens to lie in this region. In this paper we propose a method for incrementally generating a partition of the input space. Our approach uses locally accumulated approximation error to determine where to insert new centers (and associated LLMs). The principle of insertion based on accumulated error has been used earlier for the incremental construction of radial basis function networks [2, 1]. Here we adapt the same idea for LLM-based networks.
The rest of the paper is organized as follows: we rst shortly describe the \growing neural gas" method which we have proposed earlier [3]. Then the combination with LLMs is outlined and nally some simulation results are given. 2 Growing Neural Gas \Growing neural gas" (GNG) is an unsupervised network model which learns topologies [3]: It incrementally constructs a graph representation of a given data set which is n-dimensional but may stem from a lower-dimensional sub-manifold of the input space R n. In the following we assume that the data obeys some (unknown) probability distribution P (). In particular the data set need not be nite but may also be generated continuously by some stationary process. The GNG method distributes a set of centers (or units) in R n. This is partially done by adaptation steps but mostly by interpolation of new centers from existing ones. Between two centers there may be an edge indicating neighborhood in R n. These edges which are used for interpolation (see below) are inserted with the \competitive Hebbian learning" rule [5] during the mentioned adaptation steps. The \competitive Hebbian learning" rule can simply be stated as: \Insert an edge between the nearest and second-nearest center with respect to the current input signal." The GNG algorithm is the following (for a more detailed discussion see [3]): 0. Start with two units a and b at random positions wa and wb in R n. 1. Generate an input signal according to P (). 2. Find the nearest unit s 1 and the second-nearest unit s 2. 3. Increment the age of all edges emanating from s 1. 4. Add the squared distance between the input signal and the nearest unit in input space to a local error variable: error(s 1 ) = kws 1? k 2 5. Move s 1 and its direct topological neighbors 1 towards by fractions b and n, respectively, of the total distance: ws 1 = b(? ws 1 ) wn = n(? wn) for all direct neighbors n of s 1 6. If s 1 and s 2 are connected by an edge, set the age of this edge to zero. If such an edge does not exist, create it. 7. Remove edges with an age larger than amax. If this results in points having no emanating edges, remove them as well. 8. If the number of input signals generated so far is an integer multiple of a parameter, insert a new unit as follows: Determine the unit q with the maximum accumulated error. Insert a new unit r halfway between q and its neighbor f with the largest error variable: wr = 0:5 (wq + wf ): Insert edges connecting the new unit r with units q and f, and remove the original edge between q and f. Decrease the error variables of q and f by multiplying them with a constant. Initialize the error variable of r with the new value of the error variable of q. 1 Throughout this paper the term neighbors denotes units which are topological neighbors in the graph (as opposed to units within a small Euclidean distance of each other in input space).
9. Decrease all error variables by multiplying them with a decay constant d. 10. If a stopping criterion (e.g., net size or some performance measure) is not yet fullled continue with step 1. How does this method work? The adaptation steps towards the input signals (5.) lead to a general movement of all units towards those areas of the input space where signals come from (P () > 0). The insertion of edges (6.) between the nearest and the second-nearest unit with respect to an input signal generates a single connection of the \induced Delaunay triangulation", a subgraph of the Delaunay triangulation restricted to areas of the input space with P () > 0. The removal of edges (7.) is necessary to get rid of those edges which are no longer part of the \induced Delaunay triangulation" because their ending points have moved and other units are in between them. This is achieved by local edge aging (3.) around the nearest unit combined with age re-setting of those edges (6.) which already exist between nearest and second-nearest units. With insertion and removal of edges the model tries to construct and then track the \induced Delaunay triangulation" which is a slowly moving target due to the adaptation of the reference vectors. The accumulation of squared distances (4.) during the adaptation helps to identify units lying in areas of the input space where the mapping from signals to units causes much error. To reduce this error, new units are inserted in such regions. 3 GNG and LLM The GNG model just described is unsupervised and it inserts new units in order to reduce the mean distortion error. For this reason the distortion error is locally accumulated and new units are inserted near the unit with maximum accumulated error. How can this principle be used for supervised learning? We rst have to dene what the networks output is (which was not necessary for unsupervised learning). Then we can use the dierence between actual and desired output to guide the insertions of new units. Our original problem was to approximate a function f : R n! R m which is given by a number of data pairs (; ) 2 R n R m. One should note that this problem includes classication tasks as a special case. The dierent classes can be en-coded by a small number of m-dimensional vectors which are often chosen to be binary (1-out-of-m). With every unit c of the GNG network (c is positioned at w c in input space) we now associate an m-dimensional output vector c and an m n-matrix A c. The vector c is the output of the network for the case = w c, i.e., for input vectors coinciding with one of the centers. For a general input vector the nearest center s 1 is determined and the output g() of the network is computed from the LLM realized by the stored value s1 and the Matrix A s1 as follows: g() = s1 + A s1 (? w s1 ): We now have to change the original GNG algorithm to incorporate the LLMs. Since we are interested in reducing the expectation of the mean square error E(j? g()j 2 ) for data pairs (; ) we change step 4 of the GNG algorithm to error(s 1 ) = j? g()j 2 This means that we now locally accumulate the error with respect to the function to be approximated. New units are inserted where the approximation is poor.
The LLMs associated with the units of our network are initially set at random. At each adaptation step the data pair (; ) is used two-fold: is used (as before) for center adaptation and the whole pair (; ) is used to improve the LLM of the nearest center s 1. This done with a simple Delta-rule: s1 = " m (? g()) A s1 = " m (? g()) (? w s1 ) Thereby " m is an adaptation parameter and denotes the outer product of two vectors. When a new unit r is inserted (step 8 of the GNG algorithm), its LLM is interpolated from its neighbors q and f: r = 0:5 ( q + f ) A r = 0:5 (A q + A f ) A stopping criterion has to be dened to nish the growth process. This can be arbitrarily chosen depending on the application. A possible choice is to observe network performance on a validation set during training and stop when this performance begins to decrease. Alternatively, the error on the training set may be used or simply the number of units in the network, if for some reason a specic network size is desired. 4 Simulation Examples In the following some simulation examples are given in order to provide some insight in the performance of the method and the kind of solutions generated. Let us rst consider the XOR problem. XOR is not interesting per se but since it is wellknown, we nd it useful as an initial example. In gure 1 the nal output of a GNG-LLM network for an XOR-like problem is shown together with the decision regions illustrating how the network generalizes over unseen patterns. The solution shown was obtained after the presentation of 300 single patterns. In contrast, a 2-2-1 (input-hidden-output) multi-layer perceptron (MLP) trained with back-propagation (plus momentum) needed over 10000 patterns to converge on the same data. The development of the network for another classication problem is shown in gure 2. The total number of presented patterns for the GNG-LLM network was 5400 in this case (CPU-time: 2 17 sec.). A 2-7-1 MLP needed 75.000 presented patterns (CPU-time: 118 sec.). As a larger classication example a high-dimensional problem shall be examined. In this case it is the vowel data from the CMU benchmark collection which has been investigated with several network models (among them MLPs) by Robinson in his thesis [8]. The data consists of 990 10-dimensional vectors derived from vowels spoken by male and female speakers. 528 vectors from four male and four female speakers are used to train the networks. The remaining 462 frames from four male and three female speakers are for testing. Since training and test data originate from disjunct speaker sets, the task is probably a dicult one. We observed 100 GNG-LLM networks growing until size 70 (see gure 3). The performance on the test set was checked at sizes 5; 10; : : : ; 70. The mean misclassifaction rate was 48 % (compared to 44-67 % reported by Robinson for the models, 2 CPU time measurements are always problematic but we assume they may be useful for some readers. The computations have all been performed on (one processor of) an SGI Challenge L computer. Times on a Sparc 20 are about four times as large.
a) output of GNG-LLM network b) decision regions Figure 1: A solution of an XOR-\problem" found by the described GNG-LLM network. The data stems from four square regions in the unit square. Diagonally opposing squares belong to one class. The generated network consists of only two units each associated with a local linear mapping. The output of the network (a) can be thresholded to obtain sharp decision regions(b) which have been determined for a square region here. The parameters of this simulation were: " b = 0:02; " n = 0:0006; = 300; " m = 0:15; = 0:5; d = 0:9995 a) 2 units b) 7 units c) 18 units Figure 2: The development of a solution for a two-class classication problem. The training data stems from the two approximately u-shaped regions. Each region is one class. The parameters of this simulation are identical to those in the previous example. he investigated). About 9 % of the GNG-LLM networks of size 20 and up had a performance superior to 44 % error, the best result Robinson achieved (he got it with the nearest neighbor classier). An important practical aspect is that the GNG-LLM networks needed only about 60 training epochs 3 to reach their maximum size. Robinson, in contrast, did report that he used 3000 epochs for the models he investigated. GNG-LLM networks can also be used for function approximation. A simple example (on which we can not elaborate here due to lack of space) is shown in gure 4. Function approximation with GNG-LLM networks is a eld we intend to investigate closer in the future. References [1] B. Fritzke. Fast learning with incremental RBF networks. Neural Processing Letters, 1(1):2{5, 1994. [2] B. Fritzke. Growing cell structures { a self-organizing network for unsupervised and supervised learning. Neural Networks, 7(9):1441{1460, 1994. 3 This is equivalent to 52860 = 31680 single patterns, or 11 min. SGI Challenge L CPU-time. A 10-88-11 MLP (one of the sizes Robinson had investigated) needed over 4 hours to converge (and had a test error of 60 %).
% misclassifications on vowel test set 75 70 65 60 55 50 45 40 mean error (w. std. dev.) maximum error minimum error 35 0 10 20 30 40 50 60 70 number of units Figure 3: Performance of GNG-LLM networks on the vowel test data during growth. 100 networks have been evaluated and were allowed to grow until size 70. The graph does not show any signs of over-tting, although the nal mean performance of about 48 % error is reached already at size 20. The exact network size does not seem to inuence performance critically. a) training data b) 3 units c) 15 units d) 65 units Figure 4: A GNG-LLM network learns to approximate a two-dimensional bell curve. Shown is the training data set (a) and the output of the networks with 3, 15, and 65 units (b,c,d). The last plot (d) has the training data overlaid to ease comparison. [3] B. Fritzke. A growing neural gas network learns topologies. In G. Tesauro, D. S. Touretzky, and T. K. Leen, editors, Advances in Neural Information Processing Systems 7 (to appear). MIT Press, Cambridge MA, 1995. [4] E. Littmann and H. Ritter. Cascade LLM networks. In I. Aleksander and J. Taylor, editors, Articial Neural Networks 2, pages 253{257. Elsevier Science Publishers B.V., North Holland, 1992. [5] T. M. Martinetz. Competitive Hebbian learning rule forms perfectly topology preserving maps. In ICANN'93: International Conference on Articial Neural Networks, pages 427{434, Amsterdam, 1993. Springer. [6] T. M. Martinetz, S. G. Berkovich, and K. J. Schulten. Neural-gas network for vector quantization and its application to time-series prediction. IEEE Transactions on Neural Networks, 4(4):558{569, 1993. [7] H. J. Ritter, T. M. Martinetz, and K. J. Schulten. Topology-conserving maps for learning visuo-motor-coordination. Neural Networks, 2:159{168, 1989. [8] A. J. Robinson. Dynamic Error Propagation Networks. Ph.D. thesis, Cambridge University, Cambridge, 1989.