Corn pet it ive Learning

Size: px

Start display at page:

Download "Corn pet it ive Learning"

Dominic Paul
5 years ago
Views:

1 L 1 o i Self-organizing Systems : Corn pet it ive Learning 10.1 ntroduction n this chapter we continue our study of self-organizing systems by considering a special class of artificial neural networks known as self-organizing feature maps. These networks are based on competitive learning; the output neurons of the network compete among themselves to be activated or fired, with the result that only one output neuron, or one neuron per group, is on at any one time. The output neurons that win the competition are called winner-takes-all neurons. One way of inducing a winner-takes-all competition among the output neurons is to use lateral inhibitory negative feedback paths) between them; such an idea was originally proposed by Rosenblatt (1958). n a self-organizing feature map, the neurons are placed at the nodes of a lattice that is usually one- or two-dimensional; higher-dimensional maps are also possible but not as common. The neurons become selectively tuned to various input patterns (vectors) or classes of input patterns in the course of a competitive learning process. The locations of the neurons so tuned (i.e., the winning neurons) tend to become ordered with respect to each other in such a way that a meaningful coordinate system for different input features is created over the lattice (Kohonen, 1990a). A self-organizing feature map is therefore characterized by the formation of a topographic map of the input patterns, in which the spatial locations (i.e., coordinates) of the neurons in the lattice correspond to intrinsic features of the input patterns, hence the name self-organizing feature map. The development of this special class of artificial neural networks is motivated by a distinct feature of the human brain; simply put, the brain is organized in many places in such a way that different sensory inputs are represented by topologically ordered computational maps. n particular, sensory inputs such as tactile (Kaas et al., 1983), visual (Hubel and Wiesel, 1962, 1977), and acoustic (Suga, 1985) are mapped onto different areas of the cerebral cortex in a topologically ordered manner. Thus the computational map constitutes a basic building block in the information-processing infrastructure of the nervous system. A computational map is defined by an array of neurons representing slightly differently tuned processors or filters, which operate on the sensory informationbearing signals in parallel. Consequently, the neurons transform input signals into aplacecoded probability distribution that represents the computed values of parameters by sites of maximum relative activity within the map (Knudsen et al., 1987). The information so derived is of such a form that it can be readily accessed by higher-order processors using relatively simple connection schemes. Organization of the Chapter The material presented in this chapter on computational maps is organized as follows: n Section 10.2 we expand on the idea of computational maps in the brain. Then, in Section 397

2 / Self-organizing Systems : Competitive Learning 10.3, we describe two feature-mapping models, one originally developed by Willshaw and von der Malsburg (1976) and the other by Kohonen (1982a), which are able to explain or capture the essential features of computational maps in the brain. The two models differ from each other in the form of the inputs used. The rest of the chapter is devoted to detailed considerations of Kohonen s model that has attracted a great deal of attention in the literature. n Section 10.4 we describe the formation of activity bubbles, which refers to the modification of the primary excitations by the use of lateral feedback. This then paves the way for the mathematical formulation of Kohonen s model in Section n Section 10.6 we describe some important properties of the model, followed by additional notes of a practical nature in Section 10.7 on the operation of the model. n Section 10.8 we describe a hybrid combination of the Kohonen model and supervised linear filter for adaptive pattern classification. Learning vector quantization, an alternative method of improving the pattern-classification performance of the Kohonen model, is described in Section The chapter concludes with Section on applications of the Kohonen model, and some final thoughts on the subject in Section Computational Maps in the Cerebral Cortex Anyone who examines a human brain cannot help but be impressed by the extent to which the brain is dominated by the cerebral cortex. The brain is almost completely enveloped by the cortex, tending to obscure the other parts. Although it is only about 2 mrn thick, its surface area, when spread out, is about 2400 cm2 (i.e., about six times the size of this page). What is even more impressive is the fact that there are billions of neurons and hundreds of billions of synapses in the cortex. For sheer complexity, the cerebral cortex probably exceeds any other known structure (Hubel and Wiesel, 1977). Figure 10.1 presents a cytoarchitectural map of the cerebral cortex as worked out by Brodmann (Shepherd, 1988; Brodal, 1981). The different areas of the cortex are identified by the thickness of their layers and the types of neurons within them. Some of the most important specific areas are as follows: Motor cortex: Somatosensory cortex: Visual cortex: Auditory cortex: motor strip, area 4; premotor area, area 6; frontal eye fields, area 8. areas 3, 1, and 2. areas 17, 18, and 19. areas 41 and 42. Figure 10.1 shows clearly that different sensory inputs (motor, somatosensory, visual, auditory, etc.) are mapped onto corresponding areas of the cerebral cortex in an orderly fashion. These cortical maps are not entirely genetically predetermined; rather, they are sketched in during the early development of the nervous system. However, it is uncertain how cortical maps are sketched in this manner. Four major hypotheses have been advanced by neurobiologists (Udin and Fawcett, 1988): 1. The target (postsynaptic) structure possesses addresses (Le., chemical signals) that are actively searched for by the ingrowing connections (axons). 2. The structure, starting from zero (Le., an informationless target structure), selforganizes using learning rules and system interactions.

10.2 / Computational Maps in the Cerebral Cortex 399 4 FGURE 10.1 Cytoarchitectural map of the cerebral cortex.

3 10.2 / Computational Maps in the Cerebral Cortex FGURE 10.1 Cytoarchitectural map of the cerebral cortex. The different areas are identified by the thickness of their layers and types of cells within them. Some of the most important specific areas are as follows. Motor cortex: motor strip, area 4; premotor area, area 6; frontal eye fields, area 8. Somatosensory cortex: areas 3, 1, 2. Visual cortex: areas 17, 18, 19. Auditory cortex: areas 41 and 42. (From G.M. Shepherd, 1988; A. Brodal, 1981; with permission of Oxford University Press.) 3. Axons, as they grow, physically maintain neighborhood relationships, and therefore arrive at the target structure already topographically arranged. 4. Axons grow out in a topographically arranged time sequence, and connect to a target structure that is generated in a matching temporal fashion. All these hypotheses have experimental support of their own, and appear to be correct to some extent. n fact, different structures may use one mechanism or another, or it could be that multiple mechanisms are involved. Once the cortical maps have been formed, they remain plastic to a varying extent, and therefore adapt to subsequent changes in the environment or the sensors themselves. The degree of plasticity, however, depends on the type of system in question. For example, a retinotopic map (i.e., the map from the retina to the visual cortex) remains plastic for only a relatively short period of time after its formation, whereas the somatosensory map remains plastic longer (Kaas et al., 1983). An example of a cortical mapping is shown in Figure This figure is a schematic representation of computational maps in the primary visual cortex of cats and monkeys. The basis of this representation was discovered originally by Hubel and Wiesel (1962). n Fig we recognize two kinds of repeating computational maps: 1. Maps of preferred line orientation, representing the angle of tilt of a line stimulus 2. Maps of ocular dominance, representing the relative strengths of excitatory influence of each eye The major point of interest here is the fact that line orientation and ocular dominance are mapped across the cortical surface along independent axes. Although in Fig (for

4 / Self-organizing Systems : Competitive Learning (a) ( b) FGURE 10.2 Computational maps in the primary visual cortex of monkeys. (a) A schematic diagram of hypercolumns for ocular dominance and line orientation in the visual cortex. (b) Representative curves of neurons in the visual cortex for line orientation. (From E.. Knudsen et al., 1987, with permission from the Annual Review of Neuroscience, 10, by Annual Reviews, nc.) convenience of presentation), we have shown these two maps to be orthogonal to each other, there is no direct evidence to suggest that they are related in this fashion. The use of computational maps offers the following advantages (Knudsen et al., 1987): rn Eficient nformation Processing. The nervous system is required to analyze complex events arising in a dynamic environment on a continuous basis. This, in turn, requires the use of processing strategies that permit the rapid handling of large amounts of information. Computational maps, performed by parallel processing arrays, are ideally suited for this task. n particular, computational maps provide a method for the rapid sorting and processing of complex stimuli, and representing the results obtained in a simple and systematic form. rn Simplicity of Access to Processed nformation. The use of computational maps simplifies the schemes of connectivity required to utilize the information by higherorder processors. rn Common Form of Representation. A common, mapped representation of the results of different kinds of computations permits the nervous system to employ a single strategy for making sense of information. H Facilitation of Additional nteractions. By representing a feature of interest in topographic form, maps enable us to sharpen tuning of the processor in ways that would not be possible otherwise. For example, regional interactions such as excitatory facilitation and lateral inhibitions can work only on sensory information that is mapped Two Basic Feature-Mapping Models What can we deduce from the above discussion of computational maps in the brain that would guide the self-building of topographic maps? The answer essentially lies in the principle of topographic map formation, which may be stated as follows (Kohonen,

5 10.3 Two Basic Feature-Mapping Models a): The spatial location of an output neuron in the topographic map corresponds to a particular domain or feature of the input data. The output neurons are usually arranged in a one- or two-dimensional lattice, a topology that ensures that each neuron has a set of neighbors. The manner in which the input patterns are specified determines the nature of the feature-mapping model. n particular, we may distinguish two basic models, as illustrated in Fig for a two-dimensional lattice of output neurons that are fully connected to the inputs. Both models were inspired by the pioneering self-organizing studies of von der Malsburg (1973), who noted that a model of the visual cortex could not be entirely genetically predetermined; rather, a self-organizing process involving synaptic learning may be responsible for the local ordering of feature-sensitive cortical cells. However, global topographic ordering was not achieved, because the model used a fixed (small) neighborhood. The computer simulation by von der Malsburg was perhaps the first to demonstrate self-organization. The model of Fig. 10.3a was originally proposed by Willshaw and von der Malsburg (1976) on biological grounds to explain the problem of retinotopic mapping from the retina to the visual cortex (in higher vertebrates). Specifically, there are two separate twodimensional lattices of neurons connected together, with one projecting onto the other. One of presynaptic neurons (a) Two-dimensional array input ( b) FGURE 10.3 (a) Willshaw-von der Malsburg s model. (b) Kohonen s model.

6 / Self-organizing Systems t: Competitive Learning lattice represents presynaptic (input) neurons, and the other lattice represents postsynaptic (output) neurons. The postsynaptic lattice uses a short-range excitatory mechanism as well as a long-range inhibitory mechanism. These two mechanisms are local in nature and critically important for self-organization. The two lattices are interconnected by modifiable synapses of a Hebbian type. Strictly speaking, therefore, the postsynaptic neurons are not winner-takes-all; rather, a threshold is used to ensure that only a few postsynaptic neurons will fire at any one time. Moreover, to prevent a steady buildup in the synaptic weights that may lead to network instability, the total weight associated with each postsynaptic neuron is limited by an upper boundary condition. Thus, for each neuron, some synaptic weights increase while others are made to decrease. The basic idea of the Willshaw-von der Malsburg model is for the geometric proximity of presynaptic neurons to be coded in the form of correlations in their electrical activity, and to use these correlations in the postsynaptic lattice so as to connect neighboring presynaptic neurons to neighboring postsynaptic neurons. A topologically ordered mapping is thereby produced by self-organization. Note, however, that the Willshaw-von der Malsburg model is specialized to mappings where the input dimension is the same as the output dimension. The second model, of Fig. 10.3b, introduced by Kohonen (1982a), is not meant to explain neurobiological details. Rather, the model tries to capture the essential features of computational maps in the brain and yet remain computationally tractable. The model s neurobiological feasibility is discussed in Kohonen (1993). t appears that the Kohonen model is more general than the Willshaw-von der Malsburg model in the sense that it is capable of performing data compression (i.e., dimensionality reduction on the input). n reality, the Kohonen model belongs to the class of vector coding algorithms. We say so because the model provides a topological mapping that optimally places a fixed number of vectors (i.e., codewords) into a higher-dimensional input space, and thereby facilitates data compression. The Kohonen model may therefore be derived in two ways. We may use basic ideas of self-organization, motivated by neurobiological considerations, to derive the model, which is the traditional approach (Kohonen, 1982a, 1988b, 1990a). Alternatively, we may use a vector quantization approach that uses a model involving an encoder and a decoder, which is motivated by communication-theoretic considerations (Luttrell, 1989b, 1991). n this chapter, we consider both approaches. The Kohonen model has received much more attention in the literature than the Willshaw-von der Malsburg model. The model possesses certain properties discussed later in the chapter, which make it particularly interesting for understanding and modeling cortical maps in the brain. The remainder of the chapter is devoted to the Kohonen model, the derivation of the self-organizing feature map (SOFM) usually associated with the Kohonen model, its basic properties, and applications Modification of Stimulus by Lateral Feedback n order to pave the way for the development of self-organizing feature maps, we first discuss the use of lateral feedback as a mechanism for modifying the form of excitation applied to a neural network. By lateral feedback we mean a special form of feedback that is dependent on lateral distance from the point of its application. For the purpose of this discussion, it is adequate to consider the one-dimensional lattice of neurons shown in Fig. 10.4, which contains two different types of connections. There Amari (1980) relaxes this restriction on the synaptic weights of the postsynaptic neurons somewhat. The mathematical analysis presented by Amari elucidates the dynamical stability of a cortical map formed by selforganization.

7 10.4 / Modification of Stimulus by Lateral Feedback 403 nput layer output layer FGURE 10.4 One-dimension lattice of neurons with feedforward connections and lateral feedback connections; the latter connections are shown only for the neuron at the center of the array. are forward connections from the primary source of excitation, and those that are internal to the network by virtue of self-feedback and lateral feedback. n Fig. 10.4, the input signals are applied in parallel to the neurons. These two types of local connections serve two different purposes. The weighted sum of the input signals at each neuron is designed to perform feature detection. Hence each neuron produces a selective response to a particular set of input signals. The feedback connections, on the other hand, produce excitatory or inhibitory effects, depending on the distance from the neuron. Following biological motivation, the lateral feedback is usually described by a Mexican hatfunction, the form of which is depicted in Fig According to this figure, we may distinguish three distinct areas of lateral interaction between neurons:* 1. A short-range lateral excitation area. 2. A penumbra of inhibitory action. 3. An area of weaker excitation that surrounds the inhibitory penumbra; this third area is usually ignored. These areas are designated as 1, 2, and 3, respectively, in Fig The neural network described here exhibits two important characteristics. First, the network tends to concentrate its electrical activity into local clusters, referred to as activity bubbles (Kohonen, 1988b). Second, the locations of the activity bubbles are determined by the nature of the input signals. Let x, x2,..., xp denote the input signals (excitations) applied to the network, where p is the number of input terminals. Let wjl, wjz,..., wjp denote the corresponding synaptic weights of neuron j. Let c~,-~,..., c ~,-~, q0, cjl,..., cjk denote the lateral feedback weights connected to neuron j, where K is the radius of the lateral interaction. Let y1, y2,..., yn denote the output signals of the network, where N is the number of n the visual cortex, the short-range lateral excitation extends up to a radius of 50 to 100 pm, the penumbra of inhibitory action reaches up to a radius of 200 to 500 pm, and the area of weaker excitation surrounding the penumbra reaches up to a radius of several centimeters (Kohonen, 1982a).

8 / Self-organizing Systems : Competitive Learning Output 1 FGURE 10.5 The Mexican hat function of lateral interconnections. neurons in the network. We may thus express the output signal (response) of neuron j as follows: where p(*) is some nonlinear function that limits the value of yj and ensures that yj 2 0. The term 4 serves the function of a stimuhs, representing the total external control exerted on neuron j by the weighted effect of the input signals; that is, (10.2) Typically, the stimulus 4 is a smoothhnction of the spatial index j. The solution to the nonlinear equation (10.1) is found iteratively, using a relaxation technique. Specifically, we reformulate it as a difference equation as follows: (10.3) where n denotes discrete time. Thus, yj(n + 1) is the output of neuron j at time n + 1, and ~ ~+~(n) is the output of neuron j + k at the previous time n. The parameter p in the argument on the right-hand side of Eq. (10.3) controls the rate of convergence of the relaxation process. The relaxation equation (10.3) represents a feedback system as illustrated in the signalflow graph shown in Fig. 10.6, where z- is the unit-delay operator. The parameter p plays the role of a feedback factor of the system. The system includes both positive and negative feedback, corresponding to the excitatory and inhibitory parts of the Mexican hat function, respectively. The limiting action of the nonlinear activation function p(.) causes the spatial response yj(n) to stabilize in a certain fashion, dependent on the value assigned to p. f p is large enough, then in the final state corresponding to n -+ m, the values of yj tend to concentrate inside a spatially bounded cluster, that is, an activity bubble. The bubble is centered at a point where the initial response yj(0) due to the stimulus ij is maximum. The width of the activity bubble depends on the ratio of the excitatory to inhibitory lateral interconnections. n particular, we may state the following:

9 10.4 / Modification of Stimulus by Lateral Feedback 405 FGURE 10.6 Signal-flow graph depicting the relaxation equation (10.3) as a feedback system. w f the positive feedback is made stronger, the activity bubble becomes wider. w f, on the other hand, the negative feedback is enhanced, the activity bubble becomes sharper. Of course, if the net feedback acting around the system is too negative, formation of the activity bubble is prevented. The various aspects of the bubble formation are illustrated in the computer experiment presented next. Computer Experiment on Bubble Formation To simplify the computation involved in the formation of activity bubble by lateral feedback, the Mexican hat function is approximated by the function shown in Fig Two specific points should be noted here: w The area of weak excitation surrounding the inhibitory region is ignored. w The areas of excitatory feedback and inhibitory feedback are normalized to unity. To further simplify the simulations, the function q(*) is taken to be a piecewise-linear function shown in Fig Specifically, we have [, x z a q(x) = x, 0 5 x < a (10.4) x<o with a = 10. The one-dimensional lattice of Fig is assumed to consist of N = 51 neurons. The stimulus 4 acting on neuron j is assumed to be half a sinusoid, as shown by 4 = 2sin (z), 0 5 j 5 50 (10.5)

10 The stimulus lj so defined is zero at both ends of the lattice, and has a peak value of 2 at its mid-point. For convenience of presentation we have redefined the range occupied byj in Eq. (10.5) as the closed interval [O, N Figure 10.9 shows 10 steps of the relaxation equation (10.3) for the conditions described herein, and for two different values of the feedback factor: Feedback Factor p = 2. The simulation results for this feedback factor are presented in Fig. 10.9a. The spatial response yj(n) begins with a width of 50, corresponding to n = 0. Then, with increasing n, it becomes narrower and higher. The limiting action of the function q(-) causes the response yj(n) to stabilize, such that in the final state (corresponding to n + m), a bubble is formed with all the neurons located inside it effectively having an activation of a = 10 and all the neurons located outside it having essentially zero activation. From the simulation results presented in Fig. 10.9a, it is apparent that the activity bubble so formed is centered around the highest value of the stimulus 4, occurring at the mid-point of the lattice. Output 0 a FGURE 10.8 Piecewise-linear function.

11 10.4 / Modification of Stimulus by Lateral Feedback 407 Activation ,7,8,9, 10 1 Neuron FGURE 10.9 Bubble formation for (a) feedback factor p = 2 and (b) feedback factor p = The iteration numbers are shown on the graphs. ( b) Feedback Factor f3 = f the feedback factor p is smaller than a critical value, the net gain of the system will not be enough to permit the formation of an activity bubble. This behavior is illustrated in Fig. 10.9b for p = Consequences of Bubble Formation Given that an activity bubble is formed as illustrated in Fig. 10.9a, what can we learn from it? First, we may idealize the output of neuron j as follows:

12 / Self-organizing Systems (: Competitive Learning a, neuron j is inside the bubble Yj = (10.6) 0, neuron j is outside the bubble where a is the limiting value of the nonlinear function p(.) defining the input-output relation of neuron j. Second, we may exploit the bubble formation to take a computational shortcut, so as to emulate the effect accomplished by lateral feedback in the form of a Mexican hat function. Specifically, we may do away with lateral feedback connections by introducing a topological neighborhood of active neurons that corresponds to the activity bubble. Third, adjustment of the lateral connections is accomplished by permitting the size of the neighborhood of active neurons to vary. n particular, making this neighborhood wider corresponds to making the positive lateral feedback stronger. On the other hand, making the neighborhood of active neurons narrower corresponds to enhancing the negative lateral feedback. Similar remarks apply to a two-dimensional lattice, except that the neighborhood of active neurons now becomes two-dimensional. We shall make use of these points (in the context of one- and two-dimensional lattices) in the formulation of the self-organizing feature map, presented in the next section Self-organizing Feature-Mapping Algorithm The principal goal of the self-organizing feature-mapping (SOFM) algorithm developed by Kohonen (1982a) is to transform an incoming signal pattern of arbitrary dimension into a one- or two-dimensional discrete map, and to perform this transformation adaptively in a topological ordered fashion. Many activation patterns are presented to the network, one at a time. Typically, each input presentation consists simply of a localized region, or spot, of activity against a quiet background. Each such presentation causes a corresponding localized group of neurons in the output layer of the network to be active. The essential ingredients of the neural network embodied in such an algorithm are as follows: A one- or two-dimensional lattice of neurons that computes simple discriminant functions of inputs received from an input of arbitrary dimension A mechanism that compares these discriminant functions and selects the neuron with the largest discriminant function value An interactive network that activates the selected neuron and its neighbors simultaneously An adaptive process that enables the activated neurons to increase their discriminant function values in relation to the input signals n this section we develop the SOFM algorithm; the properties of the algorithm are discussed in the next section. To proceed with the development of the algorithm, consider Fig , which depicts a two-dimensional lattice of neurons. n this figure we have connected the same set of input (sensory) signals to all the neurons in the lattice. The input vector, representing the set of input signals, is denoted by x = [X, xz,..., Xp]T (10.7) The synaptic weight vector of neuron j is denoted by wj = [ ~jl, ~ j z.,.., ~j,], j = 1, 2,..., N (10.8)

13 10.5 Self-organizing Feature-Mapping Algorithm 409 FGURE Two-dimensional lattice of neurons. To find the best match of the input vector x with the synaptic weight vectors w,, we simply compare the inner products w,tx forj = 1, 2,..., N and select the largest. This assumes that the same threshold is applied to all the neurons. Note also that wtx is identical with the stimulus, in Eq. (10.2). Thus, by selecting the neuron with the largest inner product WTX, we will have in effect determined the location where the activity bubble is to be formed. n the formulation of an adaptive algorithm, we often find it convenient to normalize the weight vectors wj to constant Euclidean norm (length). n such a situation, the bestmatching criterion described here is equivalent to the minimum Euclidean distance between uectors. Specifically, if we use the index i(x) to identify the neuron that best matches the input vector x, we may then determine i(x) by applying the condition i(x) = arg min x - ~ ~ 1 1, j = 1,2,..., N (10.9) J where ).)) denotes the Euclidean norm of the argument vector. According to Eq. (10.9), i(x) is the subject of attention, for after all it is the value of i that we want. The particular neuron i that satisfies this condition is called the best-matching or winning neuron for the input vector x. By using Eq. (10.9), a continuous input space is mapped onto a discrete set of neurons. Depending on the application of interest, the response of the network could be either the index of the winning neuron (Le., its position in the lattice), or the synaptic weight vector that is closest to the input vector in a Euclidean sense. The topology of iterations in the SOFM algorithm defines which neurons in the twodimensional lattice are in fact neighbors. Let h,(,)(n) denote the topological neighborhood of winning neuron i(x). The neighborhood A,(&) is chosen to be a function of the discrete time n; hence we may also refer to Ai(&) as a neighborhoodfunction. Numerous simulations have shown that the best results in self-organization are obtained if the neighborhood function A,(&) is selected fairly wide in the beginning and then permitted to shrink with time n (Kohonen, 1990a). This behavior is equivalent to initially using a strong positive lateral feedback, and then enhancing the negative lateral feedback. The

14 / Self-Organizing Systems : Competitive Learning important point to note here is that the use of a neighborhood function hi(&) around the winning neuron i(x) provides a clever computational shortcut for emulating the formation of a localized response by lateral feedback. Another useful way of viewing the variation of the neighborhood function Ai(&) around a winning neuron i(x) is as follows (Luttrell, 1989a, 1992). The purpose of a wide hi&) is essentially to correlate the directions of the weight updates of a large number of neurons in the lattice. As the width of Ai(&) is decreased, then so is the number of neurons whose update directions are correlated. This phenomenon becomes particularly obvious when the training of a self-organized feature map is played on a computer graphics screen. t is rather wasteful of computer resources to move a large number of degrees of freedom around in a correlated fashion, as in the standard SOFM algorithm. nstead, it is much better to use renormalized SOFM training (Luttrell, 1992), according to which we work with a much smaller number of normalized degrees of freedom. This operation is easily performed in discrete form simply by having a h,(,)(n) of constant width, but gradually increasing the total number of neurons. The new neurons are just inserted halfway between the old ones, and the smoothness properties of the SOFM algorithm guarantee that the new ones join the adaptation in a graceful manner (Luttrell, 1989a). A summary of the renormalized SOFM algorithm is presented in Problem Adaptive Process For the network to be self-organizing, the synaptic weight vector wj of neuron j is required to change in relation to the input vector x. The question is how to make the change. n Hebb s postulate of learning, a synaptic weight is increased with a simultaneous occurrence of presynaptic and postsynaptic activities. The use of such a rule is well suited for associative learning. For the type of unsupervised learning being considered here, however, the Hebbian hypothesis in its basic form is unsatisfactory for the following reason; changes in connectivities occur in one direction only, which finally drive all the synaptic weights into saturation. To overcome this problem, we may modify the Hebbian hypothesis simply by including a nonlinear forgetting term -g(y,)w,, where w, is the synaptic weight vector of neuron j and g(y,) is some positive scalar function of its response yj. The only requirement imposed on the function g(y,) is that the constant term in the Taylor series expansion of g(y,) be zero, so that we may write g(y,) = 0 for y, = 0 and for all j (10.10) The significance of this requirement will become apparent momentarily. Given such a function, we may then express the differential equation that defines the computational map produced by the SOFM algorithm as (10.1 1) where t denotes the continuous time and 7 is the learning-rate parameter of the algorithm. We may further simplify Eq. (10.11) in light of the activity bubble formation phenomenon described previously. To be specific, if the input vector x is changing at a rate that is slow compared to the synaptic weight vector wj for allj, we may justifiably assume that due to the clustering effect (Le., the formation of an activity bubble), the response yj of neuron j is at either a low or high saturation value, depending on whether neuron j is outside or inside the bubble, respectively. Correspondingly, the function g(yj) takes on a binary character. Thus, identifying the neighborhood function (around the winning neuron i(x)] with the activity bubble, we may write

15 Yj = 1, 0, 10.5 Self-organizing Feature-Mapping Algorithm 41 1 neuron j is active (i.e., inside the neighborhood Ai,) neuron j is inactive (Le., outside the neighborhood Ai(x)) (10.12) n a corresponding way, we may express the function g(yj) as (a, g(yj) = 10, neuron j is active (on) neuron j is inactive (off) (10.13) where CY is some positive constant, and the second line is a consequence of the conditions describedineqs. (10.10) and (10.12). Accordingly wemay simplify Eq. (10.11) as follows: vx - aw,, neuron j is inside the neighborhood A,,, "={ (10.14) dt 0, neuron j is outside the neighborhood A,, Without loss of generality, we may use the same scaling factor for the input vector x and the weight vector w,. n other words, we may put a = q, in which case Eq. (10.14) further simplifies as r)(x - w,), neuron j is inside the neighborhood A,, "={ (10.15) dt 0, neuron j is outside the neighborhood A,(,, According to Eq. (10.15), the weight vector w, tends to follow the input vector x with increasing time t. Finally, using discrete-time formalism, Eq.(10.15) is cast in a form whereby, given the synaptic weight vector w,(n) of neuron j at discrete time n, we may compute the updated value w,(n + 1) at time n + 1 as follows (Kohonen, 1982a, 1990a): Here, hi(x,(n) is the neighborhood function around the winning neuron i at time n, and ~(n) is the corresponding value of the learning-rate parameter; the reason for making the learning-rate parameter time-dependent is explained later. The update equation (10.16) is the desired formula for computing the feature map. The effect of the update equation (10.16) is to move the synaptic weight vector w, of the winning neuron i toward the input vector x. Upon repeated presentations of the training data, the synaptic weight vectors tend to follow the distribution of the input vectors due to the neighborhood updating. The algorithm therefore leads to a topological ordering of the feature map in the input space in the sense that neurons that are adjacent in the lattice will tend to have similar synaptic weight vectors. We will have much more to say on this issue in Section Summary of the SOFM Algorithm The essence of Kohonen's SOW algorithm is that it substitutes a simple geometric computation for the more detailed properties of the Hebb-like rule and lateral interactions. There are three basic steps involved in the application of the algorithm after initialization, namely, sampling, similarity matching, and updating. These three steps are repeated until the map formation is completed. The algorithm is summarized as follows (Kohonen 1988b, 1990a): 1. nitialization. Choose random values for the initial weight vectors wj(0). The only restriction here is that the wj(0) be different for j = 1, 2,..., N, where N is the

412 10 / Self-organizing Systems : Competitive Learning number of neurons in the lattice. t may be desirable to keep the magnitude of the weights small. 2. SampZing.

16 / Self-organizing Systems : Competitive Learning number of neurons in the lattice. t may be desirable to keep the magnitude of the weights small. 2. SampZing. Draw a sample x from the input distribution with a certain probability; the vector x represents the sensory signal. 3. Similarity Matching. Find the best-matching (winning) neuron i(x) at time n, using the minimum-distance Euclidean criterion: i(x) = arg min lx(n) - wjll, j = 1,2,..., N j 4. Updating. Adjust the synaptic weight vectors of all neurons, using the update formula w,(4 otherwise where q(n) is the learning-rate parameter, and Ai(&) is the neighborhood function centered around the winning neuron i(x); both q(n) and A,(&z) are varied dynamically during learning for best results. 5. Continuation. Continue with step 2 until no noticeable changes in the feature map are observed. Selection of Parameters The learning process involved in the computation of a feature map is stochastic in nature, which means that the accuracy of the map depends on the number of iterations of the SOFM algorithm. Moreover, the success of map formation is critically dependent on how the main parameters of the algorithm, namely, the learning-rate parameter q and the neighborhood function A,, are selected. Unfortunately, there is no theoretical basis for the selection of these parameters. They are usually determined by a process of trial and error. Nevertheless, the following observations provide a useful guide (Kohonen, 1988b, 1990a). 1. The learning-rate parameter q(n) used to update the synaptic weight vector w,(n) should be time-varying. n particular, during the first 1000 iterations or so, q(n) should begin with a value close to unity; thereafter, q(n) should decrease gradually, but staying above 0.1. The exact form of variation of q(n) with n is not critical; it can be linear, exponential, or inversely proportional to n. t is during this initial phase of the algorithm that the topological ordering of the weight vectors w,(n) takes place. This phase of the learning process is called the ordering phase. The remaining (relatively long) iterations of the algorithm are needed principally for the fine tuning of the computational map; this second phase of the learning process is called the convergence phase. For good statistical accuracy, q(n) should be maintained during the convergence phase at a small value (on the order of 0.01 or less) for a fairly long period of time, which is typically thousands of iterations. 2. For topological ordering of the weight vectors w, to take place, careful consideration has to be given to the neighborhood function A,(n). Generally, the function A,@) is taken to include neighbors in a square region around the winning neuron, as illustrated in Fig For example, a radius of one includes the winning neuron plus the eight

17 10.5 / Self-organizing Feature-Mapping Algorithm 413 oj 0; 01 Ai=O jo io lo j io lo 0 0; oj ' oj - : FGURE Square topological neighborhood Ai, of varying size, around "winning" neuron i, identified as a black circle. linearly with&ne n to a small value of only a couple of neighboring neurons. During the convergence phase of the algorithm, Ri(n) should contain only the nearest neighbors of winning neuron i, which may eventually be 1 or 0 neighboring neurons. t is of interest to note that, by the appropriate use of topological neighborhoods, the SOFM algorithm ensures that the neurons in the network are not underutilized, a problem that plagues other competitive learning networks (Ahalt et al., 1990). Also, as mentioned previously, we may improve the utilization of computer resources by using the renorrnalized SOFM training scheme in place of varying the neighborhood function A,@) with n (Luttrell, 1989a). Computer Simulations We illustrate the behavior of the SOFM algorithm by using computer simulations to study a network with 100 neurons, arranged in the form of a two-dimensional lattice with 10 rows and 10 columns. The network is trained with a two-dimensional input vector x, whose elements x1 and x2 are uniformly distributed in the region (-1 < x1 < $1; - 1 < x2 < + l}. To initialize the network, the synaptic weights are chosen from a random set. Figure shows four stages of training as the network learns to represent the input distribution. Figure 10.12a shows the initial values of the synaptic weights, randomly chosen. Figures 10.12b, 10.12c, and 10.12d present the values of the synaptic weight vectors, plotted as dots in the input space, after 50,1000, and 10,000 iterations, respectively. The lines drawn in Fig connect neighboring neurons (across rows and columns) in the network. The results shown in Fig demonstrate the ordering phase and the convergence phase that characterize the learning process of the SOFM algorithm. During the ordering phase, the map unfolds to form a mesh, as shown in Figs b and 10.12~. At the end of this phase, the neurons are mapped in the correct order. During the convergence phase, the map spreads out to fill the input space. At the end of this second phase, shown in Fig d, the statistical distribution of the neurons in the map approaches that of the input vectors, except for some edge effects. This property holds for a uniform distribution

414 10 / Self-organizing Systems : Competitive Learning 1-1 ' -1-0.5 0 0.5 1 X1 (a) 0.2 x2 0-0.2-0.4 0.6 O3 0.4 - - - - - - - - - - - - -0.6-0.8 1-1 ' -1-0.5 0 0.5 1 X1 (b) 0.8-0.6-0.4-0.2 - - -0.

18 / Self-organizing Systems : Competitive Learning 1-1 ' X1 (a) 0.2 x O ' X1 (b) x2 0 x1 (c) x FGURE Demonstration of the SOFM algorithm for a uniformly distributed twodimensional input, using a two-dimensional lattice. (a) nitial random weights. (b) Network after 50 iterations. (c) After 1000 iterations. (d) After 10,000 iterations. X1 (4 of input vectors, but would not be the case for any other input distribution, as discussed later Properties of the SOFM Algorithm Once the SOFM algorithm has converged, the feature map computed by the algorithm displays important statistical characteristics of the input, as discussed in this section. To begin with, let X denote a spatially continuous input (sensory) space, the topology of which is defined by the metric relationship of the vectors x E X. Let A denote a spatially discrete output space, the topology of which is engowed by arranging a set of neurons as the computation nodes of a lattice. denote a nonlinear transformation called a feature map, which maps the input space X onto the output space A, as shown

10.6 / Properties of the SOFM Algorithm 415 CP:X+A (10.17) Equation (10.17) may be viewed as an abstraction of Eq. (10.9) that defines the location of a winning neuron i(x) developed in response to an input vector x.

19 10.6 / Properties of the SOFM Algorithm 415 CP:X+A (10.17) Equation (10.17) may be viewed as an abstraction of Eq. (10.9) that defines the location of a winning neuron i(x) developed in response to an input vector x. For example, in a neurobiological context, the input space X may represent the coordinate set of somatosensory receptors distributed densely over the entire body surface. Correspondingly, the output space A represents the set of neurons located in that layer of the cerebral cortex to which the somatosensory receptors are confined. Given an input vector x, the SOFM algorithm proceeds by first identifying a bestmatching or winning neuron i(x) in the output space A, in accordance with the feature map CP. The synaptic weight vector wi of neuron i(x) may then be viewed as a pointer for that neuron into the input space X. These two operations are depicted in Fig The self-organizing feature mapping CD has some important properties, as described here: PROPERTY 1. Approximation of the nput Space The self-organizing feature map CP, represented by the set of synaptic weight vectors {wjj = 1, 2,..., N}, in the output space A, provides a good approximation to the input space X. The basic aim of the SOFM algorithm is to store a large set of input vectors x E X by finding a smaller set of prototypes wj E A, so as to provide a good approximation to the original input space X. The theoretical basis of the idea just described is rooted in uector quantization theory, the motivation for which is dimensionality reduction or data compression (Gray, 1984). t is therefore appropriate that we present a brief discussion of this theory. /wi FGURE llustration of a relationship between the feature map Q, and weight vector wi of winning neuron i.

20 / Self-organizing Systems : Competitive Learning Consider Fig , where c(x) acts as an encoder for the input vector x and x (c) acts as a decoder for c(x). The vector x is selected at random from a training set (i.e., input space X), subject to an underlying probability density function f(x). The optimum encoding-decoding scheme is determined by varying the functions c(x) and x (c), so as to minimize an average distortion, defined by D = 1 dxf(x) d(x,x ) (10.18) 2 -m where the factor g has been introduced for convenience of presentation, and d(x,x ) is a distortion measure; and the integration is performed over the entire input space X. A popular choice for the distortion measure d(x,x ) is the square of the Euclidean distance between the input vector x and the reproduction (reconstruction) vector x ; that is, Thus we may rewrite Eq. (10.18) as d(x,x ) = x - ~ 1 1 ~ = (x - x )T(x - x ) (10.19) (10.20) The necessary conditions for the minimization of the average distortion D are embodied in the LBG algorithm, so named in recognition of its originators, Linde, Buzo, and Gray (1980). The conditions are two-fold: CONDTON 1. Given the input vector x, choose the code c = c(x) to minimize the squared error distortion x - x (c)12. CONDTON 2. Given the code c, compute the reproduction vector x = x (c) as the centroid of those input vectors x that satisfy condition 1. Condition 1 is recognized as a nearest-neighbor encoding rule. Conditions 1 and 2 imply that the average distortion D is stationary (i.e., at a local minimum) with respect to variations in the encoder c(x) and decoder x (c), respectively. The LBG algorithm, for the implementation of vector quantization, operates in a batch training mode. Basically, the algorithm consists of alternately adjusting the encoder c(x) in accordance with condition 1, and then adjusting the decoder x (c) in accordance with condition 2, until the average distortion D reaches a minimum. n order to overcome the local-minimum problem, it may be necessary to run the LBG algorithm several times with different initial code vectors. nuut vector,. Code Reconstruction vector FGURE Encoder-decoder model.

21 10.6 / Properties of the SOFM Algorithm 417 The LBG algorithm described briefly here is closely related to the SOFM algorithm. Following Luttrell(1989b), we may delineate the form of this relationship by considering the scheme shown in Fig , where we have introduced a signal-independent noise process v following the encoder c(x). The noise v is associated with the communication channel between the encoder and the decoder, the purpose of which is to account for the possibility that the output code c(x) may be distorted. On the basis of the model shown in Fig , we may consider a modijied form of average distortion as follows: D1= - dxf(x) 1 dvn(v)llx - x (c(x) + u)z (10.21) -m 2 -m where r(v) is the probability density function of the noise v, and the second integration is over all possible realizations of the noise. We wish to minimize the average distortion D1 with respect to the function x (c) in the model of Fig Differentiating Eq. (10.21) with respect to x (c), we get the following partial derivative: dxf(x)r(c - c(x))[x - x (c)] (10.22) Thus, in order to minimize the average distortion D1, conditions 1 and 2 stated earlier for the LBG algorithm must be modified as follows (Luttrell, 1989b): CONDTON. Given the input vector x, choose the code c = c(x) to minimize the distortion measure 1 -m dvr(v)lx - x (c(x) + v) (10.23) CONDTON 11. the condition Given the code c, compute the reconstruction vector x (c) to satisfy 1 -m dxf(x)r(c - c(x))x = 1 dxf(x)r(c - c(x)) -m (10.24) Equation (10.24) is obtained simply by setting the partial derivative dd,/ax (c) in Eq. (10.22) equal to zero and then solving for x (c) in closed form. Noise V Reconstruction FGURE Noisy encoder-decoder model.

22 / Self-organizing Systems : Competitive Learning The model described in Fig may be viewed as a special case of that shown in Fig n particular, if we set the probability density function ~ (u) equal to a Dirac delta function &u), conditions and 1 reduce to conditions 1 and 2 for the LBG algorithm, respectively. To simplify condition, we assume that n(u) is a smooth function of u. t may then be shown that, to a second-order of approximation, the distortion measure defined in Eq. (10.23) consists of two components (Luttrell, 1989b): rn The conventional distortion term, defined by the squared error distortion x - x (c) [2 w A curvature term that arises from the noise model r(u) Consider next condition 11. A straightforward approach to realize this condition is to use stochastic gradient descent learning. n particular, we pick input vectors x at random from the input space X using the factor Jdxf(x), and update the reconstruction vector x (c ) as follows (Luttrell, 1989b): x (c ) + x (c ) + vn(c - c(x))[x - x (c )] (10.25) where 7 is the learning-rate parameter and c(x) is the nearest-neighbor encoding approximation to condition (1). The update equation (10.25) is obtained by inspection of the partial derivative in Eq. (10.22). This update is applied to all c, for which we have %-(c - c(x)) > 0 (10.26) We may think of the gradient descent procedure described in Eq. (10.25) as a way of minimizing the distortion measure D, in Eq. (10.21). That is, Eqs. (10.24) and (10.25) are essentially of the same type, except for the fact that (10.24) is batch and (10.25) is continuous. The update equation (10.25) is identical to Kohonen s SOFM algorithm, bearing in mind the correspondences listed in Table Accordingly, we may state that the LBG algorithm for vector quantization is the batch training version of the SOFM algorithm with zero neighborhood size; for zero neighborhood, ~(0) = 1. Note that in order to obtain the LBG algorithm from the batch version of the SOFM algorithm we do not need to make any approximations, because the curvature terms (and all higher-order terms) make no contribution when the neighborhood has zero width. The important point to note from the discussion presented here is that the SOFM algorithm is a vector quantization algorithm, which provides a good approximation to the input space X. ndeed, this viewpoint provides another approach for deriving the SOFM algorithm, as exemplified by Eq. (10.25). We will have more to say on the factor %-(e) of this update equation in Section PROPERTY 2. Topological Ordering The feature map CJ computed by the SOFM algorithm is topologically ordered in the sense that the spatial location of a neuron in the lattice corresponds to a particular domain or feature of input patterns. TABLE 10.1 Correspondence Between the SOFM Algorithm and the Model of Fig Encoding-Decoding Model of Fig Encoder c(x) Reconstruction vector x (c ) Function T(C - C(X)) SOFM Algorithm Best-matching neuron i(x) Synaptic weight vector wj Neighborhood function hi(x)

Function approximation using RBF network. 10 basis functions and 25 data points.

1 Function approximation using RBF network F (x j ) = m 1 w i ϕ( x j t i ) i=1 j = 1... N, m 1 = 10, N = 25 10 basis functions and 25 data points. Basis function centers are plotted with circles and data