Subgoal Chaining and the Local Minimum Problem

Subgoal Chaining and the Local Minimum roblem Jonathan. Lewis (jonl@dcs.st-and.ac.uk), Michael K. Weir (mkw@dcs.st-and.ac.uk) Department of Computer Science, University of St. Andrews, St. Andrews, Fife KY6 9SS, Scotland Abstract It is well known that performing gradient descent on fixed surfaces may result in poor travel through getting stuck in local minima and other surface features. Subgoal chaining in supervised learning is a method to improve travel for neural networks by directing local variation in the surface during training. This paper shows however that linear subgoal chains such as those used in ERA are not sufficient to overcome the local minimum problem and examines non-linear subgoal chains as a possible alternative. Introduction A problem long recognised as important for gradient descent techniques used in optimisation is that of local minima. The problem is how to avoid convergence to solutions of the minimisation condition F=f(S)=0 that do not correspond to the lowest value of F where F is a potential function over a state space S. An interesting technique, expanded range approximation (ERA), has recently been put forward [] which purports to deal with the problem for supervised feedforward neural networks where the potential function is Least Mean Square output error and the state space is the neural weight space. The ERA method consists of compressing the range of target values d p for the training set down to their mean value <d> for each output unit d = p= d p and then progressively expanding these compressed targets linearly back toward their original values. That is, a modified training set for the inputs x p is defined as S( λ ) = x p,{ d + λ( d p d )} (2) where the parameter λ is increased in regular steps from 0 to. In our own terminology, the value of λ corresponds to a particular subgoal setting and the increases in λ generate a linear subgoal chain from the mean-valued targets to the final goal targets for the original training set. There have been other approaches to achieving optimal learning such as in [2]. Our group has also developed a subgoal chain approach [3] to improve robustness through better goal direction. However what is interesting about () ERA is its simplicity and the claim made for it, namely that ERA "is guaranteed to succeed in avoiding local minima in a large class of problems". This claim appears to be supported emrically through ERA achieving a 00% success rate on the XOR problem. In this paper, we examine the above claim made for ERA in the avoidance of local minima both theoretically and emrically. In particular we show that the linear chaining unfortunately still fails to avoid convergence to local minima for the goal. Finally, our analysis and results suggest that subgoal chaining remains potentially an attractive approach provided the realisability of the subgoals is taken into account. 2 The Status of Attractors in Linear Subgoal Chains The designers of ERA use three main stages. The first is to begin training on the mid-range output state as the first subgoal defined by (). The mid-range output state corresponds to the global minimum weight states for this first subgoal which can be set exactly or trained iteratively. Some mid-range weight states for a N-H-M net are given exactly by w d H k = ln w jk f ( w0 j (3) d k j= = 0, i = N, j = H, k = M 0 k ) w ij where w ij are weights to hidden unit j from unit i, w jk are weights to output unit k from unit j and unit 0 is the bias unit. For a single layer N-M net the mid-range state is similar to (3) only without the sum, and w jk =0 for k= M and j= N The second stage is to make a small enough step along the linear chain so that the new subgoal s global minimum contains the mid-range state in its basin. The third stage is to repeat the same size of step to move iteratively along the chain until the goal is reached. The claim is made that the range can thereby be progressively expanded up to λ= without displacing the system from the global minimum at any step. This gives rise to the notion that local minima are avoided so that travel to the goal s global minimum is always successful.

Our main concern is with the third stage in the design. The main design feature is that the travel surface is changed with every subgoal. This surface change however may not entirely fit with the designers intentions. In particular there is the possibility of the attractor influencing the weight state transitions changing from being a global to a local minimum. In this event, the ERA method can no longer rely on passing from one global minimum to the next. The worst case is when the weight state is in a subgoal attractor basin which is that of a local minimum for the goal. In this situation, the local minimum is the most attractive state, whereafter no further progress can be made towards the goal through the linear subgoal chain. This point is illustrated in Figures & 2. Figure is a stylised view of weight space with a local minimum as the current state and a global minimum. Figure 2 is a stylised view of the output space corresponding to the weight space in Figure. In Figure 2, various paths such as the line L attempted by ERA are not realisable. In particular, paths from local minima which monotonically decrease error with respect to the goal do not exist, no matter what surface variation occurs. Other paths such as L2 or L3 which initially lead away from the goal exist but have an initial increase in error. The increase is not only with respect to the goal but any subgoal along L. Consequently, the weight state will converge to the local minimum rather than the global minimum when linear chaining is used. Figure Weight space with a local minimum W and global minimum W2. Dashed lines indicate error contours. L2 and L3 are paths from W to W2. In short, the designers of ERA do not allow for the fact that its linear subgoal chain may not have an associated connecting path to the goal, no matter how small the step size is. The consequence is that ERA's main claim to guarantee to succeed in avoiding local minima in a large class of problems is undermined. On the contrary, local minima are a major source of difficulty for ERA. 3 Emrical Examples of Goal-failure In this section we construct a number of examples which show ERA failing to attain the global minimum for the goal. These examples fall into a variety of categories, principally those where the network does and does not contain hidden units. Some examples for each category are minimal and artificial in order to show the principles of construction clearly. In a further example we also use a larger and more irregular data set to indicate the scope of the counter examples. The general principle of construction behind all the examples will be the same, namely to place a local minimum for the goal in the way of the training path of weight states being trained by the linear subgoal chain. For simplicity, this is mostly done by placing local minima at or near a weight state satisfying the initial ERA subgoal. This Figure 2 Output space with achievable outputs and 2 corresponding to weight states W and W2 in Figure. The concentric circles denote error contour lines with respect to the goal at 2. L is an unrealisable path if is a local minimum for the goal. L2 and L3 are realisable routes to the goal 2 from the local minimum, but not using a linear subgoal chain. makes it either certain or at least likely that the local minimum basin on the goal's error-weight surface contains ERA's mid-range weight state. 3. Example Construction rinciples By using a technique similar to one used by Brady [4] and examined in [5] it is possible to establish local minima in

LMS error. Many of our problems differ from Brady s though in using inseparable data so that the best fit to the data involves misclassification. The training set is divided between non-spoiler points and a relatively small set of spoiler points. Without the spoiler points, non-spoiler points, by definition, meet their targets exactly at the global minimum. When the spoilers are added there is no weight state where the new data set meets all of its targets exactly. Furthermore, the spoilers are set so that there is, apart from the global minimum, at least one local minimum for the goal at or near mid-range states satisfying ERA s initial subgoal. The local minimum occurs where the points all have exact or near mid-range values. The global minimum has the nonspoilers nearly meeting their targets exactly, leaving the fewer spoilers with relatively high error. Figure 3, explained in detail below, illustrates a design for a 2- net where the points have exact mid-range values at the local minimum. The local minimum has zero weight values in this case while the global minimum corresponds to a linear separation near the line H in Figure 3. 3.2 Counter Examples for Single-layer Nets The following counter examples are for a 2- net, where a local minimum for the goal is at or near ERA s mid-range state. A weight state for the mid range may be set to be a local minimum by first arranging the spoiler and non-spoiler points relative to one another so that = 0, i, j (4) wij where E is the LMS error over all the patterns and w ij is a weight from input i to output unit j as defined in (3). To check that the zero-valued gradients are minima, 2 nd order derivatives or convergence tests can be used. In order to ensure (4) holds we make the following observations. Firstly, E p ex pj = = pj inp wij = p pj w ij p= (5) = inp + inp pj pj where E p is the LMS error for pattern p, ex pj is the excitation of output unit j for pattern p, inp is the input i for pattern p. pj is defined for each pattern p and output unit j with a sigmoidal activation function as p pj = = ( T pj out pj ) out pj ( out pj ) (6) pj where T pj is the target for pattern p and output unit j, out pj is the output of unit j for pattern p. If we set the number of Figure 3 An inseparable input point set for a 2- net which creates a local minimum for the goal at the mid-range output state. There are 4 points in each class: 7 nonspoilers and spoiler. The global minimum corresponds to a linear separation near H. patterns for each class to be equal we establish the midrange output value as TAj + TBj out MR = (7) 2 where T Aj and T Bj are the goal targets for patterns of class A and B respectively. Using (6) and (7) then yields Aj = ( TAj out MR ) outmr ( out MR ) (8) = ( TBj outmr ) out MR ( out MR ) = Bj For the bias weight we have therefore created zero gradients, because inp = in (5) for a bias unit i. In order to create zero-valued error-weight gradients for weights connected from all input lines i we require in addition that inp = inp (9) p classa The training set in Figure (3) is designed to obey (7) and (9) and so has zero-valued error-weight gradients at the mid-range state. In contrast, the points in Figure 4 have been generated in a more random way to make the example less artificial. The latter input point set no longer has a local minimum at the mid-range exactly, but merely close to the mid-range. It is possible to do this for single layer nets by moving points slightly relative to the positions determined by (9) in a random fashion. Figure 5 shows a function approximation problem. This counter example is an exception to the design placing a local minimum at or near the mid-range state. The design is similar to one described in [4], which was modified so that the previous global minimum state was shifted in weight space and status to become a local minimum at the end of ERA's subgoal path. The goal targets of 0.29 and 0.77 created the desired local minimum.

Figure 4 An inseparable input point set for a 2- net which creates a local minimum for the goal near the mid-range output state. There are 25 points in each class including the 4 class A spoilers. H and H2 denote global and local minimum separation lines respectively. 3.3 Counter Examples for Multi-layer Nets The following counter example is for a 2-2- net. Figure 6 shows an example which has a local minimum for the goal at the mid-range state. Zero error-weight gradients are again created through equations. The mid-range weight states given by (3) constrain links from the hidden units j to carry the same output K to all other units over all the patterns. For the output units we therefore observe that w = jk = p classa p pk = pk out pj = p pk w jk p= (0) pk K + pk K where w jk is the weight from hidden unit j to output unit k, ex pk is the excitation of output unit k for pattern p and pk is in essence the same output unit delta as defined in (6) with k substituted for j in (6). With the same substitution made in (7) we obtain Ak = Bk () From (0) and () we can therefore see that for all links from the bias unit and hidden units we have zero-valued error-weight gradients for the output unit. Now we require = = 0 pj out (2) wij p= where w ij are weights to hidden unit j from previous layers, and out is the output of unit i to a hidden unit for pattern p. pj is a hidden unit delta for unit j. For a sigmoidal activation function it can be written as pj = pk w jk out pj ( out pj ) (3) k where out pj is the output of hidden unit j for pattern p. Figure 5 A function approximation example for a 2- net which creates a local minimum associated with the line H at the end of ERA s path from the mid-range to the goal. The global minimum hyperplane is near the line H2. Figure 6 A training set for a 2-2- net which creates a local minimum at the mid-range output state. There are 8 points in each class. The global minimum occurs near the 2 hidden unit hyperplanes associated with the lines H and H2. The local minimum hidden unit hyperplanes are parallel to the x-y plane. With w jk and out pj constant over all patterns p it follows from () that Aj = Bj (4) In conjunction, (2) and (4) applied recursively ensure that we have zero-valued error-weight gradients for all links from hidden units, and from the bias unit, to other hidden units. For links from input lines to hidden units to have zero-valued error-weight gradients, we now in addition require (9) to be satisfied. The input and target sets in Figure 6 have been designed to obey the conditions in (7) and (9) and hence have zerovalued error-weight gradients.

3.4 Experiments The problems were all run from 25 different random initial weight-states. Training was done using back-propagation. The learning rate was set to be very low, with no momentum being used, in order to encourage robust, i.e. successful, training where possible. The tolerance for determining successful training on both subgoal and goal targets was problem dependent and was set in order to distinguish between final local and global minimum states. The goal target values were 0.2 and 0.8 apart from the function approximation example where they were 0.29 and 0.77. For ERA, 0 and 00 step subgoal chains were tested. 3.4. Single Layer Net Results In this section we present the results for the various test problems with a short discussion. The results summarised in Table show ERA failing completely on all single layer net problems whereas standard (Std) training can find the global minimum with up to 32% success. One can see that if ERA's mid-range state is a local minimum for the goal ERA is bound to fail to reach the global minimum (data set ). Failure also occurs when the mid-range state lies near a local minimum for the goal (data set 2) or a local minimum for the goal lies at the end of the training path directed by the linear subgoal chain (data set 3). Failing to reach the global minimum results in severe misclassification for the linearly inseparable problems ( and 2). For the function approximation example (data set 3), failure to reach the global minimum produces poor approximation. Data Training Learning % finding Average Cycles set method Rate Global for success Std 0. 8 906.5 ERA 0. 0 fails 2 Std 0.005 4 0060.00 2 ERA 0.005 0 fails 3 Std 0. 32 78.60 3 ERA 0. 0 fails Table Results for single layer net experiments. Data sets, 2 and 3 refer to the training sets displayed in Figures 3, 4 and 5 respectively. 3.4.2 Multi Layer Net Results Table 2 shows the results for the multi-layer net experiments. One can see that for data set 4, which has zero-valued error-weight gradients, ERA has almost a 50% chance of finding the global minimum, whereas for the single layer counter examples ERA fails completely. Unlike the single layer case, there is more than one mid-range weight state now only some of which both obey (7) and (9) and are local minima. It would appear that roughly 50% of the mid-range weight states are local minima since we get roughly 50% failure for ERA. It should be noted that when ERA succeeds in finding the Data Training Learning Step % finding Average Cycles set method Rate Size Global for success 4 Std 0. N/A 72 99539.00 4 ERA 0. 0. 48 474907.84 4 ERA 0. 0.0 48 758429.80 Table 2 Results for multi layer net experiments. Data set 4 refers to the counter example displayed in Figure 6. global minimum in data set 4 it takes a lot longer than a standard unchained technique due to initially shallow gradients at ERA's starting weight state. On data set 4 ERA takes roughly 2 to 4 times as many cycles as standard training and has a significantly lower success rate. 4.0 Conclusions The successful construction and testing of counter examples for linear subgoal chaining leads to the question of why the success and failure rates for ERA's linear subgoal chaining can be so extreme, i.e. 00% success for XOR and 00% failure for some counter examples. We surmise that the answer lies in ERA's starting conditions in particular and in the mechanism of linear subgoal chaining in general. ERA's starting conditions are such, that it is necessary to pass through or near the midrange state to begin with for every problem. This means that the outcome is dependent on the mid-range state rather than the initial weight-state. We have found that for XOR the first subgoal after the mid-range state is exactly realisable. The same is true for all subsequent subgoals with a successful path to the goal being the result. Hence all weight initialisations yield success. The counter examples causing failure placed a local minimum for the goal on ERA s travel path so that the state transitions converge towards it and become stuck. This resulted in ERA's failure every time, because, once convergent towards a local minimum for the goal, there is no significant progress towards the remaining subgoals. Deste appearances to the contrary, linear subgoal chaining is very similar to the standard unchained approach in the basis for its success and failure. That is, while it undoubtedly generates different travel paths to those of the unchained approach, due to using varying error-weight surfaces, it has similar attractor basins. What we mean by this is that linear subgoal chaining may be treated as having an underlying travel surface which is an amalgamation of the varying error-weight surfaces. The basins of this surface may differ in shape but are the same in number and have the same attractors. Success or failure for ERA therefore depends on whether the mid-range state is in the goal's attractor basin on the underlying travel surface or not, and step-size issues apart, nothing else. That is why it is all-or-none for some

problems. Linear subgoal chaining without the mid-range starting condition and initialised randomly may be expected to have a more variable success rate depending on the basin distribution for a problem. It has become clear that the linear subgoal chaining technique employed in [] can fail to reach the global minimum, just as standard unchained training can. Nonetheless we believe subgoal chaining potentially remains an attractive approach provided the realisability of subgoals can be taken into account. 5. Non-linear Subgoal Chaining The counter example in Figure 5 was used to test the feasibility of a non-linear subgoal chaining approach. The subgoal chains were derived from outputs taken along weight-space paths between the local and global minimum for the goal. The local and global minimum weight states were obtained beforehand in multiple training runs using a standard unchained approach. report on the value of this technique. References [] Gorse, D, Shepherd, A J and Taylor, J G. The new ERA in supervised learning. Neural Networks, 0(2):343-352, 995. [2] Cetin, B C, Barhen, J and Burdick, J W. Terminal repeller unconstrained subenergy tunneling (TRUST) for fast global optimization. Journal of Optimization Theory and Applications, 77:97-26, 993. [3] Weir, M. & Fernandes, A. Tangent Hyperplanes and Subgoals as a Means of Controlling Direction in Goal Finding, roceedings of the World Conference on Neural Networks, Vol III:438-443, San Diego California, 994. [4] Brady, M.L. et al, Back-ropagation Fails to Separate Where erceptrons Succeed, IEEE Transactions on Circuits and Systems, V36, No5, May 989. [5] Sontag, E.D., & Sussmann, H.J., Back-ropagation can give rise to spurious local minima even for networks without hidden layers, Complex Systems, 3:9-06, 989. The results for the test with non-linear chains are displayed in Table 3. It shows the non-linear subgoal chaining method (NLSG) manages to obtain the global minimum in 00% of the trials for data set, on which standard training succeeds in 8% of the trials and ERA in 0%. This is a major improvement on the standard unchained method and ERA. In terms of training cycles the non-linear chains perform similarly to standard training (see Table, data set ). The potential of non-linear subgoal chaining is confirmed by the experimental runs. We believe that such non linear subgoal chains may be found by performing adaptive subgoal chain shang during training, according to progress and realisability criteria. Data Training Learning % finding Average Cycles set method Rate Global for success NL SG 0. 00 98.72 Table 3 Results for non-linear subgoal chaining experiments. The training set for data set is displayed in Figure 3. 6. Summary This paper has presented results to show that linear subgoal chaining such as used in [] cannot overcome the local minimum problem. In doing so we have provided a mathematical model for designing training sets which have a local minimum at ERA's starting mid-range state and have proposed non-linear subgoal chaining as a feasible technique to overcome local minima. Non-linear subgoal chaining has been shown to be potentially very useful in overcoming local minima without causing any substantial loss in training speed. Such chaining is currently being developed and as our research progresses we intend to