V.Petridis, S. Kazarlis and A. Papaikonomou

Proceedings of IJCNN 93, p.p. 276-279, Oct. 993, Nagoya, Japan. A GENETIC ALGORITHM FOR TRAINING RECURRENT NEURAL NETWORKS V.Petridis, S. Kazarlis and A. Papaikonomou Dept. of Electrical Eng. Faculty of Engineering University of Thessaloniki, Box 438 Thessaloniki 54 6 GREECE Abstract A hybrid genetic algorithm is proposed for training neural networks with recurrent connections. A fully connected recurrent ANN model is employed and tested over a number of various problems. Simulation results are presented for three problems: generation of a stable limit cycle, sequence recognition and storage and reproduction of temporal sequences..introduction Although the recurrent ANN models, seem to be promising, in solving problems associated with time, they suffer from lack of efficient training algorithms. A number of algorithms have been proposed in the past [-4], for different models of ANNs with recurrent connections. The proposed training algorithms seem to have a limited scope. In this paper we present a hybrid genetic algorithm for training ANNs, which is robust and exhibits enhanced training abilities in a range of difficult problems. 2.Network model We assume a fully connected recurrent neural network that consists of sigmoid units. Let W denote the weight matrix of the network. The topology of the ANN is shown in figure. The dimension of weight matrix W is n x (n+m+), where n is the total number of units and m is the number of input lines (the neuron thresholds are trainable). The total number of weights is N = n. (n+m+). If y j (t)is the output of jth unit at time t and x i (t)is the value of ith input line at the same time, then the total input to the kth unit at time t is given by the summation : n n+m+ S k (t) = w kj.y j (t)+ w ki.x in (t) () j= i=n+ where x m+ (t) =, is an extra input that controls the threshold. The output of every unit at time t is y k (t) = f(s k (t)) (2) where f is typical hyperbolic tangent function. Let Y denote the set of units (indexed by k) for which there exists a desired target value d k (t)for every time step. Let J(t) = /2 [ e k (t)] 2 (3) Figure. k denote the overall network error at time t, where e k (t)=d k (t)-y k (t) measures the distance between the output y k and the desired output d k at the same time.

We want to ajust the weigths w kj in matrix W so that the total error J T becomes less than a predefined quantity. T J T = J(t) is the total network error calculated over the period T of the desired t= output signal or the presentation time T of a temporal sequence. The training algorithm that has been used is presented in the following section. 3.The training algorithm The training method used, is based on Genetic Algorithms (GAs) first proposed by Holland [7] and more recently reviewed and enhanced by others [8-9]. GAs are conceptually based on natural genetic and evolution mechanisms working on populations of solutions in contrast to other search techniques that work on a single solution. GAs search not on the real parameter solution space but on a bit string encoding of it. In this way they mimic the natural chromosome genetics by applying genetic-like operators in search for the global optimum. In this paper GAs search for the optimum set of real weights for a recurrent ANN in a variety of problems. The training algorithm tries to find the optimum N- dimensional weight vector for the given problem. Every weight is encoded in a 6 bit string (an unsigned 6 bit integer), which gives 2 6 = 65536 different values for every weight in the real range (-2...2.), resulting in a resolution of.6 per bit. The N 6-bit integers define an N-dimensional integer coded weight vector, Z = [z,...,z N ] T, which defines what we call the network weight point in the weight space. These N 6-bit strings are concatenated to form a solution bit string of N x 6 bits called genotype. Each genotype is decoded uniquely to an N-dimensional weight vector called the phenotype. The resulting genotype space is vast. For example, for a 35 connection ANN the genotype strings are 35 x 6 = 56 bits long with a search space of 2 56 = 3.77 x 68 different values. According to the GAs' principles a population of genotypes must initially be generated at random. After the production of M such genotypes they are all evaluated with the following procedure: the genotype is decoded to a weight vector and then the performance of ANN is evaluated for the specific problem. J T is taken as the fitness quality of the particular genotype. Genetic evolution takes place by means of three genetic operators : a) Roulette Wheel parent selection. Two genotypes are selected from the parent population with probability proportional to their fitness values. b) Crossover. If a probability test is passed, the two genotypes are combined (exchange bits) to form a new genotype which incorporates characteristics from both parent genotypes. The produced genotype (offspring) is a member of the next generation's population. c) Mutation. With a small probability, random bits of the offspring genotype flip from to and vice versa to give characteristics that don't exist in the parent population. The above procedure is repeated M times to give M new offspring genotypes. In this population we add the best parent genotype (a technique called elitism), to form an M-atom new population which wholly replaces the parents. This elitism mechanism guarantees that a good solution found cannot be lost. The production of an offspring population is called a generation. Many such generations are required for the population to converge to an optimum solution. 2

The above algorithm is more or less a classic implementation of GAs. This scheme although proven effective in finding near optimal solutions, needs a very large number of generations to converge. It is capable of finding the "basin" of the global optimum but thereafter it proceeds extremely slowly towards the global optimum [6]. In order to accelerate the search, we introduce a new operator called "phenotype mutation". This operator is applied only to the best genotype of every generation. It performs a local direct search in the neighborhood of the best genotype in an effort to find a better point. It uses the network weight point in the weight space, defined by the best genotype, as a base point of the search (i.e. a point where the search starts from). This operator increases to a great extent the speed of convergence of the GA, especially in case that the error surface does not have many discontinuities. The search procedure is summarized in the sequel: step. Select the best genotype as a base point (denoted by Z ). Define steps s (= ), s 2 (=5 ), s 3 (= ) and s 4 (=-5 ). Set p=. step 2. Sequentially explore the four points Z pi = Z p + î pi where î pi =[ä p,...,ä pn ] T. s i, for i=...4. ä pr is the Kronecker delta : ä pr = if p=r and ä pr = if p r. We move the base point to the first successful point (that is the point that improves the performance of ANN). The new base point is denoted by Z p. If no point is successful the new base point is Z p =Z p. step 3. Increase p by and repeat step 2 until p=n. When p=n the search has gone through all N weights and the current base point is a possible improvement over Z. Z p is the offspring that results from phenotype mutation. It is obvious that for every generation the new operator adds an amount of N x 4 fitness evaluations, at maximum. In return, the new operator accelerates, to a great extent, the search speed towards the global optimum. It is worth mentioning that the two classic genetic operators, crossover and mutation, are competing over the field of convergence. Crossover forces convergence while mutation forces diversity in the population. Therefore a balance between the two operators should be maintained. To this end, we have implemented a method for adaptive on line determination of these probabilities. For every generation we calculate statistic data concerning fitness deviation of the population, from the best genotype fitness. If too many genotypes evaluate to a fitness near the population's best (premature convergence) there is no progress due to genotype similarity. When this happens crossover probability is lowered while mutation probability is strengthened. The opposite happens if too many genotypes have qualities far from the population's best (too much diversity). In a situation like this there is no progress, because only very few (the best) genotypes have good qualities, hence crossover does not effectively produce better solutions. In this way genotype diversity is always kept at a reasonable state avoiding pre-mature convergence and excessive diversity. The training algorithm is terminated when one of two things happens : the algorithm finds a solution equal or better than a pre-estimated satisfactory nearoptimum solution, or population converges so that it does not produce a different solution over a given number of generations. 4.Simulation Results 4.Simulation Results The Previously mentioned ANN model and training algorithm has been tested over three problems : 3

Limit Cycle Input=.2 y2,5 y -,5 Input =.7,5 y2 y -,5 Figure 2. Return to Input =.2,5 y2 y -,5 -,5,5 a) The first problem has been the generation of a stable limit cycle, in the form of two sinusoidal functions of different phases. The model chosen has had 6 neurons, two of which have been the output units y and y 2. The input is kept constant at.2. 6 samples per period have been used for training. After approximately 25 generations the outputs have been capable of following the sinusoidal oscillations with a negligible total error of e-3. Concerning stability of the limit cycle, we observed that, small changes to the input level (± 5%) left the output virtually uneffected. Moreover, in cases of excessive input level changes, e.g. input level=.7, the outputs were driven to a certain constant point, but when the input level was restored to the initial value of.2, the outputs quickly returned to the originally trained limit cycle, without any deviation. The whole model behavior is shown in Figure 2. b) The second task has been a sequence recognition problem. The network has had to recognise and classify a number of 5 input sequences which have been chosen to be samples of sinusoidal functions of different frequencies. The number of network units in this case has been increased to 8 with one output unit which had to classify the input oscillation frequency among a set of 5 possible values. The input sinusoid signals were presented one by one for five periods each. It required about 7 generations for the ANN to be trained. During the consulting mode, the ANN has been capable of recognizing the presented input perfectly. c) The ANN has been trained to store and reproduce a temporal sequence consisted of a number of 4 x 4 pixel patterns [5]. The ANN implemented, was an 8- neuron model, four of which have been chosen as output units. Each output unit has had to produce a single line of four pixels, thus each output value has been decoded into a 4-bit word (6 levels). The solution evaluation has been performed over the presentation of the whole sequence continuously for a number of times. After approximately 45 generations the weight configuration found, has been capable of reproducing the temporal sequence with zero hamming distance from the original. It should be emphasized that the temporal sequence has been reproduced starting from any frame without any grey states between frames. The frames of the learned temporal sequence are shown in Figure 3. Figure 3. 4

5. Conclusions In all three examples the behaviour of the trained ANN is very robust. This is an indication that the hybrid training algorithm has found a solution near the optimum. Moreover, it has been demonstrated that recurrent ANNs are capable of solving difficult time-series problems. References [] L. B. Almeida, "Backpropagation in Perceptrons with Feedback", Neural Computers (Neuss 987), pp. 99-28, 987. [2] F. J. Pineda, "Generalization of Backpropagation to Recurrent Neural Networks", Physical Review Letters, 8, pp. 2229-2232, Nov. 987. [3] R. J. Williams & D. Zipser, "A Learning Algorithm for Continually Running Fully Recurrent Neural Networks", Neural Computation, vol., pp. 27-28, 989. [4] B. A. Pearlmutter, "Learning State Space Trajectories in Recurrent Neural Networks", Neural Computation vol., pp. 263-269, 989. [5] M. Reiss & J.G. Taylor, "Storing Temporal Sequences", Neural Networks, vol. 4, pp. 773-78, 99. [6] V. Petridis, S. Kazarlis, A. Papaikonomou & A. Filelis, "A hybrid genetic algorithm for training neural networks" In: Proceedings of ICANN '92, pp. 953-956, Sep. 992, Brighton England. [7] J. H. Holland, "Outline for a logical theory of adaptive systems", J. ACM, vol. 3, pp. 297-34, July 962. [8] D.E. Goldberg, Genenetic Algorithms in Search, Optimization and Machine Learning, Reading, Mass.: Addison-Wesley, 989. [9] L. Davis (ed.) Handbook of genetic algorithms, Van Nostrand N. York, 99. 5