A Population-Based Learning Algorithm Which Learns Both Architectures and Weights of Neural Networks y Yong Liu and Xin Yao Computational Intelligence Group Department of Computer Science University College, The University of New South Wales Australian Defence Force Academy, Canberra, ACT, Australia 2600 Email: xin@csadfa.cs.adfa.oz.au Abstract One of the major issues in the eld of articial neural networks (ANNs) is the design of their architectures. There are strong biological and engineering evidences to support that the information processing capability of an ANN is determined by its architecture. This paper proposes a new population-based learning algorithm (PBLA) which learns both ANN's architecture and weights. The evolutionary approach is used to evolve a population of ANNs. Unlike other evolutionary approaches to ANN learning, each ANN (i.e., individual) in the population is evaluated by partial training rather than complete training. Substantial savings in computational cost can be achieved by such progressive partial training. This training process can change both ANN's architecture and weights. Our preliminary experiments have demonstrated the eectiveness of our algorithm. 1 Introduction One of the major issues in the eld of ANNs is the design of their architectures. There are strong biological and engineering evidences to support that the information processing capability of an ANN is determined by its architecture. Given a learning task, if the network is too small, it will not be capable of forming a good model of the problem. On the other hand, if the network is too big then the ANN may overt the training data and have a very poor generalisation ability. With little or no prior knowledge of the problem, one usually determines the architecture by trial-and-error. There is no systematic way to design a near optimal architecture automatically for a given task. Research on constructive and destructive algorithms is an eort made towards the automatic design of architectures. Roughly speaking, a constructive algorithm starts with the smallest possible network and gradually increases the size until the performance begins to level o while a destructive algorithm does the opposite, i.e. starts with the maximal network and deletes unnecessary layers, nodes and connections during training. Design of the optimal architecture for an ANN can be formulated as a search problem in the architecture space where each point represents an architecture. Given some performance criteria This work is supported by the Australian Research Council through its small grant scheme and by a University College Special Research Grant. y Published in Proc. of ICYCS'95 Workshop on Soft Computing, ed. X. Yao and X. Li, pp.29{38, July 1995. To appear in Chinese Journal of Advanced Software Research (Allerton Press, Inc., New York, NY 10011), Vol. 3, No. 1, 1996. 1
about architectures, the performance level of all architectures forms a surface in the space. The optimal architecture design is equivalent to nding the highest point on this surface. Because the surface is innitely large, nondierentiable, complex, deceptive and multimodal, these characteristics make evolutionary algorithms a better candidate for searching the surface than those constructive and destructive algorithms mentioned above. Because of advantages of the evolutionary design of architectures, a lot of research has been carried out in recent years [1, 2]. In the evolution of architectures, each architecture is evaluated through back-propagation (BP) training. This process is often very time consuming and sensitive to initial conditions of the training. Such evaluation of the architecture is also extremely noisy [2]. Furthermore, if the measure of tness is the sum of squared errors on the training set, this method may generate networks that over learn the training data. One way to avoid this is to add a complexity term, e.g. the number of connections in the architecture to the tness function. However, this penalty term method tends to generate ANNs that are not able to learn. In the extreme case, a network might try to gain rewards by pruning o all of its connections. In order to solve the above problems, this paper proposes a new population-based learning algorithm which learns both ANN's architecture and weights. The evolutionary approach is used to evolve a population of ANNs. Each individual of the population is evaluated by partial training using the modied BP. Because network architectures in the population are dierent from each other, it is not suitable to keep learning rate xed for all individuals. In PBLA, we modied the classical BP by means of dynamically adapting during training for each member in the population. When a parent network is selected for breeding from the population, PBLA rst tests its performance to determine whether to continue training or to mutate the architecture. If the parent network is promising, PBLA continues training using the modied BP. Otherwise, PBLA switches to simulated annealing (SA) from the modied BP. If SA still cannot make the network escape from the local minimum, PBLA mutates the architecture of the network to generate a new architecture. To speed up network optimization, we apply the nonconvergent method [3] to guide mutation. In Section 2, we describe PBLA at the population level and individual level. Section 3 reports experiment results with PBLA on a number of parity problems. Finally, some conclusions are given in Section 4. 2 A Population-Based Learning Algorithm 2.1 Network Architecture In the published literature on ANNs, a large number of structures have been considered and studied. These can be categorised into two broad classes: feedforward neural networks and recurrent networks. Here, a class of feedforward neural networks called generalised multilayer perceptrons are considered. The architecture of such a network is shown in Figure 1, X and Y are inputs and outputs respectively. We assume the following: net i = x i = X i ; 1 i m (1) Xi?1 j=1 w ij x j ; m < i N + n (2) x j = f(net j ); m < j N + n (3) Y i = x i+n ; 1 i n (4) 2
where f is the following sigmoid function: f(z) = 1=(1 + e?z ) (5) m and n is the number of inputs and outputs respectively, N is a constant that can be any integer you choose as long as it is no less than m. The value of N + n decides how many nodes are in the network (if we include inputs as nodes). X Input X:......... 1 m m+1 i-1 i N+1 N+n Y Output Figure 1: A generalised multilayer perceptron. In Figure 1, there are N + n circles, representing all of the nodes in the network, including the input nodes. The rst m circles are really just copies of the inputs X 1 ; : : :; X m. Every other node in the network such as node number i, which calculates net i and x i takes inputs from every node that precedes it in the network. Even the last output node, which generates Y n, takes input from other output nodes, such as the one which outputs Y n?1. In neural network terminology, this network is \fully connected" in the extreme. It is generally agreed that it is inadvisable for a generalised multilayer perceptron to be fully connected. In this context, we may therefore raise the following question: Given that a generalised multilayer perceptron should not be fully connected, how should the connections of the network be allocated? This question is of no major concern in the case of small-scale applications, but it is certainly crucial to the successful application of BP for solving large-scale, real-world problems. However, there is no systematic way to design a near optimal architecture automatically for a given task. Our present approach is to learn both architectures and weights of neural networks based on evolutionary algorithms. In PBLA, we choose the architecture and the weights w ij so as to minimize square error over the training set that contains T patterns: E = TX t=1 E(t) = 1 2 TX nx t=1 i=1 [Y i (t)? Z i (t)] 2 (6) where Y i (t) and Z i (t) are actual and desired outputs of node i for pattern t. 2.2 The Evolutionary Process The PBLA uses an evolutionary-programming-like algorithm to evolve a population of ANNs. This method works as follows: 3
Step 1 Randomly generate an initial population of M feedforward neural networks. The number of hidden nodes and initial connection density for each network in the population are chosen within certain ranges. The random initial weights are uniformly distributed inside a small range. Step 2 Partially train each network in the population for a certain number of epochs using the modied BP. The number of epochs is xed by a control parameter set by the user. The error E for each network is checked after partial training. If E has not been signicantly reduced, then the assumption is that E is trapped in a local minimum, mark the network with `failure'. Otherwise mark the network with `success'. Step 3 Rank the networks in the population according to their error values( from the best to the worst) Step 4 Use the rank-based selection to pick up one parent network from the population. If its mark is `success', go to Step 5. Otherwise go to Step 6. Step 5 Partially train the parent network to obtain an ospring network and mark it in the same way as Step 2. Insert this ospring into the population in the ranking replacing its parent network. Go back to Step 4. Step 6 Train the parent network with SA to obtain an ospring network. If SA reduces the error E of the parent network signicantly, mark the ospring network with `success' and insert it in the ranking replacing its parent network, then go back to Step 4. Otherwise discard this ospring and go to Step 7. Step 7 Delete hidden nodes. 1. Randomly delete hidden nodes from the parent network. 2. Partially train the deleted network to obtain an ospring network. If the ospring network is better than the worst network in the population, insert the former in the ranking and remove the latter, then go back to Step 4. Otherwise discard this ospring and go to Step 8. Step 8 Delete connections. 1. Calculate the approximate importance of each connection in the parent network using the nonconvergent method. Randomly delete the connections from the parent network according to the calculated importance. 2. Partially train the deleted network to obtain an ospring network and decide whether to accept or reject it in the same way as Step 7. If the ospring network is accepted, go back to Step 4. Otherwise discard this ospring and go to Step 9. Step 9 Add connections/nodes. 1. Calculate the approximate importance of each virtual connection with zero weight. Randomly add the connections to the parent network to obtain Ospring 1 according to the calculated importance. 2. Add new nodes to the parent network to obtain Ospring 2 through splitting existing nodes. 4
3. Partially train Ospring 1 and Ospring 2 then choose the better one as the survived ospring. Insert the survived ospring in the ranking and remove the worst network from the population. Step 10 Repeat Step 4 to Step 9 until an acceptable network has been found or until a certain number of generations has been reached. Evolutionary algorithms provide alternative approaches to the design of architecture. Such evolutionary approaches consist of two major stages. The rst stage is to decide the genotype representation scheme of architectures. The second stage is the evolution itself driven by evolutionary search procedures in which genetic operators have to be decided in conjunction with the representation scheme. The key issue is to decide how much information about an architecture should be encoded into the genotype representation. At one extreme, all the detail, i.e. every connection and node of an architecture can be specied by the genotype representation. This kind of representation schemes is called the direct encoding scheme. At the other extreme, only the most important parameters of an architecture such as the number of hidden layers and hidden nodes in each layer are encoded. Other detail about the architecture is left to the training process to decide. This kind of representation schemes is called the indirect encoding scheme. In the direct encoding scheme, each connection in an architecture is directly specied by its binary representation. For example, an N N matrix C = (c ij ) NN can represent an architecture with N nodes where c ij indicates presence or absence of the connection from node i to node j. We can use c ij = 1 to indicate a connection and c ij = 0 to indicate no connection. In fact, c ij can even be connection weights from node i to node j so that both the topological structure and connection weights of an ANN are evolved at the same time. Each such matrix has a direct one-to-one mapping to the corresponding architecture. Constraints on architectures being explored can easily be incorporated into such representation scheme by setting constraints on the matrix, e.g. a feedforward ANN will have nonzero entries only in the upper triangle of the matrix. Because the direct encoding scheme is relatively simple and straightforward to implement, we decided to use it to code our network architectures. However, the direct encoding scheme does not scale well since large architectures require very large matrices to represent. To implement this representation scheme eciently on a conventional computer, one would use a linked list to represent the connections actually implemented for each node. It is obvious that for sparse feedforward neural networks a lot of memory can be saved. In PBLA, each network in the population is evaluated by partial training. The tness is calculated by the sum of the squared error for the training set. Since the evaluation of the networks is very expensive, PBLA adopts rank-based selection mechanism to enhance selection pressure. It is demonstrated that selection pressure is a key factor in obtaining a near optimum. In PBLA, networks in the population are rst sorted in a non-descending order according to their tness. Let the M sorted networks be numbered as 0; 1; : : :; M? 1. Then the (M? j)th network is selected with probability j p(m? j) = P M (7) k=1 k The selected network is then manipulated by the following ve mutations: partial training, delete nodes and connections, and add connections and nodes. In order to avoid unnecessary training and premature convergence, we adopt the following replacement strategy. If the ospring is obtained through progressive partial training, the algorithm accepts it and removes its parent network. If the ospring is obtained through deleting connections or nodes, the algorithm accepts 5
it only when it is better than the worst network in the population. In such cases, the algorithm removes the worst network. If the ospring is obtained through adding connections or nodes, the algorithm always accepts it and removes the worst network in the population. 2.3 Partial Training and Evaluation BP is currently the most popular algorithm for the supervised training of ANNs. There have been some successful applications of BP algorithms in various areas. However, it is well-known that nding optimal weights using the classical BP is very slow. Numerous heuristic optimization algorithms have been proposed to improve the convergence speed of the classical BP. Although most of these have been somewhat successful, they usually introduced additional parameters which must be varied from one problem to another, and if not chosen properly can actually slow the rate of convergence. In the classical BP, the learning rate is kept xed throughout training. The learning rate is often very small in order to prevent oscillations and ensure convergence. However, a very small xed value for slows down the rate of convergence of BP. The use of a xed learning rate may not suit the evolutionary design of architecture. Because all individuals in the population are dierent from each other, a learning rate appropriate for one network is not necessarily appropriate for other networks in the population. Every network should have its own individual learning rate. Unfortunately, search for a good xed learning rate can itself become a hard problem. In PBLA, learning is accelerated through learning rate adaptation. The initial learning rate i,(i = 1; : : :; M) for all individuals in the initial population have the same values. Each individual adjusts its learning rate within a certain range during the evolutionary process according to a simple heuristic principle. During partial training, the error E is checked after every k epochs. If E decreases, the learning rate is increased. Otherwise, the learning rate is reduced. In the later case the new weights and error are discarded. Another drawback of BP is due to its gradient descent nature. BP often gets trapped in a local minimum of the error function and is very inecient in nding a global minimum if the error function is multimodal and nondierentiable. There are two ways for avoiding local minimum. One way is to mutate the architecture network. The other way is to adopt global optimization methods to train the network. It is worth pointing out that the capability of an ANN not only depends on the network architecture but also on the weights. When a network is trapped in a local minimum, it is not clear whether it is due to the weights or the inappropriate network architecture. In order to nd a smaller network, PBLA rst switches to SA from the modied BP in order to nd better weights. Only when SA fails to improve the error E, PBLA starts to mutate the network architecture. 2.4 Architecture Mutation An issue in the evolution of architectures is when and how the architectures should be mutated. In PBLA, when the hybrid algorithm that combines the modied BP and SA fails to improve the error E of the parent network, the algorithm starts to mutate its architecture. The mutation is divided into deletion phase and addition phase. The architecture is rst mutated by deleting hidden nodes or connections. If the new network is better than the worst network in the population then accept it. Otherwise, the algorithm adds connections or hidden nodes to the network and then chooses the better one to survive. The selection of which node to remove or split is uniform over the collection of hidden nodes. The deletion of a node involves the complete removal of the node and and all its incident connections. In order to preserve the knowledge achieved by the parent network, hidden 6
Table 1: The parameter set used in PBLA experiments Population size 20 Number of epochs for each partial training 100 Initial number of hidden nodes 2-N Number of mutated hidden nodes 1-2 Initial connection density 0.75 Number of mutated connections 1-3 Initial learning rate 0.5 Number of temperatures in SA 5 Range of learning rate 0.1-0.6 Number of iterations at each temperature 100 nodes are added to the parent network through splitting existing nodes. Two nodes obtained by splitting an existing node i have the same connections as the existing node. The weights of these new nodes have the following values: w 1 ij = w 2 ij = w ij ; j < i (8) w 1 ki = (1 + )w ki ; k > i (9) w 2 ki =?w ki; k > i (10) where w is the weight vector of the existing node i, w 1 and w 2 are respectively the weight vectors of the new nodes, and is the mutation factor which may take either the xed or random values. Addition or deletion of a connection depends on the importance of the connection in the network. The simplest approach is to delete the smallest weight in the network. This however is not always the best approach since the solution could be quite sensitive to the weight. The nonconvergent method measures the importance of weights by the nal weight test variables based on signicance tests for deviations from zero in the weight update process [3]. Denoting the weight update w ij (w) =?[@L t =@w ij ], by the local gradient of the linear error function L (L = P T P n jy t=1 i=1 i(t)? Z i (t)j) with respect to example t and weight w ji, the signicance of the deviation of w ij from zero is dened by the test variable test(w ij ) = P T t=1 t ij q PT t=1( t ij? ij) (11) where t = ij w ij + w t (w), ij ij denotes the average over the set t ij, t = 1; : : :; T. A large value of test variable test(w ij ) indicates high importance of the connection with weight w ij. The advantage of the nonconvergent method is that this method does not require the training process to converge, so we can use this method to measure the relevance of connections during the evolutionary process. At the same time, since these variables can be calculated for weights that have already been set to zero, they can also be used to determine which connection should be added to the network. 3 Experiments In order to test the eciency of PBLA, we applied PBLA to N-bit parity problem with N ranging from 4 to 8. The parity problem is a very dicult problem because the most similar patterns (those which dier by a single bit) require dierent answers. In the N-bit parity problem, the output required is 1 if the input pattern contains an odd number of 1s and 0 otherwise. All 2 N patterns were used in training. PBLA was run with the parameters shown in Table 1. In solving the N-bit parity problem for N =4 to 8, the Cascade-Correlation algorithm requires (2, 2-3, 3, 4-5, 5-6) hidden nodes respectively [4]; the Perceptron Cascade algorithm requires (2, 7
Table 2: Summary of results obtained with use of PBLA Problem Instance Parity-4 Parity-5 Parity-6 Parity-7 Parity-8 Number of Min 2 2 3 3 3 hidden nodes Max 3 4 4 5 6 Mean 2.3 2.5 3.2 3.4 4.6 SD 0.483 0.707 0.422 0.699 1.188 Number of Min 13 17 28 31 38 connections Max 16 27 37 48 76 Mean 14.5 20.6 30.3 34.7 55.0 SD 1.080 3.718 3.368 5.122 14.262 Number of Min 4950 17100 62050 103850 97150 epochs Max 27250 38200 312150 283700 401850 Mean 14625 30245 132525 177417 249625 SD 7332 7416 82728 72834 127562 Error of Min 8:3 10?6 1:1 10?2 1:5 10?3 4:2 10?4 3:9 10?4 etworks Max 1:4 10?3 5:0 10?2 6:1 10?2 3:2 10?2 2:1 10?2 Mean 5:0 10?4 1:4 10?2 1:2 10?2 8:9 10?3 5:2 10?3 SD 3:5 10?3 1:6 10?2 1:8 10?2 9:5 10?3 7:1 10?3 Table 3: Parameters for the network of Figure 2 T 1 2 3 4 5 8?42:8?32:4 0?5:6?23:6?33:6 9?75:3?32:0 43.2?41:1?34:5?34:8 10?85:0?28:1 28.6?28:0?28:0?28:2 11 59.9 12.9?13:5 13.0 13.0 13.0 6 7 8 9 10 8?33:6 41.6 0 0 0 9?34:8 39.8?58:9 0 0 10?28:2 29.3?47:6?41:3 0 11 13.0?13:4 0 0 81.8 2, 3, 3, 4) hidden nodes respectively [5]; the tower algorithm requires N=2 hidden nodes [6]. The rst algorithm uses Gaussian hidden nodes; the last two algorithms use linear threshold nodes. All networks constructed by the above algorithms use short cut connections. Using a single hidden layer, FNNCA can construct neural networks having (3, 4, 5, 5) hidden nodes respectively that solve this problem for N=4 to 7 [7]. Based on ten runs of PBLA for each value of N, the average of the best network obtained are summarized in Table 2, where \number of epochs" indicates the total learning epochs taken by PBLA when the best network is obtained. Figure 2 shows an optimum network obtained by PBLA for the 7-bit parity problem. Figure 3 is amazing as PBLA can solve the 8-bit parity problem with a network having only 3 hidden nodes. The parameters of the networks of Figure 2-3 are given in Table 3-4, where \T" indicates the thresholds of hidden nodes and output nodes. It is clear that PBLA is superior to the existing constructive algorithms in terms of size of networks. PBLA not only yields the appropriate architectures, but also can generate the optimal architectures. 8
11 10 9 8 1 2 3 4 5 6 7 Figure 2: An optimum network for the 7-bit parity problem 12 11 10 9 1 2 3 4 5 6 7 8 Figure 3: An optimum network for the 8-bit parity problem 9
Table 4: Parameters for the network of Figure 3 T 1 2 3 4 5 9?12:4 25.2 27.7?29:4?28:9?29:7 10?40:4 19.6 18.9?18:1?19:1?18:5 11?48:1 16.0 16.1?15:9?16:3?15:8 12 45.7?10:0?11:0 10.0 9.9 9.4 6 7 8 9 10 11 9?25:4?28:5 27.8 0 0 0 10?17:3?18:8 20.4?67:6 0 0 11?15:9?15:8 16.7?55:0?26:7 0 12 10.0 9.6?11:4 6.8 2.3 76.3 4 Conclusion A population-based learning algorithm is proposed to generate a near optimal feedforward neural networks dynamically for the tasks at hand. Unlike other evolutionary approaches to ANNs learning, each ANN in the population is evaluated by partial training rather than complete training. This training process can change both ANN's architectures and weights. Our preliminary experiments have demonstrated the eectiveness of our algorithm. Global search procedures such as evolutionary algorithms are usually computationally expensive to run. It is however benecial to introduce global search in the design of ANNs especially when there is little prior knowledge available and performance of ANNs is required to be high because the trial-and-error and other heuristic methods are very inecient in such circumstances. There have already been some experiments which demonstrate the advantages of hybrid global/local search but the issue of an optimal combination of dierent search procedures needs further investigation. With the increasing power of parallel computers, the evolutionary design of large ANNs becomes feasible. The evolutionary process oers a new way to discover possible new ANN architectures. References [1] J. D. Schaer, D. Whitley, and L. J. Eshelman. Combinations of genetic algorithms and neural networks: a survey of the state of the art. In D. Whitley and J. D. Schaer, editors, Proceedings of the International Workshop on Combinations of Genetic Algorithms and Neural Networks COGANN-92), pp.1-37. IEEE Computer Society Press, Los Alamitos, CA, 1992. [2] X. Yao. Evolutionary articial neural networks. International Journal of Neural Systems, 4(3):203-222, 1993. [3] W. Finno, F. Hergert, and H. G. Zimmermann. Improving model selection by nonconvergent methods. Neural Networks, 6:771-783, 1993. [4] S. E. Fahlman and C. Lebiere, The cascade-correlation learning architecture. In Advances in Neural Information Processing Systems 2, ed. D. S. Touretzky (Morgan Kaufmann, San Mateo, CA, 1990) pp.524-532. [5] N. Burgess. A constructive algorithm that converges for real-valued input patterns. International Journal of Neural Systems, 5(1):59-66,1994. 10
[6] J.-P. Nadal. Study of a growth algorithm for a feedforward network. International Journal of Neural Systems, 1:55-59, 1989. [7] R. Setiono and L. C. K. Hui. Use of a quasi-newton method in a feedforward neural network construction algorithm. IEEE Trans on Neural Networks, 6(1):273-277, 1995. 11