A Population-Based Learning Algorithm Which Learns Both. Architectures and Weights of Neural Networks y. Yong Liu and Xin Yao

Similar documents
Hyperplane Ranking in. Simple Genetic Algorithms. D. Whitley, K. Mathias, and L. Pyeatt. Department of Computer Science. Colorado State University

Meta- Heuristic based Optimization Algorithms: A Comparative Study of Genetic Algorithm and Particle Swarm Optimization

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]

Neural Network Weight Selection Using Genetic Algorithms

Telecommunication and Informatics University of North Carolina, Technical University of Gdansk Charlotte, NC 28223, USA

Department of. Computer Science. Remapping Subpartitions of. Hyperspace Using Iterative. Genetic Search. Keith Mathias and Darrell Whitley

Object classes. recall (%)

Argha Roy* Dept. of CSE Netaji Subhash Engg. College West Bengal, India.

Multi-Objective Optimization Using Genetic Algorithms

APPLICATION OF THE FUZZY MIN-MAX NEURAL NETWORK CLASSIFIER TO PROBLEMS WITH CONTINUOUS AND DISCRETE ATTRIBUTES

This leads to our algorithm which is outlined in Section III, along with a tabular summary of it's performance on several benchmarks. The last section

Reinforcement Control via Heuristic Dynamic Programming. K. Wendy Tang and Govardhan Srikant. and

336 THE STATISTICAL SOFTWARE NEWSLETTER where z is one (randomly taken) pole of the simplex S, g the centroid of the remaining d poles of the simplex

A B. A: sigmoid B: EBA (x0=0.03) C: EBA (x0=0.05) U

Designing Application-Specic Neural. Networks using the Structured Genetic. Dipankar Dasgupta and Douglas R. McGregor. Department of Computer Science

1. Introduction. 2. Motivation and Problem Definition. Volume 8 Issue 2, February Susmita Mohapatra

Artificial Neural Network based Curve Prediction

V.Petridis, S. Kazarlis and A. Papaikonomou

GENETIC ALGORITHM VERSUS PARTICLE SWARM OPTIMIZATION IN N-QUEEN PROBLEM

International Journal of Digital Application & Contemporary research Website: (Volume 1, Issue 7, February 2013)

Aero-engine PID parameters Optimization based on Adaptive Genetic Algorithm. Yinling Wang, Huacong Li

Theoretical Foundations of SBSE. Xin Yao CERCIA, School of Computer Science University of Birmingham

Frontier Pareto-optimum

Center for Automation and Autonomous Complex Systems. Computer Science Department, Tulane University. New Orleans, LA June 5, 1991.

REAL-CODED GENETIC ALGORITHMS CONSTRAINED OPTIMIZATION. Nedim TUTKUN

Constructively Learning a Near-Minimal Neural Network Architecture

Object Modeling from Multiple Images Using Genetic Algorithms. Hideo SAITO and Masayuki MORI. Department of Electrical Engineering, Keio University

Classifier C-Net. 2D Projected Images of 3D Objects. 2D Projected Images of 3D Objects. Model I. Model II

Multiple Constraint Satisfaction by Belief Propagation: An Example Using Sudoku

Gen := 0. Create Initial Random Population. Termination Criterion Satisfied? Yes. Evaluate fitness of each individual in population.

Evolving Multilayer Neural Networks using Permutation free Encoding Technique

Extensive research has been conducted, aimed at developing

COMBINING NEURAL NETWORKS FOR SKIN DETECTION

Algorithm Design (4) Metaheuristics

11/14/2010 Intelligent Systems and Soft Computing 1

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

A NEW APPROACH TO SOLVE ECONOMIC LOAD DISPATCH USING PARTICLE SWARM OPTIMIZATION

CHAPTER 2 CONVENTIONAL AND NON-CONVENTIONAL TECHNIQUES TO SOLVE ORPD PROBLEM

Evolving SQL Queries for Data Mining

Evolutionary Algorithms. CS Evolutionary Algorithms 1

Research on time optimal trajectory planning of 7-DOF manipulator based on genetic algorithm

Image Compression: An Artificial Neural Network Approach

Journal of Global Optimization, 10, 1{40 (1997) A Discrete Lagrangian-Based Global-Search. Method for Solving Satisability Problems *

Optimum Alphabetic Binary Trees T. C. Hu and J. D. Morgenthaler Department of Computer Science and Engineering, School of Engineering, University of C

Research on outlier intrusion detection technologybased on data mining

size, runs an existing induction algorithm on the rst subset to obtain a rst set of rules, and then processes each of the remaining data subsets at a

AN NOVEL NEURAL NETWORK TRAINING BASED ON HYBRID DE AND BP

An Approach to Polygonal Approximation of Digital CurvesBasedonDiscreteParticleSwarmAlgorithm

GENERATING FUZZY RULES FROM EXAMPLES USING GENETIC. Francisco HERRERA, Manuel LOZANO, Jose Luis VERDEGAY

Research Article Path Planning Using a Hybrid Evolutionary Algorithm Based on Tree Structure Encoding

IMPROVEMENTS TO THE BACKPROPAGATION ALGORITHM

The only known methods for solving this problem optimally are enumerative in nature, with branch-and-bound being the most ecient. However, such algori

Using Local Trajectory Optimizers To Speed Up Global. Christopher G. Atkeson. Department of Brain and Cognitive Sciences and

Automatic Generation of Test Case based on GATS Algorithm *

Chapter 14 Global Search Algorithms

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

Genetic Algorithms for Solving. Open Shop Scheduling Problems. Sami Khuri and Sowmya Rao Miryala. San Jose State University.

Chapter 5 Components for Evolution of Modular Artificial Neural Networks

Variable Neighborhood Search for Solving the Balanced Location Problem

Genetic Algorithms, Numerical Optimization, and Constraints. Zbigniew Michalewicz. Department of Computer Science. University of North Carolina

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Data Mining Chapter 8: Search and Optimization Methods Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

l 8 r 3 l 9 r 1 l 3 l 7 l 1 l 6 l 5 l 10 l 2 l 4 r 2

Genetic Algorithm Performance with Different Selection Methods in Solving Multi-Objective Network Design Problem

Artificial Intelligence

An Evolutionary Algorithm for Minimizing Multimodal Functions

A new way to optimize LDPC code in gaussian multiple access channel

CHAPTER 6 REAL-VALUED GENETIC ALGORITHMS

Genetic Algorithms Variations and Implementation Issues

A Compensatory Wavelet Neuron Model

A Generalized Permutation Approach to. Department of Economics, University of Bremen, Germany

Proceedings of the First IEEE Conference on Evolutionary Computation - IEEE World Congress on Computational Intelligence, June

Using CODEQ to Train Feed-forward Neural Networks

[13] W. Litwin. Linear hashing: a new tool for le and table addressing. In. Proceedings of the 6th International Conference on Very Large Databases,

An Empirical Study of Software Metrics in Artificial Neural Networks

Accelerating the convergence speed of neural networks learning methods using least squares

Comparison of Some Evolutionary Algorithms for Approximate Solutions of Optimal Control Problems

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

Dept. of Computer Science. The eld of time series analysis and forecasting methods has signicantly changed in the last

Fall 09, Homework 5

Reducing Graphic Conflict In Scale Reduced Maps Using A Genetic Algorithm

Neuro-Remodeling via Backpropagation of Utility. ABSTRACT Backpropagation of utility is one of the many methods for neuro-control.

Artificial neural networks are the paradigm of connectionist systems (connectionism vs. symbolism)

Fuzzy Signature Neural Networks for Classification: Optimising the Structure

CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES

Modified Order Crossover (OX) Operator

Department of Electrical Engineering, Keio University Hiyoshi Kouhoku-ku Yokohama 223, Japan

Optimization of Noisy Fitness Functions by means of Genetic Algorithms using History of Search with Test of Estimation

Non-deterministic Search techniques. Emma Hart

COMPLETE INDUCTION OF RECURRENT NEURAL NETWORKS. PETER J. ANGELINE IBM Federal Systems Company, Rt 17C Owego, New York 13827

Single and Multiple Frame Video Trac. Radu Drossu 1. Zoran Obradovic 1; C. Raghavendra 1

Strategy for Individuals Distribution by Incident Nodes Participation in Star Topology of Distributed Evolutionary Algorithms

A *69>H>N6 #DJGC6A DG C<>C::G>C<,8>:C8:H /DA 'D 2:6G, ()-"&"3 -"(' ( +-" " " % '.+ % ' -0(+$,

CHAPTER 6 ORTHOGONAL PARTICLE SWARM OPTIMIZATION

Estivill-Castro & Murray Introduction Geographical Information Systems have served an important role in the creation and manipulation of large spatial

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

Genetic Algorithm for Circuit Partitioning

A HYBRID APPROACH TO GLOBAL OPTIMIZATION USING A CLUSTERING ALGORITHM IN A GENETIC SEARCH FRAMEWORK VIJAYKUMAR HANAGANDI. Los Alamos, NM 87545, U.S.A.

phase transition. In summary, on the 625 problem instances considered, MID performs better than the other two EAs with respect to the success rate (i.

Combined Weak Classifiers

Transcription:

A Population-Based Learning Algorithm Which Learns Both Architectures and Weights of Neural Networks y Yong Liu and Xin Yao Computational Intelligence Group Department of Computer Science University College, The University of New South Wales Australian Defence Force Academy, Canberra, ACT, Australia 2600 Email: xin@csadfa.cs.adfa.oz.au Abstract One of the major issues in the eld of articial neural networks (ANNs) is the design of their architectures. There are strong biological and engineering evidences to support that the information processing capability of an ANN is determined by its architecture. This paper proposes a new population-based learning algorithm (PBLA) which learns both ANN's architecture and weights. The evolutionary approach is used to evolve a population of ANNs. Unlike other evolutionary approaches to ANN learning, each ANN (i.e., individual) in the population is evaluated by partial training rather than complete training. Substantial savings in computational cost can be achieved by such progressive partial training. This training process can change both ANN's architecture and weights. Our preliminary experiments have demonstrated the eectiveness of our algorithm. 1 Introduction One of the major issues in the eld of ANNs is the design of their architectures. There are strong biological and engineering evidences to support that the information processing capability of an ANN is determined by its architecture. Given a learning task, if the network is too small, it will not be capable of forming a good model of the problem. On the other hand, if the network is too big then the ANN may overt the training data and have a very poor generalisation ability. With little or no prior knowledge of the problem, one usually determines the architecture by trial-and-error. There is no systematic way to design a near optimal architecture automatically for a given task. Research on constructive and destructive algorithms is an eort made towards the automatic design of architectures. Roughly speaking, a constructive algorithm starts with the smallest possible network and gradually increases the size until the performance begins to level o while a destructive algorithm does the opposite, i.e. starts with the maximal network and deletes unnecessary layers, nodes and connections during training. Design of the optimal architecture for an ANN can be formulated as a search problem in the architecture space where each point represents an architecture. Given some performance criteria This work is supported by the Australian Research Council through its small grant scheme and by a University College Special Research Grant. y Published in Proc. of ICYCS'95 Workshop on Soft Computing, ed. X. Yao and X. Li, pp.29{38, July 1995. To appear in Chinese Journal of Advanced Software Research (Allerton Press, Inc., New York, NY 10011), Vol. 3, No. 1, 1996. 1

about architectures, the performance level of all architectures forms a surface in the space. The optimal architecture design is equivalent to nding the highest point on this surface. Because the surface is innitely large, nondierentiable, complex, deceptive and multimodal, these characteristics make evolutionary algorithms a better candidate for searching the surface than those constructive and destructive algorithms mentioned above. Because of advantages of the evolutionary design of architectures, a lot of research has been carried out in recent years [1, 2]. In the evolution of architectures, each architecture is evaluated through back-propagation (BP) training. This process is often very time consuming and sensitive to initial conditions of the training. Such evaluation of the architecture is also extremely noisy [2]. Furthermore, if the measure of tness is the sum of squared errors on the training set, this method may generate networks that over learn the training data. One way to avoid this is to add a complexity term, e.g. the number of connections in the architecture to the tness function. However, this penalty term method tends to generate ANNs that are not able to learn. In the extreme case, a network might try to gain rewards by pruning o all of its connections. In order to solve the above problems, this paper proposes a new population-based learning algorithm which learns both ANN's architecture and weights. The evolutionary approach is used to evolve a population of ANNs. Each individual of the population is evaluated by partial training using the modied BP. Because network architectures in the population are dierent from each other, it is not suitable to keep learning rate xed for all individuals. In PBLA, we modied the classical BP by means of dynamically adapting during training for each member in the population. When a parent network is selected for breeding from the population, PBLA rst tests its performance to determine whether to continue training or to mutate the architecture. If the parent network is promising, PBLA continues training using the modied BP. Otherwise, PBLA switches to simulated annealing (SA) from the modied BP. If SA still cannot make the network escape from the local minimum, PBLA mutates the architecture of the network to generate a new architecture. To speed up network optimization, we apply the nonconvergent method [3] to guide mutation. In Section 2, we describe PBLA at the population level and individual level. Section 3 reports experiment results with PBLA on a number of parity problems. Finally, some conclusions are given in Section 4. 2 A Population-Based Learning Algorithm 2.1 Network Architecture In the published literature on ANNs, a large number of structures have been considered and studied. These can be categorised into two broad classes: feedforward neural networks and recurrent networks. Here, a class of feedforward neural networks called generalised multilayer perceptrons are considered. The architecture of such a network is shown in Figure 1, X and Y are inputs and outputs respectively. We assume the following: net i = x i = X i ; 1 i m (1) Xi?1 j=1 w ij x j ; m < i N + n (2) x j = f(net j ); m < j N + n (3) Y i = x i+n ; 1 i n (4) 2

where f is the following sigmoid function: f(z) = 1=(1 + e?z ) (5) m and n is the number of inputs and outputs respectively, N is a constant that can be any integer you choose as long as it is no less than m. The value of N + n decides how many nodes are in the network (if we include inputs as nodes). X Input X:......... 1 m m+1 i-1 i N+1 N+n Y Output Figure 1: A generalised multilayer perceptron. In Figure 1, there are N + n circles, representing all of the nodes in the network, including the input nodes. The rst m circles are really just copies of the inputs X 1 ; : : :; X m. Every other node in the network such as node number i, which calculates net i and x i takes inputs from every node that precedes it in the network. Even the last output node, which generates Y n, takes input from other output nodes, such as the one which outputs Y n?1. In neural network terminology, this network is \fully connected" in the extreme. It is generally agreed that it is inadvisable for a generalised multilayer perceptron to be fully connected. In this context, we may therefore raise the following question: Given that a generalised multilayer perceptron should not be fully connected, how should the connections of the network be allocated? This question is of no major concern in the case of small-scale applications, but it is certainly crucial to the successful application of BP for solving large-scale, real-world problems. However, there is no systematic way to design a near optimal architecture automatically for a given task. Our present approach is to learn both architectures and weights of neural networks based on evolutionary algorithms. In PBLA, we choose the architecture and the weights w ij so as to minimize square error over the training set that contains T patterns: E = TX t=1 E(t) = 1 2 TX nx t=1 i=1 [Y i (t)? Z i (t)] 2 (6) where Y i (t) and Z i (t) are actual and desired outputs of node i for pattern t. 2.2 The Evolutionary Process The PBLA uses an evolutionary-programming-like algorithm to evolve a population of ANNs. This method works as follows: 3

Step 1 Randomly generate an initial population of M feedforward neural networks. The number of hidden nodes and initial connection density for each network in the population are chosen within certain ranges. The random initial weights are uniformly distributed inside a small range. Step 2 Partially train each network in the population for a certain number of epochs using the modied BP. The number of epochs is xed by a control parameter set by the user. The error E for each network is checked after partial training. If E has not been signicantly reduced, then the assumption is that E is trapped in a local minimum, mark the network with `failure'. Otherwise mark the network with `success'. Step 3 Rank the networks in the population according to their error values( from the best to the worst) Step 4 Use the rank-based selection to pick up one parent network from the population. If its mark is `success', go to Step 5. Otherwise go to Step 6. Step 5 Partially train the parent network to obtain an ospring network and mark it in the same way as Step 2. Insert this ospring into the population in the ranking replacing its parent network. Go back to Step 4. Step 6 Train the parent network with SA to obtain an ospring network. If SA reduces the error E of the parent network signicantly, mark the ospring network with `success' and insert it in the ranking replacing its parent network, then go back to Step 4. Otherwise discard this ospring and go to Step 7. Step 7 Delete hidden nodes. 1. Randomly delete hidden nodes from the parent network. 2. Partially train the deleted network to obtain an ospring network. If the ospring network is better than the worst network in the population, insert the former in the ranking and remove the latter, then go back to Step 4. Otherwise discard this ospring and go to Step 8. Step 8 Delete connections. 1. Calculate the approximate importance of each connection in the parent network using the nonconvergent method. Randomly delete the connections from the parent network according to the calculated importance. 2. Partially train the deleted network to obtain an ospring network and decide whether to accept or reject it in the same way as Step 7. If the ospring network is accepted, go back to Step 4. Otherwise discard this ospring and go to Step 9. Step 9 Add connections/nodes. 1. Calculate the approximate importance of each virtual connection with zero weight. Randomly add the connections to the parent network to obtain Ospring 1 according to the calculated importance. 2. Add new nodes to the parent network to obtain Ospring 2 through splitting existing nodes. 4

3. Partially train Ospring 1 and Ospring 2 then choose the better one as the survived ospring. Insert the survived ospring in the ranking and remove the worst network from the population. Step 10 Repeat Step 4 to Step 9 until an acceptable network has been found or until a certain number of generations has been reached. Evolutionary algorithms provide alternative approaches to the design of architecture. Such evolutionary approaches consist of two major stages. The rst stage is to decide the genotype representation scheme of architectures. The second stage is the evolution itself driven by evolutionary search procedures in which genetic operators have to be decided in conjunction with the representation scheme. The key issue is to decide how much information about an architecture should be encoded into the genotype representation. At one extreme, all the detail, i.e. every connection and node of an architecture can be specied by the genotype representation. This kind of representation schemes is called the direct encoding scheme. At the other extreme, only the most important parameters of an architecture such as the number of hidden layers and hidden nodes in each layer are encoded. Other detail about the architecture is left to the training process to decide. This kind of representation schemes is called the indirect encoding scheme. In the direct encoding scheme, each connection in an architecture is directly specied by its binary representation. For example, an N N matrix C = (c ij ) NN can represent an architecture with N nodes where c ij indicates presence or absence of the connection from node i to node j. We can use c ij = 1 to indicate a connection and c ij = 0 to indicate no connection. In fact, c ij can even be connection weights from node i to node j so that both the topological structure and connection weights of an ANN are evolved at the same time. Each such matrix has a direct one-to-one mapping to the corresponding architecture. Constraints on architectures being explored can easily be incorporated into such representation scheme by setting constraints on the matrix, e.g. a feedforward ANN will have nonzero entries only in the upper triangle of the matrix. Because the direct encoding scheme is relatively simple and straightforward to implement, we decided to use it to code our network architectures. However, the direct encoding scheme does not scale well since large architectures require very large matrices to represent. To implement this representation scheme eciently on a conventional computer, one would use a linked list to represent the connections actually implemented for each node. It is obvious that for sparse feedforward neural networks a lot of memory can be saved. In PBLA, each network in the population is evaluated by partial training. The tness is calculated by the sum of the squared error for the training set. Since the evaluation of the networks is very expensive, PBLA adopts rank-based selection mechanism to enhance selection pressure. It is demonstrated that selection pressure is a key factor in obtaining a near optimum. In PBLA, networks in the population are rst sorted in a non-descending order according to their tness. Let the M sorted networks be numbered as 0; 1; : : :; M? 1. Then the (M? j)th network is selected with probability j p(m? j) = P M (7) k=1 k The selected network is then manipulated by the following ve mutations: partial training, delete nodes and connections, and add connections and nodes. In order to avoid unnecessary training and premature convergence, we adopt the following replacement strategy. If the ospring is obtained through progressive partial training, the algorithm accepts it and removes its parent network. If the ospring is obtained through deleting connections or nodes, the algorithm accepts 5

it only when it is better than the worst network in the population. In such cases, the algorithm removes the worst network. If the ospring is obtained through adding connections or nodes, the algorithm always accepts it and removes the worst network in the population. 2.3 Partial Training and Evaluation BP is currently the most popular algorithm for the supervised training of ANNs. There have been some successful applications of BP algorithms in various areas. However, it is well-known that nding optimal weights using the classical BP is very slow. Numerous heuristic optimization algorithms have been proposed to improve the convergence speed of the classical BP. Although most of these have been somewhat successful, they usually introduced additional parameters which must be varied from one problem to another, and if not chosen properly can actually slow the rate of convergence. In the classical BP, the learning rate is kept xed throughout training. The learning rate is often very small in order to prevent oscillations and ensure convergence. However, a very small xed value for slows down the rate of convergence of BP. The use of a xed learning rate may not suit the evolutionary design of architecture. Because all individuals in the population are dierent from each other, a learning rate appropriate for one network is not necessarily appropriate for other networks in the population. Every network should have its own individual learning rate. Unfortunately, search for a good xed learning rate can itself become a hard problem. In PBLA, learning is accelerated through learning rate adaptation. The initial learning rate i,(i = 1; : : :; M) for all individuals in the initial population have the same values. Each individual adjusts its learning rate within a certain range during the evolutionary process according to a simple heuristic principle. During partial training, the error E is checked after every k epochs. If E decreases, the learning rate is increased. Otherwise, the learning rate is reduced. In the later case the new weights and error are discarded. Another drawback of BP is due to its gradient descent nature. BP often gets trapped in a local minimum of the error function and is very inecient in nding a global minimum if the error function is multimodal and nondierentiable. There are two ways for avoiding local minimum. One way is to mutate the architecture network. The other way is to adopt global optimization methods to train the network. It is worth pointing out that the capability of an ANN not only depends on the network architecture but also on the weights. When a network is trapped in a local minimum, it is not clear whether it is due to the weights or the inappropriate network architecture. In order to nd a smaller network, PBLA rst switches to SA from the modied BP in order to nd better weights. Only when SA fails to improve the error E, PBLA starts to mutate the network architecture. 2.4 Architecture Mutation An issue in the evolution of architectures is when and how the architectures should be mutated. In PBLA, when the hybrid algorithm that combines the modied BP and SA fails to improve the error E of the parent network, the algorithm starts to mutate its architecture. The mutation is divided into deletion phase and addition phase. The architecture is rst mutated by deleting hidden nodes or connections. If the new network is better than the worst network in the population then accept it. Otherwise, the algorithm adds connections or hidden nodes to the network and then chooses the better one to survive. The selection of which node to remove or split is uniform over the collection of hidden nodes. The deletion of a node involves the complete removal of the node and and all its incident connections. In order to preserve the knowledge achieved by the parent network, hidden 6

Table 1: The parameter set used in PBLA experiments Population size 20 Number of epochs for each partial training 100 Initial number of hidden nodes 2-N Number of mutated hidden nodes 1-2 Initial connection density 0.75 Number of mutated connections 1-3 Initial learning rate 0.5 Number of temperatures in SA 5 Range of learning rate 0.1-0.6 Number of iterations at each temperature 100 nodes are added to the parent network through splitting existing nodes. Two nodes obtained by splitting an existing node i have the same connections as the existing node. The weights of these new nodes have the following values: w 1 ij = w 2 ij = w ij ; j < i (8) w 1 ki = (1 + )w ki ; k > i (9) w 2 ki =?w ki; k > i (10) where w is the weight vector of the existing node i, w 1 and w 2 are respectively the weight vectors of the new nodes, and is the mutation factor which may take either the xed or random values. Addition or deletion of a connection depends on the importance of the connection in the network. The simplest approach is to delete the smallest weight in the network. This however is not always the best approach since the solution could be quite sensitive to the weight. The nonconvergent method measures the importance of weights by the nal weight test variables based on signicance tests for deviations from zero in the weight update process [3]. Denoting the weight update w ij (w) =?[@L t =@w ij ], by the local gradient of the linear error function L (L = P T P n jy t=1 i=1 i(t)? Z i (t)j) with respect to example t and weight w ji, the signicance of the deviation of w ij from zero is dened by the test variable test(w ij ) = P T t=1 t ij q PT t=1( t ij? ij) (11) where t = ij w ij + w t (w), ij ij denotes the average over the set t ij, t = 1; : : :; T. A large value of test variable test(w ij ) indicates high importance of the connection with weight w ij. The advantage of the nonconvergent method is that this method does not require the training process to converge, so we can use this method to measure the relevance of connections during the evolutionary process. At the same time, since these variables can be calculated for weights that have already been set to zero, they can also be used to determine which connection should be added to the network. 3 Experiments In order to test the eciency of PBLA, we applied PBLA to N-bit parity problem with N ranging from 4 to 8. The parity problem is a very dicult problem because the most similar patterns (those which dier by a single bit) require dierent answers. In the N-bit parity problem, the output required is 1 if the input pattern contains an odd number of 1s and 0 otherwise. All 2 N patterns were used in training. PBLA was run with the parameters shown in Table 1. In solving the N-bit parity problem for N =4 to 8, the Cascade-Correlation algorithm requires (2, 2-3, 3, 4-5, 5-6) hidden nodes respectively [4]; the Perceptron Cascade algorithm requires (2, 7

Table 2: Summary of results obtained with use of PBLA Problem Instance Parity-4 Parity-5 Parity-6 Parity-7 Parity-8 Number of Min 2 2 3 3 3 hidden nodes Max 3 4 4 5 6 Mean 2.3 2.5 3.2 3.4 4.6 SD 0.483 0.707 0.422 0.699 1.188 Number of Min 13 17 28 31 38 connections Max 16 27 37 48 76 Mean 14.5 20.6 30.3 34.7 55.0 SD 1.080 3.718 3.368 5.122 14.262 Number of Min 4950 17100 62050 103850 97150 epochs Max 27250 38200 312150 283700 401850 Mean 14625 30245 132525 177417 249625 SD 7332 7416 82728 72834 127562 Error of Min 8:3 10?6 1:1 10?2 1:5 10?3 4:2 10?4 3:9 10?4 etworks Max 1:4 10?3 5:0 10?2 6:1 10?2 3:2 10?2 2:1 10?2 Mean 5:0 10?4 1:4 10?2 1:2 10?2 8:9 10?3 5:2 10?3 SD 3:5 10?3 1:6 10?2 1:8 10?2 9:5 10?3 7:1 10?3 Table 3: Parameters for the network of Figure 2 T 1 2 3 4 5 8?42:8?32:4 0?5:6?23:6?33:6 9?75:3?32:0 43.2?41:1?34:5?34:8 10?85:0?28:1 28.6?28:0?28:0?28:2 11 59.9 12.9?13:5 13.0 13.0 13.0 6 7 8 9 10 8?33:6 41.6 0 0 0 9?34:8 39.8?58:9 0 0 10?28:2 29.3?47:6?41:3 0 11 13.0?13:4 0 0 81.8 2, 3, 3, 4) hidden nodes respectively [5]; the tower algorithm requires N=2 hidden nodes [6]. The rst algorithm uses Gaussian hidden nodes; the last two algorithms use linear threshold nodes. All networks constructed by the above algorithms use short cut connections. Using a single hidden layer, FNNCA can construct neural networks having (3, 4, 5, 5) hidden nodes respectively that solve this problem for N=4 to 7 [7]. Based on ten runs of PBLA for each value of N, the average of the best network obtained are summarized in Table 2, where \number of epochs" indicates the total learning epochs taken by PBLA when the best network is obtained. Figure 2 shows an optimum network obtained by PBLA for the 7-bit parity problem. Figure 3 is amazing as PBLA can solve the 8-bit parity problem with a network having only 3 hidden nodes. The parameters of the networks of Figure 2-3 are given in Table 3-4, where \T" indicates the thresholds of hidden nodes and output nodes. It is clear that PBLA is superior to the existing constructive algorithms in terms of size of networks. PBLA not only yields the appropriate architectures, but also can generate the optimal architectures. 8

11 10 9 8 1 2 3 4 5 6 7 Figure 2: An optimum network for the 7-bit parity problem 12 11 10 9 1 2 3 4 5 6 7 8 Figure 3: An optimum network for the 8-bit parity problem 9

Table 4: Parameters for the network of Figure 3 T 1 2 3 4 5 9?12:4 25.2 27.7?29:4?28:9?29:7 10?40:4 19.6 18.9?18:1?19:1?18:5 11?48:1 16.0 16.1?15:9?16:3?15:8 12 45.7?10:0?11:0 10.0 9.9 9.4 6 7 8 9 10 11 9?25:4?28:5 27.8 0 0 0 10?17:3?18:8 20.4?67:6 0 0 11?15:9?15:8 16.7?55:0?26:7 0 12 10.0 9.6?11:4 6.8 2.3 76.3 4 Conclusion A population-based learning algorithm is proposed to generate a near optimal feedforward neural networks dynamically for the tasks at hand. Unlike other evolutionary approaches to ANNs learning, each ANN in the population is evaluated by partial training rather than complete training. This training process can change both ANN's architectures and weights. Our preliminary experiments have demonstrated the eectiveness of our algorithm. Global search procedures such as evolutionary algorithms are usually computationally expensive to run. It is however benecial to introduce global search in the design of ANNs especially when there is little prior knowledge available and performance of ANNs is required to be high because the trial-and-error and other heuristic methods are very inecient in such circumstances. There have already been some experiments which demonstrate the advantages of hybrid global/local search but the issue of an optimal combination of dierent search procedures needs further investigation. With the increasing power of parallel computers, the evolutionary design of large ANNs becomes feasible. The evolutionary process oers a new way to discover possible new ANN architectures. References [1] J. D. Schaer, D. Whitley, and L. J. Eshelman. Combinations of genetic algorithms and neural networks: a survey of the state of the art. In D. Whitley and J. D. Schaer, editors, Proceedings of the International Workshop on Combinations of Genetic Algorithms and Neural Networks COGANN-92), pp.1-37. IEEE Computer Society Press, Los Alamitos, CA, 1992. [2] X. Yao. Evolutionary articial neural networks. International Journal of Neural Systems, 4(3):203-222, 1993. [3] W. Finno, F. Hergert, and H. G. Zimmermann. Improving model selection by nonconvergent methods. Neural Networks, 6:771-783, 1993. [4] S. E. Fahlman and C. Lebiere, The cascade-correlation learning architecture. In Advances in Neural Information Processing Systems 2, ed. D. S. Touretzky (Morgan Kaufmann, San Mateo, CA, 1990) pp.524-532. [5] N. Burgess. A constructive algorithm that converges for real-valued input patterns. International Journal of Neural Systems, 5(1):59-66,1994. 10

[6] J.-P. Nadal. Study of a growth algorithm for a feedforward network. International Journal of Neural Systems, 1:55-59, 1989. [7] R. Setiono and L. C. K. Hui. Use of a quasi-newton method in a feedforward neural network construction algorithm. IEEE Trans on Neural Networks, 6(1):273-277, 1995. 11