Effect of the PSO Topologies on the Performance of the PSO-ELM

2012 Brazilian Symposium on Neural Networks Effect of the PSO Topologies on the Performance of the PSO-ELM Elliackin M. N. Figueiredo and Teresa B. Ludermir Center of Informatics Federal University of Pernambuco Recife, Brazil Email: {emnf,tbl}@cin.ufpe.br Abstract In recent years, the Extreme Learning Machine (ELM) has been hybridized with the Particle Swarm Optimization (PSO). This hybridization is named PSO-ELM. In most of these hybridizations, the PSO uses the global topology. However, other topologies were designed to improve the performance of the PSO. The performance of PSO depends on its topology, and there is not a best topology for all problems. Thus, in this paper, we investigate the effect of the PSO topology on performance of the PSO-ELM. In our study, we investigate five topologies: Global, Local, Von Neumann, Wheel and Four Clusters. The results showed that the Global topology can be more promising than all other topologies. Keywords-Extreme Learning Machine; Particle Swarm Optimization; topology I. INTRODUCTION Extreme Learning Machine (ELM) is an efficient learning algorithm for single-hidden layer feedforward neural network (SLFN) [1] created in order to overcome the drawbacks of gradient-based methods. In the last years, the Particle Swarm Optimization (PSO) has been combined with ELM in order to improve the generalization capacity of the SLFNs. In 2006, Xu and Shu presented an evolutionary ELM based on PSO named PSO-ELM and applied this algorithm in a prediction task [2]. In this work, the PSO was used to optimize the input weights and hidden biases of the SLFN. The objective function used to evaluate each particle is the root mean squared error (RMSE) on the validation set. In 2011, Han et al. [3] proposed an Improved Extreme Learning Machine named IPSO-ELM that uses an improved PSO with adaptive inertia to select the input weights and hidden biases of the SLFN. However, unlike the PSO-ELM, the IPSO-ELM optimizes the input weights and hidden biases according to the RMSE on the validation set and the norm of the output weights. Thus, this algorithm can obtain good performance with more compact and well-conditioned SLFN than other approaches of ELMs. In the same year, Sundararajan et al. [4] propose a combination of Integer Coded Genetic Algorithm (ICGA), PSO and ELM used for cancer classification. In this work, the ICGA is used to features selection while the PSO-ELM classifier is used to find the optimal input weights such that the ELM classifier can discriminate the cancer classes efficiently. This PSO- ELM was proposed in order to handle with sparse and highly imbalanced data set. In all these works, the social topology used in the PSO is the global. In this topology, all particles are fulled connected. That is, each particle can communicate with every other particle. In this case, each particle is attracted towards the best solution found by the entire swarm. Besides of the global topology, other topologies were designed to improve the performance of the PSO. As observed in [5], the performance of the PSO depends strongly of its topology and there is no outright best topology for all problems. Thus, it is important to investigate the effect of the PSO topologies on the performance of the PSO-ELM in the training of SLFNs. Studies on the effect of the PSO topology on the performance of the PSO in training of the neural networks have been carried out in the past [6]. More recently, in 2010, van Wyk and Engelbrecht investigated the overffiting in neural networks trained with the PSO based on three topologies: Global, Local and Von Neumann [7]. Thus, the objective of this paper is to carry out a similar study to those in context of the PSO-ELM. That is, our objective is to investigate the effect of different PSO topologies on the performance of the PSO-ELM classifier. In this paper, we studied five topologies: Global, Local, Von Neumann, Wheel and Four Clusters. The experimental results showed that the Global topology can be more promising than all other topologies. The rest of this paper is organized as follows. Section II presents the ELM algorithm. Section III presents the PSO algorithm and the topologies used in this work. The experimental arrangement is detailed in Section IV. Section V discusses the experimental results. The paper is concluded in Section VI along with the future work. II. EXTREME LEARNING MACHINE Extreme Learning Machine (ELM) is a simple learning algorithm for single-hidden layer feedforward neural networks (SLFNs) proposed by Huang et al. [1]. This learning algorithm not only learns much faster with higher generalization performance than the traditional gradient-based learning algorithms but also avoids many difficulties faced by gradient-based learning methods such as stopping criteria, learning rate, learning epochs, and local minima [8]. ELM works as following. For N distinct training samples (x i, t i ) R n XR m, where x i = [x i1,x i2,...,x in ] T and 1522-4899/12 $26.00 2012 IEEE DOI 10.1109/SBRN.2012.26 178

t i =[t i1,t i2,...,t im ] T, the SLFNs with K hidden nodes and an activation function g(x) are mathematically modeled as Hβ = T (1) where H = {h ij } (i =1,...,N and j =1,...,K)isthe hidden layer output matrix, h ij = g(wj T x i + b j ) denotes the output of jth hidden node with respect to x i ; w j = [w j1,w j2,...,w jn ] T denotes the weight vector connecting the jth hidden node and the input nodes, and b j is the bias (threshold) of the jth hidden node; β =[β 1, β 2,...β K ] T is the matrix of output weights and β j =[β j1,β j2,...β jm ] T (j =1,...,K) is the the weight vector connecting the jth hidden node and the output nodes; T =[t 1, t 2,, t N ] T denotes the matrix of targets. Thus, initially the ELM generates the input weights and hidden biases randomly. Next, the determination of the output weights, linking the hidden layer to the output layer, consists in finding simply the least-square solution to the linear system given by the Eq. 1. The minimum norm leastsquare (LS) solution to the linear system given by Eq. 1 is ˆβ = H T (2) where H is the Moore-Penrose (MP) generalized inverse of matrix H. III. PARTICLE SWARM OPTIMIZATION This section presents the basic algorithm of Particle Swarm Optimization and the information sharing topologies. A. Basic Algorithm The Particle Swarm Optimization (PSO) is a populationbased stochastic optimization technique developed by Kennedy and Eberhart in 1995, and attempts to model the flight pattern of a flock of birds [9]. In PSO, each particle in the swarm represents a candidate solution for the objective function f(x). Thus, the current position of the ith particle at iteration t is denoted by x i (t). In each iteration of the PSO, each particle moves through the search space with a velocity v i (t) calculated as follows: v i (t +1)=wv i (t)+c 1 r 1 [y i (t) x i (t)] + c 2 r 2 [ŷ(t) x i (t)], where w is the inertia weight, y i (t) is the personal best position of the particle i at iteration t and ŷ(t) is the global best position of the swarm at iteration t. The personal best position is called pbest and it represents the best position found by the particle during the search process until the iteration t. The global best position is called gbest and it constitutes the best position found by the entire swarm until the iteration t. The quality (fitness) of a solution is (3) measured by objective function, f(x i (t)), calculated from the current position of the particle i. The terms c 1 and c 2 are acceleration coefficients. The terms r 1 and r 2 are vectors whose components r 1j and r 2j respectively are random numbers sampled from an uniform distribution U(0, 1). The velocity is limited to the range [v min, v max ]. After updating velocity, the new position of the particle i at iteration t +1 is calculated using Eq. 4 x i (t +1)=x i (t)+v i (t +1). (4) The Algorithm 1 summarizes the PSO. Algorithm 1 PSO Algorithm 1: Initialize the swarm S; 2: while stopping condition is false do 3: for all particle i of the swarm do 4: Calculate the fitness f(x i (t)); 5: Set the personal best position y i (t); 6: end for 7: Set the global best position ŷ(t) of the swarm; 8: for all particle i of the swarm do 9: Update the velocity v i (t); 10: Update the position x i (t); 11: end for 12: end while B. Topologies The PSO described above is known as gbest PSO. In this PSO, each particle is connected to every other particle in the swarm in a type of topology referred to as Global topology. Different topologies have been developed for PSO and empirically studied, each affecting the performance of the PSO in a potentially drastic way [10]. In this paper, five topologies well known in the literature have been selected for comparison: Global: In this topology, the swarm is fully connected and each particle can therefore communicate with every other particle as illustrated in Fig. 1a. Thus, each particle is attracted towards the best solution found by the entire swarm. This topology has been shown an information flow between particles and convergence faster than other topologies, but with a susceptibility to be trapped in local minima. Local: In this topology, each particle communicates with its n N immediate neighbors. Thus, instead of a global best particle, each particle selects the best particle in its neighborhood. When n N = 2, each particle within of the swarm communicates with its immediately adjacent neighbors as illustrated in Fig. 1b. In this paper, n N =2was used. Von Neumann: In this topology, each particle is connected to its left, right, top and bottom neighbors on a 179

percentage of correct classifications produced by the trained SLFNs on the testing set. Thus, the TA is calculated using Eq. 5 (a) Global (b) Local (c) Von Neumann (d) Wheel Figure 1. (e) Four Cluster Topologies of the PSO two dimensional lattice. This topology has been shown in a number of empirical studies to outperform other topologies in a large number of problems [10]. Figure 1c illustrates this topology. Wheel: In this social topology, the particles in a neighborhood are isolated from one another. One particle serves as the focal point, and all information flow occurs through the focal particle as illustrated in Fig. 1d. In this work, the focal particle was chosen randomly. Four clusters: In this topology, four clusters of particles are formed with two connections between clusters and each cluster is a subswarm fully connected as illustrated in Fig. 1e. It is worth mentioning that the neighborhood of the particles in all topologies is based on the their indexes. For a more detailed description of these topologies, the reader may consult [5]. IV. EXPERIMENTAL ARRANGEMENT In this section, we give the details of the experimental arrangement. A. PSO-ELM Description In this work, we used the PSO to select the input weights and hidden biases of the SLFN, and the MP generalized inverse is used to analytically calculate the output weights. In the PSO used, the inertia varies linearly with time like in [3]. The fitness of each particle is adopted as the root mean squared error (RMSE) on the validation set. We used the RMSE because it is a continuous metric. Thus, the search process can be conducted in the smooth way. B. Metrics Performance In this paper, two metrics are used to evaluate the performance of PSO-ELM: root mean squared error (RMSE) over the validation set referred to as RMSE V and the test accuracy denoted by TA. The test accuracy refers to the TA = 100 n c n T (5) where n T is the size of the testing set and n c is the number of correct classifications. The RMSE over the validation set of size P is calculated using Eq. 6: RMSE V = P p=1 K k=1 (t k,p o k,p ) 2 P K where K is the number of output units in the SLFN, t k,p is the target to the pattern p in the output k, and o k,p is the output obtained by the network to the pattern p in the output k. Other metric used in this work is the global swarm diversity D S [11] that is calculated as follows: D S = 1 S (6) S I (x ik x k ) 2 (7) i=1 k=1 where S is the swarm size, I is the dimensionality of the search space, x i and x are particle position i and the swarm center respectively. This metric is used to measure the convergence of the PSO. In this metric, a small value indicates swarm convergence around the swarm center while a large value indicates a higher dispersion of particles away from the center. C. Problem Sets In our experiments, 5 benchmark classification problems obtained from UCI Machine Learning Repository [12] were used. These problems presents different degrees of difficulties and different number of classes. The missing values were handled using the value most frequently of each attribute. The problems and chosen sizes of the hidden layer are given in Tab. I. The sizes of the hidden layer used were the same that [13] In our experiments, all attributes have been normalized into the range [0,1], while the outputs (targets) have been normalize into [-1,1]. The ELM activation function used was the sigmoid function. Each dataset was divided in training, validation and testing sets, as specified in Tab. II. For all topologies, 30 independent executions were done for each dataset. The training, validation and testing sets were randomly generated at each trial of simulations. All topologies were executed on the same sets generated. 180

Table I DATA SETS USED IN THE EXPERIMENTS ALONG WITH THE HIDDEN LAYER SIZES OF THE SLFNS. Problem Attributes Classes Hidden Nodes Cancer 9 2 5 Pima Indians Diabetes 8 2 10 Glass Identification 9 7 10 Heart 13 2 5 Iris 4 3 10 RMSE 0.38 0.37 0.36 0.35 0.34 0.33 0.32 RMSE Evolution for Cancer Global Local VonNeumann Wheel 4Clusters Table II PARTITION OF THE DATA SETS USED IN THE EXPERIMENTS. 0.31 0.3 Problem Training Validation Testing Total Cancer 350 175 174 699 Pima Indians Diabetes 252 258 258 768 Glass Identification 114 50 50 214 Heart 130 70 70 270 Iris 70 40 40 150 0.29 0 200 400 600 800 1000 Iteration Figure 2. RMSE V over time on the Cancer dataset. Graph shows mean over 30 samples. D. PSO Configuration The following PSO configuration was used for the experiment: the c 1 and c 2 control parameters were both set to 2.0. We used an adaptive inertia w, where the initial inertia is 0.9 and the end inertia is 0.4. This same scheme was used in the IPSO-ELM. The swarm size used was 20. We use a few particles because we wanted to observe the effect of topology on information sharing. If we use a very large number of particles as in [2] and [3], the effect of the topology on the communication of the particles could be hidden. It is worth to say that we did not try to find out the better number of particles for each topology. Thus, the same number of particles was used for all. The PSO was executed for 1000 iterations. We use many iterations because we wanted to provide enough time for the exchange of information between the particles for all topologies. All components in the particle are randomly initialized within the range of [-1,1]. In this work, we used the reflexive approach as was done by [2], i.e, the out-bounded particles in a given dimension will inverse their directions in this dimension and leave the boundary keeping the same velocity. We used v max =1and v min = 1 for all dimensions of the search space. V. RESULTS AND DISCUSSION In this section, we present the results obtained. Figure 2 shows the evolution of the RMSE V along of the iterations of the five PSO topologies on the Cancer dataset. The graph corresponds to the mean over 30 samples. As can be seen from the graph, all topologies showed the same behavior until approximately the 400th iteration. After this iteration, the decrease in the RMSE V occurred differently for each topology. The Wheel topology, for example, presented a decrease of the RMSE V slower than all other topologies empirically. This result is consistent with the literature which highlights the slow sharing of information in this topology. The same result was also observed in a similar way for all data sets evaluated. For the other topologies, their behaviors varied over the data set. However, the following data was observed: the topologies with the lowest RMSE V in the last iteration were the Global and the Von Neumann topologies for all datasets evaluated. For example, for Cancer and Heart, the Global topology presented a better RMSE V than all other topologies, while for Glass, Iris and Diabetes, the Von Neumann topology showed a better RMSE V than all other topologies. The Tables III, IV, V, VI and VII show the results for the data sets Cancer, Diabetes, Glass, Heart and Iris, respectively. The value in bold for the RMSE V represents the lowest value obtained for the RMSE V and the value in bold for D S represents the lowest diversity among the topologies. We perform the one-way analysis of variance (ANOVA) to assess whether the performs of the topologies are statistically different or not 1. We used this technique because the RMSE V samples came from a normal distribution for all topologies. Figure 3 shows the confidence intervals with 95% of confidence of the topologies for the RMSE V for the Cancer dataset 2. As can be seen, the topologies Global, Local, Von Neumann, and Four Clusters have statistically equivalent performance, while the Wheel topology has a 1 We used the Levene s test with a significance level equal to 5% to test if the samples of the topologies had equal variances. 2 To generate this confidence intervals, we performed a multiple comparison test using the Matlab function multcompare 181

0.34 0.33 Table III RESULTS FOR CANCER. MEAN AND STANDARD DEVIATION (IN 0.32 0.31 0.3 0.29 Global Local Von Neumann Wheel Four Clusters Global 0.2907 96.34 0.8399 (0.0440) (1.33) (0.1833) Local 0.3018 96.36 3.0889 (0.0384) (1.44) (0.1709) Von Neumann 0.2927 96.32 1.8905 (0.0417) (1.40) (0.4683) Wheel 0.3225 96.46 2.4464 (0.0345) (1.52) (0.4609) Four Clusters 0.2929 96.34 1.6197 (0.0461) (1.41) (0.4516) Figure 3. dataset. Confidence intervals with 5% of confidence for the Cancer Table IV RESULTS FOR DIABETES. MEAN AND STANDARD DEVIATION (IN 0.66 0.65 0.64 0.63 0.62 0.61 0.6 0.59 0.58 Global 0.7587 76.87 2.1321 (0.0263) (2.40) (0.5129) Local 0.7605 76.25 4.0701 (0.0235) (2.25) (0.1816) Von Neumann 0.7574 77.13 3.6352 (0.0230) (2.39) (0.3877) Wheel 0.7716 76.69 2.8073 (0.0227) (2.11) (0.5493) Four Clusters 0.7580 76.67 3.2000 (0.0240) (2.18) (0.4128) 0.57 Figure 4. Global Local Von Neumann Wheel Four Clusters Confidence intervals 5% of confidence for the Heart dataset. significantly lower performace than these topologies. The Diabetes and Glass datasets had the same result and their graphs are not shown here due to obvious space limitations. For the Iris dataset, all topologies present the same performance in statistical terms. Figure 4 shows the confidence intervals for the Heart dataset. As we can be see, the topologies Global, Von Neumann and Four Clusters have statistically equivalent performance. The topology Wheel has a performance significantily different from all other topologies. Thus, in general, the topologies Global, Von Neumann and Four Clusters have the same performance statistically. An explanation for this is that these topologies are more connected than all other topologies and thus they have similar behavior. In relation to the Test Accuracy, all topologies showed performance statistically equivalent according to the Kruskal-Wallis test with 5% of significance 3. In terms of diversity, the Global topology achieved a lower result than all other topologies empirically over all datasets. It is important to note that all topologies had low values of diversity with small standard deviations over all datasets, indicating convergence of the PSO for all topologies. VI. CONCLUSION In this paper, we investigate the effect of five topologies on the performance of a PSO-ELM. The results showed that in all data sets evaluated, the Global and Von Neumann topologies achieved a lower RMSE V, in an empirical analysis, than all other topologies. However, the topologies Global, Von Neumann and Four Cluster were equivalents in the statistical analysis on all datasets. Analyzing the convergence, we see that the diversity of the Global topology in the last iteration is lower than all other topologies on all datasets. This means that the Global topology has a good capacity for exploitation, which can make it promising in finding good solutions for the PSO-ELM. Interestingly, all 3 We used the Kruskal-Wallis test because the test accuracies did not follow a normal distribution on all the data set. 182

Table V RESULTS FOR GLASS. MEAN AND STANDARD DEVIATION (IN Global 0.5478 62.93 2.2693 (0.0311) (6.80) (0.4405) Local 0.5500 63.07 4.3205 (0.0272) (6.12) (0.1675) Von Neumann 0.5442 63.40 3.8036 (0.0330) (7.22) (0.3916) Wheel 0.5675 62.00 3.2929 (0.0262) (7.11) (0.4367) Four Clusters 0.5494 62.47 3.4969 (0.0330) (7.06) (0.5615) Table VI RESULTS FOR HEART. MEAN AND STANDARD DEVIATION (IN REFERENCES [1] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, Extreme learning machine: a new learning scheme of feedforward neural networks, in Proceedings of the IEEE International Joint Conference on Neural Networks, vol. 2, 2004, pp. 985 990. [2] Y. Xu and Y. Shu, Evolutionary extreme learning machine - Based on particle swarm optimization, ser. Lecture Notes in Computer Science, 2006, vol. 3971. [3] F. Han, H.. Yao, and Q.. Ling, An improved extreme learning machine based on particle swarm optimization, ser. Lecture Notes in Computer Science, 2011, vol. 6840. [4] S. Saraswathi, S. Sundaram, N. Sundararajan, M. Zimmermann, and M. Nilsen-Hamilton, Icga-pso-elm approach for accurate multiclass cancer classification resulting in reduced gene sets in which genes encoding secreted proteins are highly represented, IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 8, no. 2, pp. 452 463, 2011. Global 0.5773 81.05 0.9695 (0.0491) (4.73) (0.2889) Local 0.6024 82.10 3.6299 (0.0457) (4.51) (0.1294) Von Neumann 0.5935 81.48 2.3723 (0.0490) (4.74) (0.5204) Wheel 0.6441 82.71 2.5188 (0.0454) (4.97) (0.4675) Four Clusters 0.5874 81.14 2.2232 (0.0451) (4.43) (0.4917) Table VII RESULTS FOR IRIS. MEAN AND STANDARD DEVIATION (IN Global 0.2937 95.33 1.7396 (0.0457) (2.69) (0.2877) Local 0.2940 95.67 3.1875 (0.0449) (3.14) (0.1376) Von Neumann 0.2911 96.08 2.7731 (0.0485) (2.84) (0.2384) Wheel 0.3074 95.75 2.3951 (0.0443) (2.87) (0.5140) Four Clusters 0.2952 96.00 2.5861 (0.0405) (2.98) (0.3020) topologies showed the same test accuracy in statistical terms. A hypothesis can explain this fact is that all topologies get stuck in a same local minimum. This point needs to be further studied and it is a future work. Another future work is to use dynamic topologies as the Dynamic Clan topology [14]. These topologies provide a good trade-off between exploration and exploitation. [5] A. P. Engelbrecht, Computational Intelligence: An Introduction. John Wiley & Sons Ltd, 2007. [6] R. Mendes, P. Cortez, M. Rocha, and J. Neves, Particle swarms for feedforward neural network training, in Proceedings of the International Joint Conference on Neural Networks, IJCNN 02, vol. 2, 2002, pp. 1895 1899. [7] A. van Wyk and A. Engelbrecht, Overfitting by pso trained feedforward neural networks, in IEEE Congress on Evolutionary Computation (CEC), 2010, pp. 1 8. [8] J. Cao, Z. Lin, and G. bin Huang, Composite function wavelet neural networks with extreme learning machine, Neurocomputing, vol. 73, no. 79, pp. 1405 1416, 2010. [9] J. Kennedy and R. Eberhart, Particle swarm optimization, in Proceedings of the IEEE International Conference on Neural Networks, vol. 4, 1995, pp. 1942 1948. [10] J. Kennedy and R. Mendes, Population structure and particle swarm performance, in Proceedings of the 2002 Congress on Evolutionary Computation. CEC 02, vol. 2, 2002, pp. 1671 1676. [11] O. Olorunda and A. Engelbrecht, Measuring exploration/exploitation in particle swarms using swarm diversity, in IEEE Congress on Evolutionary Computation, 2008. CEC 2008, june 2008, pp. 1128 1134. [12] A. Frank and A. Asuncion, UCI machine learning repository, 2010. [Online]. Available: http://archive.ics.uci.edu/ml (Accessed: 12 July 2012) [13] D. N. G. Silva, L. D. S. Pacifico, and T. B. Ludermir, An evolutionary extreme learning machine based on group search optimization, in IEEE Congress on Evolutionary Computation (CEC), 2011. [14] C. Bastos-Filho, D. Carvalho, E. Figueiredo, and P. de Miranda, Dynamic clan particle swarm optimization, in Ninth International Conference on Intelligent Systems Design and Applications, 2009. ISDA 09, 2009, pp. 249 254. 183