Automatic Clustering using a Genetic Algorithm with New Solution Encoding and Operators

Similar documents
Fast Efficient Clustering Algorithm for Balanced Data

Using a genetic algorithm for editing k-nearest neighbor classifiers

A Genetic Algorithm Approach for Clustering

1. Introduction. 2. Motivation and Problem Definition. Volume 8 Issue 2, February Susmita Mohapatra

Clustering Analysis of Simple K Means Algorithm for Various Data Sets in Function Optimization Problem (Fop) of Evolutionary Programming

Performance Measure of Hard c-means,fuzzy c-means and Alternative c-means Algorithms

Segmentation of Noisy Binary Images Containing Circular and Elliptical Objects using Genetic Algorithms


Research Article Path Planning Using a Hybrid Evolutionary Algorithm Based on Tree Structure Encoding

A Genetic Algorithm for Graph Matching using Graph Node Characteristics 1 2

Active contour: a parallel genetic algorithm approach

A Genetic k-modes Algorithm for Clustering Categorical Data

CHAPTER 4 FEATURE SELECTION USING GENETIC ALGORITHM

Enhanced Hemisphere Concept for Color Pixel Classification

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

Time Complexity Analysis of the Genetic Algorithm Clustering Method

A GENETIC ALGORITHM FOR CLUSTERING ON VERY LARGE DATA SETS

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS

Knowledge Discovery using PSO and DE Techniques

Monika Maharishi Dayanand University Rohtak

The k-means Algorithm and Genetic Algorithm

CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES

MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS

Final Project Report: Learning optimal parameters of Graph-Based Image Segmentation

CHAPTER 2 CONVENTIONAL AND NON-CONVENTIONAL TECHNIQUES TO SOLVE ORPD PROBLEM

Application of Genetic Algorithm Based Intuitionistic Fuzzy k-mode for Clustering Categorical Data

Dynamic Clustering of Data with Modified K-Means Algorithm

Improving interpretability in approximative fuzzy models via multi-objective evolutionary algorithms.

Image Clustering using GA based Fuzzy c-means Algorithm

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

A Parallel Evolutionary Algorithm for Discovery of Decision Rules

Unsupervised Learning

Clustering and Visualisation of Data

Genetic Tuning for Improving Wang and Mendel s Fuzzy Database

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Neural Network Weight Selection Using Genetic Algorithms

A New Selection Operator - CSM in Genetic Algorithms for Solving the TSP

CHAPTER 6 REAL-VALUED GENETIC ALGORITHMS

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest

An Evolutionary Algorithm for the Multi-objective Shortest Path Problem

Fuzzy Ant Clustering by Centroid Positioning

Multiple Classifier Fusion using k-nearest Localized Templates

ANTICIPATORY VERSUS TRADITIONAL GENETIC ALGORITHM

Comparative Study of Clustering Algorithms using R

Network Routing Protocol using Genetic Algorithms

BI-OBJECTIVE EVOLUTIONARY ALGORITHM FOR FLEXIBLE JOB-SHOP SCHEDULING PROBLEM. Minimizing Make Span and the Total Workload of Machines

HARD, SOFT AND FUZZY C-MEANS CLUSTERING TECHNIQUES FOR TEXT CLASSIFICATION

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures

Exploration vs. Exploitation in Differential Evolution

Unsupervised Learning and Clustering

Introduction to Genetic Algorithms. Based on Chapter 10 of Marsland Chapter 9 of Mitchell

AN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS

Comparative Study on VQ with Simple GA and Ordain GA

Genetic Algorithms for Vision and Pattern Recognition

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Genetic Algorithm Performance with Different Selection Methods in Solving Multi-Objective Network Design Problem

Meta- Heuristic based Optimization Algorithms: A Comparative Study of Genetic Algorithm and Particle Swarm Optimization

Motivation. Technical Background

Automatic Group-Outlier Detection

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION

Performance Assessment of DMOEA-DD with CEC 2009 MOEA Competition Test Instances

Genetic Algorithms. Kang Zheng Karl Schober

An Improved Fuzzy K-Medoids Clustering Algorithm with Optimized Number of Clusters

entire search space constituting coefficient sets. The brute force approach performs three passes through the search space, with each run the se

Machine Evolution. Machine Evolution. Let s look at. Machine Evolution. Machine Evolution. Machine Evolution. Machine Evolution

INTERACTIVE MULTI-OBJECTIVE GENETIC ALGORITHMS FOR THE BUS DRIVER SCHEDULING PROBLEM

Unsupervised Learning : Clustering

GENETIC ALGORITHM VERSUS PARTICLE SWARM OPTIMIZATION IN N-QUEEN PROBLEM

Hybridization of Fuzzy GBML Approaches for Pattern Classification Problems

THE AREA UNDER THE ROC CURVE AS A CRITERION FOR CLUSTERING EVALUATION

Enhancing K-means Clustering Algorithm with Improved Initial Center

The Genetic Algorithm for finding the maxima of single-variable functions

Datasets Size: Effect on Clustering Results

Lamarckian Repair and Darwinian Repair in EMO Algorithms for Multiobjective 0/1 Knapsack Problems

Chapter 14 Global Search Algorithms

The Projected Dip-means Clustering Algorithm

pfk-means: A Parameter Free K-means Algorithm Omar Kettani*, *(Scientific Institute, Physics of the Earth Laboratory/Mohamed V- University, Rabat)

Automatic Generation of Fuzzy Classification Rules Using Granulation-Based Adaptive Clustering

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

K-Means Clustering With Initial Centroids Based On Difference Operator

Fuzzy Segmentation. Chapter Introduction. 4.2 Unsupervised Clustering.

Use of a fuzzy granulation degranulation criterion for assessing cluster validity

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

CHAPTER 4 FUZZY LOGIC, K-MEANS, FUZZY C-MEANS AND BAYESIAN METHODS

Genetic Algorithms Variations and Implementation Issues

Improving Performance of Multi-objective Genetic for Function Approximation through island specialisation

Adaptive Crossover in Genetic Algorithms Using Statistics Mechanism

Unsupervised Feature Selection Using Multi-Objective Genetic Algorithms for Handwritten Word Recognition

IJMIE Volume 2, Issue 9 ISSN:

DETERMINING MAXIMUM/MINIMUM VALUES FOR TWO- DIMENTIONAL MATHMATICLE FUNCTIONS USING RANDOM CREOSSOVER TECHNIQUES

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

A New Method For Forecasting Enrolments Combining Time-Variant Fuzzy Logical Relationship Groups And K-Means Clustering

Optimal Facility Layout Problem Solution Using Genetic Algorithm

A Naïve Soft Computing based Approach for Gene Expression Data Analysis

Evolving SQL Queries for Data Mining

4. Feedforward neural networks. 4.1 Feedforward neural network structure

NOVEL HYBRID GENETIC ALGORITHM WITH HMM BASED IRIS RECOGNITION

A New Technique using GA style and LMS for Structure Adaptation

CHAPTER 5 ANT-FUZZY META HEURISTIC GENETIC SENSOR NETWORK SYSTEM FOR MULTI - SINK AGGREGATED DATA TRANSMISSION

Semi-Supervised Clustering with Partial Background Information

Transcription:

Automatic Clustering using a Genetic Algorithm with New Solution Encoding and Operators Carolina Raposo 1, Carlos Henggeler Antunes 2, and João Pedro Barreto 1 1 Institute of Systems and Robotics, Dept. of Electrical and Computer Engineering, University of Coimbra, Portugal {carolinaraposo,jpbar}@isr.uc.pt 2 INESC Coimbra, Dept. of Electrical and Computer Engineering, University of Coimbra, Portugal ch@deec.uc.pt Abstract. Genetic algorithms (GA) are randomized search and optimization techniques which have proven to be robust and effective in large scale problems. In this work, we propose a new GA approach for solving the automatic clustering problem, ACGA - Automatic Clustering Genetic Algorithm. It is capable of finding the optimal number of clusters in a dataset, and correctly assign each data point to a cluster without any prior knowledge about the data. An encoding scheme which had not yet been tested with GA is adopted and new genetic operators are developed. The algorithm can use any cluster validity function as fitness function. Experimental validation shows that this new approach outperforms the classical clustering methods K-means and FCM. The method provides good results, and requires a small number of iterations to converge. Keywords: Genetic Algorithms, Clustering, Calinski-Harabasz index, K-means, Fuzzy C-Means 1 Introduction Clustering is the problem of classifying an unlabeled dataset into groups of similar objects. Each group, called a cluster, consists of objects that are similar between themselves, according to a certain measure, and dissimilar to objects of other groups, with respect to the same measure. Clustering has had an important role in areas such as computer vision [1], data mining and document categorization [2], and bioinformatics [3]. The major difficulty in the clustering problem is the fact that it is an unsupervised task, i.e., in most cases the structural characteristics of the problem are unknown. In particular, the spatial distribution of data in terms of number, volumes, and shapes of clusters is not known. A classical clustering method is K-means [4], in which the data is divided into a set of K clusters. This division depends on the initial clustering, and is computed by minimizing the squared error between the centroid of a cluster and

its elements. One variant of K-means is the Fuzzy C-Means (FCM), in which the elements of the dataset may belong to more than one cluster with different membership degrees [5]. FCM proceeds by updating the cluster centroids and the membership degrees until a convergence condition is attained. Both methods have proven to perform well in many situations. However, they can easily get stuck in local minima, depending on the initial clustering. Moreover, for problems where the number of clusters is unknown a priori, it is necessary to perform the clustering for different values of K and choose the best solution according to a certain evaluation function. This becomes impractical and tedious in the presence of a large number of clusters. These issues (local minima and unknown number of clusters) can be tackled using GA [6], which have proven to be effective in search and optimization problems [7]. GA are high level search heuristics that mimic the process of natural evolution, based on the principle of survival of the fittest, by using selection, crossover and mutation mechanisms. Solutions (individuals) are encoded as strings, called chromosomes, and, at each iteration, the fitness of solutions in the population is computed for determining the evolutionary process. In [8] a GA was proposed to find the optimal partition of a dataset, given the number of clusters. More recently, methods based on GA have been proposed to solve the automatic clustering problem, i.e., when no information about the dataset is known [9 13]. We propose a new GA approach for solving the automatic clustering problem, ACGA - Automatic Clustering Genetic Algorithm - which uses a solution encoding based on the one presented in [14]. Two new mutation and one new crossover operators have been developed in order to guarantee the existence of high diversity in the population. As fitness function any internal cluster validity measure can be used, for which the Calinski-Harabasz (CH) index [15] was chosen. An experimental validation using real [16] and synthetic [17] datasets was conducted, and the performance of ACGA was compared with the K-means and FCM algorithms. ACGA outperforms these two classical methods, converging to solutions with higher fitness function values. The interest and motivation of this study have been provided in the introduction. An overview of the GA approach describing the encoding scheme, and genetic operators is presented in Section 2. Experiments and results are discussed in Section 3. Finally, some conclusions are drawn in Section 4. 2 Overview of the GA Approach In this Section a new GA approach for solving the automatic clustering problem is presented. An encoding scheme based on [14] is used, and new genetic operators (mutation and crossover) are developed. Solutions represent different points in the search space, to which quality values are assigned, according to a pre-defined fitness function. The best B solutions of the population are preserved, and pass to the next generation, thus endowing the evolutionary process with a certain degree of elitism. A fixed number

of solutions is then selected to act as parents. Crossover is applied to the parent chromosomes, originating new solutions, which are then subject to mutation. The new population is evaluated, and this procedure is repeated until a stopping criterion is satisfied. Algorithm 1 shows the main steps of the GA. In the next Subsections, the encoding scheme is described, as well as the procedures for generating the initial population, evaluating each solution, selecting the parent population, and performing the crossover and mutation operators. Algorithm 1: Pseudo-Code of the GA. t 0; Generate initial population P (t); Evaluate P (t); while Stopping criterion not satisfied do Select parent population P (t) from P (t); Apply genetic operators to P (t) P (t + 1); Replace random solutions in P (t + 1) with the best B solutions in P (t); Evaluate P (t + 1); t t + 1; Result: Best solution (highest fitness value) of the population in the last generation. 2.1 Encoding Scheme In this work, we adopted an encoding scheme based on the one used in [14]. Each solution i is a fixed-length string represented by activation values A ij and cluster centroids m ij, j = 1,..., K max, where K max is the maximum number of clusters defined by the user. Then, solution i of the population is represented as: A i1 A i2... A ikmax m i1 m i2... m ikmax where A ij are activation values defined in the interval [0, 1]. An activation value A ij larger than 0.5 means that the cluster with centroid m ij is active, i.e., it is selected for partitioning the dataset. Otherwise, the respective cluster is inactive in chromosome i. Since each centroid m ij is a d-dimensional vector, where d is the number of features of the data, each solution is a string with total length equal to K max + dk max. As an example, for a maximum number of clusters K max = 3 and a dataset with d = 2 features, the string 0.1 0.8 0.6 10 0.5 11 0.9 14 0.3

represents a solution i with two activated clusters with centroids m i2 = [11, 0.9] and m i3 = [14, 0.3]. This encoding scheme has advantages over the more popular label-encoding, where each position is an integer value corresponding to one cluster. These include the fact that it may lead to smaller solution vectors, in the presence of large datasets, and solutions with isolated one-element clusters are more difficultly generated, since the labeling of each data point only depends on the location of the centroids. However, this encoding has the disadvantage of requiring one label to be assigned to each data point, before computing the fitness of each solution. This is done by finding the active centroid closest to each point (smallest euclidean distance). 2.2 Initialization of the Population Each candidate solution in the population is initialized independently, with the activation values and the cluster centroids initialized separately. The K max activation values A ij are initialized as random values drawn from the standard uniform distribution in the interval [0, 1]. In order to guarantee that each initial solution has at least 2 active clusters, a minimum of 2 activation values are forced to be larger than 0.5. For initializing the K max centroids m ij, the interval in which the data points are contained is determined. More specifically, for each dimension relative to each feature, the minimum and the maximum values in the dataset are determined. Then, for each dimension, K max random values are sampled from the corresponding interval, originating K max points that are the initial centroids. The size of the population P (number of candidate solutions) is a user-defined value. 2.3 Solution Evaluation A quality measure is assigned to each solution, which is computed using a fitness function. In this work we chose to define the fitness function as the Calinski- Harabasz (CH) index [15], which uses the quotient between the intra-cluster average squared distance and the inter-cluster average squared distance. For solution s, CH(s) is computed as in equation (1): CH(s) = N i=1 X i m s 2 Ks K s j=1 j=1 X C sj X m sj 2 X C sj X m sj 2 N K s K s 1, (1) where N is the number of objects X i in the dataset, m s is the centroid of the dataset, K s is the number of clusters in solution s, and X are the objects in cluster C sj with centroid m sj. denotes the L2 norm. Larger inter-cluster distances and smaller intra-cluster distances lead to larger CH values, i.e., higher quality solutions.

Fig. 1. Example of the crossover between two parent solutions p 1 and p 2, originating two offspring c 1 and c 2. The solutions correspond to maximum number of clusters K max = 3 and d = 2 features. Note that for one-cluster solutions, the term K s 1 is substituted by 1, and we have the relation N X i m s 2 = i=1 K s j=1 X C sj X m sj 2, (2) leading to a CH value of 0. Thus, it is not necessary to explicitly deal with one-cluster solutions since they are automatically assigned a null quality value. 2.4 Selection of the Parent Solutions Before applying the genetic operators, a set of parent solutions must be selected. This is done in a tournament scheme by selecting P p random solutions from the population and finding the one with the highest fitness, which will act as a parent solution. This procedure is repeated P times, originating a parent population of the same size as the original population. 2.5 Genetic Operators The genetic operators should be suited to the encoding scheme. Their objective is to ensure diversity in the population, such that better solutions can be produced through the evolutionary process. In this work, one new crossover scheme is presented, as well as two new mutation operators. Crossover For each pair of parent solutions in the population crossover is applied, originating two offspring. A binary mask of length K max is created, where 1 occurs with a probability p cross. The centroids in the parent solutions corresponding to the positions in the mask with a unitary value are exchanged, originating two children. Figure 1 shows an example of the crossover between parent solutions p 1 and p 2 yielding offspring c 1 and c 2. Mutation Each solution produced by the crossover operator goes through a mutation process. Two different types of mutation have been considered: mutation in the activation values, and mutation in the centroids.

(a) An example of the activation value mutation. (b) An example of the centroid mutation. Fig. 2. Examples of the two different types of mutation proposed in this work: mutation in the activation values, and centroid mutation. The solution vectors correspond to maximum number of clusters K max = 3, and d = 2 features. Activation value mutation Similarly to the crossover operator, a binary mask of length K max is created, where 1 occurs with a probability p mut. Let α be a user-defined parameter that controls the perturbation of the activation values. For each unitary value in the mask, a random value r α is drawn in the interval [ α, α]. The activation value in the corresponding position is modified by adding r α : A ij = A ij + r α, where A ij is the modified activation value. If the resulting activation value A ij does not belong to the interval [0, 1], it is forced to 0 or 1: A ij = { 1 if A ij > 1 0 if A ij < 0. (3) Figure 2(a) shows an example of the application of mutation to the activation values. The centroid values remain unaltered, and only randomly chosen activation values, with probability p mut, are altered. Since the crossover operator does not influence the activation values, this type of mutation is the only possible way to achieve diversity in the number of clusters between solutions in consecutive iterations of the algorithm. Centroid mutation The mutation operator applied to the centroids is similar to the activation value mutation, with the difference that the perturbation is a d-dimensional vector created according to the range of values of each feature of the data points. Let ɛ be a user-defined parameter that controls the change in amplitude of the centroids. For each dimension w of the data points, w = 1,..., d, let g w = h w l w be the difference between the largest value relative to feature w found in the dataset, h w, and the smallest value relative to the same feature, l w. By defining t w = ɛg w for each feature, a vector r ɛ = [r ɛ1,..., r ɛd ] can be created by drawing random values r ɛw from the intervals [ t w, t w ]. The centroids m ij that correspond to unitary values in the binary mask (created as in the activation value mutation scheme) are modified by adding the perturbation vector r ɛ : m ij = m ij + r ɛ, where m ij is the modified centroid. Again, if the new centroid exceeds the boundary values of the data point values

in any dimension, it is forced to the corresponding boundary value: { m hw if m ijw ijw = > h w l w if m ijw < l w, w = 1,..., d. (4) Figure 2(b) shows an example of the application of mutation to the centroids. In this case, only randomly chosen centroids are altered. The activation value mutation may drastically change the number of clusters in a solution, originating child solutions considerably different from the parent solutions. This may be undesirable, since it easily leads to solutions with significantly lower fitness values. Thus, this type of mutation is applied less often than the centroid mutation. The algorithm proceeds until either a maximum number of iterations I max, or a fixed number of iterations without improvement of the best solution in the population I ni are reached. 3 Experiments and Results In this Section, we compare the performance of ACGA with two classical clustering methods: K-means [4] and FCM [5]. Since these two methods require the user to input the number of clusters, we performed the clustering for a varying number of clusters, and chose the solution with the highest quality value, using the same fitness function used in the GA (CH index). We tested for K = 2,..., K max, with K max = 20, clusters. Experiments were performed using the datasets in Table 1. The first three are synthetic datasets downloaded from [17], presenting different degrees of cluster overlapping, as can be seen in Figure 3. The remaining sets are real-life datasets, which were obtained from [16], and are commonly used for the validation of clustering methods. The parameters refer to the original labeling provided with the datasets. Table 1 shows the number of clusters K, the number of objects N, and the number of features d for each dataset. Moreover, the average intra-cluster and inter-cluster distances were computed. Table 1. Description of the datasets used in the experiments. K is the number of clusters, N is the number of data points, and d is the number of features. Dataset K N d Avg Intra Avg Inter CH Cluster Dist. Cluster Dist. value S2 15 5000 2 42294.60 182459.12 12541 S3 10 3254 2 54507.97 174914.69 5002 S4 15 5000 2 55312.63 149595.93 5385 Iris 3 150 4 0.67 2.15 486 B. Cancer 2 638 9 2.66 4.65 303 Seeds 3 210 7 1.58 3.85 310

(a) Dataset S2 (b) Dataset S3 (c) Dataset S4 Fig. 3. The three synthetic datasets used in the experiments, presenting different degrees of cluster overlapping. Colors are used to identify the clusters. For each cluster, the intra-cluster distance is the mean of the distances of all data points to the cluster centroid. The inter-cluster distance is computed as the minimal distance between cluster centroids, for each cluster. Table 1 also shows the CH value, computed using equation (1). For all the datasets, 20 independent runs of each method were performed. Since K-means and FCM depend on the initial clustering which is obtained by randomly sampling the dataset, and generating random membership values, respectively, slightly different results may be obtained in different runs. For ACGA, the values in Table 2 were assigned to the parameters described in Section 2. Table 3 shows the results obtained for all the datasets in Table 1, using K-means, FCM, and ACGA. The last column of Table 3 presents the values for the Adjusted Rand (AR) index [18], which is a measure of agreement between two partitions and is frequently used in cluster validation. We computed the AR index between the original labelings and the ones obtained with each method. This index indicates how similar the partition obtained is to the original one. For two random partitions, the expected value of the AR index is 0, and its maximum value is 1, obtained when comparing equal partitions. By comparing the AR index in Table 3, it can be seen that ACGA generally outperforms both K-means and FCM. As an example, an AR value of Table 2. User-defined values used in the experiments. Maximum number of clusters, Kmax 20 Population size, P 40 No. of solutions chosen for parent selection, Pp 3 Crossover probability, pcross 0.3 Mutation probability, pmut 0.1 Activation value mutation parameter, α 0.1 Centroid mutation parameter, 0.3 No. of elite solutions, B 4 Maximum no. of iterations, Imax 5000 Maximum no. of iterations without improvement, Ini 500

Table 3. Results obtained over 20 independent runs using ACGA, K-means, and FCM. The first line of each dataset corresponds to the ground truth values in Table 1. Dataset Method Avg. No. Clu. Avg. Inter Cluster Avg. CH Avg. AR - 15 182459.12 12541 - S2 ACGA 15.000±000 183081.207±316.112 13461.515±420.630 0.937±0.003 K-means 16.967±1.426 148383.088±10516.459 10658.228±1697.369 0.859±0.048 FCM 15.233±0.430 171068.182±5657.558 13106.362±924.506 0.926±0.027-10 174914.69 5002 - S3 ACGA 10.000±0.000 177104.685±420.960 6307.633±16.153 0.806±0.003 K-means 10.267±1.112 173354.951±15347.364 5482.044±584.628 0.736±0.042 FCM 10.133±0.434 170345.561±9489.542 6091.281±387.783 0.785±0.041-15 149595.93 5385 - S4 ACGA 15.000±0.000 152034.821±358.558 7865.943±12.925 0.724±0.003 K-means 15.933±1.460 138291.741±7920.736 6788.132±605.469 0.664±0.035 FCM 15.567±0.626 141600.223±4524.683 7570.782±339.299 0.713±0.019-3 2.15 486 - Iris ACGA 3.00±0.000 2.185±0.000 506.297±0.000 0.886±0.000 K-means 3.067±0.450 1.932±0.319 473.496±75.401 0.791±0.158 FCM 3.00±0.000 1.940±0.000 506.297±0.000 0.886±0.000-2 4.65 303 - Breast ACGA 2.000±0.000 11.793±1.117 1026.084±0.351 0.847±0.004 Cancer K-means 2.000±0.000 13.816±0.000 1026.262±0.000 0.846±0.000 FCM 2.000±0.000 13.886±0.000 1023.939±0.000 0.830±0.000-3 3.85 310 - Seeds ACGA 3.000±0.000 3.107±0.153 329.191±3.108 0.704±0.012 K-means 3.000±0.000 3.988±0.000 328.267±0.000 0.706±0.000 FCM 3.000±0.000 3.970±0.000 329.918±0.000 0.694±0.000 0.886 obtained for the iris dataset means that 6 data points were misclassified, corresponding to 4%. The superiority of ACGA is clear by comparing the results obtained for all synthetic datasets, where both K-means and FCM failed to partition the dataset in the correct number of clusters in every run. Figure 4 shows a partition of the S2 dataset obtained with the three methods, where each cluster is identified by a color. Comparing the results to the original clusters in Figure 3(a), it can be seen that our solution (Figure 4(a)) yielded the correct number of clusters, with only a few misclassified objects. K-means (Figure 4(b)) performed poorly since it was not capable of finding the optimal number of clusters. FCM (Figure 4(c)) produced a result considerably better than K- means. However, it still failed to correctly classify many data points. Moreover, ACGA leads to smaller intra-cluster distances and larger inter-cluster distances, and consequently larger CH values. Due to cluster overlapping, the CH values computed using the original partitions (Table 1) are, in some cases, lower than the ones obtained with ACGA. Also, it presents much smaller standard deviations than both K-means and FCM, especially in the synthetic datasets. This means that ACGA is capable of producing stable results, independently of the initialization, converging to very similar solutions in different runs.

(a) ACGA. (b) K-means. (c) Fuzzy C-Means. Fig. 4. Results obtained for dataset S2 (refer to Figure 4) using ACGA, K-means, and FCM. Clusters are identified by colors. Table 4. Average number of iterations required to reach the results in Table 3. Dataset S2 S3 S4 Iris B. Cancer Seeds Iterations 2936.5 2199.5 3316.0 15.6 588.5 353.3 Table 4 shows the average number of iterations necessary to achieve the solutions in Table 3. The values were obtained by subtracting the number of iterations without improvement of the best solution in the population, Ini, to the total number of iterations. It can be seen that for smaller and well-behaved datasets, ACGA is fast and converges after a small number of iterations. For larger datasets, such as S2, it requires approximately 3000 iterations. Generally, using parameters such as the mutation probability that adapt to the evolution of the algorithm is a good technique since higher diversity in the population can be achieved, preventing the algorithm to get stuck in local minima. Thus, we tested the influence of an increase in pmut triggered by reaching a certain number of iterations without improvement of the best solution in the population. Results obtained with dataset S2 are shown in Figure 5(a), which plots the average (blue) and maximum (red) quality of the population, as well as the number of clusters of the best solution (black), in each iteration. It can be seen that, despite increasing diversity in the population, applying mutation at higher rates leads to a decrease in the average quality of population, which is clear by observing the evolution in the last 500 iterations. Thus, we kept all parameters fixed after a tuning process, and the algorithm behaves as depicted in Figure 5(b), where the average quality curve presents always the same pattern. The maximum quality in the population does not decrease due to elitism: the best B solutions in each generation pass to the next generation unaltered. This moderate degree of selective pressure due to elitism contributed to improve the results without reducing exploration. Also, the high population diversity is evinced by the significant changes in the average fitness in consecutive iterations of the algorithm. The curves show that significant changes in the maximum quality are generally caused by changes in the number of clusters.

(a) Results using adaptive mutation (b) Results using fixed mutation probability probability. Fig. 5. Evolution of the fitness and number of clusters of the solutions in the population for dataset S2. The figures show the maximum (red) and average (blue) fitnesses of the population, as well as the number of clusters of the best solution in each iteration (black). Fitness values have been scaled by a factor of 10 3. 4 Conclusion We present a new GA approach, ACGA, for automatically finding the optimal number of clusters in real and synthetic datasets, with different degrees of overlapping, and correctly assigning each data point to a cluster. It uses an encoding scheme that, to the best of our knowledge, had never been incorporated into GA. This required the development of new genetic operators that ensure diversity in the population. ACGA does not require any information about the data, and is able to outperform two classical clustering methods: K-means and FCM. Experiments included three synthetic and three real datasets, and results show that ACGA leads to partitions very similar to the original ones, requiring a small number of iterations to converge. Acknowledgments. This work was financially supported by a PhD grant (SFRH/BD/88446/2012) from the Portuguese Foundation for Science and Technology (FCT). The author Carlos Henggeler Antunes acknowledges the support of FCT project PEst-OE/EEI/UI0308/2014 and QREN Mais Centro Program icis project (CENTRO-07-ST24-FEDER-002003). References 1. Belahbib, F., Souami, F.: Genetic algorithm clustering for color image quantization. 3rd European Workshop on Visual Information Processing (EUVIP), pp. 83-87 (2011) 2. Mecca, G., Raunich, S., Pappalardo, A.: A New Algorithm for Clustering Search Results. Data and Knowledge Engineering, vol. 62, pp. 504-522 (2007)

3. Valafar, F.: Pattern Recognition Techniques in Microarray Data Analysis: A Survey. Annals of New York Academy of Sciences, vol. 980, pp. 41-64 (2002) 4. Hartigan, J., Wong, M.: Algorithm AS 136: A K-Means Clustering Algorithm. Applied Statistics, vol. 28, no. 1, pp. 100-108 (1979) 5. Bezdek J., Ehrlich R., Full, W.: FCM: The fuzzy c-means clustering algorithm. Computers and Geosciences, vol. 10, no. 2-3, pp. 191-203 (1984) 6. Holland, J.: Genetic algorithms. Scientific American (1992) 7. Srinivas, M., Patnaik, M.: Genetic algorithm: A survey. IEEE Computer vol. 27, no. 6, pp. 17-26 (1994 8. Murthy C., Chowdhury, N.: In search of optimal clusters using GA. Pattern Recognition Letters 17 pp. 825-832 (1996) 9. Tseng, L., Yang, S.: A genetic approach to the automatic clustering problem. Pattern Recognition, vol. 34, no. 2, pp. 415-424 (2001) 10. Agustin-Blas, L., Salcedo-Sanz, S., Jimenez-Fernandez, S., Carro-Calvo, L., Del Ser, J., Portilla-Figueras, J.A.: A new grouping GA for clustering problems. Expert Systems with Applications, vol. 39, no. 10 (2012) 11. Sheikh, R., Raghuwanshi, M., Jaiswal, A..:Genetic Algorithm Based Clustering: A Survey. First International Conference on Emerging Trends in Engineering and Technology, vol. 2, no. 6, pp. 314319 (2008) 12. Liu, Y., Wu, X., Shen, Y.: Automatic clustering using genetic algorithms. Applied Mathematics and Computation, vol. 218, no. 4, pp. 1267-1279 (2011) 13. He, H., Tan, Y.: A two-stage genetic algorithm for automatic clustering. Neurocomputing, vol. 81, pp. 49-59 (2012) 14. Das, S., Abraham A., Konar, A.: Automatic Clustering Using an Improved Differential Evolution Algorithm. IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans, vol. 38, no. 1, pp. 218-237 (2008) 15. Calinski R., Harabasz J.: A dendrite method for cluster analysis. Communications in Statistics, vol. 3, no. 1, pp. 1-27 (1974) 16. Asuncion, A., Newman, J.: UCI Machine Learning Repository, http://www.ics. uci.edu/~mlearn/mlrepository.html, Irvine, CA: University of California, Department of Information and Computer Science (2007) 17. Speech and Image Processing Unit. Clustering datasets, http://www.cs.joensuu. fi/sipu/datasets/ 18. Hubert, L., Arabie, P.: Comparing Partitions. Journal of Classification, no. 2, pp. 193-218 (1985)