Clustering using a coarse-grained parallel Genetic Algorithm: A Preliminary Study Nalini K. Ratha Anil K. Jain Moon J. Chung Department of Computer Science Department of Computer Science Department of Computer Science Michigan State University Michigan State University Michigan State University East Lansing, MI 4884 East Lansing, MI 4884 East Lansing, MI 4884 ratha@cps.msu.edu jain@cps.msu.edu chung@cps.msu.edu Abstract Genetic Algorithms (GA) are useful in solving complex optimization problems. By posing pattern clustering as an optimization problem, GAs can be used to obtain an optimal minimum squared-error partitions. In order to improve the total execution time, a distributed algorithm has been developed using the divide and conquer approach. Using a standard communication library called PVM, the distributed algorithm has been implemented on a workstation cluster. The GA approach gives better quality clusters for many data sets compared to a standard K-Means clustering algorithm. We have achieved a near linear speedup for the distributed implementation. Keywords: Genetic Algorithm, Pattern Clustering, PVM, Workstation cluster. Introduction Clustering algorithms group patterns based on measures of similarity or dissimilarity. Data clustering is an important technique in the eld of exploratory data analysis. The set of clustering algorithms can be broadly classied into one of the following two types: (i) hierarchical or (ii) partitional. Hierarchical clustering is concerned with obtaining a nested hierarchical partition of the data. In partitional clustering we are interested in generating a single partition describing the groups or clusters present in the data. A formal denition of partitional clustering can be described as follows. Given a collection of n pattern vectors where each pattern is a m-dimensional vector characterized by the set of features (x; x; : : : ; x m ), nd the clusters present in the data. A cluster is dened by the similarity of patterns present in it. The number of clusters may be known or unknown. Jain and Dubes [6] describe a number of clustering techniques and indices for cluster validity. The minimum squared-error is a well known criterion used to obtain a specied number of clusters. The squared-error for a set of n m-dimensional patterns is given by K E k = e k ; () k= where K is the desired number of clusters, and e k is dened as follows: where n k e k = (x (k) i? m (k) ) t (x (k) i? m (k) ); () i= n k m (k) = n k i= x (k) i ; n = K n k : (3) k= Mean-squared error criterion function typically looks for clusters with hyperellipsoidal shapes. The most well-known squared-error clustering algorithms are K- Means, ISODATA, CLUSTER, and FORG. The main problem with these algorithms is the nonoptimality of the resulting partitions. Moreover, these algorithms often give dierent clusters when run with dierent initial cluster centers as they usually get stuck at a local minima. The simplest and the most well known partitional clustering algorithm is the K-Means algorithm. A sequential version of the K-Means algorithm is shown in Table. The algorithm requires that the user specify the desired number of clusters. The partitional clustering problem can also be viewed as an optimization problem. For the squarederror criterion, the clustering problem can be stated as nding the clusters (i.e., nd a labeling for the patterns) such that the between-group scatter is maximized and the within-group scatter is minimized. Many stochastic techniques exist in the literature which address the issue of achieving the global minima of a criterion function. Simulated annealing and
Input: n m?dimensional patterns, and K (number of desired clusters). Output: Non-overlapping K clusters i.e., a labeling of all the n patterns with labels from the set [..K]. Method:. Select K points randomly as cluster centers.. Repeat For i= to n do if pattern[i] is closest to the j th cluster center, assign it to cluster j. Compute new cluster centers as the average of patterns in each cluster. Until no changes occur in cluster centers Table : K-Means algorithm. genetic algorithms are some of these techniques. Simulated annealing has been used to solve the partitional clustering problem [7]. A genetic algorithm (GA) is a search procedure based on the \survival of the ttest" principle [3]. The \ttest candidate" is the solution at any given time. By running the evolution process for a suciently large number of generations, we are hopeful of reaching the global minima. The genetic algorithm (GA) is a model of machine learning [5]. It mimics the evolutionary process in nature by creating a population of individuals represented by chromosomes. These individuals go through a process of evolution. Dierent individuals compete for resources in the environment. The \ttest" individual survives and propagates its genes. The \crossover" operation is the process of exchanging chunks of genetic information between a pair of chromosomes. As in natural evolution process, GAs also dene a \mutation" process, where a gene undergoes changes in itself. A general scheme for GA is shown in Table. The main issues involved in designing a GA are (i) a suitable problem representation that enables application of GA operators, (ii) selecting a suitable candidate tness evaluation function, and (iii) dening the crossover and mutation functions. Other global parameters such as the population size, crossover and mutation probabilities, and number of generations also play an important role in obtaining good quality results using GAs. Genetic algorithms have been used in many pattern recognition and image processing applications including image segmentation [], feature selection [], and shape analysis []. The main drawback of genetic algorithms is the amount of time taken for convergence. The search space grows exponentially as a function of the problem size. Hence, the number of generations needed to reach a global solution increases rapidly. A number of methods have been described in the literature to improve the convergence [3]. We adopt a divide and conquer strategy to combat the unacceptable convergence time. The divide and conquer approach lends itself to a coarse-grained parallel implementation. Squared-error clustering algorithms are compute intensive. As a result, a number of parallel clustering algorithms have been proposed in the literature. Ni and Jain [9] describe a systolic array-based algorithm that can be implemented on a VLSI. Li et al. [8] proposed a SIMD algorithm with O(k lognm) time complexity and NM processing elements (PEs). Another SIMD algorithm described by Ranka and Sahni [0] has a time complexity of O(k + lognm) with NM processors in a hypercube. We have used a set of generalpurpose workstations connected over a local area network (LAN) to implement the squared-error clustering algorithm using a genetic algorithm. The purpose of this paper is two fold. First, we show that the local minima problem associated with a standard squarederror clustering algorithm can be overcome by using a genetic algorithm. Second, the slow convergence of genetic algorithms can be somewhat alleviated by using a cluster of workstations. Thus, the combination of the two approaches can result in good clusters with a reasonable execution time. The reminder of the paper is organized as follows. Section describes a sequential genetic algorithm for partitional clustering. The parallel algorithm using a coarse-grained approach is described in section 3. Both the sequential and distributed algorithms have been implemented. An analysis of the results in terms
of quality of clusters and speedup is carried out in section 4. The conclusions and future work are presented in section 5. A Genetic Algorithm for clustering The squared-error clustering problem can also be posed as a label assignment problem. Each of the n patterns needs to be assigned a label from the set f : : : Kg such that the squared error in Eq. () is minimized. Using this denition of clustering, we form the chromosome as a bit stream of pattern labels. We can apply the genetic operators on the bit stream. The sequential genetic algorithm for pattern clustering is described in Table. The clustering problem has now been presented as an optimization problem. The standard crossover and mutation can be applied on potential solutions represented as bit streams. However, we need to dene the tness function. The tness of a new generation candidate should be better than its parents [5]. We dene a variation of squared-error as the tness function. The tness value of a candidate is computed as follows:. Let Worstscore be the squared-error when all the patterns form a single cluster.. Let PresentScore be the squared-error obtained by the present assignment of labels. 3. FitnessScore = e P resentscore (W orstscoret ), where T is a normalization constant. The normalization is done so that the tness value is between 0 and which can be used as the probability of a candidate being selected for crossover. From the squared-error criterion point of view, a crossover need not result in a better solution. Hence, we restrict the crossover to cases where the crossover results in a lower squared-error value with respect to its parents. Otherwise, the generated candidate is rejected. In this way, we ensure the property of the population moving towards a global optimum. 3 Coarse-grained parallel Algorithm The main drawback with any GA scheme is the time taken to converge to the global minima. In order to speedup the total execution time, we need to explore distributed/parallel methods. There are two ways to parallelize the above algorithm: (i) divide and conquer, and (ii) distributed computation. In the rst method, the n pattern vectors are divided into P groups assuming that P processors are available. Each of the P processors works on the data assigned to it using the sequential algorithm. After each processor is done with its task, we will have P K clusters. The master now runs a K- Means pass on the PK cluster centers to obtain the desired K clusters. The advantage of this method is that it needs very little communication between the processors. The disadvantage is that it is dicult to balance the load on a heterogeneous workstation cluster as each subset of data might take dierent numbers of passes to complete. Hence, the overall execution time may depend on the slowest workstation. In the distributed computation method, the pattern vectors are distributed as before. At the end of every phase, the result is communicated to the master. A minor variation to this is that the partial results (best candidates) can be sent asynchronously. This method has a higher communication overhead, but the work load can be balanced. We use the divide and conquer method because of its low communication requirement. A schematic diagram of this approach is shown in Figure. The distributed algorithm is fairly simple using divide and conquer strategy and is described in Table algorithm is based on Master - Slave protocol. 4 Results 4. The Both the sequential and distributed algorithms have been implemented. We used the PVM communication library to implement the distributed algorithm. PVM was developed at the Oak Ridge National Laboratory [4] and is available as a public domain software. It supports heterogeneous computing. Users call the architecture-independent (transparent) subroutines for passing messages between the nodes. There is no synchronization involved in our algorithm as the slaves are independent of each other. However, the master has to wait for the result from all the slaves before it can start the merge pass. The following data sets are used to evaluate the performance of the genetic algorithm approach.. A set of two-dimensional points shown in Figure. This data set contains two clusters.. A data set on which the K-Means algorithm fails is shown in g.. This data set contains three clusters. 3. A subset of the IRIS data. IRIS dataset is wellknown in the pattern recognition literature. It has four features, 3 classes, and 50 patterns per class. We have chosen only 0 patterns per class. 4. Full IRIS data (50 patterns, 4 features, 3 classes).
t:=0; initialize population(t); evaluate population(t); do while (true) t:=t+; p := select parents(t); recombine(p); mutate(p); evaluate population(p); new population := survivors (p,p); end; Table : A simple genetic algorithm. Input: n m?dimensional patterns, and K (the desired number of clusters). Output: Non-overlapping K clusters or a labeling of all the n patterns with labels f : : : Kg. Method: Let P s = population size, k = dlogke, p c = probability of crossover, and p m = probability of mutation.. Coding: Each pattern can take a label of k bits. Hence the string length is nk bits.. Initial Population: Randomly generate P s streams of size Nk bits. 3. For i = to P s, compute tness value of each candidate. 4. Reproduction: Reproduce the i th string proportional to its tness value. 5. Crossover: Each pair of strings undergoes a crossover at randomly chosen positions. 6. Mutation: A mutation is carried out by ipping randomly chosen bits with a probability p m. 7. Repeat steps ({6) for a specied number of generations. Table 3: Sequential GA for clustering.
n m-dimensional data points n/p data points n/p data points n/p data points n/p data points PE PE PE (P-) PE P KP Data Points (Cluster centers) Figure : Scheme for a distributed clustering approach. Input: n m-dimensional patterns, and K (the desired number of clusters) Output: K clusters Method:. Data Distribution: Assign n patterns to P processors in a round robin fashion, thus dividing data as evenly as possible.. Computation Phase: Each PE clusters the data set assigned to it using the sequential method described previously. At the end of the run, the result is sent to the Master. 3. Merge Phase: Master collects the PK cluster centers and applies a K-Means algorithm to these PK points. It is assumed that PK << n. 4. Reassignment of labels: Depending on the result of the merge phase, the patterns are assigned a new label to get the nal set of K clusters. Table 4: A coarse-grained parallel GA for clustering.
The clusters obtained by the K-Means algorithm for the two synthetic data sets are shown in Figure 3. Using the coarse-grained parallel GA, the clusters obtained for these data sets are shown in Figure 4. The cluster labels for the 30 patterns of IRIS data are shown in Figure 5. For the full IRIS data (50 patterns) the confusion matrices of assigned labels are shown in Table 5 and Table 5 using K-Means and parallel genetic algorithm respectively. Out of 50 patterns, 5 patterns were misclassied by the parallel genetic algorithm in contrast to 6 patterns being misclassied by the K-Means algorithm. Typically, K-Means algorithm is run more than once to verify that the solution obtained does not correspond to a local minima. For large data sets, this can be very costly. For our synthetic data set in Figure 3, with multiple runs of K-Means algorithm, we were able to obtain the correct labels for the three clusters. The total execution time for the full IRIS data set using - 5 workstations is shown in Table 6. For this experiment, we used Sun SPARCstation 0 workstations which were mostly idle during the experiment. The clustering results obtained using GAs is better than the standard K-Means algorithm. The performance in terms of speed is evaluated using the following denition of speedup: Speedup = Exection time on workstation Execution T ime on P workstations : The resulting speedup measure for - 5 workstations is given in Table 6. The best speedup is 4. for 5 workstations. 5 Conclusions and Future Work We have implemented a distributed genetic algorithm for pattern clustering on a workstation cluster. The clustering results are better for the genetic algorithm compared to the K-Means algorithm. The evaluation criteria for the parallel implementation is the ratio of execution time on a single workstation to the execution time on P workstations. The obtained speedup is near linear. We feel that because of our divide and conquer approach, we obtained better results compared to a distributed large population case. We have not addressed the following issues in our implementation: (i) load balancing in case of heterogeneous nodes in the cluster, (ii) fault tolerance, (iii) large data sets, and (iv) advanced GA features such as -point crossover, restricted mating, and other genetic operators such as inversion, reordering, and epistasis [3]. References [] Philippe Andrey and Philippe Tarroux. Unsupervised image segmentation using a distributed genetic algorithm. Pattern Recognition, 7(5):659{ 673, May 994. [] Jerzy Bala and Harry Wechsler. Shape analysis using genetic algorithms. Pattern Recognition Letters, 4():965{973, December 993. [3] R. Bianchini and C. Brown. Parallel Genetic Algorithms on distributed-memory architectures. Technical Report 436, Computer Science Department, The University of Rochester, New ork, 993. [4] Al Geist, Adam Beguelin, Jack Dongarra, Weicheng Jiang, Robert Manchek, and Vaidy Sunderam. PVM 3 User's guide and reference manual. Oak Ridge National Laboratory, Tennessee, 993. [5] David E. Goldberg. Genetic Algorihms in Search, Optimization, and Machine Learning. Addison- Wesley, New ork, 989. [6] Anil K. Jain and Richard C. Dubes. Algorithms for Clustering Data. Prentice-Hall, Englewood Clis, New Jersy, 988. [7] R. W. Klein and R. C. Dubes. Experiments in projection and clustering by simulated annealing. Pattern Recognition, :3{0, 989. [8] iabo Li and Zhixi Fang. Parallel algorithms for clustering on hypercube SIMD computers. In Proc. of IEEE Computer Vision and Pattern Recognition, pages 30{33, 986. [9] Lionel M. Ni and Anil K. Jain. A VLSI systolic architecture for pattern clustering. IEEE Trans. on Pattern Analysis and Machine Intelligence, PAMI-7():80{89, January 985. [0] Sanjay Ranka and Sartaj Sahni. Clustering on a hypercube multicomputer. IEEE Trans. on Parallel and Distributed Systems, ():9{37, April 99. [] W. Siedlecki and J. Sklansky. A note on genetic algorithms for large-scale feature selection. Pattern Recognition Letters, 0():335{346, November 989.
0 0 30 40 50 3 4 5 6 7 0 0 30 40 50 60 70 3 4 5 6 7 Figure : Data set used for evaluating GA clustering algorithm. -cluster data; 3-cluster data. 0 0 30 40 50 3 4 5 6 7 3 3 3 3 3 3 0 0 30 40 50 60 70 3 4 5 6 7 Figure 3: Results of K-Means algorithm. -cluster data; 3-cluster data. 0 0 30 40 50 3 4 5 6 7 3 3 3 0 0 30 40 50 60 70 3 4 5 6 7 Figure 4: Results of Genetic Algorithm. -cluster data; 3-cluster data.
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 Figure 5: Clustering results for 30 patterns from IRIS Data Set. labels by K-Means algorithm; labels by GA-based clustering; Note that patterns have been misclassied by the K-Means algorithm. Assigned Class True Class c c c3 c 50 0 0 c 0 48 c3 0 4 36 Assigned Class True Class c c c3 c 48 0 c 39 0 c3 0 48 Table 5: Confusion Matrix for IRIS dataset. K-Means; Parallel GA. No. of Execution Speedup Workstations Time 74.0 883.95 3 64.8 4 5 3.3 5 44 4. Table 6: Total Execution Time (in milliseconds).