n m-dimensional data points K Clusters KP Data Points (Cluster centers) K Clusters

Similar documents
Heuristic Optimisation

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering

MOTION ESTIMATION IN MPEG-2 VIDEO ENCODING USING A PARALLEL BLOCK MATCHING ALGORITHM. Daniel Grosu, Honorius G^almeanu

A Genetic Algorithm for Graph Matching using Graph Node Characteristics 1 2

Genetic Algorithm for Circuit Partitioning

Distributed Optimization of Feature Mining Using Evolutionary Techniques

Department of Electrical Engineering, Keio University Hiyoshi Kouhoku-ku Yokohama 223, Japan

100 Mbps DEC FDDI Gigaswitch

Normal mode acoustic propagation models. E.A. Vavalis. the computer code to a network of heterogeneous workstations using the Parallel

Parallel Genetic Algorithm to Solve Traveling Salesman Problem on MapReduce Framework using Hadoop Cluster

Proceedings of the 1994 ACM/SIGAPP Symposium on Applied Computing March 6-8, 1994, pp , ACM Press.

Genetic Programming. Charles Chilaka. Department of Computational Science Memorial University of Newfoundland

Object classes. recall (%)

An experimental evaluation of a parallel genetic algorithm using MPI

Proceedings of the First IEEE Conference on Evolutionary Computation - IEEE World Congress on Computational Intelligence, June

Grid-Based Genetic Algorithm Approach to Colour Image Segmentation

CHAPTER 6 ORTHOGONAL PARTICLE SWARM OPTIMIZATION

Monika Maharishi Dayanand University Rohtak

Unsupervised Feature Selection Using Multi-Objective Genetic Algorithms for Handwritten Word Recognition

Frontier Pareto-optimum

University of Florida CISE department Gator Engineering. Clustering Part 2

Clustering and Visualisation of Data

CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES

CHAPTER 2 CONVENTIONAL AND NON-CONVENTIONAL TECHNIQUES TO SOLVE ORPD PROBLEM

MODULE 6 Different Approaches to Feature Selection LESSON 10

Neural Network Weight Selection Using Genetic Algorithms

Using Genetic Algorithm with Triple Crossover to Solve Travelling Salesman Problem

Segmentation of Noisy Binary Images Containing Circular and Elliptical Objects using Genetic Algorithms

Dept. of Computer Science. The eld of time series analysis and forecasting methods has signicantly changed in the last

Rowena Cole and Luigi Barone. Department of Computer Science, The University of Western Australia, Western Australia, 6907

CHAPTER 4 GENETIC ALGORITHM

Topological Machining Fixture Layout Synthesis Using Genetic Algorithms

A Modified Genetic Algorithm for Task Scheduling in Multiprocessor Systems

Performance impact of dynamic parallelism on different clustering algorithms

Akaike information criterion).

Object Modeling from Multiple Images Using Genetic Algorithms. Hideo SAITO and Masayuki MORI. Department of Electrical Engineering, Keio University

Multiprocessor Scheduling Using Parallel Genetic Algorithm

Using Genetic Algorithms to Improve Pattern Classification Performance

Chapter 14 Global Search Algorithms

The k-means Algorithm and Genetic Algorithm

Using a genetic algorithm for editing k-nearest neighbor classifiers

Introduction to Design Optimization: Search Methods

Network. Department of Statistics. University of California, Berkeley. January, Abstract

Genetic Algorithm for Dynamic Capacitated Minimum Spanning Tree

Active contour: a parallel genetic algorithm approach

Revision of a Floating-Point Genetic Algorithm GENOCOP V for Nonlinear Programming Problems

Introduction (7.1) Genetic Algorithms (GA) (7.2) Simulated Annealing (SA) (7.3) Random Search (7.4) Downhill Simplex Search (DSS) (7.

sizes. Section 5 briey introduces some of the possible applications of the algorithm. Finally, we draw some conclusions in Section 6. 2 MasPar Archite

ISSN: [Keswani* et al., 7(1): January, 2018] Impact Factor: 4.116

Parallel Implementation of a Unied Approach to. Image Focus and Defocus Analysis on the Parallel Virtual Machine

Genetic Algorithm and Simulated Annealing based Approaches to Categorical Data Clustering

Hardware Implementation of GA.

Fast Efficient Clustering Algorithm for Balanced Data

AN EVOLUTIONARY APPROACH TO DISTANCE VECTOR ROUTING

1 Case study of SVM (Rob)

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

DERIVATIVE-FREE OPTIMIZATION

Active Motion Detection and Object Tracking. Joachim Denzler and Dietrich W.R.Paulus.

Selection of Location, Frequency and Orientation Parameters of 2D Gabor Wavelets for Face Recognition

A Framework for Parallel Genetic Algorithms on PC Cluster

An adaptive genetic algorithm for dynamically reconfigurable modules allocation

A Genetic Algorithm for Multiprocessor Task Scheduling

A Real Coded Genetic Algorithm for Data Partitioning and Scheduling in Networks with Arbitrary Processor Release Time

CHAPTER 5 ENERGY MANAGEMENT USING FUZZY GENETIC APPROACH IN WSN

In Proc of 4th Int'l Conf on Parallel Problem Solving from Nature New Crossover Methods for Sequencing Problems 1 Tolga Asveren and Paul Molito

Design Optimization of Hydroformed Crashworthy Automotive Body Structures

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

Robust Object Segmentation Using Genetic Optimization of Morphological Processing Chains

REAL-CODED GENETIC ALGORITHMS CONSTRAINED OPTIMIZATION. Nedim TUTKUN

Graph Matching: Fast Candidate Elimination Using Machine Learning Techniques

JHPCSN: Volume 4, Number 1, 2012, pp. 1-7

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION

Multiobjective Data Clustering

Energy Efficient Genetic Algorithm Model for Wireless Sensor Networks

Genetic Model Optimization for Hausdorff Distance-Based Face Localization

Network Routing Protocol using Genetic Algorithms

Using Genetic Algorithms in Integer Programming for Decision Support

MINIMAL EDGE-ORDERED SPANNING TREES USING A SELF-ADAPTING GENETIC ALGORITHM WITH MULTIPLE GENOMIC REPRESENTATIONS

Automata Construct with Genetic Algorithm

Using implicit fitness functions for genetic algorithm-based agent scheduling

The Simple Genetic Algorithm Performance: A Comparative Study on the Operators Combination

Constrained Functions of N Variables: Non-Gradient Based Methods

Suppose you have a problem You don t know how to solve it What can you do? Can you use a computer to somehow find a solution for you?

Genetic algorithm-based clustering technique

Genetic Approach to Parallel Scheduling

PARALLEL GENETIC ALGORITHMS IMPLEMENTED ON TRANSPUTERS

The Parallel Software Design Process. Parallel Software Design

Comparative Study Of Different Data Mining Techniques : A Review

Automated Clustering-Based Workload Characterization

Lecture 4: Principles of Parallel Algorithm Design (part 4)

Comparative Study on VQ with Simple GA and Ordain GA

Design of an Optimal Nearest Neighbor Classifier Using an Intelligent Genetic Algorithm

336 THE STATISTICAL SOFTWARE NEWSLETTER where z is one (randomly taken) pole of the simplex S, g the centroid of the remaining d poles of the simplex

The Genetic Algorithm for finding the maxima of single-variable functions

SPR. stochastic deterministic stochastic SA beam search GA

Multi-objective pattern and feature selection by a genetic algorithm

An Application of Genetic Algorithm for Auto-body Panel Die-design Case Library Based on Grid

March 19, Heuristics for Optimization. Outline. Problem formulation. Genetic algorithms

Towards Automatic Recognition of Fonts using Genetic Approach

Transcription:

Clustering using a coarse-grained parallel Genetic Algorithm: A Preliminary Study Nalini K. Ratha Anil K. Jain Moon J. Chung Department of Computer Science Department of Computer Science Department of Computer Science Michigan State University Michigan State University Michigan State University East Lansing, MI 4884 East Lansing, MI 4884 East Lansing, MI 4884 ratha@cps.msu.edu jain@cps.msu.edu chung@cps.msu.edu Abstract Genetic Algorithms (GA) are useful in solving complex optimization problems. By posing pattern clustering as an optimization problem, GAs can be used to obtain an optimal minimum squared-error partitions. In order to improve the total execution time, a distributed algorithm has been developed using the divide and conquer approach. Using a standard communication library called PVM, the distributed algorithm has been implemented on a workstation cluster. The GA approach gives better quality clusters for many data sets compared to a standard K-Means clustering algorithm. We have achieved a near linear speedup for the distributed implementation. Keywords: Genetic Algorithm, Pattern Clustering, PVM, Workstation cluster. Introduction Clustering algorithms group patterns based on measures of similarity or dissimilarity. Data clustering is an important technique in the eld of exploratory data analysis. The set of clustering algorithms can be broadly classied into one of the following two types: (i) hierarchical or (ii) partitional. Hierarchical clustering is concerned with obtaining a nested hierarchical partition of the data. In partitional clustering we are interested in generating a single partition describing the groups or clusters present in the data. A formal denition of partitional clustering can be described as follows. Given a collection of n pattern vectors where each pattern is a m-dimensional vector characterized by the set of features (x; x; : : : ; x m ), nd the clusters present in the data. A cluster is dened by the similarity of patterns present in it. The number of clusters may be known or unknown. Jain and Dubes [6] describe a number of clustering techniques and indices for cluster validity. The minimum squared-error is a well known criterion used to obtain a specied number of clusters. The squared-error for a set of n m-dimensional patterns is given by K E k = e k ; () k= where K is the desired number of clusters, and e k is dened as follows: where n k e k = (x (k) i? m (k) ) t (x (k) i? m (k) ); () i= n k m (k) = n k i= x (k) i ; n = K n k : (3) k= Mean-squared error criterion function typically looks for clusters with hyperellipsoidal shapes. The most well-known squared-error clustering algorithms are K- Means, ISODATA, CLUSTER, and FORG. The main problem with these algorithms is the nonoptimality of the resulting partitions. Moreover, these algorithms often give dierent clusters when run with dierent initial cluster centers as they usually get stuck at a local minima. The simplest and the most well known partitional clustering algorithm is the K-Means algorithm. A sequential version of the K-Means algorithm is shown in Table. The algorithm requires that the user specify the desired number of clusters. The partitional clustering problem can also be viewed as an optimization problem. For the squarederror criterion, the clustering problem can be stated as nding the clusters (i.e., nd a labeling for the patterns) such that the between-group scatter is maximized and the within-group scatter is minimized. Many stochastic techniques exist in the literature which address the issue of achieving the global minima of a criterion function. Simulated annealing and

Input: n m?dimensional patterns, and K (number of desired clusters). Output: Non-overlapping K clusters i.e., a labeling of all the n patterns with labels from the set [..K]. Method:. Select K points randomly as cluster centers.. Repeat For i= to n do if pattern[i] is closest to the j th cluster center, assign it to cluster j. Compute new cluster centers as the average of patterns in each cluster. Until no changes occur in cluster centers Table : K-Means algorithm. genetic algorithms are some of these techniques. Simulated annealing has been used to solve the partitional clustering problem [7]. A genetic algorithm (GA) is a search procedure based on the \survival of the ttest" principle [3]. The \ttest candidate" is the solution at any given time. By running the evolution process for a suciently large number of generations, we are hopeful of reaching the global minima. The genetic algorithm (GA) is a model of machine learning [5]. It mimics the evolutionary process in nature by creating a population of individuals represented by chromosomes. These individuals go through a process of evolution. Dierent individuals compete for resources in the environment. The \ttest" individual survives and propagates its genes. The \crossover" operation is the process of exchanging chunks of genetic information between a pair of chromosomes. As in natural evolution process, GAs also dene a \mutation" process, where a gene undergoes changes in itself. A general scheme for GA is shown in Table. The main issues involved in designing a GA are (i) a suitable problem representation that enables application of GA operators, (ii) selecting a suitable candidate tness evaluation function, and (iii) dening the crossover and mutation functions. Other global parameters such as the population size, crossover and mutation probabilities, and number of generations also play an important role in obtaining good quality results using GAs. Genetic algorithms have been used in many pattern recognition and image processing applications including image segmentation [], feature selection [], and shape analysis []. The main drawback of genetic algorithms is the amount of time taken for convergence. The search space grows exponentially as a function of the problem size. Hence, the number of generations needed to reach a global solution increases rapidly. A number of methods have been described in the literature to improve the convergence [3]. We adopt a divide and conquer strategy to combat the unacceptable convergence time. The divide and conquer approach lends itself to a coarse-grained parallel implementation. Squared-error clustering algorithms are compute intensive. As a result, a number of parallel clustering algorithms have been proposed in the literature. Ni and Jain [9] describe a systolic array-based algorithm that can be implemented on a VLSI. Li et al. [8] proposed a SIMD algorithm with O(k lognm) time complexity and NM processing elements (PEs). Another SIMD algorithm described by Ranka and Sahni [0] has a time complexity of O(k + lognm) with NM processors in a hypercube. We have used a set of generalpurpose workstations connected over a local area network (LAN) to implement the squared-error clustering algorithm using a genetic algorithm. The purpose of this paper is two fold. First, we show that the local minima problem associated with a standard squarederror clustering algorithm can be overcome by using a genetic algorithm. Second, the slow convergence of genetic algorithms can be somewhat alleviated by using a cluster of workstations. Thus, the combination of the two approaches can result in good clusters with a reasonable execution time. The reminder of the paper is organized as follows. Section describes a sequential genetic algorithm for partitional clustering. The parallel algorithm using a coarse-grained approach is described in section 3. Both the sequential and distributed algorithms have been implemented. An analysis of the results in terms

of quality of clusters and speedup is carried out in section 4. The conclusions and future work are presented in section 5. A Genetic Algorithm for clustering The squared-error clustering problem can also be posed as a label assignment problem. Each of the n patterns needs to be assigned a label from the set f : : : Kg such that the squared error in Eq. () is minimized. Using this denition of clustering, we form the chromosome as a bit stream of pattern labels. We can apply the genetic operators on the bit stream. The sequential genetic algorithm for pattern clustering is described in Table. The clustering problem has now been presented as an optimization problem. The standard crossover and mutation can be applied on potential solutions represented as bit streams. However, we need to dene the tness function. The tness of a new generation candidate should be better than its parents [5]. We dene a variation of squared-error as the tness function. The tness value of a candidate is computed as follows:. Let Worstscore be the squared-error when all the patterns form a single cluster.. Let PresentScore be the squared-error obtained by the present assignment of labels. 3. FitnessScore = e P resentscore (W orstscoret ), where T is a normalization constant. The normalization is done so that the tness value is between 0 and which can be used as the probability of a candidate being selected for crossover. From the squared-error criterion point of view, a crossover need not result in a better solution. Hence, we restrict the crossover to cases where the crossover results in a lower squared-error value with respect to its parents. Otherwise, the generated candidate is rejected. In this way, we ensure the property of the population moving towards a global optimum. 3 Coarse-grained parallel Algorithm The main drawback with any GA scheme is the time taken to converge to the global minima. In order to speedup the total execution time, we need to explore distributed/parallel methods. There are two ways to parallelize the above algorithm: (i) divide and conquer, and (ii) distributed computation. In the rst method, the n pattern vectors are divided into P groups assuming that P processors are available. Each of the P processors works on the data assigned to it using the sequential algorithm. After each processor is done with its task, we will have P K clusters. The master now runs a K- Means pass on the PK cluster centers to obtain the desired K clusters. The advantage of this method is that it needs very little communication between the processors. The disadvantage is that it is dicult to balance the load on a heterogeneous workstation cluster as each subset of data might take dierent numbers of passes to complete. Hence, the overall execution time may depend on the slowest workstation. In the distributed computation method, the pattern vectors are distributed as before. At the end of every phase, the result is communicated to the master. A minor variation to this is that the partial results (best candidates) can be sent asynchronously. This method has a higher communication overhead, but the work load can be balanced. We use the divide and conquer method because of its low communication requirement. A schematic diagram of this approach is shown in Figure. The distributed algorithm is fairly simple using divide and conquer strategy and is described in Table algorithm is based on Master - Slave protocol. 4 Results 4. The Both the sequential and distributed algorithms have been implemented. We used the PVM communication library to implement the distributed algorithm. PVM was developed at the Oak Ridge National Laboratory [4] and is available as a public domain software. It supports heterogeneous computing. Users call the architecture-independent (transparent) subroutines for passing messages between the nodes. There is no synchronization involved in our algorithm as the slaves are independent of each other. However, the master has to wait for the result from all the slaves before it can start the merge pass. The following data sets are used to evaluate the performance of the genetic algorithm approach.. A set of two-dimensional points shown in Figure. This data set contains two clusters.. A data set on which the K-Means algorithm fails is shown in g.. This data set contains three clusters. 3. A subset of the IRIS data. IRIS dataset is wellknown in the pattern recognition literature. It has four features, 3 classes, and 50 patterns per class. We have chosen only 0 patterns per class. 4. Full IRIS data (50 patterns, 4 features, 3 classes).

t:=0; initialize population(t); evaluate population(t); do while (true) t:=t+; p := select parents(t); recombine(p); mutate(p); evaluate population(p); new population := survivors (p,p); end; Table : A simple genetic algorithm. Input: n m?dimensional patterns, and K (the desired number of clusters). Output: Non-overlapping K clusters or a labeling of all the n patterns with labels f : : : Kg. Method: Let P s = population size, k = dlogke, p c = probability of crossover, and p m = probability of mutation.. Coding: Each pattern can take a label of k bits. Hence the string length is nk bits.. Initial Population: Randomly generate P s streams of size Nk bits. 3. For i = to P s, compute tness value of each candidate. 4. Reproduction: Reproduce the i th string proportional to its tness value. 5. Crossover: Each pair of strings undergoes a crossover at randomly chosen positions. 6. Mutation: A mutation is carried out by ipping randomly chosen bits with a probability p m. 7. Repeat steps ({6) for a specied number of generations. Table 3: Sequential GA for clustering.

n m-dimensional data points n/p data points n/p data points n/p data points n/p data points PE PE PE (P-) PE P KP Data Points (Cluster centers) Figure : Scheme for a distributed clustering approach. Input: n m-dimensional patterns, and K (the desired number of clusters) Output: K clusters Method:. Data Distribution: Assign n patterns to P processors in a round robin fashion, thus dividing data as evenly as possible.. Computation Phase: Each PE clusters the data set assigned to it using the sequential method described previously. At the end of the run, the result is sent to the Master. 3. Merge Phase: Master collects the PK cluster centers and applies a K-Means algorithm to these PK points. It is assumed that PK << n. 4. Reassignment of labels: Depending on the result of the merge phase, the patterns are assigned a new label to get the nal set of K clusters. Table 4: A coarse-grained parallel GA for clustering.

The clusters obtained by the K-Means algorithm for the two synthetic data sets are shown in Figure 3. Using the coarse-grained parallel GA, the clusters obtained for these data sets are shown in Figure 4. The cluster labels for the 30 patterns of IRIS data are shown in Figure 5. For the full IRIS data (50 patterns) the confusion matrices of assigned labels are shown in Table 5 and Table 5 using K-Means and parallel genetic algorithm respectively. Out of 50 patterns, 5 patterns were misclassied by the parallel genetic algorithm in contrast to 6 patterns being misclassied by the K-Means algorithm. Typically, K-Means algorithm is run more than once to verify that the solution obtained does not correspond to a local minima. For large data sets, this can be very costly. For our synthetic data set in Figure 3, with multiple runs of K-Means algorithm, we were able to obtain the correct labels for the three clusters. The total execution time for the full IRIS data set using - 5 workstations is shown in Table 6. For this experiment, we used Sun SPARCstation 0 workstations which were mostly idle during the experiment. The clustering results obtained using GAs is better than the standard K-Means algorithm. The performance in terms of speed is evaluated using the following denition of speedup: Speedup = Exection time on workstation Execution T ime on P workstations : The resulting speedup measure for - 5 workstations is given in Table 6. The best speedup is 4. for 5 workstations. 5 Conclusions and Future Work We have implemented a distributed genetic algorithm for pattern clustering on a workstation cluster. The clustering results are better for the genetic algorithm compared to the K-Means algorithm. The evaluation criteria for the parallel implementation is the ratio of execution time on a single workstation to the execution time on P workstations. The obtained speedup is near linear. We feel that because of our divide and conquer approach, we obtained better results compared to a distributed large population case. We have not addressed the following issues in our implementation: (i) load balancing in case of heterogeneous nodes in the cluster, (ii) fault tolerance, (iii) large data sets, and (iv) advanced GA features such as -point crossover, restricted mating, and other genetic operators such as inversion, reordering, and epistasis [3]. References [] Philippe Andrey and Philippe Tarroux. Unsupervised image segmentation using a distributed genetic algorithm. Pattern Recognition, 7(5):659{ 673, May 994. [] Jerzy Bala and Harry Wechsler. Shape analysis using genetic algorithms. Pattern Recognition Letters, 4():965{973, December 993. [3] R. Bianchini and C. Brown. Parallel Genetic Algorithms on distributed-memory architectures. Technical Report 436, Computer Science Department, The University of Rochester, New ork, 993. [4] Al Geist, Adam Beguelin, Jack Dongarra, Weicheng Jiang, Robert Manchek, and Vaidy Sunderam. PVM 3 User's guide and reference manual. Oak Ridge National Laboratory, Tennessee, 993. [5] David E. Goldberg. Genetic Algorihms in Search, Optimization, and Machine Learning. Addison- Wesley, New ork, 989. [6] Anil K. Jain and Richard C. Dubes. Algorithms for Clustering Data. Prentice-Hall, Englewood Clis, New Jersy, 988. [7] R. W. Klein and R. C. Dubes. Experiments in projection and clustering by simulated annealing. Pattern Recognition, :3{0, 989. [8] iabo Li and Zhixi Fang. Parallel algorithms for clustering on hypercube SIMD computers. In Proc. of IEEE Computer Vision and Pattern Recognition, pages 30{33, 986. [9] Lionel M. Ni and Anil K. Jain. A VLSI systolic architecture for pattern clustering. IEEE Trans. on Pattern Analysis and Machine Intelligence, PAMI-7():80{89, January 985. [0] Sanjay Ranka and Sartaj Sahni. Clustering on a hypercube multicomputer. IEEE Trans. on Parallel and Distributed Systems, ():9{37, April 99. [] W. Siedlecki and J. Sklansky. A note on genetic algorithms for large-scale feature selection. Pattern Recognition Letters, 0():335{346, November 989.

0 0 30 40 50 3 4 5 6 7 0 0 30 40 50 60 70 3 4 5 6 7 Figure : Data set used for evaluating GA clustering algorithm. -cluster data; 3-cluster data. 0 0 30 40 50 3 4 5 6 7 3 3 3 3 3 3 0 0 30 40 50 60 70 3 4 5 6 7 Figure 3: Results of K-Means algorithm. -cluster data; 3-cluster data. 0 0 30 40 50 3 4 5 6 7 3 3 3 0 0 30 40 50 60 70 3 4 5 6 7 Figure 4: Results of Genetic Algorithm. -cluster data; 3-cluster data.

3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 Figure 5: Clustering results for 30 patterns from IRIS Data Set. labels by K-Means algorithm; labels by GA-based clustering; Note that patterns have been misclassied by the K-Means algorithm. Assigned Class True Class c c c3 c 50 0 0 c 0 48 c3 0 4 36 Assigned Class True Class c c c3 c 48 0 c 39 0 c3 0 48 Table 5: Confusion Matrix for IRIS dataset. K-Means; Parallel GA. No. of Execution Speedup Workstations Time 74.0 883.95 3 64.8 4 5 3.3 5 44 4. Table 6: Total Execution Time (in milliseconds).