A Clustering Method with Efficient Number of Clusters Selected Automatically Based on Shortest Path

Similar documents
Solving Traveling Salesman Problem Using Parallel Genetic. Algorithm and Simulated Annealing

Unsupervised Learning and Clustering

Midterm Examination CS540-2: Introduction to Artificial Intelligence

A New Selection Operator - CSM in Genetic Algorithms for Solving the TSP

An Application of Genetic Algorithm for Auto-body Panel Die-design Case Library Based on Grid

Unsupervised Learning and Clustering

Using Genetic Algorithm with Triple Crossover to Solve Travelling Salesman Problem

A Genetic Algorithm for Graph Matching using Graph Node Characteristics 1 2

The Study of Genetic Algorithm-based Task Scheduling for Cloud Computing

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest

Fuzzy Inspired Hybrid Genetic Approach to Optimize Travelling Salesman Problem

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Midterm Examination CS 540-2: Introduction to Artificial Intelligence

The Genetic Algorithm for finding the maxima of single-variable functions

AN OPTIMIZATION GENETIC ALGORITHM FOR IMAGE DATABASES IN AGRICULTURE

4/22/2014. Genetic Algorithms. Diwakar Yagyasen Department of Computer Science BBDNITM. Introduction

A GENETIC ALGORITHM FOR CLUSTERING ON VERY LARGE DATA SETS

Introduction to Genetic Algorithms. Genetic Algorithms

Modified Order Crossover (OX) Operator

Redefining and Enhancing K-means Algorithm

The k-means Algorithm and Genetic Algorithm

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Comparison Study of Multiple Traveling Salesmen Problem using Genetic Algorithm

Neural Network Weight Selection Using Genetic Algorithms

GENETIC ALGORITHM with Hands-On exercise

Parallel Genetic Algorithm to Solve Traveling Salesman Problem on MapReduce Framework using Hadoop Cluster

High Utility Web Access Patterns Mining from Distributed Databases

DETERMINING MAXIMUM/MINIMUM VALUES FOR TWO- DIMENTIONAL MATHMATICLE FUNCTIONS USING RANDOM CREOSSOVER TECHNIQUES

Unsupervised Learning : Clustering

Uncertain Data Classification Using Decision Tree Classification Tool With Probability Density Function Modeling Technique

A Genetic Algorithm Applied to Graph Problems Involving Subsets of Vertices

Genetic Algorithms Variations and Implementation Issues

Optimization of fuzzy multi-company workers assignment problem with penalty using genetic algorithm

Optimization of Association Rule Mining through Genetic Algorithm

A Genetic Algorithm Approach for Clustering

Using a genetic algorithm for editing k-nearest neighbor classifiers

Network Routing Protocol using Genetic Algorithms

An Improved Genetic Algorithm for the Traveling Salesman Problem with Multi-Relations

CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES

Approach Using Genetic Algorithm for Intrusion Detection System

Comparative Study on VQ with Simple GA and Ordain GA

Cluster Analysis. Ying Shen, SSE, Tongji University

C 1 Modified Genetic Algorithm to Solve Time-varying Lot Sizes Economic Lot Scheduling Problem

An experimental evaluation of a parallel genetic algorithm using MPI

Thresholds Determination for Probabilistic Rough Sets with Genetic Algorithms

Introduction to Genetic Algorithms. Based on Chapter 10 of Marsland Chapter 9 of Mitchell

Robot localization method based on visual features and their geometric relationship

Fast Efficient Clustering Algorithm for Balanced Data

A Real Coded Genetic Algorithm for Data Partitioning and Scheduling in Networks with Arbitrary Processor Release Time

Fuzzy C-means Clustering with Temporal-based Membership Function

SIMULATION APPROACH OF CUTTING TOOL MOVEMENT USING ARTIFICIAL INTELLIGENCE METHOD

Image Compression Using BPD with De Based Multi- Level Thresholding

Chapter 4: Text Clustering

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Time Complexity Analysis of the Genetic Algorithm Clustering Method

CHAPTER 5 ENERGY MANAGEMENT USING FUZZY GENETIC APPROACH IN WSN

CHAPTER 6 REAL-VALUED GENETIC ALGORITHMS

A Web-Based Evolutionary Algorithm Demonstration using the Traveling Salesman Problem

Unsupervised Learning Hierarchical Methods

Using implicit fitness functions for genetic algorithm-based agent scheduling

Using Genetic Algorithm to Break Super-Pascal Knapsack Cipher

A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data

Introduction to Genetic Algorithms

An Efficient XML Index Structure with Bottom-Up Query Processing

CHAPTER-6 WEB USAGE MINING USING CLUSTERING

Comparison of TSP Algorithms

Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall

CHAPTER 4 GENETIC ALGORITHM

Comparison of Heuristics for the Colorful Traveling Salesman Problem

Monika Maharishi Dayanand University Rohtak

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Genetic algorithm based on number of children and height task for multiprocessor task Scheduling

Introduction to Mobile Robotics

ACCELERATING THE ANT COLONY OPTIMIZATION

Normalization based K means Clustering Algorithm

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Grid Scheduling Strategy using GA (GSSGA)

Unsupervised Learning

AN EVOLUTIONARY APPROACH TO DISTANCE VECTOR ROUTING

Traveling Salesman Problem. Java Genetic Algorithm Solution

A Web Page Recommendation system using GA based biclustering of web usage data

Metaheuristic Development Methodology. Fall 2009 Instructor: Dr. Masoud Yaghini

Literature Review On Implementing Binary Knapsack problem

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Train schedule diagram drawing algorithm considering interrelationship between labels

A Genetic Algorithm for Multiprocessor Task Scheduling

ISSN: (Online) Volume 3, Issue 5, May 2015 International Journal of Advance Research in Computer Science and Management Studies

Welfare Navigation Using Genetic Algorithm

Distance-based Outlier Detection: Consolidation and Renewed Bearing

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

A Parallel Evolutionary Algorithm for Discovery of Decision Rules

Job Shop Scheduling Problem (JSSP) Genetic Algorithms Critical Block and DG distance Neighbourhood Search

Effective Tour Searching for Large TSP Instances. Gerold Jäger

ARTIFICIAL INTELLIGENCE (CSCU9YE ) LECTURE 5: EVOLUTIONARY ALGORITHMS

Strategy for Individuals Distribution by Incident Nodes Participation in Star Topology of Distributed Evolutionary Algorithms

Improvement of Matrix Factorization-based Recommender Systems Using Similar User Index

MODULE 6 Different Approaches to Feature Selection LESSON 10

A Note on the Separation of Subtour Elimination Constraints in Asymmetric Routing Problems

Transcription:

A Clustering Method with Efficient Number of Clusters Selected Automatically Based on Shortest Path Makki Akasha, Ibrahim Musa Ishag, Dong Gyu Lee, Keun Ho Ryu Database/Bioinformatics Laboratory Chungbuk National University Cheongju, Korea {Makki, Ibrahim, dglee, khryu} @ dblab.chungbuk.ac.kr Abstract Proposed method is for finding optimal number of clusters in large datasets, efficiently without any interventions from user based on relationships among the data objects. The proposed method is divided into two main steps. First is filtering step which uses shortest path between data objects. Second is clustering step which uses mean distance to obtain the number of clusters based on optimal route. The main advantage of this algorithm is its ability to detect the typical number of clusters among objects in datasets. Theoretical analysis and empirical evidence reveal that our method can efficiently self-generate the cluster group automatically rather than other methods. We expect these results to be of interest to researchers and practitioners because it suggests a simple but very elegant and effective alternative for clustering large datasets. Keywords-clustering; data mining; shortest path; Introduction The problem of clustering datasets has become very important. Clustering algorithms divide datasets into subsets or classes. They have been used in many applications such as knowledge discovery, compression, and medical organizations. The objects of datasets with many attributes or dimensionality can be represented in multidimensional vector space. Figure 1. Representation of data in two dimensions space The main objective of clustering is to find the rational and valid organization of the data based on the relationships of data objects. The objects within one cluster have most similarities rather than objects belonging to different clusters or classes. The traditional clustering algorithms can be divided to two types: hierarchical or agglomerative clustering and divisible clustering [1]. In the agglomerative clustering, the number of clusters does not need to be specified manually. We just consider only local neighbors in each step. In divisive clustering, there are two types: Crisp clustering where each object belongs only to one cluster and fuzzy cluster where some objects belong to every cluster of certain degree. The disadvantages of the divisible approach are the difficulty of determining number of clusters and its sensitivity to noise and outlier [2]. Figure 2. Describing how genetic algorithm works to find optimal solution.

The genetic algorithms are general methods for solving and searching for solutions problem in a large space of candidate solutions as figure 2 [3]. The genetic algorithms apply genetic operators such as selection, crossover and mutation to solve the problems. Every solution has fitness function value depends on the problem definition. Figure 3. Searching space. For example, on the above figure 3, the fitness function at point x=809 has small value. Those solutions are used for producing next generation of solutions by reproduction. The solutions with higher fitness value have more chance to reproduce. The solution or chromosome can be represented as non-binary numbers that have integer and floating point types. Proposed Method The work proposed in this section aims to explain our new method which is called a clustering method for selecting efficiently the number of clusters based on shortest path. The proposed method tries to find the optimal number of clusters automatically based on the relationships among data objects. The proposed method is divided into two main steps. First filtering step which is used to find strongest relationship among objects in dataset.our proposed method uses shortest path to find strongest relationship. It begins with reading the dataset objects and calculates the relationships among them using Euclidean distance. Traveling salesman problem generates sample solutions to known relationships that are shown in figure 1 and 6(a). Genetic algorithm calculates fitness function value for every solution based on fitness value. Two solutions are selected leading to emergence of new born solution which replaces one of the parent solutions.this process continues until Genetic algorithm finds the best solution or strongest relationship. The best solution or strongest relationship satisfies the equation (1) and it is shown as figure 8. Route min = min ( (dxixi+1 + dx1xn)) (1) Second, after finishing filtering step, clustering step finds clusters possible to detect in a dataset. Our proposed approach uses mean distance of shortest path given by the following equation AVG = DP min /n (2) Where DP min is summation of shortest path and n is the number of objects. Clustering step begins by calculating the mean of shortest path. Then it searches for edge which is greater than mean and expose it from path. The shortest path divides to sub paths after one or two iterations as shown in Figure 7(c and d).then this process iterates on every sub path. If the process obtains more than three paths that have an object in next iteration, it must be stopped. Figure 4 shows our proposed algorithm steps to get the clusters from datasets. Agglomerative fuzzy clustering algorithms give us many results with different selection of number of clusters such as k-means and c-means. After that we should compare between them to find the best result, those algorithms take more times and need interventions from users [4]. So we propose this method to find number of clusters automatically based on relationships among the objects. The details of the proposed approach are shown in the following sections.

Genetic Components Figure 4. Proposed method steps The genetic algorithm starts with randomly selecting initial population. The successive generations are derived by applying the selection, crossover and mutation operators to the previous tour population.a current generation is acted upon by the three operators to produce the successive generation such as selection, crossover and mutation operation. Before performing these operations, fitness function f i evaluation is being implemented [1].The method employed here is to calculate DPI as the total Euclidean distances for each path first, then compute fi by using the following equation: f i = DP max - DP I (3) Where DP max is the longest Euclidean distance over solution in the population [1] Selection Operation The selection operator chooses two members from the solutions that are available within the population to participate in the next operations of crossover and mutation. There are two popular methods for implementing this selection: The first one called roulette selection uses the probability based on the fitness function of the solution and it is computed by using the following equation: Pi = f i / F j ( 4) The second method is called deterministic sampling which assigns a value SI to each solution or path which is evaluated by the following equation: S i =TRUNCAT (P i *NS). (5) Where TRUNCAT means rounding real numbers into bigger integer, NS means the number of solutions or paths. The selection operator assures that each solution or path participates as parents exactly Si times. Crossover Operation After the selection operation step, the solutions will be passed through the crossover operation. There are many proposals about crossover procedures [1]. The following figure shows our proposed method

Figure 5. Proposed method Our proposed method is shown in figure 5. Sometimes it is not required to solve any hard sub problems. But it can give nearly optimal solutions for data clustering. Figure 6. An example of proposed method Figure 6 shows how our proposed method works. First filtering step, it begins with Initial relationships produced as figure 6 (a). The method uses genetic algorithm to find the optimal route as figure 6 (b). Second clustering step, it begins with dividing optimal path into sub paths as figure 6(c). Our method continues dividing until the terminal condition for dividing becomes true as figure 6 (d) then it stops. Finally each sub path is considered as clustered as figure 6 (e). The method stops when the terminal condition becomes true. Because it may arrive to leaf level if the method iterates once.

Figure 7. Clustering process The above figure shows the clusters of dataset that appear during different stages. During the clustering step the proposed method divides the shortest path into many sub paths. This process is applied in any sub path alone until the terminal condition becomes true. Then every sub path that has more than one object is considered as clusters otherwise as outliers. The run time of this algorithm can be calculated based on the size of dataset N. TSP can generate sample of routes, we suppose K relationship among those objects.therefore the time complexity is O(NK). Genetic algorithm can find strongest relationship after M iterations. Hence, the total running time for Filtering part is O (MNK). But for the Clustering part, the time complexity is going to be very small. It depends on the number of stages as in figure (7), we assume clustering step take L iterations to find clusters. Therefore, the total complexity of our algorithm is O (MNKL). Experimental Results To implement our proposed algorithm, our experimental hardware setup was Pentium computer 4, memory 1 GB, CPU 2.8 GHz, and it is running window XP professional. We wrote program using mathlab tool and used iris dataset with different sizes [5]. Table 1 shows the results of our experimental proposed approach. TABLE 1. The results of implementation of our method by using iris dataset (column 3 and 5) with different sizes comparing with result k-mean algorithm as it is shown in column(2,4) Figure 8 shows the first testing.it clusters the dataset into 7 clusters according to the first row in table1. The first section of figure 8 shows how proposed method finds the shortest path among the objects. Since the objects are few, the proposed method will find the shortest path very quickly during the filtering step. The second section of figure 8 shows how proposed method finds the typical clusters.

Figure 8. The output of our proposed method (50 tuples) Figure 9 shows the second test. It combines the dataset into two clusters according to the second row table 1. The first section of figure 9 shows how the proposed method finds the shortest path. Since the objects are many, the method takes more time to find shortest path comparing with clustering step. The second section of figure 9 shows how the clustering step finds the typical clusters. Figure 9. The output of our proposed method (166 tuples) Figure 10 shows the third test. It unites the dataset into 3 Clusters and 12 outliers according to the third row table 1. The first section of figure 10 shows the filtering step which finds the shortest path. In this step, the method takes more time when the dataset is very big. The second section of figure 10 shows clustering steps depending on filtering steps. Figure 10. The output of our proposed method (768 tuples) Conclusion In this paper, we proposed novel method for clustering objects based on relationship. Our proposed method has two main steps: Filtering step for finding optimal route or strongest relationship among data by using shortest path and clustering step which continues dividing that optimal route into number of sub routes. The advantage of this method is to build clusters automatically without any interventions from users. We are going to examine further extensions from our proposed method in larger datasets. Acknowledgment This work was supported by the grant of the Korean Ministry of Education, Science and Technology"(The Regional Core Research Program / Chungbuk BIT Research-Oriented University Consortium) and the Korea Science and Engineering Foundation (KOSEF) grant funded by the Korea government (MEST) (No. R11-2008-014-02002-0). References 1. M. J. Li, M. K. NG and Y. M. Cheung, Agglomerative Fuzzy K-Means Clustering Algorithm With Selection of Number of Clusters, IEEE Transaction on knowledge and Data engineering vol.20,no.11,novmber, 2008. 2. C. F. Tsai, H. C. Wu, C. W. Tsai, "A New Data Clustering Approach for Data Mining in Large Databases,",ispan, p. 0315, 2002 International Symposium on Parallel Architectures, Algorithms and Networks (ISPAN '02), 2002. 3. H. L. R. Encarnación, S. M. B. Suárez, W. H. Rivera, V. C. Vázquez, M. A. S. Figueroa, A. R. Toro, Genetic algorithm approach for recorder cycle time determination in multi-stage System, university of Puerto,2003

4. B. F. A. Dulaimi, and H. A. Ali, Enhanced Traveling Salesman Problem Solving by Genetic Algorithm Technique (TSPGA), PWASET VOLUME 28 APRIL 2008 ISSN 1307-6884. 5. http://neural.cs.nthu.edu.tw/ 6.