A Clustering Method with Efficient Number of Clusters Selected Automatically Based on Shortest Path Makki Akasha, Ibrahim Musa Ishag, Dong Gyu Lee, Keun Ho Ryu Database/Bioinformatics Laboratory Chungbuk National University Cheongju, Korea {Makki, Ibrahim, dglee, khryu} @ dblab.chungbuk.ac.kr Abstract Proposed method is for finding optimal number of clusters in large datasets, efficiently without any interventions from user based on relationships among the data objects. The proposed method is divided into two main steps. First is filtering step which uses shortest path between data objects. Second is clustering step which uses mean distance to obtain the number of clusters based on optimal route. The main advantage of this algorithm is its ability to detect the typical number of clusters among objects in datasets. Theoretical analysis and empirical evidence reveal that our method can efficiently self-generate the cluster group automatically rather than other methods. We expect these results to be of interest to researchers and practitioners because it suggests a simple but very elegant and effective alternative for clustering large datasets. Keywords-clustering; data mining; shortest path; Introduction The problem of clustering datasets has become very important. Clustering algorithms divide datasets into subsets or classes. They have been used in many applications such as knowledge discovery, compression, and medical organizations. The objects of datasets with many attributes or dimensionality can be represented in multidimensional vector space. Figure 1. Representation of data in two dimensions space The main objective of clustering is to find the rational and valid organization of the data based on the relationships of data objects. The objects within one cluster have most similarities rather than objects belonging to different clusters or classes. The traditional clustering algorithms can be divided to two types: hierarchical or agglomerative clustering and divisible clustering [1]. In the agglomerative clustering, the number of clusters does not need to be specified manually. We just consider only local neighbors in each step. In divisive clustering, there are two types: Crisp clustering where each object belongs only to one cluster and fuzzy cluster where some objects belong to every cluster of certain degree. The disadvantages of the divisible approach are the difficulty of determining number of clusters and its sensitivity to noise and outlier [2]. Figure 2. Describing how genetic algorithm works to find optimal solution.
The genetic algorithms are general methods for solving and searching for solutions problem in a large space of candidate solutions as figure 2 [3]. The genetic algorithms apply genetic operators such as selection, crossover and mutation to solve the problems. Every solution has fitness function value depends on the problem definition. Figure 3. Searching space. For example, on the above figure 3, the fitness function at point x=809 has small value. Those solutions are used for producing next generation of solutions by reproduction. The solutions with higher fitness value have more chance to reproduce. The solution or chromosome can be represented as non-binary numbers that have integer and floating point types. Proposed Method The work proposed in this section aims to explain our new method which is called a clustering method for selecting efficiently the number of clusters based on shortest path. The proposed method tries to find the optimal number of clusters automatically based on the relationships among data objects. The proposed method is divided into two main steps. First filtering step which is used to find strongest relationship among objects in dataset.our proposed method uses shortest path to find strongest relationship. It begins with reading the dataset objects and calculates the relationships among them using Euclidean distance. Traveling salesman problem generates sample solutions to known relationships that are shown in figure 1 and 6(a). Genetic algorithm calculates fitness function value for every solution based on fitness value. Two solutions are selected leading to emergence of new born solution which replaces one of the parent solutions.this process continues until Genetic algorithm finds the best solution or strongest relationship. The best solution or strongest relationship satisfies the equation (1) and it is shown as figure 8. Route min = min ( (dxixi+1 + dx1xn)) (1) Second, after finishing filtering step, clustering step finds clusters possible to detect in a dataset. Our proposed approach uses mean distance of shortest path given by the following equation AVG = DP min /n (2) Where DP min is summation of shortest path and n is the number of objects. Clustering step begins by calculating the mean of shortest path. Then it searches for edge which is greater than mean and expose it from path. The shortest path divides to sub paths after one or two iterations as shown in Figure 7(c and d).then this process iterates on every sub path. If the process obtains more than three paths that have an object in next iteration, it must be stopped. Figure 4 shows our proposed algorithm steps to get the clusters from datasets. Agglomerative fuzzy clustering algorithms give us many results with different selection of number of clusters such as k-means and c-means. After that we should compare between them to find the best result, those algorithms take more times and need interventions from users [4]. So we propose this method to find number of clusters automatically based on relationships among the objects. The details of the proposed approach are shown in the following sections.
Genetic Components Figure 4. Proposed method steps The genetic algorithm starts with randomly selecting initial population. The successive generations are derived by applying the selection, crossover and mutation operators to the previous tour population.a current generation is acted upon by the three operators to produce the successive generation such as selection, crossover and mutation operation. Before performing these operations, fitness function f i evaluation is being implemented [1].The method employed here is to calculate DPI as the total Euclidean distances for each path first, then compute fi by using the following equation: f i = DP max - DP I (3) Where DP max is the longest Euclidean distance over solution in the population [1] Selection Operation The selection operator chooses two members from the solutions that are available within the population to participate in the next operations of crossover and mutation. There are two popular methods for implementing this selection: The first one called roulette selection uses the probability based on the fitness function of the solution and it is computed by using the following equation: Pi = f i / F j ( 4) The second method is called deterministic sampling which assigns a value SI to each solution or path which is evaluated by the following equation: S i =TRUNCAT (P i *NS). (5) Where TRUNCAT means rounding real numbers into bigger integer, NS means the number of solutions or paths. The selection operator assures that each solution or path participates as parents exactly Si times. Crossover Operation After the selection operation step, the solutions will be passed through the crossover operation. There are many proposals about crossover procedures [1]. The following figure shows our proposed method
Figure 5. Proposed method Our proposed method is shown in figure 5. Sometimes it is not required to solve any hard sub problems. But it can give nearly optimal solutions for data clustering. Figure 6. An example of proposed method Figure 6 shows how our proposed method works. First filtering step, it begins with Initial relationships produced as figure 6 (a). The method uses genetic algorithm to find the optimal route as figure 6 (b). Second clustering step, it begins with dividing optimal path into sub paths as figure 6(c). Our method continues dividing until the terminal condition for dividing becomes true as figure 6 (d) then it stops. Finally each sub path is considered as clustered as figure 6 (e). The method stops when the terminal condition becomes true. Because it may arrive to leaf level if the method iterates once.
Figure 7. Clustering process The above figure shows the clusters of dataset that appear during different stages. During the clustering step the proposed method divides the shortest path into many sub paths. This process is applied in any sub path alone until the terminal condition becomes true. Then every sub path that has more than one object is considered as clusters otherwise as outliers. The run time of this algorithm can be calculated based on the size of dataset N. TSP can generate sample of routes, we suppose K relationship among those objects.therefore the time complexity is O(NK). Genetic algorithm can find strongest relationship after M iterations. Hence, the total running time for Filtering part is O (MNK). But for the Clustering part, the time complexity is going to be very small. It depends on the number of stages as in figure (7), we assume clustering step take L iterations to find clusters. Therefore, the total complexity of our algorithm is O (MNKL). Experimental Results To implement our proposed algorithm, our experimental hardware setup was Pentium computer 4, memory 1 GB, CPU 2.8 GHz, and it is running window XP professional. We wrote program using mathlab tool and used iris dataset with different sizes [5]. Table 1 shows the results of our experimental proposed approach. TABLE 1. The results of implementation of our method by using iris dataset (column 3 and 5) with different sizes comparing with result k-mean algorithm as it is shown in column(2,4) Figure 8 shows the first testing.it clusters the dataset into 7 clusters according to the first row in table1. The first section of figure 8 shows how proposed method finds the shortest path among the objects. Since the objects are few, the proposed method will find the shortest path very quickly during the filtering step. The second section of figure 8 shows how proposed method finds the typical clusters.
Figure 8. The output of our proposed method (50 tuples) Figure 9 shows the second test. It combines the dataset into two clusters according to the second row table 1. The first section of figure 9 shows how the proposed method finds the shortest path. Since the objects are many, the method takes more time to find shortest path comparing with clustering step. The second section of figure 9 shows how the clustering step finds the typical clusters. Figure 9. The output of our proposed method (166 tuples) Figure 10 shows the third test. It unites the dataset into 3 Clusters and 12 outliers according to the third row table 1. The first section of figure 10 shows the filtering step which finds the shortest path. In this step, the method takes more time when the dataset is very big. The second section of figure 10 shows clustering steps depending on filtering steps. Figure 10. The output of our proposed method (768 tuples) Conclusion In this paper, we proposed novel method for clustering objects based on relationship. Our proposed method has two main steps: Filtering step for finding optimal route or strongest relationship among data by using shortest path and clustering step which continues dividing that optimal route into number of sub routes. The advantage of this method is to build clusters automatically without any interventions from users. We are going to examine further extensions from our proposed method in larger datasets. Acknowledgment This work was supported by the grant of the Korean Ministry of Education, Science and Technology"(The Regional Core Research Program / Chungbuk BIT Research-Oriented University Consortium) and the Korea Science and Engineering Foundation (KOSEF) grant funded by the Korea government (MEST) (No. R11-2008-014-02002-0). References 1. M. J. Li, M. K. NG and Y. M. Cheung, Agglomerative Fuzzy K-Means Clustering Algorithm With Selection of Number of Clusters, IEEE Transaction on knowledge and Data engineering vol.20,no.11,novmber, 2008. 2. C. F. Tsai, H. C. Wu, C. W. Tsai, "A New Data Clustering Approach for Data Mining in Large Databases,",ispan, p. 0315, 2002 International Symposium on Parallel Architectures, Algorithms and Networks (ISPAN '02), 2002. 3. H. L. R. Encarnación, S. M. B. Suárez, W. H. Rivera, V. C. Vázquez, M. A. S. Figueroa, A. R. Toro, Genetic algorithm approach for recorder cycle time determination in multi-stage System, university of Puerto,2003
4. B. F. A. Dulaimi, and H. A. Ali, Enhanced Traveling Salesman Problem Solving by Genetic Algorithm Technique (TSPGA), PWASET VOLUME 28 APRIL 2008 ISSN 1307-6884. 5. http://neural.cs.nthu.edu.tw/ 6.