OPTIMIZATION AND ANALYSIS OF CLUSTERS Dost Muhammad Khan Faisal Shahzad, Najia Saher, Nawaz Mohamudally* 1

Size: px

Start display at page:

Download "OPTIMIZATION AND ANALYSIS OF CLUSTERS Dost Muhammad Khan Faisal Shahzad, Najia Saher, Nawaz Mohamudally* 1"

Kenneth Webb
6 years ago
Views:

1 Sci.Int.(Lahore),26(5), ,2014 ISSN ; CODEN: SINTE OPTIMIZATION AND ANALYSIS OF CLUSTERS Dost Muhammad Khan Faisal Shahzad, Najia Saher, Nawaz Mohamudally* 1 Department of Computer Science & IT, The Islamia University of Bahawalpur, PAKISTAN khan.dostkhan@iub.edu.pk, faisalsd@gmail.com, najia_saher@hotmail.com * School of Innovative Technologies and Engineering (SITE), University of Technology, MAURITIUS, alimohamudally@utm.intnet.mu ABSTRACT : Clustering is an important area of data mining which is used to find patterns from the dataset. The K-means clustering algorithm is used to group the large dataset into clusters of smaller sets of similar data. The number of clusters K is to be specified as an input parameter in advance. There are different approaches and methods to choose the value of K. In practice, several values for K are tried and the one that gives the most desirable result is selected. There is no statistical answer to this issue that what is the most suitable, appropriate, accurate and correct value of K for a dataset. In this paper we try to find the statistical answer of this issue by presenting different optimization approaches and methods for the value of K. Furthermore, in order to extract patterns and features from a dataset we also present the analysis of clusters of different datasets through these optimization approaches. Keywords: Clustering; Clusters Analysis; Clusters Optimization; MDL; Gap Statistics 1. INTRODUCTION In cluster analysis, the basic and fundamental problem is to optimize the number of clusters, which directly effects on the results of clustering. The cluster analysis consists of: clustering element and variable selection, variable standardization, choosing a measure of association, selection of clustering method, determining the number of clusters and interpretation of clusters as enumerated in [31]. The focus of the paper lands on the optimization and analysis of clusters in order to select the interesting patterns from a dataset. It is a common saying that clusters are not in data but in the viewing eye. Clustering is considered as a fundamental task in data mining which is used to classify the data i.e. to create the groups of data. Clustering is often used for segmenting the data. Since the aim and objective of clustering is to discover a new set of categories, the new groups are of interest in themselves and their assessment is intrinsic; therefore, the outcome of clustering is descriptive. Choosing an appropriate clustering method is another critical step in clustering. There exists a number of clustering algorithms among which the K-means clustering algorithm is commonly used due to its simplicity of implementation and fast execution. It appears extensively in the machine learning literature and in most data mining suites. The algorithm is not only studied in the classification, data mining and machine learning but also there is increasing number of practitioners in market research, bioinformatics, customer management, engineering and other application areas [20,24,27]. The K-means clustering algorithm has as an input a predefined number of clusters i.e. the K from its name and the means stands for an average, location of all the members of a particular cluster. The algorithm is used for finding the similar patterns due to its simplicity and fast execution [1]. The following steps explain the execution of the algorithm: Step 1: Input the number of clusters K and iterations. Step 2: Calculate the initial centroids and divide the datapoints into K clusters. Step 3: Move datapoints into clusters using the distance formula and re-calculate new centroids. Step 4: Repeat step 3 until no datapoint is to be moved. The algorithm follows an iterative procedure. The number of clusters K and iterations are the required and basic inputs of the K-means algorithm without these inputs the algorithm cannot start its execution. Still there is no answer of the questions such as how many clusters are required and what is the exact number of clusters for a dataset. In the paper we discuss different optimization approaches and these approaches are applied on different real life datasets. Furthermore, the proposed number of clusters by these approaches are analyzed which will help to discover the pattern(s) from a dataset and finally a comparison is drawn between these approaches and also the results of different optimization approaches are verified. The rest of the paper is organized as: Section 2 is about the related work and in section 3 we discuss the improved methods and approaches to optimize the number of clusters K. Section 4 is about methodology, results are discussed in section 5 and finally conclusion is drawn in section Related Work We present some of the existing techniques and approaches to optimize the number of clusters in a dataset. In [28] the proposed approach for the value of K is called validity which is a ratio of intra and inter-clusters distances. The intra-cluster is sum of squared distances of all clusters and inter-cluster is the distance between two clusters. If we divide intra-cluster and inter-cluster the obtained result is the value of K, which is known as validity. This approach is tested over red, blue and green colour space and the results are encouraging and satisfactory. In this approach first we have to fix the maximum value of K called k max i.e. we can segment the image from 2 to k max. The proposed approach is tested on K-means clustering data mining algorithm. Actually, the value of k max is optimized in this approach through the ratio or validity and the number of clusters present in an image are automatically determined. We observe that a lot of computation is required in this approach and in advance one has to decide the value of k max, these are two limitations of this approach.

2 1960 ISSN ; CODEN: SINTE 8 Sci.Int.(Lahore),26(5), ,2014 In [19] a method called the gap statistic is proposed to enumerated: estimate the number of clusters K in a dataset. The proposed method is tested on the K-means clustering data mining algorithm. The function of the proposed technique is that compute the sum of pair-wise distance D r of all datapoints 1. It is difficult to take into account different equations with different weights and the process of allocating the weights is not very simple and it is difficult to understand even for the experienced user. in a cluster and calculate a pooled sum of squares W k around 2. Since all the proposed approaches are based on the the cluster means and then multiply log(w k ) with the randomly generated initial centroids; therefore, expectation which varies from 2 to n. Finally, the gap is different initial centroids can affect the results. calculated by subtracting both values and in this way we will obtain the optimal value of number of clusters. The 3. All the approaches discussed above are computationally expensive even for the small and medium datasets. formula is applicable only if more then two clusters are required. We observe that a lot of computation is required and another overhead in this optimization approach is that the value of expectation must be greater than 2. In [29] it is suggested that number of clusters K can be In order to address these limitations we propose the improved clusters optimization approaches in next section. 3. Approaches of Clusters Optimization There are 30 different methods and approaches available to optimize the number of clusters in a dataset [32]. In this optimized through the model-based cluster analysis section we discuss the improved methods and approaches to techniques and these techniques are also helpful to answer the questions such as how many clusters and which clustering methods is suitable for a dataset. Different modelbased approaches are discussed such as probability, expectation-maximization (EM), Bayesian and modelling noise and outliers and their comparison is also drawn. Each model-based approach has its own advantages and optimize the number of clusters K by using the K-means clustering data mining algorithm K the User Defined Approach The user has to set the required parameters i.e. the number of clusters K and iteration in K-means clustering algorithm. If no parameter is set then the default values are automatically taken by the algorithm e.g. in ODM and MS disadvantages. The optimal value of clusters is SQL Server the default of number of clusters is 10. In order demonstrated through visualization. Again a lot of computation is required in these approaches. In [24] a comparison is drawn between various clusters optimization approaches such as variance based, structural, consensus distribution, hierarchical and re-sampling in the K-means clustering data mining algorithm. ik-means, an intelligent version of K-means algorithm is used in this comparison. It is proposed that data and cluster size, cluster shape, cluster intermix and data standardization are the determinant factors to optimize the value of clusters. In [23] first, commonly used existing methods of clusters optimizations such as value of K specified within a range, by user, statistical measures, number of classes and through visualization are reviewed and then a recursive function called the evaluation function f(k) is proposed to suggest the multiple values of K for the dataset. The function is divided into three different equations for K=1, K=2 and K >2, where the weighted gap method is used. Different weights are used for each value of K. The sum of squared errors of two adjunct values of K is divided and if the value is approximately equivalent to 1 then these are the suggested values of K. The function is tested on incremental K-means algorithm. In [30] two new methods are proposed to determine the number of clusters in a dataset and it is also proposed that these clustering techniques perform better then the already available. The weighted gap method technique is applied in the proposed methods. The commonly used optimization techniques are also a part of discussion and a comparison of different commonly used techniques is drawn with the proposed techniques. to obtain the satisfactory results, the clustering algorithm is tested with different values of K. The issue in this method is how to validate and evaluate the obtained clustering results which is only possible through the visualization of these results. But it is difficult to validate and evaluate the results for multidimensional datasets through visualization. This is a popular approach implemented in many data mining suites [4-7]. Some data mining suites also provide the maximum limit for the number of clusters K Information Criterion Approach The value K can be specified through the value of the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC), the most commonly used information criterion [12-15]. The information criterion is often applied on the probabilistic clustering approaches. Expectation Maximization (EM) is one of the examples of probabilistic clustering. On the other hand, the K-means clustering algorithms use the partitioning clustering approaches. Therefore certain assumptions are made on the distribution of the data e.g. we can only compute AIC or BIC of datasets which have a set of Gaussian distributions [8-10]. Although EM and K-means clustering are based on different hypotheses, models and criteria but some ideas are common in both [11]. The researchers believe that the value of the Information Criterion is not applicable as the value of K because of its probabilistic nature. But we propose that it is just a number one can use as the value of K. The value of Information Criterion of a dataset sometimes is very high, it is not appropriate to apply directly the same value, therefore take the logarithm value of Information Criterion as the number of clusters. This is illustrated in equation (1). It is also observed that the K-means clustering algorithm K ln( ValueofInformationCr iterion) (1) with its different forms is chosen to test all the proposed Where K is the number of clusters and approaches. After the extensive review of above discussed ln( ValueofInf ormationcr iterion) is the logarithm value of optimization approaches, the following limitations are

3 Sci.Int.(Lahore),26(5), ,2014 ISSN ; CODEN: SINTE the Information Criterion either AIC or BIC K Equivalent to Number of Parameters The number of clusters is equated to the number of parameters in a dataset. It is a very simple approach to optimize the value of K. Although the clustering is a supervised data mining which is further validated by the implementation of this approach. The issue in this approach is that one has to make an assumption that the produced clusters of only objects would belong to one class. This assumption does not hold on the real problems [16,17]. Another problem with this approach is that if number of parameters in a dataset is too many or too few can raise the problem of over or under-fitting respectively. The issue of over and under-fitting can be solved through the Model Selection Criterion [22] K Specified through the Datapoints The value of K can be determined by the datapoints of the dataset. The formula which is also called the rule of thumb, given in equation (2) [18]. N K (2) 2 Where K is the number of clusters and N is the number of datapoints of the given dataset and N > 0. One major issue of this method is that if the dataset has large datapoints then the value of K will be large e.g. if datapoints are then according to this formula the value of K will be 77. It is not only difficult to select the appropriate results from 77 clusters but also the time and space complexity of the algorithm will become high. The formula in equation (2) even not suitable for the small datasets; therefore, we propose that instead of taking the square root of datapoints, takes the logarithm of the datapoints as given in equation (3). K ln(n) (3) Where K is the number of clusters and N is the number of datapoints of the given dataset and N > 0. The formula in equation (3) is suitable for all type of datasets i.e. small, medium and large. We take the logarithm value of the complexities of these algorithms because the use of logarithm value makes extremely efficient when dealing with large values K Determined through the Stepwise Method In the stepwise method we start the value of number of cluster K with 1 and then gradually increase the value by 1. The distance which is also called the distortion of cluster is computed after each value of K. The stepwise method ensures that the optimization is not carried to the extreme. We apply the squared Euclidean s distance (any distance formula can be applied) to compute the distance, the formula is given in equation (4). N x 2 ik x jk d (4) j Where k 1 d is the distance of j x, N is the total ik x jk number of attributes of the given dataset and i,j,k=1,2,..,n. The K-means clustering data mining algorithm uses the sum of squared error criterion for the re-assignment of any sample from one cluster to another as given in equation (5). K S k d j j 1 Where K is the number of cluster, S k is the sum of the squared-error of k th cluster and d is the distance. As the number of cluster K increases the value of S k decreases. In order to determine the number of clusters of the dataset divide the sum of squared-error of cluster k and the sum of squared-error of cluster (k-1) which is denoted by the ratio. The formula of the ratio is given in equation (6) which is also known as Gap Statistics [19]. S k ratio (6) S k 1 Where S k and S k-1 are the sum of the squared-errors of k th and (k-1) th clusters respectively. The formula in equation (6) is valid for k 2. If the value of k=1 then it is assumed that there is no need to create the clusters of the given dataset i.e. the dataset is uniformly distributed. The value of ratio comes close to 1 as the number of cluster increases. The number of values before the convergence to the uniform distribution or the value of ratio close to 1 determines the number of clusters for the given dataset. It is an iterative and lengthy process [23,24]. One major constraint of this method is that since the K- means clustering algorithm is sensitive to initial centroids therefore different initial centroids may result in different clusters [20]. Applying this approach, the value of number of clusters can be same for the small, medium and large datasets i.e. the value of K does not depend on the objects of the datasets. The number of attributes, parameters and datapoints are the objects of a dataset. 4. METHODOLOGY The following methodology is adopted in this research paper: For the clusters analysis, the first step is to create the appropriate vertical partitions of the dataset. In vertical data partition the different sets of attributes are on different nodes. It is required that each node must contain a unique identifier to facilitate matching in vertical partition. Therefore, the attribute class is common in all created partitions. Suppose there are n attributes in a dataset, we want to create m partitions. The formula is given below in equation (7) [25]. m n 1 (7) 2 j (5)

4 1962 ISSN ; CODEN: SINTE 8 Sci.Int.(Lahore),26(5), ,2014 Figure 1. (a) A Graph between the K-No. of Clusters and Ratio-sum of squared-error of dataset Iris ; (b) A Graph between the K-No. of Clusters and Ratio-sum of squared-error of dataset Sales ; (c ) A Graph between the K-No. of Clusters and Ratio-sum of squared-error of dataset BreastCancer ; (d) A Graph between the K-No. of Clusters and Ratio-sum of squared-error of dataset Diabetes ; (e) A Graph between the K-No. of Clusters and Ratio- sum of squared-error of dataset Cars ; (f) A Graph between the K-No. of Clusters and Ratio- sum of squared-error of dataset DNA

5 Sci.Int.(Lahore),26(5), ,2014 ISSN ; CODEN: SINTE Where 0 n m, round the value of m to the nearest integer and the number of attributes in each partition is 2 along with the class or the target attribute, therefore 1 is datasets. The explanation of the graphs in Figure 1 is given below: subtracted from n. The selection of attributes in each partition is arbitrary, depends on the user, different combination of attributes will give different results. All the clustering algorithms in data mining are distance based. Therefore, the selection of attributes for the partitions must be in such a way that all attributes can have equal impact on the distance computation. This is to avoid obtaining clusters that are dominated by attributes with the largest amounts of variation. Furthermore, the population, the percentage of the parameters and the value of Minimal Description Length (MDL) of each cluster are computed. The cluster(s) with the smallest MDL value(s) is preferred. The formula for the MDL is given in equation (8). MDL 2.log( likelihood ) k log( n) (8) Where k is the number of parameters, n is the datapoints of the given dataset, 2.log( likelihood ) is the Model Accuracy and k log(n) is the Model Size. Therefore, wan can write the formula of MDL as given in equation (9). MDL ModelSize ModelAccur acy (9) The Minimal Description Length (MDL) is also referred as BIC (Bayesian Information Criterion) [21,22]. 5. RESULTS AND DISCUSSION In this section we discuss the results of optimization and analysis of clusters respectively Optimization of Clusters The following datasets are chosen to test different clusters optimization approaches discussed in section 2: i. Iris: The dataset contains information about the flowers. There are 3 different classes (parameters) of flowers. It has 150 datapoints with 5 attributes [2]. ii. Sales: The dataset contains information about the sales. There are 2 different classes (parameters) of sales. It has only 24 datapoints with 5 attributes [2]. iii. BreastCancer: A medical dataset contains information about the breast cancer. There are 2 different classes (parameters) of breast cancer. It has 233 datapoints with 10 attributes [2]. iv. Diabetes: A medical dataset contains information about diabetes. There are 2 different classes (parameters) of diabetes. It has 788 datapoints with 9 attributes [3]. v. Cars: A vehicle dataset contains information about the different brands of cars from different countries. There are 30 different classes (parameters) of cars. It has 261 datapoints with 9 attributes [2]. vi. DNA: A medical dataset contains information DNA. There are 3 different classes (parameters) of DNA. It has datapoints with 181 attributes [2]. We discuss each optimization approach presented in section 3: 1. Applying the user s defied approach to specify the value of K, the user can enter any value of K from a range of 2 to 20 i.e. minimum value of K is 2 and maximum value of K is 20. In this method and approach the value of K does not depend on the number of parameters, number of datapoints and the number of attributes. Before taking the decision about the value of K, the user checks the number of parameters, number of attributes and the number of datapoints of the dataset. 2. Since the Information Criterion is used to determine the fitness (over, under or best-fit) of the given dataset therefore by taking the logarithm of the value of Information Criterion the value of K can easily be specified. In this method and approach the value of K depends on number of parameters and number of attributes. The number of datapoints has no significant impact on the value of Information Criterion. Since selecting the number of parameters and number of attributes is a problem of the model selection criterion, therefore one has to take care of these important aspects of a dataset. Using too many or too few parameters and attributes can raise the problem of over or under-fitting respectively which shows the significance of the parameters and attributes of the dataset. There is a limit for the number of parameters and number of attributes in a dataset under and above the limit will affect the fitness of the dataset. Therefore, the value of K remains between 2 and 23 in this approach [22]. 3. The value of K can easily be determined through the number of parameters in the class attribute of a dataset. In this approach and method, the value of K changes with any change in the number of parameters. The number of attributes and number of datapoints have no impact on the value of K. This approach to optimize the number of clusters is applied in many data mining suites. The assumption in this optimization approach is that the dataset must be best-fitted. If the right dataset is not selected then there is risk of over and under-fitting and consequently the results are not valid. 4. Similarly, one can apply the logarithm value of number of datapoints of a dataset for the value K. In this approach and method, the value of K changes with any change in the number of parameters. The number of attributes and number of parameters have no impact on the value of K. These four methods are straightforward and simple and the value of K can be optimized with less computation. But in these methods the number of parameters, number of attributes and number of datapoints play an important and vital role to specify the value of K, we can say that the value of K depends on the objects of the dataset. 5. The evaluation of the Stepwise method is tested on the above datasets. The K-means clustering algorithm is applied with the value of K ranging from 1 to 10. The main constraint of the K-means clustering algorithm is that it is sensitive to the value of initial centroids. Therefore in order to avoid this constraint, the data are normalized which also

6 1964 ISSN ; CODEN: SINTE 8 Sci.Int.(Lahore),26(5), ,2014 reduces the effect of differences in the ranges of the distribution and the value of ratio approaches to constant attributes of the datasets. The selected datasets have small, value 1. The number of clusters recommended through this medium and high values of attributes, parameters and method for this dataset is 8 because after this value the datapoints. The distortion or distance, sum of squared-error graph is in the uniform distribution. Hence the value of K = and ratio of the sum of squared-errors is calculated for each 8 for the dataset Cars. value of K. In this approach a lot of computational is The graph in Figure 1(f) shows that the value of the ratio of involved. When K=1, we cannot compute the value of ratio, the sum of squared-error reflects the result of clustering on therefore it is assumed that the dataset in normal uniform the dataset DNA. The graph becomes in the normal distribution. As the value of K increases the sum of squarederror uniform distribution after the value of K = 7 before this decreases and we also observe the variation in the value of K the graph is not in the uniform distribution. The value of ratio. Figure 1 shows the graphs between the K- value of ratio approaches to constant value 1 when K = 8 to number of clusters and the ratio-sum of squared-error for the 9. The number of clusters recommended through this The graph in Figure 1(a) shows that the value of the ratio of method for this dataset is 7 because after this value the the sum of squared-error reflects the result of clustering on graph is in the uniform distribution. Hence the value of K = the dataset Iris. The graph can be divided into two regions; 7 for the dataset DNA. one region is from K = 2 to 4 and the second is after K = 5. We can draw a conclusion from the graphs in Figure 1 that The value of ratio is approximately constant and approaches the value of K corresponding to value of ratio 1. 0 shows to 1 from K = 5 to 9. The number of clusters recommended the uniform distribution and the value (K 1) can be through this method for this dataset is 4 because after this recommended for the number of clusters. If there is no value value the graph is in the uniform distribution. Hence the then K = 1. The value of K specified through the Stepwise value of K = 4 for the dataset Iris. Method does not depend on the objects of a dataset. The The graph in Figure 1(b) shows that there is a single region method is tested on six different datasets for K 1 to 10; more and all objects are in a uniform distribution which also values of K and more datasets can be selected for the further reflects the result of clustering on the dataset Sales. The verification of this method. value of ratio is approximately constant and approaches to 1 The number of clusters recommended through five different from K = 4 to 9. The number of clusters recommended optimization methods and approaches for the datasets is through this method for this dataset is 3 because after this given in Table 1. value the graph is in the uniform distribution. Hence the Table 1 is a summary of all five different approaches to value of K = 3 for the dataset Sales. optimize the number of clusters in K-means clustering data The graph in Figure 1(c) shows that the value of the ratio of mining algorithm. The recommended values of K are small the sum of squared-error reflects the result of clustering on except in case of parametric approach of optimization for the dataset BreastCancer. The graph has two regions; the dataset Cars where the value of number of parameters almost the second half of the graph is in the uniform is very high. The main issue is that how to validate and distribution and the first half is not. The value of ratio is evaluate these recommended number of clusters from the approximately constant and approaches to 1 from K = 7 to optimization approaches. The Data Visualization techniques 9. The number of clusters recommended through this can be applied to verify these results but it is difficult to method for this dataset is 6 because after this value the validate and evaluate the results for multidimensional graph is in the uniform distribution. Hence the value of K = datasets through visualization [26]. Finally, the user has to 6 for the dataset BreastCancer. decide the appropriate number of clusters for the given The graph in Figure 1(d) shows that the value of the ratio of dataset using the data visualization techniques. the sum of squared-error reflects the result of clustering on A comparison of five different optimization approaches for the dataset Diabetes. The graph can be divided into two the selected datasets is illustrated in Figure 2. regions; one region is from K = 2 to 5 which is not in the Figure 2 demonstrates the comparison of five different uniform distribution and the second is after K = 6 which is approaches to optimize the number of clusters in K-means in the uniform distribution. The value of ratio is clustering data mining algorithm for different datasets. The approximately constant and approaches to 1 from K = 6 to recommended number of clusters or the value of K through 9. The number of clusters recommended through this the Stepwise approach is rational as compared to the rest of method for this dataset is 5 because after this value the approaches. For the validation of the number of clusters, the graph is in the uniform distribution. Hence the value of K = Data Visualization technique can be used. 5 for the dataset Diabetes. Moreover, a comparison is drawn between the optimization The graph in Figure 1(e) shows that the value of the ratio of approaches discussed in sections 2 and 3 in Table 2. the sum of squared-error reflects the result of clustering on the dataset Cars. The graph is not in the uniform distribution which shows that the dataset is not properly moralized. The value of ratio is either small or large from K = 2 to 8 and after K = 9 the graph becomes in the normal

7 Number of Clusters 'K' Sci.Int.(Lahore),26(5), ,2014 ISSN ; CODEN: SINTE Comparison of Optimization Approaches The explanation of Table 3 is: The clusters of both partitions are almost evenly populated. The first cluster of first partition has Iris-setosa a single parameter, similarly, the User Defined second cluster has Iris-versicolor parameter and the third Information Criterion one has Iris-virginica i.e. each cluster of this partition has 15 Class Parameters distinct parameter of the dataset. The first cluster of second 10 Datapoints partition has two parameters Iris-versicolor and Irisvirginica 5 Stepwise but the percentage value of Iris-versicolor is 0 high, the second cluster has Iris-setosa a single parameter and the third one again has two parameters Iris-versicolor and Iris-virginica but the percentage value of Irisversicolor is high i.e. each cluster of this partition has Datasets dominated by the distinct parameter of the dataset. The values of MDL of first cluster of first partition and second Figure 2. A Comparison of different Optimization cluster of second partition are same and the rest of clusters Approaches are of same values. The conclusion is that each cluster has distinct parameter as output. The cluster 1 of partition 1 and It is obvious that our proposed optimization approaches are cluster 2 of partition 2 has minimum value of MDL simple, easy to implement and with less computational therefore the feature and pattern of these two clusters are complexity as compare to the others. The number of clusters selected from the dataset Iris. neither should be small nor be large because it is difficult to Since there are 10 attributes and 2 parameters in dataset extract the patterns from both types of clusters. It is also BreastCancer ; therefore 5 partitions and 2 clusters per observed that the proposed approaches also called as partition are created according to the methodology. Table 4 improved optimization approaches are computationally less shows the clusters analysis of dataset BreastCancer. expensive as compared to the existing approaches. Since the The explanation of Table 4 is: The population of each proposed and existing approaches both provide the realistic partition of the dataset is same i.e. the first cluster of each number of clusters in a dataset; therefore, we conclude that partition is less populated then the second cluster. The all the approaches are useful and helpful to optimize the percentage value of parameter Malignant is high in the less number of clusters in a dataset. populated cluster of each partition. The percentage value of 5.2. Analysis of Clusters parameter Benign is high in the highly populated cluster of Cluster analysis means partitions of data into meaningful each partition. The value of MDL of each cluster of the subgroups which is also known as the characterization of dataset is almost the same. The conclusion is that the output clusters. It is an important tool for unsupervised learning but of first cluster of each partition is Malignant and Benign the major issue is to estimate the optimal number of is the output of second cluster of each partition. The cluster clusters. 1 of partition 3 and cluster 1 of partition 5 has almost the We provide clusters analysis of datasets Iris, same minimum value of MDL therefore the feature and BreastCancer, Diabetes and Sales using different pattern of these two clusters are selected from the dataset optimization approaches presented in this paper. The cluster BreastCancer. analysis reflects the clear picture of each cluster and also Since there are 9 attributes and 2 parameters in dataset helps to extract the feature and pattern from a dataset. For Diabetes ; therefore 4 partitions and 2 clusters per partition this purpose we first create the vertical partitions of each are created according to the methodology. Table 5 shows dataset applying the equation (7). The number of parameters the clusters analysis of dataset Diabetes. of a dataset will become the value of K. We analyze the For the rest of the cases we are taking the dataset Iris only, cluster by providing the information such as population of similarly other datasets can be used to test the different each cluster, percentage of each parameter in a cluster and optimization approaches. the value of Minimal Description Length (MDL) of each Case2: The analysis of clusters of the dataset Iris using the cluster. The MDL of a cluster is computed using equation Stepwise approach. (9). Finally, we select the cluster having minimum value of The explanation of Table 5 is: The first cluster of each MDL. The following tables demonstrate the whole process partition has greater value as compared to the second of clusters analysis of the selected datasets. cluster. The value of MDL of each cluster of the dataset is Case1: The analysis of clusters of the datasets using the almost the same. The conclusion is that the output of first number parameters approach. Since there are 5 attributes cluster of each partition is Cat2 and the result of second and 3 parameters in dataset Iris ; therefore 2 partitions and cluster of each partition is either Cat1 or Cat2 because 3 clusters per partition are created according to the the percentage of both parameters is very close. The only methodology. Table 3 shows the clusters analysis of dataset cluster 2 of partition 2 has the minimum value of MDL Iris. therefore the feature and pattern of this cluster is selected Iris Diabetes BreastCancer Sales Cars DNA

8 1966 ISSN ; CODEN: SINTE 8 Sci.Int.(Lahore),26(5), ,2014 from the dataset Diabetes. the clusters analysis of dataset Sales. Since there are 5 attributes and 2 parameters in dataset Diabetes ; therefore 2 partitions and 2 clusters per partition are created according to the methodology. Table 6 shows Table 1. The Number of Clusters based on Different Approaches Datasets/Approaches User Defined Information Criterion Class Parameters Datapoints Stepwise Iris Diabetes BreastCancer Sales Cars DNA Table 2. A Comparison of Clusters Optimization Approaches Criteria User-defined Information Number of Datapoints Stepwise Others (Discussed in section /Approaches Criterion Parameters 2) Dataset Objects Independent Dependent Dependent Dependent Independent Independent Computational Complexity Low Medium Medium Medium High Very High Easy to Yes Yes Yes Yes Yes Not easy even for the Implement experienced user. Overview Simple and Simple and easy. Simple and easy. Simple and easy. Not simple but it is Very complex and easy. easy. complicated for user. Conclusion User Computation is Computation is Computation is Less computation but A lot of computation can Dependent. required. required. required. still it is easy and even confuse the simple. experienced user. Table 3. The Clusters Analysis of Dataset Iris 1 petal_length, petal_width, class 1 Iris-setosa=50 Iris-setosa= Iris-versicolor=48 Iris-virginica=4 Iris-versicolor= Iris-versicolor=2 Iris-virginica=46 2 sepal_length, sepal_width, class 1 Iris-versicolor=12 Iris-virginica=35 Iris-virginica=7.7 Iris-versicolor=4.2 Iris-virginica=95.8 Iris-versicolor=25.5 Iris-virginica= Iris-setosa=50 Iris-setosa=100 3 Iris-versicolor=38 Iris-virginica=15 Iris-versicolor=71.7 Iris-virginica=28.3 Table 4. The Clusters Analysis of Dataset BreastCancer 1 CT, UCS, Class 1 Benign =162 Benign= Malignant=10 Malignant=5.8 2 Malignant =59 Malignant= Benign=2 Benign=3.3 2 UCSh, MAdh, Class 1 Malignant =59 Malignant= Benign=6 Benign=9.2 2 Benign=158 Benign= Malignant =10 Malignant=6.0 3 SECS, BNu, Class 1 Malignant =52 Malignant= Benign=7 Benign= Benign=157 Benign= Malignant =17 Malignant=9.8 4 BCh, NNuc, Class 1 Malignant =47 Malignant= Benign=4 Benign=7.8 2 Benign=160 Malignant =22 Benign=87.9 Malignant=

9 Sci.Int.(Lahore),26(5), ,2014 ISSN ; CODEN: SINTE BCh, Mitoses, Class 1 Malignant =55 Malignant= Benign=7 Benign= Benign=157 Benign= Malignant =14 Malignant=8.2 Table 5. The Clusters Analysis of Dataset Diabetes 1 NTP, DPF, CLASS 1 Cat2=384 Cat2= Cat1=96 Cat1= Cat1=164 Cat1= Cat2=144 Cat2= PGC, 2HIS, CLASS 1 Cat2=474 Cat2= Cat1=198 Cat1= Cat1=62 Cat1= Cat2=54 Cat2= DBP, AGE, CLASS 1 Cat1=176 Cat1= Cat2=214 Cat2= Cat1=84 Cat1= Cat2=314 Cat2= TSFT, BMI, CLASS 1 Cat2=330 Cat2= Cat1=86 Cat1= Cat1=174 Cat2=198 Cat1=46.8 Cat2= Table 6. The Clusters Analysis of Dataset Sales 1 AverageSales, ForecastSales, Class 1 OK=10 OK= NotOK=8 NotOK= NotOK=6 OK=0 0.9 NotOK=100 2 CalculatedSeasonalIndex, AverageIndex, Class 1 OK=15 OK= NotOK=0 2 OK=9 OK=100 NotOK=0 1.3 Tabe 7. The Cluster Analysis of Dataset Iris 1 petal_length, petal_width, class 1 Iris-virginica=36 Iris-virginica= Iris-setosa=50 Iris-setosa=100 3 Iris-versicolor=26 Iris-versicolor=100 4 Iris-versicolor=24 Iris-virginica=14 2 sepal_length, sepal_width, class 1 Iris-versicolor=6 Iris-virginica=21 Iris-versicolor=63.2 Iris-virginica=36.8 Iris-versicolor=22.2 Iris-virginica= Iris-setosa=49 Iris-setosa= Iris-versicolor=25 Iris-virginica=7 Iris-setosa=1 4 Iris-versicolor=19 Iris-virginica=22 Iris-versicolor=75.8 Iris-virginica=21.2 Iris-setosa=3.0 Iris-versicolor=46.3 Iris-virginica=

10 1968 ISSN ; CODEN: SINTE 8 Sci.Int.(Lahore),26(5), ,2014 Table 8. The Cluster Analysis of Dataset Iris 1 petal_length, petal_width, class 1 Iris-virginica=32 Iris-virginica= Iris-setosa=50 Iris-setosa= Iris-versicolor=23 Iris-versicolor= Iris-versicolor=19 Iris-virginica=18 Iris-versicolor=51.4 Iris-virginica= Iris-versicolor=8 Iris-versicolor= sepal_length, sepal_width, class 1 Iris-versicolor=6 Iris-virginica=21 Iris-versicolor=22.2 Iris-virginica= Iris-setosa=17 Iris-setosa=100 3 Iris-versicolor=19 Iris-virginica=22 4 Iris-versicolor=25 Iris-virginica=7 Iris-versicolor=46.3 Iris-virginica=53.7 Iris-versicolor=78.1 Iris-virginica= Iris-setosa=33 Iris-setosa=100.0 Table 9. The Cluster Analysis of Dataset Iris 1. petal_length, petal_width, class 1 Iris-virginica=13 Iris-virginica=100 2 Iris-virginica=21 Iris-virginica=100 3 Iris-versicolor=8 Iris-versicolor=100 4 Iris-versicolor=10 Iris-versicolor=100 5 Iris-versicolor=5 Iris-virginica=15 Iris-versicolor=25.0 Iris-virginica= Iris-versicolor=12 Iris-versicolor= Iris-virginica=1 Iris-versicolor=1 Iris-virginica=50.0 Iris-versicolor= Iris-versicolor=7 Iris-versicolor=100 9 Iris-versicolor=7 Iris-versicolor= Iris-setosa=50 Iris-setosa=

11 Sci.Int.(Lahore),26(5), ,2014 ISSN ; CODEN: SINTE sepal_length, sepal_width, class 1 Iris-virginica=10 Iris-virginica= Iris-versicolor=3 Iris-versicolor= Iris-virginica=11 Iris-virginica= Iris-versicolor=4 Iris-virginica=3 Iris-versicolor=57.1 Iris-virginica= Iris-setosa=28 Iris-setosa= Iris-virginica=4 Iris-versicolor=9 6 Iris-versicolor=3 Iris-virginica=6 Iris-virginica=30.8 Iris-versicolor=69.2 Iris-versicolor=33.3 Iris-virginica= Iris-setosa=21 Iris-setosa= Iris-versicolor=7 Iris-virginica=10 9 Iris-versicolor=5 Iris-virginica=1 Iris-setosa=1 10 Iris-versicolor=19 Iris-virginica=5 The explanation of Table 6 is: The second cluster of both partitions is less populated as compare to the first cluster. The second partition is dominated by a single parameter OK. The first cluster of first partition has both parameters OK and NotOK but again the percentage value of OK is greater than NotOK. The second cluster of first partition has only single parameter NotOK. The MDL value of first cluster of first partition is high as compare to the rest of the clusters. The conclusion is that the output of each cluster is dominated by a single parameter OK except the second cluster of first partition. All the clusters except the cluster 1 of partition 1 have almost the same minimum value of MDL therefore the feature and pattern of these three clusters are selected from the dataset Sales. The explanation of Table 7 is: The first three clusters of first partition are dominated by a single parameter either Irisvirginica, Iris-setosa or Iris-versicolor i.e. each cluster of this partition has distinct parameter of the dataset. The fourth and the last cluster of first partition have two parameters Iris-versicolor and Iris-virginica but the percentage value of Iris-versicolor is high. The second cluster of the second partition has only Iris-setosa a single parameter and the remaining three clusters have two parameters Iris-versicolor and Iris-virginica. We can say that each cluster is dominated by the distinct parameter of the dataset. The value of MDL of third cluster of second partition is high and the third cluster of first partition is low as compared to the rest of clusters. Therefore, the cluster 3 of partition 1 is the selected feature and pattern of the dataset Iris. Case3: The analysis of clusters of the dataset Iris using the Iris-versicolor=41.2 Iris-virginica=58.8 Iris-versicolor=71.4 Iris-virginica=14.3 Iris-setosa=14.3 Iris-versicolor=79.2 Iris-virginica= Datapoints approach. The explanation of Table 8 is: The clusters 1,2,3, and 5 of first partition are dominated by a single parameter either Iris-virginica, Iris-setosa or Iris-versicolor i.e. each cluster of this partition has distinct parameter of the dataset. The cluster 4 of first partition has two parameters Irisversicolor and Iris-virginica but the percentage value of Iris-versicolor is high. The cluster 2and 5 of the second partition have only Iris-setosa a single parameter and the remaining three clusters have two parameters Irisversicolor and Iris-virginica. We can say that each cluster is dominated by the distinct parameter of the dataset. The value of MDL of cluster 4 of second partition is high and the cluster 5 of first partition is low as compared to the rest of clusters. Therefore, the cluster 5 of partition 1 is the selected feature and pattern of the dataset Iris. Case4: The analysis of clusters of the dataset Iris using the User-defined approach. The explanation of Table 9 is: The clusters 1,2,3,4,6,8,9 and 10 of first partition are dominated by a single parameter either Iris-virginica, Iris-setosa or Iris-versicolor i.e. each cluster of this partition has distinct parameter of the dataset. The clusters 5 and 7 of first partition have two parameters Iris-versicolor and Iris-virginica. The clusters 2,4 and 7 of the second partition have a single parameter and the remaining clusters have two parameters.. We can say that each cluster is dominated by the distinct parameter of the dataset. The value of MDL of cluster 9 of second partition is high and the clusters 3,8 and 9 of first partition and the clusters 5 and 8 of second partition have low MDL value as compared to the rest of clusters. Therefore, the

12 1970 ISSN ; CODEN: SINTE 8 Sci.Int.(Lahore),26(5), ,2014 clusters of low values of MDL are the selected feature and pattern of the dataset Iris. We also notice that increasing the number clusters of a dataset creates the difficultly to handle the clusters; therefore, the only required number of clusters must be applied in order to extract the feature and pattern from the given dataset. The conclusion of the above tables of the clusters analysis is that increasing the number of parameters and attributes of the dataset will increase the number of clusters and the number of partitions respectively which will increase the domain of knowledge. Increasing the sample size will only increase the population of clusters, which does not change the domain of knowledge extraction. We can say that the number of clusters K play an important, vital and decisive role in pattern and feature extraction from the dataset. 6. CONCLUSION The objectives set at the outset of the research paper are: first, to optimize the value of K number of clusters in the K- means clustering algorithm and second, to provide the clusters analysis by applying the optimization approaches on different datasets. For the first objective, different optimization methods and approaches, namely the value of K specified through the user, the value of the Information Criterion, parameters of the dataset, the datapoints of a dataset and the Stepwise methods are presented. The first four methods are straightforward and computationally less expensive. In these four methods, the value of K depends on the objects of a dataset such as the number of parameters, attributes and datapoints. Although the last optimization approach, the Stepwise method is computationally expensive but the value of K does not depend on the objects of a dataset. These methods are tested and validated on 6 different datasets and the recommended value of K is rational. For the second objective, the clusters analysis of 4 different datasets is presented which helps the user to extract the patterns and features from the given dataset. We conclude that the value of K is a determinant factor for the pattern and feature extraction from a dataset. If the value of K is correctly optimized then the results will be acceptable otherwise there is risk in deteriorated or no pattern and feature extraction at all. Further experiments are required to validate the value of K and the optimization of K will remain an avenue for further research. ACKNOWLEDGEMENT The authors are thankful to The Islamia University of Bahawalpur, Pakistan for providing financial assistance to carry out this research activity under HEC project 6467/F II. REFERENCES [1] MacQueen, J.B., Some Methods for classification and Analysis of Multivariate Observations, Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press, 1: [2] URL: US Census Bureau for the datasets Iris, Breastcancer and Sales, visited [3] URL: National Institute of Diabetes and Digestive and Kidney Diseases, Pima Indians Diabetes Dataset, visited [4] SPSS Clementine Data Mining System (Integral Solutions Limited, Basingstoke, Hampshire), User Guide Version 5, [5] DataEngine 3.0 Intelligent Data Analysis an Easy Job, Management Intelligenter Technologien GmbH, Germany, [6] Kerr, A., Hall, H. K., and Kozub, S., Doing Statistics with SPSS, (Sage, London), [7] S-PLUS 6 for Windows Guide to Statistics, Vol. 2, Insightful Corporation, Seattle, Washington, 2001, man2.pdf, [8] Hardy, A., On the number of clusters, Comput. Statist. Data Analysis, 23, 83 96, [9] Theodoridis, S. and Koutroubas, K., Pattern Recognition, Academic Press, London, [10] Halkidi, M., Batistakis, Y., and Vazirgiannis, M., Cluster validity methods, Part I. SIGMOD Record, 31(2); available online [11] Bradley, S. and Fayyad, U. M., Refining initial points for K-means clustering, In Proceedings of the Fifteenth International Conference on Machine Learning (ICML 98) (Ed. J. Shavlik), Madison, Wisconsin, pp (Morgan Kaufmann, San Francisco, California), [12] Ishioka, T., Extended K-means with an efficient estimation of the number of clusters, In Proceedings of the Second International Conference on Intelligent Data Engineering and Automated Learning (IDEAL 2000), Hong Kong, PR China, pp , [13] Kanungo, T., Mount, D. M., Netanyahu, N., Piatko, C., Silverman, R., and Wu, A., The efficient K-means clustering algorithm: analysis and implementation, IEEE Trans. Pattern Analysis Mach. Intell., 24(7), , [14] Pelleg, D. and Moore, A., Accelerating exact K-means algorithms with geometric reasoning, In Proceedings of the Conference on Knowledge Discovery in Databases (KDD 99), San Diego, California, pp , [15] Pelleg, D. and Moore, A., X-means: extending K- means with efficient estimation of the number of clusters, In Proceedings of the 17th International Conference on Machine Learning (ICML 2000), Stanford, California, , [16] Kothari, R. and Pitts, D., On finding the number of clusters, Pattern Recognition Lett., 20, ,

Enhanced Cluster Centroid Determination in K- means Algorithm

International Journal of Computer Applications in Engineering Sciences [VOL V, ISSUE I, JAN-MAR 2015] [ISSN: 2231-4946] Enhanced Cluster Centroid Determination in K- means Algorithm Shruti Samant #1, Sharada