OPTIMIZATION AND ANALYSIS OF CLUSTERS Dost Muhammad Khan Faisal Shahzad, Najia Saher, Nawaz Mohamudally* 1

Size: px
Start display at page:

Download "OPTIMIZATION AND ANALYSIS OF CLUSTERS Dost Muhammad Khan Faisal Shahzad, Najia Saher, Nawaz Mohamudally* 1"

Transcription

1 Sci.Int.(Lahore),26(5), ,2014 ISSN ; CODEN: SINTE OPTIMIZATION AND ANALYSIS OF CLUSTERS Dost Muhammad Khan Faisal Shahzad, Najia Saher, Nawaz Mohamudally* 1 Department of Computer Science & IT, The Islamia University of Bahawalpur, PAKISTAN khan.dostkhan@iub.edu.pk, faisalsd@gmail.com, najia_saher@hotmail.com * School of Innovative Technologies and Engineering (SITE), University of Technology, MAURITIUS, alimohamudally@utm.intnet.mu ABSTRACT : Clustering is an important area of data mining which is used to find patterns from the dataset. The K-means clustering algorithm is used to group the large dataset into clusters of smaller sets of similar data. The number of clusters K is to be specified as an input parameter in advance. There are different approaches and methods to choose the value of K. In practice, several values for K are tried and the one that gives the most desirable result is selected. There is no statistical answer to this issue that what is the most suitable, appropriate, accurate and correct value of K for a dataset. In this paper we try to find the statistical answer of this issue by presenting different optimization approaches and methods for the value of K. Furthermore, in order to extract patterns and features from a dataset we also present the analysis of clusters of different datasets through these optimization approaches. Keywords: Clustering; Clusters Analysis; Clusters Optimization; MDL; Gap Statistics 1. INTRODUCTION In cluster analysis, the basic and fundamental problem is to optimize the number of clusters, which directly effects on the results of clustering. The cluster analysis consists of: clustering element and variable selection, variable standardization, choosing a measure of association, selection of clustering method, determining the number of clusters and interpretation of clusters as enumerated in [31]. The focus of the paper lands on the optimization and analysis of clusters in order to select the interesting patterns from a dataset. It is a common saying that clusters are not in data but in the viewing eye. Clustering is considered as a fundamental task in data mining which is used to classify the data i.e. to create the groups of data. Clustering is often used for segmenting the data. Since the aim and objective of clustering is to discover a new set of categories, the new groups are of interest in themselves and their assessment is intrinsic; therefore, the outcome of clustering is descriptive. Choosing an appropriate clustering method is another critical step in clustering. There exists a number of clustering algorithms among which the K-means clustering algorithm is commonly used due to its simplicity of implementation and fast execution. It appears extensively in the machine learning literature and in most data mining suites. The algorithm is not only studied in the classification, data mining and machine learning but also there is increasing number of practitioners in market research, bioinformatics, customer management, engineering and other application areas [20,24,27]. The K-means clustering algorithm has as an input a predefined number of clusters i.e. the K from its name and the means stands for an average, location of all the members of a particular cluster. The algorithm is used for finding the similar patterns due to its simplicity and fast execution [1]. The following steps explain the execution of the algorithm: Step 1: Input the number of clusters K and iterations. Step 2: Calculate the initial centroids and divide the datapoints into K clusters. Step 3: Move datapoints into clusters using the distance formula and re-calculate new centroids. Step 4: Repeat step 3 until no datapoint is to be moved. The algorithm follows an iterative procedure. The number of clusters K and iterations are the required and basic inputs of the K-means algorithm without these inputs the algorithm cannot start its execution. Still there is no answer of the questions such as how many clusters are required and what is the exact number of clusters for a dataset. In the paper we discuss different optimization approaches and these approaches are applied on different real life datasets. Furthermore, the proposed number of clusters by these approaches are analyzed which will help to discover the pattern(s) from a dataset and finally a comparison is drawn between these approaches and also the results of different optimization approaches are verified. The rest of the paper is organized as: Section 2 is about the related work and in section 3 we discuss the improved methods and approaches to optimize the number of clusters K. Section 4 is about methodology, results are discussed in section 5 and finally conclusion is drawn in section Related Work We present some of the existing techniques and approaches to optimize the number of clusters in a dataset. In [28] the proposed approach for the value of K is called validity which is a ratio of intra and inter-clusters distances. The intra-cluster is sum of squared distances of all clusters and inter-cluster is the distance between two clusters. If we divide intra-cluster and inter-cluster the obtained result is the value of K, which is known as validity. This approach is tested over red, blue and green colour space and the results are encouraging and satisfactory. In this approach first we have to fix the maximum value of K called k max i.e. we can segment the image from 2 to k max. The proposed approach is tested on K-means clustering data mining algorithm. Actually, the value of k max is optimized in this approach through the ratio or validity and the number of clusters present in an image are automatically determined. We observe that a lot of computation is required in this approach and in advance one has to decide the value of k max, these are two limitations of this approach.

2 1960 ISSN ; CODEN: SINTE 8 Sci.Int.(Lahore),26(5), ,2014 In [19] a method called the gap statistic is proposed to enumerated: estimate the number of clusters K in a dataset. The proposed method is tested on the K-means clustering data mining algorithm. The function of the proposed technique is that compute the sum of pair-wise distance D r of all datapoints 1. It is difficult to take into account different equations with different weights and the process of allocating the weights is not very simple and it is difficult to understand even for the experienced user. in a cluster and calculate a pooled sum of squares W k around 2. Since all the proposed approaches are based on the the cluster means and then multiply log(w k ) with the randomly generated initial centroids; therefore, expectation which varies from 2 to n. Finally, the gap is different initial centroids can affect the results. calculated by subtracting both values and in this way we will obtain the optimal value of number of clusters. The 3. All the approaches discussed above are computationally expensive even for the small and medium datasets. formula is applicable only if more then two clusters are required. We observe that a lot of computation is required and another overhead in this optimization approach is that the value of expectation must be greater than 2. In [29] it is suggested that number of clusters K can be In order to address these limitations we propose the improved clusters optimization approaches in next section. 3. Approaches of Clusters Optimization There are 30 different methods and approaches available to optimize the number of clusters in a dataset [32]. In this optimized through the model-based cluster analysis section we discuss the improved methods and approaches to techniques and these techniques are also helpful to answer the questions such as how many clusters and which clustering methods is suitable for a dataset. Different modelbased approaches are discussed such as probability, expectation-maximization (EM), Bayesian and modelling noise and outliers and their comparison is also drawn. Each model-based approach has its own advantages and optimize the number of clusters K by using the K-means clustering data mining algorithm K the User Defined Approach The user has to set the required parameters i.e. the number of clusters K and iteration in K-means clustering algorithm. If no parameter is set then the default values are automatically taken by the algorithm e.g. in ODM and MS disadvantages. The optimal value of clusters is SQL Server the default of number of clusters is 10. In order demonstrated through visualization. Again a lot of computation is required in these approaches. In [24] a comparison is drawn between various clusters optimization approaches such as variance based, structural, consensus distribution, hierarchical and re-sampling in the K-means clustering data mining algorithm. ik-means, an intelligent version of K-means algorithm is used in this comparison. It is proposed that data and cluster size, cluster shape, cluster intermix and data standardization are the determinant factors to optimize the value of clusters. In [23] first, commonly used existing methods of clusters optimizations such as value of K specified within a range, by user, statistical measures, number of classes and through visualization are reviewed and then a recursive function called the evaluation function f(k) is proposed to suggest the multiple values of K for the dataset. The function is divided into three different equations for K=1, K=2 and K >2, where the weighted gap method is used. Different weights are used for each value of K. The sum of squared errors of two adjunct values of K is divided and if the value is approximately equivalent to 1 then these are the suggested values of K. The function is tested on incremental K-means algorithm. In [30] two new methods are proposed to determine the number of clusters in a dataset and it is also proposed that these clustering techniques perform better then the already available. The weighted gap method technique is applied in the proposed methods. The commonly used optimization techniques are also a part of discussion and a comparison of different commonly used techniques is drawn with the proposed techniques. to obtain the satisfactory results, the clustering algorithm is tested with different values of K. The issue in this method is how to validate and evaluate the obtained clustering results which is only possible through the visualization of these results. But it is difficult to validate and evaluate the results for multidimensional datasets through visualization. This is a popular approach implemented in many data mining suites [4-7]. Some data mining suites also provide the maximum limit for the number of clusters K Information Criterion Approach The value K can be specified through the value of the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC), the most commonly used information criterion [12-15]. The information criterion is often applied on the probabilistic clustering approaches. Expectation Maximization (EM) is one of the examples of probabilistic clustering. On the other hand, the K-means clustering algorithms use the partitioning clustering approaches. Therefore certain assumptions are made on the distribution of the data e.g. we can only compute AIC or BIC of datasets which have a set of Gaussian distributions [8-10]. Although EM and K-means clustering are based on different hypotheses, models and criteria but some ideas are common in both [11]. The researchers believe that the value of the Information Criterion is not applicable as the value of K because of its probabilistic nature. But we propose that it is just a number one can use as the value of K. The value of Information Criterion of a dataset sometimes is very high, it is not appropriate to apply directly the same value, therefore take the logarithm value of Information Criterion as the number of clusters. This is illustrated in equation (1). It is also observed that the K-means clustering algorithm K ln( ValueofInformationCr iterion) (1) with its different forms is chosen to test all the proposed Where K is the number of clusters and approaches. After the extensive review of above discussed ln( ValueofInf ormationcr iterion) is the logarithm value of optimization approaches, the following limitations are

3 Sci.Int.(Lahore),26(5), ,2014 ISSN ; CODEN: SINTE the Information Criterion either AIC or BIC K Equivalent to Number of Parameters The number of clusters is equated to the number of parameters in a dataset. It is a very simple approach to optimize the value of K. Although the clustering is a supervised data mining which is further validated by the implementation of this approach. The issue in this approach is that one has to make an assumption that the produced clusters of only objects would belong to one class. This assumption does not hold on the real problems [16,17]. Another problem with this approach is that if number of parameters in a dataset is too many or too few can raise the problem of over or under-fitting respectively. The issue of over and under-fitting can be solved through the Model Selection Criterion [22] K Specified through the Datapoints The value of K can be determined by the datapoints of the dataset. The formula which is also called the rule of thumb, given in equation (2) [18]. N K (2) 2 Where K is the number of clusters and N is the number of datapoints of the given dataset and N > 0. One major issue of this method is that if the dataset has large datapoints then the value of K will be large e.g. if datapoints are then according to this formula the value of K will be 77. It is not only difficult to select the appropriate results from 77 clusters but also the time and space complexity of the algorithm will become high. The formula in equation (2) even not suitable for the small datasets; therefore, we propose that instead of taking the square root of datapoints, takes the logarithm of the datapoints as given in equation (3). K ln(n) (3) Where K is the number of clusters and N is the number of datapoints of the given dataset and N > 0. The formula in equation (3) is suitable for all type of datasets i.e. small, medium and large. We take the logarithm value of the complexities of these algorithms because the use of logarithm value makes extremely efficient when dealing with large values K Determined through the Stepwise Method In the stepwise method we start the value of number of cluster K with 1 and then gradually increase the value by 1. The distance which is also called the distortion of cluster is computed after each value of K. The stepwise method ensures that the optimization is not carried to the extreme. We apply the squared Euclidean s distance (any distance formula can be applied) to compute the distance, the formula is given in equation (4). N x 2 ik x jk d (4) j Where k 1 d is the distance of j x, N is the total ik x jk number of attributes of the given dataset and i,j,k=1,2,..,n. The K-means clustering data mining algorithm uses the sum of squared error criterion for the re-assignment of any sample from one cluster to another as given in equation (5). K S k d j j 1 Where K is the number of cluster, S k is the sum of the squared-error of k th cluster and d is the distance. As the number of cluster K increases the value of S k decreases. In order to determine the number of clusters of the dataset divide the sum of squared-error of cluster k and the sum of squared-error of cluster (k-1) which is denoted by the ratio. The formula of the ratio is given in equation (6) which is also known as Gap Statistics [19]. S k ratio (6) S k 1 Where S k and S k-1 are the sum of the squared-errors of k th and (k-1) th clusters respectively. The formula in equation (6) is valid for k 2. If the value of k=1 then it is assumed that there is no need to create the clusters of the given dataset i.e. the dataset is uniformly distributed. The value of ratio comes close to 1 as the number of cluster increases. The number of values before the convergence to the uniform distribution or the value of ratio close to 1 determines the number of clusters for the given dataset. It is an iterative and lengthy process [23,24]. One major constraint of this method is that since the K- means clustering algorithm is sensitive to initial centroids therefore different initial centroids may result in different clusters [20]. Applying this approach, the value of number of clusters can be same for the small, medium and large datasets i.e. the value of K does not depend on the objects of the datasets. The number of attributes, parameters and datapoints are the objects of a dataset. 4. METHODOLOGY The following methodology is adopted in this research paper: For the clusters analysis, the first step is to create the appropriate vertical partitions of the dataset. In vertical data partition the different sets of attributes are on different nodes. It is required that each node must contain a unique identifier to facilitate matching in vertical partition. Therefore, the attribute class is common in all created partitions. Suppose there are n attributes in a dataset, we want to create m partitions. The formula is given below in equation (7) [25]. m n 1 (7) 2 j (5)

4 1962 ISSN ; CODEN: SINTE 8 Sci.Int.(Lahore),26(5), ,2014 Figure 1. (a) A Graph between the K-No. of Clusters and Ratio-sum of squared-error of dataset Iris ; (b) A Graph between the K-No. of Clusters and Ratio-sum of squared-error of dataset Sales ; (c ) A Graph between the K-No. of Clusters and Ratio-sum of squared-error of dataset BreastCancer ; (d) A Graph between the K-No. of Clusters and Ratio-sum of squared-error of dataset Diabetes ; (e) A Graph between the K-No. of Clusters and Ratio- sum of squared-error of dataset Cars ; (f) A Graph between the K-No. of Clusters and Ratio- sum of squared-error of dataset DNA

5 Sci.Int.(Lahore),26(5), ,2014 ISSN ; CODEN: SINTE Where 0 n m, round the value of m to the nearest integer and the number of attributes in each partition is 2 along with the class or the target attribute, therefore 1 is datasets. The explanation of the graphs in Figure 1 is given below: subtracted from n. The selection of attributes in each partition is arbitrary, depends on the user, different combination of attributes will give different results. All the clustering algorithms in data mining are distance based. Therefore, the selection of attributes for the partitions must be in such a way that all attributes can have equal impact on the distance computation. This is to avoid obtaining clusters that are dominated by attributes with the largest amounts of variation. Furthermore, the population, the percentage of the parameters and the value of Minimal Description Length (MDL) of each cluster are computed. The cluster(s) with the smallest MDL value(s) is preferred. The formula for the MDL is given in equation (8). MDL 2.log( likelihood ) k log( n) (8) Where k is the number of parameters, n is the datapoints of the given dataset, 2.log( likelihood ) is the Model Accuracy and k log(n) is the Model Size. Therefore, wan can write the formula of MDL as given in equation (9). MDL ModelSize ModelAccur acy (9) The Minimal Description Length (MDL) is also referred as BIC (Bayesian Information Criterion) [21,22]. 5. RESULTS AND DISCUSSION In this section we discuss the results of optimization and analysis of clusters respectively Optimization of Clusters The following datasets are chosen to test different clusters optimization approaches discussed in section 2: i. Iris: The dataset contains information about the flowers. There are 3 different classes (parameters) of flowers. It has 150 datapoints with 5 attributes [2]. ii. Sales: The dataset contains information about the sales. There are 2 different classes (parameters) of sales. It has only 24 datapoints with 5 attributes [2]. iii. BreastCancer: A medical dataset contains information about the breast cancer. There are 2 different classes (parameters) of breast cancer. It has 233 datapoints with 10 attributes [2]. iv. Diabetes: A medical dataset contains information about diabetes. There are 2 different classes (parameters) of diabetes. It has 788 datapoints with 9 attributes [3]. v. Cars: A vehicle dataset contains information about the different brands of cars from different countries. There are 30 different classes (parameters) of cars. It has 261 datapoints with 9 attributes [2]. vi. DNA: A medical dataset contains information DNA. There are 3 different classes (parameters) of DNA. It has datapoints with 181 attributes [2]. We discuss each optimization approach presented in section 3: 1. Applying the user s defied approach to specify the value of K, the user can enter any value of K from a range of 2 to 20 i.e. minimum value of K is 2 and maximum value of K is 20. In this method and approach the value of K does not depend on the number of parameters, number of datapoints and the number of attributes. Before taking the decision about the value of K, the user checks the number of parameters, number of attributes and the number of datapoints of the dataset. 2. Since the Information Criterion is used to determine the fitness (over, under or best-fit) of the given dataset therefore by taking the logarithm of the value of Information Criterion the value of K can easily be specified. In this method and approach the value of K depends on number of parameters and number of attributes. The number of datapoints has no significant impact on the value of Information Criterion. Since selecting the number of parameters and number of attributes is a problem of the model selection criterion, therefore one has to take care of these important aspects of a dataset. Using too many or too few parameters and attributes can raise the problem of over or under-fitting respectively which shows the significance of the parameters and attributes of the dataset. There is a limit for the number of parameters and number of attributes in a dataset under and above the limit will affect the fitness of the dataset. Therefore, the value of K remains between 2 and 23 in this approach [22]. 3. The value of K can easily be determined through the number of parameters in the class attribute of a dataset. In this approach and method, the value of K changes with any change in the number of parameters. The number of attributes and number of datapoints have no impact on the value of K. This approach to optimize the number of clusters is applied in many data mining suites. The assumption in this optimization approach is that the dataset must be best-fitted. If the right dataset is not selected then there is risk of over and under-fitting and consequently the results are not valid. 4. Similarly, one can apply the logarithm value of number of datapoints of a dataset for the value K. In this approach and method, the value of K changes with any change in the number of parameters. The number of attributes and number of parameters have no impact on the value of K. These four methods are straightforward and simple and the value of K can be optimized with less computation. But in these methods the number of parameters, number of attributes and number of datapoints play an important and vital role to specify the value of K, we can say that the value of K depends on the objects of the dataset. 5. The evaluation of the Stepwise method is tested on the above datasets. The K-means clustering algorithm is applied with the value of K ranging from 1 to 10. The main constraint of the K-means clustering algorithm is that it is sensitive to the value of initial centroids. Therefore in order to avoid this constraint, the data are normalized which also

6 1964 ISSN ; CODEN: SINTE 8 Sci.Int.(Lahore),26(5), ,2014 reduces the effect of differences in the ranges of the distribution and the value of ratio approaches to constant attributes of the datasets. The selected datasets have small, value 1. The number of clusters recommended through this medium and high values of attributes, parameters and method for this dataset is 8 because after this value the datapoints. The distortion or distance, sum of squared-error graph is in the uniform distribution. Hence the value of K = and ratio of the sum of squared-errors is calculated for each 8 for the dataset Cars. value of K. In this approach a lot of computational is The graph in Figure 1(f) shows that the value of the ratio of involved. When K=1, we cannot compute the value of ratio, the sum of squared-error reflects the result of clustering on therefore it is assumed that the dataset in normal uniform the dataset DNA. The graph becomes in the normal distribution. As the value of K increases the sum of squarederror uniform distribution after the value of K = 7 before this decreases and we also observe the variation in the value of K the graph is not in the uniform distribution. The value of ratio. Figure 1 shows the graphs between the K- value of ratio approaches to constant value 1 when K = 8 to number of clusters and the ratio-sum of squared-error for the 9. The number of clusters recommended through this The graph in Figure 1(a) shows that the value of the ratio of method for this dataset is 7 because after this value the the sum of squared-error reflects the result of clustering on graph is in the uniform distribution. Hence the value of K = the dataset Iris. The graph can be divided into two regions; 7 for the dataset DNA. one region is from K = 2 to 4 and the second is after K = 5. We can draw a conclusion from the graphs in Figure 1 that The value of ratio is approximately constant and approaches the value of K corresponding to value of ratio 1. 0 shows to 1 from K = 5 to 9. The number of clusters recommended the uniform distribution and the value (K 1) can be through this method for this dataset is 4 because after this recommended for the number of clusters. If there is no value value the graph is in the uniform distribution. Hence the then K = 1. The value of K specified through the Stepwise value of K = 4 for the dataset Iris. Method does not depend on the objects of a dataset. The The graph in Figure 1(b) shows that there is a single region method is tested on six different datasets for K 1 to 10; more and all objects are in a uniform distribution which also values of K and more datasets can be selected for the further reflects the result of clustering on the dataset Sales. The verification of this method. value of ratio is approximately constant and approaches to 1 The number of clusters recommended through five different from K = 4 to 9. The number of clusters recommended optimization methods and approaches for the datasets is through this method for this dataset is 3 because after this given in Table 1. value the graph is in the uniform distribution. Hence the Table 1 is a summary of all five different approaches to value of K = 3 for the dataset Sales. optimize the number of clusters in K-means clustering data The graph in Figure 1(c) shows that the value of the ratio of mining algorithm. The recommended values of K are small the sum of squared-error reflects the result of clustering on except in case of parametric approach of optimization for the dataset BreastCancer. The graph has two regions; the dataset Cars where the value of number of parameters almost the second half of the graph is in the uniform is very high. The main issue is that how to validate and distribution and the first half is not. The value of ratio is evaluate these recommended number of clusters from the approximately constant and approaches to 1 from K = 7 to optimization approaches. The Data Visualization techniques 9. The number of clusters recommended through this can be applied to verify these results but it is difficult to method for this dataset is 6 because after this value the validate and evaluate the results for multidimensional graph is in the uniform distribution. Hence the value of K = datasets through visualization [26]. Finally, the user has to 6 for the dataset BreastCancer. decide the appropriate number of clusters for the given The graph in Figure 1(d) shows that the value of the ratio of dataset using the data visualization techniques. the sum of squared-error reflects the result of clustering on A comparison of five different optimization approaches for the dataset Diabetes. The graph can be divided into two the selected datasets is illustrated in Figure 2. regions; one region is from K = 2 to 5 which is not in the Figure 2 demonstrates the comparison of five different uniform distribution and the second is after K = 6 which is approaches to optimize the number of clusters in K-means in the uniform distribution. The value of ratio is clustering data mining algorithm for different datasets. The approximately constant and approaches to 1 from K = 6 to recommended number of clusters or the value of K through 9. The number of clusters recommended through this the Stepwise approach is rational as compared to the rest of method for this dataset is 5 because after this value the approaches. For the validation of the number of clusters, the graph is in the uniform distribution. Hence the value of K = Data Visualization technique can be used. 5 for the dataset Diabetes. Moreover, a comparison is drawn between the optimization The graph in Figure 1(e) shows that the value of the ratio of approaches discussed in sections 2 and 3 in Table 2. the sum of squared-error reflects the result of clustering on the dataset Cars. The graph is not in the uniform distribution which shows that the dataset is not properly moralized. The value of ratio is either small or large from K = 2 to 8 and after K = 9 the graph becomes in the normal

7 Number of Clusters 'K' Sci.Int.(Lahore),26(5), ,2014 ISSN ; CODEN: SINTE Comparison of Optimization Approaches The explanation of Table 3 is: The clusters of both partitions are almost evenly populated. The first cluster of first partition has Iris-setosa a single parameter, similarly, the User Defined second cluster has Iris-versicolor parameter and the third Information Criterion one has Iris-virginica i.e. each cluster of this partition has 15 Class Parameters distinct parameter of the dataset. The first cluster of second 10 Datapoints partition has two parameters Iris-versicolor and Irisvirginica 5 Stepwise but the percentage value of Iris-versicolor is 0 high, the second cluster has Iris-setosa a single parameter and the third one again has two parameters Iris-versicolor and Iris-virginica but the percentage value of Irisversicolor is high i.e. each cluster of this partition has Datasets dominated by the distinct parameter of the dataset. The values of MDL of first cluster of first partition and second Figure 2. A Comparison of different Optimization cluster of second partition are same and the rest of clusters Approaches are of same values. The conclusion is that each cluster has distinct parameter as output. The cluster 1 of partition 1 and It is obvious that our proposed optimization approaches are cluster 2 of partition 2 has minimum value of MDL simple, easy to implement and with less computational therefore the feature and pattern of these two clusters are complexity as compare to the others. The number of clusters selected from the dataset Iris. neither should be small nor be large because it is difficult to Since there are 10 attributes and 2 parameters in dataset extract the patterns from both types of clusters. It is also BreastCancer ; therefore 5 partitions and 2 clusters per observed that the proposed approaches also called as partition are created according to the methodology. Table 4 improved optimization approaches are computationally less shows the clusters analysis of dataset BreastCancer. expensive as compared to the existing approaches. Since the The explanation of Table 4 is: The population of each proposed and existing approaches both provide the realistic partition of the dataset is same i.e. the first cluster of each number of clusters in a dataset; therefore, we conclude that partition is less populated then the second cluster. The all the approaches are useful and helpful to optimize the percentage value of parameter Malignant is high in the less number of clusters in a dataset. populated cluster of each partition. The percentage value of 5.2. Analysis of Clusters parameter Benign is high in the highly populated cluster of Cluster analysis means partitions of data into meaningful each partition. The value of MDL of each cluster of the subgroups which is also known as the characterization of dataset is almost the same. The conclusion is that the output clusters. It is an important tool for unsupervised learning but of first cluster of each partition is Malignant and Benign the major issue is to estimate the optimal number of is the output of second cluster of each partition. The cluster clusters. 1 of partition 3 and cluster 1 of partition 5 has almost the We provide clusters analysis of datasets Iris, same minimum value of MDL therefore the feature and BreastCancer, Diabetes and Sales using different pattern of these two clusters are selected from the dataset optimization approaches presented in this paper. The cluster BreastCancer. analysis reflects the clear picture of each cluster and also Since there are 9 attributes and 2 parameters in dataset helps to extract the feature and pattern from a dataset. For Diabetes ; therefore 4 partitions and 2 clusters per partition this purpose we first create the vertical partitions of each are created according to the methodology. Table 5 shows dataset applying the equation (7). The number of parameters the clusters analysis of dataset Diabetes. of a dataset will become the value of K. We analyze the For the rest of the cases we are taking the dataset Iris only, cluster by providing the information such as population of similarly other datasets can be used to test the different each cluster, percentage of each parameter in a cluster and optimization approaches. the value of Minimal Description Length (MDL) of each Case2: The analysis of clusters of the dataset Iris using the cluster. The MDL of a cluster is computed using equation Stepwise approach. (9). Finally, we select the cluster having minimum value of The explanation of Table 5 is: The first cluster of each MDL. The following tables demonstrate the whole process partition has greater value as compared to the second of clusters analysis of the selected datasets. cluster. The value of MDL of each cluster of the dataset is Case1: The analysis of clusters of the datasets using the almost the same. The conclusion is that the output of first number parameters approach. Since there are 5 attributes cluster of each partition is Cat2 and the result of second and 3 parameters in dataset Iris ; therefore 2 partitions and cluster of each partition is either Cat1 or Cat2 because 3 clusters per partition are created according to the the percentage of both parameters is very close. The only methodology. Table 3 shows the clusters analysis of dataset cluster 2 of partition 2 has the minimum value of MDL Iris. therefore the feature and pattern of this cluster is selected Iris Diabetes BreastCancer Sales Cars DNA

8 1966 ISSN ; CODEN: SINTE 8 Sci.Int.(Lahore),26(5), ,2014 from the dataset Diabetes. the clusters analysis of dataset Sales. Since there are 5 attributes and 2 parameters in dataset Diabetes ; therefore 2 partitions and 2 clusters per partition are created according to the methodology. Table 6 shows Table 1. The Number of Clusters based on Different Approaches Datasets/Approaches User Defined Information Criterion Class Parameters Datapoints Stepwise Iris Diabetes BreastCancer Sales Cars DNA Table 2. A Comparison of Clusters Optimization Approaches Criteria User-defined Information Number of Datapoints Stepwise Others (Discussed in section /Approaches Criterion Parameters 2) Dataset Objects Independent Dependent Dependent Dependent Independent Independent Computational Complexity Low Medium Medium Medium High Very High Easy to Yes Yes Yes Yes Yes Not easy even for the Implement experienced user. Overview Simple and Simple and easy. Simple and easy. Simple and easy. Not simple but it is Very complex and easy. easy. complicated for user. Conclusion User Computation is Computation is Computation is Less computation but A lot of computation can Dependent. required. required. required. still it is easy and even confuse the simple. experienced user. Table 3. The Clusters Analysis of Dataset Iris 1 petal_length, petal_width, class 1 Iris-setosa=50 Iris-setosa= Iris-versicolor=48 Iris-virginica=4 Iris-versicolor= Iris-versicolor=2 Iris-virginica=46 2 sepal_length, sepal_width, class 1 Iris-versicolor=12 Iris-virginica=35 Iris-virginica=7.7 Iris-versicolor=4.2 Iris-virginica=95.8 Iris-versicolor=25.5 Iris-virginica= Iris-setosa=50 Iris-setosa=100 3 Iris-versicolor=38 Iris-virginica=15 Iris-versicolor=71.7 Iris-virginica=28.3 Table 4. The Clusters Analysis of Dataset BreastCancer 1 CT, UCS, Class 1 Benign =162 Benign= Malignant=10 Malignant=5.8 2 Malignant =59 Malignant= Benign=2 Benign=3.3 2 UCSh, MAdh, Class 1 Malignant =59 Malignant= Benign=6 Benign=9.2 2 Benign=158 Benign= Malignant =10 Malignant=6.0 3 SECS, BNu, Class 1 Malignant =52 Malignant= Benign=7 Benign= Benign=157 Benign= Malignant =17 Malignant=9.8 4 BCh, NNuc, Class 1 Malignant =47 Malignant= Benign=4 Benign=7.8 2 Benign=160 Malignant =22 Benign=87.9 Malignant=

9 Sci.Int.(Lahore),26(5), ,2014 ISSN ; CODEN: SINTE BCh, Mitoses, Class 1 Malignant =55 Malignant= Benign=7 Benign= Benign=157 Benign= Malignant =14 Malignant=8.2 Table 5. The Clusters Analysis of Dataset Diabetes 1 NTP, DPF, CLASS 1 Cat2=384 Cat2= Cat1=96 Cat1= Cat1=164 Cat1= Cat2=144 Cat2= PGC, 2HIS, CLASS 1 Cat2=474 Cat2= Cat1=198 Cat1= Cat1=62 Cat1= Cat2=54 Cat2= DBP, AGE, CLASS 1 Cat1=176 Cat1= Cat2=214 Cat2= Cat1=84 Cat1= Cat2=314 Cat2= TSFT, BMI, CLASS 1 Cat2=330 Cat2= Cat1=86 Cat1= Cat1=174 Cat2=198 Cat1=46.8 Cat2= Table 6. The Clusters Analysis of Dataset Sales 1 AverageSales, ForecastSales, Class 1 OK=10 OK= NotOK=8 NotOK= NotOK=6 OK=0 0.9 NotOK=100 2 CalculatedSeasonalIndex, AverageIndex, Class 1 OK=15 OK= NotOK=0 2 OK=9 OK=100 NotOK=0 1.3 Tabe 7. The Cluster Analysis of Dataset Iris 1 petal_length, petal_width, class 1 Iris-virginica=36 Iris-virginica= Iris-setosa=50 Iris-setosa=100 3 Iris-versicolor=26 Iris-versicolor=100 4 Iris-versicolor=24 Iris-virginica=14 2 sepal_length, sepal_width, class 1 Iris-versicolor=6 Iris-virginica=21 Iris-versicolor=63.2 Iris-virginica=36.8 Iris-versicolor=22.2 Iris-virginica= Iris-setosa=49 Iris-setosa= Iris-versicolor=25 Iris-virginica=7 Iris-setosa=1 4 Iris-versicolor=19 Iris-virginica=22 Iris-versicolor=75.8 Iris-virginica=21.2 Iris-setosa=3.0 Iris-versicolor=46.3 Iris-virginica=

10 1968 ISSN ; CODEN: SINTE 8 Sci.Int.(Lahore),26(5), ,2014 Table 8. The Cluster Analysis of Dataset Iris 1 petal_length, petal_width, class 1 Iris-virginica=32 Iris-virginica= Iris-setosa=50 Iris-setosa= Iris-versicolor=23 Iris-versicolor= Iris-versicolor=19 Iris-virginica=18 Iris-versicolor=51.4 Iris-virginica= Iris-versicolor=8 Iris-versicolor= sepal_length, sepal_width, class 1 Iris-versicolor=6 Iris-virginica=21 Iris-versicolor=22.2 Iris-virginica= Iris-setosa=17 Iris-setosa=100 3 Iris-versicolor=19 Iris-virginica=22 4 Iris-versicolor=25 Iris-virginica=7 Iris-versicolor=46.3 Iris-virginica=53.7 Iris-versicolor=78.1 Iris-virginica= Iris-setosa=33 Iris-setosa=100.0 Table 9. The Cluster Analysis of Dataset Iris 1. petal_length, petal_width, class 1 Iris-virginica=13 Iris-virginica=100 2 Iris-virginica=21 Iris-virginica=100 3 Iris-versicolor=8 Iris-versicolor=100 4 Iris-versicolor=10 Iris-versicolor=100 5 Iris-versicolor=5 Iris-virginica=15 Iris-versicolor=25.0 Iris-virginica= Iris-versicolor=12 Iris-versicolor= Iris-virginica=1 Iris-versicolor=1 Iris-virginica=50.0 Iris-versicolor= Iris-versicolor=7 Iris-versicolor=100 9 Iris-versicolor=7 Iris-versicolor= Iris-setosa=50 Iris-setosa=

11 Sci.Int.(Lahore),26(5), ,2014 ISSN ; CODEN: SINTE sepal_length, sepal_width, class 1 Iris-virginica=10 Iris-virginica= Iris-versicolor=3 Iris-versicolor= Iris-virginica=11 Iris-virginica= Iris-versicolor=4 Iris-virginica=3 Iris-versicolor=57.1 Iris-virginica= Iris-setosa=28 Iris-setosa= Iris-virginica=4 Iris-versicolor=9 6 Iris-versicolor=3 Iris-virginica=6 Iris-virginica=30.8 Iris-versicolor=69.2 Iris-versicolor=33.3 Iris-virginica= Iris-setosa=21 Iris-setosa= Iris-versicolor=7 Iris-virginica=10 9 Iris-versicolor=5 Iris-virginica=1 Iris-setosa=1 10 Iris-versicolor=19 Iris-virginica=5 The explanation of Table 6 is: The second cluster of both partitions is less populated as compare to the first cluster. The second partition is dominated by a single parameter OK. The first cluster of first partition has both parameters OK and NotOK but again the percentage value of OK is greater than NotOK. The second cluster of first partition has only single parameter NotOK. The MDL value of first cluster of first partition is high as compare to the rest of the clusters. The conclusion is that the output of each cluster is dominated by a single parameter OK except the second cluster of first partition. All the clusters except the cluster 1 of partition 1 have almost the same minimum value of MDL therefore the feature and pattern of these three clusters are selected from the dataset Sales. The explanation of Table 7 is: The first three clusters of first partition are dominated by a single parameter either Irisvirginica, Iris-setosa or Iris-versicolor i.e. each cluster of this partition has distinct parameter of the dataset. The fourth and the last cluster of first partition have two parameters Iris-versicolor and Iris-virginica but the percentage value of Iris-versicolor is high. The second cluster of the second partition has only Iris-setosa a single parameter and the remaining three clusters have two parameters Iris-versicolor and Iris-virginica. We can say that each cluster is dominated by the distinct parameter of the dataset. The value of MDL of third cluster of second partition is high and the third cluster of first partition is low as compared to the rest of clusters. Therefore, the cluster 3 of partition 1 is the selected feature and pattern of the dataset Iris. Case3: The analysis of clusters of the dataset Iris using the Iris-versicolor=41.2 Iris-virginica=58.8 Iris-versicolor=71.4 Iris-virginica=14.3 Iris-setosa=14.3 Iris-versicolor=79.2 Iris-virginica= Datapoints approach. The explanation of Table 8 is: The clusters 1,2,3, and 5 of first partition are dominated by a single parameter either Iris-virginica, Iris-setosa or Iris-versicolor i.e. each cluster of this partition has distinct parameter of the dataset. The cluster 4 of first partition has two parameters Irisversicolor and Iris-virginica but the percentage value of Iris-versicolor is high. The cluster 2and 5 of the second partition have only Iris-setosa a single parameter and the remaining three clusters have two parameters Irisversicolor and Iris-virginica. We can say that each cluster is dominated by the distinct parameter of the dataset. The value of MDL of cluster 4 of second partition is high and the cluster 5 of first partition is low as compared to the rest of clusters. Therefore, the cluster 5 of partition 1 is the selected feature and pattern of the dataset Iris. Case4: The analysis of clusters of the dataset Iris using the User-defined approach. The explanation of Table 9 is: The clusters 1,2,3,4,6,8,9 and 10 of first partition are dominated by a single parameter either Iris-virginica, Iris-setosa or Iris-versicolor i.e. each cluster of this partition has distinct parameter of the dataset. The clusters 5 and 7 of first partition have two parameters Iris-versicolor and Iris-virginica. The clusters 2,4 and 7 of the second partition have a single parameter and the remaining clusters have two parameters.. We can say that each cluster is dominated by the distinct parameter of the dataset. The value of MDL of cluster 9 of second partition is high and the clusters 3,8 and 9 of first partition and the clusters 5 and 8 of second partition have low MDL value as compared to the rest of clusters. Therefore, the

12 1970 ISSN ; CODEN: SINTE 8 Sci.Int.(Lahore),26(5), ,2014 clusters of low values of MDL are the selected feature and pattern of the dataset Iris. We also notice that increasing the number clusters of a dataset creates the difficultly to handle the clusters; therefore, the only required number of clusters must be applied in order to extract the feature and pattern from the given dataset. The conclusion of the above tables of the clusters analysis is that increasing the number of parameters and attributes of the dataset will increase the number of clusters and the number of partitions respectively which will increase the domain of knowledge. Increasing the sample size will only increase the population of clusters, which does not change the domain of knowledge extraction. We can say that the number of clusters K play an important, vital and decisive role in pattern and feature extraction from the dataset. 6. CONCLUSION The objectives set at the outset of the research paper are: first, to optimize the value of K number of clusters in the K- means clustering algorithm and second, to provide the clusters analysis by applying the optimization approaches on different datasets. For the first objective, different optimization methods and approaches, namely the value of K specified through the user, the value of the Information Criterion, parameters of the dataset, the datapoints of a dataset and the Stepwise methods are presented. The first four methods are straightforward and computationally less expensive. In these four methods, the value of K depends on the objects of a dataset such as the number of parameters, attributes and datapoints. Although the last optimization approach, the Stepwise method is computationally expensive but the value of K does not depend on the objects of a dataset. These methods are tested and validated on 6 different datasets and the recommended value of K is rational. For the second objective, the clusters analysis of 4 different datasets is presented which helps the user to extract the patterns and features from the given dataset. We conclude that the value of K is a determinant factor for the pattern and feature extraction from a dataset. If the value of K is correctly optimized then the results will be acceptable otherwise there is risk in deteriorated or no pattern and feature extraction at all. Further experiments are required to validate the value of K and the optimization of K will remain an avenue for further research. ACKNOWLEDGEMENT The authors are thankful to The Islamia University of Bahawalpur, Pakistan for providing financial assistance to carry out this research activity under HEC project 6467/F II. REFERENCES [1] MacQueen, J.B., Some Methods for classification and Analysis of Multivariate Observations, Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press, 1: [2] URL: US Census Bureau for the datasets Iris, Breastcancer and Sales, visited [3] URL: National Institute of Diabetes and Digestive and Kidney Diseases, Pima Indians Diabetes Dataset, visited [4] SPSS Clementine Data Mining System (Integral Solutions Limited, Basingstoke, Hampshire), User Guide Version 5, [5] DataEngine 3.0 Intelligent Data Analysis an Easy Job, Management Intelligenter Technologien GmbH, Germany, [6] Kerr, A., Hall, H. K., and Kozub, S., Doing Statistics with SPSS, (Sage, London), [7] S-PLUS 6 for Windows Guide to Statistics, Vol. 2, Insightful Corporation, Seattle, Washington, 2001, man2.pdf, [8] Hardy, A., On the number of clusters, Comput. Statist. Data Analysis, 23, 83 96, [9] Theodoridis, S. and Koutroubas, K., Pattern Recognition, Academic Press, London, [10] Halkidi, M., Batistakis, Y., and Vazirgiannis, M., Cluster validity methods, Part I. SIGMOD Record, 31(2); available online [11] Bradley, S. and Fayyad, U. M., Refining initial points for K-means clustering, In Proceedings of the Fifteenth International Conference on Machine Learning (ICML 98) (Ed. J. Shavlik), Madison, Wisconsin, pp (Morgan Kaufmann, San Francisco, California), [12] Ishioka, T., Extended K-means with an efficient estimation of the number of clusters, In Proceedings of the Second International Conference on Intelligent Data Engineering and Automated Learning (IDEAL 2000), Hong Kong, PR China, pp , [13] Kanungo, T., Mount, D. M., Netanyahu, N., Piatko, C., Silverman, R., and Wu, A., The efficient K-means clustering algorithm: analysis and implementation, IEEE Trans. Pattern Analysis Mach. Intell., 24(7), , [14] Pelleg, D. and Moore, A., Accelerating exact K-means algorithms with geometric reasoning, In Proceedings of the Conference on Knowledge Discovery in Databases (KDD 99), San Diego, California, pp , [15] Pelleg, D. and Moore, A., X-means: extending K- means with efficient estimation of the number of clusters, In Proceedings of the 17th International Conference on Machine Learning (ICML 2000), Stanford, California, , [16] Kothari, R. and Pitts, D., On finding the number of clusters, Pattern Recognition Lett., 20, ,

Enhanced Cluster Centroid Determination in K- means Algorithm

Enhanced Cluster Centroid Determination in K- means Algorithm International Journal of Computer Applications in Engineering Sciences [VOL V, ISSUE I, JAN-MAR 2015] [ISSN: 2231-4946] Enhanced Cluster Centroid Determination in K- means Algorithm Shruti Samant #1, Sharada

More information

Supervised vs. Unsupervised Learning

Supervised vs. Unsupervised Learning Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now

More information

Comparative Study of Clustering Algorithms using R

Comparative Study of Clustering Algorithms using R Comparative Study of Clustering Algorithms using R Debayan Das 1 and D. Peter Augustine 2 1 ( M.Sc Computer Science Student, Christ University, Bangalore, India) 2 (Associate Professor, Department of Computer

More information

Machine Learning: Algorithms and Applications Mockup Examination

Machine Learning: Algorithms and Applications Mockup Examination Machine Learning: Algorithms and Applications Mockup Examination 14 May 2012 FIRST NAME STUDENT NUMBER LAST NAME SIGNATURE Instructions for students Write First Name, Last Name, Student Number and Signature

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

Clustering. Supervised vs. Unsupervised Learning

Clustering. Supervised vs. Unsupervised Learning Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now

More information

K-Means Clustering With Initial Centroids Based On Difference Operator

K-Means Clustering With Initial Centroids Based On Difference Operator K-Means Clustering With Initial Centroids Based On Difference Operator Satish Chaurasiya 1, Dr.Ratish Agrawal 2 M.Tech Student, School of Information and Technology, R.G.P.V, Bhopal, India Assistant Professor,

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

K-means algorithm and its application for clustering companies listed in Zhejiang province

K-means algorithm and its application for clustering companies listed in Zhejiang province Data Mining VII: Data, Text and Web Mining and their Business Applications 35 K-means algorithm and its application for clustering companies listed in Zhejiang province Y. Qian School of Finance, Zhejiang

More information

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest) Lecture-17: Clustering with K-Means (Contd: DT + Random Forest) Medha Vidyotma April 24, 2018 1 Contd. Random Forest For Example, if there are 50 scholars who take the measurement of the length of the

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS

COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS Mariam Rehman Lahore College for Women University Lahore, Pakistan mariam.rehman321@gmail.com Syed Atif Mehdi University of Management and Technology Lahore,

More information

Introduction to R and Statistical Data Analysis

Introduction to R and Statistical Data Analysis Microarray Center Introduction to R and Statistical Data Analysis PART II Petr Nazarov petr.nazarov@crp-sante.lu 22-11-2010 OUTLINE PART II Descriptive statistics in R (8) sum, mean, median, sd, var, cor,

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16 CLUSTERING CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16 1. K-medoids: REFERENCES https://www.coursera.org/learn/cluster-analysis/lecture/nj0sb/3-4-the-k-medoids-clustering-method https://anuradhasrinivas.files.wordpress.com/2013/04/lesson8-clustering.pdf

More information

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE Sundari NallamReddy, Samarandra Behera, Sanjeev Karadagi, Dr. Anantha Desik ABSTRACT: Tata

More information

Enhancing K-means Clustering Algorithm with Improved Initial Center

Enhancing K-means Clustering Algorithm with Improved Initial Center Enhancing K-means Clustering Algorithm with Improved Initial Center Madhu Yedla #1, Srinivasa Rao Pathakota #2, T M Srinivasa #3 # Department of Computer Science and Engineering, National Institute of

More information

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions Thomas Giraud Simon Chabot October 12, 2013 Contents 1 Discriminant analysis 3 1.1 Main idea................................

More information

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering

More information

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs An Introduction to Cluster Analysis Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs zhaoxia@ics.uci.edu 1 What can you say about the figure? signal C 0.0 0.5 1.0 1500 subjects Two

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

CloNI: clustering of JN -interval discretization

CloNI: clustering of JN -interval discretization CloNI: clustering of JN -interval discretization C. Ratanamahatana Department of Computer Science, University of California, Riverside, USA Abstract It is known that the naive Bayesian classifier typically

More information

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science Data Mining Dr. Raed Ibraheem Hamed University of Human Development, College of Science and Technology Department of Computer Science 2016 201 Road map What is Cluster Analysis? Characteristics of Clustering

More information

An Initial Seed Selection Algorithm for K-means Clustering of Georeferenced Data to Improve

An Initial Seed Selection Algorithm for K-means Clustering of Georeferenced Data to Improve An Initial Seed Selection Algorithm for K-means Clustering of Georeferenced Data to Improve Replicability of Cluster Assignments for Mapping Application Fouad Khan Central European University-Environmental

More information

Input: Concepts, Instances, Attributes

Input: Concepts, Instances, Attributes Input: Concepts, Instances, Attributes 1 Terminology Components of the input: Concepts: kinds of things that can be learned aim: intelligible and operational concept description Instances: the individual,

More information

Scalability of Efficient Parallel K-Means

Scalability of Efficient Parallel K-Means Scalability of Efficient Parallel K-Means David Pettinger and Giuseppe Di Fatta School of Systems Engineering The University of Reading Whiteknights, Reading, Berkshire, RG6 6AY, UK {D.G.Pettinger,G.DiFatta}@reading.ac.uk

More information

Dynamic Clustering of Data with Modified K-Means Algorithm

Dynamic Clustering of Data with Modified K-Means Algorithm 2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore Dynamic Clustering of Data with Modified K-Means Algorithm Ahamed Shafeeq

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing Unsupervised Data Mining: Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 1. Supervised Data Mining Classification Regression Outlier detection

More information

Clustering. Chapter 10 in Introduction to statistical learning

Clustering. Chapter 10 in Introduction to statistical learning Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1 Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What

More information

I How does the formulation (5) serve the purpose of the composite parameterization

I How does the formulation (5) serve the purpose of the composite parameterization Supplemental Material to Identifying Alzheimer s Disease-Related Brain Regions from Multi-Modality Neuroimaging Data using Sparse Composite Linear Discrimination Analysis I How does the formulation (5)

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points Dr. T. VELMURUGAN Associate professor, PG and Research Department of Computer Science, D.G.Vaishnav College, Chennai-600106,

More information

Monika Maharishi Dayanand University Rohtak

Monika Maharishi Dayanand University Rohtak Performance enhancement for Text Data Mining using k means clustering based genetic optimization (KMGO) Monika Maharishi Dayanand University Rohtak ABSTRACT For discovering hidden patterns and structures

More information

Mine Blood Donors Information through Improved K- Means Clustering Bondu Venkateswarlu 1 and Prof G.S.V.Prasad Raju 2

Mine Blood Donors Information through Improved K- Means Clustering Bondu Venkateswarlu 1 and Prof G.S.V.Prasad Raju 2 Mine Blood Donors Information through Improved K- Means Clustering Bondu Venkateswarlu 1 and Prof G.S.V.Prasad Raju 2 1 Department of Computer Science and Systems Engineering, Andhra University, Visakhapatnam-

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Accelerating K-Means Clustering with Parallel Implementations and GPU computing

Accelerating K-Means Clustering with Parallel Implementations and GPU computing Accelerating K-Means Clustering with Parallel Implementations and GPU computing Janki Bhimani Electrical and Computer Engineering Dept. Northeastern University Boston, MA Email: bhimani@ece.neu.edu Miriam

More information

9.1. K-means Clustering

9.1. K-means Clustering 424 9. MIXTURE MODELS AND EM Section 9.2 Section 9.3 Section 9.4 view of mixture distributions in which the discrete latent variables can be interpreted as defining assignments of data points to specific

More information

A Classifier with the Function-based Decision Tree

A Classifier with the Function-based Decision Tree A Classifier with the Function-based Decision Tree Been-Chian Chien and Jung-Yi Lin Institute of Information Engineering I-Shou University, Kaohsiung 84008, Taiwan, R.O.C E-mail: cbc@isu.edu.tw, m893310m@isu.edu.tw

More information

Introduction to Artificial Intelligence

Introduction to Artificial Intelligence Introduction to Artificial Intelligence COMP307 Machine Learning 2: 3-K Techniques Yi Mei yi.mei@ecs.vuw.ac.nz 1 Outline K-Nearest Neighbour method Classification (Supervised learning) Basic NN (1-NN)

More information

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges.

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges. Instance-Based Representations exemplars + distance measure Challenges. algorithm: IB1 classify based on majority class of k nearest neighbors learned structure is not explicitly represented choosing k

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Data Mining: Models and Methods

Data Mining: Models and Methods Data Mining: Models and Methods Author, Kirill Goltsman A White Paper July 2017 --------------------------------------------------- www.datascience.foundation Copyright 2016-2017 What is Data Mining? Data

More information

Expectation Maximization (EM) and Gaussian Mixture Models

Expectation Maximization (EM) and Gaussian Mixture Models Expectation Maximization (EM) and Gaussian Mixture Models Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 2 3 4 5 6 7 8 Unsupervised Learning Motivation

More information

Fast Efficient Clustering Algorithm for Balanced Data

Fast Efficient Clustering Algorithm for Balanced Data Vol. 5, No. 6, 214 Fast Efficient Clustering Algorithm for Balanced Data Adel A. Sewisy Faculty of Computer and Information, Assiut University M. H. Marghny Faculty of Computer and Information, Assiut

More information

A Study on Clustering Method by Self-Organizing Map and Information Criteria

A Study on Clustering Method by Self-Organizing Map and Information Criteria A Study on Clustering Method by Self-Organizing Map and Information Criteria Satoru Kato, Tadashi Horiuchi,andYoshioItoh Matsue College of Technology, 4-4 Nishi-ikuma, Matsue, Shimane 90-88, JAPAN, kato@matsue-ct.ac.jp

More information

SSV Criterion Based Discretization for Naive Bayes Classifiers

SSV Criterion Based Discretization for Naive Bayes Classifiers SSV Criterion Based Discretization for Naive Bayes Classifiers Krzysztof Grąbczewski kgrabcze@phys.uni.torun.pl Department of Informatics, Nicolaus Copernicus University, ul. Grudziądzka 5, 87-100 Toruń,

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Comparative Study Of Different Data Mining Techniques : A Review

Comparative Study Of Different Data Mining Techniques : A Review Volume II, Issue IV, APRIL 13 IJLTEMAS ISSN 7-5 Comparative Study Of Different Data Mining Techniques : A Review Sudhir Singh Deptt of Computer Science & Applications M.D. University Rohtak, Haryana sudhirsingh@yahoo.com

More information

Basic Concepts Weka Workbench and its terminology

Basic Concepts Weka Workbench and its terminology Changelog: 14 Oct, 30 Oct Basic Concepts Weka Workbench and its terminology Lecture Part Outline Concepts, instances, attributes How to prepare the input: ARFF, attributes, missing values, getting to know

More information

Performance Measure of Hard c-means,fuzzy c-means and Alternative c-means Algorithms

Performance Measure of Hard c-means,fuzzy c-means and Alternative c-means Algorithms Performance Measure of Hard c-means,fuzzy c-means and Alternative c-means Algorithms Binoda Nand Prasad*, Mohit Rathore**, Geeta Gupta***, Tarandeep Singh**** *Guru Gobind Singh Indraprastha University,

More information

Analyzing Outlier Detection Techniques with Hybrid Method

Analyzing Outlier Detection Techniques with Hybrid Method Analyzing Outlier Detection Techniques with Hybrid Method Shruti Aggarwal Assistant Professor Department of Computer Science and Engineering Sri Guru Granth Sahib World University. (SGGSWU) Fatehgarh Sahib,

More information

A Comparative Study of Data Mining Process Models (KDD, CRISP-DM and SEMMA)

A Comparative Study of Data Mining Process Models (KDD, CRISP-DM and SEMMA) International Journal of Innovation and Scientific Research ISSN 2351-8014 Vol. 12 No. 1 Nov. 2014, pp. 217-222 2014 Innovative Space of Scientific Research Journals http://www.ijisr.issr-journals.org/

More information

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A. 205-206 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI BARI

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-25-2018 Outline Background Defining proximity Clustering methods Determining number of clusters Other approaches Cluster analysis as unsupervised Learning Unsupervised

More information

Improved Performance of Unsupervised Method by Renovated K-Means

Improved Performance of Unsupervised Method by Renovated K-Means Improved Performance of Unsupervised Method by Renovated P.Ashok Research Scholar, Bharathiar University, Coimbatore Tamilnadu, India. ashokcutee@gmail.com Dr.G.M Kadhar Nawaz Department of Computer Application

More information

Rank Measures for Ordering

Rank Measures for Ordering Rank Measures for Ordering Jin Huang and Charles X. Ling Department of Computer Science The University of Western Ontario London, Ontario, Canada N6A 5B7 email: fjhuang33, clingg@csd.uwo.ca Abstract. Many

More information

An Enhanced K-Medoid Clustering Algorithm

An Enhanced K-Medoid Clustering Algorithm An Enhanced Clustering Algorithm Archna Kumari Science &Engineering kumara.archana14@gmail.com Pramod S. Nair Science &Engineering, pramodsnair@yahoo.com Sheetal Kumrawat Science &Engineering, sheetal2692@gmail.com

More information

Cluster Analysis. CSE634 Data Mining

Cluster Analysis. CSE634 Data Mining Cluster Analysis CSE634 Data Mining Agenda Introduction Clustering Requirements Data Representation Partitioning Methods K-Means Clustering K-Medoids Clustering Constrained K-Means clustering Introduction

More information

NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM

NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM Saroj 1, Ms. Kavita2 1 Student of Masters of Technology, 2 Assistant Professor Department of Computer Science and Engineering JCDM college

More information

INITIALIZING CENTROIDS FOR K-MEANS ALGORITHM AN ALTERNATIVE APPROACH

INITIALIZING CENTROIDS FOR K-MEANS ALGORITHM AN ALTERNATIVE APPROACH Volume 118 No. 18 2018, 1565-1570 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu INITIALIZING CENTROIDS FOR K-MEANS ALGORITHM AN ALTERNATIVE APPROACH

More information

A PSO-based Generic Classifier Design and Weka Implementation Study

A PSO-based Generic Classifier Design and Weka Implementation Study International Forum on Mechanical, Control and Automation (IFMCA 16) A PSO-based Generic Classifier Design and Weka Implementation Study Hui HU1, a Xiaodong MAO1, b Qin XI1, c 1 School of Economics and

More information

Nearest Clustering Algorithm for Satellite Image Classification in Remote Sensing Applications

Nearest Clustering Algorithm for Satellite Image Classification in Remote Sensing Applications Nearest Clustering Algorithm for Satellite Image Classification in Remote Sensing Applications Anil K Goswami 1, Swati Sharma 2, Praveen Kumar 3 1 DRDO, New Delhi, India 2 PDM College of Engineering for

More information

Chapter 8 The C 4.5*stat algorithm

Chapter 8 The C 4.5*stat algorithm 109 The C 4.5*stat algorithm This chapter explains a new algorithm namely C 4.5*stat for numeric data sets. It is a variant of the C 4.5 algorithm and it uses variance instead of information gain for the

More information

Machine Learning. Unsupervised Learning. Manfred Huber

Machine Learning. Unsupervised Learning. Manfred Huber Machine Learning Unsupervised Learning Manfred Huber 2015 1 Unsupervised Learning In supervised learning the training data provides desired target output for learning In unsupervised learning the training

More information

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Team 2 Prof. Anita Wasilewska CSE 634 Data Mining All Sources Used for the Presentation Olson CF. Parallel algorithms

More information

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013 Your Name: Your student id: Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013 Problem 1 [5+?]: Hypothesis Classes Problem 2 [8]: Losses and Risks Problem 3 [11]: Model Generation

More information

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which

More information

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION WILLIAM ROBSON SCHWARTZ University of Maryland, Department of Computer Science College Park, MD, USA, 20742-327, schwartz@cs.umd.edu RICARDO

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

More information

Comparing and Selecting Appropriate Measuring Parameters for K-means Clustering Technique

Comparing and Selecting Appropriate Measuring Parameters for K-means Clustering Technique International Journal of Soft Computing and Engineering (IJSCE) Comparing and Selecting Appropriate Measuring Parameters for K-means Clustering Technique Shreya Jain, Samta Gajbhiye Abstract Clustering

More information

A Review of K-mean Algorithm

A Review of K-mean Algorithm A Review of K-mean Algorithm Jyoti Yadav #1, Monika Sharma *2 1 PG Student, CSE Department, M.D.U Rohtak, Haryana, India 2 Assistant Professor, IT Department, M.D.U Rohtak, Haryana, India Abstract Cluster

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Introduction to Computer Science

Introduction to Computer Science DM534 Introduction to Computer Science Clustering and Feature Spaces Richard Roettger: About Me Computer Science (Technical University of Munich and thesis at the ICSI at the University of California at

More information

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany Syllabus Fri. 27.10. (1) 0. Introduction A. Supervised Learning: Linear Models & Fundamentals Fri. 3.11. (2) A.1 Linear Regression Fri. 10.11. (3) A.2 Linear Classification Fri. 17.11. (4) A.3 Regularization

More information

On Sample Weighted Clustering Algorithm using Euclidean and Mahalanobis Distances

On Sample Weighted Clustering Algorithm using Euclidean and Mahalanobis Distances International Journal of Statistics and Systems ISSN 0973-2675 Volume 12, Number 3 (2017), pp. 421-430 Research India Publications http://www.ripublication.com On Sample Weighted Clustering Algorithm using

More information

2 Computation with Floating-Point Numbers

2 Computation with Floating-Point Numbers 2 Computation with Floating-Point Numbers 2.1 Floating-Point Representation The notion of real numbers in mathematics is convenient for hand computations and formula manipulations. However, real numbers

More information

University of Florida CISE department Gator Engineering. Clustering Part 2

University of Florida CISE department Gator Engineering. Clustering Part 2 Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical

More information

Motivation. Technical Background

Motivation. Technical Background Handling Outliers through Agglomerative Clustering with Full Model Maximum Likelihood Estimation, with Application to Flow Cytometry Mark Gordon, Justin Li, Kevin Matzen, Bryce Wiedenbeck Motivation Clustering

More information

Data Mining - Data. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - Data 1 / 47

Data Mining - Data. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - Data 1 / 47 Data Mining - Data Dr. Jean-Michel RICHER 2018 jean-michel.richer@univ-angers.fr Dr. Jean-Michel RICHER Data Mining - Data 1 / 47 Outline 1. Introduction 2. Data preprocessing 3. CPA with R 4. Exercise

More information

Selection of n in K-Means Algorithm

Selection of n in K-Means Algorithm International Journal of Information & Computation Technology. ISSN 0974-2239 Volume 4, Number 6 (2014), pp. 577-582 International Research Publications House http://www. irphouse.com Selection of n in

More information

Machine Learning Chapter 2. Input

Machine Learning Chapter 2. Input Machine Learning Chapter 2. Input 2 Input: Concepts, instances, attributes Terminology What s a concept? Classification, association, clustering, numeric prediction What s in an example? Relations, flat

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Datasets Size: Effect on Clustering Results

Datasets Size: Effect on Clustering Results 1 Datasets Size: Effect on Clustering Results Adeleke Ajiboye 1, Ruzaini Abdullah Arshah 2, Hongwu Qin 3 Faculty of Computer Systems and Software Engineering Universiti Malaysia Pahang 1 {ajibraheem@live.com}

More information

A Genetic Algorithm Approach for Clustering

A Genetic Algorithm Approach for Clustering www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3 Issue 6 June, 2014 Page No. 6442-6447 A Genetic Algorithm Approach for Clustering Mamta Mor 1, Poonam Gupta

More information

KEYWORDS: Clustering, RFPCM Algorithm, Ranking Method, Query Redirection Method.

KEYWORDS: Clustering, RFPCM Algorithm, Ranking Method, Query Redirection Method. IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY IMPROVED ROUGH FUZZY POSSIBILISTIC C-MEANS (RFPCM) CLUSTERING ALGORITHM FOR MARKET DATA T.Buvana*, Dr.P.krishnakumari *Research

More information

Mining di Dati Web. Lezione 3 - Clustering and Classification

Mining di Dati Web. Lezione 3 - Clustering and Classification Mining di Dati Web Lezione 3 - Clustering and Classification Introduction Clustering and classification are both learning techniques They learn functions describing data Clustering is also known as Unsupervised

More information

International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14

International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14 International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14 DESIGN OF AN EFFICIENT DATA ANALYSIS CLUSTERING ALGORITHM Dr. Dilbag Singh 1, Ms. Priyanka 2

More information

ADaM version 4.0 (Eagle) Tutorial Information Technology and Systems Center University of Alabama in Huntsville

ADaM version 4.0 (Eagle) Tutorial Information Technology and Systems Center University of Alabama in Huntsville ADaM version 4.0 (Eagle) Tutorial Information Technology and Systems Center University of Alabama in Huntsville Tutorial Outline Overview of the Mining System Architecture Data Formats Components Using

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning

More information

Unsupervised learning on Color Images

Unsupervised learning on Color Images Unsupervised learning on Color Images Sindhuja Vakkalagadda 1, Prasanthi Dhavala 2 1 Computer Science and Systems Engineering, Andhra University, AP, India 2 Computer Science and Systems Engineering, Andhra

More information

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s] Fast, single-pass K-means algorithms Fredrik Farnstrom Computer Science and Engineering Lund Institute of Technology, Sweden arnstrom@ucsd.edu James Lewis Computer Science and Engineering University of

More information

Comparision between Quad tree based K-Means and EM Algorithm for Fault Prediction

Comparision between Quad tree based K-Means and EM Algorithm for Fault Prediction Comparision between Quad tree based K-Means and EM Algorithm for Fault Prediction Swapna M. Patil Dept.Of Computer science and Engineering,Walchand Institute Of Technology,Solapur,413006 R.V.Argiddi Assistant

More information

Fuzzy Ant Clustering by Centroid Positioning

Fuzzy Ant Clustering by Centroid Positioning Fuzzy Ant Clustering by Centroid Positioning Parag M. Kanade and Lawrence O. Hall Computer Science & Engineering Dept University of South Florida, Tampa FL 33620 @csee.usf.edu Abstract We

More information

Normalization based K means Clustering Algorithm

Normalization based K means Clustering Algorithm Normalization based K means Clustering Algorithm Deepali Virmani 1,Shweta Taneja 2,Geetika Malhotra 3 1 Department of Computer Science,Bhagwan Parshuram Institute of Technology,New Delhi Email:deepalivirmani@gmail.com

More information

Dimension reduction : PCA and Clustering

Dimension reduction : PCA and Clustering Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental

More information

Introduction to Mobile Robotics

Introduction to Mobile Robotics Introduction to Mobile Robotics Clustering Wolfram Burgard Cyrill Stachniss Giorgio Grisetti Maren Bennewitz Christian Plagemann Clustering (1) Common technique for statistical data analysis (machine learning,

More information