A k-means Clustering Algorithm on Numeric Data

Size: px

Start display at page:

Download "A k-means Clustering Algorithm on Numeric Data"

Laurence Daniels
6 years ago
Views:

Volume 117 No. 7 2017, 157-164 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu A k-means Clustering Algorithm on Numeric Data P.Praveen 1 B.

1 Volume 117 No , ISSN: (printed version); ISSN: (on-line version) url: ijpam.eu A k-means Clustering Algorithm on Numeric Data P.Praveen 1 B.Rama 2 1. Assistant Professor in Department of CSE, S R Engineering, Warangal, Telangana 2. Assistant Professor in Department of CS, Kakatiya University, Warangal, Telangana prawin1731@gmail.com rama.abbidi@gmail.com Abstract - Data clustering is the process of grouping data elements based on some aspects of relationship between the elements in the group Clustering has many applications such as data firmness, data mining pattern recognition machine learning and there are many different clustering methods. In this paper we examines the K-means method of clustering and how to select of primary seed for dividing a group of clusters that affects the result. In this paper we study what are the clustering algorithms and what are problems to split a cluster of k-means Clustering using initial seed point. Keywords - Clustering, Data Mining, Distance measure, Hierarchical clustering, K-means clustering. I INTRODUCTION Data Mining There is an immense measure of information accessible in the Information Industry. This information is of no utilization until the point that it is changed over into helpful data. It is important to break down this colossal measure of information and concentrate helpful data from it[9]. Extraction of data is not by any means the only procedure we have to perform; information mining additionally includes different procedures, for example, Data Cleaning, Data Integration, Data Transformation, Data Mining, Pattern Evaluation and Data Presentation. When every one of these procedures are finished, we would have the capacity to utilize this data in numerous applications, for example, Fraud Detection, Market Analysis, Production Control, Science Exploration, and so on. Information Mining is characterized as removing data from colossal arrangements of information. At the end of the day, we can state that information mining is the methodology of mining learning from information[8]. a. Association One of the well techniques of data mining is association rules which are used to find out the relationship or association between various items. The problem of finding relation between items is often termed as market basket analysis. In this problem the presence of items within baskets is identified so that the customers buying habits can be analysed. The technique is used in inventory management, sales promotion etc [3][7]. The discovery of association rules is primarily dependent on finding the frequent sets. This can require multiple passes through the database. The algorithms aims at reducing number of passes by generating a candidate set which should turn out to be frequent sets. Many different algorithms are designed to find out the association rules. The algorithm differs on the basis of how they handle candidate sets and how they reduce number of scans on the database. Some of the recent algorithms of association rule mining do not create candidate set. Practically the frequent sets generated are very large in number and this can be constrained by selecting only those items in which the user is interested[10]. Let us consider a set of items and a transaction database which is again a set of transactions. The association rule takes the following form for a transaction database: X=>Y, where X and Y are the sets of items called item sets. There are two important basic measures for association rules, support(s) and Confidence(c). Since the database is large and users concern about only those. b. Classification In general, knowledge classification is a two step method. In the first step, that is known as the learning step, a model that describes a planned set of categories or ideas is constructed by analyzing a set of coaching info instances. Every instance is assumed to belong to a predefined category. Within the second step, the model is tested employing a totally different knowledge set that's accustomed estimate the classification accuracy of the model. If 157

2 the accuracy of the model is thought of acceptable, the model will be used to classify future knowledge instances for that the category label is not known[5] At the end, the model acts as a classifier within the call making method. There square measure many techniques that will be used for classification like call tree, Bayesian strategies, rule based algorithms, and Neural Networks. Decision tree classifiers square measure quite widespread techniques because the construction of tree will not need any domain expert data or parameter setting, and is applicable for exploratory knowledge discovery[6]. call tree will turn out a model with rules that square measure human readable and explainable. Decision Tree has the benefits of straightforward interpretation and understanding for call manufacturers to compare with their domain data for validation and justify their call[4]. C. Clustering Clustering can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. A loose definition of clustering could be the process of organizing objects into groups whose members are similar in some way [3]. A cluster is therefore a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters. We can show this with a simple graphical example: Figure 1: Example of Clustering In this case we easily identify the four clusters into which the data can be divided; the similarity criterion is distance: two or more objects belong to the same cluster if they are close according to a given distance (in this case geometrical distance). This is called distance-based clustering[9]. Another kind of clustering is conceptual clustering two or more objects belong to the same cluster if this one defines a concept common to all that objects. In other words, objects are grouped according to their fit to descriptive concepts, not according to simple similarity measures[2]. Cluster Analysis Types: Data matrix n number of objects and p variables (attributes) like n is person p is age, height, gender etc. a relational table n-by-p matrix is generated[1][2]. Dissimilarity matrix A collection of nearest points that are present for all pairs of n objects n by n are stored in matrix[2]. D(i, j) is the measured difference or dissimilarity between objects i & j, d (i, j) = d (j, i) i.e. d(2, 1) = d(1, 2) and d(i, i) = 0. It means d(1, 1) = d(2, 2). d(n, n) = 0. We find the distance between i to j d(i, j) we have most popular distance measure is Euclidean distance measure. Distance between P1 to P2 = 5, d(p1, P2) =5. (u 1, v 1 ) l 2 (u 2, v 2 ) l 2 = ((u 2 u 1 ) 2 + (v 2 v 1 ) 2 ) 1/2 158

3 Example P1 = (2, 10), P2 = (2, 5) (2 2) + (5 10) = = 5 Clustering algorithms are mainly divided into two different groups one is partitioned and another is hierarchical. Partitioned method is distribute the data set into mutually exclusive groups [18]. For example k- number of pre-determined clusters are grouped[1]. min x x In the j th cluster xi is the center, and d is the nearest data point in a data set of n elements. Distance function nature is defined by an integer q(q=2).for a data set of numeric values II CLUSTERING ALGORITHMS a. Partitioning-based In such algorithms, all clusters ar determined promptly. Initial teams are given and reallocated towards a union. In different words, the partitioning algorithms divide knowledge objects into variety of partitions, wherever every partition represents a cluster. These clusters ought to fulfill the subsequent requirements: (1) every cluster should contain a minimum of one object, and (2) every object should belong to precisely one cluster. Within the K-means formula, as an example, middle is that the average of all points and coordinates representing the expectation. Within the K-medoids formula, objects that are close to the middle represent the clusters. There ar several different partitioning algorithms like K-modes, PAM, CLARA, CLARANS and FCM [7]. Algorithm K-means K decides the number of clusters that are needed finally. Step 1: K non empty subsets of objects are partitioned (randomly) Step 2: the partitioning seed points of the clusters currently are the clusters centroids. Step 3: the cluster with the nearest seed point is assigned with an object Step 4: repeat step 2 until assignment does not change. b. Hierarchical Clustering Algorithms Hierarchical agglomeration is divided into two main types: agglomerate and discordant. 1. Agglomerate clustering: It s conjointly referred to as AGNES (Agglomerative Nesting). It works in a very bottom-up manner. That is, every object is at first thought-about as a single-element cluster (leaf). At every step of the rule, the two that square measure the foremost similar square measure combined into a replacement larger cluster (nodes). This procedure is iterated till all points square measure member of only one single massive cluster (root) (see figure below). The result's a tree which might be planned as a dendrogram[2]. 2. Divisive clustering: It s conjointly referred to as DIANA (Divise Analysis) and it works in a very top-down manner[6]. The rule is Associate in Nursing inverse order of AGNES. It begins with the foundation, within which all objects square measure enclosed in a very single cluster. At every step of iteration, the foremost heterogeneous cluster is split into 2. the method is iterated till all objects square measure in their own cluster (see figure below) 159

4 Figure 2:Agglomerative Clustering As we tend to learned within the k-means we tend to live the (dis)similarity of observations exploitation distance measures (i.e. euclidian distance, Manhattan distance, etc.) In R, the euclidian distance is employed by default to live the unsimilarity between every combine of observations. As we tend to already understand, it s straightforward to reason the unsimilarity live between 2 pairs of observations with the get_dist perform. However, a much bigger question is: however will we live the unsimilarity between 2 clusters of observations? variety of various cluster agglomeration ways (i.e, linkage methods) are developed to answer to the current question. the foremost common varieties ways are: Maximum or complete linkage clustering: It computes all pairwise dissimilarities between parts the weather} in cluster one and therefore the elements in cluster a pair of, and considers the biggest worth (i.e., most value) of those dissimilarities because the distance between the 2 clusters. It tends to provide additional compact clusters. Minimum or single linkage clustering: It computes all pairwise dissimilarities between parts the weather} in cluster one and therefore the elements in cluster a pair of, and considers the tiniest of those dissimilarities as a linkage criterion. It tends to provide long, loose clusters. Mean or average linkage clustering: It computes all pairwise dissimilarities between parts the weather} in cluster one and therefore the elements in cluster a pair of, and considers the typical of those dissimilarities because the distance between the 2 clusters. Centroid linkage clustering: It computes the unsimilarity between the center of mass for cluster one (a mean vector of length p variables) and therefore the center of mass for cluster a pair of. Ward s minimum variance method: It minimizes the full within-cluster variance. At every step the combine of clusters with minimum between-cluster distance square measure united. c. Density-based Here, data objects are divided by their regions of density, property and border. They're closely associated with point to nearest neighbors. A cluster, outlined as a connected dense element, grows in any direction that density results in. so, density-based algorithms are getting a typical shapes by discovering cluster. Also, this provides a natural protection against outliers. so the general density of a degree is analyzed to work out the functions of datasets that influence a selected datum. DBSCAN, OPTICS and DENCLUE are a unit of algorithms are use to such a way of remainder noise (ouliers) and see clusters of individual form[10]. d. Grid-based The house of the data variables is split into grids. the most advantage of this approach is its quick time interval, as a result of it goes through the dataset once to cypher the applied mathematics values for the grids. The accumulate grid data generate grid-based cluster technique and irregular of the amount {of information of knowledge of information} objects that use a homogenous grid to gather regional applied mathematics data, then perform the clump on the grid, rather than the information directly[5]. The performance of a grid-based approach depends on the scale of the grid, that is typically abundant but the scale of the information. However, for extremely irregular data distributions, employing a single uniform grid might not be easy to get the desired clump quality or fulfill the time demand. Wave Cluster and STING are a part of typical samples on this class[9][4]. 160

III EXPERIMENTAL APPROACH OF K-MEANS K-Means: Step-By-Step Example As a simple illustration of a k-means algorithm, consider the following data set consisting of the scores of two variables on each

5 III EXPERIMENTAL APPROACH OF K-MEANS K-Means: Step-By-Step Example As a simple illustration of a k-means algorithm, consider the following data set consisting of the scores of two variables on each of seven individuals:this data set is to be grouped into two clusters. As a first step in finding a sensible initial partition, let the A & B values of the two individuals furthest apart (using the Euclidean distance measure), define the initial cluster means, giving: Table 1: Initial Data set The remaining individuals are now examined in sequence and allocated to the cluster to which they are closest, in terms of Euclidean distance to the cluster mean. The mean vector is recalculated each time a new member is added. This leads to the following series of steps: Now the initial partition has changed, and the two clusters at this stage having the following characteristics: But we cannot yet be sure that each individual has been assigned to the right cluster. So, we compare each individual s distance to its own cluster mean and to that of the opposite cluster. And we find: 161

6 Only individual 3 is nearer to the mean of the opposite cluster (Cluster 2) than its own (Cluster 1). In other words, each individual's distance to its own cluster mean should be smaller that the distance to the other cluster's mean (which is not the case with individual 3). Thus, individual 3 is relocated to Cluster 2 resulting in the new partition: The iterative relocation would now continue from this new partition until no more relocations occur. However, in this example each individual is now nearer its own cluster mean than that of the other cluster and the iteration stops, choosing the latest partitioning as the final cluster solution. Also, it is possible that the k-means algorithm won't find a final solution. In this case it would be a good idea to consider stopping the algorithm after a pre-chosen maximum of iterations. IV CONCLUSION In this paper, we have examined k-means and hierarchical clustering algorithm for pure numeric synthetic data set. So far we have examined distance measure on single link, complete link, average links for numeric data sets and non-hierarchical clustering methods. Considering a number of real world data sets the results achieved were highly encouraging. For achieving more optimal performance K-means algorithm can also examined. This paper, a hierarchical and non-hierarchical clustering method designed for numerical data. We have examined on K-means and future we have extended to mixed data. References 1. P. Praveen and B. Rama, "An empirical comparison of Clustering using hierarchical methods and K- means," nd International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics (AEEICB), Chennai, 2016, pp doi: /AEEICB Zhang, S. and Chen, Z. The research of hilbert r-tree spatial index algorithm based on hybrid clustering. International Conference on Electronic and Mechanical Engineering and Information Technology, 2011, P. Praveen, B. Rama and T. Sampath Kumar, "An efficient clustering algorithm of minimum Spanning Tree," 2017 Third International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics (AEEICB), Chennai,2017,pp doi: /AEEICB Jiawei Han Micheline Kamber, Data Mining concepts and techniques, 2nd Edition. 162

7 5. A.K.Jain, M.N. Murty, P.J. Flynn, Data clusterin: a review, ACM Computing Surveys 31 (3) (1999) N.Yuruk, M.Mete, X.Xu, T. A. J. Schweiger, A divisive hierarchical structural clustering algorithm for networks, in: Proceedings of the 7th IEEE International Conference on Data Mining Workshops, 2007, pp Fang, H. and Saad, Y. (2008). Farthest centroids divisive clustering. InProc. ICMLA, pages De Carvalho, F. A. T. and De Souza, R. M. C. R. (2010). Unsupervised pattern recognition models for mixed featuretype symbolic data. Pattern Recognition Let-ters, 31(5): Arun K Pujari, Data Mining Techniques, second edition. 10. Musa J. Jafar, A Tools-Based Approach to Teaching Data Mining Methods, Journal of Information Technology Education, Volume9,

8 164

Clustering part II 1

Clustering part II 1 Clustering What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods 2 Partitioning Algorithms: