CHAPTER 4 K-MEANS AND UCAM CLUSTERING 4.1 Introduction ALGORITHM Clustering has been used in a number of applications such as engineering, biology, medicine and data mining. The most popular clustering algorithm used in several field is K-Means since it is very simple, fast and efficient. K-means is developed by Mac Queen. The K-Means algorithm is effective in producing cluster for many practical applications. But the computational complexity of the original K-Means algorithm is very high, especially for large datasets. The K- Means algorithm is a partition clustering method that separates data into K groups. Main drawback of this algorithm is that of a priori fixation of number of clusters and seeds [16]. To rectify the drawbacks of K-means algorithm a new algorithm is proposed namely Unique Clustering with Affinity Measures (UCAM) clustering algorithm which starts its computation without representing the number of clusters and the initial seeds. UCAM clustering algorithm purely works on affinity measure which helps to fix the number of resultant clusters. It divides the dataset into some number of clusters with the help of threshold value.the uniqueness of the cluster is based on the threshold value.the number of clusters increases on decreasing the threshold value and the number of cluster decreases by increasing the
threshold value. More unique cluster is obtained when the threshold value is smaller. 4.2 K-Mean Clustering The main objective of cluster is to group the object that are similar in one cluster and separate objects that are dissimilar by assigning them to different clusters. One of the most popular clustering methods is K-Means clusters algorithm. It classifies objects to pre-defined number of clusters, which is given by the user (assume K clusters). The idea is to choose random cluster centers, one for each cluster. These centers are preferred to be as far as possible from each other. In this algorithm Euclidean distance measure is used between two multidimensional data points X = (x 1,x 2,x 3, x m ) (4.1) Y = (y 1,y 2,y 3, y m ) (4.2) The Euclidean distance measure between the above points x and y are described as follows: D(X, Y) = ( ( x i - y i ) 2 ) 1/2 (4.3) The K-Means method aims to minimize the sum of squared distances between all points and the cluster centre. The algorithmic steps are described in the following Figure 4.1.
Input: D = {d 1, d 2, d 3,..., d n } // Set of n data points. K - Number of desired clusters Output: A set of K clusters. Method: 1. Select the number of clusters. Let this number be k. 2. Pick k seeds as centroids of the k clusters. The seeds may be picked randomly unless the user has some insight into the data. 3. Compute the Euclidean distance of each object in the dataset from each of the centroids. 4. Allocate each object to the cluster nearest, based on the distances computed in the previous step. 5. Compute the centroids of the clusters by computing the means of the attribute values if the objects are in each cluster. 6. Check if the stopping criterion has been met (e.g. the cluster membership is unchanged). If yes, go to step 7. If not go to step 3. 7. [Optional] One may decide to stop at this stage or to split a cluster or combine two clusters heuristically until a stopping criterion is met. Figure 4.1: K-Means Clustering Algorithm Though the K-Means algorithm is simple, it has some drawbacks in its quality of the final clustering, since it is highly depends on the initial centroids.
Implementing K-Means clustering algorithm in a very small sample data with ten student s information which contains student number, age and marks obtained in three subjects as shown in Table 4.1. Table 4.1: Students Information S 1 18 73 75 57 S 2 18 79 85 75 S 3 23 70 70 52 S 4 20 55 55 55 S 5 22 85 86 87 S 6 19 91 90 89 S7 20 70 65 60 S 8 21 53 56 59 S 9 19 82 82 60 S 10 47 75 76 77 The process of K-Mean clustering is initiated with initial seeds which are selected either sequentially or randomly. Each seed acts as centroid for the cluster in the initial stage. In this example three initial seeds are selected in sequential manner. The objects S 1, S 2 and S 3 are the initial seed as represented in the below Table 4.2
Table 4.2: The three seeds from Table 4.1 S 1 18 73 75 57 S 2 18 79 85 75 S 3 23 70 70 52 K-Means algorithm produces the following result by applying it on the sample data in Table 4.1. The process is initialized with the seeds as indicated in the 4.2 and produce the results with three clusters which is listed in the following Table 4.3 with 2 objects, Table 4.4 and Table 4.5 is also with 4 objects each. Table 4.3 Cluster C 1 obtained through K-Means S 1 18 73 75 57 S 9 19 82 82 60 Table 4.4 Cluster C 2 obtained through K-Means S 2 18 79 85 75 S 5 22 85 86 87 S 6 19 91 90 89 S 10 47 75 76 77
Table 4.5 Cluster C 3 obtained through K-Means S 3 23 70 70 52 S 4 20 55 55 55 S 7 20 70 65 60 S 8 21 53 56 59 The K-Means execution results with three clusters as noted below C 1 = { S 1,S 9 } (4.4) C 2 = {S 2, S 5, S 6, S 10 } (4.5) C 3 = {S 3, S 4, S 7, S 8 } (4.6) Where S 1,S 2, S 10 Student s details which considers only numeric attributes. In the above study of K-Means clustering algorithm results with three clusters where low marks and high marks are found in all clusters, since the initial seeds do not have any seeds with the marks above 90. Hence, if the initial seeds or not defined properly then the result won t be unique and more over if it is constrained it will have only three clusters. In K-Means the initial seeds are randomly selected and hence result of two executions on the same data set will not get the same result unless the initial seeds are same. The main drawback in K-Means is
that initial seeds and number of cluster should be defined though it is difficult to predict it, in the early stage. 4. 3 UCAM Clustering Algorithm In cluster analysis, one does not know what classes or clusters exist and the problem to be solved is to group the given data into meaningful clusters. Here on the same motive UCAM algorithm is developed. clustering algorithm basically for numeric data s. UCAM algorithm is a It mainly focuses on the drawback of K-Means clustering algorithm. In K-Means algorithm, the process is initiated with the initial seeds and number of cluster to be obtained. But the number of cluster that is to be obtained cannot be predicted on a single view of the dataset. The result may not be unique if the number of cluster and the initial seed is not properly identified. UCAM algorithm is implemented with the help of affinity measure for clustering. The process of clustering in UCAM initiated without any centorid and number of clusters that is to be produced. But it sets the threshold value for making unique clusters. By increasing and decreasing the threshold value fixes the number of resultant cluster [85]. The step by step procedure for UCAM is given below in the Figure 4.2 Input: D = {d 1, d 2, d 3... d n } // Set of n data points. T Threshold value. Output: Clusters. Number of cluster depends on affinity measure.
Method: 1. Set the threshold value T. 2. Create new cluster structure if it is the first tuple of the dataset. 3. If it is not first tuple compute similarity measure with existing clusters. 4. Get the minimum value of computed similarity, S. 5. Get the cluster index of C i which corresponds to S. 6. If S<=T, then add current tuple to C i. 7. If S>T, create new cluster. 8. Continue the process until the last tuple of the dataset. Figure 4.2 UCAM Clustering Algorithm Implementing UCAM algorithm with the sample data given in Table 4.1. The process is initiated with threshold value T and results with following 5 clusters as shown below is listed in Table 4.6 with 3 objects, Table 4.7 with 3 objects, Table 4.8 with 2 objects, Table 4.9 and Table 4.10 with 1 object. Table 4.6 Cluster C 1 obtained through UCAM S 1 18 73 75 57 S 3 23 70 70 52 S 7 20 70 65 60
Table 4.7 Cluster C 2 obtained through UCAM S 2 18 79 85 75 S 5 22 85 86 87 S 6 19 91 90 89 Table 4.8 Cluster C 3 obtained through UCAM S 4 20 55 55 55 S 8 21 53 56 59 Table 4.9 Cluster C 4 obtained through UCAM S 9 19 82 82 60
Table 4.10 Cluster C 5 obtained through UCAM Stud-no age Mark1 Mark2 Mark3 S 10 47 75 76 77 The UCAM execution results with five clusters which is noted below C 1 = { S 1,S 3, S 7 } (4.7) C 2 = {S 2, S 5, S 6 } (4.8) C 3 = { S 4, S 8 } (4.9) C 4 = { S 9 } (4.10) C 5 = {S 10 } (4.11) Uniqueness of the cluster depends on the initial setting of the threshold value. If the threshold value increases number of cluster decreases. In UCAM there is no initial prediction on number of resultant cluster. Here, in this algorithm resultant cluster purely based on the affinity measure. In the above study of K-Means clustering algorithm results with three clusters where low marks and high marks are found in all clusters, since the initial seeds do not have any seeds with the marks above 90. Hence if the initial
seeds are not defined properly then the result won t be unique and more over if it is constrained it will have only three clusters. In UCAM algorithm is initiated with the threshold alone which produces unique result with five clusters. C 1 Cluster with medium marks. C 2 Cluster with high marks. C 3 Cluster with low marks. C 4 = { S 9 } (4.12) C 5 = {S 10 } (4.13) S 9 is the student with good mark in two subjects and low mark in one subject. So, S 9 should be considered with more care in subject 3 so that it increases ranking of the institution. And S10 should be considered since his age is unique than other students. Both approximate clustering and unique cluster can be obtained by increasing and decreasing the threshold values. 4.4. Measurements on Cluster Uniqueness The cluster representation of K-Mean and UCAM are illustrated through scatter graph as shown in below Figure 4.3 in which each symbol indicates a separate cluster.
Figure 4.3 : Clustering through K-Means Figure 4.4 Clustering through UCAM In the above graph Figure 4.4 all the clusters are unique in representation compared to K-Means clustering and the dark shaded symbols are peculiar objects, based on the application it is projected out otherwise it merges with nearby cluster by adjusting the threshold value. Both approximate cluster and unique cluster are obtained by increasing and decreasing the threshold values.
4.5 Comparative Analysis UCAM algorithm produces unique clustering only on the bases of affinity measure, hence there is no possibility of error in clustering. One major advantage is that both rough clustering and accurate unique clustering is possible by adjusting the threshold value. But in K-Means clustering there is a chance of getting error if the initial seeds are not identified properly. The comparative study of K-Means and UCAM clustering are shown in the following Table 4.11. Table 4.11: Comparative study on K-Means and UCAM Clustering Algorithms Initial number of clusters Centriod Threshold value Cluster result Cluster Error K-Means K Initial seeds - Depend on initial seeds Yes, if wrong seeds UCAM - - T Depend on threshold value - 4.6 Discussion Clustering is a widely used technique in data mining application for discovering patterns in large dataset. In this chapter the traditional K-Means
algorithm is analyzed and found that quality of the resultant cluster is based on the initial seed where it is selected either sequentially or randomly. The K-Means algorithm should be initiated with the number of cluster k and initial seeds. For real time large database it s difficult to predict the number of cluster and initial seeds accurately. In order to overcome this drawback the current chapter focused on developing the UCAM(Unique Clustering with Affinity Measure) algorithm for clustering without giving initial seed and number of clusters. Unique clustering is obtained with the help of affinity measures. 4.7 Summary In this chapter, new UCAM algorithm is used for data clustering. This approach reduces the overheads of fixing the cluster size and initial seeds as in K- Means. It fixes threshold value to obtain a unique clustering. The proposed method improves the scalability and reduces the clustering error. This approach ensures that the total mechanism of clustering is in time without loss in correctness of clusters.