CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM

Similar documents
CHAPTER 4: CLUSTER ANALYSIS

Unsupervised Learning

[7.3, EA], [9.1, CMB]

K-Means. Oct Youn-Hee Han

CSE 5243 INTRO. TO DATA MINING

Unsupervised Learning

CSE 5243 INTRO. TO DATA MINING

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

A Review of K-mean Algorithm

ECLT 5810 Clustering

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Data Mining and Data Warehousing Classification-Lazy Learners

ECLT 5810 Clustering

Lesson 3. Prof. Enza Messina

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Gene Clustering & Classification

Document Clustering: Comparison of Similarity Measures

11/2/2017 MIST.6060 Business Intelligence and Data Mining 1. Clustering. Two widely used distance metrics to measure the distance between two records

Intelligent Image and Graphics Processing

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Clustering. K-means clustering

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Exploratory Analysis: Clustering

Association Rule Mining and Clustering

Cluster Analysis. Ying Shen, SSE, Tongji University

Clustering and Visualisation of Data

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Enhancing K-means Clustering Algorithm with Improved Initial Center

Data clustering & the k-means algorithm

University of Florida CISE department Gator Engineering. Clustering Part 2

Comparative Study Of Different Data Mining Techniques : A Review

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

New Approach for K-mean and K-medoids Algorithm

Clustering. Supervised vs. Unsupervised Learning

NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM

COSC 6339 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2017.

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

Road map. Basic concepts

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering

COSC 6397 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2015.

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Jarek Szlichta

Information Retrieval and Web Search Engines

Machine Learning (BSMC-GA 4439) Wenke Liu

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

A REVIEW ON K-mean ALGORITHM AND IT S DIFFERENT DISTANCE MATRICS

Clustering: Overview and K-means algorithm

Accelerating Unique Strategy for Centroid Priming in K-Means Clustering

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Data Informatics. Seon Ho Kim, Ph.D.

Unsupervised Learning Partitioning Methods

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Basic Data Mining Technique

Comparision between Quad tree based K-Means and EM Algorithm for Fault Prediction

Artificial Intelligence. Programming Styles

6. Learning Partitions of a Set

Unsupervised Learning : Clustering

INITIALIZING CENTROIDS FOR K-MEANS ALGORITHM AN ALTERNATIVE APPROACH

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

Clustering CS 550: Machine Learning

Analyzing Outlier Detection Techniques with Hybrid Method

Multivariate Analysis

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Introduction to Computer Science

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

What is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology

Clustering: Overview and K-means algorithm

CHAPTER-6 WEB USAGE MINING USING CLUSTERING

Automatic Cluster Number Selection using a Split and Merge K-Means Approach

Improved Performance of Unsupervised Method by Renovated K-Means

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

MIS2502: Data Analytics Clustering and Segmentation. Jing Gong

Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

Introduction to Data Mining

Iteration Reduction K Means Clustering Algorithm

Clustering in Data Mining

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Supervised vs. Unsupervised Learning

Clustering Algorithms In Data Mining

Clustering. RNA-seq: What is it good for? Finding Similarly Expressed Genes. Data... And Lots of It!

Clustering: Centroid-Based Partitioning

Redefining and Enhancing K-means Algorithm

ECG782: Multidimensional Digital Signal Processing

Global Journal of Engineering Science and Research Management

I211: Information infrastructure II

k-means, k-means++ Barna Saha March 8, 2016

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Information Retrieval and Web Search Engines

Cluster Analysis. CSE634 Data Mining

2. Background. 2.1 Clustering

INF 4300 Classification III Anne Solberg The agenda today:

Clustering in Ratemaking: Applications in Territories Clustering

Transcription:

CHAPTER 4 K-MEANS AND UCAM CLUSTERING 4.1 Introduction ALGORITHM Clustering has been used in a number of applications such as engineering, biology, medicine and data mining. The most popular clustering algorithm used in several field is K-Means since it is very simple, fast and efficient. K-means is developed by Mac Queen. The K-Means algorithm is effective in producing cluster for many practical applications. But the computational complexity of the original K-Means algorithm is very high, especially for large datasets. The K- Means algorithm is a partition clustering method that separates data into K groups. Main drawback of this algorithm is that of a priori fixation of number of clusters and seeds [16]. To rectify the drawbacks of K-means algorithm a new algorithm is proposed namely Unique Clustering with Affinity Measures (UCAM) clustering algorithm which starts its computation without representing the number of clusters and the initial seeds. UCAM clustering algorithm purely works on affinity measure which helps to fix the number of resultant clusters. It divides the dataset into some number of clusters with the help of threshold value.the uniqueness of the cluster is based on the threshold value.the number of clusters increases on decreasing the threshold value and the number of cluster decreases by increasing the

threshold value. More unique cluster is obtained when the threshold value is smaller. 4.2 K-Mean Clustering The main objective of cluster is to group the object that are similar in one cluster and separate objects that are dissimilar by assigning them to different clusters. One of the most popular clustering methods is K-Means clusters algorithm. It classifies objects to pre-defined number of clusters, which is given by the user (assume K clusters). The idea is to choose random cluster centers, one for each cluster. These centers are preferred to be as far as possible from each other. In this algorithm Euclidean distance measure is used between two multidimensional data points X = (x 1,x 2,x 3, x m ) (4.1) Y = (y 1,y 2,y 3, y m ) (4.2) The Euclidean distance measure between the above points x and y are described as follows: D(X, Y) = ( ( x i - y i ) 2 ) 1/2 (4.3) The K-Means method aims to minimize the sum of squared distances between all points and the cluster centre. The algorithmic steps are described in the following Figure 4.1.

Input: D = {d 1, d 2, d 3,..., d n } // Set of n data points. K - Number of desired clusters Output: A set of K clusters. Method: 1. Select the number of clusters. Let this number be k. 2. Pick k seeds as centroids of the k clusters. The seeds may be picked randomly unless the user has some insight into the data. 3. Compute the Euclidean distance of each object in the dataset from each of the centroids. 4. Allocate each object to the cluster nearest, based on the distances computed in the previous step. 5. Compute the centroids of the clusters by computing the means of the attribute values if the objects are in each cluster. 6. Check if the stopping criterion has been met (e.g. the cluster membership is unchanged). If yes, go to step 7. If not go to step 3. 7. [Optional] One may decide to stop at this stage or to split a cluster or combine two clusters heuristically until a stopping criterion is met. Figure 4.1: K-Means Clustering Algorithm Though the K-Means algorithm is simple, it has some drawbacks in its quality of the final clustering, since it is highly depends on the initial centroids.

Implementing K-Means clustering algorithm in a very small sample data with ten student s information which contains student number, age and marks obtained in three subjects as shown in Table 4.1. Table 4.1: Students Information S 1 18 73 75 57 S 2 18 79 85 75 S 3 23 70 70 52 S 4 20 55 55 55 S 5 22 85 86 87 S 6 19 91 90 89 S7 20 70 65 60 S 8 21 53 56 59 S 9 19 82 82 60 S 10 47 75 76 77 The process of K-Mean clustering is initiated with initial seeds which are selected either sequentially or randomly. Each seed acts as centroid for the cluster in the initial stage. In this example three initial seeds are selected in sequential manner. The objects S 1, S 2 and S 3 are the initial seed as represented in the below Table 4.2

Table 4.2: The three seeds from Table 4.1 S 1 18 73 75 57 S 2 18 79 85 75 S 3 23 70 70 52 K-Means algorithm produces the following result by applying it on the sample data in Table 4.1. The process is initialized with the seeds as indicated in the 4.2 and produce the results with three clusters which is listed in the following Table 4.3 with 2 objects, Table 4.4 and Table 4.5 is also with 4 objects each. Table 4.3 Cluster C 1 obtained through K-Means S 1 18 73 75 57 S 9 19 82 82 60 Table 4.4 Cluster C 2 obtained through K-Means S 2 18 79 85 75 S 5 22 85 86 87 S 6 19 91 90 89 S 10 47 75 76 77

Table 4.5 Cluster C 3 obtained through K-Means S 3 23 70 70 52 S 4 20 55 55 55 S 7 20 70 65 60 S 8 21 53 56 59 The K-Means execution results with three clusters as noted below C 1 = { S 1,S 9 } (4.4) C 2 = {S 2, S 5, S 6, S 10 } (4.5) C 3 = {S 3, S 4, S 7, S 8 } (4.6) Where S 1,S 2, S 10 Student s details which considers only numeric attributes. In the above study of K-Means clustering algorithm results with three clusters where low marks and high marks are found in all clusters, since the initial seeds do not have any seeds with the marks above 90. Hence, if the initial seeds or not defined properly then the result won t be unique and more over if it is constrained it will have only three clusters. In K-Means the initial seeds are randomly selected and hence result of two executions on the same data set will not get the same result unless the initial seeds are same. The main drawback in K-Means is

that initial seeds and number of cluster should be defined though it is difficult to predict it, in the early stage. 4. 3 UCAM Clustering Algorithm In cluster analysis, one does not know what classes or clusters exist and the problem to be solved is to group the given data into meaningful clusters. Here on the same motive UCAM algorithm is developed. clustering algorithm basically for numeric data s. UCAM algorithm is a It mainly focuses on the drawback of K-Means clustering algorithm. In K-Means algorithm, the process is initiated with the initial seeds and number of cluster to be obtained. But the number of cluster that is to be obtained cannot be predicted on a single view of the dataset. The result may not be unique if the number of cluster and the initial seed is not properly identified. UCAM algorithm is implemented with the help of affinity measure for clustering. The process of clustering in UCAM initiated without any centorid and number of clusters that is to be produced. But it sets the threshold value for making unique clusters. By increasing and decreasing the threshold value fixes the number of resultant cluster [85]. The step by step procedure for UCAM is given below in the Figure 4.2 Input: D = {d 1, d 2, d 3... d n } // Set of n data points. T Threshold value. Output: Clusters. Number of cluster depends on affinity measure.

Method: 1. Set the threshold value T. 2. Create new cluster structure if it is the first tuple of the dataset. 3. If it is not first tuple compute similarity measure with existing clusters. 4. Get the minimum value of computed similarity, S. 5. Get the cluster index of C i which corresponds to S. 6. If S<=T, then add current tuple to C i. 7. If S>T, create new cluster. 8. Continue the process until the last tuple of the dataset. Figure 4.2 UCAM Clustering Algorithm Implementing UCAM algorithm with the sample data given in Table 4.1. The process is initiated with threshold value T and results with following 5 clusters as shown below is listed in Table 4.6 with 3 objects, Table 4.7 with 3 objects, Table 4.8 with 2 objects, Table 4.9 and Table 4.10 with 1 object. Table 4.6 Cluster C 1 obtained through UCAM S 1 18 73 75 57 S 3 23 70 70 52 S 7 20 70 65 60

Table 4.7 Cluster C 2 obtained through UCAM S 2 18 79 85 75 S 5 22 85 86 87 S 6 19 91 90 89 Table 4.8 Cluster C 3 obtained through UCAM S 4 20 55 55 55 S 8 21 53 56 59 Table 4.9 Cluster C 4 obtained through UCAM S 9 19 82 82 60

Table 4.10 Cluster C 5 obtained through UCAM Stud-no age Mark1 Mark2 Mark3 S 10 47 75 76 77 The UCAM execution results with five clusters which is noted below C 1 = { S 1,S 3, S 7 } (4.7) C 2 = {S 2, S 5, S 6 } (4.8) C 3 = { S 4, S 8 } (4.9) C 4 = { S 9 } (4.10) C 5 = {S 10 } (4.11) Uniqueness of the cluster depends on the initial setting of the threshold value. If the threshold value increases number of cluster decreases. In UCAM there is no initial prediction on number of resultant cluster. Here, in this algorithm resultant cluster purely based on the affinity measure. In the above study of K-Means clustering algorithm results with three clusters where low marks and high marks are found in all clusters, since the initial seeds do not have any seeds with the marks above 90. Hence if the initial

seeds are not defined properly then the result won t be unique and more over if it is constrained it will have only three clusters. In UCAM algorithm is initiated with the threshold alone which produces unique result with five clusters. C 1 Cluster with medium marks. C 2 Cluster with high marks. C 3 Cluster with low marks. C 4 = { S 9 } (4.12) C 5 = {S 10 } (4.13) S 9 is the student with good mark in two subjects and low mark in one subject. So, S 9 should be considered with more care in subject 3 so that it increases ranking of the institution. And S10 should be considered since his age is unique than other students. Both approximate clustering and unique cluster can be obtained by increasing and decreasing the threshold values. 4.4. Measurements on Cluster Uniqueness The cluster representation of K-Mean and UCAM are illustrated through scatter graph as shown in below Figure 4.3 in which each symbol indicates a separate cluster.

Figure 4.3 : Clustering through K-Means Figure 4.4 Clustering through UCAM In the above graph Figure 4.4 all the clusters are unique in representation compared to K-Means clustering and the dark shaded symbols are peculiar objects, based on the application it is projected out otherwise it merges with nearby cluster by adjusting the threshold value. Both approximate cluster and unique cluster are obtained by increasing and decreasing the threshold values.

4.5 Comparative Analysis UCAM algorithm produces unique clustering only on the bases of affinity measure, hence there is no possibility of error in clustering. One major advantage is that both rough clustering and accurate unique clustering is possible by adjusting the threshold value. But in K-Means clustering there is a chance of getting error if the initial seeds are not identified properly. The comparative study of K-Means and UCAM clustering are shown in the following Table 4.11. Table 4.11: Comparative study on K-Means and UCAM Clustering Algorithms Initial number of clusters Centriod Threshold value Cluster result Cluster Error K-Means K Initial seeds - Depend on initial seeds Yes, if wrong seeds UCAM - - T Depend on threshold value - 4.6 Discussion Clustering is a widely used technique in data mining application for discovering patterns in large dataset. In this chapter the traditional K-Means

algorithm is analyzed and found that quality of the resultant cluster is based on the initial seed where it is selected either sequentially or randomly. The K-Means algorithm should be initiated with the number of cluster k and initial seeds. For real time large database it s difficult to predict the number of cluster and initial seeds accurately. In order to overcome this drawback the current chapter focused on developing the UCAM(Unique Clustering with Affinity Measure) algorithm for clustering without giving initial seed and number of clusters. Unique clustering is obtained with the help of affinity measures. 4.7 Summary In this chapter, new UCAM algorithm is used for data clustering. This approach reduces the overheads of fixing the cluster size and initial seeds as in K- Means. It fixes threshold value to obtain a unique clustering. The proposed method improves the scalability and reduces the clustering error. This approach ensures that the total mechanism of clustering is in time without loss in correctness of clusters.