A SURVEY ON CLUSTERING ALGORITHMS Ms. Kirti M. Patil 1 and Dr. Jagdish W. Bakal 2

Similar documents
Performance Measure of Hard c-means,fuzzy c-means and Alternative c-means Algorithms

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CHAPTER 4: CLUSTER ANALYSIS

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Review on Various Clustering Methods for the Image Data

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Unsupervised Learning and Clustering

Unsupervised Learning : Clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Unsupervised Learning and Clustering

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation

Cluster Analysis. Ying Shen, SSE, Tongji University

Clustering CS 550: Machine Learning

Methods for Intelligent Systems

A Fuzzy Rule Based Clustering

CSE 5243 INTRO. TO DATA MINING

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Machine Learning. Unsupervised Learning. Manfred Huber

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

University of Florida CISE department Gator Engineering. Clustering Part 2

ECLT 5810 Clustering

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

[Raghuvanshi* et al., 5(8): August, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

Cluster Analysis: Agglomerate Hierarchical Clustering

CSE 5243 INTRO. TO DATA MINING

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Text Documents clustering using K Means Algorithm

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

HARD, SOFT AND FUZZY C-MEANS CLUSTERING TECHNIQUES FOR TEXT CLASSIFICATION

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

An Unsupervised Technique for Statistical Data Analysis Using Data Mining

ECLT 5810 Clustering

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

identified and grouped together.

Cluster analysis of 3D seismic data for oil and gas exploration

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

International Journal of Advanced Research in Computer Science and Software Engineering

COMS 4771 Clustering. Nakul Verma

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Introduction to Mobile Robotics

Understanding Clustering Supervising the unsupervised

Information Retrieval and Web Search Engines

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

COMPARATIVE ANALYSIS OF PARALLEL K MEANS AND PARALLEL FUZZY C MEANS CLUSTER ALGORITHMS

Based on Raymond J. Mooney s slides

Comparative Study of Clustering Algorithms using R

Unsupervised Learning

Overlapping Clustering: A Review

A Graph Based Approach for Clustering Ensemble of Fuzzy Partitions

Clustering in Data Mining

10701 Machine Learning. Clustering

Lecture 7: Segmentation. Thursday, Sept 20

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

ARTICLE; BIOINFORMATICS Clustering performance comparison using K-means and expectation maximization algorithms

An Enhanced K-Medoid Clustering Algorithm

SGN (4 cr) Chapter 11

Cluster analysis. Agnieszka Nowak - Brzezinska

Kapitel 4: Clustering

CLUSTERING PERFORMANCE IN SENTENCE USING FUZZY RELATIONAL CLUSTERING ALGORITHM

A Study of Hierarchical and Partitioning Algorithms in Clustering Methods

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

K-Means. Oct Youn-Hee Han

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis

Analyzing Outlier Detection Techniques with Hybrid Method

A Survey on Image Segmentation Using Clustering Techniques

Hybrid Models Using Unsupervised Clustering for Prediction of Customer Churn

Redefining and Enhancing K-means Algorithm

Analysis of K-Means Clustering Based Image Segmentation

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

CS 534: Computer Vision Segmentation and Perceptual Grouping

Unsupervised Learning: Clustering

Association Rule Mining and Clustering

Pattern Clustering with Similarity Measures

S. Sreenivasan Research Scholar, School of Advanced Sciences, VIT University, Chennai Campus, Vandalur-Kelambakkam Road, Chennai, Tamil Nadu, India

Behavioral Data Mining. Lecture 18 Clustering

Document Clustering Approach for Forensic Analysis: A Survey

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

K-Means Clustering 3/3/17

Dynamic Clustering of Data with Modified K-Means Algorithm

Information Retrieval and Web Search Engines

A COMPARATIVE STUDY ON K-MEANS AND HIERARCHICAL CLUSTERING

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures

Color based segmentation using clustering techniques

Colour Image Segmentation Using K-Means, Fuzzy C-Means and Density Based Clustering

CLUSTER ANALYSIS. V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi

CS Introduction to Data Mining Instructor: Abdullah Mueen

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC

ALTERNATIVE METHODS FOR CLUSTERING

Transcription:

Ms. Kirti M. Patil 1 and Dr. Jagdish W. Bakal 2 1 P.G. Scholar, Department of Computer Engineering, ARMIET, Mumbai University, India 2 Principal of, S.S.J.C.O.E, Mumbai University, India ABSTRACT Now a days, clustering is the main objective of the research in several fields such as machine learning, pattern recognition, etc. Clustering plays an outstanding role in information retrieval, text summarization, marketing, bioinformatics, medicine and many more. Clustering is process which groups or divides the data into meaningful groups and these groups are called as clusters. Clusters are formed on the basis of similar and dissimilar objects in the clusters. The clustering algorithms are used to cluster the data objects. Generally, clustering algorithms are categorized as hard and soft clustering. Some clustering algorithms like K-means, Fuzzy C- means (FCM), Hierarchical and mixture of Gaussian are mostly used. This paper is focus on these clustering algorithms with their advantages and disadvantages. Keywords: K-means, Fuzzy C-means, Hierarchical, Mixture of Gaussian [1] INTRODUCTION Clustering or cluster analysis is the process of grouping a set of objects that are meaningful, useful or both. However, the groups are not predefined. Clustering can be used in many application domains like marketing, medicine, bioinformatics, economics and anthropology. Clustering can be sometimes referred to as unsupervised learning. An unsupervised learning finds some kind of structure in the data. A clustering is a set of clusters which contains all objects in the data set. Clustering can be distinguished as hard clustering and soft clustering. In hard clustering each object belongs to a cluster and in soft clustering each object belongs to each cluster to a certain degree. Clustered objects are grouped in such a way that objects in the same group are more similar and dissimilar in the other group. Clustering algorithms are classified as exclusive (K-means), overlapping (Fuzzy C-means), hierarchical and probabilistic clustering (Mixture of Gaussian). The most used clustering algorithms are as follows: K-means Fuzzy C-means Hierarchical clustering Ms. Kirti M. Patil and Dr. Jagdish W. Bakal 157

Mixture of Gaussians [2] CLUSTERING ALGORITHMS K-means Algorithm K-means clustering algorithm is the unsupervised learning algorithm. The algorithm solves the well known clustering problem. The k-means clustering follows to classify a given data set through a certain number of clusters. The main idea is to define k centers, one for each cluster. The k-means is an algorithm to group objects based on attributes into k number of group. The main purpose of k-means clustering is to classify the data. The k-means clustering uses the squared Euclidean distance to allocate objects to clusters. The quality of cluster is determined by following squared error function Where, xi - vj is the Euclidean distance between xi and vj. ci is the number of data points in i th cluster. c is the number of cluster canters. Algorithmic steps for k-means clustering Here X = {x1,x2,x3,..,xn} is the set of data points and V = {v1,v2,.,vc} is the set of centers. 1) Select any c cluster centers. 2) Calculate the distance between each data point and cluster centers. 3) Assign the data point to the cluster center (data points distance from the cluster center is minimum of all the cluster centers). 4) Again calculate the new cluster center using: 5) Recalculate the distance between each data point and new obtained cluster centers. 6) If data point was not reassigned then stop, otherwise go to step 3). 1) Easy to understand. 2) Gives best result when data set are distinct or well separated from each other. Disadvantages:- 158

1) Requires apriori specification of the number of cluster centers. 2) Unable to handle noisy data and outliers. 3) Provides the local optima of the squared error function. 4) Euclidean distance measures can unequally weight underlying factors. Fuzzy C-Means Algorithm Fuzzy C-means algorithm is developed by Jim Bezdek in 1974. Fuzzy C-means algorithm is unsupervised clustering algorithm which assigns the membership values to each data point corresponding to each cluster center. The membership value is assign on the basis of the distance between the data point and cluster center. The degree of membership of each data item to the cluster is calculated and this degree of membership value decides the cluster to which that data item is supposed to belong. The summation of membership of each data v item should be equal to one. The following formula specifies the membership degree and the cluster center: Where m is the fuzziness index m [1, ]. c represents the number of cluster center. µij represents the membership of the i th data to j th cluster center. dij represents the Euclidean distance between i th data and j th cluster center. Algorithmic steps for Fuzzy c-means clustering Here X = {x1, x2, x3..., xn} is the set of data points, V = {v1, v2, v3..., vc} is the set of centers. 1) Select any c cluster centers. 2) Calculate the fuzzy membership 'µij' using: 3) The fuzzy centers 'vj' calculate using: Ms. Kirti M. Patil and Dr. Jagdish W. Bakal 159

4) Repeat step 2) and 3) until the minimum 'J' value is achieved or U (k+1) - U (k) < β. Where, k is the iteration step. β is the termination criterion between [0, 1]. U = (µij) n*c is the fuzzy membership matrix. J is the objective function. 1) Better than K-means algorithm 2) Gives best results when overlapped data set. Disadvantages: 1) Apriori specification of the number of clusters. 2) Euclidean distance measures can unequally weight underlying factors. Hierarchical Clustering Algorithm A hierarchical clustering algorithm (HCA) creates a set of clusters. For this cluster are recursively partitions the instances. The clusters are group data into a tree structure. This tree structure is known as dendogram. The dendogram is used to show the hierarchical clustering methods or technique and the clusters which are belong to different set. The root of dendogram tree is one cluster and in this cluster all elements are grouped together. A single element cluster is the leaves in the dendogram. Figure:1. Dendogram 160

Hierarchical clustering algorithm is divided in two types:- i) Agglomerative Algorithm [merging]:- The clustering process is start with the unclustered items and merge clusters until all items are belong to one cluster. For this the pairwise similarity measures are performed to determine the clusters. ii) Divisive Algorithm [splitting]:- These algorithms initially placed all the items in one cluster and clusters are repeatedly split into smaller cluster. If elements are not sufficiently close to each other then the clusters are split up. Algorithmic steps for HCA:- 1) Start with all instances in their own cluster. 2) Until there is only one cluster: Among the current clusters, determine the two Clusters, ci and cj, that are most similar. 3) Replace ci and cj with a single cluster ci cj 1) Ease of handling of any forms of similarity or distance. 2) Hierarchical clustering algorithms are more versatile. Disadvantages:- 1) Algorithm can never undo what was done previously. 2) No objective function is directly minimized. Mixture of Gaussian In model based clustering, certain models of clustering are used and attempt to optimize the fit between the data and model. There are Gaussian (continuous) or Poisson (discrete) distributions which are modeled by mixture of distributions for the entire data set. The Expectation-Maximization (EM) algorithm is used to find the parameters of mixture of Gaussian. The EM for Gaussian mixture is an iterative that starts from some initial estimates Ɵ, and then proceeds to iteratively update Ɵ until convergence is detected. Each iteration consists of an E-step and an M-step. E-Step: - Estimates the missing values using the current estimates of Ɵ. This can initially done by finding a weighted average of the observed data. M-Step: - Finds the new estimates for the Ɵ parameters that maximize by using those estimates of the missing data. 1) Fastest algorithm for learning mixture model. Ms. Kirti M. Patil and Dr. Jagdish W. Bakal 161

Disadvantages:- 1) Algorithm always use all the components it has access to, needing complex held-out data criteria to decide how many components to use in the absence of external cues. [3] LITERATURE SURVEY T. Kanungo and D. M. Mount presents a simple and efficient implementation of Lloyd's k-means clustering algorithm. This algorithm is called as filtering algorithm. This algorithm is easy to implement, requiring a kd-tree as the only major data structure. [2] Rui Xu, Donald Wunsch II presents survey of different clustering algorithms for data sets appearing in statistics, computer science, and machine learning. They illustrate their applications in some benchmark data set. [3] A. Baraldi and P. Blonda, the reviews the issues related to clustering approaches and their relationships to the different methods. [4] M.S. Yang gives the summary of the fuzzy set theory. The fuzzy set theory is applied in cluster analysis. This paper mostly focused on the fuzzy clustering which is based on fuzzy relation objective functions and the fuzzy generalized K- nearest neighbor rule. [5] Brendan J. Frey* and Delbert Dueck, Clustering of data is to learn a set of centers of cluster such that sum of squared errors between data points and their nearest centers is small. The examplars, are the centers selected from actual data point. [6] Jianbo Shi and Jitendra Malik developed a algorithm which is based on the view perceptual grouping a process that extract global impressions of scene or image. This grouping provides a hierarchical description of scene. In this paper, graph segmentation is done by the normalized cut criterial. Normalized cut is an unbiased measure of disassociation between sub groups of graph. [7] P.Corsini, B.Lazzerini, F. Marcelloni shown a new fuzzy clustering algorithm known as any relation clustering algorithm. This algorithm partitions data set which minimize the Euclidean distance between object from a cluster and the prototype of the cluster. The proposed algorithm is based on the fuzzy relational object data. The proposed algorithm is more stable, scalable and convergence speed. [8] M. Kuchaki Rafsanjani, Z. Asghari Varzaneh, N. Emami Chukanlo, paper discuss cluster process, some hierarchical clustering algorithms, attributes of algorithms, advantages and disadvantages of hierarchical clustering algorithms and compare the algorithms with each other. [9] A.K. Jain, M.N. Murty, and P.J. Flynn, they examined various steps in clustering and discussed fuzzy, neural, evolutionary, and knowledge-based approaches to clustering. The paper described the applications of clustering. [10] [6] CONCLUSION Clustering or cluster analysis is the process of grouping a set of objects that are meaningful, useful or both. Clustering can be hard clustering or soft clustering. Clustering objects are grouped on the basis of similarities and dissimilarities of object in the group. There are various clustering algorithms which are used for clustering data objects. 162

K-means clustering algorithm is unsupervised learning algorithm and solves the well known clustering problem. This algorithm is easy to understand but requires apriori specification of the number of cluster centers. Fuzzy C-means algorithm (FCM) assigns the membership values to each data point on the basis of of the distance between the data point and cluster center. This algorithm is better than K-means algorithm but it also requires apriori specification of the number of cluster centers. Hierarchical Clustering algorithm (HCA) creates a set of clusters which are grouped into a tree structure. This is called as dendogram. Hierarchical Clustering algorithm is divided into two types agglomerative and divisive. Hierarchical Clustering algorithms are more versatile but objective function is not directly minimized. The Expectation-Maximization (EM) algorithm is used to find the parameters of mixture of Gaussian. The algorithm is divided into Expectation (E-step) and Maximization (M-step). It is a fastest algorithm. REFERENCES [1] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, New York: Plenum Press, 1981. [2] T. Kanungo and D. M. Mount, An Efficient K-means Clustering Algorithm: Analysis and on Implementation Pattern Analysis and Machine Intelligence, IEEE Transactions Pattern Analysis and Machine Intelligence. vol. 24, no. 7, 2002. [3] Rui Xu, Donald Wunsch II, Survey of Clustering Algorithms, IEEE TRANSACTIONS ON NEURAL NETWORKS,VOL.16, NO. 3, MAY 2005 [4] A. Baraldi and P. Blonda, A survey of fuzzy clustering algorithms for pattern recognition-part I And II, " IEEE Trans. Syst.,Man, Cybern. B, Cybern., vol. 29, no. 6, pp. 778-801, Dec. 1999. [5] M.-S. Yang, A Survey of Fuzzy Clustering, Math. Computer Modelling, vol. 18, no. 11, pp 1-16, 1993. [6] B.J. Frey and D. Dueck, Clustering by Passing Messages between Data Points, Science, vol. 315, pp. 972-976, 2007. [7] J. Shi and J. Malik, Normalized Cuts and Image Segmentation, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888-905, Aug. 2000. [8] P. Corsini, F. Lazzerini, and F. Marcelloni, A New Fuzzy Relational Clustering Algorithm Based on the Fuzzy C-Means Algorithm, Soft Computing, vol. 9, pp. 439-447, 2005. [9] M. Kuchaki Rafsanjani, Z. Asghari Varzaneh, N. Emami Chukanlo, A survey of hierarchical clustering algorithms The Journal of Mathematics and Computer Science Vol.5 No.3 (2012), 229-240. [10] A.K. Jain, M.N. Murty, and P.J. Flynn, ªData Clustering: A Review,º ACM Computing Surveys, vol. 31, no. 3, pp. 264-323,1999. [11] C.F.J. Wu. On the convergence properties of the em algorithm. The Annals of Statistics, 11(1):95 103, 1983. [12] M. Jordan and R. Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural Computation, 6:181 214, 1994. Ms. Kirti M. Patil and Dr. Jagdish W. Bakal 163