Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Similar documents
ECLT 5810 Clustering

Unsupervised Learning

ECLT 5810 Clustering

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

CSE 5243 INTRO. TO DATA MINING

A Review on Cluster Based Approach in Data Mining

Cluster Analysis. Ying Shen, SSE, Tongji University

CSE 5243 INTRO. TO DATA MINING

Analyzing Outlier Detection Techniques with Hybrid Method

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Unsupervised Learning

Unsupervised Learning and Clustering

Redefining and Enhancing K-means Algorithm

NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM

Clustering CS 550: Machine Learning

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

10701 Machine Learning. Clustering

Unsupervised Learning : Clustering

Clustering in Ratemaking: Applications in Territories Clustering

Cluster Analysis. CSE634 Data Mining

A Review: Techniques for Clustering of Web Usage Mining

Cluster analysis. Agnieszka Nowak - Brzezinska

Comparative Study of Clustering Algorithms using R

Unsupervised Learning and Clustering

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Clustering Part 4 DBSCAN

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

University of Florida CISE department Gator Engineering. Clustering Part 2

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

International Journal of Advanced Research in Computer Science and Software Engineering

University of Florida CISE department Gator Engineering. Clustering Part 4

Cluster Analysis: Basic Concepts and Methods

Divisive Hierarchical Clustering with K-means and Agglomerative Hierarchical Clustering

AN EXPERIMENTAL APPROACH OF K-MEANS ALGORITHM

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

CHAPTER 4: CLUSTER ANALYSIS

Gene Clustering & Classification

Road map. Basic concepts

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

UNIT V CLUSTERING, APPLICATIONS AND TRENDS IN DATA MINING. Clustering is unsupervised classification: no predefined classes

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14

An Unsupervised Technique for Statistical Data Analysis Using Data Mining

A SURVEY ON CLUSTERING ALGORITHMS Ms. Kirti M. Patil 1 and Dr. Jagdish W. Bakal 2

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

A Comparative Study of Various Clustering Algorithms in Data Mining

A Technical Insight into Clustering Algorithms & Applications

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Clustering in Data Mining

Unsupervised Learning Partitioning Methods

Machine Learning (BSMC-GA 4439) Wenke Liu

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Keywords: clustering algorithms, unsupervised learning, cluster validity

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Iteration Reduction K Means Clustering Algorithm

COMS 4771 Clustering. Nakul Verma

Clustering. Content. Typical Applications. Clustering: Unsupervised data mining technique

Multivariate Analysis

Clustering. Bruno Martins. 1 st Semester 2012/2013

Kapitel 4: Clustering

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

A REVIEW ON CLUSTERING TECHNIQUES AND THEIR COMPARISON

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme

Research and Improvement on K-means Algorithm Based on Large Data Set

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Cluster Analysis: Agglomerate Hierarchical Clustering

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

CSE 5243 INTRO. TO DATA MINING

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Datasets Size: Effect on Clustering Results

Data Informatics. Seon Ho Kim, Ph.D.

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

ISSN: [Saurkar* et al., 6(4): April, 2017] Impact Factor: 4.116

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Machine Learning (BSMC-GA 4439) Wenke Liu

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Machine Learning. Unsupervised Learning. Manfred Huber

Clustering and Visualisation of Data

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

H-D and Subspace Clustering of Paradoxical High Dimensional Clinical Datasets with Dimension Reduction Techniques A Model

Introduction to Mobile Robotics

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Data Mining Algorithms

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points

Lesson 3. Prof. Enza Messina

Clustering & Classification (chapter 15)

Dimension reduction : PCA and Clustering

Cluster Analysis. Angela Montanari and Laura Anderlucci

Improved Performance of Unsupervised Method by Renovated K-Means

Transcription:

Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Survey of Clustering Techniques Ramandeep Kaur * Dr. Gurjit Singh Bhathal Department of Computer Engineering, Department of Computer Engineering, University College of Engineering, University College of Engineering, Punjabi University, Patiala, Punjab, India. Punjabi University, Patiala, Punjab, India Abstract Clustering can be considered the most important unsupervised learning technique; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. Clustering is the process of organizing objects into groups whose members are similar in some way. A cluster is therefore a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters. In this paper, we are describing the clustering techniques and algorithms used for it. Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms. I. INTRODUCTION Clustering is the process of partitioning a set of data (or objects) into a set of meaningful sub-classes, called clusters. It helps users to understand the natural grouping or structure in a dataset. A good clustering method will produce high quality clusters in which the intra-class (i.e., intra-clusters) similarity is high and the inter-class similarity is low. The quality of clustering result also depends on both the similarity measure used by the method and its implementation. The quality of a clustering method is also measured by its ability to discover some or the entire hidden pattern. Fig.1 The result of Cluster analysis The following are typical requirements of clustering in data mining: 1) Scalability: Many clustering algorithms work well on small data sets containing fewer than 200 data objects. However, a large database may contain millions of objects. 2) Ability to deal with different types of attributed: Many algorithms are designed to cluster interval-based (numerical) data. However, applications may require clustering other types of data, such as binary, categorical (nominal), and ordinal data, or mixtures of these data types. 3) Discovery of clusters with arbitrary shape: Many clustering algorithms determined clusters based on Euclidean or Manhattan distance measures. Algorithms based on such distance measures end to find spherical clusters with similar size and density. It is important to develop algorithms that can detect clusters of arbitrary shape. 4) Minimal requirements for domain knowledge of determine input parameters: Many clustering algorithms require users to input certain parameters in cluster analysis (such as the number of desired clusters). The clustering results can be quite sensitive to input parameters.. 5) Ability to deal with noisy data: Most real-world databases contain outliners or missing, unknown, erroneous data. Some clustering algorithms are sensitive to such data and may lead to clusters of poor quality. 6) Insensitivity to the order of input records: Some clustering algorithms are sensitive to the order of input data; for example, may generated dramatically different clusters. It is important to develop algorithms that are insensitive to the order of input. 2013, IJARCSSE All Rights Reserved Page 153

7) High dimensionality: A database or a data warehouse can contain several dimensions or attributes. Many clustering algorithms are good at handling low-dimensional data, involving only two to three dimensions. Human eyes are good at judging the quality of clustering for up to three dimensions. 8) Constraint-based clustering: Real-world applications may need to perform clustering under various kinds of constraints. Suppose that your job is to choose the locations for a given number of new automatic cash-dispensing machines (ATMs) in a city. To decide upon this, we may cluster household while considering constraints such as the city s rivers and highway networks and customer requirements per region. 9) Interpretability and usability: Users expect clustering results to be interpretable, comprehensible, and usable. That is, clustering may need to be tied up with specific semantic interpretations and applications. It is important to study how an applications goal may influence the selection of clustering methods. II. Goals Of Clustering The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data but how to decide what constitutes a good clustering? It can be shown that there is no absolute best criterion which would be independent of the final aim of clustering. Consequently, it is a user which must supply this criterion, in such a way that the results of clustering will suit their needs. For instance, we could be interested in finding representatives for homogeneous groups (data reduction), in finding natural clusters and describe their unknown properties ( natural data types), in finding useful and suitable groupings ( useful data classes) or in finding unusual data objects (outlier detection). III. Clustering Techniques A. Hierarchical Clustering A hierarchical clustering creates a hierarchical decomposition of the given set of data objects. A hierarchical clustering can be classified as being either agglomerative or divisive, based on how the hierarchical decomposition if formed. The agglomerative approach, also called the bottom-up approach, starts with each object forming a separate group. It successively merges the objects that are close to one another, until all the groups are merging into one. The divisive approach, also called as top down approach, starts with all of the objects in the same clusters. In this, a cluster is split up into smaller clusters, until eventually each object is in one another. In this clustering, Dendograms are great for visualization. It provides hierarchical relations between clusters. it shown to be able to capture concentric clusters. In this clustering, it is not easy to define levels for clusters. Fig.2 Example for Agglomerative Clustering The hierarchical clustering dendrogram would be as such: 2013, IJARCSSE All Rights Reserved Page 154

B. Density- based Clustering Most partitioning methods cluster object based on the distance between objects. Such methods can find only spherical-shaped clusters and encounter difficulty at discovering clusters of arbitrary shapes. Other clustering methods have been developed based on the notion of density. Density-based clustering does not require one to specify the number of clusters in the data a priori, as opposed to k-means. It has a notion of noise. It requires just two parameters and is mostly in sensitive to the ordering of the points in the database. The quality of density-based clustering depends on the distance measure used in the function. Fig.3 Example for Density- based Clustering C. Grid-based Clustering Grid-based clustering quantizes the object space into a finite number of cells that form a grid structure. All of the clustering operations are performed on the grid structure. The main advantage of this approach is its fast processing time, which is typically independent of the number of data objects and dependent only on the number of cells in each dimension in the quantized space. STING is a typical example of grid-based method. Fig.4 Example for Grid-based Clustering D. Model-based Clustering Model-based methods hypothesize a model for each of the clusters and find the best fit of the data to the given model. A model based algorithm may locate clusters by constructing a density function that reflects the spatial distribution of data points. E. Partitioning Method In this method, given a database of n objects, a partitioning methods construct k partitions of the data, where each partition represents a cluster and k<= n. That is, it classifies the data into k groups, which together satisfy the requirements: each group must contain at least one object and each object must belong to exactly one group. 2013, IJARCSSE All Rights Reserved Page 155

Given k, the number of partitions to construct, a partitioning method creates an initial partitioning. It uses an iterative relocation technique that attempts to improve the partitioning by moving object from one group to another. Fig.5 Example for Partitioning Method a) K-mean algorithm 1. It accepts the number of clusters to group data into, and the dataset to cluster as input values. 2. It then creates the first K initial clusters (K= number of clusters needed) from the dataset by choosing K rows of data randomly from the dataset. 3. The K-Means algorithm calculates the Arithmetic Mean of each cluster formed in the dataset. The Arithmetic Mean of a cluster is the mean of all the individual records in the cluster. In each of the first K initial clusters, their is only one record. The Arithmetic Mean of a cluster with one record is the set of values that make up that record. 4. Next, K-Means assigns each record in the dataset to only one of the initial clusters. Each record is assigned to the nearest cluster (the cluster which it is most similar to) using a measure of distance or similarity like the Euclidean Distance Measure. 5. K-Means re-assigns each record in the dataset to the most similar cluster and re-calculates the arithmetic mean of all the clusters in the dataset. The arithmetic mean of a cluster is the arithmetic mean of all the records in that cluster. 6. K-Means re-assigns each record in the dataset to only one of the new clusters formed. A record or data point is assigned to the nearest cluster (the cluster which it is most similar to) using a measure of distance or similarity 7. The preceding steps are repeated until stable clusters are formed and the K-Means clustering procedure is completed. Stable clusters are formed when new iterations or repetitions of the K-Means clustering algorithm does not create new clusters as the cluster center or Arithmetic Mean of each cluster formed is the same as the old cluster center. There are different techniques for determining when a stable cluster is formed or when the k-means clustering algorithm procedure is completed. b) Fuzzy c-means 1. It is an extension of k-means. 2. Fuzzy c-means allows data points to be assigned into more than one cluster. 3. In c-mean, each data point has a degree of membership (or probability) of belonging to each cluster. Fuzzy c-means algorithm Let x i be a vector of values for data point g i. 1. Initialize membership U (0) = [ u ij ] for data point g i of cluster cl j by random 2. At the k-th step, compute the fuzzy centroid C (k) = [ c j ] for j = 1,.., nc, where nc is the number of clusters, using n m ( uij ) xi i1 c j n m ( uij ) where m is the fuzzy parameter i1 and n is the number of data points. 3. Update the fuzzy membership U (k) = [ u ij ], using u ij 1 x i c j 1 m1 1 n c m1 1 j1 x i c j 2013, IJARCSSE All Rights Reserved Page 156

4. If U (k) U (k-1) <, then STOP, else return to step 2. 5. Determine membership cutoff For each data point g i, assign g i to cluster cl j if u ij of U (k) > IV. Why We Use Fuzzy C-Means Algorithm In fuzzy K-means algorithm, clusters are identical whereas, In fuzzy C-means, clusters are sometimes similar. Fuzzy C- means provides the better results than K-means because membership exits in their clusters. Fuzzy C-means algorithm also provides flexibility, So changes can easily done. V. Conclusion And Future Work This paper represents the survey of different techniques of clustering. In future, we work on the fuzzy C-means algorithm to get the improved results of image. References [1] X. Chang, W. Li, and J. Farrell, A C-means clustering based fuzzy modeling method, The Ninth IEEE International Conference on Fuzzy Systems, Vol. 2, 2000, pp. 937-940. [2] Amiya Halder, Soumajit Pramanik, Arindam Kar, Dynamic Image Segmentation using Fuzzy C-Means based Genetic Algorithm, International Journal of Computer Applications, Volume 28-No.6, August 2011. [3] Chaloemchai Lowongtrakool and Nualsawat Hiransakolwong, Design of Image Segmentation by Automatic Unsupervised Clustering using Computation Intelligence, International Conference on Machine Learning and Computer Science (IMLCS'2012) August 11-12, 2012 Phuket. [4] B.Sowmya, B.Sheela Rani Color Image Segmentation usingfuzzy Clustering techniques and Competitive Neural Network, Applied Soft Computing, Vol. 11( 3), 3170-3178, April 2011. [5] M.Ameer Ali, Laurence S Dooley and Gour C Karmakar Object Based Image Segmentation Using Fuzzy Clustering, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 06), 14-19 May 2006, Toulouse. [6] Ehsan Nadernejad and Amin Barari, A Novel Pixon-Based Image Segmentation Process Using Fuzzy Filtering and Fuzzy C-mean Algorithm, International Journal of Fuzzy Systems, Vol. 13, No. 4, December 2011. [7] C. H. Li, W. C. Huang, B. C. Kua, and C. C. Hung, A novel fuzzy weighted c-means method for image classification, International Journal of Fuzzy Systems, vol.10, no. 3, pp. 168-173, 2008. [8] Sahaphong and N.Hiransakolwong, Unsupervised Image Segmentation Using Automated Fuzzy c-means, Seventh International Conference on Computer and Information Technology, 690-694, 2007. 2013, IJARCSSE All Rights Reserved Page 157