Review on Various Clustering Methods for the Image Data

Size: px

Start display at page:

Download "Review on Various Clustering Methods for the Image Data"

Jonathan Phillips
5 years ago
Views:

Review on Various Clustering Methods for the Image Data Madhuri A. Tayal 1,M.M.Raghuwanshi 2 1 SRKNEC Nagpur, 2 NYSS Nagpur, 1, 2 Nagpur University Nagpur [Maharashtra], INDIA.

1 Review on Various Clustering Methods for the Image Data Madhuri A. Tayal 1,M.M.Raghuwanshi 2 1 SRKNEC Nagpur, 2 NYSS Nagpur, 1, 2 Nagpur University Nagpur [Maharashtra], INDIA. 1 madhuri_kalpe@rediffmail.com, m_raghuwanshi@rediffmail.com ABSTRACT Now-a-days, keeping information (data) is not a problem, but keeping that data effectively is the problem. Clustering is the classification of patterns into the groups of similar items. The data in every group is similar but quiet different in different groups. The clustering problem has been addressed in many of the fields.it shows the usability of it.in this paper the clustering is applied to the image data. The feature values are taken, and the final solution depends upon, these values on which the categorization is done. The complexities for the different methods are also defined here. The paper ends with some of the difficulties and solutions for them and with the results, on which the clustering is done. Keywords classification, clustering, feature extraction, feature selection I. INTRODUCTION We are living in a world full of data. Every day, people encounter a large amount of information and store or represent it as data, for further analysis and management. One of the vital means in dealing with these data is to classify or group them into a set of categories or clusters. Clustering refers to the process of grouping samples so that the samples are similar within each group[1]. Clustering can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. A cluster is therefore a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters. Important survey papers on clustering techniques also exist in the literature. Starting from a statistical pattern recognition viewpoint, Jain, murty, and Flynn [2] reviewed clustering algorithms and other important issues related to cluster analysis. The purpose of this paper is to provide a comprehensive description of the influential and important clustering algorithms rooted in statistics, computer science, and machine learning, with emphasis on new advances in recent years. One issue on cluster analysis, how to choose the number of clusters, is also summarized in the last section. II. CLUSTERING ALGORITHMS Different starting points and criteria usually lead to different taxonomies of clustering algorithms [1][2][3]. A rough but widely agreed frame is to classify clustering techniques as hierarchical clustering and partitional clustering, based on the properties of clusters generated. Hierarchical clustering groups data objects with a sequence of partitions, either from singleton clusters to a cluster including all individuals or vice versa, while partitional clustering directly divides data objects into some prespecified number of clusters without the hierarchical structure. We follow this frame in surveying the clustering algorithms in the literature. Beginning with the discussion on different algorithms, we focus on hierarchical clustering and classical partitional clustering algorithms in Section? Distance and Similarity Measure An important component of a clustering algorithm is the distance measure between data points. If the components of the data instance vectors are all in the same physical units then it is possible that the simple Euclidean distance metric is sufficient to successfully group similar data instances. Distance between the two clusters can be measured by[1]. 1. Euclidian Distance 2. City Block Distance In addition to this some of the similarity and dissimilarity measures are as follows in Table-1[3] Table I: Similarity and Dissimilarity Measure For Quantitative Features [3] 34

III. CLASSIFICATION Clustering algorithms may be broadly classified as listed below: Table Ii Computational Complexity Of Clustering Algorithms[3] A.

2 III. CLASSIFICATION Clustering algorithms may be broadly classified as listed below: Table Ii Computational Complexity Of Clustering Algorithms[3] A. Hierarchical ---Agglomerative a) Single linkage, b) Complete linkage, c) Group average linkage, d) Median linkage, e) Centroid linkage, f) Ward s method, g) Balanced iterative reducing and clustering using hierarchies (BIRCH), h) Clustering using representatives (CURE), i) Robust clustering using links (ROCK) ---Divisive Divisive analysis (DIANA), monothetic analysis (MONA) B. Squared Error-Based (Vector Quantization) a) K-means, C. Fuzzy a. Fuzzy -means (FCM), b. Mountain method (MM), Possibilistic means clustering algorithm (PCM), c. Fuzzy shells (FCS) D. Neural Networks-Based a) Learning vector quantization (LVQ), b) Self-organizing feature map (SOFM), ART, c) Simplified ART (SART), d) Hyperellipsoidal clustering network e) Self-splitting competitive learning network (SPLL) f) E. Kernel-Based a) Kernel -means, b) Support vector clustering (SVC) F. Data visualization/high-dimensional data a) Iterative self-organizing data analysis technique (ISODATA), b) Genetic -means algorithm (GKA), c) Partitioning around medoids (PAM) Similarly various clustering algorithms and their complexities are mentioned in Table-2. In the first case data are grouped in an exclusive way, so that if a certain datum belongs to a definite cluster then it could not be included in another cluster. A simple example of that is shown in the figure below, where the separation of points is achieved by a straight line on a bi dimensional plane. On the contrary the second type, the overlapping clustering, uses fuzzy sets to cluster data, so that each point may belong to two or more clusters with different degrees of membership. In this case, data will be associated to an appropriate membership value. Instead, a hierarchical clustering algorithm is based on the union between the two nearest clusters. The beginning condition is realized by setting every datum as a cluster. After a few iterations it reaches the final clusters wanted. Finally, the last kind of clustering uses a completely probabilistic approach. IV. HIERARCHICAL ALGORITHM CLUSTERING Given a set of N items to be clustered, and an N*N distance (or similarity) matrix, the basic process of hierarchical is this: 1. Start by assigning each item to a cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters the same as the distances (similarities) between the items they contain. 2. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one cluster less. 35

3 3. Compute distances (similarities) between the new cluster and each of the old clusters. 4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N. (*) Step 3 can be done in different ways, which is what distinguishes single-linkage from complete-linkage and average-linkage clustering. In single-linkage clustering (also called the connectedness or minimum method), we consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster. If the data consist of similarities, we consider the similarity between one cluster and another cluster to be equal to the greatest similarity from any member of one cluster to any member of the other cluster. In complete-linkage clustering (also called the diameter or maximum method), we consider the distance between one cluster and another cluster to be equal to the greatest distance from any member of one cluster to any member of the other cluster. In average-linkage clustering, we consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster. The result with image data is shown in the section. Single Linkage Algorithm: Single linkage algorithm is also called as the minimum method. The single linkage algorithm is obtained by smallest distance between two points such that one point is in each cluster. If Ci and cj are clusters, the distance DsL(Ci,Cj) = min d(a,b) Complete Linkage Algorithm:- algorithms. The Average linkage algorithm is obtained by average distance between two points such that one point is in each cluster. If Ci and Cj are clusters, the distance DAL(Ci,Cj) = 1/ninj d(a,b) The main weaknesses of agglomerative clustering methods are: They do not scale well, time complexity of at least O(n 2 ), where n is the number of total objects; They can never undo what was done previously. V. K-Means Clustering K-means is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each cluster. These centroids shoud be placed in a cunning way because of different location causes different result[6]. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed and an early groupage is done. At this point we need to recalculate k new centroids as barycenters of the clusters resulting from the previous step. After we have these k new centroids, a new binding has to be done between the same data set points and the nearest new centroid. A loop has been generated. As a result of this loop we may notice that the k centroids change their location step by step until no more changes are done. In other words centroids do not move any more. Finally, this algorithm aims at minimizing an objective function, in this case a squared error function. The objective function Complete linkage algorithm is also called as the maximum method. The complete linkage algorithm is obtained by largest distance between two points such that one point is in each cluster. If Ci and cj are clusters, the distance DcL(Ci,Cj) = max d(a,b) Average Linkage Algorithm:- Average linkage algorithm is an attempt to compromise between the extremes of the single and complete linkage where is a chosen distance measure between a data point and the cluster centre, is an indicator of the distance of the n data points from their respective cluster centres. The algorithm is composed of the following steps: 1. Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids. 2. Assign each object to the group that has the closest centroid., 36

3. When all objects have been assigned, recalculate the positions of the K centroids. 4. Repeat Steps 2 and 3 until the centroids no longer move.

K-means is a simple algorithm that has been adapted to many problem domains. 2. More automated than manual thresholding of an image 3.

Although it can be proved that the procedure will always terminate, the k-means algorithm does not necessarily find the most optimal configuration, corresponding to the global objective function

A large number of attempts have been made to estimate the appropriate and some of representative examples are illustrated in the following. [6]. Some Solutions for this algorithm are 1.

4 3. When all objects have been assigned, recalculate the positions of the K centroids. 4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated. Advantages 1. K-means is a simple algorithm that has been adapted to many problem domains. 2. More automated than manual thresholding of an image 3. It is a good candidate for extension to work with fuzzy feature vectors. Disadvantages 1. Although it can be proved that the procedure will always terminate, the k-means algorithm does not necessarily find the most optimal configuration, corresponding to the global objective function minimum. 2. The algorithm is also significantly sensitive to the initial randomly selected cluster centers. The k- means algorithm can be run multiple times to reduce this effect. A large number of attempts have been made to estimate the appropriate and some of representative examples are illustrated in the following. [6]. Some Solutions for this algorithm are 1. Visualization of the data set. For the data points that can be effectively projected onto a two-dimensional Euclidean space, which are commonly depicted with a histogram or scatterplot, direct observations can provide good insight on the value of.however, the complexity of most real data sets restricts the effectiveness of the strategy only to a small scope of applications. 2. Construction of certain indices (or stopping rules). These indices usually emphasize the compactnesss of intra-cluster and isolation of inter-cluster and consider the comprehensive effects of several factors, including the defined squared error, the geometric or statistical properties of the data, the number of patterns, the dissimilarity (or similarity), and the number of clusters. Milligan and Cooper compared and ranked 30 indices according to their performance over a series of artificial data sets. Fig. 1 Different image patterns The patterns can be clustered using no of features. The basic features are color, shape and texture. Here one feature from the basic features is taken in addition to two more new features i.e. No of objects and size of object. The various methods for detection for size, shape etc are available in [7], and in literature also. The corresponding value for each feature is shown in the Table. The results after the experimentation for clustering is shown in Figure-2.Results are found to be same for simple, complete, average linkage algorithms. Even the results are same by using Euclidian and City lock distances. Table Iii Image Features Value With Respect To Patterns Pattern No Color No. of objects Size of object 1 15(White) (Dark Gray) 3 14 (Yellow) 4 06 (Brown) Optimization of some criterion functions under probabilistic cmixture-model framework. In a statistical framework, finding the correct number of clusters (components) is equivalent to fitting a model with observed data and optimizing some criterion. 37

utilities. REFERENCES [1] Textbook on Pattern Recognition and Image Analysis,Earl Gose, Richard Johnsonbaugh,Steve Jost. [2] A.K. Jain,,M.N. Murty, P.J. Flynn.

After experimentation it is found that the patterns 1 and 3 are categorised into cluster-1 and patterns 2 and 4 are there in cluster-2 as shown above in figure 2.

Applications clustering algorithms can be applied in many fields, for instance: Marketing: finding groups of customers with similar behaviour given a large database of customer data containing their

5 Fig. 2 Dendrogram for the Clustering. Presented work consisting of the basic idea and implementation of some of the basic clustering methods, In future, we will go for more number of implementations for the clustering methods and their utilities. REFERENCES [1] Textbook on Pattern Recognition and Image Analysis,Earl Gose, Richard Johnsonbaugh,Steve Jost. [2] A.K. Jain,,M.N. Murty, P.J. Flynn. Data Clustering: A Review, ACM Computing Surveys, Vol. 31, No. 3, September Fig. 3 Different clusters for different patterns. After experimentation it is found that the patterns 1 and 3 are categorised into cluster-1 and patterns 2 and 4 are there in cluster-2 as shown above in figure 2.So depending upon number of features and the corresponding values, we can separate the patterns into different clusters. Applications clustering algorithms can be applied in many fields, for instance: Marketing: finding groups of customers with similar behaviour given a large database of customer data containing their properties and past buying records; Biology: classification of plants and animals given their features; Insurance: identifying groups of motor insurance policy holders with a high average claim cost; identifying frauds; City-planning: identifying groups of houses according to their house type, value and geographical location; Earthquake studies: clustering observed earthquake epicenters to identify dangerous zones; WWW: document classification; clustering weblog data to discover groups of similar access patterns. And many more. [3] Rui Xu, Donald Wunsch Survey of Clustering Algorithms,IEEE Transactions on Neural Networks Vol 16, No. 3, May [4] Anil K. Jain, Robert P.W. Duin, and Jianchang Mao, IEEE Transactions on pattern analysis and machine intelligence, Statistical pattern recognition: a review. vol. 22, no. 1, january [5] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, Automatic subspace clustering of high dimensional data for data mining applications, in Proc. ACM SIGMOD Int. Conf. Management of Data, 1998,pp [6] Hui Xiong, Junjie Wu, and Jian Chen, K- Means Clustering Versus Validation Measures:A Data-Distribution Perspective. IEEE Transaction on Man,and cybernetics- Part B:Cybernetics, Vol. 39, No. 2, April [7] Textbook on Digital Image Processing, Rafael Gonzalez, Richard E.Woods. VI. CONCLUSIONS AND FUTURE WORK In this paper, we studied the various clustering methods their complexities. We have studied the K means clustering algorithm, its advantages and disadvantages, also the problems which are encountered for this algorithm. In literature, we found that most of the existing methods for the clustering are depending on the image features like gray levels, texture, color. Even one more method for the clustering can be done on Histrograms[7]. 38

Unsupervised Learning

Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised