HARD, SOFT AND FUZZY C-MEANS CLUSTERING TECHNIQUES FOR TEXT CLASSIFICATION

Size: px

Start display at page:

Download "HARD, SOFT AND FUZZY C-MEANS CLUSTERING TECHNIQUES FOR TEXT CLASSIFICATION"

Mark Bridges
5 years ago
Views:

1 HARD, SOFT AND FUZZY C-MEANS CLUSTERING TECHNIQUES FOR TEXT CLASSIFICATION 1 M.S.Rekha, 2 S.G.Nawaz 1 PG SCALOR, CSE, SRI KRISHNADEVARAYA ENGINEERING COLLEGE, GOOTY 2 ASSOCIATE PROFESSOR, SRI KRISHNADEVARAYA ENGINEERING COLLEGE, GOOTY Abstract:-The Multiple Prototype Fuzzy Clustering Model (FCMP), introduced by Nascimento, Mirkin and Moura-Pires (1999), proposes a framework for partitional fuzzy clustering which suggests a model of how the data are generated from a cluster structure to be identified. In the model, it is assumed that the membership of each entity to a cluster expresses a part of the cluster prototype re ected in the entity. In this paper we extend the FCMP framework to a number of clustering criteria, and study the FCMP properties on fitting the underlying proposed model from which data is generated. A comparative study with the Fuzzy c-means algorithm is also presented. This paper presents Centre-based clustering approaches for clustering Y-STR data. The main goal is to investigate and observe the performance of the fundamental clustering approaches when partitioning Y-STR data. Two fundamental Centre-based hard clustering approaches, k-means and k-modes algorithms, and two fundamental Centre-based soft clustering approaches, fuzzy k-means and fuzzy k-modes algorithms were chosen for evaluation of Y-STR haplogroup and Y-STR Surname datasets. The results show that the soft k-means clustering algorithm produces the best average of the clustering accuracy (99.62%) for Y-STR haplogroup data as well Y-STR surname data (97.61%). The overall results show that the soft clustering approach is better (92.11%) than the hard clustering approach (81.20%) in clustering Y-STR data. However, the approach for clustering Y-STR data should be further investigated to find the best way of achieving 100% of the clustering results. Keywords: Soft Clustering, Hard Clustering, Fuzzy c-means Clustering, and text classification. 1.Introduction: Partitional clustering essentially deals with the task of partitioning a set of entities into a number of homogeneous clusters, with respect to a suitable similarity measure. Due to the fuzzy nature of many practical problems, a number of fuzzy clustering methods have been developed following the general fuzzy set theory strategies outlined by Zadeh, [1]. The main difference between the traditional hard clustering and fuzzy clustering can be stated as follows. While in hard clustering an entity belongs only to one cluster, in fuzzy clustering entities are allowed to belong to many clusters with different degrees of membership. The most known method of fuzzy clustering is the Fuzzy c- Means method (FCM), initially proposed by Dunn [2] and generalized by Bezdek [3],[4] and other authors [5],[6] (see [7] for an overview). Usually, membership functions are de ned based on a distance function, such that membership degrees express proximities of entities to cluster centers (i.e. prototypes). By choosing a suitable distance function (see [6],[8], [9]) different cluster shapes can be identified. However, these approaches typically fail to explicitly describe how the fuzzy cluster structure relates to the data from which it is derived. Nascimento, Mirkin and Moura-Pires [10] proposed a framework for fuzzy clustering based on a model of how the data is generated from a cluster structure to be identified. In this approach, the underlying fuzzy c-partition is supposed to be de ned in such a way that the membership of an entity to a cluster expresses a part of the cluster s prototype re ected in the entity. This way, an entity may bear 60% of a prototype A and 40% of prototype B, which simultaneously expresses the entity s membership to the respective clusters. The prototypes are considered as offered by the knowledge domain. This idea was initially proposed by Mirkin and Satarov as the socalled ideal type fuzzy clustering model [11] (see also [12]), such that each observed entity is defined as a convex combination of the prototypes, and the coe cients are the entity membership values. In our work, we consider a different way for pertaining observed entities to the prototypes: any entity may independently relate to any prototype, which is similar to the assumption in FCM criterion. The model is called the Fuzzy Clustering Multiple Prototype (FCMP) model. In this paper we extend the FCMP framework to a number of clustering criteria and present the main results of the study of the FCMP as a model-based 213

2 approach for clustering as well as its comparison with the Hard, Soft and FCM algorithm. 1. Centre Based Hard and Soft Clustering Approaches The centre-based clustering has been evolving significantly, even though the results have not reached 100% of the clustering accuracy for all benchmarks datasets yet. The trend now has shifted from the hard clustering approaches to the soft clustering approaches. The soft clustering approaches seem to be a promising approach particularly in dealing with categorical data (See Ng and Li (2009) and Kim et al., (2007)). The hard clustering is sometimes called non-fuzzy clustering, whereas the soft clustering is referred to fuzzy clustering. From a general perspective, the hard and soft clustering can be seen as differing in the assigning of values for a partition matrix. The hard clustering approach only assigns a value of 1 or 0. In contrast, the soft clustering is more relaxed, allowing the values to be part of more than one cluster. The higher the value is, the higher the degree of confidence that the objects belong to that cluster. The Centre-based clustering can be described as follows: Let us suppose that the objective is to partition a data set, D into cluster, C. Suppose that k is known as a priori. Let X ={X1, X2,, Xn} be set of data with set of attributes A ={A1,A2,, Am}. The partition of D, whether hard or soft partition is to minimize the cost function as Equation (1), and subject to Equation (2), (3) and (4). Where k( n ) is a known number of clusters, W is a (k x n) partition matrix, Z is [z 1,z 2, z k ] R mk and d (z l, x i ) is a dissimilarity measure between z l and x i. The algorithm can be generalized as: Step 1: Choose an initial point Z (1) R mk Determine W (1) such that F( W,Z (1) ) is minimized. Set t=1. Step 2: Determine Z (t=1) such that F( W t,z (t+1) ) is minimized. If F( W t,z (t+1) )= F( W t,z (t) ) then stop; otherwise goto step 3 Step 3: Determine W (t+1) such that F( W( t+1),z (t+1) ) minimized. If that F( W( t+1),z (t+1) )= that F( W( t),z (t+1) ) then stop; otherwise t=t+1 and goto step 2 From an optimization perspective, the main focus is to solve problem P as described by [13]. The problem P can be solved by iteratively solving the following two problems [14]: Problem P1: Fix Z = Z and solve the reduced problem P(W,Z ) Problem P2: Fix W = Ŵ and solve the reduced problem P( W, Z). Thus, the differences between the hard clustering and the soft clustering are as follows: In the hard clustering, the problem P1 is minimized by Equation (5). 214

3 whereas, in the soft clustering, the problem P1 is minimized by Equation (6). However, in the problem P2, the hard clustering is minimized according to the k-means and k-mode respectively. The k-means minimize as in Equation (7). The main difference between the k-means and k-modes algorithms is that, the k-means handles numerical data, whereas the k-mode handles categorical data. Thus, mean is a mechanism for k-means algorithm to update its centroids and mode for k -Modes algorithm. Consequently, the k-means uses Euclidean distance as in Equation (12) and the k-modes uses a simple dissimilarity measure, introduced by [15] as in Equation (13) and (14). whereas, in the k-modes minimize as in Equation (8). z lj =a j (r) (8) (r) where a j is the mode of attribute values of A j in cluster C l such that Further, in the soft clustering, for the Fuzzy k -Means, the minimize z is given by Equation (10). 2. FUZZY C-MEANS ALGORITHM The fuzzy c-means (FCM) algorithm [3] is one of the most widely used methods in fuzzy clustering. It is based on the concept of fuzzy c-partition, introduced by Ruspini [13], summarized as follows. LetX = [x 1..x n ] be a set of given data, where each data point x k (k = 1..n) is a vector in R p, Ucn be a set of real c x n matrices, and c be an integer, 2 c < n. Then, the fuzzy c-partition space for X is the set Where α [1, ) is a weighting exponent. In the Fuzzy k -Modes, the minimize z is given by Equation (11). (15) 215

4 where u ik is the membership value of x k in cluster i (i =1,, c). The aim of the FCM algorithm is to find an optimal fuzzy c-partition and corresponding prototypes minimizing the objective function (16) In (16), V = (v 1, v 2,., v c ) is a matrix of unknown cluster centers (prototypes) vi 2 <p, k k is the Euclidean norm, and the weighting exponent m in [1, ) is a constant that influences the membership values. To minimize criterion Jm, under the fuzzy constraints defined in (15), the FCM algorithm is de ned as an alternating minimization algorithm (cf.[3] for the derivations), as follows. Choose a value for c; m and ", a small positive constant; then, generate randomly a fuzzy c-partition U0 and set iteration number t = 0. A two-step iterative process works as follows. Given the membership values u(t) ik, the cluster centers v(t) i ( i = 1,, c ) are calculated by (17) Given the new cluster centers v (t) i, update membership values u (t) ik: According to the discussion about the traditional FCM algorithm, the initial condition of cluster centers influences the performance of the algorithm. The best choice of the original cluster centers needs to consider the features of the data set. In this paper, meteorological data is chosen as our experimental data. Meteorological data is different from other experiment data. If we just use the traditional FCM algorithm to deal with the meteorological data, there will be a large error when clustering a certain object. To solve the initialization problem, we put forward an improved FCM algorithm in term of selecting the initial cluster centers. Nowadays, there are several methods to select the original cluster centers. In the following section we will go through some commonly used methods. (1) Randomly The traditional FCM algorithm determines initial cluster centers randomly. This method is simple and generally applicable to all data but usually causes local minima. (2) user-specified Normally, users decide original cluster centers by some priori knowledge. According to the understanding of the data, users always can obtain logical cluster centers to achieve the purpose of the global optimum. (3) Randomly classify objects into several clusters, compute the center of each cluster and determine them as cluster centers More time consumption is spent when randomly classifying objects in this method. When the number of objects in data sets is very small, the cost of time can be ignored. Nevertheless, as the number of objects increases, the speed of the increasing cost of time can be largely rapid. (18) The process stops when U (t+1) - U (t) = ", or a predefined number of iterations is reached. 3. The Improved FCM Algorithm for Meteorological Data (4) Select the farthest points as cluster centers Generally speaking, this method selects initial cluster centers following to the maximum distance principle. It can achieve high efficiency if there are no outliers or noisy points in data sets. But if the data sets contain some outliers, outliers are easier to be chosen as the cluster centers. 216

5 (5) Select points with the maximum density The number of objects whose distance is less than the given radius from the observed object is defined as the density of the observed object. After computing the density of each object, the object whose density is the largest is chosen as the cluster center. Then compute densities of objects whose distances are larger than the given distance from the selected center centers, also choose the object whose density is the largest as the second cluster center. And so forth until the number of cluster centers reaches the given number. This method ensures that cluster centers are far away from each other to avoid the objective function into local minima. r d In our paper, we adopt a new method to determine cluster centers which is based on the fifth method as mentioned above. In our method, we first randomly select the observed object and compute the density of the observed object. If the density of the observed object is not less than the given density parameter, the observed object can be seen as the cluster center. Secondly we keep selecting the second cluster center satisfying the above constraints in the data set which excludes the objects which are cluster centers or objects whose distances are less than the given distance parameter. Finally we obtain the given number of cluster centers after repeating the above process. The distance parameter and the density parameter are decided by users according to the characteristics of the data sets and the priori knowledge. This selection strategy spends less time than the fifth method because time of computing densities of all objects in the data set is saved, while this method maintains the advantage of avoiding the object function into local minima. 4. Conclusion: The hard, soft and FCMP framework proposes a model of how the data are generated from a cluster structure to be identified. This implies direct interpretability of the fuzzy membership values, which should be considered a motivation for introducing the model-based methods. Based on the experimental results obtained in this research, the following can be stated. The FCMP-2 algorithm is able to restore the original prototypes from which data have been generated, and FCMP-0 can be viewed as a device for estimating the number of clusters in the underlying structure to be found. For small dimension spaces, FCMP-1 is an intermediate model between FCMP-0 and FCMP-2, and can be viewed as a model based parallel to FCM. On the high dimension data, FCMP-1 degenerates in a hard clustering approach. Also, FCM drastically decreases the number of prototypes in the high dimension spaces (at least with the proposed data generator). This model-based clustering approach seems appealing in the sense that, on doing cluster analysis, the experts of a knowledge domain usually have conceptual understanding of how the domain is organized in terms of tentative prototypes. This knowledge may well serve as the initial setting for data based structurization of the domain. In such a case, the belongingness of data entities to clusters are based on how much they share the features of corresponding prototypes. This seems useful in such application areas as mental disorders in psychiatry or consumer behavior in marketing. However, the effective utility of the multiple prototypes model still remains to be demonstrated with real data. References: [1] L. Zadeh, Fuzzy sets, Information and Control, 8, pp , [2] J. Dunn, A fuzzy relative of the Isodata process and its use in detecting compact, well-separated clusters, Journal of Cybernetics, 3(3), pp , [3] J. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, New York, [4] J. Bezdek and R. Hathaway, Recent convergence results for the fuzzy c-means clustering algorithms, Journal of Classified- cation, 5(2), pp , [5] G. Gustafson and W. Kessel, Fuzzy clustering with a fuzzy covariance matrix, in Proc. IEEE CDC, pp , San Diego, [6] F. Klawonn and A. Keller, Fuzzy clustering based on modified distance measures, J. Kok, D. Hand and M. Berthold (eds.), Advances in Intelligent Data Analysis, Third International Symposium,(IDA 99), Lecture Notes in Computer Science, 1642, Springer-Verlag, pp , [7] R. Kruse, F. Hoppner, F. Klawonn and T. Runkler, Fuzzy Cluster Analysis, John Wiley and Sons, [8] R. Dave, Fuzzy shell-clustering and applications to circle detection of digital images, International Journal of General Systems, 16(4), pp ,

6 [9] L. Bobrowski and J. Bezdek, c-means with l1 and l1 norms, IEEE Transactions on Systems, Man and Cybernetics, 21(3), pp , 1991 [10] S. Nascimento, B. Mirkin and F. Moura-Pires, Multiple prototypes model for fuzzy clustering, in J. Kok, D. Hand and M. Berthold, (eds.), Advances in Intelligent Data Analysis. Third International Symposium,(IDA 99), Lecture Notes in Computer Science, 1642, Springer-Verlag, pp , [11] B. Mirkin and G. Satarov, Method of fuzzy additive types for analysis of multidimensional data: I, II, Automation and Remote Control, 51(5, 6), pp , pp , [13] Bobrowski L, Bezdek JC (1991) c-means clustering with the l1 and l norms. IEEE Transactions on Systems, Man and Cybernetics. 21(3): Chaturvedi A, Foods K, Green JE (2001) K-modes Clustering. Journal of Classification, 18: [14] Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2: [15] Kaufman L and Rousseeuw PJ (1987) Clustering by means of medoids. Elsevier, [12] B. Mirkin, Mathematical Classi cation and Clustering, Kluwer Academic Publishers,

Fuzzy Clustering: Insights and a New Approach

Mathware & Soft Computing 11 2004) 125-142 Fuzzy Clustering: Insights and a New Approach F. Klawonn Department of Computer Science University of Applied Sciences Braunschweig/Wolfenbuettel Salzdahlumer