RPKM: The Rough Possibilistic K-Modes Asma Ammar 1, Zied Elouedi 1, and Pawan Lingras 2 1 LARODEC, Institut Supérieur de Gestion de Tunis, Université de Tunis 41 Avenue de la Liberté, 2000 Le Bardo, Tunisie asma.ammar@voila.fr, zied.elouedi@gmx.fr 2 Department of Mathematics and Computing Science, Saint Marys University Halifax, Nova Scotia, B3H 3C3, Canada pawan@cs.smu.ca Abstract. Clustering categorical data sets under uncertain framework is a fundamental task in data mining area. In this paper, we propose a new method based on the k-modes clustering method using rough set and possibility theories in order to cluster objects into several clusters. While possibility theory handles the uncertainty in the belonging of objects to different clusters by specifying the possibilistic membership degrees, rough set theory detects and clusters peripheral objects using the upper and lower approximations. We introduce modifications on the standard version of the k-modes approach (SKM) to obtain the rough possibilistic k-modes method denoted by RPKM. These modifications make it possible to classify objects to different clusters characterized by rough boundaries. Experimental results on benchmark UCI data sets indicate the effectiveness of our proposed method i.e. RPKM. 1 Introduction Clustering is an unsupervised learning technique where its main aim is to discover structure of unlabeled data by grouping together similar objects. There are two main categories of clustering methods. They consist of hard (or crisp) methods and soft methods. Crisp approaches cluster each object of the training set into a particular cluster. In contrast to the hard clustering methods, in soft approaches, objects belong to different clusters. Actually, clustering objects into separate clusters presents a difficult task because clusters may not necessarily have precise boundaries. In order to deal with this imperfection, many theories of uncertainty have been proposed. We can mention the fuzzy set, the possibility and the Rough set theories that have been used with different clustering methods to handle uncertainty [1] [5] [6] [11]. In this work, we develop the rough possibilistic k-modes method denoted by RPKM. This proposed approach is based on the standard k-modes (SKM) and it uses possibility and rough set theories to handle uncertainty in the belonging of the objects to several clusters. Hence, it forms clusters with rough limits. The use of these uncertainty theories provides many advantages. They can express the degree of belongingness of each object to several clusters using possibilistic membership values and they allow the detection of peripheral objects (i.e. an object belongs to several clusters) using the upper and lower approximations. L. Chen et al. (Eds.): ISMIS 2012, LNAI 7661, pp. 81 86, 2012. c Springer-Verlag Berlin Heidelberg 2012
82 A. Ammar, Z. Elouedi, and P. Lingras 2 The K-Modes Method The k-modes method (SKM) [9] [10] deals with large categorical data sets. It is based on the k-means [7] and it uses the simple matching dissimilarity measure and the frequency-based function to cluster the objects into k clusters. Assume that we have two objects X 1 and X 2 with m categorical attributes defined respectively by X 1 =(x 11, x 12,..., x 1m )andx 2 =(x 21, x 22,..., x 2m ). The simple matching method denoted by d (0 d m) is described in Equation (1): m d (X 1,X 2 )= δ (x 1t,x 2t ). (1) t=1 Note that δ (x 1t,x 2t )isequalto0ifx 1t = x 2t and equal to 1 otherwise. Moreover, d=0 if all the values of attributes relative to X 1 and X 2 are similar. However, if there are no similarities between them, d=m. Generally, if we have a set of n objects S = {X 1,X 2,..., X n } with its k-modes Q = {Q 1,Q 2,..., Q k } for set of k clusters C = {C 1,C 2,..., C k }, we can aggregate it into k clusters with k n. The minimization of the clustering cost function is min D(W, Q) = k n j=1 i=1 ω i,jd(x i,q j ), where W is an n k partition matrix and ω i,j {0, 1} is the membership degree of X i in C j. 3 Possibility and Rough Set Theories 3.1 Possibility Theory Possibility Distribution. Let us consider Ω = {ω 1,ω 2,..., ω n } as the universe of discourse where ω i is an element (an event or a state) from Ω [12]. The possibilistic scale denoted by L is defined in the quantitative setting by [0, 1]. A fundamental concept in possibility theory is the possibility distribution function denoted by π. It is defined from the set Ω to L and associates to each element ω i Ω a value from L. Besides, we mention the normalization illustrated by max i {π (ω i )} = 1, the complete knowledge defined by ω 0, π (ω 0 )=1and π (ω) = 0 otherwise and the total ignorance defined by ω Ω,π (ω) =1. 3.2 Rough Set Theory Information System. Data sets used in RST are presented through a table known as an information table. Generally, an information system (IS) is a pair defined such that S =(U, A) whereu and A are finite and nonempty sets. U is the universe and A is the set of attributes. The value set of a also called the domain of a is denoted by V a and defined for every a A such that a : U V a. Indiscernibility Relation. Assume that S =(U, A) is an IS, the equivalence relation (IND S (B)) for any B A is defined in Equation (2): IND S (B) = { (x, y) U 2 a Ba(x) =a(y) }. (2)
RPKM: The Rough Possibilistic K-Modes 83 Where IND S (B) is B- indiscernibility relation and a(x) anda(y) denote respectively the value of attribute a for the elements x and y. Approximation of Sets. Suppose we have an IS: S =(U, A), B A and Y U. ThesetY can be described through the attribute values from B using two sets called the B-upper B(Y )andtheb-lowerb(y) approximations of Y. B(Y )= {B(y) :B(y) Y φ}. (3) y U B(Y )= {B(y) :B(y) Y }. (4) y U By B(y) we denote the equivalence class of B identified by the element y. Equivalence class of B describes elementary knowledge called granule. The B-boundary region of Y is described by: BN B (Y )=B(Y ) B(Y ). 4 Rough Possibilistic K-Modes The aim of the RPKM is to deal with uncertainty in the belonging of objects to several clusters based on possibilistic membership degrees and to detect peripheral objects in the clustering task using rough sets. There are several cases where an object can be similar to different clusters and it can belong to each cluster with a different degree. This fact can be caused by the high similarities between the values of the modes and objects. Clustering such objects to exactly one cluster is difficult and even impossible in some situations. Besides it can make the clustering results inaccurate. In order to avoid this limitation, we propose the RPKM that defines possibilistic membership using possibility theory in order to specify the degree of belongingness of each object to different clusters. Then, the RPKM derives clusters with rough boundaries by applying the upper and the lower approximations. Thus, an object is assigned to an upper or a lower approximation with respect to its possibilistic membership. 4.1 The RPKM Parameters 1. The simple matching dissimilarity measure: The RPKM deals with categorical and certain attributes values as the SKM does, so the simple matching method is applied, using Equation (1). It indicates how dissimilar are the objects from the clusters by comparing their attributes values. 2. The possibilistic membership degree: It presents the degree of belongingness of each object of the training set to the available clusters. It is denoted by ω ij where i and j present respectively the object and the cluster. ω ij expresses the degree of similarity between objects and clusters. To obtain this possibilistic membership, which is defined in [0, 1], we transform the dissimilarity value obtained through Equation (1) to a similarity value such that similarity = total number of attributes dissimilarity. Afterthat, we normalize the obtained result.
84 A. Ammar, Z. Elouedi, and P. Lingras 3. The update of clusters modes: It uses Equation (5). j k, t A, Mode jt =argmax v n ω ijtv. (5) Where i n, max j (ω ij )=1,ω ijtv is the possibilistic membership degree of the object i relative to the cluster j defined for the value v of the attribute t and A is the total number of attributes. 4. The deriving of the rough clusters from the possibilistic membership: We adapt the ratio [3] [4] to specify to which region each peripheral object belongs. In fact, after specifying the final ω ij for each object, we compute the ratio defined by Equation (6). ratio ij = max ω i. (6) ω ij After that, the ratio relative to each object is compared to a threshold 1 [3] [4] denoted by T.Ifratio ij T it means that the object i belongs to the upper bound of the cluster j. If an object belongs to the upper bound of exactly one cluster j, it means that it belongs to the lower bound of j. Note that every object in the data set satisfies the rough sets properties [3]. i=1 4.2 The RPKM Algorithm 1. Select randomly the k initial modes, one mode for each cluster. 2. Compute the distance measure between all objects and modes using Equation (1) then precise the membership degree of each object to the k clusters. 3. Allocate an object to the k clusters using the possibilistic membership. 4. Update the cluster mode using Equation (5). 5. Retest the similarity between objects and modes. Reallocate objects to clusters using possibilistic membership degrees then update the modes. 6. Repeat step 5 until all objects are stable. 7. Derive the rough clustering through the possibilistic membership degrees by computing the ratio of each object using Equation (6) and assigning each object to the upper or the lower bound of the cluster. 5 Experiments 5.1 The Framework In order to test the RPKM, we have used several real-world data sets taken from UCI machine learning repository [8]. They consist of Shuttle Landing Control (SLC), Balloons (Bal), Post-Operative Patient (POP), Congressional Voting Records (CVR), Balance Scale (BS), Tic-Tac-Toe Endgame (TE), Solar-Flare (SF) and Car Evaluation (CE).
RPKM: The Rough Possibilistic K-Modes 85 5.2 Evaluation Criteria The evaluation criteria consist of the accuracy (AC), the iteration number (IN) k l=1 and the execution time (ET). The AC= ac n is the rate of the correctly classified objects, where n is the total number of objects and a C is the number of objects correctly classified in C. It can be verified that the objects with the highest degree are in the correct clusters. The IN denotes the number of iterations needed to classify the objects into k rough clusters. The ET is the time taken to form the k rough clusters and to classify the objects. 5.3 Experimental Results In this section, we make a comparative study between the RPKM, the SKM and the KM-PM (the k-modes method based on possibilistic membership) proposed in [2] which is an improved version of the SKM where each object is assigned to all the clusters with different memberships. This latter specifies how similar is each object to different clusters. However, it cannot detect the boundary region computed using the set approximations as the RPKM does. Table 1. The evaluation criteria of RPKM vs. SKM and KM-PM Data sets SLC Bal POP CVR BS TE SF CE AC 0.61 0.52 0.684 0.825 0.785 0.513 0.87 0.795 SKM IN 8 9 11 12 13 12 14 11 ET/s 12.43 14.55 17.23 29.66 37.81 128.98 2661.63 3248.61 AC 0.63 0.65 0.74 0.79 0.82 0.59 0.91 0.87 KM-PM IN 4 4 8 6 2 10 12 12 ET/s 10.28 12.56 15.23 28.09 31.41 60.87 87.39 197.63 AC 0.67 0.68 0.77 0.83 0.88 0.61 0.94 0.91 RPKM IN 4 4 8 6 2 10 12 12 ET/s 11.04 13.14 16.73 29.11 35.32 70.12 95.57 209.68 As shown in Table 1, the RPKM has improved the clustering task of both the SKM and the KM-PM. Both of the KM-PM and the RPKM allow objects to belong to several clusters, in contrast to the SKM which forces each object to belong to exactly one cluster. The difference in the behaviors leads to different clustering results. Generally, the RPKM and the KM-PM provide better results than the SKM based on the three evaluation criteria. Furthermore, we observe that the RPKM gives the most accurate results for all data sets. Moving to the second evaluation criterion (i.e. the IN), the KM-PM and the RPKM need the same number of iterations to cluster the objects, to give the final partitions and to detect the peripheral objects. However, the last evaluation criterion i.e. the execution time relative to the RPKM is higher than the execution time of the KM-PM since, our proposed approach needs more time to detect boundary regions and to specify to which bound (upper or lower) each object belongs. We can observe that the ET of the RPKM is lower than the ET relative to the SKM. This result is due to the time taken by the SKM to cluster each object to distinct cluster which slows down the SKM algorithm. Moreover, in the SKM it is possible to obtain several modes for a particular
86 A. Ammar, Z. Elouedi, and P. Lingras cluster which leads to random choice, this latter may affect the stability of the partition and as a result, increases the execution time. Generally, the RPKM has improved the clustering task by providing more accurate results through the detection of clusters with rough limits. 6 Conclusion In this paper, we have highlighted the uncertainty in the clustering task by combining the SKM with the possibility and the rough set theories. This combination has been addressed in the RPKM which successfully clustered objects using possibilistic membership degrees and detected objects that belong to rough clusters. The RPKM has been tested and evaluated on several data sets from UCI machine learning repository [8]. Experimental results on well-known UCI data sets have proved the effectiveness of our method compared to the SKM and the KM-PM. References 1. Ammar, A., Elouedi, Z.: A New Possibilistic Clustering Method: The Possibilistic K- Modes. In: Pirrone, R., Sorbello, F. (eds.) AI*IA 2011. LNCS, vol. 6934, pp. 413 419. Springer, Heidelberg (2011) 2. Ammar, A., Elouedi, Z., Lingras, P.: K-Modes Clustering Using Possibilistic Membership. In: Greco, S., Bouchon-Meunier, B., Coletti, G., Fedrizzi, M., Matarazzo, B., Yager, R.R. (eds.) IPMU 2012, Part III. CCIS, vol. 299, pp. 596 605. Springer, Heidelberg (2012) 3. Joshi, M., Lingras, P., Rao, C.R.: Correlating Fuzzy and Rough Clustering. Fundamenta Informaticae (2011) (in press) 4. Lingras, P., Nimse, S., Darkunde, N., Muley, A.: Soft clustering from crisp clustering using granulation for mobile call mining. In: Proceedings of the GrC 2011: International Conference on Granular Computing, pp. 410 416 (2011) 5. Lingras, P., West, C.: Interval Set Clustering of Web Users with Rough K-means. Journal of Intelligent Information Systems 23, 5 16 (2004) 6. Lingras, P., Hogo, M., Snorek, M., Leonard, B.: Clustering Supermarket Customers Using Rough Set Based Kohonen Networks. In: Zhong, N., Raś, Z.W., Tsumoto, S., Suzuki, E. (eds.) ISMIS 2003. LNCS (LNAI), vol. 2871, pp. 169 173. Springer, Heidelberg (2003) 7. MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceeding of the 5th Berkeley Symposium on Math., Stat. and Prob., pp. 281 296 (1967) 8. Murphy, M.P., Aha, D.W.: Uci repository databases (1996), http://www.ics.uci.edu/mlearn 9. Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery 2, 283 304 (1998) 10. Huang, Z., Ng, M.K.: A note on k-modes clustering. Journal of Classification 20, 257 261 (2003) 11. Pal, N.R., Pal, K., Keller, J.M., Bezdek, J.C.: A possibilistic fuzzy c-means clustering algorithm. IEEE Transactions on Fuzzy Systems, 517 530 (2005) 12. Zadeh, L.A.: Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets and Systems 1, 3 28 (1978)