RPKM: The Rough Possibilistic K-Modes

Similar documents
Collaborative Rough Clustering

Qualitative classification and evaluation in possibilistic decision trees

A Generalized Decision Logic Language for Granular Computing

HARD, SOFT AND FUZZY C-MEANS CLUSTERING TECHNIQUES FOR TEXT CLASSIFICATION

On Generalizing Rough Set Theory

Mining High Order Decision Rules

Minimal Test Cost Feature Selection with Positive Region Constraint

Belief Hierarchical Clustering

Granular Computing based on Rough Sets, Quotient Space Theory, and Belief Functions

RFCM: A Hybrid Clustering Algorithm Using Rough and Fuzzy Sets

HFCT: A Hybrid Fuzzy Clustering Method for Collaborative Tagging

Classification with Diffuse or Incomplete Information

Rough Sets, Neighborhood Systems, and Granular Computing

ROUGH SETS THEORY AND UNCERTAINTY INTO INFORMATION SYSTEM

EFFICIENT ATTRIBUTE REDUCTION ALGORITHM

RFCM: A Hybrid Clustering Algorithm Using Rough and Fuzzy Sets

A Rough Set Approach for Generation and Validation of Rules for Missing Attribute Values of a Data Set

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

A fuzzy k-modes algorithm for clustering categorical data. Citation IEEE Transactions on Fuzzy Systems, 1999, v. 7 n. 4, p.

A New Method For Forecasting Enrolments Combining Time-Variant Fuzzy Logical Relationship Groups And K-Means Clustering

Mining Surgical Meta-actions Effects with Variable Diagnoses Number

A Rough Set Approach to Data with Missing Attribute Values

CLUSTERING analysis [1] is one of the most popular

QUALITY OF CLUSTER INDEX BASED ON STUDY OF DECISION TREE

Saudi Journal of Engineering and Technology. DOI: /sjeat ISSN (Print)

Determination of Similarity Threshold in Clustering Problems for Large Data Sets

Semi-Supervised Clustering with Partial Background Information

Data with Missing Attribute Values: Generalization of Indiscernibility Relation and Rule Induction

A framework for fuzzy models of multiple-criteria evaluation

APPLICATION OF THE FUZZY MIN-MAX NEURAL NETWORK CLASSIFIER TO PROBLEMS WITH CONTINUOUS AND DISCRETE ATTRIBUTES

Approximate Reasoning with Fuzzy Booleans

Approximation of Relations. Andrzej Skowron. Warsaw University. Banacha 2, Warsaw, Poland. Jaroslaw Stepaniuk

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

Interval Sets and Interval-Set Algebras

A Comparison of Global and Local Probabilistic Approximations in Mining Data with Many Missing Attribute Values

S-APPROXIMATION SPACES: A FUZZY APPROACH

Mining Local Association Rules from Temporal Data Set

Enhancing K-means Clustering Algorithm with Improved Initial Center

A Logic Language of Granular Computing

ROUGH MEMBERSHIP FUNCTIONS: A TOOL FOR REASONING WITH UNCERTAINTY

The Rough Set View on Bayes Theorem

Formal Concept Analysis and Hierarchical Classes Analysis

Induction of Strong Feature Subsets

Using Decision Boundary to Analyze Classifiers

self-organizing maps and symbolic data

Granular Computing: A Paradigm in Information Processing Saroj K. Meher Center for Soft Computing Research Indian Statistical Institute, Kolkata

Keywords: clustering algorithms, unsupervised learning, cluster validity

UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania

A Parallel Evolutionary Algorithm for Discovery of Decision Rules

REDUNDANCY OF MULTISET TOPOLOGICAL SPACES

Some questions of consensus building using co-association

Optimization with linguistic variables

Applying Rough Set Concepts to Clustering

Discretizing Continuous Attributes Using Information Theory

Rough Approximations under Level Fuzzy Sets

Modeling the Real World for Data Mining: Granular Computing Approach

6. Dicretization methods 6.1 The purpose of discretization

Efficient SQL-Querying Method for Data Mining in Large Data Bases

Fuzzy Set-Theoretical Approach for Comparing Objects with Fuzzy Attributes

On Solving Fuzzy Rough Linear Fractional Programming Problem

Rough Set Approaches to Rule Induction from Incomplete Data

A Modular Reduction Method for k-nn Algorithm with Self-recombination Learning

Chapter 4 Fuzzy Logic

Dynamic Clustering of Data with Modified K-Means Algorithm

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis

Moving Object Segmentation Method Based on Motion Information Classification by X-means and Spatial Region Segmentation

Fuzzy Sets and Systems. Lecture 1 (Introduction) Bu- Ali Sina University Computer Engineering Dep. Spring 2010

Attribute Reduction using Forward Selection and Relative Reduct Algorithm

Granular Computing. Y. Y. Yao

An Accelerated MapReduce-based K-prototypes for Big Data

A Decision-Theoretic Rough Set Model

American International Journal of Research in Science, Technology, Engineering & Mathematics

Semantics of Fuzzy Sets in Rough Set Theory

Multiple-Criteria Fuzzy Evaluation: The FuzzME Software Package

Swarm Based Fuzzy Clustering with Partition Validity

I. INTRODUCTION II. RELATED WORK.

Package SoftClustering

Fuzzy Queueing Model Using DSW Algorithm

A study on lower interval probability function based decision theoretic rough set models

Data Analysis and Mining in Ordered Information Tables

Strong Chromatic Number of Fuzzy Graphs

Similarity Measures of Pentagonal Fuzzy Numbers

Application of Fuzzy Classification in Bankruptcy Prediction

On the packing chromatic number of some lattices

Learning the Three Factors of a Non-overlapping Multi-camera Network Topology

Sequences Modeling and Analysis Based on Complex Network

Advances in Fuzzy Rough Set Theory for Temporal Databases

Feature Selection with Adjustable Criteria

On Reduct Construction Algorithms

Efficient Rule Set Generation using K-Map & Rough Set Theory (RST)

Supervised vs. Unsupervised Learning

ECM A Novel On-line, Evolving Clustering Method and Its Applications

Cost Minimization Fuzzy Assignment Problem applying Linguistic Variables

Data Mining & Feature Selection

Information Granulation and Approximation in a Decision-theoretic Model of Rough Sets

A Novel Fuzzy Rough Granular Neural Network for Classification

10701 Machine Learning. Clustering

Cluster analysis of 3D seismic data for oil and gas exploration

A Proposed Approach for Solving Rough Bi-Level. Programming Problems by Genetic Algorithm

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

Transcription:

RPKM: The Rough Possibilistic K-Modes Asma Ammar 1, Zied Elouedi 1, and Pawan Lingras 2 1 LARODEC, Institut Supérieur de Gestion de Tunis, Université de Tunis 41 Avenue de la Liberté, 2000 Le Bardo, Tunisie asma.ammar@voila.fr, zied.elouedi@gmx.fr 2 Department of Mathematics and Computing Science, Saint Marys University Halifax, Nova Scotia, B3H 3C3, Canada pawan@cs.smu.ca Abstract. Clustering categorical data sets under uncertain framework is a fundamental task in data mining area. In this paper, we propose a new method based on the k-modes clustering method using rough set and possibility theories in order to cluster objects into several clusters. While possibility theory handles the uncertainty in the belonging of objects to different clusters by specifying the possibilistic membership degrees, rough set theory detects and clusters peripheral objects using the upper and lower approximations. We introduce modifications on the standard version of the k-modes approach (SKM) to obtain the rough possibilistic k-modes method denoted by RPKM. These modifications make it possible to classify objects to different clusters characterized by rough boundaries. Experimental results on benchmark UCI data sets indicate the effectiveness of our proposed method i.e. RPKM. 1 Introduction Clustering is an unsupervised learning technique where its main aim is to discover structure of unlabeled data by grouping together similar objects. There are two main categories of clustering methods. They consist of hard (or crisp) methods and soft methods. Crisp approaches cluster each object of the training set into a particular cluster. In contrast to the hard clustering methods, in soft approaches, objects belong to different clusters. Actually, clustering objects into separate clusters presents a difficult task because clusters may not necessarily have precise boundaries. In order to deal with this imperfection, many theories of uncertainty have been proposed. We can mention the fuzzy set, the possibility and the Rough set theories that have been used with different clustering methods to handle uncertainty [1] [5] [6] [11]. In this work, we develop the rough possibilistic k-modes method denoted by RPKM. This proposed approach is based on the standard k-modes (SKM) and it uses possibility and rough set theories to handle uncertainty in the belonging of the objects to several clusters. Hence, it forms clusters with rough limits. The use of these uncertainty theories provides many advantages. They can express the degree of belongingness of each object to several clusters using possibilistic membership values and they allow the detection of peripheral objects (i.e. an object belongs to several clusters) using the upper and lower approximations. L. Chen et al. (Eds.): ISMIS 2012, LNAI 7661, pp. 81 86, 2012. c Springer-Verlag Berlin Heidelberg 2012

82 A. Ammar, Z. Elouedi, and P. Lingras 2 The K-Modes Method The k-modes method (SKM) [9] [10] deals with large categorical data sets. It is based on the k-means [7] and it uses the simple matching dissimilarity measure and the frequency-based function to cluster the objects into k clusters. Assume that we have two objects X 1 and X 2 with m categorical attributes defined respectively by X 1 =(x 11, x 12,..., x 1m )andx 2 =(x 21, x 22,..., x 2m ). The simple matching method denoted by d (0 d m) is described in Equation (1): m d (X 1,X 2 )= δ (x 1t,x 2t ). (1) t=1 Note that δ (x 1t,x 2t )isequalto0ifx 1t = x 2t and equal to 1 otherwise. Moreover, d=0 if all the values of attributes relative to X 1 and X 2 are similar. However, if there are no similarities between them, d=m. Generally, if we have a set of n objects S = {X 1,X 2,..., X n } with its k-modes Q = {Q 1,Q 2,..., Q k } for set of k clusters C = {C 1,C 2,..., C k }, we can aggregate it into k clusters with k n. The minimization of the clustering cost function is min D(W, Q) = k n j=1 i=1 ω i,jd(x i,q j ), where W is an n k partition matrix and ω i,j {0, 1} is the membership degree of X i in C j. 3 Possibility and Rough Set Theories 3.1 Possibility Theory Possibility Distribution. Let us consider Ω = {ω 1,ω 2,..., ω n } as the universe of discourse where ω i is an element (an event or a state) from Ω [12]. The possibilistic scale denoted by L is defined in the quantitative setting by [0, 1]. A fundamental concept in possibility theory is the possibility distribution function denoted by π. It is defined from the set Ω to L and associates to each element ω i Ω a value from L. Besides, we mention the normalization illustrated by max i {π (ω i )} = 1, the complete knowledge defined by ω 0, π (ω 0 )=1and π (ω) = 0 otherwise and the total ignorance defined by ω Ω,π (ω) =1. 3.2 Rough Set Theory Information System. Data sets used in RST are presented through a table known as an information table. Generally, an information system (IS) is a pair defined such that S =(U, A) whereu and A are finite and nonempty sets. U is the universe and A is the set of attributes. The value set of a also called the domain of a is denoted by V a and defined for every a A such that a : U V a. Indiscernibility Relation. Assume that S =(U, A) is an IS, the equivalence relation (IND S (B)) for any B A is defined in Equation (2): IND S (B) = { (x, y) U 2 a Ba(x) =a(y) }. (2)

RPKM: The Rough Possibilistic K-Modes 83 Where IND S (B) is B- indiscernibility relation and a(x) anda(y) denote respectively the value of attribute a for the elements x and y. Approximation of Sets. Suppose we have an IS: S =(U, A), B A and Y U. ThesetY can be described through the attribute values from B using two sets called the B-upper B(Y )andtheb-lowerb(y) approximations of Y. B(Y )= {B(y) :B(y) Y φ}. (3) y U B(Y )= {B(y) :B(y) Y }. (4) y U By B(y) we denote the equivalence class of B identified by the element y. Equivalence class of B describes elementary knowledge called granule. The B-boundary region of Y is described by: BN B (Y )=B(Y ) B(Y ). 4 Rough Possibilistic K-Modes The aim of the RPKM is to deal with uncertainty in the belonging of objects to several clusters based on possibilistic membership degrees and to detect peripheral objects in the clustering task using rough sets. There are several cases where an object can be similar to different clusters and it can belong to each cluster with a different degree. This fact can be caused by the high similarities between the values of the modes and objects. Clustering such objects to exactly one cluster is difficult and even impossible in some situations. Besides it can make the clustering results inaccurate. In order to avoid this limitation, we propose the RPKM that defines possibilistic membership using possibility theory in order to specify the degree of belongingness of each object to different clusters. Then, the RPKM derives clusters with rough boundaries by applying the upper and the lower approximations. Thus, an object is assigned to an upper or a lower approximation with respect to its possibilistic membership. 4.1 The RPKM Parameters 1. The simple matching dissimilarity measure: The RPKM deals with categorical and certain attributes values as the SKM does, so the simple matching method is applied, using Equation (1). It indicates how dissimilar are the objects from the clusters by comparing their attributes values. 2. The possibilistic membership degree: It presents the degree of belongingness of each object of the training set to the available clusters. It is denoted by ω ij where i and j present respectively the object and the cluster. ω ij expresses the degree of similarity between objects and clusters. To obtain this possibilistic membership, which is defined in [0, 1], we transform the dissimilarity value obtained through Equation (1) to a similarity value such that similarity = total number of attributes dissimilarity. Afterthat, we normalize the obtained result.

84 A. Ammar, Z. Elouedi, and P. Lingras 3. The update of clusters modes: It uses Equation (5). j k, t A, Mode jt =argmax v n ω ijtv. (5) Where i n, max j (ω ij )=1,ω ijtv is the possibilistic membership degree of the object i relative to the cluster j defined for the value v of the attribute t and A is the total number of attributes. 4. The deriving of the rough clusters from the possibilistic membership: We adapt the ratio [3] [4] to specify to which region each peripheral object belongs. In fact, after specifying the final ω ij for each object, we compute the ratio defined by Equation (6). ratio ij = max ω i. (6) ω ij After that, the ratio relative to each object is compared to a threshold 1 [3] [4] denoted by T.Ifratio ij T it means that the object i belongs to the upper bound of the cluster j. If an object belongs to the upper bound of exactly one cluster j, it means that it belongs to the lower bound of j. Note that every object in the data set satisfies the rough sets properties [3]. i=1 4.2 The RPKM Algorithm 1. Select randomly the k initial modes, one mode for each cluster. 2. Compute the distance measure between all objects and modes using Equation (1) then precise the membership degree of each object to the k clusters. 3. Allocate an object to the k clusters using the possibilistic membership. 4. Update the cluster mode using Equation (5). 5. Retest the similarity between objects and modes. Reallocate objects to clusters using possibilistic membership degrees then update the modes. 6. Repeat step 5 until all objects are stable. 7. Derive the rough clustering through the possibilistic membership degrees by computing the ratio of each object using Equation (6) and assigning each object to the upper or the lower bound of the cluster. 5 Experiments 5.1 The Framework In order to test the RPKM, we have used several real-world data sets taken from UCI machine learning repository [8]. They consist of Shuttle Landing Control (SLC), Balloons (Bal), Post-Operative Patient (POP), Congressional Voting Records (CVR), Balance Scale (BS), Tic-Tac-Toe Endgame (TE), Solar-Flare (SF) and Car Evaluation (CE).

RPKM: The Rough Possibilistic K-Modes 85 5.2 Evaluation Criteria The evaluation criteria consist of the accuracy (AC), the iteration number (IN) k l=1 and the execution time (ET). The AC= ac n is the rate of the correctly classified objects, where n is the total number of objects and a C is the number of objects correctly classified in C. It can be verified that the objects with the highest degree are in the correct clusters. The IN denotes the number of iterations needed to classify the objects into k rough clusters. The ET is the time taken to form the k rough clusters and to classify the objects. 5.3 Experimental Results In this section, we make a comparative study between the RPKM, the SKM and the KM-PM (the k-modes method based on possibilistic membership) proposed in [2] which is an improved version of the SKM where each object is assigned to all the clusters with different memberships. This latter specifies how similar is each object to different clusters. However, it cannot detect the boundary region computed using the set approximations as the RPKM does. Table 1. The evaluation criteria of RPKM vs. SKM and KM-PM Data sets SLC Bal POP CVR BS TE SF CE AC 0.61 0.52 0.684 0.825 0.785 0.513 0.87 0.795 SKM IN 8 9 11 12 13 12 14 11 ET/s 12.43 14.55 17.23 29.66 37.81 128.98 2661.63 3248.61 AC 0.63 0.65 0.74 0.79 0.82 0.59 0.91 0.87 KM-PM IN 4 4 8 6 2 10 12 12 ET/s 10.28 12.56 15.23 28.09 31.41 60.87 87.39 197.63 AC 0.67 0.68 0.77 0.83 0.88 0.61 0.94 0.91 RPKM IN 4 4 8 6 2 10 12 12 ET/s 11.04 13.14 16.73 29.11 35.32 70.12 95.57 209.68 As shown in Table 1, the RPKM has improved the clustering task of both the SKM and the KM-PM. Both of the KM-PM and the RPKM allow objects to belong to several clusters, in contrast to the SKM which forces each object to belong to exactly one cluster. The difference in the behaviors leads to different clustering results. Generally, the RPKM and the KM-PM provide better results than the SKM based on the three evaluation criteria. Furthermore, we observe that the RPKM gives the most accurate results for all data sets. Moving to the second evaluation criterion (i.e. the IN), the KM-PM and the RPKM need the same number of iterations to cluster the objects, to give the final partitions and to detect the peripheral objects. However, the last evaluation criterion i.e. the execution time relative to the RPKM is higher than the execution time of the KM-PM since, our proposed approach needs more time to detect boundary regions and to specify to which bound (upper or lower) each object belongs. We can observe that the ET of the RPKM is lower than the ET relative to the SKM. This result is due to the time taken by the SKM to cluster each object to distinct cluster which slows down the SKM algorithm. Moreover, in the SKM it is possible to obtain several modes for a particular

86 A. Ammar, Z. Elouedi, and P. Lingras cluster which leads to random choice, this latter may affect the stability of the partition and as a result, increases the execution time. Generally, the RPKM has improved the clustering task by providing more accurate results through the detection of clusters with rough limits. 6 Conclusion In this paper, we have highlighted the uncertainty in the clustering task by combining the SKM with the possibility and the rough set theories. This combination has been addressed in the RPKM which successfully clustered objects using possibilistic membership degrees and detected objects that belong to rough clusters. The RPKM has been tested and evaluated on several data sets from UCI machine learning repository [8]. Experimental results on well-known UCI data sets have proved the effectiveness of our method compared to the SKM and the KM-PM. References 1. Ammar, A., Elouedi, Z.: A New Possibilistic Clustering Method: The Possibilistic K- Modes. In: Pirrone, R., Sorbello, F. (eds.) AI*IA 2011. LNCS, vol. 6934, pp. 413 419. Springer, Heidelberg (2011) 2. Ammar, A., Elouedi, Z., Lingras, P.: K-Modes Clustering Using Possibilistic Membership. In: Greco, S., Bouchon-Meunier, B., Coletti, G., Fedrizzi, M., Matarazzo, B., Yager, R.R. (eds.) IPMU 2012, Part III. CCIS, vol. 299, pp. 596 605. Springer, Heidelberg (2012) 3. Joshi, M., Lingras, P., Rao, C.R.: Correlating Fuzzy and Rough Clustering. Fundamenta Informaticae (2011) (in press) 4. Lingras, P., Nimse, S., Darkunde, N., Muley, A.: Soft clustering from crisp clustering using granulation for mobile call mining. In: Proceedings of the GrC 2011: International Conference on Granular Computing, pp. 410 416 (2011) 5. Lingras, P., West, C.: Interval Set Clustering of Web Users with Rough K-means. Journal of Intelligent Information Systems 23, 5 16 (2004) 6. Lingras, P., Hogo, M., Snorek, M., Leonard, B.: Clustering Supermarket Customers Using Rough Set Based Kohonen Networks. In: Zhong, N., Raś, Z.W., Tsumoto, S., Suzuki, E. (eds.) ISMIS 2003. LNCS (LNAI), vol. 2871, pp. 169 173. Springer, Heidelberg (2003) 7. MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceeding of the 5th Berkeley Symposium on Math., Stat. and Prob., pp. 281 296 (1967) 8. Murphy, M.P., Aha, D.W.: Uci repository databases (1996), http://www.ics.uci.edu/mlearn 9. Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery 2, 283 304 (1998) 10. Huang, Z., Ng, M.K.: A note on k-modes clustering. Journal of Classification 20, 257 261 (2003) 11. Pal, N.R., Pal, K., Keller, J.M., Bezdek, J.C.: A possibilistic fuzzy c-means clustering algorithm. IEEE Transactions on Fuzzy Systems, 517 530 (2005) 12. Zadeh, L.A.: Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets and Systems 1, 3 28 (1978)