Collaborative Rough Clustering

Similar documents
RFCM: A Hybrid Clustering Algorithm Using Rough and Fuzzy Sets

RFCM: A Hybrid Clustering Algorithm Using Rough and Fuzzy Sets

RPKM: The Rough Possibilistic K-Modes

Approximation of Relations. Andrzej Skowron. Warsaw University. Banacha 2, Warsaw, Poland. Jaroslaw Stepaniuk

A Rough Set Approach for Generation and Validation of Rules for Missing Attribute Values of a Data Set

Texture Image Segmentation using FCM

Research Article QOS Based Web Service Ranking Using Fuzzy C-means Clusters

HARD, SOFT AND FUZZY C-MEANS CLUSTERING TECHNIQUES FOR TEXT CLASSIFICATION

EFFICIENT ATTRIBUTE REDUCTION ALGORITHM

Granular Computing: A Paradigm in Information Processing Saroj K. Meher Center for Soft Computing Research Indian Statistical Institute, Kolkata

Efficient Object Extraction Using Fuzzy Cardinality Based Thresholding and Hopfield Network

Fuzzy Ant Clustering by Centroid Positioning

8. Clustering: Pattern Classification by Distance Functions

On Generalizing Rough Set Theory

A Generalized Decision Logic Language for Granular Computing

Novel Intuitionistic Fuzzy C-Means Clustering for Linearly and Nonlinearly Separable Data

A Comparison of Global and Local Probabilistic Approximations in Mining Data with Many Missing Attribute Values

Modeling with Uncertainty Interval Computations Using Fuzzy Sets

Classification with Diffuse or Incomplete Information

A Novel Fuzzy Rough Granular Neural Network for Classification

ROUGH SETS THEORY AND UNCERTAINTY INTO INFORMATION SYSTEM

Mining Local Association Rules from Temporal Data Set

Maximum Class Separability for Rough-Fuzzy C-Means Based Brain MR Image Segmentation

Efficient SQL-Querying Method for Data Mining in Large Data Bases

An Improved Fuzzy K-Medoids Clustering Algorithm with Optimized Number of Clusters

Formal Concept Analysis and Hierarchical Classes Analysis

A fuzzy soft set theoretic approach to decision making problems

Mi Main goal of rough sets - id induction of approximations of concepts.

Granular Computing based on Rough Sets, Quotient Space Theory, and Belief Functions

An Improved Clustering Method for Text Documents Using Neutrosophic Logic

Granular Computing. Y. Y. Yao

I. INTRODUCTION. Image Acquisition. Denoising in Wavelet Domain. Enhancement. Binarization. Thinning. Feature Extraction. Matching

Fast Efficient Clustering Algorithm for Balanced Data

Fast Approximate Minimum Spanning Tree Algorithm Based on K-Means

GRANULAR COMPUTING AND EVOLUTIONARY FUZZY MODELLING FOR MECHANICAL PROPERTIES OF ALLOY STEELS. G. Panoutsos and M. Mahfouf

Accepted Manuscript. Data Science, Big Data and Granular Mining. Sankar K. Pal, Saroj K. Meher, Andrzej Skowron

Swarm Based Fuzzy Clustering with Partition Validity

A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data

Rough Sets, Neighborhood Systems, and Granular Computing

Metric Dimension in Fuzzy Graphs. A Novel Approach

6. Dicretization methods 6.1 The purpose of discretization

Keywords - Fuzzy rule-based systems, clustering, system design

Rough-Fuzzy Clustering and M-Band Wavelet Packet for Text-Graphics Segmentation

Rough Set Approach to Unsupervised Neural Network based Pattern Classifier

A Logic Language of Granular Computing

A Bounded Index for Cluster Validity

Equi-sized, Homogeneous Partitioning

Colour Image Segmentation Using K-Means, Fuzzy C-Means and Density Based Clustering

AN EFFICIENT BINARIZATION TECHNIQUE FOR FINGERPRINT IMAGES S. B. SRIDEVI M.Tech., Department of ECE

New Optimal Load Allocation for Scheduling Divisible Data Grid Applications

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Color-Based Classification of Natural Rock Images Using Classifier Combinations

Effect of Regularization on Fuzzy Graph

S. Sreenivasan Research Scholar, School of Advanced Sciences, VIT University, Chennai Campus, Vandalur-Kelambakkam Road, Chennai, Tamil Nadu, India

Biclustering Bioinformatics Data Sets. A Possibilistic Approach

QUALITY OF CLUSTER INDEX BASED ON STUDY OF DECISION TREE

HFCT: A Hybrid Fuzzy Clustering Method for Collaborative Tagging

Molodtsov's Soft Set Theory and its Applications in Decision Making

Enhancing K-means Clustering Algorithm with Improved Initial Center

A Novel Template Matching Approach To Speaker-Independent Arabic Spoken Digit Recognition

Minimal Test Cost Feature Selection with Positive Region Constraint

Graph Matching: Fast Candidate Elimination Using Machine Learning Techniques

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Rough Fuzzy c-means Subspace Clustering

DECISION TREE INDUCTION USING ROUGH SET THEORY COMPARATIVE STUDY

Generalized Fuzzy Clustering Model with Fuzzy C-Means

Dynamic local search algorithm for the clustering problem

A New Orthogonalization of Locality Preserving Projection and Applications

Available online at ScienceDirect. Procedia Computer Science 96 (2016 )

Fuzzy C-means Clustering with Temporal-based Membership Function

Optimization Model of K-Means Clustering Using Artificial Neural Networks to Handle Class Imbalance Problem

K-Means Clustering With Initial Centroids Based On Difference Operator

Optimization with linguistic variables

The Rough Set View on Bayes Theorem

Modeling the Real World for Data Mining: Granular Computing Approach

Keywords: clustering algorithms, unsupervised learning, cluster validity

Image Enhancement Using Fuzzy Morphology

Package SoftClustering

COMPARATIVE ANALYSIS OF PARALLEL K MEANS AND PARALLEL FUZZY C MEANS CLUSTER ALGORITHMS

A Modified Fuzzy C Means Clustering using Neutrosophic Logic

A Comparative Study on Optimization Techniques for Solving Multi-objective Geometric Programming Problems

Rough Fuzzy C-means and Particle Swarm Optimization Hybridized Method for Information Clustering Problem

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

FUZZY SET THEORETIC APPROACH TO IMAGE THRESHOLDING

EE 589 INTRODUCTION TO ARTIFICIAL NETWORK REPORT OF THE TERM PROJECT REAL TIME ODOR RECOGNATION SYSTEM FATMA ÖZYURT SANCAR

Fuzzifying clustering algorithms: The case study of MajorClust

Similarity Measures of Pentagonal Fuzzy Numbers

Learning to Segment Document Images

Data Distortion for Privacy Protection in a Terrorist Analysis System

Rough Approximations under Level Fuzzy Sets

A Fuzzy Rule Based Clustering

MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS

Controlling the spread of dynamic self-organising maps

Unsupervised Learning

Content Based Image Retrieval system with a combination of Rough Set and Support Vector Machine

Semi-Supervised Clustering with Partial Background Information

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

A Graph Theoretic Approach to Image Database Retrieval

Spatial Topology of Equitemporal Points on Signatures for Retrieval

Approximation Algorithms for Geometric Intersection Graphs

Transcription:

Collaborative Rough Clustering Sushmita Mitra, Haider Banka, and Witold Pedrycz Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India {sushmita, hbanka r}@isical.ac.in Dept. of Electrical & Computer Engg., University of Alberta, Edmonton, Canada TG G pedrycz@ece.ualberta.ca Abstract. A novel collaborative clustering is proposed through the use of rough sets. The Davies-Bouldin clustering validity index is extended to the rough framework, to generate the optimal number of clusters during collaboration. Keywords: Rough sets, collaborative clustering, cluster validity. Introduction A cluster is a collection of data objects which are similar to one another within the same cluster but dissimilar to the objects in other clusters. Clustering large data is a mining problem of considerable interest. A recent trend in intelligent system design for large-scale problems is to split the original task into simpler subtasks and use a module for each of the subtasks. It has been shown that by combining the output of several modules in an ensemble, one can improve the generalization ability over that of a single model []. Collaborative clustering deals with revealing a structure that is common or similar to a number of subsets []. For example, let us consider a large population of data about client information, distributed over multiple databases. An intelligent approach to mine such large volume of information would be to analyze each individual database of sub-population locally and subsequently combine (or collaborate on) the results at a globally abstract level. This also satisfies certain security (or privacy) concerns of clients in not allowing the sharing of individual data (or samples). Recently various soft computing methodologies have been applied to handle the different challenges posed by data mining [], involving large heterogeneous datasets and clustering []. Fuzzy sets and rough sets have been incorporated in the c-means framework to develop the fuzzy c-means (FCM) [] and rough c-means (RCM) [] algorithms. Typically, rough sets [] are used to model the clusters in terms of upper and lower approximations. Collaborative clustering has been investigated by Pedrycz [], using the FCM algorithm. In this article we present a novel collaborative clustering through the use of rough sets. The Davies-Bouldin (DB) index is extended to the rough framework, S.K. Pal et al. (Eds.): PReMI, LNCS, pp.,. c Springer-Verlag Berlin Heidelberg

Collaborative Rough Clustering and helps in determining the optimal number of clusters during collaboration. Section provides the basic description of the c-means and RCM clustering algorithms, along with the modified DB index. Collaboration in the RCM framework is presented in Section. Experimental results are demonstrated in Section. Finally Section concludes the article. Rough Clustering The c-means algorithm proceeds by partitioning N objects into c nonempty subsets. During each partition, the centroids or means of the clusters are computed. The main steps of the c-means algorithm [] are as follows:. Assign initial means m i (also called centroids).. Assign each data object X k to the cluster U i for the closest mean.. Compute new mean for each cluster using X m i = k U i X k, () c i where c i is the number of objects in cluster U i.. Iterate until criterion function converges, i.e., there are no more new assignments. The theory of rough sets [] has recently emerged as another major mathematical tool for managing uncertainty that arises from granularity in the domain of discourse. The intention is to approximate a rough (imprecise) concept in the domain of discourse by a pair of exact concepts, called the lower and upper approximations. These exact concepts are determined by an indiscernibility relation on the domain, which, in turn, may be induced by a given set of attributes ascribed to the objects of the domain. The lower approximation is the set of objects definitely belonging to the vague concept, whereas the upper approximation is the set of objects possibly belonging to the same. In the rough c-means (RCM) algorithm, the concept of c-means is extended by viewing each cluster as an interval or rough set []. A rough set Y is characterized by its lower and upper approximations BY and BY respectively. This permits overlaps between clusters. Here an object X k can be part of at most one lower approximation. If X k BY of cluster Y, then simultaneously X k BY. If X k is not a part of any lower approximation, then it belongs to two or more upper approximations. Adapting eqn. (), the centroid m i of cluster U i is computed as m i = X w k BU X k i low BU i X k BU X k i BU i X k (BU i BU i )X k BU i BU i + w up X k (BU i BU i ) X k BU i BU i if BU i BU i BU i if BU i BU i = otherwise, ()

S. Mitra, H. Banka, and W. Pedrycz where the parameters w low and w up correspond to the relative importance of the lower and upper bounds respectively. Here BU i indicates the number of patterns in the lower approximation of cluster U i, while BU i BU i is the number of elements in the rough boundary lying between the two approximations. The RCM algorithm is outlined as follows.. Assign initial means m i for the c clusters.. Assign each data object (pattern point) X k to the lower approximation BU i or upper approximation BU i, BU j of cluster pairs U i, U j by computing the difference in its distance d ik d jk from cluster centroid pairs m i and m j.. Let d ik be minimal. If d jk d ik is less than some threshold then X k BU i and X k BU j and X k cannot be a member of any lower approximation, else X k BU i such that distance d ik is minimum over the c clusters.. Compute new mean for each cluster U i using eqn. ().. Iterate until convergence, i.e., there are no more new assignments. The expression in eqn. () boils down to eqn. () when the lower approximation is equal to the upper approximation, implying an empty boundary region. It is observed that the performance of the algorithm is dependent on the choice of w low, w up and threshold. We allowed w up = w low,. <w low < and < threshold <.. Clustering Validity Index The partitive clustering algorithms, described above, require prespecification of the number of clusters. The results are dependent on the choice of c. Thereexist validity indices to evaluate the goodness of clustering, corresponding to a given value of c. In this article we compute the optimal number of clusters c in terms of the Davies-Bouldin cluster validity index []. The Davies-Bouldin index (DB) is a function of the ratio of the sum of within-cluster distance to between-cluster separation. Let {x,...,x ck } be a set of patterns lying in a cluster U k. Then the average distance between objects within the cluster U k is expressed as S(U k )= i,i xi x i c k ( c k ),wherex i, x i U k i,j xi xj c k c l, and i i.thebetween-cluster separation is defined as d(u k,u l )= where x i U k, x j U l, such that k l. The optimal clustering, for c = c, minimizes DB = c { } S(Ui )+S(U j ) max, () c j i d(u i,u j ) i= for i, j c. Thereby, the within-cluster distance S(U i ) is minimized while the between-cluster separation d(u i,u j ) gets maximized.

Collaborative Rough Clustering The rough within-cluster distance is formulated as S r (U i )= w low X k BU i X k m i BU i X +w k (BU i BU i ) X k m i up BU i BU i X k BU X k m i i BU i ) X X k (BU i BU k m i i BU i BU i if BU i BU i BU i if BU i BU i = otherwise, () using eqn. (). Rough DB now becomes DB r = c c i= max j i { Sr (U i )+S r (U j ) d(u i,u j ) }. () Collaborative RCM Clustering In this section a collaborative rough c-means clustering is proposed, by incorporating collaboration between different partitions or sub-populations. Let a dataset be divided into P sub-populations or modules. Each sub-population is independently clustered to reveal its structure. Collaboration is incorporated by exchanging information between the modules regarding the local partitions, in terms of the collection of prototypes computed within the individual modules. This sort of divide-and-conquer strategy enables efficient mining of large databases. The required communication links are hence at a higher level of abstraction, thereby representing information granules (rough clusters) in terms of their prototypes. There exist two phases in the algorithm. Generation of RCM clusters within the modules, without collaboration. Here we employ. <w low <, thereby providing more importance to samples lying within the lower approximation of clusters while computing their prototypes locally. Collaborative RCM between the clusters, computed locally for each module of the large dataset. Now we use <w low <. (wechosew low = w low ), with a lower value providing higher precedence to samples lying in the boundary region of overlapping clusters. Here a cluster U i may be merged with an overlapping cluster U j if BU i BU i BU i () is maximum for all i =,...,c and d(m i, m j ) is minimum for j i. Let there be c and c clusters, generated by RCM, in a pair of modules (P =) under consideration. During collaboration, we begin with c + c cluster prototypes and merge using eqn. ().

S. Mitra, H. Banka, and W. Pedrycz (a) (b) Fig.. Collaborative rough clustering on synthetic dataset for (a) Module A and (b) Module B, with RCM (a) (b) Fig.. Collaborative fuzzy clustering on synthetic dataset for (a) Module A and (b) Module B, with FCM Table. Sample comparative performance of collaborative RCM on synthetic data Before collaboration After collaboration Module Prototypes Samples in Rough Prototypes Samples in Rough lower DB lower DB approximation approximation (.,.),,,, (.,.) A (.,.) (.,.),, (.,.),,,. (.,.),,. (.,.) (.,.), (.,.) (.,.), B (.,.),,,,,, (.,.) (.,.),. (.,.). (.,.),,, (.,.), Results Results are provided on a two-dimensional synthetic dataset [], using RCM and FCM, as shown in Fig. and Fig. respectively. There are two modules (A and B) corresponding to ten samples each, partitioned into three clusters. Sample

Collaborative Rough Clustering results using threshold =., w low =. for RCM, and w low =. =. for collaborative RCM, are provided in Table. It is observed that the Rough DB decreases for both modules after collaboration. Figs. (a) and (b) indicate the clustering before (solid line) and after (dotted line) collaboration, using modules A and B respectively with RCM. Conclusion Collaborative clustering is a promising approach towards modeling agent-based systems. While handling large data in this framework, each intelligent agent may concentrate on information discovery (or clustering) within a module. These agents can communicate with each other at the cluster interface, using appropriate protocol, their cluster profiles represented in terms of the centroids. In this article, we have presented a novel collaborative clustering using the RCM algorithm. The Davies-Bouldin clustering validity index is modified, by incorporating rough concepts, to determine optimal clustering during collaboration. Acknowledgement This work is partly supported by CSIR (No. ()//EMR II),New Delhi. References. Mitra, S., Acharya, T.: Data Mining: Multimedia, Soft Computing, and Bioinformatics. John Wiley, New York (). Pedrycz, W.: Collaborative fuzzy clustering. Pattern Recognition Letters (). Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York (). Lingras, P., West, C.: Interval set clustering of Web users with rough k-means. Technical Report No. -, Dept. of Mathematics and Computer Science, St. Mary s University, Halifax, Canada (). Pawlak, Z.: Rough Sets, Theoretical Aspects of Reasoning about Data. Kluwer Academic, Dordrecht (). Tou, J.T., Gonzalez, R.C.: Pattern Recognition Principles. Addison-Wesley, London (). Bezdek, J.C., Pal, N.R.: Some new indexes for cluster validity. IEEE Transactions on Systems, Man, and Cybernetics, Part-B ()