Fuzzy C-means Clustering with Temporal-based Membership Function

Indian Journal of Science and Technology, Vol (S()), DOI:./ijst//viS/, December ISSN (Print) : - ISSN (Online) : - Fuzzy C-means Clustering with Temporal-based Membership Function Aseel Mousa * and Yuhanis Yusof School of Computing, Universiti Utara Malaysia, Sintok, Kedah, Malaysia; assiso@yahoo.com, yuhanis@uum.edu.my Abstract Objective: In this paper, a method is proposed to create clusters depending on temporal information. Despite its popularity, the FCM algorithm does not utilize temporal information in creating clusters, hence affecting the accuracy of clustering. This paper presents an improved Fuzzy C-means algorithm that incorporates temporal information into the membership function used for clustering. Methods: The proposed FCM algorithm employs temporal neighbouring of data points as the base of clustering. In order to evaluate the algorithm, experimental analysis was performed on three multi-labelled datasets, including a clinical free text (medical), textual email messages (Enron), and Bibtex. Finding: The experimental results show that the proposed function contributes a smaller value of objective function while using a minimum number of iterations. Application: The proposed work will benefit data mining in various domains such as information retrieval, healthcare, business management and many others. This is due to its ability in grouping data- points that are not mutually exclusive. Keyword: Fuzzy C-mean, Data Clustering, Data Mining, Multi-labelled Data. Introduction Clustering is a distinguishing among things, in conformity with certain requirements and rules. A clustering algorithm is a set of classification rules of the data with unknown distribution. The main goal is to find the structure hidden in data, and as much as possible to assign data with the same nature attributed to the same class according to some measure of similarity degree. Clustering analysis, which leads to fuzzy partition of sample space, has been widely used in a variety of areas such as data mining and pattern recognition. An example of a clustering algorithm is the Fuzzy c-means (FCM) which has been successfully applied in various fields such as medical imaging, target recognition, image segmentation and so on. In, the authors propose an improved FCM algorithm based on Particle Swarm Optimization algorithm (PSO) to solve the problem of premature convergence of the fuzzy c-means clustering algorithm. The results show that the improved method can handle the noise better than previous methods. Further the clustering performance is improved with data sets that has dimensions greater than the number of samples. In, the author proposes an improved clustering approach based on cluster density (FCM-CD). Considering of the global dot density in a cluster, a distance correction regulatory factor is built and applied to FCM. The experiment results reveal that FCM-CD has good tolerance to different densities and various cluster shapes. FCM-CD shows a higher performance in clustering accuracy. In the author propose new fuzzy c-means method for improving the Magnetic Resonance Imaging (MRI) segmentation. The proposed method is known as Possiblistic Fuzzy C-Means (PFCM) and hybrids the Fuzzy C-Means (FCM) and Possiblistic C-Means (PCM) functions. It may be realized by modifying the objective function of the conventional PCM algorithm with Gaussian exponent weights to produce memberships and possibilities simultaneously, along with the usual point prototypes or cluster centres for each cluster. *Author for correspondence

Fuzzy C-means Clustering with Temporal-based Membership Function In the authors propose an improved FCM algorithm by adopting a novel strategy for selecting the initial cluster centres, to address the difficulty that the traditional FCM clustering algorithm has in selecting initial cluster centres. In the authors presents an automatic effective intuitionistic fuzzy c-means s which is an extension of standard Intuitionistic Fuzzy C-Means (IFCM). They present a model called RBF Kernel based on Intuitionistic Fuzzy C-Means (KIFCM), replacing the Euclidean norm with other, different distances. Although there have been many studies on distance measure, existing approaches do not consider the temporal information of the data. Such information may provide insight into characteristics of the data, hence producing accurate clusters. This paper introduces a membership function that operates based on the temporal information of the data. The proposed function is later integrated into the FCM algorithm to cluster a multi-labelled dataset.. Fuzzy C-mean The FCM algorithm assigns pixels to each category using fuzzy memberships. FCM has the ability to determine and iteratively update the values of membership of a data point in clusters that are previously defined. So, any data point can be related to all clusters based on its membership value. This algorithm tries to assign membership to each data point corresponding to each cluster centre. It depends on the calculation of mean distance between each data point and the centroid point,. Algorithm gives the conventional FCM : Algorithm. Conventional FCM : Initialize of membership : Choose parameter > to stop the iteration. Set the iteration counting parameter l equal to : for l = to x : Set time matrix : At k-step calculate the centre vectors = by : else : stop at some iteration lw : end if where m is defined to any real number that is greater than, is refers to a degree of membership of in cluster j, is the of d-dimensional measured data, is the dimension centroid of the cluster, is any norm expressing the similarity between any measured data and the centroid, c represents the number of cluster centre; n is the number of data points, is the distance between data to cluster center and d ij represents the Euclidean distance between i th data to j th centre.. Methods This study focuses on improving existing fuzzy function employed in FCM algorithm in order to produce a better clustering. In this section, the phases undertaken in the study are elaborated on.. Data Collection Three multi-label datasets are used in this study and this includes medical, Enron, and Bibtex datasets. The description of these datasets are tabulated in Table. All of them were obtained from the Mulan website. Mulan is an open-source library for learning from multi-label datasets, containing a table of multi label data sets and descriptions http://mulan.sourceforge.net/datasets. html.. Function Design Currently, the standard FCM algorithm does not utilize temporal relationship of data points, so it may not be robust with close data points that have strong relationships (i.e. overlapping in data points). Overlapping means that a data point may intrinsically belong to more than one cluster. In other words, each data point is mapped to a more than one label. Figure shows a plotted figure of overlapping clustering. : Update the membership matrix, by : if - > l : Goto step Table. Description of datasets Name Domain Instances Labels Density Medical Text. Enron Text. Bibtex Text. Vol (S()) December www.indjst.org Indian Journal of Science and Technology

Aseel Mousa and Yuhanis Yusof Figure. Two-D plot for overlapped classes. Non-overlapped data means that a data point should belongs to one cluster. In other words, clustering will assign an object to exactly one class, even though there are two or more class labels. Temporal information indicates that the neighbouring data points, in time, are highly correlated and thus possess the same feature value. This means that the probability that they belong to the same cluster is high. In order to exploit temporal information contained in the data, temporal function is defined: Equation. Probability of pixel belongs to cluster The clustering starts by applying the conventional FCM to calculate the membership function. This membership function is mapped to the temporal function to compute the temporal membership function as in Equation and Equation. The operation stops when the distance between cluster centres and data point is less than a threshold (.). This threshold value represents a minimum value of distance between data point and cluster centre, guaranteeing an optimal estimation of a number of clusters to solve the overlap problem. This distance then relates each data point to its real cluster, which is in general better than for a larger value of distance and thus still ensures the recognition of barely detectable clusters, reducing the overlap that could be found in clusters. The proposed temporal-based membership function is included as step in the improved FCM, as presented in Algorithm. Algorithm. Improved FCM : Initialize of membership : Choose parameter > to stop the iteration. Set the iteration counting parameter l equal to : for i = to x : Set time matrix : end for : At k-step calculate the centers vectors = by Equation. New membership function in Equation is the probability that pixel belongs to the cluster, and t is the time combined with each input vector. Each vector (data point) corresponds to time value that is stored in the time matrix, represents the membership of pixel of time t in the cluster while c is the number of cluster centres. represents a square array of time value centred on time value in the temporal domain, µnewij represents the new membership function in temporal domain. In this study, a x array of time matrix associated with the matrix of data points will be taken for easer representation (i.e. a x of SQ matrix). The temporal function of a data point for a cluster is large if the majority of its neighbourhood belongs to the same clusters. The temporal function will support the conventional membership function in the case of normal datasets and reduce the overlapping weight in case of multi label datasets by reprocessing the overlapped data points that are related to more than one cluster by applying the time factor matching in temporal membership function. : Update the membership matrix, by : if - > l : Goto step : else : stop at some iteration l : end if : Calculate the temporal function and map it to the new membership function by : if Go to step Vol (S()) December www.indjst.org Indian Journal of Science and Technology

Fuzzy C-means Clustering with Temporal-based Membership Function : else : Goto step : end if Table depicts the iteration count that gives the minimum objective function, while Table show the value of distance and objective function for the three datasets, respectively. For the medical dataset, (as in Table ), the improved FCM is better than the conventional FCM in terms of average distance between the cluster centre and data points and in terms of objective function. However, for the Enron dataset, (as in Table ), the improved method is better than conventional one in objective function, the difference in value of average distance between clusters is small between conventional and improve FCM, also for Bibtex dataset, the value of average distance is close between conventional and improved FCM, while there is a difference in the values of objective function for the two methods. This is because of the high overlap of data. Such result indicates that the conventional FCM is sensitive to overlapping data. Hence, temporal information is useful to facilitate the mapping of a data point to its cluster. Clustering performance is highly affected by data structure and cluster density. It has poor performance when the cluster densities are highly different.................... x..... a........ b x. Results In this section, results of the undertaken experiments are presented. The graphical illustration in Figure shows Table. Iteration count that leads to the minimum objective function Dataset Conventional FCM Improved FCM Medical Enron Bibtex,. Table. Average distance and objective function for the three datasets Method Medical Enron Bibtex Conventional FCM Improved FCM Avg. Objective Avg. Objective Avg. Distance function Distance function Distance.......... Objective function. c. Figure. (c) Bibtex. c Cluster densities of (a) medical, (b) Enron and the cluster densities for the datasets under consideration. It can be seen that the medical dataset has high different cluster density, which affects the performance of clustering. On the other hand, the Enron and Bibtex sets have close cluster densities. Figure shows the progress of the objective function for conventional FCM, while Figure illustrates the result for the improved FCM. From these two figures, it can be concluded that improved FCM is better than the conventional FCM in terms of iteration count and the value of objective function. The number of iterations that generates the minimum objective function for the employed datasets is smaller than that obtained by conventional FCM. The value of objective function affects the accuracy of the clustering. This is because when data point is close to Vol (S()) December www.indjst.org Indian Journal of Science and Technology

Aseel Mousa and Yuhanis Yusof s a.... b s s..... b..... c Figure. Improved FCM objective function for (a) medical, (b) enron and (c) bibtex. s the cluster centre, the objective function becomes small, causing a high membership function. c Figure. Conventional FCM objective function for (a) medical, (b) enron and (c) bibtex.. Conclusion The Fuzzy C-mean Algorithm (FCM) is one of the most well-known clustering algorithms. Nevertheless, it does not utilize temporal information contained in the data, to create clusters based temporal matching of data. In this paper, a new membership function is proposed for inclusion in the FCM. The membership functions of the neighbors in the temporal domain are enumerated to obtain the probability of data point belonging to specific cluster. The aim of the improved FCM is to produce quality clusters for multi-labelled dataset. The experiments revealed that the proposed method minimizes the objective function value while requiring fewer iterations.. References a. Niu Q, Huang X. An improved fuzzy c-means clustering algorithm based on PSO. Journal of Software. ; ():.. Lou X, Li J, Liu H. Improved fuzzy c-means clustering algorithm based on cluster density. Journal of Computational Information Systems. ; ():. Vol (S()) December www.indjst.org Indian Journal of Science and Technology

Fuzzy C-means Clustering with Temporal-based Membership Function. Chattopadhyay S, Pratihar DK, Sarkar SCD. A comparative Study of Fuzzy C-means algorithm and Entropy-based Fuzzy Clustering algorithms. Computing and Informatics. ; :.. Zanaty EA. An adaptive fuzzy C-means algorithm for improving MRI segmentation. Open Journal of Medical Imaging. ; ():.. Blacknell D, Griffiths H. Radar Automatic Target Recognition (ATR) and Non-Cooperative Target Recognition (NCTR). ;.. Kannan SR, Ramathilagam S, Pandiyarajan R. Modified bias field fuzzy C-means for effective segmentation of brain MRI. Transactions on computational science VIII. Gavrilova ML: Springer-Verlag;. p... Lu Y, Ma T, Yi C, Xie X, Tian W, Zhong S. Implementation of the Fuzzy C-Means Clustering algorithm in meteorological data. International Journal of Database Theory and Application. ; ():.. Kaur P, Soni AK, Gosain A. RETRACTED: A robust kernelized intuitionistic Fuzzy C-means Clustering algorithm in segmentation of noisy medical images. Pattern Recognition Letters. ; ():.. William R. Hierarchical Temporal Memory Cortical Learning algorithm for pattern recoginition: ProQuest, UMI Dissertation Publishing; Oct.. Jun W, Shi-Tong W. Double indices FCM algorithm based on hybrid distance metric learning. Journal of Software. ; ():.. Grabusts P. The choice of metrics for clustering algorithms. International Scientific and Practical Conference; Izdevniecība,: Rēzeknes Augstskola, Rēzekne;.. Cai W, Chen S, Zhang D. Fast and Robust Fuzzy C-Means Clustering Algorithms incorporating local information for image segmentation. Pattern Recognition Letters. ; ():. Mullner D. Modern Hierarchical, Agglomerative Clustering algorithms. Librarary C, editor. Modern Hierarchical, Agglomerative Clustering Algorithms. arxiv:.v;. p... Tsai D-M, Lin C-C. Fuzzy C-means based clustering for linearly and nonlinearly separable data. Pattern Recognition Letters. ; ():.. Schwämmle V, Jensen ON. A simple and fast method to determine the parameters for Fuzzy C means Cluster analysis. Bioiformatics. ; ():. Vol (S()) December www.indjst.org Indian Journal of Science and Technology