Hybrid Fuzzy C-Means Clustering Technique for Gene Expression Data

Similar documents
Redefining and Enhancing K-means Algorithm

CHAPTER 5 CLUSTER VALIDATION TECHNIQUES

were generated by a model and tries to model that we recover from the data then to clusters.

Cluster Analysis. Summer School on Geocomputation. 27 June July 2011 Vysoké Pole

Iteration Reduction K Means Clustering Algorithm

Comparative Study Of Different Data Mining Techniques : A Review

What is clustering. Organizing data into clusters such that there is high intra- cluster similarity low inter- cluster similarity

A Naïve Soft Computing based Approach for Gene Expression Data Analysis

KEYWORDS: Clustering, RFPCM Algorithm, Ranking Method, Query Redirection Method.

A Survey On Different Text Clustering Techniques For Patent Analysis

Clustering of Data with Mixed Attributes based on Unified Similarity Metric

MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS

Introduction to Mobile Robotics

Comparative Study of Clustering Algorithms using R

PERFORMANCE ANALYSIS OF DATA MINING TECHNIQUES FOR REAL TIME APPLICATIONS

Clustering Techniques

Overlapping Clustering: A Review

Biclustering Bioinformatics Data Sets. A Possibilistic Approach

Fuzzy C-means Clustering with Temporal-based Membership Function

HFCT: A Hybrid Fuzzy Clustering Method for Collaborative Tagging

Keywords hierarchic clustering, distance-determination, adaptation of quality threshold algorithm, depth-search, the best first search.

CHAPTER 4 AN IMPROVED INITIALIZATION METHOD FOR FUZZY C-MEANS CLUSTERING USING DENSITY BASED APPROACH

Double Self-Organizing Maps to Cluster Gene Expression Data

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points

Clustering CS 550: Machine Learning

On Sample Weighted Clustering Algorithm using Euclidean and Mahalanobis Distances

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION

Accelerating Unique Strategy for Centroid Priming in K-Means Clustering

Generalized Fuzzy Clustering Model with Fuzzy C-Means

A Review on Cluster Based Approach in Data Mining

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

A SURVEY ON CLUSTERING ALGORITHMS Ms. Kirti M. Patil 1 and Dr. Jagdish W. Bakal 2

A fuzzy k-modes algorithm for clustering categorical data. Citation IEEE Transactions on Fuzzy Systems, 1999, v. 7 n. 4, p.

Tools and methods for model-based clustering in R

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

Methods for Intelligent Systems

On the Consequence of Variation Measure in K- modes Clustering Algorithm

CHAPTER 4 FUZZY LOGIC, K-MEANS, FUZZY C-MEANS AND BAYESIAN METHODS

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

A Web Page Recommendation system using GA based biclustering of web usage data

Note Set 4: Finite Mixture Models and the EM Algorithm

Performance Analysis of Enhanced Clustering Algorithm for Gene Expression Data

Content Based Image Retrieval Using Hierachical and Fuzzy C-Means Clustering

Colour Image Segmentation Using K-Means, Fuzzy C-Means and Density Based Clustering

Object Segmentation in Color Images Using Enhanced Level Set Segmentation by Soft Fuzzy C Means Clustering

Missing Data Estimation in Microarrays Using Multi-Organism Approach

Keywords: clustering algorithms, unsupervised learning, cluster validity

A Memetic Heuristic for the Co-clustering Problem

CS Introduction to Data Mining Instructor: Abdullah Mueen

Final Exam. Controller, F. Expert Sys.., Solving F. Ineq.} {Hopefield, SVM, Comptetive Learning,

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

A Novel Approach for Minimum Spanning Tree Based Clustering Algorithm

Distributed and clustering techniques for Multiprocessor Systems

Novel Intuitionistic Fuzzy C-Means Clustering for Linearly and Nonlinearly Separable Data

Multiple Classifier Fusion using k-nearest Localized Templates

Chapter 6 Continued: Partitioning Methods

International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14

Rough Set Approach to Unsupervised Neural Network based Pattern Classifier

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation

Spatial Information Based Image Classification Using Support Vector Machine

K-means and Hierarchical Clustering

Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure

University of Florida CISE department Gator Engineering. Clustering Part 5

Comparative Analysis of K means Clustering Sequentially And Parallely

Classification with Diffuse or Incomplete Information

Understanding Clustering Supervising the unsupervised

A Fuzzy Rule Based Clustering

Comparisons and validation of statistical clustering techniques for microarray gene expression data. Outline. Microarrays.

Data clustering & the k-means algorithm

Efficient Object Extraction Using Fuzzy Cardinality Based Thresholding and Hopfield Network

Fuzzy Segmentation. Chapter Introduction. 4.2 Unsupervised Clustering.

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Dynamic Clustering of Data with Modified K-Means Algorithm

Genetic Algorithm and Simulated Annealing based Approaches to Categorical Data Clustering

Data Mining: An experimental approach with WEKA on UCI Dataset

Web Based Fuzzy Clustering Analysis

Design and Analysis of Fuzzy Metagraph Based Data Structures

Comparing and Selecting Appropriate Measuring Parameters for K-means Clustering Technique

A Review of K-mean Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Mixture Models and the EM Algorithm

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

ARTICLE; BIOINFORMATICS Clustering performance comparison using K-means and expectation maximization algorithms

Exploratory data analysis for microarrays

Retrieval of Web Documents Using a Fuzzy Hierarchical Clustering

A MODIFICATION OF FUZZY TOPSIS BASED ON DISTANCE MEASURE. Dept. of Mathematics, Saveetha Engineering College,

IMPLEMENTATION OF CLASSIFICATION ALGORITHMS USING WEKA NAÏVE BAYES CLASSIFIER

Texture Image Segmentation using FCM

Clustering and Visualisation of Data

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET)

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

Equi-sized, Homogeneous Partitioning

Density Based Clustering using Modified PSO based Neighbor Selection

FEATURE EXTRACTION USING FUZZY RULE BASED SYSTEM

Fast Fuzzy Clustering of Infrared Images. 2. brfcm

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Transcription:

Hybrid Fuzzy C-Means Clustering Technique for Gene Expression Data 1 P. Valarmathie, 2 Dr MV Srinath, 3 Dr T. Ravichandran, 4 K. Dinakaran 1 Dept. of Computer Science and Engineering, Dr. MGR University, Chennai, India 2 Dept. of Computer Science and Engineering, Mahendra Engineering College, Namakkal, India. 3 Dept. of Computer Science and Engineering, Hindustan Institute of Tech., Coimbatore, India. 4 Dept. of Computer Science and Engineering, RMK Engineering College, Chennai, India. ABSTRACT The challenging issue in microarray technique is to analyze and interpret the large volume of data. This can be achieved by clustering techniques in data mining. In hard clustering like hierarchical and k-means clustering techniques, data is divided into distinct clusters, where each data element belongs to exactly one cluster so that the out come of the clustering may not be correct in many times. The problems addressed in hard clustering could be solved in fuzzy clustering technique. Among fuzzy based clustering, fuzzy c- means (FCM) is the most suitable for microarray gene expression data. The problem associated with fuzzy c-means is the number of clusters to be generated for the given dataset needs to be specified in prior. This can be solved by combining this method with a popular probability related Expectation Maximization (EM) algorithm which provides the statistical frame work to model the cluster structure of gene expression data. The main objective of this proposed hybrid fuzzy c-means method is to determine the precise number of clusters and interpret the same efficiently. Keywords: Fuzzy C-Means, Gene Expression Data, Expectation Maximization, Hard Clustering I. INTRODUCTION An emergence of microarray technology has made it possible to monitor the expression levels of thousands of genes simultaneously. The Challenge is to effectively analyze and interpret this large volume of data. Two statistical operations commonly applied to microarray data are classification and clustering but the most significant area is clustering microarray data analysis [1][2]. Clustering problems arise in many different applications such as data mining and knowledge discovery, data compression, pattern recognition and pattern classification in order to grouping similar genes in one cluster so that genes within the same clusters are similar to each other and different from genes in other clusters[3]. Depending on the nature of the data and purpose for which clustering is being used, different measures of similarity may be used to place objects into clusters, where the similarity measure controls how the clusters are formed [4]. There are numerous clustering techniques presently available to cluster particularly the gene expression data such as hierarchical clustering technique which is a method used commonly by many people in early days. A common problem associated with this method is visualization of clustering results in terms of dendrogram which is difficult when a dataset is large [5]. In the popular k-means clustering method, the user was always uncertain to define the precise number of clusters. In hard clustering, data is divided into distinct clusters, where each data element belongs to exactly one cluster. In some situations, the object may belong to more than one cluster, and associated with each element is a set membership levels. Clustering may be either crisp or fuzzy. Fuzzy clustering of microarray data has an advantage over crisp partitioning because of great amount of imprecision and uncertainty 33

related with gene expression data [6]. The problem associated with fuzzy is that the number of clusters to be generated for the given data set needs to be specified, this can be solved by the proposed method. EM (Expectation Maximization) algorithm, here for each data object i, probabilities are calculated i corresponding to cluster k. The parameters Ө = { Ө i 1<= i <= k} and ={γ r 1<=i<=k, 1<=r<=n} Where Ө = model parameters k = no. of components = hidden parameters, n= number of data objects are estimated for representing the probability that data belongs to cluster. Using EM (Expectation Maximization) algorithm, the above unknown parameters are estimated. In the expectation process hidden parameters are conditionally estimated from the data with current estimated model parameters. In the maximization process, model parameters are estimated so as to maximize the likelihood of complete data given the estimated hidden parameters. Each data object is assigned to the component with the maximum conditional probability when the algorithm converges [7][8]. To solve the problem in fuzzy clustering, we combined this method with EM algorithm. II. FUZZY CLUSTERING Fuzzy clustering is a process of assigning the membership levels, and then using them to assign data elements to one or more clusters. It gives more information on the similarity of each object [9]. One of the most widely used fuzzy clustering algorithms is fuzzy c-means (FCM) algorithm. vector of fuzzy clustering, V={v 1, v 2,.,v c }, an objective function is defined with the membership degree between each data x j and cluster center v i The fuzzy c-means algorithm attempts to partition a finite collection of elements into a collection of C fuzzy clusters with respect to some given criteria. Given a finite set of data, X= {x 1,..,x n }and the central n c J m (X, U, V) = (µ ij) m d 2 (x j, v i ) ----- (1) j= 1 i= 1 Where µ ij is the membership degree of x j and the ith cluster, an element of membership matrix U = [µ ij ]. d 2 is the square of the Euclidean distance, and m is the fuzziness parameter, which means the degree of the fuzziness of each datum s membership degree that should be bigger than 1.0 [10]. Like the k-means algorithm, the FCM aims to minimize an objective function. The standard function which differs from the k-means squared error criterion is by the addition of the membership function U ij and hence, fuzzier clusters [11]. III. PROPOSED METHOD The problem associated with fuzzy c-means is the number of clusters to be generated for the given dataset needs to be specified, this can be solved by this proposed method. In this method, the fuzzy c-means combined with the EM (Expectation Maximization) algorithm which provides the statistical frame work to model the cluster structure of gene expression data. It makes use of probabilistic models which can explain the probabilistic characteristics of the given systems and helps to find the precise number of clusters for the given dataset so that the resultant value of EM can be used as number of clusters k. The main objective of using this hybrid method is to minimize the objective function value in fuzzy c-means. A sample dataset used to examine the performance of the proposed method is yeast data downloaded from the website [12], which consists of expression levels of 61 genes with 15 different conditions. 34

IV. RESULTS AND DISCUSSIONS The EM algorithm gives us the precise number of clusters and is illustrated in the fig.1 which depicts the finest number of clusters as components. The different models represented in different color to distinguish and among them the model EEE indicates the best and accurate no. of components. The silhouette value of the best model is shown in the fig.2, the optimum value is 0.11 with k=8. The fig.3 shows the point of variability for the particular dataset when the k value is 8 and the membership coefficient value is 1.3. This prediction is really useful to the researchers to define no. of clusters k and the table 1 shows how the objective function values have been changed with different membership coefficients according to k value. BIC -1800-1400 -1000-800 -600 EII VII EEI VEI EVI VVI EEE EEV VEV VVV 2 4 6 8 number of components Figure 1. Shows the best model EEV is the highest point in the plot. The no. of components is eight which represents the maximum no of possible clusters. The maximum value of membership coefficient in this method is by default 2 but it does not fit for all kind of dataset so we have used different membership coefficient values. Among the three, the table 3 shows the minimum objective function value 2.0 for the membership coefficient value 1.3 with respect to k. From this result, we can infer that the k value 8 is the best and can produce the desired results. The method described in this paper allows performing clustering on microarray gene expression data. One of the main advantages of the proposed method is its capability of determining the precise number of clusters; thereby the researcher can analyze and interpret the results in efficient way. 35

-0.95-0.62-0.89-0.80-0.93-1.14-1.01-0.76-0.73-1.15-1.12-1.68-1.54-0.74-1.37-1.24-1.53-1.02-1.29-1.41-0.95-1.06-1.20-1.03-1.06 0.337-1.26-1.35-1.58-1.97-0.90-1.64-0.99-0.62-0.24 Silhouette plot of fanny(x = da, k = 8, memb.exp = 1.3) n = 35 8 clusters C j j : n j ave i Cj s i 1 : 7 0.33 2 : 8-0.13 3 : 3 0.06 4 : 7 0.32 5 : 1 0.00 6 : 7 0.02 7 : 1 0.00 8 : 1 0.00-0.2 0.0 0.2 0.4 0.6 0.8 1.0 Average silhouette width : 0.11 Silhouette width s i Figure 2. Show the silhouette values for the best model with the k value and value of membership expression clusplot(fanny(x = da, k = 8, memb.exp = 1.3)) Component 2-3 -2-1 0 1 2-2 0 2 4 6 Component 1 These two components explain 43.83 % of the point variability. Figure 3. Shows the maximum point variability between two components 36

Table: 1 Objective function values for different k values Membership Coefficient K = 6 K=7 K=8 1.1 2.83 2.65 2.4 1.2 2.72 2.39 2.3 1.3 2.43 2.29 2.0 REFERENCES 1. Michel B Eisen, Paul T. Spellman, Patrick O. Brown, and David Botstein, Cluster analysis and display of genome-wide expression patterns, Proc, Natl. Acad. Sci. USA, Vol. 95, pp, 14863-14868, December 1998. 2. RM Suresh, K Dinakaran, P Valarmathie, Model based modified k-means clustering for microarray data, International Conference on Information Management and Engineering, Vol.13, pp 271-273, 2009, IEEE. 3. Han, Kamber, Datamining Concepts and Techniques, Elsevier publications, 2005. 4. K.Dinakaran, RM.Suresh, P.Valarmathie, Clustering gene expression data using self organizing maps, Journal of Computer Applications, Vol.1, No.4, 2008. 5. Anil K. Jain and Richard C. Dubes, Algorithms for clustering data, Prentice Hall, New Jersey, 1988. 6. Anirban Mukhopadhyay, Ujjwal Maulik and Sanghamitra bandyopadhyay, Efficient two stage fuzzy clustering of microarray gene expression data, International Conference on Information Technology(ICIT 06), 2006 IEEE. 7. Shi Zhong, Joydeep Ghosh, A unified framework for model based clustering, Journal of Machine Learning Research 4 (2003) 1001-1037 8. Wei Pan, Jizhen Lin and Chap T Le, Model-based cluster analysis of microarray gene expression data Genome Biology 2002, 3(2):research0009.1 0009.8 9. Seo Young Kim, Tai Myong Choi, Fuzzy types clustering for microarray data, PWASET Volume 4 February 2005 ISSN 1307-6884 10. Han-Saem Park and Sung-Bae Cho, Evolutionary fuzzy clustering for gene expression profile analysis, SCIS&ISIS2006@Tokyo, Japan(September 20-24, 2006) 11. D. Dembele and P. Kastner, Fuzzy c-means method for clustering microarray data, Bio- Informatics, Vol. 19, No.8, PP 973-980, 2003. 12. http://genomics.stanford.edu 37