Hybrid Fuzzy C-Means Clustering Technique for Gene Expression Data 1 P. Valarmathie, 2 Dr MV Srinath, 3 Dr T. Ravichandran, 4 K. Dinakaran 1 Dept. of Computer Science and Engineering, Dr. MGR University, Chennai, India 2 Dept. of Computer Science and Engineering, Mahendra Engineering College, Namakkal, India. 3 Dept. of Computer Science and Engineering, Hindustan Institute of Tech., Coimbatore, India. 4 Dept. of Computer Science and Engineering, RMK Engineering College, Chennai, India. ABSTRACT The challenging issue in microarray technique is to analyze and interpret the large volume of data. This can be achieved by clustering techniques in data mining. In hard clustering like hierarchical and k-means clustering techniques, data is divided into distinct clusters, where each data element belongs to exactly one cluster so that the out come of the clustering may not be correct in many times. The problems addressed in hard clustering could be solved in fuzzy clustering technique. Among fuzzy based clustering, fuzzy c- means (FCM) is the most suitable for microarray gene expression data. The problem associated with fuzzy c-means is the number of clusters to be generated for the given dataset needs to be specified in prior. This can be solved by combining this method with a popular probability related Expectation Maximization (EM) algorithm which provides the statistical frame work to model the cluster structure of gene expression data. The main objective of this proposed hybrid fuzzy c-means method is to determine the precise number of clusters and interpret the same efficiently. Keywords: Fuzzy C-Means, Gene Expression Data, Expectation Maximization, Hard Clustering I. INTRODUCTION An emergence of microarray technology has made it possible to monitor the expression levels of thousands of genes simultaneously. The Challenge is to effectively analyze and interpret this large volume of data. Two statistical operations commonly applied to microarray data are classification and clustering but the most significant area is clustering microarray data analysis [1][2]. Clustering problems arise in many different applications such as data mining and knowledge discovery, data compression, pattern recognition and pattern classification in order to grouping similar genes in one cluster so that genes within the same clusters are similar to each other and different from genes in other clusters[3]. Depending on the nature of the data and purpose for which clustering is being used, different measures of similarity may be used to place objects into clusters, where the similarity measure controls how the clusters are formed [4]. There are numerous clustering techniques presently available to cluster particularly the gene expression data such as hierarchical clustering technique which is a method used commonly by many people in early days. A common problem associated with this method is visualization of clustering results in terms of dendrogram which is difficult when a dataset is large [5]. In the popular k-means clustering method, the user was always uncertain to define the precise number of clusters. In hard clustering, data is divided into distinct clusters, where each data element belongs to exactly one cluster. In some situations, the object may belong to more than one cluster, and associated with each element is a set membership levels. Clustering may be either crisp or fuzzy. Fuzzy clustering of microarray data has an advantage over crisp partitioning because of great amount of imprecision and uncertainty 33
related with gene expression data [6]. The problem associated with fuzzy is that the number of clusters to be generated for the given data set needs to be specified, this can be solved by the proposed method. EM (Expectation Maximization) algorithm, here for each data object i, probabilities are calculated i corresponding to cluster k. The parameters Ө = { Ө i 1<= i <= k} and ={γ r 1<=i<=k, 1<=r<=n} Where Ө = model parameters k = no. of components = hidden parameters, n= number of data objects are estimated for representing the probability that data belongs to cluster. Using EM (Expectation Maximization) algorithm, the above unknown parameters are estimated. In the expectation process hidden parameters are conditionally estimated from the data with current estimated model parameters. In the maximization process, model parameters are estimated so as to maximize the likelihood of complete data given the estimated hidden parameters. Each data object is assigned to the component with the maximum conditional probability when the algorithm converges [7][8]. To solve the problem in fuzzy clustering, we combined this method with EM algorithm. II. FUZZY CLUSTERING Fuzzy clustering is a process of assigning the membership levels, and then using them to assign data elements to one or more clusters. It gives more information on the similarity of each object [9]. One of the most widely used fuzzy clustering algorithms is fuzzy c-means (FCM) algorithm. vector of fuzzy clustering, V={v 1, v 2,.,v c }, an objective function is defined with the membership degree between each data x j and cluster center v i The fuzzy c-means algorithm attempts to partition a finite collection of elements into a collection of C fuzzy clusters with respect to some given criteria. Given a finite set of data, X= {x 1,..,x n }and the central n c J m (X, U, V) = (µ ij) m d 2 (x j, v i ) ----- (1) j= 1 i= 1 Where µ ij is the membership degree of x j and the ith cluster, an element of membership matrix U = [µ ij ]. d 2 is the square of the Euclidean distance, and m is the fuzziness parameter, which means the degree of the fuzziness of each datum s membership degree that should be bigger than 1.0 [10]. Like the k-means algorithm, the FCM aims to minimize an objective function. The standard function which differs from the k-means squared error criterion is by the addition of the membership function U ij and hence, fuzzier clusters [11]. III. PROPOSED METHOD The problem associated with fuzzy c-means is the number of clusters to be generated for the given dataset needs to be specified, this can be solved by this proposed method. In this method, the fuzzy c-means combined with the EM (Expectation Maximization) algorithm which provides the statistical frame work to model the cluster structure of gene expression data. It makes use of probabilistic models which can explain the probabilistic characteristics of the given systems and helps to find the precise number of clusters for the given dataset so that the resultant value of EM can be used as number of clusters k. The main objective of using this hybrid method is to minimize the objective function value in fuzzy c-means. A sample dataset used to examine the performance of the proposed method is yeast data downloaded from the website [12], which consists of expression levels of 61 genes with 15 different conditions. 34
IV. RESULTS AND DISCUSSIONS The EM algorithm gives us the precise number of clusters and is illustrated in the fig.1 which depicts the finest number of clusters as components. The different models represented in different color to distinguish and among them the model EEE indicates the best and accurate no. of components. The silhouette value of the best model is shown in the fig.2, the optimum value is 0.11 with k=8. The fig.3 shows the point of variability for the particular dataset when the k value is 8 and the membership coefficient value is 1.3. This prediction is really useful to the researchers to define no. of clusters k and the table 1 shows how the objective function values have been changed with different membership coefficients according to k value. BIC -1800-1400 -1000-800 -600 EII VII EEI VEI EVI VVI EEE EEV VEV VVV 2 4 6 8 number of components Figure 1. Shows the best model EEV is the highest point in the plot. The no. of components is eight which represents the maximum no of possible clusters. The maximum value of membership coefficient in this method is by default 2 but it does not fit for all kind of dataset so we have used different membership coefficient values. Among the three, the table 3 shows the minimum objective function value 2.0 for the membership coefficient value 1.3 with respect to k. From this result, we can infer that the k value 8 is the best and can produce the desired results. The method described in this paper allows performing clustering on microarray gene expression data. One of the main advantages of the proposed method is its capability of determining the precise number of clusters; thereby the researcher can analyze and interpret the results in efficient way. 35
-0.95-0.62-0.89-0.80-0.93-1.14-1.01-0.76-0.73-1.15-1.12-1.68-1.54-0.74-1.37-1.24-1.53-1.02-1.29-1.41-0.95-1.06-1.20-1.03-1.06 0.337-1.26-1.35-1.58-1.97-0.90-1.64-0.99-0.62-0.24 Silhouette plot of fanny(x = da, k = 8, memb.exp = 1.3) n = 35 8 clusters C j j : n j ave i Cj s i 1 : 7 0.33 2 : 8-0.13 3 : 3 0.06 4 : 7 0.32 5 : 1 0.00 6 : 7 0.02 7 : 1 0.00 8 : 1 0.00-0.2 0.0 0.2 0.4 0.6 0.8 1.0 Average silhouette width : 0.11 Silhouette width s i Figure 2. Show the silhouette values for the best model with the k value and value of membership expression clusplot(fanny(x = da, k = 8, memb.exp = 1.3)) Component 2-3 -2-1 0 1 2-2 0 2 4 6 Component 1 These two components explain 43.83 % of the point variability. Figure 3. Shows the maximum point variability between two components 36
Table: 1 Objective function values for different k values Membership Coefficient K = 6 K=7 K=8 1.1 2.83 2.65 2.4 1.2 2.72 2.39 2.3 1.3 2.43 2.29 2.0 REFERENCES 1. Michel B Eisen, Paul T. Spellman, Patrick O. Brown, and David Botstein, Cluster analysis and display of genome-wide expression patterns, Proc, Natl. Acad. Sci. USA, Vol. 95, pp, 14863-14868, December 1998. 2. RM Suresh, K Dinakaran, P Valarmathie, Model based modified k-means clustering for microarray data, International Conference on Information Management and Engineering, Vol.13, pp 271-273, 2009, IEEE. 3. Han, Kamber, Datamining Concepts and Techniques, Elsevier publications, 2005. 4. K.Dinakaran, RM.Suresh, P.Valarmathie, Clustering gene expression data using self organizing maps, Journal of Computer Applications, Vol.1, No.4, 2008. 5. Anil K. Jain and Richard C. Dubes, Algorithms for clustering data, Prentice Hall, New Jersey, 1988. 6. Anirban Mukhopadhyay, Ujjwal Maulik and Sanghamitra bandyopadhyay, Efficient two stage fuzzy clustering of microarray gene expression data, International Conference on Information Technology(ICIT 06), 2006 IEEE. 7. Shi Zhong, Joydeep Ghosh, A unified framework for model based clustering, Journal of Machine Learning Research 4 (2003) 1001-1037 8. Wei Pan, Jizhen Lin and Chap T Le, Model-based cluster analysis of microarray gene expression data Genome Biology 2002, 3(2):research0009.1 0009.8 9. Seo Young Kim, Tai Myong Choi, Fuzzy types clustering for microarray data, PWASET Volume 4 February 2005 ISSN 1307-6884 10. Han-Saem Park and Sung-Bae Cho, Evolutionary fuzzy clustering for gene expression profile analysis, SCIS&ISIS2006@Tokyo, Japan(September 20-24, 2006) 11. D. Dembele and P. Kastner, Fuzzy c-means method for clustering microarray data, Bio- Informatics, Vol. 19, No.8, PP 973-980, 2003. 12. http://genomics.stanford.edu 37