Mining Gene Expression Data Using PCA Based Clustering

Size: px

Start display at page:

Download "Mining Gene Expression Data Using PCA Based Clustering"

Lucy Dickerson
6 years ago
Views:

1 Vol. 5, No. 1, January-June 2012, pp , Published by Serials Publications, ISSN: Mining Gene Expression Data Using PCA Based Clustering N.P. Gopalan 1 and B. Sathiyabhama 2 * 1 Department of Computer Applications, National Institute of Technology, Tiruchirappalli, , Tamilnadu, India, gopalan@nitt.edu 2 Department of Computer Science and Engineering, National Institute of Technology, Tiruchirappalli, , Tamilnadu, India, sathya674@yahoo.co.in ABSTRACT: As the amount of laboratory data in molecular biology and bioinformatics grows exponentially in each year due to advanced technologies such as DNA Microarray, new efficient and effective clustering methods must be developed to process this fast growing amount of biological data. Numerous clustering techniques have been applied in the analysis of gene expression data to extract biologically significant patterns. But there are issues like clustering quality, high dimensionality of input data and computational efficiency need to be addressed. A novel hybrid clustering algorithm is proposed, which is a blend of Principal Component Analysis (PCA) and the enhanced correlation based clustering. PCA is a classical statistic technique for finding patterns in data of high dimension. The empirical results show that this approach provides more stable clustering performance in terms of quality and efficiency. The resulting clusters offer potential insight into gene function, molecular biological processes and regulatory mechanisms. Keywords: Clustering analysis; Bioinformatics; Gene expression data; Principal Component Analysis; 1. INTRODUCTION DNA Microarray technology has now made it possible to simultaneously monitor the expression levels of thousands of genes during important biological processes across collections of related samples. It has enormous promises in areas such as revealing function of genes in various cell populations, tumor classification, drug target identification, understanding cellular pathways, and prediction of outcome to therapy [1], [2]. A major application of microarray technology is gene expression profiling to predict outcome in multiple tumor types [3]. Data mining methods can be applied to various gene expression data sets including cancer data sets in order to identify distinct genes to classify tumors. Cluster analysis is one of the data mining technique, seeks to partition a given data set into groups based on specified features so that the data points within a group are more similar to each other than the points in different groups [2]. Clustering techniques are useful in identifying (yet unknown) subclasses of tumors, or identifying clusters of genes that are co-regulated or share the same function [4]. These methods have been successful in separating certain types of genes associated with different types of leukemia and lymphoma [3]. The groupings of biologically relevant clusters containing genes are having similar expression patterns called co-expression genes. Clustering technique has become an efficient and mandatory tool for in-silico analysis of gene expression data * Corresponding Author: sathya674@yahoo.co.in [5], [6], [7], [8], [9]. A variant of hierarchical clustering algorithm is used by Eisen et al. [7] to identify groups of co-expressed yeast genes. Two-way clustering technique [5] is used to detect clusters of correlated genes and tissues. To identify clusters in the yeast cell cycle data set and human hematopoietic differentiation data set Self-Organizing Maps (SOM) [9] is used. Biologically meaningful clusters of yeast chodata have been determined by using genetic enhanced K-Means clustering method [10]. Variety of clustering validation measures are used in the literature to evaluate the validity of clustering results [11], [12]. Numerous validation indexes are used in practice like Jaccard coefficient, Simple matching coefficient and Hubert s (gamma) statistic (HGS) [13] to evaluate the stability of parameters and reliability of clustering algorithms and are ingrained only on the phase of post-validation. Clustering techniques have the drawbacks of poor clustering quality and destabilization of clusters [4],[14],[15],[16]. Vincent Tseng et al. have used correlation based clustering algorithm for partitioning co-regulated genes. To improve the quality of clustering, validation technique is integrated in to the clustering process [13]. In the initial stage of clustering this algorithm adds highly negative correlated elements in addition ot positively correlated elelments. In the later phase, exterminates the cluster members that were inaccurately added. Hence it consumes more computational resources. Recently the authors have developed a variant of sparse matrices to represent the gene expression similarity matrix [17], [18]. Sparse matrix is the suitable data structure for effective

2 14 N.P. Gopalan and B. Sathiyabhama memory utilization. The authors also made improvement on the validation statistic by substituting the fast heuristic namely Enhanced HGS (EHGS) from the basic Hubert s statistic]. Computational intelligence [18], [19] is generally accepted to include evolutionary computation and is used to increase the precision of resolved structure. Genetic algorithm (GA) has been proven to be a robust and effective search method requiring very little information about the problem to explore a large search space. Blend of computational intelligence and clustering approaches endow with rapid, automated, feature selection and pattern recognition for a wide assortment of gene expression profile [19]. Most of the clustering algorithms suffer from high dimensionality and huge size of the data. To analyze these fast growing gene expression data sets efficiently and effectively good clustering algorithm is required, but the dimensionality and size of data impersonate challenging problems in both computational and biomedical research, and the difficult task ahead is transferring gene expression data in to subject specific knowledge. Various methods have been developed to reduce the size of the gene expression data [20], [21], [22]. In the proposed work, clustering algorithm is appropriately integrated with a dimensionality reduction technique namely Principal Component Analysis (PCA) whose goal is to reduce the dimensionality of the data to facilitate visualization and additional analysis. PCA is often used as a pre-processing step to the clustering analysis of large data sets and are widely used in the gene expression data. 2. RESEARCH METHODOLOGY The high dimensionality of the gene expression data sets and the high percentage of irrelevant or redundant genes make it very difficult either to classify samples or pick out substantial genes in a context where little domain knowledge is available. To address this problem, PCA has been applied to analyze gene expression data. PCA is a classical statistic technique to reduce the dimensionality of the data by transforming to a set of variables that summarize the features of the data without much loss of information [22]. Principal Components (PC s) are uncorrelated and ordered. PCA is closely related to a mathematical technique called Singular Value Decomposition (SVD) and it is applied to the algorithm before the clustering process. Hence, only the relevant data is given to the clustering. SVD takes a gene expression data matrix namely A of order n X p where n rows represent the genes and the p columns (p is approximately equal to n) represent the experimental conditions. The SVD theorem is as follows: A nxp = U T U = T UnXn SnXp V pxp (1) T InXn V V I pxp (2) U and V are orthogonal. U is the left singular gene coefficient vectors and S has the same dimension as A. Now SVD represents an expansion of the original data in a coordinate system where the covariance matrix is diagonal. SVD consists of finding eigen values and eigen vectors of the following: AA T and A T A (3) Depending on the eigen vectors, the components are selected. These are forming a feature vector and it is the notion of data compression. Eigen vector with the highest eigen value is the PC s of the data set. If eigen vectors with the largest eigen value is one that pointed as middle of the data. This is the most significant relationship between the data dimensions and the least significant components are ignored. The clustering algorithm is then forming the similarity matrix from the PC s only with the relevant biologically significant data. Unlike the traditional clustering algorithms the proposed approach uses the constraint based addition procedure to add the elements to the clusters. It never removes any element from the clusters once added and outliers are filtered out during the initial phase of the clustering process. Consequently, the stability and quality of the clustering process is improved. To assess the predictive power of the clustering algorithm and quality of clustering results, combination of EHGS [17] and figure of merit (FOM) is used [11]. A typical gene expression data set contains the measurements of expression levels of n genes measured under n experimental conditions. Apparently, the expression levels of co-regulated genes will vary similarly across the n conditions. Consequently, clustering the genes based on similarities among these expression level measurements should isolate clusters of biologically related genes. The EHGS is as follows: 2 ( A( i, j)( B( i, j)) A( i, j) A* B M n 1 n n 1 n i 1 j i 1 i 1 j i 1 ( B) A clustering algorithm is said to have good predictive power if genes in the same cluster tend to have similar expression levels. In the set of experimental conditions, the condition that is not used to produce the clusters is used as leave one out condition and assumed as least significant constraint [11]. With reference to the left out condition, the clustering process is evaluated. The above illustration does not provide any guarantee that the left out condition is the appropriate one to determine the predictive power of the clustering algorithm. There is no proof that the left out condition is not biologically significant because there exists an equal probability that every condition becomes a left out. Hence the proposed algorithm uses a variant of FOM, a set of scalar quantities that determine the predictive power of clusters. This implies that a set of threshold parameters is attached for every cluster produced for each pre-defined biological condition, i.e. T i, is the threshold parameter for (4)

3 Mining Gene Expression Data Using PCA Based Clustering 15 the ith cluster (where 1 i k) and k is the number of clusters. This heuristic helps in reducing the redundant computations. The idea behind the FOM is that the data from conditions 0, 1, 2,., (m-1), are used to estimate the predictive power of the algorithm. Suppose k clusters, C 1, C 2,, C k are obtained, with cluster sizes s 1, s 2,, s k, such that si n. Let R(i, j) be the expression level of gene i under condition j in the similarity matrix. FOM (i, k) be the FOM for k clusters and using condition i as validation along with a threshold value T i. Thus FOM is defined as FOM 1 ( i, k ) n( R( X, e)) T i Numerous strategies available in the literature to set up the threshold value [23]. There is no theoretical proof that the chosen value is appropriate for clustering. In the proposed approach, a metric is appropriately assigned for gene expression data clustering. The threshold value T i, is an expected average distance (according to the distance metric) of objects is assigned to the cluster C i. Mixture of correlation based enhanced variant and FOM operators are used to select the appropriate cluster for the gene expression data and validate them simultaneously. Hence the proposed algorithm Gene clustering using Correlation Search Technique (G-CST) is scalable, efficient and resilient in determining the biologically significant patterns. 3. THE PROPOSED (G-CST) ALGORITHM Input: Gene expression data (n X n matrix). Output: Biologically significant clusters. begin 1. Initialization: (5) A: input gene expression similarity matrix A of n X n Perform PCA: AA T and A T A n M = ( n 1) 2 S A = S B = 0 C = 0 n 1 n i 1 j i 1 A( i, j) G = {1, 2, 3,..., n} CGST = 0 2. while (G is not empty) do begin C open = 0; for i: = 1 to n do a(i) = 0; Max_EHGS = max ; pick an element c from C with maximum neighbours; remove c from C; for j: = 1 to n do a(i) = a(c, i); C open FOM (i, k) = = {c}; 1 n( R( X, e)) // Adding elements to the clusters repeat while Max_EHGS and FOM(i,k) do begin select c from C with highest a(i) remove c from C; S B = S B + C open S AB for i = 1 to n do = S AB + a(u); a(i) = a(i) + A(C, i); Until all elements in A(i, j) are assigned; 3. Return the collection of clusters; end T i end Figure 1: Pseudo Code for G-CST Clustering The algorithm consists of an initialisation and an iterative step. In the initialization step, the algorithm first computes the PCA by determining the eigen vectors. From the feature vectors the similarity matrix is constructed. After that the appropriate population is generated. The iterative step successively selects elements and allocates to the appropriate cluster. 3.1 Clustering and Validation The input for the algorithm is a raw gene expression data matrix. This is converted as sparse symmetric similarity matrix of the gene expression data set. This algorithm is constructing clusters one at a time. The current cluster is denoted by C open. Each cluster is started by a seed value and constructed incrementally by adding items to C open. The addition of data items is computed using EHGS [17] and is defined as. The current maximum is represented as add(k). An element k is added if it has high positive correlation max i.e high similarity. Also it clusters low similarity gene data items in different clusters according to the value. The value of is between ( 1, 1) and a higher value of represents the best clustering quality. A data item is added to the cluster if it satisfies the maximum neighbors criteria and a threshold value. In general, the threshold value depends on the number of patterns and the number of features in the data set. The

4 16 N.P. Gopalan and B. Sathiyabhama set of clusters is stabilized by consecutive addition operations. To inaugurate a new cluster, a data item with maximum number of neighbours or closest data items is used. Also, a threshold value is used while adding an element. This automatically filters out the outlier data items and appropriately inserted in to the respective clusters. The mixture of validation measures provide increased predictive performance relative to other methods of pattern recognition. These are the principal heuristics that have been attached to this algorithm and are responsible for assigning clusters to all the valid items. The added items need not be removed from the cluster unlike correlation based clustering algorithm devised by Vincent Tseng et al. [13]. 4. EMPIRICAL RESULTS To describe the performance of the proposed approach, K-Means, E-CAST [23] and E-CST algorithms on the cancer gene expression data sets are used. There are several other algorithms are also available in the literature for comparing the performance of the proposed algorithm. Due to relativity these algorithms are used. Datasets [24] from breast cell lines are used here to evaluate the proposed methodology. To estimate the predictive power of the clustering algorithms mixture of FOM and EGHS used. To obtain reliable clustering results the proposed approach, K-Means, E-CAST and Enhanced Correlation Search Technique (ECST) algorithms are executed 25, 20, 25 and 20 times respectively. Transfection with a single oncogene is expected to generate similar expression profiles presumably, because only a few genes are biologically influenced. Therefore, it is desirable to see whether profiles of the different phenotypes can be partitioned. Due to the presence of noise in the data and similarity between the different samples, common clustering techniques such as K-Means, and E-CAST failed to produce good quality clusters. Expression levels of the four cell lines were measured in two separate sets of four measurements. These data sets cluster structures are determined in advance. From the given data set, the users can set up some parameters for generating various kinds of gene expression data sets with variation in terms of the number of clusters and number of genes in each cluster. First seed genes are generated and it must have the same number of constraints for all the clusters. If the seed gen es and the th reshold values are appropriately incorporated and tested in the algorithm, then the clusters having high intra cluster similarity and low inter cluster similarity. During the initial phase of the clustering process the outliers or noise are purged successfully. The proposed approach (G-CST) is compared with the other clustering algorithms. Table 1 provides the complete detail about the data sets, cluster structure, clustering patterns for the proposed approach, E-CAST, K-Means, E-CST and their computational time (running time in Table 1). The newly designed algorithm outperforms quantitatively and qualitatively in computational time and memory utilization. Close to this, E-CST is performing better in accuracy. In addition, the results illustrate that the quality of clustering will be better in the proposed algorithm. This can provide more accurate results and insight into molecular process, morphological characteristics and gene control functions. Figures 2 and 3 depict a large contiguous group of genes sharing the similar expression patterns over set of conditions. This type of clustering structure elaborates the biological significance of the underlying genes. The curved lines in the figures 2 and 3 represent the sum of average FOM (i, k) and EHGS measures on the experimental conditions. The newly designed algorithm and the E-CST are very sensitive to outliers. The number of clusters is very crucial parameter in the traditional clustering algorithms, whereas in the proposed approach automatically produces the clusters without any user input. The result of this clustering analysis may be a group of co-regulated genes (i.e. genes that exhibit similar experimental behavior) that are placed in the same cluster. They express the relationships between the clusters and the functional categories in biological activities. The behaviors of the clustering algorithms on data sets presented here demonstrate a feature of gene expression that makes this method particularly useful. It is known that genes expressed together share common functions. Gene expression patterns suffice to separate genes into functional categories across a relatively small and redundant collection of conditions. It is been observed that the addition of more and diverse conditions can only enhance these observations. The behaviour of the clustering algorithms on gene expression data set 2 is very similar to that of the data set 1 which is shown in Table 1. When the number of clusters is small, the E-CAST, K-Means algorithms have comparable FOM and EHGS, which are lower than those of the new approach and E-CST. When the number of clusters is large, the proposed algorithm has comparable FOM and EHGS. In Figure 2, there is a knee shaped structure in the curves between one and two clusters portrays that cluster separation is minimum for the data set 2. Data sets considered for evaluation exhibit declining validity measures under all algorithms as the number of clusters increases. Two factors contribute to this. First, the algorithms may be finding higher quality clusters, as they subdivide large, coarse clusters into smaller, more homogeneous ones. Second, simply increasing the number of clusters will tend to decrease the validity measures. There is an obvious negative slope trend in both figures, showing that clustering results with low values tend to have high correspondence with the given functional categorization. The mixture of FOM and EHGS provide a meaningful estimate of cluster quality.

5 Mining Gene Expression Data Using PCA Based Clustering 17 Table 1 Experimental Results for 2 Gene Expression Data Sets Algorithm No. of Enhanced No. of Running Patterns hm Clusters Statistic outliers Time (approx) E-CST 63, ,10 O(n log n) 3000, 200 E-CAST 58, ,20 O(n2) 2500, 180 K-means 27, ,12 O(n2) 2000, 75 Proposed 67, < 100, 10 >= O(n > 3000, G-CST log n) 200 The number of outliers is shown in the Table 1, which may exist approximately in all the clusters together. In the G-CST and clearly the outliers tremendously reduced, since only relevant genes are considered for building clusters. All the possible patterns which are biologically significant are extracted from the clusters which are formed on the basis of the constraints specified by the proposed algorithm. Figure 2: Clustering behaviour of Data Set 1 Figure 3: Clustering Behavior of Data Set 2 The proposed approach is superior to the existing approaches in quality and efficiency, stability and memory utilization. It is understood from the Figures 2 and 3 this emphasizes its supremacy of capturing sharp coherent tendency among gene expression data. In addition, the results of functionally enriched clusters highlight the fact that these clusters carry significant biological meaning. 7. CONCLUSION AND FUTURE WORK As the number of microarray experiments continues to increase drastically and as these techniques are becoming more and more a part of personalized healthcare, computational methods to support this expansion must also occur. Most of the clustering algorithms used in practice are having certain inherent difficulties. This novel approach clusters the gene expression data sets and produces good results. This clustering process signifies great promise to glean information from gene expression profile. To evaluate the performance of this method, cancer gene expression data sets have been used and it is compared with the E-CST, E- CAST, K-Means clustering algorithms. It is clear that the healthcare industry requires methods to rapidly transit microarray data into practical use. Future work includes the application of more real data sets and the theoretical analysis of the determination of the threshold parameter. A key roadblock remains the discovery of exact patterns and predictive accuracy that still retains high accuracy in clustering. Combinations of computational intelligence approaches hold promise for rapid, automated and pattern recognition for a wide assortment of data. Blend of parallel approaches like genetic algorithm and characterization guided clustering may improve the performance and it will play an increasingly important role in the areas of gene expression analysis. References [1] P.T. Spellman, G. Sherlock, M.Q. Zhang, V.R. Iyer, K. Anders, M.B. Eisen, P.O. Brown, D. Botstein, and B. Fucher, Comprehensive Identification of Cell Cycle- Regulated Genes of the Yeast Saccharomyces Cerevisiae by Microarray Hybridization, Molecular Biology of the Cell, 9(12), pp , [2] T. Zhang, R. Ramakrishnan, and M. Livny, Birch: An Efficient Data Clustering Methods for Very Large Databases, Proc. ACM SIGMOD Int l Conf. Management of Data, pp , [3] Golub. T.R. Slonim, D.K. Slonim, D.K. Tamayo, P. Huard, C. Gaasenbeek, M. Mesiroy, J.P. Coller, H. Loh, M. Downing, J.R., Caligiuri, M. et al., Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring, Science, 286: , [4] M.S. Chen, J. Han, and P.S. Yu, Data Mining: An Overview from a Database Perspective, IEEE Trans. Knowledge and Data Eng., 8(6), pp , Dec [5] U. Alon, N. Barkai, D.A. Nottleman, k. Gish, S. Ybarra, D. Mack, and A.J. Levine, Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Clustering Oligonucleotide Arrays, Proc. Nat l Academy of Sciences, 96, pp , 1999.

6 18 N.P. Gopalan and B. Sathiyabhama [6] A.Ben-Dor and Z. Yakhini, Clustering Gene Expression Patterns, J.Computational Biology, 6, pp , [7] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, Clustering Analysis and Display of Genome Wide Expression Patterns, Proc. Nat l Academy of Sciences, 95, pp , [8] M.K. Kerr and G.A. Churchill, Bootstrapping Cluster Analysis: Assessing the Reliability of Conclusions from Microarray Experiments, Proc. Nat l Academy of Science, 98(16), pp , [9] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E.S. Lander, and T.R. Golub, Interpreting Patterns of Gene Expression with Self-Organizing Maps: Methods and Application to Hematopoietic Differentiation, Proc. Nat l Academy of Sciences, 96(6), pp , [10] N.P. Gopalan, B. Sathiyabhama, Scalable Biclustering Gene Expression Data using Genetic Enhanced K-Means Algorithm, Proc. National Conference on High Performance Computing - VISION 06, pp [11] K.Y. Yeung, D.R. Haynor, and W.L. Ruzzo, Validating Clustering for Gene Expression Data, Bioinformatics, 17(4), pp , [12] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data, Englewood Cliffs N.J.: Prentice Hall, [13] Vincent S. Tseng and Ching-Pin Kao, Efficiently Mining Gene Expression Data via a Novel Parameterless Clustering Method, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2(4), pp , Dec [14] S. Guha, R. Rastogi, and K. Shim, CURE: An Efficient Clustering Algorithm for Large Databases, ACM Int'l Conf. Management of Data, pp , [15] S. Guha, R. Rastogi, and K. Shim, ROCK: A Robust Clustering Algorithm for Categorical Attributes, 15th Int l Conf. Data Eng., pp , [16] T. Kohonen, The Self-Organizing Map, Proc.IEEE, 78(9), pp , [17] B. Sathiyabhama, N.P. Gopalan, Correlation Search Technique for Clustering Cancer Gene Expression Data, WSEAS International Conferences Lisbon, Sep [18] B. Sathiyabhama, N.P. Gopalan, Enhanced Correlation Search Technique for Clustering Cancer Gene Expression Data, WSEAS Transactions on Information Science and Applications 12, 3, 2006, pp [19] Fogel, G.B., Corne, D.W., Evolutionary Computation in Bioinformatics, Morgan Kaufmann, San Francisco [20] Xing E. P. and Karp R.M., Cliff: Clustering of High - Dimensional Microarray Data via Iterative Feature Filtering Using Normalized Cuts, Bioinformatics, 17(4), , [21] Yeung, Ka Yee and Ruzzo, Walter L., An Empirical Study on Principal Component Analysis for Clustering Gene Expression Data, Technical Report UW-CSE , Department of Computer Science and Engineering, University of Washington, [22] Holter N.S., Mitra, M. Maritan, A., Cieplak, M., Banaver, J.R. and Fedoroff, N.V., Fundamental Patterns Underlying Gene Expression Profiles: Simplicity from Complexity, Proceedings of the National Academy of Science USA, 97, , [23] Abdelghani Bellaachia et. al., E-CAST: A Data Mining Algorithm for Gene Expression Data, Proc. BIOKDD02: Workshop on Data Mining in Bioinformatics (With SIGKDD02 conference), pp [24] Kluger, H. Kacinski, B., Kluger, Y., Mironenko, Gilmore Hebert, M., Chang, J., Perkins, A.S., and Sapi, E., Microarray Analysis of Invasive and Metastatic in a Breast Cancer Model, In Poster presented at the Gordon Conference on Cancer, Newport, RI, 2001.

Double Self-Organizing Maps to Cluster Gene Expression Data

Double Self-Organizing Maps to Cluster Gene Expression Data Dali Wang, Habtom Ressom, Mohamad Musavi, Cristian Domnisoru University of Maine, Department of Electrical & Computer Engineering, Intelligent