Weighted Consensus Clustering for Identifying Functional Modules In Protein-Protein Interaction Networks

Size: px

Start display at page:

Download "Weighted Consensus Clustering for Identifying Functional Modules In Protein-Protein Interaction Networks"

Arabella Stone
5 years ago
Views:

1 Weighted Consensus Clustering for Identifying Functional Modules In Protein-Protein Interaction Networks Yi Zhang 1 Erliang Zeng 2 Tao Li 1 Giri Narasimhan 1 1 School of Computer Science, Florida International University, {yzhan004,taoli,giri}@cs.fiu.edu 2 Department of Computer Science and Engineering, University of Notre Dame, ezeng@nd.edu Abstract In this article we present a new approach - weighted consensus clustering to identify the clusters in Protein-protein interaction (PPI) networks where each cluster corresponds to a group of functionally similar proteins. In weighed consensus clustering, different input clustering results weigh differently, i.e., a weight for each input clustering is introduced and the weights are automatically determined by an optimization process. We evaluate our proposed method with standard measures such as modularity, normalized mutual information (NMI) and the Gene Ontology (GO) consortium database and compare the performance of our approach with other consensus clustering methods. Experimental results demonstrate the effectiveness of our proposed approach. 1. Introduction Proteins are central components of cell machinery in living systems, and are operative at most every level of cell function. They usually interact with other proteins either in pairs or as components of larger complexes. The proteinprotein interaction (PPI) maps provide us a new valuable perspective for a better understanding of the functional organization of the proteome [2]. PPI networks are important information sources related to biological processes and complex metabolic functions of the cell [7]. Many researchers have already theorized some biologically relevant functional modules in these networks [5, 22]. Identifying functional modules from PPI networks is an important and challenging task in post genomic era. In general, the PPI network can be represented as a graph, where the nodes represent the proteins and the edges indicate the interaction between two proteins. With this network modeling, functional modules can be identified as cliques [14]. Traditional graph clustering algorithms have also been applied to find functional modules as clusters by partitioning the graph [13, 8, 6]. Markov clustering method (MCL) is a fast and scalable unsupervised cluster algorithm for graph partition based on simulation of (stochastic) flow in graphs and is applied to predict functional modules [13]. The RNSC approach proposed by King et al. detects functional modules using a restricted neighborhoods search clustering algorithm with a cost function [8]. Bipartite graph is used by Ding et al. to represent protein-protein complex relationship, and MinMaxCut clustering is proposed to find meaningful functional modules [6]. A comparative study of various clustering methods for protein-protein interaction networks can be found in [4]. Despite the significant progress that has been made in the area, the application of existing algorithms for extracting functional modules from PPI data is far from satisfactory due to the following challenges [3]. The first challenge is the data quality. Different high-throughput screening methods such as yeast two-hybrid systems and mass spectrometry method discover different PPI sets and there is no high overlap among them [20]. In addition, the PPI data discovered by high-throughput experimental systems are considered to have a very high false positive rate [20, 11]. Second, extracting functional modules by partitioning the network using classical graph partitioning or clustering schemes is inherently difficult, although the network is assumed be noise free. The characteristic of PPI networks presents that a few nodes (hubs) with very large degrees, while most other nodes with very few interactions. Third, some proteins are believed to be multi-functional effective strategies. Hence clustering algorithms need to support soft-assignment, i.e., assigning a protein into multiple groups. In the paper, we present our research efforts to address the above challenges. We first pre-process the data using line graph transformation based on two topological matrices to transform the PPI network into a sparser network with reduced interactions, which can lead to a more biologically relevant partitioning than the original graph [18]. We then use weighted consensus clustering methods for combining multiple, diverse and independent clustering results to improve the quality and robustness of identification. Consensus clustering offers an appealing framework for taking ad-

vantage of the strengths of individual clustering algorithms and empirical evidence has suggested that consensus clustering can improve clustering robustness and discover useful cluster structures

2 vantage of the strengths of individual clustering algorithms and empirical evidence has suggested that consensus clustering can improve clustering robustness and discover useful cluster structures even if the data is quite noisy [17]. In weighed consensus clustering, different input clustering results weigh differently, i.e., a weight for each input clustering is introduced and the weights are automatically determined by an optimization process similar to a kernel matrix learning [9]. In addition, our weighted consensus clustering framework provides a natural soft assignment since the values in the clustering solutions reflect the degree of association between data points and clusters. We evaluate our proposed method with standard measures such as modularity, normalized mutual information (NMI) and the Gene Ontology (GO) consortium database and compare the performance of our approach with other consensus clustering methods (weighted consensus clustering) with PCA-rbr [3], CSPA [15], HGPA [16]. Experimental results illustrate the effectiveness of our approach. 2. Our Proposed Method Figure 2 describes the flowchart of our proposed method. In particular, our approach follows the framework of consensus clustering for identifying PPI functional modules [3], and it consists three components. First, clustering coefficient and betweenness based similarity matrices proposed in [3] are used to weigh each edge. Second, four clustering methods are used to generate individual base clustering results. Finally, weighted consensus clustering is used to aggregate individual clusterings and obtain final function modules. We describe each component in the following subsections. the corresponding interactions; in other words, this step can be used to reduce noise and incorporate topological and network properties of PPI data. Clustering coefficient-based similarity: This method is based on the clustering coefficient [21] which represents the interconnectivity of a vertex s neighbors. The similarity between two nodes v i and v j can be calculated by Eq.(1) [3]: S cc (v i, v j ) = CC vi + CC vj CC v i CC v j. (1) Here CC(v) of a node v is computed using Eq.(2) [3] CC(v) = 2n v k v (k v 1) Where n v denotes the number of triangles that go through node v, k v is used to define the degree of node v or the number of edges connected with node v. CC v i and CC v j are recalculated as the clustering coefficient of each node when we remove the interaction (edge) between these nodes. Between-based similarity: This method is based on the shortest-path edge betweenness measure [12] and it computes the fraction of shortest paths that pass through for each edge as following equation [3]: S(v i, v j ) = 1 (2) SP SP max (3) Where SP max is the maximum number of shortest paths passing through an edge in the graph and SP is the number of shortest paths passing through edge. Both of these two similarity measures is defined only for connected pairs. They are rescaled into the range from 0 to 1 using min-max normalization Base Clustering Algorithms We use four conventional graph clustering algorithms including three methods in Metis (rbr, direct, Metis) 1, and spectral clustering to obtain the base clusterings. The algorithms are described below: Figure 1. The overview of our proposed approach 2.1. Similarity Measures Two different similarity measures are designed to capture diverse topological properties of PPI network. The goal is to weight edges of PPI network to reflect the reliability of (i) Repeated bisections (rbr): it is a top-down clustering algorithm, which uses a sequence of k 1 repeated bisections (k is number of clusters) to compute the desired k-way clustering solution; (ii) Direct k-way partitioning (direct):in this algorithm, k instances are selected from the dataset as the seeds of the k clusters, then assign each instance to the cluster corresponding to its most similar seed by computing the similarity with these k seeds. This is a method that is computed by simultaneously finding k clusters; 1 The software can be downloaded from

3 (iii) Multilevel k-way partitioning (Metis): it is multilevel partitioning algorithm and works in three phases as coarsening, initial partitioning, and refinement; (iv) Spectral clustering: a spectral clustering algorithm is obtained by recursively applying a spectral method for graph partitioning [19] Consensus Clustering Method From Section 2.1, two similarity matrices are obtained. Coupled with the four different base clustering algorithms described in Section 2.2, we obtain 2*4=8 sets of base clusterings. The goal of consensus method is to combine these 8 individual clusterings to derive a better clustering solution. We use weighted consensus clustering [10], in which, input clustering is weighted and the weights are automatically determined. Formally, suppose we are given a set of T clustering (or partitions) P = P 1, P 2,..., P T of the dataset. Each partition P t, t = 1,..., T, consists of a set of clusters C t = C1, t C2, t..., Ck t where k is the number of clusters for partition P t. Weighted consensus clustering The clustering framework is based on nonnegative matrix factorization (NMF), and the major optimization includes two steps. First, we define connectivity matrix as: { M (P t 1, (i, j) C k (P t ) ) = (4) 0, otherwise In this equation, if node i and node j belong to the same cluster C k, the connectivity between i, j is 1, otherwise, the connectivity between i, j is 0. Second, we build the weighted consensus association between node i and node j be the equation (5): M = 1 T T w i M (P t ) (5) t=1 Where w = (w 1, w 2,, w T ) T and T i=1 w i = 1. Therefore, we try to solve the following optimization problem as following: min U n ( M Ũ) 2 = M U 2 (6) i,j=1 Where U is a solution to this optimization problem. We define the clustering solution specified by clustering indicator H = {0, 1} n k, n is the number of instances, k is the number of clusters. Thus the above optimization problem can be presented as: min M HH T 2 (7) H>0 Let D be a diagonal matrix indicating the number of points in each cluster, i.e., D = diag(h T H) = diag(n 1,..., n k ). (8) Then the optimization problem in Eq.(7) is reduced to a symmetric NMF problem as: min H T H=I, H,D>0 M HD H T 2 (9) Note that once w is fixed, H and D can be obtained using multiplicative update rules below: (a) update H: H H ( M HD) ( H T HD H T H) ; ( (b) update D: D D H T M H) ( H T HD H T H) ; Therefore to optimize Eq. (9), we iterate the following two steps: (i) solve H with fixing w: using NMF based method; (ii) solve w with fixing H: min J = T r[ M M 2 H T M H + HT H HT H] (10) Where M 2 = w T Aw, and T r( H T M H) = b T w, b i = H T M(P j ) H. A = T r[m(p i )M(P j )] = uv M(P i ) uv M(P j ) uv Thus, for fixing H, the problem becomes: (11) min J = w T Aw 2b T w + const (12) More details of the weighted consensus clustering can be found in [10] Soft Consensus Clustering Method In general, the cluster indicator H is not exactly orthogonal. This slight deviation from rigorous orthogonality produces a benefit of soft clustering. Suppose a protein has a posterior distribution as (0.96, 0, 0.04, 0,..., 0). It is obvious that this protein is clearly into one cluster. We say this protein has a 1-peak distribution. Suppose another protein has a posterior distribution as (0.48, 0.48, 0.04, 0,..., 0). Obviously, this protein is clustered into two clusters. We say this protein has a 2-peak distribution. In general, we characterize each protein belonging to 1-peak, 2-peak, 3-peak, etc. For K protein clusters, we set K prototype distribution: (1, 0,..., 0), ( 1 2, 1 2,, 0),, ( 1 k,, 1 k ) (13)

4 For each protein, we assign it to the closest prototype distribution based on the Euclidean distance, allowing all possible permutations of the clusters. In practice, the less peaks of the posterior distribution of the protein, the more unique module of the protein has. 3. Experiments nodes in cluster cl k. Note that greater P means stronger interactions between the node i and cluster cl k. So, we can define a threshold to assign a protein to alternative cluster. If the value of P (i, cl k ) is larger than this threshold, the node i also belong to cluster cl k Evaluation Methods 3.1. Data Description In this paper, we choose MIPS Yeast Protein- Protein Interaction Database, which is available at It is a collection of manually curated high-quality PPI data collected from the scientific literature by expert curators. It consists of 8617 interactions between 871 proteins Comparison Methods We compare our method with the following consensus clustering algorithms. PCA-based consensus algorithm: Eight independent base clusterings are obtained using four graphs clustering algorithms with the two topology-based metrics. Then PCA-rbr algorithm is used to perform consensus clustering [3]. CSPA: (cluster-based similarity partitioning algorithm): A clustering signifies a relationship between objects in the same cluster and can thus be used to establish a measure of pairwise similarity. This induced similarity measure is then used to recluster the objects, yielding a combined clustering [15]. HGPA: (HyperGraph Partition algorithm): This algorithm approximates the maximum mutual information objective with a constrained minimum cut objective. Essentially, the cluster ensemble problem is posed as a partitioning problem of a suitably defined hypergraph where hyperedges represent clusters [16]. As we mentioned before, some proteins are believed to have multiple functions. To perform soft assignment, the following soft-consensus clustering method is used for the above consensus clustering algorithms. The method is to calculate the probability of a protein belonging to an alternative cluster by a factor of its distance from the nodes in the cluster. And using the average shortest path distance, we can quantify this measure as [3]: j cl P (i, cl k ) = 1 k SP (i, j) (14) V clk Diam G Where SP (i, j) is the length of shortest path between node i and node j, Diam(G) is the diameter of the PPI graph (the maximum length in all shortest paths), V clk denotes all Modularity WClustering PCA CSPA HGPA Figure 2. Comparison the performance of four consensus clustering algorithms using modularity. The following evaluation methods are used in our experiments. Topological Measure: Modularity First, we use the topology-based modularity measure, which is originally proposed by [12, 3]. It is computed as: M = i (d ii ( j d ) 2 ) (15) where each element d represents the fraction of edges that link nodes between clusters i and j and each d ii represents the fraction of edges linking nodes within cluster i. From the above equation, more linking edges between the nodes in the same clusters, and less linking edges between the nodes in the difference clusters would lead higher modularity value. Information Theoretic Measure: Normalized Mutual Information Second, we use NMI to compare the performance of each algorithm, which was originally described [16, 3].

5 WClustering PCA CSPA HGPA NMI log(p value) WClustering PCA CSPA HGPA Figure 3. Comparison the performance of four consensus clustering algorithms of using NMI. Figure 4. Comparison the performance of four consensus clustering algorithms using log(p value). NMI is computed as follows ϕ NMI (λ a, λ b ) = 2 a k n k b n h l log k a k b l=1 h=1 n h l n n h n l (16) Where λ a,λ b are the calculated labels for each instance, and the true labels for them respectively. k a is the number of clusters in λ a, k b is the number of clusters in λ b. n l is the number of instances in cluster l in λ a, n h is the number of instances in cluster h in λ b, n is the number of instances in both cluster l in and cluster h in λ b. n the number of all instances. From the equation above, if λ a, λ b are exactly same, ϕ NMI (λ a, λ b ) should be 1. Domain-based Measures We also use known biological association from the Gene Ontology (GO) Consortium Online Database [1] to test whether the clusters obtained from our experiments correspond to known functional modules. This GO dataset provides the information including cellular component, molecular level and biological process, and we only use the information about biological process which refers to entities at both the cellular and organism levels of granularity to calculate the P-value and clustering score. P-value comparison: P-value can be used to calculate the statistical and biological significance of a cluster of proteins. In our case, we evaluate P-value for each annotation in each cluster for each algorithm. Firstly, we assume the cluster which we evaluate with the size n and it contains m proteins in the particular biological annotation noted in GO database. For each cluster in each algorithm we choose the smallest value of P-value as the cluster s P-value [3]. ( n M )( N M ) i n i P value = ( N (17) n) i=m Where N is the total number of proteins in GO database; M is the number of proteins in GO database with particular biological annotation. Now, we set cutoff value (which will be explained in clustering score comparison) as 0.05 and get the significant clusters for each algorithm. For all algorithms, the minimum number of significant clusters is 31. Clustering score comparison: Clustering-score is used to evaluate the clustering results for each algorithm. We define cutoff value to differentiate significant cluster from the insignificant cluster. If a cluster with a P-value greater than cutoff, it is insignificant cluster [3]. ns min(p i ) + (n I cutoff) CScore = 1 (18) (n S + n I ) cutoff Where n S is the number of clusters which are significant, n I is the number of clusters which are insignificant. min(p i ) denotes the smallest P-value of the significant cluster i. According to Eq.(18), greater CScore generally means better clustering results Results Analysis The performance comparisons using different evaluation measures (e.g., modularity, NMI, P-values, clustering scores) are shown in Figure 2, Figure 3, Figure 4, and Figure 5, respectively. In these figures, our proposed method

6 is denoted as Wclustering and PCA refers to the PCAbased consensus algorithm. For domain-based measures, the minimum number of significant clusters is 31 and we only show the top 31 significant clusters for each algorithm. From the comparisons, we observe that the weighted-based consensus method is obviously better than other three methods in all evaluation methods. Cluster Score WClustering PCA CSPA HGPA Figure 5. Comparison the performance of four consensus clustering algorithms using clustering score. 4. Conclusions In this paper, we present our work on using weighted consensus clustering for identifying functional modules from PPI data. Our weighted consensus clustering framework is able to combine multiple, diverse and independent clustering results to improve the quality and robustness of identification. Acknowledgment The work is partially supported by NSF grant DBI References [1] The gene ontology consortium online database GO/goTermFinder. [2] R. Aebersold and M. Mann. Mass spectrometry-based proteomics. Nature, 422(6928): , [3] S. Asur, D. Ucar, and S. Parthasarathy. An ensemble framework for clustering proteincprotein interaction networks. Bioinformatics, 23(13):i29 i40. [4] S. Brohee and van J. Helden. Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics, 7:488, [5] C. Brun, C. Herrmann, and A. Guenoche. Clustering proteins from interaction networks for the prediction of cellular functions. BMC Bioinformatics, 5(95), July [6] C. Ding, X. He, R. F. Meraz, and S. R. Holbrook. A unified representation of multiprotein complex data for modeling interaction networks. Proteins: Structure, Function, and Bioinformatics, 57(1):99 108, [7] N. Guelzim, S. Bottani, P. Bourgine, and F. Kepes. Topological and causal structure of the yeast transcriptional regulatory network. Nat Genet, 31(1):60 63, [8] A. D. King, N. Pržulj, and I. Jurisica. Protein complex prediction via cost-based clustering. Bioinformatics, 20(17): , [9] G. Lanckriet, N. Cristianini, P. Bartlett, L. Ghaoui, and M. Jordan. Learning the kernel matrix with semi-definite programming. In Proceedings of International Conference on Machine Learning (ICML), pages , [10] T. Li and C. Ding. Weighted consensus clustering. In Proceedings of 2008 SIAM International Conference on Data Mining, [11] E. Nabieva, K. Jim, A. Agarwal, B. Chazelle, and M. Singh. Whole-proteome prediction of protein function via graphtheoretic analysis of interaction maps. Bioinformatics, 21(1): , [12] M. E. J. Newman and M. Girvan. Finding and evaluating community structure in networks, August [13] J. B. Pereira-Leal, A. J. Enright, and C. A. Ouzounis. Detection of functional modules from protein interaction networks. Proteins, 54(1):49 57, [14] V. Spirin and L. A. Mirny. Protein complexes and functional modules in molecular networks. Proc Natl Acad Sci U S A, 100(21): , [15] A. Strehl and J. Ghosh. Relationship-based clustering and visualization for high-dimensional data mining, [16] A. Strehl, J. Ghosh, and C. Cardie. Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3: , [17] A. Topchy, M. Law, A. Jain, and A. Fred. Analysis of consensus partition in cluster ensemble. In Proceedings of International Conference on Data Mining, pages , [18] D. Ucar, S. Parthasarathy, S. Asur, and C. Wang. Effective pre-processing strategies for functional clustering of a protein-protein interactions network. Bioinformatic and Bioengineering, IEEE International Symposium on, 0: , [19] U. von Luxburg. A tutorial on spectral clustering. Techonical report, August. [20] C. von Mering, R. Krause, B. Snel, M. Cornell, S. G. Oliver, S. Fields, and P. Bork. Comparative assessment of large-scale data sets of protein-protein interactions. Nature, 417(6887): , May [21] D. J. Watts and S. H. Strogatz. Collective dynamics of smallworld networks. Nature, 393(6684): , June [22] M. Wu, X. Li, C.-K. Kwoh, and S.-K. Ng. A core-attachment based method to detect protein complexes in ppi networks. BMC Bioinformatics, 10(1):169, 2009.

Brief description of the base clustering algorithms

Brief description of the base clustering algorithms Le Ou-Yang, Dao-Qing Dai, and Xiao-Fei Zhang In this paper, we choose ten state-of-the-art protein complex identification algorithms as base clustering