WITH the coming of the postgenomic era, proteomics

Size: px

Start display at page:

Download "WITH the coming of the postgenomic era, proteomics"

Dayna Pearson
5 years ago
Views:

1 610 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 13, NO. 4, JULY/AUGUST 2016 Detecting Functional Modules Based on a Multiple-Grain Model in Large-Scale Protein-Protein Interaction Networks Junzhong Ji, Jiawei Lv, Cuicui Yang, and Aidong Zhang Abstract Detecting functional modules from a Protein-Protein Interaction (PPI) network is a fundamental and hot issue in proteomics research, where many computational approaches have played an important role in recent years. However, how to effectively and efficiently detect functional modules in large-scale PPI networks is still a challenging problem. We present a new framework, based on a multiple-grain model of PPI networks, to detect functional modules in PPI networks. First, we give a multiple-grain representation model of a PPI network, which has a smaller scale with super nodes. Next, we design the protein grain partitioning method, which employs a functional similarity or a structural similarity to merge some proteins layer by layer. Thirdly, a refining mechanism with border node tests is proposed to address the protein overlapping of different modules during the grain eliminating process. Finally, systematic experiments are conducted on five large-scale yeast and human networks. The results show that the framework not only significantly reduces the running time of functional module detection, but also effectively identifies overlapping modules while keeping some competitive performances, thus it is highly competent to detect functional modules in large-scale PPI networks. Index Terms Computational biology, large-scale PPI networks, functional module detection, multiple-grain model Ç 1 INTRODUCTION WITH the coming of the postgenomic era, proteomics research has gradually become one of the most important areas in the field of life science [1]. As a bimolecular relationship network, the protein-protein interaction (PPI) network plays an important role in biological activities. Hence, the analysis of PPI networks naturally serves as the basis to the better comprehending of cellular organization, processes, and functions [2]. From the biological point of view, cellular functions and biochemical events are coordinately carried out by groups of proteins interacting each other in functional modules (or complexes), and the modular structure of a PPI network is critical to functions. For instance, retromer complex is a key component of the endosomal protein sorting machinery [3]. Anaphase-Promoting Complex is an E3 ubiquitin ligase that marks target cell cycle proteins for degradation by the 26S proteasome [4]. Therefore, how to correctly identify such functional modules (or complexes) in PPI networks becomes a vital scientific problem for understanding the structures and functions of these fundamental cellular networks, further discovering the mechanism of diseases, and developing the new medicine. As many literatures have pointed out [5], [6], [7], [8], the functional modules are those J. Ji, J. Lv, and C. Yang are with Beijing Municipal Key Laboratory of Multimedia and Intelligent Software Technology, the College of Computer Science and Technology, Beijing University of Technology, Beijing , China. jjz01@bjut.edu.cn, {zhangoic, yangcc_2008}@163.com. A. Zhang is with the Department of Computer Science and Engineering, University at Buffalo, The State University of New York, Buffalo, NY azhang@cse.buffalo.edu. Manuscript received 15 Nov. 2014; revised 4 June 2015; accepted 9 Sept Date of publication 18 Sept. 2015; date of current version 4 Aug For information on obtaining reprints of this article, please send to: reprints@ieee.org, and reference the Digital Object Identifier below. Digital Object Identifier no /TCBB protein groups where proteins are interacted with each other at different time and place, while protein complexes are those protein groups where proteins in the same complexes are interacted with each other at the same time and place. Owing to not considering the temporal and spatial information for protein interaction data used in the paper, we use the concept of functional modules in the research. So far, there are a number of biological experimental methods to detect functional modules in PPI networks, such as tandem affinity purification with mass spectrometry [9], [10], and proteinfragment complementation assay (PCA) [11]. However, there are several limitations to these experimental methods, such as too many processing steps and too much time consumed, which motivated computer scientists to investigate efficient, novel and robust ways to fully exploit the protein interaction data to mine functional modules. In the past decade, many computational approaches based on machine learning and data mining have rapidly grown and become useful complements to the experimental methods [12]. Though there are some differences on the employed ideas and schemes among these approaches, the clustering analysis of a graph is a basic and key technique, which follows such a fact that these closely connected protein areas in PPI networks correspond to the protein functional modules. Up to now, a variety of classic clustering approaches, such as density-based clustering [13], [14], [15], hierarchical clustering [16], [17], [18], partition-based clustering [19], [20], [21], and flow simulation-based clustering [22], [23], [24], have been applied for identifying functional modules in PPI networks. In recent years, there have also been a number of new emerging approaches [25], [26], [27], [28], which employ novel computational models to effectively identify functional modules in a PPI network. More recently, some nature-inspired swarm intelligence ß 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See for more information.

2 JI ET AL.: DETECTING FUNCTIONAL MODULES BASED ON A MULTIPLE-GRAIN MODEL IN LARGE-SCALE PROTEIN-PROTEIN algorithms have been successfully applied to the detection of functional modules in PPI networks [29], [30], [31], [32]. Above all, using computational approaches to detect protein functional modules in PPI networks has received considerable attention, and researchers have proposed many detection ideas and schemes over the past few years [12], which has shown their strong vitalities and broad application prospects of detecting function modules in PPI networks. However, with the increasing availability of biological experiments and information mining techniques, PPI network scales for most species have become larger and larger, which brings a new challenge for many existing computational approaches, especially for those approaches with higher time complexity. That is, some computational approaches could have been limited in dealing with large-scale PPI networks due to their own complexities. To resolve the problem, some academics have performed some meaningful explorations. For example, Cho et al. proposed a new method based on graph reduction [33], which first selects a small number of informative proteins from a large network, and transforms the intricate small-world, scale-free network into a simple graph with high modularity. Oliveira and Seok proposed a multilevel algorithm with structured analysis of unweighted networks [34], [35], which constructs high-quality groups of nodes merged before applying a clustering algorithm. These two methods can remarkably enhance the efficiency of detecting functional modules by reducing the scale of a PPI network. In this paper, we present a new framework to efficiently detect functional modules in the PPI network. The main contributions of this paper include the following: 1) From the perspective of granular computation, we define a multiple-grain model of a PPI network, which has a small scale, and propose a new framework to detect functional modules in a large-scale PPI network. 2) To obtain a smaller network, we design a grain partitioning method of proteins, which combines the functional similarity with the structural similarity to merge some proteins layer by layer. 3) To address the protein overlapping of different modules in a PPI network, we develop a refining mechanism with border node tests during the grain eliminating process. 4) Systematic experiments have been conducted to verify the proposed framework using several state-ofthe-art algorithms on the benchmark testing sets of yeast and human networks. The rest of this paper is organized as follows. Section 2 gives the multiple-grain representation model of a PPI network. Section 3 introduces the proposed algorithm framework, the grain partitioning method and the refining mechanism with edge node tests. Next, Section 4 presents and analyzes the experimental results. Section 5 concludes this paper and outlines future research work. 2 MULTIPLE GRAIN MODEL OF PPI NETWORKS In the real-world, there are usually some complex and difficult problems, which are hard for people to accurately solve. Therefore, people are often not blindly pursuing the best solution at a time, but gradually approaching to a solution with a certain precision step by step. It is the analysis method of multi-granularity, which uses a refining policy from coarse to fine. By means of such an idea, people can efficiently solve many large scale problems with high complexity. In other words, the multi-granularity representation is an effective method to reduce the scale of problems and improve the efficiency of problem solving. Generally speaking, a PPI network is typically represented as an undirected graph G ¼ðV; EÞ with a set of nodes V and a set of edges E, where V and E represent proteins and interactions between nodes, respectively. In PPI networks, those proteins belonging to the same function module always perform similar functions in some biological activities. They are either closely linked to reflect the topology similarity, or closely related in functions to reflect the functional similarity. That is, from the view of structure partition, those close relationship elements can be viewed as the same kind of elements either in the structure or in the function. For a PPI network, if two proteins have a high similarity degree both in topology and function, then the two proteins are more likely to be grouped into the same function module. Assuming that we could merge these similar proteins in a PPI network into the same virtual protein in advance, the scale of the PPI network will significantly decrease, thus the running time of detection of functional modules in the PPI network can be greatly shortened. Based on the idea of such problem reduction, we first introduce some definitions in the following. Defination 1 (Super protein node). Given any two nodes, i; j 2 V in a graph G, once the similarity degree between the two nodes is greater than a pre-set fusion threshold, then they can be merged into a larger node s. We call the larger node s as a super protein node. Because the integration process can be run multiple times in the whole problem-reduction process, a super protein node at higher layer network may contain a number of proteins in the initial graph. From the view of function or structure, these proteins are considered to be similar to each other. However, it should be noted that a super protein node located at current layer network contains at most two nodes of the last layer network since each fusion at the same layer network only involved two nodes. If two nodes are merged, then their annotation information of Gene Ontology (GO) will be combined by union. Moreover, the local graph structure is simplified by the neighbor change. For a node i in a graph G, the set of its neighbor nodes can be denoted as NeighðiÞ ¼fjjj 2 V; ði; jþ 2Eg. For a super protein node s, the set of its neighbor nodes is described as follows. Defination 2 (Neighbor of a super protein node). Given a super protein node s including i and j protein nodes at the current layer network, the set of its neighbor nodes is defined as NeighðsÞ ¼fljl 2 NeighðiÞ S NeighðjÞnsg. Without loss of generality, there are some protein nodes which don t meet the fusion requirement. These protein nodes can be defined as follows. Defination 3 (Isolated protein node). If a protein in a PPI network is not merged with any other protein after several integrations, it is called an isolated protein node.

3 612 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 13, NO. 4, JULY/AUGUST 2016 Fig. 1. The schematic diagram of a MGM-PPI. In general, a higher layer network not only contains some super protein nodes, but also includes some isolated protein nodes. Once two nodes are merged into a super node, its local graph structure and GO annotation information will be changed or united before further testing in the next layer. That is, no matter whether super protein nodes or isolated protein nodes are in the same layer network, they are handled by using the same way. Based on the definitions above, a multiple-grain model for a PPI network (MGM- PPI) is given in the following. Defination 4 (Multiple Grain Model for a PPI network). By merging similar proteins, we can transform a large scale PPI network with the same granularity of protein nodes into a small scale network with the different granularities of super protein nodes layer by layer. We call the connected graph with different granularities as a MGM-PPI. Fig. 1 gives a schematic diagram of a MGM-PPI acquired by two merge processes, where the top layer gives an initial PPI network G 0, the middle layer shows a coarse-grained connection diagram G 1 after the first merge process, and the third layer denotes a more coarse-grained connection diagram G 2 after the second merge process. It is clear that more frequently the merge process occurs, the smaller the size of the MGM-PPI acquired and the simpler the connection relationships formed. Because employing the multiplegrain model in detecting functional modules from a PPI network can greatly decrease the scale of the PPI network and reduce the computation complexity, thus it has very important significance for some module detection algorithms to increase their solving abilities in large-scale networks. 3 FRAMEWORK BASED ON THE MGM-PPI Based on the MGM-PPI, this section presents a new framework to detect functional modules from large-scale PPI networks, which tries to effectively improve the time performance. Fig. 2 shows the flowchart of the proposed framework, where three key steps are automatically-layered grain partitioning, module detecting, and grain eliminating with a refining strategy. The first step performs the coarsening process which transforms a large-scale PPI network into a smaller one by merging close nodes into some corresponding super nodes layer by layer, G 0! G 1 G i!g k, where G i represents the ith layer MGM-PPI. Obviously, as the number of layers k increases, the size of super nodes becomes bigger and bigger while the number of nodes (edges) in a corresponding MGM-PPI gets smaller and smaller. This step is essential for the framework to establish the MGM-PPI. The second step carries out a functional module detection process on the coarsest layer, where an appropriate algorithm with a high accuracy can be used. The third step performs restoring and refining process on the clustering result of the coarsest layer. That is, in light of the order G k! G k 1 G i!g 0, the step completes grain eliminating layer by layer, where in addition to the transformation from one layer with larger super nodes to another layer with smaller ones, a refining strategy is employed to optimize the results generated. 3.1 Automatically-Layered Grain Partitioning The goal of grain partitioning is to reduce the scale of a PPI network to a smaller one so that many algorithms can be applied at a lower cost. There are two states for a protein node in this phase: unmatched or matched. At the initial stage of each layer, every protein node in a PPI network is first set to be an unmatched state. Then we begin to traverse nodes one by one to find node pairs that can be integrated. Once a node pair meets a preset merging condition, the states of the two protein nodes are changed from unmatched to matched. After traversing all nodes in the current layer, those matched protein pairs are respectively merged to form some corresponding super nodes with larger granularity while the rest unmatched proteins still maintain the original grain size. In detail, the coarsening process is described in the following four steps: 1) At first, all protein nodes in the current layer are sorted in a roughly descending Fig. 2. The flowchart of the proposed framework.

4 JI ET AL.: DETECTING FUNCTIONAL MODULES BASED ON A MULTIPLE-GRAIN MODEL IN LARGE-SCALE PROTEIN-PROTEIN order of degree; 2) Then we select the first unmatched node, and search its neighbor nodes in a descending order to find an unmatched node which can satisfy any merging condition. When there is such a pair of nodes, then switch their states from unmatched to matched; 3) We repeat the step 2 until all nodes at the current layer are traversed, thus we find all the matching pairs of nodes, and respectively merge every pair matched nodes to build the corresponding super protein node; and 4) We continue to perform the above three steps on a higher layer network until the end condition of grain partitioning is satisfied. In the above process, there are two key problems to be addressed. The first one is how to set a reasonable merging condition. We employ a functional similarity measure (f ij ) and a topology similarity measure (s ij ) to determine whether a pair of nodes (i; j) can be merged. The judgment rule is that a pair of protein nodes (i; j) will be merged into a super node only if f ij " 1 or s ij " 2, where " 1 and " 2 are two thresholds, respectively. To compute the functional similarity f ij for two proteins i and j, we utilize their GO information. If the two protein nodes are annotated with two GO term sets g i and g j, respectively, f ij can be calculated by [36]: f ij ¼ jgi \ g j j jg i [ g j j : (1) Based on the Czekanovski-Dice distance d ij [37], we calculate the topology similarity between two proteins. The formula is described as follows: s ij ¼ 1 d ij jintðiþ[intðjþj jintðiþ\intðjþj ¼ 1 jintðiþ[intðjþj þ jintðiþ\intðjþj ; (2) where IntðiÞ ¼NeighðiÞ[i. Obviously, when two proteins i and j interact with completely different neighbor sets of proteins, the similarity between them will arrive at the smallest value 0. In other cases, the similarity value will be larger than 0, and smaller or equal to 1. Another key problem is how to determine the number of layers for the grain partitioning. That is, a favorable circumstance needs to be set to end the coarsening process. Merging nodes with high topological similarity and functional similarity can change topological structure of the PPI network, thus we have carried out many experimental researches to explore the relationships between some graphic metrics (such as density, average shortest path length, betweenness centrality for edges, average degree of node, clustering coefficient of nodes, etc.) and number of layers. Fortunately, we found an interesting phenomenon that the density of the kth layer MGM-PPI network, denðg k Þ¼2e=vðv 1Þ (v ¼jV k j and e ¼jE k j), is a single extreme value function of layer number. The density function begins to gradually rise to a maximum value and then declines significantly or slightly as the layer number k increases. It s more important that once the density change between two adjacent layer networks is small, the number of merging nodes becomes less, which means that the contribution of grain partitioning becomes very small. To ensure the efficiency of problem reduction and performance of module detection, the automatically-layered coarsening mechanism is designed as: K ¼ arg minfjdenðg k Þ denðg k 1 Þj m;k > 0g; (3) k where m is a preset smaller threshold value (e.g., ) and K is the minimum layer at which the density of the MGM-PPI approaching the maximum value. 3.2 Grain Elimination with a Refining Strategy Once the grain partitioning ends, we acquire a smaller scale network based on MGM-PPI, where a module detecting algorithm can be applied to quick finding initial functional modules. These modules may consist of many super protein nodes and some isolated protein nodes. To obtain the final functional modules included initial protein nodes, we have to perform the grain eliminating process from coarse to fine. That is, we need to restore the node connections at the previous layer network by layer, where a basic operator is that super nodes are separated layer and layer until there is no super node in the corresponding network (i.e., returning the initial PPI network). Moreover, the separation of super nodes at each layer may cause some changes in the connection between modules. To address the overlap proteins of different modules, we design a refining mechanism to test the reasonable belonging of some protein nodes, where border protein node is defined as follows: Defination 5 (Border protein node). Though some protein nodes are partitioned in different functional modules, there are still some connections between these nodes to cross different modules at a particular layer. We call those protein nodes which exist in connections across different modules as border protein nodes. Generally, a border protein node is most likely to be shared by two modules, and becomes their overlapping parts. To actually reflect the overlapping nature of some functional modules, we employ a loop to refine the grain eliminating results at each layer network. The loop looks through each module in the current layer network individually, once there is a connection between the module and its near module, then the two border protein nodes involved in the connection will be tested. Our testing method is based on the density of subgraphs of functional modules, the formula is as follows: 2 je l j DenðM l Þ¼ jv l jðjv l j 1Þ ; (4) where M l is the lth module, je l j is the number of edges between nodes and jv l j is the number of nodes in M l. The larger the value of DenðM l Þ, the better a clustering result is. Thus, we take the node-sharing gain about the density of functional modules as a judging criterion to test whether the border nodes should be shared between two neighboring modules. Let M p, M q be two neighboring modules at the current layer, i 2 V p, j 2 V q and there is a connection between i and j, theni and j are border protein nodes for M q and M p. The specific testing method for j is as follows: Assume Mp 0 ¼ M p [fjg be a new module, we can calculate DenðM p Þ and DenðMp 0 Þ.IfDenðM0 p Þ >DenðM pþ, then j should be shared by M p. Otherwise, j should not be shared by M p, and this test will be ignored. The testing

5 614 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 13, NO. 4, JULY/AUGUST 2016 TABLE 1 Data Sets Used in Our Experiments Data sets Http address Preprocessed data Processed data Proteins Interactions Layers Proteins Interactions DIPScere ,126 22, ,803 14,183 MIPS 4,545 12, ,160 6,683 BioGrid ,334 80, ,817 40,920 DIPHsapi ,086 5, ,172 3,264 H-InvDB 1 5,858 27, ,518 19,974 1 The original data contains 9,086 proteins and 31,030 interactions. Limited by the memory of our machine, an extra operator in the preprocessed step is removing all nodes whose degrees are 1. process is repeated till all modules at the current layer are passed through. 3.3 Framework Description and Complexity Analysis The procedure of the proposed framework is to carry out initialization, grain partition, module detection, grain eliminating, and output of detected modules. To fast reduce the scale of the network, we employ a strategy that the nodes with the largest degree are first merged with their close neighbors. The detailed pseudocode is shown in Algorithm 1. Algorithm 1. Framework Based on MGM-PPI Input: Graph G 0 ¼ G(V, E): a PPI network,jv j¼n; Output: M: the set of modules (jmj ¼m); 1. Initialization: Set parameters " 1, " 2 and m; * " 1 : functional similarity threshold * * " 2 : topology similarity threshold * * m: density gain threshold * 2. Grain partition: k ¼ 0; Do { Let all nodes in the unmatched state; Compute the degree of all nodes in G k, and sort them in descending order; For each unmatched node i in G k { Sort the nodes of NeighðiÞ in a descending order; For each unmatched node j 2 NeighðiÞ {If(f ij " 1 )or(s ij " 2 ) then Label the pair of nodes as matched and exit the loop; } } k ¼ k þ 1; Respectively merge both matched nodes to form a super node, build corresponding connections and construct a graph G k ; } while (jdenðg k Þ denðg k 1 Þj > m) K ¼ k; Obtain the coarsest graph G K ; 3. Module detection: Employ classic algorithm to detect modules on G K ; Get initial modules for G K ; 4. Grain eliminating: While (k 0)do { Restore the node connections at the k-1 layer network; For each connection between two modules Refine some border nodes by the judging criteria; k ¼ k-1; } 5. Output: Return m Functional Modules of G 0. Based on the description of Algorithm 1, the complexity of the detection framework can be simply analyzed as follows: Let the maximum number of a node degree be n in a PPI network. In the initialization process, the computing complexity is Oð1Þ. In the grain partition process, the time complexity is OðKðN þ N log N þ Nðn 2 log nþþ, which can be approximately OðN log NÞ because n N and K is a very small number compared with N. For the module detection process, its time complexity exactly depends on the algorithm used, but fortunately the complexity of the initial problem can be greatly reduced by grain partition. In the grain eliminating process, the time complexity is OðKðN=2 þ m ðm 1ÞcÞÞ OðK ðn þ m 2 ÞÞ, where c is the maximum number of connections between two modules. In the output process, the time complexity is OðmÞ. Thus, the overall complexity mainly depends on the module detection algorithm since other processes have lower complexities. Most importantly, the module detection algorithm is applied into the smaller scale PPI network, thus there is a great potential to enhance the running efficiency. 4 EXPERIMENTAL RESULTS AND DISCUSSION In this section, we use five large protein-protein interaction datasets to perform our empirical study. In light of many evaluation metrics, we assess the effectiveness of the framework, and compare test results using some typical algorithms on these PPI datasets. The experimental platform is a PC with Core 2, 2.13 GHz CPU, 2.99 GB RAM, and Windows XP. 4.1 PPI Data Sets and Their Corresponding Number of Layers We have performed our experiments over five publicly available benchmark PPI datasets including three yeast data (DIPScere , MIPS, BioGrid2014) and two human data (DIPHsapi , H-InvDB). Table 1 shows a summary of the data sets used in our experiments, where the second column gives the web links, the third and fourth columns respectively present the size of proteins and interactions after data preprocessing while the fifth, sixth and seventh columns respectively present the number of layers determined automatically, and the size of proteins and interactions after grain partitioning. A cleaning step, which deletes all self-connected and repeated interactions, is performed in data preprocessing. To evaluate the protein modules mined by algorithms, the set of real functional modules from [39] is selected as the benchmark for yeast. This benchmark set, which consists of 428 protein functional modules,

6 JI ET AL.: DETECTING FUNCTIONAL MODULES BASED ON A MULTIPLE-GRAIN MODEL IN LARGE-SCALE PROTEIN-PROTEIN N cb ¼jfbjb 2 B; 9p 2 P; NAðp; bþ vgj. Thus, Precision and Recall can be defined as follows [44]: and Precision ¼ N cp jp j ; (6) Recall ¼ N cb jbj : (7) Fig. 3. The relationships between density and number of layers for five PPI datasets. is constructed from three main sources: the MIPS [38], Aloy et al. [40] and the SGD database [41] based on the Gene Ontology notations. For human, we use the set of real functional modules from [42] as the benchmark, which consists of 1,259 functional modules and 3,458 proteins. After removing some modules, which don t include any protein in DIPHsapi or H-InvDB, we respectively get two corresponding validation sets of functional modules, where the validation set for DIPHsapi comprises 793 functional modules and 2,443 proteins, and the validation set for H-InvDB comprises 901 functional modules and 2,666 proteins. As mentioned in Section 3.1, to automatically determine the corresponding number of layers for any PPI dataset, the relationship between the number of layers and any of network characteristics (e.g., density, degree, path, centrality, coefficient, etc) has been experimentally investigated. From the results, we discovered that the number of layers and the density of corresponding graphs have a certain relationship. Fig. 3 shows the density curves over number of layers for five PPI datasets. Based on the automatically-layered coarsening mechanism, we can get the corresponding number of layers 4, 4, 3, 3 and 4 for five grain partitioning, respectively. 4.2 Evaluation Metrics In the section, we employ two popular sets of measurements [43] to evaluate the detected modules quality and calculate the detection methods general performance Precision, Recall, F-Measure, and Coverage Many research works use a neighborhood affinity score to assess the degree of matching between the identified functional modules and real ones. The score NAðp; bþ between an identified module p ¼ðV p ;E p Þ and a real module b ¼ðV b ;E b Þ is defined as: NAðp; bþ ¼ jv T p Vb j 2 jv p jjv p j : (5) If NAðp; bþ v, then p and b are considered to be matched (generally, v ¼ 0.2). Let P be the set of functional modules identified by some computational methods and B be the real functional module set in benchmark networks. Then N cp ¼jfpjp 2 P; 9b 2 B; NAðp; bþ vgj, and F-measure is a harmonic mean of Precision and Recall, so can be used to evaluate the overall performance. It is defined as: F ¼ 2 Precision Recall : (8) Precision þ Recall Moreover, Coverage assesses how many proteins in a PPI network can be clustered into the detected modules by a computational method. It can be defined as follows [45]: Coverage ¼ j S jp j i¼1 V pij ; (9) jv j where jv j¼n, which denotes the size of the PPI network and V pi is the set of the proteins in the ith detected module Sensitivity, Positive Predictive Value, and Accuracy Sensitivity (S n ), Positive predictive value (PPV ) and Accuracy (Acc) are also common measures to assess the performance of module detection methods. Let T ij be the number of the commonproteinsinbothoftheith ground truth and the jth identified module. Then S n and PPV can be defined as [39]: and S n ¼ PPV ¼ P jbj i¼1 max jft ij g P jbj i¼1 N ; (10) i P jpj j¼1 max ift ij g P jpj j¼1 T ; (11) :j where N i is the number of the proteins in the ith benchmark module, and T :j ¼ P jbj i¼1 T ij. As a general metric, the accuracy of an identification can be calculated as a geometric mean of S n and PPV : Acc ¼ðS n PPV Þ 1=2 : (12) 4.3 Effects of Parameters As a swarm intelligent-based algorithm, NACO-FMD [31] has excellent robustness and good detection accuracy on various PPI data. Thus, we take NACO-FMD as an instance to respectively perform many experiments on DIPScere to determine a better parameter configuration for the framework. The parameters of NACO-FMD were set as m a ¼ 100 (Number of ants), NI ¼ 20 (Number of iterations), a ¼ 1:5, b ¼ 5:0, d ¼ 0:8. During all experimentations, the value of a single parameter is changed, while keeping the values of other parameters fixed.

7 616 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 13, NO. 4, JULY/AUGUST 2016 TABLE 2 Layer Number and Running Time of NACO-FMD with the New Framework for Different m, " 1, and " 2 Values Test item m values ( E-4) " 1 values " 2 values 0.5 [1.5, 3.5] [4.5, 6.5] 7.5 [8.5, 9.5] Layer number Clustering , ,246 1,599 2,171 2,195 2,518 time(s.) Grain partition time(s.) Grain elimination time(s.) Total running time(s.) , ,253 1,606 2,176 2,201 2,522 Table 2 lists the results about the layer number of the MGM-PPI, clustering time, grain partition time, elimination time and the total running time for different m, " 1 and " 2 values, where s denotes to seconds. Column 2 to 6 show that smaller m, larger the layer number is and shorter the total running time is. Column 7 to 14 and Column 15 to 22 show that when " 1 or " 2 increases, the number of node fusion becomes smaller, which lead to diminishing the layer number and growing the total running time. Fig. 4 gives the effects of the parameter m, " 1 and " 2 on six metrics, where Fig. 4a is performance curves about m when " 1 ¼ 0:5, " 2 ¼ 0:3, Fig. 4b is performance curves about " 1 when m ¼ 2:5 E-04, " 2 ¼ 0:3, and Fig. 4c is performance curves about " 2 when m ¼ 2:5 E-04, " 1 ¼ 0:5. In Fig. 4a, though some metrics have different tendencies as m increases. However, the two general metrics (F-measure and accuracy) are relatively stable, which shows that the application of the framework is not sensitive to the parameter m. In Fig. 4b, along with " 1 increasing, the values of precision, recall and F-measure gradually increase till get respectively the maximum values at " 1 ¼ 0:5 or 0.6 and then slowly decrease. Though there are some differences for three curves of PPV, accuracy and sensitivity, the accuracy is maintained at around 0.32 when " In Fig. 4c, three metrics of precision, recall and F-measure have the same trend: there is the maximum at " 2 ¼ 0.4 for each metric. The three metrics of PPV, accuracy and sensitivity have similar values except for " 2 ¼ 0:2. Togetbetter quality and cost shorter time, we employ m 2½1:5; 3:5Š E-4, " 1 ¼ 0:5 and " 2 ¼ 0:3 in our framework for DIPScere data. Above all, the value of m directly determines the layer number while " 1 and " 2 indirectly influence the layer number by confirming the node fusion. Either way, they can affect the running time of detection modules with the framework. From these results, we can give a simple suggestion to preset the three parameters. That is, the determination of parameters has to comprehensively consider the Fig. 4. The effects of the parameter m, " 1, and " 2 on six evaluation metrics. (a) Six performance curves about m; (b) Six performance curves about " 1 ; (c) Six performance curves about " 2.

8 JI ET AL.: DETECTING FUNCTIONAL MODULES BASED ON A MULTIPLE-GRAIN MODEL IN LARGE-SCALE PROTEIN-PROTEIN TABLE 3 The Performance Comparisons of Several Typical Detection Algorithms on DIP Performance description Algorithms Category Coverage Precision Recall F-measure Sensitivity PPV Accuracy Time (s) CFinder CFinder* 1 Density DPClus DPClus* Density Jerarca ,100 Jerarca* Hierarchy RNSC RNSC* Partition MCL MCL* Flow ADMSC ADMSC* Spectral Core ,210 Core* Core ,596 COACH COACH* Core NACO-FMD ,128 NACO-FMD* Swarm ,029 HAM-FMD ,540 HAM-FMD* Swarm An asterisk refers to the corresponding algorithm using the multiple grain model. 2 DPClus runs with d ¼ 0:9, cp ¼ 0:4 and Minimum cluster size ¼ 2. 3 ADMSC runs with C ¼ 600, b ¼ 1:4. 4 COACH runs with V ¼ 0: NACO-FMD runs with a ¼ 1:5, b ¼ 5:0, r ¼ 0:5, d ¼ 0:3. 6 HAM-FMD runs with a ¼ 1:5, b ¼ 5:0, r ¼ 0:5, d ¼ 0:3, P o ¼ 0:2, P c ¼ 0:05, P m ¼ 0:5. running time and other detection performances. Based on numerous similar experiments, three parameters are respectively set as " 1 ¼ 0:5, " 2 ¼ 0:3, m ¼ 0:00025 for three yeast data and " 1 ¼ 0:4, " 2 ¼ 0:4, m ¼ 0:00015 for the human data with sparse connections. 4.4 Comparative Evaluations To evaluate the new framework, we employed 10 typical algorithms, i.e., CFinder [14], DPClus [15], Jerarca [18], RNSC [19], MCL [22], ADMSC [26], Core [27], COACH [28], NACO-FMD [31], and HAM-FMD [32], to demonstrate its strengths on seven performances in our experiments. Based on the survey in [12], these algorithms involve seven main categories: density-based, hierarchy-based, partitionbased, flow simulation-based, spectral clustering-based, core attachment-based, and swarm intelligence-based approaches. We still employed DIPScere data to perform comprehensive comparisons among 10 different algorithms. Table 3 shows the detailed performance comparisons of these typical detection algorithms on the same DIP data. For each detection algorithm and its variation with the new framework, we have listed the classified category, seven metrics and running Time (Seconds). In all experiments, we use the simplest or the best default values for those algorithms which need to set parameter configurations. It is not difficult to see that the application of the new framework can effectively reduce the running time for most of algorithms. For some fast algorithms such as MCL, COACH, CFinder and RNSC, using the new framework is no significant influence on the the running time or only has a certain growth. The main reason is that grain partitioning and elimination in the new framework also need extra running time, which may be larger than saving time. For other algorithms, using the new framework has significant improving on time performance. Most variation algorithms using the new framework can achieve good results on many performances, which are better than or comparable with that of the original algorithms. From a technical perspective, the framework is supposed to be independent of detection algorithms. However, merging nodes with high topological similarity and functional similarity in the framework may change the topological structure of the PPI network, and it may also mislead some PPI clustering techniques to go for neighbourhood analysis. Therefore, using the new framework in some clustering techniques may also cause the degradation of some performances. To further investigate the computational results of those time-consuming algorithms, we select six out of 10 algorithms, including NACO-FMD, HAM-FMD, Core, Jerarca, DPClus and ADMSC, to carry out a large number of experiments. NACO-FMD is an ant cology intelligent-based algorithm for detecting functional modules in a PPI network, which combines topological characters with functional information into the ant colony optimization process [31]. HAM-FMD is a hybrid approach which employs ant colony optimization and multi-agent evolution to detect functional modules in PPI networks [32]. Essentially, NACO-FMD and HAM-FMD are two swarm intelligent-based algorithms.

9 618 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 13, NO. 4, JULY/AUGUST 2016 TABLE 4 The Results of NACO-FMD, HAM-FMD, and Core Algorithms on Five Different Data Sets Data sets DIPScere MIPS BioGrid2014 DIPHsapi H-InvDB Metrics NACO-FMD HAM-FMD Core Original variation Original variation Original variation Number of clusters Size of average module N cp 0: N cb 0: Number of clusters Size of average module N cp 0: N cb 0: Number of clusters ,090 Size of average module N cp 0: N cb 0: Number of clusters Size of average module N cp 0: N cb 0: Number of clusters ,199 1, Size of average module N cp 0: N cb 0: Core is a core attachment-based algorithm [27], which successively predicts core proteins, identifies their attachment proteins, and forms functional modules in a PPI network. Jerarca is a hierarchy-based algorithm, which first computes weighted distances, and then employs a neighbor-joining algorithm to build dendrograms [18]. ADMSC is a spectral clustering-based algorithm, it analytically solves the cluster structure of PPI networks as a problem of random walks in the diffusion process [26]. DPClus is a density-based algorithm, which uses a cluster periphery-tracking mechanism to generate clusters [15]. Tables 4 and 5 respectively show the detailed comparative results of these algorithms on the five different data sets, where a variation algorithm refers to the corresponding algorithm with the multiple-grain framework. For each detection algorithm, we have listed the number of clusters detected (number of clusters), the average number of proteins in each cluster (size of average module), the number of TABLE 5 The Results of Jerarca, DPClus, and ADMSC Algorithms on Five Different Data Sets Data sets DIPScere MIPS BioGrid2014 DIPHsapi H-InvDB Metrics Jerarca DPClus ADMSC Original variation Original variation Original variation Number of clusters Size of average module N cp 0: N cb 0: Number of clusters , Size of average module N cp 0: N cb 0: Number of clusters , Size of average module N cp 0: N cb 0: Number of clusters Size of average module N cp 0: N cb 0: Number of clusters , Size of average module N cp 0: N cb 0:

JI ET AL.: DETECTING FUNCTIONAL MODULES BASED ON A MULTIPLE-GRAIN MODEL IN LARGE-SCALE PROTEIN-PROTEIN... 619 Fig. 5.

A comparison of F-measure for six algorithms and their corresponding variations on five datasets. Fig. 6.

A comparison of sensitivity for six algorithms and their corresponding variations on five datasets. Fig. 7.

detected modules which match at least one real module (N cp ) and the number of real modules that match at least one detected module (N cb ).

which 139 match 202 real modules. From the two tables, we observe that the utilization of the new framework can produce different results for various algorithms. Figs.

respectively. From Fig. 5, we can find the application of our framework sometime increases the coverage value, and sometime decreases the coverage value, Fig. 10.

10 JI ET AL.: DETECTING FUNCTIONAL MODULES BASED ON A MULTIPLE-GRAIN MODEL IN LARGE-SCALE PROTEIN-PROTEIN Fig. 5. A comparison of coverage for six algorithms and their corresponding variations on five datasets. Fig. 8. A comparison of F-measure for six algorithms and their corresponding variations on five datasets. Fig. 6. A comparison of precision for six algorithms and their corresponding variations on five datasets. Fig. 9. A comparison of sensitivity for six algorithms and their corresponding variations on five datasets. Fig. 7. A comparison of recall for six algorithms and their corresponding variations on five datasets. detected modules which match at least one real module (N cp ) and the number of real modules that match at least one detected module (N cb ). Taking NACO-FMD on DIPScere data as an example, the original algorithm has detected 571 modules, of which 127 match 212 real modules while the variation algorithm has detected 545 modules, of which 139 match 202 real modules. From the two tables, we observe that the utilization of the new framework can produce different results for various algorithms. Figs. 5, 6, 7, 8, 9, 10, and 11 show the overall comparison results of these methods and their corresponding variations (marked with an asterisk) in terms of seven metrics on five different data sets, respectively. From Fig. 5, we can find the application of our framework sometime increases the coverage value, and sometime decreases the coverage value, Fig. 10. A comparison of PPV for six algorithms and their corresponding variations on five datasets. which means that the change of coverage not only depends on detection algorithms but also depends on testing data. The main reason is that the new framework performs many fusions in building MGM-PPI process, there are a few chances for some algorithms to get back missing nodes. Thus, these algorithms with lower Coverage values 0:5 can increase Coverage values on some data, such as Core on DIPScere , MIPS, DIPHsapi data, and NACO-FMD on DIPHsapi data. From Fig. 6, one can easily see that all algorithms with the multiple-grain framework can achieve better precision on DIPScere data. However, for other four data sets, some algorithms can increase or keep precision values while others may decrease precision values. By experimental analysis, we find that if the average node degree of MGM-PPI obtained is close to that of intimal graph,

11 620 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 13, NO. 4, JULY/AUGUST 2016 Fig. 11. A comparison of accuracy for six algorithms and their corresponding variations on the datasets. then using the new framework can improve the precision performance. As shown in Fig. 7, the application of the new framework makes recall performance of six algorithms decline universally, especially on MIPS data, all algorithms obtain the worse recall values. A main reason is that two extra processes (grain partitioning and grain eliminating) on MIPS make the number of N cb decline. Of course, several algorithms still show better recall performance, e.g., HAM-FMD and Jerarca on DIPScere , HAM-FMD on Bio- Grid2014, NACO-FMD on DIPHsapi , NACO-FMD and HAM-FMD on H-InvDB. Fig. 8 gives the F-measure comparisons, which combine the precision and recall performances on each case. It is obvious that the application of our framework increases F-measure performance of six algorithms on DIPScere due to corresponding precision increasing. However, for other four data sets, the new framework is apt to generally lower F-measure in addition to several increases ( NACO-FMD on DIPHsapi and H-InvDB, HAM-FMD on H-InvDB) and keeping same values (DPClus on MIPS, NACO-FMD, HAM-FMD and ADMSC on BioGrid2014, and HAM-FMD and ADMSC on DIPHsapi ). Since the corresponding recall decreasing on MIPS, five variation algorithms except for DPClus obtain the worse F-measure. Moreover, since H-InvDB is a large sparse human data, some variation algorithms (DPClus, Core, Jerarca, ADMSC) only get bad precision and recall. Thus their F- measure values are decreased significantly. As shown in Fig. 9, the application of the new framework results in some irregular changes on sensitivity values. For instance, DPClus increases sensitivity values on four data except for DIPHsapi while Jerarca decreases them on all testing data. For some data sets, the new framework can help some algorithms to merge new proteins into some real function modules, which may make the ratio of proteins covered by the predicted modules either decline or increase. The sensitivity performance not only depends on the topological characteristic changes of the graph, but also relies on graph clustering mechanisms. From Fig. 10, one can easily see that PPV values of these algorithms suffer from slight decrease in many cases. The main reason is that the new framework has the ability to detect overlapping modules which potentially cause the denominator of the Eq. (11) to increase. Fig. 11 shows the comparison of accuracy values for six algorithms and their corresponding variations on the five data sets. Since using the new framework make most of algorithms obtain worse PPV values, the corresponding accuracy values also generally decline. However, some algorithms get better PPV or sensitivity values on some data, so their accuracy values can be improved by employing the new framework, such as Core on DIPScere and BioGrid2014, NACO-FMD on DIPHsapi and H- InvDB, HAM-FMD on H-InvDB. From the above results on seven measures, we conclude that applying the new framework into a detection algorithm on some data sets may be able to improve some performances (e.g., coverage, precision and f-measure), and also likely achieve worse performances on other metrics. Analyzing its reason, two main factors are topological changes caused by the new framework and clustering mechanisms employed. But on the whole, the introduction of the new framework can roughly maintain competing performances of the original algorithms. To explicitly reveal the significance of the new framework, we further performed the time performance comparison between six algorithms and their corresponding TABLE 6 Time Comparisons for Six Algorithms and Their Corresponding Variations on the Five Datasets Data sets Time Algorithms NACO-FMD HAM-FMD Core Jerarca DPClus ADMSC Original (S.) 11,128 4,540 58,210 1, DIPScere multiple-grain (S.) 1, , upgrade rate (%) Original (S.) 8,720 3,567 25, MIPS multiple-grain (S.) , upgrade rate (%) Original (S.) 26,908 19, ,734 2,554 1, BioGrid2014 multiple-grain (S.) 4,538 3,246 24, upgrade rate (%) Original (S.) 25,755 14,407 18, DIPHsapi multiple-grain (S.) 3,818 1,933 2, upgrade rate (%) Original (S.) 99,149 97, ,200 1, H-InvDB multiple-grain (S.) 5,381 10,613 16, upgrade rate (%)

12 JI ET AL.: DETECTING FUNCTIONAL MODULES BASED ON A MULTIPLE-GRAIN MODEL IN LARGE-SCALE PROTEIN-PROTEIN variations on the five datasets. The experimental results are shown in Table 6, where the original and multiple-grain rows respectively denote the running time of the original algorithms and the same algorithms with the multiple-grain framework, and upgrade rate row shows the degree of shortening the running time by means of the new framework. From these results, we can find that it s because the new framework is employed that the running time of six algorithms is extremely improved in all cases, where the minimum upgrade rate is 54:9 percent and the maximum upgrade rate is up to 94:5 percent and the maximum time saved is 1,00,828 seconds (Core on H-InvDB). We have systematically performed comparisons among NACO-FMD, HAM-FMD, Core, Jerarca, DPClus, ADMSC algorithms and their corresponding variations with the new framework in terms of various metrics. The outstanding time performance of the new framework on five data sets demonstrates that the proposed method can dramatically increase the time efficiency of module detection algorithms for various PPI data while keeping other competitive performances. Thus, the new framework has great potential for detecting functional modules in largescale PPI networks. 5 CONCLUSIONS How to efficiently identify functional modules by means of novel computational approaches is still a vital and important scientific problem in computational biology. Along with the era of big data, people have more opportunities to obtain large-scale PPI networks. Therefore, it has become a new challenge how to develop efficient and robust ways to detect functional modules in such large-scale PPI networks. In this paper, we proposed a novel framework to accelerate the process of detecting functional modules from large-scale PPI networks. First, we define a multiple-grain model of a PPI network by which the scale of a PPI network can be reduced. And then, we employ a functional or structural similarity to design a protein grain partitioning method. Finally, we apply a refining mechanism with border node tests to handle the protein overlapping of different modules during the grain eliminating process. Systematic experiments by six algorithms on five datasets show that the new framework not only significantly reduces the running time of module detection algorithms, but also effectively identifies overlapping modules while keeping some competitive performances. Thus the proposed framework can be competent to detect functional modules in large-scale PPI networks. Our future work includes investigating the relationships among the framework, algorithm and data characteristics to further design a better algorithm with high performances on various aspects. ACKNOWLEDGMENTS This work is partly supported by NSFC Research Program ( and ), National 973 Key Basic Research Program of China (2014CB744601), Specialized Research Fund for the Doctoral Program of Higher Education ( ), and the Beijing Municipal Education Research Plan key project (Beijing Municipal Fund Class B) (KZ ). REFERENCES [1] S. D. Patternson and R. H. Aebersold, Proteomics: The first decade and beyond, Nature Genetics, vol. 33, pp , [2] A. D. Zhang, Protein Interaction Networks: Computational Analysis. Cambridge, U.K.: Cambridge Univ. Press, [3] M. N. J. Seaman, Recycle your receptors with retromer, Trends Cell Biol., vol. 15, no. 2, pp , [4] D. O. Morgan, The Cell Cycle: Principles of Control. London, U.K.: New Science Press, [5] V. Spirin and L. A. Mirny, Protein complexes and functional modules in molecular networks, in Proc. Nat. Acad. Sci., vol. 100, no. 21, pp , [6] B. L. Chen and F. X. Wu, Identifying protein complexes based on multiple topological structures in PPI networks, IEEE Trans. Nanobiosci., vol. 12, no. 3, pp , Sep [7] B. L. Chen, W. W. Fan, J. Liu, and F. X. Wu, Identifying protein complexes and functional modules: From static PPI networks to dynamic PPI networks, Brief. Bioinform., vol. 15, no. 2, pp , [8] X. L. Li, M. Wu, C. K. Kwoh, and S. K. Ng, Computational approaches for detecting protein complexes from protein interaction networks: A survey, BMC Genomics, vol. 11, suppl 1, p. S3, [9] G. Rigaut, A. Shevchenko, B. Rutz, M. Wilm, and M. Mann, A generic protein purification method for protein complex characterization and proteome exploration, Nature Biotechnol., vol. 17, no. 10, pp , [10] A. C. Gavin, M. Boesche, R. Krause, et al., Functional organization of the yeast proteome by systematic analysis of protein complexes, Nature, vol. 415, no. 6868, pp , [11] K. Tarassov, V. Messier, C. R. Landry, S. Radinovic, M. M. Molina, and I. Shames, An in vivo map of the yeast protein interactome, Science, vol. 320, no. 5882, pp , [12] J. Z. Ji, A. D. Zhang, C. N. Liu, X. M. Quan, and Z. J. Liu, Survey: Functional module detection from protein-protein interaction networks, IEEE Trans. Knowl Data Eng., vol. 26, no. 2, pp , Feb [13] G. D. Bader and C. W. Hogue, An automated method for finding molecular complexes in large protein interaction networks, BMC Bioinform., vol. 4, no. 1, p. 2, [14] B. Adamcsek, G. Palla, I. J. Farkas, I. Derenyi, and T. Vicsek, CFinder: Locating cliques and overlapping modules in biological networks, Bioinformatics, vol. 22, no. 8, pp , [15] M. Altaf-UI-Amin, Y. Shinbo, K. Mihara, K. Kurokawa, and S. Kanaya, Development and implementation of an algorithm for detection of protein complexes in large interaction networks, BMC Bioinform., vol. 7, no. 1, p. 207, [16] E. Ravasz, A. L. Somera, D. A. Mongru, Z. N. Oltvai, and A. L. Barabasi, Hierarchical organization of modularity in metabolic networks, Science, vol. 297, no. 5586, pp , [17] V. Arnau, S. Mars, and I. Marin, Iterative cluster analysis of protein interaction data, Bioinformatics, vol. 21, no. 3, pp , [18] R. Aldecoa and I. Marin, Jerarca: Efficient analysis of complex networks using hierarchical clustering, PLoS ONE, vol. 5, no. 7, p. e11585, [19] A. D. King, N. Przulj, and I. Jurisica, Protein complex prediction via cost-based clustering, Bioinformatics, vol. 20, no. 17, pp , [20] B. J. Frey and D. Dueck, Clustering by passing messages between data points, Science, vol. 15, no. 5814, pp , [21] A. Abdullah, S. Deris, S. Z. M. Hashim, and H. M. Jamil, Graph partitioning method for functional module detections of protein interaction network, in Proc. Int. Conf. Comput. Technol. Develop., 2009, pp [22] A. J. Enright, S. Van Dongen, and C. A. Ouzounis, An efficient algorithm for large-scale detection of protein families, Nucleic Acids Res., vol. 30, no. 7, pp , [23] J. B. Pereira-Leal, A. J. Enright, and C. A. Ouzounis, Detection of functional modules from protein interaction networks, Proteins, vol. 54, pp , [24] Y. R. Cho, W. Hwang, M. Ramanathan, and A. D. Zhang, Semantic integration to identify overlapping functional modules in protein interaction networks, BMC Bioinform., vol. 8, no. 1, p. 265, [25] W. Hwang, Y. R. Cho, A. D. Zhang, and M. Ramanathan, CASCADE: A novel quasi all paths-based network analysis algorithm for clustering biological interactions, BMC Bioinform., vol. 9, no. 1, p. 64, 2008.

622 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 13, NO. 4, JULY/AUGUST 2016 [26] K. Inoue, W. Li, and H.

u, Q. Xiang, and F. Y. Chin, Predicting protein complexes from PPI data: A core-attachment approach, J. Comput. Bio., vol. 16, no. 2, pp. 133 144, 2009. [28] M. Wu, X. L. Li, C. K.

Khader, ACOPIN: An ACO Algorithm with TSP approach for clustering proteins from protein interaction network, in Proc. 2nd UKSIM Eur. Symp. Comput. Model. Simul., 2008, pp. 203 208. [30] S. Wu, X. J.

D. Zhang, L. Jiao, and C. N. Liu, Improved ant colony optimization for detecting functional modules in proteinprotein interaction networks, in Proc. 3rd Int. Conf. Inform. Comput. Appl., 2012, pp.

13 622 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 13, NO. 4, JULY/AUGUST 2016 [26] K. Inoue, W. Li, and H. Kurata, Diffusion model based spectral clustering for protein-protein interaction networks, PLoS ONE, vol. 5, no. 9, p. e12623, [27] H. C. Leung, S. M. Yiu, Q. Xiang, and F. Y. Chin, Predicting protein complexes from PPI data: A core-attachment approach, J. Comput. Bio., vol. 16, no. 2, pp , [28] M. Wu, X. L. Li, C. K. Kwoh, and S. K. Ng, A core-attachment based method to detect protein complexes in PPI networks, BMC Bioinform., vol. 10, no. 1, p. 169, [29] J. Sallim, R. Abdullah, and A. T. Khader, ACOPIN: An ACO Algorithm with TSP approach for clustering proteins from protein interaction network, in Proc. 2nd UKSIM Eur. Symp. Comput. Model. Simul., 2008, pp [30] S. Wu, X. J. Lei, and J. F. Tian, Clustering PPI network based on functional flow model through artificial bee colony algorithm, in Proc. 7th Int. Conf. Natural Comput., 2011, pp [31] J. Z. Ji, Z. J. Liu, A. D. Zhang, L. Jiao, and C. N. Liu, Improved ant colony optimization for detecting functional modules in proteinprotein interaction networks, in Proc. 3rd Int. Conf. Inform. Comput. Appl., 2012, pp [32] J. Z. Ji, Z. J. Liu, A. D. Zhang, C. C. Yang, and C. N. Liu, HAM- FMD: Mining functional modules in protein-protein interaction networks using ant colony optimization and multi-agent evolution, Neurocomputing, vol. 121, pp , [33] Y. R. Cho, W. Hwang, and A. D. Zhang, Efficient modularization of weighted protein interaction networks using k-hop graph reduction, in Proc. 6th IEEE Symp. Bioinform. Bioeng., 2006, pp [34] S. Oliveira and S. C. Seok, Multilevel approaches for large-scale proteomic networks, Int. J. Comput. Math., vol. 84, no. 5, pp , [35] S. Oliveira and S. C. Seok, A matrix-based multilevel approach to identify functional protein modules, Int. J. Bioinform. Res. Appl., vol. 4, no. 1, pp , [36] A. Schlicker and M. Albrecht, FunSimMat: A comprehensive functional similarity database, Nucleic Acids Res., vol. 36, pp. D434 D439, [37] C. Brun, C. Herrmann, and A. Guenoche, Clustering proteins from interaction networks for the prediction of cellular functions, BMC Bioinform., vol. 5, no. 1, p. 95, [38] H. W. Mewes, et al., MIPS: Analysis and annotation of proteins from whole genomes, Nucleic Acids Res., vol. 32, no. Database issue, pp. D41 D44, [39] C. C. Friedel, J. Krumsiek, and R. Zimmer, Boostrapping the interactome: Unsupervised identification of protein complexes in yeast, Res. Comput. Molecular Bio., vol. 4955, pp. 3 16, [40] P. Aloy, et al., Structure-based assembly of protein complexes in yeast, Science, vol. 303, no. 5666, pp , [41] S. S. Dwight, et al., Saccharomyces genome database provides secondary gene annotation using the gene ontology, Nucleic Acids Res., vol. 30, no. 1, pp , [42] T. Junichi, et al., H-InvDB in 2013: An omics study platform for human functional gene and transcript discovery, Nucleic Acids Res., vol. 41, no. D1, pp. D915 D919, [43] X. L. Li, M. Wu, C. K. Kwoh, and S. K. Ng, Computational approaches for detecting protein complexes from protein interaction networks: A survey, BMC Genomics, vol. 11, no. suppl. 1, p. S3, [44] H. N. Chua, K. Ning, W. K. Sung, H. W. Leong, and L. Wong, Using indirect protein-protein interactions for protein complex prediction, in Proc. CSB, 2007, pp [45] W. Hwang, Y. R. Cho, A. D. Zhang, and M. Ramanathan, CASCADE: A novel quasi all paths-based network analysis algorithm for clustering biological interactions, BMC Bioinform., vol. 9, no. 1, p. 64, Junzhong Ji received the PhD degree in computer science and application technology from the Beijing University of Technology. He is a professor, and supervisor at the Computer Science College, Beijing University of Technology, Youth Skeleton teacher in Beijing, and senior membership of the China Computer Federation. He was a visiting scholar at the Norwegian University and State University of New York at Buffalo, respectively. His research interests include data mining, machine learning, swarm intelligence, and bioinformatics. He has published over 80 papers in these areas. Jiawei Lv is currently working toward the master s degree at the Beijing University of Technology. His main research interests include computational biology and data mining. Cuicui Yang is currently working toward the PhD degree in computer science at the Beijing University of Technology. Her main interests include artificial intelligence and data mining. Aidong Zhang is UB distinguished professor and the chair in the Department of Computer Science and Engineering, State University of New York at Buffalo. Her research interests include bioinformatics, data mining, multimedia and database systems, and content-based image retrieval. She is an author of over 200 research publications in these areas. She has chaired or served on over 100 program committees of international conferences and workshops, and currently serves several journal editorial boards. She has published two books Protein Interaction Networks: Computational Analysis (Cambridge University Press, 2009) and Advanced Analysis of Gene Expression Microarray Data (World Scientific Publishing Co., Inc. 2006). She received the US National Science Foundation CAREER award and State University of New York (SUNY) Chancellor s Research Recognition award. She is a fellow of the IEEE. " For more information on this or any other computing topic, please visit our Digital Library at

Brief description of the base clustering algorithms

Brief description of the base clustering algorithms Le Ou-Yang, Dao-Qing Dai, and Xiao-Fei Zhang In this paper, we choose ten state-of-the-art protein complex identification algorithms as base clustering