WITH the coming of the postgenomic era, proteomics

Size: px
Start display at page:

Download "WITH the coming of the postgenomic era, proteomics"

Transcription

1 610 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 13, NO. 4, JULY/AUGUST 2016 Detecting Functional Modules Based on a Multiple-Grain Model in Large-Scale Protein-Protein Interaction Networks Junzhong Ji, Jiawei Lv, Cuicui Yang, and Aidong Zhang Abstract Detecting functional modules from a Protein-Protein Interaction (PPI) network is a fundamental and hot issue in proteomics research, where many computational approaches have played an important role in recent years. However, how to effectively and efficiently detect functional modules in large-scale PPI networks is still a challenging problem. We present a new framework, based on a multiple-grain model of PPI networks, to detect functional modules in PPI networks. First, we give a multiple-grain representation model of a PPI network, which has a smaller scale with super nodes. Next, we design the protein grain partitioning method, which employs a functional similarity or a structural similarity to merge some proteins layer by layer. Thirdly, a refining mechanism with border node tests is proposed to address the protein overlapping of different modules during the grain eliminating process. Finally, systematic experiments are conducted on five large-scale yeast and human networks. The results show that the framework not only significantly reduces the running time of functional module detection, but also effectively identifies overlapping modules while keeping some competitive performances, thus it is highly competent to detect functional modules in large-scale PPI networks. Index Terms Computational biology, large-scale PPI networks, functional module detection, multiple-grain model Ç 1 INTRODUCTION WITH the coming of the postgenomic era, proteomics research has gradually become one of the most important areas in the field of life science [1]. As a bimolecular relationship network, the protein-protein interaction (PPI) network plays an important role in biological activities. Hence, the analysis of PPI networks naturally serves as the basis to the better comprehending of cellular organization, processes, and functions [2]. From the biological point of view, cellular functions and biochemical events are coordinately carried out by groups of proteins interacting each other in functional modules (or complexes), and the modular structure of a PPI network is critical to functions. For instance, retromer complex is a key component of the endosomal protein sorting machinery [3]. Anaphase-Promoting Complex is an E3 ubiquitin ligase that marks target cell cycle proteins for degradation by the 26S proteasome [4]. Therefore, how to correctly identify such functional modules (or complexes) in PPI networks becomes a vital scientific problem for understanding the structures and functions of these fundamental cellular networks, further discovering the mechanism of diseases, and developing the new medicine. As many literatures have pointed out [5], [6], [7], [8], the functional modules are those J. Ji, J. Lv, and C. Yang are with Beijing Municipal Key Laboratory of Multimedia and Intelligent Software Technology, the College of Computer Science and Technology, Beijing University of Technology, Beijing , China. jjz01@bjut.edu.cn, {zhangoic, yangcc_2008}@163.com. A. Zhang is with the Department of Computer Science and Engineering, University at Buffalo, The State University of New York, Buffalo, NY azhang@cse.buffalo.edu. Manuscript received 15 Nov. 2014; revised 4 June 2015; accepted 9 Sept Date of publication 18 Sept. 2015; date of current version 4 Aug For information on obtaining reprints of this article, please send to: reprints@ieee.org, and reference the Digital Object Identifier below. Digital Object Identifier no /TCBB protein groups where proteins are interacted with each other at different time and place, while protein complexes are those protein groups where proteins in the same complexes are interacted with each other at the same time and place. Owing to not considering the temporal and spatial information for protein interaction data used in the paper, we use the concept of functional modules in the research. So far, there are a number of biological experimental methods to detect functional modules in PPI networks, such as tandem affinity purification with mass spectrometry [9], [10], and proteinfragment complementation assay (PCA) [11]. However, there are several limitations to these experimental methods, such as too many processing steps and too much time consumed, which motivated computer scientists to investigate efficient, novel and robust ways to fully exploit the protein interaction data to mine functional modules. In the past decade, many computational approaches based on machine learning and data mining have rapidly grown and become useful complements to the experimental methods [12]. Though there are some differences on the employed ideas and schemes among these approaches, the clustering analysis of a graph is a basic and key technique, which follows such a fact that these closely connected protein areas in PPI networks correspond to the protein functional modules. Up to now, a variety of classic clustering approaches, such as density-based clustering [13], [14], [15], hierarchical clustering [16], [17], [18], partition-based clustering [19], [20], [21], and flow simulation-based clustering [22], [23], [24], have been applied for identifying functional modules in PPI networks. In recent years, there have also been a number of new emerging approaches [25], [26], [27], [28], which employ novel computational models to effectively identify functional modules in a PPI network. More recently, some nature-inspired swarm intelligence ß 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See for more information.

2 JI ET AL.: DETECTING FUNCTIONAL MODULES BASED ON A MULTIPLE-GRAIN MODEL IN LARGE-SCALE PROTEIN-PROTEIN algorithms have been successfully applied to the detection of functional modules in PPI networks [29], [30], [31], [32]. Above all, using computational approaches to detect protein functional modules in PPI networks has received considerable attention, and researchers have proposed many detection ideas and schemes over the past few years [12], which has shown their strong vitalities and broad application prospects of detecting function modules in PPI networks. However, with the increasing availability of biological experiments and information mining techniques, PPI network scales for most species have become larger and larger, which brings a new challenge for many existing computational approaches, especially for those approaches with higher time complexity. That is, some computational approaches could have been limited in dealing with large-scale PPI networks due to their own complexities. To resolve the problem, some academics have performed some meaningful explorations. For example, Cho et al. proposed a new method based on graph reduction [33], which first selects a small number of informative proteins from a large network, and transforms the intricate small-world, scale-free network into a simple graph with high modularity. Oliveira and Seok proposed a multilevel algorithm with structured analysis of unweighted networks [34], [35], which constructs high-quality groups of nodes merged before applying a clustering algorithm. These two methods can remarkably enhance the efficiency of detecting functional modules by reducing the scale of a PPI network. In this paper, we present a new framework to efficiently detect functional modules in the PPI network. The main contributions of this paper include the following: 1) From the perspective of granular computation, we define a multiple-grain model of a PPI network, which has a small scale, and propose a new framework to detect functional modules in a large-scale PPI network. 2) To obtain a smaller network, we design a grain partitioning method of proteins, which combines the functional similarity with the structural similarity to merge some proteins layer by layer. 3) To address the protein overlapping of different modules in a PPI network, we develop a refining mechanism with border node tests during the grain eliminating process. 4) Systematic experiments have been conducted to verify the proposed framework using several state-ofthe-art algorithms on the benchmark testing sets of yeast and human networks. The rest of this paper is organized as follows. Section 2 gives the multiple-grain representation model of a PPI network. Section 3 introduces the proposed algorithm framework, the grain partitioning method and the refining mechanism with edge node tests. Next, Section 4 presents and analyzes the experimental results. Section 5 concludes this paper and outlines future research work. 2 MULTIPLE GRAIN MODEL OF PPI NETWORKS In the real-world, there are usually some complex and difficult problems, which are hard for people to accurately solve. Therefore, people are often not blindly pursuing the best solution at a time, but gradually approaching to a solution with a certain precision step by step. It is the analysis method of multi-granularity, which uses a refining policy from coarse to fine. By means of such an idea, people can efficiently solve many large scale problems with high complexity. In other words, the multi-granularity representation is an effective method to reduce the scale of problems and improve the efficiency of problem solving. Generally speaking, a PPI network is typically represented as an undirected graph G ¼ðV; EÞ with a set of nodes V and a set of edges E, where V and E represent proteins and interactions between nodes, respectively. In PPI networks, those proteins belonging to the same function module always perform similar functions in some biological activities. They are either closely linked to reflect the topology similarity, or closely related in functions to reflect the functional similarity. That is, from the view of structure partition, those close relationship elements can be viewed as the same kind of elements either in the structure or in the function. For a PPI network, if two proteins have a high similarity degree both in topology and function, then the two proteins are more likely to be grouped into the same function module. Assuming that we could merge these similar proteins in a PPI network into the same virtual protein in advance, the scale of the PPI network will significantly decrease, thus the running time of detection of functional modules in the PPI network can be greatly shortened. Based on the idea of such problem reduction, we first introduce some definitions in the following. Defination 1 (Super protein node). Given any two nodes, i; j 2 V in a graph G, once the similarity degree between the two nodes is greater than a pre-set fusion threshold, then they can be merged into a larger node s. We call the larger node s as a super protein node. Because the integration process can be run multiple times in the whole problem-reduction process, a super protein node at higher layer network may contain a number of proteins in the initial graph. From the view of function or structure, these proteins are considered to be similar to each other. However, it should be noted that a super protein node located at current layer network contains at most two nodes of the last layer network since each fusion at the same layer network only involved two nodes. If two nodes are merged, then their annotation information of Gene Ontology (GO) will be combined by union. Moreover, the local graph structure is simplified by the neighbor change. For a node i in a graph G, the set of its neighbor nodes can be denoted as NeighðiÞ ¼fjjj 2 V; ði; jþ 2Eg. For a super protein node s, the set of its neighbor nodes is described as follows. Defination 2 (Neighbor of a super protein node). Given a super protein node s including i and j protein nodes at the current layer network, the set of its neighbor nodes is defined as NeighðsÞ ¼fljl 2 NeighðiÞ S NeighðjÞnsg. Without loss of generality, there are some protein nodes which don t meet the fusion requirement. These protein nodes can be defined as follows. Defination 3 (Isolated protein node). If a protein in a PPI network is not merged with any other protein after several integrations, it is called an isolated protein node.

3 612 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 13, NO. 4, JULY/AUGUST 2016 Fig. 1. The schematic diagram of a MGM-PPI. In general, a higher layer network not only contains some super protein nodes, but also includes some isolated protein nodes. Once two nodes are merged into a super node, its local graph structure and GO annotation information will be changed or united before further testing in the next layer. That is, no matter whether super protein nodes or isolated protein nodes are in the same layer network, they are handled by using the same way. Based on the definitions above, a multiple-grain model for a PPI network (MGM- PPI) is given in the following. Defination 4 (Multiple Grain Model for a PPI network). By merging similar proteins, we can transform a large scale PPI network with the same granularity of protein nodes into a small scale network with the different granularities of super protein nodes layer by layer. We call the connected graph with different granularities as a MGM-PPI. Fig. 1 gives a schematic diagram of a MGM-PPI acquired by two merge processes, where the top layer gives an initial PPI network G 0, the middle layer shows a coarse-grained connection diagram G 1 after the first merge process, and the third layer denotes a more coarse-grained connection diagram G 2 after the second merge process. It is clear that more frequently the merge process occurs, the smaller the size of the MGM-PPI acquired and the simpler the connection relationships formed. Because employing the multiplegrain model in detecting functional modules from a PPI network can greatly decrease the scale of the PPI network and reduce the computation complexity, thus it has very important significance for some module detection algorithms to increase their solving abilities in large-scale networks. 3 FRAMEWORK BASED ON THE MGM-PPI Based on the MGM-PPI, this section presents a new framework to detect functional modules from large-scale PPI networks, which tries to effectively improve the time performance. Fig. 2 shows the flowchart of the proposed framework, where three key steps are automatically-layered grain partitioning, module detecting, and grain eliminating with a refining strategy. The first step performs the coarsening process which transforms a large-scale PPI network into a smaller one by merging close nodes into some corresponding super nodes layer by layer, G 0! G 1 G i!g k, where G i represents the ith layer MGM-PPI. Obviously, as the number of layers k increases, the size of super nodes becomes bigger and bigger while the number of nodes (edges) in a corresponding MGM-PPI gets smaller and smaller. This step is essential for the framework to establish the MGM-PPI. The second step carries out a functional module detection process on the coarsest layer, where an appropriate algorithm with a high accuracy can be used. The third step performs restoring and refining process on the clustering result of the coarsest layer. That is, in light of the order G k! G k 1 G i!g 0, the step completes grain eliminating layer by layer, where in addition to the transformation from one layer with larger super nodes to another layer with smaller ones, a refining strategy is employed to optimize the results generated. 3.1 Automatically-Layered Grain Partitioning The goal of grain partitioning is to reduce the scale of a PPI network to a smaller one so that many algorithms can be applied at a lower cost. There are two states for a protein node in this phase: unmatched or matched. At the initial stage of each layer, every protein node in a PPI network is first set to be an unmatched state. Then we begin to traverse nodes one by one to find node pairs that can be integrated. Once a node pair meets a preset merging condition, the states of the two protein nodes are changed from unmatched to matched. After traversing all nodes in the current layer, those matched protein pairs are respectively merged to form some corresponding super nodes with larger granularity while the rest unmatched proteins still maintain the original grain size. In detail, the coarsening process is described in the following four steps: 1) At first, all protein nodes in the current layer are sorted in a roughly descending Fig. 2. The flowchart of the proposed framework.

4 JI ET AL.: DETECTING FUNCTIONAL MODULES BASED ON A MULTIPLE-GRAIN MODEL IN LARGE-SCALE PROTEIN-PROTEIN order of degree; 2) Then we select the first unmatched node, and search its neighbor nodes in a descending order to find an unmatched node which can satisfy any merging condition. When there is such a pair of nodes, then switch their states from unmatched to matched; 3) We repeat the step 2 until all nodes at the current layer are traversed, thus we find all the matching pairs of nodes, and respectively merge every pair matched nodes to build the corresponding super protein node; and 4) We continue to perform the above three steps on a higher layer network until the end condition of grain partitioning is satisfied. In the above process, there are two key problems to be addressed. The first one is how to set a reasonable merging condition. We employ a functional similarity measure (f ij ) and a topology similarity measure (s ij ) to determine whether a pair of nodes (i; j) can be merged. The judgment rule is that a pair of protein nodes (i; j) will be merged into a super node only if f ij " 1 or s ij " 2, where " 1 and " 2 are two thresholds, respectively. To compute the functional similarity f ij for two proteins i and j, we utilize their GO information. If the two protein nodes are annotated with two GO term sets g i and g j, respectively, f ij can be calculated by [36]: f ij ¼ jgi \ g j j jg i [ g j j : (1) Based on the Czekanovski-Dice distance d ij [37], we calculate the topology similarity between two proteins. The formula is described as follows: s ij ¼ 1 d ij jintðiþ[intðjþj jintðiþ\intðjþj ¼ 1 jintðiþ[intðjþj þ jintðiþ\intðjþj ; (2) where IntðiÞ ¼NeighðiÞ[i. Obviously, when two proteins i and j interact with completely different neighbor sets of proteins, the similarity between them will arrive at the smallest value 0. In other cases, the similarity value will be larger than 0, and smaller or equal to 1. Another key problem is how to determine the number of layers for the grain partitioning. That is, a favorable circumstance needs to be set to end the coarsening process. Merging nodes with high topological similarity and functional similarity can change topological structure of the PPI network, thus we have carried out many experimental researches to explore the relationships between some graphic metrics (such as density, average shortest path length, betweenness centrality for edges, average degree of node, clustering coefficient of nodes, etc.) and number of layers. Fortunately, we found an interesting phenomenon that the density of the kth layer MGM-PPI network, denðg k Þ¼2e=vðv 1Þ (v ¼jV k j and e ¼jE k j), is a single extreme value function of layer number. The density function begins to gradually rise to a maximum value and then declines significantly or slightly as the layer number k increases. It s more important that once the density change between two adjacent layer networks is small, the number of merging nodes becomes less, which means that the contribution of grain partitioning becomes very small. To ensure the efficiency of problem reduction and performance of module detection, the automatically-layered coarsening mechanism is designed as: K ¼ arg minfjdenðg k Þ denðg k 1 Þj m;k > 0g; (3) k where m is a preset smaller threshold value (e.g., ) and K is the minimum layer at which the density of the MGM-PPI approaching the maximum value. 3.2 Grain Elimination with a Refining Strategy Once the grain partitioning ends, we acquire a smaller scale network based on MGM-PPI, where a module detecting algorithm can be applied to quick finding initial functional modules. These modules may consist of many super protein nodes and some isolated protein nodes. To obtain the final functional modules included initial protein nodes, we have to perform the grain eliminating process from coarse to fine. That is, we need to restore the node connections at the previous layer network by layer, where a basic operator is that super nodes are separated layer and layer until there is no super node in the corresponding network (i.e., returning the initial PPI network). Moreover, the separation of super nodes at each layer may cause some changes in the connection between modules. To address the overlap proteins of different modules, we design a refining mechanism to test the reasonable belonging of some protein nodes, where border protein node is defined as follows: Defination 5 (Border protein node). Though some protein nodes are partitioned in different functional modules, there are still some connections between these nodes to cross different modules at a particular layer. We call those protein nodes which exist in connections across different modules as border protein nodes. Generally, a border protein node is most likely to be shared by two modules, and becomes their overlapping parts. To actually reflect the overlapping nature of some functional modules, we employ a loop to refine the grain eliminating results at each layer network. The loop looks through each module in the current layer network individually, once there is a connection between the module and its near module, then the two border protein nodes involved in the connection will be tested. Our testing method is based on the density of subgraphs of functional modules, the formula is as follows: 2 je l j DenðM l Þ¼ jv l jðjv l j 1Þ ; (4) where M l is the lth module, je l j is the number of edges between nodes and jv l j is the number of nodes in M l. The larger the value of DenðM l Þ, the better a clustering result is. Thus, we take the node-sharing gain about the density of functional modules as a judging criterion to test whether the border nodes should be shared between two neighboring modules. Let M p, M q be two neighboring modules at the current layer, i 2 V p, j 2 V q and there is a connection between i and j, theni and j are border protein nodes for M q and M p. The specific testing method for j is as follows: Assume Mp 0 ¼ M p [fjg be a new module, we can calculate DenðM p Þ and DenðMp 0 Þ.IfDenðM0 p Þ >DenðM pþ, then j should be shared by M p. Otherwise, j should not be shared by M p, and this test will be ignored. The testing

5 614 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 13, NO. 4, JULY/AUGUST 2016 TABLE 1 Data Sets Used in Our Experiments Data sets Http address Preprocessed data Processed data Proteins Interactions Layers Proteins Interactions DIPScere ,126 22, ,803 14,183 MIPS 4,545 12, ,160 6,683 BioGrid ,334 80, ,817 40,920 DIPHsapi ,086 5, ,172 3,264 H-InvDB 1 5,858 27, ,518 19,974 1 The original data contains 9,086 proteins and 31,030 interactions. Limited by the memory of our machine, an extra operator in the preprocessed step is removing all nodes whose degrees are 1. process is repeated till all modules at the current layer are passed through. 3.3 Framework Description and Complexity Analysis The procedure of the proposed framework is to carry out initialization, grain partition, module detection, grain eliminating, and output of detected modules. To fast reduce the scale of the network, we employ a strategy that the nodes with the largest degree are first merged with their close neighbors. The detailed pseudocode is shown in Algorithm 1. Algorithm 1. Framework Based on MGM-PPI Input: Graph G 0 ¼ G(V, E): a PPI network,jv j¼n; Output: M: the set of modules (jmj ¼m); 1. Initialization: Set parameters " 1, " 2 and m; * " 1 : functional similarity threshold * * " 2 : topology similarity threshold * * m: density gain threshold * 2. Grain partition: k ¼ 0; Do { Let all nodes in the unmatched state; Compute the degree of all nodes in G k, and sort them in descending order; For each unmatched node i in G k { Sort the nodes of NeighðiÞ in a descending order; For each unmatched node j 2 NeighðiÞ {If(f ij " 1 )or(s ij " 2 ) then Label the pair of nodes as matched and exit the loop; } } k ¼ k þ 1; Respectively merge both matched nodes to form a super node, build corresponding connections and construct a graph G k ; } while (jdenðg k Þ denðg k 1 Þj > m) K ¼ k; Obtain the coarsest graph G K ; 3. Module detection: Employ classic algorithm to detect modules on G K ; Get initial modules for G K ; 4. Grain eliminating: While (k 0)do { Restore the node connections at the k-1 layer network; For each connection between two modules Refine some border nodes by the judging criteria; k ¼ k-1; } 5. Output: Return m Functional Modules of G 0. Based on the description of Algorithm 1, the complexity of the detection framework can be simply analyzed as follows: Let the maximum number of a node degree be n in a PPI network. In the initialization process, the computing complexity is Oð1Þ. In the grain partition process, the time complexity is OðKðN þ N log N þ Nðn 2 log nþþ, which can be approximately OðN log NÞ because n N and K is a very small number compared with N. For the module detection process, its time complexity exactly depends on the algorithm used, but fortunately the complexity of the initial problem can be greatly reduced by grain partition. In the grain eliminating process, the time complexity is OðKðN=2 þ m ðm 1ÞcÞÞ OðK ðn þ m 2 ÞÞ, where c is the maximum number of connections between two modules. In the output process, the time complexity is OðmÞ. Thus, the overall complexity mainly depends on the module detection algorithm since other processes have lower complexities. Most importantly, the module detection algorithm is applied into the smaller scale PPI network, thus there is a great potential to enhance the running efficiency. 4 EXPERIMENTAL RESULTS AND DISCUSSION In this section, we use five large protein-protein interaction datasets to perform our empirical study. In light of many evaluation metrics, we assess the effectiveness of the framework, and compare test results using some typical algorithms on these PPI datasets. The experimental platform is a PC with Core 2, 2.13 GHz CPU, 2.99 GB RAM, and Windows XP. 4.1 PPI Data Sets and Their Corresponding Number of Layers We have performed our experiments over five publicly available benchmark PPI datasets including three yeast data (DIPScere , MIPS, BioGrid2014) and two human data (DIPHsapi , H-InvDB). Table 1 shows a summary of the data sets used in our experiments, where the second column gives the web links, the third and fourth columns respectively present the size of proteins and interactions after data preprocessing while the fifth, sixth and seventh columns respectively present the number of layers determined automatically, and the size of proteins and interactions after grain partitioning. A cleaning step, which deletes all self-connected and repeated interactions, is performed in data preprocessing. To evaluate the protein modules mined by algorithms, the set of real functional modules from [39] is selected as the benchmark for yeast. This benchmark set, which consists of 428 protein functional modules,

6 JI ET AL.: DETECTING FUNCTIONAL MODULES BASED ON A MULTIPLE-GRAIN MODEL IN LARGE-SCALE PROTEIN-PROTEIN N cb ¼jfbjb 2 B; 9p 2 P; NAðp; bþ vgj. Thus, Precision and Recall can be defined as follows [44]: and Precision ¼ N cp jp j ; (6) Recall ¼ N cb jbj : (7) Fig. 3. The relationships between density and number of layers for five PPI datasets. is constructed from three main sources: the MIPS [38], Aloy et al. [40] and the SGD database [41] based on the Gene Ontology notations. For human, we use the set of real functional modules from [42] as the benchmark, which consists of 1,259 functional modules and 3,458 proteins. After removing some modules, which don t include any protein in DIPHsapi or H-InvDB, we respectively get two corresponding validation sets of functional modules, where the validation set for DIPHsapi comprises 793 functional modules and 2,443 proteins, and the validation set for H-InvDB comprises 901 functional modules and 2,666 proteins. As mentioned in Section 3.1, to automatically determine the corresponding number of layers for any PPI dataset, the relationship between the number of layers and any of network characteristics (e.g., density, degree, path, centrality, coefficient, etc) has been experimentally investigated. From the results, we discovered that the number of layers and the density of corresponding graphs have a certain relationship. Fig. 3 shows the density curves over number of layers for five PPI datasets. Based on the automatically-layered coarsening mechanism, we can get the corresponding number of layers 4, 4, 3, 3 and 4 for five grain partitioning, respectively. 4.2 Evaluation Metrics In the section, we employ two popular sets of measurements [43] to evaluate the detected modules quality and calculate the detection methods general performance Precision, Recall, F-Measure, and Coverage Many research works use a neighborhood affinity score to assess the degree of matching between the identified functional modules and real ones. The score NAðp; bþ between an identified module p ¼ðV p ;E p Þ and a real module b ¼ðV b ;E b Þ is defined as: NAðp; bþ ¼ jv T p Vb j 2 jv p jjv p j : (5) If NAðp; bþ v, then p and b are considered to be matched (generally, v ¼ 0.2). Let P be the set of functional modules identified by some computational methods and B be the real functional module set in benchmark networks. Then N cp ¼jfpjp 2 P; 9b 2 B; NAðp; bþ vgj, and F-measure is a harmonic mean of Precision and Recall, so can be used to evaluate the overall performance. It is defined as: F ¼ 2 Precision Recall : (8) Precision þ Recall Moreover, Coverage assesses how many proteins in a PPI network can be clustered into the detected modules by a computational method. It can be defined as follows [45]: Coverage ¼ j S jp j i¼1 V pij ; (9) jv j where jv j¼n, which denotes the size of the PPI network and V pi is the set of the proteins in the ith detected module Sensitivity, Positive Predictive Value, and Accuracy Sensitivity (S n ), Positive predictive value (PPV ) and Accuracy (Acc) are also common measures to assess the performance of module detection methods. Let T ij be the number of the commonproteinsinbothoftheith ground truth and the jth identified module. Then S n and PPV can be defined as [39]: and S n ¼ PPV ¼ P jbj i¼1 max jft ij g P jbj i¼1 N ; (10) i P jpj j¼1 max ift ij g P jpj j¼1 T ; (11) :j where N i is the number of the proteins in the ith benchmark module, and T :j ¼ P jbj i¼1 T ij. As a general metric, the accuracy of an identification can be calculated as a geometric mean of S n and PPV : Acc ¼ðS n PPV Þ 1=2 : (12) 4.3 Effects of Parameters As a swarm intelligent-based algorithm, NACO-FMD [31] has excellent robustness and good detection accuracy on various PPI data. Thus, we take NACO-FMD as an instance to respectively perform many experiments on DIPScere to determine a better parameter configuration for the framework. The parameters of NACO-FMD were set as m a ¼ 100 (Number of ants), NI ¼ 20 (Number of iterations), a ¼ 1:5, b ¼ 5:0, d ¼ 0:8. During all experimentations, the value of a single parameter is changed, while keeping the values of other parameters fixed.

7 616 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 13, NO. 4, JULY/AUGUST 2016 TABLE 2 Layer Number and Running Time of NACO-FMD with the New Framework for Different m, " 1, and " 2 Values Test item m values ( E-4) " 1 values " 2 values 0.5 [1.5, 3.5] [4.5, 6.5] 7.5 [8.5, 9.5] Layer number Clustering , ,246 1,599 2,171 2,195 2,518 time(s.) Grain partition time(s.) Grain elimination time(s.) Total running time(s.) , ,253 1,606 2,176 2,201 2,522 Table 2 lists the results about the layer number of the MGM-PPI, clustering time, grain partition time, elimination time and the total running time for different m, " 1 and " 2 values, where s denotes to seconds. Column 2 to 6 show that smaller m, larger the layer number is and shorter the total running time is. Column 7 to 14 and Column 15 to 22 show that when " 1 or " 2 increases, the number of node fusion becomes smaller, which lead to diminishing the layer number and growing the total running time. Fig. 4 gives the effects of the parameter m, " 1 and " 2 on six metrics, where Fig. 4a is performance curves about m when " 1 ¼ 0:5, " 2 ¼ 0:3, Fig. 4b is performance curves about " 1 when m ¼ 2:5 E-04, " 2 ¼ 0:3, and Fig. 4c is performance curves about " 2 when m ¼ 2:5 E-04, " 1 ¼ 0:5. In Fig. 4a, though some metrics have different tendencies as m increases. However, the two general metrics (F-measure and accuracy) are relatively stable, which shows that the application of the framework is not sensitive to the parameter m. In Fig. 4b, along with " 1 increasing, the values of precision, recall and F-measure gradually increase till get respectively the maximum values at " 1 ¼ 0:5 or 0.6 and then slowly decrease. Though there are some differences for three curves of PPV, accuracy and sensitivity, the accuracy is maintained at around 0.32 when " In Fig. 4c, three metrics of precision, recall and F-measure have the same trend: there is the maximum at " 2 ¼ 0.4 for each metric. The three metrics of PPV, accuracy and sensitivity have similar values except for " 2 ¼ 0:2. Togetbetter quality and cost shorter time, we employ m 2½1:5; 3:5Š E-4, " 1 ¼ 0:5 and " 2 ¼ 0:3 in our framework for DIPScere data. Above all, the value of m directly determines the layer number while " 1 and " 2 indirectly influence the layer number by confirming the node fusion. Either way, they can affect the running time of detection modules with the framework. From these results, we can give a simple suggestion to preset the three parameters. That is, the determination of parameters has to comprehensively consider the Fig. 4. The effects of the parameter m, " 1, and " 2 on six evaluation metrics. (a) Six performance curves about m; (b) Six performance curves about " 1 ; (c) Six performance curves about " 2.

8 JI ET AL.: DETECTING FUNCTIONAL MODULES BASED ON A MULTIPLE-GRAIN MODEL IN LARGE-SCALE PROTEIN-PROTEIN TABLE 3 The Performance Comparisons of Several Typical Detection Algorithms on DIP Performance description Algorithms Category Coverage Precision Recall F-measure Sensitivity PPV Accuracy Time (s) CFinder CFinder* 1 Density DPClus DPClus* Density Jerarca ,100 Jerarca* Hierarchy RNSC RNSC* Partition MCL MCL* Flow ADMSC ADMSC* Spectral Core ,210 Core* Core ,596 COACH COACH* Core NACO-FMD ,128 NACO-FMD* Swarm ,029 HAM-FMD ,540 HAM-FMD* Swarm An asterisk refers to the corresponding algorithm using the multiple grain model. 2 DPClus runs with d ¼ 0:9, cp ¼ 0:4 and Minimum cluster size ¼ 2. 3 ADMSC runs with C ¼ 600, b ¼ 1:4. 4 COACH runs with V ¼ 0: NACO-FMD runs with a ¼ 1:5, b ¼ 5:0, r ¼ 0:5, d ¼ 0:3. 6 HAM-FMD runs with a ¼ 1:5, b ¼ 5:0, r ¼ 0:5, d ¼ 0:3, P o ¼ 0:2, P c ¼ 0:05, P m ¼ 0:5. running time and other detection performances. Based on numerous similar experiments, three parameters are respectively set as " 1 ¼ 0:5, " 2 ¼ 0:3, m ¼ 0:00025 for three yeast data and " 1 ¼ 0:4, " 2 ¼ 0:4, m ¼ 0:00015 for the human data with sparse connections. 4.4 Comparative Evaluations To evaluate the new framework, we employed 10 typical algorithms, i.e., CFinder [14], DPClus [15], Jerarca [18], RNSC [19], MCL [22], ADMSC [26], Core [27], COACH [28], NACO-FMD [31], and HAM-FMD [32], to demonstrate its strengths on seven performances in our experiments. Based on the survey in [12], these algorithms involve seven main categories: density-based, hierarchy-based, partitionbased, flow simulation-based, spectral clustering-based, core attachment-based, and swarm intelligence-based approaches. We still employed DIPScere data to perform comprehensive comparisons among 10 different algorithms. Table 3 shows the detailed performance comparisons of these typical detection algorithms on the same DIP data. For each detection algorithm and its variation with the new framework, we have listed the classified category, seven metrics and running Time (Seconds). In all experiments, we use the simplest or the best default values for those algorithms which need to set parameter configurations. It is not difficult to see that the application of the new framework can effectively reduce the running time for most of algorithms. For some fast algorithms such as MCL, COACH, CFinder and RNSC, using the new framework is no significant influence on the the running time or only has a certain growth. The main reason is that grain partitioning and elimination in the new framework also need extra running time, which may be larger than saving time. For other algorithms, using the new framework has significant improving on time performance. Most variation algorithms using the new framework can achieve good results on many performances, which are better than or comparable with that of the original algorithms. From a technical perspective, the framework is supposed to be independent of detection algorithms. However, merging nodes with high topological similarity and functional similarity in the framework may change the topological structure of the PPI network, and it may also mislead some PPI clustering techniques to go for neighbourhood analysis. Therefore, using the new framework in some clustering techniques may also cause the degradation of some performances. To further investigate the computational results of those time-consuming algorithms, we select six out of 10 algorithms, including NACO-FMD, HAM-FMD, Core, Jerarca, DPClus and ADMSC, to carry out a large number of experiments. NACO-FMD is an ant cology intelligent-based algorithm for detecting functional modules in a PPI network, which combines topological characters with functional information into the ant colony optimization process [31]. HAM-FMD is a hybrid approach which employs ant colony optimization and multi-agent evolution to detect functional modules in PPI networks [32]. Essentially, NACO-FMD and HAM-FMD are two swarm intelligent-based algorithms.

9 618 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 13, NO. 4, JULY/AUGUST 2016 TABLE 4 The Results of NACO-FMD, HAM-FMD, and Core Algorithms on Five Different Data Sets Data sets DIPScere MIPS BioGrid2014 DIPHsapi H-InvDB Metrics NACO-FMD HAM-FMD Core Original variation Original variation Original variation Number of clusters Size of average module N cp 0: N cb 0: Number of clusters Size of average module N cp 0: N cb 0: Number of clusters ,090 Size of average module N cp 0: N cb 0: Number of clusters Size of average module N cp 0: N cb 0: Number of clusters ,199 1, Size of average module N cp 0: N cb 0: Core is a core attachment-based algorithm [27], which successively predicts core proteins, identifies their attachment proteins, and forms functional modules in a PPI network. Jerarca is a hierarchy-based algorithm, which first computes weighted distances, and then employs a neighbor-joining algorithm to build dendrograms [18]. ADMSC is a spectral clustering-based algorithm, it analytically solves the cluster structure of PPI networks as a problem of random walks in the diffusion process [26]. DPClus is a density-based algorithm, which uses a cluster periphery-tracking mechanism to generate clusters [15]. Tables 4 and 5 respectively show the detailed comparative results of these algorithms on the five different data sets, where a variation algorithm refers to the corresponding algorithm with the multiple-grain framework. For each detection algorithm, we have listed the number of clusters detected (number of clusters), the average number of proteins in each cluster (size of average module), the number of TABLE 5 The Results of Jerarca, DPClus, and ADMSC Algorithms on Five Different Data Sets Data sets DIPScere MIPS BioGrid2014 DIPHsapi H-InvDB Metrics Jerarca DPClus ADMSC Original variation Original variation Original variation Number of clusters Size of average module N cp 0: N cb 0: Number of clusters , Size of average module N cp 0: N cb 0: Number of clusters , Size of average module N cp 0: N cb 0: Number of clusters Size of average module N cp 0: N cb 0: Number of clusters , Size of average module N cp 0: N cb 0:

10 JI ET AL.: DETECTING FUNCTIONAL MODULES BASED ON A MULTIPLE-GRAIN MODEL IN LARGE-SCALE PROTEIN-PROTEIN Fig. 5. A comparison of coverage for six algorithms and their corresponding variations on five datasets. Fig. 8. A comparison of F-measure for six algorithms and their corresponding variations on five datasets. Fig. 6. A comparison of precision for six algorithms and their corresponding variations on five datasets. Fig. 9. A comparison of sensitivity for six algorithms and their corresponding variations on five datasets. Fig. 7. A comparison of recall for six algorithms and their corresponding variations on five datasets. detected modules which match at least one real module (N cp ) and the number of real modules that match at least one detected module (N cb ). Taking NACO-FMD on DIPScere data as an example, the original algorithm has detected 571 modules, of which 127 match 212 real modules while the variation algorithm has detected 545 modules, of which 139 match 202 real modules. From the two tables, we observe that the utilization of the new framework can produce different results for various algorithms. Figs. 5, 6, 7, 8, 9, 10, and 11 show the overall comparison results of these methods and their corresponding variations (marked with an asterisk) in terms of seven metrics on five different data sets, respectively. From Fig. 5, we can find the application of our framework sometime increases the coverage value, and sometime decreases the coverage value, Fig. 10. A comparison of PPV for six algorithms and their corresponding variations on five datasets. which means that the change of coverage not only depends on detection algorithms but also depends on testing data. The main reason is that the new framework performs many fusions in building MGM-PPI process, there are a few chances for some algorithms to get back missing nodes. Thus, these algorithms with lower Coverage values 0:5 can increase Coverage values on some data, such as Core on DIPScere , MIPS, DIPHsapi data, and NACO-FMD on DIPHsapi data. From Fig. 6, one can easily see that all algorithms with the multiple-grain framework can achieve better precision on DIPScere data. However, for other four data sets, some algorithms can increase or keep precision values while others may decrease precision values. By experimental analysis, we find that if the average node degree of MGM-PPI obtained is close to that of intimal graph,

11 620 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 13, NO. 4, JULY/AUGUST 2016 Fig. 11. A comparison of accuracy for six algorithms and their corresponding variations on the datasets. then using the new framework can improve the precision performance. As shown in Fig. 7, the application of the new framework makes recall performance of six algorithms decline universally, especially on MIPS data, all algorithms obtain the worse recall values. A main reason is that two extra processes (grain partitioning and grain eliminating) on MIPS make the number of N cb decline. Of course, several algorithms still show better recall performance, e.g., HAM-FMD and Jerarca on DIPScere , HAM-FMD on Bio- Grid2014, NACO-FMD on DIPHsapi , NACO-FMD and HAM-FMD on H-InvDB. Fig. 8 gives the F-measure comparisons, which combine the precision and recall performances on each case. It is obvious that the application of our framework increases F-measure performance of six algorithms on DIPScere due to corresponding precision increasing. However, for other four data sets, the new framework is apt to generally lower F-measure in addition to several increases ( NACO-FMD on DIPHsapi and H-InvDB, HAM-FMD on H-InvDB) and keeping same values (DPClus on MIPS, NACO-FMD, HAM-FMD and ADMSC on BioGrid2014, and HAM-FMD and ADMSC on DIPHsapi ). Since the corresponding recall decreasing on MIPS, five variation algorithms except for DPClus obtain the worse F-measure. Moreover, since H-InvDB is a large sparse human data, some variation algorithms (DPClus, Core, Jerarca, ADMSC) only get bad precision and recall. Thus their F- measure values are decreased significantly. As shown in Fig. 9, the application of the new framework results in some irregular changes on sensitivity values. For instance, DPClus increases sensitivity values on four data except for DIPHsapi while Jerarca decreases them on all testing data. For some data sets, the new framework can help some algorithms to merge new proteins into some real function modules, which may make the ratio of proteins covered by the predicted modules either decline or increase. The sensitivity performance not only depends on the topological characteristic changes of the graph, but also relies on graph clustering mechanisms. From Fig. 10, one can easily see that PPV values of these algorithms suffer from slight decrease in many cases. The main reason is that the new framework has the ability to detect overlapping modules which potentially cause the denominator of the Eq. (11) to increase. Fig. 11 shows the comparison of accuracy values for six algorithms and their corresponding variations on the five data sets. Since using the new framework make most of algorithms obtain worse PPV values, the corresponding accuracy values also generally decline. However, some algorithms get better PPV or sensitivity values on some data, so their accuracy values can be improved by employing the new framework, such as Core on DIPScere and BioGrid2014, NACO-FMD on DIPHsapi and H- InvDB, HAM-FMD on H-InvDB. From the above results on seven measures, we conclude that applying the new framework into a detection algorithm on some data sets may be able to improve some performances (e.g., coverage, precision and f-measure), and also likely achieve worse performances on other metrics. Analyzing its reason, two main factors are topological changes caused by the new framework and clustering mechanisms employed. But on the whole, the introduction of the new framework can roughly maintain competing performances of the original algorithms. To explicitly reveal the significance of the new framework, we further performed the time performance comparison between six algorithms and their corresponding TABLE 6 Time Comparisons for Six Algorithms and Their Corresponding Variations on the Five Datasets Data sets Time Algorithms NACO-FMD HAM-FMD Core Jerarca DPClus ADMSC Original (S.) 11,128 4,540 58,210 1, DIPScere multiple-grain (S.) 1, , upgrade rate (%) Original (S.) 8,720 3,567 25, MIPS multiple-grain (S.) , upgrade rate (%) Original (S.) 26,908 19, ,734 2,554 1, BioGrid2014 multiple-grain (S.) 4,538 3,246 24, upgrade rate (%) Original (S.) 25,755 14,407 18, DIPHsapi multiple-grain (S.) 3,818 1,933 2, upgrade rate (%) Original (S.) 99,149 97, ,200 1, H-InvDB multiple-grain (S.) 5,381 10,613 16, upgrade rate (%)

12 JI ET AL.: DETECTING FUNCTIONAL MODULES BASED ON A MULTIPLE-GRAIN MODEL IN LARGE-SCALE PROTEIN-PROTEIN variations on the five datasets. The experimental results are shown in Table 6, where the original and multiple-grain rows respectively denote the running time of the original algorithms and the same algorithms with the multiple-grain framework, and upgrade rate row shows the degree of shortening the running time by means of the new framework. From these results, we can find that it s because the new framework is employed that the running time of six algorithms is extremely improved in all cases, where the minimum upgrade rate is 54:9 percent and the maximum upgrade rate is up to 94:5 percent and the maximum time saved is 1,00,828 seconds (Core on H-InvDB). We have systematically performed comparisons among NACO-FMD, HAM-FMD, Core, Jerarca, DPClus, ADMSC algorithms and their corresponding variations with the new framework in terms of various metrics. The outstanding time performance of the new framework on five data sets demonstrates that the proposed method can dramatically increase the time efficiency of module detection algorithms for various PPI data while keeping other competitive performances. Thus, the new framework has great potential for detecting functional modules in largescale PPI networks. 5 CONCLUSIONS How to efficiently identify functional modules by means of novel computational approaches is still a vital and important scientific problem in computational biology. Along with the era of big data, people have more opportunities to obtain large-scale PPI networks. Therefore, it has become a new challenge how to develop efficient and robust ways to detect functional modules in such large-scale PPI networks. In this paper, we proposed a novel framework to accelerate the process of detecting functional modules from large-scale PPI networks. First, we define a multiple-grain model of a PPI network by which the scale of a PPI network can be reduced. And then, we employ a functional or structural similarity to design a protein grain partitioning method. Finally, we apply a refining mechanism with border node tests to handle the protein overlapping of different modules during the grain eliminating process. Systematic experiments by six algorithms on five datasets show that the new framework not only significantly reduces the running time of module detection algorithms, but also effectively identifies overlapping modules while keeping some competitive performances. Thus the proposed framework can be competent to detect functional modules in large-scale PPI networks. Our future work includes investigating the relationships among the framework, algorithm and data characteristics to further design a better algorithm with high performances on various aspects. ACKNOWLEDGMENTS This work is partly supported by NSFC Research Program ( and ), National 973 Key Basic Research Program of China (2014CB744601), Specialized Research Fund for the Doctoral Program of Higher Education ( ), and the Beijing Municipal Education Research Plan key project (Beijing Municipal Fund Class B) (KZ ). REFERENCES [1] S. D. Patternson and R. H. Aebersold, Proteomics: The first decade and beyond, Nature Genetics, vol. 33, pp , [2] A. D. Zhang, Protein Interaction Networks: Computational Analysis. Cambridge, U.K.: Cambridge Univ. Press, [3] M. N. J. Seaman, Recycle your receptors with retromer, Trends Cell Biol., vol. 15, no. 2, pp , [4] D. O. Morgan, The Cell Cycle: Principles of Control. London, U.K.: New Science Press, [5] V. Spirin and L. A. Mirny, Protein complexes and functional modules in molecular networks, in Proc. Nat. Acad. Sci., vol. 100, no. 21, pp , [6] B. L. Chen and F. X. Wu, Identifying protein complexes based on multiple topological structures in PPI networks, IEEE Trans. Nanobiosci., vol. 12, no. 3, pp , Sep [7] B. L. Chen, W. W. Fan, J. Liu, and F. X. Wu, Identifying protein complexes and functional modules: From static PPI networks to dynamic PPI networks, Brief. Bioinform., vol. 15, no. 2, pp , [8] X. L. Li, M. Wu, C. K. Kwoh, and S. K. Ng, Computational approaches for detecting protein complexes from protein interaction networks: A survey, BMC Genomics, vol. 11, suppl 1, p. S3, [9] G. Rigaut, A. Shevchenko, B. Rutz, M. Wilm, and M. Mann, A generic protein purification method for protein complex characterization and proteome exploration, Nature Biotechnol., vol. 17, no. 10, pp , [10] A. C. Gavin, M. Boesche, R. Krause, et al., Functional organization of the yeast proteome by systematic analysis of protein complexes, Nature, vol. 415, no. 6868, pp , [11] K. Tarassov, V. Messier, C. R. Landry, S. Radinovic, M. M. Molina, and I. Shames, An in vivo map of the yeast protein interactome, Science, vol. 320, no. 5882, pp , [12] J. Z. Ji, A. D. Zhang, C. N. Liu, X. M. Quan, and Z. J. Liu, Survey: Functional module detection from protein-protein interaction networks, IEEE Trans. Knowl Data Eng., vol. 26, no. 2, pp , Feb [13] G. D. Bader and C. W. Hogue, An automated method for finding molecular complexes in large protein interaction networks, BMC Bioinform., vol. 4, no. 1, p. 2, [14] B. Adamcsek, G. Palla, I. J. Farkas, I. Derenyi, and T. Vicsek, CFinder: Locating cliques and overlapping modules in biological networks, Bioinformatics, vol. 22, no. 8, pp , [15] M. Altaf-UI-Amin, Y. Shinbo, K. Mihara, K. Kurokawa, and S. Kanaya, Development and implementation of an algorithm for detection of protein complexes in large interaction networks, BMC Bioinform., vol. 7, no. 1, p. 207, [16] E. Ravasz, A. L. Somera, D. A. Mongru, Z. N. Oltvai, and A. L. Barabasi, Hierarchical organization of modularity in metabolic networks, Science, vol. 297, no. 5586, pp , [17] V. Arnau, S. Mars, and I. Marin, Iterative cluster analysis of protein interaction data, Bioinformatics, vol. 21, no. 3, pp , [18] R. Aldecoa and I. Marin, Jerarca: Efficient analysis of complex networks using hierarchical clustering, PLoS ONE, vol. 5, no. 7, p. e11585, [19] A. D. King, N. Przulj, and I. Jurisica, Protein complex prediction via cost-based clustering, Bioinformatics, vol. 20, no. 17, pp , [20] B. J. Frey and D. Dueck, Clustering by passing messages between data points, Science, vol. 15, no. 5814, pp , [21] A. Abdullah, S. Deris, S. Z. M. Hashim, and H. M. Jamil, Graph partitioning method for functional module detections of protein interaction network, in Proc. Int. Conf. Comput. Technol. Develop., 2009, pp [22] A. J. Enright, S. Van Dongen, and C. A. Ouzounis, An efficient algorithm for large-scale detection of protein families, Nucleic Acids Res., vol. 30, no. 7, pp , [23] J. B. Pereira-Leal, A. J. Enright, and C. A. Ouzounis, Detection of functional modules from protein interaction networks, Proteins, vol. 54, pp , [24] Y. R. Cho, W. Hwang, M. Ramanathan, and A. D. Zhang, Semantic integration to identify overlapping functional modules in protein interaction networks, BMC Bioinform., vol. 8, no. 1, p. 265, [25] W. Hwang, Y. R. Cho, A. D. Zhang, and M. Ramanathan, CASCADE: A novel quasi all paths-based network analysis algorithm for clustering biological interactions, BMC Bioinform., vol. 9, no. 1, p. 64, 2008.

13 622 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 13, NO. 4, JULY/AUGUST 2016 [26] K. Inoue, W. Li, and H. Kurata, Diffusion model based spectral clustering for protein-protein interaction networks, PLoS ONE, vol. 5, no. 9, p. e12623, [27] H. C. Leung, S. M. Yiu, Q. Xiang, and F. Y. Chin, Predicting protein complexes from PPI data: A core-attachment approach, J. Comput. Bio., vol. 16, no. 2, pp , [28] M. Wu, X. L. Li, C. K. Kwoh, and S. K. Ng, A core-attachment based method to detect protein complexes in PPI networks, BMC Bioinform., vol. 10, no. 1, p. 169, [29] J. Sallim, R. Abdullah, and A. T. Khader, ACOPIN: An ACO Algorithm with TSP approach for clustering proteins from protein interaction network, in Proc. 2nd UKSIM Eur. Symp. Comput. Model. Simul., 2008, pp [30] S. Wu, X. J. Lei, and J. F. Tian, Clustering PPI network based on functional flow model through artificial bee colony algorithm, in Proc. 7th Int. Conf. Natural Comput., 2011, pp [31] J. Z. Ji, Z. J. Liu, A. D. Zhang, L. Jiao, and C. N. Liu, Improved ant colony optimization for detecting functional modules in proteinprotein interaction networks, in Proc. 3rd Int. Conf. Inform. Comput. Appl., 2012, pp [32] J. Z. Ji, Z. J. Liu, A. D. Zhang, C. C. Yang, and C. N. Liu, HAM- FMD: Mining functional modules in protein-protein interaction networks using ant colony optimization and multi-agent evolution, Neurocomputing, vol. 121, pp , [33] Y. R. Cho, W. Hwang, and A. D. Zhang, Efficient modularization of weighted protein interaction networks using k-hop graph reduction, in Proc. 6th IEEE Symp. Bioinform. Bioeng., 2006, pp [34] S. Oliveira and S. C. Seok, Multilevel approaches for large-scale proteomic networks, Int. J. Comput. Math., vol. 84, no. 5, pp , [35] S. Oliveira and S. C. Seok, A matrix-based multilevel approach to identify functional protein modules, Int. J. Bioinform. Res. Appl., vol. 4, no. 1, pp , [36] A. Schlicker and M. Albrecht, FunSimMat: A comprehensive functional similarity database, Nucleic Acids Res., vol. 36, pp. D434 D439, [37] C. Brun, C. Herrmann, and A. Guenoche, Clustering proteins from interaction networks for the prediction of cellular functions, BMC Bioinform., vol. 5, no. 1, p. 95, [38] H. W. Mewes, et al., MIPS: Analysis and annotation of proteins from whole genomes, Nucleic Acids Res., vol. 32, no. Database issue, pp. D41 D44, [39] C. C. Friedel, J. Krumsiek, and R. Zimmer, Boostrapping the interactome: Unsupervised identification of protein complexes in yeast, Res. Comput. Molecular Bio., vol. 4955, pp. 3 16, [40] P. Aloy, et al., Structure-based assembly of protein complexes in yeast, Science, vol. 303, no. 5666, pp , [41] S. S. Dwight, et al., Saccharomyces genome database provides secondary gene annotation using the gene ontology, Nucleic Acids Res., vol. 30, no. 1, pp , [42] T. Junichi, et al., H-InvDB in 2013: An omics study platform for human functional gene and transcript discovery, Nucleic Acids Res., vol. 41, no. D1, pp. D915 D919, [43] X. L. Li, M. Wu, C. K. Kwoh, and S. K. Ng, Computational approaches for detecting protein complexes from protein interaction networks: A survey, BMC Genomics, vol. 11, no. suppl. 1, p. S3, [44] H. N. Chua, K. Ning, W. K. Sung, H. W. Leong, and L. Wong, Using indirect protein-protein interactions for protein complex prediction, in Proc. CSB, 2007, pp [45] W. Hwang, Y. R. Cho, A. D. Zhang, and M. Ramanathan, CASCADE: A novel quasi all paths-based network analysis algorithm for clustering biological interactions, BMC Bioinform., vol. 9, no. 1, p. 64, Junzhong Ji received the PhD degree in computer science and application technology from the Beijing University of Technology. He is a professor, and supervisor at the Computer Science College, Beijing University of Technology, Youth Skeleton teacher in Beijing, and senior membership of the China Computer Federation. He was a visiting scholar at the Norwegian University and State University of New York at Buffalo, respectively. His research interests include data mining, machine learning, swarm intelligence, and bioinformatics. He has published over 80 papers in these areas. Jiawei Lv is currently working toward the master s degree at the Beijing University of Technology. His main research interests include computational biology and data mining. Cuicui Yang is currently working toward the PhD degree in computer science at the Beijing University of Technology. Her main interests include artificial intelligence and data mining. Aidong Zhang is UB distinguished professor and the chair in the Department of Computer Science and Engineering, State University of New York at Buffalo. Her research interests include bioinformatics, data mining, multimedia and database systems, and content-based image retrieval. She is an author of over 200 research publications in these areas. She has chaired or served on over 100 program committees of international conferences and workshops, and currently serves several journal editorial boards. She has published two books Protein Interaction Networks: Computational Analysis (Cambridge University Press, 2009) and Advanced Analysis of Gene Expression Microarray Data (World Scientific Publishing Co., Inc. 2006). She received the US National Science Foundation CAREER award and State University of New York (SUNY) Chancellor s Research Recognition award. She is a fellow of the IEEE. " For more information on this or any other computing topic, please visit our Digital Library at

Brief description of the base clustering algorithms

Brief description of the base clustering algorithms Brief description of the base clustering algorithms Le Ou-Yang, Dao-Qing Dai, and Xiao-Fei Zhang In this paper, we choose ten state-of-the-art protein complex identification algorithms as base clustering

More information

Research Article An Improved Topology-Potential-Based Community Detection Algorithm for Complex Network

Research Article An Improved Topology-Potential-Based Community Detection Algorithm for Complex Network e Scientific World Journal, Article ID 121609, 7 pages http://dx.doi.org/10.1155/2014/121609 Research Article An Improved Topology-Potential-Based Community Detection Algorithm for Complex Network Zhixiao

More information

RETRACTED ARTICLE. Web-Based Data Mining in System Design and Implementation. Open Access. Jianhu Gong 1* and Jianzhi Gong 2

RETRACTED ARTICLE. Web-Based Data Mining in System Design and Implementation. Open Access. Jianhu Gong 1* and Jianzhi Gong 2 Send Orders for Reprints to reprints@benthamscience.ae The Open Automation and Control Systems Journal, 2014, 6, 1907-1911 1907 Web-Based Data Mining in System Design and Implementation Open Access Jianhu

More information

Semi supervised clustering for Text Clustering

Semi supervised clustering for Text Clustering Semi supervised clustering for Text Clustering N.Saranya 1 Assistant Professor, Department of Computer Science and Engineering, Sri Eshwar College of Engineering, Coimbatore 1 ABSTRACT: Based on clustering

More information

e-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data

e-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data : Related work on biclustering algorithms for time series gene expression data Sara C. Madeira 1,2,3, Arlindo L. Oliveira 1,2 1 Knowledge Discovery and Bioinformatics (KDBIO) group, INESC-ID, Lisbon, Portugal

More information

Properties of Biological Networks

Properties of Biological Networks Properties of Biological Networks presented by: Ola Hamud June 12, 2013 Supervisor: Prof. Ron Pinter Based on: NETWORK BIOLOGY: UNDERSTANDING THE CELL S FUNCTIONAL ORGANIZATION By Albert-László Barabási

More information

Theme Identification in RDF Graphs

Theme Identification in RDF Graphs Theme Identification in RDF Graphs Hanane Ouksili PRiSM, Univ. Versailles St Quentin, UMR CNRS 8144, Versailles France hanane.ouksili@prism.uvsq.fr Abstract. An increasing number of RDF datasets is published

More information

Detecting Clusters and Outliers for Multidimensional

Detecting Clusters and Outliers for Multidimensional Kennesaw State University DigitalCommons@Kennesaw State University Faculty Publications 2008 Detecting Clusters and Outliers for Multidimensional Data Yong Shi Kennesaw State University, yshi5@kennesaw.edu

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

Analysis of Biological Networks. 1. Clustering 2. Random Walks 3. Finding paths

Analysis of Biological Networks. 1. Clustering 2. Random Walks 3. Finding paths Analysis of Biological Networks 1. Clustering 2. Random Walks 3. Finding paths Problem 1: Graph Clustering Finding dense subgraphs Applications Identification of novel pathways, complexes, other modules?

More information

Yunfeng Zhang 1, Huan Wang 2, Jie Zhu 1 1 Computer Science & Engineering Department, North China Institute of Aerospace

Yunfeng Zhang 1, Huan Wang 2, Jie Zhu 1 1 Computer Science & Engineering Department, North China Institute of Aerospace [Type text] [Type text] [Type text] ISSN : 0974-7435 Volume 10 Issue 20 BioTechnology 2014 An Indian Journal FULL PAPER BTAIJ, 10(20), 2014 [12526-12531] Exploration on the data mining system construction

More information

Clustering the Internet Topology at the AS-level

Clustering the Internet Topology at the AS-level Clustering the Internet Topology at the AS-level BILL ANDREOPOULOS 1, AIJUN AN 1, XIAOGANG WANG 2 1 Department of Computer Science and Engineering, York University 2 Department of Mathematics and Statistics,

More information

Comparative Study of Subspace Clustering Algorithms

Comparative Study of Subspace Clustering Algorithms Comparative Study of Subspace Clustering Algorithms S.Chitra Nayagam, Asst Prof., Dept of Computer Applications, Don Bosco College, Panjim, Goa. Abstract-A cluster is a collection of data objects that

More information

MCL. (and other clustering algorithms) 858L

MCL. (and other clustering algorithms) 858L MCL (and other clustering algorithms) 858L Comparing Clustering Algorithms Brohee and van Helden (2006) compared 4 graph clustering algorithms for the task of finding protein complexes: MCODE RNSC Restricted

More information

Iterative Removing Salt and Pepper Noise based on Neighbourhood Information

Iterative Removing Salt and Pepper Noise based on Neighbourhood Information Iterative Removing Salt and Pepper Noise based on Neighbourhood Information Liu Chun College of Computer Science and Information Technology Daqing Normal University Daqing, China Sun Bishen Twenty-seventh

More information

Texture Image Segmentation using FCM

Texture Image Segmentation using FCM Proceedings of 2012 4th International Conference on Machine Learning and Computing IPCSIT vol. 25 (2012) (2012) IACSIT Press, Singapore Texture Image Segmentation using FCM Kanchan S. Deshmukh + M.G.M

More information

CHAPTER 6 PERCEPTUAL ORGANIZATION BASED ON TEMPORAL DYNAMICS

CHAPTER 6 PERCEPTUAL ORGANIZATION BASED ON TEMPORAL DYNAMICS CHAPTER 6 PERCEPTUAL ORGANIZATION BASED ON TEMPORAL DYNAMICS This chapter presents a computational model for perceptual organization. A figure-ground segregation network is proposed based on a novel boundary

More information

Web Data mining-a Research area in Web usage mining

Web Data mining-a Research area in Web usage mining IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 13, Issue 1 (Jul. - Aug. 2013), PP 22-26 Web Data mining-a Research area in Web usage mining 1 V.S.Thiyagarajan,

More information

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE Sundari NallamReddy, Samarandra Behera, Sanjeev Karadagi, Dr. Anantha Desik ABSTRACT: Tata

More information

SDC: A Distributed Clustering Protocol for Peer-to-Peer Networks

SDC: A Distributed Clustering Protocol for Peer-to-Peer Networks SDC: A Distributed Clustering Protocol for Peer-to-Peer Networks Yan Li 1, Li Lao 2, and Jun-Hong Cui 1 1 Computer Science & Engineering Dept., University of Connecticut, CT 06029 2 Computer Science Dept.,

More information

Hierarchical Multi level Approach to graph clustering

Hierarchical Multi level Approach to graph clustering Hierarchical Multi level Approach to graph clustering by: Neda Shahidi neda@cs.utexas.edu Cesar mantilla, cesar.mantilla@mail.utexas.edu Advisor: Dr. Inderjit Dhillon Introduction Data sets can be presented

More information

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr

More information

Improving Image Segmentation Quality Via Graph Theory

Improving Image Segmentation Quality Via Graph Theory International Symposium on Computers & Informatics (ISCI 05) Improving Image Segmentation Quality Via Graph Theory Xiangxiang Li, Songhao Zhu School of Automatic, Nanjing University of Post and Telecommunications,

More information

Identification of Functional Modules in Protein Interaction Networks

Identification of Functional Modules in Protein Interaction Networks Seminar Spring 2009 Identification of Functional Modules in Protein Interaction Networks Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Protein-Protein Interaction

More information

Rank Measures for Ordering

Rank Measures for Ordering Rank Measures for Ordering Jin Huang and Charles X. Ling Department of Computer Science The University of Western Ontario London, Ontario, Canada N6A 5B7 email: fjhuang33, clingg@csd.uwo.ca Abstract. Many

More information

Image retrieval based on region shape similarity

Image retrieval based on region shape similarity Image retrieval based on region shape similarity Cheng Chang Liu Wenyin Hongjiang Zhang Microsoft Research China, 49 Zhichun Road, Beijing 8, China {wyliu, hjzhang}@microsoft.com ABSTRACT This paper presents

More information

Response Network Emerging from Simple Perturbation

Response Network Emerging from Simple Perturbation Journal of the Korean Physical Society, Vol 44, No 3, March 2004, pp 628 632 Response Network Emerging from Simple Perturbation S-W Son, D-H Kim, Y-Y Ahn and H Jeong Department of Physics, Korea Advanced

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Precomputation Schemes for QoS Routing

Precomputation Schemes for QoS Routing 578 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 11, NO. 4, AUGUST 2003 Precomputation Schemes for QoS Routing Ariel Orda, Senior Member, IEEE, and Alexander Sprintson, Student Member, IEEE Abstract Precomputation-based

More information

Video annotation based on adaptive annular spatial partition scheme

Video annotation based on adaptive annular spatial partition scheme Video annotation based on adaptive annular spatial partition scheme Guiguang Ding a), Lu Zhang, and Xiaoxu Li Key Laboratory for Information System Security, Ministry of Education, Tsinghua National Laboratory

More information

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 70 CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 3.1 INTRODUCTION In medical science, effective tools are essential to categorize and systematically

More information

Metric and Identification of Spatial Objects Based on Data Fields

Metric and Identification of Spatial Objects Based on Data Fields Proceedings of the 8th International Symposium on Spatial Accuracy Assessment in Natural Resources and Environmental Sciences Shanghai, P. R. China, June 25-27, 2008, pp. 368-375 Metric and Identification

More information

Improved Classification of Known and Unknown Network Traffic Flows using Semi-Supervised Machine Learning

Improved Classification of Known and Unknown Network Traffic Flows using Semi-Supervised Machine Learning Improved Classification of Known and Unknown Network Traffic Flows using Semi-Supervised Machine Learning Timothy Glennan, Christopher Leckie, Sarah M. Erfani Department of Computing and Information Systems,

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

Introduction to GE Microarray data analysis Practical Course MolBio 2012

Introduction to GE Microarray data analysis Practical Course MolBio 2012 Introduction to GE Microarray data analysis Practical Course MolBio 2012 Claudia Pommerenke Nov-2012 Transkriptomanalyselabor TAL Microarray and Deep Sequencing Core Facility Göttingen University Medical

More information

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. Americo Pereira, Jan Otto Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. ABSTRACT In this paper we want to explain what feature selection is and

More information

CS 664 Segmentation. Daniel Huttenlocher

CS 664 Segmentation. Daniel Huttenlocher CS 664 Segmentation Daniel Huttenlocher Grouping Perceptual Organization Structural relationships between tokens Parallelism, symmetry, alignment Similarity of token properties Often strong psychophysical

More information

Chapters 11 and 13, Graph Data Mining

Chapters 11 and 13, Graph Data Mining CSI 4352, Introduction to Data Mining Chapters 11 and 13, Graph Data Mining Young-Rae Cho Associate Professor Department of Computer Science Balor Universit Graph Representation Graph An ordered pair GV,E

More information

Usability Evaluation of Software Testing Based on Analytic Hierarchy Process Dandan HE1, a, Can WANG2

Usability Evaluation of Software Testing Based on Analytic Hierarchy Process Dandan HE1, a, Can WANG2 4th International Conference on Machinery, Materials and Computing Technology (ICMMCT 2016) Usability Evaluation of Software Testing Based on Analytic Hierarchy Process Dandan HE1, a, Can WANG2 1,2 Department

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Evolutionary Linkage Creation between Information Sources in P2P Networks

Evolutionary Linkage Creation between Information Sources in P2P Networks Noname manuscript No. (will be inserted by the editor) Evolutionary Linkage Creation between Information Sources in P2P Networks Kei Ohnishi Mario Köppen Kaori Yoshida Received: date / Accepted: date Abstract

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Identifying protein complexes based on node embeddings obtained from protein-protein interaction networks

Identifying protein complexes based on node embeddings obtained from protein-protein interaction networks Liu et al. BMC Bioinformatics (2018) 19:332 https://doi.org/10.1186/s12859-018-2364-2 RESEARCH ARTICLE Open Access Identifying protein complexes based on node embeddings obtained from protein-protein interaction

More information

Analyzing ICAT Data. Analyzing ICAT Data

Analyzing ICAT Data. Analyzing ICAT Data Analyzing ICAT Data Gary Van Domselaar University of Alberta Analyzing ICAT Data ICAT: Isotope Coded Affinity Tag Introduced in 1999 by Ruedi Aebersold as a method for quantitative analysis of complex

More information

CHAPTER 6 QUANTITATIVE PERFORMANCE ANALYSIS OF THE PROPOSED COLOR TEXTURE SEGMENTATION ALGORITHMS

CHAPTER 6 QUANTITATIVE PERFORMANCE ANALYSIS OF THE PROPOSED COLOR TEXTURE SEGMENTATION ALGORITHMS 145 CHAPTER 6 QUANTITATIVE PERFORMANCE ANALYSIS OF THE PROPOSED COLOR TEXTURE SEGMENTATION ALGORITHMS 6.1 INTRODUCTION This chapter analyzes the performance of the three proposed colortexture segmentation

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 Advance Encryption Standard (AES) Rijndael algorithm is symmetric block cipher that can process data blocks of 128 bits, using cipher keys with lengths of 128, 192, and 256

More information

A Parallel Community Detection Algorithm for Big Social Networks

A Parallel Community Detection Algorithm for Big Social Networks A Parallel Community Detection Algorithm for Big Social Networks Yathrib AlQahtani College of Computer and Information Sciences King Saud University Collage of Computing and Informatics Saudi Electronic

More information

Intelligent management of on-line video learning resources supported by Web-mining technology based on the practical application of VOD

Intelligent management of on-line video learning resources supported by Web-mining technology based on the practical application of VOD World Transactions on Engineering and Technology Education Vol.13, No.3, 2015 2015 WIETE Intelligent management of on-line video learning resources supported by Web-mining technology based on the practical

More information

Cellular Learning Automata-Based Color Image Segmentation using Adaptive Chains

Cellular Learning Automata-Based Color Image Segmentation using Adaptive Chains Cellular Learning Automata-Based Color Image Segmentation using Adaptive Chains Ahmad Ali Abin, Mehran Fotouhi, Shohreh Kasaei, Senior Member, IEEE Sharif University of Technology, Tehran, Iran abin@ce.sharif.edu,

More information

Image retrieval based on bag of images

Image retrieval based on bag of images University of Wollongong Research Online Faculty of Informatics - Papers (Archive) Faculty of Engineering and Information Sciences 2009 Image retrieval based on bag of images Jun Zhang University of Wollongong

More information

Mining Temporal Association Rules in Network Traffic Data

Mining Temporal Association Rules in Network Traffic Data Mining Temporal Association Rules in Network Traffic Data Guojun Mao Abstract Mining association rules is one of the most important and popular task in data mining. Current researches focus on discovering

More information

Supplementary text S6 Comparison studies on simulated data

Supplementary text S6 Comparison studies on simulated data Supplementary text S Comparison studies on simulated data Peter Langfelder, Rui Luo, Michael C. Oldham, and Steve Horvath Corresponding author: shorvath@mednet.ucla.edu Overview In this document we illustrate

More information

TELCOM2125: Network Science and Analysis

TELCOM2125: Network Science and Analysis School of Information Sciences University of Pittsburgh TELCOM2125: Network Science and Analysis Konstantinos Pelechrinis Spring 2015 2 Part 4: Dividing Networks into Clusters The problem l Graph partitioning

More information

Drawing Bipartite Graphs as Anchored Maps

Drawing Bipartite Graphs as Anchored Maps Drawing Bipartite Graphs as Anchored Maps Kazuo Misue Graduate School of Systems and Information Engineering University of Tsukuba 1-1-1 Tennoudai, Tsukuba, 305-8573 Japan misue@cs.tsukuba.ac.jp Abstract

More information

CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM. Please purchase PDF Split-Merge on to remove this watermark.

CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM. Please purchase PDF Split-Merge on   to remove this watermark. 119 CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM 120 CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM 5.1. INTRODUCTION Association rule mining, one of the most important and well researched

More information

Fast Efficient Clustering Algorithm for Balanced Data

Fast Efficient Clustering Algorithm for Balanced Data Vol. 5, No. 6, 214 Fast Efficient Clustering Algorithm for Balanced Data Adel A. Sewisy Faculty of Computer and Information, Assiut University M. H. Marghny Faculty of Computer and Information, Assiut

More information

A Novel Approach for Minimum Spanning Tree Based Clustering Algorithm

A Novel Approach for Minimum Spanning Tree Based Clustering Algorithm IJCSES International Journal of Computer Sciences and Engineering Systems, Vol. 5, No. 2, April 2011 CSES International 2011 ISSN 0973-4406 A Novel Approach for Minimum Spanning Tree Based Clustering Algorithm

More information

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a International Conference on Education Technology, Management and Humanities Science (ETMHS 2015) Research on Applications of Data Mining in Electronic Commerce Xiuping YANG 1, a 1 Computer Science Department,

More information

Benchmarking the UB-tree

Benchmarking the UB-tree Benchmarking the UB-tree Michal Krátký, Tomáš Skopal Department of Computer Science, VŠB Technical University of Ostrava, tř. 17. listopadu 15, Ostrava, Czech Republic michal.kratky@vsb.cz, tomas.skopal@vsb.cz

More information

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 1, January 2014,

More information

On Using Machine Learning for Logic BIST

On Using Machine Learning for Logic BIST On Using Machine Learning for Logic BIST Christophe FAGOT Patrick GIRARD Christian LANDRAULT Laboratoire d Informatique de Robotique et de Microélectronique de Montpellier, UMR 5506 UNIVERSITE MONTPELLIER

More information

CHAPTER-6 WEB USAGE MINING USING CLUSTERING

CHAPTER-6 WEB USAGE MINING USING CLUSTERING CHAPTER-6 WEB USAGE MINING USING CLUSTERING 6.1 Related work in Clustering Technique 6.2 Quantifiable Analysis of Distance Measurement Techniques 6.3 Approaches to Formation of Clusters 6.4 Conclusion

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

Definition, Detection, and Evaluation of Meeting Events in Airport Surveillance Videos

Definition, Detection, and Evaluation of Meeting Events in Airport Surveillance Videos Definition, Detection, and Evaluation of Meeting Events in Airport Surveillance Videos Sung Chun Lee, Chang Huang, and Ram Nevatia University of Southern California, Los Angeles, CA 90089, USA sungchun@usc.edu,

More information

Graph Matching Iris Image Blocks with Local Binary Pattern

Graph Matching Iris Image Blocks with Local Binary Pattern Graph Matching Iris Image Blocs with Local Binary Pattern Zhenan Sun, Tieniu Tan, and Xianchao Qiu Center for Biometrics and Security Research, National Laboratory of Pattern Recognition, Institute of

More information

Clustering of Data with Mixed Attributes based on Unified Similarity Metric

Clustering of Data with Mixed Attributes based on Unified Similarity Metric Clustering of Data with Mixed Attributes based on Unified Similarity Metric M.Soundaryadevi 1, Dr.L.S.Jayashree 2 Dept of CSE, RVS College of Engineering and Technology, Coimbatore, Tamilnadu, India 1

More information

Problem Definition. Clustering nonlinearly separable data:

Problem Definition. Clustering nonlinearly separable data: Outlines Weighted Graph Cuts without Eigenvectors: A Multilevel Approach (PAMI 2007) User-Guided Large Attributed Graph Clustering with Multiple Sparse Annotations (PAKDD 2016) Problem Definition Clustering

More information

Web Structure Mining Community Detection and Evaluation

Web Structure Mining Community Detection and Evaluation Web Structure Mining Community Detection and Evaluation 1 Community Community. It is formed by individuals such that those within a group interact with each other more frequently than with those outside

More information

Taccumulation of the social network data has raised

Taccumulation of the social network data has raised International Journal of Advanced Research in Social Sciences, Environmental Studies & Technology Hard Print: 2536-6505 Online: 2536-6513 September, 2016 Vol. 2, No. 1 Review Social Network Analysis and

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

Research on Cloud Resource Scheduling Algorithm based on Ant-cycle Model

Research on Cloud Resource Scheduling Algorithm based on Ant-cycle Model , pp.427-432 http://dx.doi.org/10.14257/astl.2016.139.85 Research on Cloud Resource Scheduling Algorithm based on Ant-cycle Model Yang Zhaofeng, Fan Aiwan Computer School, Pingdingshan University, Pingdingshan,

More information

Domain Independent Prediction with Evolutionary Nearest Neighbors.

Domain Independent Prediction with Evolutionary Nearest Neighbors. Research Summary Domain Independent Prediction with Evolutionary Nearest Neighbors. Introduction In January of 1848, on the American River at Coloma near Sacramento a few tiny gold nuggets were discovered.

More information

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments Send Orders for Reprints to reprints@benthamscience.ae 368 The Open Automation and Control Systems Journal, 2014, 6, 368-373 Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing

More information

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured

More information

Clustering Analysis based on Data Mining Applications Xuedong Fan

Clustering Analysis based on Data Mining Applications Xuedong Fan Applied Mechanics and Materials Online: 203-02-3 ISSN: 662-7482, Vols. 303-306, pp 026-029 doi:0.4028/www.scientific.net/amm.303-306.026 203 Trans Tech Publications, Switzerland Clustering Analysis based

More information

RiMOM Results for OAEI 2009

RiMOM Results for OAEI 2009 RiMOM Results for OAEI 2009 Xiao Zhang, Qian Zhong, Feng Shi, Juanzi Li and Jie Tang Department of Computer Science and Technology, Tsinghua University, Beijing, China zhangxiao,zhongqian,shifeng,ljz,tangjie@keg.cs.tsinghua.edu.cn

More information

The Gene Modular Detection of Random Boolean Networks by Dynamic Characteristics Analysis

The Gene Modular Detection of Random Boolean Networks by Dynamic Characteristics Analysis Journal of Materials, Processing and Design (2017) Vol. 1, Number 1 Clausius Scientific Press, Canada The Gene Modular Detection of Random Boolean Networks by Dynamic Characteristics Analysis Xueyi Bai1,a,

More information

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution Multigrid Pattern I. Problem Problem domain is decomposed into a set of geometric grids, where each element participates in a local computation followed by data exchanges with adjacent neighbors. The grids

More information

Data Structure Optimization of AS_PATH in BGP

Data Structure Optimization of AS_PATH in BGP Data Structure Optimization of AS_PATH in BGP Weirong Jiang Research Institute of Information Technology, Tsinghua University, Beijing, 100084, P.R.China jwr2000@mails.tsinghua.edu.cn Abstract. With the

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

Interactive RGB-D Image Segmentation Using Hierarchical Graph Cut and Geodesic Distance

Interactive RGB-D Image Segmentation Using Hierarchical Graph Cut and Geodesic Distance Interactive RGB-D Image Segmentation Using Hierarchical Graph Cut and Geodesic Distance Ling Ge, Ran Ju, Tongwei Ren, Gangshan Wu Multimedia Computing Group, State Key Laboratory for Novel Software Technology,

More information

STUDYING THE FEASIBILITY AND IMPORTANCE OF GRAPH-BASED IMAGE SEGMENTATION TECHNIQUES

STUDYING THE FEASIBILITY AND IMPORTANCE OF GRAPH-BASED IMAGE SEGMENTATION TECHNIQUES 25-29 JATIT. All rights reserved. STUDYING THE FEASIBILITY AND IMPORTANCE OF GRAPH-BASED IMAGE SEGMENTATION TECHNIQUES DR.S.V.KASMIR RAJA, 2 A.SHAIK ABDUL KHADIR, 3 DR.S.S.RIAZ AHAMED. Dean (Research),

More information

Enhancing K-means Clustering Algorithm with Improved Initial Center

Enhancing K-means Clustering Algorithm with Improved Initial Center Enhancing K-means Clustering Algorithm with Improved Initial Center Madhu Yedla #1, Srinivasa Rao Pathakota #2, T M Srinivasa #3 # Department of Computer Science and Engineering, National Institute of

More information

Comparison of Centralities for Biological Networks

Comparison of Centralities for Biological Networks Comparison of Centralities for Biological Networks Dirk Koschützki and Falk Schreiber Bioinformatics Center Gatersleben-Halle Institute of Plant Genetics and Crop Plant Research Corrensstraße 3 06466 Gatersleben,

More information

Seismic regionalization based on an artificial neural network

Seismic regionalization based on an artificial neural network Seismic regionalization based on an artificial neural network *Jaime García-Pérez 1) and René Riaño 2) 1), 2) Instituto de Ingeniería, UNAM, CU, Coyoacán, México D.F., 014510, Mexico 1) jgap@pumas.ii.unam.mx

More information

Navigation of Multiple Mobile Robots Using Swarm Intelligence

Navigation of Multiple Mobile Robots Using Swarm Intelligence Navigation of Multiple Mobile Robots Using Swarm Intelligence Dayal R. Parhi National Institute of Technology, Rourkela, India E-mail: dayalparhi@yahoo.com Jayanta Kumar Pothal National Institute of Technology,

More information

The Affinity Effects of Parallelized Libraries in Concurrent Environments. Abstract

The Affinity Effects of Parallelized Libraries in Concurrent Environments. Abstract The Affinity Effects of Parallelized Libraries in Concurrent Environments FABIO LICHT, BRUNO SCHULZE, LUIS E. BONA, AND ANTONIO R. MURY 1 Federal University of Parana (UFPR) licht@lncc.br Abstract The

More information

A Data Classification Algorithm of Internet of Things Based on Neural Network

A Data Classification Algorithm of Internet of Things Based on Neural Network A Data Classification Algorithm of Internet of Things Based on Neural Network https://doi.org/10.3991/ijoe.v13i09.7587 Zhenjun Li Hunan Radio and TV University, Hunan, China 278060389@qq.com Abstract To

More information

BUBBLE RAP: Social-Based Forwarding in Delay-Tolerant Networks

BUBBLE RAP: Social-Based Forwarding in Delay-Tolerant Networks 1 BUBBLE RAP: Social-Based Forwarding in Delay-Tolerant Networks Pan Hui, Jon Crowcroft, Eiko Yoneki Presented By: Shaymaa Khater 2 Outline Introduction. Goals. Data Sets. Community Detection Algorithms

More information

A SOCIAL NETWORK ANALYSIS APPROACH TO ANALYZE ROAD NETWORKS INTRODUCTION

A SOCIAL NETWORK ANALYSIS APPROACH TO ANALYZE ROAD NETWORKS INTRODUCTION A SOCIAL NETWORK ANALYSIS APPROACH TO ANALYZE ROAD NETWORKS Kyoungjin Park Alper Yilmaz Photogrammetric and Computer Vision Lab Ohio State University park.764@osu.edu yilmaz.15@osu.edu ABSTRACT Depending

More information

An Improved KNN Classification Algorithm based on Sampling

An Improved KNN Classification Algorithm based on Sampling International Conference on Advances in Materials, Machinery, Electrical Engineering (AMMEE 017) An Improved KNN Classification Algorithm based on Sampling Zhiwei Cheng1, a, Caisen Chen1, b, Xuehuan Qiu1,

More information

A Novel Image Super-resolution Reconstruction Algorithm based on Modified Sparse Representation

A Novel Image Super-resolution Reconstruction Algorithm based on Modified Sparse Representation , pp.162-167 http://dx.doi.org/10.14257/astl.2016.138.33 A Novel Image Super-resolution Reconstruction Algorithm based on Modified Sparse Representation Liqiang Hu, Chaofeng He Shijiazhuang Tiedao University,

More information

SOME stereo image-matching methods require a user-selected

SOME stereo image-matching methods require a user-selected IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 3, NO. 2, APRIL 2006 207 Seed Point Selection Method for Triangle Constrained Image Matching Propagation Qing Zhu, Bo Wu, and Zhi-Xiang Xu Abstract In order

More information

Example for calculation of clustering coefficient Node N 1 has 8 neighbors (red arrows) There are 12 connectivities among neighbors (blue arrows)

Example for calculation of clustering coefficient Node N 1 has 8 neighbors (red arrows) There are 12 connectivities among neighbors (blue arrows) Example for calculation of clustering coefficient Node N 1 has 8 neighbors (red arrows) There are 12 connectivities among neighbors (blue arrows) Average clustering coefficient of a graph Overall measure

More information

Structure of biological networks. Presentation by Atanas Kamburov

Structure of biological networks. Presentation by Atanas Kamburov Structure of biological networks Presentation by Atanas Kamburov Seminar Gute Ideen in der theoretischen Biologie / Systembiologie 08.05.2007 Overview Motivation Definitions Large-scale properties of cellular

More information

A Modular k-nearest Neighbor Classification Method for Massively Parallel Text Categorization

A Modular k-nearest Neighbor Classification Method for Massively Parallel Text Categorization A Modular k-nearest Neighbor Classification Method for Massively Parallel Text Categorization Hai Zhao and Bao-Liang Lu Department of Computer Science and Engineering, Shanghai Jiao Tong University, 1954

More information

CHAPTER 7 A GRID CLUSTERING ALGORITHM

CHAPTER 7 A GRID CLUSTERING ALGORITHM CHAPTER 7 A GRID CLUSTERING ALGORITHM 7.1 Introduction The grid-based methods have widely been used over all the algorithms discussed in previous chapters due to their rapid clustering results. In this

More information

Histogram and watershed based segmentation of color images

Histogram and watershed based segmentation of color images Histogram and watershed based segmentation of color images O. Lezoray H. Cardot LUSAC EA 2607 IUT Saint-Lô, 120 rue de l'exode, 50000 Saint-Lô, FRANCE Abstract A novel method for color image segmentation

More information