Detecting Communities in K-Partite K-Uniform (Hyper)Networks

Size: px

Start display at page:

Download "Detecting Communities in K-Partite K-Uniform (Hyper)Networks"

Theresa Glenn
6 years ago
Views:

1 Liu X, Murata T. Detecting communities in K-partite K-uniform (hyper)networks. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 6(5): Sept. 0. DOI 0.007/s Detecting Communities in K-Partite K-Uniform (Hyper)Networks Xin Liu ( ) and Tsuyoshi Murata, Member, ACM, IEEE Department of Computer Science, Tokyo Institute of Technology, Tokyo 5-855, Japan tsinllew@ai.cs.titech.ac.jp; murata@cs.titech.ac.jp Received October, 00; revised July 4, 0. Abstract In social tagging systems such as Delicious and Flickr, users collaboratively manage tags to annotate resources. Naturally, a social tagging system can be modeled as a (user, tag, resource) hypernetwork, where there are three different types of nodes, namely users, resources and tags, and each hyperedge has three end nodes, connecting a user, a resource and a tag that the user employs to annotate the resource. Then how can we automatically cluster related users, resources and tags, respectively? This is a problem of community detection in a 3-partite, 3-uniform hypernetwork. More generally, given a K-partite K-uniform (hyper)network, where each (hyper)edge is a K-tuple composed of nodes of K different types, how can we automatically detect communities for nodes of different types? In this paper, by turning this problem into a problem of finding an efficient compression of the (hyper)network s structure, we propose a quality function for measuring the goodness of partitions of a K-partite K-uniform (hyper)network into communities, and develop a fast community detection method based on optimization. Our method overcomes the limitations of state of the art techniques and has several desired properties such as comprehensive, parameter-free, and scalable. We compare our method with existing methods in both synthetic and real-world datasets. Keywords community detection, bipartite graph, tripartite hypergraph, clustering, social tagging Introduction Networks are appropriate models for studying the structures and dynamics of real-world systems, where nodes represent the fundamental entities of a system and edges represent the relationships between entities. One crucial step when conducting such a study is to detect communities: groups of related nodes that correspond to functional subunits of the underlying system such as protein complexes and social spheres. Community detection in unipartite networks has been extensively investigated [-5]. There are many heterogenous systems that can be modeled as K-partite K-uniform (hyper)networks, or simply K, K -(hyper)networks, where the nodes can be divided into K disjoint sets and each (hyper)edge has K end nodes, one in each set. For example, the DBLP computer science bibliography can be modeled Fig.. (a) A unipartite network. (b), -network. (c) 3, 3 -hypernetwork, where the 3-way hyperedges are represented as curved lines. Regular Paper This work was supported in part by JSPS Grant-in-Aid under Grant No and IBM Ph.D. Fellowship. In unipartite networks, there is no restriction on edges, i.e., an edge can connect any two nodes. It is considered that unipartite networks are suited to model homogeneous systems composed of entities of a single type, e.g., a social network where nodes represent individuals and edges represent the friendship between two individuals (see Fig.(a)). Since edges have more than two end nodes when K >, they are actually hyperedges. Correspondingly, the networks are hypernetworks. 0 Springer Science + Business Media, LLC & Science Press, China

Xin Liu et al.: Detecting Communities in K, K -(Hyper)Networks 779 as an author-paper network, where edges between authors and papers represent the authoring relationship.

2 Xin Liu et al.: Detecting Communities in K, K -(Hyper)Networks 779 as an author-paper network, where edges between authors and papers represent the authoring relationship. This is a case for, -networks, often referred as bipartite networks (see Fig.(b)). In social tagging systems, users collaboratively manage tags to annotate resources. The tagging relationship involves three entities of different types: a user, a resource and a tag that the user employs to annotate the resource. A social tagging system can be naturally modeled as a (user, tag, resource) hypernetwork. This is a case for 3, 3 - hypernetworks (see Fig.(c)). As for community detection in K, K - (hyper)networks, a common strategy is to reduce a K, K -(hyper)network to simpler unipartite networks [6-7],, -networks [8], or K, -networks [9]. One major drawback of this strategy is that some valuable information of the original (hyper)network is lost during reduction [0], and the subsequently detected communities are less accurate. Researchers also proposed extended modularity optimization [-5] and tensor decomposition [6] methods. But these two methods are designed to detect communities of one-to-one correspondence, as shown in Fig.(a). Real-world heterogeneous systems are often more complex than that. For example, ) in DBLP computer science bibliography, a research group may publish papers on data mining and also on natural language processing; ) in social tagging systems, a group of users may have interest in resources about programming and also enjoy resources about sports news; a collection of resources about flower photos may be annotated by some users with tags like flower, beautiful, nature, and also annotated by some other users with semantically different tags like canon, macro, 00 mm lens (because they are photography enthusiasts). Hence, communities of many-to-many correspondence, as shown in Fig.(b), are practically more significant. To the best of our knowledge, there is no method that is able to detect communities of many-to-many correspondence up to now. Besides, another disadvantage of some existing methods [9,6-0] is that they require experimenters to specify certain parameters such as the numbers of communities. In practice, such a priori knowledge is difficult to obtain. Rosvall and Bergstrom recently proposed an information compression method for community detection Fig.. (a) Communities of one-to-one correspondence. (b) Communities of many-to-many correspondence in a 3, 3 -hypernetwork. in unipartite networks []. The main insight is to convert the community detection problem to a problem of finding an efficient compression of the network s structure. In this paper, we extend their idea and propose a framework to address the problem of community detection in K, K -(hyper)networks. Specifically, we propose a quality function for measuring the goodness of partitions of a K, K -(hyper)network into communities, and develop a fast algorithm for optimizing the quality function. Our method overcomes the limitations of existing methods and has the following key properties. Comprehensive: it is able to handle broad families of K, K -(hyper)networks, and is competent for both communities of one-to-one correspondence and communities of many-to-many correspondence. Parameter-free: it can automatically detect communities, without any a priori knowledge like the numbers of communities. Accurate: it is more sensitive than previous methods. Scalable: it is fast and scalable to large-scale data. The rest of the paper is organized as follows. Section reviews related research. Section 3 formulates the problem of community detection in K, K - (hyper)networks. We introduce our method, including the quality function and the optimization algorithm in Section 4, with attendant notes on implementation in Appendix. Section 5 presents experimental results, followed by a conclusion in Section 6. Related Work Here we discuss related work from four areas: K, -networks are networks whose nodes can be divided into K disjoint sets and each edge can only connect two nodes in different sets. When K =, K, K -network and K, -network indicate the same thing, namely the bipartite network. When K >, K, K -hypernetwork differs fundamentally from K, -network, since the building block of the former is the hyperedge connecting K end nodes, while building block of the latter is the pairwise edge connecting two end nodes. A community is said to have correspondence with another community, if there are dense connections between them. In a community detection problem, the numbers of communities are not specified by experimenters but should be found by a method itself [6,]. Thus, the classic clustering and graph partitioning approaches which require experimenters to specify the numbers of communities are not qualified.

3 780 J. Comput. Sci. & Technol., Sept. 0, Vol.6, No.5 unipartite network,, -network, K, -network (K > ), and K, K -hypernetwork (K > ).. Unipartite Network The study of community detection in unipartite networks has a long history. It is closely related to graph partitioning [3] in graph theory and computer science, and hierarchical clustering [4] in sociology. In recent years, this study has attracted a great deal of interest, especially in the realm of statistical physics [-5]. And, particular attention should be paid to modularity [5-6], a quality function for measuring the goodness of partitions of a unipartite network into communities. It is based on the idea that a random network is not expected to have community structure, so the possible existence of communities in a given network is revealed by comparison between certain quantities in this given network and those in an appropriate random network. Although modularity suffers from the resolution limit [7] attributed to the existence of multiscale community structure [8], it has been widely used. Modularity optimization [9-37] is perhaps the most popular method for community detection in unipartite networks, partly in virtue of its parameter-free nature.., -Network Due to the success of modularity optimization, some researchers extended it to, -networks [-3]. In their extended modularity formulas, a community seeks only one community of another node type, with dense connections between them. Hence, one major drawback of these methods is that they are biased towards communities of one-to-one correspondence, and cannot handle communities of many-to-many correspondence. In parallel, Guimerà et al. proposed another extended modularity optimization method [6], which focuses on community detection in one node set at a time. Their method essentially amounts to carrying out modularity optimization in a unipartite network reduced from the original, -network. Thus, as mentioned before, it suffers from the accuracy problem. There are many researches on co-clustering objects of two types. The information-theoretic co-clustering algorithm [7] is among the first to address this problem. Follow-up work includes [8-0]. However, all of them require a priori knowledge such as the numbers of clusters..3 K, -Network (K > ) Community detection in K, -Networks is related to the problem of co-clustering objects of multiple types. Successful methods include consistent bipartite graph co-partitioning [38], high order coclustering [39], spectral relational clustering [40], graph approximation [4], and collective matrix factorization [4-43], etc. All of these methods need experimenters to specify certain parameters..4 K, K -Hypernetwork (K > ) A common strategy for processing a K, K - hypernetwork is to reduce it to simpler networks. Take a (user, tag, resource) hypernetwork as an example. In Zlatić s method [7], they reduce the hypernetwork to three unipartite networks respectively for users, tags and resources based on a node similarity measure, and then use bottom-up optimization algorithm in each of the unipartite networks; in Neubauer s method [8], they reduce the hypernetwork to a user-tag, -network, a user-resource, -network and a tag-resource, - network to formulate an extended modularity and then apply modularity optimization algorithm. In Lu s method [9], they decompose the 3-way (user, tag, resource) hyperedge into pairwise user-tag, user-resource, and tag-resource edges to build a user-tag-resource 3, -network, and then employ K-means algorithm. One major drawback of this class of methods is that some valuable information of the original hypernetworks is lost during reduction [0], and the subsequently detected communities are less accurate. Murata extended modularity optimization to K, K -hypernetworks [4-5]. In his extended modularity formula, a community seeks only one community of another node type, with dense connections between them. Hence, this method is biased towards communities of one-to-one correspondence. Lin et al. presented an approach for detecting communities in rich media social networks by consistent decomposition of multiple tensors [6], which also applies to the community detection problem in K, K -hypernetworks. However, they only consider the same numbers of communities in different node sets (due to the restriction on dimensions of the decomposed factor matrices), implying the oneto-one correspondence between communities. Another weakness is that these numbers have to be specified by experimenters in advance. There are researches towards understanding the properties and structures of the 3, 3 -hypernetwork model for social tagging systems. Zlatić et al. discussed hyperedge distributions, node similarity, and correlations [7]. Cattuto et al. studied path length, clustering coefficient, and node connectivity [44]. Halpin et al. focused on dynamics of tag distributions [45]. 3 Problem Formulation One fundamental issue is the definition of

4 Xin Liu et al.: Detecting Communities in K, K -(Hyper)Networks 78 community in a K, K -(hyper)network. Generally, a community should be a group of related nodes that correspond to a functional subunit of the underlying real-world system. In unipartite networks, a community is often understood as a group of nodes with dense connections between them. But this notion is not suitable for K, K -(hyper)networks, since nodes of the same type are not connected. Instead, we consider a K, K -(hyper)network community as a group of parallel nodes (of the same type), i.e., the nodes that are similar to one another as regards their relations to nodes of other types. In other words, parallel nodes connect to other nodes in similar ways [46-47]. This is a natural assumption, because a group of parallel nodes are very likely to form a functional subunit. For example, in a social tagging system, those users having similar tagging actions are very likely to share the same interest; those resources that are annotated with many common tags are very likely to be in the same category. In the following, we formulate the problem of community detection in K, K -(hyper) networks. Now assume an undirected and unweighted K, K -(hyper)network H = (V () V () V (K), E), where V (k) is the k-th node set, and E {(v () i, v () i,..., v (K) ) v (k) i k V (k) } is the set of K-way (hyper)edges. Suppose n (k) = V (k) is the number of nodes in the k-th node set, and m = E the number of (hyper)edges. The structure of H can be represented by a K-dimensional array A of n () n () n (K) size, with elements {, if (v () i A ii =, v () i 0, otherwise.,..., v (K) ) E; The problem of community detection in H is that, given A, how we can find a good partition C = {V α () } c() α = {V α () } c() α = {V α (K) K } c(k) α K = that divides V (), V ()..., V (K) into disjoint communities, respectively: c () α = V () α = V (), c () α = V () α = V (), c (K) α K = V (K) α K = V (K). Note that the numbers of communities c (k) (k =,,..., K) are not known a priori. The meaning of good is twofold: ) nodes in the same community are parallel (in a sense that we mentioned in the first paragraph of Section 3, and the same. hereinafter); ) (hyper)edges between communities are either dense or sparse, so that the correspondence between communities is clear. It is easy to see that the above criteria of good apply to both communities of one-to-one correspondence and communities of manyto-many correspondence. Whether we should make a partition into the former or the latter is determined by the intrinsic structure of H itself. Table summarizes the notations frequently used in this paper. Table. Notations for a K, K -(Hyper)Network H Symbol Meaning V (k) The k-th node set E n n (k) c (k) m v (k) i k V (k) α k n (k) α k S (k) The (hyper)edge set The total number of nodes The number of nodes in the k-th node set The number of communities in the k-th node set The total number of (hyper)edges The i k -th node of the k-th node set The α k -th community of the k-th node set The number of nodes in the α k -th community of the k-th node set The vector whose element S (k) i indicates community membership of the i k -th node of the k-th node k set A The K-dimensional array whose element A i i indicates the number of (hyper)edges between v () i, v () i,..., v (K) M The K-dimensional array whose element M α α α K indicates the number of (hyper)edges between V α (), V α ()..., V α (K) K 4 The Proposed Method In this section we present our method for solving the problem formulated in Section 3. We first define a quality function for measuring the goodness of partitions of a K, K -(hyper)network into communities, and then propose an algorithm for optimizing the quality function. 4. Quality Function The main insight of the information compression method proposed by Rosvall and Bergstrom [] is to convert the community detection problem to a problem of finding an efficient compression of the network s structure. In their study, they focus on information compression on a unipartite network s structure. In the following, we extend their idea to show how to compress the structural information of a K, K -(hyper)network, in order to formulate our quality function. Now let us envision a communication process of transmitting structural information of a K, K - (hyper)network H. A signaler knows the structure of H

5 78 J. Comput. Sci. & Technol., Sept. 0, Vol.6, No.5 and aims to transmit much of the information in a reduced fashion to a receiver over a noiseless channel. To do so, the signaler makes a partition of H into communities and encodes the structural information X = {A} as compressed information summarizing the community partition: Y = {S (), S (),..., S (K), M}, where S (k) is the community membership vector of the k-th node set, and M is the community connectivity array. For a partition dividing the k-th node set into c (k) communities, we have S (k) = [S (k), S(k),..., S(k) ], where n (k) S (k) i k {,,..., c (k) } indicates the community membership of node v (k) i k. The community connectivity array M is a K-dimensional array of c () c () c (K) size, with element M αα α K {0,,..., m} indicating the number of (hyper)edges between communities V α (), V α ()..., V α (K) K. That is M αα α K = A ii. v () i V α () v () i V α () v (K) i V α (K) K K It is easy to derive that the description length (in bits) of the compressed information Y is K K L(Y ) = (n (k) logc (k) ) + c (k) log(m + ) k= k= where the logarithm is taken in base. After receiving Y, the receiver knows the community membership of each node and the number of (hyper)edges between each community K-tuple (V α (), V α () ). Then he tries to recover the original structural information X by constructing possible candidates. The number of different candidates is given..., V (K) α K by c () c () α = α = c (K) α K = ( n () n (K) α K α n () α M αα α K ) () where n (k) α k = V α (k) k is the number of nodes in community V α (k) k, the parentheses in () denote the binomial coefficient, and each binomial coefficient gives the number of different candidates for recovering the original M αα α K (hyper)edges between V α (), V α (),..., V α (K) K. Hence, the description length (in bits) of the additional information for the receiver to recover X (i.e., the conditional information between X and Y ) is [ c () L(X Y ) = log c () α = α = c (K) α K = ( n () n (K) α K α n () α M αα α K ) ]. The objective is that the signaler transmits the least while the receiver receives the most. This is apparently a dilemma. If the signaler makes a partition of V (k) into n (k) communities (k =,,..., K), meaning one community for each node, there would be no compression on Y. Thus, the receiver can recover X completely, without any additional information (L(X Y ) = 0), while the signaler has to transmit the most (L(Y ) gets the largest). Conversely, if the signaler makes a partition of V (k) into only one community (k =,,, K), Y would be in the most compressed form. Thus, the signaler transmits the least information (L(Y ) gets the smallest), while the receiver needs the most additional information to recover X (L(X Y ) gets the largest). If the signaler makes a good partition (as described in Section 3, and the same hereinafter), he can Fig.3. Information compression on a 3, 3 -hypernetwork.

6 Xin Liu et al.: Detecting Communities in K, K -(Hyper)Networks 783 highlight certain regularities (e.g., the similarities of nodes in the same community) and filter out relatively unimportant details (e.g., the dissimilarities of nodes in the same community). Intuitively, the compression based on it would achieve the optimal trade-off between L(Y ) and L(X Y ) (see Fig.3 for illustration). According to the minimum description length (MDL) principle [48], Q H (C ) = L(Y ) + L(X Y ) = K K (n (k) log c (k) ) + c (k) log(m + )+ k= [ c () log c () α = α = k= c (K) α K = ( n () n (K) α K α n () α M αα α K would get the minimum value. This is the quality function for C. It is clear that the lower value of Q H, the better of C. It should be emphasized that Q H highly values a good partition, which applies to both communities of one-to-one correspondence and communities of many-to-many correspondence. 4. Algorithm for Minimizing Quality Function Now we can evaluate a partition based on the quality function Q H, and a low value of Q H indicates a good partition. Then the task is to search over all possible partitions for one that has a minimum Q H. However, like modularity optimization [9-37], finding the global optimal solution is NP-hard [49]. Thus approximate algorithm is required. We develop an algorithm modified from label propagation algorithm (LPA) [37,50-5]. LPA is fast for minimizing Q H (originally designed for maximizing the modularity [5] ), but it is prone to get stuck in a poor local minimum [37]. Our algorithm achieves a proper balance between accuracy and speed. It can be divided into two iterative phases. In Phase, we run LPA in the original (hyper)network H, to reach a local minimum of Q H. In Phase, we run LPA in a reduced (hyper)network H, to escape the local minimum previously reached. Specific procedures and detailed explanations of our algorithm are as follows. Initially we assign each node a unique label, indicating its community membership. Therefore, there are as many initial communities as there are nodes. Then we enter Phase (running LPA in H). (a) In each step, in a random sequential order, update each node s label to one of the existing labels (the existing labels in the node set of the considered node) that generates the greatest decrease of Q H ; if no new ) ] label generates a decrease of Q H, keep the node s label unchanged (the label updating rule). (b) Repeat (a), each step in a new random sequential order, until no decrease of Q H can be achieved. (c) Identify communities as nodes bearing the same labels. Algorithm. Detecting Communities in a K, K - (Hyper)Network H by Minimizing Q H Input: Connectivity array A of H Output: A community partition C Begin Assign each node in H a unique label; 3 repeat //Phase 4 repeat 5 Update each node s label in H; 6 until a local minimum of Q H //Phase 7 Build a reduced K, K -(hyper)network H; 8 Assign each node in H a unique label; 9 repeat 0 Update each node s label in H; until a local minimum of Q H Retrieve node labels in H from the corresponding labels in H; 3 until no change in Q H 4 Identity communities as groups of nodes bearing 5 End the same labels; In the above procedures, note that ) according to the label updating rule, Q H never increases; ) some initial labels vanish while some others become popular. Thus, Phase in effect seeks to minimize Q H by combining communities. At the initial stage when each community is composed of one node, we can easily combine communities. Later, with communities getting larger and larger, combining them becomes harder and harder: combining community V (k) with community V (k) means that the labels of all nodes in V (k) are updated to V (k) s label; but it is difficult to achieve this consensus based on the updating rule of each individual node. Consequently, Phase finally stops and converges to a local minimum of Q H. In Phase we try to escape the local minimum. Following the method presented in [35], we build a reduced (hyper)network H whose nodes are now the

7 784 J. Comput. Sci. & Technol., Sept. 0, Vol.6, No.5 communities identified in Phase. Then, we assign each node in H a unique label, and sequently and repeatedly update their labels (running LPA in H). Note that the nodes are now the communities identified in Phase. By updating a node s label to another, we actually combine these two communities and escape the local minimum. After Phase finishes, we would come back to the original (hyper)network H and enter Phase again. That is, we first retrieve each node s label from the corresponding label in H, and then run LPA to reach another local minimum. With these two phases being repeated iteratively, a fairly good solution can be finally obtained. It should be noted that the above algorithm, though described as to minimize Q H, can be used to optimize other quality functions (for maximization, just make some minor modification, e.g., decrease increase, minimum maximum). In Appendix, we discuss implementation issues and show how this algorithm can be speeded to run in near linear time. 0 8 in the second [54]. Taken together, we consider the partition that divides women into two communities { 9} and {0 8} as the ground truth (since it is in agreement with both the findings of Davis and Freeman). 5 Experiments In this section we present experiments for evaluating the performance of our method in, -network and 3, 3 -hypernetwork. 5. Southern Women, -Network First, we use the famous Southern women, - network as a touchstone. The original data underlying this network were collected by Davis et al. during the 930s as part of an extensive study of class and race in black and white society in the Deep South [53]. It describes the participation of 8 women in 4 social events. Then we can derive a, -network whose nodes represent women and social events, and whose edges represent the participation of the women in the events. This data and the corresponding, -network have been intensively studied by social scientists. ) Based on the ethnographic knowledge, Davis made a partition of the 8 women into two groups women 9 in the first group and women 9 8 in the second (woman 9 is a secondary member of both groups) [53]. ) Freeman reviewed different studies and identified a consensus partition of the 8 women into two groups women 9 in the first group and women Fig.4. Partitions of the Southern women, -network obtained by (a) our method, (b) the extended modularity optimization advanced by Guimerà [6], (c) the extended modularity optimization proposed by Murata [], (d) the extended modularity optimization presented by Barber [], and (e) the extended modularity optimization brought forward by Suzuki [3]. Women are indicated as circle symbols located at the top side, while events are indicated as square symbols located at the bottom side. Nodes in the same community are painted in the same color. Except for our method, existing approaches for community detection in, -networks include the extended modularity optimization methods [6,-3] and In order to calculate QH in H, we introduce the node weight (equal to the number of nodes in the corresponding communities in H) and (hyper)edge weight (equal to the number of (hyper)edges between the corresponding communities in H) [35,5]. For a given node vx in H, there is a node v x in H corresponding to v x s community, so we retrieve v x s label from v x s label. In Phase, all nodes labels in a community are forced to be updated jointly. By running LPA in H, we can update the individual labels separately and further refine the solution.

8 Xin Liu et al.: Detecting Communities in K, K -(Hyper)Networks 785 the co-clustering algorithms [7-0]. Here we only consider the extended modularity optimization methods, since the co-clustering algorithms require a priori knowledge such as the numbers of clusters and are not qualified in a strict sense [6,]. Fig.4 shows the partitions obtained by different methods. (The partitions obtained by the extended modularity optimization methods are reported in [6,, 3].) It is satisfying to find that only our method s partition for women (the top side nodes in Fig.4(a)) is consistent with the ground truth proposed by the social scientists. Our partition of events into three communities is also reasonable, as it conforms to the criteria of good : ) nodes within the same event community are parallel ; ) the correspondence between communities is clear event communities { 6} and {0 4}, respectively, correspond with woman communities { 9} and {0 8}, while event community {7 9} corresponds with both woman communities. In addition, our method detects communities of manyto-many correspondence, while others detect communities of one-to-one correspondence. (Although there are many-to-many correspondence between communities described in Fig.4(e), this partition seems somewhat strange, since several communities consist of only one node.) 5. Synthetic 3, 3 -Hypernetwork Now, let us concentrate on comparing our method with existing methods through a standard benchmark test [,8,4-5,5] in synthetic 3, 3 -hypernetworks. The basic scheme is as follows. ) We generate a set of random 3, 3 -hypernetworks with known community structure (the true partition). ) Applying various community detection methods to these hypernetworks (the true partition is hidden at this time), we compare the similarities between partitions obtained by these methods and the true partition. 3) The more similar an obtained partition is to the true partition, the better of the corresponding method. To quantify the similarity between two partitions, we adopt the widely used normalized mutual information (NMI) []. NMI is an information theoretic measure that calculates the amount of common information between two partitions. Specifically, NMI(P, P ) c P cp α= β= = N αβ log(n αβ N/N α N β ) cp α= N α log(n α /N) + c P β= N β log(n β /N) where P and P are two partitions on the same set of nodes, c P is number of communities in P, c P is number of communities in P, N is the total number of nodes, N α is the number of nodes in the α-th community of P, N β is the number of nodes in the β-th community of P, and N αβ is the number of nodes that are both in the α-th community of P and the β-th community of P. If P and P match completely, we have a maximum NMI value of, whereas if P and P are totally independent of one another, we have a minimum value of 0. For comparison, we consider several methods that cover state-of-the-art techniques. They are, in order, the extended modularity optimization method (ExModularity) [4], the tensor decomposition method (MetaFac) [6], and the method that involves reduction of a 3, 3 -hypernetwork to, -networks (BiReduction) [8] (a brief description of these three methods is included in Subsection.4). In addition, we consider another method (UniReduction) modified from Zlatić s approach [7], which involves reduction of a 3, 3 -hypernetwork to unipartite networks. Detailed description of UniReduction is as follows. To make it clear, we color the three node sets V (), V (), and V (3) in a 3, 3 -hypernetwork red, green, and blue, respectively. Suppose we are to detect red communities. We first reduce the original hypernetwork to a weighted unipartite network of red nodes, and then employ modularity optimization in this unipartite network. The edge weight w xy () between v x () and v y () in the unipartite network is equal to their similarity in the original hypernetwork as measured by Jaccard index [55]. In specific, xy = Γ (v() x ) Γ (v () Γ (v x () ) Γ (v () w () = n () i = y ) y ) n () n (3) i = i A 3= xi i 3 A yii 3 n (3) i (A 3= xi i 3 + A yii 3 A xii 3 A yii 3 ), where Γ (v x () ) and Γ (v y () ) denote the neighbor sets of v x () and v y (). Justification of this method is that the more similar two nodes are to each other, the larger of the edge weight between them, and thus the more probable they are grouped into the same community by modularity optimization. Note that ExModularity, UniReduction, and BiReduction all rely on heuristics to optimize their respective quality functions. The original ExModularity and The major difference between Zlatić s approach and UniReduction lies in the similarity measure for calculating the edge weight. In Zlatić s approach, the similarity measure only considers neighbors in one node set, i.e., either green or blue neighbors of v x () and v y () (thus some information of the original hyperedges is lost). In UniReduction, the similarity measure considers both green and blue neighbors.

786 J. Comput. Sci. & Technol., Sept. 0, Vol.6, No.5 BiReduction use the greedy bottom-up algorithm [9]. To be fair, we consistently use Algorithm (described in Subsection 4.

9 786 J. Comput. Sci. & Technol., Sept. 0, Vol.6, No.5 BiReduction use the greedy bottom-up algorithm [9]. To be fair, we consistently use Algorithm (described in Subsection 4., and the same hereinafter) as the optimization algorithm for our method, ExModularity, UniReduction, and BiReduction. (Actually, Algorithm is empirically proved to be superior to the greedy bottom-up algorithm in terms of the optimization result.) 5.. Communities of One-to-One Correspondence For the first case, we consider comparing these methods in a set of synthetic 3, 3 -hypernetworks with builtin communities of one-to-one correspondence. Detailed hypernetwork generation procedures are as follows. ) Set the numbers of red, green and blue nodes as n () = n () = n (3) = 60. Set the numbers of red, green and blue communities as c () = c () = c (3) = 8. Equally divide the red, green and blue nodes into c (), c () and c (3) communities. That is, n () α = n () α = n (3) α 3 = 0 (α =,..., c () ; α =,..., c () ; α 3 =,..., c (3) ). ) For each community triple (V α (), V α (), V α (3) 3 ), first set the hyperedge density as { pdense ± rand, if α = α = α 3 ; p αα α 3 = rand, otherwise. () rand is a random number from 0.0 to.0, which acts as a noise factor. Then, with probability p αα α 3, randomly place hyperedges connecting nodes v () i V α (), v () i V α () and v (3) i 3 V α (3) 3. From (), we can find that dense hyperedges are placed between the α-th red, α-th green and α-th blue communities (α =,..., 8), constituting their one-to-one correspondence. By gradually decreasing p dense from 0.08 to 0.00, we generate different hypernetworks, with the resulting community structure more and more weak. Thus we pose greater and greater challenges to different methods. In Fig.5, we show the performances of various methods in this set of hypernetworks (values are averaged over 0 runs). On the whole, performance of each method varies in a similar way across red, green and blue node sets (since red, green and blue nodes are in a symmetric status in the hypernetwork generation procedures). Specifically, our method, ExModularity, and BiReduction perform excellently, correctly detecting not only the numbers of communities but also community membership of each node almost all the way to the point p dense = At the turning stage, i.e., when p dense falling from to 0.03, our method slightly outperforms ExModularity and BiReduction, as shown in the embedded figures. MetaFac, though given a priori knowledge of the true numbers of communities (the number of red/green/blue communities are set as 8), does not provide remarkable result. Not all nodes community memberships can be detected by MetaFac, even at p dense = The record of UniReduction is even worse. Its performance began to deteriorate as early as p dense = All of these methods miss the true community structure when p dense <0.05. To see this, we compare the partitions obtained by our method, ExModularity, UniReduction and BiReduction with the true partition. We are surprised to find that these obtained partitions are consistently much better than the true partition, as measured by their respective quality functions (the true partition should have been better than any other partitions based on an objective view), that is, these quality functions fail to tell right from wrong and become invalid. Consequently, all of the methods fail, since one can never expect an optimization algorithm to find the Fig.5. Performances of various methods in the 3, 3 -hypernetworks with built-in communities of one-to-one correspondence. (a) Red node set. (b) Green node set. (c) Blue node set. For different pdense, the number of hyperedges generally ranges from to

10 Xin Liu et al.: Detecting Communities in K, K -(Hyper)Networks 787 true partition under an invalid quality function. The only difference is that, under different quality functions, our method favors a partition into only one community (in each node set), while ExModularity and BiReduction prefer a partition into over 40 communities (in each node set). According to the information theory, a partition into one community is totally independent of the true partition, leading to an NMI value of 0 (this is the reason that our method s value dives towards 0). On the other hand, a randomly generated partition into 40 communities is expected to have an NMI value of 0.30±0.000 (based on tests), which is comparable to the values achieved by ExModularity and BiReduction (this is the reason that these two methods can still have nice looking records even when no community structure exists, i.e., at p dense = 0.00). Therefore, it is just deceptive information given by NMI, and ExModularity and BiReduction are actually not superior to our method when p dense < Communities of Many-to-Many Correspondence As for the second case, we generated a set of random 3, 3 -hypernetworks with built-in communities of many-to-many correspondence. Detailed procedures are as follows. ) Set the numbers of red, green and blue nodes as n () = 0, n () = 60, n (3) = 00. Set the numbers of red, green and blue communities as c () = 6, c () = 8, c (3) = 0. Equally divide red, green and blue nodes into c (), c () and c (3) communities. That is, n () α = n () α = n (3) α 3 = 0 (α =,..., c () ; α =,..., c () ; α 3 =,..., c (3) ). ) For each community triple (V α (), V α (), V α (3) 3 ), first set the hyperedge density as p αα α 3 { pdense ± rand, with prob. 0.; rand, with prob (3) Then, with probability p αα α 3, randomly place hyperedges connecting nodes v () i V α (), v () i V α () and v (3) i 3 V α (3) 3. From (3), we can find that around 48 out of the c () c () c (3) = 480 community triples are randomly selected and placed with dense hyperedges. Since 48 is much larger than c (), c () and c (3), there must be manyto-many correspondence between communities. Similarly, we gradually decrease p dense from 0.08 to 0.00, and generate a set of hypernetworks O, with resulting community structure more and more difficult to detect. The performances of various methods in this set of hypernetworks are shown in Fig.6 (values are averaged over 0 runs). As Fig.6 shows, our method outperforms others by a large margin. It works almost perfectly all the way until p dense = 0.05, with a sudden degradation thereafter. (The reason that our method s NMI value is well below others when p dense < 0.05 is the same as discussed in the test for communities of one-to-one correspondence.) As for other methods, we can observe two common features. ) None of them can detect community memberships with 00% accuracy, even when p dense = ) Their performances deteriorate much earlier than our method, often with records fluctuating wildly before the turning points. Therefore, as expected, these methods cannot handle communities of many-to-many correspondence. In specific, UniReduction is the second best in most of the time, followed by MetaFac. Note that MetaFac is given at least an estimate of the true numbers of communities (the number of red/green/blue communities are set as 8), so its performance is not appealing. Contrary to the excellent performances in the previous set of hypernetworks, BiReduction and Ex- Modularity do not show satisfactory results this time. Remark. One may wonder that why the performance of our method in this test is better than that in Fig.6. Performances of various methods in the 3, 3 -hypernetworks with built-in communities of many-to-many correspondence. (a) Red node set. (b) Green node set. (c) Blue Node set. O For different pdense, the number of hyperedges generally ranges from to

11 788 J. Comput. Sci. & Technol., Sept. 0, Vol.6, No.5 the previous test for communities of one-to-one correspondence. This is because there are more communities triples placed with dense connections this time (recall that around 48 community triples are placed with dense hyperedges in the hypernetwork generation procedures for communities of many-to-many correspondence, while only 8 community triples are placed with dense hyperedges in the generation procedures for communities of one-to-one correspondence). Thus, more information can be capitalized by our method, resulting in a better performance. These two tests show that our method is competent for detecting both communities of one-to-one correspondence and communities of many-to-many correspondence. For the former case, our method slightly wins the state-of-the-art techniques. For the latter case which is more significant in practice, our method outperforms others by a large margin. Whether we should make a partition into the former case or the latter case is determined by the intrinsic structure of the (hyper)network itself. 6 Conclusion Based on the information compression idea [], we define a quality function for measuring the goodness of partitions of a K, K -(hyper)network into communities, and develop an algorithm for optimizing this quality function. Our method provides a framework to solve the community detection problem in K, K - (hyper)networks. Compared with existing methods, our method is competent for both communities of one-to-one correspondence and many-to-many correspondence. It should be emphasized that our method is automatic and independent of any priori knowledge like the numbers of communities. By carefully designing the algorithm (see Appendix), our method also has the desired property of scalability. The framework proposed in this paper is the first step for analyzing complex heterogeneous systems. A real-world heterogeneous system often contains several different kinds of relationships: some relationships are formed between entities of the same type, some are between entities of different types. Generally, the relationship between entities of the same type can be modeled as a unipartite network, while the relationship between entities of different types can be modeled as a K, K -(hyper)network. Thus, such a heterogeneous system can be modeled as multiple related (hyper)networks, one for each kind of relationship. Community detection from multiple related (hyper)networks is left for our future work. References [] Fortunato S. Community detection in graphs. Physics Reports, 00, 486: [] Danon L, Duch L, Guilera A D, Arenas A. Comparing community structure identification. J. Stat. Mech, 005, 9: P [3] Lancichinetti A, Fortunato S. Community detection algorithms: A comparative analysis. Phys. Rev. E, 009, 80(5): [4] Leskovec J, Lang K J, Mahoney M W. Empirical comparison of algorithms for network community detection. In Proc. the 9th International Conference on World Wide Web, Raleigh, USA, Apr. 6-30, 00, pp [5] Shen H, Cheng X. Spectral methods for the detection of network community structure: A comparative analysis. J. Stat. Mech., 00, 0: P000. [6] Guimerà R, Pardo M S, Amaral L A N. Module identification in bipartite and directed networks. Phys. Rev. E, 007, 76(3): [7] Zlatić V, Ghoshal G, Caldarelli G. Hypergraph topological quantities for tagged social networks. Phys. Rev. E, 009, 80(3): [8] Neubauer N, Obermayer K. Towards community detection in k-partite k-uniform hypergraphs. In Workshop on Analyzing Networks and Learning with Graphs, Whistler, BC, Canada, Dec., 009. [9] Lu C, Chen X, Park E K. Exploit the tripartite network of social tagging for web clustering. In Proc. the 8th ACM Conference on Information and Knowledge Management, Hong Kong, China, Nov. -6, 009, pp [0] Zhou T, Ren J, Medo M, Zhang Y C. Bipartite network projection and personal recommendation. Phys. Rev. E, 007, 76(4): [] Barber M J. Modularity and community detection in bipartite network. Phys. Rev. E, 007, 76(6): [] Murata T, Ikeya T. A new modularity for detecting one-tomany correspondence of communities in bipartite networks. Advances in Complex Systems, 00, 3(): 9-3. [3] Suzuk, Wakita K. Extracting multi-facet community structure from bipartite networks. In Proc. International Conference on Computational Science and Engineering, Vancouver, BC, Canada, Aug. 9-3, 009, pp [4] Murata T. Detecting communities from tripartite networks. In Proc. the 9th International Conference on World Wide Web, Raleigh, USA, Apr. 6-30, 00, pp [5] Murata T. Modularity for heterogeneous networks. In Proc. the st ACM Conference on Hypertext and Hypermedia, Toronto, Canada, Jun. 3-6, 00, pp [6] Lin Y R, Sun J, Castro P, Konuru R, Sundaram H, Kelliher A. Metafac: Community discovery via relational hypergraph factorization. In Proc. the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, Jun. 8-Jul., 009, pp [7] Dhillon I S, Mallela S, Modha D S. Information-theoretic coclustering. In Proc. the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington DC, USA, Aug. 4-7, 003, pp [8] Li T. A general model for clustering binary data. In Proc. the th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, USA, Aug. -4, 005, pp [9] Banerjee A, Dhillon I, Ghosh J, Merugu S, Modha D S. A generalized maximum entropy approach to Bregman co-clustering and matrix approximation. Journal of Machine Learning Research, 007, 8:

12 Xin Liu et al.: Detecting Communities in K, K -(Hyper)Networks 789 [0] Long B, Zhang Z, Yu P S. A probabilistic framework for relational clustering. In Proc. the 3th ACM International Conference on Knowledge Discovery and Data Mining, San Jose, USA, Aug. -5, 007, pp [] Newman M E J. Networks: An Introduction. New York: Oxford University Press, 00. [] Rosvall M, Bergstrom C T. An information-theoretic framework for resolving community structure in complex networks. Proc. Natl. Acad. Sci. USA, 007, 04(8): [3] Kernighan B, Lin S. An efficient heuristic procedure to partition graphs. Bell Syst. Tech. J., 970, 49(): [4] Scott J. Social Network Analysis: A Handbook. Second Edition, Sage Publications, Newberry Park, CA, 000. [5] Newman M E J, Girvan M. Finding and evaluating community structure in networks. Phys. Rev. E, 004, 69(): 063. [6] Newman M E J. Modularity and community structure in networks. Proc. Natl. Acad. Sci. USA, 006, 03(3): [7] Fortunato S, Barthélemy M. Resolution limit in community detection. Proc. Natl. Acad. Sci. USA, 007, 04(): [8] Shen H, Cheng X, Fang B. Covariance, correlation matrix, and the multiscale community structure of networks. Phys. Rev. E, 00, 8(): 064. [9] Newman M E J. Fast algorithm for detecting community structure in networks. Phys. Rev. E, 004, 69(6): [30] Clauset A, Newman M E J, Moore C. Finding community structure in very large networks. Phys. Rev. E, 004, 70(6): 066. [3] Duch L, Arenas A. Community detection in complex networks using extremal optimization. Phys. Rev. E, 005, 7(): [3] Medus A, Acuna G, Dorso C O. Detection of community structures in networks via global optimization. Physica A, 005, 358(-4): [33] Newman M E J. Finding community structure in networks using the eigenvectors of matrices. Phys. Rev. E, 006, 74(3): [34] Schuetz P, Caflisch A. Efficient modularity optimization by multistep greedy algorithm and vertex refinement. Phys. Rev. E, 008, 77(4): 046. [35] Blondel V D, Guillaume J L, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. J. Stat. Mech., 008, 0: P0008. [36] Zhang X S, Wang R S, Wang Y, Wang J, Qiu Y, Wang L, Chen L. Modularity optimization in community detection of complex networks. Europhys. Lett., 009, 87(3): [37] Liu X, Murata T. Advanced modularity-specialized label propagation algorithm for detecting communities in networks. Physica A, 00, 389(7): [38] Gao B, Liu T Y, Zheng X, Cheng Q S, Ma W Y. Consistent bipartite graph co-partitioning for star-structured high-order heterogeneous data co-clustering. In Proc. the th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, USA, Aug. -4, 005, pp [39] Greco G, Guzzo A, Pontieri L. An information-theoretic framework for high-order co-clustering of heterogeneous objects. In Proc. the 5th Italian Symposium on Advanced Database Systems, Torre Canne, Italy, Jun. 7-0, 007, pp [40] Long B, Zhang Z F, Wu X Y, Yu P S. Spectral clustering for multi-type relational data. In Proc. the 3rd International Conference on Machine Learning, Pittsburgh, USA, Jun. 5-9, 006, pp [4] Long B, Wu X, Zhang Z, Yu P S. Unsupervised learning on k-partite graphs. In Proc. the th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, USA, Aug. 0-3, 006, pp [4] Singh A P, Gordon G J. Relational learning via collective matrix factorization. In Proc. the 4th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp , Las Vegas, USA, Aug. 4-7, 008. [43] Singh A P, Gordon G J. A unified view of matrix factorization models. In Proc. the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, Antwerp, Belgium, Sept. 5-9, 008, pp [44] Cattuto C, Schmitz C, Baldassarri A, Servedio V D P, Loreto V, Hotho A, Grahl M, Stumme G. Network properties of folksonomies. AI Communications, 007, 0(4): [45] Halpin H, Robu V, Shepherd H. The complex dynamics of collaborative tagging. In Proc. the 6th International Conference on World Wide Web, Banff, Canada, May 8-, 007, pp.-0. [46] Long B, Wu X, Zhang Z, Yu P S. Community learning by graph approximation. In Proc. the 7th IEEE International Conference on Data Mining, Omaha, USA, Oct. 8-3, 007, pp.3-4. [47] Long B, Zhang Z, Yu P S, Xu T. Clustering on complex graphs. In Proc. the 3rd National Conference on Artificial Intelligence, Chicago, USA, Jul. 3-7, 008, pp [48] Rissanen J. Modelling by shortest data description. Automatica, 978, 4(5): [49] Brandes U, Delling D, Gaertler M, Görke R, Hoefer M, Nikolski Z, Wagner D. On modularity np-completeness and beyond. Technical Report 006-9, ITI Wagner, Faculty of Informatics, Universität Karlsruhe, 006. [50] Raghavan U N, Albert R, Kumara S. Near linear time algorithm to detect community structures in large-scale networks. Phys. Rev. E, 007, 76(3): [5] Barber M J, Clark J W. Detecting network communities by propagating labels under constraints. Phys. Rev. E, 009, 80(): 069. [5] Arenas A, Duch J, Fernández A, Gómez S. Size reduction of complex networks preserving modularity. New Journal of Physics, 007, 9: 76. [53] Davis A, Gardner B B, Gardner M R. Deep South. Chicago: University of Chicago Press, IL, 94. [54] Breiger R, Carley K, Pattison P (eds.) Dynamic Social Network Modeling and Analysis: Workshop Summary and Papers, Washington, DC: The National Academics Press, USA, 003. [55] Jaccard P. Étude comparative de la distribution florale dans une portion des alpes et des jura. Bull. Soc. Vaudoise Sci. Nat., 90, 37: [56] Lü L, Zhou T. Link prediction in complex networks: A survey. Physica A, 0, 390: Xin Liu is a Ph.D. candidate in the Department of Computer Science, Tokyo Institute of Technology. He received the B.S. degree in computing and information science from Wuhan University of Technology in 004, and the M.S. degree in computer science from Wuhan University in 007. His research interests include Web mining and social network analysis.

13 790 J. Comput. Sci. & Technol., Sept. 0, Vol.6, No.5 Tsuyoshi Murata is an associate professor in the Department of Computer Science, Graduate School of Information Science and Engineering, Tokyo Institute of Technology. He obtained his doctor s degree in computer science at Tokyo Institute of Technology in 997, on the topic of machine discovery of geometrical theorems. At Tokyo Institute of Technology, he conducts research on Web mining, artificial intelligence, and social network analysis. He is a member of IEEE, AAAI, ACM, JSAI, IPSJ and JSSST. Appendix Implementation Issues In Appendix, we focus on the implementation issues of Algorithm. The core of the algorithm is LPA, where we sequentially and repeatedly update each node s label to reach a local minimum of Q H. Without loss of generality, suppose we are now updating the label of node v x (). The updating rule is S () x where Q H (S () x v () x = argmin(q H (S x () = l)) l = l) denotes the quality function for taking the label l. This rule can be rewritten as S () x = argmin(φ(l) + Ψ(l)) l Φ(l) = c () α = c (K) α K = log log ( n () + ( n () i = i x i = i x n () i = n () i = i x )( n () δ(s () i, l) n () i = n (K) = i = ) ( n (K) δ(s () i, α ) = δ(s (K!), α K ) A xi δ(s () i, α ) δ(s (K), α K )+ n (K) = )( n () δ(s () i, l) n () i = i x n () i = i = n (K) = A i δ(s () i, l) δ(s (K), α K ) ) ( n (K) δ(s () i, α ) = δ(s (K), α K ) A i δ(s () i, l) δ(s (K), α K ) K n () n () log(c () + ) n () log c () + c (k) log(m + ), if δ(s () i, l) = 0; Ψ(l) = k= i = i x 0, otherwise. ) ) + where δ(, ) is the Kronecker s delta (the proof is omitted here due to space constraints). In implementation, the numbers of (hyper)edges between communities, namely M, are kept in real-time. Other stored data include the (hyper)network s connectivity array A, the community membership vector S (k), and the number of nodes in each community n (k) α k. To check a candidate label l as v x () s new label, we need to calculate Φ(l) and Ψ(l). As for Φ(l), we traverse communities that have connections with v x () or community labeled l. This operation requires a time of O( d () ), where d () is the average degree of nodes in V (). Ψ(l) can be calculated in O() time. Suppose we consider all existing labels as candidate labels. In the worst case (i.e., the initial stage when each node forms its own community), the number of candidate labels is as many as n (). Then, the running time of updating v () x s label would be O( d () n () ) = O(m). Further, one step of label updating (i.e., sequentially updating each node s label once) requires a time of O(m K k= n(k) ). Relatively few steps are needed for LPA to converge [37,50-5]. The number of passes of Phase and Phase in practice is also a small number (see the running details shown below). As a result, the overall time complexity of our algorithm is O(m K k= n(k) ). The speed of the algorithm can be further improved. Possible techniques are as follows.

Community Detection: Comparison of State of the Art Algorithms

Community Detection: Comparison of State of the Art Algorithms Josiane Mothe IRIT, UMR5505 CNRS & ESPE, Univ. de Toulouse Toulouse, France e-mail: josiane.mothe@irit.fr Karen Mkhitaryan Institute for Informatics