Network community detection with edge classifiers trained on LFR graphs

Similar documents
EXTREMAL OPTIMIZATION AND NETWORK COMMUNITY STRUCTURE

Community detection using boundary nodes in complex networks

Community Detection: Comparison of State of the Art Algorithms

arxiv: v2 [cs.si] 22 Mar 2013

A new Pre-processing Strategy for Improving Community Detection Algorithms

arxiv: v2 [physics.soc-ph] 16 Sep 2010

An Efficient Algorithm for Community Detection in Complex Networks

Social Data Management Communities

Modeling and Detecting Community Hierarchies

Crawling and Detecting Community Structure in Online Social Networks using Local Information

Finding Hierarchical Communities in Complex Networks Using Influence-Guided Label Propagation

Hierarchical community structure in complex (social) networks. arxiv: v2 [physics.soc-ph] 10 Mar 2014

Detection of Communities in Directed Networks based on Strongly p-connected Components

arxiv: v1 [physics.soc-ph] 19 Sep 2007

Expected Nodes: a quality function for the detection of link communities

Community Detection in Directed Weighted Function-call Networks

On the Permanence of Vertices in Network Communities. Tanmoy Chakraborty Google India PhD Fellow IIT Kharagpur, India

This is a published version of a paper published in PLoS ONE.

Approximation of the Maximal α Consensus Local Community detection problem in Complex Networks

Supplementary material to Epidemic spreading on complex networks with community structures

SOMSN: An Effective Self Organizing Map for Clustering of Social Networks

Community Detection: A Bayesian Approach and the Challenge of Evaluation

Efficient Community Detection Algorithm with Label Propagation using Node Importance and Link Weight

Cascading failures in complex networks with community structure

Generalized Modularity for Community Detection

Studying the Properties of Complex Network Crawled Using MFC

Generalized Measures for the Evaluation of Community Detection Methods

Keywords: dynamic Social Network, Community detection, Centrality measures, Modularity function.

Community Detection in Bipartite Networks:

Subspace Based Network Community Detection Using Sparse Linear Coding

Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities

Weighted Label Propagation Algorithm based on Local Edge Betweenness

A Comparison of Community Detection Algorithms on Artificial Networks

Statistical Physics of Community Detection

A Simple Acceleration Method for the Louvain Algorithm

V4 Matrix algorithms and graph partitioning

Near linear time algorithm to detect community structures in large-scale networks

Local Network Community Detection with Continuous Optimization of Conductance and Weighted Kernel K-Means

An Empirical Analysis of Communities in Real-World Networks

Overlapping Communities

An information-theoretic framework for resolving community structure in complex networks

Community Structure Detection. Amar Chandole Ameya Kabre Atishay Aggarwal

Community structure of complex software systems: Analysis and applications

Basics of Network Analysis

An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

Perspective on Measurement Metrics for Community Detection Algorithms

On community detection in very large networks

AN ANT-BASED ALGORITHM WITH LOCAL OPTIMIZATION FOR COMMUNITY DETECTION IN LARGE-SCALE NETWORKS

Edge Classification in Networks

Consensus clustering by graph based approach

Gene Clustering & Classification

Community Mining in Signed Networks: A Multiobjective Approach

Research Article An Improved Topology-Potential-Based Community Detection Algorithm for Complex Network

Modularity CMSC 858L

Weighted compactness function based label propagation algorithm for community detection

Edge Weight Method for Community Detection in Scale-Free Networks

Relative Validity Criteria for Community

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

Local higher-order graph clustering

SLPA: Uncovering Overlapping Communities in Social Networks via A Speaker-listener Interaction Dynamic Process

arxiv: v2 [physics.soc-ph] 11 Mar 2009

MULTI-SCALE COMMUNITY DETECTION USING STABILITY AS OPTIMISATION CRITERION IN A GREEDY ALGORITHM

A Novel Parallel Hierarchical Community Detection Method for Large Networks

2007 by authors and 2007 World Scientific Publishing Company

arxiv: v1 [physics.data-an] 27 Sep 2007

Genetic Algorithm with a Local Search Strategy for Discovering Communities in Complex Networks

Edge Representation Learning for Community Detection in Large Scale Information Networks

Hierarchical Overlapping Community Discovery Algorithm Based on Node purity

Classification of Subject Motion for Improved Reconstruction of Dynamic Magnetic Resonance Imaging

Overlapping Community Detection in Social Networks Using Parliamentary Optimization Algorithm

Extracting Communities from Networks

Oh Pott, Oh Pott! or how to detect community structure in complex networks

Community Structure and Beyond

A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM

Yi Song and Stéphane Bressan

Class dependent feature weighting and K-nearest neighbor classification

arxiv:cond-mat/ v2 [cond-mat.stat-mech] 27 Feb 2004

Local quality functions for graph clustering with Non-negative Matrix Factorization

Using PageRank in Feature Selection

Adaptive Modularity Maximization via Edge Weighting Scheme

Research Article Detecting Local Community Structures in Networks Based on Boundary Identification

ALTERNATIVES TO BETWEENNESS CENTRALITY: A MEASURE OF CORRELATION COEFFICIENT

Random projection for non-gaussian mixture models

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

Learning Similarity Metrics for Event Identification in Social Media. Hila Becker, Luis Gravano

Spectral Graph Multisection Through Orthogonality

Link Prediction for Social Network

The Un-normalized Graph p-laplacian based Semi-supervised Learning Method and Speech Recognition Problem

arxiv: v1 [cs.si] 17 Sep 2016

Mining Social Network Graphs

Linda M. Null Associate Professor of Computer Science Associate Chair, Mathematics and Computer Science Programs Graduate Coordinator

Motif-based Classification in Journal Citation Networks

A Seed-Centric Community Detection Algorithm based on an Expanding Ring Search

Cluster Analysis. Angela Montanari and Laura Anderlucci

Revealing Multiple Layers of Deep Community Structure in Networks

Argha Roy* Dept. of CSE Netaji Subhash Engg. College West Bengal, India.

Social and Technological Network Analysis. Lecture 4: Community Detec=on and Overlapping Communi=es. Dr. Cecilia Mascolo

Using PageRank in Feature Selection

Face Recognition using Eigenfaces SMAI Course Project

Transcription:

Network community detection with edge classifiers trained on LFR graphs Twan van Laarhoven and Elena Marchiori Department of Computer Science, Radboud University Nijmegen, The Netherlands Abstract. Graphs generated using the Lancichinetti-Fortunato-Radicchi (LFR) model are widely used for assessing the performance of network community detection algorithms. This paper investigates an laternative use of LFR graphs: as training data for learning classifiers that discriminate between edges that are within a community and between network communities. The LFR generator has a parameter that controls the extent to which communities are mixed, and hence harder to detect. We show experimentally that a linear edge-wise weighted support vector machine classifier trained on a graph with more mixed communities also works well when tested on easier graph instances, while it achieves mixed performance on real-life networks, with a tendency towards finding many communities. 1 Introduction Network community detection is the task of identifying communities in a graph. Informally, a community is a set of nodes, such that there are many edges inside the community and relatively few edges linking it to the rest of the graph. Here we consider the traditional view of community structure of a graph as a partition of its nodes into groups such that each group is a community. A simple way to find communities in a graph is to identify and remove edges that connect nodes belonging to different communities. The connected components of the resulting graph are the communities (see for instance [1, 2]). The question of community detection then becomes a question of how to pick the set of edges to remove. In this paper we investigate the effectiveness of a direct approach to this task, based on supervised learning. Using supervised learning requires training data, that is, graphs with a known community structure. We consider training data generated using the LFR model [3]. This model accounts for the fact that complex real-life networks are characterized by heterogeneous distributions of degree and community sizes, and that the degrees of nodes, as well as the sizes of communities follow a power law distribution. Constructing LFR graphs is efficient with linear complexity in the number of edges of the graph. We consider a simple learning setting: linear classifiers acting on local features of an edge. Such features can be efficiently computed since they are based only on the degree of the edge s nodes and the number of triangles containing that edge. They are used in heuristic methods for community detection, e.g. [2]. This work has been partially funded by the Netherlands Organization for Scientific Research (NWO) within the NWO project 612.066.927. 315

We investigate empirically two main questions: (a) Is there one classifier trained on a specific type of LFR graphs that also generalizes well to graphs generated using other LFR parameter settings? (b) If such a classifier exists, what is its performance on real world graphs? Extensive experiments indicate that one can build such a classifier, e.g. using the linear edge-wise weighted support vector machine. Its generalization performance on the considered real world graphs is shown to be mixed, with a tendency to detect a high number of clusters. In general the results indicate that it is possible to learn a simple supervised model based on few local topological features for identifying community structure in artificial and real world networks. Therefore community detection, that is graph clustering, can be addressed using a supervised setting. 2 Community detection by classifying edges The problem of detecting communities in a graph G =(V,E) can be formulated as a binary classification task, where the goal is to decide whether each edge ab E is within a community or between communities. A classifier is a function h that assigns a score to each edge, where edges with a positive scores are considered to be within a community. Given these scores, we can construct the reduced subgraph, containing only edges ab E for which h(x ab ) > 0. In other words, the reduced graph contains only edges classified as being within a community. The connected components of this reduced graph are considered the graph s communities. In order to use a classifier on edges, we need to associate a feature vector x ab to each edge ab E connecting two nodes a, b V. We use simple features employed in heuristic methods for network analysis: the degrees d a,d b of a, b, the number of triangles containing ab, and the mean values of these features. This gives a total of 8 features, all of which can be calculated efficiently. Because the features are local, this calculation can in principle also be done in parallel for different parts of the graph. We consider undirected graphs, where ab E ifandonlyifba E. Therefore features should be symmetric, that is, x ab = x ba. To this end we replace the degrees with minimum and maximum degree values, min(d a,d b )andmax(d a,d b ). Several heuristic scoring functions have been used in the literature to rank edges. For instance, in [2] one of the score functions used to rank edges for performing community detection is the fraction of possible triangles that contain an edge ab, t ab +1 s Radicchi (ab) = min(d a 1,d b b 1), where t ab is the number of triangles containing the edge ab, which is equivalent to the number of paths of length 2 between a and b. The algorithm then iteratively removes the edge with the lowest score from the graph, until some stopping criterion is reached. At that point, the graph will contain only edges with scores greater than some threshold τ. That is, edges ab such that s Radicchi (ab) >τ. For the above score function we can rewrite this 316

condition in the form of a linear classifier, h(x ab )= (τ +1, τ,0, 1), x ab > 0, where x ab =(1, min(d a,d b ), max(d a,d b ),t ab ). This expression is in the standard form of a linear classifier, w, x > 0. The Jaccard similarity between the adjacency lists of a and b has also been used as a score function [4]. If Adj(a) denotes the set of nodes adjacent to a, this Jaccard similarity can be written as s Jaccard (ab) = Adj(a) Adj(b) Adj(a) Adj(b). The condition s Jaccard (ab) > τ can also be rewritten as a linear classifier, (0, τ, τ,1 τ), x ab > 0. Instead of using such a fixed score function, we can try to learn one based on a training graph. The learning problem looks like a standard linear classification problem. For each edge ab E we could use the features x ab defined above, and labels such that y ab is 1 if a and b are in the same community, and is 1 otherwise. The problem of finding the weight vector w is then a linear classification problem that can be solved by, for instance, a weighted Support Vector Machine. We call this the edge-wise classification problem. Note that usually incorrectly keeping a between community edge results in a much larger error in the community structure than incorrectly removing a within community edge. Therefore the weight of negative training instances should be higher. In extreme cases, all edges incident to a node could be classified as between community edges. In that case, the connected component containing that node would be a singleton. However, a single node alone is not a community. We found that we can significantly improve the quality of the clustering if we avoid such singletons. To enforce that all communities consist of at least two nodes, for each node a, we keep the edge with the highest classifier score h(x ab ), regardless of whether this score is positive. 3 LFR benchmark as training data To learn a classifier, we need training data. I.e. one or more graphs with a known community structures. For real world applications, such training data can be hard to come by. Therefore, we instead use benchmarks specifically constructed to closely resemble real world graphs. A popular example is the LFR benchmark by Lancichinetti, Fortunato, and Radicchi [3], which generates artificial graphs that closely resemble real world graphs. In this benchmark, the size of each community is drawn from a power-law distribution; as is the degree of each node. It has previously been observed that real world graphs also have such a powerlaw degree distribution Clauset et al. [5]. Therefore, we hope that by using LFR 317

graphs for training data, we can train classifiers that also work well on real-world testing graphs. The LFR model has several parameters. The most important one is the mixing parameter μ, that controls the fraction of edges that are between communities. Essentially this can be viewed as the amount of noise in the graph. If μ = 0 all edges are within community edges, if μ = 1 all edges are between nodes in different communities. Other parameters control the number of nodes, the distributions of community sizes, the distribution of degrees, etc. We used the LRF implementation from http://sites.google.com/site/ santofortunato. We consider the four benchmarks used in [6], two with small communities of between 10 and 50 nodes, and two with large communities of between 20 and 100 nodes. Each graph has either 1000 or 5000 nodes in total. Furthermore the GN benchmark (see [1]) is considered. This is an instance of LFR with community sizes equal to 32 nodes, and the degree of each node equal to 16. We refer to the resulting 5 classes of graphs as Small1000, Small5000, Big1000, Big5000 and GN. 4 Experiments Since the true community structure is known for the benchmark graphs, we can assess the quality of the resulting clustering by means of the Normalized Mutual Information (NMI) metric, computed between a given output clustering and the true one [7]. As learning method we consider a linear edge-wise weighted SVM, using the LIBLINEAR implementation (available at www.csie.ntu.edu.tw/~cjlin/ liblinear/). To select the hyper parameters of the learning method, the regularization parameter and the relative weight of within community edges, we use a grid search procedure. First, for each testing graph we used another graph generated with the same settings as training data. That means that the distribution of training and test graphs is the same. The left plot of figure 1 shows the NMI as a function of the mixing parameter. The right plot in figure 1 shows the NMI for a classifier that was trained on a big community graph with 1000 nodes, and μ =0.5. The same classifier was used for all tests. The two plots in figure 1 show that the performance is similar to that obtained with a classifier trained for the specific network generation parameter settings. This shows that the classifier generalizes well to other parameter settings of the LFR benchmark. Further experiments indicate that, in general, classifiers trained on small communities benchmark graphs also work well for graphs with large communities, and vice-versa. Additionally, classifiers trained with a particular mixing parameter μ will still perform well on graphs using less mixing. The converse is not true, however, since graphs with low μ are essentially too easy. They do not have any examples of between community edges with larger number of 318

Normalized Mutual Information 1 0.8 0.6 0.4 Small1000 Small5000 0.2 Big1000 Big5000 GN 0 0 0.2 0.4 0.6 0.8 Mixing parameter μ Normalized Mutual Information 1 0.8 0.6 0.4 Small1000 Small5000 0.2 Big1000 Big5000 GN 0 0 0.2 0.4 0.6 0.8 Mixing parameter μ Figure 1: The performance of classifier based community detection on graphs generated with the LFR benchmark, as measured with Normalized Mutual Information. The error bars show the standard deviation across different training and testing datasets. In the left plot a new classifier is trained for each graph. In the right plot the same classifier is used in all cases, trained on a big community graph (Big1000) withμ =0.5. Normalized Mutual Information Number of communities Dataset Classifier R. Weak R. Strong Infomap Actual Classifier Zachary 0.649 0 0 0.568 2 4 Football 0.923 0.908 0.201 0.924 12 15 PolBooks 0.522 0 0 0.537 3 9 PolBlogs 0.134 0.014 0.014 0.340 2 322 Table 1: Results on real world datasets. triangles, which do appear with higher μ. The reason for this generalization results is that the good linear classifiers that are found by the training algorithm are very similar. This suggests that for a particular set of features there might be a large class of graphs for which the same scoring or classification function is optimal. If the LFR benchmark is representative of real world data, then the classifiers we learned for artificial graphs should also generalize to real world graphs. Since the classifier trained on the Big1000 dataset with μ =0.5 generalizes well to other LFR benchmark graphs, it is interesting to examine its performance on real world datasets. We consider real-life networks employed in previous studies. The Zachary s karate club; Football: A network of American football games; Political books: A network of books about US politics; Political blogs: Hyperlinks between weblogs on US politics. We compare the results against those of the infomap method [8], a representative of the state of the art in network community detection [6], and the two methods proposed in [2], because they are similar to the classifier based method. 319

Indeed they also use the number of triangles and node degrees as features, and make local decisions for each edge. The two methods differ in their stopping criterion, one is based on weak communities, the other on strong communities, we refer to them as R. Weak and R. Strong, respectively. Table 1 shows the results. For the Zachary, Football and Political books networks, the results of the classifier are close to or better than those of infomap, and better than those of R. Weak and R. Strong. This does not hold for the Political Blogs network: the actual network is considered to have just 2 communities but the classifier finds a much larger number. Radicchi s method has problems with these graphs, often returning a single cluster that contains all nodes. 5 Conclusion We investigated how community detection can be approached as a classification problem using as training data LFR graphs. An inherent limitation of the proposed method is the fact that it removes edges in one go, while existing algorithms for community detection based on edge removal incorporate an adaptive mechanism where the score of the edges is updated each time a (set of) edges is removed. How to incorporate such a mechanism in a supervised setting without making the training process too complex is an open issue. An extended version of this paper and source code are available from http:// cs.ru.nl/~t.vanlaarhoven/learning-communities-2013/. References [1] M. Girvan and M. E. J. Newman. Community structure in social and biological networks. Proceedings of the National Academy of Sciences of the United States of America, 99(12): 7821 7826, 2002. [2] F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, and D. Parisi. Defining and identifying communities in networks. Proceedings of the National Academy of Sciences of the United States of America, 101(9):2658 2663, 2004. [3] A. Lancichinetti, S. Fortunato, and F. Radicchi. Benchmark graphs for testing community detection algorithms. Phys. Rev. E, 78(4), 2008. [4] Y.-Y. Ahn, J. P. Bagrow, and S. Lehmann. Link communities reveal multiscale complexity in networks. Nature, 466:761 764, 2010. [5] A. Clauset, C. R. Shalizi, and M. E. J. Newman. Power-law distributions in empirical data. SIAM Rev., 51:661 703, November 2009. ISSN 0036-1445. [6] A. Lancichinetti and S. Fortunato. Community detection algorithms: A comparative analysis. Phys. Rev. E, 80:056117, Nov 2009. [7] L. Danon, J. Duch, A. Arenas, and A. Díaz-Guilera. Comparing community structure identification. Journal of Statistical Mechanics: Theory and Experiment, 9008:09008, 2005. [8] M. Rosvall and C.T. Bergstrom. Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences of the United States of America, 105(4):1118 1123, January 2008. 320