An Empirical Analysis of Communities in Real-World Networks

Similar documents
Jure Leskovec, Cornell/Stanford University. Joint work with Kevin Lang, Anirban Dasgupta and Michael Mahoney, Yahoo! Research

Random Generation of the Social Network with Several Communities

Predict Topic Trend in Blogosphere

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul

Algorithms and Applications in Social Networks. 2017/2018, Semester B Slava Novgorodov

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS224W: Analysis of Networks Jure Leskovec, Stanford University

Implementation of Network Community Profile using Local Spectral algorithm and its application in Community Networking

COMMUNITY SHELL S EFFECT ON THE DISINTEGRATION OF SOCIAL NETWORKS

Overlay (and P2P) Networks

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

arxiv: v1 [cs.si] 12 Jan 2019

Semi-Supervised Clustering with Partial Background Information

Community structures evaluation in complex networks: A descriptive approach

Graph Sampling Approach for Reducing. Computational Complexity of. Large-Scale Social Network

Classify My Social Contacts into Circles Stanford University CS224W Fall 2014

COMMUNITY detection is one of the most important

Dynamic network generative model

CSE 258 Lecture 6. Web Mining and Recommender Systems. Community Detection

SCALABLE LOCAL COMMUNITY DETECTION WITH MAPREDUCE FOR LARGE NETWORKS

Graph Property Preservation under Community-Based Sampling

BFS and Coloring-based Parallel Algorithms for Strongly Connected Components and Related Problems

arxiv: v1 [cs.si] 11 Jan 2019

Extracting Information from Complex Networks

CSE 158 Lecture 6. Web Mining and Recommender Systems. Community Detection

Network community detection with edge classifiers trained on LFR graphs

External influence on Bitcoin trust network structure

Basics of Network Analysis

Unsupervised Learning and Clustering

Distribution-Free Models of Social and Information Networks

Non Overlapping Communities

arxiv: v1 [cs.ds] 20 Apr 2010

arxiv: v1 [stat.ml] 2 Nov 2010

A Novel Parallel Hierarchical Community Detection Method for Large Networks

The Directed Closure Process in Hybrid Social-Information Networks, with an Analysis of Link Formation on Twitter

CSE 258 Lecture 12. Web Mining and Recommender Systems. Social networks

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013

Supplementary material to Epidemic spreading on complex networks with community structures

V2: Measures and Metrics (II)

Link Prediction for Social Network

My favorite application using eigenvalues: partitioning and community detection in social networks

Exploring graph mining approaches for dynamic heterogeneous networks

Modeling Dynamic Behavior in Large Evolving Graphs

Unsupervised Learning

Understanding Clustering Supervising the unsupervised

Recommendation System for Location-based Social Network CS224W Project Report

Graph Exploration: Taking the User into the Loop

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

Influence Maximization in Location-Based Social Networks Ivan Suarez, Sudarshan Seshadri, Patrick Cho CS224W Final Project Report

Web Structure Mining Community Detection and Evaluation

CEIL: A Scalable, Resolution Limit Free Approach for Detecting Communities in Large Networks

CUT: Community Update and Tracking in Dynamic Social Networks

Overview of the Stateof-the-Art. Networks. Evolution of social network studies

SIGHTS: A Software System for Finding Coalitions and Leaders in a Social Network

Mining Social Network Graphs

Greedy and Local Ratio Algorithms in the MapReduce Model

Local higher-order graph clustering

Social Network Analysis as Knowledge Discovery process: a case study on Digital Bibliography

CSE 158 Lecture 11. Web Mining and Recommender Systems. Social networks

Online Social Networks and Media. Community detection

CS 224W Final Report Group 37

Social-Network Graphs

Characteristics of Preferentially Attached Network Grown from. Small World

A Local Seed Selection Algorithm for Overlapping Community Detection

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Community Detection: Comparison of State of the Art Algorithms

Large Scale Graph Algorithms

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

node2vec: Scalable Feature Learning for Networks

DIFFUSION CASCADES: SPREADING PHENOMENA IN BLOG NETWORK COMMUNITIES ABDELHAMID SALAH BRAHIM

Unsupervised Learning and Clustering

Jure Leskovec Computer Science Department Cornell University / Stanford University

A geometric model for on-line social networks

Stable Statistics of the Blogograph

An Experiment in Visual Clustering Using Star Glyph Displays

CSE 190 Lecture 16. Data Mining and Predictive Analytics. Small-world phenomena

Distributed Detection in Sensor Networks: Connectivity Graph and Small World Networks

Positive and Negative Links

CS 534: Computer Vision Segmentation and Perceptual Grouping

1 General description of the projects

Structural and Temporal Properties of and Spam Networks

Spectral Methods for Network Community Detection and Graph Partitioning

Community detection. Leonid E. Zhukov

Big Data Analytics Influx of data pertaining to the 4Vs, i.e. Volume, Veracity, Velocity and Variety

A. Papadopoulos, G. Pallis, M. D. Dikaiakos. Identifying Clusters with Attribute Homogeneity and Similar Connectivity in Information Networks

Lecture 2: From Structured Data to Graphs and Spectral Analysis

Generating Social Network Features for Link-Based Classification

Research Proposal. Guo, He. November 3rd, 2016

Graph Theory. Network Science: Graph theory. Graph theory Terminology and notation. Graph theory Graph visualization

Holistic Analysis of Multi-Source, Multi- Feature Data: Modeling and Computation Challenges

Learning to Grasp Objects: A Novel Approach for Localizing Objects Using Depth Based Segmentation

Community detection algorithms survey and overlapping communities. Presented by Sai Ravi Kiran Mallampati

Bring Semantic Web to Social Communities

Mining and Analyzing Online Social Networks

Scalable Clustering of Signed Networks Using Balance Normalized Cut

Lecture Note: Computation problems in social. network analysis

First-improvement vs. Best-improvement Local Optima Networks of NK Landscapes

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

An Optimal Allocation Approach to Influence Maximization Problem on Modular Social Network. Tianyu Cao, Xindong Wu, Song Wang, Xiaohua Hu

Accurate Semi-supervised Classification for Graph Data

Transcription:

An Empirical Analysis of Communities in Real-World Networks Chuan Sheng Foo Computer Science Department Stanford University csfoo@cs.stanford.edu ABSTRACT Little work has been done on the characterization of community structure as opposed to the design of algorithms for detecting communities. In this work, we perform an analysis of real-world communities by computing a variety of features and comparing their distributions to those of features computed on randomly generated communities. We find that real-world communities are more internally wellconnected as compared to their random counterparts, and that conductance does not necessarily differentiate real from random communities. 1. INTRODUCTION Much work in the field of networks has focused on the characterization and modelling of various properties of real-world networks. This includes properties such as power-law degree distributions, the small-world property of networks, densification of networks, and shrinking diameters [6]. The characterization of these global network properties has led to the development of network models that are able to generate new networks that accurately replicate the characteristics of their real-world counterparts [5]. In addition, recent work has also focused on discovering the local structure of networks, investigating the process of how new connections are formed in real-world networks [4]. In comparison, far less work has been done on the intermediate level of organization in networks, which involves the characterizion of groups or communities in networks. Instead, most work on the community level of network organization has been focused on algorithms for detecting communities in networks. Such approaches tend to be unsupervised, and focus on finding groups of nodes that maximize some pre-defined quality score that is thought to accurately describe the properties of a community (see [2] for a recent review and discussion). Ideally, one would like to have a community detection algorithm that could automatically learn the salient characteristics of given sample communities in networks, and use this information to detect other novel communities. In other words, a supervised approach to community detection that uses empirically derived properties of communities to perform detection. The key advantage to such an approach lies precisely in this use of empirically derived properties of communities. This eliminates the need to incorporate predefined ideas of what communities look-like, that may not be an accurate representation of reality, into the design of the detection algorithm. This work is a first step towards the goal of supervised community detection, by providing an empirical analysis of various communities in real-world networks. It may be seen as an extension of the work in [7], which analyzes community structure in real-world networks, where communities are defined by subsets of nodes with low conductance. In comparison, we analyze ground-truth communities through the computation of various features on communities and the comparison of their distributions with various randomized baselines. The reasoning behind this approach is that features which discriminate between real and randomized communities are those which define a community. Our analysis yielded greater insight into the structure of a community, showing that a common intution that a community is a group of nodes in a network that are more internally well-connected than to the rest of the network is not entirely true. Specifically, we discovered that communities are not necessarily easier to separate from the rest of the network as compared to their randomized counterparts and that true communities are indeed more internally well-connected than their randomized counterparts. Besides having ramfications for community detection algorithms based on these assumptions of community structure, our discoveries also suggest new hypothesis about the structure of communities that we describe in the discussion section. 2. METHODOLOGY In order to understand the structure of real-world communities, our approach was to attempt to find various features of communities that would allow us to discriminate them from random subsets of nodes. We therefore computed a variety of features on communities, and then compared them to the values of the same features but on randomized versions of the communities. We will first describe the features that we computed and then describe the process of how randomized communities were constructed for the purposes of

comparison. 2.1 Features used to describe communities Let G(V, E) be an undirected graph with n = V nodes and m = E edges. A community is then defined to be a subset of nodes S V in G. Then, we may define the following quantities: the number of nodes in S, n S = S ; the number of edges in S, m S = {(u, v) : u S, v S} ; and c S, the number of edges on the boundary of S, c S = {(u, v) : u S, v S}. Also, let d(v) denote the degree of node v. We computed the following 17 features for each community S: Edges inside: m s, the number of edges inside the community Edges cut: c S, number of edges needed to be removed to disconnect nodes in S from the rest of the network. c Expansion: S n S, measures the number of edges per node that point outside the community Internal density: 1 density of the community S m S n S (n S 1)/2 is the internal edge c Conductance: S 2m S +c S measures the fraction of total edge volume that points outside the community [9, 3]. Normalized Cut: c S 2m S +c S + c S 2(m m S )+c S [9]. c Cut Ratio: S n S (n n S, the fraction of all possible ) edges leaving the community Maximum-ODF (Out Degree Fraction): {(u,v):v S} max u S, the maximum fraction of edges d(u) of a node pointing outside the community [1]. 1 {(u,v):v S} Average-ODF: n S Pu S, the average fraction of nodes edges pointing outside the community d(u) [1]. Flake-ODF: {u:u S, {(u,v):v S} <d(u)/2} n S, the fraction of nodes in S that have less edges pointing inside than to the outside of the community [1]. Volume: P u S d(u), sum of degrees of nodes in S. Number of open triads Number of closed triads Ratio of number of open triads to closed triads Clustering coefficient Size of largest connected component in S Size of the largest connected component as a fraction of the size of S Table 1: Statistics for networks used in this work Dataset # Nodes # Edges # Communities DBLP 851523 1135266 3050 Flickr 584207 2257488 14051 LinkedIn 7550955 29161896 147 LiveJournal 4847571 42851237 385959 2.2 Random baselines used for comparison To determine which of the abovementioned features truly characterizes real communities, we sought to see if the distributions of these features would remain the same on communities that were randomly generated. To do so, we adopted the following 3 procedures to generate random communities: Network rewiring: We simply keep the node sets of the communities the same, but we rewire the underlying network using an edge switching procedure as described in [8]. This results in communities having different internal edge structure due to the rewired network. Randomized memberships: We randomly assign nodes in the graph to communities while keeping the number of communities, the number of members in each community and the number of communities each node is a member of the same as in the original data. This is achieved using a membership swapping technique, where we pick two random nodes and swap their community memberships for a randomly chosen community from each node. Randomized memberships (neighbor constrained): This is a similar procedure as the randomized memberships case, but where nodes are required to be neighbors in the graph. To do this, we pick a random node from the graph, and then pick one of its neighbors to execute the membership swap as before. 3. EXPERIMENTAL SETUP We computed the 17 described statistics on communities and their randomized counterparts using the 3 methods previously described, on the following 4 datasets. Basic statistics of these datasets may be found in Table 1. DBLP co-authorship network: The DBLP network is constructed from publications at various Computer Science conferences and journals. Authors are nodes in the network, while edges are defined by coauthorship on a publication; each publication gives rise to a clique on the authors of the publication. Communities in this network are defined by the publication venue, e.g., KDD, ICML, FOCS. Flickr photo-sharing network: Members of the Flickr network can post and tag photos. There is also an underlying social network of friendships, which we used as the network in this work. Communities are then defined by co-tagging two members are in the same community if they have photos tagged by other members with the same tag; we did not consider tags provided by members on their own photos.

LinkedIn social network: Here we have data on the underlying social network, as well as community data defined by the various industries that members work in. Due to the coarse definition of an industry, the communities in this dataset tend to be rather large (at least a few thousand members). LiveJournal blogging network: This is the network of bloggers and their friends. LiveJournal has a concept of communities which bloggers join based on similar interests. We used these communities as the ground-truth in this work. This dataset is probably the most useful of the 4 since the communities are indeed user-defined and not defined by us in this work. Figure 3: Plot of the average of the size of largest connected component in the community as a fraction of the community size, versus community size for the LiveJournal dataset. Each point in the plot gives the average fraction over all communities of the particular size given on the x-axis. We then plotted the average and median values of each feature versus the community size, as well as the distribution for each feature over various community size ranges. For each feature, we manually inspected the plots to determine if the feature was significant, by checking for a separation between the points for the original communities and their randomized counterparts. 4. RESULTS 4.1 Communities may not have low conductance As there are many algorithms based on the assumption that communities are subgraphs that are easily separated from the network (for example, those based on graph-cuts), we decided to investigate if real communities have this property. One popular measure used to quantify this notion that a community is easily separated is conductance (see section 2.1 for a more precise definition) if a community has low conductance, it is more easily separated from the rest of the network. We found that the data does not entirely support this assumption. From the plots of the average conductance versus community size shown in Figure 1, we observe that the conductance of ground-truth communities in all the networks is not appreciably different from those of randomized communities in the Flickr and LiveJournal networks, across all community sizes. We observe some separation in the DBLP plot, but this is probably an artefact of the way the network was constructed. In the construction of the network, all authors in a single publication are presumed to be connected, resulting in the network being composed of a network of overlapping cliques. This could result in a lower conductance score since there are likely to be more edges inside the communities due to the cliques. The separation of the LinkedIn plot is also possibly due to the coarse definitions of the industries which were used to define communities, and the properties of the network. As it is a social network used for professional networking, people are more likely to have friends within their own industries, and if the industries are coarsely defined, then there are few cross-community edges, resulting in a lower conductance. 4.2 Communities have greater internal connectivity While real communities may not necessarily be more easily separated from the rest of the network than random communities, they have greater internal connectivity than their randomized counterparts. This is shown in the plots of Figure 2, which show the average number of edges inside the community versus community size. As opposed to the plots for conductance, here there is a clear separation between the true communities and their randomized counterparts. In all 4 networks, we see that across all community sizes, the true communities have a greater number of edges inside. We also see that communities on the rewired network tend to have the least number of edges inside, reflecting the fact that rewiring the network may not be a good random baseline as it destroys much of the network structure. In addition, in plots of the size of the largest connected component in the community as a fraction of the community size, we found that real communities have a much higher fraction of their nodes in the largest connected component as compared to random communities. An illustrative plot is shown in Figure 3 for the LiveJournal dataset; the observed trend is similar for the other 3 networks, but the separation was not as significant. 5. CONCLUSIONS AND FUTURE WORK In this work we managed to gain a better understanding into community structure by analysing the distributions of various features computed on real and random communities. Our observations suggest that communities may not have low conductance but have greater internal connectivity. Out of the 17 measures we computed, most others did not show any significant signal that could be used to differentiate real communities from random ones. Another possible explanation for our observations with regards to the conductance plots is that we are overcounting the number of edges on the boundary of the network such

Figure 1: Plots showing the average conductance versus community size. Each point in the plot represents the average conductance over all communities of the particular size given on the x-axis. (a) DBLP (b) Flickr (c) LinkedIn (d) LiveJournal

Figure 2: Plots showing the average number of edges inside the community versus community size. Each point in the plot represents the average number of edges over all communities of the particular size given on the x-axis. (a) DBLP (b) Flickr (c) LinkedIn (d) LiveJournal

Figure 4: Plot of the average conductance versus community size in the LiveJournal dataset. Each point in the plot represents the average conductance over all communities of the particular size given on the x-axis. edges may be between nodes that, with respect to the community in question, are not in the same community, but may be together in a different community altogether. Thus, we should not need to separate these nodes as they are, after all, in the same community. In other words, we need to take into account the fact that nodes may be part of multiple communities, and we should not count edges on the boundary between nodes that belong together in a different community. To this end, we have re-computed the measures with this intution in mind, only counting boundary edges if their endpoints do not share any community memberships. Preliminary results show that this is indeed an important factor, as seen from Figure 4, a plot of the average conductance versue community size on the LiveJournal dataset. As compared to the similar plot in Figure 1(d), we observe that now there is a separation between the real and random communities, especially at smaller community sizes up to about 1000. The corresponding plot for DBLP showed that all communities had zero conductance once multiple community memberships were taken into consideration, showing that perhaps it is easy to find communities in such a graph due to the artefacts of graph construction. Discovery and Data Mining, pages 150 160. ACM Press, 2000. [2] S. Fortunato. Community detection in graphs. arxiv:0906.0612v1 [physics.soc-ph], 2009. [3] R. Kannan, S. Vempala, and A. Vetta. On clusterings: Good, bad and spectral. J. ACM, 51(3):497 515, 2004. [4] J. Leskovec, L. Backstrom, R. Kumar, and A. Tomkins. Microscopic evolution of social networks. In KDD 08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 462 470, New York, NY, USA, 2008. ACM. [5] J. Leskovec and C. Faloutsos. Scalable modeling of real graphs using kronecker multiplication. In ICML 07: Proceedings of the 24th international conference on Machine learning, pages 497 504, New York, NY, USA, 2007. ACM. [6] J. Leskovec, J. Kleinberg, and C. Faloutsos. Graph evolution: Densification and shrinking diameters. ACM Trans. Knowl. Discov. Data, 1(1):2, 2007. [7] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. CoRR, abs/0810.1355, 2008. [8] R. Milo, N. Kashtan, S. Itzkovitz, M. E. J. Newman, and U. Alon. On the uniform generation of random graphs with prescribed degree sequences. May 2004. [9] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 22(8):888 905, 2000. In the future, we will analyse more datasets to see if our observations hold in general. We will then use these observations to design a community detection algorithm, based not on preconceived ideas of what a community looks like, but on actual insight gained from empirical studies such as this one. 6. ACKNOWLEDGEMENTS I would like to thank Professor Leskovec for his insights and guidance throughout the course of the project. 7. REFERENCES [1] G. W. Flake, S. Lawrence, and C. L. Giles. Efficient identification of web communities. In In Sixth ACM SIGKDD International Conference on Knowledge