Community Structure Detection. Amar Chandole Ameya Kabre Atishay Aggarwal

Similar documents
Modularity CMSC 858L

Fast algorithm for detecting community structure in networks

Oh Pott, Oh Pott! or how to detect community structure in complex networks

Social-Network Graphs

TELCOM2125: Network Science and Analysis

Community Structure and Beyond

V4 Matrix algorithms and graph partitioning

Mining Social Network Graphs

Finding and evaluating community structure in networks

Extracting Communities from Networks

SOMSN: An Effective Self Organizing Map for Clustering of Social Networks

arxiv: v1 [physics.data-an] 27 Sep 2007

Spectral Methods for Network Community Detection and Graph Partitioning

Non Overlapping Communities

Clusters and Communities

Clustering Algorithms on Graphs Community Detection 6CCS3WSN-7CCSMWAL

Social Data Management Communities

A new Pre-processing Strategy for Improving Community Detection Algorithms

Community Detection by Affinity Propagation

Overlapping Communities

My favorite application using eigenvalues: partitioning and community detection in social networks

Extracting Information from Complex Networks

Understanding complex networks with community-finding algorithms

1 Homophily and assortative mixing

Community Detection in Social Networks

CS224W: Analysis of Networks Jure Leskovec, Stanford University

Community Detection. Community

Online Social Networks and Media. Community detection

Community Detection Methods using Eigenvectors of Matrices

Detecting community structure in networks

Visual Representations for Machine Learning

Betweenness Measures and Graph Partitioning

A Novel Parallel Hierarchical Community Detection Method for Large Networks

Clustering Algorithms for general similarity measures

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Statistical Physics of Community Detection

Network community detection with edge classifiers trained on LFR graphs

ALTERNATIVES TO BETWEENNESS CENTRALITY: A MEASURE OF CORRELATION COEFFICIENT

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters

Clustering on networks by modularity maximization

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

Applications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors

An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization

A Novel Similarity-based Modularity Function for Graph Partitioning

7. Decision or classification trees

CSE 258 Lecture 6. Web Mining and Recommender Systems. Community Detection

Jure Leskovec, Cornell/Stanford University. Joint work with Kevin Lang, Anirban Dasgupta and Michael Mahoney, Yahoo! Research

Keywords: dynamic Social Network, Community detection, Centrality measures, Modularity function.

CSE 158 Lecture 6. Web Mining and Recommender Systems. Community Detection

Complex-Network Modelling and Inference

Bipartite Graph Partitioning and Content-based Image Clustering

Lesson 2 7 Graph Partitioning

Chapter 11. Network Community Detection

Dynamically Motivated Models for Multiplex Networks 1

CSE 5243 INTRO. TO DATA MINING

arxiv: v1 [physics.soc-ph] 19 Sep 2007

Cluster Analysis. Ying Shen, SSE, Tongji University

Discovery of Community Structure in Complex Networks Based on Resistance Distance and Center Nodes

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

CS 534: Computer Vision Segmentation and Perceptual Grouping

CSE 5243 INTRO. TO DATA MINING

Social and Technological Network Analysis. Lecture 4: Community Detec=on and Overlapping Communi=es. Dr. Cecilia Mascolo

Web Structure Mining Community Detection and Evaluation

Hierarchical Clustering

Unsupervised Learning and Clustering

V2: Measures and Metrics (II)

Clustering CS 550: Machine Learning

An Efficient Algorithm for Community Detection in Complex Networks

Kernighan/Lin - Preliminary Definitions. Comments on Kernighan/Lin Algorithm. Partitioning Without Nodal Coordinates Kernighan/Lin

Clustering Lecture 3: Hierarchical Methods

EE 701 ROBOT VISION. Segmentation

arxiv:cond-mat/ v2 [cond-mat.stat-mech] 30 Aug 2004

CUT: Community Update and Tracking in Dynamic Social Networks

Lecture 5: Graphs. Rajat Mittal. IIT Kanpur

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

The Generalized Topological Overlap Matrix in Biological Network Analysis

Network Community Detection

Basics of Network Analysis

Diffusion and Clustering on Large Graphs

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Spectral Graph Multisection Through Orthogonality

Community Detection: Comparison of State of the Art Algorithms

1 More stochastic block model. Pr(G θ) G=(V, E) 1.1 Model definition. 1.2 Fitting the model to data. Prof. Aaron Clauset 7 November 2013

INTRODUCTION SECTION 9.1

Content-based Image and Video Retrieval. Image Segmentation

HW4 VINH NGUYEN. Q1 (6 points). Chapter 8 Exercise 20

Community Detection in Directed Weighted Function-call Networks

CS224W Final Report: Study of Communities in Bitcoin Network

CAIM: Cerca i Anàlisi d Informació Massiva

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Online Social Networks and Media

CS 231A CA Session: Problem Set 4 Review. Kevin Chen May 13, 2016

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 11

Local Fiedler Vector Centrality for Detection of Deep and Overlapping Communities in Networks

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

2007 by authors and 2007 World Scientific Publishing Company

Section 7.13: Homophily (or Assortativity) By: Ralucca Gera, NPS

Unsupervised Learning

Hierarchical Multi level Approach to graph clustering

Finding community structure in very large networks

Transcription:

Community Structure Detection Amar Chandole Ameya Kabre Atishay Aggarwal

What is a network? Group or system of interconnected people or things Ways to represent a network: Matrices Sets Sequences Time series Graphs 2

Networks as Graphs Entities represented as nodes Connections between entities represented as edges Example Social Network People: Nodes Friendship: Edges Citation Network Documents: Nodes Citation: Edges 3

Community Structures Group of vertices within which connections are dense but between which they are sparser Can be found in the Internet, the world wide web, citation networks, transportation networks, email networks, food webs, social and biochemical networks 4

Need to study Community Structures Networks of interest found to naturally divide into communities or modules Communities generally have similar behavior, decision making characteristics Properties of communities may be quite different from average properties of the network 5

Detection of Community Structures Computer Science Approach Graph Partitioning Social Science Approach Block Modeling or Hierarchical Clustering or Community Structure Detection 6

Graph Partitioning Goal: Find division of network irrespective of existence of a good division Condition: Number and size of groups is generally known in advance Example: Division of tasks between different cores of a parallel computer 7

Graph Partitioning 8

Community Structure Detection Goal: Find a good division of the network provided that it exists Condition: Number and size of groups are determined by the network itself Example: Find most popular messaging application in each country 9

Community Structure Detection Most Popular Messaging App in Every Country Political Orientation Across Blogs on the Internet 1

Girvan and Newman (GN) Algorithm Edge betweenness Number of shortest paths that run between a pair of nodes Each edge assigned a betweenness score Edges with high betweenness scores removed Betweenness recalculated 11

Performance of GN Algorithm Computationally very expensive For m edges and n nodes: Complexity of O(m2n) for regular graphs O(n3) for sparse graphs 12

What do the papers that we reviewed propose? 13

Paper 1 : Fast algorithm for detecting community structure in networks (arxiv 3) Author: M. E. J. Newman University of Michigan, Ann Arbor 14

Drawbacks of graph partitioning and GN algorithm Computation effort required is immense Feasible even upto thousand nodes Infeasible for present day s interactive networks containing millions and billions nodes 15

Overcoming drawbacks Use of modularity Optimizing modularity 16

1. Modularity Number of edges falling within groups minus the expected number in an equivalent network with edges placed at random Positive Magnitude indicates the probability of a community structure Zero Edges are simply a random distribution Negative Indicates that a community structure must be absent 17

Example scenario for Modularity Suppose we have 3 vertices and 6 edges. Then by random distribution of edges, the network is expected to look like: EXPECTED: 18

Example scenario for Modularity Suppose we have 3 vertices and 6 edges. Then by random distribution of edges, the network is expected to look like: ACTUAL: 19

Example scenario for Modularity EXPECTED: Too many connections than exppected. Something is fishy here! ACTUAL: Modularity will find out this deviation from expected value using different expressions (explained in later slides) and help us divide network into communities. 2

2. Ways to Optimize Modularity Simulated Annealing Genetic Algorithms Greedy Algorithms Spectral Techniques 21

What purpose does Modularity serve in this paper? Values of Q indicative of presence of community structure Q = indicates random presence Q >=.3 indicates strong presence 22

Mathematical Definitions Modularity Q is defined as: Q = (eii - ai2) i where, - eij is the fraction of edges in network that connect vertices in group i to those in group j. - ai = eij j Δ Q = eij + eji - 2 aiaj = 2 ( eij - aiaj) 23

Optimizing modularity Ways to divide n vertices into g non-empty groups is Sn(g) Number of distinct community divisions: Sn(g) for g=1 to n Maximum 2-3 vertices 24

Greedy Algorithm Each vertex is considered as a community Join 2 vertices such that it results in the greatest increase or smallest decrease in the value of modularity Process can be represented as a dendrogram and cuts through it give communities 25

Steps to follow 1. Calculate eij matrix for the graph 2. Calculate Δ Q for every possible combination of 2 nodes 3. Join vertices with highest value of Δ Q 4. Calculate Q after the join and update dendrogram 5. Recompute eij matrix 26

Community Structures 1. We have the dendrogram with values of Q after each join 2. Starting from the highest value of Q, if it is greater than.3, cut through it 3. We obtain the final community structure when no value >.3 left to be cut 27

Example 28

The eij matrix :.3.1.3.2.1.2.1.1.3.3 29

All combinations Δ Q = 2 ( eij - aiaj) (1,2) 2 (.3 -.4 *.5 ) =.2 (2,3) 2 (.2 -.4 *.5 ) =. (1,3) 2 (. -.4 *.4 ) = -.32 (3,4) 2 (.1 -.4 *.4 ) = -.12 (4,5) 2 (.3 -.3 *.4 ) =.36 The eij matrix :.3.1.3.2.1.2.1.1.3.3 Max Δ Q =.36 Thus, join 4 & 5 and calculate Q = -.28 3

The updated eij matrix :.3.1.3.2.1.2.1.1.3 The original eij matrix :.3.1.3.2.1.2.1.1.3.3 31

All combinations Δ Q = 2 ( eij - aiaj) The updated eij matrix :.3.1 (1,2) 2 (.3 -.4 *.5 ) =.2.3.2 (2,3) 2 (.2 -.4 *.5 ) =..1.2.1 (1,3) 2 (. -.4 *.4 ) = -.32.1.3 (3, 4-5) 2 (.1 -.4 *.4 ) = -.12 Max Δ Q =.2 Thus, join 1 and 2 and calculate Q = -.41 32

The updated eij matrix :.3.3.3.1.1.3 The previous eij matrix :.3.1.3.2.1.2.1.1.3 33

All combinations The updated eij matrix : Δ Q = 2 ( eij - aiaj).3.3 (1-2, 3) 2 (.3 -.4 *.6 ) =.12.3.1 (3, 4-5) 2 (.1 -.4 *.4 ) = -.12.1.3 Max Δ Q =.12 Thus, join 1-2 with 3 and calculate Q =.5 34

The updated eij matrix : The previous eij matrix :.3.3.3.1.1.3.6.1.1.3 35

All combinations Join the 2 communities: 1-2-3 and 4-5 The updated eij matrix :.6.1.1.3 As there is just one single component now, we stop. Q =.88 36

Splitting into communities 1. Now, we observe that the maximum value of Q is.88 2. As, this is >=.3, we cut through it 3. We obtain 2 communities - (1-2-3) and (4-5) 4. No other value of Q>=.3 5. The final community structure: (1-2-3) and (4-5) 37

The dendrogram 38

Resulting community structure division Community 1 Community 2 39

Performance of Greedy Algorithm Computationally less expensive For m edges and n nodes Complexity of O(m+n) for each join operation O(n(m+n)) for n-1 join operations 4

Computer generated graph n=128 vertices. 4 groups of 32. Zin + Zout = 16 41

Observations: 9% correct for Zout<=6 For Zout=5, Greedy algo : 97.4% vertices GN algo : 98.9% vertices. For greater Zout, Greedy performs better. 42

Karate Club Network Friendships between 34 members of a club at a US university over a 2-year period. The club naturally split into 2 due to some disputes. 43

Observations: Peak modularity Q =.381 Vertex 1 classified wrongly. GN algorithm classified vertex 3 wrongly. 44

Games Network Schedule of games between American college football teams in a single season. Teams divided into groups or conferences. Intra-conference games more frequent than inter-conference games. 45

Observations: 46

How much faster does a greedy algorithm perform in comparison to the Girvan and Newman algorithm? 47

Some comparisons: For the Games Network: Greedy algorithm : Less than 1th of a second. GN algorithm: Little over a second 1275-node Jazz musicians network: Greedy algorithm : 1 second GN algorithm : 3 hours. 48

42 minutes! vs. 3-5 years! 49

Physicists network Network of collaborations between physicists as documented by papers posted on the widely-used Physics E-print Archive at arxiv.org 56,276 scientists in all branches of physics Scientists are considered connected if they have coauthored one or more papers posted on the archive 5

Observations: 6 communities Peak modularity of Q =.713 4 large communities - 77% vertices Astrophysics, High-energy physics, 2 of condensed matter physics 51

52

Greedy algorithm conclusions: The author uses a bottom-up approach The algorithm can be recursively applied to identify intricate groups and communities. The algorithm is computationally feasible It is possible to study larger networks now. 53

Paper 2 : Modularity and community structure in networks (PNAS 6) Author: M. E. J. Newman University of Michigan, Ann Arbor 54

What is a good division for community detection? A good division is not merely the one that has fewer edges between communities; A good division is the one that has fewer than expected edges between communities! 55

What purpose does Modularity serve here? Modularity quantifies the idea that, a true community structure in a network corresponds to a statistically surprising arrangement of edges The paper reformulates modularity in terms of the spectral properties of the network. 56

Spectral properties? Properties of a graph in relationship to the eigenvalues and eigenvectors of matrices associated to the graph, such as its adjacency matrix. 57

Eigenvalues and Eigenvectors If A is a matrix, then the eigenvalues are given by the solution of λ such that A - λi = (1) where I is identity matrix. For each value of λ obtained from eq. 1, the following equation gives an eigenvector: Av = λv (2) 58

Spectral Algorithm Start with the adjacency matrix A of the network Convert A into modularity matrix B Find all eigenvalues and eigenvectors of B Consider the eigenvector corresponding to the highest eigenvalue and use the signs of elements in this vector to decide the group. 59

Steps to follow 1. Calculate the eij matrix for the graph 2. Calculate Δ Q for every possible combination of 2 nodes. 3. Join the vertices with highest value of Δ Q 4. Calculate Q after the join and update the dendrogram 5. Recompute the eij matrix 6

Example: 61

Example: Adjacency matrix A= 3 1 3 2 1 2 1 1 3 3 62

Calculating Modularity Matrix Modularity matrix B is a real symmetric matrix with elements Bij = Aij - kikj/2m where, - Aij is the adjacency matrix - ki and kj are the degrees of vertices - m is ½ iki 63

Calculating Modularity Matrix B12 = A12- k1k2/2m = 3-4*5/2*1 =3-1 =2 64

Calculating Modularity Matrix Doing the same calculation for all Bij, we get B= -.8 2.2 -.8 -.6 2-1.25 1-1 -.75.2 1 -.8.2 -.6 -.8-1.2 -.8 2.4 -.6 -.75 -.6 2.4 -.45 65

Eigenvalues and Eigenvectors for B Leading eigenvector Always present! Eigenvalues Corresponding eigenvectors λ1 3.95139 v1 (.442855,.499515,.199318, -.511499, -.52997) λ2-2.61989 v2 (.144319, -.295586,.292573, -.654337,.614854) λ3-2.978 v3 (.629671, -.664442,.152521,.237745, -.286785) λ4.8213 v4 (.387237,.451396,.436317,.51464,.452162) λ5 v5 (1, 1, 1, 1, 1) 66

Splitting into communities Leading eigenvector is the one with highest eigenvalue Node number: λ1 3.95139 Community: 1 2 3 4 5 v1 (.442855,.499515,.199318, -.511499, -.52997) +ve +ve A A +ve A -ve -ve B B 67

Resulting community structure division Community 1 Community 2 68

What else do the eigenvectors suggest? The elements of the leading eigenvector measure how firmly each vertex belongs to its assigned community. Large vector elements strong central members of their communities. Small vector elements more ambivalent 69

Special case! There is always an eigenvector (1, 1, 1,, 1) with eigenvalue for every modularity matrix B This is a property of matrices having sum of rows and columns If all the eigenvectors have negative eigenvalues, then this special eigenvector acts as the leading eigenvector The network is said to be INDIVISIBLE in this case Big highlight point of this algorithm as many algorithms divide into communities even when it is not needed, based on sheer number of edges between nodes (if they are less) 7

More than two communities? Possible - because this algorithm knows when to stop dividing! Start the algorithm with original network Run the same algorithm over the newly formed communities Continue unless all the communities obtained are indivisible 71

Example verified in the paper A network of books about politics, compiled by V. Krebs. Vertices represent 15 recent books on American politics bought from the on-line bookseller Amazon.com Edges join pairs of books that are frequently purchased by the same buyer. Books were divided (by author) according to their stated or apparent political alignment, liberal or conservative, except for a small number of books that were explicitly bipartisan or centrist, or had no clear affiliation. 72

Result & Interpretation Two strong communities are formed one consists almost entirely of conservative books and other liberal. Centrist books belong to their own communities and are not, in most cases, as expected Indicates that political moderates form their own purchasing community. Books appear to form communities of copurchasing that align closely with political views Liberal Conservative Centrist Fig. Dashed lines divide the four communities found by the algorithm, and shapes represent the political alignment of the books. 73

How better does leading eigenvector method do in comparison to the Girvan and Newman algorithm? Context Collaboration network of 27 vertices 74

2 minutes! vs. 2-3 years! 75

Performance Fast and accurate Most time consuming part is evaluation of leading eigenvector of modularity matrix For n nodes Complexity of O(n2 logn) Better than all except Greedy - O(nlog2 n) But quality is better than Greedy, especially in large networks Takes ~2 min to process network with ~27, vertices on a standard PC (circa 26) 76

Milestones in Community Structure Detection Girvan, Newman Clauset, Newman, Moore Over 5 papers* Community structure in social and biological networks Finding community structure in very large networks On the topic Community Structure Detection 22 23 24 26 27-217 Newman Newman Fast algorithm for detecting community structure in networks Modularity and Community Structure Detection * As found on Google Scholar77

Conclusion Both greedy and spectral algorithm have their own advantages and shortcomings, but both of them have been strong foundational algorithms in the starting days of Community Structure Detection back in early 2s. While spectral has more accurate results, greedy is faster. Both make use of the concept of Modularity successfully to get meaningful & insightful results. 78

References Modularity and community structure in networks. (PNAS 6) Community structure in social and biological networks (PNAS 2) Finding community structure in very large networks (arxiv 4) Wikipedia Fast algorithm for detecting community structure in networks (arxiv 3) 79

Thank you! We are so tired! :/ Questions? :) 8