Non Overlapping Communities Davide Mottin, Konstantina Lazaridou HassoPlattner Institute Graph Mining course Winter Semester 2016
Acknowledgements Most of this lecture is taken from: http://web.stanford.edu/class/cs224w/slides Some other material: https://www.ismll.uni-hildesheim.de/lehre/cmie-11w/script/lecture5.pdf 2
Lecture road Introduction to graph clustering Hierarchical approaches Spectral clustering 3
Networks & Communities We often think of networks being organized into modules, cluster, communities: 4
Goal: Find Densely Linked Clusters 5
Micro-Markets in Sponsored Search Find micro-markets by partitioning the query-to-advertiser graph: query advertiser Andersen, R. and Lang, K.J. Communities from seed sets. WWW, 2006. 6
Movies and Actors Clusters in Movies-to-Actors graph: Andersen, R. and Lang, K.J. Communities from seed sets. WWW, 2006. 7
Twitter & Facebook Discovering social circles, circles of trust: Mcauley, J. and Leskovec, J. Discovering social circles in ego networks. TKDD, 2014. 8
Lecture road Introduction to graph clustering Hierarchical approaches Spectral clustering 9
Community Detection How to find communities? We will work with undirected (unweighted) networks 10
Method 1: Strength of Weak Ties Edge betweenness: Number of shortest paths passing over the edge Intuition: b=16 b=7.5 Edge strengths (call volume) in a real network Edge betweenness in a real network 11
[Girvan-Newman 02] Method 1: Girvan-Newman Divisive hierarchical clustering based on the notion of edge betweenness: Number of shortest paths passing through the edge Girvan-Newman Algorithm: Undirected unweighted networks Repeat until no edges are left: Calculate betweenness of edges Remove edges with highest betweenness Recompute betweenness Connected components are communities Gives a hierarchical decomposition of the network 12
Girvan-Newman: Example 1 12 33 49 Need to re-compute betweenness at every step 13
Girvan-Newman: Example Step 1: Step 2: Step 3: Hierarchical network decomposition: 14
Girvan-Newman: Results Communities in physics collaborations 15
Girvan-Newman: Results Zachary s Karate club: Hierarchical decomposition 16
We need to resolve 2 questions 1. How to compute betweenness? 2. How to select the number of clusters? 17
How to Compute Betweenness? Want to compute betweenness of paths starting at node A Breath first search starting from A: 0 1 2 3 4 18
Betweenness centrality Betweennes is a measure of the brokerage power of a node: the higher the betweenness the higher the capacity of a node of acting as a broker and spreading a particular information Gives also the idea of the capacity of a graph to be easily disconnected Shortest paths from s to t including v betweenness v = * σ,- v σ,-,/0/- Shortest paths from s to t Takes O n 2 Require all the shortest paths! in the worst case: all nodes shortest path computation Defined for nodes 19
How to Compute Betweenness? Count the number of shortest paths from A to all other nodes of the network : Variant of Dijkstra or Floyd-Warshall algorithms 20
How to Compute Betweenness? * σ,- v σ,-,/0/- Compute betweenness by working up the tree: If there are multiple paths count them fractionally The algorithm: Add edge flows: -- node flow = 1+ child edges -- split the flow up based on the parent value Repeat the BFS procedure for each starting node U 1 path to K. Split evenly 1+0.5 paths to J Split 1:2 1+1 paths to H Split evenly 21
How to Compute Betweenness? Compute betweenness by working up the tree: If there are multiple paths count them fractionally The algorithm: Add edge flows: -- node flow = 1+ child edges -- split the flow up based on the parent value Repeat the BFS procedure for each starting node U 1 path to K. Split evenly 1+0.5 paths to J Split 1:2 1+1 paths to H Split evenly 22
Summary of Betweenness computation 1. Build one BFS structures for each node 2. Determine flow values for each edge using the previous procedure and (3 steps) 3. Sum up the flow values of each edge in all BFS structures to get its betweenness value Notice: we are counting the flow between each pair of nodes X and Y twice (once when BFS from X and once when BFS from Y) at the end we divide everything by two 4. Use these betweenness values to identify the edges of highest betweenness for purposes of removing them in the Girvan- Newman method 23
We need to resolve 2 questions 1. How to compute betweenness? 2. How to select the number of clusters? 24
Network Communities Communities: sets of tightly connected nodes Define: Modularity Q A measure of how well a network is partitioned into communities Given a partitioning of the network into groups s Î S: Q sî S [ (# edges within group s) (expected # edges within group s) ] Need a null model! 25
Null Model: Configuration Model Given real G on n nodes and m edges, construct rewired network G Same degree distribution but random connections Consider G as a multigraph (=multiple edges for the same pair of nodes) Question: what is the expected number of edges between two nodes? Imagine to split the edges in nodes + half-edges i The expected number of edges between nodes i and j of degrees k i and k j equals to: k i k j The expected number of edges in (multigraph) G : = 1 k i k j 2 i N j N = 1 1 k 2m 2 2m i j N k j = 1 2m 2m = m 4m i N = = k ik j 2m 2m Note: * k ` = 2m j 26
Modularity Modularity of partitioning S of graph G: Q sî S [ (# edges within group s) (expected # edges within group s) ] Q G, S = 1 A 2m ij k ik j s S i s j s 2m Normalizing cost.: -1<Q<1 Adjacency matrix Modularity values take range [ 1,1] It is positive if the number of edges within groups exceeds the expected number 0. 3 < Q < 0. 7 means significant community structure 27
Reasoning about modularity Can modularity = 1? Answer: NO!!! Range: [ x y, 1) Whatis the modularity of a two completed connected components of n/2 nodes? Can you find an example of a network partitioning with modularity < 0? 28
Modularity: Number of clusters Modularity is useful for selecting the number of clusters: Q Why not optimize Modularity directly? 29
Lecture road Introduction to graph clustering Hierarchical approaches Spectral clustering 30
Graph Partitioning Undirected graph G(V, E): Bi-partitioning task: Divide vertices into two disjoint groups A, B 2 1 3 4 5 6 A 1 2 3 Questions: How can we define a good partition of G? How can we efficiently identify such a partition? 4 5 6 B 31
Graph Partitioning What makes a good partition? Maximize the number of within-group connections Minimize the number of between-group connections 1 5 2 3 4 6 A B 32
Graph Cuts Express partitioning objectives as a function of the edge cut of the partition Cut: Set of edges with only one vertex in a group: A 2 1 3 4 5 6 B cut(a,b) = 2 33
Graph Cut Criterion Criterion: Minimum-cut Minimize weight of connections between groups Degenerate case: argmin A,B cut(a, B) Optimal cut Minimum cut Problem: Only considers external cluster connections Does not consider internal cluster connectivity 34
Graph Cut Criteria NP-hard Criterion: Normalized-cut Connectivity between groups relative to the density of each group vol(a): total weight of the edges with at least one endpoint in A: vol A = n Why use this criterion? n Produces more balanced partitions How do we efficiently find a good partition? Problem: Computing optimal cut is NP-hard i A k i Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. TPAMI, 22(8), 888-905. 35
Spectral Graph Partitioning A: adjacency matrix of undirected G A ij = 1 if (i, j) is an edge, else 0 x is a vector in Rˆ with components (x 1,, x n ) Think of it as a label/value of each node of G What is the meaning of A x? ˆ ŒŽx y = A Œ x Œ = x Œ,Œ Entry y i is a sum of labels x j of neighbors of i 36
What is the meaning of Ax? j th coordinate of A x : Sum of the x-values of neighbors of j Make this a new value at node j A x = λ x Spectral Graph Theory: Analyze the spectrum of matrix representing G Spectrum: Eigenvectors x i of a graph, ordered by the magnitude (strength) of their corresponding eigenvalues λ i : 37
Questions? 38
References Andersen, R. and Lang, K.J. Communities from seed sets. WWW, 2006. Mcauley, J. and Leskovec, J. Discovering social circles in ego networks. TKDD, 2014. Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. TPAMI, 22(8), 888-905. Fiedler, M., 1973. Algebraic connectivity of graphs. Czechoslovak mathematical journal, 23(2), pp.298-305. 39