Jure Leskovec, Cornell/Stanford University. Joint work with Kevin Lang, Anirban Dasgupta and Michael Mahoney, Yahoo! Research

Jure Leskovec, Cornell/Stanford University Joint work with Kevin Lang, Anirban Dasgupta and Michael Mahoney, Yahoo! Research

Network: an interaction graph: Nodes represent entities Edges represent interaction between pairs of entities 2

Are there natural clusters, communities, partitions, etc.? Concept-based clusters, link-based clusters, density-based clusters, 3

Bid, click and impression information for keyword x advertiser pair Mine information at query-time to provide new ads Maximize CTR, RPS, advertiser ROI 4

query Find micro-markets by partitioning the query x advertiser graph: advertiser 5

Linear (low-rank) methods: If Gaussian, then low-rank space is good Kernel (non-linear) methods: If low-dimensional manifold, then kernels are good Hierarchical methods: Top-down and bottom-up common in social sciences Graph partitioning methods: Define edge counting metric conductance, expansion, modularity, etc. and optimize! It is a matter of common experience that communities exist in networks... Although not precisely defined, communities are usually thought of as sets of nodes with better connections amongst its members than with the rest of the world. 6

Communities: Sets of nodes with lots of connections inside and few to outside (the rest of the network) Assumption: Networks are (hierarchically) composed of communities Communities, clusters, groups, modules 7

Communities: Sets of nodes with lots of connections inside and few to outside (the rest of the network) Assumption: Networks are (hierarchically) composed of communities Hierarchical community structure Question: Are large networks really like this? 8

How community like is a set of nodes? Let A be the adjacency matrix of G=(V,E). The conductance of a set S of nodes is: S S The Network Community Profile (NCP) plot of the graph is: 9

What is best community of 5 nodes? Score: Φ(S) = # edges cut / # edges inside 10

What is best community of 5 nodes? Bad community Φ=5/6 = 0.83 Score: Φ(S) = # edges cut / # edges inside 11

What is best community of 5 nodes? Bad community Φ=5/7 = 0.7 Better community Φ=2/5 = 0.4 Score: Φ(S) = # edges cut / # edges inside 12

What is best community of 5 nodes? Bad community Φ=5/7 = 0.7 Best community Φ=2/8 = 0.25 Better community Φ=2/5 = 0.4 Score: Φ(S) = # edges cut / # edges inside 13

Network community profile (NCP) plot Plot the score of best community of size k k=5 k=7 log Φ(k) Φ(5)=0.25 Φ(7)=0.18 Community size, log k 14

Idea: Use approximation algorithms for NP-hard graph partitioning problems as experimental probes of network structure. Spectral (quadratic approx): confuses long paths with deep cuts Multi-commodity flow (log(n) approx): difficulty with expanders SDP (sqrt(log(n)) approx): best in theory Metis (multi-resolution heuristic): common in practice X+MQI: post-processing step on, e.g., MQI of Metis Local Spectral - connected and tighter sets (empirically) Metis+MQI - best conductance (empirically) We are not interested in partitions per se, but in probing network structure 15

d-dimensional meshes California road network 16

Zachary s university karate club social network During the study club split into 2 The split (squares vs. circles) corresponds to cut B 17

Collaborations between scientists in Networks [Newman, 2005] 18

[Ravasz&Barabasi, 2003] [Clauset,Moore&Newman, 2008] 19

Previously researchers examined community structure of small networks (~100 nodes) We examined more than 100 different large networks Large networks look very different! 20

Typical example: General relativity collaboration network (4,158 nodes, 13,422 edges) 21

Φ(k), (conductance) Better and better communities Communities get worse and worse Best community has ~100 nodes k, (community size) 22

Definition: Whisker is a maximal set of nodes connected to the network by a single edge NCP plot Largest whisker Whiskers are responsible for downward slope of NCP plot 23

Denser and denser core of the network Core contains ~60% nodes and ~80% edges Network structure: Core-periphery (jellyfish, octopus) Whiskers are responsible for good communities 24

Each new edge inside the community costs more Φ=1/3 = 0.33 NCP plot Φ=2/4 = 0.5 Φ=8/6 = 1.3 Φ=64/14 = 4.5 Each node has twice as many children 25

Edge to cut Whiskers: Whiskers in real networks are non-trivial (richer than trees) 26

Whiskers in real networks are larger than Whiskers expected based on density and degree sequence 27

Nothing happens! Now we have 2-edge connected whiskers to deal with. Indicates the recursiveness of our coreperiphery structure: as we remove the periphery, the core itself breaks into core and the periphery 29

What if we allow cuts that give disconnected communities? Cut all whiskers Compose communities out of whiskers How good community do we get? 30

Rewired network Local spectral Bag-ofwhiskers Metis+MQI LiveJournal 31

Regularization properties: spectral embeddings stretch along directions in which the randomwalk mixes slowly Resulting hyperplane cuts have "good" conductance cuts, but may not yield the optimal cuts spectral embedding flow based embedding 32

ext/int Dots are connected clusters Metis+MQI (red) gives sets with better conductance. Local Spectral (blue) gives tighter and more wellrounded sets. 33

Two ca. 500 node communities from Local Spectral: Two ca. 500 node communities from Metis+MQI: 34

... can be computed from: Spectral embedding (independent of balance) SDP-based methods (for volume-balanced partitions) 35

What is a good model that explains such network structure? None of the existing models work Flat Down and Flat Flat and Down Pref. attachment Small World Geometric Pref. Attachment 36

Note: Sparsity is the issue, not heavytails per se. (Power laws with 2< <3 give us the appropriate sparsity) 37

Forest Fire [LKF05]: connections spread like a fire New node joins the network Selects a seed node Connects to some of its neighbors Continue recursively Notes: Preferential attachment flavor - second neighbor is not uniform at random. Copying flavor - since burn seed s neighbors. Hierarchical flavor - seed is parent. Local flavor - burn near -- in a diffusion sense -- the seed vertex. As community grows it blends into the core of the network 38

rewired network Bag of whiskers 39

Whiskers: Largest whisker has ~100 nodes Whisker size is independent of network size Core: 60% of the nodes, 80% edges Core has little structure (hard to cut) Still more structure than the random network 40

The Dunbar number 150 individuals is maximum community size On-line communities have 60 members and break down at around 80, military, churches, divisions, etc. all close to the Dunbar's 150 Common bond vs. common identity theory Common bond (people are attached to individual community members) are smaller and more cohesive Common identity (people are attached to the group as a whole) focused around common interest and tend to be larger and more diverse What edges mean and community identification social networks - reasons an individual adds a link to a friend can vary enormously citation networks or web graphs - links are more expensive and are more semantically uniform 41

Networks with ground truth communities: LiveJournal12: users create and explicitly join on-line groups DBLP co-authorships: publication venues can be viewed as communities Amazon product co-purchasing: each item belongs to one or more hierarchically organized categories, as defined by Amazon IMDB collaboration: countries of production and languages may be viewed as communities 42

LiveJournal DBLP Rewired Network Ground truth Amazon IMDB 43

NCP plot is a way to analyze network community structure Our results agree with previous work on small networks (people did not hit the Dunbar s limit) But large networks are different: Whiskers + Core (core-periphery) structure Small well isolated communities blend into the core of the networks as they grow 44

Assume a recursive Kronecker model. Fit it to G. We get K = 0.9 0.5 0.5 0.1 What does this tell about the network structure? CoreCore-peripheryPeriphery 0.9 edges No communities 0.1 edges No good cuts 0.9 0.5 edges As opposed to: 0.9 0.1 0.5 edges 0.1 0.9 which gives a hierarchy 0.1 0.9 0.1 46

Assume a recursive Kronecker model. Fit it to G. We get K = 0.9 0.5 What does this tell about the network structure? Core 0.9 edges 0.5 edges 0.5 0.1 0.5 edges Periphery 0.1 edges As opposed to: 0.9 0.1 0.1 0.9 which gives a hierarchy 0.9 0.1 0.1 0.9 47