Jure Leskovec, Cornell/Stanford University. Joint work with Kevin Lang, Anirban Dasgupta and Michael Mahoney, Yahoo! Research

Similar documents
Community Structure in Large Social and Information Networks

An Empirical Analysis of Communities in Real-World Networks

Non Overlapping Communities

Implementation of Network Community Profile using Local Spectral algorithm and its application in Community Networking

Mining Social Network Graphs

Web Structure Mining Community Detection and Evaluation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Supplementary Material: Large-scale community structure in social and information networks

Community Detection. Community

Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well- Defined Clusters

arxiv: v1 [cs.ds] 20 Apr 2010

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

V4 Matrix algorithms and graph partitioning

Modularity CMSC 858L

Diffusion and Clustering on Large Graphs

CS224W: Analysis of Networks Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Community detection. Leonid E. Zhukov

Clustering Algorithms for general similarity measures

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #21: Graph Mining 2

Online Social Networks and Media. Community detection

Oh Pott, Oh Pott! or how to detect community structure in complex networks

Scalable Clustering of Signed Networks Using Balance Normalized Cut

Extracting Information from Complex Networks

TELCOM2125: Network Science and Analysis

COMMUNITY detection is one of the most important

How do we view BIG data?

ECS 289 / MAE 298, Lecture 15 Mar 2, Diffusion, Cascades and Influence, Part II

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters

Challenges in Multiresolution Methods for Graph-based Learning

Spectral Methods for Network Community Detection and Graph Partitioning

Demystifying movie ratings 224W Project Report. Amritha Raghunath Vignesh Ganapathi Subramanian

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Big Data Analytics Influx of data pertaining to the 4Vs, i.e. Volume, Veracity, Velocity and Variety

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

CSE 158 Lecture 6. Web Mining and Recommender Systems. Community Detection

Holistic Analysis of Multi-Source, Multi- Feature Data: Modeling and Computation Challenges

Community Structure Detection. Amar Chandole Ameya Kabre Atishay Aggarwal

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Stanford University CS359G: Graph Partitioning and Expanders Handout 1 Luca Trevisan January 4, 2011

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

Extracting Communities from Networks

Clusters and Communities

Models of Network Formation. Networked Life NETS 112 Fall 2017 Prof. Michael Kearns

CSE 258 Lecture 6. Web Mining and Recommender Systems. Community Detection

Clustering in Data Mining

A Divisive clustering technique for maximizing the modularity

Hierarchical Clustering: Objectives & Algorithms. École normale supérieure & CNRS

Bumptrees for Efficient Function, Constraint, and Classification Learning

Parallel Local Graph Clustering

CS 534: Computer Vision Segmentation and Perceptual Grouping

Visual Representations for Machine Learning

EMERGENCE OF CORE-PERIPHERY STRUCTURE FROM LOCAL NODE DOMINANCE IN SOCIAL NETWORKS

Social Data Management Communities

CS 322: (Social and Information) Network Analysis Jure Leskovec Stanford University

CSE 494 Project C. Garrett Wolf

CS 229 Midterm Review

A Computational Theory of Clustering

An Optimal Allocation Approach to Influence Maximization Problem on Modular Social Network. Tianyu Cao, Xindong Wu, Song Wang, Xiaohua Hu

Co-clustering or Biclustering

Hierarchical Clustering

Clustering: Overview and K-means algorithm

Social-Network Graphs

Clustering Algorithms on Graphs Community Detection 6CCS3WSN-7CCSMWAL

Machine Learning (BSMC-GA 4439) Wenke Liu

Parallel Local Graph Clustering

Robust Kernel Methods in Clustering and Dimensionality Reduction Problems

Spectral Graph Multisection Through Orthogonality. Huanyang Zheng and Jie Wu CIS Department, Temple University

Machine Learning for Signal Processing Clustering. Bhiksha Raj Class Oct 2016

Big Data Management and NoSQL Databases

TELCOM2125: Network Science and Analysis

Search Engines. Information Retrieval in Practice

CS512 (Spring 2012) Advanced Data Mining : Midterm Exam I

MCL. (and other clustering algorithms) 858L

CSCI-B609: A Theorist s Toolkit, Fall 2016 Sept. 6, Firstly let s consider a real world problem: community detection.

V2: Measures and Metrics (II)

Chapter 1. Social Media and Social Computing. October 2012 Youn-Hee Han

Community detection algorithms survey and overlapping communities. Presented by Sai Ravi Kiran Mallampati

Image Segmentation continued Graph Based Methods

Statistical Physics of Community Detection

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Unsupervised Learning and Clustering

Scalable Influence Maximization in Social Networks under the Linear Threshold Model

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Large Scale Graph Algorithms

Topology Enhancement in Wireless Multihop Networks: A Top-down Approach

Lecture 7: Decision Trees

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Based on Raymond J. Mooney s slides

Classification. 1 o Semestre 2007/2008

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Community Detection Using Random Walk Label Propagation Algorithm and PageRank Algorithm over Social Network

Part I: Data Mining Foundations

Clustering Part 4 DBSCAN

An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization

Cluster Analysis. Ying Shen, SSE, Tongji University

Nearest Neighbor with KD Trees

Community Analysis. Chapter 6

Image Segmentation. Shengnan Wang

Transcription:

Jure Leskovec, Cornell/Stanford University Joint work with Kevin Lang, Anirban Dasgupta and Michael Mahoney, Yahoo! Research

Network: an interaction graph: Nodes represent entities Edges represent interaction between pairs of entities 2

Are there natural clusters, communities, partitions, etc.? Concept-based clusters, link-based clusters, density-based clusters, 3

Bid, click and impression information for keyword x advertiser pair Mine information at query-time to provide new ads Maximize CTR, RPS, advertiser ROI 4

query Find micro-markets by partitioning the query x advertiser graph: advertiser 5

Linear (low-rank) methods: If Gaussian, then low-rank space is good Kernel (non-linear) methods: If low-dimensional manifold, then kernels are good Hierarchical methods: Top-down and bottom-up common in social sciences Graph partitioning methods: Define edge counting metric conductance, expansion, modularity, etc. and optimize! It is a matter of common experience that communities exist in networks... Although not precisely defined, communities are usually thought of as sets of nodes with better connections amongst its members than with the rest of the world. 6

Communities: Sets of nodes with lots of connections inside and few to outside (the rest of the network) Assumption: Networks are (hierarchically) composed of communities Communities, clusters, groups, modules 7

Communities: Sets of nodes with lots of connections inside and few to outside (the rest of the network) Assumption: Networks are (hierarchically) composed of communities Hierarchical community structure Question: Are large networks really like this? 8

How community like is a set of nodes? Let A be the adjacency matrix of G=(V,E). The conductance of a set S of nodes is: S S The Network Community Profile (NCP) plot of the graph is: 9

What is best community of 5 nodes? Score: Φ(S) = # edges cut / # edges inside 10

What is best community of 5 nodes? Bad community Φ=5/6 = 0.83 Score: Φ(S) = # edges cut / # edges inside 11

What is best community of 5 nodes? Bad community Φ=5/7 = 0.7 Better community Φ=2/5 = 0.4 Score: Φ(S) = # edges cut / # edges inside 12

What is best community of 5 nodes? Bad community Φ=5/7 = 0.7 Best community Φ=2/8 = 0.25 Better community Φ=2/5 = 0.4 Score: Φ(S) = # edges cut / # edges inside 13

Network community profile (NCP) plot Plot the score of best community of size k k=5 k=7 log Φ(k) Φ(5)=0.25 Φ(7)=0.18 Community size, log k 14

Idea: Use approximation algorithms for NP-hard graph partitioning problems as experimental probes of network structure. Spectral (quadratic approx): confuses long paths with deep cuts Multi-commodity flow (log(n) approx): difficulty with expanders SDP (sqrt(log(n)) approx): best in theory Metis (multi-resolution heuristic): common in practice X+MQI: post-processing step on, e.g., MQI of Metis Local Spectral - connected and tighter sets (empirically) Metis+MQI - best conductance (empirically) We are not interested in partitions per se, but in probing network structure 15

d-dimensional meshes California road network 16

Zachary s university karate club social network During the study club split into 2 The split (squares vs. circles) corresponds to cut B 17

Collaborations between scientists in Networks [Newman, 2005] 18

[Ravasz&Barabasi, 2003] [Clauset,Moore&Newman, 2008] 19

Previously researchers examined community structure of small networks (~100 nodes) We examined more than 100 different large networks Large networks look very different! 20

Typical example: General relativity collaboration network (4,158 nodes, 13,422 edges) 21

Φ(k), (conductance) Better and better communities Communities get worse and worse Best community has ~100 nodes k, (community size) 22

Definition: Whisker is a maximal set of nodes connected to the network by a single edge NCP plot Largest whisker Whiskers are responsible for downward slope of NCP plot 23

Denser and denser core of the network Core contains ~60% nodes and ~80% edges Network structure: Core-periphery (jellyfish, octopus) Whiskers are responsible for good communities 24

Each new edge inside the community costs more Φ=1/3 = 0.33 NCP plot Φ=2/4 = 0.5 Φ=8/6 = 1.3 Φ=64/14 = 4.5 Each node has twice as many children 25

Edge to cut Whiskers: Whiskers in real networks are non-trivial (richer than trees) 26

Whiskers in real networks are larger than Whiskers expected based on density and degree sequence 27

28

Nothing happens! Now we have 2-edge connected whiskers to deal with. Indicates the recursiveness of our coreperiphery structure: as we remove the periphery, the core itself breaks into core and the periphery 29

What if we allow cuts that give disconnected communities? Cut all whiskers Compose communities out of whiskers How good community do we get? 30

Rewired network Local spectral Bag-ofwhiskers Metis+MQI LiveJournal 31

Regularization properties: spectral embeddings stretch along directions in which the randomwalk mixes slowly Resulting hyperplane cuts have "good" conductance cuts, but may not yield the optimal cuts spectral embedding flow based embedding 32

ext/int Dots are connected clusters Metis+MQI (red) gives sets with better conductance. Local Spectral (blue) gives tighter and more wellrounded sets. 33

Two ca. 500 node communities from Local Spectral: Two ca. 500 node communities from Metis+MQI: 34

... can be computed from: Spectral embedding (independent of balance) SDP-based methods (for volume-balanced partitions) 35

What is a good model that explains such network structure? None of the existing models work Flat Down and Flat Flat and Down Pref. attachment Small World Geometric Pref. Attachment 36

Note: Sparsity is the issue, not heavytails per se. (Power laws with 2< <3 give us the appropriate sparsity) 37

Forest Fire [LKF05]: connections spread like a fire New node joins the network Selects a seed node Connects to some of its neighbors Continue recursively Notes: Preferential attachment flavor - second neighbor is not uniform at random. Copying flavor - since burn seed s neighbors. Hierarchical flavor - seed is parent. Local flavor - burn near -- in a diffusion sense -- the seed vertex. As community grows it blends into the core of the network 38

rewired network Bag of whiskers 39

Whiskers: Largest whisker has ~100 nodes Whisker size is independent of network size Core: 60% of the nodes, 80% edges Core has little structure (hard to cut) Still more structure than the random network 40

The Dunbar number 150 individuals is maximum community size On-line communities have 60 members and break down at around 80, military, churches, divisions, etc. all close to the Dunbar's 150 Common bond vs. common identity theory Common bond (people are attached to individual community members) are smaller and more cohesive Common identity (people are attached to the group as a whole) focused around common interest and tend to be larger and more diverse What edges mean and community identification social networks - reasons an individual adds a link to a friend can vary enormously citation networks or web graphs - links are more expensive and are more semantically uniform 41

Networks with ground truth communities: LiveJournal12: users create and explicitly join on-line groups DBLP co-authorships: publication venues can be viewed as communities Amazon product co-purchasing: each item belongs to one or more hierarchically organized categories, as defined by Amazon IMDB collaboration: countries of production and languages may be viewed as communities 42

LiveJournal DBLP Rewired Network Ground truth Amazon IMDB 43

NCP plot is a way to analyze network community structure Our results agree with previous work on small networks (people did not hit the Dunbar s limit) But large networks are different: Whiskers + Core (core-periphery) structure Small well isolated communities blend into the core of the networks as they grow 44

45

Assume a recursive Kronecker model. Fit it to G. We get K = 0.9 0.5 0.5 0.1 What does this tell about the network structure? CoreCore-peripheryPeriphery 0.9 edges No communities 0.1 edges No good cuts 0.9 0.5 edges As opposed to: 0.9 0.1 0.5 edges 0.1 0.9 which gives a hierarchy 0.1 0.9 0.1 46

Assume a recursive Kronecker model. Fit it to G. We get K = 0.9 0.5 What does this tell about the network structure? Core 0.9 edges 0.5 edges 0.5 0.1 0.5 edges Periphery 0.1 edges As opposed to: 0.9 0.1 0.1 0.9 which gives a hierarchy 0.9 0.1 0.1 0.9 47

48