Local higher-order graph clustering

Similar documents
Local Higher-Order Graph Clustering

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Local Network Community Detection with Continuous Optimization of Conductance and Weighted Kernel K-Means

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs

Local Community Detection Based on Small Cliques

Non-exhaustive, Overlapping k-means

COMMUNITY detection is one of the most important

Diffusion and Clustering on Large Graphs

Non Overlapping Communities

A Local Algorithm for Structure-Preserving Graph Cut

Similarity Ranking in Large- Scale Bipartite Graphs

TOOLS FOR HIGHER-ORDER NETWORK ANALYSIS

Community detection using boundary nodes in complex networks

Parallel Local Graph Clustering

CS224W: Analysis of Networks Jure Leskovec, Stanford University

An Empirical Analysis of Communities in Real-World Networks

Network community detection with edge classifiers trained on LFR graphs

Community Detection: Comparison of State of the Art Algorithms

Dynamic Structural Similarity on Graphs

Community detection. Leonid E. Zhukov

An Efficient Algorithm for Community Detection in Complex Networks

Parallel Local Graph Clustering

Implementation of Network Community Profile using Local Spectral algorithm and its application in Community Networking

Distribution-Free Models of Social and Information Networks

COMP5331: Knowledge Discovery and Data Mining

Community Detection Using Random Walk Label Propagation Algorithm and PageRank Algorithm over Social Network

A Novel Parallel Hierarchical Community Detection Method for Large Networks

arxiv: v1 [cs.si] 25 Sep 2015

Finding and Visualizing Graph Clusters Using PageRank Optimization. Fan Chung and Alexander Tsiatas, UCSD WAW 2010

Local Partitioning using PageRank

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

A Local Seed Selection Algorithm for Overlapping Community Detection

Diffusion and Clustering on Large Graphs

Why is a power law interesting? 2. it begs a question about mechanism: How do networks come to have power-law degree distributions in the first place?

Fast Nearest Neighbor Search on Large Time-Evolving Graphs

On the Permanence of Vertices in Network Communities. Tanmoy Chakraborty Google India PhD Fellow IIT Kharagpur, India

Supplementary material to Epidemic spreading on complex networks with community structures

Community detection algorithms survey and overlapping communities. Presented by Sai Ravi Kiran Mallampati

Extracting Information from Complex Networks

Graph Exploration: Taking the User into the Loop

FlowPro: A Flow Propagation Method for Single Community Detection

SCALABLE LOCAL COMMUNITY DETECTION WITH MAPREDUCE FOR LARGE NETWORKS

Mining Social Network Graphs

Web Structure Mining Community Detection and Evaluation

Snowball Sampling a Large Graph

Summary: What We Have Learned So Far

Comparison of Centralities for Biological Networks

Lecture Note: Computation problems in social. network analysis

Introduction to Bioinformatics

Spectral Graph Multisection Through Orthogonality

The Small Community Phenomenon in Networks: Models, Algorithms and Applications

Studying the Properties of Complex Network Crawled Using MFC

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Graph Partitioning Algorithms

TELCOM2125: Network Science and Analysis

Local Community Detection in Dynamic Graphs Using Personalized Centrality

Information Networks: PageRank

Pregel. Ali Shah

Instructor: Dr. Mehmet Aktaş. Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

CSE 701: LARGE-SCALE GRAPH MINING. A. Erdem Sariyuce

Incoming, Outgoing Degree and Importance Analysis of Network Motifs

Subsampling Graphs 1

Lecture 11: Clustering and the Spectral Partitioning Algorithm A note on randomized algorithm, Unbiased estimates

Online Social Networks and Media. Community detection

CEIL: A Scalable, Resolution Limit Free Approach for Detecting Communities in Large Networks

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Sparsification of Social Networks Using Random Walks

Jure Leskovec, Cornell/Stanford University. Joint work with Kevin Lang, Anirban Dasgupta and Michael Mahoney, Yahoo! Research

EXTREMAL OPTIMIZATION AND NETWORK COMMUNITY STRUCTURE

CS224W: Analysis of Networks Jure Leskovec, Stanford University

Advanced Algorithms and Models for Computational Biology -- a machine learning approach

An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization

MCL. (and other clustering algorithms) 858L

Lecture 6: Spectral Graph Theory I

Relative Centrality and Local Community Detection

Classify My Social Contacts into Circles Stanford University CS224W Fall 2014

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

A Simple Acceleration Method for the Louvain Algorithm

CSCI-B609: A Theorist s Toolkit, Fall 2016 Sept. 6, Firstly let s consider a real world problem: community detection.

BigML homework 6: Efficient Approximate PageRank

Graph Theory. Graph Theory. COURSE: Introduction to Biological Networks. Euler s Solution LECTURE 1: INTRODUCTION TO NETWORKS.

Theme Identification in RDF Graphs

Edge Weight Method for Community Detection in Scale-Free Networks

TPA: Fast, Scalable, and Accurate Method for Approximate Random Walk with Restart on Billion Scale Graphs

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

Gaps between theory practice in!!! large scale matrix computations for networks

Computing Heat Kernel Pagerank and a Local Clustering Algorithm

Generalized Louvain method for community detection in large networks

Scalable Clustering of Signed Networks Using Balance Normalized Cut

Parallel Local Graph Clustering

Locality-sensitive hashing and biological network alignment

Expected Nodes: a quality function for the detection of link communities

Computational Complexity CSC Professor: Tom Altman. Capacitated Problem

arxiv: v2 [cs.si] 22 Mar 2013

Efficient Counting of Network Motifs

Analysis of Biological Networks. 1. Clustering 2. Random Walks 3. Finding paths

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #21: Graph Mining 2

SLPA: Uncovering Overlapping Communities in Social Networks via A Speaker-listener Interaction Dynamic Process

Community Detection in Bipartite Networks:

Transcription:

Local higher-order graph clustering Hao Yin Stanford University yinh@stanford.edu Austin R. Benson Cornell University arb@cornell.edu Jure Leskovec Stanford University jure@cs.stanford.edu David F. Gleich Purdue University dgleich@purdue.edu * Code and data available at http://snap.stanford.edu/mappr

Background: Local clustering Local cluster Seed node Different from global clustering: Target community detection; Algorithm only explores a local neighborhood of seed node. 2

Background: Local clustering Local cluster Seed node Used to find communities of an individual in social networks [Jeub et al., PRE, 15]. find members of a protein complex in PPI networks [Voevodski et al., BMC Sys. Biol., 09]. find related videos in online media [Gargi et al., ICWSM, 11]. and much more... [Epasto et al., WWW, 14; Jiang et al., Sys. Biol., 09]. 3

Background: Local clustering Existing methods find clusters with many internal edges and few external edges; Usually formulated as finding a cluster S with minimal (edge) conductance [Schaeffer, 07]. #(edges cut) φ S = Vol S S o edges cut: o Vol S = #(edge end points in S) = deg(u) 4 6 4 edges cut 20 edge end points edge conductance = 4 / 20 However, edges are not the only interesting structures in networks! 4

Background: Higher-order structure Higher-order connectivity patterns, or network motifs, mediate complex networks. Triangles in social networks. Bi-fans in neural Aconnectivity networks. Rapoport, 1953; Granovetter, 1973. Milo et al., 2002; Benson et al., 2016. C Feed-forward loops in genetic transcription networks. B Mangan et al., 2003; Alon, 2007. 5

Background: Motif-based graph clustering Idea: Find a cluster S with minimal motif conductance [Benson et al., 16] φ 7 S = #(motifs cut) Vol 7 S S o o motifs cut: Vol 7 S = #(motif end points in S) Significant improvement in ground truth community detection and knowledge discovery [Benson et al., 16] However, current motif-based clustering methods are global, and no motif-based local clustering method exists! 6

Our work: Local motif-based graph clustering better results Global method Local method Edge-based [Fiedler, 1973] [Anderson et al., 2006] Motif-based [Benson et al., 2016] Our Work scalable and faster Problem: Input: a network, a seed node, and a motif. Output: a cluster containing the seed node with minimal motif conductance. 7

Our work: Local motif-based graph clustering Seed node Different motifs give different local clustering results! 8

Our work: Local motif-based graph clustering Challenge A generalization of the conductance minimization problem which is NP-hard [Wagner and Wagner, 1993]. No existing methods for local clustering based on motif conductance. Our approximate solution: MAPPR Motif-based Approximate Personalized PageRank Algorithm A generalization of the APPR method [Andersen et al., 06]. 9

MAPPR: Overview Key ideas and steps: 1. Create a weighted graph with weighted edge conductance equals the motif conductance in the original graph; 2. Find a cluster of minimal weighted edge conductance. unweighted graph φ S = #(edges cut) d(u) 4 6 weighted graph φ ; S = = >4?(6) 4 6 w(e) d ; u d ; u = @ w(e) 4 = 10

MAPPR: Overview Key ideas and steps: 1. Create a weighted graph with weighted edge conductance equals the motif conductance in the original graph; 2. Find a cluster of minimal weighted edge conductance. Properties: o Runtime guarantee Ø Procedure stops upon finding a good cluster, no need to explore the rest of graph. o Quality guarantee Ø Finds a near-optimal cluster regarding motif conductance. 11

MAPPR Step I: Weighted graph Create a weighted graph with w i, j = #motif instances containing nodes i and j. The motif conductance (approximately) equals the weighted edge conductance in this weighted graph [Benson et at.,16]. 12

MAPPR Step II: APPR vector Compute an approximate PPR vector for this weighted graph. o The PPR vector p is the stationary distribution of a random walk which at each step it teleports back to the seed with some probability. o p u measures an integrated closeness of node u to the seed. o On a weighted graph, we choose each edge with probability proportional to its weight. o We adapted the approximate PPR algorithm [Anderson et al., 06] for weighted graphs. 13

MAPPR Step III: Sweep Use the sweep procedure on APPR vector p to output the set with minimal weighted edge conductance [Anderson et al., 06]. 1) Sort nodes by p(u)/d ; (u) : u E, u F, u [J] ; best local motif-based cluster 2) Compute the conductance of each S L = u E, u F,, u L ; 3) Output the S N with minimal weighted edge conductance. 14

MAPPR: Runtime Theory After motif counting, for each seed, the algorithm finishes in time proportional to the output cluster size! No dependence on graph size! Key part of proof: Interpret integer-weighted edges as parallel unweighted edges, then apply the previous analysis [Anderson et al., 06]. Practice Takes < 2 seconds / seed on 2 billion edge graphs! Global motif-based method takes 2 hours. 15

MAPPR: Quality Theory For any unknown target community T, MAPPR seeded with most nodes in T would output a cluster S with φ 7 S OR min φ 7 T, φ 7 T / η. η is the inverse mixing time of the subgraph induced by T. Guaranteed to find a near-optimal cluster. Results inherited from classic APPR analysis [Anderson et al., 06, Zhu et al., 13]. Practice Better recovers ground truth communities! 16

MAPPR: Better discovery on synthetic networks Random graph models with planted community structure: 1. Planted partition model 17

MAPPR: Better discovery on synthetic networks Random graph models with planted community structure: 1. Planted partition model 2. Lancichinetti-Fortunato-Radicchi (LFR) model [Lancichinetti et al., 08, 09] A variant of planted partition model with power-law degree distribution, community overlapping, power-law community size distribution, etc. 18

MAPPR: Better discovery on synthetic networks Experiment Procedure: Seeded at every node, compute the F E score of the MAPPR cluster with the ground truth cluster, and take the average. 19

MAPPR: Better discovery on synthetic networks Experiment Procedure: Seeded at every node, compute the F E score of the MAPPR cluster with the ground truth cluster, and take the average. Repeat this experiment under different mixing level μ : fraction of edges across the ground truth communities. μ = 0.2 μ = 0.4 μ = 0.6 μ = 0.8 20

MAPPR: Better discovery on synthetic networks A large region of μ where MAPPR with triangle motif outperforms the edge-based method! Intuition: Edges across different communities are less likely to form a triangle. 21

MAPPR: Better discovery on real-world networks email-eu network Nodes are people at a research institute. Each person belongs to a department. Edges are email correspondence. Given a person as a seed, can we recover other members of his/her department? Graph Statistics: Nodes: 1K Edges: 25.6K Departments: 28 Depart. Sizes: 10 -- 109 New dataset! http://snap.stanford.edu/data/email-eu-core.html 22

MAPPR: Better discovery on real-world networks email-eu network Nodes are people at a research institute. Each person belongs to a department. Edges are email correspondence. APPR MAPPR Method Average F 1 0.40 0.50 Given a person as a seed, can we recover other members of his/her department? A 25% improvement! v Triangle motif better discovers ground truth communities! 23

MAPPR: Better discovery on real-world networks email-eu network Nodes are people at a research institute. Each person belongs to a department. Edges are email correspondence. Given a person as a seed, can we recover other members of his/her department? Motif used in MAPPR Average F 1 0.50 0.45 0.47 v Both feed-forward loop and cycle are important in discovering communities structure in communication network! 24

More in the paper: Finding good seeds Idea: A vertex with low 1-hop neighborhood motif conductance has lower motif conductance in its MAPPR cluster. Theory: Vertex 1-hop neighborhoods have low motif conductance. o Related with higher-order clustering coefficient [Yin et al., 17]. 25

Recap Proposed the MAPPR algorithm for local motif-based graph clustering ü Runtime guarantee ü Quality guarantee ü Better recovery in synthetic models and real-world datasets Finding Good Seeds 26

Local higher-order graph clustering Hao Yin, Austin R. Benson, Jure Leskovec, David F. Gleich * Code and data available at http://snap.stanford.edu/mappr