Finding and Visualizing Graph Clusters Using PageRank Optimization. Fan Chung and Alexander Tsiatas, UCSD WAW 2010

Similar documents
Diffusion and Clustering on Large Graphs

Diffusion and Clustering on Large Graphs

Visual Representations for Machine Learning

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Spectral Clustering X I AO ZE N G + E L HA M TA BA S SI CS E CL A S S P R ESENTATION MA RCH 1 6,

EECS730: Introduction to Bioinformatics

Single link clustering: 11/7: Lecture 18. Clustering Heuristics 1

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Spectral Clustering. Presented by Eldad Rubinstein Based on a Tutorial by Ulrike von Luxburg TAU Big Data Processing Seminar December 14, 2014

Clustering in Networks

Non Overlapping Communities

Absorbing Random walks Coverage

Mining Social Network Graphs

Absorbing Random walks Coverage

CS 534: Computer Vision Segmentation II Graph Cuts and Image Segmentation

Segmentation Computer Vision Spring 2018, Lecture 27

Gene expression & Clustering (Chapter 10)

Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. 2 April April 2015

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

CSCI-B609: A Theorist s Toolkit, Fall 2016 Sept. 6, Firstly let s consider a real world problem: community detection.

Clustering. So far in the course. Clustering. Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. dist(x, y) = x y 2 2

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Computing Heat Kernel Pagerank and a Local Clustering Algorithm

Social-Network Graphs

Community Detection. Community

Analysis of Biological Networks. 1. Clustering 2. Random Walks 3. Finding paths

My favorite application using eigenvalues: partitioning and community detection in social networks

Web Structure Mining Community Detection and Evaluation

Local Partitioning using PageRank

SOMSN: An Effective Self Organizing Map for Clustering of Social Networks

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

CSE 158 Lecture 6. Web Mining and Recommender Systems. Community Detection

A Tutorial on the Probabilistic Framework for Centrality Analysis, Community Detection and Clustering

Clustering Lecture 8. David Sontag New York University. Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Carlos Guestrin, Andrew Moore, Dan Klein

A Local Algorithm for Structure-Preserving Graph Cut

ECLT 5810 Clustering

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

Unsupervised Learning and Clustering

Image Segmentation continued Graph Based Methods

Spectral Methods for Network Community Detection and Graph Partitioning

Statistical Physics of Community Detection

CSE 258 Lecture 6. Web Mining and Recommender Systems. Community Detection

Applications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors

Theorem 2.9: nearest addition algorithm

Online Social Networks and Media. Community detection

CS 534: Computer Vision Segmentation and Perceptual Grouping

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

ECLT 5810 Clustering

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents

Introduction to spectral clustering

Modularity CMSC 858L

TELCOM2125: Network Science and Analysis

Spectral Clustering on Handwritten Digits Database

Lecture 11: E-M and MeanShift. CAP 5415 Fall 2007

Clusters and Communities

3 : Representation of Undirected GMs

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

Discrete geometry. Lecture 2. Alexander & Michael Bronstein tosca.cs.technion.ac.il/book

Local higher-order graph clustering

CS 229 Midterm Review

A SURVEY ON CLUSTERING ALGORITHMS Ms. Kirti M. Patil 1 and Dr. Jagdish W. Bakal 2

Lesson 3. Prof. Enza Messina

Decomposition of log-linear models

Clustering algorithms and introduction to persistent homology

Exploiting Parallelism to Support Scalable Hierarchical Clustering

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

AN ANALYSIS ON MARKOV RANDOM FIELDS (MRFs) USING CYCLE GRAPHS

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Segmentation: Clustering, Graph Cut and EM

1.2 Graph Drawing Techniques

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

Subject Index. Journal of Discrete Algorithms 5 (2007)

Randomized Algorithms 2017A - Lecture 10 Metric Embeddings into Random Trees

Graph/Network Visualization

6 Randomized rounding of semidefinite programs

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

Relative Centrality and Local Community Detection

Graph Theory S 1 I 2 I 1 S 2 I 1 I 2

Cluster Analysis. Ying Shen, SSE, Tongji University

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Energy Minimization for Segmentation in Computer Vision

Spectral Graph Sparsification: overview of theory and practical methods. Yiannis Koutis. University of Puerto Rico - Rio Piedras

Copyright 2000, Kevin Wayne 1

L1-graph based community detection in online social networks

CS224W: Analysis of Networks Jure Leskovec, Stanford University

Finding Clusters 1 / 60

DSCI 575: Advanced Machine Learning. PageRank Winter 2018

Application of fuzzy set theory in image analysis. Nataša Sladoje Centre for Image Analysis

Lecture 7: Segmentation. Thursday, Sept 20

Clustering Jacques van Helden

The application of Randomized HITS algorithm in the fund trading network

MCL. (and other clustering algorithms) 858L

10-701/15-781, Fall 2006, Final

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #21: Graph Mining 2

CS4670 / 5670: Computer Vision Noah Snavely

Understanding Clustering Supervising the unsupervised

Transcription:

Finding and Visualizing Graph Clusters Using PageRank Optimization Fan Chung and Alexander Tsiatas, UCSD WAW 2010

What is graph clustering? The division of a graph into several partitions. Clusters should be selected according to some criteria: For example: balance, graph connectivity and separation,

Why use graph clustering? Identification of communities or logical subgroups within a larger network Targeted advertising Product recommendations

Why use graph clustering? Image segmentation and computer vision [Shi, Malik 00+ Identifying distinct objects or surfaces in an image Effective resistance to network epidemics [Chung, Horn, Tsiatas 09+ Many applications in machine learning and data processing

Graph clustering algorithms k-means [MacQueen 67; Lloyd 82+ NP-complete to solve exactly. Many heuristic iterative algorithms Requires a notion of pairwise distances. Usually used for vector-based data. Intractable to embed a graph into a low-dimensional space, or to cluster data in high-dimensional space. Using the usual (shortest-path) graph distance as a metric is not discerning in real-world graphs with the small-world phenomenon: all distances are small!

Graph clustering algorithms Spectral clustering [Shi, Malik 00; Ng, Jordan, Weiss 02+ Relies on matrix computation, which can be intractable for large networks. Splits graph into 2 parts For more, recursively apply the algorithm. We will develop an algorithm for k parts without recursive division. Markov clustering [Enright, Van Dongen, Ouzounis 02+ Also reliant on matrix computations.

Graph clustering algorithms Affinity propagation [Frey, Dueck 07+ A heuristic algorithm using pairwise distances. Local partitioning algorithms [Andersen, Chung, Lang 06; Andersen, Chung 07+ Algorithms for finding one smaller cut within a network. We will find a more balanced partition into k parts. Does not require pairwise distances Many others *Schaeffer 07+

Our contribution A new graph clustering algorithm: Find k centers of mass using PageRank Avoid the need for a high-dimensional embedding Use centers to derive k clusters using Voronoi diagrams Perform computations efficiently using inexpensive approximation algorithms A graph drawing algorithm: Use PageRank to assist in determining node locations, highlighting local cluster structure PageRank helps overcome several problems!

What is PageRank? Personalized PageRank [Brin, Page 98, Jeh, Widom 03+ vectors quantify structural relationships between vertices and a specified starting distribution (or vertex) s: PageRank is the stationary distribution of a random walk (with transition probability matrix W) that restarts to s randomly. Restart rate is controlled by the jumping constant α.

Why use PageRank? Proven to be effective and efficient at finding relevance in link-based data Intuitive interpretation of vertex relationships The vth component of pr(α,u) quantifies how well-suited v is to be a representative center for u. A natural metric for pairwise distances giving more information than simple graph distances. Proven to be effective in Web search, local partitioning, combating epidemics. Performance Using approximation algorithms [Andersen, Chung, Lang 06; Chung, Zhao 10+, PageRank vectors can be calculated efficiently.

Pairwise distances using PageRank In Euclidean space (for k-means): Using PageRank: where D is the diagonal degree matrix. Throughout, we use node degrees and cluster volumes as normalizing factors. Generalizing to probability distributions p,q over vertices:

Centers of mass and clusters c is an ε-center or center of mass for a vertex set S if its PageRank distance to S is small: Here, c can be an individual vertex or a more general probability distribution. A set C of k centers determines a set of k clusters R c for every c in C: In other words, clusters are determined using a Voronoi diagram with PageRank distances and the centers C.

Evaluating centers and clusters We need some way to describe how good a set of centers C and their corresponding clusters R c are. For k-means: Using PageRank: Here, c v is the center of mass closest to v.

Evaluating centers and clusters μ(c) quantifies how well each center c in C represents its cluster R c. We also need a metric for evaluating how well each cluster R c is structurally different from the overall graph structure, using the random walk stationary distribution π. If Ψ α (C) is large, then the clusters are wellseparated.

Evaluating a graph for clustered structure We interpret the PageRank vector p = pr(α,v) for a vertex v to give the suitability of other vertices to be its center of mass. We define the α-pagerank-variance: If Φ(α) is small, then the PageRank vectors for v and p are close, indicating a clustered structure.

Evaluating a graph for clustered structure We also define the α-cluster-variance: If Ψ(α) is large, then centers of mass predicted by PageRank vectors are far from the stationary distribution, also indicating a clustered structure.

Relationship between metrics The objective is to find a good set of centers C: This means μ(c) is small and Ψ α (C) is large. But if we use PageRank vectors to guess centers of mass, these metrics are similar to Φ(α) and Ψ(α). If we take enough samples for centers of mass using PageRank, these metrics can be made arbitrarily close.

Relationship between metrics Thus, to find small μ(c) and large Ψ α (C), we must find an α that gives small Φ(α) and large Ψ(α).

Selecting the jumping constant α We need to choose α so that Φ(α) is small. (We will see later that if a graph has a clustered structure, Ψ(α) will be large.) We can find local minima of Φ(α) by finding roots of Φ (α). We can also just use binary search over (0,1). There can be several good values for α, indicating a layered clustering structure.

Selecting the jumping constant α: an illustration Let G be a dumbbell graph: two cliques of 20 nodes connected by a single edge. Φ(α) Φ (α) Ψ(α) Ψ (α)

The clustering algorithm PageRank-ClusteringA(G,k,ε): For each vertex v, compute its PageRank vector pr(α,v) For each root α of Φ (α): If Φ(α) ε and k Ψ(α) 2 ε: Select c log n sets of k potential centers, randomly selected from π. (Here, c is some large constant.) For each set S = {v 1,, v k }, let C be the set of centers of mass where c i = pr(α,v i ). If μ(c) Φ(α) ε and Ψ α (C) Ψ(α) ε, return the clusters given by the k Voronoi regions according to the PageRank distances using C. Otherwise, there may be no output the graph does not have a clustered structure

Analysis of the clustering algorithm The algorithm PageRank-ClusteringA does not always return a clustering, but we will show that it does for a special class of (k,h,β,ε)-clusterable graphs G: G can be partitioned into k parts so that each part S satisfies: S has Cheeger ratio at most h. S has volume at least β vol(g)/k. There is a subset S with vol(s ) (1-ε) vol(s) and Cheeger ratio at least. Here, the Cheeger ratio for a set H is the ratio of the number of edges leaving H and vol(h).

Analysis of the clustering algorithm Theorem. Suppose a graph G has a (k,h,β,ε)- clustering and α,ε in (0,1) satisfy ε hk/2αβ. Then with high probability, PageRank- ClusteringA returns a set C of k centers with Φ(α) ε, Ψ(C) > k 2 ε, and the k clusters are near optimal according to μ(c) with an additive error term ε.

Analysis of the clustering algorithm The theorem mainly follows from the definition of (k,h,β,ε)-clustering, the discussed theory surrounding the evaluative cluster metrics Φ and Ψ, and probabilistic sampling arguments. The rest follows from the following claim: If G can be partitioned into k clusters with Cheeger ratio at most h and ε hk/2αβ, then Ψ(a) k 2 ε. This claim can be proven using a generalization of a known connection between PageRank and the Cheeger ratio *Andersen, Chung, Lang 06+.

Analysis of the clustering algorithm The computational complexity of PageRank-ClusteringA is dominated by several computations: Finding the roots of Φ (α) O(k log n) calculations of μ(c) and Ψ α (C) O(n) PageRank vector calculations These computations can be expensive, but fortunately we have inexpensive approximation algorithms: Finding roots and calculating functions using sampling techniques [Rudelson, Vershynin 07+ Using approximate PageRank vectors [Andersen, Chung, Lang 06; Chung, Zhao 10+ These techniques are outlined in an algorithm PageRank- ClusteringB.

PageRank and graph visualization Many graph visualization algorithms have trouble showing local structure in complex networks without resorting to hierarchical layouts. PageRank s quantitative information can be used to assist force-based graph layout algorithms [Kamada, Kawai 89+, highlighting local clusters around k centers of mass. For each center c and every non-center v, simulate a spring with force inversely proportional to the vth component of pr(α,c). For two centers c and c, simulate a spring with a strong repellent force. We also overlay a Voronoi diagram in Euclidean space to highlight the clusters.

PageRank and graph visualization: an example Social network of dolphins [Newman, Girvan 04+ with 2 clusters

PageRank and graph drawing: example Network of NCAA Division I football opponents *Girvan, Newman 02+, highlighting several conferences

Open questions Improved performance and scalability Graph visualization bottlenecks Applications of the graph clustering algorithm in specific settings Biological graphs Social networks

Questions? Thank you!