Fast Nearest Neighbor Search on Large Time-Evolving Graphs

Similar documents
Fast Personalized PageRank On MapReduce Authors: Bahman Bahmani, Kaushik Chakrabart, Dong Xin In SIGMOD 2011 Presenter: Adams Wei Yu

Fast Inbound Top- K Query for Random Walk with Restart

Similarity Ranking in Large- Scale Bipartite Graphs

Rare Category Detection on Time-Evolving Graphs

Diffusion and Clustering on Large Graphs

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs

BEAR: Block Elimination Approach for Random Walk with Restart on Large Graphs

Diffusion and Clustering on Large Graphs

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Introduction to Data Mining

GIVEN a large graph and a query node, finding its k-

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Fast Nearest-neighbor Search in Disk-resident Graphs. February 2010 CMU-ML

Graph Exploration: Taking the User into the Loop

Mining Web Data. Lijun Zhang

TPA: Fast, Scalable, and Accurate Method for Approximate Random Walk with Restart on Billion Scale Graphs

arxiv: v1 [cs.db] 31 Jan 2012

How to organize the Web?

Slides based on those in:

Graph Mining: Overview of different graph models

Fast Random Walk with Restart: Algorithms and Applications U Kang Dept. of CSE Seoul National University

Analysis of Large Graphs: TrustRank and WebSpam

Clustering. Bruno Martins. 1 st Semester 2012/2013

Local Higher-Order Graph Clustering

Integrating Meta-Path Selection with User-Preference for Top-k Relevant Search in Heterogeneous Information Networks

Online Social Networks and Media

QUINT: On Query-Specific Optimal Networks

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

Local higher-order graph clustering

CSCI-B609: A Theorist s Toolkit, Fall 2016 Sept. 6, Firstly let s consider a real world problem: community detection.

SPECTRAL SPARSIFICATION IN SPECTRAL CLUSTERING

Latest on Linear Sketches for Large Graphs: Lots of Problems, Little Space, and Loads of Handwaving. Andrew McGregor University of Massachusetts

Effective Latent Space Graph-based Re-ranking Model with Global Consistency

Proximity Tracking on Time-Evolving Bipartite Graphs

Learning to Rank Networked Entities

A Local Algorithm for Structure-Preserving Graph Cut

Snowball Sampling a Large Graph

Predictive Indexing for Fast Search

Analysis of Biological Networks. 1. Clustering 2. Random Walks 3. Finding paths

Local Partitioning using PageRank

Absorbing Random walks Coverage

Finding and Visualizing Graph Clusters Using PageRank Optimization. Fan Chung and Alexander Tsiatas, UCSD WAW 2010

A two-stage strategy for solving the connection subgraph problem

Absorbing Random walks Coverage

arxiv: v1 [cs.si] 18 Oct 2017

Inverted Index for Fast Nearest Neighbour

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

Jure Leskovec, Cornell/Stanford University. Joint work with Kevin Lang, Anirban Dasgupta and Michael Mahoney, Yahoo! Research

The link prediction problem for social networks

CS425: Algorithms for Web Scale Data

Problem 1: Complexity of Update Rules for Logistic Regression

arxiv: v1 [cs.si] 2 Dec 2017

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

B490 Mining the Big Data. 5. Models for Big Data

A Dynamic Algorithm for Local Community Detection in Graphs

The extendability of matchings in strongly regular graphs

University of Maryland. Tuesday, March 2, 2010

Seed Noise in Personalized PageRank

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today

Distribution-Free Models of Social and Information Networks

Mining Web Data. Lijun Zhang

Graph Data Management

Fast and Accurate Random Walk with Restart on Dynamic Graphs with Guarantees

Mining for Patterns and Anomalies in Data Streams. Sampath Kannan University of Pennsylvania

Query Independent Scholarly Article Ranking

Clustering. (Part 2)

Optimized Graph-Based Trust Mechanisms using Hitting Times

Neighborhood Formation and Anomaly Detection in Bipartite Graphs

Predicting Disease-related Genes using Integrated Biomedical Networks

Graphs / Networks CSE 6242/ CX Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Text Analytics (Text Mining)

Challenges in Multiresolution Methods for Graph-based Learning

TupleRank: Ranking Relational Databases using Random Walks on Extended K-partite Graphs

SOCIAL MEDIA MINING. Data Mining Essentials

Supervised Random Walks

Local Community Detection in Dynamic Graphs Using Personalized Centrality

Fast Random Walk with Restart and Its Applications

Link Structure Analysis

Using Spam Farm to Boost PageRank p. 1/2

MCHITS: Monte Carlo based Method for Hyperlink Induced Topic Search on Networks

Local Algorithms for Sparse Spanning Graphs

G(B)enchmark GraphBench: Towards a Universal Graph Benchmark. Khaled Ammar M. Tamer Özsu

Graph Algorithms. Revised based on the slides by Ruoming Kent State

Succinct Representation of Separable Graphs

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory An introductory (i.e. foundational) level graduate course.

A Scalable Approach to Size-Independent Network Similarity

Single link clustering: 11/7: Lecture 18. Clustering Heuristics 1

Preserving Personalized Pagerank in Subgraphs

Distributed Algorithms on Exact Personalized PageRank

Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia

Approximation Algorithms

DSCI 575: Advanced Machine Learning. PageRank Winter 2018

Link Analysis in the Cloud

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks

Transcription:

Fast Nearest Neighbor Search on Large Time-Evolving Graphs Leman Akoglu Srinivasan Parthasarathy Rohit Khandekar Vibhore Kumar Deepak Rajan Kun-Lung Wu

Graphs are everywhere Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 3

and LARGE and TIME-evolving! n 1.32 billion monthly active users June 30, 2014 Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 4

Proximity problem on graphs also: NN-search, similarity, closeness, relevance Q: Which nodes are close to A? I 1 J 1 1 A 1 H 1 B B 1 1 D E 1 1 1 F G Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 5

Application: Recommendations Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 6

Other applications Finding communities (e.g. co-authorship networks such as DBLP) Anomaly detection (e.g. infected hosts, potential suspects) Link Prediction Keyword search Content-based Image Retrieval Fighting spam Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 7

Proximity measures for graphs n Several metrics: shortest paths, commute time, hitting time, SimRank, n Prevalent (robust) metric: Personalized PageRank I 1 J PPR captures: 1 1 - many, A 1 H 1 B - short, 1 1 D - heavy-weighted paths E 1 1 1 F G Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 8

PPR is based on RWR 0.10 0.13 2 1 0.13 3 4 6 5 0.13 0.04 9 0.08 8 0.05 0.03 10 11 0.04 12 0.02 7 0.05 Slides adapted from http://www.cs.cmu.edu/~htong/pub_new.htm Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 9

Problem Definition Maintain A LARGE, time- varying, edge- weighted graph G(t), so that we can answer the following query efficiently: Given a query node q in G(t) at Fme t, Find verfces in G(t) that are close to q (w.r.t. the Personalized PageRank score) Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 10

Road Map n n n n Motivation Problem Definition Previous work Our Approach q q n n n Graph clustering Intra-Cluster & Inter-Cluster Random Walks (baby steps & BIG steps) Time-Varying Graphs Experiments Conclusions Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 11

Previous Work on PPR n n n n n n n n D. Fogaras, B. Rcz, K. Csalogny, Tams Sarls. Towards scaling fully personalized pagerank. In Internet Mathematics 2004. Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan. Fast Random Walk with Restart and Its Applications. In ICDM 2006. Soumen Chakrabarti. Dynamic personalized pagerank in entity-relation graphs. In WWW 2007. H. Tong, S. Papadimitriou, P. S. Yu and C. Faloutsos. Proximity Tracking on Time-Evolving Bipartite Graphs. In SDM 2008. P. Sarkar, A. W. Moore. Fast nearest-neighbor search in disk-resident graphs. In KDD 2010. Bahman Bahmani, Abdur Chowdhury, Ashish Goel: Fast Incremental and Personalized PageRank. In PVLDB 2010. Bahman Bahmani, Kaushik Chakrabarti, Dong Xin: Fast personalized PageRank on MapReduce. In SIGMOD 2011. P. A Lofgren, S. Banerjee, A. Goel, C. Seshadhri. FAST-PPR: Scaling Personalized PageRank Estimation for Large Graphs. In KDD 2014. We consider both large AND time-varying graphs! Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 12

Our Method ClusterRank 1) Pre-computation a. Graph clustering b. Compute meta-info for each cluster 2) Query processing a. Identify relevant clusters to consider b. Combine their meta-info to compute an answer Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 13

Graph Clustering n We work with large graphs (that do not fit in main memory), thus cluster vertices such that each cluster is small enough. n Need good clusters many intra-cluster edges, but few inter-cluster edges. q Random walks more likely to stay within cluster q Good cluster is already a good approximation of close neighborhood of vertices in cluster Note: For some cases, graph could be clustered naturally (e.g. Web graph across many servers) Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 14

Graph Clustering n Many graph clustering algorithms, e.g. based on community detection, spectral partitioning, etc. n Reid Andersen, Fan Chung, and Kevin Lang (ACL). Local Graph Partitioning using PageRank Vectors. FOCS, 2006. n Advantages: q Local algorithm complexity depends on output cluster size q Gives different size clusters which can be overlapping q Can do clustering while graph is on disk Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 15

What is good clustering? ACL [FOCS06] s measure is conductance: ϵ [0, 1] Φ = 3 / (4+3+4+4+2) = 0.17 Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 16

Graph Clustering G Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 17

Our Method ClusterRank 1) Pre-computation a. Graph clustering b. Compute meta-info for each cluster 2) Query processing a. Identify relevant clusters to consider b. Combine their meta-info to compute an answer Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 18

Compute meta-info for each cluster C(u,v) : The expected number of times (Count) a RW starting at node u in cluster S hits node v, before exiting S (can exit by walking to another cluster or by restarting to q). E(u,v) : Expected probability that a RW starting at node u in cluster S Exits S to node v (out-bound node in B) (assuming query (restart) vertex q is outside S). C matrix for S is 5x5 ( S x S ) E matrix is 5x3 ( S x 2 B + q ) S Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 19

Compute meta-info for each cluster Intra-cluster random-walks à baby steps S3 S2 q S4 S1 Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 20

Compute meta-info for each cluster Recursive definition for C T(u, v) : transition probability from u to v N(u) : neighbor nodes set of node u (1 α) : restart probability S : set of nodes in given cluster Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 21

Compute meta-info for each cluster Closed-form formulae for C and E Similarly, : S x S transition matrix : S x ( B +1) matrix with exit prob.s to nodes in B U {q} Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 22

Our Method ClusterRank 1) Pre-computation a. Graph clustering OFFLINE b. Compute meta-info for each cluster 2) Query processing ONLINE a. Identify relevant clusters to consider b. Combine their meta-info to compute an answer Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 23

Query processing Update meta-info for q s cluster C q (C given q) : E q (E given q) : C q K : S x S 0s matrix with column q all 1s (rank 1!) à Fast Sherman-Morrison matrix inverse update Recall: Closed-form formulae for C and E Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 24

Query processing Inter-cluster Graph M over relevant clusters S3 S2 q S4 S1 Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 25

Query processing Inter-cluster random-walks à BIG steps M q Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 26

Query processing Combine intra- and inter- cluster meta-info to compute final answer ( lift C matrices) S3 S2 S4 S1 Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 27

Query processing Combine intra- and inter- cluster meta-info to compute final answer ( lift C matrices) S3 S2 S4 S1 Theorem: ClusterRank gives exact PPR scores. Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 28

Road Map n n n n Motivation Problem Definition Previous work Our Approach q q n n n Graph clustering Intra-Cluster & Inter-Cluster Random Walks (baby steps & BIG steps) Time-Varying Graphs Experiments Conclusions Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 29

Time-varying ClusterRank n WLOG: assume single edge (u,v) added n Observation: changes in & low-rank à compute new C & E by SM formula n 4 cases studied in paper: q Both u and v new vertices q Either u or v is a new vertex q u and v in same cluster q u and v in different clusters Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 30

Graph datasets Dataset #edges #nodes #clusters description Synthetic 909K 300K 100 Planted partitions Amazon 900K 262K 3739 Product co-purchase Web 1.1M 325K 2793 http://nd.edu links DBLP 1.1M 329K 4670 Co-authorships LiveJournal 21.5M 2.7M 15252 Friendships Dataset median Φ avg. Φ med. size avg. size Amazon 0.1385 0.1486 17 98.5 Web 0.0625 0.0871 31 129.4 DBLP 0.2117 0.2196 27 102.4 LiveJournal 0.5500 0.5319 43 237.3 Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 31

Pre-computation Pre-computation time depends on 1) graph size, 2) #clusters, 3) parallelization Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 32

Query Processing: set up n Instead of all clusters, focus on a subset of relevant clusters (small neighborhoods around query vertex) (1,2-hop away). n Allow for maximum of B boundary vertices n Sparsify inter-cluster matrix: zero-out entries close to zero n 100 randomly chosen query vertices Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 33

Evaluation criterion n n We report accuracy and running time for k nearest neighbor (knn) queries. Accuracy = Relative Average Goodness (RAG) score @k RAG(@k) = total true score of output total true score of optimum Note: precision, i.e. overlap with optimum, is *not* a good measure (due to ties/near-ties). Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 34

Synthetic graphs SYNTHETIC 2-HOP 1-HOP Average RAG (50) score (100 runs) B = 5K 0.9986 0.9865 B = 1K 0.9892 0.9865 ClusterRank Average Response Time (sec.) B = 5K 5.12 2.18 B = 1K 2.86 2.12 Brute-Force 5.16 sec.s Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 35

Real graphs Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 36

Dynamic updates n 500K edge DBLP graph + 1K new edges Avg: 42.12 seconds Avg: 2.78 clusters Note: load/store time of C, E matrices included Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 37

Dynamic updates DBLP 1959-2001 +1K edges in time +500K edges in time Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 38

Summary n ClusterRank: k Nearest Neighbor queries based on Personalized Pagerank scores q Works with large and time-evolving graphs q Fast query time: sub-linear computation on pre-computed meta-info q Efficient dynamic updates by low-rank matrices q Disk-based: query processing and dynamic updates only on relevant subset of clusters n Future directions q Cluster tracking and localized re-clustering q Extension to hitting / commute time Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 39

Thank You! leman@cs.stonybrook.edu http://www.cs.stonybrook.edu/~leman Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 40

Back-up Slides

Recursive definition for E T(u, v) : transition probability from u to v N(u) : neighbor nodes set of node u (1 α) : restart probability S : set of nodes in given cluster Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 42

Closed formulations for C and E C 1 is an identity matrix of S x S Similary, Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 43

What if s (query vertex) ϵ S? Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 44

At query time, given the query vertex s, those two matrices in which s resides in is updated only. K is rank 1! Therefore, we will use the Sherman-Morrison Lemma to update C. Complexity: Multiplication of S x1 and 1x S vectors Note that we do not need to run SVD as K is rank-1 only! Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 45