Fast Nearest Neighbor Search on Large Time-Evolving Graphs

Fast Nearest Neighbor Search on Large Time-Evolving Graphs Leman Akoglu Srinivasan Parthasarathy Rohit Khandekar Vibhore Kumar Deepak Rajan Kun-Lung Wu

Graphs are everywhere Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 3

and LARGE and TIME-evolving! n 1.32 billion monthly active users June 30, 2014 Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 4

Proximity problem on graphs also: NN-search, similarity, closeness, relevance Q: Which nodes are close to A? I 1 J 1 1 A 1 H 1 B B 1 1 D E 1 1 1 F G Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 5

Application: Recommendations Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 6

Other applications Finding communities (e.g. co-authorship networks such as DBLP) Anomaly detection (e.g. infected hosts, potential suspects) Link Prediction Keyword search Content-based Image Retrieval Fighting spam Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 7

Proximity measures for graphs n Several metrics: shortest paths, commute time, hitting time, SimRank, n Prevalent (robust) metric: Personalized PageRank I 1 J PPR captures: 1 1 - many, A 1 H 1 B - short, 1 1 D - heavy-weighted paths E 1 1 1 F G Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 8

PPR is based on RWR 0.10 0.13 2 1 0.13 3 4 6 5 0.13 0.04 9 0.08 8 0.05 0.03 10 11 0.04 12 0.02 7 0.05 Slides adapted from http://www.cs.cmu.edu/~htong/pub_new.htm Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 9

Problem Definition Maintain A LARGE, time- varying, edge- weighted graph G(t), so that we can answer the following query efficiently: Given a query node q in G(t) at Fme t, Find verfces in G(t) that are close to q (w.r.t. the Personalized PageRank score) Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 10

Road Map n n n n Motivation Problem Definition Previous work Our Approach q q n n n Graph clustering Intra-Cluster & Inter-Cluster Random Walks (baby steps & BIG steps) Time-Varying Graphs Experiments Conclusions Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 11

Previous Work on PPR n n n n n n n n D. Fogaras, B. Rcz, K. Csalogny, Tams Sarls. Towards scaling fully personalized pagerank. In Internet Mathematics 2004. Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan. Fast Random Walk with Restart and Its Applications. In ICDM 2006. Soumen Chakrabarti. Dynamic personalized pagerank in entity-relation graphs. In WWW 2007. H. Tong, S. Papadimitriou, P. S. Yu and C. Faloutsos. Proximity Tracking on Time-Evolving Bipartite Graphs. In SDM 2008. P. Sarkar, A. W. Moore. Fast nearest-neighbor search in disk-resident graphs. In KDD 2010. Bahman Bahmani, Abdur Chowdhury, Ashish Goel: Fast Incremental and Personalized PageRank. In PVLDB 2010. Bahman Bahmani, Kaushik Chakrabarti, Dong Xin: Fast personalized PageRank on MapReduce. In SIGMOD 2011. P. A Lofgren, S. Banerjee, A. Goel, C. Seshadhri. FAST-PPR: Scaling Personalized PageRank Estimation for Large Graphs. In KDD 2014. We consider both large AND time-varying graphs! Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 12

Our Method ClusterRank 1) Pre-computation a. Graph clustering b. Compute meta-info for each cluster 2) Query processing a. Identify relevant clusters to consider b. Combine their meta-info to compute an answer Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 13

Graph Clustering n We work with large graphs (that do not fit in main memory), thus cluster vertices such that each cluster is small enough. n Need good clusters many intra-cluster edges, but few inter-cluster edges. q Random walks more likely to stay within cluster q Good cluster is already a good approximation of close neighborhood of vertices in cluster Note: For some cases, graph could be clustered naturally (e.g. Web graph across many servers) Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 14

Graph Clustering n Many graph clustering algorithms, e.g. based on community detection, spectral partitioning, etc. n Reid Andersen, Fan Chung, and Kevin Lang (ACL). Local Graph Partitioning using PageRank Vectors. FOCS, 2006. n Advantages: q Local algorithm complexity depends on output cluster size q Gives different size clusters which can be overlapping q Can do clustering while graph is on disk Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 15

What is good clustering? ACL [FOCS06] s measure is conductance: ϵ [0, 1] Φ = 3 / (4+3+4+4+2) = 0.17 Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 16

Graph Clustering G Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 17

Compute meta-info for each cluster C(u,v) : The expected number of times (Count) a RW starting at node u in cluster S hits node v, before exiting S (can exit by walking to another cluster or by restarting to q). E(u,v) : Expected probability that a RW starting at node u in cluster S Exits S to node v (out-bound node in B) (assuming query (restart) vertex q is outside S). C matrix for S is 5x5 ( S x S ) E matrix is 5x3 ( S x 2 B + q ) S Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 19

Compute meta-info for each cluster Intra-cluster random-walks à baby steps S3 S2 q S4 S1 Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 20

Compute meta-info for each cluster Recursive definition for C T(u, v) : transition probability from u to v N(u) : neighbor nodes set of node u (1 α) : restart probability S : set of nodes in given cluster Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 21

Compute meta-info for each cluster Closed-form formulae for C and E Similarly, : S x S transition matrix : S x ( B +1) matrix with exit prob.s to nodes in B U {q} Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 22

Our Method ClusterRank 1) Pre-computation a. Graph clustering OFFLINE b. Compute meta-info for each cluster 2) Query processing ONLINE a. Identify relevant clusters to consider b. Combine their meta-info to compute an answer Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 23

Query processing Update meta-info for q s cluster C q (C given q) : E q (E given q) : C q K : S x S 0s matrix with column q all 1s (rank 1!) à Fast Sherman-Morrison matrix inverse update Recall: Closed-form formulae for C and E Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 24

Query processing Inter-cluster Graph M over relevant clusters S3 S2 q S4 S1 Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 25

Query processing Inter-cluster random-walks à BIG steps M q Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 26

Query processing Combine intra- and inter- cluster meta-info to compute final answer ( lift C matrices) S3 S2 S4 S1 Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 27

Query processing Combine intra- and inter- cluster meta-info to compute final answer ( lift C matrices) S3 S2 S4 S1 Theorem: ClusterRank gives exact PPR scores. Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 28

Time-varying ClusterRank n WLOG: assume single edge (u,v) added n Observation: changes in & low-rank à compute new C & E by SM formula n 4 cases studied in paper: q Both u and v new vertices q Either u or v is a new vertex q u and v in same cluster q u and v in different clusters Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 30

Graph datasets Dataset #edges #nodes #clusters description Synthetic 909K 300K 100 Planted partitions Amazon 900K 262K 3739 Product co-purchase Web 1.1M 325K 2793 http://nd.edu links DBLP 1.1M 329K 4670 Co-authorships LiveJournal 21.5M 2.7M 15252 Friendships Dataset median Φ avg. Φ med. size avg. size Amazon 0.1385 0.1486 17 98.5 Web 0.0625 0.0871 31 129.4 DBLP 0.2117 0.2196 27 102.4 LiveJournal 0.5500 0.5319 43 237.3 Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 31

Pre-computation Pre-computation time depends on 1) graph size, 2) #clusters, 3) parallelization Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 32

Query Processing: set up n Instead of all clusters, focus on a subset of relevant clusters (small neighborhoods around query vertex) (1,2-hop away). n Allow for maximum of B boundary vertices n Sparsify inter-cluster matrix: zero-out entries close to zero n 100 randomly chosen query vertices Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 33

Evaluation criterion n n We report accuracy and running time for k nearest neighbor (knn) queries. Accuracy = Relative Average Goodness (RAG) score @k RAG(@k) = total true score of output total true score of optimum Note: precision, i.e. overlap with optimum, is *not* a good measure (due to ties/near-ties). Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 34

Synthetic graphs SYNTHETIC 2-HOP 1-HOP Average RAG (50) score (100 runs) B = 5K 0.9986 0.9865 B = 1K 0.9892 0.9865 ClusterRank Average Response Time (sec.) B = 5K 5.12 2.18 B = 1K 2.86 2.12 Brute-Force 5.16 sec.s Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 35

Real graphs Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 36

Dynamic updates n 500K edge DBLP graph + 1K new edges Avg: 42.12 seconds Avg: 2.78 clusters Note: load/store time of C, E matrices included Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 37

Dynamic updates DBLP 1959-2001 +1K edges in time +500K edges in time Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 38

Summary n ClusterRank: k Nearest Neighbor queries based on Personalized Pagerank scores q Works with large and time-evolving graphs q Fast query time: sub-linear computation on pre-computed meta-info q Efficient dynamic updates by low-rank matrices q Disk-based: query processing and dynamic updates only on relevant subset of clusters n Future directions q Cluster tracking and localized re-clustering q Extension to hitting / commute time Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 39

Thank You! leman@cs.stonybrook.edu http://www.cs.stonybrook.edu/~leman Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 40

Back-up Slides

Recursive definition for E T(u, v) : transition probability from u to v N(u) : neighbor nodes set of node u (1 α) : restart probability S : set of nodes in given cluster Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 42

Closed formulations for C and E C 1 is an identity matrix of S x S Similary, Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 43

What if s (query vertex) ϵ S? Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 44

At query time, given the query vertex s, those two matrices in which s resides in is updated only. K is rank 1! Therefore, we will use the Sherman-Morrison Lemma to update C. Complexity: Multiplication of S x1 and 1x S vectors Note that we do not need to run SVD as K is rank-1 only! Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 45