Fast Nearest Neighbor Search on Large Time-Evolving Graphs

Size: px

Start display at page:

Download "Fast Nearest Neighbor Search on Large Time-Evolving Graphs"

Gladys Agnes Nash
5 years ago
Views:

1 Fast Nearest Neighbor Search on Large Time-Evolving Graphs Leman Akoglu Srinivasan Parthasarathy Rohit Khandekar Vibhore Kumar Deepak Rajan Kun-Lung Wu

2 Graphs are everywhere Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 3

3 and LARGE and TIME-evolving! n 1.32 billion monthly active users June 30, 2014 Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 4

4 Proximity problem on graphs also: NN-search, similarity, closeness, relevance Q: Which nodes are close to A? I 1 J 1 1 A 1 H 1 B B 1 1 D E F G Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 5

5 Application: Recommendations Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 6

6 Other applications Finding communities (e.g. co-authorship networks such as DBLP) Anomaly detection (e.g. infected hosts, potential suspects) Link Prediction Keyword search Content-based Image Retrieval Fighting spam Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 7

7 Proximity measures for graphs n Several metrics: shortest paths, commute time, hitting time, SimRank, n Prevalent (robust) metric: Personalized PageRank I 1 J PPR captures: many, A 1 H 1 B - short, 1 1 D - heavy-weighted paths E F G Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 8

8 PPR is based on RWR Slides adapted from Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 9

9 Problem Definition Maintain A LARGE, time- varying, edge- weighted graph G(t), so that we can answer the following query efficiently: Given a query node q in G(t) at Fme t, Find verfces in G(t) that are close to q (w.r.t. the Personalized PageRank score) Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 10

10 Road Map n n n n Motivation Problem Definition Previous work Our Approach q q n n n Graph clustering Intra-Cluster & Inter-Cluster Random Walks (baby steps & BIG steps) Time-Varying Graphs Experiments Conclusions Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 11

11 Previous Work on PPR n n n n n n n n D. Fogaras, B. Rcz, K. Csalogny, Tams Sarls. Towards scaling fully personalized pagerank. In Internet Mathematics Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan. Fast Random Walk with Restart and Its Applications. In ICDM Soumen Chakrabarti. Dynamic personalized pagerank in entity-relation graphs. In WWW H. Tong, S. Papadimitriou, P. S. Yu and C. Faloutsos. Proximity Tracking on Time-Evolving Bipartite Graphs. In SDM P. Sarkar, A. W. Moore. Fast nearest-neighbor search in disk-resident graphs. In KDD Bahman Bahmani, Abdur Chowdhury, Ashish Goel: Fast Incremental and Personalized PageRank. In PVLDB Bahman Bahmani, Kaushik Chakrabarti, Dong Xin: Fast personalized PageRank on MapReduce. In SIGMOD P. A Lofgren, S. Banerjee, A. Goel, C. Seshadhri. FAST-PPR: Scaling Personalized PageRank Estimation for Large Graphs. In KDD We consider both large AND time-varying graphs! Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 12

12 Our Method ClusterRank 1) Pre-computation a. Graph clustering b. Compute meta-info for each cluster 2) Query processing a. Identify relevant clusters to consider b. Combine their meta-info to compute an answer Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 13

13 Graph Clustering n We work with large graphs (that do not fit in main memory), thus cluster vertices such that each cluster is small enough. n Need good clusters many intra-cluster edges, but few inter-cluster edges. q Random walks more likely to stay within cluster q Good cluster is already a good approximation of close neighborhood of vertices in cluster Note: For some cases, graph could be clustered naturally (e.g. Web graph across many servers) Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 14

14 Graph Clustering n Many graph clustering algorithms, e.g. based on community detection, spectral partitioning, etc. n Reid Andersen, Fan Chung, and Kevin Lang (ACL). Local Graph Partitioning using PageRank Vectors. FOCS, n Advantages: q Local algorithm complexity depends on output cluster size q Gives different size clusters which can be overlapping q Can do clustering while graph is on disk Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 15

15 What is good clustering? ACL [FOCS06] s measure is conductance: ϵ [0, 1] Φ = 3 / ( ) = 0.17 Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 16

16 Graph Clustering G Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 17

17 Our Method ClusterRank 1) Pre-computation a. Graph clustering b. Compute meta-info for each cluster 2) Query processing a. Identify relevant clusters to consider b. Combine their meta-info to compute an answer Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 18

18 Compute meta-info for each cluster C(u,v) : The expected number of times (Count) a RW starting at node u in cluster S hits node v, before exiting S (can exit by walking to another cluster or by restarting to q). E(u,v) : Expected probability that a RW starting at node u in cluster S Exits S to node v (out-bound node in B) (assuming query (restart) vertex q is outside S). C matrix for S is 5x5 ( S x S ) E matrix is 5x3 ( S x 2 B + q ) S Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 19

19 Compute meta-info for each cluster Intra-cluster random-walks à baby steps S3 S2 q S4 S1 Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 20

20 Compute meta-info for each cluster Recursive definition for C T(u, v) : transition probability from u to v N(u) : neighbor nodes set of node u (1 α) : restart probability S : set of nodes in given cluster Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 21

21 Compute meta-info for each cluster Closed-form formulae for C and E Similarly, : S x S transition matrix : S x ( B +1) matrix with exit prob.s to nodes in B U {q} Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 22

22 Our Method ClusterRank 1) Pre-computation a. Graph clustering OFFLINE b. Compute meta-info for each cluster 2) Query processing ONLINE a. Identify relevant clusters to consider b. Combine their meta-info to compute an answer Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 23

23 Query processing Update meta-info for q s cluster C q (C given q) : E q (E given q) : C q K : S x S 0s matrix with column q all 1s (rank 1!) à Fast Sherman-Morrison matrix inverse update Recall: Closed-form formulae for C and E Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 24

24 Query processing Inter-cluster Graph M over relevant clusters S3 S2 q S4 S1 Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 25

25 Query processing Inter-cluster random-walks à BIG steps M q Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 26

26 Query processing Combine intra- and inter- cluster meta-info to compute final answer ( lift C matrices) S3 S2 S4 S1 Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 27

27 Query processing Combine intra- and inter- cluster meta-info to compute final answer ( lift C matrices) S3 S2 S4 S1 Theorem: ClusterRank gives exact PPR scores. Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 28

28 Road Map n n n n Motivation Problem Definition Previous work Our Approach q q n n n Graph clustering Intra-Cluster & Inter-Cluster Random Walks (baby steps & BIG steps) Time-Varying Graphs Experiments Conclusions Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 29

Time-varying ClusterRank n WLOG: assume single edge (u,v) added n Observation: changes in & low-rank à compute new C & E by SM formula n 4 cases studied in paper: q Both u and v

29 Time-varying ClusterRank n WLOG: assume single edge (u,v) added n Observation: changes in & low-rank à compute new C & E by SM formula n 4 cases studied in paper: q Both u and v new vertices q Either u or v is a new vertex q u and v in same cluster q u and v in different clusters Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 30

30 Graph datasets Dataset #edges #nodes #clusters description Synthetic 909K 300K 100 Planted partitions Amazon 900K 262K 3739 Product co-purchase Web 1.1M 325K links DBLP 1.1M 329K 4670 Co-authorships LiveJournal 21.5M 2.7M Friendships Dataset median Φ avg. Φ med. size avg. size Amazon Web DBLP LiveJournal Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 31

31 Pre-computation Pre-computation time depends on 1) graph size, 2) #clusters, 3) parallelization Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 32

32 Query Processing: set up n Instead of all clusters, focus on a subset of relevant clusters (small neighborhoods around query vertex) (1,2-hop away). n Allow for maximum of B boundary vertices n Sparsify inter-cluster matrix: zero-out entries close to zero n 100 randomly chosen query vertices Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 33

33 Evaluation criterion n n We report accuracy and running time for k nearest neighbor (knn) queries. Accuracy = Relative Average Goodness (RAG) RAG(@k) = total true score of output total true score of optimum Note: precision, i.e. overlap with optimum, is *not* a good measure (due to ties/near-ties). Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 34

34 Synthetic graphs SYNTHETIC 2-HOP 1-HOP Average RAG (50) score (100 runs) B = 5K B = 1K ClusterRank Average Response Time (sec.) B = 5K B = 1K Brute-Force 5.16 sec.s Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 35

35 Real graphs Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 36

Dynamic updates n 500K edge DBLP graph + 1K new edges Avg: 42.12 seconds Avg: 2.

36 Dynamic updates n 500K edge DBLP graph + 1K new edges Avg: seconds Avg: 2.78 clusters Note: load/store time of C, E matrices included Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 37

37 Dynamic updates DBLP K edges in time +500K edges in time Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 38

38 Summary n ClusterRank: k Nearest Neighbor queries based on Personalized Pagerank scores q Works with large and time-evolving graphs q Fast query time: sub-linear computation on pre-computed meta-info q Efficient dynamic updates by low-rank matrices q Disk-based: query processing and dynamic updates only on relevant subset of clusters n Future directions q Cluster tracking and localized re-clustering q Extension to hitting / commute time Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 39

39 Thank You! Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 40

40 Back-up Slides

41 Recursive definition for E T(u, v) : transition probability from u to v N(u) : neighbor nodes set of node u (1 α) : restart probability S : set of nodes in given cluster Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 42

42 Closed formulations for C and E C 1 is an identity matrix of S x S Similary, Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 43

43 What if s (query vertex) ϵ S? Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 44

44 At query time, given the query vertex s, those two matrices in which s resides in is updated only. K is rank 1! Therefore, we will use the Sherman-Morrison Lemma to update C. Complexity: Multiplication of S x1 and 1x S vectors Note that we do not need to run SVD as K is rank-1 only! Leman Akoglu Fast Nearest Neighbor Search on Large Time-Evolving Graphs 45

Fast Personalized PageRank On MapReduce Authors: Bahman Bahmani, Kaushik Chakrabart, Dong Xin In SIGMOD 2011 Presenter: Adams Wei Yu

Fast Personalized PageRank On MapReduce Authors: Bahman Bahmani, Kaushik Chakrabart, Dong Xin In SIGMOD 2011 Presenter: Adams Wei Yu March 2015, CMU Graph data is Ubiquitous Basic Problem in Graphs: How