Scalable Mining of Massive Networks: Distance-based Centrality, Similarity, and Influence. Edith Cohen Tel Aviv University

Size: px

Start display at page:

Download "Scalable Mining of Massive Networks: Distance-based Centrality, Similarity, and Influence. Edith Cohen Tel Aviv University"

Melanie Snow
5 years ago
Views:

1 Scalable Mining of Massive Networks: Distance-based Centrality, Similarity, and Influence Edith Cohen Tel Aviv University

2 Graph Datasets: Represent relations between things Bowtie structure of the Web Broder et. al. 200 Dolphin interactions

3 Graph Datasets Hyperlinks (the Web) Social graphs (Facebook, Twitter, LinkedIn, ): friend, follow, like logs, phone call logs, messages Commerce transactions (Amazon,eBay) Road networks Communication networks Protein interactions

4 Analytics on Graphs Centralities/Influence The power/importance/coverage of a node or a set of nodes Applications: ranking, viral marketing, Similarities/Communities How tightly related are 2 or more nodes Applications: Friend/Product Recommendations, attribute completion, Advertising, Prediction

5 Challenges Modeling: Models must capture reality Be flexible to allow fitting to problem (tunable parameters) Scalability: 0 9+ interactions (edges), 0 8+ entities (nodes)

6 Centrality, Similarity, Influence Structural measures (based on the set of interactions) Local measures: only depend on the set of neighbors (paths of length ) Degree centrality, Influence as cardinality of union of neighbors Similarity by relation of neighbor sets (Jaccard, Adamic/Adar) Advantages: Gets the first order bit Scalable Disadvantages: Limited recall Spammable

7 Centrality, Similarity, Influence Structural measures (based on the set of interactions) Global Measures: depend on the paths ensemble Higher recall but much less scalable Random walk based (Geometric/Heat Kernel): Centrality, similarity, influence thru (personalized) page rank Katz (Sum of paths, discounted by length) Resistance: Similarity by effective resistance Distance/reachability based Measures: centrality, similarity, influence

8 Distance/Reachability based measures of Centrality, Influence, Similarity Closeness centrality: Ability of a node to reach other nodes [Bavelas 950]+++: Influence: Ability of a set of nodes to reach others [GLM200,KKT2003,] [Gomez- Rodriguez+ ICML 20], [Du+ NIPS 203]. [C DPW 204] +++: Closeness similarity: Relates two nodes based on the similarity of their reach [C DFGGW203], (local: [AA 2005,LK2007]):

9 This talk: Unified treatment of distance/reachability-based measures Models: Basic definition of distance-based Different ways of using distance/reachability Argue that models capture intuitive properties Scalability: Sketching and estimation

SP distances: Distance vectors 0 5 6 7 0 0 3 4 5 5 6

10 SP distances: Distance vectors

11 SP distances: Distance vectors

12 SP distances d ij : Distance vectors

13 SP distances d ij : Distance-based measures Relate entities based on their vectors. Closeness centrality C( ) Closeness Similarity Sim(, ) Influence Inf(,,, )

14 SP distances d ij : Distance-based Models Inbound/outbound/undirected distance Distance/Reachability/Dijkstra (NN) rank Topic filter β v 0: weighted relevance of entity Distances vs. decayed distances α d 0 Robustness by randomizing lengths/presence of edges (REL), or using multiple example instances

15 SP distances d ij : Distance-based Models Inbound/outbound/undirected distance Distance/Reachability/Dijkstra (NN) rank Topic filter β v 0: weighted relevance of entity Distances vs. decayed distances α d 0 Robustness by randomizing lengths/presence of edges (REL), or using multiple example instances

16 Topic Filter β(i) Weigh entities (nodes or entries of vectors) according to Topic, interests, education level, age, community, geography, language, product type, Applications: focus measure on topic. Useful for recommendations, attribute completion, targeted ads Similar to the role of start state in PPR Evil β:

17 SP distances d ij : Distance-based Models Inbound/outbound/undirected distance Distance/Reachability/Dijkstra (NN) rank Topic filter β v 0: weighted relevance of entity Distances vs. decayed distances α d 0 Robustness by randomizing lengths/presence of edges (REL), or using multiple example instances

18 Farness-penalty vs. Closeness-reward To emphasize penalty for weak coverage of far nodes, we use distances as is: The norm of the distances vector of a node is dominated by far nodes. As in classic closeness centrality, facility location, k-medians/means To emphasize reward for better coverage of close nodes, we apply a decay ( kernel ) function α d to the distance: The norm is dominated by closer nodes. As in influence models and local measures. Exact computation: both requires full distance vectors. Scalable approximation: different techniques

19 Quantifying Closeness Reward: Distance Decay (Kernel) Functions Specify how importance/relevance decays with distance α d = exp λd (exponential), d (geomertic), exp λd 2 (Gaussian) d <? : 0 (reachability) d < T? : 0 (threshold) Similar to the role of kernel distribution in PPR

20 Closeness Vectors SP distance vectors: Closeness vectors: α d ij = +d ij β

21 Closeness Centrality Classic (farness penalty) [Bavelas/Sabidussi/Beuchamp 950, 965, 966] C i = (n )/ d ij β(j) Distance-decay (closeness reward) [C Kaplan 2004, Opsahl et. al. 200, Dangalchev 2006, Boldi Vigna 203, ] C i = α(d ij ) β(j) j j

22 Closeness Matrix (distance decay+topic filter) c ij = α(d ij ) β(j) c i (row i ) is the closeness vector of node i Centrality: Influence: of a set of nodes C i = c ij Inf S = max j j i S c ij

23 Example: (dist-decay) Closeness Centrality SP distance vectors: Closeness vectors: α d ij = +d ij β C 2.05 C 2.

24 Example + topic filter: Closeness to Evil α d ij = +d ij 0. β evilness C 2.05 total C 2. total.24 evil part 0.29 evil part

25 Example: Influence from Vectors SP distance vectors: Closeness vectors: α d ij = +d ij β Inf, 3.64 Sum of max (Submodular) C 2. C 2.05

26 Closeness Similarity: Local Measures Local: sim(i, j) depends only on immediate neighbors of i, j Jaccard: Γ i Γ j Adamic Adar: Γ i Γ j x Γ i Γ(j) log Γ x J(, )= 2 9 J(, )= 3 5 AA(, )= log 5 AA(, )= 3 log 4 Work well for d ij 2, but no recall for d ij > 2

27 Closeness Similarity: Global Measures Based on full closeness vectors c ij = α(d ij ) β(j) weighted Jaccard: Cosine: L p Distance: h sim i, j = β h min c ih,c jh sim i, j = c i c j. h β h max c ih,c jh L p -dist i, j = c i c j p

28 Example: (global) Closeness Similarity SP distance vectors: Closeness vectors: α d ij = +d ij β Weighted Jaccard: x min C ix,c jx 0.2 x max C ix,c jx Cosine similarity: x C ix C jx C i 2 C j L Distance: C ix C jx x 2.8

29 So far: Definitions and some motivation of distance-based measures Next: Scalability thru Sketches and estimators

30 Scalable Closeness Centrality Closeness centrality of a node v can be computed exactly using a SSSP computation from v. But this takes near-linear time per node: Does not scale when we want centralities of many or all nodes We want to estimate all centralities within a small relative error using near-linear computation Classic (farness penalty): [C DPW : COSN 204] Previously additive error: [EW SODA 200, OCL 2008] Distance-decay (closeness reward): All-distances sketches [C 94], [C Kaplan SIGMOD 2004] [C : PODS 204].

31 Applications General Tools Scaling Up (closeness reward) All-distances and reachability node sketches (ADS) [C 94], [C Kaplan 2004 [C : PODS 204]. Estimators applicable with ADS: [C K PODC 2007, VLDB 2008, SIGMETRICS 2008], [C PODC 204,KDD 204] Distance-decay closeness centrality [C 94], [C Kaplan 2004] [Boldi Vigna 203] [C : PODS 204]. Closeness similarity [CDFGGW: COSN 203] Influence computation and maximization [CWY KDD 2009] [C Delling Pajor Werneck: CIKM 204] [Du Song Gomez-Rodruiguez Zha NIPS 203] [C Delling Pajor Werneck 205]

32 All-Distances Sketches (ADS) [C 94]+ per-node summary structures of Un/Directed, Un/Weighted networks For a node i n : ADS(i) samples the distance vector of i m edges, n nodes, parameter k trades-off sketch size and information

33 All-Distances Sketches: Definition ADS(v) is a list of pairs of the form (i, d vi ) Draw a random permutation of the nodes: r n [n] i ADS v r(i) < k th smallest rank amongst nodes that are closer to v than i This is a bottom-k ADS, there are other ADS flavors

34 All-Distances Sketches: Definition ADS(v) is a list of pairs of the form (i, d vi ) Draw a random permutation of the nodes: r n [n] i ADS v r(i) < k th smallest rank amongst nodes that are closer to v than i n j= Expected size of ADS(i) is min, k < k ln n j The j th closest node to i is included with probability min {, k j }

35 SP distances: ADS example Random permutation of nodes

36 ADS example k = All nodes sorted by SP distance from k = :

37 ADS example k = 2 Sorted by SP distances from k = 2:

38 Sketch Coordination We use the same permutation to obtain the ADS of all nodes, as a result: ADS Sketches of different nodes are coordinated: related in a way that is useful for queries that involve multiple nodes (similarities, influence, distance) [Brewer, Early, Joyce 972] Generalize MinHash sketches by adding a time/distance dimension

39 Computing ADSs Efficiently Pruned Dijkstra s Algorithm: Build ADS entries by increasing permutation rank. Local (Dynamic programming) Algorithm: Build ADS entries by increasing distance. Either way, we scan outgoing edges of a node only after its ADS is modified number of edge traversals is bounded by ADS size: O(km ln n)

40 Computing ADSs Efficiently Perform pruned Dijkstra from nodes by increasing permutation rank: k =

41 Estimation from sketches

42 Estimation with All-Distances Sketches Closeness centrality of node i, from ADS(i) with NRMSE O( ) (+concentration) [C 94, C Kaplan 04, C 4] k Influence of S, from ADS(i) i S, with NRMSE O( ) k [C 94, Chen WY KDD 09, Du SGZ NIPS 3, C DPW 204 ]+++ Closeness similarity sim(i, j) (W/Jaccard, cosine), from ADS(i) and ADS(j) w. RMSE O( ) [C DFGGW COSN 3] k L p distance, from ADS(i) and ADS(j) with RMSE O( ) times max norm using [C KDD 204] k Side note: ADSs can also be used as distance oracles (estimate pairwise distances), spanners, and more

43 Historic Inverse Probability (HIP) probability & estimator [C 204] The HIP inclusion probability p vi of i ADS(v) is conditioned on fixed rank values of other nodes. To estimate centrality of v, we estimate the presence of i with respect to v: I v i (= if v i, 0 otherwise) using inverse-probability [HT52] a vi = p. This is unbiased (when p > 0). For bottom-k ADS and r U[0,] p vi = k th {r(u) u ADS v d vu < d vi }

44 Example: HIP estimates Bottom-2 ADS of p: a = : p p:2 nd smallest r value among closer nodes

45 HIP cardinality estimate Bottom-2 ADS of distance: a: Query: n 6 v = { i d vi 6 } = I dvi 6 i n 6 (v) = a vi = = 3.59 i,d vi ADS v d vi 6

46 Quality of HIP cardinality Estimate Lemma: The HIP neighborhood cardinality estimator n d (v) = a vi i,d vi ADS v d vi d Has Coefficient of Variation σ μ 2k 2 This is 2 improvement over MinHash estimators [C 94]. Also optimally uses sketch.

47 HIP estimates of Centrality C v = α(d vi )β(i) i α non increasing; β some filter C v = a vi α(d vi ) β(i) i ADS(v)

HIP estimates: closeness to good/evil Bottom-2 ADS of distance: 0

48 HIP estimates: closeness to good/evil Bottom-2 ADS of distance: a: β: Filter: β 0, measures goodness Distance-decay kernel : e d vi C v = a vi β(i)e d vi = e e e 0 + i ADS(v)

49 Similarity/Influence Estimation We work with HIP inclusions and sum estimators (estimate separately contribution of each node) We use monotone estimation formulation [C K Random 204; C PODC 204] and the L* estimator (unique optimal monotone unbiased nonnegative sum estimator) Influence: can be estimated by merging ADS sketches and applying centrality estimator, but L* is tighter Similarity: inverse-probability on joint inclusion may not even apply, L* gets around the problem

50 Estimating Jaccard Closeness Similarity SP-distance; using kernel α d ij = +d ij : x sim i, j = min C ix,c jx = x max C ix,c jx X x > 0 or N x > 0 only when x ADS i ADS(j) easy to compute sim i, j using only ADS i ADS(j) We use L* for X x. We show N x (inverse probability). Get relative error on denominator, additive error of same order on numerator. k x min max d ix +, d jx + d ix +, d jx + Sum estimator: For all x obtain unbiased estimates N x of min d ix +, d jx + x sim i, j = N x x ; X x of max x X x d ix +, d jx + ;

51 Estimating Jaccard Closeness Similarity Estimate N x for min d ix +, d jx + : For x ADS i ADS(j) We compute HIP probability p for inclusion in intersection An inverse probability estimate: N x = p min, d ix + d jx + Otherwise, estimate is N x = 0 This is the best we can do because if x is not in intersection, data is consistent with 0 value any nonzero unbiased estimator must return 0

52 Estimating Jaccard Closeness Similarity Estimate X x for max d ix +, d jx + For x ADS i ADS(j) We can never determine X x if x is only reachable from one of i, j. We can only lower bound X x. The L* estimator [C 204] is an optimal extension of inverse probability to such a setting.

53 NN-rank based closeness similarity sim(i, j) = x x min max, π ix π jx, π ix π jx π ix : Dijkstra (NN) rank of x with respect to i Lemma: sim i, j is approximated well by proof: Pr[ x ADS i ADS j ] k min Pr[ x ADS i ADS j ] k max ADS i ADS j ADS i ADS j, π ix, π ix ADSs can be stored very compactly: no distances, hash of nodes π jx π jx

54 Next: Enhancing distance/reachability-based models with REL REL: randomizing lengths/presence of edges (Implicit) The Independent Cascade Diffusion model [Kempe Klienberg Tardos KDD 2003] Closeness similarity [C DFGGW 203] Timed Diffusion [Gomez-Rodriguez BS 20, Du SGZ 203, C DPW 204,.]

55 Strength of relation between two entities Basic Intuitions for dependence on path ensemble: Increase with shorter paths Increase with more (redundant paths)

56 Boosting distance by Randomizing Edge Lengths (REL) We expect the strength of a relation to Increase with the multiplicity of paths Decrease with the length of paths Eigenvector measures: Rooted PageRank, RWR, Hitting Time, Commute Time, Eff. Resistance, Katz Satisfy this strictly. SP distances: weakly

57 Randomizing Edge Lengths (REL) Replace edge lengths (or presence) with independent random variables Look at the expected measure

minimum of expectations.43.88.23 0.6 0.53.64 0..33 0.

58 Randomizing Edge Lengths (REL) Expected distance is lower with multiple paths: expectation of the minimum < minimum of expectations

59 Randomizing Edge Lengths (REL) Scalability: use Monte Carlo simulations of model to generate fixed instances (graphs) Build sketches for multiple instances

60 Benefits of REL in network analysis Measure does reward for paths multiplicity Robustness: Measure is not sensitive to small changes in edge weights (edge presence for reachability)

61 Next: Some (preliminary) experimental results

62 Scalability: Timed Influence with REL Combined ADS sketches for all nodes, obtained from l = 64 instances generated by assigning independent Exp distributed edge lengths. Sketch parameter k = 64 Centrality ( seed) and Influence (50 seeds) queries Queries α x = exp( 0x) preproc seed 50 seeds #nodes #edges [h:m] time μs %err time μs %err Slashdot 77K 828K : % % Gowalla 97K.9M 3: % % Twitter F 456K 5M 9: % % [C Delling Pajor Werneck 204]

63 Closeness Similarity evaluation [C Delling Fuchs Goldberg Goldszmid Werneck COSN 203] Data Sets Nodes x 0 6 Edges x 0 6 ADS label size arxiv DBLP twitter smallworld

64 Similarity Evaluation Spearman coefficient: correlation between meta-data ranking and similarity measure ranking( = perfect match, 0= random rankings). On pairs selected uniformly by ground-truth similarity. arxiv DBLP Twitter SmallWord Adamic-Adar hops SP distance RWR RWR RWR RWR Closeness Closeness REL

65 Similarity Evaluation No clear winner (networks too different) Local measures are limited RWR has very good recall (with tuning) but computationally expensive Closeness (+REL) : robust, good recall, fast queries (microseconds) arxiv DBLP Twitter SmallWord Adamic-Adar hops SP distance RWR RWR RWR RWR Closeness Closeness REL

66 Conclusion Distance/reachability based measures of centrality, influence, and similarity: global measures that are flexible and highly scalable Presented a unified treatment Scalability through: All-Distances and reachability sketches Estimators applicable to sketches Future: Model fitting framework, Evaluations, Algorithms Engineering

68 Components of my work are joint work with (subsets of): Daniel Delling, Fabian Fuchs, Andrew Goldberg, Moises Goldszmidt, Haim Kaplan, Thomas Pajor, and Renato Werneck

Computing Classic Closeness Centrality, at Scale

Computing Classic Closeness Centrality, at Scale Edith Cohen Joint with: Thomas Pajor, Daniel Delling, Renato Werneck Very Large Graphs Model relations and interactions (edges) between entities (nodes)