Uncertain Data Management Non-relational models: Graphs

Size: px

Start display at page:

Download "Uncertain Data Management Non-relational models: Graphs"

Laurel Hicks
6 years ago
Views:

1 Uncertain Data Management Non-relational models: Graphs Antoine Amarilli 1, Silviu Maniu 2 1 Télécom ParisTech 2 Université Paris-Sud January 16th, /21

2 Credits M. Potamias, F. Bonchi, A. Gionis, G. Kolios. k-nearest Neighbors in Uncertain Graphs. PVLDB 3(1), (number of samples, median measure, figure in slide 17, algorithm in slide 20) M. Ball. Computational Complexity of Network Reliability Analysis: An Overview. IEEE Trans. Reliab. R-35(3), L. Valiant. The Complexity of Enumeration And Reliability Problems. SIAM J. Comput. 8(3), (complexity of reliability/reachability) PDFs of the slides available at 2/21

3 Uncertain Graphs Graphs: a natural way to represent data in various domains transport data: road, air links between locations social networks: relationships between humans, citation networks interactions between proteins: contacts due to biochemical processes 3/21

4 Uncertain Graphs Graphs: a natural way to represent data in various domains transport data: road, air links between locations social networks: relationships between humans, citation networks interactions between proteins: contacts due to biochemical processes For all the above examples, the links are not exact. (Why?) 3/21

5 (Deterministic) Graphs a d A graph G = (V, E) is formed of b c a set V of vertices (nodes) a set E V V, of edges 4/21

6 Uncertain Graphs An uncertain graph G = (V, E, p) is formed of a 0.2 b d 0.5 c 0.7 a set V of vertices (nodes) a set E V V, of edges a function p : E [0, 1], representing the probability p e that the edge e E exists or not What are the possible worlds and their probability for this model? 5/21

7 Uncertain Graphs: Possible Worlds A possible world of G, denoted G G is a deterministic graph G = (V, E G ) where each e E G is chosen from E 6/21

8 Uncertain Graphs: Possible Worlds A possible world of G, denoted G G is a deterministic graph G = (V, E G ) where each e E G is chosen from E The probability of G is: Pr[G] = e E G p e e E\E G (1 p e ) How many possible worlds are there? 6/21

9 Uncertain Graphs: Other models Other models are possible: each edge is replaced by a distribution of weights instead of choosing if the edge exists or not, a possible world is an instantiation of weights each edge has a formula of events, capturing correlations probabilities can be on nodes also equivalent to the edge model (Why?) 7/21

10 Queries on Uncertain Graphs Generally, the queries we want to answer are distance queries: the reachability or reliability query get the probability that two nodes s and t are connected 8/21

11 Queries on Uncertain Graphs Generally, the queries we want to answer are distance queries: the reachability or reliability query get the probability that two nodes s and t are connected queries on the distance distribution: p s,t (d) = G d G (s,t)=d Pr[G] 8/21

12 Queries on Uncertain Graphs Generally, the queries we want to answer are distance queries: the reachability or reliability query get the probability that two nodes s and t are connected queries on the distance distribution: p s,t (d) = Multiple uses of distance queries: G d G (s,t)=d Pr[G] link prediction, social search, travel estimation 8/21

13 Queries on Uncertain Graphs Generally, the queries we want to answer are distance queries: the reachability or reliability query get the probability that two nodes s and t are connected 9/21

14 Queries on Uncertain Graphs Generally, the queries we want to answer are distance queries: the reachability or reliability query get the probability that two nodes s and t are connected queries on the distance distribution: p s,t (d) = G d G (s,t)=d Pr[G] 9/21

15 Queries on Uncertain Graphs a d What is the distance (in hops) between b and a? b c 10/21

16 Queries on Uncertain Graphs d What is the distance (in hops) between b and a? b a c BFS search (or Dijkstra s algorithms) finds the edge b a the cost is O(E) (linear in the size of the graph) 10/21

17 Queries on Uncertain Graphs What is the distance (in hops) between b and a? a d b 0.3 c 11/21

18 Queries on Uncertain Graphs What is the distance (in hops) between b and a? a d the edge b a does not appear in all possible worlds: p b,a (1) = p(b a) b 0.3 c 11/21

19 Queries on Uncertain Graphs What is the distance (in hops) between b and a? a d the edge b a does not appear in all possible worlds: p b,a (1) = p(b a) b 0.3 c there are two possible paths of distance 2 (b c a) and 3 (b d c a) p b,a (1) = (1 p b,a (1)) p(b c a) 11/21

20 Queries on Uncertain Graphs a d What is the distance (in hops) between b and a? b c 12/21

21 Queries on Uncertain Graphs a b d c What is the distance (in hops) between b and a? the number of paths is exponential in the size of the graph specifically, there are 3! paths 12/21

22 Queries on Uncertain Graphs Distance query answering in uncertain graphs is at least as hard as in relational databases (logical formulas of paths; the number of which can be exponential) 13/21

23 Queries on Uncertain Graphs Distance query answering in uncertain graphs is at least as hard as in relational databases (logical formulas of paths; the number of which can be exponential) Computing the reachability probability (i.e, computing the probability of there being a path between a source and a target) is known to be #P hard [Valiant, SIAM J. Comp, 1979] 13/21

24 Computing Answers to Distance Queries on Probabilistic Graphs Distance estimations in uncertain graphs can be approximated via Monte Carlo sampling 14/21

25 Computing Answers to Distance Queries on Probabilistic Graphs Distance estimations in uncertain graphs can be approximated via Monte Carlo sampling 1. generate sampled graphs for r rounds (is this the optimal way for an s, t distance estimation?) 2. compute the desired measure (reachability probability, distance distributions) by averaging results 14/21

26 Computing Answers to Distance Queries on Probabilistic Graphs Distance estimations in uncertain graphs can be approximated via Monte Carlo sampling 1. generate sampled graphs for r rounds (is this the optimal way for an s, t distance estimation?) 2. compute the desired measure (reachability probability, distance distributions) by averaging results Same issue: how many rounds? 14/21

27 Median distance: Number of Samples: Median Distance d M (s, t) = arg max D { D } p s,t (d) 1 2 d=0 15/21

28 Median distance: Number of Samples: Median Distance d M (s, t) = arg max D { D } p s,t (d) 1 2 d=0 Let µ be the real median, and α and β values ±ɛn away from µ. Then for: r > c ɛ 2 log(2 δ ) and a good choice of c: Pr(ˆµ [α, β]) > 1 δ 15/21

29 Number of Samples: Expected Distance Expected reliable distance (generalization of reliability): d ER (s, t) = d d< d p s,t (d) 1 p s,t ( ) 16/21

30 Number of Samples: Expected Distance Expected reliable distance (generalization of reliability): d ER (s, t) = d d< d p s,t (d) 1 p s,t ( ) By estimating the connectivity ρ, we need to sample at least: { } ( ) 3 (n 1)2 2 r max ɛ 2, ρ 2ɛ 2 log δ for an (ɛ, δ) approximation. 16/21

31 s- y - - s of y n a 3. rs Distance Estimation 2 in Uncertain Graphs i- a s e - s i- e Number of Samples In Reality Edge probability (a) The Figure number 4: ofdistribution needed samples of can (a) beedge surprisingly probabilities, low (but it (b) depends distances. on the actual probabilities) Mean Squared Error BIOMINE Number of worlds Median Majority ExpectedRel Reliability Mean Squared Error Distance (b) FLICKR Number of worlds Median Majority ExpectedRel Reliability Figure 5: MSE vs. worlds. 200 worlds are enough. 17/21

32 Sampling Graphs Generating the entirety of the graph G i for each round i < r is not optimal 18/21

33 Sampling Graphs Generating the entirety of the graph G i for each round i < r is not optimal we do not need to estimate the entire graph G i we can start from s and do a BFS or Dijkstra search by sampling only the outgoing edges based on the generated outgoing edges, we re-do the computation for each generated outgoing node, until we find t 18/21

34 Example: Median Distance k-nn k-nn (k nearest neighbours) finding the k nodes from s the closest by some measure let us consider the median distance (reminder: it is the highest distance in the distribution that has mass less or equal to 0.5) 19/21

35 Example: Median Distance k-nn k-nn (k nearest neighbours) finding the k nodes from s the closest by some measure let us consider the median distance (reminder: it is the highest distance in the distribution that has mass less or equal to 0.5) We only care about the top-k nodes, and not their values, and we do not want to evaluate all the graph if possible we can evaluate a truncated distribution up to a distance D p s,t (d) if d < D p D,s,t (d) = x=d p s,t(x) if d = D 0 if d > D for any two nodes t 1, t 2, d D,M (s, t 1 ) < d D,M (s, t 2 ) implies d M (s, t 1 ) < d M (d, t 2 ) 19/21

36 DistancetoEstimation increase D asinyou Uncertain go and tographs perform all r repetitions of the Dijkstra algorithm in parallel. The algorithm proceeds in rounds, starting from distance D =0,andincreasingthe distance by γ. In each round, we resume all r executions of the Dijkstra from where they had left in the previous round, and pause them when they reach all nodes with distance at most D. If the distribution p D,s,t of a node t reaches the 50% of its mass, then t is added to the k-nn solution. All other btained nodes that will be added in later steps will have greater or al meeal dis- equal median distances. The algorithm terminates once the solution set contains at least k nodes. This scheme works s, t1) < for any order statistic other than the median. Example: Median Distance k-nn (s, t) =. Since D,and nd the s, ti) declare This is e, since e overmating p D,s,t. 1 worlds, h their Algotimes: e Dijkets visgraphs, : when (samtop the whose date or ance is e comt V. e. We Algorithm 1 Median-Distance k-nn Input: Probabilistic graph G = (V,E,P,W), node s V, number of samples r, number k, distance increment γ Ouput: Tk, a result set of k nodes for the k-nn query 1: Tk ; D 0 2: Initiate r executions of Dijkstra from s 3: while Tk <kdo 4: D D + γ 5: for i 1:r do 6: Continue visiting nodes in the i-th execution of Dijkstra until reaching distance D 7: For each node t V visited update the distribution p D,s,t {Create the distribution p D,s,t if t has never been visited before} 8: end for 9: for all nodes t Tk for which p D,s,t exists do 10: if median( p D,s,t) <Dthen 11: Tk Tk {t} 12: end if 13: end for 14: end while 4.4 Majority-distance k-nn pruning The k-nn algorithm for Majority-Distance is similar to the one for Median-Distance. There are two main differences: In the case of the median, the distance of a node t from s is determined once the truncated distribution p D,s,t reaches the 50% of its mass. In the case of the majority, let d1 be the current majority value in p D,s,t, andletrt be all Dijkstra executions in which a node t has been visited. The condition for ensuring that d1 will be the exact majority distance is p D,s,t(d1) r rt. The aboveconditionstakecare r of the (worst) case that a node will appear with the same start from a small distance D decide whether there are nodes to add to the k-nn set increase the distance, and re-start each sampled graph from the new distance 20/21

37 Example: Median Distance k-nn The algorithm does not need to visit all nodes 0.5 Median Pruning (200 worlds) Visited nodes DBLP BIOMINE FLICKR 21/21

Social Data Management Communities

Social Data Management Communities Antoine Amarilli 1, Silviu Maniu 2 January 9th, 2018 1 Télécom ParisTech 2 Université Paris-Sud 1/20 Table of contents Communities in Graphs 2/20 Graph Communities Communities