Graph Exploration: How to do better than the random walk? Adrian Kosowski. INRIA Bordeaux Sud-Ouest.

Size: px

Start display at page:

Download "Graph Exploration: How to do better than the random walk? Adrian Kosowski. INRIA Bordeaux Sud-Ouest."

Erika Olivia Kennedy
6 years ago
Views:

1 Graph Exploration: How to do better than the random walk? Adrian Kosowski INRIA Bordeaux Sud-Ouest Réunion Displexity La Rochelle, April 4, 2013

2 Talk outline Introduction to network exploration The random walk model When is the random walk good enough? Overview of basic properties and applications When is the random walk not good enough? Trying to do better in practice and in theory: Biasing probabilities and Metropolis-Hastings walks 2

3 Introduction to Network Exploration 3

4 Introduction Graph exploration: definition and motivation A walker is placed on some node of the input network The walker is allowed to traverse edges (links) of the network The goal is to perpetually visit all the nodes of the graph Different possible optimization criteria: we would like the walker to complete its first exploration as quickly as possible. we would like to guarantee regularity of exploration. Motivation: crawling webs, sampling of nodes and gathering statistics, ranking nodes, performing network maintenance, Depending on the scenario, walkers may vary in terms of physical implementation, capabilities, and available memory/resources. 40 years of CS theory behind the problem (automata on graphs, L=SL, ) 4

5 The setting of this talk The network: Think locally: the global topology of the network is not known The network may potentially change in time We may possibly know some global parameters a bound on n the number of network nodes we may have a rough idea of the degree distribution in the graph Network links are undirected! (like Facebook) The walker: For most of the time, we see the walker as a crawler process (a bit like GoogleBot) When visiting a node, we learn its neighbors this comes with a fixed cost Only following links is possible teleportation is not allowed (e.g. no numeric node ID-s) 5

6 Modeling network data Example. A view of data accessible by crawling the Facebook frontend. (Gjoka et al., INFOCOM 10) 6

7 Modeling network data Example. A graph-based model. explicit port labeling 7

8 Modeling network data Example. A graph-based model. 8

9 Objectives of exploration Several parameters/properties to consider: 1. Time until all nodes have been visited at least once (cover time) 2. Time between two subsequent visits to a node, in the limit (refresh time) 3. Convergence to some limit frequency of visits to specific nodes/edges Time after which the limit frequency has been approximately reached (blanket time) Time after which the walk reaches a probability distribution on nodes/edges corresponding to the limit frequency (mixing time) 4. Properties characterizing short walks: A short walk quickly discoveres many nodes A short walk samples nodes/edges with a specified probability Convention: Worst-case set-up + Averaging over all possible runs of the walk algorithm. 9

10 The Random Walk 10

the adjacent links, chosen uniformly at random The process is Markovian: the

11 The random walk model What is the random walk? We are lost in an unknown network G = (V,E) We leave each node along one of the adjacent links, chosen uniformly at random The process is Markovian: the next step of exploration does not depend on the exploration history 11

12 The random walk model Inspired by nature: Brownian motion 12

13 The random walk model A geometric setting: Paths taken by the Roomba TM (Artwork by IBRumba) 13

14 The random walk model Classical networking applications? Information search/packet circulation in p2p networks an alternative to flooding [Gkantsidis 2004] Sampling of nodes in a web or social network (to be discussed later) Self-stabilizing mutual exclusion (tokens following random walks meet and coalesce, until 1 remains) [Israeli-Jalfon 1990] Picking a spanning tree of a graph uniformly at random (applications of loop-erased walks) [Broder 1988, Wilson 1995] 14

15 The random walk model Advantages? Simple, resource-efficient, independent of network location Equitable uses all edges fairly (1/ E frequency in the limit) Recovers quickly after a slight modification of the graph Covers web-type graphs quickly in almost linear time Parallel walks are faster than one [Alon et al. 2008, Sauerwald et al. 2008, Cooper et al. 2009] By deploying several independent walks, the cover time is reduced Sometimes the total number of steps made by all walks before each node of the network has been visited is reduced! Disadvantages? Completely useless in terms of worst-case performance Expected cover time of (n 3 ) for some graphs Short walks may get stuck in local network neighborhoods 15

16 The random walk model How to analyze the random walk? First parameter: hitting time H(u,v) What is the expected number of steps for a random walk to reach node v from node u? v _ u + 16

17 The random walk model How to analyze the random walk? Second parameter: commute time Com(u,v) = H (u,v) + H (v,u) What is the expected number of steps for a random walk to reach node v from node u, and then return to node v? Theorem [Chandra, Raghavan, Ruzzo, Smolensky, Tiwari 1989]: Com(u,v) = 2 E R(u,v). Theorem [Foster 1949]: {u,v} E R (u,v) = n 1. u + v _ The electrical resistor network analogy 17

18 The random walk model How to analyze the random walk? Third parameter: cover time Cov Cov(u) - what is the expected number of steps for a random walk to reach all nodes of the graph, starting from node u? Cov = max u V Cov(u) Theorem [Aleliunas, Karp, Lipton, Lovasz, Rackoff 1979] Cover time is upper bounded by sum of commute times along the edges of any spanning tree of the graph. Theorem [Feige 1995] By using the best spanning tree, we obtain for any graph: Cov 4n 3 / 27 the bound is tight, and the worst-case example is precisely known 18

19 The random walk model How to analyze the random walk? The lollipop graph worst-case cover time 4n 3 / 27 - o(n 3 ) n / 3 2n / 3 19

20 The random walk model How to analyze the random walk? Order of the cover time for different graph classes Cliques n log n Paths, cycles n 2 2-dimensional grids n log 2 n 3-dimensional grids n log n Complete k-ary trees n log 2 n / log k Expanders n log n Regular graphs not more than n 2 20

21 The random walk model How to analyze the random walk? In the limit, the random walk visits all edges with the same frequency (1/ E ) If the graph is not bipartite, then in the limit, at any given moment: The probability of finding ourselves at a vertex v is proportional to the degree of v. Fourth parameter: blanket time B Intuitively, what is the expected number of steps of a random walk before all edges of the graph have been visited a similar number of times? Theorem [Ding, Lee, Peres 2011] Cov B const * Cov 21

22 Can we do better than the Random Walk? 22

23 Tweaking the random walk Disadvantages of the random walk Completely useless in terms of worst-case performance Partial remedy: use a deterministic strategy instead (*n 2 time overhead) 23

Tweaking the random walk Disadvantages of the random walk Completely useless in terms of worst-case performance Partial remedy: use a deterministic strategy instead (*n 2 time

24 Tweaking the random walk Disadvantages of the random walk Completely useless in terms of worst-case performance Partial remedy: use a deterministic strategy instead (*n 2 time overhead) Expected cover time of (n 3 ) for some weakly connected graphs Partial remedy: use Metropolis-Hastings biasing [TBD] Trick : try to tweak the input graph [Zhue et al. 2012] 24

25 Tweaking the random walk Disadvantages of the random walk Completely useless in terms of worst-case performance Partial remedy: use a deterministic strategy instead (*n 2 time overhead) Expected cover time of (n 3 ) for some weakly connected graphs Partial remedy: use Metropolis-Hastings biasing [TBD] Trick : try to tweak the input graph [Zhue et al. 2012] Short walks may get stuck in local network neighborhoods More precisely: a random walk of small length t is expected to visit about t edges [Broder et al. 1994], but may possibly visit very few nodes Partial remedy: use Metropolis-Hastings biasing [TBD] Trick : if you feel you are stuck, teleport yourself to a random location [Jin et al. 2011] 25

26 Biased walks and the Metropolis-Hastings algorithm 26

27 Biased walks A biased walk is one in which the next node is chosen by the walker from among its neighbors, but transition probabilities need not be equal. The bias can be: Topological based on the structure of the graph, degrees/importance of nodes, etc., Dependant on exploration history e.g. walks which never back-track to the node they have just come from In general, we want to keep the process as close to reversible Markovian as possible A simple way to obtain the desired form of bias (Markovian, reversible): put positive real-valued weights on edges at each step, choose an incident edge with probability proportional to its weight (relative to the sum of all weights of incident edges) 27

28 Metropolis walks There have been several recent papers showing how to bias random walks, given helper information about the topology of the graph, etc. The effort required to collect this information means that effectively a normal walker needs to do (n 3 ) steps, anyway. There is an exception: the Metropolis-Hastings walk weighted by node degrees [Metropolis 1959, Nonaka et al. 2010]: For each edge connecting u and v, put the following weight on it: min { 1 / deg(u), 1 / deg(v) } Add self-loops at each node, so that the sum of weights of all incident edges sums to 1, for all nodes. Consequence: all nodes will be visited equally often during exploration 28

29 Metropolis walks There have been several recent papers showing how to bias random walks, given helper information about the topology of the graph, etc. The effort required to collect this information means that effectively a normal walker needs to do (n 3 ) steps, anyway. There is an exception: the Metropolis-Hastings walk weighted by node degrees [Metropolis 1959, Nonaka et al. 2010] The walk can be implemented, as shown below: NextState (v: node) u <- neighbor of v in G chosen uniformly at random; with probability min{deg(v)/deg(u), 1} move to u; remain at v; [Lee at al. 2012, K. 2013] 29

30 Metropolis walks The Metropolis walk has a worst-case performance superior to that of the random walk A Metropolis walker explores a graph in O(n z log(n)) steps w.h.p. [Nonaka at al. 2010] Note: the random walk does not carry any state when traversing edges. A little bit of memory is necessary to implement Metropolis-Hastings. Any strategy with o(n 3 ) cover time requires some state memory carried over edges. A Metropolis walker can be implemented in O(log n) bits of memory. 30

31 Combining Metropolis walks and Random walks Note: the Metropolis walk is slower than the random walk on many graphs even for the star. Is this strategy of practical importance? Yes. There are several elegant ways of combining the Metropolis walk with the random walk. A variant of the Metropolis walk explores all graphs O(n 2 log(n)) steps w.h.p., and not more slowly (up to a factor of 2) than the random walk. 31

32 Combining Metropolis walks and Random walks Note: the Metropolis walk is slower than the random walk on many graphs even for the star. Is this strategy of practical importance? Yes. There are several elegant ways of combining the Metropolis walk with the random walk. NextState (v: node) u <- neighbor of v in G chosen uniformly at random; with probability min{(deg -1 (u)+d -1 ) / (deg -1 (v)+d -1 ), 1} move to u; remain at v; The above method relies on knowledge of the average degree d = 2m/n. (can be done without.) 32

33 Short Metropolis walks In expectation, a Metropolis walk of length t D 2 discovers at least t 1/2 nodes. [K. 2013] Potentially useful property in local searches around network neighborhoods. Trick: subdivide nodes of high degree to achieve better discovery rate. E.g. combine the above with a landmark distribution scheme [Broder et al. 1994] 33

34 Space-time tradeoffs for s-t connectivity Algorithm (Broder et al.): Short walks from landmarks 1. Pick k landmark nodes in the graph. Add s and t to the set of landmark nodes. 2. Repeat: [a polylogarithmic number of times] From each landmark, run a random walk of length t. If a walk starting from landmark l 1 visits a landmark l 2, mark them as belonging to the same component of landmarks (SET UNION operation). 3. Return YES, if s and t belong to the same component of landmarks. Return NO, otherwise. G: s t 34

35 Space-time tradeoffs for s-t connectivity Algorithm (Broder et al.): Short walks from landmarks 1. Pick k landmark nodes in the graph. Add s and t to the set of landmark nodes. 2. Repeat: [a polylogarithmic number of times] From each landmark, run a random walk of length t. If a walk starting from landmark l 1 visits a landmark l 2, mark them as belonging to the same component of landmarks (SET UNION operation). 3. Return YES, if s and t belong to the same component of landmarks. Return NO, otherwise. G: s t 35

36 Space-time tradeoffs for s-t connectivity Algorithm (Broder et al.): Short walks from landmarks 1. Pick k landmark nodes in the graph. Add s and t to the set of landmark nodes. 2. Repeat: [a polylogarithmic number of times] From each landmark, run a random walk of length t. If a walk starting from landmark l 1 visits a landmark l 2, mark them as belonging to the same component of landmarks (SET UNION operation). 3. Return YES, if s and t belong to the same component of landmarks. Return NO, otherwise. G: s t 36

37 Space-time tradeoffs for s-t connectivity Algorithm (Broder et al.): Short walks from landmarks 1. Pick k landmark nodes in the graph. Add s and t to the set of landmark nodes. 2. Repeat: [a polylogarithmic number of times] From each landmark, run a random walk of length t. If a walk starting from landmark l 1 visits a landmark l 2, mark them as belonging to the same component of landmarks (SET UNION operation). 3. Return YES, if s and t belong to the same component of landmarks. Return NO, otherwise. G: s t 37

38 Space-time tradeoffs for s-t connectivity Algorithm (Broder et al.): Short walks from landmarks 1. Pick k landmark nodes in the graph. Add s and t to the set of landmark nodes. 2. Repeat: [a polylogarithmic number of times] From each landmark, run a random walk of length t. If a walk starting from landmark l 1 visits a landmark l 2, mark them as belonging to the same component of landmarks (SET UNION operation). 3. Return YES, if s and t belong to the same component of landmarks. Return NO, otherwise. G: s t 38

39 Space-time tradeoffs for s-t connectivity Algorithm (Broder et al.): Short walks from landmarks 1. Pick k landmark nodes in the graph. Add s and t to the set of landmark nodes. 2. Repeat: [a polylogarithmic number of times] From each landmark, run a random Metropolis walk of length t. If a walk starting from landmark l 1 visits a landmark l 2, mark them as belonging to the same component of landmarks (SET UNION operation). 3. Return YES, if s and t belong to the same component of landmarks. Return NO, otherwise. and fixing some details to make it work Improvement: Replacing the random walk by the Metropolis walk 39

40 Short Metropolis walks The above process can be used to test if a graph is connected. For a well chosen value of k, we obtain the following theoretical result: In the RAM memory model, given S log n bits of space, one can test Undirected s-t Connectivity (USTCON) in time: T=Õ(max{n 2 /S, m}). [K. 2013] Corollary: an almost-linear time algorithm for checking if a graph is connected, running in space S = n 2 /m more space-efficient than BFS/DFS! 40

41 Metropolis walks Advantages? Simple, resource-efficient, independent of network location Equitable uses all nodes fairly Recovers quickly after a slight modification of the graph Covers web-type graphs quickly in almost linear time Parallel walks are faster than one Expected cover time not worse than O(n 2 log n) After some fine-tuning, short Metropolis walks visit nodes more quickly than short random walks Disadvantages? Unbounded pessimistic cover time In practical scenarios, a little slower than the Random Walk 41

42 Applications of biased walks in node sampling and ranking 42

43 Uniform sampling problem Goal: we would like to measure some metric on the nodes of a network E.g. what is the number of connections people have on average on Facebook/Google+? We need to find a representative subset of nodes of the network The optimal solution: just pick network nodes uniformly at random Unfortunately, not feasible we do not know all people / p2p hosts on the network In many cases, numeric ID-s may not be relied upon A first attempt: take a network subset with BFS. Often unsatisfactory. A feasible solution: run a walk in the network! Approach 1: run a short random walk starting from a random node, then fix it to account for over-representation of high degree nodes Approach 2: run a short Metropolis-Hastings walk. 43

44 Uniform sampling problem Example: sampling degree distribution of Facebook nodes, Spring 2009 [Results and figures of: Gjoka et al.,infocom 10] Metropolis-Hastings not bad Random Walk with compensation - good Random Walk no compensation - bad BFS no compensation - bad General conclusion from the literature: Compensated Random Walk seems to win slightly with Metropolis-Hastings in most tests. 44

45 Non-uniform network exploration Goal: we would like our walker to visit some nodes more often than others Scenario 1: A walk to estimate Google PageRank. Propose a walk on the web which converges to a limit distribution in which more important websites are visited more often than less important ones. [Agarwal Chakrabarti 2006] Scenario 2: A walk to recommend new Facebook links. A supervised walk, finding interactions between nodes which are likely to exist in reality, but missing from the social network. [Backstrom Leskovec 2011] Scenario 3: A walk to sample non-uniform populations. Suppose we want to compare the mean income of social network users in China and the Vatican. We need a sample of 100 users from China and 100 users from the Vatican. How to get one quickly with a walk on the social network?. [Kurant et al. 2011] The solution to all cases: biased walks with weights on edges dynamically adapted in a learning process. Fine-tuning is quite tricky. 45

46 What may the future bring? New application domains: walks in the real-time analysis of live information Following news as it spreads virally through the social web Walks traversing different media, e.g.: a re-tweeted FB post linking to a blog entry based on live TV news coverage Walks to evaluate the extent/threat due to a spreading rumour, detect the source of viral info Walks policing the web for brand infringement & copyright enforcement New challenges: A clear need for biased walks of different types A clear need to better understand how such walks parallelize An opportunity to use the abstraction of a walk in the modeling and analysis of heterogenous networks/webs Possibly, a need for coordination of multiple walks, resources shared among walks, etc. Empirical studies of walks in evolving networks (e.g. networks growing by several percent during the duration of the walk). 46

47 What may the future bring? New techniques developing a combination of: probabilistic analysis spectral graph theory computational complexity modeling of dynamic systems rumour spreading models network optimization distributed computing machine learning game/equilibria theory sampling & statistics 47

48 Deterministic Graph Exploration Is there still time? 48

49 The labeled graph model Assumptions of the labeled graph model The explored graph G = (V,E) is simple, undirected, and connected The nodes of the graph do not have any labels or colors which are known to the agent (anonymous graph property) When located at a vertex, the agent can distinguish among the edges adjacent to the current node The agent is aware of the edge by which it entered the current node There are two distinct types of local orientations of edges at a node: implicit cyclic ordering explicit port labeling 49

50 Focus: computations in anonymous networks In the anonymous model, the agent is an automaton with state memory No identifiers, no global knowledge Rationale: testing limits of computability, profound implications in other areas: log-space complexity theory, fault-tolerant routing, token distribution schemes f ( STATE, IN-PORT) = ( STATE, OUT-PORT ) explored graph accessible information ( view ) 50

51 De-randomizing random walks How to make the random walk deterministic? We perform an exploration using a robot equipped with some memory (state) and knowledge of the ports in the graph: f ( STATE, IN-PORT, DEGREE ) = ( STATE, OUT-PORT ) The following properties are extremely desirable: The memory size of the robot should be as small as possible The worst-case cover time of the robot should be polynomial If possible, other properties should be retained (e.g. equity of edge visits) First variant: we assume nothing about the port labeling of the graph (i.e., worst case labeling) First approach: sequences of port numbers that work for any graph 51

52 Universal Sequences Universal Traversal Sequences (UTS-s) A UTS(n,d) is a sequence of numbers (t 1 t k ) in 1..d, such that the robot f ( STEP i, PORT?, DEGREE d ) = (STEP i+1, PORT t i ) covers any d-regular graph of (at most) n vertices in at most k steps. Theorem [Aleliunas, Karp, Lipton, Lovasz, Rackoff 1979] For any n, there exists a UTS(n,d) of length k n 5 log n Proof: the probabilistic method Fix a sequence S of length k = n 5 log n, chosen uniformly at random Let G be an arbitrary graph. Then a random walk following S explores G with probability 1, where = O(2 -n^2 log n ). Let F(G) be the set of all sequences that fail to explore G. We have: F(G) d k. Let Gn be the set of all graphs of order at most n. The total number of sequences which fail for all graphs from Gn is at most Gn d k which is less than d k. So, at least one sequence succeeds for all graphs! 52

53 Universal Sequences How much memory is required to construct a UTS efficiently? Nisan s generic derandomizer (1992): O(log 2 n) memory but the length of the sequence is no longer polynomial O(n log n ) Not clear even if a sequence of polynomial length can be constructed in polynomial time Some explicit constructions are known, e.g. for cycles It turns out that it is easier to apply UXS-s instead! Universal Exploration Sequences (UXS-s) A UXS(n,d) is a sequence of numbers (x 1 x k ) in 1..d, such that the robot f ( STEP i, PORT p, DEGREE d) = (STEP i+1, PORT [ p + x i ] ) covers any d-regular graph of (at most) n vertices in at most k steps. 53

54 Touching the foundations of computer science The L = SL complexity class problem L is the complexity class containing decision problems which can be solved by a deterministic Turing machine using a logarithmic amount of memory space. SL (Symmetric Logspace) is the complexity class of problems log-space reducible to USTCON (undirected s-t connectivity), which is the problem of determining whether two vertices of a graph are in the same connected component A positive answer [Reingold, STOC 2005] UXS(n,d) can be constructed by a machine equipped with O(log n) memory By applying a slight modification of the sequence, a robot can explore any (not necessarily regular) graph of order at most n, thus solving USTCON. Note: the problem for the related oblivious (UTS-based) variant is open! 54

55 Helping the robot: guiding using counters Guide the agent along the least often used edge [Ilcinkas et al., 2010] Explores the graph periodically, with an exploration period of O(m D) in graphs of diameter D with m edges. Guide the agent along the edge not in use for the longest time A poor strategy, with an exponential exploration time. Guide the agent along the port not in use for the longest time: rotor-router introduced by [Yanovski et al. 2003], also [Bampas et al. 2009] A fast and robust exploration strategy (w.r.t. changes of graph structure), Improves previous bound of 4n (Ilcinkas 06) stabilizing to a periodic traversal of an Eulerian cycle within (m D) steps. 61

56 Thank you. 62

Random Walks and Universal Sequences

Random Walks and Universal Sequences Xiaochen Qi February 28, 2013 Abstract A random walk is a chance process studied in probability, which plays an important role in probability theory and its applications.