Reaching consensus about gossip: convergence times and costs Florence Bénézit, Patrick Denantes, Alexandros G. Dimakis, Patrick Thiran, Martin Vetterli School of IC, Station 14, EPFL, CH-1015 Lausanne, Switzerland Department of Electrical Engineering and Computer Science (EECS) University of California, Berkeley Berkeley, CA 94720, USA Abstract Gossip algorithms have recently received significant attention, mainly because they constitute simple and robust methods for distributed information processing over networks. However, for many topologies that are realistic for wireless adhoc and sensor networks (such as grids and random geometric graphs), the standard nearest-neighbor gossip converges however slowly. Moreover we show that convergence of gossip algorithms consists of a transient and a steady state phase, that have not been distinguished so far. In this paper, we first introduce a metric for convergence time and cost that allow us to clearly characterize the steady state regime of the convergence, not only for i.i.d. but for all stationary and ergodic time-varying networks. This metric is based on Oseledec s theorem, which gives an almost-sure description of the algorithm s convergence rate. We next describe a variation of geographic gossip that averages along routed paths, and which is proven to be order optimal (cost of O(n) messages for a network of n nodes) for grids and random geometric graphs, in sharp contrast with standard nearest-neighbor gossip (O(n 2 ) messages). This paper summarizes some of the results in [1] and [2]. I. INTRODUCTION Gossip algorithms are distributed message-passing schemes designed to disseminate and process information over wireless sensor and ad-hoc networks. They have received significant interest because the problem of computing a global function of data distributively over a network, using only localized message-passing, is fundamental for numerous applications. The simplest setup is the following: n nodes are placed on a graph whose edges correspond to reliable communication links. Each node is initially given a scalar (which could correspond to some sensor measurement like temperature) and we are interested in solving the distributed averaging problem: namely, to find a distributed message-passing algorithm by which all nodes can compute the average of all n scalars. A scheme that computes the average can easily be modified to compute any linear function (projection) of the measurements as well as more general functions. Furthermore, the scalars can be replaced with vectors and generalized to address problems like distributed filtering and optimization as well as distributed detection in sensor networks [5], [6], [7]. Gossip algorithms [8], [9] solve the averaging problem by iteratively having a random node wake up, pick one of its onehop neighbors and compute a pairwise average by exchanging messages with their current estimates. Initially all the nodes start with their own measurement as an estimate of the average. At each iteration, the node that wakes up and its chosen neighbor update their estimates with the pairwise average of their current estimates. An attractive property of gossip is that no coordination is required for convergence, as long as the communication graph is connected. We will refer to this algorithm as standard or nearest-neighbor gossip. Gossip algorithms can converge up to any desired level of accuracy but not to the exact value, which makes convergence time difficult to define. In this paper, we use a metric introduced in [2] which focuses on the stationary regime of the decay of the error to zero. It can be proven that, after a transient regime the error decays exponentially with a deterministic rate. This rate constitutes our definition of consensus time. We similarly define consensus cost as the number of messages needed to reduce the error by a factor of e in the stationary regime. These performance definitions are valid if the time variations of the network and the waking up processes in the algorithms are stationary and ergodic. The performance of gossip algorithms varies from one algorithm to the other, deteriorates with the number n of nodes in the network and depends on the graph topology modeling the network. For example, while standard gossip requires only O(n) messages to converge on a fully connected graph, it needs O(n 2 ) messages on the 2-D grid, where it thus scales as badly as flooding all the measurements everywhere. Unfortunately for standard gossip, grids and random geometric graphs are the more realistic topologies for sensor networks, on which this algorithm requires O(n 2 ) and O(n 2 / logn) messages, respectively. It is thus of great interest to design more competitive gossip algorithms specifically for these topologies. Dimakis et al. [10] introduced geographic gossip, which is based on a simple idea: since fully connected graphs spread gossip fast, the nodes can use geographic information to create a very well connected overlay graph using geographic routing. Geographic gossip converges on any connected graph as fast as standard gossip on the complete graph, but it is penalized by the extra cost of routing. Consensus cost in geographic gossip is thus O(n n) messages for grids and O(n n/ log n) for random geometric graphs. In this paper we explore a natural extension of geographic gossip: instead of averaging only the starting node and the
destination node of each route, what happens if the whole route s estimates are averaged together The surprising answer to this question is that this algorithm, which we call path averaging, actually converges as fast as standard gossip on the fully connected graph, i.e. only requires only O(n) messages which is order-optimal since it matches obvious lower bounds. In Section II, we formally define gossip algorithms, give conditions for convergence and precise network models. In Section III we show that consensus time and consensus cost have valid definitions under the ergodicity conditions detailed in Section II. Section IV states a theorem concerning the linear behavior of path averaging on grids with horizontal-vertical routing and gives the intuition of the proof. Section V presents experiments on path averaging on random geometric graphs with greedy routing, and we comment the results with the help of our analysis of Section IV. II. NETWORK MODEL AND PROBLEM FORMULATION A. Distributed averaging algorithms: the general case. We consider a time-varying network of n nodes, and the goal is to make available at each node the average value of the measurements of all the nodes in the network, or at least a good approximation. At each time-slot t (t discrete), the nodes can communicate with each other over all currently active graph edges, or communication links. We will restrict the exchanged messages to contain only a current estimate x i (t) of the sending node i, and in the path averaging case some routing information. Note that we assume that the messages have infinite accuracy; the effects of message quantization have only recently been explored [3], [4]. At each time step t, every node i may perform an update operation of its estimate x i (t) of the overall average. This operation is linear, and relies only on the current average estimates from node i and from the nodes i communicated with. The update equation for node i at time t then reads, for 1 i n, x i (t + 1) = w ii (t)x i (t) + w ij (t)x j (t), (1) j S i(t) where w ij (t) are the weighing factors gathered in a weight matrix W(t) such that x(t + 1) = W(t)x(t), where x(t) = [x 1 (t);... ; x n (t)] T. The weights values are defined by the specific averaging algorithm that is being used, and S i (t) is the set of nodes node i has been communicating with at time-slot t. x i (0) is the initial measurement at node i and x ave := 1 T x(0)/n = 1 T x(t)/n denotes the true average, where 1 = [1;...;1] T is the vector with all ones. Gossip is a particular class of distributed averaging algorithms: Definition 1: Gossip In gossip algorithms, at each time-slot t, a random set S(t) of nodes exchange their estimates, such that for every node i S(t), S i (t) = S(t), and moreover x i (t + 1) = 1 S(t) k S(t) x k (t). (2) For every node j / S(t), S j (t) is the empty set and w jj = 1. B. Conditions for convergence to true average. We denote by J n the n n averaging matrix with all elements J k,l = 1/n, and by 2 the spectral norm. It is also useful to define the matrix A(t) = W(t) J n and the vector of the estimation errors ǫ(t) = x(t) x ave 1. The algorithm converges almost surely (a.s.) if P[lim t ǫ(t) = 0] = 1. There are two necessary conditions for convergence: 1 T W(t) = 1 T W(t)1 = 1, which respectively ensure that the average is preserved at every iteration, and that 1 is a fixed point. Conditions for convergence in expectation and in mean square can be found in [9]. We present here sufficient conditions for a.s. convergence and convergence in second moment, in the case where {W(t)} t 0 is stationary and ergodic: Conservation properties: conditions (3). Contraction property: W(t) 2 1. Connectivity property: E[T η ] <, where T η := inf t {t 1 : t p=0 W(t p) η > 0} is a stopping time. In other words, there can be isolated nodes at any iteration, but every node has to eventually connect to the network, which has to be jointly connected. This result was recently proved in [11]. C. Stationary and ergodic networks It is important to notice that all distributed averaging algorithms present time-varying averaging weights. Moreover, these changes are random. That is, we can see the sequence of averaging matrices as a realization of a random process {W(t)} t 0. From there, it might seem difficult to define a deterministic convergence speed, which is observed not only in a particular run, but repeatedly in almost every 1 realization of the process {W(t)} t 0. We show in Section III that this is in fact possible, when this process satisfies two conditions: stationarity and ergodicity. These conditions are quite general and satisfied by most network models. In particular, they easily translate when using gossip algorithms: {W(t)} t 0 is stationary and ergodic if and only if {S(t)} t 0 is stationary and ergodic. For the latter condition to be fulfilled, it is necessary that the nodes actively connecting to the network form a stationary and ergodic process as well. The isolation of nodes are due to a number of factors, e.g. node failures, transmission channel quality, node sleeping modes, etc. Furthermore, the randomness of the algorithm itself has to be stationary and ergodic. This is easily done by waking up nodes i.i.d. over time and by making the woken up node design the rest of set S(t) i.i.d. over time too. As a conclusion, this paper deals with two different sources of randomness, the network topology variations and the randomized algorithm, which we aggregate into a single stochastic process {W(t)} t 0. Our analysis is valid as long as this process is stationary and ergodic, a condition satisfied by most natural network and algorithm definitions. 1 That is, with probability 1. (3)
III. METRICS FOR CONVERGENCE TIME AND COST For the distributed averaging algorithms described in Section II, the estimates x(t) and the error ǫ(t) = x(t) x ave 1 for t > 0 are random vectors, since the network is time-varying and the algorithms have a randomized behavior. However, in the long run, the error decays exponentially with a deterministic rate 1/T c, where T c is called consensus time. The following theorem, which is a direct application of Oseledec s theorem, precisely states the existence of this rate [2]: Theorem 1: If {W(t)} t 0 is a stationary and ergodic process, then the limit 1 lim log ǫ(t), (4) t t where denotes the l 2 norm, exists and is a constant γ with probability 1. Definition 2: Consensus time T c. Whenever the coefficient γ is well defined, consensus time T c is defined as follows: T c = 1 γ. (5) In other words, after a transient regime, the number of iterations needed to reduce the error ǫ by a factor e is almost surely equal to T c, which therefore characterizes the speed of convergence of the algorithm. T c is easy to measure in experiments, and has analytical upper bounds. However, lower bounding this quantity remains an open problem. Theorem 2: Bounding T c for gossip algorithms Whenever consensus time T c is well defined, then T c can be bounded as follows in the case of gossip algorithms: 2 T c (E[W]) log (λ 2 (E[W])) 2 1 λ 2 (E[W]). (6) Note that bounds can be found in [2] for more general averaging algorithms. We compare algorithms in terms of the amount of required communication. More specifically, let R(t) represent the number of one-hop radio transmissions required in time-slot t. In a standard gossip protocol, the quantity R(t) R is simply a constant, whereas for path averaging, which we will study in the next sections, {R(t)} t 1 is a sequence of i.i.d. random variables. The total communication cost up to time-slot t, measured in one-hop transmissions, is given by the random variable C(t) = t k=1 R(k). Consensus cost C cis defined as follows [2]: Theorem 3: If {W(t)} t 0 is a stationary and ergodic process, then the following limit exists and is a constant with probability 1: 1 lim log ǫ(t) (7) t C(t) Definition 3: Consensus time C c. Consensus cost C c is defined as following: 1 1 = lim C c t C(t) = lim t log ǫ(t) t C(t) lim t log ǫ(t). t Thus, C c = E[R(1)]T c is the number of one-hop transmissions needed in the long run to reduce the error by a factor e with probability 1. IV. PATH AVERAGING ON GRIDS A. Description and performance In this paper, a grid of n nodes is a torus of size n n. (, )-path averaging is a gossip algorithm that performs in the following way. At each iteration t, a random node I wakes up and randomly chooses a destination node J so that the random pairs (I, J) are independently and uniformly distributed. Node I also flips a fair coin to design the first direction: horizontal ( ) or vertical ( ). If for instance horizontal was picked as the first direction, the path between I and J is then defined by the shortest path between I and J that is routed horizontally first, then vertically. If vertical was picked then the path is routed vertically first. As the message goes to J, each node adds its own estimate to the message estimate (initiated with x I (t)) and also increments a counter that was started at 1 by node I. When J receives the message, he computes the average of the estimates of the route and routes this value back to I retracing the incoming message s steps. As in path averaging, the estimates of the nodes belonging to the random route are updated to their global average. Theorem 4 (Path averaging on grids): On a n n torus grid, the consensus time T c (n) of path averaging is O( n). Furthermore, the consensus cost is linear: C c (n) = O(n). This result is interesting since we cannot expect to achieve better than C c (n) = O(n). Indeed, at least n messages are required to average the values measured by n nodes. B. Intuition of the proof The detailed proof of Theorem 4 can be found in [1]. The proof, which relies on Theorem 2, has two steps: first evaluate the matrix E[W] and then upper bound its second largest eigenvalue. In Fig. 1, we show the entries of the matrix E[W ij ] as a function of the distance d(i, j) between node i and node j, for standard gossip, geographic gossip and path averaging. In standard gossip only neighboring nodes can average their estimates, so E[W ij ] falls to 0 if the distance d(i, j) is larger than the connecting radius r(n). In geographic gossip, every pair of nodes can average their estimates with equal chance, thus E[W ij ] is a constant. In path averaging, the closer nodes are, the more chance they have to be on the same route. In that case E[W ij ] is a decreasing function of d(i, j). The main phenomenon driving the estimates to the true average is diffusion. The larger the coefficients E[W ij ] are, the more efficient the diffusion is. Also, diffusion is a step by step phenomenon. In standard gossip, the information mixes slowly because many steps are needed to get information from node i to a far away node j (E[W ij ] = 0 if d(i, j) r(n)). On the opposite, diffusion in geographic gossip proceeds in one step since any pair of nodes can average their estimates together, but in this case E[W ij ] is small, which slows down the diffusion. However, one step diffusion is a good enough
1 x 10 5 0.8 Standard gossip Geographic gossip Box path averaging 0.6 r(n) E[W ij ] 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 d(i,j) 1: Behavior of E[W ij ] as a function of the distance in norm 1 between i and j for standard gossip, geographic gossip and box-path averaging. asset to make geographic gossip more efficient than standard gossip. The situation in path averaging is a trade-off between these two situations. The coefficients E[W ij ] concentrate on close nodes, which leads to high values of these coefficients, and this concentration is wide enough to reach nodes that are 1/2 away from each other in the unit square. In path averaging, we can describe the diffusion phenomenon as an efficient two step diffusion. This idea is formalized in the proof [1]. In conclusion, the proof teaches us that path averaging is a good tradeoff between promoting local averaging to increase averaging intensity (large E[W ij ]) and favoring long distance averaging to get an efficient diffusion pattern. V. PATH AVERAGING ON RANDOM GEOMETRIC GRAPHS A. Description of the algorithm We can now introduce the path averaging algorithm for random geometric graphs. At each time-slot one random node activates and selects a random position (target) on the unit square region where the nodes are spread out (no node needs to be located exactly on the target). It then creates a packet that contains its current estimate of the average, its position, the number of visited nodes so far (one), the target location, and passes the packet to a neighbor that is randomly chosen among the neighbors closer to the target. As nodes receive the packet, randomly and greedily forwarding it towards the target, they add their value to the sum and increase the counter. When the packet reaches its destination node (the first node whose nearest neighbors have larger distance to the target compared to it), the destination node computes the average of all the nodes on the path, and reroutes that information backwards on the same route. See Fig. 2 for an illustration of random greedy routing. The algorithm iterates this process of averaging along routes until any desired level of convergence. Node i 2: Random greedy routing. Node i has to choose the following node in the route among the nodes that are his neighbors (inside the ball of radius r(n) centered in node i) and that are closer to the target than i (inside the ball of radius centered in the target, where d is the distance between node i and the target). Next node is thus randomly chosen in the intersection of the two balls. B. Simulations Fig. 3 compares the behavior of standard gossip, geographic gossip and path averaging on random geometric graphs with an increasing number n of nodes and connection radius r(n) = c log n/n, c = 4.5. We can see that path averaging, also called gossip along the way, performs strikingly better than standard gossip and geographic gossip. This improvement can be explained with the analysis of path averaging on grids. Just as on grids, path averaging on random geometric graphs averages nodes together often if there are close, leading to a concentration of averaging coefficients E[W ij ]. Long distance averaging is also frequent enough to get an efficient diffusion pattern. Path averaging is a good trade-off between cheap short distance averaging and expensive, but diffusive long distance averaging. A closer look at simulations shows that as the scaling coefficient c of connection radius increases, consensus cost approaches a linear behavior in n. Proving this is part of future work. VI. CONCLUSION In this paper we introduced a novel gossip algorithm for distributed averaging. The proposed algorithm operates in a distributed and asynchronous manner on locally connected graphs and requires an order-optimal number of communicated messages. The execution of path averaging relies on knowledge of geographic locations; this location information is independently useful and likely to exist in many application scenarios. The key idea that makes path averaging so efficient is the opportunistic combination of routing and averaging. We believe that the idea of greedily routing towards a randomly pre-selected target (and possibly processing information on the routed paths) is a very useful primitive for designing message-passing algorithms on networks that have
number of messages 18 x 104 16 14 12 10 8 6 4 2 gossip along the way geographic gossip standard gossip [8] D. Kempe, A. Dobra, and J. Gehrke. Gossip-based computation of aggregate information. In Proc. IEEE Conference of Foundations of Computer Science, (FOCS), 2003. [9] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Randomized gossip algorithms. In IEEE Transactions on Information Theory, Special issue of IEEE Transactions on Information Theory and IEEE/ACM Transactions on Networking, 2006. [10] A. G. Dimakis, A. D. Sarwate, and M. J. Wainwright. Geographic gossip: efficient aggregation for sensor networks. In ACM/IEEE Symposium on Information Processing in Sensor Networks, 2006. [11] P. Denantes. Performance of Averaging Algorithms in Time-Varying Networks. http://icapeople.epfl.ch/thiran/dipl Denantes.pdf, 2007. 0 200 400 600 800 1000 1200 1400 1600 network size n 3: Consensus cost of standard gossip, geographic gossip (without rejection sampling) and path averaging with r(n) = p 4.5log n/n. The simulations were performed over 15 graphs per n. planar geometry. The reason is that the target introduces some directionality in the scheduling of message passing which avoids diffusive behavior. Other than computing linear functions, such path-processing algorithms can be designed for information dissemination or more general message passing computations such as marginal computations or MAP estimates for probabilistic graphical models. Processing and forwarding the messages on random paths can avoid the diffusive nature of random walks and accelerate the convergence of message-passing. We plan to investigate such protocols in future work. ACKNOWLEDGMENT This work was supported (in part) by the National Competence Center in Research on Mobile Information and Communication Systems (NCCR-MICS), a center supported by the Swiss National Science Foundation under grant number 5005-67322. REFERENCES [1] F. Bénézit, A. G. Dimakis, P. Thiran, and M. Vetterli. Gossip along the way: Order-optimal consensus through randomized path averaging, submitted for publication, 2008. [2] P. Denantes, F. Bénézit, P. Thiran, M. Vetterli, Which Distributed Averaging Algorithm Should I Choose for my Sensor Network, Proc. IEEE Infocom 08, Phoenix, April 2008. [3] A. Nedic, A. Olshevsky, A. Ozdaglar, and J. N. Tsitsiklis. On Distributed Averaging Algorithms and Quantization Effects. submitted for publication, 2007. [4] T. C. Aysal, M. J. Coates and M. G. Rabbat. Rates of Convergence of Distributed Average Consensus Using Probabilistic Quantization. Proc. of the Allerton Conference on Communication, Control, and Computing Sep., 2007 [5] D. Spanos, R. Olfati-Saber, and R. Murray. Distributed Kalman filtering in sensor networks with quantifiable performance. In 2005 Fourth International Symposium on Information Processing in Sensor Networks, 2005. [6] L. Xiao, S. Boyd, and S. Lall. A scheme for asynchronous distributed sensor fusion based on average consensus. In 2005 Fourth International Symposium on Information Processing in Sensor Networks, 2005. [7] V. Saligrama, M. Alanyali, and O. Savas. Distributed detection in sensor networks with packet losses and finite capacity links. In IEEE Transactions on Signal Processing, to appear, 2007.