Decentralized Collaborative Learning of Personalized Models over Networks

Size: px

Start display at page:

Download "Decentralized Collaborative Learning of Personalized Models over Networks"

Brendan Rodgers
6 years ago
Views:

1 Decentralzed Collaboratve Learnng of Personalzed Models over Networks Paul Vanhaesebrouck Aurélen Bellet Marc Tommas INRIA INRIA Unversté de Llle Abstract We consder a set of learnng agents n a collaboratve peer-to-peer network, where each agent learns a personalzed model accordng to ts own learnng objectve. The queston addressed n ths paper s: how can agents mprove upon ther locally traned model by communcatng wth other agents that have smlar objectves? We ntroduce and analyze two asynchronous gossp algorthms runnng n a fully decentralzed manner. Our frst approach, nspred from label propagaton, ams to smooth pre-traned local models over the network whle accountng for the confdence that each agent has n ts ntal model. In our second approach, agents jontly learn and propagate ther model by makng teratve updates based on both ther local dataset and the behavor of ther neghbors. To optmze ths challengng objectve, our decentralzed algorthm s based on ADMM. 1 Introducton Increasng amounts of data are beng produced by nterconnected devces such as moble phones, connected objects, sensors, etc. For nstance, hstory logs are generated when a smartphone user browses the web, gves product ratngs and executes varous applcatons. The currently domnant approach to extract useful nformaton from such data s to collect all users personal data on a server (or a tghtly coupled system hosted n a data center) and apply centralzed machne learnng and data mnng technques. However, ths centralzaton poses a number of ssues, such as the need for users to surrender ther personal data Proceedngs of the 20 th Internatonal Conference on Artfcal Intellgence and Statstcs (AISTATS) 2017, Fort Lauderdale, Florda, USA. JMLR: W&CP volume 54. Copyrght 2017 by the author(s). to the servce provder wthout much control on how the data wll be used, whle ncurrng potentally hgh bandwdth and devce battery costs. Even when the learnng algorthm can be dstrbuted n a way that keeps data on users devces, a central entty s often stll requred for aggregaton and coordnaton (see e.g., McMahan et al., 2016). In ths paper, we envson an alternatve settng where many users (agents) wth local datasets collaborate to learn models by engagng n a fully decentralzed peer-to-peer network. Unlke exstng work focusng on problems where agents seek to agree on a global consensus model (see e.g., Nedc and Ozdaglar, 2009; We and Ozdaglar, 2012; Duch et al., 2012), we study the case where each agent learns a personalzed model accordng to ts own learnng objectve. We assume that the network graph s gven and reflects a noton of smlarty between agents (two agents are neghbors n the network f they have a smlar learnng objectve), but each agent s only aware of ts drect neghbors. An agent can then learn a model from ts (typcally scarce) personal data but also from nteractons wth ts neghborhood. As a motvatng example, consder a decentralzed recommender system (Boutet et al., 2013, 2014) n whch each user rates a small number of moves on a smartphone applcaton and expects personalzed recommendatons of new moves. In order to tran a relable recommender for each user, one should rely on the lmted user s data but also on nformaton brought by users wth smlar taste/profle. The peer-to-peer communcaton graph could be establshed when some users go the same move theater or attend the same cultural event, and some smlarty weghts between users could be computed based on hstorcal data (e.g., countng how many tmes people have met n such locatons). Our contrbutons are as follows. After formalzng the problem of nterest, we propose two asynchronous and fully decentralzed algorthms for collaboratve learnng of personalzed models. They belong to the famly of gossp algorthms (Shah, 2009; Dmaks et al., 2010): agents only communcate wth a sngle negh-

2 Decentralzed Collaboratve Learnng of Personalzed Models over Networks bor at a tme, whch makes our algorthms sutable for deployment n large peer-to-peer real networks. Our frst approach, called model propagaton, s nspred by the graph-based label propagaton technque of Zhou et al. (2004). In a frst phase, each agent learns a model based on ts local data only, wthout communcatng wth others. In a second phase, the model parameters are regularzed so as to be smooth over the network graph. We ntroduce some confdence values to account for potental dscrepances n the agents tranng set szes, and derve a novel asynchronous gossp algorthm whch s smple and effcent. We prove that ths algorthm converges to the optmal soluton of the problem. Our second approach, called collaboratve learnng, s more flexble as t nterweaves learnng and propagaton n a sngle process. Specfcally, t optmzes a trade-off between the smoothness of the model parameters over the network on the one hand, and the models accuracy on the local datasets on the other hand. For ths formulaton, we propose an asynchronous gossp algorthm based on a decentralzed verson of Alternatng Drecton Method of Multplers (ADMM) (Boyd et al., 2011). Fnally, we evaluate the performance of our methods on two synthetc collaboratve tasks: mean estmaton and lnear classfcaton. Our experments show the superorty of the proposed approaches over baselne strateges, and confrm the effcency of our decentralzed algorthms. The rest of the paper s organzed as follows. Secton 2 formally descrbes the problem of nterest and dscusses some related work. Our model propagaton approach s ntroduced n Secton 3, along wth our decentralzed algorthm. Secton 4 descrbes our collaboratve learnng approach, and derves an equvalent formulaton whch s amenable to optmzaton usng decentralzed ADMM. Fnally, Secton 5 shows our numercal results, and we conclude n Secton 6. 2 Prelmnares 2.1 Notatons and Problem Settng We consder a set of n agents V = n where n := {1,..., n}. Gven a convex loss functon l : R p X Y, the goal of agent s to learn a model θ R p whose expected loss E (x,y ) µ l(θ ; x, y ) s small wth respect to an unknown and fxed dstrbuton µ over X Y. Each agent has access to a set of m 0..d. tranng examples S = {(x j, yj )}m j=1 drawn from µ. We allow the tranng set sze to vary wdely across agents (some may even have no data at all). Ths s mportant n practce as some agents may be more actve than others, may have recently joned the servce, etc. In solaton, an agent can learn a soltary model θ sol by mnmzng the loss over ts local dataset S : θ sol arg mn θ R p L (θ) = m j=1 l(θ; x j, yj ). (1) The goal for the agents s to mprove upon ther soltary model by leveragng nformaton from other users n the network. Formally, we consder a weghted connected graph G = (V, E) over the set V of agents, where E V V s the set of undrected edges. We denote by W R n n the symmetrc nonnegatve weght matrx assocated wth G, where W j gves the weght of edge (, j) E and by conventon, W j = 0 f (, j) / E or = j. We assume that the weghts represent the underlyng smlarty between the agents objectves: W j should tend to be large (resp. small) when the objectves of agents and j are smlar (resp. dssmlar). Whle we assume n ths paper that the weghts are gven, n practcal scenaros one could for nstance use some auxlary nformaton such as users profles (when avalable) and/or predcton dsagreement to estmate the weghts. For notatonal convenence, we defne the dagonal matrx D R n n where D = n j=1 W j. We wll also denote by N = {j : W j > 0} the set of neghbors of agent. We assume that the agents only have a local vew of the network: they know ther neghbors and the assocated weghts, but not the global topology or how many agents partcpate n the network. Our goal s to propose decentralzed algorthms for agents to collaboratvely mprove upon ther soltary model by leveragng nformaton from ther neghbors. 2.2 Related Work Several peer-to-peer algorthms have been developed for decentralzed averagng (Kempe et al., 2003; Boyd et al., 2006; Coln et al., 2015) and optmzaton (Nedc and Ozdaglar, 2009; Ram et al., 2010; Duch et al., 2012; We and Ozdaglar, 2012, 2013; Iutzeler et al., 2013; Coln et al., 2016). These approaches solve a consensus problem of the form: mn θ R p L (θ), (2) =1 resultng n a global soluton common to all agents (e.g., a classfer mnmzng the predcton error over the unon of all datasets). Ths s unsutable for our settng, where all agents have personalzed objectves. Our problem s remnscent of Mult-Task Learnng (MTL) (Caruana, 1997), where one jontly learns models for related tasks. Yet, there are several dfferences wth our settng. In MTL, the number of tasks s often small, tranng sets are well-balanced across tasks, and

3 Paul Vanhaesebrouck, Aurélen Bellet, Marc Tommas all tasks are usually assumed to be postvely related (a popular assumpton s that all models share a common subspace). Lastly, the algorthms are centralzed, asde from the dstrbuted MTL of Wang et al. (2016) whch s synchronous and reles on a central server. 3 Model Propagaton In ths secton, we present our model propagaton approach. We frst ntroduce a global optmzaton problem, and then propose and analyze an asynchronous gossp algorthm to solve t. 3.1 Problem Formulaton In ths formulaton, we assume that each agent has learned a soltary model θ sol by mnmzng ts local loss, as n (1). Ths can be done wthout any communcaton between agents. Our goal here conssts n adaptng these models by makng them smoother over the network graph. In order account for the fact that the soltary models were learned on tranng sets of dfferent szes, we wll use c (0, 1] to denote the confdence we put n the model θ sol of user {1,..., n}. The c s should be proportonal to the number of tranng ponts m one may for nstance set c = m max j m j (plus some small constant n the case where m = 0). Denotng Θ = [θ 1 ;... ; θ n ] R n p, the objectve functon we am to mnmze s as follows: Q MP (Θ) = 1 ( n W j θ θ j <j µ =1 D c θ θ sol 2 ), (3) where µ > 0 s a trade-off parameter and denotes the Eucldean norm. The frst term n the rght hand sde of (3) s a classc quadratc form used to smooth the models wthn neghborhoods: the dstance between the new models of agents and j s encouraged to be small when the weght W j s large. The second term prevents models wth large confdence from dvergng too much from ther orgnal values so that they can propagate useful nformaton to ther neghborhood. On the other hand, models wth low confdence are allowed large devatons: n the extreme case where agent has very lttle or even no data (.e., c s neglgble), ts model s fully determned by the neghborng models. The presence of D n the second term s smply for normalzaton. We have the followng result (the proof s n the supplementary materal). Proposton 1 (Closed-form soluton). Let P = D 1 W be the stochastc smlarty matrx assocated wth the graph G and Θ sol = [θ sol 1 ;... ; θ sol n ] R n p. The soluton Θ = arg mn Θ R n p Q MP (Θ) s gven by Θ = ᾱ(i ᾱ(i C) αp ) 1 CΘ sol, (4) wth α (0, 1) such that µ = (1 α)/α, and ᾱ = 1 α. Our formulaton s a generalzaton of the semsupervsed label propagaton technque of (Zhou et al., 2004), whch can be recovered by settng C = I (same confdence for all nodes). Note that t s strctly more general: we can see from (4) that unless the confdence values are equal for all agents, the confdence nformaton cannot be ncorporated by usng dfferent soltary models Θ sol or by consderng a dfferent graph (because ᾱ α (I C) P s not stochastc). The asynchronous gossp algorthm we present below thus apples to label propagaton for whch, to the best of our knowledge, no such algorthm was prevously known. Computng the closed form soluton (4) requres the knowledge of the global network and of all soltary models, whch are unknown to the agents. Our startng pont for the dervaton of an asynchronous gossp algorthm s the followng teratve form: for any t 0, Θ(t + 1) = (αi + ᾱc) 1 ( αp Θ(t) + ᾱcθ sol), (5) The sequence (Θ(t)) t N can be shown to converge to (4) regardless of the choce of ntal value Θ(0), see supplementary materal for detals. An nterestng observaton about ths recurson s that t can be decomposed nto agent-centrc updates whch only nvolve neghborhoods. Indeed, for any agent and any t 0: θ (t + 1) = 1 ( α W ) j θ j (t) + ᾱc θ sol. α + ᾱc D j N The teraton (5) can thus be understood as a decentralzed but synchronous process where, at each step, every agent communcates wth all ts neghbors to collect ther current model parameters and uses ths nformaton to update ts model. Assumng that the agents do have access to a global clock to synchronze the updates (whch s unrealstc n many practcal scenaros), synchronzaton ncurs large delays snce all agents must fnsh the update at step t before anyone starts step t + 1. The fact that agents must contact all ther neghbors at each teraton further hnders the effcency of the algorthm. To avod these lmtatons, we propose below an asynchronous gossp algorthm. 3.2 Asynchronous Gossp Algorthm In the asynchronous settng, each agent has a local clock tckng at the tmes of a rate 1 Posson process, and wakes up when t tcks. As local clocks are..d.,

4 Decentralzed Collaboratve Learnng of Personalzed Models over Networks t s equvalent to actvatng a sngle node unformly at random at each tme step (Boyd et al., 2006). 1 The dea behnd our algorthm s the followng. At any tme t 0, each agent wll mantan a (possbly outdated) knowledge of ts neghbors models. For mathematcal convenence, we wll consder a matrx Θ (t) R n p where ts -th lne Θ (t) Rp s agent s model at tme t, and for j, ts j-th lne Θ j (t) Rp s agent s last knowledge of the model of agent j. For any j / N {} and any t 0, we wll mantan Θ j (t) = 0. Let Θ = [ Θ 1,..., Θ n ] R n2 p be the horzontal stackng of all the Θ s. If agent wakes up at tme step t, two consecutve actons are performed: communcaton step: agent selects a random neghbor j N wth prob. π j and both agents update ther knowledge of each other s model: Θ j (t + 1) = Θ j j (t) and Θ j(t + 1) = Θ (t), update step: agents and j update ther own models based on current knowledge. For l {, j}: Θ l l(t + 1) = (α + ᾱc l ) 1 (α W ) lk Θk D l (t + 1) + ᾱc l θl sol. (6) ll k N l All other varables n the network reman unchanged. In the communcaton step above, π j corresponds to the probablty that agent selects agent j. For any n, we have π [0, 1] n such that n j=1 πj = 1 and π j > 0 f and only f j N. Our algorthm belongs to the famly of gossp algorthms as each agent communcates wth at most one neghbor at a tme. Gossp algorthms are known to be very effectve for decentralzed computaton n peer-topeer networks (see Dmaks et al., 2010; Shah, 2009). Thanks to ts asynchronous updates, our algorthm has the potental to be much faster than a synchronous verson when executed n a large peer-to-peer network. The man result of ths secton shows that our algorthm converges to a state where all nodes have ther optmal model (and those of ther neghbors). Theorem 1 (Convergence). Let Θ(0) n2 p R be some arbtrary ntal value and ( Θ(t)) t N be the sequence generated by our algorthm. Let Θ = arg mn Θ R n p Q MP (Θ) be the optmal soluton to model propagaton. For any n, we have: ] lm [ Θj E t (t) = Θ j for j N {}. 1 Our analyss straghtforwardly extends to the case where agents have clocks tckng at dfferent rates. Sketch of proof. The frst step of the proof s to rewrte the algorthm as an equvalent random teratve process over Θ R n2 p of the form: Θ(t + 1) = A(t) Θ(t) + b(t), for any t 0. Then, we show that the spectral radus of E[A(t)] s smaller than 1, whch allows us to exhbt the convergence to the desred quantty. The proof can be found n the supplementary materal. 4 Collaboratve Learnng In the approach presented n the prevous secton, models are learned locally by each agent and then propagated through the graph. In ths secton, we allow the agents to smultaneously learn ther model and propagate t through the network. In other words, agents teratvely update ther models based on both ther local dataset and the behavor of ther neghbors. Whle n general ths s computatonally more costly than merely propagatng pre-traned models, we can expect sgnfcant mprovements n terms of accuracy. As n the case of model propagaton, we frst ntroduce the global objectve functon and then propose an asynchronous gossp algorthm, whch s based on the general paradgm of ADMM (Boyd et al., 2011). 4.1 Problem Formulaton In contrast to model propagaton, the objectve functon to mnmze here takes nto account the loss of each personal model on the local dataset, rather than smply the dstance to the soltary model: Q CL (Θ) = W j θ θ j 2 + µ D L (θ ), (7) <j =1 where µ > 0 s a trade-off parameter. The assocated optmzaton problem s Θ = arg mn Θ R n p Q CL (Θ). The frst term n the rght hand sde of (7) s the same as n the model propagaton objectve (3) and tends to favor models that are smooth on the graph. However, whle n model propagaton enforcng smoothness on the models may potentally translate nto a sgnfcant decrease of accuracy on the local datasets (even for relatvely small changes n parameter values wth respect to the soltary models), here the second term prevents ths. It allows more flexblty n settngs where very dfferent parameter values defne models whch actually gve very smlar predctons. Note that the confdence s bult n the second term as L s a sum over the local dataset of agent. In general, there s no closed-form expresson for Θ, but we can solve the problem wth a decentralzed teratve algorthm, as shown n the rest of ths secton.

5 Paul Vanhaesebrouck, Aurélen Bellet, Marc Tommas 4.2 Asynchronous Gossp Algorthm We propose an asynchronous decentralzed algorthm for mnmzng (7) based on the Alternatve Drecton Method of Multplers (ADMM). Ths general method s a popular way to solve consensus problems of the form (2) n the dstrbuted and decentralzed settngs (see e.g., Boyd et al., 2011; We and Ozdaglar, 2012, 2013; Iutzeler et al., 2013). In our settng, we do not seek a consensus n the classc sense of (2) snce our goal s to learn a personalzed model for each agent. However, we show below that we can reformulate (7) as an equvalent partal consensus problem whch s amenable to decentralzed optmzaton wth ADMM. Problem reformulaton. Let Θ be the set of N +1 varables θ j R p for j N {}, and denote θ j by Θ j. Ths s smlar to the notatons used n Secton 3, except that here we consder Θ as lvng n R ( N +1) p. We now defne Q CL(Θ ) = 1 W j θ θ j 2 + µd L (θ ), 2 j N so that we can rewrte our problem (7) as mn Θ R n p n =1 Q CL (Θ ). In ths formulaton, the objectve functons assocated wth the agents are dependent as they share some decson varables n Θ. In order to apply decentralzed ADMM, we need to decouple the objectves. The dea s to ntroduce a local copy Θ R ( N +1) p of the decson varables Θ for each agent and to mpose equalty constrants on the varables Θ = Θ j for all n, j N. Ths partal consensus can be seen as requrng that two neghborng agents agree on each other s personalzed model. We further ntroduce 4 secondary varables Ze, Z ej, Zj e and Zj ej for each edge e = (, j), whch can be vewed as estmates of the models Θ and Θ j known by each end of e and wll allow an effcent decomposton of the ADMM updates. Formally, denotng Θ = [ Θ 1,..., Θ n ] R (2 E +n) p and Z R 4 E p, we ntroduce the formulaton mn Θ R (2 E +n) p Z C E s.t. e = (, j) E, Q CL( Θ ) =1 { Z e = Θ, Z j e = Θ j Z j ej = Θ j j, Z ej = Θ j, (8) where C E = {Z R 4 E p Ze = Z ej, Zj ej = Zj e for all e = (, j) E}. It s easy to see that Problem (8) s equvalent to the orgnal problem (7) n the followng sense: the mnmzer Θ of (8) satsfes ( Θ ) j = Θ j for all n and j N {}. Further observe that the set of constrants nvolvng Θ can be wrtten D Θ+HZ = 0 where H = I of dmenson 4 E 4 E s dagonal nvertble and D of dmenson 4 E (2 E + n) contans exactly one entry of 1 n each row. The assumptons of We and Ozdaglar (2013) are thus met and we can apply asynchronous decentralzed ADMM. Before presentng the algorthm, we derve the augmented Lagrangan assocated wth Problem (8). Let Λ j e be dual varables assocated wth constrants nvolvng Θ n (8). For convenence, we denote by Z R 2 N the set of secondary varables {{Ze } {Zj e }} e=(,j) E assocated wth agent. Smlarly, we denote by Λ R 2 N the set of dual varables {{Λ e } {Λj e }} e=(,j) E. The augmented Lagrangan s gven by: L ρ ( Θ, Z, Λ) = L ρ( Θ, Z, Λ ), =1 where ρ > 0 s a penalty parameter, Z C E and L ρ( Θ, Z, Λ ) = Q CL( Θ )+ +Λ j e ( Θ j Zj e )+ ρ 2 j:e=(,j) E [ Λ e( Θ Z e) ( Θ Z e 2 + Θ j Zj e 2) ]. Algorthm. ADMM conssts n approxmately mnmzng the augmented Lagrangan L ρ ( Θ, Z, Λ) by alternatng mnmzaton wth respect to the prmal varable Θ and the secondary varable Z, together wth an teratve update of the dual varable Λ. We frst brefly dscuss how to nstantate the ntal values Θ(0), Z(0) and Λ(0). The only constrant on these ntal values s to have Z(0) C E, so a smple opton s to ntalze all varables to 0. That sad, t s typcally advantageous to use a warm-start strategy. For nstance, each agent can send ts soltary model to ts neghbors, and then set Θ = θsol for all j N, Ze = Θ, Zj e = Θ j θ sol, Θ j = θsol j for all e = (, j) E, and Λ(0) = 0. Alternatvely, one can ntalze the algorthm wth the model propagaton soluton obtaned usng the method of Secton 3. Recall from Secton 3.2 that n the asynchronous settng, a sngle agent wakes up at each tme step and selects one of ts neghbors. Assume that agent wakes up at some teraton t 0 and selects j N. Denotng e = (, j), the teraton goes as follows: 1. Agent updates ts prmal varables: Θ (t + 1) = arg mn L ρ(θ, Z (t), Λ (t)), Θ R ( N +1) p and sends Θ (t + 1), Θ j (t + 1), Λ e (t), Λj e (t) to agent j. Agent j executes the same steps w.r.t..

6 Decentralzed Collaboratve Learnng of Personalzed Models over Networks 2. Usng Θ j j (t + 1), Θ j (t + 1), Λj ej (t), Λ ej (t) receved from j, agent updates ts secondary varables: Ze(t + 1) = 1 [ 1 ( Λ 2 ρ e (t) + Λ ej(t) ) + Θ (t + 1) + Θ ] j(t + 1), Z j e (t + 1) = 1 [ 1 ( Λ j ej 2 ρ (t) + Λj e (t)) + Θ j j (t + 1) + Θ j ]. (t + 1) Agent j updates ts secondary varables symmetrcally, so by constructon we have Z(t + 1) C E. 3. Agent updates ts dual varables: Λ e(t + 1) = Λ e(t) + ρ ( Θ (t + 1) Z e(t + 1) ), Λ j e (t + 1) = Λj e (t) + ρ( Θj (t + 1) Zj e (t + 1)). Agent j updates ts dual varables symmetrcally. All other varables n the network reman unchanged. Step 1 has a smple soluton for some loss functons commonly used n machne learnng (such as quadratc and L 1 loss), and when t s not the case ADMM s typcally robust to approxmate solutons to the correspondng subproblems (obtaned for nstance after a few steps of gradent descent), see Boyd et al. (2011) for examples and further practcal consderatons. Asynchronous ADMM converges almost surely to an optmal soluton at a rate of O(1/t) for convex objectve functons (see We and Ozdaglar, 2013). 5 Experments In ths secton, we provde numercal experments to evaluate the performance of our decentralzed algorthms wth respect to accuracy, convergence rate and the amount of communcaton. To ths end, we ntroduce two synthetc collaboratve tasks: mean estmaton and lnear classfcaton. 5.1 Collaboratve Mean Estmaton We frst ntroduce a smple task n whch the goal of each agent s to estmate the mean of a 1D dstrbuton. To ths end, we adapt the two ntertwnng moons dataset popular n sem-supervsed learnng (Zhou et al., 2004). We consder a set of 300 agents, together wth auxlary nformaton about each agent n the form of a vector v R 2. The true dstrbuton µ of an agent s ether N (1, 40) or N ( 1, 40) dependng on whether v belongs to the upper or lower moon, see Fgure 1(a). Each agent receves m samples x 1,..., xm R from ts dstrbuton µ. Its soltary model s then gven by θ sol = 1 m m j=1 xj, whch corresponds to the use of the quadratc loss functon l(θ; x ) = θ x 2. Fnally, the graph over agents s the complete graph where the weght between agents and j s gven by a Gaussan kernel on the agents auxlary nformaton W j = exp( v v j 2 /2σ 2 ), wth σ = 0.1 for approprate scalng. In all experments, the parameter α of model propagaton was set to 0.99, whch gave the best results on a held-out set of random problem nstances. We frst use ths mean estmaton task to llustrate the mportance of consderng confdence values n our model propagaton formulaton, and then to evaluate the effcency of our asynchronous decentralzed algorthm. Relevance of confdence values. Our goal here s to show that ntroducng confdence values nto the model propagaton approach can sgnfcantly mprove the overall accuracy, especally when the agents receve unbalanced amounts of data. In ths experment, we only compare model propagaton wth and wthout confdence values, so we compute the optmal solutons drectly usng the closed-form soluton (4). We generate several problem nstances wth varyng standard devaton for the confdence values c s. More precsely, we sample c for each agent from a unform dstrbuton centered at 1/2 wth wdth ɛ [0, 1]. The number of samples m gven to agent s then set to m = c 100. The larger ɛ, the more varance n the sze of the local datasets. Fgures 1(b)-1(d) gve a vsualzaton of the models before and after propagaton on a problem nstance for the hardest settng ɛ = 1. Fgure 2 (left-mddle) shows results averaged over 1000 random problem nstances for several values of ɛ. As expected, when the local dataset szes are well-balanced (small ɛ), model propagaton performs the same wth or wthout the use of confdence values. Indeed, both have smlar L 2 error wth respect to the target mean, and the wn rato s about 0.5. However, the performance gap n favor of usng confdence values ncreases sharply wth ɛ. For ɛ = 1, the wn rato n favor of usng confdence values s about Strkngly, the error of model propagaton wth confdence values remans constant as ɛ ncreases. These results emprcally confrm the relevance of ntroducng confdence values nto the objectve functon. Asynchronous algorthm. In ths second experment, we compare asynchronous model propagaton wth the synchronous varant gven by (5). We are nterested n the average L 2 error of the models as a functon of the number of parwse communcatons (number of exchanges from one agent to another). Note that a sngle teraton of the synchronous (resp. asynchronous) algorthm corresponds to 2 E (resp. 2) communcatons. For the asynchronous algorthm, we set the neghbor selecton dstrbuton π of agent

7 Paul Vanhaesebrouck, Aurélen Bellet, Marc Tommas (a) Ground models (b) Soltary models (c) MP wthout confdence (d) MP wth confdence Fgure 1: Illustraton of the collaboratve mean estmaton task, where each pont represents an agent and ts 2D coordnates the assocated auxlary nformaton. Fgure 1(a) shows the ground truth models (blue for mean 1 and red for mean -1). Fgure 1(b) shows the soltary models (local averages) for an nstance where ɛ = 1. Fgures 1(c)-1(d) show the models after propagaton, wthout/wth the use of confdence values. L2 error MP wthout confdence MP wth confdence Wdth ε Wn rato Wdth ε L2 error Sync. MP Async. MP Number of communcatons 105 Fgure 2: Results on the mean estmaton task. (Left-mddle) Model propagaton wth and wthout confdence values w.r.t. the unbalancedness of the local datasets. The left fgure shows the L 2 errors, whle the mddle one shows the percentage of wns n favor of usng confdence values. (Rght) L 2 error of the synchronous and asynchronous model propagaton algorthms wth respect to the number of parwse communcatons. n to be unform over the set of neghbors N. Fgure 2 (rght) shows the results on a problem nstance generated as n the prevous experment (wth ε = 1). Snce the asynchronous algorthm s randomzed, we average ts results on 100 random runs. We see that our asynchronous algorthm acheves an accuracy/communcaton trade-off whch s almost as good as that of the synchronous one, wthout requrng any synchronzaton. It s thus expected to be much faster than the synchronous algorthm on large decentralzed networks wth communcaton delays and/or wthout effcent global synchronzaton. 5.2 Collaboratve Lnear Classfcaton In the prevous mean estmaton task, the squared dstance between two model parameters (.e., estmated means) translates nto the same dfference n L 2 error wth respect to the target mean. Therefore, our collaboratve learnng formulaton s essentally equvalent to our model propagaton approach. To show the benefts that can be brought by collaboratve learnng, we now consder a lnear classfcaton task. Snce two lnear separators wth sgnfcantly dfferent parameters can lead to smlar predctons on a gven dataset, ncorporatng the local errors nto the objectve functon rather than smply the dstances between parameters should lead to more accurate models. We consder a set of 100 agents whose goal s to perform lnear classfcaton n R p. For ease of vsualzaton, the target (true) model of each agent les n a 2-dmensonal subspace: we represent t as a vector n R p wth the frst two entres drawn from a normal dstrbuton centered at the orgn and the remanng ones equal to 0. We consder the smlarty graph where the weght between two agents and j s a Gaussan kernel on the dstance between target models, where the dstance here refers to the length of the chord of the angle φ j between target models projected on a unt crcle. More formally, W,j = exp((cos(φ,j ) 1)/σ) wth σ = 0.1 for approprate scalng. Edges wth neglgble weghts are gnored to speed up computaton. We refer the reader to the supplementary materal for a 2D vsualzaton of the target models and the lnks between them. Every agent receves a random number of tranng ponts drawn unformly between 1 and 20. Each tranng pont (n R p ) s drawn unformly around the orgn, and the bnary label s gven by the predcton of the target lnear separator. We then add some label nose by randomly flppng each label wth probablty The loss functon used by the agents s the hnge loss, gven by l(θ; (x, y )) = max ( ) 0, 1 y θ x. As n the prevous experment, for each algorthm we tune the value of α on a held-out set of random prob-

8 Decentralzed Collaboratve Learnng of Personalzed Models over Networks Test accuracy Model P. Collaboratve L. Soltary models Consensus model Dmenson p Test accuracy Collaboratve Learnng Model Propagaton Soltary models Number of tranng ponts Test accuracy Sync. CL 0.65 Async. CL Async. MP Number of communcatons 10 5 Fgure 3: Results on the lnear classfcaton task. (Left) Test accuracy of model propagaton and collaboratve learnng wth varyng feature space dmenson. (Mddle) Average test accuracy of model propagaton and collaboratve learnng wth respect to the number of tranng ponts avalable to the agent (feature dmenson p = 50). (Rght) Test accuracy of synchronous and asynchronous collaboratve learnng and asynchronous model propagaton wth respect to the number of parwse communcatons (lnear classfcaton task, p = 50). lem nstances. Fnally, we wll evaluate the qualty of the learned model of each agent by computng the accuracy on a separate sample of 100 test ponts drawn from the same dstrbuton as the tranng set. In the followng, we use ths lnear classfcaton task to compare the performance of collaboratve learnng aganst model propagaton, and to evaluate the effcency of our asynchronous algorthms. MP vs. CL. In ths frst experment, we compare the accuracy of the models learned by model propagaton and collaboratve learnng wth feature space dmenson p rangng from 2 to 100. Fgure 3 (left) shows the results averaged over 10 randomly generated problem nstances for each value of p. As baselnes, we also plot the average accuracy of the soltary models and of the global consensus model mnmzng (2). The accuracy of all models decreases wth the feature space dmenson, whch comes from the fact that the expected number of tranng samples remans constant. As expected, the consensus model acheves very poor performance snce agents have very dfferent objectves. On the other hand, both model propagaton and collaboratve learnng are able to mprove very sgnfcantly over the soltary models, even n hgher dmensons where on average these ntal models barely outperform a random guess. Furthermore, collaboratve learnng always outperforms model propagaton. We further analyze these results by plottng the accuracy wth respect to the sze of the local tranng set (Fgure 3, mddle). As expected, the accuracy of the soltary models s hgher for larger tranng sets. Furthermore, collaboratve learnng converges to models whch have smlar accuracy regardless of the tranng sze, effectvely correctng for the ntal unbalancedness. Whle model propagaton also performs well, t s consstently outperformed by collaboratve learnng on all tranng szes. Ths gap s larger for agents wth more tranng data: n model propagaton, the large confdence values assocated wth these agents prevent them from devatng much from ther soltary model, thereby lmtng ther own gan n accuracy. Asynchronous algorthms. Ths second experment compares our asynchronous collaboratve learnng algorthm wth a synchronous varant also based on ADMM (see supplementary materal for detals) n terms of number parwse of communcatons. Fgure 3 (rght) shows that our asynchronous algorthm performs as good as ts synchronous counterpart and should thus be largely preferred for deployment n real peer-to-peer networks. It s also worth notng that asynchronous model propagaton converges an order of magntude faster than collaboratve learnng, as t only propagates models that are pre-traned locally. Model propagaton can thus provde a valuable warmstart ntalzaton for collaboratve learnng. Scalablty. We also observe expermentally that the number of teratons needed by our decentralzed algorthms to converge scales favorably wth the sze of the network (see supplementary materal for detals). 6 Concluson We proposed, analyzed and evaluated two asynchronous peer-to-peer algorthms for the novel settng of decentralzed collaboratve learnng of personalzed models. Ths work opens up nterestng perspectves. The lnk between the smlarty graph and the generalzaton performance of the resultng models should be formally analyzed. Ths could n turn gude the desgn of generc methods to estmate the graph weghts, makng our approaches wdely applcable. Other drectons of nterest nclude the development of prvacypreservng algorthms as well as extensons to tmeevolvng networks and sequental arrval of data. Acknowledgments. Ths work was partally supported by grant ANR-16-CE and by a grant from CPER Nord-Pas de Calas/FEDER DATA Advanced data scence and technologes

9 Paul Vanhaesebrouck, Aurélen Bellet, Marc Tommas References Boutet, A., Frey, D., Guerraou, R., Jégou, A., and Kermarrec, A.-M. (2013). WHATSUP: A Decentralzed Instant News Recommender. In Proceedngs of the 27th IEEE Internatonal Symposum on Parallel and Dstrbuted Processng (IPDPS), pages Boutet, A., Frey, D., Guerraou, R., Kermarrec, A.-M., and Patra, R. (2014). Hyrec: leveragng browsers for scalable recommenders. In Proceedngs of the 15th Internatonal Mddleware Conference, pages Boyd, S., Ghosh, A., Prabhakar, B., and Shah, D. (2006). Randomzed gossp algorthms. IEEE/ACM Transactons on Networkng (TON), 14(SI): Boyd, S., Parkh, N., Chu, E., Peleato, B., and Ecksten, J. (2011). Dstrbuted optmzaton and statstcal learnng va the alternatng drecton method of multplers. Foundatons and Trends R n Machne Learnng, 3(1): Caruana, R. (1997). Multtask Learnng. Machne Learnng, 28(1): Coln, I., Bellet, A., Salmon, J., and Clémençon, S. (2015). Extendng Gossp Algorthms to Dstrbuted Estmaton of U-statstcs. In Proceedngs of the Annual Conference on Neural Informaton Processng Systems (NIPS). Coln, I., Bellet, A., Salmon, J., and Clémençon, S. (2016). Gossp Dual Averagng for Decentralzed Optmzaton of Parwse Functons. In Proceedngs of the 33rd Internatonal Conference on Machne Learnng (ICML). Dmaks, A. G., Kar, S., Moura, J. M. F., Rabbat, M. G., and Scaglone, A. (2010). Gossp Algorthms for Dstrbuted Sgnal Processng. Proceedngs of the IEEE, 98(11): Duch, J. C., Agarwal, A., and Wanwrght, M. J. (2012). Dual Averagng for Dstrbuted Optmzaton: Convergence Analyss and Network Scalng. IEEE Transactons on Automatc Control, 57(3): Iutzeler, F., Banch, P., Cblat, P., and Hachem, W. (2013). Asynchronous Dstrbuted Optmzaton usng a Randomzed Alternatng Drecton Method of Multplers. In Proceedngs of the 52nd IEEE Conference on Decson and Control (CDC), pages Kempe, D., Dobra, A., and Gehrke, J. (2003). Gossp- Based Computaton of Aggregate Informaton. In Proceedngs of the 44th Annual IEEE Symposum on Foundatons of Computer Scence (FOCS), pages McMahan, H. B., Moore, E., Ramage, D., and Agüera y Arcas, B. (2016). Federated Learnng of Deep Networks usng Model Averagng. Techncal report, arxv: Nedc, A. and Ozdaglar, A. E. (2009). Dstrbuted Subgradent Methods for Mult-Agent Optmzaton. IEEE Transactons on Automatc Control, 54(1): Ram, S. S., Nedc, A., and Veeravall, V. V. (2010). Dstrbuted Stochastc Subgradent Projecton Algorthms for Convex Optmzaton. Journal of Optmzaton Theory and Applcatons, 147(3): Shah, D. (2009). Gossp Algorthms. Foundatons and Trends n Networkng, 3(1): Wang, J., Kolar, M., and Srebro, N. (2016). Dstrbuted Mult-Task Learnng. In Proceedngs of the 19th Internatonal Conference on Artfcal Intellgence and Statstcs (AISTATS), pages We, E. and Ozdaglar, A. E. (2012). Dstrbuted Alternatng Drecton Method of Multplers. In Proceedngs of the 51th IEEE Conference on Decson and Control (CDC), pages We, E. and Ozdaglar, A. E. (2013). On the O(1/k) Convergence of Asynchronous Dstrbuted Alternatng Drecton Method of Multplers. In IEEE Global Conference on Sgnal and Informaton Processng (GlobalSIP). Zhou, D., Bousquet, O., Lal, T. N., Weston, J., and Schölkopf, B. (2004). Learnng wth local and global consstency. In Proceedngs of the Annual Conference on Neural Informaton Processng Systems (NIPS), volume 16, pages

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.