Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers

Size: px

Start display at page:

Download "Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers"

Gavin Cole
5 years ago
Views:

1 Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers Stephen Boyd MIIS Xi An, 1/7/12 source: Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers (Boyd, Parikh, Chu, Peleato, Eckstein) 1

2 Goals robust methods for arbitrary-scale optimization machine learning/statistics with huge data-sets dynamic optimization on large-scale network decentralized optimization devices/processors/agents coordinate to solve large problem, by passing relatively small messages 2

3 Outline Dual decomposition Method of multipliers Alternating direction method of multipliers Common patterns Examples Consensus and exchange Conclusions 3

4 Outline Dual decomposition Method of multipliers Alternating direction method of multipliers Common patterns Examples Consensus and exchange Conclusions Dual decomposition 4

5 Dual problem convex equality constrained optimization problem minimize f(x) subject to Ax = b Lagrangian: L(x,y) = f(x)+y T (Ax b) dual function: g(y) = inf x L(x,y) dual problem: maximize g(y) recover x = argmin x L(x,y ) Dual decomposition 5

6 Dual ascent gradient method for dual problem: y k+1 = y k +α k g(y k ) g(y k ) = A x b, where x = argmin x L(x,y k ) dual ascent method is x k+1 := argmin x L(x,y k ) // x-minimization y k+1 := y k +α k (Ax k+1 b) // dual update works, with lots of strong assumptions Dual decomposition 6

7 Dual decomposition suppose f is separable: f(x) = f 1 (x 1 )+ +f N (x N ), x = (x 1,...,x N ) then L is separable in x: L(x,y) = L 1 (x 1,y)+ +L N (x N,y) y T b, L i (x i,y) = f i (x i )+y T A i x i x-minimization in dual ascent splits into N separate minimizations x k+1 i := argmin x i L i (x i,y k ) which can be carried out in parallel Dual decomposition 7

8 Dual decomposition dual decomposition (Everett, Dantzig, Wolfe, Benders ) x k+1 i := argmin xi L i (x i,y k ), i = 1,...,N y k+1 := y k +α k ( N i=1 A ix k+1 i b) scatter y k ; update x i in parallel; gather A i x k+1 i solve a large problem by iteratively solving subproblems (in parallel) dual variable update provides coordination works, with lots of assumptions; often slow Dual decomposition 8

9 Outline Dual decomposition Method of multipliers Alternating direction method of multipliers Common patterns Examples Consensus and exchange Conclusions Method of multipliers 9

10 Method of multipliers a method to robustify dual ascent use augmented Lagrangian (Hestenes, Powell 1969), ρ > L ρ (x,y) = f(x)+y T (Ax b)+(ρ/2) Ax b 2 2 method of multipliers (Hestenes, Powell; analysis in Bertsekas 1982) x k+1 := argminl ρ (x,y k ) x y k+1 := y k +ρ(ax k+1 b) (note specific dual update step length ρ) Method of multipliers 1

11 Method of multipliers dual update step optimality conditions (for differentiable f): (primal and dual feasibility) since x k+1 minimizes L ρ (x,y k ) Ax b =, f(x )+A T y = = x L ρ (x k+1,y k ) = x f(x k+1 )+A T ( y k +ρ(ax k+1 b) ) = x f(x k+1 )+A T y k+1 dual update y k+1 = y k +ρ(x k+1 b) makes (x k+1,y k+1 ) dual feasible primal feasibility achieved in limit: Ax k+1 b Method of multipliers 11

12 Method of multipliers (compared to dual decomposition) good news: converges under much more relaxed conditions (f can be nondifferentiable, take on value +,...) bad news: quadratic penalty destroys splitting of the x-update, so can t do decomposition Method of multipliers 12

13 Outline Dual decomposition Method of multipliers Alternating direction method of multipliers Common patterns Examples Consensus and exchange Conclusions Alternating direction method of multipliers 13

14 Alternating direction method of multipliers a method with good robustness of method of multipliers which can support decomposition robust dual decomposition or decomposable method of multipliers proposed by Gabay, Mercier, Glowinski, Marrocco in 1976 Alternating direction method of multipliers 14

15 Alternating direction method of multipliers ADMM problem form (with f, g convex) minimize f(x) + g(z) subject to Ax+Bz = c two sets of variables, with separable objective L ρ (x,z,y) = f(x)+g(z)+y T (Ax+Bz c)+(ρ/2) Ax+Bz c 2 2 ADMM: x k+1 := argmin x L ρ (x,z k,y k ) // x-minimization z k+1 := argmin z L ρ (x k+1,z,y k ) // z-minimization y k+1 := y k +ρ(ax k+1 +Bz k+1 c) // dual update Alternating direction method of multipliers 15

16 Alternating direction method of multipliers if we minimized over x and z jointly, reduces to method of multipliers instead, we do one pass of a Gauss-Seidel method we get splitting since we minimize over x with z fixed, and vice versa Alternating direction method of multipliers 16

17 ADMM and optimality conditions optimality conditions (for differentiable case): primal feasibility: Ax+Bz c = dual feasibility: f(x)+a T y =, g(z)+b T y = since z k+1 minimizes L ρ (x k+1,z,y k ) we have = g(z k+1 )+B T y k +ρb T (Ax k+1 +Bz k+1 c) = g(z k+1 )+B T y k+1 so with ADMM dual variable update, (x k+1,z k+1,y k+1 ) satisfies second dual feasibility condition primal and first dual feasibility are achieved as k Alternating direction method of multipliers 17

18 ADMM with scaled dual variables combine linear and quadratic terms in augmented Lagrangian L ρ (x,z,y) = f(x)+g(z)+y T (Ax+Bz c)+(ρ/2) Ax+Bz c 2 2 = f(x)+g(z)+(ρ/2) Ax+Bz c+u 2 2 +const. with u k = (1/ρ)y k ADMM (scaled dual form): x k+1 ( := argmin f(x)+(ρ/2) Ax+Bz k c+u k 2 2) x z k+1 ( := argmin g(z)+(ρ/2) Ax k+1 +Bz c+u k 2 2) z u k+1 := u k +(Ax k+1 +Bz k+1 c) Alternating direction method of multipliers 18

19 Convergence assume (very little!) f, g convex, closed, proper L has a saddle point then ADMM converges: iterates approach feasibility: Ax k +Bz k c objective approaches optimal value: f(x k )+g(z k ) p Alternating direction method of multipliers 19

20 Related algorithms operator splitting methods (Douglas, Peaceman, Rachford, Lions, Mercier, s, 1979) proximal point algorithm (Rockafellar 1976) Dykstra s alternating projections algorithm (1983) Spingarn s method of partial inverses (1985) Rockafellar-Wets progressive hedging (1991) proximal methods (Rockafellar, many others, 1976 present) Bregman iterative methods (28 present) most of these are special cases of the proximal point algorithm Alternating direction method of multipliers 2

21 Outline Dual decomposition Method of multipliers Alternating direction method of multipliers Common patterns Examples Consensus and exchange Conclusions Common patterns 21

22 Common patterns x-update step requires minimizing f(x)+(ρ/2) Ax v 2 2 (with v = Bz k c+u k, which is constant during x-update) similar for z-update several special cases come up often can simplify update by exploiting structure in these cases Common patterns 22

23 Decomposition suppose f is block-separable, f(x) = f 1 (x 1 )+ +f N (x N ), x = (x 1,...,x N ) A is conformably block separable: A T A is block diagonal then x-update splits into N parallel updates of x i Common patterns 23

24 Proximal operator consider x-update when A = I some special cases: x + ( ) = argmin f(x)+(ρ/2) x v 2 2 = proxf,ρ (v) x f = I C (indicator fct. of set C) x + := Π C (v) (projection onto C) f = λ 1 (l 1 norm) x + i := S λ/ρ (v i ) (soft thresholding) (S a (v) = (v a) + ( v a) + ) Common patterns 24

25 Quadratic objective f(x) = (1/2)x T Px+q T x+r x + := (P +ρa T A) 1 (ρa T v q) use matrix inversion lemma when computationally advantageous (P +ρa T A) 1 = P 1 ρp 1 A T (I +ρap 1 A T ) 1 AP 1 (direct method) cache factorization of P +ρa T A (or I +ρap 1 A T ) (iterative method) warm start, early stopping, reducing tolerances Common patterns 25

26 Smooth objective f smooth can use standard methods for smooth minimization gradient, Newton, or quasi-newton preconditionned CG, limited-memory BFGS (scale to very large problems) can exploit warm start early stopping, with tolerances decreasing as ADMM proceeds Common patterns 26

27 Outline Dual decomposition Method of multipliers Alternating direction method of multipliers Common patterns Examples Consensus and exchange Conclusions Examples 27

28 Constrained convex optimization consider ADMM for generic problem minimize f(x) subject to x C ADMM form: take g to be indicator of C algorithm: minimize f(x) + g(z) subject to x z = x k+1 ( := argmin f(x)+(ρ/2) x z k +u k 2 2) x z k+1 := Π C (x k+1 +u k ) u k+1 := u k +x k+1 z k+1 Examples 28

29 Lasso lasso problem: minimize (1/2) Ax b 2 2 +λ x 1 ADMM form: minimize (1/2) Ax b 2 2 +λ z 1 subject to x z = ADMM: x k+1 := (A T A+ρI) 1 (A T b+ρz k y k ) z k+1 := S λ/ρ (x k+1 +y k /ρ) y k+1 := y k +ρ(x k+1 z k+1 ) Examples 29

30 Lasso example example with dense A R 15 5 (15 measurements; 5 regressors) computation times factorization (same as ridge regression) 1.3s subsequent ADMM iterations.3s lasso solve (about 5 ADMM iterations) 2.9s full regularization path (3 λ s) 4.4s not bad for a very short Matlab script Examples 3

31 Sparse inverse covariance selection S: empirical covariance of samples from N(,Σ), with Σ 1 sparse (i.e., Gaussian Markov random field) estimate Σ 1 via l 1 regularized maximum likelihood minimize Tr(SX) logdetx +λ X 1 methods: COVSEL (Banerjee et al 28), graphical lasso (FHT 28) Examples 31

32 Sparse inverse covariance selection via ADMM ADMM form: minimize Tr(SX) logdetx +λ Z 1 subject to X Z = ADMM: X k+1 ( := argmin Tr(SX) logdetx +(ρ/2) X Z k +U k 2 F) X Z k+1 := S λ/ρ (X k+1 +U k ) U k+1 := U k +(X k+1 Z k+1 ) Examples 32

33 Analytical solution for X-update compute eigendecomposition ρ(z k U k ) S = QΛQ T form diagonal matrix X with let X k+1 := Q XQ T X ii = λ i + λ 2 i +4ρ 2ρ cost of X-update is an eigendecomposition Examples 33

34 Sparse inverse covariance selection example Σ 1 is 1 1 with 1 4 nonzeros graphical lasso (Fortran): 2 seconds 3 minutes ADMM (Matlab): 3 1 minutes (depends on choice of λ) very rough experiment, but with no special tuning, ADMM is in ballpark of recent specialized methods (for comparison, COVSEL takes 25+ min when Σ 1 is a 4 4 tridiagonal matrix) Examples 34

35 Outline Dual decomposition Method of multipliers Alternating direction method of multipliers Common patterns Examples Consensus and exchange Conclusions Consensus and exchange 35

36 Consensus optimization want to solve problem with N objective terms minimize N i=1 f i(x) e.g., f i is the loss function for ith block of training data ADMM form: N minimize i=1 f i(x i ) subject to x i z = x i are local variables z is the global variable x i z = are consistency or consensus constraints can add regularization using a g(z) term Consensus and exchange 36

37 Consensus optimization via ADMM L ρ (x,z,y) = N i=1( fi (x i )+yi T(x ) i z)+(ρ/2) x i z 2 2 ADMM: x k+1 i ( := argmin fi (x i )+yi kt (x i z k )+(ρ/2) x i z k 2 ) 2 x i N z k+1 := 1 N i=1 ( x k+1 i +(1/ρ)yi k ) y k+1 i := y k i +ρ(x k+1 i z k+1 ) with regularization, averaging in z update is followed by prox g,ρ Consensus and exchange 37

38 Consensus optimization via ADMM using N i=1 yk i x k+1 i =, algorithm simplifies to := argmin x i y k+1 i := y k i +ρ(x k+1 i x k+1 ) where x k = (1/N) N i=1 xk i ( fi (x i )+yi kt (x i x k )+(ρ/2) x i x k 2 ) 2 in each iteration gather x k i and average to get x k scatter the average x k to processors update y k i locally (in each processor, in parallel) update x i locally Consensus and exchange 38

39 Statistical interpretation f i is negative log-likelihood for parameter x given ith data block x k+1 i is MAP estimate under prior N(x k +(1/ρ)y k i,ρi) prior mean is previous iteration s consensus shifted by price of processor i disagreeing with previous consensus processors only need to support a Gaussian MAP method type or number of data in each block not relevant consensus protocol yields global maximum-likelihood estimate Consensus and exchange 39

40 Consensus classification data (examples) (a i,b i ), i = 1,...,N, a i R n, b i { 1,+1} linear classifier sign(a T w +v), with weight w, offset v margin for ith example is b i (a T i w+v); want margin to be positive loss for ith example is l(b i (a T i w +v)) l is loss function (hinge, logistic, probit, exponential,...) choose w, v to minimize 1 N N i=1 l(b i(a T i w+v))+r(w) r(w) is regularization term (l 2, l 1,...) split data and use ADMM consensus to solve Consensus and exchange 4

41 Consensus SVM example hinge loss l(u) = (1 u) + with l 2 regularization baby problem with n = 2, N = 4 to illustrate examples split into 2 groups, in worst possible way: each group contains only positive or negative examples Consensus and exchange 41

42 Iteration Consensus and exchange 42

43 Iteration Consensus and exchange 43

44 Iteration Consensus and exchange 44

45 Distributed lasso example example with dense A R 4 8 (roughly 3 GB of data) distributed solver written in C using MPI and GSL no optimization or tuned libraries (like ATLAS, MKL) split into 8 subsystems across 1 (8-core) machines on Amazon EC2 computation times loading data factorization subsequent ADMM iterations 3s 5m.5 2s lasso solve (about 15 ADMM iterations) 5 6m Consensus and exchange 45

46 Exchange problem N minimize i=1 f i(x i ) subject to N i=1 x i = another canonical problem, like consensus in fact, it s the dual of consensus can interpret as N agents exchanging n goods to minimize a total cost (x i ) j means agent i receives (x i ) j of good j from exchange (x i ) j < means agent i contributes (x i ) j of good j to exchange constraint N i=1 x i = is equilibrium or market clearing constraint optimal dual variable y is a set of valid prices for the goods suggests real or virtual cash payment (y ) T x i by agent i Consensus and exchange 46

47 Exchange ADMM solve as a generic constrained convex problem with constraint set scaled form: x k+1 i unscaled form: x k+1 i C = {x R nn x 1 +x 2 + +x N = } ( := argmin fi (x i )+(ρ/2) x i x k i +x k +u k 2 ) 2 x i u k+1 := u k +x k+1 ( := argmin fi (x i )+y kt x i +(ρ/2) x i (x k i x k ) 2 ) 2 x i y k+1 := y k +ρx k+1 Consensus and exchange 47

48 Interpretation as tâtonnement process tâtonnement process: iteratively update prices to clear market work towards equilibrium by increasing/decreasing prices of goods based on excess demand/supply dual decomposition is the simplest tâtonnement algorithm ADMM adds proximal regularization incorporate agents prior commitment to help clear market convergence far more robust convergence than dual decomposition Consensus and exchange 48

49 Distributed dynamic energy management N devices exchange power in time periods t = 1,...,T x i R T is power flow profile for device i f i (x i ) is cost of profile x i (and encodes constraints) x 1 + +x N = is energy balance (in each time period) dynamic energy management problem is exchange problem exchange ADMM gives distributed method for dynamic energy management each device optimizes its own profile, with quadratic regularization for coordination residual (energy imbalance) is driven to zero Consensus and exchange 49

50 Generators x t t 3 example generators left: generator costs/limits; right: ramp constraints can add cost for power changes Consensus and exchange 5

51 Fixed loads t 2 example fixed loads cost is + for not supplying load; zero otherwise Consensus and exchange 51

52 Shiftable load t total energy consumed over an interval must exceed given minimum level limits on energy consumed in each period cost is + for violating constraints; zero otherwise Consensus and exchange 52

53 Battery energy storage system t energy store with maximum capacity, charge/discharge limits black: battery charge, red: charge/discharge profile cost is + for violating constraints; zero otherwise Consensus and exchange 53

54 Electric vehicle charging system t black: desired charge profile; blue: charge profile shortfall cost for not meeting desired charge Consensus and exchange 54

55 HVAC t thermal load (e.g., room, refrigerator) with temperature limits magenta: ambient temperature; blue: load temperature red: cooling energy profile cost is + for violating constraints; zero otherwise Consensus and exchange 55

56 External tie t buy/sell energy from/to external grid at price p ext (t)±γ(t) solid: p ext (t); dashed: p ext (t)±γ(t) Consensus and exchange 56

57 Smart grid example 1 devices 3 generators 2 fixed loads 1 shiftable load 1 EV charging systems 1 battery 1 HVAC system 1 external tie Consensus and exchange 57

58 Convergence iteration: k = t t left: solid: optimal generator profile, dashed: profile at kth iteration right: residual vector x k Consensus and exchange 58

59 Convergence iteration: k = t t left: solid: optimal generator profile, dashed: profile at kth iteration right: residual vector x k Consensus and exchange 58

60 Convergence iteration: k = t t left: solid: optimal generator profile, dashed: profile at kth iteration right: residual vector x k Consensus and exchange 58

61 Convergence iteration: k = t t left: solid: optimal generator profile, dashed: profile at kth iteration right: residual vector x k Consensus and exchange 58

62 Convergence iteration: k = t t left: solid: optimal generator profile, dashed: profile at kth iteration right: residual vector x k Consensus and exchange 58

63 Convergence iteration: k = t t left: solid: optimal generator profile, dashed: profile at kth iteration right: residual vector x k Consensus and exchange 58

64 Convergence iteration: k = t t left: solid: optimal generator profile, dashed: profile at kth iteration right: residual vector x k Consensus and exchange 58

65 Convergence iteration: k = t t left: solid: optimal generator profile, dashed: profile at kth iteration right: residual vector x k Consensus and exchange 58

66 Convergence iteration: k = t t left: solid: optimal generator profile, dashed: profile at kth iteration right: residual vector x k Consensus and exchange 58

67 Convergence iteration: k = t t left: solid: optimal generator profile, dashed: profile at kth iteration right: residual vector x k Consensus and exchange 58

68 Convergence iteration: k = t t left: solid: optimal generator profile, dashed: profile at kth iteration right: residual vector x k Consensus and exchange 58

69 Convergence iteration: k = t t left: solid: optimal generator profile, dashed: profile at kth iteration right: residual vector x k Consensus and exchange 58

70 Outline Dual decomposition Method of multipliers Alternating direction method of multipliers Common patterns Examples Consensus and exchange Conclusions Conclusions 59

71 Summary and conclusions ADMM is the same as, or closely related to, many methods with other names has been around since the 197s gives simple single-processor algorithms that can be competitive with state-of-the-art can be used to coordinate many processors, each solving a substantial problem, to solve a very large problem Conclusions 6

Alternating Direction Method of Multipliers

Alternating Direction Method of Multipliers CS 584: Big Data Analytics Material adapted from Stephen Boyd (https://web.stanford.edu/~boyd/papers/pdf/admm_slides.pdf) & Ryan Tibshirani (http://stat.cmu.edu/~ryantibs/convexopt/lectures/21-dual-meth.pdf)