The Alternating Direction Method of Multipliers

Similar documents
The Alternating Direction Method of Multipliers

The Alternating Direction Method of Multipliers

Distributed Optimization via ADMM

Alternating Direction Method of Multipliers

Lecture 4 Duality and Decomposition Techniques

Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers

Convex optimization algorithms for sparse and low-rank representations

Parallel and Distributed Sparse Optimization Algorithms

Convex Optimization and Machine Learning

A More Efficient Approach to Large Scale Matrix Completion Problems

Optimization for Machine Learning

An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation

Epigraph proximal algorithms for general convex programming

POGS Proximal Operator Graph Solver

The flare Package for High Dimensional Linear Regression and Precision Matrix Estimation in R

Distributed Alternating Direction Method of Multipliers

Application of Proximal Algorithms to Three Dimensional Deconvolution Microscopy

Augmented Lagrangian Methods

Augmented Lagrangian Methods

Sparse Optimization Lecture: Proximal Operator/Algorithm and Lagrange Dual

Composite Self-concordant Minimization

Investigating Image Inpainting via the Alternating Direction Method of Multipliers

Conic Optimization via Operator Splitting and Homogeneous Self-Dual Embedding

SnapVX: A Network-Based Convex Optimization Solver

Tensor Sparse PCA and Face Recognition: A Novel Approach

A primal-dual framework for mixtures of regularizers

Constrained optimization

15.082J and 6.855J. Lagrangian Relaxation 2 Algorithms Application to LPs

Learning with infinitely many features

An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation

A parallel implementation of the ADMM algorithm for power network control

Introduction to Optimization

CS 224W Final Report: Community Detection for Distributed Optimization

Package ADMM. May 29, 2018

Convex Optimization MLSS 2015

Lecture 19: November 5

Lecture 17 Sparse Convex Optimization

DISTRIBUTED NETWORK RESOURCE ALLOCATION WITH INTEGER CONSTRAINTS. Yujiao Cheng, Houfeng Huang, Gang Wu, Qing Ling

Convex Optimization. Stephen Boyd

Contents. I Basics 1. Copyright by SIAM. Unauthorized reproduction of this article is prohibited.

Convex Optimization: from Real-Time Embedded to Large-Scale Distributed

A proximal center-based decomposition method for multi-agent convex optimization

6 Model selection and kernels

Project Proposals. Xiang Zhang. Department of Computer Science Courant Institute of Mathematical Sciences New York University.

Lecture 18: March 23

EE8231 Optimization Theory Course Project

DM6 Support Vector Machines

Optimization for Machine Learning

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

ELEG Compressive Sensing and Sparse Signal Representations

CONLIN & MMA solvers. Pierre DUYSINX LTAS Automotive Engineering Academic year

An augmented ADMM algorithm with application to the generalized lasso problem

Iteratively Re-weighted Least Squares for Sums of Convex Functions

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Lecture 7: Support Vector Machine

Performance Estimation of the Alternating Directions Method of Multipliers in a Distributed Environment

Fast-Lipschitz Optimization

Detecting Burnscar from Hyperspectral Imagery via Sparse Representation with Low-Rank Interference

Crash-Starting the Simplex Method

Network Lasso: Clustering and Optimization in Large Graphs

ACCELERATED DUAL GRADIENT-BASED METHODS FOR TOTAL VARIATION IMAGE DENOISING/DEBLURRING PROBLEMS. Donghwan Kim and Jeffrey A.

Convergence of Nonconvex Douglas-Rachford Splitting and Nonconvex ADMM

Stanford University. A Distributed Solver for Kernalized SVM

Linear methods for supervised learning

Machine Learning: Think Big and Parallel

Lecture 22 The Generalized Lasso

Programming, numerics and optimization

Robust Principal Component Analysis (RPCA)

Solution Methods Numerical Algorithms

5 Machine Learning Abstractions and Numerical Optimization

Selected Topics in Column Generation

Lasso. November 14, 2017

Computational Optimization. Constrained Optimization Algorithms

ECE 204 Numerical Methods for Computer Engineers MIDTERM EXAMINATION /4:30-6:00

Support vector machines

Section 5 Convex Optimisation 1. W. Dai (IC) EE4.66 Data Proc. Convex Optimisation page 5-1

SVMs for Structured Output. Andrea Vedaldi

Accelerating CT Iterative Reconstruction Using ADMM and Nesterov s Methods

Nonlinear Programming

1. Show that the rectangle of maximum area that has a given perimeter p is a square.

IE598 Big Data Optimization Summary Nonconvex Optimization

Advanced phase retrieval: maximum likelihood technique with sparse regularization of phase and amplitude

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Support Vector Machines

Support Vector Machines

ONLINE ROBUST PRINCIPAL COMPONENT ANALYSIS FOR BACKGROUND SUBTRACTION: A SYSTEM EVALUATION ON TOYOTA CAR DATA XINGQIAN XU THESIS

Minimum-Polytope-Based Linear Programming Decoder for LDPC Codes via ADMM Approach

The Trainable Alternating Gradient Shrinkage method

Optimization III: Constrained Optimization

EE613 Machine Learning for Engineers LINEAR REGRESSION. Sylvain Calinon Robot Learning & Interaction Group Idiap Research Institute Nov.

CS 775: Advanced Computer Graphics. Lecture 3 : Kinematics

EE613 Machine Learning for Engineers LINEAR REGRESSION. Sylvain Calinon Robot Learning & Interaction Group Idiap Research Institute Nov.

SEQUENTIAL IMAGE COMPLETION FOR HIGH-SPEED LARGE-PIXEL NUMBER SENSING

Fast Newton methods for the group fused lasso

Programs. Introduction

Least-Squares Minimization Under Constraints EPFL Technical Report #

Surrogate Gradient Algorithm for Lagrangian Relaxation 1,2

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

COLLABORATIVE RESOURCE ALLOCATION OVER A HYBRID CLOUD CENTER AND EDGE SERVER NETWORK *

A Short SVM (Support Vector Machine) Tutorial

Transcription:

The Alternating Direction Method of Multipliers Customizable software solver package Peter Sutor, Jr. Project Advisor: Professor Tom Goldstein April 27, 2016 1 / 28

Background The Dual Problem Consider the following problem (primal problem): min x (f (x)) subject to Ax = b. Important components of this problem: 1 The Lagrangian: L(x, y) = f (x) + y T (Ax b) We refer to the original x variable as the primal variable and the y variable as the dual variable. 2 Dual function: g(y) = inf x (L(x, y)) New function made purely out of the dual variable. Gives a lower bound on the objective value. 3 Dual problem: max y 0 (g(y)) The problem of finding the best lower bound. End goal: recover x = arg min x (L(x, y )), where x and y are corresponding optimizers. 2 / 28

Background The Alternating Direction Method of Multipliers (ADMM) Robustness of the Method of Multipliers. Supports Dual Decomposition parallel x-updates possible. Problem form: (where f and g are both convex) min (f (x) + g(z)) subject to Ax + Bz = c, Objective is separable into two sets of variables. ADMM defines a special Augmented Lagrangian to enable decomposition: (r = Ax + Bz c, u = y/ρ) L ρ (x, z, y) = f (x) + g(z) + y T (r) + ρ 2 r 2 2 = f (x) + g(z) + (ρ/2) r + u 2 2 const = L ρ (x, z, u) 3 / 28

Background ADMM Algorithm Repeat for k = 0 to specified n, or until convergence: 1 x (k+1) := arg min x (L ρ (x, z (k), u (k) )) 2 z (k+1) := arg min z (L ρ (x (k+1), z, u (k) )) 3 u (k+1) := u (k) + (Ax (k+1) + Bz (k+1) c) Recall the proximal operator: (with v = Bz (k) c + u (k) ) prox f,ρ (v) := arg min(f (x) + (ρ/2) Ax + v 2 2) x If g(z) = λ z 1, then prox g,ρ (v) is computed by soft-thresholding: (with v = Ax (k+1) c + u (k) ) z (k+1) i := sign(v i )( v i λ) + 4 / 28

Progress on ADMM Library In this project... Our goal is to make ADMM easier to use in practice. Maximizing ADMM s potential means tweaking parameters such as step size ρ, starting values for x and z, efficient proximal operators, etc., for specific problem. Want a comprehensive library for general ADMM use. Generalized ADMM functionality (with customizable options). Adaptive step-size selection. Ready to go optimized functions for problems ADMM is most used for (with customizable options). High performance computing capabilities (MPI). Implementations in Python and Matlab. 5 / 28

Progress on ADMM Library Prior Progress 1 Created a fully customizable general ADMM function: Convergence checking of proximal operators. Multiple types of stopping conditions. Over/under relaxation. Complete run-time information. Accelerated and Fast ADMM 2 Created library of solvers for problems ADMM is used for: Constrained Convex Optimization: Linear and Quadratic Programming. l 1 Norm Problems: Least Absolute Deviations, Huber Fitting, and Basis Pursuit. l 1 Regularization: Linear SVMs, LASSO, TVM, Sparse Inverse Covariance Selection, Logistic Regression. 6 / 28

Progress on ADMM Library Prior Progress (continued) 3 Testing and validation software for ADMM and solvers: For ADMM: general solver (simple quadratic model) to test on. For solvers: tester functions. Set up random problems and solve them, knowing the correct solution. Batch tester to run solvers over a problem size scaling function. 4 Adaptive Step Sizes: Tried several interpolation + 1D least squares methods: 1 On Ye and Huan s w = [ x T, z T, u T ] T values. 2 On Esser s φ and ψ based residual. Step sizes tended to explode. 7 / 28

Progress on ADMM Library Further Progress File organization and setup routines: With solvers, testers, and other files, about 30 programs. Organized into subfolders containing solvers, testers, examples. Nifty routine to automatically setup paths no matter what file is run. Code Restructuring: Streamlined solver code. Added different algorithms do some solvers. Prepped all code for parallel implementation. Implemented local, parallel capabilities into ADMM to use all cores efficiently. 8 / 28

Parallel Updates Decomposition In ADMM Suppose function f is separable in x = (x 1,, x n ) T ; then: f (x) = f n1 (x n1 ) + + f nm (x nm ), x = (x n1,, x nm ), Can decompose the proximal operator for f. m n i = n i=1 Thus, our x-minimization step in ADMM is split into m separate minimizations that can be carried out in parallel: x (k+1) i := prox fni,ρ(x ni, z, u) 9 / 28

Parallel Updates Parallelizing Updates Can use this observation to parallelize x-updates in ADMM. No reason this can t be done for g as well! We often use simple z-updates (soft-thresholding, projections) Can be updated component-wise, or block component-wise. For the u-update: u (k+1) := u (k) + (Ax (k+1) + Bz (k+1) c) Can compute ˆx = Ax (k+1) and ẑ = Bz (k+1) by similar parallel computation. Clearly can update u (k+1) component-wise: u (k+1) i := u (k+1) i + (ˆx i + û i c i ) Note that update chunks n i can differ between x, z, and u. 10 / 28

Parallel Updates Implementing Parallel Updates Interpret user provided proximal operators for f or g as component proximal operators: Normally given function proxf(x,z,u,rho). Change to proxf(x,z,u,i) (Vector variables passed by reference) Instead of looping over i, distribute workload to workers (processors) in each update. User provides slices (n 1,, n m ) for every update they wish to parallelize as acknowledgement to perform Parallel ADMM. In Matlab, all this is easy to do for local parallel processes: Use parfor loop over i on each parallel update (x or z). Most matrix/vector operations already distributed among workers. 11 / 28

LASSO Using Parallel ADMM Example: LASSO Problem Standard LASSO formulation: min x (1/2 Dx s 2 2 + λ x 1) ADMM form: min(f (x) + g(z)) subject to x z = 0, where f (x) = 1/2 Dx s 2 2 and g(z) = λ z 1. L ρ (x, z, u) = f (x) + g(z) + (ρ/2) x z + u 2 2 const(u) Proximal operator for f is x such that: x (L ρ (x, z, u)) = D T (Dx s) + ρ(x z + u) := 0 Update step: x := (D T D + ρi ) 1 (D T s + ρ(z u)) Update for z can be parallel (soft-thresholding). What about the x update? 12 / 28

LASSO Using Parallel ADMM Parallel LASSO Slice up rows of D and s into i chunks {D ni } and {s ni }: x ni := (D T n i D ni + ρi ) 1 (D T n i s ni + ρ(z ni u ni )) Consensus update z = S λ/(ρn) ( x + ū). In both serial and parallel LASSO, cache Cholesky factorizations (X = Y T Y ) of matrix to invert and solve the system for updating. Parallel preprocessing: You need to factor and store each chunk s decomposition. Solution: Add parameter to options struct, options.preprocess, a function handle to local preprocessing function in user s program. 13 / 28

LASSO Using Parallel ADMM LASSO Serial vs. Parallel: 2 11 Rows, 2 3 Columns 14 / 28

LASSO Using Parallel ADMM LASSO Serial vs. Parallel Thorough Test 15 / 28

Unwrapped ADMM Transpose Reduction Want more efficient parallel x-update for skinny matrix D, which is typical. Note: 1/2 Dx s 2 2 = 1/2x T (D T D)x x T D T s + 1/2 s 2 2 Now, a central server needs only D T D and D T b. For tall, large D, D T D has much fewer entries. Note that: D T D = i DT i D i and D T b = i DT i b i. Now each server need only compute local components and aggregate on a central server. Once D T D and D T b are computed, solve with ADMM. 16 / 28

Unwrapped ADMM Unwrapped ADMM Problem statement: min z (g(z)) subject to z = Dx. ADMM form: min(f (x) + g(z)) subject to Dx z = 0, where f (x) = 0, and g(z) is the same. Define the pseudoinverse of D as D + = (D T D) 1 D T As f (x) = 0, proximal operator for f is simply x such that: x (ρ/2 Dx z + u 2 2) = D T (Dx z + u) := 0 which is simply x = (D T D) 1 D T (z u) = D + (z u). Cache D +. For separable function g, can parallelize z update. Can we parallelize x update? 17 / 28

Unwrapped ADMM Unwrapped ADMM With Transpose Reduction Abuse Transpose Reduction: slice D into D = [D T 1 DT N ]T. Then update x = D + (z u) = W i DT i (z i u i ), where W = ( i DT i D i ) 1. Note that for skinny D, W is very small and linear system solve much cheaper! In distributed setting, can: 1 Store D T i D i, D T i, z i and u i on each machine. 2 Have central server compute and cache W. 3 Central server adds up d i = D T i (z i u i ) into sum d and computes Wd. In local, parallel settings, can: 1 Do everything distributed does, but locally. 2 Compute summations in parallel. 18 / 28

Linear SVM Using Unwrapped ADMM and Transpose Reduction Example: Linear SVMs General Form: min(1/2 x 2 + Ch(Dx)), C a regularization parameter. D is training data, with l the training labels. Hinge loss function: h(z) = M k=1 max(1 l kz k, 0). Unwrapped ADMM can solve this problem, even for 0-1 loss. For hinge loss: z k+1 = Dx + u + l max(min(1 v, C/ρ), 0) For 0-1 loss: z k+1 = li(v 1 or v < (1 2C/ρ)) Here, v = l(dx + u) Can use both parallel and serial Unwrapped ADMM, as z update is a component-wise computation. Perform parallel sums and preprocessing using options.preprocess. 19 / 28

Linear SVM Using Unwrapped ADMM and Transpose Reduction Solution: Serial vs. Parallel SVM for 2 9 Rows 20 / 28

Linear SVM Using Unwrapped ADMM and Transpose Reduction Serial vs. Parallel: Thorough Test 21 / 28

Linear SVM Using Unwrapped ADMM and Transpose Reduction Serial vs. Parallel: Thorough Test 22 / 28

Final stretch Final Stretch Need to finish code restructuring on about 4 solvers. Need to add parallel versions a few more solvers. Test and validate everything using testers. Documentation. Write final report. 23 / 28

Future Work High Performance Computing We have an efficient parallel implementation of ADMM. Can take full advantage of all cores on a machine. Would like a distributed version: Parallel ADMM allows for distributed computing. Distribute to many machines, then use all cores with clever parfor usage. Optimize my solvers for big data. Looking into MatlabMPI to do this. Distributed computing via MPI-like behavior. Potential to completely automate ADMM usage for big data. 24 / 28

Future Work Adaptive Step-Sizes Previous attempts at adaptive step-sizes had issue of blowup of stepsizes. Restarting at detection of blowups negates this. Tends to improve convergence regardless of starting stepsize. Drawback: no theoretical support for this. Future work: adaptive stepsizes with strong theoretical support and better results. 25 / 28

Future Work A Better Software Library Create general solvers and group more specific ones under them: Streamlines code. General solvers more useful for users. More solvers: Need solvers for consensus and sharing problems. Need more distributed solvers for big data. More user friendliness, examples, and documentation. Want this to be a base for future ADMM research. 26 / 28

The End References S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers, Foundations and Trends in Machine Learning, vol. 3, no.1, pp. 1-122, 2010. T. Goldstein, G. Taylor, K. Barabin, and K. Sayre, Unwrapping ADMM: Efficient Distributed Computing via Transpose Reduction, CoRR, vol. abs/1504.02147, 2015. E. Esser, Applications of Lagrangian-Based Alternating Direction Methods and Connections to Split Bregman, April 2009. B. He, X. Yuan, On non-ergodic rate of Douglas-Rachford alternating direction method of multipliers, Numerishe Mathematik, vol. 130, iss. 3, pp. 567-577, 2014. H. Everett, Generalized Lagrange multiplier method for solving problems of optimum allocation of resources, Operations Research, vol. 11, no. 3, pp. 399-417, 1963. 27 / 28

The End Thank you! Any questions? 28 / 28