The Alternating Direction Method of Multipliers

Similar documents
The Alternating Direction Method of Multipliers

The Alternating Direction Method of Multipliers

Alternating Direction Method of Multipliers

Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers

Lecture 4 Duality and Decomposition Techniques

Distributed Optimization via ADMM

Convex Optimization and Machine Learning

Convex optimization algorithms for sparse and low-rank representations

Optimization for Machine Learning

Augmented Lagrangian Methods

An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation

Augmented Lagrangian Methods

A parallel implementation of the ADMM algorithm for power network control

A More Efficient Approach to Large Scale Matrix Completion Problems

Convex Optimization: from Real-Time Embedded to Large-Scale Distributed

Sparse Optimization Lecture: Proximal Operator/Algorithm and Lagrange Dual

DISTRIBUTED NETWORK RESOURCE ALLOCATION WITH INTEGER CONSTRAINTS. Yujiao Cheng, Houfeng Huang, Gang Wu, Qing Ling

Parallel and Distributed Sparse Optimization Algorithms

The flare Package for High Dimensional Linear Regression and Precision Matrix Estimation in R

Conic Optimization via Operator Splitting and Homogeneous Self-Dual Embedding

Application of Proximal Algorithms to Three Dimensional Deconvolution Microscopy

Convex Optimization. Stephen Boyd

Constrained optimization

Epigraph proximal algorithms for general convex programming

SnapVX: A Network-Based Convex Optimization Solver

Project Proposals. Xiang Zhang. Department of Computer Science Courant Institute of Mathematical Sciences New York University.

Package ADMM. May 29, 2018

A primal-dual framework for mixtures of regularizers

Computational Methods. Constrained Optimization

Convex Optimization MLSS 2015

PRIMAL-DUAL INTERIOR POINT METHOD FOR LINEAR PROGRAMMING. 1. Introduction

Investigating Image Inpainting via the Alternating Direction Method of Multipliers

Distributed Alternating Direction Method of Multipliers

Programming, numerics and optimization

Lecture 18: March 23

POGS Proximal Operator Graph Solver

Composite Self-concordant Minimization

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Selected Topics in Column Generation

Lecture 19: November 5

Tensor Sparse PCA and Face Recognition: A Novel Approach

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Machine Learning for Signal Processing Lecture 4: Optimization

COMS 4771 Support Vector Machines. Nakul Verma

Crash-Starting the Simplex Method

Introduction to Machine Learning

David G. Luenberger Yinyu Ye. Linear and Nonlinear. Programming. Fourth Edition. ö Springer

Kernel Methods & Support Vector Machines

Lecture 10: SVM Lecture Overview Support Vector Machines The binary classification problem

Accelerating CT Iterative Reconstruction Using ADMM and Nesterov s Methods

Robust Principal Component Analysis (RPCA)

Learning with infinitely many features

Introduction to Optimization

Surrogate Gradient Algorithm for Lagrangian Relaxation 1,2

Convergence of Nonconvex Douglas-Rachford Splitting and Nonconvex ADMM

Nonlinear Programming

CONLIN & MMA solvers. Pierre DUYSINX LTAS Automotive Engineering Academic year

An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation

Optimization for Machine Learning

The Trainable Alternating Gradient Shrinkage method

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

An augmented ADMM algorithm with application to the generalized lasso problem

Introduction to Optimization

A Short SVM (Support Vector Machine) Tutorial

Lecture 12: Feasible direction methods

LECTURE 13: SOLUTION METHODS FOR CONSTRAINED OPTIMIZATION. 1. Primal approach 2. Penalty and barrier methods 3. Dual approach 4. Primal-dual approach

Lagrangian Relaxation: An overview

Fast-Lipschitz Optimization

All lecture slides will be available at CSC2515_Winter15.html

Contents. I Basics 1. Copyright by SIAM. Unauthorized reproduction of this article is prohibited.

Introduction to Optimization Problems and Methods

Total variation tomographic inversion via the Alternating Direction Method of Multipliers

EE8231 Optimization Theory Course Project

Efficient Iterative LP Decoding of LDPC Codes with Alternating Direction Method of Multipliers

EE/AA 578: Convex Optimization

Solving basis pursuit: infeasible-point subgradient algorithm, computational comparison, and improvements

Mathematical Programming and Research Methods (Part II)

IE598 Big Data Optimization Summary Nonconvex Optimization

ISM206 Lecture, April 26, 2005 Optimization of Nonlinear Objectives, with Non-Linear Constraints

PARALLEL OPTIMIZATION

15.082J and 6.855J. Lagrangian Relaxation 2 Algorithms Application to LPs

Network Lasso: Clustering and Optimization in Large Graphs

Part 4. Decomposition Algorithms Dantzig-Wolf Decomposition Algorithm

Convex Programs. COMPSCI 371D Machine Learning. COMPSCI 371D Machine Learning Convex Programs 1 / 21

Spectral Clustering on Handwritten Digits Database

Lecture 17 Sparse Convex Optimization

16.410/413 Principles of Autonomy and Decision Making

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

Fast Newton methods for the group fused lasso

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Efficient Iterative Semi-supervised Classification on Manifold

Templates for Convex Cone Problems with Applications to Sparse Signal Recovery

DM6 Support Vector Machines

Application of Spectral Clustering Algorithm

Parallel and Distributed Graph Cuts by Dual Decomposition

Javad Lavaei. Department of Electrical Engineering Columbia University

Chapter II. Linear Programming

Comparison of Interior Point Filter Line Search Strategies for Constrained Optimization by Performance Profiles

Outline. Column Generation: Cutting Stock A very applied method. Introduction to Column Generation. Given an LP problem

Column Generation: Cutting Stock

Transcription:

The Alternating Direction Method of Multipliers With Adaptive Step Size Selection Peter Sutor, Jr. Project Advisor: Professor Tom Goldstein October 8, 2015 1 / 30

Introduction Presentation Outline 1 Convex Optimization for Large Datasets 2 Background Information 3 The Alternating Direction Method of Multipliers (ADMM) 4 Using ADMM to Solve Problems 5 Project Description 6 Adaptive Step Size Selection 7 Testing and Validation 8 Project Timeline and Deliverables 2 / 30

Introduction The General Problem Convex Optimization: We wish to find an optimal x X such that: f (x ) = min {f (x) : x X }, where X R n is called the feasible set and f (x) : R n R is the objective function. Objective function f is convex on R n. Feasible set X is a closed convex set. Large scale optimization: Huge data-sets. Traditional techniques for minimization may be too slow. Decentralized optimization. 3 / 30

Introduction How can ADMM help? Need robust methods for: Arbitrary scale optimization. Decentralized optimization. The Alternating Direction Method of Multipliers (ADMM): Solves convex optimization problems by splitting them into smaller, easier to handle pieces. Can solve these pieces in parallel. Is robust, and handles the forms of optimization we want. 4 / 30

The Dual Ascent Method The Dual Problem Consider the following problem (primal problem): min x (f (x)) subject to Ax = b. Important components of this problem: 1 The Lagrangian: L(x, y) = f (x) + y T (Ax b) We refer to the original x variable as the primal variable and the y variable as the dual variable. 2 Dual function: g(y) = inf x (L(x, y)) New function made purely out of the dual variable. Gives a lower bound on the objective value. 3 Dual problem: max y 0 (g(y)) The problem of finding the best lower bound. End goal: recover x = arg min x (L(x, y )), where x and y are corresponding optimizers. 5 / 30

The Dual Ascent Method The Dual Ascent Method (DAM) DAM is a gradient-type method that solves our dual problem; characterized by the k-iteration: y (k+1) = y (k) + α (k) g(y (k) ), where α (k) is a step size for the iteration k. Note that g(y (k) ) = Ax b and x = arg min x (L(x, y (k) )). Repeat for k = 0 to a given n number of steps, or until convergence: 1 x (k+1) := arg min x (L(x, y (k) )) 2 y (k+1) := y (k) + α (k) (Ax (k+1) b) 6 / 30

Dual Decomposition Dual Decomposition (DD) Let s say that our objective is separable; then: f (x) = f 1 (x 1 ) + + f m (x m ), x = (x 1,..., x m ) The same goes for the Lagrangian: L(x, y) = L 1 (x 1, y) + + L m (x m, y) y T b, where L i = f (x i ) + y T A i x i. Thus, our x-minimization step in the DAM is split into m separate minimizations that can be carried out in parallel: x (k+1) i := arg min x i (L i (x i, y (k) )) 7 / 30

Dual Decomposition Dual Decomposition (continued) Idea: decompose y (k), update x i in parallel then add up the A i x (k+1) i terms. DD as proposed by Everett, Dantzig, Wolfe, and Benders: Repeat for k = 0 to a given n steps, or until convergence: := arg min xi (L i ( (x i, y (k) )), for i = 1,..., m m ) 2 y (k+1) := y (k) + α (k) A i x (k+1) i b i=1 1 x (k+1) i Solve large problem by solving parallel sub-problems, coordinating at the dual variable. Drawbacks: Needs assumption that f is separable. Can be slow at times. 8 / 30

Method of Multipliers Method of Multipliers (MM) Need a more robust DAM? Use the Method of Multipliers. Swap the Lagrangian for an Augmented Lagrangian: L ρ (x, y) = f (x) + y T (Ax b) + (ρ/2) Ax b 2 2, ρ > 0 Method of Multipliers as proposed by Hestenes and Powell: Repeat for k = 0 to a given n, or until convergence: 1 x (k+1) := arg min x (L ρ (x, y (k) )) 2 y (k+1) := y (k) + ρ(ax (k+1) b) The ρ here is the dual update step length. 9 / 30

Method of Multipliers MM: The Dual Update Step If f is differentiable, the optimality conditions are: Primal Feasibility: Ax b = 0 Dual Feasibility: f (x ) + A T y = 0 At each iteration k, x (k+1) minimizes L ρ (x, y (k) ), so: x L ρ (x (k+1), y (k) ) = x (f (x (k+1) )) + A T (y (k) + ρ(ax (k+1) b)) = x (f (x (k+1) )) + A T y (k+1) = 0 Thus, our dual update y (k+1) makes (x (k+1), y (k+1) ) dual feasible; primal feasibility is achieved as (Ax (k+1) b) 0 as k. 10 / 30

Method of Multipliers MM: the good, the bad and the ugly The Good: Conditions for convergence are much more relaxed than DD; e.g., f doesn t have to be differentiable. The Bad: Quadratic penalty from using the Augmented Lagrangian prevents us from being able to separate the x-update like in DD; thus, we can t use DD with MM. The Ugly: We can t use DD and MM simultaneously and have the advantages of both methods, at least not with the set-up we have here. 11 / 30

The Alternating Direction Method of Multipliers (ADMM) What is ADMM? Finds a way to combine advantages of DD and MM. Robustness of the Method of Multipliers. Supports Dual Decomposition parallel x-updates. Problem form: min (f (x) + g(z)) subject to Ax + Bz = c, where f and g are both convex. Objective is separable into two sets of variables. ADMM defines a special Augmented Lagrangian to enable decomposition: L ρ (x, z, y) = f (x)+g(x)+y T (Ax +Bz c)+ ρ 2 Ax +Bz c 2 2 12 / 30

The Alternating Direction Method of Multipliers (ADMM) ADMM Algorithm Algorithm proposed by Gabay, Mercier, Glowinski, and Marrocco in: Repeat for k = 0 to specified n, or until convergence: 1 x (k+1) := arg min x (L ρ (x, z (k), y (k) )) 2 z (k+1) := arg min z (L ρ (x (k+1), z, y (k) )) 3 y (k+1) := y (k) + ρ(ax (k+1) + Bz (k+1) c) Doesn t minimize x and z together, as in MM. It instead solves a linear system of equations for the z-minimization step. The Augmented Lagrangian uses the extra penalty term ρ 2 Ax + Bz c 2 2 to enable this separation. 13 / 30

Convergence of ADMM Optimality Conditions for ADMM For the differentiable case, the optimality conditions are: Primal Feasibility: Ax + Bz c = 0 Dual Feasibility: f (x) + A T y = 0, g(z) + B T y = 0 As z (k+1) minimizes L ρ (x (k+1), z, y (k) ), it follows that: 0 = g(z (k+1) ) + B T y (k) + ρb T (Ax (k+1) + Bz (k+1) c) = g(z (k+1) ) + B T y (k+1) Dual update makes (x (k+1), z (k+1), y (k+1) ) satisfy the second dual feasible condition. Other conditions are achieved as k. 14 / 30

Convergence of ADMM Convergence of ADMM Assumptions required: 1 Functions f and g are closed, convex and proper. A function is closed if for any α R, x dom(f ) : f (x) α is a closed set. A convex function is proper if f (x) < for some x and f (x) > for every x. 2 For ρ = 0, L ρ has a saddle point. If assumptions are true, then ADMM converges: Iterates approach feasibility. Objective function approaches optimal value. 15 / 30

Useful Tricks Scaling Dual Variables Let r = Ax + Bz c, then: L ρ (x, z, y) = f (x) + g(z) + y T r + (ρ/2) r 2 2 = f (x) + g(z) + (ρ/2) r + (1/ρ)y 2 2 (1/2ρ) y 2 2 = f (x) + g(z) + (ρ/2) r + u 2 2 constant y = L ρ (x, z, u), where u = (1/ρ)y. Now the algorithm is: 1 x (k+1) := arg min x (L ρ (x, z (k), u (k) )) 2 z (k+1) := arg min z (L ρ (x (k+1), z, u (k) )) 3 u (k+1) := u (k) + (Ax (k+1) + Bz (k+1) c) 16 / 30

Useful Tricks Writing problems in ADMM form Generic Problem: min(f (x)), subject to x S ADMM Form: min(f (x) + g(z)), subject to x z = 0, where g(z) = I S (z), the indicator function that z is in S. Notice that B = I, so z-minimization boils down to: arg min(g(z) + (ρ/2) z v 2 2) = prox g,ρ (v), with v = x (k+1) + u (k) (Proximal Function). Since g(z) is the indicator function, do this by projecting v onto S. Use soft-thresholding: z (k+1) i := (v i λ/ρ) + ( v i λ/ρ) + 17 / 30

Common problems solved by ADMM Common problems solved by ADMM Basis Pursuit Sparse Inverse Covariance Selection Huber Fitting Intersection of Polyhedra Lasso Problem Least Absolute Deviations Linear Programming l 1 Regularized Logistic Regression Regressor Selection (nonconvex) Quadratic Programming Support Vector Machines (SVMs) Total Variation Minimization (e.g., image denoising) 18 / 30

Example: Total Variation Minimization Total Variation Minimization (1-D Case) In essence, total variation is an infinitesimal version of absolute value; definable for x in one dimension as V (x) = i x i+1 x i. In one dimension, problem is of the following form: 1 E(x, b) + λv (x) = min x 2 x b 2 2 + λ i where x, b R n. Let s write the problem in ADMM form: 1 min x 2 x b 2 2 + λ i x i+1 x i + g(z) x i+1 x i subject to dx z = 0, where x, b R n and g(z) is the indicator function as mentioned before. 19 / 30

Example: Total Variation Minimization Total Variation Minimization (continued) What is the x-minimization step? Using Augmented Lagrangian L ρ (x, z, u), the problem is minimizing for x: 1 2 x b 2 2+λ i x i+1 x i +g(z)+ ρ 2 dx z+u 2 2 constant y So set gradient in x to zero to minimize: x L ρ (x, z, u) = x b + ρd T (dx z + u) = 0 Group x terms on one side: (I + ρd T d)x = ρd T (z u) + b So: x = (I + ρd T d) 1 (ρd T (z u) + b) 20 / 30

Example: Total Variation Minimization Total Variation Minimization (continued) ADMM Algorithm to Solve T.V. Problem 1 Form difference matrix D approximating d in dx using stencil [1, 1] along diagonal; circular wrapping. 2 Solve for x (k+1) (this is the x-update) the system: (I + ρd T D)x (k+1) = b + ρd T (z (k) y (k) ) 3 The shrinkage of P = Dx (k+1) + y (k) is the z-update: ( z (k+1) := max 0, P λ ) ( max 0, P λ ) ρ ρ 4 Finally, the dual update: y (k+1) := y (k) + Dx (k+1) z (k+1) 5 Repeat steps 2-4 until convergence or n iterations is reached. 21 / 30

Example: Total Variation Minimization Examples of denoising in one dimension 22 / 30

Project Goals In this project... Our goal is to make ADMM easier to use in practice: upload A, B, and c, then run appropriate function. Maximizing ADMM s potential means tweaking parameters such as step size ρ and more. Hope to create a comprehensive library for general ADMM use. Generalized ADMM functionality. Adaptive step-size selection. Ready to go optimized functions for problems ADMM is most used for. High performance computing capabilities (MPI). Implementations in Python and Matlab. 23 / 30

Project Goals Major Obstacles 1 There is no known step-size selection algorithm for ADMM: In practice, choosing of ρ is typically done by fine-tuning and testing. Optimal ρ changes with the problem, and perhaps even data. It may be possible to dynamically choose optimal ρ at every iteration instead. 2 How to dynamically choose ρ? Several possible strategies we will try. Requires thorough testing to see which works and which works best. 3 When should the algorithm stop? May require multiple types of stopping conditions. 4 How to validate that given input will actually converge? 24 / 30

Project Goals Stopping Conditions Primal (p) and Dual (d) residuals in ADMM at step k + 1: p k+1 = Ax k+1 + Bz k+1 c d k+1 = ρa T B(z k+1 z k ) Reasonable stopping criteria: p k 2 ɛ pri and d k 2 ɛ dual. Many ways to choose these tolerances. One common example, where p R n 1 and d R n 2 : ɛ pri = n 1 ɛ abs + ɛ rel max( Ax k 2, Bz k 2, c 2 ) ɛ dual = n 2 ɛ abs + ɛ rel A T y k 2 where ɛ abs and ɛ rel are chosen constants referred to as absolute and relative tolerance. 25 / 30

Project Goals Adaptive Step Size Selection Strategies 1 Perform a partial step with a pre-existing estimate for step-size. Perform linear interpolation of the residuals with unknown full step-size. Solve this to find what a better step size should have been. Use this as next step-size. 2 It is possible to optimize the dual problem s step size. Can keep the penalty term s ρ constant and manipulate the dual s step size only. 3 ADMM can be viewed as a type of Douglas-Rachford Method. Using results from Esser s paper on DRM, can solve for optimal step size via finding the gradient on the dual step in DRM. 26 / 30

Project Details Testing and Validation Testing: Main functionality can be tested through randomized data: random A, B and c. This is standard for convex optimization. For specific problems, e.g. SVM classifiers, can compare to existing solvers and datasets. For example: the MNIST handwritten data. Validation: Can compare performances between adaptive step size selection strategies. Can also compare these strategies to normal ADMM performance without adaptive step-size selection. 27 / 30

Project Details Project Timeline Fall Semester Goals: End of October: Implement generic ADMM, solvers for the Lasso problem, TV Minimization, and SVMs. Early November: Implement scripts for general testing, convergence checking, and stopping condition strategies. End of November: Try out implementations of all three adaptive step-size selection strategies. Early December: Finalize bells and whistles on ADMM options. Compile testing/validation data. Spring Semester Goals: End of February: Implement the full library of standard problem solvers. End of March: Finish implementing MPI in ADMM library. End of April: Finishing porting code to Python version. Early May: Compile new testing/validation data. 28 / 30

Project Details Deliverables ADMM Library: Python and Matlab versions Contain general ADMM with adaptive step size routines and standard solvers for common problems ADMM solves. Scripts for generating random test data and results. Scripts for validating performance of adaptive ADMM to regular ADMM for each adaptive strategy. Report on observed testing/validation results and on findings with adaptive ADMM - may lead to a paper eventually. Datasets used for testing the standard solvers (or references to where to obtain them, if they are too big). 29 / 30

Project Details References S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers, Foundations and Trends in Machine Learning, vol. 3, no.1, pp. 1-122, 2010. Ernie Esser, Applications of Lagrangian-Based Alternating Direction Methods and Connections to Split Bregman, April 2009 H. Everett, Generalized Lagrange multiplier method for solving problems of optimum allocation of resources, Operations Research, vol. 11, no. 3, pp. 399-417, 1963. M. R. Hestenes, Multiplier and gradient methods, Journal of Optimization Theory and Applications, vol. 4, pp. 302-320, 1969. J. Eckstein and M. Fukushima, Some reformulations and applications of the alternating direction method of multipliers, Large Scale Optimization: State of the Art, pp. 119-138, 1993. 30 / 30