Sparse Optimization Lecture: Proximal Operator/Algorithm and Lagrange Dual

Similar documents
Convexity Theory and Gradient Methods

Convex Optimization and Machine Learning

Lagrangian Relaxation: An overview

Lecture 19: Convex Non-Smooth Optimization. April 2, 2007

Characterizing Improving Directions Unconstrained Optimization

Optimization for Machine Learning

Lecture 4 Duality and Decomposition Techniques

Lecture 18: March 23

Convex Optimization MLSS 2015

Convex optimization algorithms for sparse and low-rank representations

Machine Learning for Signal Processing Lecture 4: Optimization

Parallel and Distributed Graph Cuts by Dual Decomposition

1. Introduction. performance of numerical methods. complexity bounds. structural convex optimization. course goals and topics

Lecture 19 Subgradient Methods. November 5, 2008

Constrained optimization

COMS 4771 Support Vector Machines. Nakul Verma

A primal-dual framework for mixtures of regularizers

Aspects of Convex, Nonconvex, and Geometric Optimization (Lecture 1) Suvrit Sra Massachusetts Institute of Technology

PRIMAL-DUAL INTERIOR POINT METHOD FOR LINEAR PROGRAMMING. 1. Introduction

Applied Lagrange Duality for Constrained Optimization

Department of Electrical and Computer Engineering

Lagrangian Relaxation

The Alternating Direction Method of Multipliers

Surrogate Gradient Algorithm for Lagrangian Relaxation 1,2

Advanced Operations Research Techniques IE316. Quiz 2 Review. Dr. Ted Ralphs

Introduction to Constrained Optimization

ISM206 Lecture, April 26, 2005 Optimization of Nonlinear Objectives, with Non-Linear Constraints

Convex Optimization Lecture 2

A proximal center-based decomposition method for multi-agent convex optimization

Convex Optimization / Homework 2, due Oct 3

Programming, numerics and optimization

Solution Methods Numerical Algorithms

LECTURE 13: SOLUTION METHODS FOR CONSTRAINED OPTIMIZATION. 1. Primal approach 2. Penalty and barrier methods 3. Dual approach 4. Primal-dual approach

Two-phase matrix splitting methods for asymmetric and symmetric LCP

Nonlinear Programming

A Derivative-Free Approximate Gradient Sampling Algorithm for Finite Minimax Problems

Distributed Alternating Direction Method of Multipliers

Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. Author: Martin Jaggi Presenter: Zhongxing Peng

Gate Sizing by Lagrangian Relaxation Revisited

Convex Optimization. Lijun Zhang Modification of

PROJECTION ONTO A POLYHEDRON THAT EXPLOITS SPARSITY

Parallel and Distributed Sparse Optimization Algorithms

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 2. Convex Optimization

Lecture 2 Optimization with equality constraints

IDENTIFYING ACTIVE MANIFOLDS

CONLIN & MMA solvers. Pierre DUYSINX LTAS Automotive Engineering Academic year

College of Computer & Information Science Fall 2007 Northeastern University 14 September 2007

15.082J and 6.855J. Lagrangian Relaxation 2 Algorithms Application to LPs

CS675: Convex and Combinatorial Optimization Spring 2018 The Simplex Algorithm. Instructor: Shaddin Dughmi

Math 5593 Linear Programming Lecture Notes

The Alternating Direction Method of Multipliers

Convex Optimization. Erick Delage, and Ashutosh Saxena. October 20, (a) (b) (c)

Convergence of Nonconvex Douglas-Rachford Splitting and Nonconvex ADMM

Kernels and Constrained Optimization

A fast algorithm for sparse reconstruction based on shrinkage, subspace optimization and continuation [Wen,Yin,Goldfarb,Zhang 2009]

for Approximating the Analytic Center of a Polytope Abstract. The analytic center of a polytope P + = fx 0 : Ax = b e T x =1g log x j

Support Vector Machines. James McInerney Adapted from slides by Nakul Verma

Perceptron Learning Algorithm

ONLY AVAILABLE IN ELECTRONIC FORM

REVIEW I MATH 254 Calculus IV. Exam I (Friday, April 29) will cover sections

Part 4. Decomposition Algorithms Dantzig-Wolf Decomposition Algorithm

Natural Language Processing

LECTURE 18 LECTURE OUTLINE

Computational Optimization. Constrained Optimization Algorithms

Section 5 Convex Optimisation 1. W. Dai (IC) EE4.66 Data Proc. Convex Optimisation page 5-1

A Multilevel Proximal Gradient Algorithm for a Class of Composite Optimization Problems

Distance-to-Solution Estimates for Optimization Problems with Constraints in Standard Form

Computational Methods. Constrained Optimization

Introduction to Optimization

Alternating Direction Method of Multipliers

Unconstrained Optimization Principles of Unconstrained Optimization Search Methods

Lagrangean Methods bounding through penalty adjustment

Proximal operator and methods

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey. Chapter 4 : Optimization for Machine Learning

Selected Topics in Column Generation

Templates for Convex Cone Problems with Applications to Sparse Signal Recovery

Linear Programming. Larry Blume. Cornell University & The Santa Fe Institute & IHS

Chapter 3 Numerical Methods

Generalized Nash Equilibrium Problem: existence, uniqueness and

David G. Luenberger Yinyu Ye. Linear and Nonlinear. Programming. Fourth Edition. ö Springer

IN THIS paper, we are concerned with problems of the following

Demo 1: KKT conditions with inequality constraints

Optimality certificates for convex minimization and Helly numbers

Lecture 10: SVM Lecture Overview Support Vector Machines The binary classification problem

Lecture 7: Support Vector Machine

Projection-Based Methods in Optimization

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 36

A generic column generation principle: derivation and convergence analysis

Lecture 2 September 3

CME307/MS&E311 Theory Summary

Lecture Notes: Constraint Optimization

Network Flows. 7. Multicommodity Flows Problems. Fall 2010 Instructor: Dr. Masoud Yaghini

Today. Gradient descent for minimization of functions of real variables. Multi-dimensional scaling. Self-organizing maps

Decomposition in Integer Linear Programming

Integer Programming Theory

ORIE 6300 Mathematical Programming I November 13, Lecture 23. max b T y. x 0 s 0. s.t. A T y + s = c

A Truncated Newton Method in an Augmented Lagrangian Framework for Nonlinear Programming

Lagrangean relaxation - exercises

Contents. I Basics 1. Copyright by SIAM. Unauthorized reproduction of this article is prohibited.

Approximation Algorithms

Transcription:

Sparse Optimization Lecture: Proximal Operator/Algorithm and Lagrange Dual Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this lecture will know learn the proximal operator and its basic properties the proximal algorithm the proximal algorithm applied to the Lagrange dual 1 / 20

Gradient descent / forward Euler assume function f is convex, differentiable consider min f(x) gradient descent iteration (with step size c): x k+1 = x k c f(x k ) x k+1 minimizes the following local quadratic approximation of f: f(x k ) + f(x k ), x x k + 1 2c x xk 2 2 compare with forward Euler iteration, a.k.a. the explicit update: x(t + 1) = x(t) t f(x(t)) 2 / 20

Backward Euler / implicit gradient descent backward Euler iteration, also known as the implicit update: equivalent to: x(t + 1) solve x(t + 1) = x(t) t f (x(t + 1)). 1. u(t + 1) solve u = f (x(t) t u), 2. x(t + 1) = x(t) t u(t + 1). we can view it as the implicit gradient descent: (x k+1, u k+1 ) solve x = x k cu, u = f(x). c is the step size, very different from a standard step size. explicit (implicit) update uses the gradient at the start (end) point 3 / 20

Implicit gradient step = proximal operation proximal update: optimality condition: given input z, - prox cf (z) returns solution x - f(prox cf (z)) returns u prox cf (z) := arg min f(x) + 1 x 2c x z 2. 0 = c f(x ) + (x z). (x, u ) solve x = z cu, u = f(x). Proposition Proximal operator is equivalent to an implicit gradient (or backward Euler) step. 4 / 20

Proximal operator handles sub-differentiable f assume that f is closed, proper, sub-differentiable convex function f(x) is denoted as the subdifferential of f at x. Recall u f(x) if f(x ) f(x) + u, x x, x R n. f(x) is point-to-set, neither direction is unique prox is well-defined for sub-differentiable f; it is point-to-point, prox maps any input to a unique point 5 / 20

Proximal operator prox cf (z) := arg min f(x) + 1 x 2c x z 2. since objective is strongly convex, solution prox cf (z) is unique since f is proper, dom prox cf = R n the followings are equivalent prox cf z = x = arg min f(x) + 1 x 2c x z 2, x solve 0 c f(x) + (x z), (x, u ) solve x = z cu, u f(x). point x minimizes f if and only if x = prox f (x ). 6 / 20

Examples f(x) = ι x C f(x) = λ 2 x 2 2 f(x) = x 1 f(x) = i xg i 2 f(x) = X 7 / 20

Examples given prox f for function f, it is easy to derive prox g for g(x) = αf(x) + β g(x) = f(αx + b) g(x) = f(x) + a T x + β g(x) = f(x) + (ρ/2) x a 2 8 / 20

Resolvent of f f is a point-to-set mapping, so is I + c f in general, (I + c f) 1 is a point-to-set mapping however, we claim prox cf = (I + c f) 1 since prox cf (z) is always unique, (I + c f) 1 is a point-to-point mapping (I + c f) 1 is known as the resolvent of f with parameter c. by the way, f is the gradient operator, and (I c f) is the gradient-descent operator. 9 / 20

Moreau envelope idea: to smooth a closed, proper, nonsmooth convex function f definition: dom M cf = R n even if f is not M cf C 1 even if f is not; in fact, M cf (x) = inf y f(y) + 1 2c y x 2. M cf = ( (cf) + (1/2) 2) the dual of strongly convex function is differentiable (with Lipschitz gradient) relation with prox cf M cf (x) = (1/c)(x prox cf (x)) prox cf (x) = x c M cf (x), explicit gradient step of M cf prox f (x) = M f (x) example: the Huber function is M f where f(x) = x 1 c is not a usual step size. As c, (prox cf (x) x) (x x). 10 / 20

Proximal algorithm Assume that f has a minimizer, then iterate x k+1 = prox c k f (xk ) prox is firmly nonexpansive prox f (x) prox f (y) 2 prox f (x) prox f (y), x y, x, y R n It converges to the minimizer as long as c k > 0 and c k =. k=1 For example, one can fix c k c Step-sized iteration: fix c > 0 and pick α k (0, 2) uniformly away from 0 and 2: x k+1 = α k prox cf (x k ) + (1 α k )x k. The convergence takes a finite number of iterations if f is polyhedral (i.e. piece-wise linear) 11 / 20

Proximal algorithm Diminishing regularization x k+1 = arg min f(x) + 1 2c x xk 2 2 As x k x, f(x k ) 0 and thus f(x) becomes weaker. Hence, x k+1 x k tends to be smaller. Many algorithms use prox c k f (xk ), either entirely or as a part (but most of them were motivated through other means) Although prox c k f computation can sometimes be difficult to compute, it simplifies for some sub-differentiable functions for those rising in duality (our next focus) 12 / 20

Lagrange duality Convex problem min f(x) s.t. Ax = b. Relax the constraints and price their violation (pay a price if violated one way; get paid if violated the other way; payment is linear to the violation) L(x; y) := f(x) + y T (Ax b) For later use, define the augmented Lagrangian L A(x; y, c) := f(x) + y T (Ax b) + c Ax b 2 2 Minimize L for fixed price y: d(y) := min x L(x; y). Always, d(y) is convex The Lagrange dual problem min d(y) y Given dual solution y, recover x = min x L(x; y ) (under which conditions?) Question: how to compute the explicit/implicit gradients of d(y)? 13 / 20

Dual explicit gradient (ascent) algorithm Assume d(y) is differentiable (true if f(x) is strictly convex. Is this if-and-only-if?) Gradient descent iteration (if the maximizing dual is used, it is called gradient ascent): y k+1 = y k c f(y k ). It turns out to be relatively easy to compute d, via an unstrained subproblem: d(y) = b A x, where x = arg min L(x; y). x Dual gradient iteration 1. x k solve min x L(x; y k ); 2. y k+1 = y k c(b Ax k ). 14 / 20

Sub-gradient of d(y) Assume d(y) is sub-differentiable (which condition on primal can guarantee this?) Lemma Given dual point y and x = arg min x L(x; y), we have b A x d(y). Proof. Recall u d(y) if d(y ) d(y) + u, y y for all y ; d(y) := min x L(x; y). From (ii) and definition of x, d(y) + b A x, y y = L( x; y) + (b A x) T (y y ) = [f( x + y T (A x b)] + (b A x) T (y y ) = [f( x) + (y ) T (A x b)] = L( x; y ) d(y ). From (i), b A x d(y). 15 / 20

Dual explicit (sub)gradient iteration The iteration: 1. x k solve min x L(x; y k ); 2. y k+1 = y k c k (b Ax k ); Notes: (b Ax k ) d(y k ) as shown in the last slide it does not require d(y) to be differentiable convergence might require a careful choice of c k (e.g., a diminishing sequence) if d(y) is only sub-differentiable (or lacking Lipschitz continuous gradient) 16 / 20

Dual implicit gradient Goal: to descend using the (sub)gradient of d at the next point y k+1 : Following from the Lemma, we have b Ax k+1 d(y k+1 ), where x k+1 = arg min L(x; y k+1 ) x Since the implicit step is y k+1 = y k c(b Ax k+1 ), we can derive x k+1 = arg min L(x; y k+1 ) x 0 xl(x k+1 ; y k+1 ) = f(x k+1 ) + A T y k+1 = f(x k+1 ) + A T (y k c(b Ax k+1 )). Therefore, while x k+1 is a solution to min x L(x; y k+1 ); it is also a solution to min x L A(x; y k, c) = f(x) + (y k ) T (Ax b) + c 2 Ax b 2, which is independent of y k+1. 17 / 20

Dual implicit gradient Proposition Assuming y = y c(b Ax ), the followings are equivalent 1. x solve min x L(x; y ), 2. x solve min x L A(x; y, c). 18 / 20

Dual implicit gradient iteration The iteration y k+1 = prox cd (y k ) is commonly known as the augmented Lagrangian method or the method of multipliers. Implementation: 1. x k+1 solve min x L A(x; y k, c); 2. y k+1 = y k c(b Ax k+1 ). Proposition The followings are equivalent 1. the augmented Lagrangian iteration; 2. the implicit gradient iteration of d(y); 3. the proximal iteration y k+1 = prox cd (y k ). 19 / 20

Definitions: Dual explicit/implicit (sub)gradient computation L(x; y) = f(x) + y T (Ax b) L A(x; y, c) = L(x; y) + c Ax b 2 2 Objective: d(y) = min L(x; y). x Explicit (sub)gradient iteration: y k+1 = y k c d(y k ) or use a subgradient d(y k ) 1. x k+1 = arg min x L(x; y k ); 2. y k+1 = y k c(b Ax k+1 ). Implicit (sub)gradient step: y k+1 = prox cd y k 1. x k+1 = arg min x L A(x; y k, c); 2. y k+1 = y k c(b Ax k+1 ). The implicit iteration is more stable; step size c does not need to diminish. 20 / 20