Optimization for Machine Learning

Similar documents
Aspects of Convex, Nonconvex, and Geometric Optimization (Lecture 1) Suvrit Sra Massachusetts Institute of Technology

Convex optimization algorithms for sparse and low-rank representations

Convex Optimization MLSS 2015

Sparse Optimization Lecture: Proximal Operator/Algorithm and Lagrange Dual

1. Introduction. performance of numerical methods. complexity bounds. structural convex optimization. course goals and topics

The Alternating Direction Method of Multipliers

Lecture 19: Convex Non-Smooth Optimization. April 2, 2007

Proximal operator and methods

Lecture 18: March 23

The Alternating Direction Method of Multipliers

Conditional gradient algorithms for machine learning

Convergence of Nonconvex Douglas-Rachford Splitting and Nonconvex ADMM

Composite Self-concordant Minimization

A proximal center-based decomposition method for multi-agent convex optimization

Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers

Parallel and Distributed Sparse Optimization Algorithms

Lecture 4 Duality and Decomposition Techniques

Conditional gradient algorithms for machine learning

Convex Optimization M2

Recent Developments in Model-based Derivative-free Optimization

Introduction to Optimization

Distributed Alternating Direction Method of Multipliers

Leaves Machine Learning and Optimization Library

Surrogate Gradient Algorithm for Lagrangian Relaxation 1,2

Convex Optimization. Lijun Zhang Modification of

Nonlinear Programming

Convex Optimization / Homework 2, due Oct 3

IDENTIFYING ACTIVE MANIFOLDS

25. NLP algorithms. ˆ Overview. ˆ Local methods. ˆ Constrained optimization. ˆ Global methods. ˆ Black-box methods.

A Multilevel Proximal Gradient Algorithm for a Class of Composite Optimization Problems

Efficient Euclidean Projections in Linear Time

Characterizing Improving Directions Unconstrained Optimization

Convex Optimization: from Real-Time Embedded to Large-Scale Distributed

Convexity: an introduction

Augmented Lagrangian Methods

Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization

ACCELERATED DUAL GRADIENT-BASED METHODS FOR TOTAL VARIATION IMAGE DENOISING/DEBLURRING PROBLEMS. Donghwan Kim and Jeffrey A.

Iteratively Re-weighted Least Squares for Sums of Convex Functions

Introduction to Convex Optimization

Augmented Lagrangian Methods

Section 5 Convex Optimisation 1. W. Dai (IC) EE4.66 Data Proc. Convex Optimisation page 5-1

Convex Optimization CMU-10725

Conic Optimization via Operator Splitting and Homogeneous Self-Dual Embedding

A primal-dual framework for mixtures of regularizers

EE8231 Optimization Theory Course Project

The Benefit of Tree Sparsity in Accelerated MRI

Mining Sparse Representations: Theory, Algorithms, and Applications

Programming, numerics and optimization

Mathematical Programming and Research Methods (Part II)

IE521 Convex Optimization Introduction

Fast Newton methods for the group fused lasso

15. Cutting plane and ellipsoid methods

David G. Luenberger Yinyu Ye. Linear and Nonlinear. Programming. Fourth Edition. ö Springer

A C 2 Four-Point Subdivision Scheme with Fourth Order Accuracy and its Extensions

ME 555: Distributed Optimization

Solution Methods Numerical Algorithms

A More Efficient Approach to Large Scale Matrix Completion Problems

Convex Optimization and Machine Learning

SYSTEMS OF NONLINEAR EQUATIONS

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey. Chapter 4 : Optimization for Machine Learning

Efficient MR Image Reconstruction for Compressed MR Imaging

15.082J and 6.855J. Lagrangian Relaxation 2 Algorithms Application to LPs

California Institute of Technology Crash-Course on Convex Optimization Fall Ec 133 Guilherme Freitas

Homotopy Smoothing for Non-Smooth Problems with Lower Complexity than O(1/ɛ)

IE598 Big Data Optimization Summary Nonconvex Optimization

Lecture 15: Log Barrier Method

SpicyMKL Efficient multiple kernel learning method using dual augmented Lagrangian

Lagrangian Relaxation: An overview

Contents. I Basics 1. Copyright by SIAM. Unauthorized reproduction of this article is prohibited.

Computational Methods. Constrained Optimization

Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application

A C 2 Four-Point Subdivision Scheme with Fourth Order Accuracy and its Extensions

Lower bounds on the barrier parameter of convex cones

Math 5593 Linear Programming Lecture Notes

On Efficient Optimal Transport: An Analysis of Greedy and Accelerated Mirror Descent Algorithms

Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. Author: Martin Jaggi Presenter: Zhongxing Peng

Lecture 2 - Introduction to Polytopes

Two-phase matrix splitting methods for asymmetric and symmetric LCP

Convex Optimization over Intersection of Simple Sets: improved Convergence Rate Guarantees via an Exact Penalty Approach

Convex or non-convex: which is better?

Convexization in Markov Chain Monte Carlo

Applied Lagrange Duality for Constrained Optimization

Introduction to Optimization Problems and Methods

Optimization for Machine Learning

Supplemental Material for Efficient MR Image Reconstruction for Compressed MR Imaging

INTRODUCTION TO LINEAR AND NONLINEAR PROGRAMMING

Sequential Coordinate-wise Algorithm for Non-negative Least Squares Problem

CMU-Q Lecture 9: Optimization II: Constrained,Unconstrained Optimization Convex optimization. Teacher: Gianni A. Di Caro

Lecture 19: November 5

Learning with infinitely many features

Alternating Direction Method of Multipliers

Introduction to Modern Control Systems

AVERAGING RANDOM PROJECTION: A FAST ONLINE SOLUTION FOR LARGE-SCALE CONSTRAINED STOCHASTIC OPTIMIZATION. Jialin Liu, Yuantao Gu, and Mengdi Wang

A fast algorithm for sparse reconstruction based on shrinkage, subspace optimization and continuation [Wen,Yin,Goldfarb,Zhang 2009]

Nesterov's Smoothing Technique and Minimizing Differences of Convex Functions for Hierarchical Clustering

A Multilevel Proximal Gradient Algorithm for Large Scale Optimization

EXTREME POINTS AND AFFINE EQUIVALENCE

Lagrangian methods for the regularization of discrete ill-posed problems. G. Landi

Distributed Delayed Proximal Gradient Methods

Convexity I: Sets and Functions

Transcription:

Optimization for Machine Learning (Problems; Algorithms - C) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017)

Course materials http://suvrit.de/teaching.html Some references: Introductory lectures on convex optimization Nesterov Convex optimization Boyd & Vandenberghe Nonlinear programming Bertsekas Convex Analysis Rockafellar Fundamentals of convex analysis Urruty, Lemaréchal Lectures on modern convex optimization Nemirovski Optimization for Machine Learning Sra, Nowozin, Wright Theory of Convex Optimization for Machine Learning Bubeck NIPS 2016 Optimization Tutorial Bach, Sra Some related courses: EE227A, Spring 2013, (Sra, UC Berkeley) 10-801, Spring 2014 (Sra, CMU) EE364a,b (Boyd, Stanford) EE236b,c (Vandenberghe, UCLA) Venues: NIPS, ICML, UAI, AISTATS, SIOPT, Math. Prog. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 2 / 29

Lecture Plan Introduction (3 lectures) Problems and algorithms (5 lectures) Non-convex optimization, perspectives (2 lectures) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 3 / 29

Nonsmooth convergence rates Theorem. (Nesterov.) Let B = { x x x 0 2 D }. Assume, x B. There exists a convex function f in C 0 L (B) (with L > 0), such that for 0 k n 1, the lower-bound f (x k ) f (x ) LD 2(1+ k+1), holds for any algorithm that generates x k by linearly combining the previous iterates and subgradients. Exercise: So design problems where we can do better! Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 4 / 29

Composite problems Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 5 / 29

Composite objectives Frequently ML problems take the regularized form minimize f (x) := l(x) + r(x) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 6 / 29

Composite objectives Frequently ML problems take the regularized form minimize f (x) := l(x) + r(x) l + r Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 6 / 29

Composite objectives Frequently ML problems take the regularized form minimize f (x) := l(x) + r(x) l + r Example: l(x) = 1 2 Ax b 2 and r(x) = λ x 1 Lasso, L1-LS, compressed sensing Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 6 / 29

Composite objectives Frequently ML problems take the regularized form minimize f (x) := l(x) + r(x) l + r Example: l(x) = 1 2 Ax b 2 and r(x) = λ x 1 Lasso, L1-LS, compressed sensing Example: l(x) : Logistic loss, and r(x) = λ x 1 L1-Logistic regression, sparse LR Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 6 / 29

Composite objective minimization minimize f (x) := l(x) + r(x) subgradient: x k+1 = x k α k g k, g k f (x k ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 7 / 29

Composite objective minimization minimize f (x) := l(x) + r(x) subgradient: x k+1 = x k α k g k, g k f (x k ) subgradient: converges slowly at rate O(1/ k) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 7 / 29

Composite objective minimization minimize f (x) := l(x) + r(x) subgradient: x k+1 = x k α k g k, g k f (x k ) subgradient: converges slowly at rate O(1/ k) but: f is smooth plus nonsmooth we should exploit: smoothness of l for better method! Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 7 / 29

Proximal Gradient Method min f (x) x X Projected (sub)gradient x P X (x α f (x)) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 8 / 29

Proximal Gradient Method min f (x) x X Projected (sub)gradient x P X (x α f (x)) min f (x) + h(x) Proximal gradient x prox αh (x α f (x)) prox αh denotes proximity operator for h Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 8 / 29

Proximal Gradient Method min f (x) x X Projected (sub)gradient x P X (x α f (x)) min f (x) + h(x) Proximal gradient x prox αh (x α f (x)) prox αh denotes proximity operator for h Why? If we can compute prox h (x) easily, prox-grad converges as fast gradient methods for smooth problems! Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 8 / 29

Proximity operator Projection P X (y) := argmin x R n 1 2 x y 2 2 + 1 X (x) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 9 / 29

Proximity operator Projection P X (y) := argmin x R n 1 2 x y 2 2 + 1 X (x) Proximity: Replace 1 X by a closed convex function prox r (y) := argmin x R n 1 2 x y 2 2 + r(x) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 9 / 29

Proximity operator l 1 -norm ball of radius ρ(λ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 10 / 29

Proximity operators Exercise: Let r(x) = x 1. Solve prox λr (y). 1 min x R n 2 x y 2 2 + λ x 1. Hint 1: The above problem decomposes into n independent subproblems of the form min x R 1 2 (x y)2 + λ x. Hint 2: Consider the two cases: either x = 0 or x 0 Exercise: Moreau decomposition y = prox h y + prox h y (notice analogy to V = S + S in linear algebra) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 11 / 29

How to cook-up prox-grad? Lemma x = prox αh (x α f (x )), α > 0 Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 12 / 29

How to cook-up prox-grad? Lemma x = prox αh (x α f (x )), α > 0 0 f (x ) + h(x ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 12 / 29

How to cook-up prox-grad? Lemma x = prox αh (x α f (x )), α > 0 0 f (x ) + h(x ) 0 α f (x ) + α h(x ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 12 / 29

How to cook-up prox-grad? Lemma x = prox αh (x α f (x )), α > 0 0 f (x ) + h(x ) 0 α f (x ) + α h(x ) x α f (x ) + (I + α h)(x ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 12 / 29

How to cook-up prox-grad? Lemma x = prox αh (x α f (x )), α > 0 0 f (x ) + h(x ) 0 α f (x ) + α h(x ) x α f (x ) + (I + α h)(x ) x α f (x ) (I + α h)(x ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 12 / 29

How to cook-up prox-grad? Lemma x = prox αh (x α f (x )), α > 0 0 f (x ) + h(x ) 0 α f (x ) + α h(x ) x α f (x ) + (I + α h)(x ) x α f (x ) (I + α h)(x ) x = (I + α h) 1 (x α f (x )) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 12 / 29

How to cook-up prox-grad? Lemma x = prox αh (x α f (x )), α > 0 0 f (x ) + h(x ) 0 α f (x ) + α h(x ) x α f (x ) + (I + α h)(x ) x α f (x ) (I + α h)(x ) x = (I + α h) 1 (x α f (x )) x = prox αh (x α f (x )) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 12 / 29

How to cook-up prox-grad? Lemma x = prox αh (x α f (x )), α > 0 0 f (x ) + h(x ) 0 α f (x ) + α h(x ) x α f (x ) + (I + α h)(x ) x α f (x ) (I + α h)(x ) x = (I + α h) 1 (x α f (x )) x = prox αh (x α f (x )) Above fixed-point eqn suggests iteration x k+1 = prox αk h (x k α k f (x k )) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 12 / 29

Convergence Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 13 / 29

Proximal-gradient works, why? x k+1 = prox αk h (x k α k f (x k )) x k+1 = x k α k G αk (x k ). Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 14 / 29

Proximal-gradient works, why? x k+1 = prox αk h (x k α k f (x k )) x k+1 = x k α k G αk (x k ). Gradient mapping: the gradient-like object G α (x) = 1 α (x P αh(x α f (x))) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 14 / 29

Proximal-gradient works, why? x k+1 = prox αk h (x k α k f (x k )) x k+1 = x k α k G αk (x k ). Gradient mapping: the gradient-like object G α (x) = 1 α (x P αh(x α f (x))) Our lemma shows: G α (x) = 0 if and only if x is optimal So G α analogous to f If x locally optimal, then G α (x) = 0 (nonconvex f ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 14 / 29

Convergence analysis Assumption: Lipschitz continuous gradient; denoted f C 1 L f (x) f (y) 2 L x y 2 Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 15 / 29

Convergence analysis Assumption: Lipschitz continuous gradient; denoted f C 1 L f (x) f (y) 2 L x y 2 Gradient vectors of closeby points are close to each other Objective function has bounded curvature Speed at which gradient varies is bounded Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 15 / 29

Convergence analysis Assumption: Lipschitz continuous gradient; denoted f C 1 L f (x) f (y) 2 L x y 2 Gradient vectors of closeby points are close to each other Objective function has bounded curvature Speed at which gradient varies is bounded Lemma (Descent). Let f C 1 L. Then, f (y) f (x) + f (x), y x + L 2 y x 2 2 Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 15 / 29

Convergence analysis Assumption: Lipschitz continuous gradient; denoted f C 1 L f (x) f (y) 2 L x y 2 Gradient vectors of closeby points are close to each other Objective function has bounded curvature Speed at which gradient varies is bounded Lemma (Descent). Let f C 1 L. Then, f (y) f (x) + f (x), y x + L 2 y x 2 2 For convex f, compare with f (y) f (x) + f (x), y x. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 15 / 29

Descent lemma Proof. Since f C 1 L, by Taylor s theorem, for the vector z t = x + t(y x) we have f (y) = f (x) + 1 0 f (z t), y x dt. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 16 / 29

Descent lemma Proof. Since f C 1 L, by Taylor s theorem, for the vector z t = x + t(y x) we have f (y) = f (x) + 1 0 f (z t), y x dt. Add and subtract f (x), y x on rhs we have f (y) f (x) f (x), y x = 1 0 f (z t) f (x), y x dt Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 16 / 29

Descent lemma Proof. Since f C 1 L, by Taylor s theorem, for the vector z t = x + t(y x) we have f (y) = f (x) + 1 0 f (z t), y x dt. Add and subtract f (x), y x on rhs we have f (y) f (x) f (x), y x = 1 0 f (z t) f (x), y x dt f (y) f (x) f (x), y x = 1 0 f (z t) f (x), y x dt Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 16 / 29

Descent lemma Proof. Since f C 1 L, by Taylor s theorem, for the vector z t = x + t(y x) we have f (y) = f (x) + 1 0 f (z t), y x dt. Add and subtract f (x), y x on rhs we have f (y) f (x) f (x), y x = 1 0 f (z t) f (x), y x dt f (y) f (x) f (x), y x = 1 0 f (z t) f (x), y x dt 1 0 f (z t) f (x), y x dt Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 16 / 29

Descent lemma Proof. Since f C 1 L, by Taylor s theorem, for the vector z t = x + t(y x) we have f (y) = f (x) + 1 0 f (z t), y x dt. Add and subtract f (x), y x on rhs we have f (y) f (x) f (x), y x = 1 0 f (z t) f (x), y x dt f (y) f (x) f (x), y x = 1 0 f (z t) f (x), y x dt 1 0 f (z t) f (x), y x dt 1 0 f (z t) f (x) 2 y x 2 dt Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 16 / 29

Descent lemma Proof. Since f C 1 L, by Taylor s theorem, for the vector z t = x + t(y x) we have f (y) = f (x) + 1 0 f (z t), y x dt. Add and subtract f (x), y x on rhs we have f (y) f (x) f (x), y x = 1 0 f (z t) f (x), y x dt f (y) f (x) f (x), y x = 1 0 f (z t) f (x), y x dt 1 0 f (z t) f (x), y x dt 1 0 f (z t) f (x) 2 y x 2 dt L 1 0 t x y 2 2 dt Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 16 / 29

Descent lemma Proof. Since f C 1 L, by Taylor s theorem, for the vector z t = x + t(y x) we have f (y) = f (x) + 1 0 f (z t), y x dt. Add and subtract f (x), y x on rhs we have f (y) f (x) f (x), y x = 1 0 f (z t) f (x), y x dt f (y) f (x) f (x), y x = 1 0 f (z t) f (x), y x dt Bounds f (y) around x with quadratic functions 1 0 f (z t) f (x), y x dt 1 0 f (z t) f (x) 2 y x 2 dt L 1 0 t x y 2 2 dt = L 2 x y 2 2. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 16 / 29

Descent lemma corollary Let y = x αg α (x), then f (y) f (x) + f (x), y x + L 2 y x 2 2 Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 17 / 29

Descent lemma corollary Let y = x αg α (x), then f (y) f (x) + f (x), y x + L 2 y x 2 2 f (y) f (x) α f (x), G α (x) + α2 L 2 G α(x) 2 2. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 17 / 29

Descent lemma corollary Let y = x αg α (x), then f (y) f (x) + f (x), y x + L 2 y x 2 2 f (y) f (x) α f (x), G α (x) + α2 L 2 G α(x) 2 2. Corollary. So if 0 α 1/L, we have f (y) f (x) α f (x), G α (x) + α 2 G α(x) 2 2. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 17 / 29

Descent lemma corollary Let y = x αg α (x), then f (y) f (x) + f (x), y x + L 2 y x 2 2 f (y) f (x) α f (x), G α (x) + α2 L 2 G α(x) 2 2. Corollary. So if 0 α 1/L, we have f (y) f (x) α f (x), G α (x) + α 2 G α(x) 2 2. Lemma Let y = x αg α (x). Then, for any z we have f (y) + h(y) f (z) + h(z) + G α (x), x z α 2 G α(x) 2 2. Exercise: Prove! (hint: f, h are convex, G α (x) f (x) h(y)) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 17 / 29

Convergence analysis We ve actually shown x = x αg α (x) is a descent method. Write φ = f + h; plug in z = x to obtain φ(x ) φ(x) α 2 G α(x) 2 2. Exercise: Why this inequality suffices to show convergence. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 18 / 29

Convergence analysis We ve actually shown x = x αg α (x) is a descent method. Write φ = f + h; plug in z = x to obtain φ(x ) φ(x) α 2 G α(x) 2 2. Exercise: Why this inequality suffices to show convergence. Use z = x in corollary to obtain progress in terms of iterates: φ(x ) φ G α (x), x x α 2 G α(x) 2 2 Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 18 / 29

Convergence analysis We ve actually shown x = x αg α (x) is a descent method. Write φ = f + h; plug in z = x to obtain φ(x ) φ(x) α 2 G α(x) 2 2. Exercise: Why this inequality suffices to show convergence. Use z = x in corollary to obtain progress in terms of iterates: φ(x ) φ G α (x), x x α 2 G α(x) 2 2 = 1 [ 2 αgα (x), x x αg α (x) 2 2α 2] = 1 [ x x 2 2 2α x x αg α (x) 2 2] Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 18 / 29

Convergence analysis We ve actually shown x = x αg α (x) is a descent method. Write φ = f + h; plug in z = x to obtain φ(x ) φ(x) α 2 G α(x) 2 2. Exercise: Why this inequality suffices to show convergence. Use z = x in corollary to obtain progress in terms of iterates: φ(x ) φ G α (x), x x α 2 G α(x) 2 2 = 1 [ 2 αgα (x), x x αg α (x) 2 2α 2] = 1 2α = 1 2α [ x x 2 2 x x αg α (x) 2 2] [ x x 2 2 x x 2 ] 2. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 18 / 29

Convergence rate Set x x k, x x k+1, and α = 1/L. Then add Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 19 / 29

Convergence rate Set x x k, x x k+1, and α = 1/L. Then add k+1 i=1 (φ(x i) φ ) L 2 k+1 i=1 [ xi x 2 2 x i+1 x 2 2] Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 19 / 29

Convergence rate Set x x k, x x k+1, and α = 1/L. Then add k+1 k+1 [ xi x 2 2 x i+1 x 2 2] (φ(x i) φ ) L i=1 2 i=1 [ x1 x 2 2 x k+1 x 2 2] = L 2 Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 19 / 29

Convergence rate Set x x k, x x k+1, and α = 1/L. Then add k+1 k+1 [ xi x 2 2 x i+1 x 2 2] (φ(x i) φ ) L i=1 2 i=1 [ x1 x 2 2 x k+1 x 2 2] = L 2 L 2 x 1 x 2 2. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 19 / 29

Convergence rate Set x x k, x x k+1, and α = 1/L. Then add k+1 k+1 [ xi x 2 2 x i+1 x 2 2] (φ(x i) φ ) L i=1 2 i=1 [ x1 x 2 2 x k+1 x 2 2] = L 2 L 2 x 1 x 2 2. Since φ(x k ) is a decreasing sequence, it follows that φ(x k+1 ) φ 1 k+1 k + 1 i=1 (φ(x i ) φ ) L 2(k + 1) x 1 x 2 2. This is the well-known O(1/k) rate. But for C 1 L convex functions, optimal rate is O(1/k2 )! Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 19 / 29

Accelerated Proximal Gradient min φ(x) = f (x) + h(x) Let x 0 = y 0 dom h. For k 1: x k = prox αk h (yk 1 α k f (y k 1 )) y k = x k + k 1 k + 2 (xk x k 1 ). Framework due to: Nesterov (1983, 2004); also Beck, Teboulle (2009). Simplified analysis: Tseng (2008). Uses extra memory for interpolation Same computational cost as ordinary prox-grad Convergence rate theoretically optimal φ(x k ) φ 2L (k + 1) 2 x0 x 2 2. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 20 / 29

Proximal splitting methods l(x) + f (x) + h(x) Direct use of prox-grad not easy Requires computation of: prox λ(f +h) (i.e., (I + λ( f + h)) 1 ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 21 / 29

Proximal splitting methods l(x) + f (x) + h(x) Direct use of prox-grad not easy Requires computation of: prox λ(f +h) (i.e., (I + λ( f + h)) 1 ) Example: 1 min 2 x y 2 2 + λ x 2 + µ n 1 }{{} x i+1 x i. i=1 }{{} f (x) h(x) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 21 / 29

Proximal splitting methods l(x) + f (x) + h(x) Direct use of prox-grad not easy Requires computation of: prox λ(f +h) (i.e., (I + λ( f + h)) 1 ) Example: 1 min 2 x y 2 2 + λ x 2 + µ n 1 }{{} x i+1 x i. i=1 }{{} f (x) h(x) But good feature: prox f and prox h separately easier Can we exploit that? Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 21 / 29

Proximal splitting operator notation If (I + f + h) 1 hard, but (I + f ) 1 and (I + h) 1 easy Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 22 / 29

Proximal splitting operator notation If (I + f + h) 1 hard, but (I + f ) 1 and (I + h) 1 easy Let us derive a fixed-point equation that splits the operators Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 22 / 29

Proximal splitting operator notation If (I + f + h) 1 hard, but (I + f ) 1 and (I + h) 1 easy Let us derive a fixed-point equation that splits the operators Assume we are solving min f (x) + h(x), where both f and h are convex but potentially nondifferentiable. Notice: We implicitly assumed: (f + h) = f + h. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 22 / 29

Proximal splitting 0 f (x) + h(x) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 23 / 29

Proximal splitting 0 f (x) + h(x) 2x (I + f )(x) + (I + h)(x) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 23 / 29

Proximal splitting 0 f (x) + h(x) 2x (I + f )(x) + (I + h)(x) Key idea of splitting: new variable! z (I + h)(x) = x = prox h (z) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 23 / 29

Proximal splitting 0 f (x) + h(x) 2x (I + f )(x) + (I + h)(x) Key idea of splitting: new variable! z (I + h)(x) = x = prox h (z) 2x z (I + f )(x) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 23 / 29

Proximal splitting 0 f (x) + h(x) 2x (I + f )(x) + (I + h)(x) Key idea of splitting: new variable! z (I + h)(x) = x = prox h (z) 2x z (I + f )(x) = x (I + f ) 1 (2x z) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 23 / 29

Proximal splitting 0 f (x) + h(x) 2x (I + f )(x) + (I + h)(x) Key idea of splitting: new variable! z (I + h)(x) = x = prox h (z) 2x z (I + f )(x) = x (I + f ) 1 (2x z) Not a fixed-point equation yet Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 23 / 29

Proximal splitting 0 f (x) + h(x) 2x (I + f )(x) + (I + h)(x) Key idea of splitting: new variable! z (I + h)(x) = x = prox h (z) 2x z (I + f )(x) = x (I + f ) 1 (2x z) Not a fixed-point equation yet We need one more idea Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 23 / 29

Douglas-Rachford splitting Reflection operator R h (z) := 2 prox h (z) z Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 24 / 29

Douglas-Rachford splitting Reflection operator R h (z) := 2 prox h (z) z Douglas-Rachford method z (I + h)(x), x = prox h (z) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 24 / 29

Douglas-Rachford splitting Reflection operator R h (z) := 2 prox h (z) z Douglas-Rachford method z (I + h)(x), x = prox h (z) = R h (z) = 2x z Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 24 / 29

Douglas-Rachford splitting z (I + h)(x), Reflection operator R h (z) := 2 prox h (z) z Douglas-Rachford method x = prox h (z) = R h (z) = 2x z 0 f (x) + g(x) 2x (I + f )(x) + (I + g)(x) 2x z (I + f )(x) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 24 / 29

Douglas-Rachford splitting z (I + h)(x), Reflection operator R h (z) := 2 prox h (z) z Douglas-Rachford method x = prox h (z) = R h (z) = 2x z 0 f (x) + g(x) 2x (I + f )(x) + (I + g)(x) 2x z (I + f )(x) x = prox f (R h (z)) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 24 / 29

Douglas-Rachford splitting z (I + h)(x), Reflection operator R h (z) := 2 prox h (z) z Douglas-Rachford method x = prox h (z) = R h (z) = 2x z 0 f (x) + g(x) 2x (I + f )(x) + (I + g)(x) 2x z (I + f )(x) x = prox f (R h (z)) but R h (z) = 2x z = z = 2x R h (z) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 24 / 29

Douglas-Rachford splitting z (I + h)(x), Reflection operator R h (z) := 2 prox h (z) z Douglas-Rachford method x = prox h (z) = R h (z) = 2x z 0 f (x) + g(x) 2x (I + f )(x) + (I + g)(x) 2x z (I + f )(x) x = prox f (R h (z)) but R h (z) = 2x z = z = 2x R h (z) z = 2 prox f (R h (z)) R h (z) = Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 24 / 29

Douglas-Rachford splitting z (I + h)(x), Reflection operator R h (z) := 2 prox h (z) z Douglas-Rachford method x = prox h (z) = R h (z) = 2x z 0 f (x) + g(x) 2x (I + f )(x) + (I + g)(x) 2x z (I + f )(x) x = prox f (R h (z)) but R h (z) = 2x z = z = 2x R h (z) z = 2 prox f (R h (z)) R h (z) = R f (R h (z)) Finally, z is on both sides of the eqn Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 24 / 29

Douglas-Rachford method 0 f (x) + h(x) DR method: given z 0, iterate for k 0 x k = prox h (z k ) { x = prox h (z) z = R f (R h (z)) v k = prox f (2x k z k ) z k+1 = z k + γ k (v k x k ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 25 / 29

Douglas-Rachford method 0 f (x) + h(x) DR method: given z 0, iterate for k 0 x k = prox h (z k ) { x = prox h (z) z = R f (R h (z)) v k = prox f (2x k z k ) z k+1 = z k + γ k (v k x k ) Theorem. If f + h admits minimizers, and (γ k ) satisfy γ k [0, 2], γ k(2 γ k ) =, k then the DR-iterates v k and x k converge to a minimizer. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 25 / 29

Douglas-Rachford method For γ k = 1, we have z k+1 = z k + v k x k z k+1 = z k + prox f (2 prox h (z k ) z k ) prox h (z k ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 26 / 29

Douglas-Rachford method For γ k = 1, we have z k+1 = z k + v k x k z k+1 = z k + prox f (2 prox h (z k ) z k ) prox h (z k ) Dropping superscripts, writing P prox, we have z Tz T = I + P f (2P h I) P h Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 26 / 29

Douglas-Rachford method For γ k = 1, we have z k+1 = z k + v k x k z k+1 = z k + prox f (2 prox h (z k ) z k ) prox h (z k ) Dropping superscripts, writing P prox, we have z Tz T = I + P f (2P h I) P h Lemma DR can be written as: z 1 2 (R f R h + I)z, where R f denotes the reflection operator 2P f I (similarly R h ). Exercise: Prove this claim. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 26 / 29

Proximal methods cornucopia Douglas Rachford splitting ADMM (special case of DR on dual) Proximal-Dykstra Proximal methods for f 1 + f 2 + + f n Peaceman-Rachford Proximal quasi-newton, Newton Many other variation... Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 27 / 29

Best approximation problem min δ A (x) + δ B (x) where A B =. Can we use DR? Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 28 / 29

Best approximation problem min δ A (x) + δ B (x) where A B =. Can we use DR? Using a clever analysis of Bauschke & Combettes (2004), DR can still be applied! However, it generates diverging iterates which can be projected back to obtain a solution to min a b 2 a A, b B. See: Jegelka, Bach, Sra (NIPS 2013) for an example. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 28 / 29

ADMM Let us see separable objective with constraints Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 29 / 29

ADMM Let us see separable objective with constraints min f (x) + g(z) s.t. Ax + Bz = c. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 29 / 29

ADMM Let us see separable objective with constraints min f (x) + g(z) s.t. Ax + Bz = c. Objective function separated into x and z variables The constraint prevents a trivial decoupling Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 29 / 29

ADMM Let us see separable objective with constraints min f (x) + g(z) s.t. Ax + Bz = c. Objective function separated into x and z variables The constraint prevents a trivial decoupling Introduce augmented lagrangian (AL) L ρ (x, z, y) := f (x) + g(z) + y T (Ax + Bz c) + ρ 2 Ax + Bz c 2 2 Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 29 / 29

ADMM Let us see separable objective with constraints min f (x) + g(z) s.t. Ax + Bz = c. Objective function separated into x and z variables The constraint prevents a trivial decoupling Introduce augmented lagrangian (AL) L ρ (x, z, y) := f (x) + g(z) + y T (Ax + Bz c) + ρ 2 Ax + Bz c 2 2 Now, a Gauss-Seidel style update to the AL Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 29 / 29

ADMM Let us see separable objective with constraints min f (x) + g(z) s.t. Ax + Bz = c. Objective function separated into x and z variables The constraint prevents a trivial decoupling Introduce augmented lagrangian (AL) L ρ (x, z, y) := f (x) + g(z) + y T (Ax + Bz c) + ρ 2 Ax + Bz c 2 2 Now, a Gauss-Seidel style update to the AL x k+1 = argmin x L ρ (x, z k, y k ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 29 / 29

ADMM Let us see separable objective with constraints min f (x) + g(z) s.t. Ax + Bz = c. Objective function separated into x and z variables The constraint prevents a trivial decoupling Introduce augmented lagrangian (AL) L ρ (x, z, y) := f (x) + g(z) + y T (Ax + Bz c) + ρ 2 Ax + Bz c 2 2 Now, a Gauss-Seidel style update to the AL x k+1 = argmin x L ρ (x, z k, y k ) z k+1 = argmin z L ρ (x k+1, z, y k ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 29 / 29

ADMM Let us see separable objective with constraints min f (x) + g(z) s.t. Ax + Bz = c. Objective function separated into x and z variables The constraint prevents a trivial decoupling Introduce augmented lagrangian (AL) L ρ (x, z, y) := f (x) + g(z) + y T (Ax + Bz c) + ρ 2 Ax + Bz c 2 2 Now, a Gauss-Seidel style update to the AL x k+1 = argmin x L ρ (x, z k, y k ) z k+1 = argmin z L ρ (x k+1, z, y k ) y k+1 = y k + ρ(ax k+1 + Bz k+1 c) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 29 / 29