Lecture 18: March 23

Similar documents
Lecture 15: Log Barrier Method

Lecture 19: November 5

Lecture 2: August 29, 2018

Sparse Optimization Lecture: Proximal Operator/Algorithm and Lagrange Dual

LECTURE 13: SOLUTION METHODS FOR CONSTRAINED OPTIMIZATION. 1. Primal approach 2. Penalty and barrier methods 3. Dual approach 4. Primal-dual approach

Lecture 4: Convexity

Lecture 2: August 29, 2018

Programming, numerics and optimization

1. Introduction. performance of numerical methods. complexity bounds. structural convex optimization. course goals and topics

Convex Optimization. Lijun Zhang Modification of

Convex Optimization / Homework 2, due Oct 3

Characterizing Improving Directions Unconstrained Optimization

Optimization for Machine Learning

Constrained optimization

Proximal operator and methods

ISM206 Lecture, April 26, 2005 Optimization of Nonlinear Objectives, with Non-Linear Constraints

Convex Optimization. Erick Delage, and Ashutosh Saxena. October 20, (a) (b) (c)

A primal-dual framework for mixtures of regularizers

CME307/MS&E311 Theory Summary

Lecture 12: convergence. Derivative (one variable)

Machine Learning for Signal Processing Lecture 4: Optimization

5 Machine Learning Abstractions and Numerical Optimization

4 Greedy Approximation for Set Cover

Nonlinear Programming

Convex Optimization and Machine Learning

Composite Self-concordant Minimization

Alternating Direction Method of Multipliers

LECTURE 18 LECTURE OUTLINE

PRIMAL-DUAL INTERIOR POINT METHOD FOR LINEAR PROGRAMMING. 1. Introduction

Convex Optimization MLSS 2015

Parallel and Distributed Sparse Optimization Algorithms

Computational Methods. Constrained Optimization

Contents. I Basics 1. Copyright by SIAM. Unauthorized reproduction of this article is prohibited.

Lecture 5: July 7,2013. Minimum Cost perfect matching in general graphs

Unconstrained Optimization Principles of Unconstrained Optimization Search Methods

Convex optimization algorithms for sparse and low-rank representations

ACCELERATED DUAL GRADIENT-BASED METHODS FOR TOTAL VARIATION IMAGE DENOISING/DEBLURRING PROBLEMS. Donghwan Kim and Jeffrey A.

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 2. Convex Optimization

Introduction to Optimization

Lecture 4 Duality and Decomposition Techniques

Convexity: an introduction

David G. Luenberger Yinyu Ye. Linear and Nonlinear. Programming. Fourth Edition. ö Springer

Parallel and Distributed Graph Cuts by Dual Decomposition

The Alternating Direction Method of Multipliers

Lecture 2 September 3

Crash-Starting the Simplex Method

Solution Methods Numerical Algorithms

Introduction to Mathematical Programming IE496. Final Review. Dr. Ted Ralphs

Lecture 3: February Local Alignment: The Smith-Waterman Algorithm

A Truncated Newton Method in an Augmented Lagrangian Framework for Nonlinear Programming

Fast-Lipschitz Optimization

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey. Chapter 4 : Optimization for Machine Learning

25. NLP algorithms. ˆ Overview. ˆ Local methods. ˆ Constrained optimization. ˆ Global methods. ˆ Black-box methods.

DISCRETE CONVEX ANALYSIS

Linear Programming. Larry Blume. Cornell University & The Santa Fe Institute & IHS

Robust Principal Component Analysis (RPCA)

Demo 1: KKT conditions with inequality constraints

4 Integer Linear Programming (ILP)

Linear programming and duality theory

Lecture Notes: Constraint Optimization

Introduction to Optimization

EE8231 Optimization Theory Course Project

CME307/MS&E311 Optimization Theory Summary

Mathematical Programming and Research Methods (Part II)

CONLIN & MMA solvers. Pierre DUYSINX LTAS Automotive Engineering Academic year

Templates for Convex Cone Problems with Applications to Sparse Signal Recovery

Convexity Theory and Gradient Methods

Linear methods for supervised learning

Mathematical and Algorithmic Foundations Linear Programming and Matchings

Computational Optimization. Constrained Optimization Algorithms

Advanced Operations Research Techniques IE316. Quiz 2 Review. Dr. Ted Ralphs

Part 5: Structured Support Vector Machines

Principles of Wireless Sensor Networks. Fast-Lipschitz Optimization

Optimization for Machine Learning

Improving Dual Bound for Stochastic MILP Models Using Sensitivity Analysis

Section Notes 5. Review of Linear Programming. Applied Math / Engineering Sciences 121. Week of October 15, 2017

Lec13p1, ORF363/COS323

Applied Lagrange Duality for Constrained Optimization

COMS 4771 Support Vector Machines. Nakul Verma

16.410/413 Principles of Autonomy and Decision Making

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Introduction Optimization Geoff Gordon Ryan Tibshirani

A Multilevel Proximal Gradient Algorithm for a Class of Composite Optimization Problems

Using CONOPT with AMPL

Lecture 7: Support Vector Machine

PIPA: A New Proximal Interior Point Algorithm for Large-Scale Convex Optimization

Submodularity Reading Group. Matroid Polytopes, Polymatroid. M. Pawan Kumar

Linear Programming Problems

Math 5593 Linear Programming Lecture Notes

Introduction to Machine Learning Spring 2018 Note Sparsity and LASSO. 1.1 Sparsity for SVMs

Introduction to Modern Control Systems

Discrete Optimization 2010 Lecture 5 Min-Cost Flows & Total Unimodularity

Econ 172A - Slides from Lecture 8

BCN Decision and Risk Analysis. Syed M. Ahmed, Ph.D.

Cost Functions in Machine Learning

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 36

Optimization under uncertainty: modeling and solution methods

Simplex Algorithm in 1 Slide

A Short SVM (Support Vector Machine) Tutorial

California Institute of Technology Crash-Course on Convex Optimization Fall Ec 133 Guilherme Freitas

Transcription:

0-725/36-725: Convex Optimization Spring 205 Lecturer: Ryan Tibshirani Lecture 8: March 23 Scribes: James Duyck Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the Instructor. These notes refer to the documents algs-table.pdf and midterm-practice.pdf, which were used as reference during the lecture. 8. Algorithm Review We assume all criteria are convex. 8.. Gradient Descent To add constraints, take gradient step, then project onto constraint set. Gradient descent is a special case of the subgradient method. For each iteration, we compute the gradient, then add a multiple of the gradient to the current argument, so the iteration cost is cheap. The gradient is required to be Lipschitz for a convergence rate of O( ɛ ). There are ways to get faster convergence. If we use acceleration, we can get a rate of O( ɛ ). With strong convexity, the convergence rate is linear O(log( ɛ )). The constants in convergence rates depend on the conditioning, for example of the Hessian, so we could have linear convergence, but it might not perform well in practice due to a large constant. Q: Did we discuss projection onto a constraint set? A: Not generically, but for some special cases, for example in Lecture 8. Linear constraints, some easy cones, boxes, L and L2 balls are easy to project onto (there are closed forms or efficient algorithms). For a generic constraint set, the projection is a separate optimization problem. 8..2 Subgradient Method This works for any criterion, with no restrictions. Again, we handle constraints by taking a step along the subgradient, then projecting onto the constraint set. We could choose step sizes to be fixed, but then we wouldn t necessarily have a convergent algorithm. A typical choice of step size is O( k ) where k is the step number. Gradient descent is quite a bit slower than gradient descent, but more general. 8-

8-2 Lecture 8: March 23 8..3 Proximal Gradient Descent For the criterion f = g + h, g is smooth. h has to be simple so that we can evaluate the prox operator. g(x) + h(x) g smooth, h nonsmooth (8.) x C g(x) + h(x) + I C (x) where I is the 0 indicator function (8.2) x = x g(x) + h(x) (8.3) prox h,t (x) (8.4) = arg z 2t x z 2 2 + h(z) (8.5) = arg z C 2t x z 2 2 + h(z) (8.6) x + = prox h,t (x t g(x)) (8.7) The prox operator of h does not necessarily give the prox operator of h. For example if h is the L norm, the prox h is soft thresholding. If the constraint is z 0, we have an easy fix (soft thresholding with negative numbers set to 0). If C is some polyhedron, prox h might be much harder to evaluate. You might have to use iterative optimization in an inner loop. Sometimes the prox operator decomposes or decomposes approximately, mentioned in recent NIPS papers. We tend to use prox when the prox operator is moderately cheap (e.x. soft thresholding costs O(n) for length n input). For expensive methods like Newton s method, we are willing to pay the price of expensive iterations because they have fast convergence rates, but the prox operator does not. Q: For gradient and proximal gradient, do we have the same convergence rate? A: Yes, but remember that for proximal gradient descent we have to evaluate both the gradient and the prox operator. 8..4 Newton s Method Add equality constraints by solving a bigger linear system (called the KKT matrix). We typically choose the step size to be (pure Newton s method), but this may not converge. The safest way is to use backtracking line search. For each iteration, we have to compute the Hessian and solve the system (possibly O(n 3 )). This may be cheaper for a structured (ex. banded) Hessian. The convergence rate O(log log( ɛ )) 5 where ɛ = 2 32. Remember that the quadratic O(log log( ɛ )) convergence rate is only local. The global convergence rate is linear O(log( ɛ )), but usually it reaches the local region quickly. 8..5 quasi-newton It doesn t solve a linear system in the Hessian. It approximates the Hessian with a sum of low rank matrices, so the iteration cost is lower than for Newton s method.

Lecture 8: March 23 8-3 The convergence rate is super-linear: somewhere between linear and quadratic. n steps are equivalent to one step of Newton s method. Point: Constants typically depend on the problem size n. This can make a big difference to the complexity. 8..6 Barrier Method We assume that the inequality constraints are doubly smooth. We handle the equality constraints as in Newton s method. We have to choose a step size for the inner loop (typically by backtracking line search). We have to choose an outer loop barrier parameter which multiplies the criterion (ex. 0). The iteration is very expensive because we have to solve another optimization problem (Newton s method) on each iteration. The convergence rate is linear (O(log ɛ )) in the outer loop, and also in the number of Newton s steps (under enough conditions) - we can use a constant number of steps on each iteration to get a desired accuracy. Q: The barrier method doesn t require doubly-smooth if we didn t use Newton s method on the inner loop? A: Using something like gradient descent on the inner loop would require a huge number of iterations to get the accuracy required. Newton s method can get close to the central path in few steps. 8..7 Primal-dual IPM Compared to the Barrier Method, there is no outer or inner loop. There are the same parameters as in the barrier method. The iteration is a single Newton step, but it can be quite large. There s not one answer to what you should use! 8.2 Midterm Practice 8.2. Problem 6 Derive the dual of x R n N A i x + b i + 2 x 2 2 (8.8) i= The dual for the problem without constraints is a constant equal to f the optimal criterion value, from the imum of the Lagrangian, but that s not very useful. You make a substitution that doesn t change the problem but allows you to find a dual. If you have a matrix that s composed with a non-smooth function, that is difficult. We want a substitution that pulls out the matrix. There are multiple ways to derive different duals, but they are equivalent. Let A i x = z i. x R n N z i + b i + 2 x 2 2 subject to A i x = z i i =,..., N (8.9) i=

8-4 Lecture 8: March 23 We associate one dual variable u i with each of the constraints. Then we can write the Lagrangian L(x, z i, u i ) = z i + b i + 2 x 2 2 + u i (z i A i x) (8.0) We imize the Lagrangian over the primal variables to find the dual function. separate terms that only involve z i and terms that only involve x. Rearrange so that we Minimize terms involving x. L(x, z i, u i ) = ( zi + b i + u i z i ) + x,z z x 2 x 2 2 ( A i u ) x (8.) x 2 x 2 2 ( A i u ) x = 2 A i u i 2 2 (8.2) We want to imize terms involving z i. We know how to imize y i, so we substitute. The indicators can be written as constraints. Then the dual problem is z i + b i + u i z i (8.3) y i=z i+b i = y i y i + u i y i u i b i (8.4) = I{ u i } u i b i (8.5) = I{ u i } u i b i (8.6) (8.7) max u i u i b i 2 A i u 2 2 (8.8) subject to u i i (8.9) 8.2.2 Problem Write down the dual of y, y 2 4y + 2y 2 (8.20) subject to y + y 2 2 (8.2) First note the primal is infeasible. If we add the constraints, we get 0 3. y y 2 (8.22) y, y 2 0 (8.23) If we imize something over an empty set of constraints, the criterion is infinite. We want to write the dual and specify the duality gap. We have strong duality (duality gap 0) if Slater s condition holds for the primal constraints or equivalently the dual constraints. For a linear program, this holds if either is feasible. The only way in which the duality gap is nonzero is if both the primal and the dual are infeasible. The duality can be infinity if the dual is also infeasible, and is 0 otherwise. There are multiple ways to solve this. You can assign a multiplier to each constraint and add them to get the criterion. You can use the Lagrangian. You can think of the problem as a general form LP.

Lecture 8: March 23 8-5 First find the Lagrangian y y 2 2 (8.24) y + y 2 (8.25) y 0 (8.26) y 2 0 (8.27) L(y, y 2, u,..., u 4 ) = 4y + 2y 2 + u (y y 2 + 2) + u 2 ( y + y 2 + ) u 3 y u 4 y 2 (8.28) = y ( 4 + u u 2 u 3 ) + y 2 (2 u + u 2 u 4 ) + 2u + u 2 (8.29) The Lagrangian is linear in y and y 2, so when we imize over y and y 2 to find the dual, the dual will be undefined ( ) if the coefficient on y or y 2 is nonzero. Then the dual has the two constraints that the coefficients on y and y 2 are zero. The dual is Add the top two constraints together. So the dual is infeasible and the duality gap is infinite. max u,...,u 4 2u + u 2 (8.30) subject to 4 + u u 2 u 3 = 0 (8.3) 2 u + u 2 u 4 = 0 (8.32) u,..., u 4 0 (8.33) 2 u 3 u 4 = 0 (8.34) u 3 + u 4 = 2 (8.35)