Convex Programs. COMPSCI 371D Machine Learning. COMPSCI 371D Machine Learning Convex Programs 1 / 21

Similar documents
COMS 4771 Support Vector Machines. Nakul Verma

Linear methods for supervised learning

Lecture 7: Support Vector Machine

Support Vector Machines. James McInerney Adapted from slides by Nakul Verma

A Short SVM (Support Vector Machine) Tutorial

Kernel Methods & Support Vector Machines

LECTURE 13: SOLUTION METHODS FOR CONSTRAINED OPTIMIZATION. 1. Primal approach 2. Penalty and barrier methods 3. Dual approach 4. Primal-dual approach

Nonlinear Programming

Introduction to Machine Learning

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Constrained optimization

Programming, numerics and optimization

Lecture 2 September 3

In other words, we want to find the domain points that yield the maximum or minimum values (extrema) of the function.

Support Vector Machines.

Lecture 10: SVM Lecture Overview Support Vector Machines The binary classification problem

DM6 Support Vector Machines

All lecture slides will be available at CSC2515_Winter15.html

Convex Optimization. Lijun Zhang Modification of

Optimization III: Constrained Optimization

Demo 1: KKT conditions with inequality constraints

Contents. I Basics 1. Copyright by SIAM. Unauthorized reproduction of this article is prohibited.

9. Support Vector Machines. The linearly separable case: hard-margin SVMs. The linearly separable case: hard-margin SVMs. Learning objectives

Unconstrained Optimization Principles of Unconstrained Optimization Search Methods

Math 5593 Linear Programming Lecture Notes

CME307/MS&E311 Theory Summary

Introduction to Optimization

Conic Duality. yyye

CME307/MS&E311 Optimization Theory Summary

Lecture 5: Duality Theory

Linear programming and duality theory

Support Vector Machines

Introduction to Constrained Optimization

Support Vector Machines.

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 2. Convex Optimization

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Mathematical Programming and Research Methods (Part II)

Learning via Optimization

Probabilistic Graphical Models

PRIMAL-DUAL INTERIOR POINT METHOD FOR LINEAR PROGRAMMING. 1. Introduction

David G. Luenberger Yinyu Ye. Linear and Nonlinear. Programming. Fourth Edition. ö Springer

Optimal Separating Hyperplane and the Support Vector Machine. Volker Tresp Summer 2018

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

Introduction to Mathematical Programming IE496. Final Review. Dr. Ted Ralphs

Theoretical Concepts of Machine Learning

Convex Optimization. Erick Delage, and Ashutosh Saxena. October 20, (a) (b) (c)

Local and Global Minimum

Characterizing Improving Directions Unconstrained Optimization

Gate Sizing by Lagrangian Relaxation Revisited

OPTIMIZATION METHODS

Convex Optimization and Machine Learning

Lecture 5: Linear Classification

Read: H&L chapters 1-6

Applied Lagrange Duality for Constrained Optimization

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

Mathematical and Algorithmic Foundations Linear Programming and Matchings

LECTURE 5: DUAL PROBLEMS AND KERNELS. * Most of the slides in this lecture are from

Department of Mathematics Oleg Burdakov of 30 October Consider the following linear programming problem (LP):

Constrained Optimization COS 323

Linear Programming Problems

Linear and Integer Programming :Algorithms in the Real World. Related Optimization Problems. How important is optimization?

LECTURE 10 LECTURE OUTLINE

Simplex Algorithm in 1 Slide

Linear Programming. Linear Programming. Linear Programming. Example: Profit Maximization (1/4) Iris Hui-Ru Jiang Fall Linear programming

California Institute of Technology Crash-Course on Convex Optimization Fall Ec 133 Guilherme Freitas

Support Vector Machines

Linear Programming Duality and Algorithms

Perceptron Learning Algorithm

Combinatorial Geometry & Topology arising in Game Theory and Optimization

Convex Optimization - Chapter 1-2. Xiangru Lian August 28, 2015

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 36

CONVEX OPTIMIZATION: A SELECTIVE OVERVIEW

Lecture 7. s.t. e = (u,v) E x u + x v 1 (2) v V x v 0 (3)

5 Machine Learning Abstractions and Numerical Optimization

Lecture Notes: Constraint Optimization

Lecture 15: Log Barrier Method

Optimization under uncertainty: modeling and solution methods

Machine Learning for Signal Processing Lecture 4: Optimization

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Sets. Instructor: Shaddin Dughmi

Optimization Methods. Final Examination. 1. There are 5 problems each w i t h 20 p o i n ts for a maximum of 100 points.

EARLY INTERIOR-POINT METHODS

Conic Optimization via Operator Splitting and Homogeneous Self-Dual Embedding

SUPPORT VECTOR MACHINE ACTIVE LEARNING

Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning

Chapter 3 Numerical Methods

Polar Duality and Farkas Lemma

Natural Language Processing

5 Day 5: Maxima and minima for n variables.

Convex Geometry arising in Optimization

Lecture 19: Convex Non-Smooth Optimization. April 2, 2007

(1) Given the following system of linear equations, which depends on a parameter a R, 3x y + 5z = 2 4x + y + (a 2 14)z = a + 2

Convex Analysis and Minimization Algorithms I

4 Integer Linear Programming (ILP)

Lec 11 Rate-Distortion Optimization (RDO) in Video Coding-I

10. Support Vector Machines

Lecture 4: Linear Programming

Support Vector Machines

Lecture Linear Support Vector Machines

A Truncated Newton Method in an Augmented Lagrangian Framework for Nonlinear Programming

Lec13p1, ORF363/COS323

Transcription:

Convex Programs COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Convex Programs 1 / 21

Logistic Regression! Support Vector Machines Support Vector Machines (SVMs) and Convex Programs SVMs are linear predictors Defined for both regression and classification Multi-class versions exist We will cover only binary SVM classification We ll need some new math: Convex Programs Optimization of convex functions with affine constraints (equalities and inequalities) Lagrange multipliers, but for inequalities COMPSCI 371D Machine Learning Convex Programs 2 / 21

Logistic Regression! Support Vector Machines Outline 1 Logistic Regression! Support Vector Machines 2 Local Convex Minimization! Convex Programs 3 Shape of the Solution Set 4 Geometry: Closed, Convex Polyhedral Cones 5 Cone Duality 6 The KKT Conditions 7 Lagrangian Duality COMPSCI 371D Machine Learning Convex Programs 3 / 21

Logistic Regression! Support Vector Machines Logistic Regression! SVMs A logistic-regression classifier places the decision boundary somewhere (and approximately) between the two classes Loss is never zero! Exact location of the boundary can be determined by samples that are very distant from the boundary (even on the correct side of it) SVMs place the boundary exactly half-way between the two classes (with exceptions to allow for non linearly-separable classes) Only samples close to the boundary matter: These are the support vectors SVMs are effectively immune from the curse of dimensionality A kernel trick allows going beyond linear classifiers COMPSCI 371D Machine Learning Convex Programs 4 / 21

Local Convex Minimization! Convex Programs Local Convex Minimization! Convex Programs Convex function f (u) : R m! R f differentiable, with continuous first derivatives Unconstrained minimization: u = argmin u2r m f (u) Constrained minimization: u = argmin u2c f (u) C = {u 2 R m : Au + b 0} f is a convex function C is a convex set: If u, v 2 C, then for t 2 [0, 1] tu +(1 t)v 2 C The specific C is bounded by hyperplanes This is a convex program µ k constraints a E A kxm 7 b ka l f ET u t b Z feints o COMPSCI 371D Machine Learning Convex Programs 5 / 21

Shape of the Solution Set Shape of the Solution Set Just as for the unconstrained problem: There is one f but there can be multiple u (a flat valley) The set of solution points u is convex if f is strictly convex at u, then u is the unique solution point COMPSCI 371D Machine Learning Convex Programs 6 / 21

Shape of the Solution Set Zero Gradient! KKT Conditions For the unconstrained problem, the solution is characterized by rf (u) =0 Constraints can generate new minima and maxima Example: f (u) =e u uz O uz 0 left ut 120 f(u) f(u) f(u) 0 1 u 0 1 u 0 1 u What is the new characterization? Karush-Kuhn-Tucker conditions, necessary and sufficient COMPSCI 371D Machine Learning Convex Programs 7 / 21

Geometry: Closed, Convex Polyhedral Cones Geometry of C The neighborhoods of points of R m look the same Not so for C: Different points look different Example m = 2: C is a convex polygon Proveconvexity View from the interior: Same as R 2 View from a side: C is a half-plane View from a corner: C is an angle COMPSCI 371D Machine Learning Convex Programs 8 / 21

Geometry: Closed, Convex Polyhedral Cones Active Constraints g A constraint c i (u) 0 is active at u if c i (u) =0 u touches that constraint The active set: A(u) ={i : c i (u) =0} 2 1 WEAK 6 5 C closed E I k I FEASIBLE 3 4 A(u) ={} A(u) ={2} A(u) ={2, 3} Black u is infeasible, the others are feasible COMPSCI 371D Machine Learning Convex Programs 9 / 21

Geometry: Closed, Convex Polyhedral Cones Closed, Convex Polyhedral (CCP) Cones bigoesaway hffgging I a 1 u 0 r 2 a 2 r 1 µreei ray tower conic View from u 0 : move the origin to u 0 {u 2 R 2 : a T 1 u 0, at 2 u 0} (implicit) {u 2 R 2 : u = 1 r 1 + 2 r 2 with 1, 2 0} (parametric) Both representations always exist (Farkas-Minkowski-Weyl) Number of hyperplane normals a i and generators r j is not always the same. Conversion is typically complex. We won t need it. COMPSCI 371D Machine Learning Convex Programs 10 / 21

Geometry: Closed, Convex Polyhedral Cones Lines, Half Lines, Half-Planes, Planes Line in R 2 EE : a T u 0, a T u 0 or u = 1 r + 2 ( r) Half line: a T u 0, a T u 0, r T u 0 or u = r Half plane: a T u 0 or u = 1 r + 2 ( r)+ 3 a (glue two angles together) Plane: {} or u = 1 r 1 + 2 r 2 + 3 r 3 with r 1, r 2, r 3, say, 120 degrees apart (glue three angles together) Parametric representation is not unique More variety in R 3 much more in R d IEEE Y MEEEE GENERATOR COMPSCI 371D Machine Learning Convex Programs 11 / 21

Geometry: Closed, Convex Polyhedral Cones General CCP Cones THE VIEW of LOCAL 0 CLOSED CONVEX POLYHEDRAL P = {u 2 R m : p T i u 0 for i = 1,...,k} Z prove it bounded by peonies P = {u 2 R m : u = P` j=1 jr j with 1,..., ` 0} Both representations always exist (Farkas-Minkowski-Weyl) The conversion is algorithmically complex ( representation conversion problem ) 0 CONE IEP LEEP 1 for any azo COMPSCI 371D Machine Learning Convex Programs 12 / 21

Cone Duality Cone Duality C P = {u 2 R m : p T i u 0 for i = 1,...,k} D = {u 2 R m : u = P k i=1 ip i with 1,..., k 0} D is the dual of P Different cones, same vectors, different representations! All and only the points in P have a nonnegative inner product with all and only the points in D C D dual J of P, P dual of D R a D p I path COMPSCI 371D Machine Learning Convex Programs 13 / 21

The KKT Conditions Necessary and Sufficient Condition for a Minimum u is a minimum iff f (u) does not decrease when moving away from u while staying in C f is differentiable, so we can look at rf (u ) No matter how we choose a direction s 2 R m, if a tiny step away from u and along s keeps us in C, F then s T rf (u ) must be 0 f If s 2 P = {s : s T rc i (u ) 0 for i 2 A(u )} 0 then s T rf (u ) must be 0 Ci rf (u ) is any vector with a nonnegative inner product with fall vectors in P rf (u ) is in the dual cone of P, D = {g : g = P i2a(u ) irc i (u ) with i 0} a Ji c g COMPSCI 371D Machine Learning Convex Programs 14 / 28 u e o

The KKT Conditions The Karush-Kuhn-Tucker Conditions rf (u ) is in D = {g : g = P i2a(u ) irc i (u ) with i 0} Ea rf (u )= P i2a(u ) i rc i(u ) for some i 0 h P i f (u ) i2a(u ) i c i(u ) = 0 for some i 0 @ @u There exist @ s.t. @u L(u, )=0where L(u, ) def P = f (u) i2a(u) ic i (u) with i 0 for i 2 A(u) L is the Lagrangian function of this convex program The values in are called the Lagrange multipliers KKT conditions also hold in the interior of C! E ch o COMPSCI 371D Machine Learning Convex Programs 15 / 28

The KKT Conditions Technical Cleanup O Simplifying the sum P i2a(u) ic i (u) Anywhere in C, we have i 0 (cone) and c i (u) 0 (feasible u), so P k i=1 ic i (u) = T c(u) 0 At (u, ), the multiplier een i is in the sum only if c i (u )=0, so we can replace the sum with T c(u) if we add the complementarity constraint ( ) T c(u )=0 which requires i = 0 for i /2 A(u ) At (u, ), the sum P i2a(u) ic i (u) is equivalent to T c(u) together with the constraint ( ) T c(u )=0 COMPSCI 371D Machine Learning Convex Programs 16 / 28

The KKT Conditions The KKT Conditions It is necessary and sufficient for u to be a solution to a convex program that there exists such that @ @u L(u, )=0 (no descent direction) c(u ) 0 (feasibility) 0 (cone) XE ( ) T c(u )=0 (complementarity) where L(u, ) def = f (u) T c(u) We can now recognize a minimum if we see one Subsumes to rf (u )=0for the unconstrained case Since both f (u) and T c(u) are convex, so is L(u, ) Therefore, (u, ) is a global minimum of L w.r.t. u (because of the first KKT condition) KKT COMPSCI 371D Machine Learning Convex Programs 17 / 28

The KKT Conditions A Tiny Example In simple examples, we can solve the KKT conditions and find the minimum (as opposed to just checking whether a given u is a minimum) This is analogous to solving the normal equations rf (u )=0 in the unconstrained case Example: min u f (u) =e u subject to c(u) =u 0 Lagrangian: L(u, ) =f (u) c(u) =e u u We obviously know the solution, u = 0 f(u) 0 1 u COMPSCI 371D Machine Learning Convex Programs 18 / 28

The KKT Conditions A Tiny Example: KKT Conditions Lagrangian: L(u, ) =e u @ @u L(u, )=e u = 0 c(u )=u 0 0 c(u )= u = 0 Solving first yields = e u e u which makes 0 moot Complementarity yields u = 0, which is feasible [Yay!] We have = e u = e 0 = 1 Since > 0, this constraint is active COMPSCI 371D Machine Learning Convex Programs 19 / 28

Lagrangian Duality Lagrangian Duality We will find a transformation of a convex program PRIMAL f C def = min u2c f (u) def = {u 2 R m : c(u) 0} into an equivalent maximization problem, called the Lagrangian dual The original problem O is called the primal This is not cone duality! This transformation will seem arbitrary for now Dual involves only the Lagrange multipliers, and not u The dual is often simpler than the primal For SVMs, the dual leads to SVMs with nonlinear decision boundaries COMPSCI 371D Machine Learning Convex Programs 20 / 28

Lagrangian Duality Derivation of Lagrangian Duality 5 Duality is based on bounds on the Lagrangian L(u, ) def = f (u) T c(u) where T c(u) 0 in C and ( ) T c(u )=0 Therefore, L(u, )=f (u ) 7 Also, L(u, ) apple f (u) for all u 2 C and 0 Even more so, D( ) def = min u2c L(u, ) apple f (u) for all u 2 C and 0, and in particular WE D( ) apple f def = f (u ) for all 0 D is called the Lagrangian dual of L For different, the bound varies in tightness For what is it tightest? 6 7 A COMPSCI 371D Machine Learning Convex Programs 21 / 28

Lagrangian Duality Derivation of Lagrangian Duality, Continued L(u, ) def = f (u) T c(u) I D( ) def = min u2c L(u, ) apple f (u) ( ) T c(u )=0(complementarity) D( )=min u2c L(u, )=L(u, )= f (u ) ( ) T c(u )=f(u )=f Thus, D( ) apple f (u) and D( )=f, so that max D( ) =f and arg max D( ) 0 0 =. This is the dual problem. The value at the solution is the same as that of the primal problem, and the value where the maximum is achieved is the same that yields the solution to the primal problem COMPSCI 371D Machine Learning Convex Programs 22 / 28

Lagrangian Duality Summary of Lagrangian Duality f = f (u )=min f O_O (u) = L(u, )=max D( ) = D( ) u2c 0 {z } {z } primal dual D( ) def = min u2c L(u, ) L has a minimum in u and a maximum in at the solution (u, ) of both problems (u, ) is a saddle point for L COMPSCI 371D Machine Learning Convex Programs 23 / 28

Lagrangian Duality A Tiny Example, Continued f (u) =e u subject to c(u) =u 0 Lagrangian: L(u, ) =f (u) c(u) =e u u f(u) 0 1 u COMPSCI 371D Machine Learning Convex Programs 24 / 28

Lagrangian Duality A Tiny Example: The Dual Lagrangian: L(u, ) =e u u D( ) =min u 0 L(u, ) =min u 0 (e u u) Differentiate L and set to zero to find the minimum Been there, done that: We know that L(u, ) has a minimum in u when = e u However, we are now interested in the value of u for fixed, so we solve for u: u =ln D( ) =L(ln, ) =e ln ln = (1 ln ) COMPSCI 371D Machine Learning Convex Programs 25 / 28

Lagrangian Duality A Tiny Example: The Dual D( ) = (1 ln ) Maximizing D in yields the same as the primal problem We have eliminated u Compute the derivative and set it to zero dd = 1 ln = ln = 0 d for = 1 [Yay!] If we hadn t already solved the primal, we could plug = 1 into the Lagrangian: L(u, )=e u u We have eliminated COMPSCI 371D Machine Learning Convex Programs 26 / 28

Lagrangian Duality Solving the Primal Given min u f (u) =e u subject to c(u) =u 0 min u L(u, )=min u L(u, 1) =e u Compute the derivative and set it to zero dl du = eu 1 = 0 for u = 0 [Yay!] u without constraints COMPSCI 371D Machine Learning Convex Programs 27 / 28

Lagrangian Duality Summary of Lagrangian Duality Worth repeating in light of the example: f = f (u )=min f (u) = L(u, )=max D( ) = D( ) u2c 0 {z } {z } primal dual D( ) def = min u2c L(u, ) L has a minimum in u and a maximum in at the solution (u, ) of both problems (u, ) is a saddle point for L COMPSCI 371D Machine Learning Convex Programs 28 / 28