Convex Optimization and Machine Learning

Similar documents
COMS 4771 Support Vector Machines. Nakul Verma

Support Vector Machines. James McInerney Adapted from slides by Nakul Verma

Introduction to Constrained Optimization

LECTURE 13: SOLUTION METHODS FOR CONSTRAINED OPTIMIZATION. 1. Primal approach 2. Penalty and barrier methods 3. Dual approach 4. Primal-dual approach

Characterizing Improving Directions Unconstrained Optimization

Alternating Direction Method of Multipliers

Applied Lagrange Duality for Constrained Optimization

Constrained optimization

Sparse Optimization Lecture: Proximal Operator/Algorithm and Lagrange Dual

Programming, numerics and optimization

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 2. Convex Optimization

Introduction to Optimization

Lagrangian Relaxation: An overview

Lecture 7: Support Vector Machine

Introduction to Machine Learning

Introduction to Optimization

Surrogate Gradient Algorithm for Lagrangian Relaxation 1,2

A primal-dual framework for mixtures of regularizers

The Alternating Direction Method of Multipliers

Lecture 10: SVM Lecture Overview Support Vector Machines The binary classification problem

ISM206 Lecture, April 26, 2005 Optimization of Nonlinear Objectives, with Non-Linear Constraints

Linear methods for supervised learning

Kernels and Constrained Optimization

Parallel and Distributed Graph Cuts by Dual Decomposition

9. Support Vector Machines. The linearly separable case: hard-margin SVMs. The linearly separable case: hard-margin SVMs. Learning objectives

Lecture 4 Duality and Decomposition Techniques

Solution Methods Numerical Algorithms

Gate Sizing by Lagrangian Relaxation Revisited

David G. Luenberger Yinyu Ye. Linear and Nonlinear. Programming. Fourth Edition. ö Springer

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Lec 11 Rate-Distortion Optimization (RDO) in Video Coding-I

Advanced Operations Research Techniques IE316. Quiz 2 Review. Dr. Ted Ralphs

PRIMAL-DUAL INTERIOR POINT METHOD FOR LINEAR PROGRAMMING. 1. Introduction

Kernel Methods & Support Vector Machines

Linear Programming. Larry Blume. Cornell University & The Santa Fe Institute & IHS

A Short SVM (Support Vector Machine) Tutorial

The Alternating Direction Method of Multipliers

CME307/MS&E311 Theory Summary

Perceptron Learning Algorithm (PLA)

Demo 1: KKT conditions with inequality constraints

15.082J and 6.855J. Lagrangian Relaxation 2 Algorithms Application to LPs

Perceptron Learning Algorithm

Nonlinear Programming

Introduction to Mathematical Programming IE496. Final Review. Dr. Ted Ralphs

Linear programming and duality theory

Optimization under uncertainty: modeling and solution methods

Unconstrained Optimization Principles of Unconstrained Optimization Search Methods

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Chapter II. Linear Programming

CONLIN & MMA solvers. Pierre DUYSINX LTAS Automotive Engineering Academic year

Assignment 5. Machine Learning, Summer term 2014, Ulrike von Luxburg To be discussed in exercise groups on May 19-21

Lecture 18: March 23

College of Computer & Information Science Fall 2007 Northeastern University 14 September 2007

Mathematical Programming and Research Methods (Part II)

minimise f(x) subject to g(x) = b, x X. inf f(x) = inf L(x,

Machine Learning for Signal Processing Lecture 4: Optimization

Convex Optimization. Lijun Zhang Modification of

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 29

Programs. Introduction

Convexity Theory and Gradient Methods

Section Notes 5. Review of Linear Programming. Applied Math / Engineering Sciences 121. Week of October 15, 2017

Optimization Methods. Final Examination. 1. There are 5 problems each w i t h 20 p o i n ts for a maximum of 100 points.

Linear Programming Problems

Lecture 19: Convex Non-Smooth Optimization. April 2, 2007

Convex Optimization. Erick Delage, and Ashutosh Saxena. October 20, (a) (b) (c)

SUPPORT VECTOR MACHINE ACTIVE LEARNING

Convex Optimization MLSS 2015

CME307/MS&E311 Optimization Theory Summary

California Institute of Technology Crash-Course on Convex Optimization Fall Ec 133 Guilherme Freitas

Theoretical Concepts of Machine Learning

Introduction to Mathematical Programming IE406. Lecture 20. Dr. Ted Ralphs

ML Detection via SDP Relaxation

Convex Programs. COMPSCI 371D Machine Learning. COMPSCI 371D Machine Learning Convex Programs 1 / 21

ONLY AVAILABLE IN ELECTRONIC FORM

Lagrangean Methods bounding through penalty adjustment

LECTURE 5: DUAL PROBLEMS AND KERNELS. * Most of the slides in this lecture are from

Computational Methods. Constrained Optimization

A Generic Separation Algorithm and Its Application to the Vehicle Routing Problem

Lagrangian Relaxation

Lecture 2 September 3

Linear Programming. Linear programming provides methods for allocating limited resources among competing activities in an optimal way.

Support Vector Machines.

Fast-Lipschitz Optimization

A Comparison of Mixed-Integer Programming Models for Non-Convex Piecewise Linear Cost Minimization Problems

Lecture 5: Duality Theory

Mathematical and Algorithmic Foundations Linear Programming and Matchings

Optimization III: Constrained Optimization

Computational Optimization. Constrained Optimization Algorithms

Local and Global Minimum

SENSITIVITY. CS4491 Introduction to Computational Models with Python

Natural Language Processing

Lagrangean relaxation - exercises

Lecture Notes: Constraint Optimization

10. Support Vector Machines

5. DUAL LP, SOLUTION INTERPRETATION, AND POST-OPTIMALITY

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 36

A Decentralized Energy Management System

Mathematical Themes in Economics, Machine Learning, and Bioinformatics

Project Proposals. Xiang Zhang. Department of Computer Science Courant Institute of Mathematical Sciences New York University.

MVE165/MMG630, Applied Optimization Lecture 8 Integer linear programming algorithms. Ann-Brith Strömberg

Transcription:

Convex Optimization and Machine Learning Mengliu Zhao Machine Learning Reading Group School of Computing Science Simon Fraser University March 12, 2014 Mengliu Zhao SFU-MLRG March 12, 2014 1 / 25

Introduction Formulation of binary SVM problem: Given training data set D = {(x i, y i ) x i R n, y i { 1, 1}, i = 1, 2,..., m} (1) We re trying to find the maximal-margin hyperplane, which can be described by its normal vector w which satisfies (b is some offset): minimize w 2 subject to y i (wx i b) 1 i = 1, 2,..., m (2) Comment We encounter a lot of constraint minimization problems in Machine Learning. Mengliu Zhao SFU-MLRG March 12, 2014 2 / 25

Why We Want Convex Problems? Mengliu Zhao SFU-MLRG March 12, 2014 3 / 25

Outline 1 Lagrange Dual Form 2 Dual Decomposition, Augmented Lagrangian and ADMM 3 SVM and Convex Optimization Mengliu Zhao SFU-MLRG March 12, 2014 4 / 25

Convex Optimization Problems General form of convex optimization problem is like following: minimize f 0 (x) subject to f i (x) 0, i = 1, 2,..., m h j (x) = 0, j = 1, 2,..., n (3) where f 0, f i are convex functions, h j are linear functions. Property The feasible set of a convex optimization problem is also convex. In other words, convex optimization problem is solving a convex function over a convex space. Mengliu Zhao SFU-MLRG March 12, 2014 5 / 25

General Constraint Problem with Lagrange Duality However, most constraint problems we optimize are not convex: minimize f 0 (x) subject to f i (x) 0, i = 1, 2,..., m h j (x) = 0, j = 1, 2,..., n (4) Lagrangian: L(x, λ, ν) = f 0 (x) + m λ i f i (x) + i=1 n ν j h j (x) (5) λ i (> 0), ν j are called Lagrangian multipliers or dual variables; the Lagrangian dual function is defined as: g(λ, µ) = inf x j=1 L(x, λ, ν) (6) Mengliu Zhao SFU-MLRG March 12, 2014 6 / 25

Geometric Explanation Primal Problem minimize x 2 + 15 2 subject to x 2 15 2 0 (7) (x 2 15 2 ) (8) Mengliu Zhao SFU-MLRG March 12, 2014 7 / 25

Geometric Explanation Dual Problem x 2 + 15 2 + λ(x 2 15 2 ) (9) λ =.1 :.1 : 2 (10) g(λ) = inf x { x 2 + 15 2 + λ(x 2 15 2 ) λ =.1 :.1 : 2 (11) Mengliu Zhao SFU-MLRG March 12, 2014 8 / 25

Geometric Explanation Two Observations Observation (I) Dual function g(λ) is concave. maximize g(λ) subject to λ > 0 (12) is a convex optimization problem. Observation (II) Let p be the optimal value of the primal problem, then g(λ) p, λ (13) Mengliu Zhao SFU-MLRG March 12, 2014 9 / 25

Economic Explanation Company production cost f 0, with certain limits f i below a i (rules, resources): minimize f 0 (x) subject to f i (x) a i 0, i = 1, 2,..., m (14) However, if, the company can pay a fund rate of λ i > 0 to violate certain rules, which adds back to the total cost: g(λ) = inf λ {f 0(x) + i λ i (f i a i )} (15) In this case, the optimal value d for the company is the cost under the least favorable set of prices λ max g(λ). Mengliu Zhao SFU-MLRG March 12, 2014 10 / 25

Strong & Weak Duality How well does the dual problem approximate the original problem? 1 Weak Duality: optimal duality gap is always non-negative. 2 Strong Duality: duality gap is zero. Q: When does strong duality hold? Theorem (Slater s Theorem) p d 0 (16) p = d (17) D is feasible set. Assume the primal problem is convex: minimize f 0 (x) subject to f i (x) 0, i = 1, 2,..., m h j (x) = 0, j = 1, 2,..., n (18) If x relint D, and f i (x) < 0, i = 0, 1,..., m, then strong duality holds. Mengliu Zhao SFU-MLRG March 12, 2014 11 / 25

KKT Condition For constrained problem: minimize f 0 (x) subject to f i (x) 0, i = 1, 2,..., m h j (x) = 0, j = 1, 2,..., n (19) If x is the primal minimum, then it satisfies the following necessary condition: f 0 (x ) + m λ i f i (x ) + i=1 n ν j h j (x ) = 0 (20) j=1 Mengliu Zhao SFU-MLRG March 12, 2014 12 / 25

Outline 1 Lagrange Dual Form 2 Dual Decomposition, Augmented Lagrangian and ADMM 3 SVM and Convex Optimization Mengliu Zhao SFU-MLRG March 12, 2014 13 / 25

Dual Form, Then What? Once we get the dual problem, it s easy to solve, e.g., by gradient approach (dual ascent). Property If g(λ) is a convex (concave) function, then f (λ ) = 0 iff λ is the global minimizer (maximizer). Comment A lot of conditions need to be satisfied for a stable gradient method. Mengliu Zhao SFU-MLRG March 12, 2014 14 / 25

Dual Ascent for Solving Dual Problem Let s look at a simplified version of the constrained problem: Its dual form: Update x, λ at each iteration: minimize subject to f (x) Ax = b (21) max g(λ) = max {min L(x, λ)} (22) λ x L(λ, x) = f (x) + λ(ax b) (23) x k+1 = min L(x, λ k ) x (24) λ k+1 = λ k + α k+1 g(x k+1, λ k ) (25) Question What if we have a much more complex situation? Mengliu Zhao SFU-MLRG March 12, 2014 15 / 25

Dual Decomposition Suppose the problem is of high dimension, ˆx = (x, z), and f (ˆx) is separable: f (ˆx) = f 1 (x) + f 2 (z) (26) Aˆx b = (A 1 x b 1 ) + (A 2 z b 2 ) (27) Then we can do dual ascent on each dimension separately: Comment L 1 (x) = f 1 (x) + λ 1 (A 1 x b 1 ) (28) L 2 (x) = f 2 (x) + λ 2 (A 2 x b 2 ) (29) x k+1 = min L 1 (x, z k, λ k ) x (30) z k+1 = min L 2 (x k+1, z, λ k ) z (31) λ k+1 = λ k + α k+1 g(x k+1, z k+1, λ k ) (32) 1 Simple dual ascent is usually slow; Mengliu Zhao SFU-MLRG March 12, 2014 16 / 25

Alternative to Dual Ascent Augmented Lagrangian Primal problem: minimize subject to f (x) Ax = b (33) Dual problem: L(x, λ, θ) = f (x) + λ T (Ax b) + θ 2 Ax b 2 2 (34) Update by method of multipliers (fixed step): x k+1 := min L(x, λ k, θ) x (35) λ k+1 := λ k + θ(ax k+1 b) (36) Mengliu Zhao SFU-MLRG March 12, 2014 17 / 25

Method of Multipliers Comparing to dual ascent: 1 Good news: convergence under more relaxed conditions; 2 Bad news: dual decomposition no longer works (now we have quadratic terms)! Comment ADMM can help! Mengliu Zhao SFU-MLRG March 12, 2014 18 / 25

ADMM Alternating Direction Method of Multipliers minimize subject to f (x) + g(z) Ax + Bz = b (37) Its Lagrangian is: L θ (x, λ, z) = f (x) + g(z) + λ T (Ax + Bz b) + θ 2 Ax + Bz b 2 2 (38) ADMM scheme: x k+1 := min x L θ (x, z k, λ k ) z k+1 := min z L θ (x k+1, z, y k ) λ k+1 := λ k + θ(ax k+1 + Bz k+1 b) (39) Mengliu Zhao SFU-MLRG March 12, 2014 19 / 25

A Closer Look at ADMM Comment We need more convincing evidence that the scheme will work! The thing unnatual here is the new variable z. We ll check the KKT condition with the constraint problem above: g(z) + B T λ = 0 (40) We ll check if this could be satisfied by the ADMM scheme. Since z k+1 minimized L θ (x k+1, z, λ k ), then 0 = g(z k+1 + B T λ k + θb T (Ax k+1 + Bz k+1 b)) (41) = g(z k+1 + B T λ k (42) Which means the KKT condition is satisfied. Mengliu Zhao SFU-MLRG March 12, 2014 20 / 25

Outline 1 Lagrange Dual Form 2 Dual Decomposition, Augmented Lagrangian and ADMM 3 SVM and Convex Optimization Mengliu Zhao SFU-MLRG March 12, 2014 21 / 25

Dual Form of SVM Now let s come back to the constrained version of SVM model: minimize w 2 subject to y i (wx i b) 1 i = 1, 2,..., m (43) It s easy to convert it to Lagrangian dual form as following: max {min λ w,b { w 2 2 + λ i [1 y i (wx i b)]}} (44) Comment The formulation is too complex! We can do further to simplify it! Mengliu Zhao SFU-MLRG March 12, 2014 22 / 25

Dual Form of SVM Check KKT condition, taking 1-order derivative of w and b on Lagrangian function w 2 2 + λ i [1 y i (wx i b)]: w = i λ i y i x i (45) Replace them back in (44), we have: 0 = λ i y i (46) max g(λ) = max { λ i 1 λ λ 2 yi y j λ i λ j (x i ) T x j } s.t. λ i 0, i = 1, 2,..., m λi y i = 0 (47) Mengliu Zhao SFU-MLRG March 12, 2014 23 / 25

Summary 1 Lagrangian Duality, KKT condition 2 Dual Decomposition, Augmented Lagrangian, ADMM 3 Example using Lagrangian Duality on SVM Mengliu Zhao SFU-MLRG March 12, 2014 24 / 25

Thank You! Mengliu Zhao SFU-MLRG March 12, 2014 25 / 25