Lecture 7: Support Vector Machine

Size: px

Start display at page:

Download "Lecture 7: Support Vector Machine"

Dwain Rogers
6 years ago
Views:

1 Lecture 7: Support Vector Machine Hien Van Nguyen University of Houston 9/28/2017

2 Separating hyperplane Red and green dots can be separated by a separating hyperplane Two classes are separable, i.e., each class lies on one side of the hyperplane, iff margin Smallest distance from a sample to the hyperplane is margin 9/28/2017 Machine Learning 2

3 Geometry of separating hyperplane For any point on the hyperplane For any two points on the hyperplane The signed distance between a xx to the hyperplane is given by 9/28/2017 Machine Learning 3

4 Linear classifier review Training data consist of nn pairs Objective: Learn a classifier, represented by a hyperplane, that separates two classes Classification rule is given by There are infinitely many solutions separating red and green samples, which one should we choose? 9/28/2017 Machine Learning 4

5 Max-margin classifier Separable classes Choose the hyperplane that maximize the margin Important: weights vector must have unit norm otherwise M can go to infinity Equivalent form: 9/28/2017 Machine Learning 5

6 Derivation of equivalent form We can get rid of the unit-norm constraint by dividing each side by the norm Any scaled multiple of a solution is also a solution We can pick a convenient scale Set we obtain the equivalent form This is a convex optimization problem. More specifically, it s a quadratic programming problem 9/28/2017 Machine Learning 6

7 Support vector machine If classes are not separable, we can use the same approaches We allow some points to be on the wrong size, but penalize them 9/28/2017 Machine Learning 7

8 Hinge loss Equivalent form: SVM uses hinge loss Hinge loss rises linearly while square loss rises quadratically hinge loss is more robust to outliers 9/28/2017 Machine Learning 8

9 Properties of SVM Solution is unique More robust to noisy example compared to least square or ridge regression Solution depends only on a sparse subset of examples. In case of separable classes, these examples lie on the margin and are called support vectors Removing a training sample would less likely to affect SVM compared to LS regression because SVM s solution only depends on a small number of support vectors 9/28/2017 Machine Learning 9

10 Review on Lagrangian Duality Optimization problem Lagrangian λλ ii and μμ ii are called Lagrange multipliers Note: The sign of λλ will change if the sign of the inequality constraints change 9/28/2017 Machine Learning 10

11 Review on Lagrangian Duality Lemma Rewrite the SVM formulation as (Primal) Dual form: 9/28/2017 Machine Learning 11

12 Lagrange dual optimization Dual function is the pointwise infimum of a family of affine functions of (λλ, μμ) concave function This is a convex optimization problem since objective is concave and constraints are convex Theorem (weak duality) 9/28/2017 Machine Learning 12

13 Slater s condition for strong duality Strong duality when If equality constraints are affine: Convex Convex Then strong duality holds when the Slater s condition holds 9/28/2017 Machine Learning 13

14 Karush-Kuhn-Tucker condition Solution to primal and dual satisfies the following conditions Complimentary slackness This is a necessary condition for any pair of primal and dual points Primal feasibility Dual feasibility For convex optimization, it is also sufficient condition 9/28/2017 Machine Learning 14

15 Solving SVM Recall our optimization problem: Only few are non-zero Lagragian Solving dual problem: 9/28/2017 Machine Learning 15

16 Solving SVM Take partial derivative wrt (w, b) and set to zero: Plugging into SVM formulation: Only depending on dot products, not explicit x 9/28/2017 Machine Learning 16

17 Solving SVM Once we have solved for Lagrange multipliers αα ii, we can reconstruct the SVM weights as: Only few values are non-zeros. Samples corresponding the non-zero αα ii are called support vectors Decision function entirely depends on support vectors For testing new data zz: 9/28/2017 Machine Learning 17

18 Interpretation Optimal decision function depends on a small number of data points. Removing data points is less likely to affect the decision function We only need to specify inner products between data points, no need to explicitly specifying vector representation of data points 9/28/2017 Machine Learning 18

19 Non-Linear Decision Boundary So far, we have only considered large-margin classifier with a linear decision boundary How to generalize it to become nonlinear? Idea: Transform xx to non-linear high-dimensional space Input space: where the data points lie Feature space: the space of Φ(xx) after transformation Why transform: Linear operation in feature space is equivalent to non-linear operation in input space can recycle our SVM solution 9/28/2017 Machine Learning 19

20 Example 9/28/2017 Machine Learning 20

21 Kernel method Recall SVM formula Only need to know inner products Many geometric operations can be expressed as inner products 9/28/2017 Machine Learning 21

22 Example 9/28/2017 Machine Learning 22

23 Kernel method Linear kernel Polinomynal Radial Basis Function (Gaussian) What are the pros and cons of using kernel? 9/28/2017 Machine Learning 23

24 Properties of SVM solution Lagrangian Take derivative wrt and set to zero, we get 9/28/2017 Machine Learning 24

25 Karush-Kuhn-Tucker condition Gradient must vanish at the optimal point Thus we have This is a necessary condition for any pair of primal and dual points For convex optimization, it is also sufficient condition 9/28/2017 Machine Learning 25

26 Dual function Definition: Dual function yields lower-bounds of the Lagrangian Exercise: Verify the above relationship 9/28/2017 Machine Learning 26

Linear methods for supervised learning

Linear methods for supervised learning LDA Logistic regression Naïve Bayes PLA Maximum margin hyperplanes Soft-margin hyperplanes Least squares resgression Ridge regression Nonlinear feature maps Sometimes