MATH 567: Mathematical Techniques in Data Science Support vector machines

Size: px

Start display at page:

Download "MATH 567: Mathematical Techniques in Data Science Support vector machines"

Carmel Leonard
5 years ago
Views:

1 1/8 MATH 567: Mathematical Techniques in Data Science Support vector machines Dominique Guillot Departments of Mathematical Sciences University of Delaware March 13, 2016

2 Hyperplanes 2/8 Recall: A hyperplane H in V = R n is a subspace of V of dimension n 1 (i.e., a subspace of codimension 1). Each hyperplane is determined by a nonzero vector β R n via H = {x R n : β T x = 0} = span(β).

3 Hyperplanes 2/8 Recall: A hyperplane H in V = R n is a subspace of V of dimension n 1 (i.e., a subspace of codimension 1). Each hyperplane is determined by a nonzero vector β R n via H = {x R n : β T x = 0} = span(β). An ane hyperplane H in R n is a subset of the form where β 0 R, β R n. H = {x R n : β 0 + β T x = 0}

4 Hyperplanes Recall: A hyperplane H in V = R n is a subspace of V of dimension n 1 (i.e., a subspace of codimension 1). Each hyperplane is determined by a nonzero vector β R n via H = {x R n : β T x = 0} = span(β). An ane hyperplane H in R n is a subset of the form where β 0 R, β R n. H = {x R n : β 0 + β T x = 0} We often use the term hyperplane for ane hyperplane. 2/8

5 Hyperplanes (cont.) 3/8 Let H = {x R n : β 0 + β T x = 0}.

6 Hyperplanes (cont.) Let H = {x R n : β 0 + β T x = 0}. Note that for x 0, x 1 H, β T (x 0 x 1 ) = 0. Thus β is perpendicular to H. It follows that for x R n, d(x, H) = βt β (x x 0) = β 0 + β T x. β 3/8

7 4/8 Separating hyperplane Suppose we have binary data with labels {+1, 1}. We want to separate data using an (ane) hyperplane. ESL, Figure (Orange = least-squares)

8 4/8 Separating hyperplane Suppose we have binary data with labels {+1, 1}. We want to separate data using an (ane) hyperplane. ESL, Figure (Orange = least-squares) Classify using G(x) = sgn(x T β + β 0 ).

9 4/8 Separating hyperplane Suppose we have binary data with labels {+1, 1}. We want to separate data using an (ane) hyperplane. ESL, Figure (Orange = least-squares) Classify using G(x) = sgn(x T β + β 0 ). Separating hyperplane may not be unique. Separating hyperplane may not exist (i.e., data may not be separable).

10 Margins 5/8 Uniqueness problem: when the data is separable, choose the hyperplane to maximize the margin (the no man's land).

11 Margins 5/8 Uniqueness problem: when the data is separable, choose the hyperplane to maximize the margin (the no man's land). Data: (y i, x i ) {+1, 1} R p (i = 1,..., n). Suppose β 0 + β T x is a separating hyperplane with β = 1.

12 Margins 5/8 Uniqueness problem: when the data is separable, choose the hyperplane to maximize the margin (the no man's land). Data: (y i, x i ) {+1, 1} R p (i = 1,..., n). Suppose β 0 + β T x is a separating hyperplane with β = 1. Note that: y i (x T i β + β 0 ) > 0 Correct classication y i (x T i β + β 0 ) < 0 Incorrect classication

13 Margins Uniqueness problem: when the data is separable, choose the hyperplane to maximize the margin (the no man's land). Data: (y i, x i ) {+1, 1} R p (i = 1,..., n). Suppose β 0 + β T x is a separating hyperplane with β = 1. Note that: y i (x T i β + β 0 ) > 0 Correct classication y i (x T i β + β 0 ) < 0 Incorrect classication Also, y i (x T i β + β 0) = distance between x and hyperplane (since β = 1). 5/8

14 Margins (cont.) 6/8 Thus, if the data is separable, we can solve max M β 0,β R p, β =1 subject to y i (x T i β + β 0 ) M (i = 1,..., n).

15 Margins (cont.) 6/8 Thus, if the data is separable, we can solve max M β 0,β R p, β =1 subject to y i (x T i β + β 0 ) M (i = 1,..., n). We will transform the problem into a usual form used in convex optimization.

16 6/8 Margins (cont.) Thus, if the data is separable, we can solve max M β 0,β R p, β =1 subject to y i (x T i β + β 0 ) M (i = 1,..., n). We will transform the problem into a usual form used in convex optimization. We can remove β = 1 by replacing the constraint by 1 β y i(x T i β+β 0 ) M, or equivalently, y i (x T i β+β 0 ) M β.

17 Margins (cont.) 6/8 Thus, if the data is separable, we can solve max M β 0,β R p, β =1 subject to y i (x T i β + β 0 ) M (i = 1,..., n). We will transform the problem into a usual form used in convex optimization. We can remove β = 1 by replacing the constraint by 1 β y i(x T i β+β 0 ) M, or equivalently, y i (x T i β+β 0 ) M β. We can always rescale (β, β 0 ) so that β = 1/M. Our problem is therefore equivalent to 1 min β 0,β R p 2 β 2 subject to y i (x T i β + β 0 ) 1 (i = 1,..., n).

18 Margins (cont.) Thus, if the data is separable, we can solve max M β 0,β R p, β =1 subject to y i (x T i β + β 0 ) M (i = 1,..., n). We will transform the problem into a usual form used in convex optimization. We can remove β = 1 by replacing the constraint by 1 β y i(x T i β+β 0 ) M, or equivalently, y i (x T i β+β 0 ) M β. We can always rescale (β, β 0 ) so that β = 1/M. Our problem is therefore equivalent to 1 min β 0,β R p 2 β 2 subject to y i (x T i β + β 0 ) 1 (i = 1,..., n). We now recognize the problem as a convex optimization problem with a quadratic objective, and linear inequality constraints. 6/8

19 Support vector machines 7/8 The previous problem works well when the data is separable. What happens if there is no way to nd a margin?

20 Support vector machines 7/8 The previous problem works well when the data is separable. What happens if there is no way to nd a margin? We allow some points to be on the wrong side of the margin, but keep control on the error.

21 Support vector machines 7/8 The previous problem works well when the data is separable. What happens if there is no way to nd a margin? We allow some points to be on the wrong side of the margin, but keep control on the error. We replace y i (x T i β + β 0) M by y i (x T i β + β 0 ) M(1 ξ i ), ξ i 0, and add the constraint n ξ i C for some xed constant C > 0. i=1

22 Support vector machines 7/8 The previous problem works well when the data is separable. What happens if there is no way to nd a margin? We allow some points to be on the wrong side of the margin, but keep control on the error. We replace y i (x T i β + β 0) M by y i (x T i β + β 0 ) M(1 ξ i ), ξ i 0, and add the constraint n ξ i C for some xed constant C > 0. i=1 The problem becomes: max M β 0,β R p, β =1 subject to y i (x T i β + β 0 ) M(1 ξ i ) n ξ i 0, ξ i C. i=1

23 Support vector machines (cont.) 8/8 As before, we can transform the problem into its normal form: 1 min β 0,β 2 β 2 subject to y i (x T i β + β 0 ) 1 ξ i n ξ i 0, ξ i C. i=1 Problem can be solved using standard optimization packages.

SUPPORT VECTOR MACHINES

SUPPORT VECTOR MACHINES Today Reading AIMA 18.9 Goals (Naïve Bayes classifiers) Support vector machines 1 Support Vector Machines (SVMs) SVMs are probably the most popular off-the-shelf classifier! Software