Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1
Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training data: S =((x 1, y 1 ),...,(x m, y m )) Hypothesis set H =halfspaces Assumption: dataislinearlyseparable) there exist a halfspace that perfectly classify the training set In general: multipleseparatinghyperplanes:) which one is the best choice? 2
Classification and Margin The last one seems the best choice, since it can tolerate more noise. Informally, for a given separating halfspace we define its margin as its minimum distance to an example in the training set S. Intuition: best separating hyperplane is the one with largest margin. How do we find it? 3
Linearly Separable Training Set Training set S =((x 1, y 1 ),...,(x m, y m )) is linearly separable if there exists a halfspace (w, b) such that y i = sign(hw, x i i + b) for all i =1,...,m. Equivalent to: 8i =1,...,m : y i (hw, x i i + b) > 0 Informally: margin of a separating hyperplane is its minimum distance to an example in the training set S 4
Separating Hyperplane and Margin Given hyperplane defined by L = {v : hw, vi + b =0}, and given x, the distance of x to L is d(x, L) =min{ x v : v 2 L} Claim: if w = 1 then d(x, L) = hw, xi + b (Proof: Claim 15.1 [UML]) 5
Margin and Support Vectors The margin of a separating hyperplane is the distance of the closest example in training set to it. If w =1the margin is: min hw, x ii + b i2{1,...,m} The closest examples are called support vectors 6
Support Vector Machine (SVM) Hard-SVM: seek for the separating hyperplane with largest margin (only for linearly separable data) Computational problem: arg max (w,b): w =1 subject to 8i : y i (hw, x i i + b) > 0 min hw, x ii + b i2{1,...,m} Equivalent formulation (due to separability assumption): arg max (w,b): w =1 min y i(hw, x i i + b) i2{1,...,m} 7
Hard-SVM: Quadratic Programming Formulation input: (x 1, y 1 ),...,(x m, y m ) solve: (w 0, b 0 )=argmin (w,b) w 2 subject to 8i : y i (hw, x i i + b) 1 output: ŵ = w 0 w 0, ˆb = b 0 w 0 How do we get a solution? Quadratic optimization problem: objective is convex quadratic function, constraints are linear inequalities ) Quadratic Programming solvers! Proposition The output of algorithm above is a solution to the Equivalent Formulation in the previous slide. Proof: on the board and in the book (Lemma 15.2 [UML]) 8
Equivalent Formulation and Support Vectors Equivalent formulation (homogeneous halfspaces): assume first component of x 2 X is 1, then w 0 =min w w 2 subject to 8i : y i hw, x i i 1 Support Vectors = vectors at minimum distance from w 0 The support vectors are the only ones that matter for defining w 0! Proposition Let w 0 be as above. Let I = {i : hw 0, x i i =1}. Thenthereexist coe cients 1,..., m such that w 0 = X i2i i x i Support vectors = {x i : i 2 I } Note: Solving Hard-SVM is equivalent to find i for i =1,...,m, and i 6=0only for support vectors 9
Soft-SVM 10 Hard-SVM works if data is linearly separable. What if data is not linearly separable? ) soft-svm Idea: modify constraints of Hard-SVM to allow for some violation, but take into account violations into objective function
Hard-SVM constraints: Soft-SVM Constraints y i (hw, x i i + b) 1 Soft-SVM constraints: slack variables: 1,..., m 0 ) vector for each i =1,...,m: y i (hw, x i i + b) 1 i i : how much constraint y i (hw, x i i + b) 1 is violated Soft-SVM minimizes combinations of norm of w average of i Tradeo among two terms is controlled by a parameter 2 R, > 0 11
Soft-SVM: Optimization Problem 12 input: (x 1, y 1 ),...,(x m, y m ),parameter > 0 solve:! min w 2 + 1 mx i w,b, m i=1 subject to 8i : y i (hw, x i i + b) 1 i and i 0 output: w, b Equivalent formulation: consider the hinge loss `hinge ((w, b), (x, y)) = max{0, 1 y(hw, xi + b)} Given (w, b) and a training S, theempiricalriskl hinge S ((w, b)) is L hinge S ((w, b)) = 1 mx `hinge ((w, b), (x i, y i )) m i=1
Soft-SVM: solve Soft-SVM as RLM min w,b, w 2 + 1 m! mx i i=1 subject to 8i : y i (hw, x i i + b) 1 i and i 0 Equivalent formulation with hinge loss: that is min w,b min w,b w 2 + 1 m w 2 + L hinge S (w, b)! mx `hinge ((w, b), (x i, y i )) i=1 Note: w 2 : `2 regularization L hinge S (w, b): empirical risk for hinge loss 13
Soft-SVM: Solution 14 We need to solve: How? min w,b w 2 + 1 m! mx `hinge ((w, b), (x i, y i )) i=1 standard solvers for optimization problems Stochastic Gradient Descent
Exercise 1 Your friend has developed a new machine learning algorithm for binary classification (i.e., y 2 { 1, 1}) with 0-1 loss and tells you that it achieves a generalization error of only 0.05. However,when you look at the learning problem he is working on, you find out that Pr D [y = 1] = 0.95... Assume that Pr D [y = `] =p`. Derive the generalization error of the (dumb) hypothesis/model that always predicts `. Use the result above to decide if your friend s algorithm has learned something or not. 1
Exercise 2 Consider a linear regression problem, where X = R d and Y = R, with mean squared loss. The hypothesis set is the set of constant functions, that is H = {h a : a 2 R}, whereh a (x) =a. Let S =((x 1, y 1 ),...,(x m, y m )) denote the training set. Derive the hypothesis h 2 H that minimizes the training error. Use the result above to explain why, for a given hypothesis ĥ from the set of all linear models, the coe cient of determination R 2 =1 P m i=1 (ĥ(x i) y i ) 2 P m i=1 (y i ȳ) 2 where ȳ is the average of the y i,i =1,...,m is a measure of how well ĥ performs (on the training set). 2
Exercise 3 Consider the classification problem with X = R 2, Y = {0, 1}. Consider the hypothesis class H = {h (c,a), c 2 R 2, a 2 R} with h (c,a) (x) = Find the VC-dimension of H. ( 1 if x c apple a 0 otherwise 3
Exercise 4 Assuming we have the following dataset (x i 2 R 2 ) and by solving the SVM for classification we get the corresponding optimal dual variables: i xi T y i i 1 [ 0.2-1.4] -1 0 2 [-2.1 1.7] 1 0 3 [0.9 1] 1 0.5 4 [-1-3.1] -1 0 5 [-0.2-1] -1 0.25 6 [-0.2 1.3] 1 0 7 [ 2.0-1] -1 0.25 8 [ 0.5 2.1] 1 0 Answer to the following: (A) Which are the support vectors? (B) Draw a schematic picture reporting the data points (approximately) and the optimal separating hyperplane, and mark the support vectors. Would it be possible, by moving only two data points, to obtain the SAME separating hyperplane with only 2 support vectors? If so, draw the modified configuration (approximately). 4
Exercise 5 Consider the ridge regression problem arg min w w 2 + P m i=1 (hw, x ii y i ) 2.Let:h S be the hypothesis obtained by ridge regressionon with training set S; h be the hypothesis of minimum generalization error among all linear models. (A) Draw, in the plot below, a typical behaviour of (i) the training error and (ii) the test/generalization error of h S as a function of. (B) Draw, in the plot below, a typical behaviour of (i) L D (h S ) L D (h ) and (ii) L D (h S ) L S (h S ) as a function of. (A)' (B)' training' error' test' error' 5