Introduction to Machine Learning

Size: px

Start display at page:

Download "Introduction to Machine Learning"

Georgina Stone
5 years ago
Views:

1 Introduction to Machine Learning Maximum Margin Methods Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA CSE 474/574 1 / 38

2 Outline Maximum Margin Classifiers Linear Classification via Hyperplanes Concept of Margin Support Vector Machines SVM Learning Solving SVM Optimization Problem Constrained Optimization and Lagrange Multipliers Kahrun-Kuhn-Tucker Conditions Support Vectors Optimization Constraints CSE 474/574 2 / 38

3 Maximum Margin Classifiers y = w x + b x 2 Remember the Perceptron! If data is linearly separable Perceptron training guarantees learning the decision boundary There can be other boundaries Depends on initial value for w w x = b b w Chandola@UB CSE 474/574 3 / 38

4 Maximum Margin Classifiers y = w x + b x 2 Remember the Perceptron! If data is linearly separable Perceptron training guarantees learning the decision boundary There can be other boundaries Depends on initial value for w w x = b b w Chandola@UB CSE 474/574 3 / 38

5 Maximum Margin Classifiers y = w x + b x 2 Remember the Perceptron! If data is linearly separable Perceptron training guarantees learning the decision boundary There can be other boundaries Depends on initial value for w But what is the best boundary? w x = b b w Chandola@UB CSE 474/574 3 / 38

6 Linear Hyperplane Separates a D-dimensional space into two half-spaces Defined by w R D Orthogonal to the hyperplane This w goes through the origin How do you check if a point lies above or below w? What happens for points on w? w Chandola@UB CSE 474/574 4 / 38

7 Make hyperplane not go through origin Add a bias b b > 0 - move along w b < 0 - move opposite to w How to check if point lies above or below w? If w x + b > 0 then x is above Else, below Chandola@UB CSE 474/574 5 / 38

8 Line as a Decision Surface Decision boundary represented by the hyperplane w For binary classification, w points towards the positive class Decision Rule y = sign(w x + b) y w w x + b > 0 w x + b = 0 w x + b < 0 w x + b > 0 y = +1 w x + b < 0 y = 1 b w x Chandola@UB CSE 474/574 6 / 38

9 What is Best Hyperplane Separator Perceptron can find a hyperplane that separates the data... if the data is linearly separable But there can be many choices! Find the one with best separability (largest margin) Gives better generalization performance x x x x x 1. Intuitive reason 2. Theoretical foundations Chandola@UB CSE 474/574 7 / 38

10 What is a Margin? Margin is the distance between an example and the decision line Denoted by γ For a positive point: For a negative point: Functional Interpretation γ = w x + b w γ = w x + b w Margin positive if prediction is correct; negative if prediction is incorrect Chandola@UB CSE 474/574 8 / 38

11 Maximum Margin Principle y w w x + b = 1 w x + b = 0 2 w w x + b = 1 x b w Chandola@UB CSE 474/574 9 / 38

12 Support Vector Machines A hyperplane based classifier defined by w and b Like perceptron Find hyperplane with maximum separation margin on the training data Assume that data is linearly separable (will relax this later) Zero training error (loss) SVM Prediction Rule SVM Learning y = sign(w x + b) Input: Training data {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )} Objective: Learn w and b that maximizes the margin Chandola@UB CSE 474/ / 38

13 SVM Learning SVM learning task as an optimization problem Find w and b that gives zero training error Maximizes the margin (= 2 w ) Same as minimizing w Optimization Formulation minimize w,b w 2 2 subject to y n (w x n + b) 1, n = 1,..., N. Optimization with N linear inequality constraint Chandola@UB CSE 474/ / 38

14 A Different Interpretation of Margin What impact does the margin have on w? CSE 474/ / 38

15 A Different Interpretation of Margin What impact does the margin have on w? Large margin Small w Chandola@UB CSE 474/ / 38

16 A Different Interpretation of Margin What impact does the margin have on w? Large margin Small w Small w regularized/simple solutions Chandola@UB CSE 474/ / 38

17 A Different Interpretation of Margin What impact does the margin have on w? Large margin Small w Small w regularized/simple solutions Simple solutions Better generalizability (Occam s Razor) Chandola@UB CSE 474/ / 38

18 A Different Interpretation of Margin What impact does the margin have on w? Large margin Small w Small w regularized/simple solutions Simple solutions Better generalizability (Occam s Razor) Computational Learning Theory provides a formal justification [1] Chandola@UB CSE 474/ / 38

19 Solving the Optimization Problem Optimization Formulation minimize w,b w 2 2 subject to y n (w x n + b) 1, n = 1,..., N. There is an quadratic objective function to minimize with N inequality constraints Off-the-shelf packages - quadprog (MATLAB), CVXOPT Is that the best way? Chandola@UB CSE 474/ / 38

20 Basic Optimization minimize x,y f (x, y) = 2 x 2 2y 2 Chandola@UB CSE 474/ / 38

21 Basic Optimization minimize x,y f (x, y) = 2 x 2 2y 2 minimize x,y f (x, y) = 2 x 2 2y 2 subject to h(x, y) = x + y 1 = 0. Chandola@UB CSE 474/ / 38

22 Lagrange Multipliers - A Primer Tool for solving constrained optimization problems of differentiable functions minimize x,y f (x, y) = 2 x 2 2y 2 subject to h(x, y) : x + y 1 = 0. A Lagrangian multiplier (β) lets you combine the two equations into one Chandola@UB CSE 474/ / 38

23 Lagrange Multipliers - A Primer Tool for solving constrained optimization problems of differentiable functions minimize x,y f (x, y) = 2 x 2 2y 2 subject to h(x, y) : x + y 1 = 0. A Lagrangian multiplier (β) lets you combine the two equations into one minimize x,y,β L(x, y, β) = f (x, y) + βg(x, y) Chandola@UB CSE 474/ / 38

24 Multiple Constraints minimize x,y,z f (x, y, z) = x 2 + 4y 2 + 2z 2 + 6y + z subject to h 1 (x, y, z) : x + z 2 1 = 0 h 2 (x, y, z) : x 2 + y 2 1 = 0. Chandola@UB CSE 474/ / 38

25 Multiple Constraints minimize x,y,z f (x, y, z) = x 2 + 4y 2 + 2z 2 + 6y + z subject to h 1 (x, y, z) : x + z 2 1 = 0 h 2 (x, y, z) : x 2 + y 2 1 = 0. L(x, y, z, β) = f (x, y, z) + i β i h i (x, y, z) Chandola@UB CSE 474/ / 38

26 Handling Inequality Constraints minimize x,y f (x, y) = x 3 + y 2 subject to g(x) : x Chandola@UB CSE 474/ / 38

27 Handling Inequality Constraints minimize x,y f (x, y) = x 3 + y 2 subject to g(x) : x Inequality constraints are transferred as constraints on the Lagrangian, α Chandola@UB CSE 474/ / 38

28 Handling Both Types of Constraints minimize w f (w) subject to g i (w) 0 i = 1,..., k and h i (w) = 0 i = 1,..., l. Generalized Lagrangian L(w, α, β) = f (w) + subject to, α i 0, i k l α i g i (w) + β i h i (w) i=1 i=1 Chandola@UB CSE 474/ / 38

29 Primal and Dual Formulations Primal Optimization Let θ P be defined as: θ P (w) = max L(w, α, β) α,β:α i 0 One can prove that the optimal value for the original constrained problem is same as: p = min w θ P (w) = min w max α,β:α i 0 L(w, α, β) Chandola@UB CSE 474/ / 38

30 Primal and Dual Formulations (II) Dual Optimization Consider θ D, defined as: θ D (α, β) = min w L(w, α, β) The dual optimization problem can be posed as: d == p? d = Note that d p max θ D(α, β) = α,β:α i 0 max min L(w, α, β) α,β:α i 0 w Max min of a function is always less than or equal to Min max When will they be equal? f (w) is convex Constraints are affine Chandola@UB CSE 474/ / 38

31 Relation between primal and dual In general d p, for SVM optimization the equality holds Certain conditions should be true Known as the Kahrun-Kuhn-Tucker conditions For d = p = L(w, α, β ): w L(w, α, β ) = 0 L(w, α, β ) β i = 0, i = 1,..., l α i g i (w ) = 0, i = 1,..., k g i (w ) 0, i = 1,..., k α i 0, i = 1,..., k Chandola@UB CSE 474/ / 38

32 Lagrangian Multipliers for SVM Optimization Formulation minimize w,b w 2 2 subject to y n (w x n + b) 1, n = 1,..., N. Chandola@UB CSE 474/ / 38

33 Lagrangian Multipliers for SVM Optimization Formulation A Toy Example x R 2 minimize w,b Two training points: w 2 2 subject to y n (w x n + b) 1, n = 1,..., N. x 1, y 1 = (1, 1), 1 x 2, y 2 = (2, 2), +1 Find the best hyperplane w = (w 1, w 2 ) Chandola@UB CSE 474/ / 38

34 Optimization problem for the toy example 1 minimize f (w) = w 2 w 2 subject to g 1 (w, b) = y 1 (w x 1 + b) 1 0 g 2 (w, b) = y 2 (w x 2 + b) 1 0. Chandola@UB CSE 474/ / 38

35 Optimization problem for the toy example 1 minimize f (w) = w 2 w 2 subject to g 1 (w, b) = y 1 (w x 1 + b) 1 0 g 2 (w, b) = y 2 (w x 2 + b) 1 0. Substituting actual values for x 1, y 1 and x 2, y 2. 1 minimize f (w) = w 2 w 2 subject to g 1 (w, b) = (w x 1 + b) 1 0 g 2 (w, b) = (w x 2 + b) 1 0. Chandola@UB CSE 474/ / 38

36 Back to SVM Optimization Optimization Formulation minimize w,b w 2 2 subject to y n (w x n + b) 1, n = 1,..., N. Introducing Lagrange Multipliers,α n, n = 1,..., N Rewriting as a (primal) Lagrangian minimize w,b,α L P (w, b, α) = w 2 2 subject to α n 0 n = 1,..., N. + N α n {1 y n (w x n + b)} n=1 Chandola@UB CSE 474/ / 38

37 Solving the Lagrangian Set gradient of L P to 0 L N P w = 0 w = α n y n x n n=1 L P b = 0 N n=1 α n y n = 0 Substituting in L P to get the dual L D Chandola@UB CSE 474/ / 38

38 Solving the Lagrangian Set gradient of L P to 0 L N P w = 0 w = α n y n x n n=1 L P b = 0 N n=1 α n y n = 0 Substituting in L P to get the dual L D Dual Lagrangian Formulation maximize b,α subject to L D (w, b, α) = N α n 1 2 n=1 N m,n=1 N α n y n = 0, α n 0 n = 1,..., N. n=1 α m α n y m y n (x mx n ) Chandola@UB CSE 474/ / 38

39 Solving the Dual Dual Lagrangian is a quadratic programming problem in α n s Use off-the-shelf solvers Having found α n s w = N α n y n x n n=1 What will be the bias term b? Chandola@UB CSE 474/ / 38

40 Investigating Kahrun Kuhn Tucker Conditions For the primal and dual formulations We can optimize the dual formulation (as shown earlier) Solution should satisfy the Karush-Kuhn-Tucker (KKT) Conditions CSE 474/ / 38

41 The Kahrun-Kuhn-Tucker Conditions w L P(w, b, α) = w b L P(w, b, α) = N α n y n x n = 0 (1) n=1 N α n y n = 0 (2) n=1 y n {w x n + b} 1 0 (3) α n 0 (4) α n (y n {w x n + b} 1) = 0 (5) Chandola@UB CSE 474/ / 38

42 Estimating Bias b Use KKT condition #5 For α n > 0 (y n {w x n + b} 1) = 0 Which means that: max n:y w x n + min n= 1 b = 2 n:y n=1 w x n Chandola@UB CSE 474/ / 38

43 Key Observation from Dual Formulation Most α n s are 0 KKT condition #5: α n (y n {w x n + b} 1) = 0 If x n not on margin y n {w x n + b} > 1 α n = 0 α n 0 only for x n on margin These are the support vectors Only need these for prediction y b w w w x + b = 1 w x + b = 0 2 w w x + b = 1 x Chandola@UB CSE 474/ / 38

44 What have we seen so far? For linearly separable data, SVM learns a weight vector w Maximizes the margin SVM training is a constrained optimization problem Each training example should lie outside the margin N constraints y b w w w x + b = 1 w x + b = 0 2 w w x + b = 1 x Chandola@UB CSE 474/ / 38

45 What if data is not linearly separable? Cannot go for zero training error Still learn a maximum margin hyperplane Chandola@UB CSE 474/ / 38

46 What if data is not linearly separable? Cannot go for zero training error Still learn a maximum margin hyperplane 1. Allow some examples to be misclassified 2. Allow some examples to fall inside the margin Chandola@UB CSE 474/ / 38

47 What if data is not linearly separable? Cannot go for zero training error Still learn a maximum margin hyperplane 1. Allow some examples to be misclassified 2. Allow some examples to fall inside the margin How do you set up the optimization for SVM training Chandola@UB CSE 474/ / 38

48 Cutting Some Slack y w x + b = 1 w x + b = 0 w x + b = 1 x Chandola@UB CSE 474/ / 38

49 Introducing Slack Variables Separable Case: To ensure zero training loss, constraint was y n (w x n + b) 1 n = 1... N Chandola@UB CSE 474/ / 38

50 Introducing Slack Variables Separable Case: To ensure zero training loss, constraint was y n (w x n + b) 1 n = 1... N Non-separable Case: Relax the constraint y n (w x n + b) 1 ξ n n = 1... N ξ n is called slack variable (ξ n 0) For misclassification, ξ n > 1 Chandola@UB CSE 474/ / 38

51 Relaxing the Constraint It is OK to have some misclassified training examples Some ξn s will be non-zero Chandola@UB CSE 474/ / 38

52 Relaxing the Constraint It is OK to have some misclassified training examples Some ξn s will be non-zero Minimize the number of such examples Minimize N n=1 ξ n Optimization Problem for Non-Separable Case maximize w,b f (w, b) = w 2 + C N n=1 subject to y n (w x n + b) 1 ξ n, ξ n 0 n = 1,..., N. ξ n C controls the impact of margin and the margin error. Chandola@UB CSE 474/ / 38

53 Estimating Weights What is the role of C? Similar optimization procedure as for the separable case (QP for the dual) Weights have the same expression w = N α n y n x n n=1 Support vectors are slightly different 1. Points on the margin (ξ n = 0) 2. Inside the margin but on the correct side (0 < ξ n < 1) 3. On the wrong side of the hyperplane (ξ n 1) Chandola@UB CSE 474/ / 38

54 Concluding Remarks on SVM Training time for SVM training is O(N 3 ) Many faster but approximate approaches exist Approximate QP solvers Online training SVMs can be extended in different ways 1. Non-linear boundaries (kernel trick) 2. Multi-class classification 3. Probabilistic output 4. Regression (Support Vector Regression) Chandola@UB CSE 474/ / 38

55 References V. Vapnik. Statistical learning theory. Wiley, CSE 474/ / 38

COMS 4771 Support Vector Machines. Nakul Verma

COMS 4771 Support Vector Machines Nakul Verma Last time Decision boundaries for classification Linear decision boundary (linear classification) The Perceptron algorithm Mistake bound for the perceptron