Notes on Support Vector Machines

Size: px

Start display at page:

Download "Notes on Support Vector Machines"

Kerry Miles
5 years ago
Views:

Vector Machines Matt Bogard, Western Kentucky

1 Western Kentucky University From the SelectedWorks of Matt Bogard Summer May, 2012 Notes on Support Vector Machines Matt Bogard, Western Kentucky University Available at:

2 Statistical Reference No. 30: Notes on Support Vector Machines The most basic idea of support vector machines is to find a line (or hyperplane) that separates classes of data, and use this information to classify new examples. But we don t want just any line, we want a line that maximizes the distance between classes (we want the best line). This line turns out to be a separating hyperplane that is equidistant between the supporting hyperplanes that support the sets that make up each distinct class. The notes that follow discuss the concepts of supporting and separating hyperplanes and inner products as they relate support vector machines (SVMs). Using simple examples, much detail is given to the mathematical notation used to represent hyperplanes, as well as how the SVM classification works. What is a separating hyperplane? From a graduate level treatment of microeconomic theory we learn different versions of what s referred to as the separating hyperplane theorem: (Image adapted from Green, 1995) Suppose A, B RN are disjoint. Then there is a p RN with p 0, and a value c RN such that p.x > c x A and p.y < c y B. i.e. there is a hyperplane H that separates A and B. And we also learn different versions of the supporting hyperplane theorem: Suppose B RN is convex and that x is on the boundary of B and y B. Then there is p RN with p 0 such that p.x p.y, y B. i.e. the hyperplane H supports the set B.

3 (Image adapted from Green, 1995) Now, after somewhat rigorously defining what we mean by separating and supporting hyperplanes, how are they utilized in the context of support vector machines? The simplest story we can tell about support vector machines involves a case where we have two classes, defined such that y = 1 for one class (A) and y = -1 for the other (B). The idea is to draw a line that separates the two classes such that all of the examples where y =1 fall on one side of the line and y = -1 fall on the other. This line is typically formulated using vector notation: f(x) = w T x + b such that f(x) > 0 if y = 1 and f(x) < 0 if y = -1.

4 The vector w is referred to as a weight vector and the value b is bias. When b = 0 the equation represents the set of points such that w T x = 0 which is all the points orthogonal to w that form the plane or line that goes through the origin. (more on this later). Adding the bias term b translates the line away from the origin. As depicted above, the equation equivalently specified as w T x + b = 0 or w.x + b = 0 or <w,x> + b = 0 forms a hyperplane separating the classes. With this we have a dot product, or inner product between two vectors with an additive bias term b that give us this plane and also formulates a decision function f(x) such that sign(f(x)) tells us which side of the plane an example falls. In the literature this function is often called a linear discriminant function or in the engineering literature classifiers that result from a linear combination of input features and return a sign have been referred to as perceptrons. (as noted in Hastie et al, these also form the foundation for neural networks). So having defined a separating hyperplane and a decision function that classifies our training examples, SVMs also reference two additional hyperplanes. These are supporting hyperplanes that support the classes y = 1 and y = -1 or sets A and B respectively. The supporting hyperplanes are parallel to the separating hyperplane, and can be represented as follows: H a : w T x + b = 1 or w.x + b = 1 or <w,x> + b = 1 H b : w T x + b = -1 or w.x + b = -1 or <w,x> + b = -1 So the idea of SVM s is to not only find a separating hyperplane H p that separates both classes, but to find the specific H p that does the best job of separating the classes. This can be formulated in terms of maximizing margin. Margin is defined as the distance between the classes, specifically the distance between each class respective supporting hyperplane ( which turns out to be 2/ w ). Maximizing margin amounts to finding the separating hyperplane that is equidistant from each class s supporting hyperplane. This is formulated geometrically below.

5 The optimization problem for SVM s is then specified as follows: Find w and b to maximize 2 / w s.t. H a : w T x + b 1 y =1 or x A H b : w T x + b -1 y = -1 or x B The solutions w and b are then used to formulate the classification rule: D(x) = w T x + b such that if D(x) > 0 x A else x B. This may seem straightforward to someone that has had a good course in multivariable calculus and linear algebra ( I ll refer to my references for the basic logical extensions including duality, kernals, and non-linearly separable sets). For others, or some that perhaps don t recall these concepts, I think the most basic but puzzling question may be, how in the world do we get a line from something like w T x + b = 0? It sort of looks like the equation we may have seen from econometrics, y = β 0 + β 1 X, but is it? What about the equivalent formulation <w,x> + b = 0? One may easily understand the concept of linear regression, the equation of a line, and even dot products or inner products, but still get very confused in how to tie these concepts together to get the notation I ve used above. The best explanation I have found that would get one started on the right path was a post entitled The Perceptron, and All the Things it Can t Perceive. Posted on the Math Programming blog ( ) by Jeremy Kun. Borrowing from his explanation, I present the following data points belonging to classes A and B: Set A Set B x y x y We can plot these points, and plot a line separating them (this will be our hypothetical separating hyperplane H p ) as follows:

6 Note the equation for this line is actually y = -x + 8. Paying attention to the scale and where origin actually is, we see that this is a basic line that intersects the y axis at y = 8 where x = 0 and intersects the x- axis at x = 8. This equation can be re-written in normal form as: x+y-8 = 0 If we realize that we have implied co-efficeints of 1 for x and y we can specify the normal equation using vector notation as: <(1,1),(x,y)> - 8 = 0. We see that the dot product or inner product of the coefficient vector (1,1) or weight vector as in w and the variable vector (x,y) as in x will give us x + y. When we add the term -8 as in b we get our normal form equation back that we are familiar with, that is x + y - 8 = 0. By representing the line in this way H p = <w,x> + b we can see that different values for the weight and variable vectors, as well as the value for b give us different lines. Hence, the representation of the hyperplanes used in SVMs. Similarly we can also define two lines that will represent our hypothetical supporting hyperplanes for each set A and B (depicted using both normal and vector form <w,x> + b). H a : y = -x + 11 or x + y -11 = 0 or <(1,1),(x,y)> - 11 = 0 for set A and H b : y = -x + 5 or x + y -5 = 0 or <(1,1),(x,y)> - 5 = 0 for set B

7 It turns out that these lines can also be described in terms of the weights and bias that define the separating hyperplane H p as such: H p = <w,x> + b =0 (as in x + y 8 = 0, so w = (1,1) and b = -8) H a = <w,x> + b =3 Note: this is x + y 8 = 3, which is the same as x + y 11 =0, our original normal equation for H a. H b = <w,x> + b =-3 Note: this is x+y -8 = -3, which is x+y-5=0, our original normal equation for H b. We can see that for every point on H a, we get <w,x> + b = 3 using the weights and bias from H p. Example: for the point (5,6) in A which lies on H a, <w,x> + b = 1*5 + 1*6-8 = 3 We can see that for every point on H b, <w,x> + b =-3 using the weights and bias from H p. Example: for the point (3,2) in B that lies on H b, <w,x> + b = 1*3 + 1*2 8 = -3 If we normalize these results by dividing through by 3, (which turns out to be the vertical distance from H p to each supporting hypeplane) we get the following:

8 H p = <w,x> + b =0 H a = <w,x> + b = 1 H b = <w,x> + b = -1 Which is how we described the planes in our original geometric visualization of SVMs above. So if we assume that H p is the separating hyperplane that maximizes the margin between sets A and B, the weights that define H p give us our decision function. Using simple math and our data points defined for sets A and B we can see how this works. Our decision rule is: D(x) = w T x + b = <(1,1),(x,y)> -8 = x+y-8 = 0 if D(x) > 0 x A else if D(x) < 0 x B. Example: for the point (3,11) we get D(x) = = 6 >0, so our decision rule would classify that point as y = 1 or belonging to class or set A. Example: for the point (4,-1) we get D(x) = = -5 < 0, so our decision rule would classify that point as y = -1 or belonging to class or set B. We can see that the decision rule (again derived from the equation for H p ) correctly classifies all of the points in sets A and B. x y D(x) Sign Class A A A A B B B B References: Mas-Colell, Winston and Green. Microeconomic Theory. Oxford University Press. (1995) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition. February 2009

9 Trevor Hastie, Robert Tibshirani and Jerome Friedman Wu, F., Olson, B., Dobbs, D., and Honavar, V. "Comparing Kernels For Predicting Protein Binding Sites From Amino Acid Sequence". IEEE Joint Conference on Neural Networks, Vancouver, Canada, IEEE Press. Vol. In press, The Perceptron, and All the Things it Can t Perceive. Jeremy Kun. Support Vector Machines for Business Applications Brian C. Lovell and Christian J Walder The University of Queensland and Max Planck Institute, Tübingen {lovell, walder}@itee.uq.edu.au A Tutorial on Support Vector Machines for Pattern Recognition. CHRISTOPHER J.C. BURGES Bell Laboratories, Lucent Technologies Data Mining and Knowledge Discovery 2, , 1998 A User's Guide to Support Vector Machines Asa Ben-Hur, Department of Computer Science. Colorado State University Jason Weston, NEC Labs America. Princeton, NJ USA Support Vector Machines Explained Tristan Fletcher Support Vector Machines. Max Welling. Department of Computer Science University of Toronto 10 King s College Road Toronto, M5S 3G5 Canada welling@cs.toronto.edu R-Code used for Data Visualization # * # PROGRAM NAME: SVM and vector geometry # DATE: 5/22/12 # CREATED BY: Matt Bogard # PROJECT FILE:P:\R Code References\Data Mining_R # * # PURPOSE: provide intuition for SVM's # # *

10 rm(list=ls()) # get rid of any existing data ls() # view open data sets # define points for class 'A' x <- c(3,6,5,8) y <- c(11,10,6,5) A <- cbind(x,y) # define points for class 'B' x <- c(0,2,3,4) y <- c(3,2,2,-1) B <- cbind(x,y) # create the training data set temp1 <- rbind(a,b) class <- c(1, 1,1,1,-1,-1,-1,-1) temp2 <- cbind(temp1,class) # plot the data plot.window(xlim = c(-1,12), ylim = c(-1,12)) plot_it <- function(){ plot(temp2) points(a, pch = 'A', col = 'red') points(b, pch ='B', col = 'blue') } plot_it() # fit the hyperplanes x <- seq(-1,20, by=.01) Hu <- -1*x + 11 # upper supporting hyperplane Hs <- -1*x + 8 # separating hyperplane Hl <- -1*x + 5 # lower supporting hyperplane # plot the hyperplanse lines(x,hu) lines(x,hs) lines(x,hl) title(main = "svm and vector geometry")

Mathematical Themes in Economics, Machine Learning, and Bioinformatics

Western Kentucky University From the SelectedWorks of Matt Bogard 2010 Mathematical Themes in Economics, Machine Learning, and Bioinformatics Matt Bogard, Western Kentucky University Available at: https://works.bepress.com/matt_bogard/7/