SUPPORT VECTOR MACHINES Today Reading AIMA 8.9 (SVMs) Goals Finish Backpropagation Support vector machines
Backpropagation. Begin with randomly initialized weights 2. Apply the neural network to each training example (each pass through examples is called an epoch) 3. If it misclassifies an example modify the weights 4. Continue until the neural network classifies all training examples correctly (Derive gradient-descent update rule) Backpropagation 2
Support Vector Machines (SVMs) SVMs are probably the most popular off-the-shelf classifier! Software Packages LIBSVM (LIBLINEAR) on the Resources page SVM-Light Which is the best decision boundary? x x 2 x and x 2 0 0 0 0 0 0 0 x x 2 x or x 2 0 0 0 0 0 x x x 2 x 2 3
Support Vector Machines Support vectors Maximum-margin decision hyperplane Maximize the margin What defines a hyperplane? 4
What defines a hyperplane? A hyperplane is defined by: A vector w Perpendicular to the hyperplane Often called the weight vector A scalar b Selects the hyperplane that is distance b from the origin from among all possible hyperplanes w b Classify a new instance (prediction) D = {(x i,y i ) i =...N} y i 2 {, } w x + b =0 w x + b<0 w x + b>0 x on the decision boundary x below the decision boundary x above the decision boundary w h(x) = sign(w x + b) X b 5
Learning (Training) How to calculate the margin? γ i r 0 x i (x i -r 0 ) w w The margin γ i is the length of the projection of the vector (x i -r 0 ) onto the weight Xvector The length of the projection of one vector onto another is just the dot product! w i = x i r 0 = x i w r 0 w = x i w + b Learning (Training) max,w,b i = x i w + b xj w + b s.t. y j =min i 8j i Multiply by y i to ensure positive max w,b xj w + b s.t. y j 8j Scale w so that γ=/iiwii max w,b s.t. y j x j w + b 8j Simplify constraints min w,b s.t. y j x j w + b 8j Max /x same as min x 6
Solving the Optimization Problem 8 min w,b 2 2 s.t. y j x j w + b 8j Need to optimize a quadratic function subject to linear constraints The solution involves constructing a dual problem where a Lagrange multiplier (a scalar) is associated with every constraint in the primary problem Solving the Optimization Problem min w,b 2 2 such that y (i) w x (i) + b 8i max min w,b X NX 2 2 i yi (w x i + b i= Dual Lagrange multipliers max NX X X XX i i j y i y j (x i x j ) 2 i= i j subject to i 0 and X i i y i =0 7
X X X X Solving the Optimization Problem The solution has the form: NX w = i y i x i and b = y i w x i for any x i s.t. i 6=0 i= Each non-zero alpha indicates corresponding x i is a support vector The classifying function has the form: X h(x) = sign i y i (x i x)+b i Relies on an dot product between the test point x and the support vectors x i Soft-margin Classification slack variables ξ i can be added to allow misclassification of difficult or noisy examples. Still, try to minimize training set errors, and to place hyperplane far from each class (large margin) ξ i ξ j 8
How many support vectors? Determined by alphas in optimization Typically only a small proportion of the training data The number of support vectors determines the run time for prediction How fast are SVMs? Training Time for training is dominated by the time for solving the underlying quadratic programming problem Slower than Naïve Bayes Non-linear SVMs are worse Testing (Prediction) Fast - as long as we don t have too many support vectors 9
Non-linear SVMs General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ: x φ(x) The Kernel trick The linear classifier relies on an inner product between vectors x it x j X g(x i ) = sign i i y (i) x (i) x + b If every example is mapped into a high-dimensional space via some transformation Φ: x φ(x) then the inner product becomes: X g(x i ) = sign i i y (i) '(x (i) ) '(x)+b A kernel function is some function that corresponds to a dot product in some transformed feature space: K(x i,x j )= φ(x i ) T φ(x j ) 0
The Kernel trick Sec. 5.2.3 The kernel K may be cheaper to compute then doing the actual transformation φ Implicitly do the transformation φ(x) = x x x x 2 x x 3 x 2 x x 2 x 2 x 2 x 3 x 3 x x 3 x 2 x 3 x 3 K(x, z) = = = ( n )( n ) x i z i x i z i i= j= n n x i x j z i z j i= j= n (x i x j )(z i z j ) i,j= * Kernels Why use kernels? n Make non-separable problem separable. n Map data into better representational space Common kernels n Linear n Polynomial K(x,z) = (+x T z) d n Radial basis function (infinite dimensional space)
Summary Support Vector Machines (SVMs) Find the maximum margin hyperplane Only the support vectors needed to determine hyperplane Use slack variables to allow some error Use a kernel function to make non-separable data separable Often among the best performing classifiers * 2