Support Vector Machines

Similar documents
Support Vector Machines.

Support Vector Machines

Support Vector Machines.

Kernel Methods & Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Support Vector Machines

All lecture slides will be available at CSC2515_Winter15.html

Lecture 7: Support Vector Machine

Support Vector Machines

Support Vector Machines

A Short SVM (Support Vector Machine) Tutorial

Introduction to Machine Learning

Support vector machines

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Introduction to Support Vector Machines

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

CS 229 Midterm Review

Optimal Separating Hyperplane and the Support Vector Machine. Volker Tresp Summer 2018

Linear methods for supervised learning

CS 559: Machine Learning Fundamentals and Applications 9 th Set of Notes

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

COMS 4771 Support Vector Machines. Nakul Verma

SUPPORT VECTOR MACHINES

Support Vector Machines + Classification for IR

6. Linear Discriminant Functions

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

DM6 Support Vector Machines

SUPPORT VECTOR MACHINES

LECTURE 5: DUAL PROBLEMS AND KERNELS. * Most of the slides in this lecture are from

Kernel Methods. Chapter 9 of A Course in Machine Learning by Hal Daumé III. Conversion to beamer by Fabrizio Riguzzi

Lab 2: Support vector machines

Generative and discriminative classification techniques

Lecture Linear Support Vector Machines

Perceptron Learning Algorithm

Support Vector Machines. James McInerney Adapted from slides by Nakul Verma

Practice EXAM: SPRING 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Support Vector Machines

LOGISTIC REGRESSION FOR MULTIPLE CLASSES

Support Vector Machines (a brief introduction) Adrian Bevan.

Support Vector Machines for Face Recognition

Module 4. Non-linear machine learning econometrics: Support Vector Machine

DECISION-TREE-BASED MULTICLASS SUPPORT VECTOR MACHINES. Fumitake Takahashi, Shigeo Abe

Lecture 9: Support Vector Machines

Perceptron Learning Algorithm (PLA)

Chap.12 Kernel methods [Book, Chap.7]

Classification: Linear Discriminant Functions

Machine Learning for NLP

Classification by Support Vector Machines

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines

Convex Programs. COMPSCI 371D Machine Learning. COMPSCI 371D Machine Learning Convex Programs 1 / 21

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Classification by Support Vector Machines

Programming, numerics and optimization

Topics in Machine Learning

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Large synthetic data sets to compare different data mining methods

Machine Learning Classifiers and Boosting

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

5 Learning hypothesis classes (16 points)

Linear Discriminant Functions: Gradient Descent and Perceptron Convergence

LECTURE 13: SOLUTION METHODS FOR CONSTRAINED OPTIMIZATION. 1. Primal approach 2. Penalty and barrier methods 3. Dual approach 4. Primal-dual approach

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

Chakra Chennubhotla and David Koes

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

6 Randomized rounding of semidefinite programs

The Curse of Dimensionality

Kernel-based online machine learning and support vector reduction

Machine Learning Lecture 9

Lecture 10: SVM Lecture Overview Support Vector Machines The binary classification problem

10. Support Vector Machines

Support Vector Machines

HW2 due on Thursday. Face Recognition: Dimensionality Reduction. Biometrics CSE 190 Lecture 11. Perceptron Revisited: Linear Separators

Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning

One-class Problems and Outlier Detection. 陶卿 中国科学院自动化研究所

SVM in Analysis of Cross-Sectional Epidemiological Data Dmitriy Fradkin. April 4, 2005 Dmitriy Fradkin, Rutgers University Page 1

Data-driven Kernels for Support Vector Machines

Content-based image and video analysis. Machine learning

Support Vector Machines

Machine Learning Lecture 9

Table of Contents. Recognition of Facial Gestures... 1 Attila Fazekas

PV211: Introduction to Information Retrieval

Lagrange Multipliers. Joseph Louis Lagrange was born in Turin, Italy in Beginning

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

6 Model selection and kernels

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Lab 2: Support Vector Machines

12 Classification using Support Vector Machines

Sequential Coordinate-wise Algorithm for Non-negative Least Squares Problem

J. Weston, A. Gammerman, M. Stitson, V. Vapnik, V. Vovk, C. Watkins. Technical Report. February 5, 1998

AM 221: Advanced Optimization Spring 2016

Rule extraction from support vector machines

9. Support Vector Machines. The linearly separable case: hard-margin SVMs. The linearly separable case: hard-margin SVMs. Learning objectives

More on Classification: Support Vector Machine

Divide and Conquer Kernel Ridge Regression

Convexization in Markov Chain Monte Carlo

Machine Learning: Think Big and Parallel

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Kernel SVM. Course: Machine Learning MAHDI YAZDIAN-DEHKORDI FALL 2017

Transcription:

Support Vector Machines

SVM Discussion Overview. Importance of SVMs. Overview of Mathematical Techniques Employed 3. Margin Geometry 4. SVM Training Methodology 5. Overlapping Distributions 6. Dealing with Multiple Classes 7. SVM and Computational Learning Theory 8. Relevance Vector Machines

. Importance of SVM SVM is a discriminative method that brings together:. computational learning theory. previously nown methods in linear discriminant functions 3. optimization theory Also called Sparse ernel machines Kernel methods predict based on linear combinations of a ernel function evaluated at the training points, e.g., Parzen Window Sparse because not all pairs of training points need be used Also called Maximum margin classifiers Widely used for solving problems in classification, regression and novelty detection

. Mathematical Techniques Used. Linearly separable case considered since appropriate nonlinear mapping φ to a high dimension two categories are always separable by a hyperplane. To handle non-linear separability: Preprocessing data to represent in much higherdimensional space than original feature space Kernel tric reduces computational overhead

3. Support Vectors and Margin Support vectors are those nearest patterns at distance b from hyperplane SVM finds hyperplane with maximum distance (margin distance b) from nearest training patterns Three CSE555: support Srihari vectors are shown as solid dots

Margin Maximization Why maximize margin? Motivation found in computational learning theory or statistical learning theory (PAC learning-vc dimension) Insight given as follows (Tong, Koller 000): Model distributions for each class using Parzen density estimators using Gaussian ernels with common parameter σ Instead of optimum boundary, determine best hyperplane relative to learned density model As σ 0 optimum hyperplane has maximum margin Hyperplane becomes independent of data points that are not support vectors

Distance from arbitrary point and plane Hyperplane: Lemma:: Distance from x to the plane is Proof: Let then t w g( x) = w ( xp + r ) + w w 0 t g ( x ) = w x + w where w is the weight vector and w0 is bias x = x + p r t = w x p w w + w 0 t w w + r w 0 r g( x) r = w where is the distance from x to the plane, = 0 w = g( x p ) + r = r w w QED Corollary: Distance of origin to plane is r = g(0)/ w = w 0 / w t since g ( 0 ) = w 0 + w 0 = w 0 Thus w 0 =0 implies that plane passes through origin g ( x ) = w t x + w = 0 0 x p g( x) r = w x= xp + r w w g ( y ) < 0 g ( x ) > 0

Choosing a margin Augmented space: g(y) = a t y by choosing a 0 = w 0 and y 0 =, i.e, plane passes through origin For each of the patterns, let z = + depending on whether pattern is in class ω or ω g(y)= a t y y Thus if g(y)=0 is a separating hyperplane then z g(y ) > 0, =,.., n Since distance of a point y to hyperplane g(y)=0 is g(y) a we could require that hyperplane be such that all points are at least distant b from it, i.e., b z g( y a ) b

SVM Margin geometry g(y) = - g(y) = a t y= 0 y y g(y) = y y Optimal hyperplane is orthogonal to.. Shortest line connecting Convex hulls, which has length = / a

Statement of Optimization Problem The goal is to find the weight vector a that satisfies ( y ) z g b, = n a,... while maximizing b To ensure uniqueness we impose the constraint b a = or b = / a Which implies that we also require that a be minimized Support vectors are (transformed) training patterns which represent equality in above equation Called a quadratic optimization problem since we are trying to minimize a quadratic function subject to a set of linear inequality constraints

4. SVM Training Methodology. Training is formulated as an optimization problem Dual problem is stated to reduce computational complexity Kernel tric is used to reduce computation. Determination of the model parameters corresponds to a convex optimization problem Solution is straightforward (local solution is a global optimum) 3. Maes use of Lagrange multipliers

Joseph-Louis Lagrange 736-83 French Mathematician Born in Turin, Italy Succeeded Euler at Berlin academy Narrowly escaped execution in French revolution due to Lovoisier who himself was guillotined Made ey contributions to calculus and dynamics

SVM Training: Optimization Problem optimize arg min a a,b subject to constraints t z a y, =,... n Can be cast as an unconstrained problem by introducing Lagrange undetermined multipliers with one multiplier for each constraint α The Lagrange function is ( ) = [ ] t, α a α za y L a n =

Optimization of Lagrange function The Lagrange function is ( ) = [ ] t, α a α za y L a n = We see to minimize L ( ) with respect to the weight vector a and maximize it w.r.t. the undetermined multipliers α Last term represents the goal of classifying the points correctly Karush-Kuhn-Tucer construction shows that this can be recast as a maximization problem which is computationally better 0

Dual Optimization Problem Problem is reformulated as one of maximizing Subject to the constraints given the training data where the ernel function is defined by ( ) = = = = n j j j n j n y y L ), ( z z α α α α n,..., 0, 0 z = = = n α α ) ( ) ( ), ( t j t j j x x y y y y φ = φ =

Solution of Dual Problem Implementation: Solved using quadratic programming Alternatively, since it only needs inner products of training data It can be implemented using ernel functions which is a crucial property for generalizing to non-linear case The solution is given by a = α z y

Summary of SVM Optimization Problems Different Notation here! Quadratic term

Kernel Function: ey property If ernel function is chosen with property K(x,y) = (φ (x). φ(y)) then computational expense of increased dimensionality is avoided. Polynomial ernel K(x,y) = (x.y) d can be shown (next slide) to correspond to a map φ into the space spanned by all products of exactly d dimensions.

ϕ(x) = K(x, y) = A Polynomial Kernel Function Suppose x = The feature space mapping is : ϕ(x) ϕ(y) = ( x (,, ) t x x x x Then inner product is (,, ) (,, ) t x x x x y y y y Polynomial ernel function to compute the same value is (x.y), x = ) is the input vector ( x, x )( y, y ) ) t or K(x, y) = ϕ(x) ϕ(y) Inner product ϕ( x) ϕ( y) needs computing six feature values and 3 x 3 = 9 multiplications Kernel function K(x,y) has multiplications and a squaring = ( x y + x = ( x y y ) + x y ) same

Another Polynomial (quadratic) ernel function K(x,y) = (x.y + ) This one maps d =, p = into a six- dimensional space Contains all the powers of x K( x, y) = ϕ( x) ϕ( y) where ϕ( x) = ( x,x, x, x, x x,) Inner product needs 36 multiplications Kernel function needs 4 multiplications

Non-Linear Case Mapping function φ (.) to a sufficiently high dimension So that data from two categories can always be separated by a hyperplane Assume each pattern x has been transformed to y =φ (x ), for =,.., n First choose the non-linear φ functions To map the input vector to a higher dimensional feature space Dimensionality of space can be arbitrarily high only limited by computational resources

Mapping into Higher Dimensional Feature Space Mapping each input point x by map y = Φ( x) = x x Points on -d line are mapped onto curve in 3-d. Linear separation in 3-d space is possible. Linear discriminant function in 3-d is in the form g ( x) = a y + a y + a y CSE555: Srihari 3 3

Pattern Transformation using Kernels Problem with high-dimensional mapping Very many parameters Polynomial of degree p over d variables leads to O(d p ) variables in feature space Example: if d = 50 and p = we need a feature space of size 500 Solution: Dual Optimization problem needs only inner products Each pattern x transformed into pattern y where y = Φ (x ) Dimensionality of mapped space can be arbitrarily high

Example of SVM results Two classes in two dimensions Synthetic Data Shows contours of constant g (x) Obtained from SVM with Gaussian ernel function Decision boundary is shown Margin boundaries are shown Support vectors are shown Shows sparsity of SVM

Demo http://www.csie.ntu.edu.tw/~cjlin/libsvm/ Svmtoy.exe

SVM for the XOR problem XOR: binary valued features x, x not solved by linear discriminant function φ maps input x = [x, x] into six-dimensional feature space y = [, /x, /x, /x x, x, x ] input space feature sub-space Hyperplanes Corresponding to /x x = + Hyperbolas Corresponding to x x = +

SVM for XOR: maximization problem = = = 4 4 4 z z j t j j j y y α α α We see to maximize Subject to the constraints From problem symmetry, at the solution 4, 3,,, 0 0 4 3 = = + α α α α α 4 3 and, α α α α = =

SVM for XOR: maximization problem Can use iterative gradient descent Or use analytical techniques for small problem The solution is a* = (/8,/8,/8,/8) Last term of Optimizn Problem implies that all four points are support vectors (unusual and due to symmetric nature of XOR) The final discriminant function is g(x,x) = x. x Decision hyperplane is defined by g(x,x) = 0 Margin is given by b=/ a = \/ _ Hyperbolas Corresponding to x x = + Hyperplanes Corresponding to _ \/x x = +

5. Overlapping Class Distributions We assumed training data are linearly separable in the mapped space y Resulting SVM will give exact separation in input space x although decision boundary will be nonlinear In practice class-conditional distributions will overlap In which case exact separation of training data will lead to poor generalization Therefore need to allow SVM to misclassify some training points

ν-svm applied to non-separable data Support Vectors are indicated by circles Done by introducing slac variables With one slac variable per training data point Maximize the margin while softly penalizing points that lie on the wrong side of the margin boundary νis an upper-bound on the fraction of margin errors (lie on wrong side of margin boundary)

6. Multiclass SVMs (one-versus rest) SVM is fundamentally a two-class classfier Several suggested methods for combining multiple twoclass classfiers Most widely used approach: one versus rest Also recommended by Vapni using data from class C as the positive examples and data from the remaining - classes as negative examples Disadvantages input can be assigned to multiple classes simultaneously Training sets are imbalanced (90% are one class and 0% are another) symmetry of original problem is lost

Multiclass SVMs (one-versus one) Train (-)/ different -class SVMs on all possible pairs of classes Classify test points according to which class has highest number of votes Again leads to ambiguities in classification For large requires significantly more training time than one-versus rest Also more computation time for evaluation Can be alleviated by organizing into a directed acyclic graph (DAGSVM)

7. SVM and Computational Learning Theory SVM is historically motivated and analyzed using a theoretical framewor called computational learning theory Called Probably Approximately Correct or PAC learning framewor Goal of PAC framewor is to understand how large a data sets needs to be in order to give good generalizations Key quantity in PAC learning is Vapni-Chernovenis (VC) dimension which provides a measure of complexity of a space of functions

All dichotomies of 3 points in dimensions are linearly separable

VC Dimension of Hyperplanes in R VC dimension provides the complexity of a class of decision functions Hyperplanes in R d VC Dimension = d+

Fraction of Dichotomies that are linearly separable Fraction of dichotomies of n points in d dimensions that are n d + linearly separable d linearly separable n f ( n, d) = n i= 0 i Capacity of a hyperplane At n = (d+), called the capacity of the hyperplane nearly one half of the dichotomies are still linearly separable n > d + f(n,d) 0.5 When no of points is same as dimensionality all dichotomies are Hyperplane is not over-determined until number of samples is several times the dimensionality n/(d+) Fraction of dichotomies of n points in d dimensions that are linear

Capacity of a line when d= Some Separable Cases f(5,)= 0.5, i.e., half the dichotomies are linear Capacity is achieved at n = d+ = 5 Some Non-separable Cases VC Dimension = d+ = 3

Possible method of training SVM Based on modification of Perceptron training rule given below Instead of CSE555: all misclassified Srihari samples, use worst classified samples

Support vectors are worst classified samples Support vectors are training samples that define the optimal separating hyperplane They are the most difficult to classify Patterns most informative for the classification tas Worst classified pattern at any stage is the one on the wrong side of the decision boundary farthest from the boundary At the end of the training period such a pattern will be one of the support vectors Finding worst case pattern is computationally expensive For each update, need to search through entire training set to find worst classified sample Only used for small problems More commonly used method is different

Generalization Error of SVM If there are n training patterns Expected value of the generalization error rate is bounded according to n No. of Support Vectors ε n[ P( error) ] ε n [ ] Expected value of error < expected no of support vectors/n Where expectation is over all training sets of size n (drawn from distributions describing the categories) This also means that error rate on the support vectors will be n times the error rate on the total sample Leave one out bound If we have n points in the training set Train SVM on n- of them Test on single remaining point If the point is a support vector then there will be an error If we find a transformation φ that well separates the data, then expected number of support vectors is small expected error rate is small

8. Relevance Vector Machines Addresses several limitations of SVMs SVM does not provide a posteriori probabilities Relevance Vector Machines provide such output Extension of SVM to multiclasses is problematic Complexity parameter C or ν that must be found using a hold-out method Linear combinations of ernel functions centered on training data points that must be positive definite