Support Vector Machines

Similar documents
SUPPORT VECTOR MACHINES

SUPPORT VECTOR MACHINES

All lecture slides will be available at CSC2515_Winter15.html

Generative and discriminative classification techniques

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

Support Vector Machines

Support vector machines

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Lecture 7: Support Vector Machine

Support Vector Machines

12 Classification using Support Vector Machines

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Introduction to Support Vector Machines

Introduction to Machine Learning

Linear methods for supervised learning

DM6 Support Vector Machines

Kernel Methods & Support Vector Machines

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

Module 4. Non-linear machine learning econometrics: Support Vector Machine

A Short SVM (Support Vector Machine) Tutorial

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

Support Vector Machines. James McInerney Adapted from slides by Nakul Verma

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others

Support Vector Machines.

Lecture 9. Support Vector Machines

LECTURE 5: DUAL PROBLEMS AND KERNELS. * Most of the slides in this lecture are from

Discriminative classifiers for image recognition

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

More on Classification: Support Vector Machine

COMS 4771 Support Vector Machines. Nakul Verma

Generative and discriminative classification

Chakra Chennubhotla and David Koes

Support Vector Machines (a brief introduction) Adrian Bevan.

Parallel & Scalable Machine Learning Introduction to Machine Learning Algorithms

Lab 2: Support vector machines

Lecture 10: SVM Lecture Overview Support Vector Machines The binary classification problem

Support Vector Machines

Instance-based Learning

Topics in Machine Learning

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Machine Learning Lecture 9

Machine Learning for NLP

ADVANCED CLASSIFICATION TECHNIQUES

Instance-based Learning

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Behavioral Data Mining. Lecture 10 Kernel methods and SVMs

Topic 4: Support Vector Machines

Support Vector Machines

Generative and discriminative classification techniques

Optimal Separating Hyperplane and the Support Vector Machine. Volker Tresp Summer 2018

Machine Learning Lecture 9

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

Lecture Linear Support Vector Machines

Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning

Recognition Tools: Support Vector Machines

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Support vector machine (II): non-linear SVM. LING 572 Fei Xia

Lecture 9: Support Vector Machines

Unconstrained Optimization Principles of Unconstrained Optimization Search Methods

Constrained optimization

SUPPORT VECTOR MACHINE ACTIVE LEARNING

HW2 due on Thursday. Face Recognition: Dimensionality Reduction. Biometrics CSE 190 Lecture 11. Perceptron Revisited: Linear Separators

CS570: Introduction to Data Mining

Kernel Methods. Chapter 9 of A Course in Machine Learning by Hal Daumé III. Conversion to beamer by Fabrizio Riguzzi

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Support Vector Machines

Support Vector Machines: Brief Overview" November 2011 CPSC 352

Data mining with Support Vector Machine

Perceptron Learning Algorithm

CS5670: Computer Vision

CAP5415-Computer Vision Lecture 13-Support Vector Machines for Computer Vision Applica=ons

SVM Classification in Multiclass Letter Recognition System

Neural Networks and Deep Learning

Convex Programs. COMPSCI 371D Machine Learning. COMPSCI 371D Machine Learning Convex Programs 1 / 21

Content-based image and video analysis. Machine learning

LECTURE 13: SOLUTION METHODS FOR CONSTRAINED OPTIMIZATION. 1. Primal approach 2. Penalty and barrier methods 3. Dual approach 4. Primal-dual approach

Lab 2: Support Vector Machines

Support Vector Machines + Classification for IR

Perceptron Learning Algorithm (PLA)

Kernel SVM. Course: Machine Learning MAHDI YAZDIAN-DEHKORDI FALL 2017

5 Machine Learning Abstractions and Numerical Optimization

Lecture 3: Linear Classification

Support Vector Machines

A Two-phase Distributed Training Algorithm for Linear SVM in WSN

Classification by Support Vector Machines

Programming, numerics and optimization

Some Advanced Topics in Linear Programming

Lecture #11: The Perceptron

Support Vector Machines

Machine Learning: Think Big and Parallel

Support Vector Machines

Machine Learning for Signal Processing Clustering. Bhiksha Raj Class Oct 2016

Support vector machines

Support Vector Machines.

CS 229 Midterm Review

SVM in Analysis of Cross-Sectional Epidemiological Data Dmitriy Fradkin. April 4, 2005 Dmitriy Fradkin, Rutgers University Page 1

CS 8520: Artificial Intelligence

Classification: Feature Vectors

Learning via Optimization

9. Support Vector Machines. The linearly separable case: hard-margin SVMs. The linearly separable case: hard-margin SVMs. Learning objectives

Transcription:

Support Vector Machines Xiaojin Zhu jerryzhu@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [ Based on slides from Andrew Moore http://www.cs.cmu.edu/~awm/tutorials] slide 1

The no-math verison Support vector machines slide 2

Courtesy of Steve Chien of NASA/JPL Lake Mendota, Madison, WI Identify areas of land cover (land, ice, water, snow) in a scene Three algorithms: Scientist manually derived Automatic best ratio Support Vector Machine (SVM) Lake Mendota, Wisconsin Classifier Expert Derived Automated Ratio SVM cloud 45.7% 43.7% 58.5% ice 60.1% 34.3% 80.4% land 93.6% 94.7% 94.0% snow 63.5% 90.4% 71.6% water 84.2% 74.3% 89.1% unclassified 45.7% Visible Image Expert Labeled Expert Derived Automated Ratio SVM slide 3

Support vector machines The state-of-the-art classifier Class labels denotes +1 denotes -1 How would you classify this data? slide 4

Linear classifier If all you can do is to draw a straight line, the linear decision boundary slide 5

Linear classifier Another OK decision boundary slide 6

slide 7

Any of these would be fine....but which is the best? slide 8

The margin Margin: the width that the boundary can be increased before hitting a data point slide 9

SVM: maximize margin The simplest SVM (linear SVM) is the linear classifier with the maximum margin Linear SVM slide 10

SVM: linearly non-separable data What if the data is not linearly separable? slide 11

SVM: linearly non-separable data Two solutions: Allow a few points on the wrong side (slack variables), and/or Map data to a higher dimensional space, do linear classification there (kernel) slide 12

SVM: more than two classes N class problem: Split the task into N binary tasks: Class 1 vs. the rest (class 2 N) Class 2 vs. the rest (class 1, 3 N) Class N vs. the rest Finally, pick the class that put the point furthest into the positive region. slide 13

SVM: get your hands on it There are many implementations http://www.support-vector.net/software.html http://svmlight.joachims.org/ You don t have to know the rest of the class in order to use SVM. end of class (But you want to know more, don t you?) slide 14

The math verison Support vector machines slide 15

Vector A vector W in d-dimensional space is a list of d numbers, e.g. W=(-1,2) A vector is a vertical column. is matrix transpose Vector is a line segment, with arrow, in space The norm of a vector W is its length W W = sqrt(w W) Inner product: X Y = X i Y i What d-dimensional point X makes W X = 0? slide 16

Lines W X is a scalar (single number): X s projection onto W W X=0 specifies the set of points X, which is the line perpendicular to W, and intersects at (0,0) W W X=1, or W X-1=0 W X=0 W X=-1, or W X+1 = 0 1/ W W X=1 is the line parallel to W X=0, shifted by 1/ W What if the boundary doesn t go through origin? slide 17

SVM boundary and margin Want: find W, b (offset) such that all positive training points (X,Y=1) in the red zone, all negative ones (X, Y=-1) in the blue zone, the margin M is maximized 2/ W Plus-Plane Classifier Boundary Minus-Plane M=2/ W (why?) How do we find such W,b? slide 18

SVM as constrained optimization Variables: W, b Objective function: maximize the margin M=2/ W Equiv. to minimize W, or W 2 =W W, or ½W W Assume N training points (Xi, Yi), Yi = 1 or -1 Subject to each training point on the correct side (the constraint). How many constraints do we have? 2/ w Plus-Plane Classifier Boundary Minus-Plane slide 19

SVM as constrained optimization Variables: W, b Objective function: maximize the margin M=2/ W Equiv. to minimize W, or W 2 =W W, or ½W W Assume N data points (Xi, Yi), Yi = 1 or -1 Subject to each training point on the correct side (the constraint). How many constraints do we have? N W Xi + b >= 1, if Yi=1 W Xi + b <= -1, if Yi=-1 we can unify them: Yi(W Xi+b) >= 1 We have a continuous constrained optimization problem! What do we do? slide 20

SVM as QP min W,b ½ W W Subject to Yi (W Xi + b) >= 1, for all i Objective is convex, quadratic Linear constraints This problem is known as Quadratic Program (QP), for which efficient global solution algorithms exist. 2/ W Plus-Plane Classifier Boundary Minus-Plane slide 21

What about this? Non-separable case Can we insist on Yi (W Xi + b) >= 1, for all i? slide 22

Trick #1: slack variables Relax the constraints allow a few bad apples For a given linear boundary W, b, we can compute how far off into the wrong side a bad point is e 2 e 11 e 7 Here s how we relax the constraints: Yi (W Xi + b) >= 1- e i slide 23

Trick #1: SVM with slack variables Subject to min W,b,e ½ W W + C i e i Yi (W Xi + b) >= 1- e i for all i e i >=0 for all i (why?) Trade-off parameter e 11 e 2 e 7 slide 24

Trick #1: SVM with slack variables Subject to min W,b,e ½ W W + C i e i Yi (W Xi + b) >= 1- e i for all i e i >=0 for all i Trade-off parameter Originally we optimize variables W (d-dimensional vector), e 2 b e 11 Now we optimize W, b, e 1 e N Now we have 2N constraints e 7 Still a QP (soft-margin SVM) slide 25

Another look at non-separable case Here s another non-separable dataset, in 1- dimensional space We can of course use slack variables but here s another trick! x=0 slide 26

Trick #2:Map data to high dimensional space We can map data x from 1-D to 2-D by x (x, x 2 ) x=0 slide 27

Trick #2:Map data to high dimensional space We can map data x from 1-D to 2-D by x (x)=(x, x 2 ) Now the data is linearly separable in the new space! We can run SVM in the new space The linear boundary in the new space corresponds to a nonlinear boundary in the old space x=0 slide 28

Another example 2 ( x1, x2) ( x1, x2, x1 x 2 2 ) slide 29

Trick #2:Map data to high dimensional space In general we might want to map an already highdimensional X=(x 1, x 2,, x d ) into some much higher, even infinite dimensional space (x) Problems: How do you represent infinite dimensions? We need to learn (amount other things) W, which lives in the new space learning a large (or infinite) number of variables in QP is not a good idea. x=0 slide 30

Trick #2: kernels We will do several things: Convert it into a equivalent QP problem, which does not use W, or even (X) alone! It only uses the inner product (X) (Y), where X and Y are two training points. The solution also only uses such inner product It still seems infeasible to compute for high (infinite) dimensions But there are smart ways to compute such inner product, known as kernel (a function on two var.) kernel(x,y) inner product (X) (Y) Why should you care: One kernel, one new (higher dimensional) space You will impress friends at cocktail parties slide 31

Prepare to bite the bullet Here s the original QP formula min W,b ½ W W Subject to Yi (W Xi + b) >= 1, for all i Remember Lagrange multipliers? L = ½ W W ai [Yi (W Xi + b) 1] Subject to ai >= 0, for all i We want the gradient of L to vanish with respect to W, b, a. Try this and you ll get W = ai Yi Xi ai Yi = 0 Put them back into the Lagrangian L (N constraints) We have them here because those are inequality constraints slide 32

Biting the bullet Subject to max {ai} ai ½ i,j ai aj Yi Yj Xi Xj ai >= 0, for all I ai Yi = 0 This is an equivalent QP problem (the dual) Before we optimize W (d variables), now we optimize a (N variables): which is better? X only appears in the inner product slide 33

Subject to Biting the bullet max {ai} ai ½ i,j ai aj Yi Yj (Xi) (Xj) ai >= 0, for all I ai Yi = 0 If we map X to new space (X) This is an equivalent QP problem Before we optimize W (d variables), now we optimize a (N variables): which is better? X only appears in the inner product slide 34

Subject to Biting the bullet max {ai} ai ½ i,j ai aj Yi Yj K(Xi, Xj) ai >= 0, for all I ai Yi = 0 This is an equivalent QP problem Before we optimize W (d variables), now we optimize a (N variables): which is better? X only appears in the inner product Kernel K(Xi,Xj) = (Xi) (Xj) If we map X to new space (X) slide 35

What s special about kernel Say data is two dimensional: s=(s1,s2) We decide to use a particular mapping into 6 dimensional space (s)=(s1 2, s2 2, 2 s1s2, s1, s2, 1) Let another point be t=(t1,t2). (s) (t)=s1 2 t1 2 + s2 2 t2 2 + 2 s1s2t1t2 + s1t1 + s2t2 + 1 slide 36

What s special about kernel Say data is two dimensional: s=(s1,s2) We decide to use a particular mapping into 6 dimensional space (s)=(s1 2, s2 2, 2 s1s2, 2 s1, 2 s2, 1) Let another point be t=(t1,t2). (s) (t)=s1 2 t1 2 + s2 2 t2 2 + 2 s1s2t1t2 + 2 s1t1 + 2 s2t2 + 1 Let the kernel be K(s,t) = (s t+1) 2 Verify that they are the same. We saved computation. slide 37

Kernels You ask: Is there such a good K for any I pick? The inverse question: Given some K, is there a so that K(X,Y) = (X) (Y)? Mercer s condition: the inverse question is true This is positive semi-definiteness, if you must know may be infinite dimensional; we may not be able to explicitly write down slide 38

Some frequently used kernels Linear kernel: K(X,Y) = X Y Quadratic kernel: K(X,Y) = (X Y+1) 2 Polynomial kernel: K(X,Y) = (X Y+1) n Radial Basis Function kernel: K(X,Y) = exp(- X-Y 2 / ) Many, many other kernels Hacking with SVM: create various kernels, hope their space is meaningful, plug them into SVM, pick the one with good classification accuracy (equivalent to feature engineering) Kernel summary: QP of size N, nonlinear SVM in the original space, new space possibly high/infinite dim, efficient if K is easy to compute Kernel can be combined with slack variables slide 39

Why the name support vector machines Subject to max {ai} ai ½ i,j ai aj Yi Yj K(Xi, Xj) The decision boundary is ai >= 0, for all I ai Yi = 0 f(xnew)=w Xnew + b= ai Yi Xi Xnew + b In practice, many a s will be zero in the solution! Those few X with a>0 lies on the margin (blue or red lines), they are the support vectors slide 40

What you should know The intuition, where to find software Vector, line, length Margin QP with linear constraints How to handle non-separable data Slack variables Kernels new feature space Ref: A Tutorial on Support Vector Machines for Pattern Recognition (1998) Christopher J. C. Burges slide 41