Lecture 9. Support Vector Machines

Similar documents
Lecture #11: The Perceptron

Machine Learning for NLP

Support Vector Machines.

Lecture 7: Support Vector Machine

3 Perceptron Learning; Maximum Margin Classifiers

Lecture 9: Support Vector Machines

Support Vector Machines

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Module 4. Non-linear machine learning econometrics: Support Vector Machine

Introduction to Machine Learning

Topic 4: Support Vector Machines

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

DM6 Support Vector Machines

Data Preprocessing. Supervised Learning

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

1 Case study of SVM (Rob)

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Support Vector Machines

Instance-based Learning

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Classification: Linear Discriminant Functions

PV211: Introduction to Information Retrieval

CS5670: Computer Vision

Support Vector Machines

Lecture 10: SVM Lecture Overview Support Vector Machines The binary classification problem

Linear Classification and Perceptron

Nearest Neighbor Predictors

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

All lecture slides will be available at CSC2515_Winter15.html

ADVANCED CLASSIFICATION TECHNIQUES

Support vector machines

LOGISTIC REGRESSION FOR MULTIPLE CLASSES

More on Classification: Support Vector Machine

Support Vector Machines

CSE 573: Artificial Intelligence Autumn 2010

Support vector machines

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

6. Linear Discriminant Functions

Naïve Bayes for text classification

Instance-based Learning

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

Data mining with Support Vector Machine

Data Mining in Bioinformatics Day 1: Classification

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Lecture Linear Support Vector Machines

ECG782: Multidimensional Digital Signal Processing

Lab 2: Support vector machines

Support Vector Machines

Natural Language Processing

Practice EXAM: SPRING 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

LARGE MARGIN CLASSIFIERS

SUPPORT VECTOR MACHINES

Support Vector Machines.

Support Vector Machines

Lecture 3: Linear Classification

Classification: Feature Vectors

1-Nearest Neighbor Boundary

6.034 Quiz 2, Spring 2005

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

CS 229 Midterm Review

Support Vector Machines + Classification for IR

AM 221: Advanced Optimization Spring 2016

Kinematics of Machines Prof. A. K. Mallik Department of Mechanical Engineering Indian Institute of Technology, Kanpur. Module - 2 Lecture - 1

k-nearest Neighbors + Model Selection

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

SUPPORT VECTOR MACHINES

Practical 7: Support vector machines

CS6716 Pattern Recognition

Parallel & Scalable Machine Learning Introduction to Machine Learning Algorithms

Introduction to Support Vector Machines

COMS 4771 Support Vector Machines. Nakul Verma

Eureka Math. Geometry, Module 4. Student File_B. Contains Exit Ticket, and Assessment Materials

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

Support Vector Machines. James McInerney Adapted from slides by Nakul Verma

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Concept of Curve Fitting Difference with Interpolation

Lab 2: Support Vector Machines

Instructor: Dr. Benjamin Thompson Lecture 20: 31 March 2009

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines

Topics in Machine Learning

Lecture 5: Linear Classification

Mathematics of Data. INFO-4604, Applied Machine Learning University of Colorado Boulder. September 5, 2017 Prof. Michael Paul

TRIGONOMETRY. T.1 Angles and Degree Measure

5 Learning hypothesis classes (16 points)

Optimal Separating Hyperplane and the Support Vector Machine. Volker Tresp Summer 2018

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

Instructor: Jessica Wu Harvey Mudd College

Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning

Texture Mapping. Michael Kazhdan ( /467) HB Ch. 14.8,14.9 FvDFH Ch. 16.3, , 16.6

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #19: Machine Learning 1

Kernels and Clustering

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

Exercise: Training Simple MLP by Backpropagation. Using Netlab.

Arabesque Groups Where Art and Mathematics Meet. Jawad Abuhlail, KFUPM (KSA)

6.001 Notes: Section 8.1

Convex Programs. COMPSCI 371D Machine Learning. COMPSCI 371D Machine Learning Convex Programs 1 / 21

VECTOR SPACE CLASSIFICATION

Transcription:

Lecture 9. Support Vector Machines COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne

This lecture Support vector machines (SVMs) as maximum margin classifiers Deriving hard margin SVM objective SVM objective as regularised loss function 2

Maximum Margin Classifier A new twist to binary classification problem 3

Beginning: linear SVMs In the first part, we will consider a basic setup of SVMs, something called linear hard margin SVM. These definitions will not make much sense now, but we will come back to this later today The point is to keep in mind that SVMs are more powerful than they may initially appear Also for this part, we will assume that the data is linearly separable, i.e., the exists a hyperplane that perfectly separates the classes We will consider training using all data at once 4

SVM is a linear binary classifier SVM is a binary classifier: SVM is a linear classifier: ss is a linear function of inputs, and the separating boundary is linear Predict class A if ss 0 Predict class B if ss < 0 where ss = bb + mm ii=1 xx ii ww ii separating boundary plane with values of ss xx 2 plane with data points xx 1 Example for 2D data 5

SVM and the perceptron In fact, the previous slide was taken from the perceptron lecture Given learned parameter values, an SVM makes predictions exactly like a perceptron. That is not particularly ground breaking. Are we done here? What makes SVMs different is the way the parameters are learned. Remember that here learning means choosing parameters that optimise a predefined criterion E.g., the perceptron minimises perceptron loss that we studied earlier SVMs use a different objective/criterion for choosing parameters 6

Choosing separation boundary An SVM is a linear binary classifier, so choosing parameters essentially means choosing how to draw a separation boundary (hyperplane) In 2D, the problem can be visualised as follows xx 2 B A Which boundary should we use? C Line C is a clear no, but A and B both perfectly separate the classes xx 1 7

Which boundary should we use? Provided the dataset is linearly separable, the perceptron will find a boundary that separates classes perfectly. This can be any such boundary, e.g., A or B xx 2 B A For the perceptron, all such boundaries are equally good, because the perceptron loss is zero for each of them. xx 1 8

Which boundary should we use? Provided the dataset is linearly separable, the perceptron will find a boundary that separates classes perfectly. This can be any such boundary, e.g., A or B xx 2 B A But they don't look equally good to us. Line A seems to be more reliable. When new data point arrives, line B is likely to misclassify it xx 1 9

Aiming for the safest boundary Intuitively, the most reliable boundary would be the one that is between the classes and as far away from both classes as possible xx 2 SVM objective captures this observation SVMs aim to find the separation boundary that maximises the margin between the classes xx 1 10

Maximum margin classifier An SVM is a linear binary classifier. During training, the SVM aims to find the separating boundary that maximises margin For this reason, SVMs are also called maximum margin classifiers The training data is fixed, so the margin is defined by the location and orientation of the separating boundary which, of course, are defined by SVM parameters Our next step is therefore to formalise our objective by expressing margin width as a function of parameters (and data) 11

The Mathy Part In which we derive the SVM objective using geometry 12

Margin width While the margin can be thought as the space between two dashed lines, it is more convenient to define margin width as the distance between the separating boundary and the nearest data point(s) xx 2 In the figure, the separating boundary is exactly between the classes, so the distances to the nearest red and blue points are the same The point(s) on margin boundaries from either side are called support vectors xx 1 13

Margin width While the margin can be thought as the space between two dashed lines, it is more convenient to define margin width as the distance between the separating boundary and the nearest data point(s) xx 2 We want to maximise the distance to support vectors However, before doing this, let s derive the expression for distance to an arbitrary point xx 1 14

Distance from point to hyperplane 1/3 Consider an arbitrary point XX (from either of the classes, and not necessarily the closest one to the boundary), and let XX pp denote the projection of XX onto the separating boundary Now, let rr be a vector XXXX pp. Note that rr is perpendicular to the boundary, and also that rr is the required distance xx 2 XX rr XX pp ww The separation boundary is defined by parameters ww and bb. From our previous lecture, recall that ww is a vector normal (perpendicular) to the boundary In the figure, ww is drawn from an arbitrary starting point 0 xx 1 15

Distance from point to hyperplane 2/3 Vectors rr and ww are parallel, but not necessarily of the same length. Thus rr = ww rr ww Next, points XX and XX pp can be viewed as vectors xx and xx pp. From vector addition rule, we have that xx + rr = xx pp or xx + ww rr ww = xx pp xx 2 XX rr XX pp ww Now let s multiply both sides of this equation by ww and also add bb: ww xx + bb + ww ww rr ww = ww xx pp + bb Since xx pp lies on the boundary, we have xx xx pp ww xx + bb + ww 2 rr ww = 0 Distance is rr = ww xx+bb ww 0 xx 1 16

Distance from point to hyperplane 3/3 However, if we took our point from the other side of the boundary, vectors rr and ww would be anti-parallel, giving us rr = ww rr ww In this case, distance is rr = ww xx+bb ww xx 2 rr We will return to this fact shortly, and for now we combine the two cases in the following result: xx pp xx ww Distance is rr = ± ww xx+bb ww 0 xx 1 17

Encoding the side using labels Training data is a collection {xx ii, yy ii }, ii = 1,, nn, where each xx ii is an mm-dimensional instance and yy ii is the corresponding binary label encoded as 1 or 1 Given a perfect separation boundary, yy ii encode the side of the boundary each xx ii is on Thus the distance from the ii-th point to a perfect boundary can be encoded as rr ii = yy ii ww xx ii +bb ww 18

Maximum margin objective The distance from the ii-th point to a perfect boundary can be encoded as rr ii = yy ii ww xx ii +bb ww The margin width is the distance to the closest point yy Thus SVMs aim to maximise min ii ww xx ii +bb ii=1,,nn ww as a function of ww and bb Do you see any problems with this objective? art: OpenClipartVectors at pixabay.com (CC0) Remember that ww = ww 1 2 + + ww mm 2 19

Non-unique representation A separating boundary (e.g., a line in 2D) is a set of points that satisfy ww xx + bb = 0 for some given ww and b However, the same set of points will also satisfy ww xx + bb = 0, with ww = ααww and bb = αααα, where αα is an arbitrary constant xx 2 The same boundary, and essentially the same classifier can be expressed with infinitely many parameter combinations xx 1 20

Resolving ambiguity Consider a candidate separating line. Which parameter combinations should we choose to represent it? As humans, we do not really care Math/Machines require a precise answer A possible way to resolve ambiguity: measure the distance to the closest point (ii ), and rescale parameters such that yy ii ww xx ii + bb = 1 ww ww For a given candidate boundary, and fixed training points there will be only one way of scaling ww and bb in order to satisfy this requirement 21

Constraining the objective yy SVMs aim to maximise min ii ww xx ii +bb ii=1,,nn ww Introduce (arbitrary) extra requirement yy ii ww xx ii +bb Here ii denotes the distance to the closest point ww = 1 ww We now have that SVMs aim to find argmin ww ww s.t. yy ii ww xx ii + bb 1 for ii = 1,, nn 22

Hard margin SVM objective We now have a major result: SVMs aim to find argmin ww ww s.t. yy ii ww xx ii + bb 1 for ii = 1,, nn xx 2 1 ww 1 ww Note 1: parameter bb is optimised indirectly by influencing constraints Note 2: all points are enforced to be on or outside the margin Therefore, this version of SVM is called hard-margin SVM xx 1 23

Geometry of SVM training 1/2 Training a linear SVM essentially means moving/rotating plane B so that separating boundary changes This is achieved by changing ww ii and bb Separating boundary xx 2 Plane A: data Plane B: values of bb + mm ii=1 xx ii ww ii xx 1

Geometry of SVM training 2/2 However, we can also rotate Plane B along the separating boundary In this case, the boundary does not change Same classifier! The additional requirement fixes the angle between planes A and B to a particular constant Separating boundary xx 2 Plane A: data Plane B: values of bb + mm ii=1 xx ii ww ii xx 1

SVM Objective as Regularised Loss Relating the resulting objective function to that of other machine learning methods 26

Previously on COMP90051 1. Choose/design a model 2. Choose/design discrepancy function 3. Find parameter values that minimise discrepancy with training data A loss function measures discrepancy between prediction and true value for a single example Training error is the average (or sum) of losses for all examples in the dataset So step 3 essentially means minimising training error We defined loss functions for perceptron and ANN, and aimed to minimised the loss during training But how do SVMs fit this pattern? 27

Regularised training error as objective Recall ridge regression objective Hard margin data-dependent training error nn minimise ii=1 yy ii ww xx ii 2 + λλ ww 2 SVM objective argmin ww ww s.t. yy ii ww xx ii + bb 1 for ii = 1,, nn The constraints can be interpreted as loss ll = 0 1 yy ii ww xx ii + bb 0 1 yy ii ww xx ii + bb > 0 data-independent regularisation term 28

Hard margin SVM loss The constraints can be interpreted as loss ll = 0 1 yy ii ww xx ii + bb 0 1 yy ii ww xx ii + bb > 0 In other words, for each point: If it s on the right side of the boundary and at least ww units away from the boundary, we re OK, the loss is 0 If the point is on the wrong side, or too close to the boundary, we immediately give infinite loss thus prohibiting such a solution altogether 1 29

This lecture Support vector machines (SVMs) as maximum margin classifiers Deriving hard margin SVM objective SVM objective as regularised loss function 30