Optimal Separating Hyperplane and the Support Vector Machine. Volker Tresp Summer 2018

Similar documents
SUPPORT VECTOR MACHINES

SUPPORT VECTOR MACHINES

Support Vector Machines (a brief introduction) Adrian Bevan.

COMS 4771 Support Vector Machines. Nakul Verma

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Introduction to Machine Learning

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Support Vector Machines.

Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Support vector machines

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

Support Vector Machines

Lecture 9: Support Vector Machines

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

Support Vector Machines

Support Vector Machines. James McInerney Adapted from slides by Nakul Verma

All lecture slides will be available at CSC2515_Winter15.html

Support Vector Machines

DM6 Support Vector Machines

Kernel Methods & Support Vector Machines

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Chakra Chennubhotla and David Koes

Support Vector Machines.

Linear methods for supervised learning

LOGISTIC REGRESSION FOR MULTIPLE CLASSES

Lab 2: Support vector machines

Perceptron Learning Algorithm

Support Vector Machines

Lecture 7: Support Vector Machine

Kernel Methods. Chapter 9 of A Course in Machine Learning by Hal Daumé III. Conversion to beamer by Fabrizio Riguzzi

Introduction to Support Vector Machines

Support Vector Machines

Content-based image and video analysis. Machine learning

Classification by Support Vector Machines

Machine Learning Lecture 9

Topic 4: Support Vector Machines

Machine Learning Lecture 9

Machine Learning for NLP

Module 4. Non-linear machine learning econometrics: Support Vector Machine

Kernel SVM. Course: Machine Learning MAHDI YAZDIAN-DEHKORDI FALL 2017

CS 559: Machine Learning Fundamentals and Applications 9 th Set of Notes

SVM in Analysis of Cross-Sectional Epidemiological Data Dmitriy Fradkin. April 4, 2005 Dmitriy Fradkin, Rutgers University Page 1

LECTURE 5: DUAL PROBLEMS AND KERNELS. * Most of the slides in this lecture are from

Classification by Support Vector Machines

Support Vector Machines + Classification for IR

9. Support Vector Machines. The linearly separable case: hard-margin SVMs. The linearly separable case: hard-margin SVMs. Learning objectives

732A54/TDDE31 Big Data Analytics

HW2 due on Thursday. Face Recognition: Dimensionality Reduction. Biometrics CSE 190 Lecture 11. Perceptron Revisited: Linear Separators

6 Model selection and kernels

Support Vector Machines for Classification and Regression

A Short SVM (Support Vector Machine) Tutorial

Perceptron Learning Algorithm (PLA)

10. Support Vector Machines

Feature scaling in support vector data description

Lab 2: Support Vector Machines

Support Vector Machines

Neural Networks and Deep Learning

Constrained optimization

Large synthetic data sets to compare different data mining methods

12 Classification using Support Vector Machines

Support Vector Machines

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

SELF-ADAPTIVE SUPPORT VECTOR MACHINES

Support Vector Machines

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization

Generative and discriminative classification techniques

Lecture Linear Support Vector Machines

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

ADVANCED CLASSIFICATION TECHNIQUES

5 Learning hypothesis classes (16 points)

Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning

FERDINAND KAISER Robust Support Vector Machines For Implicit Outlier Removal. Master of Science Thesis

Convex Programs. COMPSCI 371D Machine Learning. COMPSCI 371D Machine Learning Convex Programs 1 / 21

Adaptive Sampling for Embedded Software Systems using SVM: Application to Water Level Sensors

Basis Functions. Volker Tresp Summer 2017

Support Vector Machines for Face Recognition

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Classification by Support Vector Machines

DECISION-TREE-BASED MULTICLASS SUPPORT VECTOR MACHINES. Fumitake Takahashi, Shigeo Abe

Lecture 10: SVM Lecture Overview Support Vector Machines The binary classification problem

Instance-based Learning

Support Vector Machines

Lecture 10: Support Vector Machines and their Applications

A Practical Guide to Support Vector Classification

Choosing the kernel parameters for SVMs by the inter-cluster distance in the feature space Authors: Kuo-Ping Wu, Sheng-De Wang Published 2008

Neetha Das Prof. Andy Khong

Mathematical Themes in Economics, Machine Learning, and Bioinformatics

LECTURE 13: SOLUTION METHODS FOR CONSTRAINED OPTIMIZATION. 1. Primal approach 2. Penalty and barrier methods 3. Dual approach 4. Primal-dual approach

Machine Learning: Think Big and Parallel

Classification. 1 o Semestre 2007/2008

Support Vector Machines and their Applications

Support Vector Machines: Brief Overview" November 2011 CPSC 352

Face Recognition using SURF Features and SVM Classifier

3 Perceptron Learning; Maximum Margin Classifiers

Support vector machines

Rule extraction from support vector machines

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

Information Management course

Transcription:

Optimal Separating Hyperplane and the Support Vector Machine Volker Tresp Summer 2018 1

(Vapnik s) Optimal Separating Hyperplane Let s consider a linear classifier with y i { 1, 1} If classes are linearly separable, the separating plane can be found Among all all solutions one chooses the one that maximizes the margin C 2

Optimal Separating Hyperplane(2D) 3

Cost Function with Constraints Thus we want to find a classifier ŷ i = sign(h i ) with h i = M 1 j=0 w j x i,j The following inequality constraints need to be fulfilled at the solution y i h i 1 i = 1,..., N 4

Maximizing the Margin Of all possible solutions, one chooses the one that maximizes the margin This can be achieved by finding the solution where the sum of the squares of the weights is minimal w opt = arg min w wt w = arg min w M 1 j=1 where w = (w 1,..., w M 1 ). (this means that in w the offset w 0 is missing); y i { 1, 1} w 2 j 5

Margin and Support-Vectors The margin becomes C = 1 w opt For the support vectors we have, y i (x T i w opt) = 1 6

Optimal Separating Hyperplane (1D) 7

Optimization Problem The optimization problem is minimize w T w under the constraint that i 1 y i (x T i w) 0 We get the Lagrangian L P = 1 2 wt w + N i=1 µ i [1 y i (x T i w)] The Lagrangian is minimized with respect to w and maximized with respect to µ i 0 (saddle point solution) 8

Solution The problem is solved via the Wolfe Dual and the solution can be written as w = N i=1 y i µ i x i = i SV y i µ i x i Thus the sum is over over the terms where the Lagrange multiplier is not zero, i.e., the support vectors Also we can write the solution h(x) = w 0 + M 1 w j x j = w 0 + x T w = w 0 + x T y i µ i x i j=1 i SV = w 0 + i SV y i µ i x T x i = w 0 + i SV y i µ i k(x i, x) Thus we get immediately a kernel solution with k(x i, x) = x T x i. 9

The solution can be written as a weighted sum over support vector kernels! Naturally, if one works with basis functions, one gets k(x, x i ) = φ(x) T φ(xi )

Non-separable Classes If classes are not linearly separable one needs to extend the approach. On introduces the so-called slack variables ξ i Find under the constraint that w opt = arg min w wt w y i (x T i w) 1 ξ i i = 1,..., N and ξ i 0 where N i=1 ξ i 1/γ The smaller γ > 0, the more slack is permitted. For γ, one obtains the hard constraint 10

Optimization The optimal separating hyperplane is found via an evolved optimization of the quadratic cost function with linear constraints γ is a hyperparameter 11

Optimization via Penalty Method We define arg min w 2γ 1 y i (x T i w) + + w T w where 1 y i (x T i w) + is the penalty term Here, arg + = max(arg, 0). With a finite γ, slack is permitted 12

Comparison of Cost Functions We consider a data point of class 1 and the contribution of one data point to the cost function/negative log-likelihood The contribution of x i to the cost is: Least squares (blue) : cost(x i, y i = 1) = (1 h(x i )) 2 Perceptron (black) cost(x i, y i = 1) = h(x i ) + Vapnik s optimal hyperplane (green): cost(x i, y i = 1) = γ 1 h(x i ) + Logistic Regression (magenta): cost(x i, y i = 1) = log(1 + exp( h(x i ))) Neural Network (red): cost(x i, y i = 1) = (1 sig(h(x i ))) 2 13

Toy Example Data for two classes (red, green) are generated Classes overlap The true class boundary is in violet The continuous line is the separating hyperplane found by the linear SVM All data points within the dotted region are support vectors (62% of all data points) γ is large (little slack is permitted) 14

Toy Example (cont d) Linear SVM with small γ: the solution has many more support vectors (85% of all data points) The test error is almost the same 15

Toy Example (cont d) With polynomial kernels The test error is reduced since the fit is better Note that although the support vectors are close to the separating plane in the basis function space, this is not necessarily true in input space 16

Toy Example (cont d) Gaussian kernels give the best results Most data points are support vectors 17

Comments The ideas of searching for solutions with a large margin has been extended to many other problems 18

Optimal Separating Hyperplane and Perceptron Vapnik was interested in solutions which focus on the currently misclassified examples such as the Perceptron. He could preserve this idea also in the Optimal Separating Hyperplanes The Perceptron requires y i h i 0 i = 1,..., N which leads to non-unique solutions (even when weight decay is added) The important insight of Vapnik was that requiring y i h i 1 i = 1,..., N leads to unique solutions when weight decay is added 19