Kernel Methods & Support Vector Machines

Similar documents
Support Vector Machines.

All lecture slides will be available at CSC2515_Winter15.html

Lecture 7: Support Vector Machine

Linear methods for supervised learning

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Support Vector Machines

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

DM6 Support Vector Machines

Support Vector Machines

COMS 4771 Support Vector Machines. Nakul Verma

Support Vector Machines.

Introduction to Machine Learning

A Short SVM (Support Vector Machine) Tutorial

Support vector machines

9. Support Vector Machines. The linearly separable case: hard-margin SVMs. The linearly separable case: hard-margin SVMs. Learning objectives

Support Vector Machines. James McInerney Adapted from slides by Nakul Verma

SUPPORT VECTOR MACHINES

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Introduction to Support Vector Machines

SUPPORT VECTOR MACHINES

LECTURE 5: DUAL PROBLEMS AND KERNELS. * Most of the slides in this lecture are from

Support Vector Machines

Support Vector Machines

Behavioral Data Mining. Lecture 10 Kernel methods and SVMs

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

Convex Programs. COMPSCI 371D Machine Learning. COMPSCI 371D Machine Learning Convex Programs 1 / 21

Constrained optimization

LECTURE 13: SOLUTION METHODS FOR CONSTRAINED OPTIMIZATION. 1. Primal approach 2. Penalty and barrier methods 3. Dual approach 4. Primal-dual approach

HW2 due on Thursday. Face Recognition: Dimensionality Reduction. Biometrics CSE 190 Lecture 11. Perceptron Revisited: Linear Separators

Programming, numerics and optimization

Support Vector Machines

Lab 2: Support vector machines

10. Support Vector Machines

Lecture Linear Support Vector Machines

Lecture 10: SVM Lecture Overview Support Vector Machines The binary classification problem

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

Support Vector Machines (a brief introduction) Adrian Bevan.

In other words, we want to find the domain points that yield the maximum or minimum values (extrema) of the function.

CS 229 Midterm Review

Data-driven Kernels for Support Vector Machines

12 Classification using Support Vector Machines

Optimal Separating Hyperplane and the Support Vector Machine. Volker Tresp Summer 2018

SUPPORT VECTOR MACHINE ACTIVE LEARNING

SVM in Analysis of Cross-Sectional Epidemiological Data Dmitriy Fradkin. April 4, 2005 Dmitriy Fradkin, Rutgers University Page 1

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

Chakra Chennubhotla and David Koes

Machine Learning for NLP

Large synthetic data sets to compare different data mining methods

Support vector machines. Dominik Wisniewski Wojciech Wawrzyniak

Machine Learning Lecture 9

Generative and discriminative classification techniques

Support Vector Machines for Face Recognition

Outline. CS38 Introduction to Algorithms. Linear programming 5/21/2014. Linear programming. Lecture 15 May 20, 2014

Machine Learning: Think Big and Parallel

Machine Learning Lecture 9

Kernel Methods. Chapter 9 of A Course in Machine Learning by Hal Daumé III. Conversion to beamer by Fabrizio Riguzzi

Parallel & Scalable Machine Learning Introduction to Machine Learning Algorithms

Support Vector Machines

Optimization III: Constrained Optimization

Unconstrained Optimization Principles of Unconstrained Optimization Search Methods

Lecture 5: Linear Classification

LOGISTIC REGRESSION FOR MULTIPLE CLASSES

Kernel SVM. Course: Machine Learning MAHDI YAZDIAN-DEHKORDI FALL 2017

Lecture 9: Support Vector Machines

Classification by Support Vector Machines

Kernels and Constrained Optimization

6 Model selection and kernels

Demo 1: KKT conditions with inequality constraints

Convex Optimization and Machine Learning

Mathematical and Algorithmic Foundations Linear Programming and Matchings

SVM Toolbox. Theory, Documentation, Experiments. S.V. Albrecht

Lab 2: Support Vector Machines

SVMs for Structured Output. Andrea Vedaldi

Support Vector Machines

CS 559: Machine Learning Fundamentals and Applications 9 th Set of Notes

Linear programming and duality theory

CLASSIFICATION OF CUSTOMER PURCHASE BEHAVIOR IN THE AIRLINE INDUSTRY USING SUPPORT VECTOR MACHINES

Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation.

Content-based image and video analysis. Machine learning

Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning

Module 4. Non-linear machine learning econometrics: Support Vector Machine

Chap.12 Kernel methods [Book, Chap.7]

Classification by Support Vector Machines

Lecture 5: Duality Theory

SELF-ADAPTIVE SUPPORT VECTOR MACHINES

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 2. Convex Optimization

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

California Institute of Technology Crash-Course on Convex Optimization Fall Ec 133 Guilherme Freitas

Support Vector Machines + Classification for IR

Support Vector Machines for Classification and Regression

Introduction to Optimization

Transductive Learning: Motivation, Model, Algorithms

Perceptron Learning Algorithm

Neural Networks and Deep Learning

16.410/413 Principles of Autonomy and Decision Making

Table of Contents. Recognition of Facial Gestures... 1 Attila Fazekas

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

Perceptron Learning Algorithm (PLA)

Transcription:

& Support Vector Machines & Support Vector Machines Arvind Visvanathan CSCE 970 Pattern Recognition 1

& Support Vector Machines Question? Draw a single line to separate two classes? 2

& Support Vector Machines Outline Kernels Definition Working Duality Construction & Validity Types 3

& Support Vector Machines Data Error Function Solution Modified Solution Coefficients Curve fitting Data Synthetic data 4

& Support Vector Machines Data Error Function Solution Modified Solution Coefficients Curve fitting Error function Predicting polynomial Error function 5

& Support Vector Machines Data Error Function Solution Modified Solution Coefficients Curve fitting Solutions 6

& Support Vector Machines Data Error Function Solution Modified Solution Coefficients Curve fitting term Error function New Error function Penalize complex functions 7

& Support Vector Machines Data Error Function Solution Modified Solution Coefficients Curve fitting Modified solutions 8

& Support Vector Machines Data Error Function Solution Modified Solution Coefficients Curve fitting Polynomial coefficients Polynomial Coefficients Reduced 9

& Support Vector Machines Outline Kernels Definition Working Duality Construction & Validity Types 10

& Support Vector Machines Definition Working Duality Construction & Validity Types Kernels Definition Kernel Function 11

& Support Vector Machines Definition Working Duality Construction & Validity Types Kernels Details Inner product in feature space Feature Space D Input Space D Feature space mapping is implicit!!! 12

& Support Vector Machines Definition Working Duality Construction & Validity Types Kernels History Symmetric Introduced by Aizerman et al. in 1964 Reintroduced as large margin classifiers by Boser et al. in 1994 giving rise to Support Vector Machines 13

& Support Vector Machines Definition Working Duality Construction & Validity Types Kernels Summary Kernel as inner product in feature space Extensions of existing techniques with Kernel Trick Algorithm formulation Input vector enters as an inner product only The inner product is substituted by a kernel so that the formulation is solved in a higher dimensional feature space 14

& Support Vector Machines Definition Working Duality Construction & Validity Types Kernels Working (Example) XOR Problem Input space x = (x1, x2) Class 1 (+1, +1), (-1, -1) Class 2 (+1, -1), (-1, +1) No Linear Classifier!!! 15

& Support Vector Machines Definition Working Duality Construction & Validity Types Kernels Working (Example) Solution Transform into higher dimension Then the remapped classes are: Class 1 - (1, 1.414, 1), (1, 1.414, 1) Class 2 - (1, -1.414, 1), (1, -1.414, 1) 16

& Support Vector Machines Definition Working Duality Construction & Validity Types Kernels Working (Example) Linear Classified!!! 17

& Support Vector Machines Definition Working Duality Construction & Validity Types Kernels Working (Example) Implicit Mapping 18

& Support Vector Machines Definition Working Duality Construction & Validity Types Duality Any linear model can be formulated in terms of a dual representation Regression Classification Kernel function arises naturally in the Dual!!! Important in later section on 19

& Support Vector Machines Definition Working Duality Construction & Validity Types Duality Linear regression model Linear regression model Error function Weight-vector dependent regularizer Data Dependent Error 20

& Support Vector Machines Definition Working Duality Construction & Validity Types Duality Linear regression model = Design Matrix row = 21

& Support Vector Machines Definition Working Duality Construction & Validity Types Duality Dual formulation Substitute 22

& Support Vector Machines Definition Working Duality Construction & Validity Types Duality Gram matrix Introduce the Gram Matrix 23

& Support Vector Machines Definition Working Duality Construction & Validity Types Duality Kernel trick 24

& Support Vector Machines Definition Working Duality Construction & Validity Types Duality Solution Solve for a by setting gradient to 0 25

& Support Vector Machines Definition Working Duality Construction & Validity Types Duality Model = Design Matrix row = 26

& Support Vector Machines Definition Working Duality Construction & Validity Types Duality Model Primal Dimension : M Dual Solution : O(M 3 ) Dimension : N Solution : O(N 3 ) M? N Good? Bad? Ugly? 27

& Support Vector Machines Definition Working Duality Construction & Validity Types Duality Examples Example Images (500 x 500) Feature space one per pixel 250,000 features (M) 10,000 Images (N) Primal O(M 3 ) 1.5 x 10 16 Dual O(N 3 ) 1.0 x 10 12 Factor 10,000 28

& Support Vector Machines Definition Working Duality Construction & Validity Types Duality Examples Example Protein sequence (1 million characters) Feature space one per character 1,000,000 features (M) 10,000 sequences (N) Primal O(M 3 ) 1.0 x 10 18 Dual O(N 3 ) 1.0 x 10 12 Factor 1 Million 29

& Support Vector Machines Definition Working Duality Construction & Validity Types Duality What did we learn??? Error function in terms of w Transform into dual mode Get rid of w In terms of Gram matrix and kernel function only Can get back to the original formulation by expressing a in linear combination of w and x 30

& Support Vector Machines Definition Working Duality Construction & Validity Types Kernels Construction Exploit kernel substitution Construct valid kernels One approach Find a feature space Φ(x) Find corresponding kernel How to construct valid kernel without constructing Φ(x) explicitly? 31

& Support Vector Machines Definition Working Duality Construction & Validity Types Kernels Validity Kernel k(x,x ) is a valid kernel if the Gram matrix K is positive semidefinite for all possible choices of the set {x n } A matrix K is positive semidefinite c T Kc 0 for all values of c. Also all eigenvalues of the matrix K i.e. λ i 0 32

& Support Vector Machines Definition Working Duality Construction & Validity Types Kernels Eigenvectors and eigenvalues Proven facts Eigenvector Equation Solution iff Solution 33

& Support Vector Machines Definition Working Duality Construction & Validity Types Kernels Construction 34

& Support Vector Machines Definition Working Duality Construction & Validity Types Kernels Polynomial kernel Various forms of polynomial kernel 35

& Support Vector Machines Definition Working Duality Construction & Validity Types Kernels Gaussian kernel Simplest form Dimension of feature space??? Proof of validity from which kernel??? Modification to use non Euclidean distance 36

& Support Vector Machines Definition Working Duality Construction & Validity Types Kernels Gaussian kernel Infinite dimension Power Series 37

& Support Vector Machines Outline Kernels classifiers Multiclass 38

& Support Vector Machines Multiclass - We studied non-linear kernel methods Problem k(x n,x m ) must be computed for all possible pairs of x n and x m Computationally infeasible Training set may be in millions 39

& Support Vector Machines Multiclass - Solution Sparse solution k(x n,x m ) must be computed for a subset of training data points Classification Regression 40

& Support Vector Machines Multiclass - Convex optimization problem Unique global optimum solution Single global optimum solution 41

& Support Vector Machines Multiclass - Problem Classification Problem N input vectors {x 1, x N } Corresponding target values {t 1, t N } t n = {-1, 1} Class 1 t n = 1 Class 2 t n = -1 42

& Support Vector Machines Multiclass - Problem Classifier Model y(x) = w T Φ(x) + b Class of new unseen example sgn(y(x)) 43

& Support Vector Machines Multiclass - Assumption Assumption Training data is linearly separable in feature space There exist at least one solution for the parameters w and b y(x n ) > 0 for all training examples having t n = +1 y(x n ) < 0 for all training examples having t n = -1 t n y(x n ) > 0 for all the training points 44

& Support Vector Machines Multiclass - Margin If many solutions exist find the best solution Solution gives the smallest generalization error maximum margin classifier 45

& Support Vector Machines Multiclass - Margin Margin Smallest distance between the decision boundary and any of the samples 46

& Support Vector Machines Multiclass Maximize margin Maximize the margin to find the globally optimum solution 47

& Support Vector Machines Multiclass Canonical decision hyperplane Model y(x) = w T Φ(x) + b Class 1 w T Φ(x n ) + b > 0 Class 2 w T Φ(x n ) + b < 0 Scale the weight vector such that For point closest to the decision plane Class 1 : w T Φ(x n ) + b = 1 Class 2 : w T Φ(x n ) + b = -1 48

& Support Vector Machines Multiclass Canonical decision hyperplane Minimum distance to hyperplane? w T Φ(x 1 ) + b = 1 w T Φ(x 2 ) + b = -1 Difference w T (Φ(x 1 )- Φ(x 2 )) = 2 Normalize weight vector (w T / w ) (Φ(x 1 )- Φ(x 2 )) = 2 / w Minimum distance 1 / w For any point y(x) / w 49

& Support Vector Machines Multiclass Formulation (Primal) For points on the margin we can thus state For all the data points the following constraint is satisfied 50

& Support Vector Machines Multiclass Formulation (Primal) Points for which equality holds, the constraint is said to be active Remainder are inactive By definition there will be at least one data point which has an active constraint After maximization there will be at least two active constraints 51

& Support Vector Machines Multiclass Maximize distance to margin Distance to the classifier classifier 52

Multiclass Maximize distance to margin Direct solution is a complex problem Solution Formulate using Lagrange Multipliers & Support Vector Machines 53

& Support Vector Machines Multiclass Formulation (Primal) Strict Formulation Optimization Minimizing weight vector maximizes the margin Constraints Constrained Optimization problem!!! Subject to the constraints that all the training samples are correctly classified 54

& Support Vector Machines Multiclass Formulation (Primal) Introduce Lagrange multipliers to make it a constrained optimization problem 55

& Support Vector Machines Multiclass Primal (Intuition) Minimize w.r.t. primal variable (w, b) Maximize w.r.t. dual variable (a) Find saddle point 56

& Support Vector Machines Multiclass Primal (Intuition) Constraint violated t i (w T Φ(x i ) + b) -1 < 0 L will be changed by increasing a i w,b change at the same time to decrease L To prevent a i t i (w T Φ(x i ) + b) -1 from becoming arbitrarily large negative number w and b will ensure that the constraint is eventually satisfied 57

& Support Vector Machines Multiclass Formulation (Gradient w.r.t. w) 58

& Support Vector Machines Multiclass Formulation (Gradient w.r.t. b) 59

& Support Vector Machines Multiclass Formulation (Conversion to Dual) Setting derivatives with respect to w and b equal to zero we get the following conditions Gradient w.r.t w = 0 Gradient w.r.t b = 0 60

& Support Vector Machines Multiclass Formulation (Dual) Eliminating w and b from L(w,b,a) we get the dual representation which is a maximization problem Constraints the maximization is subjected to get the solution 61

& Support Vector Machines Multiclass Solution Model Prediction for new example Substitute value of w Model prediction in terms of kernel Complexity??? 62

& Support Vector Machines Multiclass Karush-Kuhn-Tucker Constrained optimization satisfies the Karush-Kuhn-Tucker conditions or 63

& Support Vector Machines Multiclass Karush-Kuhn-Tucker Support Vectors 64

& Support Vector Machines Multiclass Solution Original Solution for unknown x O(N) Modified Solution for unknown x O(S) Are we there yet?? N >>> S!!! 65

& Support Vector Machines Multiclass Solution for threshold (b) For all support vectors After substitution Solution for the threshold Average over all SV s 66

& Support Vector Machines Multiclass Example classification problem 67

& Support Vector Machines Multiclass Hard margin Assumed that linear classifier exists Hard margin Exact separation in feature and input space Feature space (Linear separator) Input space (Possible non-linear separator) All examples are classified correctly 68

& Support Vector Machines Multiclass Soft margin Assumed that linear classifier exists Soft margin Some examples may be misclassified Incur penalty during learning Some examples may be closer to the separator Incur penalty during learning Slack Variables 69

& Support Vector Machines Multiclass Slack variables Slack Variables One for each training example Correctly Classified (margin >= 1) Otherwise On decision plane Misclassified 70

& Support Vector Machines Multiclass Slack variables 71

& Support Vector Machines Multiclass C- Primal (Formulation) Minimize Constraints 72

& Support Vector Machines Multiclass C- Primal (Formulation) 73

& Support Vector Machines Multiclass C- Primal Dual 74

& Support Vector Machines Multiclass C- Dual (Formulation) 75

& Support Vector Machines Multiclass C- KKT Conditions 76

& Support Vector Machines Multiclass C- Interpretation 77

& Support Vector Machines Multiclass C- Problem The value to take for C is not clear The intuition behind C is not clear Solution ν- 78

& Support Vector Machines Multiclass ν- Formulation Minimize Constraints 79

& Support Vector Machines Multiclass ν- Interpretation If there is a solution with ρ > 0 ν is an upper bound on fraction of margin errors Margin error ξ > 0 ν is a lower bound on fraction of support vectors 80

& Support Vector Machines Multiclass ν- Interpretation 81

& Support Vector Machines Multiclass Multiclass is defined as a binary classifier What if there are M classes instead of 2 classes Ways around it 82

& Support Vector Machines Multiclass Multiclass One vs Rest M classifiers Classifier M i Train examples of class i as positive Train examples of class not i as negative For unseen example Run through all classifiers Class assigned is the classifier that has maximum y(x) 83

& Support Vector Machines Multiclass Multiclass One vs Rest Classifier Class Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Output 1.23 2.34 4.32 7.50 4.98 3.32 84

& Support Vector Machines Multiclass Multiclass One vs Rest Problems with One vs Rest Scale on y(x) different? One classifier output range (-100, 100) One classifier output range (-1,10) Training set is imbalanced 10 Classes (10 examples each) For each classifier 10 positive samples 90 negative samples Most widely used multiclass classifier 85

& Support Vector Machines Multiclass Multiclass One vs One M (M-1) / 2 Classifiers Each classifiers trains on points from two classes Classes Classifiers 86

& Support Vector Machines Multiclass Multiclass One vs One Solution for unknown Run all classifiers on new input Classify using voting approach Classifier Classifier(1,2) Classifier(1,3) Classifier(1,4) Classifier(2,3) Classifier(2,4) Classifier(3,4) Output 1 3 1 3 4 3 87

& Support Vector Machines Multiclass Multiclass One vs One Problems with One vs One approach Too many classifiers 20 classes 190 classifiers Too much training time as M increases 88

& Support Vector Machines Multiclass Multiclass Error Correcting Codes Log 2 (M) + C classifiers C-1 C-2 C-3 Class -1-1 -1 Class 1-1 -1 1 Class 2-1 1-1 Class 3........ 1 1 1 Class 7 89

& Support Vector Machines Question? Draw a single line to separate two classes? 90

& Support Vector Machines Question? Draw a single line to separate two classes with minimum of 5% of training examples being support vectors? 91

& Support Vector Machines Question? A multiclass linear separator for the three classes? 92

& Support Vector Machines Questions 93