Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Similar documents
Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Support vector machines

SUPPORT VECTOR MACHINES

Support Vector Machines

SUPPORT VECTOR MACHINES

Kernel Methods & Support Vector Machines

Support Vector Machines

12 Classification using Support Vector Machines

Support Vector Machines

All lecture slides will be available at CSC2515_Winter15.html

Support Vector Machines.

DM6 Support Vector Machines

Optimal Separating Hyperplane and the Support Vector Machine. Volker Tresp Summer 2018

LECTURE 5: DUAL PROBLEMS AND KERNELS. * Most of the slides in this lecture are from

Machine Learning for NLP

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

A Short SVM (Support Vector Machine) Tutorial

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

Linear methods for supervised learning

Support Vector Machines

Kernel Methods. Chapter 9 of A Course in Machine Learning by Hal Daumé III. Conversion to beamer by Fabrizio Riguzzi

Content-based image and video analysis. Machine learning

Lab 2: Support vector machines

Lecture 9: Support Vector Machines

Lecture Linear Support Vector Machines

Generative and discriminative classification techniques

10. Support Vector Machines

Support Vector Machines + Classification for IR

HW2 due on Thursday. Face Recognition: Dimensionality Reduction. Biometrics CSE 190 Lecture 11. Perceptron Revisited: Linear Separators

Support vector machine (II): non-linear SVM. LING 572 Fei Xia

Module 4. Non-linear machine learning econometrics: Support Vector Machine

Introduction to Machine Learning

Support Vector Machines (a brief introduction) Adrian Bevan.

Mathematical Themes in Economics, Machine Learning, and Bioinformatics

Machine Learning: Think Big and Parallel

A Dendrogram. Bioinformatics (Lec 17)

Introduction to Support Vector Machines

Support Vector Machines

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization

CS 559: Machine Learning Fundamentals and Applications 9 th Set of Notes

Chakra Chennubhotla and David Koes

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Lab 2: Support Vector Machines

Support Vector Machines

Lecture 7: Support Vector Machine

9. Support Vector Machines. The linearly separable case: hard-margin SVMs. The linearly separable case: hard-margin SVMs. Learning objectives

Classification by Support Vector Machines

Lecture 10: Support Vector Machines and their Applications

Support Vector Machines

COMS 4771 Support Vector Machines. Nakul Verma

6 Model selection and kernels

Support Vector Machines

Support Vector Machines.

Support Vector Machines

Kernel SVM. Course: Machine Learning MAHDI YAZDIAN-DEHKORDI FALL 2017

LECTURE 13: SOLUTION METHODS FOR CONSTRAINED OPTIMIZATION. 1. Primal approach 2. Penalty and barrier methods 3. Dual approach 4. Primal-dual approach

Programming, numerics and optimization

More on Classification: Support Vector Machine

Classification by Support Vector Machines

Support vector machines

732A54/TDDE31 Big Data Analytics

SVM Toolbox. Theory, Documentation, Experiments. S.V. Albrecht

Chap.12 Kernel methods [Book, Chap.7]

Support Vector Machines for Phoneme Classification

Support Vector Machines

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

CAP5415-Computer Vision Lecture 13-Support Vector Machines for Computer Vision Applica=ons

SVM Classification in Multiclass Letter Recognition System

SELF-ADAPTIVE SUPPORT VECTOR MACHINES

Support Vector Machines. James McInerney Adapted from slides by Nakul Verma

LOGISTIC REGRESSION FOR MULTIPLE CLASSES

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Table of Contents. Recognition of Facial Gestures... 1 Attila Fazekas

Data mining with Support Vector Machine

Neural Networks and Deep Learning

Supervised vs unsupervised clustering

Machine Learning Lecture 9

Parallel & Scalable Machine Learning Introduction to Machine Learning Algorithms

SVM in Analysis of Cross-Sectional Epidemiological Data Dmitriy Fradkin. April 4, 2005 Dmitriy Fradkin, Rutgers University Page 1

Application of Support Vector Machine In Bioinformatics

An Introduction to Machine Learning

Support Vector Machines for Face Recognition

SUPPORT VECTOR MACHINE ACTIVE LEARNING

Basis Functions. Volker Tresp Summer 2017

Support vector machines. Dominik Wisniewski Wojciech Wawrzyniak

Feature scaling in support vector data description

Generative and discriminative classification techniques

Discriminative classifiers for image recognition

Generative and discriminative classification

Machine Learning for Signal Processing Clustering. Bhiksha Raj Class Oct 2016

Topics in Machine Learning

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others

Topic 4: Support Vector Machines

Support Vector Machines for Classification and Regression

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Supervised classification exercice

Machine Learning Lecture 9

ADVANCED CLASSIFICATION TECHNIQUES

Transcription:

Data Analysis 3 Support Vector Machines Jan Platoš October 30, 2017 Department of Computer Science Faculty of Electrical Engineering and Computer Science VŠB - Technical University of Ostrava

Table of contents 1 Support Vector Machines 2 Other application of the Kernel Methods 1

Support Vector Machines Naturally defined binary classification of numeric data Multi-class generalization possible using several different strategies Categorical features may be binarized and used The class labels are assumed to be from the set { 1, 1} The separation hyperplanes are used as classification criterion as with all linear models The hyperplane is determined using a notion of margin 2

Support Vector Machines - Linearly separable case 314 CHAP A hyperplane that clearly separate points that belongs to the two classes An infinite number of possible ways of constructing a linear hyperplane between classes exists A maximum margin between hyperplanes have to be set, eg the minimum perpendicular distance to data points have to be maximum TEST INSTANCE CLASS A HYPERPLANE 2 SUPPORT VECTOR HYPERPLANE 1 CLASS (a) Hard separation Figure 1: Linearly separable case B MARGIN V Figure 107: Hard and s 3 Consider a hyperplane that cleanly separates two

Support Vector Machines - Linearly separable case A hyperplane that cleanly separates two linearly separable classes exists The margins of the hyperplanes is defined as he sum of its distances to the closest training points belonging to each of the two classes The distance between the margin and the closest training points in either class is the same A parallel hyperplanes may be constructed to the separating one that they touch the training points from either class and has no data points between them The training points on these hyperplanes are referred to as the support vectors The distance between the support vectors is the margin The separating plane is precisely in the middle of these two hyperplanes in order to achieve the most accurate classification 4

Support Vector Machines - Linearly separable case Determination of the maximum margin hyperplane By setting up of the non-linear programming optimization formulation that maximizes the margin by expressing it as a functions of the coefficients of the hyperplane The optimal coefficients can be determined by solving this optimization problem 5

Support Vector Machines - Linearly separable case Definition The n is the number of data points in the training set D The i-th data points is denoted as (X i, y i ), where X i is a d-dimensional row vector, and y i { 1, +1} is the binary class variable W X + b = 0 W = (w 1,, w d ) is the d-dimensional row vector representing the direction of the normal of the hyperplane b is a scalar, also know as bias Problem Learning of the (d + 1) coefficients corresponding to the W and b from the training data that maximizes the margin 6

Support Vector Machines - Linearly separable case The points from either class have to lie on the opposite sides of the hyperplane W X i + b 0 i : y i = +1 W X i + b 0 i : y i = 1 By introducing the margin parameter and its normalization and transformation we may get y i (W X i + b) +1 i (1) The goal is to maximize the distance between two parallel hyperplanes 2 W = 2 d i=1 w2 i Instead of the maximization of the above term we may minimize the following W 2 2 7

Support Vector Machines - Linearly separable case Minimization of the W 2 2 is a complex quadratic programming problem because the parameter is minimized subject to a set of linear constraints, see eq 1 Each data points leads to a constraint, therefore the SVM is computationally complex One of the possible method that is able to solve such problem is a Lagrangian relaxations It brings an set of non-negative multipliers λ = (λ 1,, λ n ) associated to each constraint The constraints are then relaxed and the objective function is augmented by incorporating a Lagrangian penalty for constraints violation L P = W 2 2 n [ ( λ i yi W Xi + b ) 1 ] i=1 8

Support Vector Machines - Linearly separable case Conversion of the L P into strictly pure maximization problem by eliminating the minimization part The variables W and b are converted by gradient-based condition and set to zero L P = W 2 2 W n [ ( λ i yi W Xi + b ) 1 ] = 0 i=1 n λ i y i X i = 0 The expression of W is then derived directly n W = λ i y i X i The similar approach with variable b then generate n λ i y i = 0 i=1 i=1 i=1 9

Support Vector Machines - Linearly separable case The final Lagrangian dual is as follows: L D = n λ i 1 2 i=1 n n λ i λ j y i y j X i X j i=1 j=1 The class label for the test instance Z defined by the decision boundary may be computed as F(Z) = sign{w Z + b} = sign{( n λ i y i X i Z) + b} The solving of the L D is done using gradient ascent according the parameter vector λ i=1 10

Support Vector Machines - Soft margin for Linearly Non-separable Data 314 CHAPTER 10 DATA CLASSIFICATION The margin is defined as soft with TEST INSTANCE HYPERPLANE 2 SUPPORT VECTOR penalization of the margin violation constraints The definition of the Lagrangian is very similar to the Lagrangian CLASS A for the separable case, with induction new constraints on the hyperplanes CLASS W X i + b +1 ξ i i : y i = +1 W X i + b 1 + ξ i i : y i = HYPERPLANE 1 1 MARGIN VIOLATION WITH PENALTY BASED SLACK VARIABLES (a) i Hard : ξ i separation 0 (b) Soft separation Figure 2: Soft margin for Non-separable Data Objective function O is then defined as Figure 107: Hard and soft SVMs B O = W 2 Consider a n hyperplane that cleanlythe separates C affects two linearly the allowed separableerror classes during The margin of the + C ξ i 2 hyperplane is defined as the sumtraining of its distances (smaller to thec closest larger training error) points belonging to each ofi=0 the two classes on the opposite side of the hyperplane A further assumption is that the distance of the separating hyperplane to its closest training point of either class 11

Support Vector Machines Non-linear decision boundary In real cases, the decision boundary is not linear The points may be transformed into higher dimensions to enable linear decision boundary 12

Support Vector Machines - Non-linear decision boundary y 1 08 06 04 02 02 04 06 08 1 x 13

Support Vector Machines - Non-linear decision boundary y 1 08 06 04 02 02 04 06 08 1 x 13

Support Vector Machines - Non-linear decision boundary y 1 08 06 04 02 02 04 06 08 1 x 13

Support Vector Machines - Non-linear decision boundary y 1 08 06 04 02 02 04 06 08 1 x 13

Support Vector Machines The Kernel Trick It leverages the important observation that the SVM formulation can be fully solved in the terms of dot products (or similarities) between pairs of data points The feature values itself are not important or needed The key is to define the pairwise dot products (similarity function) directly in the d -dimensional transformed representation Φ(X) such as: K(X i, X j ) = Φ(X i ) Φ(X j ) Only the dot product is required, therefore there is no need to compute transformed feature values Φ(X) 14

Support Vector Machines - The Kernel Trick The final Lagrangian dual with the substitution is defined as follows: L D = n λ i 1 2 i=1 n n λ i λ j y i y j K(X i, X j ) i=1 j=1 The class label for the test instance Z defined by the decision boundary may be computed as follows: F(Z) = sign{w Z + b} = sign{( n λ i y i K(X i, Z)) + b} i=1 15

Support Vector Machines - The Kernel Trick All computations are performed in the original space The actual transformation Φ( ) does not to be known as long as the kernel similarity function K( ) is known The kernel function have to be chosen carefully The kernel function have to satisfy Mercer s theorem to be considered valid This theorem ensures that the n n kernel matrix (called Gramm matrix) S = [ K ( X i, X j )] is symmetric, positive semidefinite 16

Support Vector Machines - The Kernel Trick Function Linear kernel Form K(X i, X j ) = X i X j + c Polynomial kernel Gaussian Radial Basis Function (RBF) K(X i, X j ) = (αx i X j + c) h K(X i, X j ) = exp ( X i X j 2 2σ 2 ) Sigmoid kernel K(X i, X j ) = tanh(κx i X j δ) ( ) Exponential kernel K(X i, X j ) = exp X i X j 2σ ( 2 ) Laplacian kernel K(X i, X j ) = exp X i X j σ Rational Quadratic Kernel K(X i, X j ) = 1 X i X j 2 X i X j 2 +c 17

Other application of the Kernel Methods Kernel K-means X µ 2 = X X i C X i 2 X = X X 2 i C X X i X + i,x j C X i X j C C C 2 The µ is the centroid of cluster C the cluster is assigned to the data points according the minimal kernel-based distance Kernel PCA Replacement of the dot products in the mean-centered data matrix Kernel fisher Discriminant Kernel Linear Discriminant Analysis 18

Questions?