Support Vector Machines

Similar documents
Support Vector Machines

Introduction to Support Vector Machines

DM6 Support Vector Machines

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Support vector machines

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

All lecture slides will be available at CSC2515_Winter15.html

Linear methods for supervised learning

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

LECTURE 5: DUAL PROBLEMS AND KERNELS. * Most of the slides in this lecture are from

SUPPORT VECTOR MACHINES

Support Vector Machines

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Support Vector Machines

SUPPORT VECTOR MACHINES

Introduction to Machine Learning

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

Support Vector Machines.

Classification by Support Vector Machines

Optimal Separating Hyperplane and the Support Vector Machine. Volker Tresp Summer 2018

Classification by Support Vector Machines

Support Vector Machines

Support Vector Machines.

Support Vector Machines

Generative and discriminative classification techniques

Support vector machine (II): non-linear SVM. LING 572 Fei Xia

Chakra Chennubhotla and David Koes

Kernel Methods & Support Vector Machines

Lecture 7: Support Vector Machine

A Short SVM (Support Vector Machine) Tutorial

12 Classification using Support Vector Machines

COMS 4771 Support Vector Machines. Nakul Verma

Kernel Methods. Chapter 9 of A Course in Machine Learning by Hal Daumé III. Conversion to beamer by Fabrizio Riguzzi

Lecture 9: Support Vector Machines

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization

10. Support Vector Machines

Lab 2: Support vector machines

Support Vector Machines (a brief introduction) Adrian Bevan.

Machine Learning for NLP

Feature scaling in support vector data description

Chap.12 Kernel methods [Book, Chap.7]

9. Support Vector Machines. The linearly separable case: hard-margin SVMs. The linearly separable case: hard-margin SVMs. Learning objectives

Basis Functions. Volker Tresp Summer 2017

Support Vector Machines. James McInerney Adapted from slides by Nakul Verma

Support Vector Machines

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Module 4. Non-linear machine learning econometrics: Support Vector Machine

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

Support Vector Machines

Support Vector Machines and their Applications

Support Vector Machines + Classification for IR

HW2 due on Thursday. Face Recognition: Dimensionality Reduction. Biometrics CSE 190 Lecture 11. Perceptron Revisited: Linear Separators

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Lab 2: Support Vector Machines

Kernel SVM. Course: Machine Learning MAHDI YAZDIAN-DEHKORDI FALL 2017

Data mining with Support Vector Machine

5 Learning hypothesis classes (16 points)

Lecture 10: Support Vector Machines and their Applications

Parallel & Scalable Machine Learning Introduction to Machine Learning Algorithms

A Dendrogram. Bioinformatics (Lec 17)

CSE 573: Artificial Intelligence Autumn 2010

Support Vector Machines for Face Recognition

DECISION-TREE-BASED MULTICLASS SUPPORT VECTOR MACHINES. Fumitake Takahashi, Shigeo Abe

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others

Classification by Support Vector Machines

CS 559: Machine Learning Fundamentals and Applications 9 th Set of Notes

A Novel Technique for Sub-pixel Image Classification Based on Support Vector Machine

Content-based image and video analysis. Machine learning

Support Vector Machines for Classification and Regression

Ensembles. An ensemble is a set of classifiers whose combined results give the final decision. test feature vector

Perceptron Learning Algorithm (PLA)

CS570: Introduction to Data Mining

Kernel Combination Versus Classifier Combination

Learning via Optimization

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Natural Language Processing

Basis Functions. Volker Tresp Summer 2016

One-class Problems and Outlier Detection. 陶卿 中国科学院自动化研究所

Classification by Nearest Shrunken Centroids and Support Vector Machines

Mathematical Themes in Economics, Machine Learning, and Bioinformatics

COMPUTATIONAL INTELLIGENCE

Perceptron Learning Algorithm

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Naïve Bayes for text classification

Rule extraction from support vector machines

Practice EXAM: SPRING 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

LOGISTIC REGRESSION FOR MULTIPLE CLASSES

More Learning. Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA

Instance-based Learning

Part 5: Structured Support Vector Machines

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Face Recognition using SURF Features and SVM Classifier

Support vector machines

Support Vector Machines

Application of Support Vector Machine In Bioinformatics

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Discriminative classifiers for image recognition

Transcription:

Support Vector Machines RBF-networks Support Vector Machines Good Decision Boundary Optimization Problem Soft margin Hyperplane Non-linear Decision Boundary Kernel-Trick Approximation Accurancy Overtraining 1

2

Gaussian response function Each hidden layer unit computes 2 D i 2σ h i = e 2 x = an input vector u = weight vector of hidden layer neuron i D 2 i = ( x u i ) T ( x u i ) Location of centers u The location of the receptive field is critical Apply clustering to the training set each determined cluster center would correspond to a center u of a receptive field of a hidden neuron 3

Determining σ Following heuristic will perform well in practice For each hidden layer neuron, find the RMS distance between u i and the center of its N nearest neighbors c N 2 j RMS = 1 n c lk n u l=1 k Assign this value to σ i i= k N The output neuron produces the linear weighted sum o = n i= 0 w i h i The weights have to be adopted (LMS) Δw i = η(t o)h i 4

Why does a RBF network work? The hidden layer applies a nonlinear transformation from the input space to the hidden space In the hidden space a linear discrimination can be performed φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) Support Vector Machines Linear machine Constructs a hyperplane as the decision surface in such a way that the margin of separation between positive and negative examples is maximized Good generalization performance Support vector learning algorithm may construct following three learning machines Polynominal learning machines Radial-Basis functions networks Two-layer perceptrons 5

Extension to Non-linear Decision Boundary Key idea: transform x i to a higher dimensional space to make life easier Input space: the space x i are in Feature space: the space of φ(x i ) after transformation Why transform? Linear operation in the feature space is equivalent to non-linear operation in input space The classification task can be easier with a proper transformation. Example: XOR Extension to Non-linear Decision Boundary Possible problem of the transformation High computation burden and hard to get a good estimate SVM solves these two issues simultaneously Kernel tricks for efficient computation Minimize w 2 can lead to a good classifier Input space φ(.) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) Feature space 6

Example Transformation Define the kernel function K (x,y) as Consider the following transformation The inner product can be computed by K without going through the map φ(.) Saves XOR problem. Doesn t save the XOR problem 7

Kernel Trick The relationship between the kernel function K and the mapping φ(.) is This is known as the kernel trick In practice, we specify K, thereby specifying φ(.) indirectly, instead of choosing φ(.) Intuitively, K (x,y) represents our desired notion of similarity between data x and y and this is from our prior knowledge K (x,y) needs to satisfy a technical condition (Mercer condition) in order for φ(.) to exist The use of kernel functions can turn any algorithm that only depends on dot product into a nonlinear algorithm 8

Examples of Kernel Functions Polynomial kernel with degree d (specified by user) Radial basis function kernel with width σ Closely related to radial basis function neural networks Sigmoid with parameter κ and θ It does not satisfy the Mercer condition on all κ and θ Research on different kernel functions in different applications is very active Many kernel mapping functions can be used Few kernel functions have been found to work well in for a wide variety of applications The recommended kernel function is the Radial Basis Function (RBF) 9

Multi-class Classification SVM is basically a two-class classifier One can change the QP formulation to allow multi-class classification More commonly, the data set is divided into two parts intelligently in different ways and a separate SVM is trained for each way of division Multi-class classification is done by combining the output of all the SVM classifiers Majority rule Error correcting code Directed acyclic graph 10

Two Class Problem: Linear Separable Case Class 1 Class 2 Many decision boundaries can separate these two classes Which one should we choose? Example of Bad Decision Boundaries Class 2 Class 2 Class 1 Class 1 11

Good Decision Boundary: Margin Should Be Large The decision boundary should be as far away from the data of both classes as possible We should maximize the margin, m Class 2 w/ w * (x1-x2) = 2/ w Class 1 m g( x ) = w T x + b x = x P + r x P = x r w w w w g( x P ) = 0 w T x w T r g( x ) = w T x + b = w T r Normal projection of x into the optimal hyperplane. w w + b = 0 w w = r w, w w = r w 12

g( x ) = w T x ± b = ±1 for d = ±1 1 r = g( x ) w = w 1 w m = 2r = 2 w if d =1 if d = 1 The Optimization Problem Let {x 1,..., x n } be our data set and let y i {1,-1} be the class label of x i The decision boundary should classify all points correctly A constrained optimization problem 13

The Optimization Problem Introduce Lagrange multipliers α, Lagrange function: N 1 2 L( w, b, α) = w 2 α i i= 1 ( y [ w x + b] 1) Minimized with respect to w and b i T i The Optimization Problem We can transform the problem to its dual This is a quadratic programming (QP) problem Global maximum of α i can always be found w can be recovered by 14

A Geometrical Interpretation α 8 =0.6 α 10 =0 Class 2 α 5 =0 α 7 =0 α 2 =0 Support vectors α 4 =0 α 9 =0 Class 1 α 3 =0 α 6 =1.4 α 1 =0.8 How About Not Linearly Separable We allow error ξ i in classification Class 2 Class 1 15

Soft Margin Hyperplane Define ξ i =0 if there is no error for x i ξ i are just slack variables in optimization theory We want to minimize C : tradeoff parameter between error and margin The optimization problem becomes The Optimization Problem The dual of the problem is w is also recovered as The only difference with the linear separable case is that there is an upper bound C on α i Once again, a QP solver can be used to find α i 16

Conclusion SVM is a useful alternative to neural networks Two key concepts of SVM: maximize the margin and the kernel trick Many active research is taking place on areas related to SVM Many SVM implementations are available on the web for you to try on your data set! In the radial-basis function type of a support vector machine, the number of radial-basis functions and their centers are determined by their number of support-vectors and their values In the two-layer perceptron type of support vector machine, the number of hidden neurons and their weight vectors are determined by their number of support-vectors and their values 17

Support Vector Machines Linear machine Constructs a hyperplane as the decision surface in such a way that the margin of separation between positive and negative examples is maximized Good generalization performance Support vector learning algorithm may construct following three learning machines Polynominal learning machines Radial-Basis functions networks Two-layer perceptrons Measuring Approximation Accuracy Comparing its output with correct values Mean squared Error F(w) of the network D={(x 1,t 1 ),(x 2,t 2 ),..,(x d,t d ),..,(x m,t m )} F( w ) = 1 m m d =1 t d o d 2 18

RBF-networks Support Vector Machines Good Decision Boundary Optimization Problem Soft margin Hyperplane Non-linear Decision Boundary Kernel-Trick Approximation Accurancy Overtraining 19

Bibliography Simon Haykin, Neural Networks, Secend edition Prentice Hall, 1999 20