Machine Learning Lecture 9

Similar documents
Machine Learning Lecture 9

All lecture slides will be available at CSC2515_Winter15.html

Introduction to Support Vector Machines

Support vector machines

DM6 Support Vector Machines

Kernel Methods & Support Vector Machines

Support Vector Machines.

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Kernel SVM. Course: Machine Learning MAHDI YAZDIAN-DEHKORDI FALL 2017

Machine Learning Lecture 3

Linear methods for supervised learning

Machine Learning Lecture 3

Machine Learning Lecture 3

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others

Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

HW2 due on Thursday. Face Recognition: Dimensionality Reduction. Biometrics CSE 190 Lecture 11. Perceptron Revisited: Linear Separators

Support vector machine (II): non-linear SVM. LING 572 Fei Xia

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Support Vector Machines

Content-based image and video analysis. Machine learning

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

12 Classification using Support Vector Machines

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

A Short SVM (Support Vector Machine) Tutorial

Discriminative classifiers for image recognition

LECTURE 5: DUAL PROBLEMS AND KERNELS. * Most of the slides in this lecture are from

Lecture 9: Support Vector Machines

Support Vector Machines

Support Vector Machines

Combine the PA Algorithm with a Proximal Classifier

SUPPORT VECTOR MACHINES

Introduction to Machine Learning

Behavioral Data Mining. Lecture 10 Kernel methods and SVMs

Lecture 7: Support Vector Machine

CS 229 Midterm Review

Bagging for One-Class Learning

Generative and discriminative classification techniques

CS570: Introduction to Data Mining

Classification by Support Vector Machines

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

A Taxonomy of Semi-Supervised Learning Algorithms

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

SVM in Analysis of Cross-Sectional Epidemiological Data Dmitriy Fradkin. April 4, 2005 Dmitriy Fradkin, Rutgers University Page 1

Chakra Chennubhotla and David Koes

Machine Learning for NLP

Classification: Linear Discriminant Functions

Leave-One-Out Support Vector Machines

Constrained optimization

SUPPORT VECTOR MACHINES

Neural Networks and Deep Learning

Support Vector Machines

Optimal Separating Hyperplane and the Support Vector Machine. Volker Tresp Summer 2018

Support Vector Machines and their Applications

Classification by Support Vector Machines

Part 5: Structured Support Vector Machines

CAP5415-Computer Vision Lecture 13-Support Vector Machines for Computer Vision Applica=ons

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

Information Management course

Support Vector Machines + Classification for IR

KBSVM: KMeans-based SVM for Business Intelligence

Image Analysis & Retrieval Lec 10 - Classification II

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

SVM Classification in Multiclass Letter Recognition System

Recognition Tools: Support Vector Machines

Kernel-based online machine learning and support vector reduction

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Support Vector Machines

Support Vector Machines.

Support Vector Machines

Feature scaling in support vector data description

Support Vector Machines. James McInerney Adapted from slides by Nakul Verma

Advanced Video Content Analysis and Video Compression (5LSH0), Module 8B

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

COMS 4771 Support Vector Machines. Nakul Verma

OBJECT CLASSIFICATION USING SUPPORT VECTOR MACHINES WITH KERNEL-BASED DATA PREPROCESSING

J. Weston, A. Gammerman, M. Stitson, V. Vapnik, V. Vovk, C. Watkins. Technical Report. February 5, 1998

ECG782: Multidimensional Digital Signal Processing

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

Support Vector Machines (a brief introduction) Adrian Bevan.

Machine Learning: Think Big and Parallel

Efficient Tuning of SVM Hyperparameters Using Radius/Margin Bound and Iterative Algorithms

Support Vector Machines for Classification and Regression

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Classification by Support Vector Machines

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Support Vector Machines

9. Support Vector Machines. The linearly separable case: hard-margin SVMs. The linearly separable case: hard-margin SVMs. Learning objectives

More Learning. Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA

Classification: Feature Vectors

Learning via Optimization

Optimization Methods for Machine Learning (OMML)

CLASSIFICATION OF CUSTOMER PURCHASE BEHAVIOR IN THE AIRLINE INDUSTRY USING SUPPORT VECTOR MACHINES

ORT EP R RCH A ESE R P A IDI! " #$$% &' (# $!"

Part 5: Structured Support Vector Machines

Object recognition. Methods for classification and image representation

CS 559: Machine Learning Fundamentals and Applications 9 th Set of Notes

Transcription:

Course Outline Machine Learning Lecture 9 Fundamentals ( weeks) Bayes Decision Theory Probability Density Estimation Nonlinear SVMs 19.05.013 Discriminative Approaches (5 weeks) Linear Discriminant Functions Statistical Learning Theory & SVMs Ensemble Methods & Boosting Randomized Trees, Forests & Ferns Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ Generative Models (4 weeks) Bayesian Networks Markov Random Fields leibe@vision.rwth-aachen.de 4 Recap: Support Vector Machine (SVM) Nonlinear Support Vector Machines Nonlinear basis functions The Kernel trick Mercer s condition Popular kernels 5 Basic idea The SVM tries to find a classifier which maximizes the margin between pos. and neg. data points. Up to now: consider linear classifiers w T x + b = 0 Formulation as a convex optimization problem Find the hyperplane satisfying 1 arg min w;b kwk under the constraints t n (w T x n + b) 1 8n based on training data points x n and target values t n f 1; 1g. Margin 6 Recap: SVM Primal Formulation Recap: SVM Solution Lagrangian primal form L p = 1 kwk a n tn (w T x n + b) 1 ª = 1 kwk a n ft n y(x n ) 1g The solution of L p needs to fulfill the KKT conditions Necessary and sufficient conditions a n 0 KKT: 0 t n y(x n ) 1 0 f(x) 0 a n ft n y(x n ) 1g = 0 f(x) = 0 Solution for the hyperplane Computed as a linear combination of the training examples w = a n t n x n Sparse solution: a n 0 only for some points, the support vectors Only the SVs actually influence the decision boundary! Compute b by averaging over all support vectors: Ã b = 1 X t n X! a m t m x T N mx n S ns ms 7 8 1

Recap: SVM Support Vectors The training points for which a n > 0 are called support vectors. Graphical interpretation: The support vectors are the points on the margin. They define the margin and thus the hyperplane. All other data points can be discarded! 9 Slide adapted from Bernt Schiele Image source: C. Burges, 1998 Recap: SVM Dual Formulation Maximize L d (a) = a n 1 a n a m t n t m (x T m x n) m=1 under the conditions a n 0 8n a n t n = 0 Comparison L d is equivalent to the primal form L p, but only depends on a n. L p scales with O(D 3 ). L d scales with O(N 3 ) in practice between O(N) and O(N ). 10 Slide adapted from Bernt Schiele Recap: SVM for Non-Separable Data Recap: SVM New Dual Formulation Slack variables One slack variable» n 0 for each training data point. Interpretation» n = 0 for points that are on the correct side of the margin.» n = t n y(x n ) for all other points. We do not have to set the slack variables ourselves! They are jointly optimized together with w.» 4» 3»» 1 w Point on decision boundary:» n = 1 Misclassified point:» n > 1 11 New SVM Dual: Maximize L d (a) = a n 1 a n a m t n t m (x T m x n) m=1 under the conditions 0 a n C a n t n = 0 This is again a quadratic programming problem Solve as before Slide adapted from Bernt Schiele This is all that changed! 1 Interpretation of Support Vectors Those are the hard examples! We can visualize them, e.g. for face detection Nonlinear Support Vector Machines Nonlinear basis functions The Kernel trick Mercer s condition Popular kernels 13 Image source: E. Osuna, F. Girosi, 1997 14

So Far Nonlinear SVM Only looked at linearly separable case Current problem formulation has no solution if the data are not linearly separable! Need to introduce some tolerance to outlier data points. Slack variables.» 4» 3»» 1 w Linear SVMs Datasets that are linearly separable with some noise work well: 0 x But what are we going to do if the dataset is just too hard? Only looked at linear decision boundaries This is not sufficient for many applications. Want to generalize the ideas to non-linear boundaries. 0 x How about mapping data to a higher-dimensional space: x 18 Image source: B. Schoelkopf, A. Smola, 00 Slide credit: Raymond Mooney 0 x 19 Another Example Non-separable by a hyperplane in D Another Example Separable by a surface in 3D Slide credit: Bill Freeman 0 Slide credit: Bill Freeman 1 Nonlinear SVM Feature Spaces Nonlinear SVM General idea: The original input space can be mapped to some higher-dimensional feature space where the training set is separable: General idea Nonlinear transformation Á of the data points x n : x R D Á : R D! H Hyperplane in higher-dim. space H (linear classifier in H) : x Á(x) w T Á(x) + b = 0 Nonlinear classifier in R D. Slide credit: Raymond Mooney Slide credit: Bernt Schiele 3 3

What Could This Look Like? Problem with High-dim. Basis Functions Example: Mapping to polynomial space, x, y R : Problem In order to apply the SVM, we need to evaluate the function y(x) = w T Á(x) + b 3 p x 1 Á(x) = 4 x1 x 5 x Using the hyperplane, which is itself defined as w = a n t n Á(x n ) Motivation: Easier to separate data in higher-dimensional space. But wait isn t there a big problem? How should we evaluate the decision function? 4 Image source: C. Burges, 1998 What happens if we try this for a million-dimensional feature space Á(x)? Oh-oh 5 Solution: The Kernel Trick Back to Our Previous Example Important observation Á(x) only appears in the form of dot products Á(x) T Á(y): y(x) = w T Á(x) + b = a n t n Á(x n ) T Á(x) + b Trick: Define a so-called kernel function k(x,y) = Á(x) T Á(y). Now, in place of the dot product, use the kernel instead: y(x) = a n t n k(x n ; x) + b The kernel function implicitly maps the data to the higherdimensional space (without having to compute Á(x) explicitly)! nd degree polynomial kernel: 3 3 x Á(x) T p 1 p y1 Á(y) = 4 x1 x 5 4 y1 y 5 x y = x 1y 1 + x 1 x y 1 y + x y = (x T y) =: k(x; y) Whenever we evaluate the kernel function k(x,y) = (x T y), we implicitly compute the dot product in the higher-dimensional feature space. 6 7 Image source: C. Burges, 1998 SVMs with Kernels Which Functions are Valid Kernels? Using kernels Applying the kernel trick is easy. Just replace every dot product by a kernel function x T y! k(x; y) and we re done. Instead of the raw input space, we re now working in a higherdimensional (potentially infinite dimensional!) space, where the data is more easily separable. Wait does this always work? The kernel needs to define an implicit mapping to a higher-dimensional feature space Á(x). When is this the case? Sounds like magic 8 Mercer s theorem (modernized version): Every positive definite symmetric function is a kernel. Positive definite symmetric functions correspond to a positive definite symmetric Gram matrix: K = Slide credit: Raymond Mooney k(x 1,x 1 ) k(x 1,x ) k(x 1,x 3 ) k(x 1,x n ) k(x,x 1 ) k(x,x ) k(x,x 3 ) k(x,x n ) k(x n,x 1 ) k(x n,x ) k(x n,x 3 ) k(x n,x n ) (positive definite = all eigenvalues are > 0) 9 4

Kernels Fulfilling Mercer s Condition Example: Bag of Visual Words Representation Polynomial kernel k(x; y) = (x T y + 1) p Radial Basis Function kernel ¾ (x y) k(x; y) = exp ½ ¾ e.g. Gaussian General framework in visual recognition Create a codebook (vocabulary) of prototypical image features Represent images as histograms over codebook activations Compare two images by any histogram kernel, e.g. Â kernel Hyperbolic tangent kernel k(x; y) = tanh( x T y + ±) e.g. Sigmoid (and many, many more ) Actually, this was wrong in the original SVM paper... Slide credit: Bernt Schiele 30 Slide adapted from Christoph Lampert 31 Nonlinear SVM Dual Formulation SVM Demo SVM Dual: Maximize under the conditions 0 a n C a n t n = 0 Classify new data points using 3 Applet from libsvm (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) 33 Summary: SVMs Summary: SVMs Properties Empirically, SVMs work very, very well. SVMs are currently among the best performers for a number of classification tasks ranging from text to genomic data. SVMs can be applied to complex data types beyond feature vectors (e.g. graphs, sequences, relational data) by designing kernel functions for such data. SVM techniques have been applied to a variety of other tasks e.g. SV Regression, One-class SVMs, The kernel trick has been used for a wide variety of applications. It can be applied wherever dot products are in use e.g. Kernel PCA, kernel FLD, Good overview, software, and tutorials available on http://www.kernel-machines.org/ 34 Limitations How to select the right kernel? Best practice guidelines are available for many applications How to select the kernel parameters? (Massive) cross-validation. Usually, several parameters are optimized together in a grid search. Solving the quadratic programming problem Standard QP solvers do not perform too well on SVM task. Dedicated methods have been developed for this, e.g. SMO. Speed of evaluation Evaluating y(x) scales linearly in the number of SVs. Too expensive if we have a large number of support vectors. There are techniques to reduce the effective SV set. Training for very large datasets (millions of data points) Stochastic gradient descent and other approximations can be used 35 5

Recap: Kernels Fulfilling Mercer s Condition Nonlinear Support Vector Machines Nonlinear basis functions The Kernel trick Mercer s condition Popular kernels Polynomial kernel k(x; y) = (x T y + 1) p Radial Basis Function kernel ¾ (x y) k(x; y) = exp ½ ¾ Hyperbolic tangent kernel k(x; y) = tanh( x T y + ±) (and many, many more ) e.g. Gaussian e.g. Sigmoid Actually, that was wrong in the original SVM paper... 36 Slide credit: Bernt Schiele 37 VC Dimension for Polynomial Kernel VC Dimension for Gaussian RBF Kernel Polynomial kernel of degree p: k(x; y) = (x T y) p Dimensionality of H: µ D + p 1 p Radial Basis Function: ¾ (x y) k(x; y) = exp ½ ¾ In this case, H is infinite dimensional! Example: D = 16 16 = 56 p = 4 dim(h) = 183:181:376 exp(x) = 1 + x 1! + x! + : : : + xn n! + : : : Since only the kernel function is used by the SVM, this is no problem. The hyperplane in H then has VC-dimension dim(h) + 1 The hyperplane in H then has VC-dimension dim(h) + 1 = 1 38 39 VC Dimension for Gaussian RBF Kernel Example: RBF Kernels Intuitively If we make the radius of the RBF kernel sufficiently small, then each data point can be associated with its own kernel. Decision boundary on toy problem RBF Kernel width (¾) However, this also means that we can get finite VC-dimension if we set a lower limit to the RBF radius. 40 Image source: C. Burges 41 Image source: B. Schoelkopf, A. Smola, 00 6

But but but Theoretical Justification for Maximum Margins Don t we risk overfitting with those enormously highdimensional feature spaces? No matter what the basis functions are, there are really only up to N parameters: a 1, a,, a N and most of them are usually set to zero by the maximum margin criterion. The data effectively lives in a low-dimensional subspace of H. What about the VC dimension? I thought low VC-dim was good (in the sense of the risk bound)? Yes, but the maximum margin classifier magically solves this. Reason (Vapnik): by maximizing the margin, we can reduce the VC-dimension. Empirically, SVMs have very good generalization performance. Gap Tolerant Classifier Classifier is defined by a ball in R d with diameter D enclosing all points and two parallel hyperplanes with distance M (the margin). Points in the ball are assigned class {-1,1} depending on which side of the margin they fall. VC dimension of this classifier depends on the margin M 3/4 D 3 points can be shattered 3/4 D < M < D points can be shattered M D 1 point can be shattered By maximizing the margin, we can minimize the VC dimension 4 43 Image source: C. Burges Theoretical Justification for Maximum Margins For the general case, Vapnik has proven the following: The class of optimal linear separators has VC dimension h bounded from above as D h min, 1 m0 where ρ is the margin, D is the diameter of the smallest sphere that can enclose all of the training examples, and m 0 is the dimensionality. Intuitively, this implies that regardless of dimensionality m 0 we can minimize the VC dimension by maximizing the margin ρ. Thus, complexity of the classifier is kept small regardless of dimensionality. Nonlinear Support Vector Machines Nonlinear basis functions The Kernel trick Mercer s condition Popular kernels Slide credit: Raymond Mooney 44 46 SVM Analysis Recap: Error Functions Traditional soft-margin formulation 1 min wr D ;»nr + kwk + C» n subject to the constraints Different way of looking at it We can reformulate the constraints into the objective function. 1 min wr D kwk + C [1 t n y(x n )] + L regularizer where [x] + := max{0,x}. Slide adapted from Christoph Lampert Hinge loss Maximize the margin Most points should be on the correct side of the margin 47 Not differentiable! Ideal misclassification error function (black) This is what we want to approximate, Unfortunately, it is not differentiable. The gradient is zero for misclassified points. Ideal misclassification error z n = t n y(x n ) We cannot minimize it by gradient descent. 48 Image source: Bishop, 006 7

Recap: Error Functions Error Functions (Loss Functions) Ideal misclassification error Squared error Ideal misclassification error Squared error Hinge error Sensitive to outliers! Robust to outliers! Penalizes too correct data points! z n = t n y(x n ) Not differentiable! Favors sparse solutions! z n = t n y(x n ) Squared error used in Least-Squares Classification Very popular, leads to closed-form solutions. However, sensitive to outliers due to squared penalty. Penalizes too correct data points Generally does not lead to good classifiers. 49 Image source: Bishop, 006 Hinge error used in SVMs Zero error for points outside the margin (z n > 1) sparsity Linear penalty for misclassified points (z n < 1) robustness Not differentiable around z n = 1 Cannot be optimized directly. 50 Image source: Bishop, 006 SVM Discussion SVM optimization function 1 min wr D kwk + C L regularizer Hinge loss enforces sparsity Only a subset of training data points actually influences the decision boundary. This is different from sparsity obtained through the regularizer! There, only a subset of input dimensions are used. Unconstrained optimization, but non-differentiable function. Solve, e.g. by subgradient descent Currently most efficient: stochastic gradient descent Slide adapted from Christoph Lampert [1 t n y(x n )] + Hinge loss 51 Nonlinear Support Vector Machines Nonlinear basis functions The Kernel trick Mercer s condition Popular kernels 5 Example Application: Text Classification Example Application: Text Classification Problem: Classify a document in a number of categories Results: Representation: Bag-of-words approach Histogram of word counts (on learned dictionary) Very high-dimensional feature space (~10.000 dimensions) Few irrelevant features This was one of the first applications of SVMs T. Joachims (1997)? 53 54 8

Example Application: Text Classification Example Application: OCR This is also how you could implement a simple spam filter Dictionary Handwritten digit recognition US Postal Service Database Standard benchmark task for many learning algorithms SVM Mailbox Incoming email Word activations Trash 55 56 Historical Importance Example Application: OCR USPS benchmark.5% error: human performance Results Almost no overfitting with higher-degree kernels. Different learning algorithms 16.% error: Decision tree (C4.5) 5.9% error: (best) -layer Neural Network 5.1% error: LeNet 1 (massively hand-tuned) 5-layer network Different SVMs 4.0% error: Polynomial kernel (p=3, 74 support vectors) 4.1% error: Gaussian kernel (¾=0.3, 91 support vectors) 57 58 Example Application: Object Detection Example Application: Pedestrian Detection Sliding-window approach Obj./non-obj. Classifier E.g. histogram representation (HOG) Map each grid cell in the input window to a histogram of gradient orientations. Train a linear SVM using training set of pedestrian vs. non-pedestrian windows. [Dalal & Triggs, CVPR 005] N. Dalal, B. Triggs, Histograms of Oriented Gradients for Human Detection, CVPR 005 60 9

Many Other Applications Lots of other applications in all fields of technology OCR Text classification Computer vision Nonlinear Support Vector Machines High-energy physics Monitoring of household appliances Protein secondary structure prediction Design on decision feedback equalizers (DFE) in telephony Extensions One-class SVMs (Detailed references in Schoelkopf & Smola, 00, pp. 1) 61 6 Summary: SVMs Summary: SVMs Properties Empirically, SVMs work very, very well. SVMs are currently among the best performers for a number of classification tasks ranging from text to genomic data. SVMs can be applied to complex data types beyond feature vectors (e.g. graphs, sequences, relational data) by designing kernel functions for such data. SVM techniques have been applied to a variety of other tasks e.g. SV Regression, One-class SVMs, The kernel trick has been used for a wide variety of applications. It can be applied wherever dot products are in use e.g. Kernel PCA, kernel FLD, Good overview, software, and tutorials available on http://www.kernel-machines.org/ 63 Limitations How to select the right kernel? Requires domain knowledge and experiments How to select the kernel parameters? (Massive) cross-validation. Usually, several parameters are optimized together in a grid search. Solving the quadratic programming problem Standard QP solvers do not perform too well on SVM task. Dedicated methods have been developed for this, e.g. SMO. Speed of evaluation Evaluating y(x) scales linearly in the number of SVs. Too expensive if we have a large number of support vectors. There are techniques to reduce the effective SV set. Training for very large datasets (millions of data points) Stochastic gradient descent and other approximations can be used 64 You Can Try It At Home References and Further Reading Lots of SVM software available, e.g. svmlight (http://svmlight.joachims.org/) Command-line based interface Source code available (in C) Interfaces to Python, MATLAB, Perl, Java, DLL, More information on SVMs can be found in Chapter 7.1 of Bishop s book. You can also look at Schölkopf & Smola (some chapters available online). Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 006 libsvm (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) Library for inclusion with own code C++ and Java sources Interfaces to Python, R, MATLAB, Perl, Ruby, Weka, C+.NET, B. Schölkopf, A. Smola Learning with Kernels MIT Press, 00 http://www.learning-with-kernels.org/ Both include fast training and evaluation algorithms, support for multi-class SVMs, automated training and cross-validation, Easy to apply to your own problems! 65 A more in-depth introduction to SVMs is available in the following tutorial: C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, Vol. (), pp. 11-167 1998. 66 10