Machine Learning Lecture 9

Size: px

Start display at page:

Download "Machine Learning Lecture 9"

Easter Henderson
6 years ago
Views:

1 Course Outline Machine Learning Lecture 9 Fundamentals ( weeks) Bayes Decision Theory Probability Density Estimation Nonlinear SVMs Discriminative Approaches (5 weeks) Linear Discriminant Functions Statistical Learning Theory & SVMs Ensemble Methods & Boosting Randomized Trees, Forests & Ferns Bastian Leibe RWTH Aachen Generative Models (4 weeks) Bayesian Networks Markov Random Fields leibe@vision.rwth-aachen.de 4 Recap: Support Vector Machine (SVM) Nonlinear Support Vector Machines Nonlinear basis functions The Kernel trick Mercer s condition Popular kernels 5 Basic idea The SVM tries to find a classifier which maximizes the margin between pos. and neg. data points. Up to now: consider linear classifiers w T x + b = 0 Formulation as a convex optimization problem Find the hyperplane satisfying 1 arg min w;b kwk under the constraints t n (w T x n + b) 1 8n based on training data points x n and target values t n f 1; 1g. Margin 6 Recap: SVM Primal Formulation Recap: SVM Solution Lagrangian primal form L p = 1 kwk a n tn (w T x n + b) 1 ª = 1 kwk a n ft n y(x n ) 1g The solution of L p needs to fulfill the KKT conditions Necessary and sufficient conditions a n 0 KKT: 0 t n y(x n ) 1 0 f(x) 0 a n ft n y(x n ) 1g = 0 f(x) = 0 Solution for the hyperplane Computed as a linear combination of the training examples w = a n t n x n Sparse solution: a n 0 only for some points, the support vectors Only the SVs actually influence the decision boundary! Compute b by averaging over all support vectors: Ã b = 1 X t n X! a m t m x T N mx n S ns ms 7 8 1

Recap: SVM Support Vectors The training points for which a n > 0 are called support vectors. Graphical interpretation: The support vectors are the points on the margin.

Burges, 1998 Recap: SVM Dual Formulation Maximize L d (a) = a n 1 a n a m t n t m (x T m x n) m=1 under the conditions a n 0 8n a n t n = 0 Comparison L d is equivalent to the primal form L p, but

2 Recap: SVM Support Vectors The training points for which a n > 0 are called support vectors. Graphical interpretation: The support vectors are the points on the margin. They define the margin and thus the hyperplane. All other data points can be discarded! 9 Slide adapted from Bernt Schiele Image source: C. Burges, 1998 Recap: SVM Dual Formulation Maximize L d (a) = a n 1 a n a m t n t m (x T m x n) m=1 under the conditions a n 0 8n a n t n = 0 Comparison L d is equivalent to the primal form L p, but only depends on a n. L p scales with O(D 3 ). L d scales with O(N 3 ) in practice between O(N) and O(N ). 10 Slide adapted from Bernt Schiele So Far SVM Non-Separable Data Only looked at linearly separable case Current problem formulation has no solution if the data are not linearly separable! Need to introduce some tolerance to outlier data points. Margin? Non-separable data I.e. the following inequalities cannot be satisfied for all data points w T x n + b +1 for t n = +1 w T x n + b 1 for t n = 1 Instead use w T x n + b +1» n for t n = +1 w T x n + b 1 +» n for t n = 1 with slack variables» n 0 8n SVM Soft-Margin Classification SVM Non-Separable Data Slack variables One slack variable» n 0 for each training data point. Interpretation» n = 0 for points that are on the correct side of the margin.» n = t n y(x n ) for all other points (linear penalty).»» 1 w Point on decision boundary:» n = 1 Separable data 1 Minimize kwk Trade-off parameter! Non-separable data 1 Minimize kwk +C» n» 4» 3 Misclassified point:» n > 1 We do not have to set the slack variables ourselves! They are jointly optimized together with w. How that? 15 16

SVM New Primal Formulation SVM New Dual Formulation New SVM Primal: Optimize L p = 1 kwk +C» n a n (t n y(x n ) 1 +» n ) ¹ n» n New SVM Dual: Maximize L d (a) = a n 1 a n a m t n t m (x T m x n) m=1

3 SVM New Primal Formulation SVM New Dual Formulation New SVM Primal: Optimize L p = 1 kwk +C» n a n (t n y(x n ) 1 +» n ) ¹ n» n New SVM Dual: Maximize L d (a) = a n 1 a n a m t n t m (x T m x n) m=1 KKT conditions a n 0 t n y(x n ) 1 +» n 0 a n (t n y(x n ) 1 +» n ) = 0 Constraint t n y(x n ) 1» n ¹ n 0» n 0 ¹ n» n = 0 Constraint» n 0 KKT: 0 f(x) 0 f(x) = 0 under the conditions 0 a n C a n t n = 0 This is all that changed! This is again a quadratic programming problem Solve as before (more on that later) 17 Slide adapted from Bernt Schiele 18 SVM New Solution Interpretation of Support Vectors Solution for the hyperplane Computed as a linear combination of the training examples w = a n t n x n Those are the hard examples! We can visualize them, e.g. for face detection Again sparse solution: a n = 0 for points outside the margin. The slack points with» n > 0 are now also support vectors! Compute b by averaging over all N M points with 0 < a n < C: Ã b = 1 X t n X! a m t m x T N mx n M nm mm 19 0 Image source: E. Osuna, F. Girosi, 1997 So Far Nonlinear Support Vector Machines Nonlinear basis functions The Kernel trick Mercer s condition Popular kernels Only looked at linearly separable case Current problem formulation has no solution if the data are not linearly separable! Need to introduce some tolerance to outlier data points. Slack variables. Only looked at linear decision boundaries This is not sufficient for many applications. Want to generalize the ideas to non-linear boundaries.» 4» 3»» 1 w 1 5 Image source: B. Schoelkopf, A. Smola, 00 3

4 Nonlinear SVM Linear SVMs Datasets that are linearly separable with some noise work well: Another Example Non-separable by a hyperplane in D 0 x But what are we going to do if the dataset is just too hard? 0 x How about mapping data to a higher-dimensional space: x Slide credit: Raymond Mooney 0 x 6 Slide credit: Bill Freeman 7 Another Example Nonlinear SVM Feature Spaces Separable by a surface in 3D General idea: The original input space can be mapped to some higher-dimensional feature space where the training set is separable: : x Á(x) Slide credit: Bill Freeman 8 Slide credit: Raymond Mooney 9 Nonlinear SVM What Could This Look Like? General idea Nonlinear transformation Á of the data points x n : x R D Á : R D! H Example: Mapping to polynomial space, x, y R : Hyperplane in higher-dim. space H (linear classifier in H) w T Á(x) + b = 0 3 p x 1 Á(x) = 4 x1 x 5 Nonlinear classifier in R D. x Motivation: Easier to separate data in higher-dimensional space. But wait isn t there a big problem? How should we evaluate the decision function? Slide credit: Bernt Schiele Image source: C. Burges,

5 Problem with High-dim. Basis Functions Solution: The Kernel Trick Problem In order to apply the SVM, we need to evaluate the function Using the hyperplane, which is itself defined as What happens if we try this for a million-dimensional feature space Á(x)? Oh-oh y(x) = w T Á(x) + b w = a n t n Á(x n ) Important observation Á(x) only appears in the form of dot products Á(x) T Á(y): y(x) = w T Á(x) + b = a n t n Á(x n ) T Á(x) + b Trick: Define a so-called kernel function k(x,y) = Á(x) T Á(y). Now, in place of the dot product, use the kernel instead: y(x) = a n t n k(x n ; x) + b The kernel function implicitly maps the data to the higherdimensional space (without having to compute Á(x) explicitly)! 3 33 Back to Our Previous Example SVMs with Kernels nd degree polynomial kernel: 3 3 x Á(x) T p 1 p y1 Á(y) = 4 x1 x 5 4 y1 y 5 x y = x 1y 1 + x 1 x y 1 y + x y = (x T y) =: k(x; y) Whenever we evaluate the kernel function k(x,y) = (x T y), we implicitly compute the dot product in the higher-dimensional feature space. 34 Image source: C. Burges, 1998 Using kernels Applying the kernel trick is easy. Just replace every dot product by a kernel function x T y! k(x; y) and we re done. Instead of the raw input space, we re now working in a higherdimensional (potentially infinite dimensional!) space, where the data is more easily separable. Wait does this always work? The kernel needs to define an implicit mapping to a higher-dimensional feature space Á(x). When is this the case? Sounds like magic 35 Which Functions are Valid Kernels? Kernels Fulfilling Mercer s Condition Mercer s theorem (modernized version): Every positive definite symmetric function is a kernel. Polynomial kernel k(x; y) = (x T y + 1) p Positive definite symmetric functions correspond to a positive definite symmetric Gram matrix: K = Slide credit: Raymond Mooney k(x 1,x 1 ) k(x 1,x ) k(x 1,x 3 ) k(x 1,x n ) k(x,x 1 ) k(x,x ) k(x,x 3 ) k(x,x n ) k(x n,x 1 ) k(x n,x ) k(x n,x 3 ) k(x n,x n ) (positive definite = all eigenvalues are > 0) 36 Radial Basis Function kernel ¾ (x y) k(x; y) = exp ½ ¾ Hyperbolic tangent kernel (and many, many more ) Slide credit: Bernt Schiele k(x; y) = tanh( x T y + ±) e.g. Gaussian e.g. Sigmoid Actually, this was wrong in the original SVM paper

Example: Bag of Visual Words Representation Nonlinear SVM Dual Formulation General framework in visual recognition Create a codebook (vocabulary) of prototypical image features Represent images as

ams over codebook activations Compare two image

6 Example: Bag of Visual Words Representation Nonlinear SVM Dual Formulation General framework in visual recognition Create a codebook (vocabulary) of prototypical image features Represent images as histograms over codebook activations Compare two images by any histogram kernel, e.g. Â kernel SVM Dual: Maximize under the conditions 0 a n C a n t n = 0 Classify new data points using Slide adapted from Christoph Lampert SVM Demo Summary: SVMs Applet from libsvm ( 40 Properties Empirically, SVMs work very, very well. SVMs are currently among the best performers for a number of classification tasks ranging from text to genomic data. SVMs can be applied to complex data types beyond feature vectors (e.g. graphs, sequences, relational data) by designing kernel functions for such data. SVM techniques have been applied to a variety of other tasks e.g. SV Regression, One-class SVMs, The kernel trick has been used for a wide variety of applications. It can be applied wherever dot products are in use e.g. Kernel PCA, kernel FLD, Good overview, software, and tutorials available on 41 Summary: SVMs Limitations How to select the right kernel? Best practice guidelines are available for many applications How to select the kernel parameters? (Massive) cross-validation. Usually, several parameters are optimized together in a grid search. Solving the quadratic programming problem Standard QP solvers do not perform too well on SVM task. Dedicated methods have been developed for this, e.g. SMO. Speed of evaluation Evaluating y(x) scales linearly in the number of SVs. Too expensive if we have a large number of support vectors. There are techniques to reduce the effective SV set. Training for very large datasets (millions of data points) Stochastic gradient descent and other approximations can be used 4 Nonlinear Support Vector Machines Nonlinear basis functions The Kernel trick Mercer s condition Popular kernels 43 6

Recap: Kernels Fulfilling Mercer s Condition VC Dimension for Polynomial Kernel Polynomial kernel k(x; y) = (x T y + 1) p Radial Basis Function kernel ¾ (x y) k(x; y) = exp ½ ¾ Hyperbolic tangent

7 Recap: Kernels Fulfilling Mercer s Condition VC Dimension for Polynomial Kernel Polynomial kernel k(x; y) = (x T y + 1) p Radial Basis Function kernel ¾ (x y) k(x; y) = exp ½ ¾ Hyperbolic tangent kernel k(x; y) = tanh( x T y + ±) e.g. Gaussian e.g. Sigmoid Polynomial kernel of degree p: Dimensionality of H: Example: k(x; y) = (x T y) p µ D + p 1 p D = = 56 p = 4 dim(h) = 183:181:376 (and many, many more ) Actually, that was wrong in the original SVM paper... The hyperplane in H then has VC-dimension dim(h) + 1 Slide credit: Bernt Schiele VC Dimension for Gaussian RBF Kernel VC Dimension for Gaussian RBF Kernel Radial Basis Function: ¾ (x y) k(x; y) = exp ½ ¾ Intuitively If we make the radius of the RBF kernel sufficiently small, then each data point can be associated with its own kernel. In this case, H is infinite dimensional! exp(x) = 1 + x 1! + x! + : : : + xn n! + : : : Since only the kernel function is used by the SVM, this is no problem. The hyperplane in H then has VC-dimension dim(h) + 1 = 1 However, this also means that we can get finite VC-dimension if we set a lower limit to the RBF radius Image source: C. Burges Example: RBF Kernels But but but Decision boundary on toy problem Don t we risk overfitting with those enormously highdimensional feature spaces? No matter what the basis functions are, there are really only up to N parameters: a 1, a,, a N and most of them are usually set to zero by the maximum margin criterion. The data effectively lives in a low-dimensional subspace of H. RBF Kernel width (¾) What about the VC dimension? I thought low VC-dim was good (in the sense of the risk bound)? Yes, but the maximum margin classifier magically solves this. Reason (Vapnik): by maximizing the margin, we can reduce the VC-dimension. Empirically, SVMs have very good generalization performance. 48 Image source: B. Schoelkopf, A. Smola,

Theoretical Justification for Maximum Margins Theoretical Justification for Maximum Margins Gap Tolerant Classifier Classifier is defined by a ball in R d with diameter D enclosing all points and two

8 Theoretical Justification for Maximum Margins Theoretical Justification for Maximum Margins Gap Tolerant Classifier Classifier is defined by a ball in R d with diameter D enclosing all points and two parallel hyperplanes with distance M (the margin). Points in the ball are assigned class {-1,1} depending on which side of the margin they fall. For the general case, Vapnik has proven the following: The class of optimal linear separators has VC dimension h bounded from above as D h min, 1 m0 where ρ is the margin, D is the diameter of the smallest sphere that can enclose all of the training examples, and m 0 is the dimensionality. VC dimension of this classifier depends on the margin M 3/4 D 3 points can be shattered 3/4 D < M < D points can be shattered Intuitively, this implies that regardless of dimensionality m 0 we can minimize the VC dimension by maximizing the margin ρ. M D 1 point can be shattered By maximizing the margin, we can minimize the VC dimension Thus, complexity of the classifier is kept small regardless of dimensionality. 50 Image source: C. Burges Slide credit: Raymond Mooney 51 SVM Analysis Nonlinear Support Vector Machines Nonlinear basis functions The Kernel trick Mercer s condition Popular kernels Traditional soft-margin formulation 1 min wr D ;»nr + kwk + C» n subject to the constraints Different way of looking at it Maximize the margin Most points should be on the correct side of the margin We can reformulate the constraints into the objective function. 1 min wr D kwk + C [1 t n y(x n )] + 53 L regularizer where [x] + := max{0,x}. Slide adapted from Christoph Lampert Hinge loss 54 Recap: Error Functions Recap: Error Functions Ideal misclassification error Ideal misclassification error Squared error Sensitive to outliers! Penalizes too correct data points! Not differentiable! z n = t n y(x n ) z n = t n y(x n ) Ideal misclassification error function (black) This is what we want to approximate, Unfortunately, it is not differentiable. The gradient is zero for misclassified points. We cannot minimize it by gradient descent. 55 Image source: Bishop, 006 Squared error used in Least-Squares Classification Very popular, leads to closed-form solutions. However, sensitive to outliers due to squared penalty. Penalizes too correct data points Generally does not lead to good classifiers. 56 Image source: Bishop, 006 8

9 Error Functions (Loss Functions) SVM Discussion Robust to outliers! Ideal misclassification error Squared error Hinge error SVM optimization function 1 min wr D kwk + C [1 t n y(x n )] + L regularizer Hinge loss Not differentiable! Hinge error used in SVMs Favors sparse solutions! z n = t n y(x n ) Zero error for points outside the margin (z n > 1) sparsity Linear penalty for misclassified points (z n < 1) robustness Not differentiable around z n = 1 Cannot be optimized directly. 57 Image source: Bishop, 006 Hinge loss enforces sparsity Only a subset of training data points actually influences the decision boundary. This is different from sparsity obtained through the regularizer! There, only a subset of input dimensions are used. Unconstrained optimization, but non-differentiable function. Solve, e.g. by subgradient descent Currently most efficient: stochastic gradient descent Slide adapted from Christoph Lampert 58 Example Application: Text Classification Nonlinear Support Vector Machines Nonlinear basis functions The Kernel trick Mercer s condition Popular kernels Problem: Classify a document in a number of categories Representation: Bag-of-words approach Histogram of word counts (on learned dictionary) Very high-dimensional feature space (~ dimensions) Few irrelevant features This was one of the first applications of SVMs T. Joachims (1997)? Example Application: Text Classification Example Application: Text Classification Results: This is also how you could implement a simple spam filter Dictionary SVM Mailbox Incoming Word activations Trash

Example Application: OCR Historical Importance Handwritten digit recognition US Postal Service Database Standard benchmark task for many learning algorithms USPS benchmark.

1% error: LeNet 1 (massively hand-tuned) 5-layer network Different SVMs 4.0% error: Polynomial kernel (p=3, 74 support vectors) 4.1% error: Gaussian kernel (¾=0.

Classifier 65 E.g. histogram representation (HOG) Map each grid cell in the input window to a histogram of gradient orientations. Train a linear SVM using training set of pedestrian vs.

10 Example Application: OCR Historical Importance Handwritten digit recognition US Postal Service Database Standard benchmark task for many learning algorithms USPS benchmark.5% error: human performance Different learning algorithms 16.% error: Decision tree (C4.5) 5.9% error: (best) -layer Neural Network 5.1% error: LeNet 1 (massively hand-tuned) 5-layer network Different SVMs 4.0% error: Polynomial kernel (p=3, 74 support vectors) 4.1% error: Gaussian kernel (¾=0.3, 91 support vectors) Example Application: OCR Example Application: Object Detection Results Almost no overfitting with higher-degree kernels. Sliding-window approach Obj./non-obj. Classifier 65 E.g. histogram representation (HOG) Map each grid cell in the input window to a histogram of gradient orientations. Train a linear SVM using training set of pedestrian vs. non-pedestrian windows. [Dalal & Triggs, CVPR 005] Example Application: Pedestrian Detection Many Other Applications Lots of other applications in all fields of technology OCR Text classification Computer vision High-energy physics Monitoring of household appliances Protein secondary structure prediction Design on decision feedback equalizers (DFE) in telephony N. Dalal, B. Triggs, Histograms of Oriented Gradients for Human Detection, CVPR 005 (Detailed references in Schoelkopf & Smola, 00, pp. 1)

SVM techniques have been applied to a variety of other tasks e.g. SV Regression, One-class SVMs, The kernel trick has been used for a wide variety of applications.

11 Summary: SVMs Nonlinear Support Vector Machines Extensions One-class SVMs Properties Empirically, SVMs work very, very well. SVMs are currently among the best performers for a number of classification tasks ranging from text to genomic data. SVMs can be applied to complex data types beyond feature vectors (e.g. graphs, sequences, relational data) by designing kernel functions for such data. SVM techniques have been applied to a variety of other tasks e.g. SV Regression, One-class SVMs, The kernel trick has been used for a wide variety of applications. It can be applied wherever dot products are in use e.g. Kernel PCA, kernel FLD, Good overview, software, and tutorials available on Summary: SVMs You Can Try It At Home Limitations How to select the right kernel? Requires domain knowledge and experiments How to select the kernel parameters? (Massive) cross-validation. Usually, several parameters are optimized together in a grid search. Solving the quadratic programming problem Standard QP solvers do not perform too well on SVM task. Dedicated methods have been developed for this, e.g. SMO. Speed of evaluation Lots of SVM software available, e.g. svmlight ( Command-line based interface Source code available (in C) Interfaces to Python, MATLAB, Perl, Java, DLL, libsvm ( Library for inclusion with own code C++ and Java sources Interfaces to Python, R, MATLAB, Perl, Ruby, Weka, C+.NET, Evaluating y(x) scales linearly in the number of SVs. Too expensive if we have a large number of support vectors. There are techniques to reduce the effective SV set. Training for very large datasets (millions of data points) Stochastic gradient descent and other approximations can be used 71 Both include fast training and evaluation algorithms, support for multi-class SVMs, automated training and cross-validation, Easy to apply to your own problems! 7 References and Further Reading More information on SVMs can be found in Chapter 7.1 of Bishop s book. You can also look at Schölkopf & Smola (some chapters available online). Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 006 B. Schölkopf, A. Smola Learning with Kernels MIT Press, 00 A more in-depth introduction to SVMs is available in the following tutorial: C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, Vol. (), pp

Machine Learning Lecture 9

Course Outline Machine Learning Lecture 9 Fundamentals ( weeks) Bayes Decision Theory Probability Density Estimation Nonlinear SVMs 19.05.013 Discriminative Approaches (5 weeks) Linear Discriminant Functions