Behavioral Data Mining. Lecture 10 Kernel methods and SVMs

Size: px

Start display at page:

Download "Behavioral Data Mining. Lecture 10 Kernel methods and SVMs"

Loraine Tyler
6 years ago
Views:

1 Behavioral Data Mining Lecture 10 Kernel methods and SVMs

2 Outline SVMs as large-margin linear classifiers Kernel methods SVM algorithms

3 SVMs as large-margin classifiers margin The separating plane maximizes the distance (margin) to any data point.

4 SVMs as large-margin classifiers Two small-margin separators

5 SVMs as large-margin classifiers

6 Generalization Add new points from the same distributions no errors

7 Generalization More misclassified points with small-margin separators

8 Support Vectors Generically an SVM requires d+1 points to specify the planes in d dimensions.

9 SVM definition The basic linear SVM is defined by a separating plane, represented by a weight vector w, and an intercept b. The classifier function is then We can find an SVM classifier by solving the system of constraints (a quadratic programming problem): max w,b where and with w T x b w T x b w T w 1 - maximize the margin for points x in the first class for points x in the second class

10 SVM definition This is equivalent to (scale by 1/ ): min w,b where and w T w w T x b 1 w T x b 1 for points x in the first class for points x in the second class

11 Soft-Margin SVM i Individual penalties i are added for any points that don t make the global margin.

12 Soft-margin SVM Most real datasets wont partition perfectly. But you can minimize the total margin violation for those points. Soft-margin SVM: min w,b, w T w C i i where With y i {-1,1} the class labels. and

13 Support Representation Since the SVM hyperplane is fully described by a (small) number of points, it may be very useful to represent it that way. The dual form is: Maximize over i Where k(x i, x j ) is a kernel function which for now we will assume to be an inner product x it x j Then we have (y i are +1,-1 labels for set membership).

14 Text classification with SVMs For Reuters news article dataset.

15 Outline SVMs as large-margin linear classifiers Kernel methods SVM algorithms

16 Kernel methods Motivation: add non-linear features that can separate sets that linear classifiers cannot.

17 Kernel Function The kernel function k is defined on vectors x as: Where the function is the feature map and its range is the feature space. Ex: for the 1d example we just saw, a standard polynomial kernel would be: And the formula for k is satisfied with the feature map: ( x) (1, 2x, 2 x )

18 Common Kernels Polynomial: Radial basis (Gaussian): The RBF kernel is expressible as a function of a vector inner product: where And it therefore fills the requirements of a kernel (positive definiteness). ( x) (1, 2x, 2 x )

19 RBF Kernels Most useful for spatial and image data

20 Kernels for Text The inner product between bag-of-words vectors for text yields a sparse vector kernel. It can be normalized using any term weighting, e.g. tfidf, to produce more realistic results. String kernel: where is the number of occurrences of the string. This kernel provides a very precise match measurement, and can be computed in time and space

21 The kernel trick in general The kernel transformation allows non-linear generalizations of linear SVM classification. But the same idea works for any operation that is based solely on inner products between vectors. e.g. principal component analysis can be done with only kernel operations. So can Gram-Schmidt orthogonalization and QR decomposition.

22 Outline SVMs as large-margin linear classifiers Kernel methods SVM algorithms

23 Computing SVMs The good news is that SVM optimization is a convex optimization problem, so the solution can be found by gradient search. The bad news is that the set of active constraints can change many times, and predicting running time is very difficult. While its polynomial, there is a wide range of times as a function of problem difficulty.

24 Computing SVMs Three widely used algorithms: SMO-SVM: uses low-dimensional projections for fast improvements in loss. SVM light, uses a partition of the problem in active and inactive sets. Roughly quadratic complexity in n training samples. SVM perf, has linear complexity in number of training samples.

25 Older algorithms: Computing SVMs

26 SVM perf Sample running times (Joachims KDD 2006) implementation is available.

27 Computing SVMs Recently there have been many approaches to SVM optimization using stochastic gradient and/or quasi-newton methods. Several adaptive scaling methods provide orders of magnitude speedup over vanilla SGD. See e.g. Adaptive Bound Optimization for Online Convex Optimization Brendan McMahan and Matthew Streeter, COLT A Reliable Effective Terascale Linear Learning System Alekh Agarwal, Olivier Chapelle, Miroslav Dudk, John Langford, AIstats Bounds may be expressed in terms of accuracy and regularization parameter, instead of n.

28 On ccat classification task: Computing SVMs

29 Computing SVMs Pegasos running time in seconds

30 Time vs. sample size Remarkably, both analysis and experiments with Pegasos show that its running time actually decreases as a function of training dataset size, for a given accuracy. SVM Optimization: Inverse Dependence on Training Set Size - Shai Shalev-Shwartz, Nathan Srebro, ICML 2008.

31 Time vs. sample size

32 Summary SVMs as large-margin linear classifiers Primal and dual versions using kernels Kernel methods Sample kernels for spatial and text data SVM algorithms Direct optimization and stochastic gradient

Support vector machines

Support vector machines When the data is linearly separable, which of the many possible solutions should we prefer? SVM criterion: maximize the margin, or distance between the hyperplane and the closest