Advanced Studies in Applied Statistics (WBL), ETHZ Applied Multivariate Statistics Spring 2018, Week 11

Size: px

Start display at page:

Download "Advanced Studies in Applied Statistics (WBL), ETHZ Applied Multivariate Statistics Spring 2018, Week 11"

Donald Haynes
5 years ago
Views:

1 Advanced Studies in Applied Statistics (WBL), ETHZ Applied Multivariate Statistics Spring 2018, Week 11 Lecturer: Beate Sick Remark: Much of the material have been developed together with Oliver Dürr for different lectures at ZHAW. 1 1

2 Topics of today Support Vector Machine as linear classifier The idea of a separation hyperplane with fat margin and support vectors Going from linear perfectly separable data to not-perfectly separable data hyper-parameters: cost and.. extension to more than two classes Kernel trick allows to find non-linear separation bounds with margin Kernel trick allows to add newly constructed features on the fly Common kernels 2

3 Support Vector Machine (SVM) the basic idea Each observation vector of values (p-dimensional) SVM constructs a hyperplane to separate class members. Feature 1, X 1 Feature 3, X 3 Feature 2, X 2 3

4 Linear perfect separable case 4

5 Support Vector Machine - Hyperplanes Each column vector can be viewed as a point in an p-dimensional space (p = number of features). A linear binary classifier constructs a hyperplane separating class members from non-members in this space. X 2 X 1 Observations Hyperplane Possible hyperplanes which one? 5

6 Idea of SVM in case of linear separable data SVM choose a specific hyperplane among the many that can separate the data, namely the maximum margin hyperplane, which maximizes the distance from the hyperplane to the closest training point. feature 2 Margin Take the fattest margin! feature 1 6

7 Why do we want a large margin? small margin -> might overfit train data o large margin -> does not overfit train data Both separation boundaries allow to separate the classes in the train data, but for new test data we expect a better classification performance for the large margin classifier since test data should have a better chance to be on the right side of the blue plane than of the green plane. 7

8 SVM - Support Vectors Training examples that lie closest to the decision boundary determine the hyperplane and are called support vectors. All other training examples do not contribute to the specification of the boundary. x 2 Margin Can be moved w/o changing hyperplane Support Vectors (at the margin in separable case, or also within the margin) x 1 8

We want to find separating hyperplane: find -vector Each hyperplane is give by: + X... X 0 0 1 1 p p x 0.9x 0.2 2 1 0.2 0.9x 1x 0 1 2 Condition for the hyperplane that separates classes y=±1: + x.

9 We want to find separating hyperplane: find -vector Each hyperplane is give by: + X... X p p x 0.9x x 1x Condition for the hyperplane that separates classes y=±1: + x... x 0 if y =1 0 1 i1 p ip i + x... x 0 if y = i1 p ip i Reformulate conditions in one line (using notation trick with y=±1): y + x... x 0 i 0 1 i1 p ip Note that for each -vector that fulfils this condition fore each i also 10 is a solution To get a unique solution of we need a constrain, which is usually: p j1 2 j 1 9

10 Optimization objective in SVM : find -Vector that maximizes M with y + x... x M i 0 1 i1 p ip subjected to (unter Nebenbedingung) p j1 2 j 1 for all i Remark: optimization under constrains is a hard business which we skip here. For details see e.g. ELS chapter 12 or A vector fulfilling this constrains corresponds to separating hyperplane with which we achieve a classification where all data are on the right side of the plane and have alt least distance M to the plane (such a hard margin is only possible for separable cases). 10

11 Linear not-perfect separable case

12 For non-separable data we need to allow for misclassifications! Let s check this out for an easy (separable) example: Idea: we buy some misclassifications and pay back with a large margin Large cost C for misclassification -> no train error but very small margin o Small cost C -> we can afford for one train error and get paid off with a large margin. 12

13 Optimization objective in SVM with soft margin Intuitive approach to optimization: find -Vector that maximizes M with s.t. xi p xip M i y i p m 2 j i i j1 i1 1, 0, C 1 soften the margin Figure credits: Elements of Statistical Learning (ELS) Cost C is a tuning parameter that determines the cost or penalty for each unit-step of in the wrong side of the margin. gives a budget to pay for some misclassifications. Remark: optimization under constrains is a hard business which we skip here. For details see e.g. ELS chapter 12 or 2 possible formulations for a soft margin: [1] y t x β M i t x β M i i 0 [2] y 1 i 0 Remark taken from ELS chapter 12.2: Formulation [1] seems more natural, since it measures overlap in actual distance from the margin; [2] measures the overlap in relative distance dependent on M. SVM uses [2] since it leads to a convex optimization problem. 13

14 SVM in case of not-perfect separable setting x 2 Margin 2 M Support Vectors, have distance M to hyperplane (in not-perfect separable case) x 1 14

15 Soft margin: errors on training is allowed but costs We use a soft margin that accepts some misclassifications of the training examples tuned by the tuning parameter C that corresponds to the cost of train errors Small cost C for train error tends to under-fit train data Large cost C for train error tends to over-fit the train data X 2 Margin X 2 Margin X 1 X 1 Low cost C: high # train error is not so expensive we can afford large soft margin -> if train data changes slightly the boundary tends to be stable -> classifier with low variance & higher bias Hight cost C: Low # train error to avoid costs often only achievable with narrow margin -> if train data changes slightly the boundary might vary a lot -> classifier with high variance & low bias 15

16 Why do we want a large margin? Small margin (large cost C for train error) We expect that the separation-hyperplane depends more on the details of the concrete realization of the data and can better separate complex data Hence: tends to over-fit the train data (-> smaller bias, larger variance) and might have worse performance on test data Large margin (small cost C for train error) The separation-hyperplane needs less to be adapted to the details of the concrete realization and can e.g. ignore small deviations from a general linear separation boundary Hence: tends to under-fit the train (-> larger bias, smaller variance) and tens to show comparable performance on test data 16

17 Making predictions with an SVM model An SVM directly fits a decision value : Classification rule of a SVM model: x 0 1 i1 p ip C sign + x... x { 1, 1} svm.fit = svm(y~., data=dat.train) test.pred = predict(svm.fit, data=dat.test) SVM does not estimate probabilities! However in R it is still possible to get probabilities, coming from fitting a logistic regression to the estimated SVM decision values. For implementation details see JSS article svm.fit = svm(y~., data=dat.train, probability=true) test.pred = predict(svm.fit, data=dat.test, probability=true) 17

18 Tuning an SVM model for prediction performance in R library(e1071) data(cats) # classify sex of cats using weight of body and heart dim(cats) # train = sample(nrow(cats), 80) cats.train = cats[train, ] cats.test = cats[-train, ] svm.linear = svm(sex~., data=cats.train, cost=0.01) train.pred = predict(svm.linear, cats.train) sum(cats.train$sex!= train.pred) / nrow(cats.train) #32% test.pred = predict(svm.linear, cats.test) sum(cats.test$sex!= test.pred) / nrow(cats.test) #32% set.seed(4711) tune.out = tune(svm, Sex~., data=cats.train, ranges=list(cost=10^seq(-2, 1, by=0.25))) tune.out$best.parameters$cost # svm.opt = svm(sex~., data=cats.train, cost=tune.out$best.parameters$cost) train.pred = predict(svm.opt, cats.train) sum(cats.train$sex!= train.pred) / nrow(cats.train) #15% test.pred = predict(svm.opt, cats.test) sum(cats.test$sex!= test.pred) / nrow(cats.test) #26% see also the R lab by Hastie & Tibshirani 18

19 Reformulate SVM objective as loss function: hinge loss find -Vector that maximizes M with xi p xip M i y i p m 2 s.t. j i i j1 i1 1, 0, C 1 minimize max 0, y , 1,..., p n p 2 i 0 1xi1 p xip j i1 2 C j1 1 Hastie, Tibshirani: The equivalence is not obvious, the derivation is pretty hard ELS chapter : It is easy to show ;-) (Exercise 12.1) that this loss function has the same solution than the optimization problem. penalty term Hinge Loss : incorrect classified higher cost for being more on the false side >M: point is more apart from hyperplane than M correct classified with margin no cost (further distance is not rewarded) : point is on hyperplane M, but >0: correct classified, but w/o margin higher cost for more penetrating into margin 19

20 Logistic regression: recap PY ( 1 X) log 0 1x1... x 1 P(Y1 X) p e β β X... X z e 1 e p p X px p PY ( 1 ) 1 e β β X... z With logistic regression we estimate the probability for Y=1! P(Y 1 X ), P(Y 0 X ) 1 P(Y X ) (1 ) useful notation Yi 1 Yi i i i i i i coding Y {0,1} i i i i We estimate the coefficients by maximizing the Likelihood (the coefficients are contained in since is determined as a function of the linear predictor) Maximize likelihood: Minimizing loss function negative log- likelihood: n L( ) P(Y i X i) i1 n i1 Loss l( ) y log ( x ) (1 y )log 1 ( x ) i i i i 20

21 Comparing SVM and logistic regression (LR) loss functions xβ ' xβ ' e e Loss LR (x) yloglog (1 y) log 1 xβ ' xβ ' 1e 1e xβ y) xβ Loss (x) y ' (1 ' SVM 1 2C p j1 2 j p j1 = P(Y=1 x) 2 j penalized (Ridge) logistic regression e log 1 e xβ ' xβ ' y xβ ', for y xβ ' 1 The SVM hinge loss is very similar to the LR loss used in (regularized) logistic regression which is given by the negative log-likelihood. However LR aims to get the estimated probability correct and not only with a margin on the right side of the decision boundary. for details see ELS chapter 12 y x x p 21

22 Sidetrack: Using SVM for regression By using a different loss function called the -insensitive loss function y f(x) = max{0, y f(x) }, SVMs can also perform regression. This loss function ignores errors that are smaller than a certain threshold >0 thus creating a tube around the true output. R: If target variable y is not a factor variable, svm fits a regression model. library(e1071) load(data) plot(data) model <- svm(y ~ X, data) predictedy <- predict(model, data) points(data$x, predictedy, col="red", pch=4) example taken from 22

23 More than 2 classes 23

24 SVM - More than 2 classes (one vs rest) SVM is a binary classifier. It can only separate two classes What if there are more than 2 classes? N>2 classes N times 'one vs. rest Unkown gene Y gene Y gene Y gene X gene X gene X vs. vs. vs. Distance of to single class ~-3 Distance of to single class ~-2 Distance of to single class ~+2 o has the highest distance to decision boundary in the green vs all case -> classify o as green 24

tune(svm, Species ~., data=iris1, kernel="linear", ranges = list(cost = c(0.

25 SVM in R (two classes) library(e1071) iris1 = iris[51:150,] table(iris1$species) fit = svm(species ~., data=iris1, kernel="linear", cost=10) res = predict(fit, iris1) sum(res == iris1$species) res_tune = tune(svm, Species ~., data=iris1, kernel="linear", ranges = list(cost = c(0.1,1,10))) summary(res_tune) - Detailed performance results: cost error dispersion

26 SVM iris with 3 classes library(mlr) svm.learner = makelearner("classif.svm", kernel="linear") plotlearnerprediction(learner = svm.learner, task = iris.task) 26

27 Kernel trick for non-linear separation boundaries 27

28 Add new feature linearly to separate classes in high-dim Only a single variable x. Not separable by a point (hyperplane in 1D) Take single variable x and x 2 Separable by a line (hyperplane in 2D) x 2 x x x View again in 1D 28

separable in 3D space after adding new feature that was constructed

29 Add new feature linearly to separate classes in high-dim not linearly separable in 2D space spanned by original features x1 and x2 linearly separable in 3D space after adding new feature that was constructed from x1 and x2 x new x2 x2 x1 x1 29

30 The basic idea of non-linear SVM with kernel trick: add new feature and do classification new feature space maps observations to high-dim feature space input space mapping the found boundary back to low-dim input space feature space 1) mapping from input to feature space: 2 2 X1 2 XX X1 2 e.g. X=,X,X,X,X, X corresponding to a polynomial kernel of degree 2 2) Find linear boundary in feature space 3) mapping boundary back into input space leading to non-linear boundary in input space Remark: this feature expansion trick works also for other classifiers see exercises. 30

31 Dual formulation allows for computational kernel trick we skip the nasty math that leads to this computational beautiful results find -Vector that maximizes M with s.t. xi p xip M i y i p m 2 j i i j1 i1 1, 0, C 1 only scalar product between features enters the optimization When going to a high-dim feature space we could determine the scalar products between each pair of new features, called kernel, in the the optimization objective before performing the optimization. p j1 x x : K( x, x ) ij i' j i i' Here comes the computational beauty of the kernel trick: Often the kernel of the expanded features can be calculated with much less computational costs than really doing the mapping to the new features and then taking their scalar product! For details see e.g. Andrew Ng s lecture notes chapter 7 on kernels 31

32 Kernel Trick in case of polynomial kernel of degree two To find the speparating hyperplane we have to minimize the dual loss: x i1 x p t i2 i i' xi 1 xi 2 xip xijxi ' j K xi xi ' j1 x x,,, : (, ) x ip The only place where x enters polynomial kernel of degree 2: Replace: K(x i, x i ' ) p j 1 x ij x i ' j 2 2 X1 2 X X X1 2 X=,X,X,X,X, X With: p K(x i, x i ' ) x ij x i ' j x 2 ij x2 i ' j j 1 p j1 Is the same as explicitly making new features. Computed on the fly 32

33 Hot topic in 1990 s and early 2000s and still used 33

34 Most important kernels The (only) Kernels used in practice: Identity (just the inner product) In R linear kernel Gaussian aka radial basis RBF often =1/2 Polynomial of degree d In R we use the tune.svm function of e1071 package to find good settings: obj = tune.svm(x,y,cost=seq(0,30,0.5),gamma=seq(0,3,0.1)) 34

35 SVM with Kernel trick extension Polynomial RBF or Gaussian kernel 35

36 Feature expansion in case of Gaussian Kernel x k? see also the great lecture of Andrew Ng x l We call each train observation landmark. As new feature for an observations we determine a similarity measure to each landmark -> we have as many new features as we have observations in our training data set. As similarity measure we use the density value of the normal distribution which is centered at the landmark evaluated at the position of the?. X l X X f1x, f2x,..., fnx with fixexp 2 2 (i) 2 36

37 Gaussian kernel cntd with f, f,..., f X X X X X f i X 1 2 X l exp 2 2 (i) 2 n input space mapping found feature space boundary back to low-dim input space find -Vector that maximizes M with fi p fip M i y i RBF SVM classification rule results in s.th similar as k-nearest-neighbor classificaton: x sign +... C f f 0 1 i1 p ip 37

38 Gaussian kernel has two tuning parameter: cost, 1/width The higher the cost c (for training error) the smaller the soft margin The larger the width of the Gauss the smoother the decision boundary 38

Gaussian Kernel can help if classes are split into clusters In a space in which the members of a class form one or more clusters, an accurate classifier might place a Gaussian around each cluster,

39 Gaussian Kernel can help if classes are split into clusters In a space in which the members of a class form one or more clusters, an accurate classifier might place a Gaussian around each cluster, thereby separating the clusters from the remaining space of non-class members. This effect can be accomplished by placing a Gaussian with a standard deviation (sigma) over each support vector in the training set

factor(y)) require(manipulate) manipulate({ svmfit=svm(y ~.

40 Gaussian Kernel in R ############################ # tuning parameter in gaussian kernel library(e1071) library(manipulate) set.seed(1) x=matrix(rnorm(200*2), ncol=2) x[1:100,]=x[1:100,]+2 x[101:150,]=x[101:150,]-2 y=c(rep(1,150),rep(2,50)) dat=data.frame(x=x,y=as.factor(y)) require(manipulate) manipulate({ svmfit=svm(y ~.,data=dat, kernel="radial", gamma=gamma, cost = cost) plot(svmfit, dat) #Plotting }, gamma = slider(0.1,10), cost=slider(0.1,10)) see also the R lab on non-linear SVM by Hastie & Tibshirani 40

41 Comparison of SVM and knn Classification figure credits: 41

42 Separation and dimensionality Consider examples of 2 classes Draw 2 points on a line. Can you always separate them? Draw 3 points in a plane (not in a line!). Can you always separate them? Imaging 4 points 3D, can you always separate them? If number features > number of examples (p>n) you can always perfectly separate 2 classes to avoid overfitting work with linear kernel and large margin (small c) 42

43 A word of warning It s quite fancy to write I have used Gaussian Kernels. But always consider if you really need them! If number of features > number of examples called (p>n) you probably don t need them. Overfitting is then the problem! It is a good idea to try a linear kernel first! 43

44 SVM versus logistic regression or LDA When classes are (nearly) separable, SVM does better than LR. However, in this case also LDA can be used, which provides probabilities. For not-separable data, LR (with ridge penalty) and SVM very similar. If you wish to estimate probabilities, LR is the choice. For nonlinear boundaries, Gaussian RBF Kernel SVMs are popular. Feature expansion works also for LR and LDA, but computations are more expensive. see Hastie & Tibshirani 44

45 Summary SVM as linear classifier with high prediction performance Based on a separation hyperplane with fat margin and support vectors if variables have different units we should scale before applying SVM performs well also with relatively few training data we should use cross-validation hyper-parameters: cost and.. works well also for more than 2 classes is not suitable if you need probability predictions Kernel trick allows to find non-linear separation bounds with margin Kernel trick allows to add newly constructed features on the fly The optimal choice of the kernel depends on unknown data structure 45

Support Vector Machines

Support Vector Machines Chapter 9 Chapter 9 1 / 50 1 91 Maximal margin classifier 2 92 Support vector classifiers 3 93 Support vector machines 4 94 SVMs with more than two classes 5 95 Relationshiop to