12 Classification using Support Vector Machines

Size: px

Start display at page:

Download "12 Classification using Support Vector Machines"

Garey Turner
5 years ago
Views:

160 Bioinformatics I, WS 14/15, D. Huson, January 28, 2015 12 Classification using Support Vector Machines This lecture is based on the following sources, which are all recommended reading: F.

de/statistik03/docs/kapitel_16.pdf. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121-167, 1998. N. Cristianini and J.

1 160 Bioinformatics I, WS 14/15, D. Huson, January 28, Classification using Support Vector Machines This lecture is based on the following sources, which are all recommended reading: F. Markowetz. Klassifikation mit Support Vector Machines. Chapter 16 of the lecture Genomische Datenanalyse, Max-Planck-Institut für Molekulare Genetik, 2003, molgen.mpg.de/statistik03/docs/kapitel_16.pdf. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2): , N. Cristianini and J. Shawe-Taylor. An introduction to Support Vector Machines and other kernel-based learning methods. Cambridge University Press, F. Mittag et al. Use of Support Vector Machines for disease risk prediction in genome-wide association studies: concerns and opportunities. Human Mutation (2012), DOI: /humu A SVM performs binary classification of datapoints, based on set of labeled training datapoints. For example, you may want to predict the diagnosis of a new patient based on the past diagonsis of patients with similar symptoms: New$ pa'ent$ Pa'ents$with$$ known$diagnosis$ training$ Decision$ func'on$ Prognosis$ $ e.g.$cancer$ yes/no$ We can phrase this as a supervised learning problem: We are given a training set of patients with known diagnosis (either positive or negative). We learn or train a decision function that can distinguish between the positive and negative patients in the training set. We want the decision function to generalize well so as to correctly diagnose new patients. A key issue is to ensure that f does not overfit or underfit the given training data: More formally, the setup will be as follows:

2 Bioinformatics I, WS 14/15, D. Huson, January 28, We are given a training set X = {x i, y i } n i=1 consisting of: datapoints x i R k, (e.g. each a k-dimensional patient profile) two classes y i {+1, 1}, (e.g. healthy and sick) and a decision function (that depends on X ): f : R k {+1, 1}, that provides the classification f(x) for any datapoint x R k Linearly separable data Assume that the training datapoints in the healthy class (green) and the sick class (red) form two separate clouds as shown here: If we are given a new datapoint x that has no label, how to use the labeled points to predict the label of x? Key idea: Determine a hyperplane H that separates the green and red points from each other. If x lies on the same side of H as the green points, then classify x as green (healthy), else x is red (sick). If such a separating hyperplane exists, then the data is called linearly separable. A hyperplane H in k-dimensional space can be defined by a normal vector w R k and an offset b R (distance to origin): H = {x w, x + b = 0}, where, denotes the scalar product. We can always ensure that the normal vector points toward the side that contains the positive class. Let x R k be a datapoint. We can determine on which side of the hyperplane H the point lines on using: f(x) = sign( w, x + b = 0),

3 162 Bioinformatics I, WS 14/15, D. Huson, January 28, 2015 if f(x) is positive, the x lies on the positive side of the hyperplane, else if f(x) is negative, then x lies on the other Classification of a new datapoint Training: Determine w and b such that the corresponding hyperplane H separates the positive and negative training datapoints. Classification: Determine on which side of H the new datapoint x lies on. If x lies on the side of the positive training datapoints, then x is assigned to class +1 and otherwise, x is assigned to class 1. We also call H the decision boundary. Definition (Linearly separable) We call a training set X = {x i, y i } n i=1 linearly separable by a hyperplane w, x + b = 0, if a vector w and a constant b exists so that the following holds: w, x i + b > 0 if y i = +1 (12.1) w, x i + b < 0 if y i = 1 (12.2) Here, w is the hyperplane s normal vector and b w is the offset from the origin. The hyperplane can be represented by a linear combination of the training datapoints, such that w = n α i y i x i, with α i 0. i=1 Using this, we can formulate the following decision function: ( n ) f(x) = sign α i y i x i, x + b. i=1 The hyperplane is defined only by those training points x i for which α i > 0 holds, called support vectors, hence the name Support Vector Machine Optimal hyperplane for linearly separable data If data is linearly separable, then there will usually be many different hyperplanes to chose from. It makes sense to choose a hyperplane that separates the two clouds as clearly as possible, which is done by selecting a hyperplane of maximum margin, that is, maximum distance to any point in either cloud (D):

4 Bioinformatics I, WS 14/15, D. Huson, January 28, To address this, the vector w is constrained so that the distance from w to the support vectors is 1 w, which equals the margin of the hyperplane. Then the problem of determining w so as to maximize the margin of the corresponding hyperplane can be formulated as the problem of maximizing 1 w or, equivalently, of minimizing w 2. Given a training set X, we can find the hyperplane that maximizes the margin by solving this: Optimal Separating Hyperplane (OSH) Problem: { } 1 min w,b 2 w 2 subject to y i ( w, x i + b) 1. The optimization problem is solved using mathematics involving the Lagrange function and dualization, the details of which we will not pursue further. Support vectors lie on the boundary of the margin: Soft Margin SVM Above we assume that a hyperplane exists that separates all positive and negative training datapoints. Such perfect linear separation is usually not possible and some training datapoints will violate this:

5 164 Bioinformatics I, WS 14/15, D. Huson, January 28, 2015 This can be handled by a so-called soft margin SVM, as follows. Slack variables ξ i are introduced that allow data to violate the margin. penalize violations and the optimization problem is then phrased as: min w,b,ξ { 1 2 w 2 + C } n ξ i i=1 subject to y i( w, x i + b) 1 ξ i, ξ i 0. A constant C is used to Note that C must be chosen with care; choosing C too big will lead to overfitting, while choosing C too small will allow to many violations and with thus lead to a failure in learning Non linearly separable data What if the data are not linearly separable in such a way that a soft margin is not the solution? One idea would be to use a more complex, non-linear function to define the decision boundary (here between red and green). A better idea is to define a mapping Φ of the points into a higher dimensional space in which they are separable:

6 Bioinformatics I, WS 14/15, D. Huson, January 28, In more detail, we map all datapoints into a higher-dimensional space H in which a scalar product is defined, using a feature map Φ: Φ : R k H x Φ(x), and then we separate the points {Φ(x i ), y i } n i=1 in H. Note that the dimension of the so-called feature space H is usually much higher than that of the training datapoints. A simple example: In this approach, we need to: Φ : R 2 R 3 (g 1, g 2 ) (z 1, z 2, z 3 ) := (g 2 1, 2g 1 g 2, g 2 2 ). compute a separating hyperplane in H in the high-dimensional feature space H, and compute scalar products of the form Φ(p), Φ(q). This can be difficult or impossible, if the dimension of H is very big The kernel trick Kernel trick: The mentioned problems can be addressed using the kernel trick: Rather than using a scalar product in H, we use a kernel function K in R k, which both replaces the map Φ and plays the role of a scalar product in H. In the decision function, both input datapoints and training datapoints are only involved in scalar products with each other, thus allowing the use of a kernel function. We did not discuss how to determine w and b; this calculation makes use of the training datapoints only through the computation of scalar products, again allowing the use of a kernel function. Example: Assume we are given datapoints p = (g 1, g 2 ) and q = (h 1, h 2 ). Apply the feature map Φ : (g 1, g 2 ) (g1 2, 2g 1 g 2, g2 2 ) and then compute the scalar product: Φ(p), Φ(q) = (g1 2, 2g 1 g 2, g2 2)(h2 1, 2h 1 h 2, h 2 2 )t = g1 2h g 1h 1 g 2 h 2 + g2 2h2 2 = (g 1 h 1 + g 2 h 2 ) 2 = p, q 2 =: K(p, q) In this example, we can compute the scalar product between Φ(p) and Φ(q) directly, without applying Φ: We simply compute the square of the scalar product of p and q in R 2. The neccessary conditions for a kernel function are as follows (based on Mercer s Theorem): Definition (Kernel function) A function K : R k R k R is a kernel function if 1. it is symmetric, i.e. we have K(p, q) = K(q, p) for all p, q, and 2. the kernel matrix K with entries K ij = K(x i, x j ) is positive (semi)-definite for all training datapoints {x i } n i=1, that is, we have a t Ka = i,j a i a j K ij 0 a R n.

7 166 Bioinformatics I, WS 14/15, D. Huson, January 28, 2015 Why is this a trick? When using a kernel function, we do not need to know anything about the feature space. All we need is a function that provides a similarity measure. An ideal kernel function assigns a higher similarity score to any pair of objects that belong to the same class than it does to any pair of objects from different classes. This is the case if the implicit mapping by the kernel function pulls similar objects close together and pushes dissimilar objects further apart from each other in the induced feature space Examples of kernel functions Frequently used kernel functions are: Linear kernel K(x i, x j ) = x i, x j ( xi x j 2 ) Radial basis function (rbf) kernel K(x i, x j ) = exp 2σ 2 0 Polynomial kernel K(x i, x j ) = (s x i, x j + c) d Sigmoid kernel K(x i, x j ) = tanh(s x i, x j + c) Convex combinations of kernels K(x i, x j ) = λ 1 K 1 (x i, x j ) + λ 2 K 2 (x i, x j ) Normalization kernel K(x i, x j ) = K (x i, x j ) K (x i, x i )K (x j, x j ) where s, c, d and λ i are kernel-specific parameters, σ 2 0 = mean x i x j Summary The most typical application of SVMs is in binary classification of new datapoints, based on a training set of labeled examples. If the datapoints are linearly separable, then a simple scalar product calculation can be used to classify new points. Non linearly separable data may become separable after mapping to a higher-dimensional feature space. The kernel trick allows one to do this implicitly. Soft margin SVMs allow some training datapoints to appear on the wrong side of the plane.

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017 Data Analysis 3 Support Vector Machines Jan Platoš October 30, 2017 Department of Computer Science Faculty of Electrical Engineering and Computer Science VŠB - Technical University of Ostrava Table of