Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Data Mining: Concepts and Techniques Chapter 9 Classification: Support Vector Machines 1 Support Vector Machines (SVMs) SVMs are a set of related supervised learning methods used for classification Based on the training datasets, it solves an optimization problem to find maximum-margin hyperplane to classify new data instances Applications Patter recognition Classification Learning Decision making for game Types of SVMs Linear vs. nonlinear Binary vs. multi-class Internal vs. external 1

SVM Support Vector Machines A relatively new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training data into a higher dimension With the new dimension, it searches for the linear optimal separating hyperplane (i.e., decision boundary ) With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can always be separated by a hyperplane SVM finds this hyperplane using support vectors ( essential training tuples) and margins (defined by the support vectors) 3 SVM History and Applications Vapnik and colleagues (1992) groundwork from Vapnik & Chervonenkis statistical learning theory in 1960s Features: training can be slow but accuracy is high owing to their ability to model complex nonlinear decision boundaries (margin maximization) Used for: classification and numeric prediction Applications: handwritten digit recognition, object recognition, speaker identification, benchmarking time-series prediction tests 2

Decision Boundary Consider a two-class, linearly separable classification problem Many decision boundaries! Are all decision boundaries equally good? Class 2 Class 1 Basic Idea x 2 Vector: x, w Scalar: x, y, w Input: {(x 1, y 1 ), } Output: classification function f(x) f(x i ) > 0 for y i = +1 f(x i ) < 0 for y i = -1 f(x) => wx + b = 0 or w 1 x 1 +w 2 x 2 +b = 0 x 1 update W 6 3

SVM General Philosophy Small Margin Support Vectors Large Margin SVM When Data Is Linearly Separable m Let data D be (X 1, y 1 ),, (X D, y D ), where X i is the set of training tuples associated with the class labels y i There are infinite lines (hyperplanes) separating the two classes but we want to find the best one (the one that minimizes classification error on unseen data) SVM searches for the hyperplane with the largest margin, i.e., maximum marginal hyperplane (MMH) 4

Large-margin Decision Boundary The decision boundary should be as far away from the data of both classes as possible We should maximize the margin, m Distance between the origin and the line w t x=k is k/ w Class 2 Class 1 m 9 SVM Linearly Separable A separating hyperplane can be written as W X + b = 0 where W={w 1, w 2,, w n } is a weight vector and b a scalar (bias) For 2-D it can be written as w 0 + w 1 x 1 + w 2 x 2 = 0 The hyperplane defining the sides of the margin: H 1 : w 0 + w 1 x 1 + w 2 x 2 1 for y i = +1, and H 2 : w 0 + w 1 x 1 + w 2 x 2 1 for y i = 1 Any training tuples that fall on hyperplanes H 1 or H 2 (i.e., the sides defining the margin) are support vectors This becomes a constrained (convex) quadratic optimization problem: Quadratic objective function and linear constraints Quadratic Programming (QP) Lagrangian multipliers 5

Finding the Decision Boundary Let {x 1,..., x n } be our data set and let y i {1,-1} be the class label of x i The decision boundary should classify all points correctly The decision boundary can be found by solving the following constrained optimization problem This is a constrained optimization problem. Determine w and b Define hyperplane x T w+b=0 Obtain the decision function to classify new data points 11 Constrained Optimization Suppose we want to: minimize f(x) subject to g(x) = 0 A necessary condition for x 0 to be a solution: a: the Lagrange multiplier For multiple constraints g i (x) = 0, i=1,, m, we need a Lagrange multiplier a i for each of the constraints 6

Constrained Optimization The case for inequality constraint g i (x) 0 is similar, except that the Lagrange multiplier a i should be positive If x 0 is a solution to the constrained optimization problem There must exist a i 0 for i=1,, m such that x 0 satisfy The function is also known as the Lagrangrian; we want to set its gradient to 0 Back to the Original Problem The Lagrangian is Note that w 2 = w T w Setting the gradient of w.r.t. w and b to zero, we have 7

The Dual Problem If we substitute to, we have Note that This is a function of a i only The Dual Problem The new objective function is in terms of a i only It is known as the dual problem: if we know w, we know all a i ; if we know all a i, we know w The original problem is known as the primal problem The objective function of the dual problem needs to be maximized! The dual problem is therefore: 8

The Dual Problem This is a quadratic programming (QP) problem A global maximum of a i can always be found w can be recovered by Characteristics of the Solution Many of the a i are zero w is a linear combination of a small number of data points x i with non-zero a i are called support vectors (SV) The decision boundary is determined only by the SV Let t j (j=1,..., s) be the indices of the s support vectors. We can write For testing with a new data z Compute and classify z as class 1 if the sum is positive, and class 2 otherwise 18 9

A Geometrical Interpretation Class 2 a 8 =0.6 a 10 =0 a 5 =0 a 7 =0 a 2 =0 a 4 =0 a 9 =0 Class 1 a 3 =0 a 6 =1.4 a 1 =0.8 Why Is SVM Effective on High Dimensional Data? The complexity of trained classifier is characterized by the # of support vectors rather than the dimensionality of the data The support vectors are the essential or critical training examples they lie closest to the decision boundary (MMH) If all other training examples are removed and the training is repeated, the same separating hyperplane would be found The number of support vectors found can be used to compute an (upper) bound on the expected error rate of the SVM classifier, which is independent of the data dimensionality Thus, an SVM with a small number of support vectors can have good generalization, even when the dimensionality of the data is high 10

Extension to Non-linear Decision Boundary So far, we have only considered large-margin classifier with a linear decision boundary How to generalize it to become nonlinear? Key idea: transform x i to a higher dimensional space to make life easier Input space: the space the point x i are located Feature space: the space of f(x i ) after transformation Why transform? Linear operation in the feature space is equivalent to non-linear operation in input space Classification can become easier with a proper transformation. Transforming the Data Input space f(.) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) Feature space Note: feature space is of higher dimension than the input space in practice Computation in the feature space can be costly because it is high dimensional The feature space is typically infinite-dimensional! The kernel trick comes to rescue 22 11

Kernel functions for Nonlinear Classification Instead of computing the dot product on the transformed data, it is math. equivalent to applying a kernel function K(X i, X j ) to the original data, i.e., K(X i, X j ) = Φ(X i ) Φ(X j ) Typical Kernel Functions SVM can also be used for classifying multiple (> 2) classes and for regression analysis (with additional parameters) Modification Due to Kernel Function Change all inner products to kernel functions For training, Original With kernel function 24 12

Modification Due to Kernel Function For testing, the new data z is classified as class 1 if f 0, and as class 2 if f <0 Original With kernel function 25 Example Value of discriminant function class 1 class 2 class 1 1 2 4 5 6 13

Strengths and Weaknesses of SVM Strengths Training is relatively easy No local optimal, unlike in neural networks It scales relatively well to high dimensional data Tradeoff between classifier complexity and error can be controlled explicitly Weaknesses Need to choose a good kernel function. 27 SVM vs. Neural Network SVM Deterministic algorithm Nice generalization properties Hard to learn learned in batch mode using quadratic programming techniques Using kernels can learn very complex functions Neural Network Nondeterministic algorithm Generalizes well but doesn t have strong mathematical foundation Can easily be learned in incremental fashion To learn complex functions use multilayer perceptron (nontrivial) 14

SVM Related Links SVM Website: http://www.kernel-machines.org/ Representative implementations LIBSVM: an efficient implementation of SVM, multiclass classifications, nu-svm, one-class SVM, including also various interfaces with java, python, etc. SVM-light: simpler but performance is not better than LIBSVM, support only binary classification and only in C SVM-torch: another recent implementation also written in C 15