SELF-ADAPTIVE SUPPORT VECTOR MACHINES

Size: px

Start display at page:

Download "SELF-ADAPTIVE SUPPORT VECTOR MACHINES"

Lee Norman
5 years ago
Views:

1 SELF-ADAPTIVE SUPPORT VECTOR MACHINES

2 SELF ADAPTIVE SUPPORT VECTOR MACHINES AND AUTOMATIC FEATURE SELECTION By PENG DU, M.Sc., B.Sc. A thesis submitted to the School of Graduate Studies in Partial Fulfillment of the Requirements for the Dregree Master of Science McMaster University c Copyright by Peng Du, June, 2004

3 ii MASTER OF SCIENCE (2004) COMPUTING & SOFTWARE McMaster University Hamilton Ontario TITLE: SELF ADAPTIVE SUPPORT VECTOR MACHINES AND AU- TOMATIC FEATURE SELECTION AUTHOR: Peng Du, M.Sc., B.Sc. (McMaster University) SUPERVISOR: Dr. Tamás Terlaky, Dr. Jiming Peng NUMBER OF PAGES: vi, 75

4 iii Abstract We handle the problem of model and feature selection for Support Vector Machines (SVMs) in this thesis. To select the best model for a given training set, we embed the standard convex quadratic problem of SVM training in an upper level problem where the optimal objective value of the quadratic problem is minimized over the feasible range of the kernel parameters. This leads to a bi-level optimization problem for which the optimal solution always exists. The problem of feature selection can be solved simultaneously under the same framework except that the kernel function is extended by introducing independent kernel parameters for each features in the original space. Two solution strategies to solve the bi-level problems of this Self-Adaptive SVMs (SASVMs) are studied. Under the first strategy, the lower level problems are replaced by their corresponding Karush-Kuhn-Tucker (KKT) conditions. Then, the bi-level problems are solved as non-linear non-convex one-level problems by general non-linear solvers. Experiment results show that this strategy can only handle small-sized training sets. The second strategy is designed around the derivative free optimization (DFO) method. The main idea of this strategy is to minimize a quadratic model, which is constructed from the solution of the lower level problem at discrete values of kernel parameters, of the optimal objective value over a controlled range of kernel parameter. At last, SASVMs are applied to several benchmark data sets. They are multi-spectral and hyper-spectral Remote Sensing images, the MNIST data set of hand-written digits and breast cancer classification data set. The accuracies of SASVMs are compared with those achieved by other classifiers.

5 iv Acknowledgements This work were built upon the ideas and helps from my supervisors, Dr. J. Peng and Dr. T. Terlaky. I thank them for their continuous supports and for their time and energy to teach and lead me. I also want to thank my colleagues at the research laboratory for making my study a wonderful experience which I will never forget. I would like to thank the members of my committee, Dr. J. Zucker, Dr. C. Anand, Dr. J. Peng and Dr. T. Terlaky, for their kind and careful examination and their useful suggestions. I am also indebted to Dr. T. Joachims for providing SV M light and F. Ellen for the DFO software package. Finally, I want to thank my wife, Yu Wang. I couldn t finish this work without her supports.

6 Contents Abstract Acknowledgements iii iv 1 Support Vector Machines Introduction to Classification Learning Machines and Generalization Performance VC Dimension Generalization Performance Linear Support Vector Machines The Linearly Separable Case The Linearly Non-separable Case Non-linear Support Vector Machines Non-linear Mappings and Feature Spaces Kernel Functions Self Adaptive Support Vector Machines Introduction The restricted training task The structural training task The VC dimensions of SVMs with Polynomial and RBF Kernels Self Adaptive Support Vector Machines An Illustrative Example Automatic Feature Selection Extended Kernel Functions An Illustrative Example v

7 vi CONTENTS 3 Solving the BLP of SASVM General Bi-level Problems Converting to a One-level Problem Derivative Free Approach Motivations A Class of Derivative Free Algorithms Implementation of SASVM Package Implementation Issues OOP for SASVM Major Classes in SASVM Package Class Model Class ModelSet Class HypothesisSpace Class Task Structure of SASVM How to Use SASVM Package Running Environment Use SASVM for Training Use SASVM for Testing and Classification Use SASVM for Feature Selection SASVM s Error codes SASVMs for Real Classification Problems Thematic Mapper Remote Sensing Images A Thematic Mapper Scene of Tippecanoe County Result Analysis Hyper-spectral Remote Sensing Images Indian Pine 1992 AVIRIS Image Result Analysis Handwritten Digit Classification Data Set Description Handwritten Digit Classification Breast Cancer Diagnosis Data Set Description Performance on Breast Cancer Data Set Conclusions 63

8 Chapter 1 Support Vector Machines This chapter gives a basic introduction to Support Vector Machines (SVMs), a family of learning machines for classification problems. In Section 1.1, we introduce and discuss what is classification. In Section 1.2, we introduce some basic concepts and some important results regarding to SVMs. In Section 1.2 and 1.3, we discuss two types of SVMs: linear and non-linear SVMs. 1.1 Introduction to Classification We deal with the classification problem in this thesis. General speaking, the classification problem is to classify a set of objects, which are commonly called instances, into pre-defined classes or categories. For instance, we can classify the earth s surface into pre-defined classes like residential areas, commercial areas and natural heritage areas. Furthermore, we can classify the natural heritage areas into forest, prairie, wet land and so on. Classification can be done in two different ways: unsupervised classification and supervised classification. The foundation of unsupervised classification is this: If two instances are very close to each other, it is very likely that they belong to the same class. The closeness of two instances are normally measured by the distance between them. There are many types of distances in literature. For more information about distance, we refer to the book written by Landgrebe [21]. One distinct feature of unsupervised classification is that the classification is applied to a set of instances for which we don t know their true classes. On the other hand, supervised classification is a method based on a set 1

9 2 CHAPTER 1. SUPPORT VECTOR MACHINES of instances with true classes known in advance. The given set of instances are further partitioned into two parts: a training set and a testing set. A supervised classification starts with a training process followed by a testing process, then goes back to the training set until a decision boundary is constructed. The decision boundary is a line or surface to separate one class from the other. The algorithm used to construct a decision boundary from a training set is called a classifier. Since the process of constructing the decision boundary from a training set can also be considered as a process of learning the information contained in a training set, a classification algorithms is also called a learning machine. The term classifier is also used to refer to the constructed decision boundary, which is called a learned machine if the algorithm is called a learning machine. In this thesis, we discuss support vector machines, which are sets of supervised classification algorithms. 1.2 Learning Machines and Generalization Performance Formally speaking, a classification problem can be expressed in this way: Suppose that we are given a training set Z = {(x 1, y 1 ), (x 2, y 2 ),..., (x l, y l )} of l instances, where each instance carries n attributes (x i = (x i 1,..., x i n)) and a class label y i {1, 1}. We want to construct a decision boundary separating the two classes such that the probability of classification error, or misclassification, made by the resulting decision boundary on a new instance is minimized. The complete set of all possible instances is called a population, therefore a training set Z is always a subset of the corresponding population. Usually, we assume that a population follows a determined but unknown probability distribution P (x, y) and the training set Z is drawn independently and distributed identically with P (x, y). Since a decision boundary separates instances from two classes, what a learning machine learns is in fact a mapping x y from a set of mappings {f(x; α, b) : R n {+1, 1}}. The functions f(x; α, b) themselves are labelled by the adjustable parameters α and b. The process of learning is to find the appropriate values for α and b based on the information in Z. In literature, the set of possible mappings is usually called the hypothesis space denoted by the symbol H. A learning machine is actually determined by the hypothesis space associated with it.

10 1.2. LEARNING MACHINES AND GENERALIZATION PERFORMANCE VC Dimension To find the appropriate values for α and b, let us first take a close look at the hypothesis space H. Suppose we have a training set of size l and it can be labelled in all possible 2 l ways. If for each labelling, we can find at least one member in H which correctly assigns those labels, we say that the given set is shattered by the set of function H. To better understand this concept, let us consider points in R 2 and a hypothesis space consisting of all possible linear functions in R 2. If we have only 3 points in R 2 and totally 8 ways of labelling, no matter how we label them, at least one linear function can be found to correctly assign them (see Figure 1.1). However, if we have 4 points located at the 4 corners of a rectangle and we label them alternately, then no linear function can be used to correctly assign them. Hence the hypothesis space of linear functions in R 2 can shatter at most 3 points. Based on the concept of shatter, we can define the VC dimension. Figure 1.1: Three point in R 2 shattered by lines Definition [34] The VC dimension for a hypothesis space H is the maximum number of training points that can be shattered by H. Note that the VC dimension is a property of a hypothesis space or a learning machine. It has nothing to do with training sets although it is defined in terms of the sizes of training sets. VC dimension measures the capacity of a hypothesis space to assign a training set correctly no matter how the training

11 4 CHAPTER 1. SUPPORT VECTOR MACHINES set is labelled. A hypothesis space with higher VC dimension usually means that it has more flexibility to simulate complex decision boundaries Generalization Performance Since a higher VC dimension usually means more flexibility to simulate complex decision boundaries, why don t we exclusively use hypothesis spaces with high VC dimensions? A simple answer is that our ultimate goal of a classification task is to achieve a high generalization performance. Theoretically, generalization performance can be measured as actual risk, which is defined as: 1 R(α, b) = y f(x; α, b) dp (x, y). 2 This formula represents the actual risk. However, unless we know what P (x, y) is, it is not an easy way to evaluate R(a, b). Fortunately, the empirical risk provides an estimate for R(α, b), but of course not the exact value of R(α, b). The empirical risk R emp (α, b) on the training set Z is defined as: R emp (α, b) = 1 2l l y i f(x i ; α, b). i=1 R emp (α, b) is a fixed value for a particular choice of α and b and a given training set with a finite number of instances. Based on empirical risk and VC dimension, we can build an upper bound on actual risk, which is given in the following theorem. Theorem [34] Let H be a hypothesis space with VC dimension h. Let Z be a training set with size l. For any probability distribution P (x, y) on the population, with probability 1 η, any hypothesis in H that makes an empirical risk R emp (α, b) on Z has actual risk no more than ( R(α, b) R emp (α, b) + h(log 2l + 1) log( η ) ) h 4 (1.2.1) l The second term on the right hand is called VC confidence. It is a function of VC dimension. When VC dimension increases, VC confidence increases as well. As a result, we end up with a loose upper bound on R(α, b). If h is known in advance, we can easily compute the right hand side.

12 1.2. LEARNING MACHINES AND GENERALIZATION PERFORMANCE5 This theorem bears an important message applicable to any classification tasks. When we train a learning machine from a training set Z, we want not only to achieve a low empirical risk, but also to keep h as low as possible in order to achieve the lowest actual risk. To better understand this, let us image that we have a sequence of learning machines with h given in a monotonic order (recall that one learning machine is just a set of functions f(x; α, b)). Suppose the confidence level η is fixed and for each learning machine in this sequence, there is no trouble in finding find the best function which minimize the right hand size of (1.2.1). Hence, a sequence of best functions is found. By taking from the sequence the function that minimizes the right hand size of (1.2.1), we are choosing the function that gives the lowest upper bound on the actual risk from the whole sequence. This gives a principle of training for a classification task, which is the essential idea of structural risk minimization. The principle of structural risk minimization was first introduce by Vapnik in 1979 [33], where he introduced a structure by dividing the union of the hypothesis spaces of a learning machine sequence into nested subsets (see Figure 1.2). The VC dimensions of these nested subsets can be estimated or bounded above. The union of the functions is structured in such a way that the nested subsets inside have smaller VC dimension than the subsets outside. What structural risk minimization does is to find a subset of of the set of functions which minimizes a bound on the actual risk. Figure 1.2: Nested Subsets of Functions Ordered by VC Dimensions To facilitate our discussion, we introduce another structure based on learning machines according to their VC dimensions. Definition Given a sequence of learning machines, and a corresponding sequence of hypothesis spaces. Suppose the VC dimension for every learning machine in the sequence is given. A structural hypothesis space, denoted

13 6 CHAPTER 1. SUPPORT VECTOR MACHINES by H S, is the given sequence of hypothesis spaces ordered by their VC dimensions. We point out that, based on Definition 1.2.3, the hypothesis spaces of higher VC dimensions do not have to be the supersets of the hypothesis spaces of lower VC dimensions. Structured hypothesis spaces will be further discussed when we discuss self-adaptive support vector machines in Chapter Linear Support Vector Machines In Section 1.1 and 1.2, we discussed learning machines and their generalization performance from a general viewpoint. In this section, we discuss a specific learning machine The Linearly Separable Case SVMs were first introduced by Vapnik in 1995 [34]. Their main task is to find a hyperplane that separates a given training set Z correctly and leave as much distance as possible from the closest instances to the hyperplane on both sides. The distance from the closest instances to the separating hyperplane is called the margin, and the hyperplane that realize the maximal margin is called the optimal hyperplane. The closest instances are called support vectors, which gives the name of this algorithm. We start with the simplest case: linear machines trained on linearly separable training sets. In this case, the hypothesis space is given as: H := { f(x; w, b) = w T x + b w, x R n, b R }, (1.3.2) where w and b are the adjustable parameters. The main task of SVMs is to determine the appropriate values for w and b that maximize the margin and at the same time separate the data points correctly. We can easily prove that this statement is true if and only if we have f(x, a, b) = 1 or 1 at each support vector x. Hence, we have the following optimization problem: min w,b 1 2 wt w s.t. y i (w T x i + b) 1, i = 1,..., l, (1.3.3)

14 1.3. LINEAR SUPPORT VECTOR MACHINES The Linearly Non-separable Case Note that when two classes are not linearly non-separable, the problem (1.3.3) becomes infeasible. One way to deal with this is to introduce slack variables ξ i R +, i = 1,..., l, then we have: min w,b 1 2 wt w + Cξ T e s.t. y i (w T x i + b) 1 ξ i, i = 1,..., l, ξ i 0, i = 1,..., l, (1.3.4) where C is a penalty factor chosen by the user and e is the all one vector in appropriate dimension. Problem (1.3.4) is a standard convex quadratic programming problem, whose dual problem can be written as follows: max α T e 1 α 2 αt Kα s.t. y T α = 0, Ce α 0, (1.3.5) where α R l is a vector of the dual variables, y = [y 1, y 2,, y l ] T, and K R l l with K ij = y i y j (x it x j ). Problem (1.3.5) is a standard quadratic programming problem as well, for which efficient algorithms are available. Furthermore, if problem (1.3.5) is a strictly convex quadratic problem, then the optimal solution is unique. The optimality conditions for the problem (1.3.5) can be written as: w l α i y i x i = 0, i=1 l α i y i = 0, i=1 C α i µ i = 0 i = 1,..., l, α i (y i (x it w + b) 1 + ξ i ) = 0 i = 1,..., l, µ i ξ i = 0 i = 1,..., l, y i (x it w + b) 1 + ξ i 0 i = 1,..., l, ξ 0, α 0, µ 0. (1.3.6) From the duality theory in optimization, we know that, by solving (1.3.6), we can get solution to both the primal and dual problems. Let α denote

15 8 CHAPTER 1. SUPPORT VECTOR MACHINES this unique optimal solution for the dual problem, then by using the first equality condition in (1.3.6), the optimal decision function can be written as: f(x; α, b ) = l y i αi (x it x) + b. (1.3.7) i=1 Observe that b does not appear in the dual problem, so b must be found by making use of the primal constraints and complementary conditions, i.e. b is chosen so that: y i f(x i ; α, b ) = 1, i : 0 < α i < C. (1.3.8) Alternatively, we can write down the hypothesis space equivalent to (1.3.2) in the terms of dual variables. It is: { } l H := f(x; α, b) = y i α i (x it x) + b α R l, b R. (1.3.9) i=1 1.4 Non-linear Support Vector Machines Non-linear Mappings and Feature Spaces In order to learn a non-linear decision function for a linearly non-separable training set, a common strategy involves mapping the representation of the data from the original input space, denoted by X, to a usually more complex space, denoted by F. This new space is called feature space in the literature. Suppose a mapping is given like this: x = (x 1, x 2,..., x n ) Φ(x) = (φ 1 (x), φ 2 (x),..., φ N (x)). Here φ i (x), i = 1,..., N are usually non-linear functions representing the new features in the N dimensional feature space F. Non-linear SVMs try to find a hyperplane separating a training set Z in the new feature space F. The separating problem in F can be formulated as follows: min ( w,b) 1 2 wt w + Cξ T e s.t. y i ( w T Φ(x i ) + b) 1 ξ i, i = 1,..., l ξ i 0, i = 1,..., l, (1.4.10)

16 1.4. NON-LINEAR SUPPORT VECTOR MACHINES 9 where w R N, w and b are the decision variables defining the separating hyperplane in F. The hypothesis space in this case is: H := { f(x; w, b) = w T Φ(x) + b w R N, b R }. (1.4.11) The dual problem of (1.4.10) is: max α T e 1 α 2 αt Kα s.t. y T α = 0, Ce α 0, (1.4.12) In this case, K ij = y i y j Φ(x i ) T Φ(x j ). Note that the dual problem is the same as the problem (1.3.5) except that the matrix K is different. In other words, the introduction of non-linear mappings does not influence the formulation of the dual problem except for a matrix, which is the only part related to the non-linear mapping. Consequently, the optimality conditions for the problem (1.4.12) are very similar to the optimality conditions (1.3.6) for the linear cases. To save space, they are not listed here. By applying the optimality conditions, the hypothesis space of non-linear SVMs in the terms of dual variables can be written as: { } l H := f(x; α, b) = y i α i (Φ(x i ) T Φ(x)) + b α R l, b R. (1.4.13) i=1 Just like the linear cases, extra effort is needed to determine b because it does not appear in the dual problem. As we can see from the above discussion, we take two steps to build a non-linear decision boundary in the original input space X : first, the data representation is mapped from X to F by Φ(x) : X F. Then, a linear learning machine is applied to find the optimal hyperplane in F. If we can compute the inner product Φ(x i ) T Φ(x j ) in F directly as a function of variables in X, then these two steps can be merged into one. Such a method is called a kernel method. Definition A kernel function K : R n R n R, for all x, z X is defined as K(x, z) = Φ(x) T Φ(z).

17 10 CHAPTER 1. SUPPORT VECTOR MACHINES Definition A kernel matrix K R l l on a training set Z, with respect to a kernel function K(x, z) is defined as: K ij = K(x i, x j ), i, j = 1,..., l. Correspondingly, the hypothesis space can be rewritten as: { } l H := f(x; α, b) = y i α i K(x i, x) + b α R l, b R. (1.4.14) i=1 Defining a kernel function is frequently more convenient and natural than defining a mapping from X to F. When the kernel function and kernel matrix is defined, we can write down the optimization problem and the hypothesis space in the terms of the kernel function and kernel matrix without knowing the exact formulation of the underlying mapping from X to F. This is called implicit mapping. One problem in non-linear SVMs is to select the kernel functions so that the problem (1.4.13) is well defined and solvable Kernel Functions To introduce various kernel functions that enjoys certain desirable properties, we need the following theorem, which was proved by Mercer [24]. Theorem Let X be a compact subset of R n. Suppose K is a continuous symmetric function such that the integral operator T K : L 2 (X ) L 2 (X ), (T K f)( ) = K(, x)f(x)dx, is positive, that is X X X K(x, z)f(x)f(z)dxdz 0, for all f L 2 (X ). Then we can expand K(x, z) in a uniformly convergent series on (X X ) in terms of T K s normalized ( φ j L2 = 1) eigen-functions φ j L 2 (X ) and nonnegative eigenvalues λ j 0 as: K(x, z) = λ j φ j (x)φ j (z). j=1

18 1.4. NON-LINEAR SUPPORT VECTOR MACHINES 11 Suppose that we arbitrarily take a set of points {x 1, x 2,..., x l } from the compact set X, and denote the values of function f on the set by a vector v. Let a kernel function K(x, z) be given and the kernel matrix be denoted by K. Then for the set, the positivity condition K(x, z)f(x)f(z)dxdz 0, f L 2 (X ) becomes X X v T Kv 0. Hence the positivity condition in Mercer s theorem is equivalent to requiring that for any finite subset of X, the corresponding matrix K is positive semidefinite. Corollary [34] The following functions satify Mercer s positivity condition, thus they are kernel functions: 1. K(x, z; σ) = exp( x z 2 ), 2σ 2 2. K(x, z; d, c) = (x T z + c) d, where σ, d are kernel parameters, and c is a constant. They are called the Gaussian RBF kernel, and polynomial kernel, respectively. Kernel parameters σ and d play important roles in the control of VC dimension. We call c a constant, instead of a kernel parameter, because it plays a role different from VC dimension controlling. This will be explained in detail in Section

19 12 CHAPTER 1. SUPPORT VECTOR MACHINES

20 Chapter 2 Self Adaptive Support Vector Machines 2.1 Introduction Since their first introduction, SVMs have been recognized as major learning algorithms for real world classification problems, such as hand-written digit recognition [22], image recognition [26] and protein homology detection [14]. However some questions still remain open. For instance, how to select the right kernel functions for a classification task, how to find appropriate values for the kernel parameter, and how much to penalize a misclassification and how to decide automatically the contribution from each attribute or feature so that feature selection can be done systematically. In this thesis, we try to address two of these problems. The first is how to find the best value for the kernel parameter, which is referred to as model selection in literature. We call an SVM Self-Adaptive, or simply an SASVM, if it is able to automatically tune the kernel parameters to the best values for any given data sets. The second problem is automatic feature selection. An SVM with automatic feature selection is an SVM learning algorithm that is able to automatically identify the features that are most important for a classification task. Model selection is not new in the field of machine learning. Some research has been done model selection on support vector machine as well. In 1999, Chapelle and Vapnik [7] presented new functionals for support vector machine model selection. These new functionals can be used to estimate the generalization error. More result regarding the generalization error of 13

21 14 CHAPTER 2. SELF ADAPTIVE SUPPORT VECTOR MACHINES SVMs were presented by Chapelle and Vapnik in [6], and a gradient descent algorithm was introduced to minimize some estimates of the generalization error over a set of kernel parameters. However, the estimates of generalization error being minimized are complicated functions which in turn make the optimization process difficult. The problem of feature selection for SVMs was first studied by Bradley et al. in 1998 [25]. Weston et al. [13] introduced a method for the feature selection problem in In 2002, Chapelle and Vapnik suggested another method in which each feature in an original input space carries its own kernel parameters [6]. If the value of a kernel parameter is very small, the corresponding feature can be removed from the data set without influencing the classification accuracy significantly. In this thesis we apply this idea to the mathematical model defined for the model selection problem. Consequently, the problems of model and feature selection can be handled simultaneously. 2.2 The restricted training task Recall the concept of structural hypothesis space in Section In what follows, we define a restricted hypothesis space and the corresponding training task. Definition Suppose a training set Z and a penalty factor C are given. Let K(x, z) to be a kernel function corresponding to an unknown but determined non-linear mapping from X to F. A restricted hypothesis space H R is a set of functions with parameter α, b given as: { } l H R := f(x; α, b) = y i α i K(x i, x) + b α R l, b R, i=1 where (x i, y i ) Z, 0 α i C, i = 1, 2,..., l. Definition Suppose a penalty factor C and a training set Z are given, a restricted training task is a searching process to find a function f R (x) := f R(x; α, b ) in a restricted hypothesis space H R such that the objective function w T w + Cξ T e is minimized, i.e. fr(x) = arg min ( f(x;α,b) H R (α,b) wt w + Cξ T e),

22 2.3. THE STRUCTURAL TRAINING TASK 15 where w is the coefficient vector with dimension equal to the fixed feature space associated with H R. A restricted training task can be easily done by solving the problem (1.4.12) with a fixed kernel function and penalty C. There is no parameter in this formulation. It is guaranteed by the convexity of the problem (1.4.12) that it has a unique solution. 2.3 The structural training task It is clear from section 2.1 that a standard convex quadratic problem need to be solved only once in order to complete a restricted training task. An immediate question that follows is: What if we use kernel functions with kernel parameters? Before answering this question, let us consider a more fundamental question: What happens when the value of the kernel parameters are changed? The VC dimensions of SVMs with Polynomial and RBF Kernels We first set the stage by giving the following theorem about the VC dimension of hyperplanes. Its proof can be found in [5]. Theorem [5] Consider a set of m points in R n. Choose any one of the points as origin. Then, the m points can be shattered by hyperplanes if and only if the position vectors of the remaining points are linearly independent. Corollary [3] The VC dimension of the set of hyperplanes in R n is n+1. This corollary can be easily verified as we can always choose n+1 points in R n and let one of them be the origin and keep the rest linearly independent. This cannot be done for n + 2 points. Now, we are ready to describe the VC dimension of kernel functions. First, let us discuss polynomial kernels, which are given as: K(x, z) = (x T z + c) d,

23 16 CHAPTER 2. SELF ADAPTIVE SUPPORT VECTOR MACHINES where x and z R n. When d = 2, we have: n n ((x T z) + c) 2 = ( x i z i + c)( x j z j + c) = = i=1 n i=1 j=1 (n,n) (i,j)=(1,1) j=1 n n x i x j z i z j + 2c x i z i + c 2 (x i x j )(z i z j ) + i=1 n ( 2cx i )( 2cz i ) + c 2. ( ) ( ) n + 1 n + 2 Observe that there are + n + 1 = distinct features of 2 2 monomials of degree up to 2. Their relative weights among the degree 0, 1 and 2 are monitored by the constant c. Similar ( derivations ) can be made n + d for the case of degree up to d, where there are distinct features, d being all the monomials up to degree d. This analysis leads to the following theorem Theorem [34] If the dimension of the original input space X is n, then the VC dimension of SVMs with ( polynomial ) kernels of degree d, i.e. n + d (K(x, z) = (x T z + c) d, x, z X ), is + 1. d The kernel parameter d plays an important role in the VC dimension controlling. When d increases, the VC dimension increases very quickly. Theorem [5] Consider Gaussian RBF kernel K(x, z) = exp( x z 2 2σ 2 ) with RBF width σ. Assume σ can be arbitrarily small and the penalty factor C can take any values. Also assume that training sets can be chosen arbitrarily from X such that the distances between any pair of instances are much larger than σ. Then, the support vector machines with Gaussian RBF kernels have infinite VC dimension. A proof of this theorem can be found in [5]. Note that the assumptions in the theorem are very strong, which usually cannot be satisfied in a real situation. In fact, a learning machine with infinite VC dimension is not what we want. On the contrary, we want to limit the VC dimension so that the learned machine has a better chance to achieve high generalization performance. Suppose we are training a learning machine with Z of size l, then i=1

24 2.3. THE STRUCTURAL TRAINING TASK 17 the maximal rank of the kernel matrices defined by Gaussian RBF kernels is l, which means that the VC dimension is at most l + 1. Now, if we increase σ, the size of the subset of Z satisfying the distance requirement decreases, which in turn cause the VC dimension to decrease. Hence, for Gaussian RBF kernels, σ actually controls the VC dimension of the corresponding SVMs. We end this section by giving two important propositions and the definition of structural training task. Proposition The hypothesis space of the SVMs with polynomial kernels training on a set Z, H S = { f(x; α, b; d) α R l, b R, d R +} where f(x; α, b; d) = l y i α i (x T x i + c) d + b i=1 is a structural hypothesis space. For a particular choice of d, the hypothesis space is a restricted hypothesis space. Furthermore, the VC dimensions of H S increase with d. Proposition The hypothesis space of the SVMs with Gaussian RBF kernels training on a set Z, H S = { f(x; α, b; σ) α R l, b R, d R +} where f(x; α, b; σ) = l y i α i exp ( x ) xi 2 + b 2σ 2 i=1 is a structural hypothesis space. For a particular choice of σ, the hypothesis space is a restricted hypothesis space. Furthermore, the VC dimensions of H S decrease with σ. Definition Suppose a penalty factor C and a training set Z are given. Let H S denote a structural hypothesis space with kernel functions K(x, z; σ) and kernel parameter σ. A structural training task is a searching process to find a function f (x) := f(x; α, b ; σ ) in H S such that the objective function w T w + Cξ T e is minimized, i.e. f (x) = arg min f(x;α,b;σ) H ( w T w + Cξ T e), where w is the coefficient vector with dimension equal to the dimension of the feature space corresponding to the current choice of σ.

25 18 CHAPTER 2. SELF ADAPTIVE SUPPORT VECTOR MACHINES Self Adaptive Support Vector Machines In this section, we discuss how to complete a structural training task. When we face a structural training task, we actually face a hypothesis space more complex than a restricted hypothesis space but with a nice structure: the VC dimension of the structural hypothesis space changes monotonically with its kernel parameter. Let us take a close look at the kernel parameter, because it provides not only a way to organize the hypothesis space, but also a key to complete the structural training task. For a particular choice of the kernel parameter, the non-linear mapping from X to F is fixed. The structural hypothesis space is reduced to a restricted hypothesis space. We know in section 2.2 that the corresponding restricted training task can be completed by solving a standard convex quadratic optimization problem. Now, if we tune the kernel parameter, the non-linear mapping varies with the kernel parameter and so does the resulting restricted hypothesis space. If the kernel parameter can take any values in a certain range, the non-linear mappings can be infinitely many. Let us denote the whole set of mappings by M. Essentially, what a structural training task asks is: From the set M, find the particular mapping that minimizes w T w + Cξe. Translating this into a mathematical model, we have the following optimization problem: min Φ(x) M w T w + Cξ T e min w,b w T w + Cξ T e s.t. y i ( w T Φ(x i ) + b) 1 ξ i, i = 1,..., l, ξ i 0, i = 1,..., l. (2.3.1) This is a bi-level optimization problem. The upper level problem is usually called the leader s problem, and the lower level problem the follower s problem. In our problem (2.3.1), the decision variable of the leader s problem is the mapping Φ(x) drawn from the whole space of mappings M, and the variables of the follower s problem are w and b with appropriate dimensions. Details about bi-level optimization problems is given in Chapter 3. It is natural and easy to formulate the problem of a structural training task in (2.3.1). However, it is difficult to solve this problem. The difficulty comes from two aspects. First, the follower s problem is not defined completely until the mapping is given explicitly. In other words, when the

26 2.3. THE STRUCTURAL TRAINING TASK 19 mappings are different, the formulations of the follower s problems are different as well. Second, it is hard to define a descent search direction in the space M of mapping. The good news is that the introduction of kernel function solves these problems. This become clear if we rewrite the problem (2.3.1) with the follower s problem replaced by its the dual problem. It looks like this: min F (σ, α(σ)) = α T (σ)e 1 σ 2 αt (σ)k(σ)α(σ) s.t. α(σ) = arg max α T e 1 α 2 αt K(σ)α s.t. y T α = 0, Ce α 0, (2.3.2) where K(σ) is a matrix parametric in σ, (K(σ)) ij = K(x i, x j ; σ). Now, instead of minimizing over a space of mappings, we try to minimize over the kernel parameter σ, which controls the non-linear mapping from X to F. The problem (2.3.2) is called the bi-level problem of SASVMs, or simply the BLP of SASAMs. We will discuss how to solve this problem in Chapter 4. By solving the BLP of SASVMs for a given kernel function type and the penalty factor C, we are simultaneously tuning σ, α and b to appropriate values. In the problem (2.3.2), we treat the penalty factor as a given constant.it is natural to deal with the kernel parameter and penalty factor separately because they play different roles in machine learning. The kernel parameter affects the underlying mapping from X to F, and thus the VC dimension of the structural hypothesis space, while the penalty factor maintains the tradeoff between the maximal margin and the number of the misclassifications. If the classification task is critical, and the cost of misclassification is very expensive, then, we should use a large penalty parameter. With this strategy, we expect that SASVMs can achieve a high accuracy by using learning machines with high VC dimensions. For this purpose, we need to have a high quality training set, which usually means a large number of instances and consistent distribute with P (x, y). If we don t have high quality training sets, then we should set the penalty factor to be a small number so that a smooth decision boundary can be constructed. More study is still needed for choosing an appropriate value of the penalty factor. We close this section by the following note. Note Consider classifying a training set Z Z = {(x 1, y 1 ), (x 2, y 2 ),..., (x l, y l )}

27 20 CHAPTER 2. SELF ADAPTIVE SUPPORT VECTOR MACHINES with a given penalty factor C. Let the type of the kernel function K(x, z; σ) be specified with a parameter σ so that the hypothesis space H S := {f(x; α, b; σ) = l y i α i K(x i, x; σ) + b x R n, α R l, b R} i=1 is a structural hypothesis space. Let α, b, and σ be an optimal solution of the bi-level problem (2.3.2). Then the decision function, f(x; α, b ; σ ) = l y i α i K(x i, x; σ ) + b, completes the structural training classification task, i.e. i=1 f(x; α, b ; σ ) = arg min ( w T w + Cξ T e), f(x;α,b;σ) H S An Illustrative Example An experiment is conducted on a training set of 100 points from two classes scattered in a 2-dimensional space (see Figure 2.1). Gaussian RBF kernel functions are used in this experiment, and C is set to 0.1. First, SV M light [16] is run at different values of σ. The decision boundaries corresponding to σ = , , , , , are shown on Figure 2.1. As expected, when σ is small (σ = , and ), accurate rates are very close to 100% because of the high VC dimensions, however the decision boundaries overfit the training set. Especially, when σ = , the decision boundary is reduced to small islands around the points indicated by + signs. On the other hand, when σ = , the decision boundary does not give good separation due to the lower VC dimension of the learning machine imposed by the large kernel parameter. Visually, we can see from Figure 2.1 that, possibly, σ = is the best value for this data set. Figure 2.2 shows various optimal values of the follower s optimal objective for different 1 values of, along with the generalization accuracy estimates called ξ-α 2σ 2 1 estimates defined in [17]. We can see from this figure that, as varies, the 2σ 2 point where the maximal generalization accuracy estimate is achieved is very close to the point with the minimal optimal objective value. Similar results are found from experiments on other data sets. Hence, if we solve the BLP of SASVMs, we tune the kernel parameter automatically to the value that most likely optimizes the generalization performance.

28 2.4. AUTOMATIC FEATURE SELECTION σ= σ= σ= σ= σ= σ= Figure 2.1: Optimal Decision Boundaries for Different σ 2.4 Automatic Feature Selection Extended Kernel Functions For the feature selection, we need an indicator of the importance of the feature to a classification task. We use a vector β R n to describe the contributions of features to the final decision boundary. A small value of an element of β means that the corresponding feature does not contribute very much to the classification, consequently we can remove it from the data set without affecting the classification accuracy significantly. The vector β can be integrated into the kernel functions. The kernel functions with vector β are called extended kernel functions. For instance,

29 22 CHAPTER 2. SELF ADAPTIVE SUPPORT VECTOR MACHINES Figure 2.2: Optimal Objective Values vs ξ-α Estimates Objective Values 70 7 ξ α estimates ξ α estimates /2σ the extended Gaussian kernel functions can be written as n K ext (x i, x j k=1 ; σ; β) = exp( β k(x i k xj k )2 ). (2.4.3) Note that setting the vector β equal to 1 gives rise to the original Gaussian RBF kernel function. Similarly, polynomial kernels can also be extended as follows: n K ext (x i, x j ; d; β) = ( β k (x i kx j k ) + c)d, (2.4.4) k=1 and linear kernel function can be extended as: 2σ 2 k K ext (x i, x j ; β) = n k=1 β k (x i kx j k ). (2.4.5) Correspondingly, we can define the extended kernel matrix K ext (σ, β) for Gaussian RBF kernels and K ext (d, β) for polynomial kernels. Now, we have a different problem with more variables at the upper level. For example, if

30 2.4. AUTOMATIC FEATURE SELECTION 23 we use extended Gaussian RBF kernel functions, then we have min σ,β s. t. F (σ, α(σ, β)) = α T (σ, β)e 1 2 αt (σ, β)k ext (σ, β)α(σ, β) α(σ, β) = arg max α T e 1 α 2 αt K ext (σ, β)α s. t. y T α = 0, Ce α 0. (2.4.6) An Illustrative Example An experiment is done on a data set of 26 points scattered in 2-dimensional space. The corresponding bi-level problem is converted into a one-level problem by the means of KKT condition replacement discussed in Chapter 4. The optimal values for the kernel parameter found by LOQO is σ = [ e 9 ] T. The kernel parameter associated with the second feature is very small. Thus, we remove this feature from the training set and run the optimization process again. As expected, the optimal kernel value for the remaining feature is unchanged. This verifies our observation that the second feature does not play a significant role in defining the decision boundary. Figure 2.3 displays the optimal decision boundaries for this data set as it is found by LOQO before and after feature selection. The upper part of Figure 2.3 displays the optimal boundary in the original 2-dimensional space where a common kernel parameter is used for both features. The generalization accuracy estimate in this situation is 61.54%. The lower part of Figure 2.3 displays the decision boundary in 1-dimensional space, where the points are projected onto 1-dimensional space. The number of misclassifications made by the 1-dimensional decision function are 3, one less than the misclassifications made by the 2-dimensional decision function. Meanwhile, the 1-dimensional decision function is simpler than that in the 2-dimensional space in the sense it involved only one feature. The generalization accuracy estimate of the 1-dimensional decision function is 76.92%, which much higher than the generalization accuracy achieved in the 2-dimensional space.

31 24 CHAPTER 2. SELF ADAPTIVE SUPPORT VECTOR MACHINES 10 Figure 2.3: Optimal Decision Boundaries in R 2 and R 2 dimensional decision boundary dimensional decision boundary

32 Chapter 3 Solving the BLP of SASVM In this chapter, we discuss how to solve the bi-level problems introduced in Section 2 for model and feature selection. Some important concepts and properties of the general bi-level programming problems are described in Section 3.1. Then, in Section 3.2, we present how to convert the bi-level problems of interest into one-level problems so that it can be solved by general non-linear solvers. Finally in Section 3.3, we describe the derivative free optimization (DFO) method and apply it to the general BLP. 3.1 General Bi-level Problems We consider the general non-linear bi-level programming problems in the following form: min x X s.t. F (x, y), min y Y f(x, y) s.t. g i (x, y) 0, i I h j (x, y) = 0, j E. (3.1.1) where I and E are sets of indices. The upper level problem is often called the leader s problem, and the lower level problem the follower s problem. The way a bi-level programming problem works is like two persons playing a strategic game where each person has a set of variables to control, and each person can respond according to how the opponent responses. Their goal 25

33 26 CHAPTER 3. SOLVING THE BLP OF SASVM is to minimize or maximize their own objective functions. To facilitate our discussion, let us introduce the following concepts. Definition [8] Follower s feasible region, denoted by Ω(x), is the set of allowable choices for the follower under the current leader s choice. It is parametric in the vector x, i.e. Ω(x) = {y g i (x, y) 0, i I and h j (x, y) = 0, j E, x X, y Y}. Definition [8] Rational reaction set, denoted by M(x), is the set of optimal solutions (optimal choices) for the follower under the current leader s choice, i.e. M(x) = {y y = argmin[f(x, y) : y Ω(x)]}. We assume that M(x) is not empty for all the vectors of x in X. Definition [8] Inducible region, denoted by IR, is the union of all possible vectors of x that the leader can choose and the corresponding rational set M(x), i.e. IR = {(x, y) x X, y M(x)} Generally speaking, M(x) is a point-to-set mapping, which means that, for a given leader s choice, there might be more than one optimal solutions for the follower s problem. In particular, if f, g and h are twice continuously differentiable in y for all y Ω(x), f is strictly convex in y for all y Ω(x) and Ω(x) is a compact convex set, then M(x) is a continuous point-to-point mapping [11]. In this case, we usually denote it by y(x). As a consequence, the leader s objective function is continuous in x, and can be written as F (x, y(x)). The difficult part is that, in most cases, y(x) is implicitly defined by the follower s problem and it is very hard to compute the gradient of the objective in the follower s problem. This difficulty prevent us from applying many optimization solvers directly to the problem as most optimizer methods use the gradient information of the objective. From now on, we shall assume that y(x) is a continuous point-to-point mapping. In fact, there is one way to treat a bi-level optimization problem as a one-level optimization problem over the inducible region IR. If y(x) is a continuous function, IR is a continuous curve in a space of R n+m. The differentiabilities of y(x) and F (x, y(x)) play an important role in the selection of available algorithms. If F (x, y(x)) is differentiable everywhere w.r.t. x, then we can apply algorithms using the first order derivative of the objective in the leader s problem that is calculated from the solution of the follower s problem. The leader s gradient information can be acquired in a variety of ways, which is carefully examed in [8, 19]. In the case where F (x, y(x)) is not

34 3.1. GENERAL BI-LEVEL PROBLEMS 27 differentiable w.r.t. x but Lipschitz continuous, we can use a bundle method [30, 8], or apply algorithms that don t use the gradient information. We summarize the results regarding the differentiability of the general BLP as follows. These results are from [8]. Let u, v be the lagrangian multipliers introduced for the equality and inequality constrains respectively in the follower s problem. Denote the active set of inequality constraints by A(x, y), i.e. A(x, y) = {i I g i (x, y) = 0} and the lagrangian function of the follower s problem for a fixed x by L(y, u, v; x), i.e. L(y, u, v; x) = f(x, y)+u T g(x, y)+v T h(x, y). The existence of one-to-one mapping y(x) means that there exist unique lagrangian multipliers u, v such that the KKT conditions [18, 20] are satisfied. Besides the KKT conditions, the following conditions are used in the study of the differentiability of y(x): The Linear Independence Condition Qualification (LICQ) holds at y if y g i (x, y), i A(x, y), and y h j (x, y), j E are linearly independent. The Strict Complementary Slackness condition (SCS) holds at y with respect to (u, v) if u i > 0, i A(x, y) The Strong Second-Order Sufficient condition (SSOSC) holds at y w.r.t. (u, v) if d T 2 yl(x, u, v; y)d > 0, d 0, where d satisfies d T y g i (x, y) = 0, i A(x, y) d T y h j (x, y) = 0, j E Proposition [10] Suppose KKT, SSOSC, SCS and LICQ conditions hold at y 0 with multipliers (u 0, v 0 ) for the follower s problem with x = x 0, and that f, g and h are C 3 in a neighborhood of (x 0, y 0 ). Then, for x in the neighborhood of x 0, there exists a unique twice continuously differentiable function z(x) = [y(x), u(x), v(x)] T with x satisfying KKT, SSOSC, SCS and LICQ at y(x) with [u(x), v(x)] multipliers for the follower s problem. Proposition [15, 23, 29] Suppose KKT, SSOSC and LICQ conditions hold at y 0 with multipliers (u 0, v 0 ) for the follower s problem with x = x 0, and that f, g and h are C 3 in a neighborhood of (x 0, y 0 ). Then, for x in the neighborhood of x 0, there exists a unique twice continuously differentiable function z(x) = [y(x), u(x), v(x)] T with x satisfying KKT, SSOSC and LI at y(x) with [u(x), v(x)] multipliers for the follower s problem.

Kernel Methods & Support Vector Machines

Kernel Methods & Support Vector Machines & Support Vector Machines & Support Vector Machines Arvind Visvanathan CSCE 970 Pattern Recognition 1 & Support Vector Machines Question? Draw a single line to separate two classes? 2 & Support Vector