SELF-ADAPTIVE SUPPORT VECTOR MACHINES

Size: px
Start display at page:

Download "SELF-ADAPTIVE SUPPORT VECTOR MACHINES"

Transcription

1 SELF-ADAPTIVE SUPPORT VECTOR MACHINES

2 SELF ADAPTIVE SUPPORT VECTOR MACHINES AND AUTOMATIC FEATURE SELECTION By PENG DU, M.Sc., B.Sc. A thesis submitted to the School of Graduate Studies in Partial Fulfillment of the Requirements for the Dregree Master of Science McMaster University c Copyright by Peng Du, June, 2004

3 ii MASTER OF SCIENCE (2004) COMPUTING & SOFTWARE McMaster University Hamilton Ontario TITLE: SELF ADAPTIVE SUPPORT VECTOR MACHINES AND AU- TOMATIC FEATURE SELECTION AUTHOR: Peng Du, M.Sc., B.Sc. (McMaster University) SUPERVISOR: Dr. Tamás Terlaky, Dr. Jiming Peng NUMBER OF PAGES: vi, 75

4 iii Abstract We handle the problem of model and feature selection for Support Vector Machines (SVMs) in this thesis. To select the best model for a given training set, we embed the standard convex quadratic problem of SVM training in an upper level problem where the optimal objective value of the quadratic problem is minimized over the feasible range of the kernel parameters. This leads to a bi-level optimization problem for which the optimal solution always exists. The problem of feature selection can be solved simultaneously under the same framework except that the kernel function is extended by introducing independent kernel parameters for each features in the original space. Two solution strategies to solve the bi-level problems of this Self-Adaptive SVMs (SASVMs) are studied. Under the first strategy, the lower level problems are replaced by their corresponding Karush-Kuhn-Tucker (KKT) conditions. Then, the bi-level problems are solved as non-linear non-convex one-level problems by general non-linear solvers. Experiment results show that this strategy can only handle small-sized training sets. The second strategy is designed around the derivative free optimization (DFO) method. The main idea of this strategy is to minimize a quadratic model, which is constructed from the solution of the lower level problem at discrete values of kernel parameters, of the optimal objective value over a controlled range of kernel parameter. At last, SASVMs are applied to several benchmark data sets. They are multi-spectral and hyper-spectral Remote Sensing images, the MNIST data set of hand-written digits and breast cancer classification data set. The accuracies of SASVMs are compared with those achieved by other classifiers.

5 iv Acknowledgements This work were built upon the ideas and helps from my supervisors, Dr. J. Peng and Dr. T. Terlaky. I thank them for their continuous supports and for their time and energy to teach and lead me. I also want to thank my colleagues at the research laboratory for making my study a wonderful experience which I will never forget. I would like to thank the members of my committee, Dr. J. Zucker, Dr. C. Anand, Dr. J. Peng and Dr. T. Terlaky, for their kind and careful examination and their useful suggestions. I am also indebted to Dr. T. Joachims for providing SV M light and F. Ellen for the DFO software package. Finally, I want to thank my wife, Yu Wang. I couldn t finish this work without her supports.

6 Contents Abstract Acknowledgements iii iv 1 Support Vector Machines Introduction to Classification Learning Machines and Generalization Performance VC Dimension Generalization Performance Linear Support Vector Machines The Linearly Separable Case The Linearly Non-separable Case Non-linear Support Vector Machines Non-linear Mappings and Feature Spaces Kernel Functions Self Adaptive Support Vector Machines Introduction The restricted training task The structural training task The VC dimensions of SVMs with Polynomial and RBF Kernels Self Adaptive Support Vector Machines An Illustrative Example Automatic Feature Selection Extended Kernel Functions An Illustrative Example v

7 vi CONTENTS 3 Solving the BLP of SASVM General Bi-level Problems Converting to a One-level Problem Derivative Free Approach Motivations A Class of Derivative Free Algorithms Implementation of SASVM Package Implementation Issues OOP for SASVM Major Classes in SASVM Package Class Model Class ModelSet Class HypothesisSpace Class Task Structure of SASVM How to Use SASVM Package Running Environment Use SASVM for Training Use SASVM for Testing and Classification Use SASVM for Feature Selection SASVM s Error codes SASVMs for Real Classification Problems Thematic Mapper Remote Sensing Images A Thematic Mapper Scene of Tippecanoe County Result Analysis Hyper-spectral Remote Sensing Images Indian Pine 1992 AVIRIS Image Result Analysis Handwritten Digit Classification Data Set Description Handwritten Digit Classification Breast Cancer Diagnosis Data Set Description Performance on Breast Cancer Data Set Conclusions 63

8 Chapter 1 Support Vector Machines This chapter gives a basic introduction to Support Vector Machines (SVMs), a family of learning machines for classification problems. In Section 1.1, we introduce and discuss what is classification. In Section 1.2, we introduce some basic concepts and some important results regarding to SVMs. In Section 1.2 and 1.3, we discuss two types of SVMs: linear and non-linear SVMs. 1.1 Introduction to Classification We deal with the classification problem in this thesis. General speaking, the classification problem is to classify a set of objects, which are commonly called instances, into pre-defined classes or categories. For instance, we can classify the earth s surface into pre-defined classes like residential areas, commercial areas and natural heritage areas. Furthermore, we can classify the natural heritage areas into forest, prairie, wet land and so on. Classification can be done in two different ways: unsupervised classification and supervised classification. The foundation of unsupervised classification is this: If two instances are very close to each other, it is very likely that they belong to the same class. The closeness of two instances are normally measured by the distance between them. There are many types of distances in literature. For more information about distance, we refer to the book written by Landgrebe [21]. One distinct feature of unsupervised classification is that the classification is applied to a set of instances for which we don t know their true classes. On the other hand, supervised classification is a method based on a set 1

9 2 CHAPTER 1. SUPPORT VECTOR MACHINES of instances with true classes known in advance. The given set of instances are further partitioned into two parts: a training set and a testing set. A supervised classification starts with a training process followed by a testing process, then goes back to the training set until a decision boundary is constructed. The decision boundary is a line or surface to separate one class from the other. The algorithm used to construct a decision boundary from a training set is called a classifier. Since the process of constructing the decision boundary from a training set can also be considered as a process of learning the information contained in a training set, a classification algorithms is also called a learning machine. The term classifier is also used to refer to the constructed decision boundary, which is called a learned machine if the algorithm is called a learning machine. In this thesis, we discuss support vector machines, which are sets of supervised classification algorithms. 1.2 Learning Machines and Generalization Performance Formally speaking, a classification problem can be expressed in this way: Suppose that we are given a training set Z = {(x 1, y 1 ), (x 2, y 2 ),..., (x l, y l )} of l instances, where each instance carries n attributes (x i = (x i 1,..., x i n)) and a class label y i {1, 1}. We want to construct a decision boundary separating the two classes such that the probability of classification error, or misclassification, made by the resulting decision boundary on a new instance is minimized. The complete set of all possible instances is called a population, therefore a training set Z is always a subset of the corresponding population. Usually, we assume that a population follows a determined but unknown probability distribution P (x, y) and the training set Z is drawn independently and distributed identically with P (x, y). Since a decision boundary separates instances from two classes, what a learning machine learns is in fact a mapping x y from a set of mappings {f(x; α, b) : R n {+1, 1}}. The functions f(x; α, b) themselves are labelled by the adjustable parameters α and b. The process of learning is to find the appropriate values for α and b based on the information in Z. In literature, the set of possible mappings is usually called the hypothesis space denoted by the symbol H. A learning machine is actually determined by the hypothesis space associated with it.

10 1.2. LEARNING MACHINES AND GENERALIZATION PERFORMANCE VC Dimension To find the appropriate values for α and b, let us first take a close look at the hypothesis space H. Suppose we have a training set of size l and it can be labelled in all possible 2 l ways. If for each labelling, we can find at least one member in H which correctly assigns those labels, we say that the given set is shattered by the set of function H. To better understand this concept, let us consider points in R 2 and a hypothesis space consisting of all possible linear functions in R 2. If we have only 3 points in R 2 and totally 8 ways of labelling, no matter how we label them, at least one linear function can be found to correctly assign them (see Figure 1.1). However, if we have 4 points located at the 4 corners of a rectangle and we label them alternately, then no linear function can be used to correctly assign them. Hence the hypothesis space of linear functions in R 2 can shatter at most 3 points. Based on the concept of shatter, we can define the VC dimension. Figure 1.1: Three point in R 2 shattered by lines Definition [34] The VC dimension for a hypothesis space H is the maximum number of training points that can be shattered by H. Note that the VC dimension is a property of a hypothesis space or a learning machine. It has nothing to do with training sets although it is defined in terms of the sizes of training sets. VC dimension measures the capacity of a hypothesis space to assign a training set correctly no matter how the training

11 4 CHAPTER 1. SUPPORT VECTOR MACHINES set is labelled. A hypothesis space with higher VC dimension usually means that it has more flexibility to simulate complex decision boundaries Generalization Performance Since a higher VC dimension usually means more flexibility to simulate complex decision boundaries, why don t we exclusively use hypothesis spaces with high VC dimensions? A simple answer is that our ultimate goal of a classification task is to achieve a high generalization performance. Theoretically, generalization performance can be measured as actual risk, which is defined as: 1 R(α, b) = y f(x; α, b) dp (x, y). 2 This formula represents the actual risk. However, unless we know what P (x, y) is, it is not an easy way to evaluate R(a, b). Fortunately, the empirical risk provides an estimate for R(α, b), but of course not the exact value of R(α, b). The empirical risk R emp (α, b) on the training set Z is defined as: R emp (α, b) = 1 2l l y i f(x i ; α, b). i=1 R emp (α, b) is a fixed value for a particular choice of α and b and a given training set with a finite number of instances. Based on empirical risk and VC dimension, we can build an upper bound on actual risk, which is given in the following theorem. Theorem [34] Let H be a hypothesis space with VC dimension h. Let Z be a training set with size l. For any probability distribution P (x, y) on the population, with probability 1 η, any hypothesis in H that makes an empirical risk R emp (α, b) on Z has actual risk no more than ( R(α, b) R emp (α, b) + h(log 2l + 1) log( η ) ) h 4 (1.2.1) l The second term on the right hand is called VC confidence. It is a function of VC dimension. When VC dimension increases, VC confidence increases as well. As a result, we end up with a loose upper bound on R(α, b). If h is known in advance, we can easily compute the right hand side.

12 1.2. LEARNING MACHINES AND GENERALIZATION PERFORMANCE5 This theorem bears an important message applicable to any classification tasks. When we train a learning machine from a training set Z, we want not only to achieve a low empirical risk, but also to keep h as low as possible in order to achieve the lowest actual risk. To better understand this, let us image that we have a sequence of learning machines with h given in a monotonic order (recall that one learning machine is just a set of functions f(x; α, b)). Suppose the confidence level η is fixed and for each learning machine in this sequence, there is no trouble in finding find the best function which minimize the right hand size of (1.2.1). Hence, a sequence of best functions is found. By taking from the sequence the function that minimizes the right hand size of (1.2.1), we are choosing the function that gives the lowest upper bound on the actual risk from the whole sequence. This gives a principle of training for a classification task, which is the essential idea of structural risk minimization. The principle of structural risk minimization was first introduce by Vapnik in 1979 [33], where he introduced a structure by dividing the union of the hypothesis spaces of a learning machine sequence into nested subsets (see Figure 1.2). The VC dimensions of these nested subsets can be estimated or bounded above. The union of the functions is structured in such a way that the nested subsets inside have smaller VC dimension than the subsets outside. What structural risk minimization does is to find a subset of of the set of functions which minimizes a bound on the actual risk. Figure 1.2: Nested Subsets of Functions Ordered by VC Dimensions To facilitate our discussion, we introduce another structure based on learning machines according to their VC dimensions. Definition Given a sequence of learning machines, and a corresponding sequence of hypothesis spaces. Suppose the VC dimension for every learning machine in the sequence is given. A structural hypothesis space, denoted

13 6 CHAPTER 1. SUPPORT VECTOR MACHINES by H S, is the given sequence of hypothesis spaces ordered by their VC dimensions. We point out that, based on Definition 1.2.3, the hypothesis spaces of higher VC dimensions do not have to be the supersets of the hypothesis spaces of lower VC dimensions. Structured hypothesis spaces will be further discussed when we discuss self-adaptive support vector machines in Chapter Linear Support Vector Machines In Section 1.1 and 1.2, we discussed learning machines and their generalization performance from a general viewpoint. In this section, we discuss a specific learning machine The Linearly Separable Case SVMs were first introduced by Vapnik in 1995 [34]. Their main task is to find a hyperplane that separates a given training set Z correctly and leave as much distance as possible from the closest instances to the hyperplane on both sides. The distance from the closest instances to the separating hyperplane is called the margin, and the hyperplane that realize the maximal margin is called the optimal hyperplane. The closest instances are called support vectors, which gives the name of this algorithm. We start with the simplest case: linear machines trained on linearly separable training sets. In this case, the hypothesis space is given as: H := { f(x; w, b) = w T x + b w, x R n, b R }, (1.3.2) where w and b are the adjustable parameters. The main task of SVMs is to determine the appropriate values for w and b that maximize the margin and at the same time separate the data points correctly. We can easily prove that this statement is true if and only if we have f(x, a, b) = 1 or 1 at each support vector x. Hence, we have the following optimization problem: min w,b 1 2 wt w s.t. y i (w T x i + b) 1, i = 1,..., l, (1.3.3)

14 1.3. LINEAR SUPPORT VECTOR MACHINES The Linearly Non-separable Case Note that when two classes are not linearly non-separable, the problem (1.3.3) becomes infeasible. One way to deal with this is to introduce slack variables ξ i R +, i = 1,..., l, then we have: min w,b 1 2 wt w + Cξ T e s.t. y i (w T x i + b) 1 ξ i, i = 1,..., l, ξ i 0, i = 1,..., l, (1.3.4) where C is a penalty factor chosen by the user and e is the all one vector in appropriate dimension. Problem (1.3.4) is a standard convex quadratic programming problem, whose dual problem can be written as follows: max α T e 1 α 2 αt Kα s.t. y T α = 0, Ce α 0, (1.3.5) where α R l is a vector of the dual variables, y = [y 1, y 2,, y l ] T, and K R l l with K ij = y i y j (x it x j ). Problem (1.3.5) is a standard quadratic programming problem as well, for which efficient algorithms are available. Furthermore, if problem (1.3.5) is a strictly convex quadratic problem, then the optimal solution is unique. The optimality conditions for the problem (1.3.5) can be written as: w l α i y i x i = 0, i=1 l α i y i = 0, i=1 C α i µ i = 0 i = 1,..., l, α i (y i (x it w + b) 1 + ξ i ) = 0 i = 1,..., l, µ i ξ i = 0 i = 1,..., l, y i (x it w + b) 1 + ξ i 0 i = 1,..., l, ξ 0, α 0, µ 0. (1.3.6) From the duality theory in optimization, we know that, by solving (1.3.6), we can get solution to both the primal and dual problems. Let α denote

15 8 CHAPTER 1. SUPPORT VECTOR MACHINES this unique optimal solution for the dual problem, then by using the first equality condition in (1.3.6), the optimal decision function can be written as: f(x; α, b ) = l y i αi (x it x) + b. (1.3.7) i=1 Observe that b does not appear in the dual problem, so b must be found by making use of the primal constraints and complementary conditions, i.e. b is chosen so that: y i f(x i ; α, b ) = 1, i : 0 < α i < C. (1.3.8) Alternatively, we can write down the hypothesis space equivalent to (1.3.2) in the terms of dual variables. It is: { } l H := f(x; α, b) = y i α i (x it x) + b α R l, b R. (1.3.9) i=1 1.4 Non-linear Support Vector Machines Non-linear Mappings and Feature Spaces In order to learn a non-linear decision function for a linearly non-separable training set, a common strategy involves mapping the representation of the data from the original input space, denoted by X, to a usually more complex space, denoted by F. This new space is called feature space in the literature. Suppose a mapping is given like this: x = (x 1, x 2,..., x n ) Φ(x) = (φ 1 (x), φ 2 (x),..., φ N (x)). Here φ i (x), i = 1,..., N are usually non-linear functions representing the new features in the N dimensional feature space F. Non-linear SVMs try to find a hyperplane separating a training set Z in the new feature space F. The separating problem in F can be formulated as follows: min ( w,b) 1 2 wt w + Cξ T e s.t. y i ( w T Φ(x i ) + b) 1 ξ i, i = 1,..., l ξ i 0, i = 1,..., l, (1.4.10)

16 1.4. NON-LINEAR SUPPORT VECTOR MACHINES 9 where w R N, w and b are the decision variables defining the separating hyperplane in F. The hypothesis space in this case is: H := { f(x; w, b) = w T Φ(x) + b w R N, b R }. (1.4.11) The dual problem of (1.4.10) is: max α T e 1 α 2 αt Kα s.t. y T α = 0, Ce α 0, (1.4.12) In this case, K ij = y i y j Φ(x i ) T Φ(x j ). Note that the dual problem is the same as the problem (1.3.5) except that the matrix K is different. In other words, the introduction of non-linear mappings does not influence the formulation of the dual problem except for a matrix, which is the only part related to the non-linear mapping. Consequently, the optimality conditions for the problem (1.4.12) are very similar to the optimality conditions (1.3.6) for the linear cases. To save space, they are not listed here. By applying the optimality conditions, the hypothesis space of non-linear SVMs in the terms of dual variables can be written as: { } l H := f(x; α, b) = y i α i (Φ(x i ) T Φ(x)) + b α R l, b R. (1.4.13) i=1 Just like the linear cases, extra effort is needed to determine b because it does not appear in the dual problem. As we can see from the above discussion, we take two steps to build a non-linear decision boundary in the original input space X : first, the data representation is mapped from X to F by Φ(x) : X F. Then, a linear learning machine is applied to find the optimal hyperplane in F. If we can compute the inner product Φ(x i ) T Φ(x j ) in F directly as a function of variables in X, then these two steps can be merged into one. Such a method is called a kernel method. Definition A kernel function K : R n R n R, for all x, z X is defined as K(x, z) = Φ(x) T Φ(z).

17 10 CHAPTER 1. SUPPORT VECTOR MACHINES Definition A kernel matrix K R l l on a training set Z, with respect to a kernel function K(x, z) is defined as: K ij = K(x i, x j ), i, j = 1,..., l. Correspondingly, the hypothesis space can be rewritten as: { } l H := f(x; α, b) = y i α i K(x i, x) + b α R l, b R. (1.4.14) i=1 Defining a kernel function is frequently more convenient and natural than defining a mapping from X to F. When the kernel function and kernel matrix is defined, we can write down the optimization problem and the hypothesis space in the terms of the kernel function and kernel matrix without knowing the exact formulation of the underlying mapping from X to F. This is called implicit mapping. One problem in non-linear SVMs is to select the kernel functions so that the problem (1.4.13) is well defined and solvable Kernel Functions To introduce various kernel functions that enjoys certain desirable properties, we need the following theorem, which was proved by Mercer [24]. Theorem Let X be a compact subset of R n. Suppose K is a continuous symmetric function such that the integral operator T K : L 2 (X ) L 2 (X ), (T K f)( ) = K(, x)f(x)dx, is positive, that is X X X K(x, z)f(x)f(z)dxdz 0, for all f L 2 (X ). Then we can expand K(x, z) in a uniformly convergent series on (X X ) in terms of T K s normalized ( φ j L2 = 1) eigen-functions φ j L 2 (X ) and nonnegative eigenvalues λ j 0 as: K(x, z) = λ j φ j (x)φ j (z). j=1

18 1.4. NON-LINEAR SUPPORT VECTOR MACHINES 11 Suppose that we arbitrarily take a set of points {x 1, x 2,..., x l } from the compact set X, and denote the values of function f on the set by a vector v. Let a kernel function K(x, z) be given and the kernel matrix be denoted by K. Then for the set, the positivity condition K(x, z)f(x)f(z)dxdz 0, f L 2 (X ) becomes X X v T Kv 0. Hence the positivity condition in Mercer s theorem is equivalent to requiring that for any finite subset of X, the corresponding matrix K is positive semidefinite. Corollary [34] The following functions satify Mercer s positivity condition, thus they are kernel functions: 1. K(x, z; σ) = exp( x z 2 ), 2σ 2 2. K(x, z; d, c) = (x T z + c) d, where σ, d are kernel parameters, and c is a constant. They are called the Gaussian RBF kernel, and polynomial kernel, respectively. Kernel parameters σ and d play important roles in the control of VC dimension. We call c a constant, instead of a kernel parameter, because it plays a role different from VC dimension controlling. This will be explained in detail in Section

19 12 CHAPTER 1. SUPPORT VECTOR MACHINES

20 Chapter 2 Self Adaptive Support Vector Machines 2.1 Introduction Since their first introduction, SVMs have been recognized as major learning algorithms for real world classification problems, such as hand-written digit recognition [22], image recognition [26] and protein homology detection [14]. However some questions still remain open. For instance, how to select the right kernel functions for a classification task, how to find appropriate values for the kernel parameter, and how much to penalize a misclassification and how to decide automatically the contribution from each attribute or feature so that feature selection can be done systematically. In this thesis, we try to address two of these problems. The first is how to find the best value for the kernel parameter, which is referred to as model selection in literature. We call an SVM Self-Adaptive, or simply an SASVM, if it is able to automatically tune the kernel parameters to the best values for any given data sets. The second problem is automatic feature selection. An SVM with automatic feature selection is an SVM learning algorithm that is able to automatically identify the features that are most important for a classification task. Model selection is not new in the field of machine learning. Some research has been done model selection on support vector machine as well. In 1999, Chapelle and Vapnik [7] presented new functionals for support vector machine model selection. These new functionals can be used to estimate the generalization error. More result regarding the generalization error of 13

21 14 CHAPTER 2. SELF ADAPTIVE SUPPORT VECTOR MACHINES SVMs were presented by Chapelle and Vapnik in [6], and a gradient descent algorithm was introduced to minimize some estimates of the generalization error over a set of kernel parameters. However, the estimates of generalization error being minimized are complicated functions which in turn make the optimization process difficult. The problem of feature selection for SVMs was first studied by Bradley et al. in 1998 [25]. Weston et al. [13] introduced a method for the feature selection problem in In 2002, Chapelle and Vapnik suggested another method in which each feature in an original input space carries its own kernel parameters [6]. If the value of a kernel parameter is very small, the corresponding feature can be removed from the data set without influencing the classification accuracy significantly. In this thesis we apply this idea to the mathematical model defined for the model selection problem. Consequently, the problems of model and feature selection can be handled simultaneously. 2.2 The restricted training task Recall the concept of structural hypothesis space in Section In what follows, we define a restricted hypothesis space and the corresponding training task. Definition Suppose a training set Z and a penalty factor C are given. Let K(x, z) to be a kernel function corresponding to an unknown but determined non-linear mapping from X to F. A restricted hypothesis space H R is a set of functions with parameter α, b given as: { } l H R := f(x; α, b) = y i α i K(x i, x) + b α R l, b R, i=1 where (x i, y i ) Z, 0 α i C, i = 1, 2,..., l. Definition Suppose a penalty factor C and a training set Z are given, a restricted training task is a searching process to find a function f R (x) := f R(x; α, b ) in a restricted hypothesis space H R such that the objective function w T w + Cξ T e is minimized, i.e. fr(x) = arg min ( f(x;α,b) H R (α,b) wt w + Cξ T e),

22 2.3. THE STRUCTURAL TRAINING TASK 15 where w is the coefficient vector with dimension equal to the fixed feature space associated with H R. A restricted training task can be easily done by solving the problem (1.4.12) with a fixed kernel function and penalty C. There is no parameter in this formulation. It is guaranteed by the convexity of the problem (1.4.12) that it has a unique solution. 2.3 The structural training task It is clear from section 2.1 that a standard convex quadratic problem need to be solved only once in order to complete a restricted training task. An immediate question that follows is: What if we use kernel functions with kernel parameters? Before answering this question, let us consider a more fundamental question: What happens when the value of the kernel parameters are changed? The VC dimensions of SVMs with Polynomial and RBF Kernels We first set the stage by giving the following theorem about the VC dimension of hyperplanes. Its proof can be found in [5]. Theorem [5] Consider a set of m points in R n. Choose any one of the points as origin. Then, the m points can be shattered by hyperplanes if and only if the position vectors of the remaining points are linearly independent. Corollary [3] The VC dimension of the set of hyperplanes in R n is n+1. This corollary can be easily verified as we can always choose n+1 points in R n and let one of them be the origin and keep the rest linearly independent. This cannot be done for n + 2 points. Now, we are ready to describe the VC dimension of kernel functions. First, let us discuss polynomial kernels, which are given as: K(x, z) = (x T z + c) d,

23 16 CHAPTER 2. SELF ADAPTIVE SUPPORT VECTOR MACHINES where x and z R n. When d = 2, we have: n n ((x T z) + c) 2 = ( x i z i + c)( x j z j + c) = = i=1 n i=1 j=1 (n,n) (i,j)=(1,1) j=1 n n x i x j z i z j + 2c x i z i + c 2 (x i x j )(z i z j ) + i=1 n ( 2cx i )( 2cz i ) + c 2. ( ) ( ) n + 1 n + 2 Observe that there are + n + 1 = distinct features of 2 2 monomials of degree up to 2. Their relative weights among the degree 0, 1 and 2 are monitored by the constant c. Similar ( derivations ) can be made n + d for the case of degree up to d, where there are distinct features, d being all the monomials up to degree d. This analysis leads to the following theorem Theorem [34] If the dimension of the original input space X is n, then the VC dimension of SVMs with ( polynomial ) kernels of degree d, i.e. n + d (K(x, z) = (x T z + c) d, x, z X ), is + 1. d The kernel parameter d plays an important role in the VC dimension controlling. When d increases, the VC dimension increases very quickly. Theorem [5] Consider Gaussian RBF kernel K(x, z) = exp( x z 2 2σ 2 ) with RBF width σ. Assume σ can be arbitrarily small and the penalty factor C can take any values. Also assume that training sets can be chosen arbitrarily from X such that the distances between any pair of instances are much larger than σ. Then, the support vector machines with Gaussian RBF kernels have infinite VC dimension. A proof of this theorem can be found in [5]. Note that the assumptions in the theorem are very strong, which usually cannot be satisfied in a real situation. In fact, a learning machine with infinite VC dimension is not what we want. On the contrary, we want to limit the VC dimension so that the learned machine has a better chance to achieve high generalization performance. Suppose we are training a learning machine with Z of size l, then i=1

24 2.3. THE STRUCTURAL TRAINING TASK 17 the maximal rank of the kernel matrices defined by Gaussian RBF kernels is l, which means that the VC dimension is at most l + 1. Now, if we increase σ, the size of the subset of Z satisfying the distance requirement decreases, which in turn cause the VC dimension to decrease. Hence, for Gaussian RBF kernels, σ actually controls the VC dimension of the corresponding SVMs. We end this section by giving two important propositions and the definition of structural training task. Proposition The hypothesis space of the SVMs with polynomial kernels training on a set Z, H S = { f(x; α, b; d) α R l, b R, d R +} where f(x; α, b; d) = l y i α i (x T x i + c) d + b i=1 is a structural hypothesis space. For a particular choice of d, the hypothesis space is a restricted hypothesis space. Furthermore, the VC dimensions of H S increase with d. Proposition The hypothesis space of the SVMs with Gaussian RBF kernels training on a set Z, H S = { f(x; α, b; σ) α R l, b R, d R +} where f(x; α, b; σ) = l y i α i exp ( x ) xi 2 + b 2σ 2 i=1 is a structural hypothesis space. For a particular choice of σ, the hypothesis space is a restricted hypothesis space. Furthermore, the VC dimensions of H S decrease with σ. Definition Suppose a penalty factor C and a training set Z are given. Let H S denote a structural hypothesis space with kernel functions K(x, z; σ) and kernel parameter σ. A structural training task is a searching process to find a function f (x) := f(x; α, b ; σ ) in H S such that the objective function w T w + Cξ T e is minimized, i.e. f (x) = arg min f(x;α,b;σ) H ( w T w + Cξ T e), where w is the coefficient vector with dimension equal to the dimension of the feature space corresponding to the current choice of σ.

25 18 CHAPTER 2. SELF ADAPTIVE SUPPORT VECTOR MACHINES Self Adaptive Support Vector Machines In this section, we discuss how to complete a structural training task. When we face a structural training task, we actually face a hypothesis space more complex than a restricted hypothesis space but with a nice structure: the VC dimension of the structural hypothesis space changes monotonically with its kernel parameter. Let us take a close look at the kernel parameter, because it provides not only a way to organize the hypothesis space, but also a key to complete the structural training task. For a particular choice of the kernel parameter, the non-linear mapping from X to F is fixed. The structural hypothesis space is reduced to a restricted hypothesis space. We know in section 2.2 that the corresponding restricted training task can be completed by solving a standard convex quadratic optimization problem. Now, if we tune the kernel parameter, the non-linear mapping varies with the kernel parameter and so does the resulting restricted hypothesis space. If the kernel parameter can take any values in a certain range, the non-linear mappings can be infinitely many. Let us denote the whole set of mappings by M. Essentially, what a structural training task asks is: From the set M, find the particular mapping that minimizes w T w + Cξe. Translating this into a mathematical model, we have the following optimization problem: min Φ(x) M w T w + Cξ T e min w,b w T w + Cξ T e s.t. y i ( w T Φ(x i ) + b) 1 ξ i, i = 1,..., l, ξ i 0, i = 1,..., l. (2.3.1) This is a bi-level optimization problem. The upper level problem is usually called the leader s problem, and the lower level problem the follower s problem. In our problem (2.3.1), the decision variable of the leader s problem is the mapping Φ(x) drawn from the whole space of mappings M, and the variables of the follower s problem are w and b with appropriate dimensions. Details about bi-level optimization problems is given in Chapter 3. It is natural and easy to formulate the problem of a structural training task in (2.3.1). However, it is difficult to solve this problem. The difficulty comes from two aspects. First, the follower s problem is not defined completely until the mapping is given explicitly. In other words, when the

26 2.3. THE STRUCTURAL TRAINING TASK 19 mappings are different, the formulations of the follower s problems are different as well. Second, it is hard to define a descent search direction in the space M of mapping. The good news is that the introduction of kernel function solves these problems. This become clear if we rewrite the problem (2.3.1) with the follower s problem replaced by its the dual problem. It looks like this: min F (σ, α(σ)) = α T (σ)e 1 σ 2 αt (σ)k(σ)α(σ) s.t. α(σ) = arg max α T e 1 α 2 αt K(σ)α s.t. y T α = 0, Ce α 0, (2.3.2) where K(σ) is a matrix parametric in σ, (K(σ)) ij = K(x i, x j ; σ). Now, instead of minimizing over a space of mappings, we try to minimize over the kernel parameter σ, which controls the non-linear mapping from X to F. The problem (2.3.2) is called the bi-level problem of SASVMs, or simply the BLP of SASAMs. We will discuss how to solve this problem in Chapter 4. By solving the BLP of SASVMs for a given kernel function type and the penalty factor C, we are simultaneously tuning σ, α and b to appropriate values. In the problem (2.3.2), we treat the penalty factor as a given constant.it is natural to deal with the kernel parameter and penalty factor separately because they play different roles in machine learning. The kernel parameter affects the underlying mapping from X to F, and thus the VC dimension of the structural hypothesis space, while the penalty factor maintains the tradeoff between the maximal margin and the number of the misclassifications. If the classification task is critical, and the cost of misclassification is very expensive, then, we should use a large penalty parameter. With this strategy, we expect that SASVMs can achieve a high accuracy by using learning machines with high VC dimensions. For this purpose, we need to have a high quality training set, which usually means a large number of instances and consistent distribute with P (x, y). If we don t have high quality training sets, then we should set the penalty factor to be a small number so that a smooth decision boundary can be constructed. More study is still needed for choosing an appropriate value of the penalty factor. We close this section by the following note. Note Consider classifying a training set Z Z = {(x 1, y 1 ), (x 2, y 2 ),..., (x l, y l )}

27 20 CHAPTER 2. SELF ADAPTIVE SUPPORT VECTOR MACHINES with a given penalty factor C. Let the type of the kernel function K(x, z; σ) be specified with a parameter σ so that the hypothesis space H S := {f(x; α, b; σ) = l y i α i K(x i, x; σ) + b x R n, α R l, b R} i=1 is a structural hypothesis space. Let α, b, and σ be an optimal solution of the bi-level problem (2.3.2). Then the decision function, f(x; α, b ; σ ) = l y i α i K(x i, x; σ ) + b, completes the structural training classification task, i.e. i=1 f(x; α, b ; σ ) = arg min ( w T w + Cξ T e), f(x;α,b;σ) H S An Illustrative Example An experiment is conducted on a training set of 100 points from two classes scattered in a 2-dimensional space (see Figure 2.1). Gaussian RBF kernel functions are used in this experiment, and C is set to 0.1. First, SV M light [16] is run at different values of σ. The decision boundaries corresponding to σ = , , , , , are shown on Figure 2.1. As expected, when σ is small (σ = , and ), accurate rates are very close to 100% because of the high VC dimensions, however the decision boundaries overfit the training set. Especially, when σ = , the decision boundary is reduced to small islands around the points indicated by + signs. On the other hand, when σ = , the decision boundary does not give good separation due to the lower VC dimension of the learning machine imposed by the large kernel parameter. Visually, we can see from Figure 2.1 that, possibly, σ = is the best value for this data set. Figure 2.2 shows various optimal values of the follower s optimal objective for different 1 values of, along with the generalization accuracy estimates called ξ-α 2σ 2 1 estimates defined in [17]. We can see from this figure that, as varies, the 2σ 2 point where the maximal generalization accuracy estimate is achieved is very close to the point with the minimal optimal objective value. Similar results are found from experiments on other data sets. Hence, if we solve the BLP of SASVMs, we tune the kernel parameter automatically to the value that most likely optimizes the generalization performance.

28 2.4. AUTOMATIC FEATURE SELECTION σ= σ= σ= σ= σ= σ= Figure 2.1: Optimal Decision Boundaries for Different σ 2.4 Automatic Feature Selection Extended Kernel Functions For the feature selection, we need an indicator of the importance of the feature to a classification task. We use a vector β R n to describe the contributions of features to the final decision boundary. A small value of an element of β means that the corresponding feature does not contribute very much to the classification, consequently we can remove it from the data set without affecting the classification accuracy significantly. The vector β can be integrated into the kernel functions. The kernel functions with vector β are called extended kernel functions. For instance,

29 22 CHAPTER 2. SELF ADAPTIVE SUPPORT VECTOR MACHINES Figure 2.2: Optimal Objective Values vs ξ-α Estimates Objective Values 70 7 ξ α estimates ξ α estimates /2σ the extended Gaussian kernel functions can be written as n K ext (x i, x j k=1 ; σ; β) = exp( β k(x i k xj k )2 ). (2.4.3) Note that setting the vector β equal to 1 gives rise to the original Gaussian RBF kernel function. Similarly, polynomial kernels can also be extended as follows: n K ext (x i, x j ; d; β) = ( β k (x i kx j k ) + c)d, (2.4.4) k=1 and linear kernel function can be extended as: 2σ 2 k K ext (x i, x j ; β) = n k=1 β k (x i kx j k ). (2.4.5) Correspondingly, we can define the extended kernel matrix K ext (σ, β) for Gaussian RBF kernels and K ext (d, β) for polynomial kernels. Now, we have a different problem with more variables at the upper level. For example, if

30 2.4. AUTOMATIC FEATURE SELECTION 23 we use extended Gaussian RBF kernel functions, then we have min σ,β s. t. F (σ, α(σ, β)) = α T (σ, β)e 1 2 αt (σ, β)k ext (σ, β)α(σ, β) α(σ, β) = arg max α T e 1 α 2 αt K ext (σ, β)α s. t. y T α = 0, Ce α 0. (2.4.6) An Illustrative Example An experiment is done on a data set of 26 points scattered in 2-dimensional space. The corresponding bi-level problem is converted into a one-level problem by the means of KKT condition replacement discussed in Chapter 4. The optimal values for the kernel parameter found by LOQO is σ = [ e 9 ] T. The kernel parameter associated with the second feature is very small. Thus, we remove this feature from the training set and run the optimization process again. As expected, the optimal kernel value for the remaining feature is unchanged. This verifies our observation that the second feature does not play a significant role in defining the decision boundary. Figure 2.3 displays the optimal decision boundaries for this data set as it is found by LOQO before and after feature selection. The upper part of Figure 2.3 displays the optimal boundary in the original 2-dimensional space where a common kernel parameter is used for both features. The generalization accuracy estimate in this situation is 61.54%. The lower part of Figure 2.3 displays the decision boundary in 1-dimensional space, where the points are projected onto 1-dimensional space. The number of misclassifications made by the 1-dimensional decision function are 3, one less than the misclassifications made by the 2-dimensional decision function. Meanwhile, the 1-dimensional decision function is simpler than that in the 2-dimensional space in the sense it involved only one feature. The generalization accuracy estimate of the 1-dimensional decision function is 76.92%, which much higher than the generalization accuracy achieved in the 2-dimensional space.

31 24 CHAPTER 2. SELF ADAPTIVE SUPPORT VECTOR MACHINES 10 Figure 2.3: Optimal Decision Boundaries in R 2 and R 2 dimensional decision boundary dimensional decision boundary

32 Chapter 3 Solving the BLP of SASVM In this chapter, we discuss how to solve the bi-level problems introduced in Section 2 for model and feature selection. Some important concepts and properties of the general bi-level programming problems are described in Section 3.1. Then, in Section 3.2, we present how to convert the bi-level problems of interest into one-level problems so that it can be solved by general non-linear solvers. Finally in Section 3.3, we describe the derivative free optimization (DFO) method and apply it to the general BLP. 3.1 General Bi-level Problems We consider the general non-linear bi-level programming problems in the following form: min x X s.t. F (x, y), min y Y f(x, y) s.t. g i (x, y) 0, i I h j (x, y) = 0, j E. (3.1.1) where I and E are sets of indices. The upper level problem is often called the leader s problem, and the lower level problem the follower s problem. The way a bi-level programming problem works is like two persons playing a strategic game where each person has a set of variables to control, and each person can respond according to how the opponent responses. Their goal 25

33 26 CHAPTER 3. SOLVING THE BLP OF SASVM is to minimize or maximize their own objective functions. To facilitate our discussion, let us introduce the following concepts. Definition [8] Follower s feasible region, denoted by Ω(x), is the set of allowable choices for the follower under the current leader s choice. It is parametric in the vector x, i.e. Ω(x) = {y g i (x, y) 0, i I and h j (x, y) = 0, j E, x X, y Y}. Definition [8] Rational reaction set, denoted by M(x), is the set of optimal solutions (optimal choices) for the follower under the current leader s choice, i.e. M(x) = {y y = argmin[f(x, y) : y Ω(x)]}. We assume that M(x) is not empty for all the vectors of x in X. Definition [8] Inducible region, denoted by IR, is the union of all possible vectors of x that the leader can choose and the corresponding rational set M(x), i.e. IR = {(x, y) x X, y M(x)} Generally speaking, M(x) is a point-to-set mapping, which means that, for a given leader s choice, there might be more than one optimal solutions for the follower s problem. In particular, if f, g and h are twice continuously differentiable in y for all y Ω(x), f is strictly convex in y for all y Ω(x) and Ω(x) is a compact convex set, then M(x) is a continuous point-to-point mapping [11]. In this case, we usually denote it by y(x). As a consequence, the leader s objective function is continuous in x, and can be written as F (x, y(x)). The difficult part is that, in most cases, y(x) is implicitly defined by the follower s problem and it is very hard to compute the gradient of the objective in the follower s problem. This difficulty prevent us from applying many optimization solvers directly to the problem as most optimizer methods use the gradient information of the objective. From now on, we shall assume that y(x) is a continuous point-to-point mapping. In fact, there is one way to treat a bi-level optimization problem as a one-level optimization problem over the inducible region IR. If y(x) is a continuous function, IR is a continuous curve in a space of R n+m. The differentiabilities of y(x) and F (x, y(x)) play an important role in the selection of available algorithms. If F (x, y(x)) is differentiable everywhere w.r.t. x, then we can apply algorithms using the first order derivative of the objective in the leader s problem that is calculated from the solution of the follower s problem. The leader s gradient information can be acquired in a variety of ways, which is carefully examed in [8, 19]. In the case where F (x, y(x)) is not

34 3.1. GENERAL BI-LEVEL PROBLEMS 27 differentiable w.r.t. x but Lipschitz continuous, we can use a bundle method [30, 8], or apply algorithms that don t use the gradient information. We summarize the results regarding the differentiability of the general BLP as follows. These results are from [8]. Let u, v be the lagrangian multipliers introduced for the equality and inequality constrains respectively in the follower s problem. Denote the active set of inequality constraints by A(x, y), i.e. A(x, y) = {i I g i (x, y) = 0} and the lagrangian function of the follower s problem for a fixed x by L(y, u, v; x), i.e. L(y, u, v; x) = f(x, y)+u T g(x, y)+v T h(x, y). The existence of one-to-one mapping y(x) means that there exist unique lagrangian multipliers u, v such that the KKT conditions [18, 20] are satisfied. Besides the KKT conditions, the following conditions are used in the study of the differentiability of y(x): The Linear Independence Condition Qualification (LICQ) holds at y if y g i (x, y), i A(x, y), and y h j (x, y), j E are linearly independent. The Strict Complementary Slackness condition (SCS) holds at y with respect to (u, v) if u i > 0, i A(x, y) The Strong Second-Order Sufficient condition (SSOSC) holds at y w.r.t. (u, v) if d T 2 yl(x, u, v; y)d > 0, d 0, where d satisfies d T y g i (x, y) = 0, i A(x, y) d T y h j (x, y) = 0, j E Proposition [10] Suppose KKT, SSOSC, SCS and LICQ conditions hold at y 0 with multipliers (u 0, v 0 ) for the follower s problem with x = x 0, and that f, g and h are C 3 in a neighborhood of (x 0, y 0 ). Then, for x in the neighborhood of x 0, there exists a unique twice continuously differentiable function z(x) = [y(x), u(x), v(x)] T with x satisfying KKT, SSOSC, SCS and LICQ at y(x) with [u(x), v(x)] multipliers for the follower s problem. Proposition [15, 23, 29] Suppose KKT, SSOSC and LICQ conditions hold at y 0 with multipliers (u 0, v 0 ) for the follower s problem with x = x 0, and that f, g and h are C 3 in a neighborhood of (x 0, y 0 ). Then, for x in the neighborhood of x 0, there exists a unique twice continuously differentiable function z(x) = [y(x), u(x), v(x)] T with x satisfying KKT, SSOSC and LI at y(x) with [u(x), v(x)] multipliers for the follower s problem.

Kernel Methods & Support Vector Machines

Kernel Methods & Support Vector Machines & Support Vector Machines & Support Vector Machines Arvind Visvanathan CSCE 970 Pattern Recognition 1 & Support Vector Machines Question? Draw a single line to separate two classes? 2 & Support Vector

More information

Linear methods for supervised learning

Linear methods for supervised learning Linear methods for supervised learning LDA Logistic regression Naïve Bayes PLA Maximum margin hyperplanes Soft-margin hyperplanes Least squares resgression Ridge regression Nonlinear feature maps Sometimes

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines srihari@buffalo.edu SVM Discussion Overview 1. Overview of SVMs 2. Margin Geometry 3. SVM Optimization 4. Overlapping Distributions 5. Relationship to Logistic Regression 6. Dealing

More information

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs) Data Mining: Concepts and Techniques Chapter 9 Classification: Support Vector Machines 1 Support Vector Machines (SVMs) SVMs are a set of related supervised learning methods used for classification Based

More information

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017 Data Analysis 3 Support Vector Machines Jan Platoš October 30, 2017 Department of Computer Science Faculty of Electrical Engineering and Computer Science VŠB - Technical University of Ostrava Table of

More information

COMS 4771 Support Vector Machines. Nakul Verma

COMS 4771 Support Vector Machines. Nakul Verma COMS 4771 Support Vector Machines Nakul Verma Last time Decision boundaries for classification Linear decision boundary (linear classification) The Perceptron algorithm Mistake bound for the perceptron

More information

A Short SVM (Support Vector Machine) Tutorial

A Short SVM (Support Vector Machine) Tutorial A Short SVM (Support Vector Machine) Tutorial j.p.lewis CGIT Lab / IMSC U. Southern California version 0.zz dec 004 This tutorial assumes you are familiar with linear algebra and equality-constrained optimization/lagrange

More information

Programming, numerics and optimization

Programming, numerics and optimization Programming, numerics and optimization Lecture C-4: Constrained optimization Łukasz Jankowski ljank@ippt.pan.pl Institute of Fundamental Technological Research Room 4.32, Phone +22.8261281 ext. 428 June

More information

Lecture 7: Support Vector Machine

Lecture 7: Support Vector Machine Lecture 7: Support Vector Machine Hien Van Nguyen University of Houston 9/28/2017 Separating hyperplane Red and green dots can be separated by a separating hyperplane Two classes are separable, i.e., each

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Maximum Margin Methods Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

9. Support Vector Machines. The linearly separable case: hard-margin SVMs. The linearly separable case: hard-margin SVMs. Learning objectives

9. Support Vector Machines. The linearly separable case: hard-margin SVMs. The linearly separable case: hard-margin SVMs. Learning objectives Foundations of Machine Learning École Centrale Paris Fall 25 9. Support Vector Machines Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech Learning objectives chloe agathe.azencott@mines

More information

Lecture 10: SVM Lecture Overview Support Vector Machines The binary classification problem

Lecture 10: SVM Lecture Overview Support Vector Machines The binary classification problem Computational Learning Theory Fall Semester, 2012/13 Lecture 10: SVM Lecturer: Yishay Mansour Scribe: Gitit Kehat, Yogev Vaknin and Ezra Levin 1 10.1 Lecture Overview In this lecture we present in detail

More information

SVM in Analysis of Cross-Sectional Epidemiological Data Dmitriy Fradkin. April 4, 2005 Dmitriy Fradkin, Rutgers University Page 1

SVM in Analysis of Cross-Sectional Epidemiological Data Dmitriy Fradkin. April 4, 2005 Dmitriy Fradkin, Rutgers University Page 1 SVM in Analysis of Cross-Sectional Epidemiological Data Dmitriy Fradkin April 4, 2005 Dmitriy Fradkin, Rutgers University Page 1 Overview The goals of analyzing cross-sectional data Standard methods used

More information

Support vector machines

Support vector machines Support vector machines When the data is linearly separable, which of the many possible solutions should we prefer? SVM criterion: maximize the margin, or distance between the hyperplane and the closest

More information

Table of Contents. Recognition of Facial Gestures... 1 Attila Fazekas

Table of Contents. Recognition of Facial Gestures... 1 Attila Fazekas Table of Contents Recognition of Facial Gestures...................................... 1 Attila Fazekas II Recognition of Facial Gestures Attila Fazekas University of Debrecen, Institute of Informatics

More information

All lecture slides will be available at CSC2515_Winter15.html

All lecture slides will be available at  CSC2515_Winter15.html CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 9: Support Vector Machines All lecture slides will be available at http://www.cs.toronto.edu/~urtasun/courses/csc2515/ CSC2515_Winter15.html Many

More information

12 Classification using Support Vector Machines

12 Classification using Support Vector Machines 160 Bioinformatics I, WS 14/15, D. Huson, January 28, 2015 12 Classification using Support Vector Machines This lecture is based on the following sources, which are all recommended reading: F. Markowetz.

More information

Mathematical and Algorithmic Foundations Linear Programming and Matchings

Mathematical and Algorithmic Foundations Linear Programming and Matchings Adavnced Algorithms Lectures Mathematical and Algorithmic Foundations Linear Programming and Matchings Paul G. Spirakis Department of Computer Science University of Patras and Liverpool Paul G. Spirakis

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Support Vector Machines Aurélie Herbelot 2018 Centre for Mind/Brain Sciences University of Trento 1 Support Vector Machines: introduction 2 Support Vector Machines (SVMs) SVMs

More information

SUPPORT VECTOR MACHINES

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Today Reading AIMA 18.9 Goals (Naïve Bayes classifiers) Support vector machines 1 Support Vector Machines (SVMs) SVMs are probably the most popular off-the-shelf classifier! Software

More information

Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning

Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning Overview T7 - SVM and s Christian Vögeli cvoegeli@inf.ethz.ch Supervised/ s Support Vector Machines Kernels Based on slides by P. Orbanz & J. Keuchel Task: Apply some machine learning method to data from

More information

Feature scaling in support vector data description

Feature scaling in support vector data description Feature scaling in support vector data description P. Juszczak, D.M.J. Tax, R.P.W. Duin Pattern Recognition Group, Department of Applied Physics, Faculty of Applied Sciences, Delft University of Technology,

More information

SUPPORT VECTOR MACHINES

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Today Reading AIMA 8.9 (SVMs) Goals Finish Backpropagation Support vector machines Backpropagation. Begin with randomly initialized weights 2. Apply the neural network to each training

More information

Lecture 2 September 3

Lecture 2 September 3 EE 381V: Large Scale Optimization Fall 2012 Lecture 2 September 3 Lecturer: Caramanis & Sanghavi Scribe: Hongbo Si, Qiaoyang Ye 2.1 Overview of the last Lecture The focus of the last lecture was to give

More information

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18 CSE 417T: Introduction to Machine Learning Lecture 22: The Kernel Trick Henry Chai 11/15/18 Linearly Inseparable Data What can we do if the data is not linearly separable? Accept some non-zero in-sample

More information

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007,

More information

Optimal Separating Hyperplane and the Support Vector Machine. Volker Tresp Summer 2018

Optimal Separating Hyperplane and the Support Vector Machine. Volker Tresp Summer 2018 Optimal Separating Hyperplane and the Support Vector Machine Volker Tresp Summer 2018 1 (Vapnik s) Optimal Separating Hyperplane Let s consider a linear classifier with y i { 1, 1} If classes are linearly

More information

Characterizing Improving Directions Unconstrained Optimization

Characterizing Improving Directions Unconstrained Optimization Final Review IE417 In the Beginning... In the beginning, Weierstrass's theorem said that a continuous function achieves a minimum on a compact set. Using this, we showed that for a convex set S and y not

More information

Second Order SMO Improves SVM Online and Active Learning

Second Order SMO Improves SVM Online and Active Learning Second Order SMO Improves SVM Online and Active Learning Tobias Glasmachers and Christian Igel Institut für Neuroinformatik, Ruhr-Universität Bochum 4478 Bochum, Germany Abstract Iterative learning algorithms

More information

Machine Learning Lecture 9

Machine Learning Lecture 9 Course Outline Machine Learning Lecture 9 Fundamentals ( weeks) Bayes Decision Theory Probability Density Estimation Nonlinear SVMs 30.05.016 Discriminative Approaches (5 weeks) Linear Discriminant Functions

More information

Kernel Methods. Chapter 9 of A Course in Machine Learning by Hal Daumé III. Conversion to beamer by Fabrizio Riguzzi

Kernel Methods. Chapter 9 of A Course in Machine Learning by Hal Daumé III.   Conversion to beamer by Fabrizio Riguzzi Kernel Methods Chapter 9 of A Course in Machine Learning by Hal Daumé III http://ciml.info Conversion to beamer by Fabrizio Riguzzi Kernel Methods 1 / 66 Kernel Methods Linear models are great because

More information

Part 5: Structured Support Vector Machines

Part 5: Structured Support Vector Machines Part 5: Structured Support Vector Machines Sebastian Nowozin and Christoph H. Lampert Providence, 21st June 2012 1 / 34 Problem (Loss-Minimizing Parameter Learning) Let d(x, y) be the (unknown) true data

More information

Optimization Methods for Machine Learning (OMML)

Optimization Methods for Machine Learning (OMML) Optimization Methods for Machine Learning (OMML) 2nd lecture Prof. L. Palagi References: 1. Bishop Pattern Recognition and Machine Learning, Springer, 2006 (Chap 1) 2. V. Cherlassky, F. Mulier - Learning

More information

Machine Learning: Think Big and Parallel

Machine Learning: Think Big and Parallel Day 1 Inderjit S. Dhillon Dept of Computer Science UT Austin CS395T: Topics in Multicore Programming Oct 1, 2013 Outline Scikit-learn: Machine Learning in Python Supervised Learning day1 Regression: Least

More information

Lecture Linear Support Vector Machines

Lecture Linear Support Vector Machines Lecture 8 In this lecture we return to the task of classification. As seen earlier, examples include spam filters, letter recognition, or text classification. In this lecture we introduce a popular method

More information

Machine Learning Lecture 9

Machine Learning Lecture 9 Course Outline Machine Learning Lecture 9 Fundamentals ( weeks) Bayes Decision Theory Probability Density Estimation Nonlinear SVMs 19.05.013 Discriminative Approaches (5 weeks) Linear Discriminant Functions

More information

A Truncated Newton Method in an Augmented Lagrangian Framework for Nonlinear Programming

A Truncated Newton Method in an Augmented Lagrangian Framework for Nonlinear Programming A Truncated Newton Method in an Augmented Lagrangian Framework for Nonlinear Programming Gianni Di Pillo (dipillo@dis.uniroma1.it) Giampaolo Liuzzi (liuzzi@iasi.cnr.it) Stefano Lucidi (lucidi@dis.uniroma1.it)

More information

Classification by Support Vector Machines

Classification by Support Vector Machines Classification by Support Vector Machines Florian Markowetz Max-Planck-Institute for Molecular Genetics Computational Molecular Biology Berlin Practical DNA Microarray Analysis 2003 1 Overview I II III

More information

Support Vector Machines for Classification and Regression

Support Vector Machines for Classification and Regression UNIVERSITY OF SOUTHAMPTON Support Vector Machines for Classification and Regression by Steve R. Gunn Technical Report Faculty of Engineering and Applied Science Department of Electronics and Computer Science

More information

Math 5593 Linear Programming Lecture Notes

Math 5593 Linear Programming Lecture Notes Math 5593 Linear Programming Lecture Notes Unit II: Theory & Foundations (Convex Analysis) University of Colorado Denver, Fall 2013 Topics 1 Convex Sets 1 1.1 Basic Properties (Luenberger-Ye Appendix B.1).........................

More information

Support Vector Machines. James McInerney Adapted from slides by Nakul Verma

Support Vector Machines. James McInerney Adapted from slides by Nakul Verma Support Vector Machines James McInerney Adapted from slides by Nakul Verma Last time Decision boundaries for classification Linear decision boundary (linear classification) The Perceptron algorithm Mistake

More information

Mathematical Programming and Research Methods (Part II)

Mathematical Programming and Research Methods (Part II) Mathematical Programming and Research Methods (Part II) 4. Convexity and Optimization Massimiliano Pontil (based on previous lecture by Andreas Argyriou) 1 Today s Plan Convex sets and functions Types

More information

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Kernels + K-Means Matt Gormley Lecture 29 April 25, 2018 1 Reminders Homework 8:

More information

LECTURE 5: DUAL PROBLEMS AND KERNELS. * Most of the slides in this lecture are from

LECTURE 5: DUAL PROBLEMS AND KERNELS. * Most of the slides in this lecture are from LECTURE 5: DUAL PROBLEMS AND KERNELS * Most of the slides in this lecture are from http://www.robots.ox.ac.uk/~az/lectures/ml Optimization Loss function Loss functions SVM review PRIMAL-DUAL PROBLEM Max-min

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines srihari@buffalo.edu SVM Discussion Overview. Importance of SVMs. Overview of Mathematical Techniques Employed 3. Margin Geometry 4. SVM Training Methodology 5. Overlapping Distributions

More information

Support Vector Machines

Support Vector Machines Support Vector Machines SVM Discussion Overview. Importance of SVMs. Overview of Mathematical Techniques Employed 3. Margin Geometry 4. SVM Training Methodology 5. Overlapping Distributions 6. Dealing

More information

Lecture 9: Support Vector Machines

Lecture 9: Support Vector Machines Lecture 9: Support Vector Machines William Webber (william@williamwebber.com) COMP90042, 2014, Semester 1, Lecture 8 What we ll learn in this lecture Support Vector Machines (SVMs) a highly robust and

More information

10. Support Vector Machines

10. Support Vector Machines Foundations of Machine Learning CentraleSupélec Fall 2017 10. Support Vector Machines Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Learning

More information

Combine the PA Algorithm with a Proximal Classifier

Combine the PA Algorithm with a Proximal Classifier Combine the Passive and Aggressive Algorithm with a Proximal Classifier Yuh-Jye Lee Joint work with Y.-C. Tseng Dept. of Computer Science & Information Engineering TaiwanTech. Dept. of Statistics@NCKU

More information

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 2. Convex Optimization

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 2. Convex Optimization Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 2 Convex Optimization Shiqian Ma, MAT-258A: Numerical Optimization 2 2.1. Convex Optimization General optimization problem: min f 0 (x) s.t., f i

More information

Support Vector Machines

Support Vector Machines Support Vector Machines . Importance of SVM SVM is a discriminative method that brings together:. computational learning theory. previously known methods in linear discriminant functions 3. optimization

More information

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation Optimization Methods: Introduction and Basic concepts 1 Module 1 Lecture Notes 2 Optimization Problem and Model Formulation Introduction In the previous lecture we studied the evolution of optimization

More information

Lab 2: Support vector machines

Lab 2: Support vector machines Artificial neural networks, advanced course, 2D1433 Lab 2: Support vector machines Martin Rehn For the course given in 2006 All files referenced below may be found in the following directory: /info/annfk06/labs/lab2

More information

Lecture 5: Duality Theory

Lecture 5: Duality Theory Lecture 5: Duality Theory Rajat Mittal IIT Kanpur The objective of this lecture note will be to learn duality theory of linear programming. We are planning to answer following questions. What are hyperplane

More information

FERDINAND KAISER Robust Support Vector Machines For Implicit Outlier Removal. Master of Science Thesis

FERDINAND KAISER Robust Support Vector Machines For Implicit Outlier Removal. Master of Science Thesis FERDINAND KAISER Robust Support Vector Machines For Implicit Outlier Removal Master of Science Thesis Examiners: Dr. Tech. Ari Visa M.Sc. Mikko Parviainen Examiners and topic approved in the Department

More information

Applied Lagrange Duality for Constrained Optimization

Applied Lagrange Duality for Constrained Optimization Applied Lagrange Duality for Constrained Optimization Robert M. Freund February 10, 2004 c 2004 Massachusetts Institute of Technology. 1 1 Overview The Practical Importance of Duality Review of Convexity

More information

HW2 due on Thursday. Face Recognition: Dimensionality Reduction. Biometrics CSE 190 Lecture 11. Perceptron Revisited: Linear Separators

HW2 due on Thursday. Face Recognition: Dimensionality Reduction. Biometrics CSE 190 Lecture 11. Perceptron Revisited: Linear Separators HW due on Thursday Face Recognition: Dimensionality Reduction Biometrics CSE 190 Lecture 11 CSE190, Winter 010 CSE190, Winter 010 Perceptron Revisited: Linear Separators Binary classification can be viewed

More information

DM6 Support Vector Machines

DM6 Support Vector Machines DM6 Support Vector Machines Outline Large margin linear classifier Linear separable Nonlinear separable Creating nonlinear classifiers: kernel trick Discussion on SVM Conclusion SVM: LARGE MARGIN LINEAR

More information

Theoretical Concepts of Machine Learning

Theoretical Concepts of Machine Learning Theoretical Concepts of Machine Learning Part 2 Institute of Bioinformatics Johannes Kepler University, Linz, Austria Outline 1 Introduction 2 Generalization Error 3 Maximum Likelihood 4 Noise Models 5

More information

Lecture 19: Convex Non-Smooth Optimization. April 2, 2007

Lecture 19: Convex Non-Smooth Optimization. April 2, 2007 : Convex Non-Smooth Optimization April 2, 2007 Outline Lecture 19 Convex non-smooth problems Examples Subgradients and subdifferentials Subgradient properties Operations with subgradients and subdifferentials

More information

Distance-to-Solution Estimates for Optimization Problems with Constraints in Standard Form

Distance-to-Solution Estimates for Optimization Problems with Constraints in Standard Form Distance-to-Solution Estimates for Optimization Problems with Constraints in Standard Form Philip E. Gill Vyacheslav Kungurtsev Daniel P. Robinson UCSD Center for Computational Mathematics Technical Report

More information

Kernel-based online machine learning and support vector reduction

Kernel-based online machine learning and support vector reduction Kernel-based online machine learning and support vector reduction Sumeet Agarwal 1, V. Vijaya Saradhi 2 andharishkarnick 2 1- IBM India Research Lab, New Delhi, India. 2- Department of Computer Science

More information

Convex Programs. COMPSCI 371D Machine Learning. COMPSCI 371D Machine Learning Convex Programs 1 / 21

Convex Programs. COMPSCI 371D Machine Learning. COMPSCI 371D Machine Learning Convex Programs 1 / 21 Convex Programs COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Convex Programs 1 / 21 Logistic Regression! Support Vector Machines Support Vector Machines (SVMs) and Convex Programs SVMs are

More information

Data-driven Kernels for Support Vector Machines

Data-driven Kernels for Support Vector Machines Data-driven Kernels for Support Vector Machines by Xin Yao A research paper presented to the University of Waterloo in partial fulfillment of the requirement for the degree of Master of Mathematics in

More information

Kernels and representation

Kernels and representation Kernels and representation Corso di AA, anno 2017/18, Padova Fabio Aiolli 20 Dicembre 2017 Fabio Aiolli Kernels and representation 20 Dicembre 2017 1 / 19 (Hierarchical) Representation Learning Hierarchical

More information

Linear Bilevel Programming With Upper Level Constraints Depending on the Lower Level Solution

Linear Bilevel Programming With Upper Level Constraints Depending on the Lower Level Solution Linear Bilevel Programming With Upper Level Constraints Depending on the Lower Level Solution Ayalew Getachew Mersha and Stephan Dempe October 17, 2005 Abstract Focus in the paper is on the definition

More information

Support Vector Machines and their Applications

Support Vector Machines and their Applications Purushottam Kar Department of Computer Science and Engineering, Indian Institute of Technology Kanpur. Summer School on Expert Systems And Their Applications, Indian Institute of Information Technology

More information

Unconstrained Optimization Principles of Unconstrained Optimization Search Methods

Unconstrained Optimization Principles of Unconstrained Optimization Search Methods 1 Nonlinear Programming Types of Nonlinear Programs (NLP) Convexity and Convex Programs NLP Solutions Unconstrained Optimization Principles of Unconstrained Optimization Search Methods Constrained Optimization

More information

Linear programming and duality theory

Linear programming and duality theory Linear programming and duality theory Complements of Operations Research Giovanni Righini Linear Programming (LP) A linear program is defined by linear constraints, a linear objective function. Its variables

More information

Convexization in Markov Chain Monte Carlo

Convexization in Markov Chain Monte Carlo in Markov Chain Monte Carlo 1 IBM T. J. Watson Yorktown Heights, NY 2 Department of Aerospace Engineering Technion, Israel August 23, 2011 Problem Statement MCMC processes in general are governed by non

More information

DM545 Linear and Integer Programming. Lecture 2. The Simplex Method. Marco Chiarandini

DM545 Linear and Integer Programming. Lecture 2. The Simplex Method. Marco Chiarandini DM545 Linear and Integer Programming Lecture 2 The Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Outline 1. 2. 3. 4. Standard Form Basic Feasible Solutions

More information

Support Vector Machines

Support Vector Machines Support Vector Machines RBF-networks Support Vector Machines Good Decision Boundary Optimization Problem Soft margin Hyperplane Non-linear Decision Boundary Kernel-Trick Approximation Accurancy Overtraining

More information

Support Vector Machines for Face Recognition

Support Vector Machines for Face Recognition Chapter 8 Support Vector Machines for Face Recognition 8.1 Introduction In chapter 7 we have investigated the credibility of different parameters introduced in the present work, viz., SSPD and ALR Feature

More information

Classification by Support Vector Machines

Classification by Support Vector Machines Classification by Support Vector Machines Florian Markowetz Max-Planck-Institute for Molecular Genetics Computational Molecular Biology Berlin Practical DNA Microarray Analysis 2003 1 Overview I II III

More information

Programs. Introduction

Programs. Introduction 16 Interior Point I: Linear Programs Lab Objective: For decades after its invention, the Simplex algorithm was the only competitive method for linear programming. The past 30 years, however, have seen

More information

Chapter 6. Curves and Surfaces. 6.1 Graphs as Surfaces

Chapter 6. Curves and Surfaces. 6.1 Graphs as Surfaces Chapter 6 Curves and Surfaces In Chapter 2 a plane is defined as the zero set of a linear function in R 3. It is expected a surface is the zero set of a differentiable function in R n. To motivate, graphs

More information

Support Vector Machines

Support Vector Machines Support Vector Machines About the Name... A Support Vector A training sample used to define classification boundaries in SVMs located near class boundaries Support Vector Machines Binary classifiers whose

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

MODEL SELECTION AND REGULARIZATION PARAMETER CHOICE

MODEL SELECTION AND REGULARIZATION PARAMETER CHOICE MODEL SELECTION AND REGULARIZATION PARAMETER CHOICE REGULARIZATION METHODS FOR HIGH DIMENSIONAL LEARNING Francesca Odone and Lorenzo Rosasco odone@disi.unige.it - lrosasco@mit.edu June 6, 2011 ABOUT THIS

More information

Divide and Conquer Kernel Ridge Regression

Divide and Conquer Kernel Ridge Regression Divide and Conquer Kernel Ridge Regression Yuchen Zhang John Duchi Martin Wainwright University of California, Berkeley COLT 2013 Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT 2013 1 / 15 Problem

More information

Practical Implementations Of The Active Set Method For Support Vector Machine Training With Semi-definite Kernels

Practical Implementations Of The Active Set Method For Support Vector Machine Training With Semi-definite Kernels University of Central Florida Electronic Theses and Dissertations Doctoral Dissertation (Open Access) Practical Implementations Of The Active Set Method For Support Vector Machine Training With Semi-definite

More information

Convexity: an introduction

Convexity: an introduction Convexity: an introduction Geir Dahl CMA, Dept. of Mathematics and Dept. of Informatics University of Oslo 1 / 74 1. Introduction 1. Introduction what is convexity where does it arise main concepts and

More information

MATHEMATICS II: COLLECTION OF EXERCISES AND PROBLEMS

MATHEMATICS II: COLLECTION OF EXERCISES AND PROBLEMS MATHEMATICS II: COLLECTION OF EXERCISES AND PROBLEMS GRADO EN A.D.E. GRADO EN ECONOMÍA GRADO EN F.Y.C. ACADEMIC YEAR 2011-12 INDEX UNIT 1.- AN INTRODUCCTION TO OPTIMIZATION 2 UNIT 2.- NONLINEAR PROGRAMMING

More information

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 36

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 36 CS 473: Algorithms Ruta Mehta University of Illinois, Urbana-Champaign Spring 2018 Ruta (UIUC) CS473 1 Spring 2018 1 / 36 CS 473: Algorithms, Spring 2018 LP Duality Lecture 20 April 3, 2018 Some of the

More information

Don t just read it; fight it! Ask your own questions, look for your own examples, discover your own proofs. Is the hypothesis necessary?

Don t just read it; fight it! Ask your own questions, look for your own examples, discover your own proofs. Is the hypothesis necessary? Don t just read it; fight it! Ask your own questions, look for your own examples, discover your own proofs. Is the hypothesis necessary? Is the converse true? What happens in the classical special case?

More information

Content-based image and video analysis. Machine learning

Content-based image and video analysis. Machine learning Content-based image and video analysis Machine learning for multimedia retrieval 04.05.2009 What is machine learning? Some problems are very hard to solve by writing a computer program by hand Almost all

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Perceptron Learning Algorithm

Perceptron Learning Algorithm Perceptron Learning Algorithm An iterative learning algorithm that can find linear threshold function to partition linearly separable set of points. Assume zero threshold value. 1) w(0) = arbitrary, j=1,

More information

Lecture notes on the simplex method September We will present an algorithm to solve linear programs of the form. maximize.

Lecture notes on the simplex method September We will present an algorithm to solve linear programs of the form. maximize. Cornell University, Fall 2017 CS 6820: Algorithms Lecture notes on the simplex method September 2017 1 The Simplex Method We will present an algorithm to solve linear programs of the form maximize subject

More information

Generative and discriminative classification techniques

Generative and discriminative classification techniques Generative and discriminative classification techniques Machine Learning and Category Representation 013-014 Jakob Verbeek, December 13+0, 013 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.13.14

More information

LECTURE 13: SOLUTION METHODS FOR CONSTRAINED OPTIMIZATION. 1. Primal approach 2. Penalty and barrier methods 3. Dual approach 4. Primal-dual approach

LECTURE 13: SOLUTION METHODS FOR CONSTRAINED OPTIMIZATION. 1. Primal approach 2. Penalty and barrier methods 3. Dual approach 4. Primal-dual approach LECTURE 13: SOLUTION METHODS FOR CONSTRAINED OPTIMIZATION 1. Primal approach 2. Penalty and barrier methods 3. Dual approach 4. Primal-dual approach Basic approaches I. Primal Approach - Feasible Direction

More information

5 Day 5: Maxima and minima for n variables.

5 Day 5: Maxima and minima for n variables. UNIVERSITAT POMPEU FABRA INTERNATIONAL BUSINESS ECONOMICS MATHEMATICS III. Pelegrí Viader. 2012-201 Updated May 14, 201 5 Day 5: Maxima and minima for n variables. The same kind of first-order and second-order

More information

ECG782: Multidimensional Digital Signal Processing

ECG782: Multidimensional Digital Signal Processing ECG782: Multidimensional Digital Signal Processing Object Recognition http://www.ee.unlv.edu/~b1morris/ecg782/ 2 Outline Knowledge Representation Statistical Pattern Recognition Neural Networks Boosting

More information

CS446: Machine Learning Fall Problem Set 4. Handed Out: October 17, 2013 Due: October 31 th, w T x i w

CS446: Machine Learning Fall Problem Set 4. Handed Out: October 17, 2013 Due: October 31 th, w T x i w CS446: Machine Learning Fall 2013 Problem Set 4 Handed Out: October 17, 2013 Due: October 31 th, 2013 Feel free to talk to other members of the class in doing the homework. I am more concerned that you

More information

Mathematical Themes in Economics, Machine Learning, and Bioinformatics

Mathematical Themes in Economics, Machine Learning, and Bioinformatics Western Kentucky University From the SelectedWorks of Matt Bogard 2010 Mathematical Themes in Economics, Machine Learning, and Bioinformatics Matt Bogard, Western Kentucky University Available at: https://works.bepress.com/matt_bogard/7/

More information

Some Advanced Topics in Linear Programming

Some Advanced Topics in Linear Programming Some Advanced Topics in Linear Programming Matthew J. Saltzman July 2, 995 Connections with Algebra and Geometry In this section, we will explore how some of the ideas in linear programming, duality theory,

More information

Support vector machines. Dominik Wisniewski Wojciech Wawrzyniak

Support vector machines. Dominik Wisniewski Wojciech Wawrzyniak Support vector machines Dominik Wisniewski Wojciech Wawrzyniak Outline 1. A brief history of SVM. 2. What is SVM and how does it work? 3. How would you classify this data? 4. Are all the separating lines

More information

Constrained optimization

Constrained optimization Constrained optimization A general constrained optimization problem has the form where The Lagrangian function is given by Primal and dual optimization problems Primal: Dual: Weak duality: Strong duality:

More information

Kernel SVM. Course: Machine Learning MAHDI YAZDIAN-DEHKORDI FALL 2017

Kernel SVM. Course: Machine Learning MAHDI YAZDIAN-DEHKORDI FALL 2017 Kernel SVM Course: MAHDI YAZDIAN-DEHKORDI FALL 2017 1 Outlines SVM Lagrangian Primal & Dual Problem Non-linear SVM & Kernel SVM SVM Advantages Toolboxes 2 SVM Lagrangian Primal/DualProblem 3 SVM LagrangianPrimalProblem

More information

A generalized quadratic loss for Support Vector Machines

A generalized quadratic loss for Support Vector Machines A generalized quadratic loss for Support Vector Machines Filippo Portera and Alessandro Sperduti Abstract. The standard SVM formulation for binary classification is based on the Hinge loss function, where

More information