A model for a complex polynomial SVM kernel

Size: px

Start display at page:

Download "A model for a complex polynomial SVM kernel"

Ashlynn Rogers
5 years ago
Views:

1 A model for a complex polynomial SVM kernel Dana Simian University Lucian Blaga of Sibiu Faculty of Sciences Department of Computer Science Str. Dr. Ion Ratiu 5-7, , Sibiu ROMANIA d simian@yahoo.com Abstract: SVM models are obtained by convex optimization and are able to learn and generalize in high dimensional input spaces. The kernel method is a very powerful idea. Using an appropriate kernel, the data are projected in a space with higher dimension in which they are separable by an hyperplane. Usually simple kernels are used but the real problems require more complex kernels. The aim of this paper is to introduce and analyze a multiple kernel based only on simple polynomials kernels. Key Words: Multiple Kernel, SVM, Classification, Genetic Algorithm 1 Introduction Support Vector Machines (SVMs) represent a class of neural networks, introduced by Vapnik ([28]). We will make a brief presentation of SVMs for binary classification in section 2. We limit our discussion to the binary case. There are many methods to make generalizations for several classes. In the recent years, SVMs have become a very popular tool for machine learning tasks and have been successfully applied in classification, regression and novelty detection. Many applications of SVM have been done in various fields: particle identification, face identification, text categorization, bioinformatics, database marketing ([23]). Generally, the task of classification is to find a rule, which based on external observations assigns an object to one of several classes. A classification task supposes the existence of training and testing data given in the form of data instances. Each instance in the training set contains one target value, named class label and several attributes named features. The goal of SVM is to produce a model which predicts target value of data instances in the testing set which are given only the attributes. SVM solutions are characterized by a convex optimization problem. If the data set is separable we obtain an optimal separating hyperplane with a maximal margin ([5]). In order to avoid the difficulties for the non separable data the idea of kernel substitution is used. The kernel methods transform data into a feature space F, that usually has a huge dimension ([2], [28]). With a suitable choice of kernel the data can become separable in feature space despite being non-separable by a hyperplane in the original input space. The basic properties of a kernel function are derived from Mercer s theorem ([5]). Under certain conditions, kernel functions can be interpreted as representing the inner product of data objects implicitly mapped into a nonlinear feature space. The kernel trick is to calculate the inner product in the feature space without knowing explicit the mapping function. There are many kernels used in standard kernel-based methods ([6], [10], [16], [26]). Standard kernel-based classifiers use only a single kernel, but the complex applications need to consider a combination of kernels. Recent development are oriented in finding complex kernels and studying their behavior for different problems ([9], [10], [11], [1], [7], [17], [18], [19]). It is very important to build multiple kernels adapted to the input data. The problem of choosing an optimal kernel function and the optimal values for the problem parameters is known as the problem of model choosing ([10]). In [9], [11] are presented methods for solving this problem, using multiple kernels and genetic programming techniques. The aim of this paper is to propose a model for a complex polynomial-based SVM kernel, starting from the method presented in [9], [11]. The paper is organized as follows: in section 2 we present some general elements related to SVMs. In section 3 we present some related work regarding the construction of multi kernels. In section 4 we introduce and discuss our model. Conclusions and further directions of study are underlined in section 5. ISSN: ISBN:

2 2 Support Vector Machines and kernels SVM algorithm can solve the problem of binary or multiclass classification. Many real-life data sets involve multiclass classification. We consider in this section only the problem of binary classification be cause there are many known methods to generalize the binary classifier to a n - class classifier ([5], [28]). Let be given the data points x i R d, i {1,..., m} and their labels y i { 1, 1}. We look for a function f which associates to each input data x its correct label y = f(x). This function is named decision function and represents a hyperplane which divides the input data into two regions: f(x) = sign( w, x + b), (1) where w = m i=1 α i x i. If the data set is separable then the conditions for classification without training error are y i ( w, x i +b) > 0. To maximize the margin the task is ( ) 1 min 2 w 2, subject to the constraints y i ( w, x i + b) 1, i {1,..., n} The Wolfe dual problem requires the maximization with respect to α i, of the function W (α) = α i 1 α i α j y i y j x i, x j (2) 2 i=1 i,j=1 subject to the constraints α i 0, α i y i = 0, (3) i=1 with α i Lagrange multipliers (hence α i 0). This constrained quadratic programming problem will give an optimal separating hyperplane with a maximal margin if the data set is separable. For the data sets which are not linearly separable we use the kernel method, which make a projection of the input data x in a feature Hilbert space F : φ : X F x φ(x) The functional form of the mapping φ(x i ) does not need to be known. It is implicitly defined by the choice of kernel: K(x i, x j ) = φ(x i ), φ(x j ) The kernel represents the inner product in the higher dimensional Hilbert space of features. Feasible kernels must satisfy Mercer s conditions ([5]). For a given choice of kernel the learning task involves maximization of the objective function W (α) = α i 1 α i α j y i y j φ(x i ), φ(x j ) 2 i=1 i,j=1 (4) subject to the constraints (3). The decision function is f(x) = y i α i K(x i, x) + b (5) i=1 The presented problem is for no training errors. Most real data sets contain noise. The presence of noise is modeled by means of a positive slack variable ξ i. In this case the optimization problem is ( 1 m ) min w,b,ξ 2 w w + C ξ i subject to the constraints i=1 (6) y i (w x i + b) 1 ξ i (7) ξ i 0 (8) For more detailed mathematical formulation of the problem see [5], [16], [28]. Different simple kernels are often used ([5], [9], [6], [16]): Linear: Polynomial: RBF: K lin (x 1, x 2 ) = x 1 x 2 (9) K pol (x 1, x 2 ) = (γx 1 x 2 + r) d, γ > 0 (10) K RBF (x 1, x 2 ) = exp Sigmoidal: ( ) 1 2σ 2 x 1 x 2 2 (11) K sig = tanh(γx 1 x 2 + coef) (12) In some papers the linear case is included in the polynomial one, in which the parameters γ, r are chosen as γ = r = 1, that is K d (x 1, x 2 ) = (x 1 x 2 + 1) d, d Z + (13) ISSN: ISBN:

3 3 Multiple kernels Usually the choice of the kernel is made empirically and the standard SVM classifiers use a single kernel. Simple kernels depend of one or many parameters. There are different methods for optimizing the kernel parameters ([7], [15], [29]). Recent papers proved that multiple kernel give better results than the single ones. It is known from the algebra of kernels that the set of operations (+,, exp) (14) preserve the Mercer s conditions and therefore we can obtain multiple kernels using these operations. One possibility is to use a linear combination of simple kernels and to optimize the weights ([1], [9], [15], [17] - [19], [24]). For optimization the weights two different kind of approaches can be found. One of them reduces the problem to a convex optimization problem ([15], [24]). Other uses evolutionary methods ([9], [21]) for optimizing the weights. Complex nonlinear multi kernels were proposed in [10], where a hybrid approach using a genetic algorithm and a SVM algorithm is proposed. Every chromosome codes the expression of a multiple kernel. The quality of a chromosome is represented by the classification accuracy (the number of correctly classified items over the total number of items) using the multiple kernel coded in this chromosome and it is obtained running the SVM algorithm. The hybrid techniques from [10] is structured in two levels: a macro level and a micro level. The macro level is represented by the genetic algorithm which build the multiple kernel. The micro level is represented by the SVM algorithm which compute the quality of chromosomes. The accuracy rate is computed by the SVM algorithm on a validation set of data. 4 Main results. Our model Our goal is to build and analyze a multiple kernel starting from the simple polynomial kernel presented in (13) and having a single parameter, the degree d. We use the idea of the model proposed in [10]. In a first level, we will build and evaluate multiple kernels obtained from (13) using the set of operations (14). 4.1 Representation of the multiple kernel Using the model from [10] we represent the multiple kernel like a tree which terminal nodes contain a single kernel, represented by the parameter d of this kernel and the intermediate nodes are operations from the set (14). In the macro level we build and evolve the multiple kernels using a genetic algorithm. A chromosome codes a tree associated to a multiple kernel. The cardinal m, of the set of terminals, is given as input data. We will combine the kernels {K d1,..., K dm }, (15) where d i represents the degree parameter of the simple polynomial kernel. The nonlinearity of the multiple kernel occurs when operation or exp are presented in its expression. In the following figure is represented the kernel K d1 K d2 + exp(k d1 ). + * exp d 1 d 2 d 1 If we want to obtain a nonlinear kernel we must impose the presence of at least one of the operation or exp. The number L represents the maximum number of levels of the tree coded in a chromosome and is given also as input data. Next we will introduce the main operations of this algorithm. 4.2 Operations of the genetic algorithm Initialization We first generate randomly the number of levels of generated chromosome, l {1,..., L}. We use a recursive algorithm for the initialization of chromosomes population. The root of a chromosome tree can be a operation from the set (14). If a node contain one of the operations {+, }, its descendants can be a function or a kernel (represented by the parameter d). If a node contain the operation exp only one of its descendants is initialized with an operation or a kernel and the other is null. On the level l can be only kernels. The initialization of a chromosome stops when the length l is obtained. The initial population of chromosomes will contain a given number, CN of chromosomes. ISSN: ISBN:

4 Another possible criteria for stoping the initialization process can be the time. The population will contained the chromosomes generated in the initialization process in a given time T Evaluation The evaluation of the chromosome is made using the SVM algorithm for a particular set of data. To do this we divide the data into two subset: the training subset, used for problem modeling and test subset used for evaluation. The training subset is also random divide into a subset for learning and a subset for validation. The SVM algorithm uses the data from the learning subset for training and the subset from the validation set for computing the classification accuracy which is used as fitness function for the genetic algorithm. In [10] are indicated the following values: 20% for testing set, 80% for training set, from which 2/3 for learning set and 1/3 for validation set. We propose to run and compare the results from different sizes of these subsets of data, more exactly 70%, 80%, 90% for the training set of data and for each different sizes of learning (2/3, 3/4) and validation subsets of data Crossover A percentage of p 1 % from the most valuable chromosomes is chosen for selection. The crossover is realized in the same way like in [10]. We make a single cut on every chromosomes - parents trees and change the descendants obtained in this way (under the cut) from one parent to another. The descendant is evaluated using the SVM algorithm and replaces the weakest chromosome from the previous population if it is better than it Mutations We use three type of mutations, realized on a very small percentage of the population selected for selection. 1. Tree mutation This mutation is the one described in [10]. We make a cut in a point of the tree represented in a chromosome and replace the subtree under the cut with a randomly generated subtree. The new subtree is generated using a similar algorithm with the initialization one. This kind of mutation is realized using p 1 % from the selection population. 2. Operation mutation One operation from the chromosome tree is random selected and is replaced by another operation from the set (14). The percentage associated to this mutation is p 2 %. This mutation can be generalized changing many operations. 3. Kernel mutation One kernel from the chromosome tree is random selected and is replaced by another kernel from the set (15). The percentage associated to this mutation is p 3 %. This mutation can be generalized changing many kernels. A mutation descendant is evaluated using the SVM algorithm and replace the weakest chromosome from the previous population if it is better than it. All the operations proposed for crossover and mutations produce multiple valid kernels ([10]). In order to compare our results with the results obtained in [10], we propose a initial population with 20 chromosomes and the following values: p = 0.8, p 1 = p 2 = p 3 = 0.1. The end criterion for the genetic algorithm can be the time or the number of generations. For every generation, before the reproduction process, the best chromosome is stored. 4.3 SVM algorithm The SVM algorithm uses the learning subset of data for training the SVM model and the validation subset for computing the classification accuracy that is the fitness function for the genetic algorithm. After the end of the genetic algorithm, the best chromosome gives the multiple kernel which is evaluated on the test subset of data. The way of construction this multiple kernel assures that it is a veritable kernel, that is, it satisfies Mercer s conditions. The SVM algorithm is taken from libsvm [6] and modified for our kernels. 5 Conclusion and further directions of study In this paper we presented a model for a SVM multiple kernels, which uses only polynomials simple kernels. We start for the model presented in [10], but our model is different from this, be cause it handles only simple polynomials kernels and the macro level, that is the genetic algorithm used is different. Our intention was to study the possibility of obtaining nonlinear multiple kernels using only polynomials simple classifiers and to analyze their performance comparing with other multiple kernels. Therefore we choose for experiments the most common data sets, used in a great number of papers ([9], [10], [24], [16]) and taken from libsvm website ([6]). For the beginning we use the form (13) of the polynomial kernel. One of our further directions of study is to extend our model for the form (10) of the ISSN: ISBN:

5 simple polynomial kernel and to use triplets of parameters (γ, r, d). Another direction of study is refereing of the training process of the SVM algorithm from our approach. Further numerical experiments are required in order to asses the power of our evolved kernel. The kernel substitution can be used to define many other types of learning machines distinct from SVMs. A possible direction of study is to generalize our approach in order to obtain multiple kernels for other task than classification. An interesting and important direction of study is to make a comparative study of action of different type of kernels on various benchmark data sets. Our open problem is: is it possible to associate to data from different fields a single standard kernel which generate multiple kernels (using a generalization of our method) with very good behavior for these data for a given task (the first task is classification). Acknowledgment: The research was supported by the research grant of the Romanian Ministry of Education and Research, CNCSIS 33/2007. References: [1] Bach F. R., Lanckriet G. R. G., Jordan M. I., Multiple kernel learning, conic duality, and the SMO algorithm Machine Learning, Proceedings of ICML 2004, ACM 2004, 6. [2] Baudat G., Anouar F., Kernel-based Methods and Function Approximation International Joint Conference on Computer Network, Washington, 2001, [3] Boser B. E., Guyon I., Vapnik V., A training algorithm for optimal margin classifiers COLT, 1992, [4] Blake C.L., Newman D.J., Hettich S., Merz C.J., UCI repository of machine learning databases, [5] Campbell C., An Introduction to Kernel Methods Radial Basis Function Network: Design and Applications, Spronger Verlag, Berlin, 2000, [6] Chang C-C., Lin C-J., LIBSVM : a library for support vector machines, Software available at cjlin/libsvm. [7] Chapelle O., Vapnik V., Bousquet O., Mukherjee S, Choosing multiple parameters for support vector machines, Machine Learning, 46(1/3), 2002, [8] Cristianini N., Shawe-Taylor J., An Introduction to Support Vector Machines, Cambridge University Press, [9] Diosan L., Oltean M., Rogozan A., Pecuchet J. P., Improving svm performance using a linear combination of kernels, Adaptive and Natural Computing Algorithms, ICANNGA07, volume 4432 of LNCS, 2007, [10] Diosan L., Rogozan A., Pecuchet J. P., Une approche evolutive pour generer des noyaux multiples (An evolutionary approach for generating multiple kernels), portal VODEL, [11] Diosan L., Oltean M., Rogozan A., Pecuchet J. P., Genetically Designed Multiple-Kernels for Improving the SVM Performance, portal VODEL, [12] Goldberg D. E., Genetic Algorithms in Search, Optimization and Machine Learning, Addison Wesley, [13] Howley T., Madden M. G., The genetic kernel support vector machine : Description and evaluation, Artif. Intell. Rev., 24(3-4), 2005, [14] Howley T., Madden M. G., An evolutionary approach to automatic kernel construction, Stefanos D. Kollias, Andreas Stafylopatis, Wlodzislaw Duch, and Erkki Oja, editors, Artificial Neural Networks - ICANN 2006, volume 4132 of LNCS, Springer, 2006, [15] Lanckriet G. R. G., Cristianini N., Bartlett P., El Ghaoui L., Jordan M. I., Learning the kernel matrix with semidefinite programming, Journal of Machine Learning Research, 5, 2004, [16] Muller k.-r., Mika S., Ratsch G., Tsuda K., Scholkopf B., An Introduction to kernel-based Learning Algorithms, IEEE Transactions On Neural Networks, Vol. 12, No. 2, 2001, [17] Nguyen H. N., Ohn S. Y., Choi W. J., Combined kernel function for support vector machine and learning method based on evolutionary algorithm, Nikhil R. Pal, Nikola Kasabov, Rajani K. Mudi, Srimanta Pal, and Swapan K. Parui, editors, Neural Information Processing, 11th International Conference, ICONIP 2004, volume 3316 of LNCS, Springer, 2004, [18] Ohn S. Y., Nguyen H. N., Chi S. D., Evolutionary parameter estimation algorithm for combined kernel function in support vector machine, Chi- Hung Chi and Kwok-Yan Lam, editors, Content Computing, Advanced Workshop on Content Computing, AWCC 2004, volume 3309 of LNCS, Springer, 2004, [19] Ohn S. Y., Nguyen H. N., Kim D. S., Park J. S., Determining optimal decision model for support vector machine by genetic algorithm, Jun Zhang, Ji- Huan He, and Yuxi Fu, editors, Computational and Information Science, First International Symposium, CIS 2004, volume 3314 of LNCS, Springer, 2004, [20] Ong C. S., Smola A., Williamson B., Learning the kernel with hyperkernels, Journal of Machine Learning Research, 6, 2005, [21] Stahlbock R., Lessmann S., Crone S., Genetically constructed kernels for support vector machines, Proc. of German Operations Research, Springer, 2005, ISSN: ISBN:

6 [22] Schoelkopf B., Smola A. J., Learning with Kernels, The MIT Press, [23] Schlkopf B., Tsuda K., Vert J. P., Kernel Methods in Computational Biology, MIT Press, [24] Sonnenburg S., Rtsch G., Schafer C., Scholkopf B, Large scale multiple kernel learning, Journal of Machine Learning Research, 7, 2006, [25] Steinwart I., On the optimal parameter choice for ν - support machine, from Research, 7, 2006, [26] Suykens J.A.K., Support Vector Machines: A Nonlinear Modelling and Control Perspective, European Journal of Control, Special Issue on Fundamental Issues In Control, 7 (2-3), 2001, [27] Syswerda G., Uniform crossover in genetic algorithms,icga, 1989, 29. [28] Vapnik V., The Nature of Statistical Learning Theory, Springer, [29] Weston J., Mukherjee S., Chapelle O., Pontil M., Poggio T., Vapnik V., Feature selection for SVMs, Todd K. Leen, Thomas G. Dietterich, and Volker Tresp, editors, NIPS, MIT Press, 20006, ISSN: ISBN:

Bagging and Boosting Algorithms for Support Vector Machine Classifiers

Bagging and Boosting Algorithms for Support Vector Machine Classifiers Noritaka SHIGEI and Hiromi MIYAJIMA Dept. of Electrical and Electronics Engineering, Kagoshima University 1-21-40, Korimoto, Kagoshima