An Experimental Multi-Objective Study of the SVM Model Selection problem

An Experimental Multi-Objective Study of the SVM Model Selection problem Giuseppe Narzisi Courant Institute of Mathematical Sciences New York, NY 10012, USA narzisi@nyu.edu Abstract. Support Vector machines (SVMs) are a powerful method for both regression and classification. However, any SVM formulation requires the user to set two or more parameters which govern the training process and such parameters can have a strong effect on the result performance of the engine. Moreover, the design of learning systems is inherently a multi-objective optimization problem. It requires to find a suitable trade-off between at least two conflicting objectives: model complexity and accuracy. In this work the SVM model selection problem is cast as a multi-objective optimization problem, where the and the number of support vectors of the model define the two objectives. Experimental analysis is presented on a well known test-bed of datasets using two different kernels: RBF and sigmoid. Key words: Support Vector Machine, Multi-Objective optimization, NSGA-II, SVM Model Selection. 1 Introduction Support Vector Machines have been proven to be very effective methods for classification and regression [12]. However, in order to obtain good generalization s the user needs to choose appropriate values for the involved parameters of the model. The kernel parameters together with the regularization parameter C, are called hyperparameters of the SVM, and the problem of tuning them in order, for example, to improve the generalization of the model is called SVM model selection problem. Usually the standard method to determine the hyperparameter is by grid search. In the simple grid-search approach the hyperparameters are varied with a fixed step-size through a wide range of values and the performance of every combination is measured. Because of its computational complexity, grid-search is only suitable for the adjustment of very few parameters. Further, the choice of the discretization of the search space may be crucial. Figure 1 shows the typical parameter surface for the and the number of support vectors as a function of the hyperparameters C and γ for the diabetes dataset. Recently gradient-based approaches have been explored for choosing the hyperparameters[2, 6, 8]. However they have some drawbacks and limitations. First of all, the score function to evaluate the quality of a set

2 Giuseppe Narzisi of hyperparameters must be differentiable, which excludes important important measure, as the number of support vectors. Also because the objective function is strongly multimodal, the performance of the gradient-based heuristic depend on the initialization, which means that the algorithm can easily get stuck in a sub-optimal local minima. # of SVs 36 34 24 22 36 34 24 22 650 600 550 500 450 350 250 650 600 550 500 450 350 250 0.01 0.1 C 1 10 100 1000 10000 1000000.0001 0.001 0.01 0.1 1 100000 100010000 100 10 gamma 0.01 0.1 C 1 10 100 1000 10000 1000000.0001 0.001 0.01 0.1 1 100000 100010000 100 10 gamma (a) (b) Fig.1. Parameter surface of the (a) and the number of SVs (b) as a function of the two hyperparameters C and γ for the diabetes dataset using 5-fold cross-validation. The main idea which is missing in this kind of approaches is that the SVM model selection problem is inherently a multi-objective optimization problem. Designing supervised learning systems for classification requires finding a suitable trade-off between several objectives. Typically we want to reduce the complexity of the model and at the same time obtaining a model with high accuracy level (or low rate). Sometimes having a model with the best generalization could not be the best choice if the price that we have to pay is working with a very complex model both in terms of time and space. Usually this problem is tackled aggregating the objectives into a scalar function (linear weighting of the objectives) and applying standards method to the resulting single objective optimization problem. However, it has been shown how this approach is not a good solution because it requires that the aggregate function correctly matches the problem, and this is not an easy task. The best solution is to apply directly the multi-objective approach in order to find the Pareto optimal set of solutions for the problem. Among the many possible approaches to solve a multi-objective optimization problem, the last decade has seen Multi-objective Evolutionary Algorithms (MOEA) as the emerging method in this area. Successful application have been already obtained in the machine learning area in the case of feature selection for SVMs [9, 10]. Similar experiments to the ones presented in this paper has been proposed in [7] where the split modified radius margin bounds and the training were used in conjunction with the number of SVs. The experiments presented in this work differ from that approach in many ways: 1) the impact of different kernels

A multi-objective analysis of Support Vector Machines 3 is analyzed; 2) the simple straightforward 2-objective formulation is considered (num. of SVs and CV ) before any additional sophistication; 3) the standard NSGA-II algorithms is used instead of the NSES algorithm proposed in [7]; 4) the is evaluated using the 5-fold cross-validation method. There are many reasons for using a multi-objective evolutionary approach for SVM model selection: ability to obtain in one run, not just a single model but several model which are optimal (in the Pareto sense) respect to the selected objectives or criteria; the best SVM model can be selected later from the Pareto front according to some higher level information or preferences; multiple hyperparameters can be tuned at the same time overcoming the limitation of the naive grid-search method; the objective/criteria do not need to be differentiable (as required for the gradient-based methods); efficient exploration of the multimodal search space associated with the parameters. The goal of this research work is to show the effectiveness of this approach for SVM model selection using a very simple 2-objective formulation which takes into account the complexity and the accuracy of the model. The paper is organized as follow. We first introduce SVMs and SVM model selection from the perspective of multi-objective optimization. Then we give the background on multi-objective optimization. Then we introduce the class of multi-objective evolutionary algorithms. Section 5 reports the results obtained on a test bed of four datasets widely used in the literature. Finally, the conclusions are presented and possible future line of investigation are given. 2 Multi-objective view of SVM The first evidence of the multi-objective nature of SVMs is directly related to their standard formulation in the inconsistent case, the so called C-SVM formulation: 1 min 2 w 2 + C m i ξ i (1) subject to y i [w x i + b] 1 ξ i, ξ i 0, i [1, m] where C is the the regularization parameter which determines the trade-off between the margin and the sum of the slack variables m i ξ i. The constant C is usually determined using same heuristic approach. However, the more natural formulation of the problem is the following: 1 min 2 w 2 min m i ξ i subject to y i [w x i + b] 1 ξ i, ξ i 0, i [1, m] (2) where the objective in (1) is split in two different conflicting objectives, overcoming the problem of determining the parameter C. Even if this formulation

4 Giuseppe Narzisi is more natural than (1), not so much effort on this problem is present in the literature. It would be interesting to analyze this problem using the theoretical approach presented by Mihalis Yannakakis in [13] where he discusses the condition under which an approximate trade-off curve can be constructed efficiently (in polynomial time). The multi-objective nature of SVM training is also present at the level of model selection. The typical criteria of evaluation for a classifier is given by the accuracy of the model in classifying new generated points, and this metric is often used alone in order to select/generate good classifiers. However there are many other important factors that must be taken into account when selecting a SVM model. A possible (not exhaustive) list is the following: number of input features; bound on the generalization (e.g., radius margin bound); number of support vectors. In this paper we consider the last one, number of SVs, as an additional selection criteria. 3 Multi-Objective Optimization When an optimization problem involves more then a single-valued objective function, the task of finding one (or more) optimum solution(s), is known as the Multi-Objective Optimization Problem (MOOP) [4]. An optimum solution with respect to one objective may not be optimum with respect to another objective. As a consequence, one cannot choose a solution, which is optimal with respect to only one objective. In problems characterized by more than one conflicting objective, there is no single optimum solution; instead there exists a set of solutions which are all optimal, called the Optimal Pareto front. A general multi-objective optimization problem is defined as follows (minimization case): min F(x) = [f 1 (x), f 2 (x),..., f M (x)] subject to E(x) = [e 1 (x), e 2 (x),..., e L (x)] 0 x i x (U) i, i = 1,...,N, x (L) i (3) where x = (x 1, x 2,...,x N ) is the vector of the N decision variables, M is the number of objectives f i, L is the number of constraints e j, and x (L) i and x (U) i are respectively the lower and upper bound for each decision variables x i. Two different solutions are compared using the concept of dominance, which induces a strict partial order in the objective space F. Here a solution a is said to dominate a solution b if it is better or equal in all objectives and better in at least one objective. For the minimization case we have: F(a) F(b) iff { fi (a) f i (b) i 1,..., M j 1,..., M f j (a) < f j (b) (4)

A multi-objective analysis of Support Vector Machines 5 In the specific case of the SVM model selection, we have that the hyperparameters are the decision variables of the problem, the range of exploration for each parameters are the bounds for each decision variable, and the model selection criteria are the objectives (no constraints are used in this formulation). 4 Method 4.1 Model selection metrics As discussed in section 2 there are many criteria that can be used for SVM model selection. In this section we introduce the two objectives that have been used for the simulations. Accuracy. The most direct way to evaluate the quality of a SVM model is to consider its classification performance (accuracy). In the simple case the data is split into a training and a validation set. The first set is used to generate the SVM model, the second set is used to evaluate the performance of the classifier. In this work we use the more general approach called L-fold cross-validation (CV). The data is partitioned into L disjoint sets D 1, D 2,..., D L and the SVM is trained L times on all data but the D i set which is used later as validation data. The accuracy (or ) is computed as the mean of the L different experiments. For reasons of computational complexity we use a 5-fold CV for each dataset. Number of support vectors. We know that the in the hard margin case the number of SVs is an upper bound on the expected number of s made by the leave-one-out procedure. Moreover the space and time complexity of the SVM classifier scales with the number of SVs. It follows that it is important to have a SVM model which has few number of support vector (SVs). Similarly to the 5-fold CV, the number of SVs is computed as the mean on the 5 different experiments of the CV method. 4.2 Multi-Objective Evolutionary Algorithms Evolutionary algorithms (EAs) are search methods that take their inspiration from natural selection and survival of the fittest in the biological world. EAs differ from more traditional optimization techniques in that they involve a search from a population of solutions, not from a single point. Each iteration of an EA involves a competitive selection that weeds out poor solutions. The solutions with high fitness are recombined with other solutions by swapping parts of a solution with another. Solutions are also mutated by making a small change to a single element of the solution. Recombination and mutation are used to generate new solutions that are biased towards regions of the space for which good solutions have already been seen. Multi-Objective Evolutionary Algorithms (MOEAs) are a special class of EAs with the goal of solving problems involving many conflicting objectives [4].

6 Giuseppe Narzisi LIBSVM library Test on new data Hyperparameters and mean number of SVs on 5 fold cross validation Decision Making phase SVM model selection NSGA II (Multi Objective Evolutionary Algorithm) Population Evolution Output Pareto fronts (trade off curve) Fig.2. NSGA-II and LIBSVM pipeline. Over the last decade, a steady stream of MOEAS has continued to be proposed and studied [4, 3]. MOEAs have been successfully applied to several real world problems (protein folding, circuit design, safety related systems, etc) even if no strong proof of convergence is available. Among the growing class of MOEAs, in this work we employ the well-known NSGA-II [5] (Nondominated Sorting Genetic Algorithm II). NSGA-II is based on the use of fast nondominated sorting approach to sort a population of solutions into different nondomination levels. It then uses elitism and a crowded-comparison operator for diversity preservation. Table 1. Benchmark datasets. Name Size Features Repository diabetes 768 8 UCI australian 690 14 Statlog german 1,000 24 Statlog splice 1,000 60 Delve 5 Results 5.1 Experiments In this research work we deal with the standard application of SVM for binary classification. We used a common benchmark of four datasets (table 1 shows the characteristics of the datasets). We consider two different kernel and their parameters: RBF (radial basis function): K(u,v) = exp( γ u v 2 ) Sigmoid: K(u,v) = tanh(γu T v + coef 0 )

A multi-objective analysis of Support Vector Machines 7 It follows that the hyperparameters considered will be respectively (C, γ) for the RBF kernel and (C, γ, coef 0 ) for the sigmoid kernel. The parameter ranges are: log 2 C [ 5,...,15], log 2 γ [ 10,...,4], coef 0 [0, 1]. According to the values suggested in [5], the NSGA-II parameters are set as follow: p c = 0.9, p m = 0.1, ν c = 10, ν m = 20. No effort has been spent in this work to tune these parameter, which clearly would improve the efficiency of the algorithm. A population size of 60 individuals is used and each simulation is curried out for a total of 250 generations. Each plot shows the Pareto fronts (trade-off curves) of all the points (SVM models) sampled by the algorithm after the first 50 generations. As it is described later, 50 iterations are enough to converge versus the final approximated Pareto front. SVMs are constructed using the LIBSVM 1 library [1] version 2.84. Figure 2 shows the interaction between NSGA-II and LIBSVM library. 350 3 310 290 0 0 240 0 270 220 0 200 250 20 22 24 34 36 (a) RBF 180 22 24 34 (b) Sigmoid num. of SVs 310 31 5.5 29.5 295 29 290.5 5 27.5 0 27 275.5 270 0 50 100 150 200 250 (c) RBF num. of SVs 31 0 29 0 27 240 220 25 200 0 50 100 150 200 250 24 (d) Sigmoid Fig. 3. Diabetes dataset: Pareto front of the sampled points using RBF (a) and sigmoid (b) kernel; mean evolution of the population for the and the number of SVs during the optimization of NSGA-II using RBF kernel (c) and sigmoid (d) kernel. 1 http://www.csie.ntu.edu.tw/ cjlin/libsvm

8 Giuseppe Narzisi 5.2 Discussion Figures 3, 4, 5 and 6 show the results obtained using the experimental protocol previously defined. Inspecting the results we observe, first of all, that approximate Pareto fronts are effectively obtained for each of the datasets showing how the two used objectives present a conflict behavior. This is also evident from the analysis of the evolution curves: an improvement of one objective is nearly always accompanied by a worsening in the other, but the interaction during the evolution produces a global minimization of both objectives. The choice of the kernel clearly affects the final outcome of the optimization algorithm. In particular the RBF kernel shows a better performance than the sigmoid kernel. Inspecting the Pareto fronts obtained we note that the RBF kernel allows to obtain a better distribution of solution along the two objectives. This is an important factor in multi-objective optimization: we want Pareto fronts with a wide range of values so that the selection of a final point in the second step (decision making) is facilitated. 180 180 170 175 160 170 165 150 140 1 160 120 110 155 12 14 16 18 20 22 24 (a) RBF 100 14 16 18 20 22 24 (b) Sigmoid 170 160 20 155 168 150 19 166 164 24 22 20 18 145 140 135 1 18 17 162 16 125 16 14 160 12 0 50 100 150 200 250 (c) RBF 120 115 15 0 50 100 150 200 250 (d) Sigmoid Fig. 4. Australian dataset: Pareto front of the sampled points using RBF (a) and sigmoid (b) kernel; mean evolution of the population for the and the number of SVs during the optimization of NSGA-II using RBF kernel (c) and sigmoid (d) kernel.

A multi-objective analysis of Support Vector Machines 9 For each dataset we also plot the mean evolution curves for the and the number of support vectors for the population of SVM models at each iteration. Inspecting the plots we observe that the algorithm generally converges very quickly to a set of good SVM models (first 50 iterations). It then uses the rest of the time to explore locally the space of solution for an additional finer refinement. If we compare the accuracy of the SVM models obtained using this method with other approaches in the literature we find comparable results. For example the best obtained for the diabetes dataset with this approach is 21.7 while the obtained by Keerthi in [8], Chapelle in [2] and Staelin in [11] are respectively 24.33, 23.19 and 20.3. Similarly for the splice dataset we obtain an of 12.4 while the obtained by by Keerthi in [8] and Staelin in [11] are respectively 10.16 and 11.7. 600 440 420 550 500 450 360 350 12 13 14 15 16 17 18 19 20 (a) RBF 0 18 20 22 24 (b) Sigmoid num. of SVs 20 540 19 520 18 500 17 480 460 16 440 15 420 14 13 0 50 100 150 200 250 12 (c) RBF num. of SVs 34 370 360 350 24 3 22 20 310 18 16 290 0 50 100 150 200 250 14 (d) Sigmoid Fig.5. Splice dataset: Pareto front of the sampled points using RBF (a) and sigmoid (b) kernel; mean evolution of the population for the and the number of SVs during the optimization of NSGA-II using RBF kernel (c) and sigmoid (d) kernel. An important advance of this approach is that together with good models, in terms of accuracy, the algorithm generate also many other models with different number of support vectors which are relevant in case that the complexity of the

10 Giuseppe Narzisi final model is an important factor for the final model selection. For example, in the case of the splice dataset, we could be happy to lose same degree of accuracy, and select a solution with an of 14% instead of 12%, in favor of a model that has a much lower complexity, 370 SVs instead of 570 (see figure 5). 450 440 440 420 4 420 410 360 390 370 0 360 22 24 34 36 (a) RBF 0 22 24 34 (b) Sigmoid 398 31 31 396 360 num. of SVs 394 392 390 29 29 388 27 27 386 0 384 0 50 100 150 200 250 25 (c) RBF 0 0 50 100 150 200 250 25 (d) Sigmoid Fig.6. German dataset: Pareto front of the sampled points using RBF (a) and sigmoid (b) kernel; mean evolution of the population for the and the number of SVs during the optimization of NSGA-II using RBF kernel (c) and sigmoid (d) kernel. 6 Conclusions and possible future investigations The SVM model selection problem clearly presents the characteristics of a multiobjective optimization problem. The results in this experimental work have shown that it is possible to effectively obtain approximated Pareto fronts of SVM models based on a simple 2-objective formulation where the accuracy and the complexity of the model are compared for Pareto dominance. This approach allows to visualize the characteristic trade-off curve for a specific dataset from where the user can select a specific model according to its own preferences and computational needs.

A multi-objective analysis of Support Vector Machines 11 The proposed method also allows to obtain comparable results to other approaches in the literature but with the advance that as set of Pareto optimal solutions (not a single one) is generated in output. Of course a deeper investigation is required and many different line of investigation can be considered : extending the formulation from 2-objectives to possibly k-objective (k > 2) including many other important criteria of model selection (like for example the number of input features); studying the performance of the proposed approach on the regression case; adapting the approach to the multi-classification case where it is harder to choose appropriate values for the base binary models of a decomposition scheme. References 1. Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. 2. Olivier Chapelle, Vladimir Vapnik, Olivier Bousquet, and Sayan Mukherjee. Choosing multiple parameters for support vector machines. Machine Learning, 46(1-3):131 159, 2002. 3. Coello Coello and G. B. Lamont. Applications of Multi-Objective Evolutionary Algorithms. World Scientific, 2004. 4. Kalyanmoy Deb. Multi-Objective Optimization Using Evolutionary Algorithms. John Wiley & Sons, Inc., New York, NY, USA, 2001. 5. Kalyanmoy Deb, Samir Agrawal, Amrit Pratap, and T. Meyarivan. A fast and elitist multiobjective genetic algorithm: Nsga-II. IEEE Trans. Evolutionary Computation, 6(2):182 197, 2002. 6. Tobias Glasmachers and Christian Igel. Gradient-based adaptation of general gaussian kernels. Neural Comput., 17(10):2099 2105, 2005. 7. Christian Igel. Multi-objective model selection for support vector machines. Evolutionary Multi-Criterion Optimization, pages 534 546, 2005. 8. S.S. Keerthi. Efficient tuning of svm hyperparameters using radius/margin bound and iterative algorithms. IEEE Transactions on Neural Networks, 13:1225 1229, 2002. 9. S. Pang and N. Kasabov. Inductive vs. transductive inference, global vs. local models: Svm, tsvm, and svmt for gene expression classification problems. International Joint Conference on Neual Networks (IJCNN), 2:1197 1202, 2004. 10. S.Y.M. Shi, P.N. Suganthan, and K. Deb. Multi-class protein fold recognition using multi-objective evolutionary algorithms. IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, pages 61 66, 2004. 11. Carl Staelin. Parameter selection for support vector machines. HP Labs Technical Reports, 2002. 12. Vladimir N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc., New York, NY, USA, 1995. 13. Mihalis Yannakakis. Approximation of multiobjective optimization problems. Algorithms and Data Structures : 7th International Workshop, pages 1, 2001.