Efficient Tuning of SVM Hyperparameters Using Radius/Margin Bound and Iterative Algorithms

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 5, SEPTEMBER 2002 1225 Efficient Tuning of SVM Hyperparameters Using Radius/Margin Bound and Iterative Algorithms S. Sathiya Keerthi Abstract This paper discusses implementation issues related to the tuning of the hyperparameters of a support vector machine (SVM) with 2 soft margin, for which the radius/margin bound is taken as the index to be minimized, and iterative techniques are employed for computing radius and margin. The implementation is shown to be feasible and efficient, even for large problems having more than 10 000 support vectors. Index Terms Hyperparameter tuning, support vector machines (SVMs). I. INTRODUCTION THE basic problem addressed in this paper is the two-category classification problem. Let be a given set of training examples, where is the th input vector and is the target value. denotes that is in class 1 and denotes that is in class 2. In this paper, we consider the support vector machine (SVM) problem formulation that uses soft margin given by min s.t. Let. This problem is usually converted (see [5] for details) to the SVM problem with hard margin given by min s.t. (1) where denotes the transformed vector in the (modified) feature space if otherwise, and is the kernel function. Popular choices for are Gaussian kernel Polynomial kernel (2) (3a) (3b) Manuscript received March 14, 2001; revised December 21, 2001 and January 10, 2002. The author is with the Department of Mechanical Engineering, National University of Singapore, Singapore 119260, Singapore (e-mail: mpessk@ guppy.mpe.nus.edu.sg). Publisher Item Identifier S 1045-9227(02)05563-7. The solution of (1) is obtained by solving the dual problem max s.t. and (4) At optimality, the objective functions in (1) and (4) are equal. Let denote the vector of hyperparameters (such as and ) in a given SVM formulation. Tuning of is usually done by minimizing an estimate of generalization error such as the leave-one-out (LOO) error or the -fold cross validation error. It was shown by Vapnik and Chapelle [14] that the following bound holds: LOO Error (5) where is the solution of (1), is the radius of the smallest sphere that contains all vectors, and is the number of training examples. can be obtained as the optimal objective function value of the following problem (see [10] and [13] for details): max s.t. and (6) The right-hand side bound in (5),, is usually referred to as the radius/margin bound. Note that both as well as depend on and, hence, is also a function of. The first experiments on using the radius/margin bound for model selection were done by Schölkopf et al. [10]; see also [1]. Recently, Chapelle et al. [2] used matrix-based quadratic programming solvers for (1) and (6) to successfully demonstrate the usefulness of for tuning hyperparameters. Since it is difficult, even for medium size problems with a few thousand examples, to load the entire kernel matrix of values to the computer memory and do matrix operations on it, conventional finitely terminating quadratic programming solvers are not very suitable for solving (4) and (6). Hence, specially designed iterative algorithms [5], [6], [8], [11] that are asymptotically converging are popular for solving (4) and (6). The use of these algorithms allows the easy tuning of hyperparameters in large-scale problems. The main aim of this paper is to discuss implementation issues associated with this, and use the resulting implementation to study the usefulness of the radius/margin bound on several benchmark problems. 1045-9227/02$17.00 2002 IEEE

1226 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 5, SEPTEMBER 2002 It should be mentioned here that Cristianini et al. [4] carried out the first set of experiments using radius/margin bound together with iterative SVM methods. However, their experiments were done on the hard margin problem without the parameter and threshold. To solve (4), they employed the kernel adatron algorithm, which is extremely easy to implement, but very slow. Further, they made no mention of the ease with which the gradient of the radius/margin bound with respect to the hyperparameters can be computed. II. IMPLEMENTATION ISSUES We will assume that is differentiable with respect to and. 1 To speed up the tuning, it is appropriate to use a gradient-based technique such as the quasi-newton algorithm or conjugate-gradient method to minimize. Quasi-Newton algorithms are particularly suitable, because they work well even when the function and gradient are not computed exactly. On the other hand, conjugate-gradient methods are known to be sensitive to such errors. A. Evaluation of We have employed the nearest point algorithm given in [5] for solving (4) and evaluating. The numerical experiments of that paper show that this algorithm is very efficient for solving the hard margin problem in (1) and (4). The sequential minimal optimization (SMO) algorithm [7], [6] is an excellent alternative. To determine via (6), the SMO algorithm discussed in [11, Sec. 4] is very suitable. This algorithm was modified along the lines outlined in [6] so that it runs very fast. B. Evaluation of Gradient of The computation of gradient of requires the knowledge of the gradients of and. Recently, Chapelle et al. gave a very useful result (see [2, Lemma 2]) which makes these gradient computations extremely easy once (4) and (6) are solved. It is important to appreciate the usefulness of their result, particularly from the viewpoint of this paper, that iterative nonmatrix-based techniques are used for solving (4) and (6). Clearly, depends on, and, in turn depends on and. Yet, because itself is computed via an optimization problem [i.e., (4)], it turns out that the gradient of with respect to the hyperparameters does not enter into the computation of the gradient of. Since is also solved via an optimization problem [i.e, (6)], a similar result holds for and. Remark 1: The easiest way to appreciate the above result is to consider the function given by. Let denote the solution of the minimization problem; then, at.now,. Hence Thus, the gradient of with respect to can be obtained simply by differentiating with respect to,asif has no influence on. The corresponding arguments for the constrained optimization problems in (4) and (6) are a bit more complicated. (See [2] for details.) Nevertheless, the above arguments, together with (4), should easily help one to appreciate the fact that the determination of the gradient of with respect to does not require. In a similar way, by (6), the determination of the gradient of with respect to does not require. It is important to note that the determination of and requires expensive matrix operations involving the kernel matrix. Hence, Chapelle et al. s result concerning the avoidance of these gradients in the evaluation of the gradients of and gives excellent support for the radius/margin criterion when iterative techniques are employed for solving (4) and (6). For other criteria such as the LOO error, -fold CV error, or other approximate measures, such an easy evaluation of gradient of the performance function with respect to hyperparameters is ruled out. This issue is particularly very important when a large number of hyperparameters, other than and (such as input weighting parameters), are also considered for tuning, because when the number of optimization variables is large, gradient-based optimization methods are many times faster than methods which use function values only. Remark 2: Since iterative algorithms for (4) and (6) converge only asymptotically, a termination criterion is usually employed to terminate them finitely. This termination criterion has to be chosen with care for the following reason. Take, for example, the funtion mentioned in Remark 1. Suppose is to be evaluated at some given. During the solution of,we use a termination criterion and only obtain, which is an approximation of. Since, the last equality in (7) does not hold and, hence, is needed to compute. If the effect of has to be ignored then it is important to ensure that the termination criterion used in the solution of is stringent enough to ensure that is sufficiently small. Unfortunately, it is not easy to come up with precise values of tolerance to do this. A simple approach that works well is to use reasonably small tolerances and, if gradient methods face failure, then decrease these tolerances further. In the rest of this paper, we consider only the Gaussian kernel given by (3a) and take. Application of Chapelle et al. s [2] gradient calculations, using (7), yields the following expressions: (8) (7) The derivatives of are given by 1 The contour plots given later in Figs. 1 and 2 seem to indicate that this is a reasonable assumption. (9)

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 5, SEPTEMBER 2002 1227 The derivatives of are given by (10) Each optimization iteration involves the determination of a direction using the BFGS method. Then a line search is performed along that direction to look for a point that satisfies certain approximate conditions associated with the following problem: (15) Also (11) Thus, gradient of is cheaply computed once has been computed (since and are all available). C. Variable Transformation As suggested by Chapelle et al. [2], we use (12) as the variables for optimization instead of and. This is a common transformation usually suggested elsewhere in the literature too. D. Choosing Initial Values of and Unless we have some good knowledge about the problem, it is not easy to choose good initial values for and.wehave experimented with two pairs of different initial conditions. The first one is (13) Let denote the radius of the smallest sphere in the input space that contains all examples, i.e., the s. The second pair of initial conditions is (14) In all the datasets tried in this paper, each component of the s is normalized to lie between 1 and 1. Hence, for all the numerical experiments, we have simply used where is the dimension of. 2 Detailed testing shows that (14) gives better results than (13). There was one dataset (Splice) for which (13) actually failed. See Fig. 2 for details. E. Issues Associated With the Gradient Descent Algorithm To minimize there are many choices for optimization methods. In this work, the Broyden Fletcher Goldfarb Shanno (BFGS) quasi-newton algorithm [12] has been used. A conjugate-gradient method was also tried, but it required many more evaluations than the BFGS algorithm. 3 Since each evaluation is expensive [it requires the solution of (4) and (6)], the BFGS method was preferred. 2 In the case of Adult-7 dataset, each x has only 15 nonzero entries. Hence, is set to 15 for that example. 3 As discussed at the beginning of Section II, this could be due to the sensitivity of the conjugate-gradient method to errors in the evaluation of f and its gradient. Since gradient of is easily computed once is obtained, it is effective to use a line search technique that uses both function values and gradients. The code in [12] employs such a technique. For the BFGS algorithm, is a natural choice to try as the first step size in each line search. This choice is so good that, the line search usually attempts only one or two values of before successfully terminating an optimization iteration. Usually, the goodness of the choice of is expected to hold strongly as the minimizer of is approached since the BFGS step for approaches a Newton root finding step. However, this does not happen in our case for the following reason. As the minimizer is approached, the gradient values are small, and the effect of errors associated with the solution of (4) and (6) on the gradient evaluation become more important. Thus, the line search sometimes requires many evaluations of and in the end steps. In numerical experiments, it was observed that reaching the minimizer of too closely is not important 4 for arriving at good values of hyperparameters. Hence, it is a good idea to terminate the line search (as well as the optimization process) if more than ten values of have been attempted in that line search. The optimization process generates a sequence of points in the space of hyperparameters. Successive points attempted by the process are usually located not-so-far-off from each other. It is important to take this factor to advantage in the solution of (4) and (6). Thus, if and denote the solution of (4) and (6) at some, and the optimization process next tries a new point, then and are used to obtain good starting points for the solution of (4) and (6) at. This gives significant gains in computational time. Since the constraints in (6) do not depend on the hyperparameters, can be directly carried over for the solution of (6) at. For (4), we already said that the nearest point formulation in [5] is employed. Since the constraints in the nearest point formulation are also independent of the hyperparameters, carrying over the variables for solution at is easy for the nearest point algorithm too. The choice of criterion for terminating the optimization process is also very important. As already mentioned, reaching the minimizer of too closely is not crucial. Hence, the criterion used can be loose. The following choice has worked quite well. Suppose BFGS starts an optimization iteration at, then successfully completes a line search and reaches the next point. Optimization is terminated if the following holds: (16) 4 This should not be confused with our stress, in Remark 2, on the accurate determination of f and its gradient by solving (4) and (6) accurately.

1228 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 5, SEPTEMBER 2002 TABLE I PERFORMANCE OF THE CODE ON THE DATA SETS CONSIDERED. HERE: n = NUMBER OF INPUT VARIABLES; m = NUMBER OF EXAMPLES; m IS THE NUMBER OF TEST EXAMPLES; nf IS THE NUMBER OF f EVALUATIONS USED BY THE RADIUS/MARGIN METHOD (RM) (THE NUMBER FOR 5-FOLD METHOD IS ALWAYS 221); TESTERR IS THE PERCENTAGE ERROR ON THE TEST SET; AND, m IS THE FINAL NUMBER OF SUPPORT VECTORS FOR THE RADIUS/MARGIN METHOD Typically, the complete optimization process uses only ten to 40 evaluations. III. COMPUTATIONAL EXPERIMENTS We have numerically tested the ideas 5 on several benchmark datasets given in [9]. To test the usefulness of the code for solving large scale problems, we have also tested it on the Adult-7 dataset in [7]. All computations were done on a Pentium 4 1.5-GHz machine running on Windows. Gaussian kernel was employed. Thus, and formed the hyperparameters; (14) was used for initializing them. For comparison, we also tuned and by five-fold cross validation. The search was done on a two-dimensional 11 11 grid in the space. To use previous solutions effectively, the search on the grid was done along a spiral outward from the central grid values of and. Some important quantities associated with the datasets and the performance are given in Table I. While the generalization performance of five-fold and radius/margin methods are comparable, the radius/margin method is much faster. The speed-up achieved is expected to be much more when there are more hyperparameters to be tuned. For a few datasets, Fig. 1 and the left-hand side of Fig. 2 show the sequence of points generated by the BFGS optimization method on plots in which contours with equal values are drawn for various values of. In the case of Splice and Banana datasets, for which the sizes of test sets are large, the right hand side plots of Fig. 2 show contours of test set error. These are given to point out how good the radius/margin criterion is. A. Using the Approximation When the Gaussian kernel function is used, to simplify computations, the approximation is sometimes tried. We did some experiments to check the usefulness of this approximation. For four datasets, Fig. 3 shows the variation of 5 An experimental version of the code, running on Matlab interface through the mex facility, is available from the author. Fig. 1. Contour plots of equal f values for Adult-7, Breast Cancer, Diabetis, and Flare-Solar datasets. and +, respectively, denote points generated by the BFGS algorithm starting from the initial conditions in (13) and (14). Fig. 2. Two figures on the left-hand side give radius/margin contour plots for Splice and Banana datasets. and +, respectively, denote points generated by the BFGS algorithm using the initial conditions in (13) and (14). In the case of Splice dataset, for initial condition (13), optimization was terminated after two f evaluations since a very large C value was attempted at the third f evaluation point and so the computing time required for that C became too huge. The two figures on the right-hand side give contour plots of test set error. In these two plots M denotes the location of the point of least test set error. and test set error with respect to for fixed values of.itis clear that all three functions are quite well correlated and hence, as far as the tuning of is concerned, using seems to be a good approximation to make. This agrees with the observation made by Cristianini et al. [4]. However, using for tuning is dangerous. Note using (9) that is always increasing with. Clearly alone is inadequate for the determination of.

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 5, SEPTEMBER 2002 1229 In this paper, we have discussed various implementation issues associated with the tuning of hyperparameters for the SVM soft margin problem, by minimizing the radius/margin criterion and employing iterative techniques for obtaining radius and margin. The experiments indicate the usefulness of the radius/margin criterion and the associated implementation. The extension of the implementation to the simultaneous tuning of many other hyperparameters such as those associated with feature selection, different cost values, etc., looks very possible. Our current research is focussed on this direction. Fig. 3. Variation of R k ~wk ; k ~wk and TestErr with respect to for fixed C values. In each graph, the vertical axis is normalized differently for R k ~wk ; k ~wk, and TestErr. This was done because, for tuning, the point of minimum of the function is important and not the actual value of the function. IV. CONCLUSION REFERENCES [1] C. J. C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining Knowledge Discovery, vol. 2, no. 2, 1998. [2] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. (2002) Choosing kernel parameters for support vector machines. Machine Learning [Online], pp. 131 159. Available: http://www-connex.lip6.fr/ ~chapelle/ [3] C. Cortes and V. Vapnik, Support vector networks, Machine Learning, vol. 20, pp. 273 297, 1995. [4] N. Cristianini, C. Campbell, and J. Shawe-Taylor. (1999) Dynamically adapting kernels in support vector machines. Advances Neural Inform. Processing Syst. [Online]. Available: http://lara.enm. bris.ac.uk/cig/pubs/1999/nips98.ps.gz [5] S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy, A fast iterative nearest point algorithm for support vector machine classifier design, IEEE Trans. Neural Networks, vol. 11, pp. 124 136, Jan. 2000. [6], Improvements to Platt s SMO algorithm for SVM design, Neural Comput., vol. 13, no. 3, pp. 637 649, 2001. [7] Sequential Minimal Optimization, J. Platt. (1998). [Online]. Available: http://www.research.microsoft.com/~jplatt/smo.html [8], Fast training of support vector machines using sequential minimal optimization, in Advances in Kernel Methods Support Vector Learning, B. Schölkopf, C. J. C. Burges, and A. J. Smola, Eds. Cambridge, MA: MIT Press, 1998. [9] Benchmark Datasets, G. Rätsch. (1999). [Online]. Available: http://ida.first.gmd.de/~raetsch/data/benchmarks.htm [10] B. Schölkopf, C. Burges, and V. Vapnik, Extracting support data for a given task, presented at the 1st Int. Conf. Knowledge Discovery Data Mining, U. M. Fayyad and R. Uthurusamy, Eds., Menlo Park, CA, 1995. [11] B. Schölkopf, J. C. Platt, J. Shawe-Taylor, and A. J. Smola. (1999) Estimating the Support of a High Dimensional Distribution. Microsoft Research, Redmond, WA. [Online]. Available: http://www.kernel-machines.org/papers/oneclass-tr.ps.gz [12] D. F. Shanno and K. H. Phua, Minimization of unconstrained multivariate functions, ACM Trans. Math. Software, vol. 6, pp. 618 622, 1980. [13] V. Vapnik, Statistical Learning Theory. New York: Wiley, 1998. [14] V. Vapnik and O. Chapelle, Bounds on error expectation for support vector machines, Neural Comput., vol. 12, no. 9, 2000.