K Fold Cross Validation for Error Rate Estimate in Support Vector Machines

Size: px

Start display at page:

Download "K Fold Cross Validation for Error Rate Estimate in Support Vector Machines"

Dina Wilkinson
6 years ago
Views:

1 K Fold Cross Validation for Error Rate Estimate in Support Vector Machines Davide Anguita 1, Alessandro Ghio 1, Sandro Ridella 1, and Dario Sterpi 2 1 Dept. of Biphysical and Electronic Engineering, University of Genova, Via Opera Pia 11A, I Genova, Italy, {Davide.Anguita, Alessandro.Ghio, Sandro.Ridella}@unige.it 2 Smartware & Data Mining s.r.l., Via Gabriele D Annunzio 2/78, I Genova, Italy, Dario.Sterpi@smartwaredm.it Abstract In this paper, we review the k Fold Cross Validation (KCV) technique, applied to the Support Vector Machine (SVM) classification algorithm. We compare several variations on the KCV technique: some of them are often used by practitioners, but without any theoretical justification, while others are less used but more rigorous in finding a correct classifier. The last ones allow to establish an upper bound of the error rate of the SVM, which represent a way to guarantee, in a statistical sense, the reliability of the classifier and, therefore, turns out to be quite important in many real world applications. Some experimental results on well known benchmarking datasets allow to perform the comparison and support our claims. Keywords: Model Selection, Support Vector Machine, k fold Cross Validation 1. Introduction The Support Vector Machine (SVM) is one of the state of the art techniques, when facing classification tasks, which belongs to the field of Artificial Neural Networks (ANNs) [1] but is characterized by the solid foundations of Statistical Learning Theory (SLT) [2]. Thanks to its extremely good performance in real world Data Mining applications, it became quickly part of commercial Data Mining suites [3]. The SVM learning is performed by finding a set of parameters (analogous to the weights of an ANN), found by solving a Convex Constrained Quadratic Programming (CCQP) problem, for which many effective techniques have been developed [4]. This is a large improvement with respect to traditional ANNs, which require the solution of a difficult non linear optimization problem. The search for optimal parameters, however, does not complete the learning process, as there is a set of additional variables (hyperparameters) that must be tuned to reach the optimal classification performance, similarly to ANNs, where the hyperparameter is the number of hidden nodes. This tuning is not trivial and is an open research problem [3], [5], [6], [7]. The process of finding the best hyperparameters is usually called the model selection phase in the Machine Learning literature and is strictly linked to the evaluation of the generalization ability of the SVM or, in other words, the error rate attainable by the SVM on new (unknown) data. In fact, it is common use to select the optimal SVM (i.e. the optimal hyperparameters) by choosing the one with the lowest generalization error. Obviously, the generalization error is impossible to compute, but SLT proposes several methods to obtain a probabilistic upper bound for it: using this bound, it is possible to select the optimal SVM and also to assess the quality of the classification. The methods for performing the model selection phase can be divided in two categories [8], [5]. Theoretical methods, like the Vapnik Chervonenkis (VC) bound [2] or the margin bound [7], provide deep insights on the classification algorithms but are often inapplicable, incalculable and too loose for being of any practical use [9]. On the other hand, practicioners have found several procedures [3], [5], which work well in practice but do not offer any theoretical guarantee about the generalization error. Some of them rely on well known statistical procedures but the underlying hypotheses are not always satisfied or are only asymptotically valid. An example is the very well-known Bootstrap resampling technique [10], [11]: we have to assume that the error distribution is Gaussian, which is not always the case and can cause the underestimation of the generalization error [12]. One of the most popular resampling techniques is the k Fold Cross Validation (KCV) procedure [13], which is simple, effective and reliable [14], [15], [16]. We will show in this work that the KCV is also a theoretically sound one and will describe exactly under what circumstances it provides a rigorous upper-bound of the generalization error. The KCV technique consists in splitting a dataset in k independent subsets: in turn, all but one of these subsets are used to train a classifier, while the remaining one is used to evaluate the generalization error. After the training, it is possible to compute an upper bound of the generalization error, for each one of the trained classifiers, but, as we have now k different models, some questions arise: 1) How these k models have to be used and/or combined for classifying the new data? 2) Each model is trained on a subset of the original dataset, so it does not use all the available information.

2 Can they be used for retraining a classifier on the entire dataset? 3) If we retrain a classifier on the entire dataset, should we modify the hyperparameters accordingly and how? Our target, in this work, is to find answers to the previous questions by analyzing and comparing techniques and heuristics for the KCV applied to the model selection task. The paper is organized as follows: Sections 2 briefly describe the SVM algorithm. Section 3 details the KCV procedure and the rigorous generalization error upper-bound, while in Section 4 some techniques for finding the SVM solution after the KCV procedure is presented. Finally, in Section 5 we show some experimental results and comparisons of the above techniques, tested on several well-known benchmarking datasets. 2. The Support Vector Machine Let us consider a dataset composed by l patterns {(x 1, y 1 ),..., (x l, y l )} where x i R n and y i = ±1. The dimensionality of each pattern n = dim(x i ), i, represents the number of features, characterizing the dataset. The SVM learning phase consists in solving the following CCQP: min α 1 2 αt Qα + r T α (1) 0 α i C i [1,...,l] y T α = 0 where r i = 1 i, C is a hyperparameter that must be tuned, and Q is a symmetric positive semidefinite l l matrix q ij = y i y j K(x i, x j ). (2) where K(, ) is a Mercer s kernel function, which allows to deal also with nonlinear mappings of the data [17]. After solving the problem (1), the feed forward classification phase at run time can be computed as f(x) = l y i α i K(x i, x) + b, (3) i=1 where x is a new pattern that must be classified. The class of the new pattern x is determined according to the sign of f(x). The bias term b can be calculated through the Karush Kuhn Tucker (KKT) conditions, which hold at optimality. The above description allows to identify the first hyperparameter (C) that must be tuned for obtaining the optimal classification performance. In the case of linear SVM this is the only one but, when dealing with nonlinear SVMs, the kernel function gives rise to at least another one: the width of the Gaussian (γ) or the order of the polynomial (p). In general, we will indicate with {C, } the set of hyperparameters to tune, where C is the regularization term, while indicates other possible hyperparameters. Some rule of thumb method has been suggested for deriving the hyperparameters in a very simple and efficient way [3]. This approach allows to select the values of the hyperparameters with a single pass through the dataset, but the accuracy of the classification is obviously not the best one. Another approach, which is the most used and effective (practical) procedure, is a hyperparameter exhaustive grid search: the CCQP problem is solved several times with different {C, } settings and the generalization error is estimated at each step. Finally, the optimal hyperparameters are chosen in correspondence to the minimum of the estimated generalization error. Obviously, both the coarseness of the grid and the size of the searching space influence severely the quality of the solution and the amount of computation time needed by the search procedure, especially when the generalization error estimate performed at each step is particularly time consuming. In any case, given a set of fixed hyperparameter values, the use of the KCV allows to perform the generalization error estimate as we will show in the following section. 3. k Fold Cross Validation (KCV) and the KCV guaranteed bound We revise here the analysis performed in [18]. The KCV consists in dividing the training set in k parts, each one consisting of l/k samples: k 1 parts are used, in turn, as a training set and the remaining one is used as a validation set. The error performed by the trained SVM on the validation set can be reliably used for estimating π, the true generalization error, because it has not been used for training the model. In fact, if we consider the error rate on the j th validation set: ν (j) V AL = k l l/k I i (y i ŷ i ), (4) i=1 where k l is the number of validation patterns, y i is the true output for the i th pattern, ŷ i is the SVM output and { 0 if yi = ŷ I i (y i ŷ i ) = i (5) 1 otherwise, we can bound the generalization error by inverting the Cumulative Binomial distribution [19]. For simplicity, we will make use here of the Azuma Hoeffding [20] inequality, which allows to write the following bound in explicit form: { } Pr π ν (j) V AL ε e ε2 k. (6) By setting a user defined confidence value δ: we obtain ε = δ = e ε2 k (7) k lnδ (8)

3 π (%) Trend for generalization error (δ = 5%, k = 10, ν kcv = 10%) l Fig. 1: Typical trend for the generalization error bound as a function of the number of training patterns l with ν KCV = 10%, δ = 5% and k = 10. The dotted line indicates the ν KCV value. and, substituting in Eq. (6), we have: { } Pr π ν (j) k lnδ V AL + δ. (9) Therefore, for any of the k classifiers, the following bound holds with probability (1 δ): k lnδ π ν j V AL +, (10) We can always pick up randomly one trained SVM to classify a new point and it is possible to show [21] that the performance of the model will be bounded by k ln δ π ν KCV +, (11) where ν KCV = 1 k k j=1 ν (j) V AL. (12) Note that k influences not only the training times but also the trade off between the stability of the error average and the size of the confidence term (i.e. the term under square root). Common practice suggests k = 5 or k = 10 [22], which usually offer a good compromise. Fig. 1 shows a typical trend of the above bound as a function of the number of training patterns: we stress again the fact that tighter bounds are possible (e.g. decreasing between O( 1 l ) and )) but we omit them here for the sake of clarity. O( 1 l 4. KCV solution One of the problem with the k Fold Cross Validation approach lies in the fact that KCV finds k different SVMs, each one trained on a subset of the original training set, made up by (k 1)l/k patterns. From a theoretical point of view, as remarked in the previous section, a rigorous technique for combining them consists in picking randomly one SVM every time a new pattern must be classified: in this case, the run time error rate is guaranteed to be upper bounded by Eq. (11) with the user defined probability. The main drawback in this case is the waste of memory, because, even if at each time only one SVM is used, all the k SVMs must be retained. We will refer to this approach in this paper as the Random SVM (RDM) technique. From a practical point of view, other solutions can be found: one possibility consists in averaging (in some way) the k solutions. An option is to build a new SVM by computing the average of the parameters of the k SVMs [5]: this heuristic results in a large memory saving, so it is worth exploring. We call this approach the Averaging (AVG) method. A third method is often used by practitioners: it consists in building a new SVM using the entire (original) dataset and the same optimal hyperparameters found through the KCV procedure. Even if this could appear as the best solution, it is the less justified from a theoretical point of view. In fact, the KCV estimate is no longer valid for the retrained SVM, because it is different from the k original ones, and, at the same time, no patterns are left outside the training set, which can be used for estimating the generalization error. Some works in trying to study this case from a theoretical point of view have appeared in the literature: the idea is to exploit the stability of the algorithm, that is its ability to perform roughly in the same way when trained on two datasets that differ only by a fraction of patterns. Unfortunately, the proposed methods are of limited practical use [2], [13]. In any case, as this technique is widely used, we will include it in our experiments as the Retraining (RET) technique. Some argue that a better way of performing the retraining of the final SVM on the entire dataset could be derived by noting that the hyperparameters are found on a subset of the original dataset, consisting of (k 1)l/k patterns. Then, the hyperparameters must be adapted to the larger dataset, by scaling it accordingly: C l = C, (13) (k 1)l k where C is the optimal hyperparameter obtained with the KCV and C is the hyperparameter value to be used with the SVM to retrain. The rationale behind Eq. (13) is that the hyperparameter values, normalized by the number of training patterns, have to be the same in both cases. Note that the other hyperparameters are not involved in this resizing, since they are strictly linked to the number of features n, which remains fixed, and not to the number of patterns. We will refer to this method as Retraining with Heuristic (RWH). However, this adaptation of the hyperparameter C

4 is in contrast with the theory of large margin classifiers. In fact, the SVM problem of Eq. (1) is only a computational simplified formulation of the following: min α,γ 1 2Γ αt Qα + r T α + Γ 2wO 2 0 α i 1 i [1,...,l] y T α = 0 Γ 0 (14) where wo 2 is the maximum margin [2] and Γ = C only when the solutions of the two formulations coincide, independently from the number of samples. All the methods described above are summarized in Tab Experimental Results In order to test the methods described in the previous section, we perform several experiments by using some well known benchmarking datasets. In particular, the datasets used in our experiments are described in Tab. 2: they were introduced by G. Rätsch for the purpose of benchmarking machine learning algorithms [23]. For each replicate of Rätsch s datasets, the following experimental setup is applied: the data features are normalized to have zero mean and standard deviation equal to two, so that most of values lie in the range [ 1, +1]; a Gaussian kernel is used; a model selection, using k Fold Cross Validation, with k = 10 [22], is performed. The optimal hyperparameters {C, } are found using an exhaustive grid search, where the range and the number of steps are shown in Tab. 3; all the k SVMs are stored for benchmarking the RDM technique; a new SVM, built by averaging the k SVMs parameters, is stored for benchmarking the AVG method; the best hyperparameters found with the KCV procedure are used for training a new SVM on the whole dataset, which is used for benchmarking the RET technique; finally, we train a new SVM using the rescaled hyperparameter C, computed according to Eq. (13), for benchmarking the RWH method. Note that the KCV procedure is performed only on the training sets, so that the test sets are never used for finding the optimal hyperparameters. This guarantees that the test data are independent from the training and the validation steps. To obtain a lower bound on the error rate attainable by a SVM for each dataset, we can select the hyperparameters that minimize the number of misclassifications on the test set. Obviously this figure cannot be used for performance assessment purposes but it acts as a reference value. We refer to these values as the Test Set (TS) error rate. Tab. 4 presents the average and the standard deviation of the error rate obtained on the Rätsch s test sets with different methods: the second column (TS) represents the error rate with the test set approach (i.e. the best achievable rate). Note that in one case (the Image dataset) the average error for AVG and RWH is lower than the TS rate, but this is due to statistical fluctuations as these two results are characterized by higher standard deviations. Since it is not easy to compare the various methods, we compute a simple performance index, which describes the relative performance of each method respect to the TS error rate: ν i ν TS ν TS, (15) where ν i is the error rate obtained with the i th method on the test set, while ν TS is the reference value. The performance indexes are shown in Tab. 5; the last row indicates the number of times that a method results to be the best performer. From Tab. 5, it is possible to see that the retraining of the SVM by using the KCV hyperparameters is the best performer in almost half of the cases. Moreover, Tab. 6 shows that the performance index for the RET method on all the thirteen Rätsch s datasets is the lowest one. Surprisingly, the only theoretically justified method (RDM) is the best one in only three cases but shows an average performance which is similar to RET. The worst performing method results to be the AVG one. As the average error is not robust to outliers, we report here the same analysis using quartile values instead of mean and standard deviation. Tab. 7 shows the values of the performance index computed using the median of the misclassification percentage and in Tab. 8 the corresponding values averaged on on all the thirteen datasets: the RET method still results to be the best one, but the RDM method is now characterized by the second best performance. By looking at the results it is clear that: (1) the AVG technique, which is often used in practice, is the worst performing one, both considering the mean or the median of the error rate; (2) the RET method seems to be the most effective even though, at the time of this writing, no theoretical result is available to justify its performance; (3) the RDM method is characterized by a good performance (similar to RET method with respect to both mean and median values) and has the advantage that the error rate is guaranteed by Eq. (11) with a user defined confidence value; (4) the rescaling of the C hyperparameter, despite being used by practitioners, appears to be useless for improving the error rate performance.

5 Table 1: Methods for building the final SVM after the KCV procedure. Technique Short name Brief review Random SVM RDM All k SVMs, found during KCV, are saved. Each time a new pattern must be classified, a SVM is picked randomly. Average SVM AVG A new SVM is built by averaging the k SVMs parameters, found during KCV. Retraining SVM RET A new SVM is retrained on the whole original dataset, by using the optimal hyperparameters, found with KCV. Retraining with RWH A new SVM is retrained on the whole original Heuristic SVM dataset, by rescaling the hyperparameters. Table 2: The Rätsch datasets. Name N. of features Training samples Test samples Realizations Banana Breast-Cancer Diabetis Flare-Solar German Heart Image Ringnorm Splice Thyroid Titanic Twonorm Waveform Table 3: Range and number of steps of the grid search procedure. Hyperparameter Range N. of steps C ˆ10 1 ; γ ˆ10 4 ; Conclusions We have reviewed in this work the k Fold Cross Validation technique applied to the SVM classification algorithm. Our purpose is twofold: on one hand we want to verify if some techniques, which are often used by practitioners, can be justified from a theoretical point of view and perform as expected; on the other hand we want to benchmark the only theoretical rigorous technique against common practice. The first point is not to be neglected: the theoretical justification of the method is not only of academic interest but is a way to guarantee (in a statistical sense) the reliability of the result, which is of paramount importance in many fields (e.g. law [24]). By analyzing the experimental results described in the previous section, we are now able to answer the three questions raised in the introduction: 1) the best ways to combine the k classifiers appear to be the retraining solution (RET) and the randomly chosen model (RDM). The former performs slightly better in practice, while the second one allows to predict the classifier error rate on unobserved data; 2) based on the previous comments, we can state that neither a retraining nor a combination of the k SVMs are really necessary. However, when memory is an issue, a retraining of the classifier (RET) could be performed if we are not interested in generalization error estimates; 3) the experimental results clearly show that rescaling the hyperparameters does not increase the classifier performance. Our analysis shows clearly that the final user has two main choices. The first alternative (RDM) guarantees the quality of the classifier, which is necessary in several fields like, for example, legal practice [24]: the price to pay for this additional information is a slight reduction in terms of classification performance. The second alternative (RET)

6 Table 4: Mean and standard deviation for the error rate on Rätsch s datasets. Dataset TS RDM AVG RET RWH mean st dev mean st dev mean st dev mean st dev mean st dev Banana Breast Diabetis Flare German Heart Image Ringnorm Splice Thyroid Titanic Twonorm Waveform Table 5: Performance indexes computed with the average error rate. Best result is in bold face. Dataset RDM AVG RET RWH Banana Breast Diabetis Flare German Heart Image Ringnorm Splice Thyroid Titanic Twonorm Waveform Best performer Table 7: Performance indexes computed with the median error rate. Best results are in bold face. Dataset RDM AVG RET RWH Banana Breast Diabetis Flare German Heart Image Ringnorm Splice Thyroid Titanic Twonorm Waveform Best performer Table 6: Mean values for performance indexes of Tab. 5. RDM AVG RET RWH Table 8: Mean values for performance indexes of Tab. 7. RDM AVG RET RWH results in a better performing classifier, but at the expense of a violation of the theoretical assumptions. Practitioners can safely use this method in all the applications where a rigorous approach is not necessary. Our results support the need for additional research to fill this gap between theory and practice. References [1] B. Schoelkopf, K.K. Sung, C. Burges, F. Girosi, P. Niyogi, T. Poggio, V. Vapnik, Comparing Support Vector Machines with Gaussian Kernels to Radial Basis Function Classifiers, IEEE Trans. on Signal Processing, vol. 45, pp , [2] V. Vapnik, The Nature of Statistical Learning Theory, Springer, [3] B. L. Milenova, J. S. Yarmus, M. M. Campos, SVM in Oracle database 10g: Removing the barriers to widespread adoption of Support Vector Machines, Proc. of the 31st Int. Conf. on Very Large Data Bases, pp , [4] S. S. Keerthy, S. K. Shevade, C. Bhattacharyya, K. R. K. Murthy, Improvements to Platt s SMO algorithm for SVM classifier design, Neural Computation, vol. 13, pp , [5] D. Anguita, A. Boni, S. Ridella, F. Rivieccio, D. Sterpi, Theoretical and practical model selection methods for Support Vector classifiers, in Support Vector Machines: Theory and Applications, edited by L. Wang, Springer, [6] B. Schoelkopf, A. Smola, Learning with Kernels, The MIT Press, [7] J. Shawe-Taylor, N. Cristianini, Margin distribution and soft margin, in Advances in Large Margin Classifiers, edited by A. Smola,

7 P. Bartlett, B. Schoelkopf, D. Schuurmans, The MIT Press, [8] K. Duan, S. S. Keerthy, A. Poo, Evaluation of simple performance measures for tuning SVM parameters, Neurocomputing, vol. 51, pp , [9] C. J. C. Burges, A tutorial on Support Vector Machines for classification, Data Mining and Knowledge Discovery, vol. 2, pp , [10] B. Efron, R. Tibshirani, An introduction to the Bootstrap, Chapmann and Hall, [11] B. Efron, R. Tibshirani, Improvements on Cross Validation: the 632+ bootstrap method, Journal of American Statistic Association, vol. 92, pp , [12] D. Anguita, A. Boni, S. Ridella, Evaluating the generalization ability of Support Vector Machines through the Bootstrap, Neural Processing Letters, vol. 11, pp , [13] M. Anthony, S. B. Holden, Cross Validation for binary classification by real valued functions: theoretical analysis, Proc. of the 11th Conf. on Computational Learning Theory, pp , [14] M. Dumler, Microsoft SQL Server 2008 Product Overview, Microsoft Corporation, [15] Cross Validation (Analysis Services - Data Mining), in Microsoft SQL Server 2008 Books Online, Microsoft Corporation, Available online at [16] M. Kaariainen, Semi supervised model selection based on Cross Validation, Proc. of IEEE Int. Joint Conf. on Neural Networks 2006, IJCNN 2006, pp , [17] C. Cortes, V. Vapnik, Support vector networks, Machine Learning, vol. 27, pp , [18] D. Anguita, S. Ridella, F. Rivieccio, K Fold Generalization Capability Assessment for Support Vector Classifiers, Proc. of the IEEE Int. Joint Conf. on Neural Networks, IJCNN 2005, pp , [19] M. Kaariainen, J. Langford, A comparison of tight generalization error bounds, Proc. of the 22nd Int. Conf. on Machine learning, pp , [20] K. Azuma, Weighted sums of certain dependent random variables, Tohoku Math. Journal, vol. 19, pp , [21] A. Blum, A. Kalai, J. Langford, Beating the Hold Out: Bounds for K fold and Progressive Cross Validation, Computational Learing Theory, pp , [22] C.-W. Su, C.-C. Chang, C.-J. Lin, A practical guide to Support Vector classification, Technical report, Dept. of Computer Science, National Taiwan University, [23] G. Raetsch, T. Onoda, K. R. Mueller, Soft margins for AdaBoost, Machine Learning, vol. 42, pp , [24] L. Roberge, S. B. Long, D. B. Burnham, Data Warehouses and Data Mining tools for the legal profession: using information technology to raise the standard of practice, Syracuse Law Review, vol. 52, pp , 2002.

Efficient Tuning of SVM Hyperparameters Using Radius/Margin Bound and Iterative Algorithms

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 5, SEPTEMBER 2002 1225 Efficient Tuning of SVM Hyperparameters Using Radius/Margin Bound and Iterative Algorithms S. Sathiya Keerthi Abstract This paper