Training Data Selection for Support Vector Machines

Size: px

Start display at page:

Download "Training Data Selection for Support Vector Machines"

Junior Rice
5 years ago
Views:

1 Training Data Selection for Support Vector Machines Jigang Wang, Predrag Neskovic, and Leon N Cooper Institute for Brain and Neural Systems, Physics Department, Brown University, Providence RI 02912, USA jigang@brown.edu,pedja@brown.edu, Leon Cooper@brown.edu, Abstract. In recent years, support vector machines (SVMs) have become a popular tool for pattern recognition and machine learning. Training a SVM involves solving a constrained quadratic programming problem, which requires large memory and enormous amounts of training time for large-scale problems. In contrast, the SVM decision function is fully determined by a small subset of the training data, called support vectors. Therefore, it is desirable to remove from the training set the data that is irrelevant to the final decision function. In this paper we propose two new methods that select a subset of data for SVM training. Using real-world datasets, we compare the effectiveness of the proposed data selection strategies in terms of their ability to reduce the training set size while maintaining the generalization performance of the resulting SVM classifiers. Our experimental results show that a significant amount of training data can be removed by our proposed methods without degrading the performance of the resulting SVM classifiers. 1 Introduction Support vector machines (SVMs), introduced by Vapnik and coworkers in the structural risk minimization (SRM) framework [1 3], have gained wide acceptance due to their solid statistical foundation and good generalization performance that has been demonstrated in a wide range of applications. Training a SVM involves solving a constrained quadratic programming (QP) problem, which requires large memory and takes enormous amounts of training time for large-scale applications [4]. On the other hand, the SVM decision function depends only on a small subset of the training data, called support vectors. Therefore, if one knows in advance which patterns correspond to the support vectors, the same solution can be obtained by solving a much smaller QP problem that involves only the support vectors. The problem is then how to select training examples that are likely to be support vectors. Recently, there This work is partially supported by ARO under grant W911NF Jigang Wang is supported by a dissertation fellowship from Brown University.

2 has been considerable research on data selection for SVM training. For example, Shin and Cho proposed a method that selects patterns near the decision boundary based on the neighborhood properties [5]. In [6 8], k-means clustering is employed to select patterns from the training set. In [9], Zhang and King proposed a β-skeleton algorithm to identify support vectors. In [10], Abe and Inoue used Mahalanobis distance to estimate boundary points. In the reduced SVM (RSVM) setting, Lee and Mangasarian chose a subset of training examples using random sampling [11]. In [12], it was shown that uniform random sampling is the optimal robust selection scheme in terms of several statistical criteria. In this paper, we introduce two new data selection methods for SVM training. The first method selects training data based on a statistical confidence measure that we will describe later. The second method uses the minimal distance from a training example to the training examples of a different class as a criterion to select patterns near the decision boundary. This method is motivated by the geometrical interpretation of SVMs based on the (reduced) convex hulls. To understand how effective these strategies are in terms of their ability to reduce the training set size while maintaining the generalization performance, we compare the results obtained by the SVM classifiers trained with data selected by these two new methods, by random sampling, and by the data selection method that is based on the distance from a training example to the desired optimal separating hyperplane. Our comparative study shows that a significant amount of training data can be removed from the training set by our methods without degrading the performance of the resulting SVM classifier. We also find that, despite its simplicity, random sampling performs well and often provides results comparable to those obtained by the method based on the desired SVM outputs. Furthermore, in our experiments, we find that incorporating the class distribution information in the training set often improves the efficiency of the data selection methods. The remainder of the paper is organized as follows. In section 2, we give a brief overview of support vector machines for classification and the corresponding training problem. In section 3, we present the two new methods that select subsets of training examples for training SVMs. In section 4 we report the experimental results on several real-world datasets. Concluding remarks are provided in section 5. 2 Related Background Given a set of training data {(x 1,y 1 ),...,(x n,y n )}, where x i IR d and y i { 1, 1}, support vector machines seek to construct an optimal separating hyperplane by solving the following quadratic optimization problem: n 1 min w,b 2 w, w + C ξ n i=1 (1) subject to the constraints: y i ( w, x i + b) 1 ξ i i =1,...,n, (2)

3 where ξ i 0 for i =1,...,n are slack variables introduced to handle the nonseparable case [2]. The constant C>0 is a parameter that controls the trade-off between the separation margin and the number of training errors. Using the Lagrange multiplier method, one can easily obtain the following Wolfe dual form of the primal quadratic programming problem: subject to 1 min α i,i=1,...,n 2 α i α j y i y j x i,x j i,j=1 0 α i C i =1,...,n and α i (3) i=1 α i y i =0. (4) Solving the dual problem, one obtains the multipliers α i,i=1,...,n, which give w as an expansion w = α i y i x i. (5) i=1 According to the Karush-Kuhn-Tucker (KKT) optimality conditions, we have i=1 α i =0 y i ( w, x i + b) 1 and ξ i =0 0 <α i <C y i ( w, x i + b) = 1 and ξ i =0 α i = C y i ( w, x i + b) 1 and ξ i 0. Therefore, only α i that correspond to training examples x i which lie either on the margin or inside the margin area are non-zero. All the remaining α i are zero and the corresponding training examples are irrelevant to the final solution. Knowing the normal vector w, the bias term b can be determined from the KKT conditions y i ( w, x i + b) = 1 for 0 <α i <C. This subsequently leads to the linear decision function f(x) = sgn( n i=1 α iy i x, x i + b). In practice, linear decision functions are generally not rich enough for pattern separation. To allow for more general decision surfaces, one can apply the kernel trick by replacing the inner products x i,x j in the dual problem with suitable kernel functions k(x i,x j ). Effectively, support vector machines implicitly map training vectors x i in IR d to feature vectors Φ(x i ) in some high dimensional feature space IF such that inner products in IF are defined as Φ(x i ),Φ(x j ) = k(x i,x j ). Consequently, the optimal hyperplane in the feature space IF represents a nonlinear decision functions of the form f(x) = sgn( α i y i k(x, x i )+b). (6) i=1 To train a SVM classifier, one therefore needs to solve the dual quadratic programming problem (3) under the constraints (4). For a small training set, standard QP solvers, such as CPLEX, LOQO, MINOS and Matlab QP routines, can be readily used to obtain the solution. However, for a large training set, they

4 quickly become intractable because of the large memory requirements and the enormous amounts of training time involved. To alleviate the problem, a number of solutions have been proposed by exploiting the sparsity of the SVM solution and the KKT conditions. The first such solution, known as chunking [13], uses the fact that only the support vectors are relevant for the final solution. At each step, chunking solves a QP problem that consists of all non-zero Lagrange multipliers α i from the last step and some of the α i that violate the KKT conditions. The size of the QP problem varies but finally equals the number of non-zero Lagrange multipliers. At the last step, the entire set of non-zero Lagrange multipliers are identified and the QP problem is solved. Another solution, proposed in [14], solves the large QP problem by breaking it down into a series of smaller QP sub-problems. This decomposition method is justified by the observation that solving a sequence of QP sub-problems that always contain at least one training example that violates the KKT conditions will eventually lead to the optimal solution. Recently, a method called sequential minimal optimization (SMO) was proposed by Platt [15], which approaches the problem by iteratively solving a QP sub-problem of size 2. The key idea is that a QP sub-problem of size 2 can be solved analytically without invoking a quadratic optimizer. This method has been reported to be several orders of magnitude faster than the classical chunking algorithm. All the above training methods make use of the whole training set. However, according to the KKT optimality conditions, the final separating hyperplane is fully determined by the support vectors. In many real-world applications, the number of support vectors is expected to be much smaller than the total number of training examples. Therefore, the speed of SVM training will be significantly improved if only the set of support vectors is used for training, and the solution will be exactly the same as if the whole training set was used. In theory, one has to solve the full QP problem in order to identify the support vectors. However, it is easy to see that the support vectors are training examples that are close to decision boundaries. Therefore, if there exists a computationally efficient way to find a small set of training data such that with high probability it contains the desired support vectors, the speed of SVM training will be improved without degrading the generalization performance. The size of the reduced training set can still be larger than the set of desired support vectors. However, as long as its size is much smaller than the size of the total training set, the SVM training speed will be significantly improved because most SVM training algorithms scales quadratically on many problems [4]. In the next section, we propose two new data selection strategies to explore the possibility. 3 Training Data Selection for Support Vector Machines 3.1 Data Selection based on Confidence Measure A good heuristic for identifying boundary points is the number of training examples that are contained in the largest sphere centered at a training example without covering an example of a different class.

5 Centered at each training example x i, let us draw a sphere that is as large as possible without covering a training example of a different class and count the number of training examples that fall inside the sphere. We denote this number by N(x i ). Obviously, the larger the number N(x i ), the more training examples (of the same class as x i ) will be scattered around x i, the less likely x i will be close to the decision boundary, and the less likely x i will be a support vector. Hence, this number can be used as a criterion to decide which training examples should belong to the reduced training set. For each training example x i, we compute the number N(x i ) and sort the training data according to the corresponding value of N(x i ) and choose a subset of data with the smallest numbers N(x i )as the reduced training set. It can be shown that N(x i ) is related to the statistical confidence that can be associated with the class label y i of the training example x i. For this reason, we call this data selection scheme the confidence measurebased training set selection. 3.2 Data Selection based on Hausdorff Distance Our second data selection strategy is based on the Hausdorff distance. In the separable case, it has been shown that the optimal SVM separating hyperplane is identical to the hyperplane that bisects the line segment which connects the two closest points of the convex hulls of the positive and of the negative training examples [16, 17]. The problem of finding the two closest points in the convex hulls can be formulated as min z + z 2 (7) z +,z subject to z + = α i x i and z = α i x i, (8) i:y i=1 i:y i= 1 where α i 0 satisfies the constraints i:y α i=1 i = 1 and i:y α i= 1 i =1. Based on this geometrical interpretation, the support vectors are the training examples that are vertices of the convex hulls that are closest to the convex hull of the training examples from the opposite class. For the non-separable case, a similar result holds by replacing the convex hulls with the reduced convex hulls [16,17]. Therefore, a good heuristic that can be used to determine whether a training example is likely to be a support vector is the distance to the convex hull of the training examples of the opposite class. Computing the distance from a training example x i to the convex hull of the training examples of the opposite class involves solving a smaller quadratic programming problem. To simplify the computation, the distance from a training example to the closest training examples of the opposite class can be used as an approximation. We denote the minimal distance as d(x i ) = min x i x j, (9) j:y j =y i which is also the Hausdorff distance between the training example x i and the set of training examples that belong to a different class. To select a subset of training

6 examples, we sort the training set according to d(x i ) and select examples with the smallest Hausdorff distances d(x i ) as the reduced training set. This method will be referred to as the Hausdorff distance-based selection method. 3.3 Data Selection based on Random Sampling and Desired SVM Outputs To study the effectiveness of the proposed data selection strategies, we compare them to two other strategies. One is random sampling and the other is a data selection strategy based on the distance from the training examples to the desired separating hyperplane. The random sampling strategy simply selects a small portion of the training data to form the reduced training set uniformly at random. This method is straightforward to implement and requires no extra computation. The other data selection strategy we compare our methods to is implemented as follows. Given the training set and the parameter setting, we solve the full QP problem to obtained the desired separating hyperplane. Then for each training example x i, we compute its distance to the desired separating hyperplane as: f(x i )=y i ( α j y j k(x i,x j )+b). (10) j=1 Note that Eq. (10) has taken into account the class information and training examples that are misclassified by the desired separating hyperplane will have negative distances. According to the KKT conditions, support vectors are training examples that have relatively small values of distance f(x i ). We sort the training examples according to their distances to the separating hyperplane and select a subset of training examples with the smallest distances as the reduced training set. This strategy, although impractical because one needs to solve the full QP problem first, is ideal for comparison purposes as the distance from a training example to the desired separating hyperplane provides the optimal criterion for selecting the support vectors. 4 Results and Discussion In this section we report experimental results on several real-world datasets from the UCI Machine Learning repository [18]. The SVM training algorithm was implemented based on the SMO method. For all datasets, Gaussian kernels were used and the generalization error of the SVMs was estimated using the 5-fold cross-validation method. For each training set, according to the data selection method used, a portion of the training set (ranging from 10 to 100 percent) was selected as the reduced training set to train the SVM classifier. The error rate reported is the average error rate of the resulting SVM classifiers on the test sets over the 5 iterations. Due to the space limit, only results on three datasets will be presented.

7 Note that when the data selection method is based on the desired SVM outputs, the SVM training procedure has to be run twice in each iteration. The first time a SVM classifier is trained with the training set to obtain the desired separating hyperplane. Then a portion of the training examples in the training set is selected to form the reduced training set based on their distances to the desired separating hyperplane (see Eq. (10)). The second time a SVM classifier is trained with the reduced training set. Given a training set and a particular data selection criterion, there are two ways to form the reduced training set. One can either select training examples regardless of which classes they belong to or select training examples from each class separately while maintaining the class distribution. It was found in our experiments that selecting training examples from each class separately often improves the classification accuracy of the resulting SVM classifiers. Therefore, we only report results in this case. Table 1 shows the error rates of SVMs on the Wisconsin Breast Cancer dataset when trained with the reduced training sets of various sizes selected by the four different data selection methods. This dataset consists of 683 examples from two classes (excluding the 16 examples with missing attribute values). Each example has 8 attributes. The size of the training set in each iteration is 547 and the size of the test set is 136. The average number of support vectors is 238.6, which is 43.62% of the training set size. Table 1. Error rates of SVMs on the Breast Cancer dataset when trained with reduced training sets of various sizes Percent Confidence Hausdorff Random SVM From Table 1 one can see that a significant amount data can be removed from the training set without degrading the performance of the resulting SVM classifier. When more than 10% of the training data is selected, the confidencebased data selection method outperforms the other two methods. Its performance is actually as good as that of the method based on the desired SVM outputs. The method based on the Hausdorff distance gives the worst results. When the data reduction rate is high, e.g., when less than 10 percent of the training data

8 is selected, the results obtained by the Hausdorff distance-based method and random sampling are much better than those based on the confidence measure and the desired SVM outputs. Table 2 shows the corresponding results obtained on the BUPA Liver dataset, which consists of 345 examples, with each example having 6 attributes. The sizes of the training and test sets in each iteration are 276 and 69, respectively. The average number of support vectors is 222.2, which is 80.51% of the size of the training sets. Interestingly, as we can see, the method based on the desired SVM outputs has the worst overall results. When less than 80% of the data is selected for training, the Hausdorff distance-based method and random sampling have similar performance and outperform the methods based on the confidence measure and the desired SVM outputs. Table 2. Results on the BUPA Liver dataset Percent Confidence Hausdorff Random SVM Table 3 provides the results on the Ionosphere dataset, which has a total of 351 examples, with each example having 34 attributes. The sizes of the training and test sets in each iteration are 281 and 70, respectively. The average number of support vectors is 159.8, which is 56.87% of the size of the training sets. From Table 3 we see that the data selection method based on the desired SVM outputs gives the best results when more than 20% of the data is selected. When more than 50% of the data is selected, the results of the confidence-based method are very close to the best achievable results. However, when the reduction rate is high, the performance of random sampling is the best. The Hausdorff distancebased method has the worst overall results. An interesting finding of the experiments is that the performance of the SVM classifiers deteriorates significantly when the reduction rate is high, e.g., when the size of the reduced training set is much smaller than the number of the desired support vectors. This is especially true for data selection strategies that are based on the desired SVM outputs and the proposed heuristics. On the other hand, the effect is less significant for random sampling, as we have seen

9 Table 3. Results on the Ionosphere dataset Percent Confidence Hausdorff Random SVM that random sampling usually has better relative performance at higher data reduction rates. From a theoretical point of view, this is not surprising because when only a subset of the support vectors is chosen as the reduced training set, there is no guarantee that the solution of the reduced QP problem will still be the same. In fact, if the reduction rate is high and the criterion is based on the desired SVM outputs or the proposed heuristics, the reduced training set is likely to be dominated by outliers, therefore leading to worse classification performance. To overcome this problem, we can remove those training examples that lie far inside the margin area since they are likely to be outliers. For the data selection strategy based on the desired SVM outputs, it means that we can discard part of the training data that has extremely small values of the distance to the desired separating hyperplane (see Eq. (10)). For the methods based on the confidence measure and Hausdorff distance, we can similarly discard the part of the training data that has extremely small values of N(x i ) and the Hausdorff distance. In Table 4 we show the results of the proposed solution on the Breast Cancer dataset. Comparing Tables 1 and 4, it is easy to see that, when only a very small subset of the training data (compared to the number of the desired support vectors) is selected for SVM training, removing training patterns that are extremely close to the decision boundary according to the confidence measure or according to the underlying SVM outputs significantly improves the performance of the resulting SVM classifiers. The effect is less obvious for the methods based on the Hausdorff measure and random sampling. Similar results have also been observed on other datasets but will not be reported here due to the space limit. 5 Conclusion In this paper we presented two new data selection methods for SVM training. To analyze their effectiveness in terms of their ability to reduce the training data

10 Table 4. Results on the Breast Cancer dataset Percent Confidence Hausdorff Random SVM while maintaining the generalization performance of the resulting SVM classifiers, we conducted a comparative study using several real-world datasets. More specifically, we compared the results obtained by these two new methods with the results of the simple random sampling scheme and the results obtained by the selection method based on the desired SVM outputs. Through our experiments, several important observations have been made: (1) In many applications, significant data reduction can be achieved without degrading the performance of the SVM classifiers. For that purpose, the performance of the confidence measurebased selection method is often comparable to or better than the performance of the method based on the desired SVM outputs. (2) When the reduction rate is high, some of training examples that are extremely close to the decision boundary have to be removed in order to maintain the generalization performance of the resulting SVM classifiers. (3) In spite of its simplicity, random sampling performs consistently well, especially when the reduction rate is high. However, at low reduction rates, random sampling performs noticeably worse compared to the confidence measure-based method. (4) When conducting training data selection, sampling training data from each class separately according to the class distribution often improves the performance of the resulting SVM classifiers. By directly comparing various data selection schemes with the scheme based on the desired SVM outputs, we are able to conclude that the confidence measure provides a criterion for training data selection that is almost as good as the optimal criterion based on the desired SVM outputs. At high reduction rates, by removing training data that are likely to be outliers, we boost the performance of the resulting SVM classifiers. Random sampling performs consistently well in our experiments, which is consistent with the results obtained by Syed et al. in [19] and the theoretical analysis of Huang and Lee in [12]. The robustness of random sampling at high reduction rates suggests that, although an SVM classifier is fully determined by the support vectors, the generalization performance of an SVM is less reliant on the choice of training data than it appears to be.

11 References 1. Boser, B. E., Guyon, I. M., Vapnik, V. N.: A training algorithm for optimal margin classifiers. In: Haussler, D. (ed.): Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory (1992) Cortes, C., Vapnik, V. N.: Support vector networks. Machine Learning. 20 (1995) Vapnik, V. N.: Statistical Learning Theory. Wiley, New York, NY (1998) 4. Joachims, T.: Making large-scale SVM learning practical. In: Schölkopf, B., Burges, C. J. C., Smola, A. J. (eds.): Advances in Kernel Methods - Support Vector Learning. MIT Press, Cambridge, MA (1999) Shin, H. J., Cho, S. Z.: Fast pattern selection for support vector classifiers. In: Proceedings of the 7th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Lecture Notes in Artificial Intelligence (LNAI 2637) (2003) Almeida, M. B., Braga, A. P., Braga, J. P.: SVM-KM: speeding SVMs learning with a priori cluster selection and k-means. In: Proceedings of the 6th Brazilian Symposium on Neural Networks (2000) Zheng, S. F., Lu, X. F., Zheng, N. N., Xu, W. P.: Unsupervised clustering based reduced support vector machines. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2 (2003) Koggalage, R., Halgamuge, S.: Reducing the number of training samples for fast support vector machine classification. Neural Information Processing - Letters and Reviews 2(3) (2004) Zhang, W., King, I.: Locating support vectors via β-skeleton technique. In: Proceedings of the International Conference on Neural Information Processing (ICONIP) (2002) Abe, S., Inoue, T.: Fast training of support vector machines by extracting boundary data. In: Proceedings of the International Conference on Artificial Neural Networks (ICANN) (2001) Lee, Y. J., Mangasarian, O. L.: RSVM: Reduced support vector machines. In: Proceedings of the First SIAM International Conference on Data Mining (2001) 12. Huang, S. Y., Lee, Y. J.: Reduced support vector machines: a statistical theory. Technical report, Institute of Statistical Science, Academia Sinica, Taiwan. (2004) 13. Vapnik, V. N.: Estimation of Dependence Based on Empirical Data. Springer- Verlag, Berlin (1982) 14. Osuna, E., Freund, R., Girosi, R.: Support vector machines: training and applications. A.I. Memo AIM , MIT A.I. Lab. (1996) 15. Platt, J.: Fast training of support vector machines using sequential minimal optimization. In: Schölkopf, B., Burges, C. J. C., Smola, A. J. (eds.): Advances in Kernel Methods - Support Vector Learning. MIT Press, Cambridge, MA (1999) Bennett, K. P., Bredensteiner, E. J.: Duality and geometry in SVM classifiers. In: Proceedings of 17th International Conference on Machine Learning. (2000) Crisp, D. J., Burges, C. J. C.: A geometric interpretation of nu-svm classifiers. Advances in Neural Information Processing Systems. 12 (1999) 18. Blake, C. L., Merz, C. J.: UCI Repository of machine learning databases. mlearn/mlrepository.html (1998) 19. Syed, N. A., Liu, H., Sung, K. K.: A study of support vectors on model independent example selection. In: Proceedings of the Workshop on Support Vector Machines at the International Joint Conference on Artificial Intelligence. (1999)

Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

Using Analytic QP and Sparseness to Speed Training of Support Vector Machines John C. Platt Microsoft Research 1 Microsoft Way Redmond, WA 9805 jplatt@microsoft.com Abstract Training a Support Vector Machine