Feature Selection for Multi-Class Problems Using Support Vector Machines

Feature Selection for Multi-Class Problems Using Support Vector Machines Guo-Zheng Li, Jie Yang, Guo-Ping Liu, Li Xue Institute of Image Processing & Pattern Recognition, Shanghai Jiao Tong University, Shanghai, China, 200030 Abstract. Since feature selection can remove the irrelevant features and improve the performance of learning systems, it is an crucial step in machine learning. The feature selection methods using support vector machines have obtained satisfactory results, but the previous works are usually for binary classification, and needs auxiliary techniques to be extended to multiple classification. In this paper, we propose a prediction risk based feature selection method using multiple classification support vector machines. The performance of the proposed method is compared with the previous methods of optimal brain damage based feature selection methods using binary support vector machines. The results of experiments on UCI data sets show that prediction risk based feature selection method obtains better results than the previous methods using support vector machines for multiple classification problems. 1 Introduction Feature selection is one of the key topics in machine learning and other related fields [1 3], it can remove the irrelevant even noisy features and hence improve the quality of the data set and the performance of learning systems. In the recent years, many feature selection algorithms have been developed, but no optimal algorithms can actually be suitable for all problems. Since neural computing does not make assumption of the possible distribution of the data, only when the training data sets are available, can feature selection using neural computing perform well and really improve the performances of the neural learning machines [4, 5]. Support vector machines(svms) proposed in 1990s have exhibited excellent performance in many applications and become the standard tools in neural computing [6, 7]. Compared with other neural computing methods like multiple layer perceptron neural network trained by back propagation algorithms, SVMs realize the data dependent principle of structure risk minimization, have better generalization ability and can obtain the optimal solution[8]. Although SVMs are powerful algorithms, too many irrelevant features can reduce their performances, so feature selection for SVMs are proposed[5, 3]. Weston et al. proposed to use the leave-one-out error bound as the selection criteria[9]; Guyon et al. used the second derivative of the object function as the criteria [5]. Rakotomamonjy used the zero oder and first order of the above criteria and

proved that the optimal damage brain measure used by Guyon et al. is better than others[10]. It is worth noting that all the above algorithms are based on binary classification SVMs. Since SVMs classification algorithms are designed for binary classification problems, the techniques like one against one or one against all are needed to build multiple classification SVMs. At the same time, the capability of the feature selection methods using binary SVMs is limited. Weston et al. computed the sum of measures of the corresponding features in each binary SVMs of the multiple classification SVMs to evaluate the features in multiple classification problems [11]. In order to use the multiple classification SVMs as the learning machine to help to select the features effectively, we propose to use the prediction risk based feature selection method [12]. The rest of this paper is arranged as follows: Prediction risk based feature selection method using multiple classification SVMs is described in Section 2; Section 3 focuses on experiments on multi-class UCI data sets; and in Section 4, we will give some discussions. 2 Prediction risk based feature selection method We use feature selection to improve the accuracy of multiple classification support vector machines, which will firstly be introduced in brief. Support vector machines(svms) proposed in the 1990s have become state-ofthe-arts methods in machine learning fields [8, 7] and exhibited excellent performance in many applications such as digit recognition[13], text categorization[14], computer vision[15], biological data mining [5], and medical diagnosis [16], etc. In this paper, the version of 2-norm soft margin SVMs [7] is used for the binary classification machines, which minimize the training error as well as the 2-norm of slack variables according to the statistical learning theory [8]. The object function is defined as: L = l α i 1 2 i=1 l i,j=1 y i y j α i α j K(x i, x j ) 1 α α 2C where C is the parameter to control the trade off of training error and the norm of slack variables, α is the Lagrange multiplier vector, and K(x i, x j ) is the kernel function[17] introduced into SVMs to solve the nonlinear problems. Radial basis function(rbf) kernel is considered as a superior choice [18]: K(x, z) = exp( x z 2 /σ 2 ). where x, z are input examples, σ is the radius. There are several methods to construct multiple classification machines based on binary classification SVMs, among them one against one method is recommended[19]. If there are k classes in the data set, k(k 1)/2 binary SVMs are trained on each pair of class labels. In this work, one against one method is used

to build the multiple classification, and voting method of maximum win strategy is used to predict the labels of test examples. For more details of SVMs, please refer to http://www.kernel-machines.org. 2.1 The previous work Some embedded feature selection methods using binary classification SVMs have been proposed. Guyon et al. proposed to use optimal brain damage as the selection criteria [5]. Furthermore, optimal brain damage has been studied by Rakotomamonjy and proved to be better than the other measures proposed before[10]. Optimal brain damage(obd) proposed by LeCun et al.[20] uses the change of the object function as the selection criteria, which is defined as the second order term in Taylor series of the object function: S i = 1 2 L 2 (w i ) 2 (Dwi ) 2. in which L is the object function of learning machines, and w is the weight of features. OBD has been used in the feature selection for artificial neural networks and obtained satisfactory results[21]. In binary classification SVMs, OBD has performed well in the gene analysis problems[5]. For binary classification SVMs, the measure of OBD is defined[5] as S i = 1 2 αt K(x k, x h )α 1 2 αt K(x i k, x i h )α where α is the Lagrange multipliers in SVMs, and i in K(x i k, x i h ) means the component i has been removed. The feature corresponding to the least S i will be removed. The methods proposed in the previous works are based on binary classification SVMs. If we want to extend them to multiple classification SVMs, we have to compute the measures of each individual binary classification SVMs. One way is that we compute the sum of the measures of each individual SVMs for the corresponding features, and remove the features with the least sum of measures. However, all these methods are based on individual SVMs not on the multiple classification SVMs, so we propose the prediction risk based feature selection method for the multi-class problems which uses the multiple classification SVMs directly. 2.2 Prediction risk based feature selection method Prediction risk based feature selection method proposed by Moody et al. [12], evaluates the features by computing the change of training error when the features are replaced by their mean values, S i = ERR( x i ) ERR

where ERR is the training error. ERR( x i ) is the test error on the training set and defined as: ERR( x i ) = 1 N N (ỹ(x 1 j,..., x i,..., x M ) j ) y j, j=1 in which, M, N are the number of features and instances respectively, x i is the mean value of the ith feature and ỹ() is the prediction value of the jth example with the ith feature replaced by its mean value. The feature corresponding to the least S i will be removed, because its change causes the least error which indicates it is the least important one. This measure was used to perform feature selection for the regularized forward neural networks and obtained better results than other measures like fuzzy gain, output sensitivity[22]. In order to remove the features effectively, we use the sequential backward search algorithms [23], which removes one feature in one step according to the measures. The algorithms used in this paper is named as SVM-SBS in the following. The best feature subset is the one with the least test error on the test sample. Algorithm SVM-SBS Surviving feature subset u = [1, 2,..., M], the discarded feature list r = [ ] and the test error list e = [ ] are initialized firstly. Then, training sample x r0 = [x 1 r,..., x i r,..., x M r ] T with the target values y r and the test sample x s0 with the target values y s are input into SVM-SBS. Step 1: Restrict training sample to good feature indices x r = x r (:, u), and in the first iteration, x r = x r0. Step 2: Train the multiple classification machines to get M-SVM(x r, y r ). Step 3: Test the model on the test sample, classification error rate is computed e t = M-SVM(x s (:, u), y s ), and update the error list e = [e t, e]. Step 4: Compute the selection criteria S i for all i on the training sample using the evaluation method in the above two subsections. Step 5 Find the feature with smallest selection criterion h = arg min(s). Step 6: Update the discarded feature list r = [u(h), r] and eliminate the feature with smallest selection criterion u = u(1 : h 1, h + 1 : length(u)). If length(u) > 1 goto step 1. Step 7: Output the test error list e on the test sample and the discarded feature list r. 3 Experiments on the UCI data sets 3.1 The used UCI data sets In order to compare the different feature selection methods for multiple classification problems using support vector machines, we use twelve of multi-class

data sets from UCI data repository [24]. Data sets selected for comparison are listed in Table 1. For all the data sets, we first replace the symbols with numerical values in the data sets, then, all the attributes are transformed into the interval of [-1,1] using an affine transformation. At last, we split the data set equally into two parts according to the number of instances of each class, one part is used as training sample, the other is used as test sample, such operation is performed 100 times. Table 1. The properties of the UCI data sets for comparison Data set Number of instances Number of attributes Number of Classes all-bp 3772 29 3 all-hyper 3772 29 5 all-hypo 3772 29 4 backup 683 35 19 fisher 47 35 4 glass 214 9 6 lung 32 56 3 processed-cl. 303 13 5 processed-va 200 13 5 soybean-l 305 34 19 soybean-s 47 35 4 stepp-order 47 35 4 3.2 Experimental methods and Results In order to compare the two feature selection methods, we choose the same parameters C = 100 and σ = 0.5 for the SVMs on all data sets. Although they are not the optimal parameters, we consider them reasonable. OBD based feature selection methods and prediction risk based methods using SVMs are performed on the data sets using the SVM-SBS algorithm. Both evaluation methods are applied in SVM-SBS to selected features on the training data sets and compute the test error of the selected feature subset of the corresponding test data set. The test error is defined as the classification error rate: ERR(x s ) = 1 N N (ỹ j (x sj ) y j ), j=1 where N is the number of test instances, ỹ j is the prediction value of x j. This calculation is performed 100 times. Finally, the statistical results of the average error and its corresponding standard deviation for each number of feature subset are computed. Results of the least average error of each data set and its corresponding standard deviation are listed in Table 2.

Since prediction risk feature selection is an embedded method, the computation is efficiency and mainly focused on the training of SVMs. The CPU time of each time of selection using the SVM-SBS algorithm is varied on different data sets and no more than one second on a computer with one PIV 1.2G CPU of intel and 512M memory. Table 2. Statistical results of the test error on UCI data sets by different feature selection methods Data set Prediction risk based method OBD based method All features all-bp 0.0418±0.0076 0.0408±0.0087 0.0646±0.0432 all-hyper 0.0305±0.0366 0.0262±0.0040 0.0355±0.0206 all-hypo 0.0249±0.0111 0.0394±0.0165 0.0504±0.0168 backup 0.1032±0.0329 0.1457±0.1600 0.2354±0.0841 fisher 0.0013±0.0077 0.0072±0.0216 0.3344±0.0753 glass 0.4019±0.0924 0.4362±0.0967 0.4362±0.0967 lung 0.4580±0.1169 0.4415±0.1468 0.5987±0.1217 processed-cl. 0.2334±0.0375 0.2334±0.0375 0.2334±0.0375 processed-va 0.3537±0.0556 0.3574±0.0532 0.3574±0.0532 soybean-l 0.1841±0.0460 0.3384±0.2639 0.4752±0.1961 soybean-s 0.0257±0.0219 0.0253±0.0443 0.3597±0.0728 stepp-order 0.0128±0.0257 0.0195±0.0456 0.3446±0.0765 Average 0.1559±0.0410 0.1759±0.0749 0.2938±0.0745 From Table 2, we can see that: 1) Compared with the total feature set, both methods significantly reduce the classification error rate, except on one data set; 2) On seven out of twelve data sets prediction risk based method obtains better results than OBD based method does, and on four data sets prediction risk based method performs worse than OBD based method does; 3) For the average values, prediction risk based feature selection method obtains 2 percent better results than OBD does on the average error and 3 percent better results on the standard deviation error. One typical selection process is on the data set of backup, whose results of the average error and the corresponding standard deviation are plotted on Figure 1. From Figure 1, we can see the results of the test error become small first and then high when more features are eliminated. 4 Discussions Prediction risk based feature selection method for multi-class problems using support vector machines is proposed and obtains better results than optimal brain damage based feature selection method on twelve of multi-class data sets from UCI data repository.

We think two factors may account for the better performance of the proposed method. One is that the two feature selection method are based on two different measures, prediction risk and optimal brain damage, the former is something like the wrapper method [2] which can obtain the least error for specific learning machines. The second reason is that prediction risk based method uses the whole multiple classification support vector machines to evaluate the features, while optimal brain damage method is based on binary support vector machines, and needs auxiliary techniques to evaluate the features, perhaps another auxiliary method can help to obtain better results. We can also find both feature selection methods can greatly reduce the test error of the used data sets, the reduction magnitude is about 13 percent of the test error on the total feature set. This indicates that almost all the data sets have some redundant features or even noisy features, and these features hurt the performance of the used learning machine. Thus, feature selection is needed to perform on all the data sets for learning machines. Feature selection using support vector machines is a general method which does not make any assumption of the data distribution. However, one or several outlier examples may cause unexpected results, outlier detection should be considered before the feature selection is performed in the real world applications. In addition, how about the performance of the proposed method compared with other ad hoc feature selection methods like spectral clustering and mutual information methods is still an open issue, which needs a thorough investigation. Acknowledgments This work is financially supported by the Natural Science Foundation of China under the grant number of 50174038. Thanks also go to the anonymous reviewers for their valuable advices. References 1. Dash, M., Liu, H.: Feature selection for classification. Intelligent Data Analysis 1 (1997) 131 156 2. Kohavi, R., George, J.H.: Wrappers for feature subset selection. Artificial Intelligence 97 (1997) 273 324 3. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. Journal of machine learning research 3 (2003) 1157 1182 4. Reed, R.: Pruning algorithms a survey. IEEE Transactions on Neural Networks 4 (1993) 740 747 5. Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Machine Learning 46 (2002) 389 422 6. Haykin, S.: Neural Networks: A Comprehensive Foundation. 2 edn. Printice Hall, New Jersey (1999) 7. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines. Cambridge University Press, Cambridge (2000) 8. Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)

9. Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T., Vapnik, V.: Feature selection for SVMs. In: Advances in Neural Information Processing Systems. Volume 13. (2001) 10. Rakotomamonjy, A.: Variable selection using SVM-based criteria. Journal of machine learning research 3 (2003) 1357 1370 11. Weston, J., Elisseeff, A., Bakir, G., Sinz, F.: The spider. http://www.kyb.tuebingen.mpg.de/bs/people/spider/index.html (2004) 12. Moody, J., Utans, J.: Principled architecture selection for neural networks: Application to corporate bond rating prediction. In Moody, J.E., Hanson, S.J., Lippmann, R.P., eds.: Advances in Neural Information Processing Systems. Volume 4., Morgan Kaufmann Publishers, Inc. (1992) 683 690 13. LeCun, Y., Jackel, L.D., Bottou, L., Brunot, A., Cortes, C., Denker, J.S., Drucker, H., Guyon, I., Müller, U.A., Säckinger, E., Simard, P., Vapnik, V.: Comparison of learning algorithms for handwritten digit recognition. In Fogelman-Soulié, F., Gallinari, P., eds.: Proceedings ICANN 95 International Conference on Artificial Neural Networks, Volume II. (1995) 53 60 14. Joachims, T.: Text categorization with support vector machines. In: Proceedings of European Conference on Machine Learning(ECML). (1998) 15. Pontil, M., Verri, A.: Object recognition with support vector machines. IEEE Trans. on PAMI 20 (1998) 637 646 16. El-Naqa, I., Yang, Y., Wernick, M.N., Galatsanos, N.P., R, N.: Support vector machine learning for detection of microcalcifications in mammograms. In: Proceedings of IEEE International Symposium on Biomedical Imaging. (2002) 201 204 17. Mercer, J.: Functions of positive and negative type and their connection with the theory of integral equations. Philos. Trans. Roy. Soc. London A 209 (1909) 415 446 18. Keerthi, S.S., Lin, C.J.: Asymptotic behaviors of support vector machines with gaussian kernel. Neural Computation 15 (2003) 1667 1689 19. Hsu, C.W., Lin, C.J.: A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks 13 (2002) 415 425 20. LeCun, Y., Denker, J.S., Solla, S.A.: Optimal brain damage. In Touretzky, D., ed.: Advances in Neural Information Processing Systems, Morgan Kaufmann, Inc. (1990) 598 605 21. Cibas, T., Soulie, F., Gallinari, P.: Variable selection with neural networks. Neurocomputing 12 (1996) 223 248 22. Verikas, A., Bacauskiene, M.: Feature selection with neural networks. Pattern Recognition Letters 23 (2002) 1323 1335 23. Marill, T., Green, D.M.: On the effectiveness of receptors in recognition system. IEEE Transaction on Information Theory 9 (1963) 11 17 24. Blake, C., Keogh, E., Merz, C.J.: UCI repository of machine learning databases. Technical report, Department of Information and Computer Science, University of California, Irvine, CA (1998) http://www.ics.uci.edu/ mlearn/mlrepository.htm.

0.5 0.18 Prediction Risk based Optimal Brain Damage based 0.45 Prediction Risk based Optimal Brain Damage based 0.16 0.4 0.14 the average error 0.35 0.3 0.25 the standard deviation 0.12 0.1 0.08 0.2 0.06 0.15 0.04 0.1 0 10 20 30 40 number of the eliminated features 0.02 0 10 20 30 40 number of the eliminated features Fig. 1. The feature selection process of the embedded algorithms on the data set of backup