Feature Selection for Multi-Class Problems Using Support Vector Machines

Similar documents
SoftDoubleMinOver: A Simple Procedure for Maximum Margin Classification

Bagging and Boosting Algorithms for Support Vector Machine Classifiers

Forward Feature Selection Using Residual Mutual Information

The Effects of Outliers on Support Vector Machines

KBSVM: KMeans-based SVM for Business Intelligence

A Modular Reduction Method for k-nn Algorithm with Self-recombination Learning

Support Vector Machine Ensemble with Bagging

Rule extraction from support vector machines

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm

Using K-NN SVMs for Performance Improvement and Comparison to K-Highest Lagrange Multipliers Selection

Noise-based Feature Perturbation as a Selection Method for Microarray Data

Individual feature selection in each One-versus-One classifier improves multi-class SVM performance

Relevance Feedback for Content-Based Image Retrieval Using Support Vector Machines and Feature Selection

Efficient Tuning of SVM Hyperparameters Using Radius/Margin Bound and Iterative Algorithms

Training Data Selection for Support Vector Machines

Memory-efficient Large-scale Linear Support Vector Machine

Improved DAG SVM: A New Method for Multi-Class SVM Classification

Combining SVMs with Various Feature Selection Strategies

Reihe Informatik 10/2001. Efficient Feature Subset Selection for Support Vector Machines. Matthias Heiler, Daniel Cremers, Christoph Schnörr

Data mining with Support Vector Machine

Feature Selection. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262

Kernel-based online machine learning and support vector reduction

Rule Based Learning Systems from SVM and RBFNN

Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

Leave-One-Out Support Vector Machines

Robustness of Selective Desensitization Perceptron Against Irrelevant and Partially Relevant Features in Pattern Classification

Version Space Support Vector Machines: An Extended Paper

Support Vector Regression for Software Reliability Growth Modeling and Prediction

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

IMPLEMENTATION OF RBF TYPE NETWORKS BY SIGMOIDAL FEEDFORWARD NEURAL NETWORKS

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

Non-linear gating network for the large scale classification model CombNET-II

Well Analysis: Program psvm_welllogs

A Subspace Kernel for Nonlinear Feature Extraction

Univariate Margin Tree

An Empirical Study on feature selection for Data Classification

Support Vector Machines

Feature Selection for SVMs

One-class Problems and Outlier Detection. 陶卿 中国科学院自动化研究所

Second Order SMO Improves SVM Online and Active Learning

Efficient Pairwise Classification

ORT EP R RCH A ESE R P A IDI! " #$$% &' (# $!"

A New Fuzzy Membership Computation Method for Fuzzy Support Vector Machines

HALF&HALF BAGGING AND HARD BOUNDARY POINTS. Leo Breiman Statistics Department University of California Berkeley, CA

Kernel Methods and Visualization for Interval Data Mining

Efficient Pairwise Classification

Fingerprint Classification with Combinations of Support Vector Machines

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Adaptive Scaling for Feature Selection in SVMs

Support Vector Machines

Speeding Up Multi-class SVM Evaluation by PCA and Feature Selection

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

SVM in Analysis of Cross-Sectional Epidemiological Data Dmitriy Fradkin. April 4, 2005 Dmitriy Fradkin, Rutgers University Page 1

Face Recognition Using Vector Quantization Histogram and Support Vector Machine Classifier Rong-sheng LI, Fei-fei LEE *, Yan YAN and Qiu CHEN

Optimal Brain Damage. Yann Le Cun, John S. Denker and Sara A. Solla. presented by Chaitanya Polumetla

Induction of Multivariate Decision Trees by Using Dipolar Criteria

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Evaluation of Performance Measures for SVR Hyperparameter Selection

Novel Initialisation and Updating Mechanisms in PSO for Feature Selection in Classification

Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection

Unsupervised Feature Selection for Sparse Data

Feature scaling in support vector data description

CS570: Introduction to Data Mining

Support Vector Machines

Efficient Pruning Method for Ensemble Self-Generating Neural Networks

Introduction The problem of cancer classication has clear implications on cancer treatment. Additionally, the advent of DNA microarrays introduces a w

Fisher Score Dimensionality Reduction for Svm Classification Arunasakthi. K, KamatchiPriya.L, Askerunisa.A

Machine Learning: Think Big and Parallel

Table of Contents. Recognition of Facial Gestures... 1 Attila Fazekas

Module 4. Non-linear machine learning econometrics: Support Vector Machine

Support Vector Machines for Face Recognition

Radial Basis Function Neural Network Classifier

Class-Specific Feature Selection for One-Against-All Multiclass SVMs

DECISION-TREE-BASED MULTICLASS SUPPORT VECTOR MACHINES. Fumitake Takahashi, Shigeo Abe

Feature Ranking Using Linear SVM

Feature Selection for Supervised Classification: A Kolmogorov- Smirnov Class Correlation-Based Filter

A Support Vector Method for Hierarchical Clustering

SSV Criterion Based Discretization for Naive Bayes Classifiers

Classification by Support Vector Machines

Nelder-Mead Enhanced Extreme Learning Machine

Kernel Combination Versus Classifier Combination

Support Vector Machines (a brief introduction) Adrian Bevan.

Variable Selection 6.783, Biomedical Decision Support

Analysis of SAGE Results with Combined Learning Techniques

A Feature Selection Method to Handle Imbalanced Data in Text Classification

A Hybrid Feature Selection Algorithm Based on Information Gain and Sequential Forward Floating Search

Chapter 22 Information Gain, Correlation and Support Vector Machines

J. Weston, A. Gammerman, M. Stitson, V. Vapnik, V. Vovk, C. Watkins. Technical Report. February 5, 1998

Classification by Support Vector Machines

Feature selection in environmental data mining combining Simulated Annealing and Extreme Learning Machine

Evaluating the SVM Component in Oracle 10g Beta

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

A Practical Guide to Support Vector Classification

Lecture 9: Support Vector Machines

Fuzzy-Kernel Learning Vector Quantization

Gene Expression Based Classification using Iterative Transductive Support Vector Machine

Cluster homogeneity as a semi-supervised principle for feature selection using mutual information

5 Learning hypothesis classes (16 points)


Support Vector Machines and their Applications

Transcription:

Feature Selection for Multi-Class Problems Using Support Vector Machines Guo-Zheng Li, Jie Yang, Guo-Ping Liu, Li Xue Institute of Image Processing & Pattern Recognition, Shanghai Jiao Tong University, Shanghai, China, 200030 Abstract. Since feature selection can remove the irrelevant features and improve the performance of learning systems, it is an crucial step in machine learning. The feature selection methods using support vector machines have obtained satisfactory results, but the previous works are usually for binary classification, and needs auxiliary techniques to be extended to multiple classification. In this paper, we propose a prediction risk based feature selection method using multiple classification support vector machines. The performance of the proposed method is compared with the previous methods of optimal brain damage based feature selection methods using binary support vector machines. The results of experiments on UCI data sets show that prediction risk based feature selection method obtains better results than the previous methods using support vector machines for multiple classification problems. 1 Introduction Feature selection is one of the key topics in machine learning and other related fields [1 3], it can remove the irrelevant even noisy features and hence improve the quality of the data set and the performance of learning systems. In the recent years, many feature selection algorithms have been developed, but no optimal algorithms can actually be suitable for all problems. Since neural computing does not make assumption of the possible distribution of the data, only when the training data sets are available, can feature selection using neural computing perform well and really improve the performances of the neural learning machines [4, 5]. Support vector machines(svms) proposed in 1990s have exhibited excellent performance in many applications and become the standard tools in neural computing [6, 7]. Compared with other neural computing methods like multiple layer perceptron neural network trained by back propagation algorithms, SVMs realize the data dependent principle of structure risk minimization, have better generalization ability and can obtain the optimal solution[8]. Although SVMs are powerful algorithms, too many irrelevant features can reduce their performances, so feature selection for SVMs are proposed[5, 3]. Weston et al. proposed to use the leave-one-out error bound as the selection criteria[9]; Guyon et al. used the second derivative of the object function as the criteria [5]. Rakotomamonjy used the zero oder and first order of the above criteria and

proved that the optimal damage brain measure used by Guyon et al. is better than others[10]. It is worth noting that all the above algorithms are based on binary classification SVMs. Since SVMs classification algorithms are designed for binary classification problems, the techniques like one against one or one against all are needed to build multiple classification SVMs. At the same time, the capability of the feature selection methods using binary SVMs is limited. Weston et al. computed the sum of measures of the corresponding features in each binary SVMs of the multiple classification SVMs to evaluate the features in multiple classification problems [11]. In order to use the multiple classification SVMs as the learning machine to help to select the features effectively, we propose to use the prediction risk based feature selection method [12]. The rest of this paper is arranged as follows: Prediction risk based feature selection method using multiple classification SVMs is described in Section 2; Section 3 focuses on experiments on multi-class UCI data sets; and in Section 4, we will give some discussions. 2 Prediction risk based feature selection method We use feature selection to improve the accuracy of multiple classification support vector machines, which will firstly be introduced in brief. Support vector machines(svms) proposed in the 1990s have become state-ofthe-arts methods in machine learning fields [8, 7] and exhibited excellent performance in many applications such as digit recognition[13], text categorization[14], computer vision[15], biological data mining [5], and medical diagnosis [16], etc. In this paper, the version of 2-norm soft margin SVMs [7] is used for the binary classification machines, which minimize the training error as well as the 2-norm of slack variables according to the statistical learning theory [8]. The object function is defined as: L = l α i 1 2 i=1 l i,j=1 y i y j α i α j K(x i, x j ) 1 α α 2C where C is the parameter to control the trade off of training error and the norm of slack variables, α is the Lagrange multiplier vector, and K(x i, x j ) is the kernel function[17] introduced into SVMs to solve the nonlinear problems. Radial basis function(rbf) kernel is considered as a superior choice [18]: K(x, z) = exp( x z 2 /σ 2 ). where x, z are input examples, σ is the radius. There are several methods to construct multiple classification machines based on binary classification SVMs, among them one against one method is recommended[19]. If there are k classes in the data set, k(k 1)/2 binary SVMs are trained on each pair of class labels. In this work, one against one method is used

to build the multiple classification, and voting method of maximum win strategy is used to predict the labels of test examples. For more details of SVMs, please refer to http://www.kernel-machines.org. 2.1 The previous work Some embedded feature selection methods using binary classification SVMs have been proposed. Guyon et al. proposed to use optimal brain damage as the selection criteria [5]. Furthermore, optimal brain damage has been studied by Rakotomamonjy and proved to be better than the other measures proposed before[10]. Optimal brain damage(obd) proposed by LeCun et al.[20] uses the change of the object function as the selection criteria, which is defined as the second order term in Taylor series of the object function: S i = 1 2 L 2 (w i ) 2 (Dwi ) 2. in which L is the object function of learning machines, and w is the weight of features. OBD has been used in the feature selection for artificial neural networks and obtained satisfactory results[21]. In binary classification SVMs, OBD has performed well in the gene analysis problems[5]. For binary classification SVMs, the measure of OBD is defined[5] as S i = 1 2 αt K(x k, x h )α 1 2 αt K(x i k, x i h )α where α is the Lagrange multipliers in SVMs, and i in K(x i k, x i h ) means the component i has been removed. The feature corresponding to the least S i will be removed. The methods proposed in the previous works are based on binary classification SVMs. If we want to extend them to multiple classification SVMs, we have to compute the measures of each individual binary classification SVMs. One way is that we compute the sum of the measures of each individual SVMs for the corresponding features, and remove the features with the least sum of measures. However, all these methods are based on individual SVMs not on the multiple classification SVMs, so we propose the prediction risk based feature selection method for the multi-class problems which uses the multiple classification SVMs directly. 2.2 Prediction risk based feature selection method Prediction risk based feature selection method proposed by Moody et al. [12], evaluates the features by computing the change of training error when the features are replaced by their mean values, S i = ERR( x i ) ERR

where ERR is the training error. ERR( x i ) is the test error on the training set and defined as: ERR( x i ) = 1 N N (ỹ(x 1 j,..., x i,..., x M ) j ) y j, j=1 in which, M, N are the number of features and instances respectively, x i is the mean value of the ith feature and ỹ() is the prediction value of the jth example with the ith feature replaced by its mean value. The feature corresponding to the least S i will be removed, because its change causes the least error which indicates it is the least important one. This measure was used to perform feature selection for the regularized forward neural networks and obtained better results than other measures like fuzzy gain, output sensitivity[22]. In order to remove the features effectively, we use the sequential backward search algorithms [23], which removes one feature in one step according to the measures. The algorithms used in this paper is named as SVM-SBS in the following. The best feature subset is the one with the least test error on the test sample. Algorithm SVM-SBS Surviving feature subset u = [1, 2,..., M], the discarded feature list r = [ ] and the test error list e = [ ] are initialized firstly. Then, training sample x r0 = [x 1 r,..., x i r,..., x M r ] T with the target values y r and the test sample x s0 with the target values y s are input into SVM-SBS. Step 1: Restrict training sample to good feature indices x r = x r (:, u), and in the first iteration, x r = x r0. Step 2: Train the multiple classification machines to get M-SVM(x r, y r ). Step 3: Test the model on the test sample, classification error rate is computed e t = M-SVM(x s (:, u), y s ), and update the error list e = [e t, e]. Step 4: Compute the selection criteria S i for all i on the training sample using the evaluation method in the above two subsections. Step 5 Find the feature with smallest selection criterion h = arg min(s). Step 6: Update the discarded feature list r = [u(h), r] and eliminate the feature with smallest selection criterion u = u(1 : h 1, h + 1 : length(u)). If length(u) > 1 goto step 1. Step 7: Output the test error list e on the test sample and the discarded feature list r. 3 Experiments on the UCI data sets 3.1 The used UCI data sets In order to compare the different feature selection methods for multiple classification problems using support vector machines, we use twelve of multi-class

data sets from UCI data repository [24]. Data sets selected for comparison are listed in Table 1. For all the data sets, we first replace the symbols with numerical values in the data sets, then, all the attributes are transformed into the interval of [-1,1] using an affine transformation. At last, we split the data set equally into two parts according to the number of instances of each class, one part is used as training sample, the other is used as test sample, such operation is performed 100 times. Table 1. The properties of the UCI data sets for comparison Data set Number of instances Number of attributes Number of Classes all-bp 3772 29 3 all-hyper 3772 29 5 all-hypo 3772 29 4 backup 683 35 19 fisher 47 35 4 glass 214 9 6 lung 32 56 3 processed-cl. 303 13 5 processed-va 200 13 5 soybean-l 305 34 19 soybean-s 47 35 4 stepp-order 47 35 4 3.2 Experimental methods and Results In order to compare the two feature selection methods, we choose the same parameters C = 100 and σ = 0.5 for the SVMs on all data sets. Although they are not the optimal parameters, we consider them reasonable. OBD based feature selection methods and prediction risk based methods using SVMs are performed on the data sets using the SVM-SBS algorithm. Both evaluation methods are applied in SVM-SBS to selected features on the training data sets and compute the test error of the selected feature subset of the corresponding test data set. The test error is defined as the classification error rate: ERR(x s ) = 1 N N (ỹ j (x sj ) y j ), j=1 where N is the number of test instances, ỹ j is the prediction value of x j. This calculation is performed 100 times. Finally, the statistical results of the average error and its corresponding standard deviation for each number of feature subset are computed. Results of the least average error of each data set and its corresponding standard deviation are listed in Table 2.

Since prediction risk feature selection is an embedded method, the computation is efficiency and mainly focused on the training of SVMs. The CPU time of each time of selection using the SVM-SBS algorithm is varied on different data sets and no more than one second on a computer with one PIV 1.2G CPU of intel and 512M memory. Table 2. Statistical results of the test error on UCI data sets by different feature selection methods Data set Prediction risk based method OBD based method All features all-bp 0.0418±0.0076 0.0408±0.0087 0.0646±0.0432 all-hyper 0.0305±0.0366 0.0262±0.0040 0.0355±0.0206 all-hypo 0.0249±0.0111 0.0394±0.0165 0.0504±0.0168 backup 0.1032±0.0329 0.1457±0.1600 0.2354±0.0841 fisher 0.0013±0.0077 0.0072±0.0216 0.3344±0.0753 glass 0.4019±0.0924 0.4362±0.0967 0.4362±0.0967 lung 0.4580±0.1169 0.4415±0.1468 0.5987±0.1217 processed-cl. 0.2334±0.0375 0.2334±0.0375 0.2334±0.0375 processed-va 0.3537±0.0556 0.3574±0.0532 0.3574±0.0532 soybean-l 0.1841±0.0460 0.3384±0.2639 0.4752±0.1961 soybean-s 0.0257±0.0219 0.0253±0.0443 0.3597±0.0728 stepp-order 0.0128±0.0257 0.0195±0.0456 0.3446±0.0765 Average 0.1559±0.0410 0.1759±0.0749 0.2938±0.0745 From Table 2, we can see that: 1) Compared with the total feature set, both methods significantly reduce the classification error rate, except on one data set; 2) On seven out of twelve data sets prediction risk based method obtains better results than OBD based method does, and on four data sets prediction risk based method performs worse than OBD based method does; 3) For the average values, prediction risk based feature selection method obtains 2 percent better results than OBD does on the average error and 3 percent better results on the standard deviation error. One typical selection process is on the data set of backup, whose results of the average error and the corresponding standard deviation are plotted on Figure 1. From Figure 1, we can see the results of the test error become small first and then high when more features are eliminated. 4 Discussions Prediction risk based feature selection method for multi-class problems using support vector machines is proposed and obtains better results than optimal brain damage based feature selection method on twelve of multi-class data sets from UCI data repository.

We think two factors may account for the better performance of the proposed method. One is that the two feature selection method are based on two different measures, prediction risk and optimal brain damage, the former is something like the wrapper method [2] which can obtain the least error for specific learning machines. The second reason is that prediction risk based method uses the whole multiple classification support vector machines to evaluate the features, while optimal brain damage method is based on binary support vector machines, and needs auxiliary techniques to evaluate the features, perhaps another auxiliary method can help to obtain better results. We can also find both feature selection methods can greatly reduce the test error of the used data sets, the reduction magnitude is about 13 percent of the test error on the total feature set. This indicates that almost all the data sets have some redundant features or even noisy features, and these features hurt the performance of the used learning machine. Thus, feature selection is needed to perform on all the data sets for learning machines. Feature selection using support vector machines is a general method which does not make any assumption of the data distribution. However, one or several outlier examples may cause unexpected results, outlier detection should be considered before the feature selection is performed in the real world applications. In addition, how about the performance of the proposed method compared with other ad hoc feature selection methods like spectral clustering and mutual information methods is still an open issue, which needs a thorough investigation. Acknowledgments This work is financially supported by the Natural Science Foundation of China under the grant number of 50174038. Thanks also go to the anonymous reviewers for their valuable advices. References 1. Dash, M., Liu, H.: Feature selection for classification. Intelligent Data Analysis 1 (1997) 131 156 2. Kohavi, R., George, J.H.: Wrappers for feature subset selection. Artificial Intelligence 97 (1997) 273 324 3. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. Journal of machine learning research 3 (2003) 1157 1182 4. Reed, R.: Pruning algorithms a survey. IEEE Transactions on Neural Networks 4 (1993) 740 747 5. Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Machine Learning 46 (2002) 389 422 6. Haykin, S.: Neural Networks: A Comprehensive Foundation. 2 edn. Printice Hall, New Jersey (1999) 7. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines. Cambridge University Press, Cambridge (2000) 8. Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)

9. Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T., Vapnik, V.: Feature selection for SVMs. In: Advances in Neural Information Processing Systems. Volume 13. (2001) 10. Rakotomamonjy, A.: Variable selection using SVM-based criteria. Journal of machine learning research 3 (2003) 1357 1370 11. Weston, J., Elisseeff, A., Bakir, G., Sinz, F.: The spider. http://www.kyb.tuebingen.mpg.de/bs/people/spider/index.html (2004) 12. Moody, J., Utans, J.: Principled architecture selection for neural networks: Application to corporate bond rating prediction. In Moody, J.E., Hanson, S.J., Lippmann, R.P., eds.: Advances in Neural Information Processing Systems. Volume 4., Morgan Kaufmann Publishers, Inc. (1992) 683 690 13. LeCun, Y., Jackel, L.D., Bottou, L., Brunot, A., Cortes, C., Denker, J.S., Drucker, H., Guyon, I., Müller, U.A., Säckinger, E., Simard, P., Vapnik, V.: Comparison of learning algorithms for handwritten digit recognition. In Fogelman-Soulié, F., Gallinari, P., eds.: Proceedings ICANN 95 International Conference on Artificial Neural Networks, Volume II. (1995) 53 60 14. Joachims, T.: Text categorization with support vector machines. In: Proceedings of European Conference on Machine Learning(ECML). (1998) 15. Pontil, M., Verri, A.: Object recognition with support vector machines. IEEE Trans. on PAMI 20 (1998) 637 646 16. El-Naqa, I., Yang, Y., Wernick, M.N., Galatsanos, N.P., R, N.: Support vector machine learning for detection of microcalcifications in mammograms. In: Proceedings of IEEE International Symposium on Biomedical Imaging. (2002) 201 204 17. Mercer, J.: Functions of positive and negative type and their connection with the theory of integral equations. Philos. Trans. Roy. Soc. London A 209 (1909) 415 446 18. Keerthi, S.S., Lin, C.J.: Asymptotic behaviors of support vector machines with gaussian kernel. Neural Computation 15 (2003) 1667 1689 19. Hsu, C.W., Lin, C.J.: A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks 13 (2002) 415 425 20. LeCun, Y., Denker, J.S., Solla, S.A.: Optimal brain damage. In Touretzky, D., ed.: Advances in Neural Information Processing Systems, Morgan Kaufmann, Inc. (1990) 598 605 21. Cibas, T., Soulie, F., Gallinari, P.: Variable selection with neural networks. Neurocomputing 12 (1996) 223 248 22. Verikas, A., Bacauskiene, M.: Feature selection with neural networks. Pattern Recognition Letters 23 (2002) 1323 1335 23. Marill, T., Green, D.M.: On the effectiveness of receptors in recognition system. IEEE Transaction on Information Theory 9 (1963) 11 17 24. Blake, C., Keogh, E., Merz, C.J.: UCI repository of machine learning databases. Technical report, Department of Information and Computer Science, University of California, Irvine, CA (1998) http://www.ics.uci.edu/ mlearn/mlrepository.htm.

0.5 0.18 Prediction Risk based Optimal Brain Damage based 0.45 Prediction Risk based Optimal Brain Damage based 0.16 0.4 0.14 the average error 0.35 0.3 0.25 the standard deviation 0.12 0.1 0.08 0.2 0.06 0.15 0.04 0.1 0 10 20 30 40 number of the eliminated features 0.02 0 10 20 30 40 number of the eliminated features Fig. 1. The feature selection process of the embedded algorithms on the data set of backup