A New Implementation of Recursive Feature Elimination Algorithm for Gene Selection from Microarray Data

2009 World Congress on Computer Science and Information Engineering A New Implementation of Recursive Feature Elimination Algorithm for Gene Selection from Microarray Data Sihua Peng 1, Xiaoping Liu 2, Jiyang Yu 1, Zhizhen Wan 3, and Xiaoning Peng 4* 1 Department of Pathology, School of Medicine, Zhejiang University; 2 College of Life Science and Technology, Xinjiang University; 3 College of Computer Science and Engineering, Zhejiang University, 4 School of Medicine, Hunan Normal University. Abstract We proposed a new approach for gene selection and multi-cancer classification based on step-by-step improvement of classification performance (SSiCP). The SSiCP gene selection algorithms were evaluated over the NCI60 and GCM benchmar datasets, with an accuracy of 96.6% and 95.5% in 10-fold crossvalidation, respectively. Furthermore, the SSiCP outperformed recently published algorithms when applied to another two multi-cancer data sets. Computational evidence indicated that SSiCP can avoid overfitting effectively. Compared with various gene selection algorithms, the implementation of SSiCP is very simple, and all the computational experiments are repeatable. 1. Introduction Cancer classification is a very important step for diagnosis and treatment of cancers. Without the correct identification of cancer types, it is almost impossible to achieve a good therapeutic effect. Based on the cdna microarray technology for cancer identification and classification, many in-depth studies have been done [1, 2]. As for binary classification issues, such as tumour versus normal tissue [3], or one subtype of a tumor versus another [4], molecular classification using gene expression profiles has achieved a very high degree of accuracy. For classification of multiple tumour types, however, the accuracy has yet to be improved [5-10]. Because of the high dimensionality, the excessive noise, and the relatively small sample sizes in DNA microarray data, this issue has become a hot focus in the data mining of gene expression profiles. Especially for data with a large number of cancer types, many conventional classification methods show very poor performance [11], such as the NCI60 data set (9 types of cancer) [5], and the GCM data set (14 types of cancer) [6]. Recently, to face the challenge of multi-cancer classification, investigators have proposed many new approaches. Xu et al. used semi-supervised ellipsoid ARTMAP and particle swarm optimization, with a competitive performance [12]. Cai et al. proposed a new algorithm, which introduced a new measurement to quantify the class discrimination strength difference between two genes [13]. Zhou et al. [14] recently put forward the MSVM-RFE algorithms, which are four expansions of the well-nown SVM-RFE algorithm [15]. However, obtaining higher classification accuracy as well as choosing fewer genes is possible by using more powerful dada mining algorithms. In this paper, we proposed a new approach of gene selection and multi-cancer classification based on stepby-step improvement of classification performance (SSiCP). SSiCP, which is neither SVM-RFE nor the expansion of SVM-RFE [15], is a new SVM based implementation of RFE feature selection methodology. The results show that our strategy is very effective, with a fast calculation procedure. 2. Materials and Methods 2.1 Data sets NCI60 dataset [5] * To whom correspondence should be addressed: Xiaoning Peng, PhD, Hunan Normal University Schoole of Medicine, No. 81 Jiatongjie, Changsha, Hunan Province, P.R.China (Email: pxiaoning@hunnu.edu.cn, Tel: 86-731-8912484, FAX: 86-731-8912417, Zip Code: 410006) 978-0-7695-3507-4/08 $25.00 2008 IEEE DOI 10.1109/CSIE.2009.75 665

The NCI60 data set was described by Ross et al., and can be downloaded from (http://wwwgenome. wi.mit.edu/mpr/nci60/nci_60.expression.scfrs.txt). There are 60 samples in this data set, which express 7129 genes in nine types. GCM dataset [6,7] The original GCM dataset contains 198 samples with 16063 genes from 14 classes of cancers [6]. A subset of the original GCM dataset is employed in this study, which was download at the web site (http://www.broad.mit.edu/cgibin/cancer/publications/pub_paper.cgi?mode= view&paper_id=114). Human Carcinomas Dataset (HCD174) [8] The HCD174 dataset contains 174 samples in 11 classes. Each sample contains 12533 genes. The dataset was obtained from (http://public.gnf.org/cancer/epican/). Central Nervous System Embryonal Tumors dataset (CNS) [9] The CNS dataset contains 42 samples with 7129 gene probes, and can be downloaded from (http://www.broad.mit.edu/mpr/cns/). 2.2 Gene pre-selection Without gene pre-selection, computation becomes a time-consuming tas because of the very high dimensions in feature space. After gene pre-selection, we can obtain a few dozen or hundreds of differentially expresse. Based on this reduced gene subset, the second step of gene selection was carried out smoothly, with the calculation burden greatly reduced. As our algorithm is based on the Wea platform, we tested several feature selection methods on Wea. After going through calculation and comparison, we chose the chi-squared test-based feature selection algorithm as our gene pre-selection algorithm, which is named the "ChiSquaredAttributeEval" feature selection on Wea. The Chi-Squared (χ 2 ) method evaluates features individually by measuring their χ 2 statistic with respect to the classes. After calculating the χ 2 value of all considered features, we sorted the values with the largest one at the first position, as the larger the χ 2 value, the more important the feature [16]. 2.3 RFE: Recursive Feature Elimination RFE is an iterative procedure, which can be described as follows. 1. Train the classifier. 2. Compute the raning criterion for all features. 3. Remove the feature with smallest raning criterion. In the algorithm of SVM-RFE proposed by guyon et al., the main steps are described as follows [15]. 1. Train the classifier: α = SVM train( x, y) 2. Compute the weight vector of dimension length (s): w = α y x c = ( wi 3. Compute the raning criteria: i 4. Find the feature with smallest raning criterion: f = arg min( c) c =. 5. Eliminate the feature with f 2.4 Feature selection methodology Step by step feature reduction SSiCP algorithm is not a ind of wrapper algorithm [17]. In SSiCP, we do not use a search method. But we do employ an evaluation function to guide the eliminate features step by step. To some extent, SSiCP is similar to SVM-RFE in two aspects. Both of the algorithms are SVM based algorithm, and both of them employed the recursive feature elimination (RFE) methodology. Nevertheless, they are completely different algorithms. The innovation of our algorithm is the feature elimination criteria. Briefly, we eliminate a feature at a time. If the classification accuracy increases (or equal to the original value) without this feature, we remove this feature forever, otherwise restores this feature. So SSiCP did not ran the features by some raning criteria. The ey steps of the algorithm we proposed were as follows: Step 1. Train the classifier with n features (genes), and compute the accuracy with m-fold cross-validation. Step 2. Eliminate a feature f temporarily, and compute the accuracy with m-fold cross-validation. Step 3. If, remove the feature f, and if >, restore the feature f. If all the retained features were restored once without the increase, a local maxima valve of the accuracy is obtained. In this case, we mae =. Step 4. If n=2, stop the calculation. If n>2 go to Step 2. The above steps are the ey points of our algorithm, and the details shown in Fig. 1. 2 ) 666

(classifier for building linear logistic regression models) [Wea: http://www.cs.waiato.ac.nz/~remco/wea.pdf.], we determined the classification algorithm which provided the best performance. By using the seven classification algorithms on the GCM and NCI60 data sets, the optimal algorithm was selected. Subsequent calculation results showed that SMO outperformed all of the other six algorithms. 2.6 Parameter selection on Wea Fig. 1 Schematic map of the feature reduction algorithm. Overfitting evaluation of SSiCP algorithm As a machine learning algorithm, overfitting issue must be addressed. Of the four datasets, there are more instances in HCD174 (174 instances) dataset than that of GCM, NCI60, and NCS. Therefore, to evaluate the overfitting status of SSiCP algorithm, HCD174 dataset is partitioned into two parts: training set and test set. A classifier model is obtained by running the SSiCP algorithm on the training dataset, with an accuracy of ten-fold across validation denoted in x. And the classifier model is then tested by the independent test dataset, with an accuracy denoted in x. If there is little difference between x and x, we conclude that SSiCP can avoid overfitting effectively. 2.5 Confirmation of classification algorithm in the second step of feature selection By comparing the seven classification algorithms including the Naive Bayes classifier, the BayesNet classifier, SMO (sequential minimal optimization algorithm for training a support vector classifier), KStar, LMT (logistic model trees), J48, and SL SMO algorithm was superior to the other algorithms. After features (genes) pre-selection, 208 genes were 3.2 Gene selection based on step-by-step improvement of classification performance When we used SVM to do the classification tas, the choice of the ernel function was a ey factor to obtain better performance. For the classification of the microarray dataset, a relatively better classification performance was achieved by using the polynomial ernel function [10]. After testing the four ernel functions (NormalizedPolyKernel, PolyKernel, RBFKernel, and StringKernel) on Wea, it was also clear that the best results were achieved by using PolyKernel. 3. Results 3.1 Initial noise removal and comparison of classification algorithms The NCI60 and GCM datasets are generally considered benchmar datasets in the microarray data mining problem, so they are always used to test the performance of a new algorithm. Therefore, seven classification algorithms which are commonly used in data mining issues were employed with these two datasets. First, we obtained the computational results with and without feature pre-selection (using the χ 2 test-based feature selection algorithm). The results suggested that after initial pre-selection of the features, the classification performance improved considerably, indicating that the noise in the microarray datasets was removed to a certain extent. The results also indicated that when using both NCI60 data and GCM data, the selected from NCI 60 data set and 150 genes from GCM data set. By calling the main pacage of Wea to run our algorithm, the computations were carried out using the NCI60 and GCM datasets, and the gene selection results of the above seven algorithms were obtained 667

(Fig. 2 and Fig. 3). Clearly, the SMO algorithm also outperformed the other six algorithms. Fig. 2 Classification performance comparisons of the seven algorithms using the NCI60 data set. The maximal accuracy of 96.6% was obtained using the SMO algorithm with 24 genes (red). Fig. 3 Classification performance comparisons of the seven algorithms using the GCM data set. The maximal accuracy of 95.5% was obtained using the SMO algorithm with 28 genes (red). 3.3 Comparison of computational results using four data sets Through the above comparisons, the SMO algorithm was selected as the classifier embedded in our algorithm. This SMO-based algorithm was then applied to the other two datasets: CNS, and HCD174. In the calculation process, we generally chose the following parameters: ten-fold cross-validation, PolyKernel ernel function and standardization data filter type, with the remaining parameters set to the default values. The results are shown in Table 1. Table 1 - cy comparison of multi-class classification using the four data sets (%) NCI60 GCM CNS HCD174 SU 85.37 13 92.0 1100 Pomeroy 83.3 7129 Yeang 81.25 16063 Peng 87.93 27 85.19 26 Lin 95 15 84.3 48 Xu 84.66 79 Cai 85.7 45 97.3 80 Zhou 83.28 400 This study 96.6 24 95.5 28 97.6 10 97.1 37 3.4 Overfitting evaluation HCD174 dataset is divided into training dataset with 142 instances and test dataset with 32 instances. Running SSiCP on HCD174 training set, a classifier 668

model including 49 features was obtained with accuracy of 95.8% by ten-fold cross validation. Then independent test dataset from HD174 is employed to test the classifier model with accuracy of 93.8%. From 95.8% to 93.8%, the accuracy declined slightly, suggesting that SSiCP avoids Overfitting efficaciously. 4. Discussions In the comparison of the results obtained from the four datasets, our algorithm was superior to all other algorithms in classification accuracy except for the algorithm of Cai et al., which achieved slightly higher accuracy than ours (97.3% versus 97.1%, Table 1), whereas the number of genes we selected was far less than theirs (80 versus 37, Table 1). The advantages of wrapper-based techniques for feature selection are well established [17]. So a comparison should be made between the wrapperbased approaches and SSiCP algorithm. First, it has recently been recognized that wrapper-based techniques have the potential to overfit the training data [18], while SSiCP has shown the ability to overcome overfitting by computational experiments. Second, wrapper-based techniques must employ a heuristic search method to search subset feature states in a large state space, maing a heavy computational burden on the computer. However, instead of searching states in a huge space, SSiCP uses a step by step improvement of classification accuracy to reduce feature space, with a result of fast procedure of computation and simple implementation of the algorithm. 5. References [1] Golub, T.R., Slonim, D.K., Tamayo, P., et al. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring, Science 286, 1999, pp. 531-537. [2] Bittner, M., et al. Molecular classification of cutaneous malignant melanoma by gene expression profiling, Nature 406, 2000, pp. 536-540. [3] Furey, T.S., et al. Support vector machine classification and validation of cancer tissue samples using microarray expression data, Bioinformatics 16, 2000, 906-914. [4] Alizadeh, A.A., et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 2000, 2000, pp. 503-511. [5] Ross, D.T., et al. Systematic variation in gene expression patterns in human cancer cell lines, Nature Genetics 24, 2000, pp. 227-235. [6] Ramaswamy, S., Tamayo, P., Rifin, R., et al. Multiclass cancer diagnosis using tumor gene expression signatures, Proc. Natl. Acad. Sci. USA 98, 2001, pp. 15149-15154. [7] Lu, J., Getz, G., Misa, E.A., et al. MicroRNA expression profiles classify human cancers, Nature 435, 2005, pp. 834-838. [8] Su, A.I., et al. Molecular classification of human carcinomas by use of gene expression signatures, Cancer Research 61, 2001, pp. 7388-7393. [9] Pomeroy, S.L., et al. Prediction of central nervous system embryonal tumour outcome based on gene expression, Nature 415, 2002, pp. 436 442. [10] Peng, S.H., Xu, Q.H., Ling, X.B., et al. Molecular classification of cancer types from microarray data using the combination of genetic algorithms and support vector machines, FEBS Letters 555, 2003, pp. 358-362. [11] Li, T., et al. A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression, Bioinformatics 20, 2004, pp. 2429 2437. [12] Xu, R., et al. Multiclass cancer classification using semisupervised ellipsoid ARTMAP and particle swarm optimization with gene expression data, IEEE-ACM Transaction on Computational Biology and Bioinformatics 4, 2007, pp. 65-77. [13] Cai, Z.P., et al. Selecting dissimilar genes for multiclass classification, an application in cancer subtyping, BMC Bioinformatics 8, 2007, Art. No.206. [14] Zhou, X. and Tuc, D.P. MSVM-RFE: extensions of SVM-RFE for multiclass gene selection on DNA microarray data, Bioinformatics 23, 2007, pp. 1106-1114. [15] Guyon, I., et al. Gene selection for cancer classification using support vector machines, Machine Learning 46, 2002, pp. 389-422. [16] Liu, H. and Setiono, R. Chi2: Feature selection and discrimination of numeric attributes. In: Proceedings of the IEEE 7th International Conference on Tools with Artificial Intelligence, pp.338-391, 1995. [17] R. Kohavi and G. H. John. Wrapper for feature subset selection, Artificial Intelligence 97, 1997, pp.273 324. [18] Reunanen, J. Overfitting in maing comparisons between variable selection methods, Journal of Machine Learning Research 3, 2003, pp. 371-1382. 669