A New Implementation of Recursive Feature Elimination Algorithm for Gene Selection from Microarray Data

Similar documents
Noise-based Feature Perturbation as a Selection Method for Microarray Data

Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification

Title: Optimized multilayer perceptrons for molecular classification and diagnosis using genomic data

Exploratory data analysis for microarrays

FEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

A New Maximum-Relevance Criterion for Significant Gene Selection

Gene Expression Based Classification using Iterative Transductive Support Vector Machine

Comparison of Optimization Methods for L1-regularized Logistic Regression

10601 Machine Learning. Model and feature selection

Semi-Supervised Clustering with Partial Background Information

Fuzzy Entropy based feature selection for classification of hyperspectral data

Statistical dependence measure for feature selection in microarray datasets

Ensemble-based Classifiers for Cancer Classification Using Human Tumor Microarray Data

SVM-Based Local Search for Gene Selection and Classification of Microarray Data

Univariate Margin Tree

A PSO-based Generic Classifier Design and Weka Implementation Study

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Gene selection through Switched Neural Networks

Support Vector Machines: Brief Overview" November 2011 CPSC 352

Feature Selection for SVMs

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets

SVMFILEFS- A NOVEL ENSEMBLE FEATURE SELECTION TECHNIQUE FOR EFFECTIVE BREAST CANCER DIAGNOSIS

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 1

Filter versus Wrapper Feature Subset Selection in Large Dimensionality Micro array: A Review

The Iterative Bayesian Model Averaging Algorithm: an improved method for gene selection and classification using microarray data

Forward Feature Selection Using Residual Mutual Information

Feature Selection in Knowledge Discovery

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.

Machine Learning in Biology

Online Streaming Feature Selection

A Naïve Soft Computing based Approach for Gene Expression Data Analysis

Supervised vs unsupervised clustering

Redundancy Based Feature Selection for Microarray Data

K-means clustering based filter feature selection on high dimensional data

Dynamic Clustering of Data with Modified K-Means Algorithm

Automated Microarray Classification Based on P-SVM Gene Selection

Software Documentation of the Potential Support Vector Machine

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

On The Value of Leave-One-Out Cross-Validation Bounds

SVM Classification in -Arrays

2. Department of Electronic Engineering and Computer Science, Case Western Reserve University

Double Self-Organizing Maps to Cluster Gene Expression Data

A STUDY OF SOME DATA MINING CLASSIFICATION TECHNIQUES

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Feature Selection Algorithm with Discretization and PSO Search Methods for Continuous Attributes

CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr.

A Hybrid Feature Selection Algorithm Based on Information Gain and Sequential Forward Floating Search

Feature-weighted k-nearest Neighbor Classifier

Multi-label classification using rule-based classifier systems

Application of Support Vector Machine Algorithm in Spam Filtering

Feature Selection for Multi-Class Imbalanced Data Sets Based on Genetic Algorithm

Exploratory data analysis for microarrays

Individualized Error Estimation for Classification and Regression Models

Improving Classification Accuracy for Single-loop Reliability-based Design Optimization

Individual feature selection in each One-versus-One classifier improves multi-class SVM performance

Knowledge Discovery. Javier Béjar URL - Spring 2019 CS - MIA

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

Introduction The problem of cancer classication has clear implications on cancer treatment. Additionally, the advent of DNA microarrays introduces a w

Detecting Network Intrusions

Prognosis of Lung Cancer Using Data Mining Techniques

2.5 A STORM-TYPE CLASSIFIER USING SUPPORT VECTOR MACHINES AND FUZZY LOGIC

Good Cell, Bad Cell: Classification of Segmented Images for Suitable Quantification and Analysis

FEATURE SELECTION TECHNIQUES

A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995)

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

Exploratory data analysis for microarrays

stepwisecm: Stepwise Classification of Cancer Samples using High-dimensional Data Sets

Gene signature selection to predict survival benefits from adjuvant chemotherapy in NSCLC patients

Hotel Recommendation Based on Hybrid Model

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand).

Information Integration of Partially Labeled Data

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Machine learning techniques for binary classification of microarray data with correlation-based gene selection

Interactive Text Mining with Iterative Denoising

Using Google s PageRank Algorithm to Identify Important Attributes of Genes

Feature Selection and Classification for Small Gene Sets

Hybrid Feature Selection for Modeling Intrusion Detection Systems

Data Mining in Bioinformatics Day 1: Classification

Unsupervised Feature Selection for Sparse Data

Chapter 8 The C 4.5*stat algorithm

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Topics In Feature Selection

Chapter 22 Information Gain, Correlation and Support Vector Machines

An Adaptive Threshold LBP Algorithm for Face Recognition

Applying Supervised Learning

A Feature Selection Method to Handle Imbalanced Data in Text Classification

A hybrid of discrete particle swarm optimization and support vector machine for gene selection and molecular classification of cancer

CS229 Lecture notes. Raphael John Lamarre Townshend

Statistical Pattern Recognition

Machine Learning Techniques for Data Mining

BIOINFORMATICS. New algorithms for multi-class cancer diagnosis using tumor gene expression signatures

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

What to come. There will be a few more topics we will cover on supervised learning

Cover Page. The handle holds various files of this Leiden University dissertation.

Classification and Regression Trees

8/19/13. Computational problems. Introduction to Algorithm

Subject. Dataset. Copy paste feature of the diagram. Importing the dataset. Copy paste feature into the diagram.

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm

Transcription:

2009 World Congress on Computer Science and Information Engineering A New Implementation of Recursive Feature Elimination Algorithm for Gene Selection from Microarray Data Sihua Peng 1, Xiaoping Liu 2, Jiyang Yu 1, Zhizhen Wan 3, and Xiaoning Peng 4* 1 Department of Pathology, School of Medicine, Zhejiang University; 2 College of Life Science and Technology, Xinjiang University; 3 College of Computer Science and Engineering, Zhejiang University, 4 School of Medicine, Hunan Normal University. Abstract We proposed a new approach for gene selection and multi-cancer classification based on step-by-step improvement of classification performance (SSiCP). The SSiCP gene selection algorithms were evaluated over the NCI60 and GCM benchmar datasets, with an accuracy of 96.6% and 95.5% in 10-fold crossvalidation, respectively. Furthermore, the SSiCP outperformed recently published algorithms when applied to another two multi-cancer data sets. Computational evidence indicated that SSiCP can avoid overfitting effectively. Compared with various gene selection algorithms, the implementation of SSiCP is very simple, and all the computational experiments are repeatable. 1. Introduction Cancer classification is a very important step for diagnosis and treatment of cancers. Without the correct identification of cancer types, it is almost impossible to achieve a good therapeutic effect. Based on the cdna microarray technology for cancer identification and classification, many in-depth studies have been done [1, 2]. As for binary classification issues, such as tumour versus normal tissue [3], or one subtype of a tumor versus another [4], molecular classification using gene expression profiles has achieved a very high degree of accuracy. For classification of multiple tumour types, however, the accuracy has yet to be improved [5-10]. Because of the high dimensionality, the excessive noise, and the relatively small sample sizes in DNA microarray data, this issue has become a hot focus in the data mining of gene expression profiles. Especially for data with a large number of cancer types, many conventional classification methods show very poor performance [11], such as the NCI60 data set (9 types of cancer) [5], and the GCM data set (14 types of cancer) [6]. Recently, to face the challenge of multi-cancer classification, investigators have proposed many new approaches. Xu et al. used semi-supervised ellipsoid ARTMAP and particle swarm optimization, with a competitive performance [12]. Cai et al. proposed a new algorithm, which introduced a new measurement to quantify the class discrimination strength difference between two genes [13]. Zhou et al. [14] recently put forward the MSVM-RFE algorithms, which are four expansions of the well-nown SVM-RFE algorithm [15]. However, obtaining higher classification accuracy as well as choosing fewer genes is possible by using more powerful dada mining algorithms. In this paper, we proposed a new approach of gene selection and multi-cancer classification based on stepby-step improvement of classification performance (SSiCP). SSiCP, which is neither SVM-RFE nor the expansion of SVM-RFE [15], is a new SVM based implementation of RFE feature selection methodology. The results show that our strategy is very effective, with a fast calculation procedure. 2. Materials and Methods 2.1 Data sets NCI60 dataset [5] * To whom correspondence should be addressed: Xiaoning Peng, PhD, Hunan Normal University Schoole of Medicine, No. 81 Jiatongjie, Changsha, Hunan Province, P.R.China (Email: pxiaoning@hunnu.edu.cn, Tel: 86-731-8912484, FAX: 86-731-8912417, Zip Code: 410006) 978-0-7695-3507-4/08 $25.00 2008 IEEE DOI 10.1109/CSIE.2009.75 665

The NCI60 data set was described by Ross et al., and can be downloaded from (http://wwwgenome. wi.mit.edu/mpr/nci60/nci_60.expression.scfrs.txt). There are 60 samples in this data set, which express 7129 genes in nine types. GCM dataset [6,7] The original GCM dataset contains 198 samples with 16063 genes from 14 classes of cancers [6]. A subset of the original GCM dataset is employed in this study, which was download at the web site (http://www.broad.mit.edu/cgibin/cancer/publications/pub_paper.cgi?mode= view&paper_id=114). Human Carcinomas Dataset (HCD174) [8] The HCD174 dataset contains 174 samples in 11 classes. Each sample contains 12533 genes. The dataset was obtained from (http://public.gnf.org/cancer/epican/). Central Nervous System Embryonal Tumors dataset (CNS) [9] The CNS dataset contains 42 samples with 7129 gene probes, and can be downloaded from (http://www.broad.mit.edu/mpr/cns/). 2.2 Gene pre-selection Without gene pre-selection, computation becomes a time-consuming tas because of the very high dimensions in feature space. After gene pre-selection, we can obtain a few dozen or hundreds of differentially expresse. Based on this reduced gene subset, the second step of gene selection was carried out smoothly, with the calculation burden greatly reduced. As our algorithm is based on the Wea platform, we tested several feature selection methods on Wea. After going through calculation and comparison, we chose the chi-squared test-based feature selection algorithm as our gene pre-selection algorithm, which is named the "ChiSquaredAttributeEval" feature selection on Wea. The Chi-Squared (χ 2 ) method evaluates features individually by measuring their χ 2 statistic with respect to the classes. After calculating the χ 2 value of all considered features, we sorted the values with the largest one at the first position, as the larger the χ 2 value, the more important the feature [16]. 2.3 RFE: Recursive Feature Elimination RFE is an iterative procedure, which can be described as follows. 1. Train the classifier. 2. Compute the raning criterion for all features. 3. Remove the feature with smallest raning criterion. In the algorithm of SVM-RFE proposed by guyon et al., the main steps are described as follows [15]. 1. Train the classifier: α = SVM train( x, y) 2. Compute the weight vector of dimension length (s): w = α y x c = ( wi 3. Compute the raning criteria: i 4. Find the feature with smallest raning criterion: f = arg min( c) c =. 5. Eliminate the feature with f 2.4 Feature selection methodology Step by step feature reduction SSiCP algorithm is not a ind of wrapper algorithm [17]. In SSiCP, we do not use a search method. But we do employ an evaluation function to guide the eliminate features step by step. To some extent, SSiCP is similar to SVM-RFE in two aspects. Both of the algorithms are SVM based algorithm, and both of them employed the recursive feature elimination (RFE) methodology. Nevertheless, they are completely different algorithms. The innovation of our algorithm is the feature elimination criteria. Briefly, we eliminate a feature at a time. If the classification accuracy increases (or equal to the original value) without this feature, we remove this feature forever, otherwise restores this feature. So SSiCP did not ran the features by some raning criteria. The ey steps of the algorithm we proposed were as follows: Step 1. Train the classifier with n features (genes), and compute the accuracy with m-fold cross-validation. Step 2. Eliminate a feature f temporarily, and compute the accuracy with m-fold cross-validation. Step 3. If, remove the feature f, and if >, restore the feature f. If all the retained features were restored once without the increase, a local maxima valve of the accuracy is obtained. In this case, we mae =. Step 4. If n=2, stop the calculation. If n>2 go to Step 2. The above steps are the ey points of our algorithm, and the details shown in Fig. 1. 2 ) 666

(classifier for building linear logistic regression models) [Wea: http://www.cs.waiato.ac.nz/~remco/wea.pdf.], we determined the classification algorithm which provided the best performance. By using the seven classification algorithms on the GCM and NCI60 data sets, the optimal algorithm was selected. Subsequent calculation results showed that SMO outperformed all of the other six algorithms. 2.6 Parameter selection on Wea Fig. 1 Schematic map of the feature reduction algorithm. Overfitting evaluation of SSiCP algorithm As a machine learning algorithm, overfitting issue must be addressed. Of the four datasets, there are more instances in HCD174 (174 instances) dataset than that of GCM, NCI60, and NCS. Therefore, to evaluate the overfitting status of SSiCP algorithm, HCD174 dataset is partitioned into two parts: training set and test set. A classifier model is obtained by running the SSiCP algorithm on the training dataset, with an accuracy of ten-fold across validation denoted in x. And the classifier model is then tested by the independent test dataset, with an accuracy denoted in x. If there is little difference between x and x, we conclude that SSiCP can avoid overfitting effectively. 2.5 Confirmation of classification algorithm in the second step of feature selection By comparing the seven classification algorithms including the Naive Bayes classifier, the BayesNet classifier, SMO (sequential minimal optimization algorithm for training a support vector classifier), KStar, LMT (logistic model trees), J48, and SL SMO algorithm was superior to the other algorithms. After features (genes) pre-selection, 208 genes were 3.2 Gene selection based on step-by-step improvement of classification performance When we used SVM to do the classification tas, the choice of the ernel function was a ey factor to obtain better performance. For the classification of the microarray dataset, a relatively better classification performance was achieved by using the polynomial ernel function [10]. After testing the four ernel functions (NormalizedPolyKernel, PolyKernel, RBFKernel, and StringKernel) on Wea, it was also clear that the best results were achieved by using PolyKernel. 3. Results 3.1 Initial noise removal and comparison of classification algorithms The NCI60 and GCM datasets are generally considered benchmar datasets in the microarray data mining problem, so they are always used to test the performance of a new algorithm. Therefore, seven classification algorithms which are commonly used in data mining issues were employed with these two datasets. First, we obtained the computational results with and without feature pre-selection (using the χ 2 test-based feature selection algorithm). The results suggested that after initial pre-selection of the features, the classification performance improved considerably, indicating that the noise in the microarray datasets was removed to a certain extent. The results also indicated that when using both NCI60 data and GCM data, the selected from NCI 60 data set and 150 genes from GCM data set. By calling the main pacage of Wea to run our algorithm, the computations were carried out using the NCI60 and GCM datasets, and the gene selection results of the above seven algorithms were obtained 667

(Fig. 2 and Fig. 3). Clearly, the SMO algorithm also outperformed the other six algorithms. Fig. 2 Classification performance comparisons of the seven algorithms using the NCI60 data set. The maximal accuracy of 96.6% was obtained using the SMO algorithm with 24 genes (red). Fig. 3 Classification performance comparisons of the seven algorithms using the GCM data set. The maximal accuracy of 95.5% was obtained using the SMO algorithm with 28 genes (red). 3.3 Comparison of computational results using four data sets Through the above comparisons, the SMO algorithm was selected as the classifier embedded in our algorithm. This SMO-based algorithm was then applied to the other two datasets: CNS, and HCD174. In the calculation process, we generally chose the following parameters: ten-fold cross-validation, PolyKernel ernel function and standardization data filter type, with the remaining parameters set to the default values. The results are shown in Table 1. Table 1 - cy comparison of multi-class classification using the four data sets (%) NCI60 GCM CNS HCD174 SU 85.37 13 92.0 1100 Pomeroy 83.3 7129 Yeang 81.25 16063 Peng 87.93 27 85.19 26 Lin 95 15 84.3 48 Xu 84.66 79 Cai 85.7 45 97.3 80 Zhou 83.28 400 This study 96.6 24 95.5 28 97.6 10 97.1 37 3.4 Overfitting evaluation HCD174 dataset is divided into training dataset with 142 instances and test dataset with 32 instances. Running SSiCP on HCD174 training set, a classifier 668

model including 49 features was obtained with accuracy of 95.8% by ten-fold cross validation. Then independent test dataset from HD174 is employed to test the classifier model with accuracy of 93.8%. From 95.8% to 93.8%, the accuracy declined slightly, suggesting that SSiCP avoids Overfitting efficaciously. 4. Discussions In the comparison of the results obtained from the four datasets, our algorithm was superior to all other algorithms in classification accuracy except for the algorithm of Cai et al., which achieved slightly higher accuracy than ours (97.3% versus 97.1%, Table 1), whereas the number of genes we selected was far less than theirs (80 versus 37, Table 1). The advantages of wrapper-based techniques for feature selection are well established [17]. So a comparison should be made between the wrapperbased approaches and SSiCP algorithm. First, it has recently been recognized that wrapper-based techniques have the potential to overfit the training data [18], while SSiCP has shown the ability to overcome overfitting by computational experiments. Second, wrapper-based techniques must employ a heuristic search method to search subset feature states in a large state space, maing a heavy computational burden on the computer. However, instead of searching states in a huge space, SSiCP uses a step by step improvement of classification accuracy to reduce feature space, with a result of fast procedure of computation and simple implementation of the algorithm. 5. References [1] Golub, T.R., Slonim, D.K., Tamayo, P., et al. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring, Science 286, 1999, pp. 531-537. [2] Bittner, M., et al. Molecular classification of cutaneous malignant melanoma by gene expression profiling, Nature 406, 2000, pp. 536-540. [3] Furey, T.S., et al. Support vector machine classification and validation of cancer tissue samples using microarray expression data, Bioinformatics 16, 2000, 906-914. [4] Alizadeh, A.A., et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 2000, 2000, pp. 503-511. [5] Ross, D.T., et al. Systematic variation in gene expression patterns in human cancer cell lines, Nature Genetics 24, 2000, pp. 227-235. [6] Ramaswamy, S., Tamayo, P., Rifin, R., et al. Multiclass cancer diagnosis using tumor gene expression signatures, Proc. Natl. Acad. Sci. USA 98, 2001, pp. 15149-15154. [7] Lu, J., Getz, G., Misa, E.A., et al. MicroRNA expression profiles classify human cancers, Nature 435, 2005, pp. 834-838. [8] Su, A.I., et al. Molecular classification of human carcinomas by use of gene expression signatures, Cancer Research 61, 2001, pp. 7388-7393. [9] Pomeroy, S.L., et al. Prediction of central nervous system embryonal tumour outcome based on gene expression, Nature 415, 2002, pp. 436 442. [10] Peng, S.H., Xu, Q.H., Ling, X.B., et al. Molecular classification of cancer types from microarray data using the combination of genetic algorithms and support vector machines, FEBS Letters 555, 2003, pp. 358-362. [11] Li, T., et al. A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression, Bioinformatics 20, 2004, pp. 2429 2437. [12] Xu, R., et al. Multiclass cancer classification using semisupervised ellipsoid ARTMAP and particle swarm optimization with gene expression data, IEEE-ACM Transaction on Computational Biology and Bioinformatics 4, 2007, pp. 65-77. [13] Cai, Z.P., et al. Selecting dissimilar genes for multiclass classification, an application in cancer subtyping, BMC Bioinformatics 8, 2007, Art. No.206. [14] Zhou, X. and Tuc, D.P. MSVM-RFE: extensions of SVM-RFE for multiclass gene selection on DNA microarray data, Bioinformatics 23, 2007, pp. 1106-1114. [15] Guyon, I., et al. Gene selection for cancer classification using support vector machines, Machine Learning 46, 2002, pp. 389-422. [16] Liu, H. and Setiono, R. Chi2: Feature selection and discrimination of numeric attributes. In: Proceedings of the IEEE 7th International Conference on Tools with Artificial Intelligence, pp.338-391, 1995. [17] R. Kohavi and G. H. John. Wrapper for feature subset selection, Artificial Intelligence 97, 1997, pp.273 324. [18] Reunanen, J. Overfitting in maing comparisons between variable selection methods, Journal of Machine Learning Research 3, 2003, pp. 371-1382. 669