SVMFILEFS- A NOVEL ENSEMBLE FEATURE SELECTION TECHNIQUE FOR EFFECTIVE BREAST CANCER DIAGNOSIS

Size: px

Start display at page:

Download "SVMFILEFS- A NOVEL ENSEMBLE FEATURE SELECTION TECHNIQUE FOR EFFECTIVE BREAST CANCER DIAGNOSIS"

Evelyn Kristian Preston
5 years ago
Views:

International Journal of Civil Engineering and Technology (IJCIET) Volume 9, Issue 11, November 2018, pp. 1526 1533, Article ID: IJCIET_09_11_147 Available online at http://www.iaeme.

1 International Journal of Civil Engineering and Technology (IJCIET) Volume 9, Issue 11, November 2018, pp , Article ID: IJCIET_09_11_147 Available online at ISSN Print: and ISSN Online: IAEME Publication Scopus Indexed SVMFILEFS- A NOVEL ENSEMBLE FEATURE SELECTION TECHNIQUE FOR EFFECTIVE BREAST CANCER DIAGNOSIS Kavitha C.R Research Scholar, R&D, Bharathiar University, Coimbatore, India Mahalekshmi T Principal, SNIT, Kollam, India ABSTRACT This paper describes about a novel ensemble feature selection method SVMFILEFS used for the diagnosis of breast cancer. Firstly, this technique incorporates three filters such as Chi-square, Random Forest and Information Gain and combines their normalized outputs to a quantitative ensemble importance. Based on the threshold value of 50% ensemble importance, the best attributes were selected. Secondly, Support Vector Machine Recursive Feature Elimination (SVMRFE) is then applied to the dataset after eliminating those selected attributes from the first step from the original dataset which leads to another subset selection of attributes. Classification was done using classification models namely random forest (rf), Support Vector Machine-Radial (svm Radial), Linear Discriminate Analysis (LDA), JRip, Recursive Partitioning and Regression Trees (rpart), J48 and Logistic Model Trees (LMT) on Wisconsin Breast Cancer Dataset (WBCD) dataset downloaded from UCI repository. In this experiment classification was performed with attribute subsets obtained by feature selection methods such as SVMRFE, Filter Combo and SVMFILEFS. A comparison study was made with the results obtained from performing classification on this dataset. The findings show that SVMFILEFS, our novel ensemble feature selection technique outperformed the other feature selection methods that were considered for this study and has high classification accuracy. Keyword: Feature selection, SVMRFE, Filter, SVMFILEFS, Ensemble Feature Selection, classification, Accuracy Cite this Article: Kavitha C.R and Mahalekshmi.T, SVMFILEFS- a Novel Ensemble Feature Selection Technique for effective Breast Cancer Diagnosis, International Journal of Civil Engineering and Technology, 9(11), 2018, pp editor@iaeme.com

2 Kavitha C.R and Mahalekshmi 1. INTRODUCTION Recently ensemble feature selection has emerged as an effective feature selection method that combines feature selection and ensemble learning. The main aim is to generate good feature subsets that have high correlation with the output class. A single feature selection approach gives less reliable results than an ensemble of different base feature selection methods for binary classifications. [1] The performance of the classifiers can be improved by combining multiple feature selection methods that identifies features that are weak as an individual but strong as a group. This paper presents a novel ensemble feature selection called ENSEMFIL which combines the outputs of four filter methods and SVMRFE [2][3] feature selection method to generate the best attribute subset for performing binary classification on health care datasets downloaded from UCI repository. This paper is organized as follows. In section II, the details of the experiment like the dataset, framework of the experiment and the description of the algorithm SVMFILEFS is described. In the next section, Results and Discussion is presented which is followed by the explanation of how this ensemble feature selection was implemented as a web application. Finally the Conclusion is given followed by the References. 2. EXPERIMENT 2.1. Introduction In this experiment the ensemble feature selection SVMFILEFS was implemented using R [4]. In the first step, three filter based feature selection methods were applied on the datasets. The three filter methods used are random forest filter [5], Chi-square filter [6] and information gain filter [7]. These filter methods apply a statistical measure to assign a scoring to each attribute Random Forest Random forest is one of the most popular methods for feature ranking. In this paper random forest filter is implemented using FS elector package. [8] Random Forest classifier achieves relatively good accuracy, robustness and easy to use. FS elector package also provides two types of importance measures namely mean decrease impurity (MDI), that is GINI index and mean decrease accuracy (MDA) for feature selection using random forest. MDI calculates each attribute importance as the sum over the number of splits that include the attribute, proportionally to the number of samples it splits. [9] MDA is the decrease in accuracy of a classification after the variable has been randomly permuted. A higher MDA means the attribute contributes more to the classification accuracy. [9] Chi-Squared (X 2 ) The chi-squared (χ2) statistic is used to test the independence of two variables by calculating a score to measure the extent of independence between these two variables.χ2 measures the independence of attributes with respect to the class in attribute selection. In this experiment the Chi-Squared score was computed using Chi-squared filter of FS elector package. Chisquared can be defined as:,= where A is the frequency of t and c occurrences, B is the frequency of t occurrences without c, C is the frequency of c without t, D is the frequency of non-occurrence of both c and t and N is the quantity of document editor@iaeme.com

3 SVMFILEFS- a Novel Ensemble Feature Selection Technique for effective Breast Cancer Diagnosis Information gain Information Gain (IG) is a filter based feature selection method used for selecting relevant attributes. The information gain is the mutual information of target variable (X) and independent variable (A). It is the reduction in entropy of target variable (X) achieved by learning the state of independent variable (A) [10]. In order to compute information gain, let us consider an attribute X and a class attributes Y. The information gain of a given attribute X with respect to class attribute Y is the reduction in uncertainty about the value of Y when the value of X is known. The value of Y is measured by its entropy, H(Y) [10]. The uncertainty about Y, given the value of X is given by the conditional probability of Y given X, H (Y X). I (Y, X) = H(Y) - H (Y X) where Y and X are discrete variables that take values in {Y1 } and {X1 } then the entropy of y is given by: HY= = = The conditional entropy of Y given X is HY X=!"=" # $ & "=" # 2.2. Dataset In this experiment, Wisconsin Breast Cancer dataset (WBCD) dataset was used. This dataset was downloaded from datasets.html. [11] The downloaded dataset contain missing values, so data must be processed so that good results can be produced. Since the datasets which we have used is a standard benchmark dataset, less effort is required for data pre-processing. The missing values are replaced by using the mean value. R tool [4] is used for conducting the experiment described in this paper. R is open source software which consists of many packages which supports machine learning. Dataset Total No. of Attributes ' # Table 1 Dataset s Characteristics No. of Input Attributes No. of Classes No. of Examples Missing Attributes WBCD yes 2.3. Framework of the Experiment In this section, the framework of the proposed novel ensemble feature selection, SVMFILEFS approach in which filter based feature selection methods and SVMRFE feature selection method has been used for selecting relevant features has been described. In the first step, Chi-squared filter, random forest filter and information gain filter was applied to the datasets. The scores from the three filters are normalized to a common scale interval (0, 1). The cumulative ranking of the results from the three filter methods are computed. Based on this cumulative ranking the attributes subset was selected which satisfied the considered threshold value. In the second step, SVMRFE feature selection method was applied to the dataset after removing those selected attributes from the first step from it and the best top most attributes based on a threshold criteria was selected. The union of the attributes from the first and second step was considered for classification using various classification models such as random forest (rf) [12], Support Vector Machine-Radial (svm Radial) [13], Linear Discriminate Analysis (LDA) [14], Jrip [15], Recursive Partitioning and Regression Trees (rpart) [16], J48 [17] and Logistic Model Trees (LMT) [18]. The framework of the proposed method is given in Figure editor@iaeme.com

4 Kavitha C.R and Mahalekshmi Figure 1 Proposed Ensemble Feature Selection Framework of SVMFILEFS 2.4. SVMFILEFS- our proposed Ensemble Feature Selection Method The algorithm of SVMFILEFS ensemble feature selection method has been described below: Algorithm SVMFILEFS Input S: the source dataset F: entire feature set with features f1, f2.,fn Output F select: the best selected feature subset Step: 1. Initialize the training dataset (S) Step: 2. Apply the random forest filter on the dataset(s), (Xa) Step: 3. Apply the Chi-square filter on the dataset(s), (Xb) Step: 4. Apply the Information gain filter on the dataset(s), (Xc) Step: 5. The results of the 3 Filter methods are normalized to a common scale, an interval From 0 to 1. Step: 6. Find the cumulative ranking of the results of 4 Filter methods for all values of Xa, Xb and Xc. Step: 7. Select those attributes, S1 whose cumulative ranking is greater than or equal to 50% threshold value. Step: 8. Apply SVMRFE to the S-S1 to select further features. important features (threshold = 10) are selected. S2=Fsvmrfe (10) Step: 9. Fselect=S1 S2 Here the most editor@iaeme.com

5 SVMFILEFS- a Novel Ensemble Feature Selection Technique for effective Breast Cancer Diagnosis 3. RESULTS AND DISCUSSION In this paper we have implemented a novel ensemble feature selection method that integrates filter feature selection methods and SVMRFE feature selection method to select relevant attributes for the prediction of breast cancer. The attributes selected from the proposed SVMFILEFS method, filters and SVMRFE has been given in Table 2. The number of attributes selected using different feature selection methods is given in Table3. The classification was done using the attributes generated from different feature selection methods and our proposed ensemble feature selection method. A comparison study was done based on the classification accuracy obtained from different classifiers namely, random forest (rf), Support Vector Machine-Radial (svm Radial), Linear Discriminate Analysis (LDA), Recursive Partitioning and Regression Trees (rpart), J48, Jrip, Logistic Model Trees (LMT) and Multi-layer perception (MLP). The classification accuracy obtained by different classifiers on Wisconsin Breast Cancer dataset using the attributes subset obtained from SVMFILEFS, SVMRFE and FILTER combo is given in Table 4. Feature Selection Filter Combo SVMFILEFS SVMRFE Table 2 Attributes Selected from SVMFILEFS Attributes selected perimeter_worst,area_worst,radius_worst,concave_poi nts_worst,concave_points_mean perimeter_worst,area_worst,radius_worst,concave_poi nts_worst,concave_points_mean,area_se,texture_worst, fractal_dimension_se,fractal_dimension_worst,concavi ty_mean,concave_points_se radius_mean,area_mean,area_se,radius_worst,texture_ worst,perimeter_worst,area_worst,smoothness_worst,c oncave_points_worst, symmetry_ worst No of Attributes (without class attribute) without FS all attributes 31 From the graph as shown in the figure 2, it is evident that our hybrid feature selection method SVMFILEFS is able to achieve improved classification accuracy with different classification models. From the graph it is also clear that the classifiers LMT and MLP achieves the maximum classification accuracy than the other classifiers that were considered for this study. Table 3 Number of Attributes selection using different Feature Selection Methods Dataset Number of Attributes Names of Attributes WBCD 11 perimeter worst, concave_ points_ worst, radius worst, concave_ points_ mean, area worst, perimeter mean, area mean, radius mean, concavity mean, concavity worst, areas HCC SURVIVAL 6 HEPATITIS 6 Alkaline_ phosphatase, Performance Status, Alpha-fetoprotein, Ferritin, Hemoglobin, Iron as cites, bilirub in, Albumin, protime, spiders, varies editor@iaeme.com

6 Kavitha C.R and Mahalekshmi Table 4 Classification Accuracy using different Classifiers on WBCD Dataset WBCD DATASET CLASSIFICATION ACCURACY CLASSIFIERS Before FS SVMRFE FILTER SVMFILEFS rf svmradial lda JRip rpart J LMT MLP From the Table 4, it is clear that our approach SVMFILEFS achieves better performance accuracy than the other feature selection methods. By applying Support Vector Machine Recursive Feature Elimination (SVMRFE) to the dataset after removing attributes obtained after the first step, additional relevant features were selected that got ignored in the first step. From Table 4, we can find that SVMFILEFS our novel feature selection approach achieved greater classification accuracy with rf, svm Radial, lda, r part, LMT and MLP classifiers than other feature selection methods with WBCD dataset. RF, LMT and MLP classifier achieved highest classification accuracy of 98% than other classification models with WBCD dataset using SVMFILEFS. 4. IMPLEMENTATION OF SVMFILEFS AS A WEB APPLICATION The SVMFILEFS was implemented as a web application using Shiny [19], an R package used for building interactive web application along with R Studio [20]. This ensemble based feature selection was implemented using two functions such as filter_ FS and svmrfe _FS. Firstly, the filter _FS was used to select attributes using filter methods such as chi-square, random forest and information gain. Secondly, svmrfe _FS was used to select the next best subset of attributes. This web application demonstrates the implementation of SVMFILEFS using WBCD dataset. This web application can be accessed at shinyapps.io/shiny/. Figure 2 Graph that depicts the classification accuracy of different feature selection methods editor@iaeme.com

7 SVMFILEFS- a Novel Ensemble Feature Selection Technique for effective Breast Cancer Diagnosis 5. CONCLUSION This paper presented an ensemble-based feature selection method that combines the outputs of multiple filter based feature selection methods (Random Forest, Information Gain and Chisquared) and SVMRFE feature selection method to generate best attributes subset which achieved higher classification accuracy with random forest (rf), Support Vector Machine- Radial (svm Radial), Linear Discriminate Analysis (LDA), Recursive Partitioning and Regression Trees (rpart), J48, JRip, Logistic Model Trees (LMT) and Multi-Layer Perceptron (MLP). REFERENCES [1] Neumann, U. (2017). Stability and accuracy analysis of a feature selection ensemble for binary classification in biomedical datasets. [2] Liu, J., Ranka, S. and Kahveci, T. Classification and feature selection algorithms for multi-class CGH data. Bioinformatics 24 (13) (2008) i86-i95. [3] Zhou, X. and Tuck, D.P. MSVM-RFE: extensions of SVM-RFE for multiclass gene selection on DNA microarray data. Bioinformatics 23 (9) (2007) [4] R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, [5] Rudnicki W.R., Wrzesien M., Paja W. (2015) All Relevant Feature Selection Methods and Applications. In: Stanczyk U., Jain L. (eds) Feature Selection for Data and Pattern Recognition. Studies in Computational Intelligence, vol 584. Springer, Berlin, Heidelberg [6] Nissim, R Moskovitch, L Rokach, Y Elovici, Detecting unknown computer worm activity via support vector machines and active learning.pattern Anal Appl 15(4), (2012) [7] Setiono R, Liu H. (1996) Improving Backpropagation learning with feature selection.applied Intelligence: The International Journal of Artifical Intelligence,Neural Networks, and Complex Problem-Solving Technologies 6, [8] Romanski P (2009). FSelector: Selecting Attributes. R package version 0.18, URL [9] Wang, Huazhen, Fan Yang, and Zhiyuan Luo. An Experimental Study of the Intrinsic Stability of Random Forest Variable Importance Measures. BMC Bioinformatics 17 (2016): 60. PMC. Web. 7 Oct [10] Shweta Rajput S, Saxena S."Combining Pruned Tree Classifiers with Feature Selection Strategies to Improvise Classification Accuracy"International Journal of Scientific & Engineering Research, Volume 4, Issue 12, December-2013,ISSN [11] Charte, F. and Charte, D. Working with multilabel datasets in R: the mldr package. R J 7 (2) (2015) [12] Liaw, A. and Wiener, M. Classification and Regression by random Forest. R News 2 (3) (2002) [13] Cortes, C. and Vapnik, V. Support-vector networks. Machine learning 20 (3) (1995) [14] Liu, Z.P. Linear Discriminant Analysis. Encyclopedia of Systems Biology (2013). [15] William W. Cohen: Fast Effective Rule Induction. In: Twelfth International Conference on Machine Learning, , [16] Therneau T., Atkinson B., Recursive partitioning for classification, regression and survival trees. An implementation of most of the functionality of the 1984 book by Breiman, Friedman, Olshen and Stone, :45:50 UTC editor@iaeme.com

8 Kavitha C.R and Mahalekshmi [17] Ross Quinlan (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, CA. [18] Niels Landwehr, Mark Hall, Eibe Frank (2005). Logistic Model Trees. Machine Learning. 95(1-2): [19] Chang, W., Cheng, J., Allaire, JJ, Xie, Y. & McPherson, J. (2015). shiny: Web ApplicationFramework for R. R package version Retrieved Feb. 23, Available at CRAN.R-project.org/package=shiny. [20] RStudio (2015). RStudio: Integrated Development Environment for R (Version )[Computer software]. Boston, MA. Retrieved Feb. 23, Available at editor@iaeme.com

FEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION

FEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION Sandeep Kaur 1, Dr. Sheetal Kalra 2 1,2 Computer Science Department, Guru Nanak Dev University RC, Jalandhar(India) ABSTRACT