Comparison of Various Feature Selection Methods in Application to Prototype Best Rules

Similar documents
Feature Selection for Supervised Classification: A Kolmogorov- Smirnov Class Correlation-Based Filter

FEATURE SELECTION BASED ON INFORMATION THEORY, CONSISTENCY AND SEPARABILITY INDICES.

Available online at ScienceDirect. Procedia Computer Science 35 (2014 )

SSV Criterion Based Discretization for Naive Bayes Classifiers

Weighting and selection of features.

Instance Selection and Prototype-based Rules in RapidMiner

Support Vector Machines for visualization and dimensionality reduction

Prototype rules from SVM

Feature Selection with Decision Tree Criterion

Information Driven Healthcare:

Some questions of consensus building using co-association

An Empirical Comparison of Ensemble Methods Based on Classification Trees. Mounir Hamza and Denis Larocque. Department of Quantitative Methods

Slides for Data Mining by I. H. Witten and E. Frank

CloNI: clustering of JN -interval discretization

Feature-weighted k-nearest Neighbor Classifier

Using a genetic algorithm for editing k-nearest neighbor classifiers

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Machine Learning Techniques for Data Mining

Global Metric Learning by Gradient Descent

BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA

Class dependent feature weighting and K-nearest neighbor classification

Forward Feature Selection Using Residual Mutual Information

Information-Theoretic Feature Selection Algorithms for Text Classification

FEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION

Learning highly non-separable Boolean functions using Constructive Feedforward Neural Network

Automatic Generation of Fuzzy Classification Rules Using Granulation-Based Adaptive Clustering

LEARNING WEIGHTS OF FUZZY RULES BY USING GRAVITATIONAL SEARCH ALGORITHM

Improving Classifier Performance by Imputing Missing Values using Discretization Method

Feature Selection Algorithm with Discretization and PSO Search Methods for Continuous Attributes

Recursive Similarity-Based Algorithm for Deep Learning

Robust 1-Norm Soft Margin Smooth Support Vector Machine

Double Sort Algorithm Resulting in Reference Set of the Desired Size

FUZZY KERNEL K-MEDOIDS ALGORITHM FOR MULTICLASS MULTIDIMENSIONAL DATA CLASSIFICATION


Information theory methods for feature selection

Efficient SQL-Querying Method for Data Mining in Large Data Bases

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets

The Role of Biomedical Dataset in Classification

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

BRACE: A Paradigm For the Discretization of Continuously Valued Data

Supervised Variable Clustering for Classification of NIR Spectra

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

INF 4300 Classification III Anne Solberg The agenda today:

6. Dicretization methods 6.1 The purpose of discretization

Correlation Based Feature Selection with Irrelevant Feature Removal

WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1

A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM

Comparing Univariate and Multivariate Decision Trees *

Statistical Pattern Recognition

Unsupervised Feature Selection for Sparse Data

Dimensionality Reduction, including by Feature Selection.

Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms

Recursive Similarity-Based Algorithm for Deep Learning

Statistical Pattern Recognition

Nearest Cluster Classifier

Statistical Pattern Recognition

Effects of Three-Objective Genetic Rule Selection on the Generalization Ability of Fuzzy Rule-Based Systems

Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection

Univariate Margin Tree

Classification of Subject Motion for Improved Reconstruction of Dynamic Magnetic Resonance Imaging

Using Pairs of Data-Points to Define Splits for Decision Trees

Discretizing Continuous Attributes Using Information Theory

CBFAST- Efficient Clustering Based Extended Fast Feature Subset Selection Algorithm for High Dimensional Data

International Journal of Scientific & Engineering Research, Volume 5, Issue 7, July ISSN

Efficient Pruning Method for Ensemble Self-Generating Neural Networks

An Empirical Study of Lazy Multilabel Classification Algorithms

Nearest Cluster Classifier

Hybrid Approach for Classification using Support Vector Machine and Decision Tree

Feature Selection Based on Relative Attribute Dependency: An Experimental Study

CGBoost: Conjugate Gradient in Function Space

A STUDY OF SOME DATA MINING CLASSIFICATION TECHNIQUES

Fuzzy Partitioning with FID3.1

Multiple Classifier Fusion using k-nearest Localized Templates

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Almost Random Projection Machine with Margin Maximization and Kernel Features

Filter methods for feature selection. A comparative study

Cluster homogeneity as a semi-supervised principle for feature selection using mutual information

Unsupervised Discretization using Tree-based Density Estimation

Handling Missing Attribute Values in Preterm Birth Data Sets

Generating the Reduced Set by Systematic Sampling

Rank Measures for Ordering

Cluster-based instance selection for machine classification

A Hybrid Feature Selection Algorithm Based on Information Gain and Sequential Forward Floating Search

Feature Subset Selection Problem using Wrapper Approach in Supervised Learning

Genetic Programming for Data Classification: Partitioning the Search Space

AMOL MUKUND LONDHE, DR.CHELPA LINGAM

A Heart Disease Risk Prediction System Based On Novel Technique Stratified Sampling

SoftDoubleMinOver: A Simple Procedure for Maximum Margin Classification

EM algorithm with GMM and Naive Bayesian to Implement Missing Values

Induction of Multivariate Decision Trees by Using Dipolar Criteria

Fuzzy Modeling using Vector Quantization with Supervised Learning

Features: representation, normalization, selection. Chapter e-9

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Unsupervised Learning : Clustering

Processing Missing Values with Self-Organized Maps

Multiobjective Formulations of Fuzzy Rule-Based Classification System Design

Mostafa Salama Abdel-hady

Classification with Diffuse or Incomplete Information

FEATURE SELECTION TECHNIQUES

A Unified Framework to Integrate Supervision and Metric Learning into Clustering

Transcription:

Comparison of Various Feature Selection Methods in Application to Prototype Best Rules Marcin Blachnik Silesian University of Technology, Electrotechnology Department,Katowice Krasinskiego 8, Poland marcin.blachnik@polsl.pl Summary. Prototype based rules is an interesting tool for data analysis. However most of prototype selection methods like CFCM+LVQ algorithm do not have embedded feature selection methods and require feature selection as initial preprocessing step. The problem that appears is which of the feature selection methods should be used with CFCM+LVQ prototype selection method, and what advantages or disadvantages of certain solutions can be pointed out. The analysis of the above problems is based on empirical data analysis 1. 1 Introduction In the field of computational intelligence exist many methods that provide good classification or regression performance like SVM, unfortunatelly they do not allow as to understand the way they make their decision. On the other hand, fuzzy modeling can be very helpful providing flexible tools that can mimic the data. However they are also restricted just to the continuous or ordinary attributes. An alternative for both these groups are similarity based methods [1], which on one hand base on various methods of machine learning techniques like NPC nearest prototype classifier, and on the other hand can be seen as a generalization of fuzzy rule-based system (F-rules) leading to prototype (similarity) based logical rules (P-rules) [2]. One of the aims of any rule-based system is comprehensibility of obtained rules, and in P-rules system it leads to the problem of selecting possible small set of prototypes. This goal can be achieved utilizing one of prototype selection methods. An example of that kind of algorithms that provides very good quality of obtained results is CFCM+LVQ algorithm [3]. However, this algorithm does not have any embedded feature selection. In any rule-based system feature selection is one of very important issues, so in this paper the 1 Project partially sponsored by the grant No PBU - 47/RM3/07 from the Polish Ministry of Education and Science (MNiSZW).

2 Marcin Blachnik combination of P-rules and various feature selection techniques is considered. Usually feature selection methods can be divided into three groups: filters - which do the feature selection independent to the inductive algorithm, wrappers - where as evaluation faction the inductive algorithm is used and embedded methods where feature selection is build into the inductive algorithm. In presented paper various methods belonging to two first groups are compared in application to P-rules. Next section describes the CFCM+LVQ algorithm, in section 3 different approaches to feature selection are provided, and section 4 describes our experiments. The last section conclude the paper pointing out advantages and disadvantages of different feature selection techniques. 2 CFCM+LVQ algorithm The simplest prototype selection methods are obtained via clustering the dataset, and taking cluster centers as prototypes. However, in this approach the information about mutual relations between class distribution is ignored. Possible solution are semi-supervised clustering methods, which can acquire external knowledge such as describing mutual class distribution. Generally, the context dependent clustering methods get additional information from an external variable f k defined for every k th training vector. This variable determine the so called clustering context describing the importance of certain training vector. A solution used for building P-rule system was proposed by Blachnik et al. in [3], where the f k variable is obtained first calculating the w k coefficient: 1 w k = x k x j 2 x k x l 2 (1) j,c(x j )=C(x k ) l,c(x l ) C(x k ) where C(x) is a function returning class label for vector x. Obtained w k value is normalized to fit values in ranges [0,1]. Obtained w k parameter defines mutual position of each vector and its neighbors and after normalization take values around 0 for training vectors close homogeneous class centers, and close to 1 for vectors x k far from class centers and close to vectors from opposite class. In other words large values of w k represent position close to the border, and very large, close to 1 are outliers. Such defined coefficient w k requires further preprocesing to determine the area of interest, and the grouping context for CFCM algorithm. The preprocesing step can be interpreted as defining a linguistic value for w k variable in the fuzzy set sense. In [3] the Gaussian function has been used: f k = exp ( γ(w k µ) 2) (2) where µ defines the position of clustering area, and γ is the spread of this area.

Title Suppressed Due to Excessive Length 3 Such obtained via clustering prototypes can be further optmized using LVQ based algorithm. This combination of two independent algorithms can be seen as tandem which allow obtaining accurate model with small set of prototypes. Another problem facing P-rules system is determining the number of prototypes. It can be solved whith appropriate computational complexity by observing that in prototype selection algorithms the overall system accuracy mostly depends on the classification accuracy of the least accurate class. On this basis we can conclude that improoving overal classification accuracy can be obtained by improving the worst classified class. This assumption is the basis for the racing algorithm, where a new prototype is iteratively added to the class with the highest error rate (Ci err ) [3]. 3 Feature selection methods 3.1 Ranking methods Ranking methods are one of the fastest methods in feature selection problems. They are based on determining coefficient J( ) describing a relation between each input variable f and output variable C. This coefficient is set for each attribute, and then according to the value of coefficient J( ) is sorted from the most to the least important variable. In the last step according to some previously selected accuracy measure the first and best n features are selected to the final feature subset used while building the final model. Ranking methods use different coefficients for estimation quality of each feature. One of possible solutions is criteria functions based on statistical and information theory coefficients. One of the examples of such metrics is a normalized information gain, also known as asymmetric dependency coefficient, ADC [4] described as ADC(C, f) = MI (C, f) H (C) where H(c) and H(f) are class and feature entropy, and MI(C, f) is mutual information between class C and feature f defined by Shanonna [5] as: (3) H(C) = c i=1 p(c i) lg 2 p(c i ) H(f) = x p(f = x) lg 2 p(f = x) MI(C, f) = H(C, f) + H(C) + H(f) (4) In this formula the sum over x for feature f require discrete values, however numerical ordered features require replacing sum with integral operation, or initial discretization, to estimate probabilities p(f = x). Another metric was proposed by Setiono [6] and is called normalized gain ratio(5). U S (C, f) = MI(C, f) H(f) (5)

4 Marcin Blachnik Another possible normalization allowing to obtain information gain is metric defining feature-class entropy as (6). U H (C, f) = MI(C, f) H(f, C) where H(f, C) is joined entropy of variable f and class C. Mantaras [7] suggested some criterion D ML fulfilling distance metric axiom as (7). D ML (f i, C) = H(f i C) + H(C f i ) (7) where H(f i C) and H(C f i ) are conditional entropies defined by Shanon [5] as H(X Y ) = H(X, Y ) H(Y ). The index of weighted joined entropy was proposed by Chi [8] as (8) N K Ch(f) = p(f = x k ) p(f = x k, C i ) lg 2 p(f = x k, C i ) (8) k=1 i=1 An alternative for already proposed information theory methods is χ 2 statistics, which measures relations between two random variables - here defined as between feature and class. χ 2 statistic can be defined as (9). χ 2 (f, C) = ij (p(f = x j, C i ) p(f = x j ) p(c i )) 2 p(f = x j )p(c i ) where p( ) is the appropriate probability. High values of χ 2 represent strong correlation between certain feature and class variable, which can be used for feature ranking. (6) (9) 3.2 Search based feature selection An advantage of search based feature selection methods over rankings are usually more accurate results. These methods are based on both stochastic and heuristic searching strategies, what implies higher computational complexity, which for very large datasets (ex. as provided during NIPS 2003 challenge) that have a few thousands of variables may limit some algorithms usability. Typical solutions of search based feature selection are forward/backward selection methods. Forward selection Forward selection starts from an empty feature set and in each iteration adds one new attribute, form the set of remaining. One that is added is this feature which maximizes certain criterion usually classification accuracy. To ensure the proper outcome of adding a new feature to the feature subset, quality is measured in the crossvalidation process.

Title Suppressed Due to Excessive Length 5 Backward elimination Backward elimination algorithm differs from forward selection by starting from the full feature set, and iteratively removes one by one feature. In each iteration only one feature is removed, which mostly affects overall model accuracy, as long as the accuracy stops increasing. 3.3 Embedded feature selection algorithms used as filters Some data mining algorithms have built-in feature selection (embedded feature selection). An example of this kind of solutions are decision trees which automatically determine optimal feature subset while building the tree. These methods can be used also as external feature selection methods, called feature filters by extracting knowledge of selected attributes subset by these data mining models. In decision trees it is equivalent to visiting every node of the tree and acquiring attributes considered for testing. This approach used as an external tool has one important advantage over ranking methods - it considers not only relations between one input attribute and the output attribute, but also searches locally for attributes that allow good local discrimination. 4 Datasets and results 4.1 Datasets used in the experiments To verify quality of previously described feature selection techniques experiments have been performed on several datasets from UCI repository From this repository 4 classification datasets have been used for validating ranking methods and 10 datasets for search based methods and tree based feature selection. Selected datasets are: Appendicitis, Wine, Pima Indian diabetes, Ionosphere, Cleveland Heart Disease, Iris, Sonar, BUPA liver disorders, Wisconsin breast cancer, Lancet. 4.2 Experiments and results To compare described above algorithms the empirical test where performed based on datasets presented in previous section. In all tests Infosel++ library was used to perform feature selection, and as induction algorithm selfimplemented in Matlab Spider Toolbox CFCM+LVQ algorithm was utilized. Ranking methods Results obtained for various ranking methods displayed different behavior and different ranking order. This differences appear not only for different ranking

6 Marcin Blachnik 0.95 0.72 Acc [%] 0.9 0.85 0.8 0.75 Acc [%] 0.7 0.68 0.66 0.64 0.62 ADC U S U H D ML Chi χ 2 Corr 0.7 0.65 χ 2 Corr 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 fi[-] (a) Ionosphere ADC U S U H D ML Chi 0.6 0.58 0.56 0.54 1 2 3 4 5 6 7 8 9 10 11 12 13 fi[-] (b) Heart Disease 0.8 0.98 0.78 0.96 0.76 0.94 Acc [%] 0.74 0.72 0.7 Acc [%] 0.92 0.9 0.88 0.86 0.68 ADC U S 0.84 ADC U S 0.66 U H D ML 0.82 U H D ML 0.64 Chi χ 2 Corr 0.8 Chi χ 2 Corr 0.62 1 2 3 4 5 6 7 8 fi[-] (c) Pima Indians 0.78 1 2 3 4 5 6 7 8 9 fi[-] (d) Wisconsin Brest Cancer Fig. 1. Results of ranking algorithms and CFCM+LVQ P-rules system coefficients but also the type of performed discretization, which was utilized to estimate all probabilities like p(f = X j, C i ). As a discretization simple equal width method has been used with 5, 10 and 15 beens. According to this for final results comparison we have used number of beens which maximize classification accuracy, for each specific coefficient, and each dataset. Obtained results are presented in fig.(1) for CFCM+LVQ algorithm. Each of these figures presents a relation between number of n - best features and obtained classification accuracy. Search based and embedded feature selection As in the presented results of ranking methods also here a two stage testing process was used. In the first step feature selection algorithms like forward, backward selection and selection based on decision trees have been used to select best feature subset from the whole dataset. In the second stage classification algorithm has been tested using crossvalidation test. Collected results for CFCM+LVQ P-rules system are presented in table (1), where also are provided information on the number of selected prototypes (l) and a number of selected features (f). As the accuracy measure optimized while selecting best feature subset only pure classification accuracy has been used.

Title Suppressed Due to Excessive Length 7 Table 1. Comparison of feature selection algorithms for P-rules system Dataset Accuracy f l Accuracy f l Accuracy f l Backward elimination Forward selection Tree based selection Appendicitis 85.70 ± 13.98 6 2 84.86 ± 14.38 4 2 84.86 ± 15.85 2 2 Wine 97.25 ± 2.90 11 3 95.06 ± 4.79 5 7 92.57 ± 6.05 6 3 Pima Indians 77.22 ± 4.36 7 3 76.96 ± 3.01 4 3 75.65 ± 2.43 2 4 Ionosphere 87.81 ± 6.90 32 6 92.10 ± 4.25 5 6 87.57 ± 7.11 2 4 Clev. Heart Dis. 85.49 ± 5.59 7 3 84.80 ± 5.23 3 2 84.80 ± 5.23 3 2 Iris 95.33 ± 6.32 3 5 97.33 ± 3.44 2 4 97.33 ± 3.44 2 4 Sonar 66.81 ± 20.66 56 4 75.62 ± 12.97 4 3 74.62 ± 12.40 1 2 Leaver disorder 67.24 ± 8.79 5 2 68.45 ± 6.40 4 2 66.71 ± 9.05 2 2 Brest Cancer 97.82 ± 1.96 7 3 97.81 ± 2.19 5 3 97.23 ± 2.51 5 4 Lancet 95.66 ± 2.36 7 4 94.65 ± 2.64 5 4 93.33 ± 3.20 4 3 5 Conclusions In the problem of combining P-rules systems and feature selection techniques it is possible to observe trend according to the problem of simplicity and comprehensibility of rule-systems. This problem is related to the model complexity versus accuracy dilemma. The analysis of provided results shows that forward selection and backward elimination outperform other methods, leading to conclusion that search based feature selection is the most robust. However these methods has a big drawback when considering computational complexity. The simplest searching methods require 0.5k(2n (k + 1)) times invoking evaluation function, where n is the number of features, and k is the number of iterations of adding or subtracting a new feature. This problem causes an important limitation of application of these methods. Another problem of the simple search strategies like forward or backward selection is stacking in local minimas (usually the first minimum). It can be observed on Liver disorder dataset where feature selection methods based on decision trees gives better results then both searching methods. The local minima problem can be solved using more sophisticated searching algorithm, however it requires even more computational effort. The computational complexity problem does not appear in ranking methods, whose complexity is a linear function of the number of features. However these methods may be unstable in the function of obtained accuracy. It is shown in fig.(1.a) and fig.(1.b) where accuracy is fluctuating according to the number of selected features. Another important problem that appear in ranking based feature selection are redundant features, which are impossible to removed during training. Possible solution is an FCBF algorithm [9] or algorithm proposed by Biesiada et al. in [10]. Obtained results for ranking methods do not allow for selecting the best ranking coefficient, so the possible solution are ranking committees, where the feature weight is related to the frequency of appearing at certain position in different rankings. Another possibility is selecting

8 Marcin Blachnik ranking coefficients with the lowest computational complexity like correlation coefficient, which is very cheap to estimate and does not require previous discretization (discretization was used for all entropy based ranking coefficients). As a tool for P-rules system ranking algorithms should be used only for large datasets where the computational complexity of search based methods is significantly to high. A compromise solution between ranking and search based methods can be obtained with filters approach realized as features selected by decision trees. This approach provide good classification accuracy and low complexity which is nm log(m), and is independent of the final classifier complexity. Acknowledgment The author would like to thank prof. W Duch for his help in developing P-rules based systems. References 1. Duch, W.: Similarity based methods: a general framework for classification approximation and association. Control and Cybernetics 29, 937-968 (2000) 2. Duch, W., Blachnik, M.: Fuzzy rule-based systems derived from similarity to prototypes. In: N. Pal, N. Kasabov, R. Mudi, S. Pal, S. Parui (eds.) Lecture Notes in Computer Science, vol. 3316, pp. 912-917. Physica Verlag, Springer, New York (2004) 3. Blachnik, M., Duch, W., Wieczorek, T.: Selection of prototypes rules context searching via clustering. Lecture Notes in Artificial Intelligence 4029, 573-582 (2006) 4. Shridhar, D., Bartlett, E., Seagrave, R.: Information theoretic subset selection. Computers in Chemical Engineering 22, 613-626 (1998) 5. Shanonn, C., Weaver, W.: The Mathematical Theory of Communication. University of Illinois Press (1946) 6. Setiono, R., Liu, H.: Improving backpropagation learning with feature selection. The International Journal of Artificial Intelligence, Neural Networks, and Complex Problem-Solving Technologies 6, 129-139 (1996) 7. de Mantaras, R.L.: A distance-based attribute selecting measure for decision tree induction. Machine Learning 6, 81-92 (1991) 8. Chi J.: Entropy based feature evaluation and selection technique. Proc. of 4-th Australian Conf. on Neural Networks (ACNN 93) (1993) 9. L. Yu, H.L.: Feature selection for high-dimensional data: A fast correlationbased filter solution. In: Proceedings of The Twentieth International Conference on Machine Learning (2003) 10. Duch, W., Biesiada, J.: Feature selection for high-dimensional data: A kolmogorov-smirnov correlation-based filter solution. In: Advances in Soft Computing, pp. 95-104. Springer (2005)