Comparison of Various Feature Selection Methods in Application to Prototype Best Rules

Comparison of Various Feature Selection Methods in Application to Prototype Best Rules Marcin Blachnik Silesian University of Technology, Electrotechnology Department,Katowice Krasinskiego 8, Poland marcin.blachnik@polsl.pl Summary. Prototype based rules is an interesting tool for data analysis. However most of prototype selection methods like CFCM+LVQ algorithm do not have embedded feature selection methods and require feature selection as initial preprocessing step. The problem that appears is which of the feature selection methods should be used with CFCM+LVQ prototype selection method, and what advantages or disadvantages of certain solutions can be pointed out. The analysis of the above problems is based on empirical data analysis 1. 1 Introduction In the field of computational intelligence exist many methods that provide good classification or regression performance like SVM, unfortunatelly they do not allow as to understand the way they make their decision. On the other hand, fuzzy modeling can be very helpful providing flexible tools that can mimic the data. However they are also restricted just to the continuous or ordinary attributes. An alternative for both these groups are similarity based methods [1], which on one hand base on various methods of machine learning techniques like NPC nearest prototype classifier, and on the other hand can be seen as a generalization of fuzzy rule-based system (F-rules) leading to prototype (similarity) based logical rules (P-rules) [2]. One of the aims of any rule-based system is comprehensibility of obtained rules, and in P-rules system it leads to the problem of selecting possible small set of prototypes. This goal can be achieved utilizing one of prototype selection methods. An example of that kind of algorithms that provides very good quality of obtained results is CFCM+LVQ algorithm [3]. However, this algorithm does not have any embedded feature selection. In any rule-based system feature selection is one of very important issues, so in this paper the 1 Project partially sponsored by the grant No PBU - 47/RM3/07 from the Polish Ministry of Education and Science (MNiSZW).

2 Marcin Blachnik combination of P-rules and various feature selection techniques is considered. Usually feature selection methods can be divided into three groups: filters - which do the feature selection independent to the inductive algorithm, wrappers - where as evaluation faction the inductive algorithm is used and embedded methods where feature selection is build into the inductive algorithm. In presented paper various methods belonging to two first groups are compared in application to P-rules. Next section describes the CFCM+LVQ algorithm, in section 3 different approaches to feature selection are provided, and section 4 describes our experiments. The last section conclude the paper pointing out advantages and disadvantages of different feature selection techniques. 2 CFCM+LVQ algorithm The simplest prototype selection methods are obtained via clustering the dataset, and taking cluster centers as prototypes. However, in this approach the information about mutual relations between class distribution is ignored. Possible solution are semi-supervised clustering methods, which can acquire external knowledge such as describing mutual class distribution. Generally, the context dependent clustering methods get additional information from an external variable f k defined for every k th training vector. This variable determine the so called clustering context describing the importance of certain training vector. A solution used for building P-rule system was proposed by Blachnik et al. in [3], where the f k variable is obtained first calculating the w k coefficient: 1 w k = x k x j 2 x k x l 2 (1) j,c(x j )=C(x k ) l,c(x l ) C(x k ) where C(x) is a function returning class label for vector x. Obtained w k value is normalized to fit values in ranges [0,1]. Obtained w k parameter defines mutual position of each vector and its neighbors and after normalization take values around 0 for training vectors close homogeneous class centers, and close to 1 for vectors x k far from class centers and close to vectors from opposite class. In other words large values of w k represent position close to the border, and very large, close to 1 are outliers. Such defined coefficient w k requires further preprocesing to determine the area of interest, and the grouping context for CFCM algorithm. The preprocesing step can be interpreted as defining a linguistic value for w k variable in the fuzzy set sense. In [3] the Gaussian function has been used: f k = exp ( γ(w k µ) 2) (2) where µ defines the position of clustering area, and γ is the spread of this area.

Title Suppressed Due to Excessive Length 3 Such obtained via clustering prototypes can be further optmized using LVQ based algorithm. This combination of two independent algorithms can be seen as tandem which allow obtaining accurate model with small set of prototypes. Another problem facing P-rules system is determining the number of prototypes. It can be solved whith appropriate computational complexity by observing that in prototype selection algorithms the overall system accuracy mostly depends on the classification accuracy of the least accurate class. On this basis we can conclude that improoving overal classification accuracy can be obtained by improving the worst classified class. This assumption is the basis for the racing algorithm, where a new prototype is iteratively added to the class with the highest error rate (Ci err ) [3]. 3 Feature selection methods 3.1 Ranking methods Ranking methods are one of the fastest methods in feature selection problems. They are based on determining coefficient J( ) describing a relation between each input variable f and output variable C. This coefficient is set for each attribute, and then according to the value of coefficient J( ) is sorted from the most to the least important variable. In the last step according to some previously selected accuracy measure the first and best n features are selected to the final feature subset used while building the final model. Ranking methods use different coefficients for estimation quality of each feature. One of possible solutions is criteria functions based on statistical and information theory coefficients. One of the examples of such metrics is a normalized information gain, also known as asymmetric dependency coefficient, ADC [4] described as ADC(C, f) = MI (C, f) H (C) where H(c) and H(f) are class and feature entropy, and MI(C, f) is mutual information between class C and feature f defined by Shanonna [5] as: (3) H(C) = c i=1 p(c i) lg 2 p(c i ) H(f) = x p(f = x) lg 2 p(f = x) MI(C, f) = H(C, f) + H(C) + H(f) (4) In this formula the sum over x for feature f require discrete values, however numerical ordered features require replacing sum with integral operation, or initial discretization, to estimate probabilities p(f = x). Another metric was proposed by Setiono [6] and is called normalized gain ratio(5). U S (C, f) = MI(C, f) H(f) (5)

4 Marcin Blachnik Another possible normalization allowing to obtain information gain is metric defining feature-class entropy as (6). U H (C, f) = MI(C, f) H(f, C) where H(f, C) is joined entropy of variable f and class C. Mantaras [7] suggested some criterion D ML fulfilling distance metric axiom as (7). D ML (f i, C) = H(f i C) + H(C f i ) (7) where H(f i C) and H(C f i ) are conditional entropies defined by Shanon [5] as H(X Y ) = H(X, Y ) H(Y ). The index of weighted joined entropy was proposed by Chi [8] as (8) N K Ch(f) = p(f = x k ) p(f = x k, C i ) lg 2 p(f = x k, C i ) (8) k=1 i=1 An alternative for already proposed information theory methods is χ 2 statistics, which measures relations between two random variables - here defined as between feature and class. χ 2 statistic can be defined as (9). χ 2 (f, C) = ij (p(f = x j, C i ) p(f = x j ) p(c i )) 2 p(f = x j )p(c i ) where p( ) is the appropriate probability. High values of χ 2 represent strong correlation between certain feature and class variable, which can be used for feature ranking. (6) (9) 3.2 Search based feature selection An advantage of search based feature selection methods over rankings are usually more accurate results. These methods are based on both stochastic and heuristic searching strategies, what implies higher computational complexity, which for very large datasets (ex. as provided during NIPS 2003 challenge) that have a few thousands of variables may limit some algorithms usability. Typical solutions of search based feature selection are forward/backward selection methods. Forward selection Forward selection starts from an empty feature set and in each iteration adds one new attribute, form the set of remaining. One that is added is this feature which maximizes certain criterion usually classification accuracy. To ensure the proper outcome of adding a new feature to the feature subset, quality is measured in the crossvalidation process.

Title Suppressed Due to Excessive Length 5 Backward elimination Backward elimination algorithm differs from forward selection by starting from the full feature set, and iteratively removes one by one feature. In each iteration only one feature is removed, which mostly affects overall model accuracy, as long as the accuracy stops increasing. 3.3 Embedded feature selection algorithms used as filters Some data mining algorithms have built-in feature selection (embedded feature selection). An example of this kind of solutions are decision trees which automatically determine optimal feature subset while building the tree. These methods can be used also as external feature selection methods, called feature filters by extracting knowledge of selected attributes subset by these data mining models. In decision trees it is equivalent to visiting every node of the tree and acquiring attributes considered for testing. This approach used as an external tool has one important advantage over ranking methods - it considers not only relations between one input attribute and the output attribute, but also searches locally for attributes that allow good local discrimination. 4 Datasets and results 4.1 Datasets used in the experiments To verify quality of previously described feature selection techniques experiments have been performed on several datasets from UCI repository From this repository 4 classification datasets have been used for validating ranking methods and 10 datasets for search based methods and tree based feature selection. Selected datasets are: Appendicitis, Wine, Pima Indian diabetes, Ionosphere, Cleveland Heart Disease, Iris, Sonar, BUPA liver disorders, Wisconsin breast cancer, Lancet. 4.2 Experiments and results To compare described above algorithms the empirical test where performed based on datasets presented in previous section. In all tests Infosel++ library was used to perform feature selection, and as induction algorithm selfimplemented in Matlab Spider Toolbox CFCM+LVQ algorithm was utilized. Ranking methods Results obtained for various ranking methods displayed different behavior and different ranking order. This differences appear not only for different ranking

6 Marcin Blachnik 0.95 0.72 Acc [%] 0.9 0.85 0.8 0.75 Acc [%] 0.7 0.68 0.66 0.64 0.62 ADC U S U H D ML Chi χ 2 Corr 0.7 0.65 χ 2 Corr 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 fi[-] (a) Ionosphere ADC U S U H D ML Chi 0.6 0.58 0.56 0.54 1 2 3 4 5 6 7 8 9 10 11 12 13 fi[-] (b) Heart Disease 0.8 0.98 0.78 0.96 0.76 0.94 Acc [%] 0.74 0.72 0.7 Acc [%] 0.92 0.9 0.88 0.86 0.68 ADC U S 0.84 ADC U S 0.66 U H D ML 0.82 U H D ML 0.64 Chi χ 2 Corr 0.8 Chi χ 2 Corr 0.62 1 2 3 4 5 6 7 8 fi[-] (c) Pima Indians 0.78 1 2 3 4 5 6 7 8 9 fi[-] (d) Wisconsin Brest Cancer Fig. 1. Results of ranking algorithms and CFCM+LVQ P-rules system coefficients but also the type of performed discretization, which was utilized to estimate all probabilities like p(f = X j, C i ). As a discretization simple equal width method has been used with 5, 10 and 15 beens. According to this for final results comparison we have used number of beens which maximize classification accuracy, for each specific coefficient, and each dataset. Obtained results are presented in fig.(1) for CFCM+LVQ algorithm. Each of these figures presents a relation between number of n - best features and obtained classification accuracy. Search based and embedded feature selection As in the presented results of ranking methods also here a two stage testing process was used. In the first step feature selection algorithms like forward, backward selection and selection based on decision trees have been used to select best feature subset from the whole dataset. In the second stage classification algorithm has been tested using crossvalidation test. Collected results for CFCM+LVQ P-rules system are presented in table (1), where also are provided information on the number of selected prototypes (l) and a number of selected features (f). As the accuracy measure optimized while selecting best feature subset only pure classification accuracy has been used.

Title Suppressed Due to Excessive Length 7 Table 1. Comparison of feature selection algorithms for P-rules system Dataset Accuracy f l Accuracy f l Accuracy f l Backward elimination Forward selection Tree based selection Appendicitis 85.70 ± 13.98 6 2 84.86 ± 14.38 4 2 84.86 ± 15.85 2 2 Wine 97.25 ± 2.90 11 3 95.06 ± 4.79 5 7 92.57 ± 6.05 6 3 Pima Indians 77.22 ± 4.36 7 3 76.96 ± 3.01 4 3 75.65 ± 2.43 2 4 Ionosphere 87.81 ± 6.90 32 6 92.10 ± 4.25 5 6 87.57 ± 7.11 2 4 Clev. Heart Dis. 85.49 ± 5.59 7 3 84.80 ± 5.23 3 2 84.80 ± 5.23 3 2 Iris 95.33 ± 6.32 3 5 97.33 ± 3.44 2 4 97.33 ± 3.44 2 4 Sonar 66.81 ± 20.66 56 4 75.62 ± 12.97 4 3 74.62 ± 12.40 1 2 Leaver disorder 67.24 ± 8.79 5 2 68.45 ± 6.40 4 2 66.71 ± 9.05 2 2 Brest Cancer 97.82 ± 1.96 7 3 97.81 ± 2.19 5 3 97.23 ± 2.51 5 4 Lancet 95.66 ± 2.36 7 4 94.65 ± 2.64 5 4 93.33 ± 3.20 4 3 5 Conclusions In the problem of combining P-rules systems and feature selection techniques it is possible to observe trend according to the problem of simplicity and comprehensibility of rule-systems. This problem is related to the model complexity versus accuracy dilemma. The analysis of provided results shows that forward selection and backward elimination outperform other methods, leading to conclusion that search based feature selection is the most robust. However these methods has a big drawback when considering computational complexity. The simplest searching methods require 0.5k(2n (k + 1)) times invoking evaluation function, where n is the number of features, and k is the number of iterations of adding or subtracting a new feature. This problem causes an important limitation of application of these methods. Another problem of the simple search strategies like forward or backward selection is stacking in local minimas (usually the first minimum). It can be observed on Liver disorder dataset where feature selection methods based on decision trees gives better results then both searching methods. The local minima problem can be solved using more sophisticated searching algorithm, however it requires even more computational effort. The computational complexity problem does not appear in ranking methods, whose complexity is a linear function of the number of features. However these methods may be unstable in the function of obtained accuracy. It is shown in fig.(1.a) and fig.(1.b) where accuracy is fluctuating according to the number of selected features. Another important problem that appear in ranking based feature selection are redundant features, which are impossible to removed during training. Possible solution is an FCBF algorithm [9] or algorithm proposed by Biesiada et al. in [10]. Obtained results for ranking methods do not allow for selecting the best ranking coefficient, so the possible solution are ranking committees, where the feature weight is related to the frequency of appearing at certain position in different rankings. Another possibility is selecting

8 Marcin Blachnik ranking coefficients with the lowest computational complexity like correlation coefficient, which is very cheap to estimate and does not require previous discretization (discretization was used for all entropy based ranking coefficients). As a tool for P-rules system ranking algorithms should be used only for large datasets where the computational complexity of search based methods is significantly to high. A compromise solution between ranking and search based methods can be obtained with filters approach realized as features selected by decision trees. This approach provide good classification accuracy and low complexity which is nm log(m), and is independent of the final classifier complexity. Acknowledgment The author would like to thank prof. W Duch for his help in developing P-rules based systems. References 1. Duch, W.: Similarity based methods: a general framework for classification approximation and association. Control and Cybernetics 29, 937-968 (2000) 2. Duch, W., Blachnik, M.: Fuzzy rule-based systems derived from similarity to prototypes. In: N. Pal, N. Kasabov, R. Mudi, S. Pal, S. Parui (eds.) Lecture Notes in Computer Science, vol. 3316, pp. 912-917. Physica Verlag, Springer, New York (2004) 3. Blachnik, M., Duch, W., Wieczorek, T.: Selection of prototypes rules context searching via clustering. Lecture Notes in Artificial Intelligence 4029, 573-582 (2006) 4. Shridhar, D., Bartlett, E., Seagrave, R.: Information theoretic subset selection. Computers in Chemical Engineering 22, 613-626 (1998) 5. Shanonn, C., Weaver, W.: The Mathematical Theory of Communication. University of Illinois Press (1946) 6. Setiono, R., Liu, H.: Improving backpropagation learning with feature selection. The International Journal of Artificial Intelligence, Neural Networks, and Complex Problem-Solving Technologies 6, 129-139 (1996) 7. de Mantaras, R.L.: A distance-based attribute selecting measure for decision tree induction. Machine Learning 6, 81-92 (1991) 8. Chi J.: Entropy based feature evaluation and selection technique. Proc. of 4-th Australian Conf. on Neural Networks (ACNN 93) (1993) 9. L. Yu, H.L.: Feature selection for high-dimensional data: A fast correlationbased filter solution. In: Proceedings of The Twentieth International Conference on Machine Learning (2003) 10. Duch, W., Biesiada, J.: Feature selection for high-dimensional data: A kolmogorov-smirnov correlation-based filter solution. In: Advances in Soft Computing, pp. 95-104. Springer (2005)