Feature Selection in High Dimensional Data by a Filter-Based Genetic Algorithm

Size: px

Start display at page:

Download "Feature Selection in High Dimensional Data by a Filter-Based Genetic Algorithm"

Kristina Hawkins
5 years ago
Views:

1 Feature Selection in High Dimensional Data by a Filter-Based Genetic Algorithm C. De Stefano, F. Fontanella and A. Scotto di Freca Dipartimento di Ingegneria Elettrica e dell Informazione (DIEI) Università di Cassino e del Lazio meridionale Via G. Di Biasio, Cassino (FR) Italy {destefano,fontanella,a.scotto}@unicas.it Abstract. In classification and clustering problems, feature selection techniques can be used to reduce the dimensionality of the data and increase the performances. However, feature selection is a challenging task, especially when hundred or thousands of features are involved. In this framework, we present a new approach for improving the performance of a filter-based genetic algorithm. The proposed approach consists of two steps: first, the available features are ranked according to a univariate evaluation function; then the search space represented by the first M features in the ranking is searched using a filter-based genetic algorithm for finding feature subsets with a high discriminative power. Experimental results demonstrated the effectiveness of our approach in dealing with high dimensional data, both in terms of recognition rate and feature number reduction. 1 Introduction Recent years have seen a strong growth of applications dealing with huge amounts of data, such as data mining and medical data processing [23]. This kind of application often imply classification or clustering problems where the objects to be classified or clustered are represented as feature vectors. The feature selection problem consists in selecting, from the whole set of available features, the subset of them providing the most discriminative power. The choice of a good feature subset is crucial since if the selected features do not contain enough information to discriminate patterns belonging to different classes, the performances may be unsatisfactory, regardless of the effectiveness of the classification system employed. Moreover, irrelevant and noisy features unnecessarily enlarge the search space, increasing both the time and the complexity of the learning process. Feature selection algorithms usually imply the definition of an evaluation function and of a search procedure. Evaluation functions can be divided into two broad classes: univariate and multivariate measures. Univariate measures evaluate the effectiveness of each single feature in discriminating samples belonging to different classes and are used to rank the available features. Once the features have been evaluated, the subset search procedure is straightforward: the features are ranked according to their merit and the best M features are

2 selected. The parameter M is specified by the user. These kind of approaches are very fast and can be used to cope with problems involving even hundreds of thousands of features. The main drawback of these measures is that they cannot consider interactions that may occur between two or more features. For this reason, features which perform well when used in conjunction with other features, are discarded if they perform poorly when used alone. Additionally, the features with the highest scores (merits) are usually similar. Therefore, these measures tend to select redundant features [6]. Multivariate measures, instead, evaluate feature subsets by measuring how well patterns belonging to different classes are discriminated when projected in the subspace represented by the subset to be evaluated. These measures are generally classified into two categories: filter and wrapper [8]. Wrapper approaches use classification algorithms, to evaluate the goodness of the subsets. This leads to high computational costs when a large number of evaluations is required, especially when large datasets are involved. Filter approaches, instead, are independent of any classification algorithm and, in most of the cases, are computationally less expensive and more general than wrapper algorithms. As concerns the search strategies, given a measure, the optimal subset can be found by exhaustively evaluating all the possible solutions. Unfortunately, the exhaustive search is impracticable when the cardinality N of the whole set of features Y is high (N > 50). This is due to the fact that the search space, made of all the 2 N possible subsets of Y, exponentially grows with N. For this reason, many heuristic algorithms have been proposed for finding near-optimal solutions [4, 8, 21]. Among these algorithms, greedy strategies that incrementally generate feature subsets have been proposed. Since these algorithms do not take into account complex interactions among the features, in most of the cases they lead to sub-optimal solutions. Evolutionary computation (EC) based techniques have been widely used to cope with the feature selection problem [21]. Among the EC based approaches, Genetic Algorithms (GAs) have been widely used. GA binary vectors provide a natural and straightforward representation for feature subsets: the value 1 or 0 of the chromosome i-th element indicates whether the i-th feature is included or not. Most of the GA approaches use wrapper evaluation functions [21]. For these approaches different classification algorithms have been adopted, among them: Support Vector Machines (SVMs) [16], K-Nearest Neighbor (KNN) [13] and Artificial Neural Networks (ANNs) [22]. As mentioned above, wrapper evaluation functions lead to high computational costs since their computational complexity depends on the number of samples actually used for training the classifier. As consequence, such approaches are not well suited to deal with problems involving a huge number of instances and features. Also filter fitness functions have been used; the approach presented in [19] uses an information theory based evaluation function, while in [11] the authors adopt a consistency measure. Moreover, in [3, 5] the authors present a filter fitness function that extends the Fisher s linear discriminant. Recently, in order to reduce the search space size for high-dimensional datasets,

3 different strategies have been adopted [18, 20, 21] for GA-based algorithms. In [18], the search space reduction for the GA is performed by using different filter approaches. The information provided by these approaches is used to build a part of the individuals making up the initial population. Then, individuals are evaluated by means of a neural network based wrapper function. The approach has been tested on a credit assessment risk problem involving just 33 features. In [20] the authors present a new GA-based approach for feature selection which uses three different ranking algorithms for reducing the search space for the GA, which uses a SVM based wrapper as fitness function. However, in this case, the GA algorithm is used in a very limited way because the search space is reduced to only 12 features. Moreover, in [10] and [12] two different GA-based hybrid approaches that use wrapper fitness functions are proposed and tested on data with no more than 100 features. Finally, in [2] a two steps procedure is used to deal with data involving up to thousands of features. In the first step, the whole set of features is ranked according to an univariate measure; in the second step, the final subset is built by incrementally adding the i-th ranked feature. The process continues until the added feature improves the performance of the classifier used for the subset evaluation. In this paper we present a new GA-based algorithm for feature selection that exploits the advantages of both feature ranking and GAs. The goal is to build a high performance feature selection system that selects a small number of features, with respect to the total number of available features. For this purpose, we built a two-module system that combines a feature ranking algorithm with a GA. The proposed system allows us to greatly reduce the number of features to be used in the classification phase. More specifically, the first module uses a feature ranking algorithm to greatly reduce the number of features to be taken into account by the second module; it considers only a given number M (a priori fixed) of features that are promising, according to the univariate measure used for ranking the whole feature set given in input to the system. The second GAbased module seeks, in the search space consisting of the feature subsets made of the features provided by the first module, the best feature subset by using a filter fitness function that evaluates feature subsets. The layout of the proposed system is shown in Figure 1. Because of the reduction performed by the feature ranking, the search space provided to the GA module is much smaller than that made of all the possible subsets of the whole feature set. The proposed system is based on the hypothesis that this reduced search space still contains most of the promising areas, i.e. those containing good and near-optimal solutions (subsets). In practice, the filtering performed by the ranking module does not discard those features that performs well only when used in combination with other ones; this allows the second GA-based module to focus its search on these more promising areas. As concerns the univariate measures for the feature ranking, we used the Chisquare measure introduced in [15]. As evaluation function for the GA module we used that introduced in [9], namely the Correlation based Feature Selection function (CFS). This function evaluates the merit of a subset by considering

4 N features Feature ranking M features GA selected features M Fig. 1. The layout of the proposed system. both the correlation between the class labels and the single features, and the inter-correlation among the selected features. The CFS function computation is made of two steps: (i) the class-features correlation vector and the features correlation matrix are a priori computed for all the features and properly stored; (ii) the subsequent computations of the CFS function can be computed by accessing to the vector and matrix a priori computed. It is worth noting that these computations are independent of the training set size. The effectiveness of the proposed approach has been tested on four different datasets publicly available, whose total number of features ranges from 500 to Two kinds of comparison were performed: in the former the results of our approach were compared with those achieved by using different feature selection strategies; in the latter, our results were compared with those obtained by a wrapper based approach. The remainder of the paper is organized as follows: in Section 2 the feature evaluation functions are described, Section 3 illustrates the GA used to implement the second module. In Section 4 the experimental results are detailed. Finally, Section 5 is devoted to the conclusions. 2 Feature Evaluation Function As mentioned in the Introduction, feature evaluation functions can be broadly divided into two classes, namely univariate measures and multivariate measures. In the following, the univariate measure adopted for the ranking module and the subset evaluation criterion used as fitness function of the GA module are detailed. 2.1 Univariate Measures Univariate measures evaluate the effectiveness of a single feature in discriminating samples belonging to different classes and can be used to sort the whole set of available features. The feature selection approaches which use this kind of measure do not need a search procedure In fact, once the features have been sorted,

5 the best subset of M features consists of the first M features of the ranking. Note that the value of M must be chosen by the user. For our approach, we used the Chi-square univariate measure [15]. This measure estimates feature merit by using a discretization algorithm: if a feature can be discretized to a single value, it has not discriminative power and it can safely be discarded. The discretization algorithm adopts a supervised heuristic method based on the χ 2 statistic. The range of values for each feature is initially discretized by considering a certain number of intervals (heuristically determined). Then, the χ 2 statistic is used to determine if the relative frequencies of the classes in adjacent intervals are similar enough to justify the merging of such intervals. The formula for computing the χ 2 value for two adjacent intervals is the following: 2 C χ 2 (A ij E ij ) 2 = (1) Eij i=1 j=1 where C is the number of classes, A ij is the instance number of the j-th class in the i-th interval and E ij is the expected frequency of A ij given by the formula: E ij = R i C j /N T where R i is the number of instances in the i-th interval and C j and N T are the instance number of the j-th class and the total number of instances, respectively, in both intervals. The extent of the merging process is controlled by a threshold, whose value represent the maximum admissible difference among the occurrence frequencies of the samples in adjacent intervals. The value of this threshold has been heuristically set during preliminary experiments. 2.2 Subset Evaluation Functions Multivariate methods for feature subset evaluation, can in turn be divided into two classes: filter and wrapper. The former are based on statistical measures and their outcomes are independent from the classifier actually used. The latter, instead, are based on the classification results achieved by a certain classifier, trained on the subset to be evaluated. Wrapper methods are usually computationally more expensive than the filter ones, as they require the training of the classifier used for each evaluation, making them unsuitable to solve big data tasks, where huge datasets must be processed. Moreover, while filter-based evaluations are more general, as they give statistical information on the data, wrapper-based evaluations may give raise to loss of generality because they depend on the specific classifier used. In order to introduce the subset evaluation function adopted, let us briefly recall the well known information-theory concept of entropy. Given a discrete variable X, which can assume the values {x 1, x 2,..., x n }, its entropy H(X) is defined as: n H(X) = p(x i ) log 2 p(x i ) (2) i=1

6 where p(x i ) is the probability mass function of the value x i. The quantity H(X) represents an estimate of the uncertainty of the random variable X. The concept of entropy can be used to define the conditional entropy of two random variables X and Y taking the values x i and y j respectively, as: H(X Y ) = i,j p(x i, y j ) log p(y j) p(x i, y j ) where p(x i, y j ) is the joint probability that at same time X = x i and Y = y j. The quantity in (3) represents the amount of randomness in the random variable X when the value of Y is known. Given two features X and Y, their correlation r XY is computed as follows 1 : r XY = 2.0 H(X) + H(Y ) H(X, Y ) H(X) + H(Y ) As fitness function for the GA module we chose a filter called CFS (Correlationbased Feature Selection) [9], which uses a correlation based heuristic to evaluate feature subset quality. This function takes into account the usefulness of the single features for predicting class labels along with the level of inter-correlation among them. The idea behind this approach is that good subsets contain features highly correlated with the class and uncorrelated with each other. Given a feature selection problem in which the patterns are represented by means of a set Y of N features, the CFS function computes the merit of the generic subset X Y, made of k features, as follows: f CF S (X) = (3) (4) k r cf k + k (k 1) rff (5) where r cf is the average feature-class correlation, and r ff is the feature-feature correlation. Note that the numerator estimates the discriminative power of the features in X, whereas the denominator assesses the redundancy among them. The CFS function allows the GA to discard irrelevant and redundant features. The former because they are poor in discriminating the different classes at the hand; the latter because they are highly correlated with one or more of the other features. In contrast to previously presented approaches [10, 18], this fitness function is able to automatically find the number of features and does not need the setting of any parameter. Finally, given a dataset D to estimate the quantities in (4) and a feature subset X to be evaluated, the computation of f CF S (X) (X Y ) can be made very fast. In fact, before starting the search procedure (the GA in our case), the correlation vector V cf, containing N elements, and the N N symmetric correlation matrix M ff can be computed. The i-th element of V cf contains the value of the correlation between the i-th feature and the class, whereas the element M ff [i, j] represents the correlation between the i-th and the j-th feature. Once the values of V cf and M ff have been computed, given a subset X containing k features, the computation of f CF S (X) only requires 2k memory accesses. 1 Note that the same holds also for the feature-class correlation.

7 3 Genetic Algorithms for Feature Selection In the last decades, Evolutionary Computation techniques have shown to be very effective as methodology for solving optimization problems whose search space are discontinuous and very complex. In this field, GAs represent a subset of these optimization techniques and have been applied to a wide variety of both numerical and combinatorial optimization problems [12]. In a GA the solutions are represented as binary vectors and operators such as crossover and mutation are applied to explore the search space made of all possible solutions. GAs can be easily applied to the problem of feature selection: given a set Y having cardinality equal to N, a subset X of Y (X Y ) can be represented by a binary vector of N elements whose i-th element is set to 1 if the i-th features is included in X, 0 otherwise. Besides the simplicity in the solution encoding, GAs are well suited for this class of problems as the search in this exponential space is very hard since interactions among features can be highly complex and strongly nonlinear. Some studies on the GAs effectiveness in solving features selection problems can be found in [12, 21]. The second module of the system presented here has been implemented by using a generational GA. In order to reduce the computational complexity of the fitness function (see subsection 2.2) the class-feature correlation vector V CF and the feature-feature correlation matrix M F F are pre-computed. Then the GA starts by randomly generating a population of P individuals. Afterwards, the fitness of the generated individuals is evaluated according to the formula in (5). After this preliminary evaluation phase, a new population is generated by selecting P/2 couples of individuals using the roulette wheel method. The one point crossover operator is then applied to each of the selected couples, according to a given probability factor p c. Afterwards, the mutation operator is applied with a probability p m. The value of p m has been set to 1/N, where N is the chromosome length, i.e. the total number of the available features for the problem at hand. This probability value allows, on average, the modification of only one chromosome element. This value has been suggested in [17] as optimal mutation rate below the error threshold of replication. Finally these individuals are added to the new population. The process just described is repeated for N g generations. 4 Experimental Results We tested the proposed approach on high dimensional data (from 500 up to features). For each dataset, a set of values for the parameter M (see Figure 1) has been tested. For each value of M, 30 runs have been performed for the GA module. At the end of every run, the feature subset encoded by the individual with the best fitness, has been used to built a Multilayer Perceptron classifier (MLP in the following), trained by using the back propagation algorithm. The classification performances of the classifiers built have been obtained by using the 10-fold cross-validation approach. The results reported in the following have

8 Table 1. The values of the GA module parameters used in the experiments. Note that p m depends on the chromosome length, i.e. the total number of available features N F. Parameter symbol value Population size P 100 Crossover probability p c 0.4 Mutation probability p m 1/N F Number of Generations N g 500 been obtained averaging the performance of the 30 MLP s built. Some preliminary trials have been performed to set the parameters of the GA and of the MLP, reported in Table 1 and 2 respectively. These two sets of parameters have been used for all the experiments reported below. 4.1 The Datasets The proposed approach has been tested on the following, publicly available, datasets: Arcene, Gisette, Madelon [1] and Ucihar [14]. The characteristics of the datasets are summarized in Table 3. They present different characteristics as regards the number of attributes, the number of classes (two or multiple classes problems) and the number of samples. In particular, Arcene contains mass-spectrometric data from medical tests for cancer diagnosis (ovarian or prostate cancer); it is a two-class classification problem with continuous input variables. Gisette contains images of confusable handwritten digits: the four and the nine. The dataset was constructed from the MNIST data made available by Yann LeCun of the NEC Research Institute; it is a two-class classification problem with sparse continuous input variables. Madelon contains two-class synthetic data with sparse binary attributes; each class contains a certain number of Gaussian clusters, independently generated. It also contains some redundant and useless features. Finally, the Ucihar dataset contains data representing signals from smartphone sensors (accelerometer and gyroscope), recorded from 30 persons wearing a smartphone on the waist. Each person performed six activities, each representing a class of the problem: walking, walking upstairs, walking downstairs, sitting, standing and laying. Table 2. The values of the parameters used for the training of the MLP s. Note that the number of hidden neurons depends on both the number of input attributes N a and the output classes N c. Parameter value Learning rate 0.3 Momentum 0.2 Hidden Neurons (N a + N c)/2 Epochs 500

9 Table 3. The datasets used in the experiments. Datasets attributes samples classes Arcene Gisette Madelon Ucihar Comparison Findings In order to test the effectiveness of our system, we performed two sets of comparisons. In the first set, we compared the results of the proposed system with those obtained by three different feature selection approaches: The feature ranking represented by the first module of our system (Figure 1): given the whole set of N features, it gives as output the best M feature, according to the univariate measure adopted; it will be denoted as RNK in the following. the GA used in the second module of the proposed system (Figure 1): given the whole set of N features, it searches for the best solution (subset) by using the GA algorithm detailed in Section 3. It will be denoted as GA in the following. The third approach taken into account for the comparison, instead, is quite similar to our approach but uses the sequential forward floating selection as search strategy of the second module. This strategy searches the solution space by using a greedy hill-climbing technique. It starts with the empty set of features and, at each step, selects the best feature that satisfies the evaluation function; The algorithm also verifies the possibility of improvement of the criterion if a feature is excluded. In this case, the worst feature, according to the evaluation function, is excluded from the set. We used an improved version of this algorithm, presented in [7]. It will be denoted as RNK-SFS in the following. The purpose of the first comparison was to test the effectiveness of the proposed approach in improving the performance obtained by using only the feature ranking approach. As regards the second comparison, its aim was to validate our hypothesis: a feature ranking algorithm can be used to improve the performance of a standard GA, by locating the promising areas of the whole search space consisting of all the available features. Finally, the goal of the third comparison was to assess the ability of the GA in finding good solutions in the search space provided by the feature ranking module. For all the comparisons, the performance have been evaluated in terms of recognition rate and feature reduction. As concerns the second set of comparisons, we compared the results of our system with those presented in [2]. The approach taken account for this comparison is called IWSS and, as mentioned in the introduction, it is wrapper-based and is able to deal with problems involving thousands of features.

10 With the purpose of investigating how the value of the parameter M affects the performance of the presented system, we tested several M values. Moreover, since the number of attributes of the datasets taken into account differs widely, we considered two sets of values. The set {100, 200, 500, 1000, 2000} has been considered for the datasets Arcene and Gisette, whose samples are described by and 5000 attributes, respectively. The set {20, 50, 100, 200, 300} has been used for the datasets Madelon and Ucihar, having 500 and 561 attributes, respectively. First Set of Comparisons Since the approaches RNK and RNK-SFS are deterministic, for each value of M, they generated a single feature subset. However, in order to perform a fair comparison with the proposed approach, for each subset generated, 30 MLP s have been trained with different, randomly generated, initial weights. The trained MLP s have been evaluated by using the 10-fold cross-validation approach. The results reported in the following have been obtained averaging the performance of the 30 MLP s learned. Also in this case we used the parameters reported in Table 2. As concerns the results of the GA approach, they have been obtained by using the methodology adopted for our approach, so as described at the beginning of the present section. Note that to statistically validate the comparison results, we performed the nonparametric Wilcoxon rank-sum test (α = 0.05) over 30 runs. The comparison results have been grouped according to the different values of M used, and are reported in Tables 4 (Arcene and Gisette) and 5 (Madelon and Ucihar). In both tables, the second column shows the values of the parameter M, while the recognition rate (RR) and the number of selected features (NF), are reported for each method. It is worth noting that for the RNK method the number of selected features have not been reported because it coincides with the value of M actually used. In each table, the recognition rates in bold highlight the results which are significantly better with respect to the second best results (values starred in the table), according to the Wilcoxon test. As concerns the results that do not present a statistically significant difference, the best two results are both starred. Moreover, for each method, in the case that two or more results do not present statistically significant difference, the result achieved with the minimum number of features has been considered. Finally, note that for our approach, we used the abbreviation RNK-GA. The comparison results for the Arcene and Gisette datasets are shown in Table 4. From the table it can be seen that the proposed approach achieves better performance for both datasets. In more detail, for the Arcene dataset a recognition rate of 92.3% has been obtained by using only 465 out of the features provided in input to the system. This result has been achieved with a value of M equal to 2000 and it is significantly better than those obtained with smaller values. This seems to suggest that for these smaller values the ranking module discards features that, although they score poorly according to the χ 2

11 Table 4. Comparison results for the Arcene and Gisette datasets. Bold values represent the best statistically significant results. Dataset M RNK-GA GA RNK RNK-SFS RR NF RR NF RR RR NF Arcene Gisette measure, are relevant for the classification task when used in conjunction with other features. Nonetheless, the search space reduction performed by the ranking module with M = 2000 (from to ) allowed a strong improvement of the performance with respect to the GA, which searched the whole search space. As concerns the second best result, the RNK approach reached a recognition rate of 87.4%, by using 500 features which coincides with the value of M. Note that for M = 2000 the RNK approach performs poorly, indicating that only some of the first 2000 ranked features are actually relevant, while most of them are irrelevant or redundant, and training the MLP with all of them leads to poor classification performance. As for the RNK-SFS method, which got its best result with M = 2000, it performs significantly worse than our approach, showing that the GA has a better searching ability than RNK-SFS. For the Gisette dataset, the best two results of our system (M=500 and M=1000) were not significantly different and, according to the criterion mentioned above, we considered that which used 41.9 features, which achieved a recognition rate of 96.7% (M =500). The results of the other methods were not significantly different each other. In this case, the GA approach got good results, but selected much more features than RNK-GA. This result shows that, even if the GA selected most of the relevant features, it was not able to discard the redundant and irrelevant ones. This is due to the fact that the GA searched the whole space of solutions and could not benefit of the filtering action performed by the ranking module. As regards the RNK-SFS, it got the best performance with three different results (M=500, M=100 and M=2000), which did not exhibit any statistically significant difference. The result chosen (M =500) reached a recognition rate of 95.6%, by selecting 73 features. These results exhibit slight differences also in terms of number of features. This seems to suggest that the SFS algorithm, starting from the 500 features case, got stuck in suboptimal areas of the search space, and it was not able to locate new areas containing solutions consisting of a greater number of features.

12 Table 5. Comparison results for the Madelon and Ucihar datasets. Bold values represent the best statistically significant results. Dataset M RNK-GA GA RNK RNK-SFS RR NF RR NF RR RR NF Madelon Ucihar The results just described seem to confirm that as the search space (exponentially) grows with M, the GA module of the proposed approach is able to locate new areas of the search space containing better solutions, which includes the new features progressively added. In particular, in the case of the Arcene dataset, our system was able to find solutions whose cardinality strongly grows as M increases, obtaining a strong increase of the performance in terms of recognition rate. For the Gisette dataset, instead, the fitness increment of the new solutions found did not lead to a significant improvement in terms of recognition rate. The comparison results for the Madelon and Ucihar datasets are shown in Table 5. From the table it can be observed that for both datasets the proposed system did not significantly outperform the compared systems. As concerns the Madelon dataset, the RNK approach reached its best performance by using the first 20 ranked features. This suggests that the Madelon whole set of features contains a small set of features that, even when they are taken separately, have a high discriminative power. This set can be easily identified, either by using RNK which selected the best 20 features, according to χ 2 measure (equation (1)), or by the GA, despite it searched the whole search space. As for the Ucihar results, RNK-GA, RNK-SFS and GA outperformed RNK, but achieved performances that did not exhibit any statistically significant difference. RNK-GA and RNK-SFS obtained their best results with M =200, using a comparable number of features. The GA selected much more features than RNK-GA and RNK-SFS (about three times), confirming that it wasn t able to discard redundant and irrelevant features. Nonetheless, in this case, these features did not affect the MLP training process. These results seems to suggest that when the number of features to be dealt with is not too large: (i) even less effective search algorithms like the SFS can find good solutions in the reduced search space provided by the first module; (ii) the GA alone is still able to locate search space areas containing good solutions, but they may contain redundant or irrelevant features.

13 Table 6. Comparison results with the IWSS approach. Dataset RNK-GA IWSS M RR NF RR NF Arcene Gisette Madelon (a) C4.5 algorithm Dataset RNK-GA IWSS M RR NF RR NF Arcene Gisette Madelon (b) K Nearest Neighbor Dataset RNK-GA IWSS M RR NF RR NF Arcene Gisette Madelon (c) Naive Bayes Second set of Comparisons In [2], the proposed approach (IWSS) has been tested on three classifiers: Naive Bayes (NB), KNN (K = 1) and C4.5. Moreover, in [2], among the others, three of the four datasets reported in Table 3 have been tested: Arcene, Gisette and Madelon. The comparison results are shown in Table 6. Note that the second column shows the values of the parameter M that obtained the best result among those tested (the same used for the first set of comparison). The recognition rate (RR) and the number of selected features (NF), are also reported. Since the IWSS approach is deterministic, the results reported in [2], refer to a single feature subset and have been obtained using the 10-fold cross-validation technique. For this reason, in order to statistically validate the comparison results we performed the one sample Wilcoxon rank-sum test (α = 0.05), comparing the single result of IWSS with the results of RNK- GA, on the 30 runs. The values in bold highlight the best result, according to the Wilcoxon test. From the table it can be seen that for the Arcene dataset (10000 features), the proposed system greatly outperforms the IWSS approach on the three classifiers considered. As concerns the Gisette dataset (5000 features), for the C4.5 and KNN classifiers, the performance of our approach are better than those of IWSS. Finally, as for the Madelon dataset (500 features), IWSS achieves better performance for the C4.5 and KNN classifiers, whereas for the NB classifier our system performs slightly better. The above results confirms the effectiveness of proposed approach to deal with high dimensional data. In fact, on the Arcene dataset our system largely

14 outperforms the IWSS approach, in spite of the fact that IWSS uses a wrapper evaluation function. This is confirmed by the results on the Gisette dataset, where our system obtained better performances on two of the three classifiers. In this case, the differences in terms of recognition rates are much smaller than those of the Arcene dataset. However, these results are similar to those reported in Table 4, where the recognition rate differences for the MLP classifier are not greater than 1%. Only for the NB classifier IWSS significantly outperforms our approach, but nonetheless this performance is worse than that obtained by our approach with the MLP. Finally, as for the Madelon dataset, as mentioned above, it is characterized by a small set of discriminative features that can be easily identified by univariate measures. Then, in this case, the IWSS approach is favored because it first ranks the features and then incrementally build the feature subset by using a greedy strategy. 4.3 Discussion From the results shown above it can be seen that the univariate measure used in the first module is able to identify most of the relevant features even though the adopted measure evaluates the relevance of each feature, without taking into account any feature interaction. In practice, what happens is that these interacting features that are singularly little relevant, but that become useful when taken with other features, do not appear at the bottom of the ranking. Thus, these features can always be included by suitably incrementing the value of M. It is worth noting that, because of the exponentially growth of the search space, even high values of M allows a strong reduction of the search space. An interesting property of our system is that, once the number of M has been correctly set, the GA of the second module is able to discard redundant features. In fact, our approach selected much less features than those given in input or those selected by the GA taken into account for the comparison. We want to remark that, the above results confirm the assumptions underlying our method: (i) a feature ranking algorithm can be used to preselect a number, a priori fixed, of features among the whole set of available features; (ii) the search space consisting of the subsets made of these selected features contains most of the good and near-optimal solutions (subsets). In practice, the filtering performed by the feature ranking module makes easier the task of searching for good solutions and this filtering is crucial in improving the performance of the GA when thousands of features are involved. 5 Conclusions We present a novel GA-based approach for feature selection which is able to deal with thousands of features. The approach consists of two modules. The first uses a feature ranking based approach that reduces the search space made of the whole set of available features. This reduction is performed by discarding

15 the features that, according the univariate measure employed, are less useful for discriminating among the different classes at hand. The second module uses a genetic algorithm to search in the solution space provided by the first module. This module employs a correlation based heuristic function to evaluate the worth of the feature subsets encoded by the individuals. Since we used only filter evaluation functions, the proposed approach shows the following interesting properties: (i) it is independent of the classification system used; (ii) once that the correlation data have been computed in the initialization step of the GA, the computational cost of the fitness function does not depend on the training set size. The second property makes our system particularly suitable for problems involving a huge number of instances. The effectiveness of the proposed system has been tested on data represented in high dimensional spaces. The achieved results have been compared with those obtained by different feature selection strategies, both wrapper and filter. For the datasets containing thousands of features, our method obtain better results than the other methods both in terms of accuracy and number of selected features. Future works will investigate different feature evaluation function, both for the ranking (univariate) and for the GA (multivariate). Moreover, system performance will be evaluated also for different classification schemes. References 1. Nips 2003 workshop on feature extraction and feature selection challenge. (2003) 2. Bermejo, P., Gámez, J.A., Puerta, J.M.: Improving incremental wrapper-based subset selection via replacement and early stopping. IJPRAI 25(5), (2011) 3. Cordella, L.P., De Stefano, C., Fontanella, F., Marrocco, C., Scotto di Freca, A.: Combining single class features for improving performance of a two stage classifier. In: 20th International Conference on Pattern Recognition (ICPR 2010). pp IEEE Computer Society (2010) 4. Dash, M., Liu, H.: Feature selection for classification. Intelligent Data Analysis 1(1-4), (1997) 5. De Stefano, C., Fontanella, F., Marrocco, C.: A GA-Based Feature Selection Algorithm for Remote Sensing Images, pp Springer Berlin Heidelberg, Berlin, Heidelberg (2008) 6. De Stefano, C., Fontanella, F., Maniaci, M., Scotto di Freca, A.: A method for scribe distinction in medieval manuscripts using page layout features. In: Maino, G., Foresti, G. (eds.) Image Analysis and Processing - ICIAP 2011, Lecture Notes in Computer Science, vol. 6978, pp Springer Berlin Heidelberg (2011) 7. Gütlein, M., Frank, E., Hall, M., Karwath, A.: Large scale attribute selection using wrappers. In: Proc. of the IEEE Symposium on Computational Intelligence and Data Mining (CIDM 2009) (2009) 8. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, (2003) 9. Hall, M.A.: Correlation-based feature selection for discrete and numeric class machine learning. In: Proceedings of the Seventeenth International Conference on Machine Learning. pp Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2000)

16 10. Huang, J., Cai, Y., Xu, X.: A hybrid genetic algorithm for feature selection wrapper based on mutual information. Pattern Recogn. Lett. 28(13), (Oct 2007) 11. Lanzi, P.: Fast feature selection with genetic algorithms: a filter approach. In: Evolutionary Computation, 1997., IEEE International Conference on. pp (Apr 1997) 12. Lee, J.S., Oh, I.S., Moon, B.R.: Hybrid genetic algorithms for feature selection. IEEE Trans. Pattern Anal. Mach. Intell. 26(11), (2004) 13. Li, R., Lu, J., Zhang, Y., Zhao, T.: Dynamic adaboost learning with feature selection based on parallel genetic algorithm for image annotation. Knowledge-Based Systems 23(3), (2010) 14. Lichman, M.: UCI machine learning repository (2013), Liu, H., Setiono, R.: Chi2: Feature selection and discretization of numeric attributes. In: ICTAI. pp IEEE Computer Society, Washington, DC, USA (1995) 16. Manimala, K., Selvi, K., Ahila, R.: Hybrid soft computing techniques for feature selection and parameter optimization in power quality data mining. Applied Soft Computing 11(8), (2011), Ochoa, G.: Error thresholds in genetic algorithms. Evolutionary Computation 14(2), (2006) 18. Oreski, S., Oreski, G.: Genetic algorithm-based heuristic for feature selection in credit risk assessment. Expert Systems with Applications 41(4, Part 2), (2014) 19. Spolaor, N., Lorena, A., Lee, H.: Multi-objective genetic algorithm evaluation in feature selection. In: Takahashi, R., Deb, K., Wanner, E., Greco, S. (eds.) Evolutionary Multi-Criterion Optimization, Lecture Notes in Computer Science, vol. 6576, pp Springer Berlin Heidelberg (2011) 20. Tan, F., Fu, X., Zhang, Y., Bourgeois, A.G.: A genetic algorithm-based method for feature subset selection. Soft Computing 12(2), (Sep 2007) 21. Xue, B., Zhang, M., Browne, W.N., Yao, X.: A survey on evolutionary computation approaches to feature selection. IEEE Transactions on Evolutionary Computation 20(4), (Aug 2016) 22. Yusta, S.C.: Different metaheuristic strategies to solve the feature selection problem. Pattern Recognition Letters 30(5), (2009) 23. Zhai, Y., Ong, Y.S., Tsang, I.: The emerging big dimensionality. Computational Intelligence Magazine, IEEE 9(3), (Aug 2014)

BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA

BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA S. DeepaLakshmi 1 and T. Velmurugan 2 1 Bharathiar University, Coimbatore, India 2 Department of Computer Science, D. G. Vaishnav College,