Feature Selection in High Dimensional Data by a Filter-Based Genetic Algorithm

Size: px
Start display at page:

Download "Feature Selection in High Dimensional Data by a Filter-Based Genetic Algorithm"

Transcription

1 Feature Selection in High Dimensional Data by a Filter-Based Genetic Algorithm C. De Stefano, F. Fontanella and A. Scotto di Freca Dipartimento di Ingegneria Elettrica e dell Informazione (DIEI) Università di Cassino e del Lazio meridionale Via G. Di Biasio, Cassino (FR) Italy {destefano,fontanella,a.scotto}@unicas.it Abstract. In classification and clustering problems, feature selection techniques can be used to reduce the dimensionality of the data and increase the performances. However, feature selection is a challenging task, especially when hundred or thousands of features are involved. In this framework, we present a new approach for improving the performance of a filter-based genetic algorithm. The proposed approach consists of two steps: first, the available features are ranked according to a univariate evaluation function; then the search space represented by the first M features in the ranking is searched using a filter-based genetic algorithm for finding feature subsets with a high discriminative power. Experimental results demonstrated the effectiveness of our approach in dealing with high dimensional data, both in terms of recognition rate and feature number reduction. 1 Introduction Recent years have seen a strong growth of applications dealing with huge amounts of data, such as data mining and medical data processing [23]. This kind of application often imply classification or clustering problems where the objects to be classified or clustered are represented as feature vectors. The feature selection problem consists in selecting, from the whole set of available features, the subset of them providing the most discriminative power. The choice of a good feature subset is crucial since if the selected features do not contain enough information to discriminate patterns belonging to different classes, the performances may be unsatisfactory, regardless of the effectiveness of the classification system employed. Moreover, irrelevant and noisy features unnecessarily enlarge the search space, increasing both the time and the complexity of the learning process. Feature selection algorithms usually imply the definition of an evaluation function and of a search procedure. Evaluation functions can be divided into two broad classes: univariate and multivariate measures. Univariate measures evaluate the effectiveness of each single feature in discriminating samples belonging to different classes and are used to rank the available features. Once the features have been evaluated, the subset search procedure is straightforward: the features are ranked according to their merit and the best M features are

2 selected. The parameter M is specified by the user. These kind of approaches are very fast and can be used to cope with problems involving even hundreds of thousands of features. The main drawback of these measures is that they cannot consider interactions that may occur between two or more features. For this reason, features which perform well when used in conjunction with other features, are discarded if they perform poorly when used alone. Additionally, the features with the highest scores (merits) are usually similar. Therefore, these measures tend to select redundant features [6]. Multivariate measures, instead, evaluate feature subsets by measuring how well patterns belonging to different classes are discriminated when projected in the subspace represented by the subset to be evaluated. These measures are generally classified into two categories: filter and wrapper [8]. Wrapper approaches use classification algorithms, to evaluate the goodness of the subsets. This leads to high computational costs when a large number of evaluations is required, especially when large datasets are involved. Filter approaches, instead, are independent of any classification algorithm and, in most of the cases, are computationally less expensive and more general than wrapper algorithms. As concerns the search strategies, given a measure, the optimal subset can be found by exhaustively evaluating all the possible solutions. Unfortunately, the exhaustive search is impracticable when the cardinality N of the whole set of features Y is high (N > 50). This is due to the fact that the search space, made of all the 2 N possible subsets of Y, exponentially grows with N. For this reason, many heuristic algorithms have been proposed for finding near-optimal solutions [4, 8, 21]. Among these algorithms, greedy strategies that incrementally generate feature subsets have been proposed. Since these algorithms do not take into account complex interactions among the features, in most of the cases they lead to sub-optimal solutions. Evolutionary computation (EC) based techniques have been widely used to cope with the feature selection problem [21]. Among the EC based approaches, Genetic Algorithms (GAs) have been widely used. GA binary vectors provide a natural and straightforward representation for feature subsets: the value 1 or 0 of the chromosome i-th element indicates whether the i-th feature is included or not. Most of the GA approaches use wrapper evaluation functions [21]. For these approaches different classification algorithms have been adopted, among them: Support Vector Machines (SVMs) [16], K-Nearest Neighbor (KNN) [13] and Artificial Neural Networks (ANNs) [22]. As mentioned above, wrapper evaluation functions lead to high computational costs since their computational complexity depends on the number of samples actually used for training the classifier. As consequence, such approaches are not well suited to deal with problems involving a huge number of instances and features. Also filter fitness functions have been used; the approach presented in [19] uses an information theory based evaluation function, while in [11] the authors adopt a consistency measure. Moreover, in [3, 5] the authors present a filter fitness function that extends the Fisher s linear discriminant. Recently, in order to reduce the search space size for high-dimensional datasets,

3 different strategies have been adopted [18, 20, 21] for GA-based algorithms. In [18], the search space reduction for the GA is performed by using different filter approaches. The information provided by these approaches is used to build a part of the individuals making up the initial population. Then, individuals are evaluated by means of a neural network based wrapper function. The approach has been tested on a credit assessment risk problem involving just 33 features. In [20] the authors present a new GA-based approach for feature selection which uses three different ranking algorithms for reducing the search space for the GA, which uses a SVM based wrapper as fitness function. However, in this case, the GA algorithm is used in a very limited way because the search space is reduced to only 12 features. Moreover, in [10] and [12] two different GA-based hybrid approaches that use wrapper fitness functions are proposed and tested on data with no more than 100 features. Finally, in [2] a two steps procedure is used to deal with data involving up to thousands of features. In the first step, the whole set of features is ranked according to an univariate measure; in the second step, the final subset is built by incrementally adding the i-th ranked feature. The process continues until the added feature improves the performance of the classifier used for the subset evaluation. In this paper we present a new GA-based algorithm for feature selection that exploits the advantages of both feature ranking and GAs. The goal is to build a high performance feature selection system that selects a small number of features, with respect to the total number of available features. For this purpose, we built a two-module system that combines a feature ranking algorithm with a GA. The proposed system allows us to greatly reduce the number of features to be used in the classification phase. More specifically, the first module uses a feature ranking algorithm to greatly reduce the number of features to be taken into account by the second module; it considers only a given number M (a priori fixed) of features that are promising, according to the univariate measure used for ranking the whole feature set given in input to the system. The second GAbased module seeks, in the search space consisting of the feature subsets made of the features provided by the first module, the best feature subset by using a filter fitness function that evaluates feature subsets. The layout of the proposed system is shown in Figure 1. Because of the reduction performed by the feature ranking, the search space provided to the GA module is much smaller than that made of all the possible subsets of the whole feature set. The proposed system is based on the hypothesis that this reduced search space still contains most of the promising areas, i.e. those containing good and near-optimal solutions (subsets). In practice, the filtering performed by the ranking module does not discard those features that performs well only when used in combination with other ones; this allows the second GA-based module to focus its search on these more promising areas. As concerns the univariate measures for the feature ranking, we used the Chisquare measure introduced in [15]. As evaluation function for the GA module we used that introduced in [9], namely the Correlation based Feature Selection function (CFS). This function evaluates the merit of a subset by considering

4 N features Feature ranking M features GA selected features M Fig. 1. The layout of the proposed system. both the correlation between the class labels and the single features, and the inter-correlation among the selected features. The CFS function computation is made of two steps: (i) the class-features correlation vector and the features correlation matrix are a priori computed for all the features and properly stored; (ii) the subsequent computations of the CFS function can be computed by accessing to the vector and matrix a priori computed. It is worth noting that these computations are independent of the training set size. The effectiveness of the proposed approach has been tested on four different datasets publicly available, whose total number of features ranges from 500 to Two kinds of comparison were performed: in the former the results of our approach were compared with those achieved by using different feature selection strategies; in the latter, our results were compared with those obtained by a wrapper based approach. The remainder of the paper is organized as follows: in Section 2 the feature evaluation functions are described, Section 3 illustrates the GA used to implement the second module. In Section 4 the experimental results are detailed. Finally, Section 5 is devoted to the conclusions. 2 Feature Evaluation Function As mentioned in the Introduction, feature evaluation functions can be broadly divided into two classes, namely univariate measures and multivariate measures. In the following, the univariate measure adopted for the ranking module and the subset evaluation criterion used as fitness function of the GA module are detailed. 2.1 Univariate Measures Univariate measures evaluate the effectiveness of a single feature in discriminating samples belonging to different classes and can be used to sort the whole set of available features. The feature selection approaches which use this kind of measure do not need a search procedure In fact, once the features have been sorted,

5 the best subset of M features consists of the first M features of the ranking. Note that the value of M must be chosen by the user. For our approach, we used the Chi-square univariate measure [15]. This measure estimates feature merit by using a discretization algorithm: if a feature can be discretized to a single value, it has not discriminative power and it can safely be discarded. The discretization algorithm adopts a supervised heuristic method based on the χ 2 statistic. The range of values for each feature is initially discretized by considering a certain number of intervals (heuristically determined). Then, the χ 2 statistic is used to determine if the relative frequencies of the classes in adjacent intervals are similar enough to justify the merging of such intervals. The formula for computing the χ 2 value for two adjacent intervals is the following: 2 C χ 2 (A ij E ij ) 2 = (1) Eij i=1 j=1 where C is the number of classes, A ij is the instance number of the j-th class in the i-th interval and E ij is the expected frequency of A ij given by the formula: E ij = R i C j /N T where R i is the number of instances in the i-th interval and C j and N T are the instance number of the j-th class and the total number of instances, respectively, in both intervals. The extent of the merging process is controlled by a threshold, whose value represent the maximum admissible difference among the occurrence frequencies of the samples in adjacent intervals. The value of this threshold has been heuristically set during preliminary experiments. 2.2 Subset Evaluation Functions Multivariate methods for feature subset evaluation, can in turn be divided into two classes: filter and wrapper. The former are based on statistical measures and their outcomes are independent from the classifier actually used. The latter, instead, are based on the classification results achieved by a certain classifier, trained on the subset to be evaluated. Wrapper methods are usually computationally more expensive than the filter ones, as they require the training of the classifier used for each evaluation, making them unsuitable to solve big data tasks, where huge datasets must be processed. Moreover, while filter-based evaluations are more general, as they give statistical information on the data, wrapper-based evaluations may give raise to loss of generality because they depend on the specific classifier used. In order to introduce the subset evaluation function adopted, let us briefly recall the well known information-theory concept of entropy. Given a discrete variable X, which can assume the values {x 1, x 2,..., x n }, its entropy H(X) is defined as: n H(X) = p(x i ) log 2 p(x i ) (2) i=1

6 where p(x i ) is the probability mass function of the value x i. The quantity H(X) represents an estimate of the uncertainty of the random variable X. The concept of entropy can be used to define the conditional entropy of two random variables X and Y taking the values x i and y j respectively, as: H(X Y ) = i,j p(x i, y j ) log p(y j) p(x i, y j ) where p(x i, y j ) is the joint probability that at same time X = x i and Y = y j. The quantity in (3) represents the amount of randomness in the random variable X when the value of Y is known. Given two features X and Y, their correlation r XY is computed as follows 1 : r XY = 2.0 H(X) + H(Y ) H(X, Y ) H(X) + H(Y ) As fitness function for the GA module we chose a filter called CFS (Correlationbased Feature Selection) [9], which uses a correlation based heuristic to evaluate feature subset quality. This function takes into account the usefulness of the single features for predicting class labels along with the level of inter-correlation among them. The idea behind this approach is that good subsets contain features highly correlated with the class and uncorrelated with each other. Given a feature selection problem in which the patterns are represented by means of a set Y of N features, the CFS function computes the merit of the generic subset X Y, made of k features, as follows: f CF S (X) = (3) (4) k r cf k + k (k 1) rff (5) where r cf is the average feature-class correlation, and r ff is the feature-feature correlation. Note that the numerator estimates the discriminative power of the features in X, whereas the denominator assesses the redundancy among them. The CFS function allows the GA to discard irrelevant and redundant features. The former because they are poor in discriminating the different classes at the hand; the latter because they are highly correlated with one or more of the other features. In contrast to previously presented approaches [10, 18], this fitness function is able to automatically find the number of features and does not need the setting of any parameter. Finally, given a dataset D to estimate the quantities in (4) and a feature subset X to be evaluated, the computation of f CF S (X) (X Y ) can be made very fast. In fact, before starting the search procedure (the GA in our case), the correlation vector V cf, containing N elements, and the N N symmetric correlation matrix M ff can be computed. The i-th element of V cf contains the value of the correlation between the i-th feature and the class, whereas the element M ff [i, j] represents the correlation between the i-th and the j-th feature. Once the values of V cf and M ff have been computed, given a subset X containing k features, the computation of f CF S (X) only requires 2k memory accesses. 1 Note that the same holds also for the feature-class correlation.

7 3 Genetic Algorithms for Feature Selection In the last decades, Evolutionary Computation techniques have shown to be very effective as methodology for solving optimization problems whose search space are discontinuous and very complex. In this field, GAs represent a subset of these optimization techniques and have been applied to a wide variety of both numerical and combinatorial optimization problems [12]. In a GA the solutions are represented as binary vectors and operators such as crossover and mutation are applied to explore the search space made of all possible solutions. GAs can be easily applied to the problem of feature selection: given a set Y having cardinality equal to N, a subset X of Y (X Y ) can be represented by a binary vector of N elements whose i-th element is set to 1 if the i-th features is included in X, 0 otherwise. Besides the simplicity in the solution encoding, GAs are well suited for this class of problems as the search in this exponential space is very hard since interactions among features can be highly complex and strongly nonlinear. Some studies on the GAs effectiveness in solving features selection problems can be found in [12, 21]. The second module of the system presented here has been implemented by using a generational GA. In order to reduce the computational complexity of the fitness function (see subsection 2.2) the class-feature correlation vector V CF and the feature-feature correlation matrix M F F are pre-computed. Then the GA starts by randomly generating a population of P individuals. Afterwards, the fitness of the generated individuals is evaluated according to the formula in (5). After this preliminary evaluation phase, a new population is generated by selecting P/2 couples of individuals using the roulette wheel method. The one point crossover operator is then applied to each of the selected couples, according to a given probability factor p c. Afterwards, the mutation operator is applied with a probability p m. The value of p m has been set to 1/N, where N is the chromosome length, i.e. the total number of the available features for the problem at hand. This probability value allows, on average, the modification of only one chromosome element. This value has been suggested in [17] as optimal mutation rate below the error threshold of replication. Finally these individuals are added to the new population. The process just described is repeated for N g generations. 4 Experimental Results We tested the proposed approach on high dimensional data (from 500 up to features). For each dataset, a set of values for the parameter M (see Figure 1) has been tested. For each value of M, 30 runs have been performed for the GA module. At the end of every run, the feature subset encoded by the individual with the best fitness, has been used to built a Multilayer Perceptron classifier (MLP in the following), trained by using the back propagation algorithm. The classification performances of the classifiers built have been obtained by using the 10-fold cross-validation approach. The results reported in the following have

8 Table 1. The values of the GA module parameters used in the experiments. Note that p m depends on the chromosome length, i.e. the total number of available features N F. Parameter symbol value Population size P 100 Crossover probability p c 0.4 Mutation probability p m 1/N F Number of Generations N g 500 been obtained averaging the performance of the 30 MLP s built. Some preliminary trials have been performed to set the parameters of the GA and of the MLP, reported in Table 1 and 2 respectively. These two sets of parameters have been used for all the experiments reported below. 4.1 The Datasets The proposed approach has been tested on the following, publicly available, datasets: Arcene, Gisette, Madelon [1] and Ucihar [14]. The characteristics of the datasets are summarized in Table 3. They present different characteristics as regards the number of attributes, the number of classes (two or multiple classes problems) and the number of samples. In particular, Arcene contains mass-spectrometric data from medical tests for cancer diagnosis (ovarian or prostate cancer); it is a two-class classification problem with continuous input variables. Gisette contains images of confusable handwritten digits: the four and the nine. The dataset was constructed from the MNIST data made available by Yann LeCun of the NEC Research Institute; it is a two-class classification problem with sparse continuous input variables. Madelon contains two-class synthetic data with sparse binary attributes; each class contains a certain number of Gaussian clusters, independently generated. It also contains some redundant and useless features. Finally, the Ucihar dataset contains data representing signals from smartphone sensors (accelerometer and gyroscope), recorded from 30 persons wearing a smartphone on the waist. Each person performed six activities, each representing a class of the problem: walking, walking upstairs, walking downstairs, sitting, standing and laying. Table 2. The values of the parameters used for the training of the MLP s. Note that the number of hidden neurons depends on both the number of input attributes N a and the output classes N c. Parameter value Learning rate 0.3 Momentum 0.2 Hidden Neurons (N a + N c)/2 Epochs 500

9 Table 3. The datasets used in the experiments. Datasets attributes samples classes Arcene Gisette Madelon Ucihar Comparison Findings In order to test the effectiveness of our system, we performed two sets of comparisons. In the first set, we compared the results of the proposed system with those obtained by three different feature selection approaches: The feature ranking represented by the first module of our system (Figure 1): given the whole set of N features, it gives as output the best M feature, according to the univariate measure adopted; it will be denoted as RNK in the following. the GA used in the second module of the proposed system (Figure 1): given the whole set of N features, it searches for the best solution (subset) by using the GA algorithm detailed in Section 3. It will be denoted as GA in the following. The third approach taken into account for the comparison, instead, is quite similar to our approach but uses the sequential forward floating selection as search strategy of the second module. This strategy searches the solution space by using a greedy hill-climbing technique. It starts with the empty set of features and, at each step, selects the best feature that satisfies the evaluation function; The algorithm also verifies the possibility of improvement of the criterion if a feature is excluded. In this case, the worst feature, according to the evaluation function, is excluded from the set. We used an improved version of this algorithm, presented in [7]. It will be denoted as RNK-SFS in the following. The purpose of the first comparison was to test the effectiveness of the proposed approach in improving the performance obtained by using only the feature ranking approach. As regards the second comparison, its aim was to validate our hypothesis: a feature ranking algorithm can be used to improve the performance of a standard GA, by locating the promising areas of the whole search space consisting of all the available features. Finally, the goal of the third comparison was to assess the ability of the GA in finding good solutions in the search space provided by the feature ranking module. For all the comparisons, the performance have been evaluated in terms of recognition rate and feature reduction. As concerns the second set of comparisons, we compared the results of our system with those presented in [2]. The approach taken account for this comparison is called IWSS and, as mentioned in the introduction, it is wrapper-based and is able to deal with problems involving thousands of features.

10 With the purpose of investigating how the value of the parameter M affects the performance of the presented system, we tested several M values. Moreover, since the number of attributes of the datasets taken into account differs widely, we considered two sets of values. The set {100, 200, 500, 1000, 2000} has been considered for the datasets Arcene and Gisette, whose samples are described by and 5000 attributes, respectively. The set {20, 50, 100, 200, 300} has been used for the datasets Madelon and Ucihar, having 500 and 561 attributes, respectively. First Set of Comparisons Since the approaches RNK and RNK-SFS are deterministic, for each value of M, they generated a single feature subset. However, in order to perform a fair comparison with the proposed approach, for each subset generated, 30 MLP s have been trained with different, randomly generated, initial weights. The trained MLP s have been evaluated by using the 10-fold cross-validation approach. The results reported in the following have been obtained averaging the performance of the 30 MLP s learned. Also in this case we used the parameters reported in Table 2. As concerns the results of the GA approach, they have been obtained by using the methodology adopted for our approach, so as described at the beginning of the present section. Note that to statistically validate the comparison results, we performed the nonparametric Wilcoxon rank-sum test (α = 0.05) over 30 runs. The comparison results have been grouped according to the different values of M used, and are reported in Tables 4 (Arcene and Gisette) and 5 (Madelon and Ucihar). In both tables, the second column shows the values of the parameter M, while the recognition rate (RR) and the number of selected features (NF), are reported for each method. It is worth noting that for the RNK method the number of selected features have not been reported because it coincides with the value of M actually used. In each table, the recognition rates in bold highlight the results which are significantly better with respect to the second best results (values starred in the table), according to the Wilcoxon test. As concerns the results that do not present a statistically significant difference, the best two results are both starred. Moreover, for each method, in the case that two or more results do not present statistically significant difference, the result achieved with the minimum number of features has been considered. Finally, note that for our approach, we used the abbreviation RNK-GA. The comparison results for the Arcene and Gisette datasets are shown in Table 4. From the table it can be seen that the proposed approach achieves better performance for both datasets. In more detail, for the Arcene dataset a recognition rate of 92.3% has been obtained by using only 465 out of the features provided in input to the system. This result has been achieved with a value of M equal to 2000 and it is significantly better than those obtained with smaller values. This seems to suggest that for these smaller values the ranking module discards features that, although they score poorly according to the χ 2

11 Table 4. Comparison results for the Arcene and Gisette datasets. Bold values represent the best statistically significant results. Dataset M RNK-GA GA RNK RNK-SFS RR NF RR NF RR RR NF Arcene Gisette measure, are relevant for the classification task when used in conjunction with other features. Nonetheless, the search space reduction performed by the ranking module with M = 2000 (from to ) allowed a strong improvement of the performance with respect to the GA, which searched the whole search space. As concerns the second best result, the RNK approach reached a recognition rate of 87.4%, by using 500 features which coincides with the value of M. Note that for M = 2000 the RNK approach performs poorly, indicating that only some of the first 2000 ranked features are actually relevant, while most of them are irrelevant or redundant, and training the MLP with all of them leads to poor classification performance. As for the RNK-SFS method, which got its best result with M = 2000, it performs significantly worse than our approach, showing that the GA has a better searching ability than RNK-SFS. For the Gisette dataset, the best two results of our system (M=500 and M=1000) were not significantly different and, according to the criterion mentioned above, we considered that which used 41.9 features, which achieved a recognition rate of 96.7% (M =500). The results of the other methods were not significantly different each other. In this case, the GA approach got good results, but selected much more features than RNK-GA. This result shows that, even if the GA selected most of the relevant features, it was not able to discard the redundant and irrelevant ones. This is due to the fact that the GA searched the whole space of solutions and could not benefit of the filtering action performed by the ranking module. As regards the RNK-SFS, it got the best performance with three different results (M=500, M=100 and M=2000), which did not exhibit any statistically significant difference. The result chosen (M =500) reached a recognition rate of 95.6%, by selecting 73 features. These results exhibit slight differences also in terms of number of features. This seems to suggest that the SFS algorithm, starting from the 500 features case, got stuck in suboptimal areas of the search space, and it was not able to locate new areas containing solutions consisting of a greater number of features.

12 Table 5. Comparison results for the Madelon and Ucihar datasets. Bold values represent the best statistically significant results. Dataset M RNK-GA GA RNK RNK-SFS RR NF RR NF RR RR NF Madelon Ucihar The results just described seem to confirm that as the search space (exponentially) grows with M, the GA module of the proposed approach is able to locate new areas of the search space containing better solutions, which includes the new features progressively added. In particular, in the case of the Arcene dataset, our system was able to find solutions whose cardinality strongly grows as M increases, obtaining a strong increase of the performance in terms of recognition rate. For the Gisette dataset, instead, the fitness increment of the new solutions found did not lead to a significant improvement in terms of recognition rate. The comparison results for the Madelon and Ucihar datasets are shown in Table 5. From the table it can be observed that for both datasets the proposed system did not significantly outperform the compared systems. As concerns the Madelon dataset, the RNK approach reached its best performance by using the first 20 ranked features. This suggests that the Madelon whole set of features contains a small set of features that, even when they are taken separately, have a high discriminative power. This set can be easily identified, either by using RNK which selected the best 20 features, according to χ 2 measure (equation (1)), or by the GA, despite it searched the whole search space. As for the Ucihar results, RNK-GA, RNK-SFS and GA outperformed RNK, but achieved performances that did not exhibit any statistically significant difference. RNK-GA and RNK-SFS obtained their best results with M =200, using a comparable number of features. The GA selected much more features than RNK-GA and RNK-SFS (about three times), confirming that it wasn t able to discard redundant and irrelevant features. Nonetheless, in this case, these features did not affect the MLP training process. These results seems to suggest that when the number of features to be dealt with is not too large: (i) even less effective search algorithms like the SFS can find good solutions in the reduced search space provided by the first module; (ii) the GA alone is still able to locate search space areas containing good solutions, but they may contain redundant or irrelevant features.

13 Table 6. Comparison results with the IWSS approach. Dataset RNK-GA IWSS M RR NF RR NF Arcene Gisette Madelon (a) C4.5 algorithm Dataset RNK-GA IWSS M RR NF RR NF Arcene Gisette Madelon (b) K Nearest Neighbor Dataset RNK-GA IWSS M RR NF RR NF Arcene Gisette Madelon (c) Naive Bayes Second set of Comparisons In [2], the proposed approach (IWSS) has been tested on three classifiers: Naive Bayes (NB), KNN (K = 1) and C4.5. Moreover, in [2], among the others, three of the four datasets reported in Table 3 have been tested: Arcene, Gisette and Madelon. The comparison results are shown in Table 6. Note that the second column shows the values of the parameter M that obtained the best result among those tested (the same used for the first set of comparison). The recognition rate (RR) and the number of selected features (NF), are also reported. Since the IWSS approach is deterministic, the results reported in [2], refer to a single feature subset and have been obtained using the 10-fold cross-validation technique. For this reason, in order to statistically validate the comparison results we performed the one sample Wilcoxon rank-sum test (α = 0.05), comparing the single result of IWSS with the results of RNK- GA, on the 30 runs. The values in bold highlight the best result, according to the Wilcoxon test. From the table it can be seen that for the Arcene dataset (10000 features), the proposed system greatly outperforms the IWSS approach on the three classifiers considered. As concerns the Gisette dataset (5000 features), for the C4.5 and KNN classifiers, the performance of our approach are better than those of IWSS. Finally, as for the Madelon dataset (500 features), IWSS achieves better performance for the C4.5 and KNN classifiers, whereas for the NB classifier our system performs slightly better. The above results confirms the effectiveness of proposed approach to deal with high dimensional data. In fact, on the Arcene dataset our system largely

14 outperforms the IWSS approach, in spite of the fact that IWSS uses a wrapper evaluation function. This is confirmed by the results on the Gisette dataset, where our system obtained better performances on two of the three classifiers. In this case, the differences in terms of recognition rates are much smaller than those of the Arcene dataset. However, these results are similar to those reported in Table 4, where the recognition rate differences for the MLP classifier are not greater than 1%. Only for the NB classifier IWSS significantly outperforms our approach, but nonetheless this performance is worse than that obtained by our approach with the MLP. Finally, as for the Madelon dataset, as mentioned above, it is characterized by a small set of discriminative features that can be easily identified by univariate measures. Then, in this case, the IWSS approach is favored because it first ranks the features and then incrementally build the feature subset by using a greedy strategy. 4.3 Discussion From the results shown above it can be seen that the univariate measure used in the first module is able to identify most of the relevant features even though the adopted measure evaluates the relevance of each feature, without taking into account any feature interaction. In practice, what happens is that these interacting features that are singularly little relevant, but that become useful when taken with other features, do not appear at the bottom of the ranking. Thus, these features can always be included by suitably incrementing the value of M. It is worth noting that, because of the exponentially growth of the search space, even high values of M allows a strong reduction of the search space. An interesting property of our system is that, once the number of M has been correctly set, the GA of the second module is able to discard redundant features. In fact, our approach selected much less features than those given in input or those selected by the GA taken into account for the comparison. We want to remark that, the above results confirm the assumptions underlying our method: (i) a feature ranking algorithm can be used to preselect a number, a priori fixed, of features among the whole set of available features; (ii) the search space consisting of the subsets made of these selected features contains most of the good and near-optimal solutions (subsets). In practice, the filtering performed by the feature ranking module makes easier the task of searching for good solutions and this filtering is crucial in improving the performance of the GA when thousands of features are involved. 5 Conclusions We present a novel GA-based approach for feature selection which is able to deal with thousands of features. The approach consists of two modules. The first uses a feature ranking based approach that reduces the search space made of the whole set of available features. This reduction is performed by discarding

15 the features that, according the univariate measure employed, are less useful for discriminating among the different classes at hand. The second module uses a genetic algorithm to search in the solution space provided by the first module. This module employs a correlation based heuristic function to evaluate the worth of the feature subsets encoded by the individuals. Since we used only filter evaluation functions, the proposed approach shows the following interesting properties: (i) it is independent of the classification system used; (ii) once that the correlation data have been computed in the initialization step of the GA, the computational cost of the fitness function does not depend on the training set size. The second property makes our system particularly suitable for problems involving a huge number of instances. The effectiveness of the proposed system has been tested on data represented in high dimensional spaces. The achieved results have been compared with those obtained by different feature selection strategies, both wrapper and filter. For the datasets containing thousands of features, our method obtain better results than the other methods both in terms of accuracy and number of selected features. Future works will investigate different feature evaluation function, both for the ranking (univariate) and for the GA (multivariate). Moreover, system performance will be evaluated also for different classification schemes. References 1. Nips 2003 workshop on feature extraction and feature selection challenge. (2003) 2. Bermejo, P., Gámez, J.A., Puerta, J.M.: Improving incremental wrapper-based subset selection via replacement and early stopping. IJPRAI 25(5), (2011) 3. Cordella, L.P., De Stefano, C., Fontanella, F., Marrocco, C., Scotto di Freca, A.: Combining single class features for improving performance of a two stage classifier. In: 20th International Conference on Pattern Recognition (ICPR 2010). pp IEEE Computer Society (2010) 4. Dash, M., Liu, H.: Feature selection for classification. Intelligent Data Analysis 1(1-4), (1997) 5. De Stefano, C., Fontanella, F., Marrocco, C.: A GA-Based Feature Selection Algorithm for Remote Sensing Images, pp Springer Berlin Heidelberg, Berlin, Heidelberg (2008) 6. De Stefano, C., Fontanella, F., Maniaci, M., Scotto di Freca, A.: A method for scribe distinction in medieval manuscripts using page layout features. In: Maino, G., Foresti, G. (eds.) Image Analysis and Processing - ICIAP 2011, Lecture Notes in Computer Science, vol. 6978, pp Springer Berlin Heidelberg (2011) 7. Gütlein, M., Frank, E., Hall, M., Karwath, A.: Large scale attribute selection using wrappers. In: Proc. of the IEEE Symposium on Computational Intelligence and Data Mining (CIDM 2009) (2009) 8. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, (2003) 9. Hall, M.A.: Correlation-based feature selection for discrete and numeric class machine learning. In: Proceedings of the Seventeenth International Conference on Machine Learning. pp Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2000)

16 10. Huang, J., Cai, Y., Xu, X.: A hybrid genetic algorithm for feature selection wrapper based on mutual information. Pattern Recogn. Lett. 28(13), (Oct 2007) 11. Lanzi, P.: Fast feature selection with genetic algorithms: a filter approach. In: Evolutionary Computation, 1997., IEEE International Conference on. pp (Apr 1997) 12. Lee, J.S., Oh, I.S., Moon, B.R.: Hybrid genetic algorithms for feature selection. IEEE Trans. Pattern Anal. Mach. Intell. 26(11), (2004) 13. Li, R., Lu, J., Zhang, Y., Zhao, T.: Dynamic adaboost learning with feature selection based on parallel genetic algorithm for image annotation. Knowledge-Based Systems 23(3), (2010) 14. Lichman, M.: UCI machine learning repository (2013), Liu, H., Setiono, R.: Chi2: Feature selection and discretization of numeric attributes. In: ICTAI. pp IEEE Computer Society, Washington, DC, USA (1995) 16. Manimala, K., Selvi, K., Ahila, R.: Hybrid soft computing techniques for feature selection and parameter optimization in power quality data mining. Applied Soft Computing 11(8), (2011), Ochoa, G.: Error thresholds in genetic algorithms. Evolutionary Computation 14(2), (2006) 18. Oreski, S., Oreski, G.: Genetic algorithm-based heuristic for feature selection in credit risk assessment. Expert Systems with Applications 41(4, Part 2), (2014) 19. Spolaor, N., Lorena, A., Lee, H.: Multi-objective genetic algorithm evaluation in feature selection. In: Takahashi, R., Deb, K., Wanner, E., Greco, S. (eds.) Evolutionary Multi-Criterion Optimization, Lecture Notes in Computer Science, vol. 6576, pp Springer Berlin Heidelberg (2011) 20. Tan, F., Fu, X., Zhang, Y., Bourgeois, A.G.: A genetic algorithm-based method for feature subset selection. Soft Computing 12(2), (Sep 2007) 21. Xue, B., Zhang, M., Browne, W.N., Yao, X.: A survey on evolutionary computation approaches to feature selection. IEEE Transactions on Evolutionary Computation 20(4), (Aug 2016) 22. Yusta, S.C.: Different metaheuristic strategies to solve the feature selection problem. Pattern Recognition Letters 30(5), (2009) 23. Zhai, Y., Ong, Y.S., Tsang, I.: The emerging big dimensionality. Computational Intelligence Magazine, IEEE 9(3), (Aug 2014)

BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA

BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA S. DeepaLakshmi 1 and T. Velmurugan 2 1 Bharathiar University, Coimbatore, India 2 Department of Computer Science, D. G. Vaishnav College,

More information

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis CHAPTER 3 BEST FIRST AND GREEDY SEARCH BASED CFS AND NAÏVE BAYES ALGORITHMS FOR HEPATITIS DIAGNOSIS 3.1 Introduction

More information

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. Americo Pereira, Jan Otto Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. ABSTRACT In this paper we want to explain what feature selection is and

More information

Using a genetic algorithm for editing k-nearest neighbor classifiers

Using a genetic algorithm for editing k-nearest neighbor classifiers Using a genetic algorithm for editing k-nearest neighbor classifiers R. Gil-Pita 1 and X. Yao 23 1 Teoría de la Señal y Comunicaciones, Universidad de Alcalá, Madrid (SPAIN) 2 Computer Sciences Department,

More information

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate

More information

An Empirical Study of Lazy Multilabel Classification Algorithms

An Empirical Study of Lazy Multilabel Classification Algorithms An Empirical Study of Lazy Multilabel Classification Algorithms E. Spyromitros and G. Tsoumakas and I. Vlahavas Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

More information

Filter methods for feature selection. A comparative study

Filter methods for feature selection. A comparative study Filter methods for feature selection. A comparative study Noelia Sánchez-Maroño, Amparo Alonso-Betanzos, and María Tombilla-Sanromán University of A Coruña, Department of Computer Science, 15071 A Coruña,

More information

Random Search Report An objective look at random search performance for 4 problem sets

Random Search Report An objective look at random search performance for 4 problem sets Random Search Report An objective look at random search performance for 4 problem sets Dudon Wai Georgia Institute of Technology CS 7641: Machine Learning Atlanta, GA dwai3@gatech.edu Abstract: This report

More information

Novel Initialisation and Updating Mechanisms in PSO for Feature Selection in Classification

Novel Initialisation and Updating Mechanisms in PSO for Feature Selection in Classification Novel Initialisation and Updating Mechanisms in PSO for Feature Selection in Classification Bing Xue, Mengjie Zhang, and Will N. Browne School of Engineering and Computer Science Victoria University of

More information

Discretizing Continuous Attributes Using Information Theory

Discretizing Continuous Attributes Using Information Theory Discretizing Continuous Attributes Using Information Theory Chang-Hwan Lee Department of Information and Communications, DongGuk University, Seoul, Korea 100-715 chlee@dgu.ac.kr Abstract. Many classification

More information

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 1

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP (  1 Cluster Based Speed and Effective Feature Extraction for Efficient Search Engine Manjuparkavi A 1, Arokiamuthu M 2 1 PG Scholar, Computer Science, Dr. Pauls Engineering College, Villupuram, India 2 Assistant

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

A Parallel Evolutionary Algorithm for Discovery of Decision Rules

A Parallel Evolutionary Algorithm for Discovery of Decision Rules A Parallel Evolutionary Algorithm for Discovery of Decision Rules Wojciech Kwedlo Faculty of Computer Science Technical University of Bia lystok Wiejska 45a, 15-351 Bia lystok, Poland wkwedlo@ii.pb.bialystok.pl

More information

Information Fusion Dr. B. K. Panigrahi

Information Fusion Dr. B. K. Panigrahi Information Fusion By Dr. B. K. Panigrahi Asst. Professor Department of Electrical Engineering IIT Delhi, New Delhi-110016 01/12/2007 1 Introduction Classification OUTLINE K-fold cross Validation Feature

More information

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

Feature Selection Using Modified-MCA Based Scoring Metric for Classification 2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) (2011) IACSIT Press, Singapore Feature Selection Using Modified-MCA Based Scoring Metric for Classification

More information

Hybrid Correlation and Causal Feature Selection for Ensemble Classifiers

Hybrid Correlation and Causal Feature Selection for Ensemble Classifiers Hybrid Correlation and Causal Feature Selection for Ensemble Classifiers Rakkrit Duangsoithong and Terry Windeatt Centre for Vision, Speech and Signal Processing University of Surrey Guildford, United

More information

Unsupervised Feature Selection Using Multi-Objective Genetic Algorithms for Handwritten Word Recognition

Unsupervised Feature Selection Using Multi-Objective Genetic Algorithms for Handwritten Word Recognition Unsupervised Feature Selection Using Multi-Objective Genetic Algorithms for Handwritten Word Recognition M. Morita,2, R. Sabourin 3, F. Bortolozzi 3 and C. Y. Suen 2 École de Technologie Supérieure, Montreal,

More information

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a International Conference on Education Technology, Management and Humanities Science (ETMHS 2015) Research on Applications of Data Mining in Electronic Commerce Xiuping YANG 1, a 1 Computer Science Department,

More information

A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM

A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM Akshay S. Agrawal 1, Prof. Sachin Bojewar 2 1 P.G. Scholar, Department of Computer Engg., ARMIET, Sapgaon, (India) 2 Associate Professor, VIT,

More information

A Feature Selection Method to Handle Imbalanced Data in Text Classification

A Feature Selection Method to Handle Imbalanced Data in Text Classification A Feature Selection Method to Handle Imbalanced Data in Text Classification Fengxiang Chang 1*, Jun Guo 1, Weiran Xu 1, Kejun Yao 2 1 School of Information and Communication Engineering Beijing University

More information

A Hybrid Feature Selection Algorithm Based on Information Gain and Sequential Forward Floating Search

A Hybrid Feature Selection Algorithm Based on Information Gain and Sequential Forward Floating Search A Hybrid Feature Selection Algorithm Based on Information Gain and Sequential Forward Floating Search Jianli Ding, Liyang Fu School of Computer Science and Technology Civil Aviation University of China

More information

Feature-weighted k-nearest Neighbor Classifier

Feature-weighted k-nearest Neighbor Classifier Proceedings of the 27 IEEE Symposium on Foundations of Computational Intelligence (FOCI 27) Feature-weighted k-nearest Neighbor Classifier Diego P. Vivencio vivencio@comp.uf scar.br Estevam R. Hruschka

More information

Evolving SQL Queries for Data Mining

Evolving SQL Queries for Data Mining Evolving SQL Queries for Data Mining Majid Salim and Xin Yao School of Computer Science, The University of Birmingham Edgbaston, Birmingham B15 2TT, UK {msc30mms,x.yao}@cs.bham.ac.uk Abstract. This paper

More information

Feature Selection with Decision Tree Criterion

Feature Selection with Decision Tree Criterion Feature Selection with Decision Tree Criterion Krzysztof Grąbczewski and Norbert Jankowski Department of Computer Methods Nicolaus Copernicus University Toruń, Poland kgrabcze,norbert@phys.uni.torun.pl

More information

Forward Feature Selection Using Residual Mutual Information

Forward Feature Selection Using Residual Mutual Information Forward Feature Selection Using Residual Mutual Information Erik Schaffernicht, Christoph Möller, Klaus Debes and Horst-Michael Gross Ilmenau University of Technology - Neuroinformatics and Cognitive Robotics

More information

Wrapper Feature Selection using Discrete Cuckoo Optimization Algorithm Abstract S.J. Mousavirad and H. Ebrahimpour-Komleh* 1 Department of Computer and Electrical Engineering, University of Kashan, Kashan,

More information

Efficient Image Compression of Medical Images Using the Wavelet Transform and Fuzzy c-means Clustering on Regions of Interest.

Efficient Image Compression of Medical Images Using the Wavelet Transform and Fuzzy c-means Clustering on Regions of Interest. Efficient Image Compression of Medical Images Using the Wavelet Transform and Fuzzy c-means Clustering on Regions of Interest. D.A. Karras, S.A. Karkanis and D. E. Maroulis University of Piraeus, Dept.

More information

Dynamic Clustering of Data with Modified K-Means Algorithm

Dynamic Clustering of Data with Modified K-Means Algorithm 2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore Dynamic Clustering of Data with Modified K-Means Algorithm Ahamed Shafeeq

More information

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS J.I. Serrano M.D. Del Castillo Instituto de Automática Industrial CSIC. Ctra. Campo Real km.0 200. La Poveda. Arganda del Rey. 28500

More information

Noise-based Feature Perturbation as a Selection Method for Microarray Data

Noise-based Feature Perturbation as a Selection Method for Microarray Data Noise-based Feature Perturbation as a Selection Method for Microarray Data Li Chen 1, Dmitry B. Goldgof 1, Lawrence O. Hall 1, and Steven A. Eschrich 2 1 Department of Computer Science and Engineering

More information

Feature Selection. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262

Feature Selection. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262 Feature Selection Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 239 / 262 What is Feature Selection? Department Biosysteme Karsten Borgwardt Data Mining Course Basel

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

A Heart Disease Risk Prediction System Based On Novel Technique Stratified Sampling

A Heart Disease Risk Prediction System Based On Novel Technique Stratified Sampling IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 2, Ver. X (Mar-Apr. 2014), PP 32-37 A Heart Disease Risk Prediction System Based On Novel Technique

More information

Visual object classification by sparse convolutional neural networks

Visual object classification by sparse convolutional neural networks Visual object classification by sparse convolutional neural networks Alexander Gepperth 1 1- Ruhr-Universität Bochum - Institute for Neural Dynamics Universitätsstraße 150, 44801 Bochum - Germany Abstract.

More information

The Un-normalized Graph p-laplacian based Semi-supervised Learning Method and Speech Recognition Problem

The Un-normalized Graph p-laplacian based Semi-supervised Learning Method and Speech Recognition Problem Int. J. Advance Soft Compu. Appl, Vol. 9, No. 1, March 2017 ISSN 2074-8523 The Un-normalized Graph p-laplacian based Semi-supervised Learning Method and Speech Recognition Problem Loc Tran 1 and Linh Tran

More information

A Classifier with the Function-based Decision Tree

A Classifier with the Function-based Decision Tree A Classifier with the Function-based Decision Tree Been-Chian Chien and Jung-Yi Lin Institute of Information Engineering I-Shou University, Kaohsiung 84008, Taiwan, R.O.C E-mail: cbc@isu.edu.tw, m893310m@isu.edu.tw

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

6. Dicretization methods 6.1 The purpose of discretization

6. Dicretization methods 6.1 The purpose of discretization 6. Dicretization methods 6.1 The purpose of discretization Often data are given in the form of continuous values. If their number is huge, model building for such data can be difficult. Moreover, many

More information

Gene selection through Switched Neural Networks

Gene selection through Switched Neural Networks Gene selection through Switched Neural Networks Marco Muselli Istituto di Elettronica e di Ingegneria dell Informazione e delle Telecomunicazioni Consiglio Nazionale delle Ricerche Email: Marco.Muselli@ieiit.cnr.it

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

The Role of Biomedical Dataset in Classification

The Role of Biomedical Dataset in Classification The Role of Biomedical Dataset in Classification Ajay Kumar Tanwani and Muddassar Farooq Next Generation Intelligent Networks Research Center (nexgin RC) National University of Computer & Emerging Sciences

More information

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm Proceedings of the National Conference on Recent Trends in Mathematical Computing NCRTMC 13 427 An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm A.Veeraswamy

More information

Information theory methods for feature selection

Information theory methods for feature selection Information theory methods for feature selection Zuzana Reitermanová Department of Computer Science Faculty of Mathematics and Physics Charles University in Prague, Czech Republic Diplomový a doktorandský

More information

Improving Feature Selection Techniques for Machine Learning

Improving Feature Selection Techniques for Machine Learning Georgia State University ScholarWorks @ Georgia State University Computer Science Dissertations Department of Computer Science 11-27-2007 Improving Feature Selection Techniques for Machine Learning Feng

More information

Using PageRank in Feature Selection

Using PageRank in Feature Selection Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy fienco,meo,bottag@di.unito.it Abstract. Feature selection is an important

More information

Using Decision Boundary to Analyze Classifiers

Using Decision Boundary to Analyze Classifiers Using Decision Boundary to Analyze Classifiers Zhiyong Yan Congfu Xu College of Computer Science, Zhejiang University, Hangzhou, China yanzhiyong@zju.edu.cn Abstract In this paper we propose to use decision

More information

Robustness of Selective Desensitization Perceptron Against Irrelevant and Partially Relevant Features in Pattern Classification

Robustness of Selective Desensitization Perceptron Against Irrelevant and Partially Relevant Features in Pattern Classification Robustness of Selective Desensitization Perceptron Against Irrelevant and Partially Relevant Features in Pattern Classification Tomohiro Tanno, Kazumasa Horie, Jun Izawa, and Masahiko Morita University

More information

Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification

Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification 1 Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification Feng Chu and Lipo Wang School of Electrical and Electronic Engineering Nanyang Technological niversity Singapore

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

Improving Results and Performance of Collaborative Filtering-based Recommender Systems using Cuckoo Optimization Algorithm

Improving Results and Performance of Collaborative Filtering-based Recommender Systems using Cuckoo Optimization Algorithm Improving Results and Performance of Collaborative Filtering-based Recommender Systems using Cuckoo Optimization Algorithm Majid Hatami Faculty of Electrical and Computer Engineering University of Tabriz,

More information

CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES

CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES 6.1 INTRODUCTION The exploration of applications of ANN for image classification has yielded satisfactory results. But, the scope for improving

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

Feature Selection Technique to Improve Performance Prediction in a Wafer Fabrication Process

Feature Selection Technique to Improve Performance Prediction in a Wafer Fabrication Process Feature Selection Technique to Improve Performance Prediction in a Wafer Fabrication Process KITTISAK KERDPRASOP and NITTAYA KERDPRASOP Data Engineering Research Unit, School of Computer Engineering, Suranaree

More information

SSV Criterion Based Discretization for Naive Bayes Classifiers

SSV Criterion Based Discretization for Naive Bayes Classifiers SSV Criterion Based Discretization for Naive Bayes Classifiers Krzysztof Grąbczewski kgrabcze@phys.uni.torun.pl Department of Informatics, Nicolaus Copernicus University, ul. Grudziądzka 5, 87-100 Toruń,

More information

Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection

Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection Petr Somol 1,2, Jana Novovičová 1,2, and Pavel Pudil 2,1 1 Dept. of Pattern Recognition, Institute of Information Theory and

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Features and Patterns The Curse of Size and

More information

Chapter 22 Information Gain, Correlation and Support Vector Machines

Chapter 22 Information Gain, Correlation and Support Vector Machines Chapter 22 Information Gain, Correlation and Support Vector Machines Danny Roobaert, Grigoris Karakoulas, and Nitesh V. Chawla Customer Behavior Analytics Retail Risk Management Canadian Imperial Bank

More information

Supervised Variable Clustering for Classification of NIR Spectra

Supervised Variable Clustering for Classification of NIR Spectra Supervised Variable Clustering for Classification of NIR Spectra Catherine Krier *, Damien François 2, Fabrice Rossi 3, Michel Verleysen, Université catholique de Louvain, Machine Learning Group, place

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

BRACE: A Paradigm For the Discretization of Continuously Valued Data

BRACE: A Paradigm For the Discretization of Continuously Valued Data Proceedings of the Seventh Florida Artificial Intelligence Research Symposium, pp. 7-2, 994 BRACE: A Paradigm For the Discretization of Continuously Valued Data Dan Ventura Tony R. Martinez Computer Science

More information

Fast Efficient Clustering Algorithm for Balanced Data

Fast Efficient Clustering Algorithm for Balanced Data Vol. 5, No. 6, 214 Fast Efficient Clustering Algorithm for Balanced Data Adel A. Sewisy Faculty of Computer and Information, Assiut University M. H. Marghny Faculty of Computer and Information, Assiut

More information

Fuzzy Entropy based feature selection for classification of hyperspectral data

Fuzzy Entropy based feature selection for classification of hyperspectral data Fuzzy Entropy based feature selection for classification of hyperspectral data Mahesh Pal Department of Civil Engineering NIT Kurukshetra, 136119 mpce_pal@yahoo.co.uk Abstract: This paper proposes to use

More information

WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1

WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1 WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1 H. Altay Güvenir and Aynur Akkuş Department of Computer Engineering and Information Science Bilkent University, 06533, Ankara, Turkey

More information

Feature Selection in Knowledge Discovery

Feature Selection in Knowledge Discovery Feature Selection in Knowledge Discovery Susana Vieira Technical University of Lisbon, Instituto Superior Técnico Department of Mechanical Engineering, Center of Intelligent Systems, IDMEC-LAETA Av. Rovisco

More information

Feature Selection Based on Relative Attribute Dependency: An Experimental Study

Feature Selection Based on Relative Attribute Dependency: An Experimental Study Feature Selection Based on Relative Attribute Dependency: An Experimental Study Jianchao Han, Ricardo Sanchez, Xiaohua Hu, T.Y. Lin Department of Computer Science, California State University Dominguez

More information

Feature and Search Space Reduction for Label-Dependent Multi-label Classification

Feature and Search Space Reduction for Label-Dependent Multi-label Classification Feature and Search Space Reduction for Label-Dependent Multi-label Classification Prema Nedungadi and H. Haripriya Abstract The problem of high dimensionality in multi-label domain is an emerging research

More information

Statistical dependence measure for feature selection in microarray datasets

Statistical dependence measure for feature selection in microarray datasets Statistical dependence measure for feature selection in microarray datasets Verónica Bolón-Canedo 1, Sohan Seth 2, Noelia Sánchez-Maroño 1, Amparo Alonso-Betanzos 1 and José C. Príncipe 2 1- Department

More information

A Novel Criterion Function in Feature Evaluation. Application to the Classification of Corks.

A Novel Criterion Function in Feature Evaluation. Application to the Classification of Corks. A Novel Criterion Function in Feature Evaluation. Application to the Classification of Corks. X. Lladó, J. Martí, J. Freixenet, Ll. Pacheco Computer Vision and Robotics Group Institute of Informatics and

More information

Selection of Location, Frequency and Orientation Parameters of 2D Gabor Wavelets for Face Recognition

Selection of Location, Frequency and Orientation Parameters of 2D Gabor Wavelets for Face Recognition Selection of Location, Frequency and Orientation Parameters of 2D Gabor Wavelets for Face Recognition Berk Gökberk, M.O. İrfanoğlu, Lale Akarun, and Ethem Alpaydın Boğaziçi University, Department of Computer

More information

A Content Vector Model for Text Classification

A Content Vector Model for Text Classification A Content Vector Model for Text Classification Eric Jiang Abstract As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications.

More information

Weighting and selection of features.

Weighting and selection of features. Intelligent Information Systems VIII Proceedings of the Workshop held in Ustroń, Poland, June 14-18, 1999 Weighting and selection of features. Włodzisław Duch and Karol Grudziński Department of Computer

More information

1. Introduction. 2. Motivation and Problem Definition. Volume 8 Issue 2, February Susmita Mohapatra

1. Introduction. 2. Motivation and Problem Definition. Volume 8 Issue 2, February Susmita Mohapatra Pattern Recall Analysis of the Hopfield Neural Network with a Genetic Algorithm Susmita Mohapatra Department of Computer Science, Utkal University, India Abstract: This paper is focused on the implementation

More information

Distributed Optimization of Feature Mining Using Evolutionary Techniques

Distributed Optimization of Feature Mining Using Evolutionary Techniques Distributed Optimization of Feature Mining Using Evolutionary Techniques Karthik Ganesan Pillai University of Dayton Computer Science 300 College Park Dayton, OH 45469-2160 Dale Emery Courte University

More information

Information-Theoretic Feature Selection Algorithms for Text Classification

Information-Theoretic Feature Selection Algorithms for Text Classification Proceedings of International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 5 Information-Theoretic Feature Selection Algorithms for Text Classification Jana Novovičová Institute

More information

An Information-Theoretic Approach to the Prepruning of Classification Rules

An Information-Theoretic Approach to the Prepruning of Classification Rules An Information-Theoretic Approach to the Prepruning of Classification Rules Max Bramer University of Portsmouth, Portsmouth, UK Abstract: Keywords: The automatic induction of classification rules from

More information

Feature selection in environmental data mining combining Simulated Annealing and Extreme Learning Machine

Feature selection in environmental data mining combining Simulated Annealing and Extreme Learning Machine Feature selection in environmental data mining combining Simulated Annealing and Extreme Learning Machine Michael Leuenberger and Mikhail Kanevski University of Lausanne - Institute of Earth Surface Dynamics

More information

COMPARISON OF SUBSAMPLING TECHNIQUES FOR RANDOM SUBSPACE ENSEMBLES

COMPARISON OF SUBSAMPLING TECHNIQUES FOR RANDOM SUBSPACE ENSEMBLES COMPARISON OF SUBSAMPLING TECHNIQUES FOR RANDOM SUBSPACE ENSEMBLES SANTHOSH PATHICAL 1, GURSEL SERPEN 1 1 Elecrical Engineering and Computer Science Department, University of Toledo, Toledo, OH, 43606,

More information

Using PageRank in Feature Selection

Using PageRank in Feature Selection Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy {ienco,meo,botta}@di.unito.it Abstract. Feature selection is an important

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

Univariate and Multivariate Decision Trees

Univariate and Multivariate Decision Trees Univariate and Multivariate Decision Trees Olcay Taner Yıldız and Ethem Alpaydın Department of Computer Engineering Boğaziçi University İstanbul 80815 Turkey Abstract. Univariate decision trees at each

More information

Optimization of Association Rule Mining through Genetic Algorithm

Optimization of Association Rule Mining through Genetic Algorithm Optimization of Association Rule Mining through Genetic Algorithm RUPALI HALDULAKAR School of Information Technology, Rajiv Gandhi Proudyogiki Vishwavidyalaya Bhopal, Madhya Pradesh India Prof. JITENDRA

More information

REMOVAL OF REDUNDANT AND IRRELEVANT DATA FROM TRAINING DATASETS USING SPEEDY FEATURE SELECTION METHOD

REMOVAL OF REDUNDANT AND IRRELEVANT DATA FROM TRAINING DATASETS USING SPEEDY FEATURE SELECTION METHOD Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,

More information

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization

More information

Improved PSO for Feature Selection on High-Dimensional Datasets

Improved PSO for Feature Selection on High-Dimensional Datasets Improved PSO for Feature Selection on High-Dimensional Datasets Binh Tran, Bing Xue, and Mengjie Zhang Victoria University of Wellington, PO Box 600, Wellington 6140, New Zealand {binh.tran,bing.xue,mengjie.zhang}@ecs.vuw.ac.nz

More information

Self-Organizing Maps for cyclic and unbounded graphs

Self-Organizing Maps for cyclic and unbounded graphs Self-Organizing Maps for cyclic and unbounded graphs M. Hagenbuchner 1, A. Sperduti 2, A.C. Tsoi 3 1- University of Wollongong, Wollongong, Australia. 2- University of Padova, Padova, Italy. 3- Hong Kong

More information

Improving Tree-Based Classification Rules Using a Particle Swarm Optimization

Improving Tree-Based Classification Rules Using a Particle Swarm Optimization Improving Tree-Based Classification Rules Using a Particle Swarm Optimization Chi-Hyuck Jun *, Yun-Ju Cho, and Hyeseon Lee Department of Industrial and Management Engineering Pohang University of Science

More information

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Cluster homogeneity as a semi-supervised principle for feature selection using mutual information

Cluster homogeneity as a semi-supervised principle for feature selection using mutual information Cluster homogeneity as a semi-supervised principle for feature selection using mutual information Frederico Coelho 1 and Antonio Padua Braga 1 andmichelverleysen 2 1- Universidade Federal de Minas Gerais

More information

FEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION

FEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION FEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION Sandeep Kaur 1, Dr. Sheetal Kalra 2 1,2 Computer Science Department, Guru Nanak Dev University RC, Jalandhar(India) ABSTRACT

More information

An Empirical Study on feature selection for Data Classification

An Empirical Study on feature selection for Data Classification An Empirical Study on feature selection for Data Classification S.Rajarajeswari 1, K.Somasundaram 2 Department of Computer Science, M.S.Ramaiah Institute of Technology, Bangalore, India 1 Department of

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2012 http://ce.sharif.edu/courses/90-91/2/ce725-1/ Agenda Features and Patterns The Curse of Size and

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2013 http://ce.sharif.edu/courses/91-92/2/ce725-1/ Agenda Features and Patterns The Curse of Size and

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Image Data: Classification via Neural Networks Instructor: Yizhou Sun yzsun@ccs.neu.edu November 19, 2015 Methods to Learn Classification Clustering Frequent Pattern Mining

More information

Prognosis of Lung Cancer Using Data Mining Techniques

Prognosis of Lung Cancer Using Data Mining Techniques Prognosis of Lung Cancer Using Data Mining Techniques 1 C. Saranya, M.Phil, Research Scholar, Dr.M.G.R.Chockalingam Arts College, Arni 2 K. R. Dillirani, Associate Professor, Department of Computer Science,

More information

Feature Selection for Supervised Classification: A Kolmogorov- Smirnov Class Correlation-Based Filter

Feature Selection for Supervised Classification: A Kolmogorov- Smirnov Class Correlation-Based Filter Feature Selection for Supervised Classification: A Kolmogorov- Smirnov Class Correlation-Based Filter Marcin Blachnik 1), Włodzisław Duch 2), Adam Kachel 1), Jacek Biesiada 1,3) 1) Silesian University

More information

Variable Selection 6.783, Biomedical Decision Support

Variable Selection 6.783, Biomedical Decision Support 6.783, Biomedical Decision Support (lrosasco@mit.edu) Department of Brain and Cognitive Science- MIT November 2, 2009 About this class Why selecting variables Approaches to variable selection Sparsity-based

More information

Rank Measures for Ordering

Rank Measures for Ordering Rank Measures for Ordering Jin Huang and Charles X. Ling Department of Computer Science The University of Western Ontario London, Ontario, Canada N6A 5B7 email: fjhuang33, clingg@csd.uwo.ca Abstract. Many

More information

Features: representation, normalization, selection. Chapter e-9

Features: representation, normalization, selection. Chapter e-9 Features: representation, normalization, selection Chapter e-9 1 Features Distinguish between instances (e.g. an image that you need to classify), and the features you create for an instance. Features

More information