Feature selection has long been an active research topic in machine learning. Beginning

Size: px

Start display at page:

Download "Feature selection has long been an active research topic in machine learning. Beginning"

Flora Conley
6 years ago
Views:

1 E n h a n c i n g I n f o r m a t i o n Parameter Tuning for Induction-Algorithm- Oriented Feature Elimination Ying Yang and Xindong Wu, University of Vermont Feature selection has long been an active research topic in machine learning. Beginning with an empty set of features, it selects features most necessary for learning a target concept. Feature elimination, a newer technique, starts out with a full set of features and eliminates those most unnecessary for learning the target concept. At first Parameter tuning To the best of our knowledge, researchers have studied only one issue about IAOFE parameter tuning the search strategy that searches the feature space for candidates to be eliminated. Kohavi and John compared a hill-climbing search and a best-first search. 2 They also briefly discussed strategies proposed by other researchers, such as beam, bidirectional, simulated- Induction-algorithmoriented feature elimination, with particular parameter configurations, can achieve higher predictive accuracy than existing popular feature selection approaches. The authors propose two sets of well-tuned parameters based on empirical analysis. glance, these two approaches seem to be minor variants on the same theme. However, feature elimination tends to be more effective, 1 can capture interacting features more easily, and suffers less from feature interaction than feature selection. 2 Because the most unnecessary features are eliminated from the beginning, they will not mislead the induction process in terms of efficiency or accuracy. 3 Traditional feature elimination involves measuring each feature s relevance to the target concept. 4,5 The process uses general characteristics of the training data to exclude some features and include others; it doesn t care what induction algorithm will use those output features. Because different algorithms have different biases, this blindness can have severe implications for inductive learning. 6 This problem is exacerbated by the fact that the goal of most inductive learning is to maximize predictive accuracy, not just discover relevant features. To address these issues, George John, Ron Kohavi, and Karl Pfleger proposed induction-algorithmoriented feature elimination (IAOFE). 6 Kohavi and John augmented the method by proposing the wrapper approach, which searches for unnecessary features using the induction algorithm of interest. 2 Because of this research and others efforts, IAOFE can improve predictive accuracy for induction algorithm families such as decision trees, naïve Bayes, Bayes networks, and nearest neighbors However, because of the way it works, abundant tunable parameters control IAOFE. To understand how to achieve the best performance possible from IAOFE, we conducted a comprehensive analysis of IAOFE parameter tuning. The IAOFE approach For each data set, IAOFE eliminates the features that don t enhance an induction algorithm s learning performance for a particular data domain. Figure 1 presents an algorithm that delineates IAOFE s generic framework in the context of classification learning. You can couple IAOFE with any induction algorithm for a data domain as long as you use the same algorithm for induction later. So, you can tailor the elimination to a particular algorithm and domain, or you can eliminate all features or no features. If IAOFE decides to eliminate all features, using the prior distribution of the classes is sufficient for classification /04/$ IEEE IEEE INTELLIGENT SYSTEMS Published by the IEEE Computer Society

2 annealing, and genetic search. These alternative strategies produced no significant differences. Most algorithms from the IAOFE family use hill-climbing search. We propose four IAOFE parameters: accuracy estimation, tolerance, tie breaker, and voting. Previous implementations of IAOFE usually explicitly use cross-validation for accuracy estimation and implicitly use the meticulous approach for tolerance and a random or first-best approach for tie breaking. They don t use voting at all. Accuracy estimation Feature elimination involves finding a set of unnecessary features under some objective function. Common objective functions include classification accuracy, classifier structure, and minimization of the features retained. 6 The first of these is used most often because a typical goal of inductive learning is to maximize classification accuracy on previously unseen data (predictive accuracy). So, for IAOFE we focus on using classification accuracy to measure whether to eliminate a feature. This parameter, which we call accuracy estimation, can take three values. Training data accuracy. This is possibly the most natural approach to estimating the accuracy of a classifier that s been induced from a set of training data. 4 It involves inducing the classifier from the entire set of training data and then estimating the learned classifier s accuracy on that same data set. However, this doesn t show how well the classifier will perform when it makes new predictions for data it hasn t already seen. So, the resulting IAOFE might be susceptible to overfitting. On the other hand, using training-data accuracy is efficient. It can ameliorate IAOFE s major disadvantage its computational overhead when repeatedly sampling training data to evaluate each feature. Cross-validation. Cross-validation is one way to reduce overfitting. Instead of estimating trainingdata accuracy, it estimates generalization accuracy on the part of training data that are held out when training the classifier. In k-fold cross-validation, 13,14 you randomly split the training-data set D into k mutually exclusive subsets (the folds) of approximately equal size, D 1, D 2,,D k, and then train and test the classifier k times. Each time t {1, 2,, k}, you train on D D t and test on D t. The cross-validation estimate of accuracy is the overall number of correct classifications divided by the number of instances in the data set. Cross-validation can be either stratified or unstratified. In stratified cross-validation, the folds contain approximately the same proportion of classes as the original data set. This is good if you know that the training data embodies the correct proportion of Input: training data T containing a set of features F and class labels C; and an induction algorithm IA Output: a subset of F that can optimize the induction performance of IA { eliminated_features = empty_set; candidates = empty_set; WHILE ( eliminated_features <= F ) improve = 0; benchmark_performance = IA(T (F,C )); FOREACH feature i performance = IA(T(F i),c )); if compare(performance, benchmark_performance) == decrease continue; elseif compare(performance, benchmark_performance) == nochange push (candidates, i); improve = 1; else clear (candidates); push (candidates, i); benchmark_performance = performance; improve = 1; END of FOREACH if (improve == 0) break; eliminated_features = eliminated_features + candidates; END of WHILE F = F eliminated_features; return (F); } Figure 1. An algorithm delineating the generic framework of the induction-algorithm-oriented feature elimination method, in the context of classification learning. classes, where stratified cross-validation can reduce the variance among the estimates. However, for many real-world data, you don t know whether the training data are representative in the class proportion, so imposing that proportion on the cross-validation folds might introduce a bias. In these cases, unstratified cross-validation might be appropriate: you create the folds randomly without considering class proportion. Bootstrap. Bootstrap was introduced by Bradley Efron 15 and later fully described by Efron and Robert Tibshirani. 16 It avoids overfitting by estimating the generalization accuracy based on resampling. In the simplest form of Bootstrap, instead of repeatedly analyzing data subsets as cross-validation does, you repeatedly analyze data subsamples. Each subsample is a random sample, with replacement from the full sample. Tolerance For the convenience of expression, we abbreviate the classification performance resulting from eliminating some feature as current performance, or CP, and the best performance up to this stage as benchmark performance, or BP. IAOFE decides to eliminate a feature when doing so doesn t degrade the induced classifier s performance that is, when CP (1 ) BP. The parameter, tolerance, controls how tolerant IAOFE can be. There are three possible settings: Meticulous. = 0. IAOFE compares CP with the exact value of BP to see whether performance decreases at all. Conservative. < 0. IAOFE eliminates a feature only when doing so increases classification performance above a certain degree. The elimination is conservative because it picks only malicious features that can damage the induction; it retains the features that don t affect (neither damage nor contribute to) induction performance. However, if the absolute value of e is set too high, IAOFE MARCH/APRIL

3 E N H A N C I N G I N F O R M A T I O N might not eliminate any features at all, and the whole procedure will quickly terminate. Aggressive. > 0. As long as eliminating a feature doesn t decrease performance beyond its tolerance, IAOFE will eliminate it. By this means, IAOFE tends to maximize the number of eliminated features at the cost of insignificantly decreasing classification accuracy. This can be useful in situations where the compactness of the induced concept is a major concern. Tie breaker Eliminating different attributes often results in identical predictive accuracies of an induction algorithm. We call these attributes candidates.we need a tie breaker parameter to decide which ones to actually eliminate. It can have various settings: Random. The most intuitive way is to randomly pick a feature from the pool of candidates. First best. Follow the practice of ID3-like induction algorithms that is, choose the first encountered candidate that produces the highest improvement in accuracy. Classifier structure. All else being equal, eliminate a feature if doing so will produce the most compact structure of the resulting classifier (if there is a structure at all, such as a decision tree). This can be desirable if judged by Occam s razor. Wholesale. Eliminate all candidates at once and begin the next loop to look for new candidates. This risks eliminating a necessary feature if it has a redundant copy. But this can significantly speed up IAOFE by freeing it from loops of reevaluating and eliminating one candidate at a time. This boost in efficiency can sometimes overshadow the risk of wrongly eliminating candidates, especially when efficiency is a big issue. Voting Although it s effective, IAOFE tends to be computationally expensive when there are large amounts of training data or many features. This problem is compounded by the fact that large data sets are routinely involved in modern research. So, we have a fourth parameter, voting, to address this scenario. The idea is to randomly partition a large data set into k parts, apply IAOFE to each part, and let the returned results from each part vote for the final decision on what features to eliminate. In this way, IAOFE can avoid dealing directly with large amounts of data, instead adopting a divide-and-conquer approach. Another important use of this mechanism is when you apply IAOFE to distributed data, which is a common practice in data mining. In this scenario, instead of fusing all the data together, which is normally impossible, IAOFE can work on each data portion in parallel. This is both more feasible and more efficient. The voting parameter has two settings: Unanimous. Eliminate a feature only when each part of the data supports the elimination. Majority. Eliminate a feature as long as the majority of the data parts agree to do so. When tuning a parameter, we kept other parameters settings invariant so that only this parameter s settings would produce the performance differences. Experiments We experimented with tuning parameters for IAOFE to find effective configurations of parameter settings. We compare IAOFE with those configurations against existing popular feature selection approaches. Design We used the C4.5 induction algorithm as our example because it s one of the most extensively used. For a data set, we conducted a threefold unstratified cross-validation. In each fold, we withheld the test data, conducted IAOFE on the training data, and obtained the retained feature subset. Using the retained features, we then induced a decision tree classifier from the training data and applied it to the test data. We took various measurements such as classification accuracy, the number of eliminated features, and the decision tree size and averaged them on three folds. When tuning a parameter, we kept other parameters settings invariant so that only this parameter s settings would produce the performance differences. These three folds belonged to an outer loop, which differs from the inner loop that IAOFE might conduct on training data. The latter is where we tuned the parameters. We used unstratified cross-validation for the outer loop because it represents many real- world scenarios, where training data are only a sample of the whole population and the class distributions don t necessarily correspond with those of the whole population. In this way, we expected to have a better understanding of IAOFE s pros and cons as it s used in real-world applications. Data We weren t picky with the experimental data. We grabbed as many data sets as we could (36) from the University of California, Irvine s machine learning repository 17 because we believed that IAOFE could tune itself to different domains. However, because we planned to conduct cross-validation for evaluation, we excluded data sets with fewer than 100 instances: If the data set was extremely small, cross-validation could incur a very high variance and thus not be indicative of an approach s true performance. Table 1 lists the data sets, including the number of features and instances, in increasing order of the number of instances. We used two statistics to evaluate the experimental results. The arithmetic mean of a particular measurement (such as classification accuracy) across all data sets provides a gross indication of competing methods relative performance. It s debatable whether values in different data sets are commensurable and hence whether averaging across data sets is meaningful. Nonetheless, a low mean value indicates a tendency toward low values for individual data sets. The second statistic, win/lose/tie record, gives the number of data sets for which the decision tree classifier trained with IAOFE obtains a higher, lower, or equal classification accuracy compared with the classifier trained with alternative methods. If we apply a onetailed sign test to each record and the test result is significantly low (at the 0.05 critical level), the outcome is unlikely to be obtained by chance. Thus, the record of wins to losses represents a systematic underlying advantage of IAOFE with respect to the type of data sets studied IEEE INTELLIGENT SYSTEMS

4 Rivals One goal of this research is to identify effective parameter configurations that can make IAOFE outperform alternative methods. Considering the large amount of literature existing for feature selection methods, evaluating every one of them is impractical. However, several methods are most commonly cited in the literature and are most commonly used in practice. Chi square, stemming from statistics, 18 has a long history in feature selection. The Chi square value can measure the association strength between a categorical class and a categorical feature. We calculate it from the contingency table of the class and the feature and then compare it with a threshold value according to a confidence level. A value higher than the threshold can lead to a judgment that dependency exists between the class and the feature, and thus the feature will be selected. If a feature is numeric, it will normally be discretized before the Chi square approach is employed. Information gain and gain ratio have been popularized with the extensive use of decision tree induction. 19 Given entropy as a measure of data impurity with regard to class, information gain measures the expected reduction of entropy caused by partitioning the data according to a feature s different values. The higher the information gain a feature achieves, the more likely it will be chosen for the classification. However, information gain has a disadvantage in that it prefers features with a large number of values that partition the data into many small, pure subsets. This can lead to choosing features such as birthdate, name, ID number, and so on that have a poor generalization strength and thus aren t predictive for future unseen data. Gain ratio can overcome this problem by introducing split information, a term taking into account how the feature partitions the data. The more uniformly data are distributed among a feature s values, the higher its split information. Gain ratio equals information gain divided by split information. Although commonly used, information gain and gain ratio assume that features are independent of each other, which is often not true in real-world applications. Relief, another commonly cited approach in the literature, estimates features according to how well their values distinguish among instances that are close to each Table 1. Experimental data sets. Data set No. of No. of features instances Iris Teaching Hepatitis Hayes-roth Wine Wpbc Automobile Sonar Glass Heart Ecoli Bupa Ionosphere Horse-colic Monks Monks Monks Voting Led Cylinder-bands Crx Breast-cancer-wisconsin Pid Anneal Vehicle Ttt Vowel-context German 21 1,000 Cmc 10 1,473 Car 7 1,728 Segmentation 20 2,310 Splice 61 3,190 Krvskp 37 3,196 Sick 30 3,772 Mushroom 23 8,124 Nursery 9 12,960 other. 20 For that purpose, Relief for a given instance searches for its two nearest neighbors: one from the same class and the other from a different class. It selects the features that can differentiate between instances from different classes and at the same time have the same value for instances from the same class. The original Relief could only deal with two-class problems. Igore Kononenko further extended Relief to deal with multiclass problems. 21 Another common approach is Focus,which exhaustively examines all feature subsets and selects the minimal subset that is sufficient to determine the class. 22 However, this exhaustive search strategy is prohibitively expensive when the training data size is large. Because our experiments routinely included data sets involving many instances and features, Focus wasn t feasible, so we didn t evaluate it. Results and analysis There are four areas of experimental results for tuning each parameter. Accuracy estimation. For each data set, Table 2 presents the average classification accuracy (standard deviation results from the outer-loop threefold cross-validation) and the average number of eliminated features, corresponding to different settings of accuracy estimation. The table also presents as benchmarks those measurements of the original data where no feature elimination was applied. Training accuracy represents training data accuracy, CV_stratified represents stratified cross-validation, and CV_unstratified represents unstratified cross-validation. We used tenfold crossvalidation for both CV_stratified and CV_unstratified because it offers a decent evaluation for a classifier. 13 To make a fair comparison, we also used 10-time resampling for Bootstrap. The data showed that all settings can eliminate features while maintaining accuracy similar to the original data. CV_stratified, CV_unstratified, and Bootstrap each obtained a similar mean accuracy (in terms of resampling data to estimate accuracy). CV_stratified slightly beats CV_unstratified and Bootstrap, with its win/lose/tie records equal to 15/10/11 and 12/10/14, respectively. When we estimated accuracy using the training data (this didn t resample data at all), IAOFE achieved the highest mean accuracy. It also competed with the alternatives without any statistically significant loss (win/lose/tie records against CV_stratified, CV_unstratified, and Bootstrap are 12/17/7, 19/11/6, and 13/15/8, respectively). Tolerance. Table 3 shows that although con- MARCH/APRIL

5 E N H A N C I N G I N F O R M A T I O N Table 2. Parameter tuning for accuracy estimation. Accuracy (standard deviation) : Number of eliminated features Data set Original Training accuracy CV_stratified CV_unstratified Bootstrap Iris 92.7 (7.6) 92.7 (7.6) : (6.9) : (4.6) : (2.3) : 3 Teaching 33.8 (2.3) 40.4 (7.1) : (6.0) : (8.6) : (8.6) : 2 Hepatitis 80.0 (5.7) 80.7 (3.7) : (1.0) : (4.3) : (2.9) : 13 Hayes-roth 80.0 (5.5) 80.0 (5.5) : (5.5) : (5.5) : (5.5) : 1 Wine 92.7 (3.5) 91.0 (9.8) : (3.9) : (3.3) : (2.5) : 9 Wpbc 72.2 (6.1) 69.7 (9.2) : (8.3) : (1.7) : (8.7) : 27 Automobile 69.8 (5.9) 73.7 (8.0) : (5.7) : (2.4) : (1.4) : 16 Sonar 67.3 (3.6) 68.8 (5.1) : (7.4) : (7.9) : (3.1) : 41 Glass 94.4 (6.5) 95.3 (3.6) : (4.2) : (6.5) : (4.2) : 8 Heart 77.4 (2.8) 77.0 (7.4) : (2.8) : (2.6) : (7.9) : 7 Ecoli 81.0 (2.9) 80.1 (2.2) : (3.4) : (3.4) : (3.4) : 3 Bupa 63.2 (4.4) 66.1 (6.0) : (4.3) : (7.1) : (3.0) : 2 Ionosphere 90.6 (1.7) 88.3 (2.2) : (4.3) : (3.0) : (3.2) : 26 Horse-colic 83.2 (2.0) 82.3 (2.5) : (2.4) : (2.4) : (2.3) : 15 Monks (0.0) (0.0) : (0.0) : (0.0) : (0.0) : 3 Monks (1.1) 67.1 (1.1) : (1.1) : (1.1) : (1.1) : 5 Monks (0.0) (0.0) : (2.4) : (2.4) : (2.4) : 3 Voting 95.6 (2.2) 94.9 (1.6) : (1.4) : (1.4) : (1.4) : 13 Led (0.0) (0.0) : (0.0) : (0.0) : (0.0) : 2 Cylinder-bands 73.7 (4.8) 76.8 (7.1) : (6.8) : (3.1) : (2.6) : 17 Crx 86.2 (0.9) 85.2 (0.8) : (0.7) : (1.5) : (1.4) : 6 Breast-cancer-wisconsin 91.9 (4.4) 91.6 (1.7) : (3.7) : (3.9) : (4.8) : 8 Pid 72.7 (2.2) 73.4 (2.4) : (2.5) : (2.3) : (4.7) : 4 Anneal 89.2 (0.9) 91.2 (2.1) : (2.1) : (2.2) : (1.7) : 25 Vehicle 70.7 (2.8) 70.8 (0.9) : (2.4) : (1.7) : (1.6) : 5 Ttt 84.6 (3.4) 84.7 (3.3) : (4.4) : (3.5) : (4.9) : 1 Vowel-context 73.5 (1.7) 73.6 (1.6) : (2.3) : (0.3) : (1.6) : 3 German 71.7 (2.7) 68.8 (6.3) : (1.6) : (2.8) : (2.4) : 7 Cmc 49.3 (3.0) 49.7 (2.8) : (1.3) : (1.5) : (2.7) : 3 Car 90.0 (1.0) 90.0 (1.0) : (1.1) : (1.1) : (1.0) : 0 Segmentation 95.4 (1.2) 95.9 (0.7) : (0.5) : (1.6) : (1.4) : 8 Splice 93.4 (0.1) 91.9 (1.6) : (0.8) : (0.1) : (1.2) : 54 Krvskp 99.2 (0.4) 99.3 (0.3) : (0.3) : (0.3) : (0.3) : 15 Sick 97.7 (0.6) 97.7 (0.6) : (0.5) : (0.6) : (0.4) : 21 Mushroom (0.0) (0.0) : (0.0) : (0.0) : (0.0) : 17 Nursery 96.1 (0.4) 96.1 (0.4) : (0.4) : (0.4) : (0.4) : 0 Mean 82.7 (-) 82.9 (-) : (-) : (-) : (-) : 11 servative settings obtained the highest mean accuracy, they seldom eliminated any features. (We didn t round up the decimal for the average number of eliminated features because we wanted to show that elimination does happen but very rarely.) Even with a seemingly marginal value such as = 2 percent, elimination occurred rarely. This suggests that most unnecessary features don t affect classification accuracy. The malicious features that can actually damage classification accuracy aren t routine cases IEEE INTELLIGENT SYSTEMS

6 Table 3. Parameter tuning for tolerance. Accuracy (standard deviation) : Number of eliminated features Meticulous Conservative Aggressive Data set 5% 2% +2% +5% Iris 93.3 (4.6) : (7.6) : (7.6) : (5.3) : (5.3) : 3 Teaching 38.5 (8.6) : (8.9) : (8.9) : (8.6) : (9.4) : 3 Hepatitis 79.4 (4.3) : (6.0) : (6.8) : (0.3) : (2.9) : 17 Hayes-roth 80.0 (5.5) : (5.5) : (5.5) : (5.5) : (5.5) : 1 Wine 91.6 (3.3) : (3.5) : (3.5) : (6.8) : (1.8) : 10 Wpbc 72.2 (1.7) : (4.6) : (2.6) : (8.4) : (5.5) : 32 Automobile 66.4 (2.4) : (6.1) : (7.0) : (12.0) : (6.0) : 20 Sonar 66.4 (7.9) : (3.7) : (5.6) : (6.0) : (8.3) : 57 Glass 94.4 (6.5) : (6.5) : (6.5) : (3.6) : (3.6) : 9 Heart 75.9 (2.6) : (3.3) : (2.8) : (4.4) : (6.7) : 11 Ecoli 80.7 (3.4) : (2.9) : (2.9) : (2.7) : (3.7) : 4 Bupa 62.3 (7.1) : (5.6) : (7.1) : (7.4) : (3.8) : 3 Ionosphere 88.0 (3.0) : (1.7) : (1.8) : (4.8) : (5.0) : 30 Horse-colic 83.7 (2.4) : (2.0) : (2.0) : (2.6) : (3.6) : 20 Monks (0.0) : (0.0) : (0.0) : (0.0) : (0.0) : 3 Monks (1.1) : (1.1) : (1.1) : (1.1) : (1.1) : 5 Monks (2.4) : (0.0) : (0.0) : (2.1) : (1.4) : 4 Voting 95.4 (1.4) : (2.2) : (2.2) : (1.6) : (1.6) : 15 Led (0.0) : (0.0) : (0.0) : (0.0) : (0.0) : 2 Cylinder-bands 75.7 (3.1) : (4.8) : (2.5) : (1.6) : (2.4) : 32 Crx 85.8 (1.5) : (0.9) : (0.9) : (2.3) : (2.5) : 14 Breast-cancer-wisconsin 92.6 (3.9) : (4.4) : (3.7) : (4.1) : (1.0) : 9 Pid 72.8 (2.3) : (2.2) : (2.2) : (1.0) : (2.3) : 7 Anneal 90.1 (2.2) : (1.0) : (1.0) : (3.1) : (1.1) : 34 Vehicle 70.1 (1.7) : (2.8) : (2.1) : (2.0) : (1.5) : 13 Ttt 84.5 (3.5) : (3.4) : (3.4) : (2.5) : (3.2) : 4 Vowel-context 73.0 (0.3) : (1.7) : (2.2) : (2.4) : (2.9) : 7 German 72.7 (2.8) : (3.3) : (3.3) : (2.3) : (0.6) : 18 Cmc 51.9 (1.5) : (3.0) : (2.2) : (3.2) : (5.1) : 6 Car 89.9 (1.1) : (1.0) : (1.0) : (1.1) : (2.3) : 2 Segmentation 95.4 (1.6) : (1.2) : (1.2) : (2.0) : (1.2) : 16 Splice 93.4 (0.1) : (0.1) : (0.1) : (0.4) : (0.2) : 55 Krvskp 99.0 (0.3) : (0.4) : (0.4) : (1.0) : (2.3) : 29 Sick 97.6 (0.6) : (0.6) : (0.6) : (0.6) : (1.0) : 28 Mushroom (0.0) : (0.0) : (0.0) : (0.7) : (2.0) : 19 Nursery 96.1 (0.4) : (0.4) : (0.4) : (0.1) : (0.9) : 3 Mean 82.6 (-) : (-) : (-) : (-) : (-) : 15 On the contrary, aggressive settings can be much more effective in reducing the number of features however, at the cost of (sometimes greatly) decreasing accuracy. Fortunately, the meticulous style achieved a happy middle between the aggressive setting and the conservative setting. Its mean accuracy approximately matched that of the conservative ones, and it eliminated almost as many features as the aggressive ones did. Tie breaker. According to Table 4, as we pre- MARCH/APRIL

7 E N H A N C I N G I N F O R M A T I O N Table 4. Parameter tuning for tie breaker. Accuracy (standard deviation) : Number of eliminated features Data set Random First best Classifier structure Wholesale Iris 94.0 (5.3) : (5.3) : (5.3) : (20.0) : 2 Teaching 38.5 (8.6) : (8.6) : (8.6) : (8.6) : 3 Hepatitis 76.8 (0.3) : (3.1) : (0.3) : (2.8) : 18 Hayes-roth 80.0 (5.5) : (5.5) : (5.5) : (5.5) : 0 Wine 88.2 (6.8) : (3.4) : (5.1) : (15.8) : 12 Wpbc 69.7 (8.4) : (4.0) : (4.9) : (0.0) : 29 Automobile 66.4 (12.0) : (2.9) : (1.5) : (9.1) : 21 Sonar 72.1 (6.0) : (6.3) : (4.9) : (2.8) : 56 Glass 95.3 (3.6) : (3.6) : (3.6) : (3.6) : 9 Heart 76.7 (4.4) : (2.3) : (1.7) : (2.6) : 11 Ecoli 81.6 (2.7) : (5.7) : (6.3) : (21.4) : 4 Bupa 62.0 (7.4) : (7.4) : (7.4) : (7.4) : 1 Ionosphere 87.2 (4.8) : (1.8) : (0.9) : (6.6) : 32 Horse-colic 84.2 (2.6) : (2.6) : (4.0) : (4.0) : 19 Monks (0.0) : (0.0) : (0.0) : (0.0) : 3 Monks (1.1) : (1.1) : (1.1) : (1.1) : 6 Monks (2.1) : (2.1) : (2.1) : (2.1) : 4 Voting 95.6 (1.6) : (1.6) : (1.6) : (1.6) : 15 Led (0.0) : (0.0) : (0.0) : (0.0) : 2 Cylinder-bands 77.9 (1.7) : (3.2) : (2.7) : (3.7) : 31 Crx 85.2 (2.3) : (2.3) : (2.5) : (2.5) : 14 Breast-cancer-wisconsin 92.6 (4.1) : (2.6) : (2.8) : (16.8) : 9 Pid 71.9 (1.0) : (3.6) : (3.6) : (3.6) : 6 Anneal 88.7 (3.1) : (1.1) : (1.2) : (3.2) : 35 Vehicle 69.4 (2.0) : (2.0) : (2.9) : (1.3) : 14 Ttt 79.6 (2.5) : (3.7) : (3.7) : (2.8) : 4 Vowel-context 70.8 (2.5) : (2.0) : (0.9) : (8.1) : 8 German 68.7 (2.3) : (4.0) : (5.4) : (4.7) : 17 Cmc 54.9 (3.3) : (4.8) : (2.4) : (4.9) : 6 Car 89.9 (1.1) : (1.1) : (1.1) : (1.1) : 1 Segmentation 93.8 (2.0) : (2.0) : (1.1) : (1.8) : 18 Splice 91.7 (0.4) : (0.5) : (0.7) : (4.1) : 56 Krvskp 96.9 (1.0) : (0.1) : (0.7) : (2.9) : 32 Sick 96.3 (0.6) : (0.3) : (0.6) : (0.5) : 28 Mushroom 98.1 (0.7) : (0.3) : (0.4) : (1.6) : 22 Nursery 94.7 (0.1) : (0.1) : (0.1) : (0.1) : 2 Mean 82.1 (-) : (-) : (-) : (-) : IEEE INTELLIGENT SYSTEMS

8 dicted, the wholesale setting produces the lowest mean accuracy. This poor performance stems from the fact that wholesale tends to eliminate necessary features as well as their redundant copies. The settings of random, first best, and classifier structure (the decision tree size in our experiments) seemed to end in a tie. However, a closer look reveals that the classifier structure still had an advantage by using more heuristics. Compared with random, first best, and wholesale, it was more accurate more often than not, with the win/lost/ tie records being 12/10/14, 14/8/14, and 18/3/15, respectively. Voting. As we mentioned earlier, we re particularly curious about the voting strategy s effect on IAOFE when the data are large or distributed, or both. We tested the five largest data sets (each with more than 3,000 instances) from our experimental suite (see Table 5). Unanimous voting always produced better accuracy than majority voting. Also, compared with the original data, voting effectively eliminated features without any significant loss in accuracy. This suggests that voting is effective and feasible for IAOFE to deal with large and distributed data. Table 5. Parameter tuning for voting. Comparison with existing methods Based on empirical evidence, we suggest two configurations that can best explore IAOFE s power (see Table 6). Configuration 1 is most desirable overall, and Configuration 2 is the best configuration among resampling approaches. Thus, using training data accuracy for accuracy estimation, meticulous for tolerance, classifier structure for tie breaking, and unanimous for voting, IAOFE is more accurate than other popular feature selection approaches. We named IAOFE with these two configurations IAOFE_trainingAccuracy and IAOFE_CVstratified, respectively. We ran them against other popular feature selection approaches. Table 7 shows the results. This time, however, instead of reporting the average number of eliminated features as we did earlier, we report the average decision tree size induced by IAOFE for each data set. Both configurations can reduce the number of features without significantly degrading classification accuracy. Their win/lose/tie records against the original data are 14/14/8 and 12/17/7, respectively. IAOFE_trainingAccuracy (Configuration 1) achieves the highest mean accuracy among all the competing methods. As for win/lose/tie results, it obtained 24/9/3 with Chi square, 22/10/4 with information gain, 22/10/4 with gain ratio, and 23/9/4 with Relief, each resulting in a sign test below That is, the frequency of wins to losses for each comparison is statistically significant. This surprisingly good performance of IAOFE_ trainingaccuracy is of special interest because it suggests we can use IAOFE more extensively than we d thought for classifying unseen data. We had worried about IAOFE s computational overhead if it employed a resampling methodology for evaluating each feature. Apparently, we can avoid this problem by estimating training data accuracy. However, this method doesn t perform so well in reducing the decision tree size. In some data sets, such as Bupa and Cylinder-bands, it produced a tree size even larger than the original data. Because of the way we implemented IAOFE, the fold with the biggest contribution to the tree size increase always tended to have the biggest contribution to the accuracy increase as well. As we explained earlier, we chose classification accuracy as the objective function. In other words, IAOFE maximizes classification accuracy but doesn t optimize the classifier structure. You can make corresponding changes to the objective function if the classifier structure is a major concern. IAOFE_CVstratified (Configuration 2) also achieved a higher mean accuracy than existing methods. Its win/lose/tie records, although not as statistically significant as IAOFE_trainingAccuracy, also favored IAOFE (23/12/1 with Chi square, 21/14/1 with information gain, 18/16/2 with gain ratio, and 20/12/4 with Relief). However, IAOFE_CVstratified was stronger than IAOFE_trainingAccuracy in terms of improving the classifier structure, so it could be more useful when the classifier structure is as important as accuracy. Selecting values for important parameters has been crucial in heuristic learning. 23 For inductive learning, the question of whether a feature is relevant to the target concept is less useful than the question of whether a feature is necessary for learning the target concept given an induction algorithm. These two issues are not equivalent. 2 However, traditional feature selection and elimination approaches are blind to the induction algorithm that will finally use their output features. Thus, we expect that IAOFE, an approach that considers the biases of an induction algorithm and takes the algorithm itself as part of the evaluation function to eliminate features, will be very useful. However, with lots of parameters offering omnifarious performances, IAOFE Accuracy (standard deviation) : No. of eliminated features Data set Original Unanimous Majority Splice 93.4 (0.1) 93.2 (0.4) : (0.8) : 53 Krvskp 99.2 (0.4) 98.4 (0.2) : (0.6) : 27 Sick 97.7 (0.6) 97.0 (0.5) : (0.4) : 27 Mushroom (0.0) 99.8 (0.2) : (0.8) : 20 Nursery 96.1 (0.4) 95.7 (0.6) : (0.4) : 1 Mean 97.3 (-) 96.8 (-) : (-) : 26 Table 6. Suggested configurations for parameter settings. Configuration Accuracy estimation Tolerance Tie breaker Voting 1: IAOFE_trainingAccuracy Training data accuracy Meticulous Classifier structure Unanimous 2: IAOFE_CVstratified Stratified cross-validation Meticulous Classifier structure Unanimous MARCH/APRIL

9 E N H A N C I N G I N F O R M A T I O N Table 7. Comparison with existing methods. Accuracy (standard deviation) : Decision tree size IAOFE_ IAOFE_ Information Data set Original CVstratified trainingaccuracy Chi square gain Gain ratio Relief Iris 92.7 (7.6) : (6.9) : (7.6) : (5.3) : (5.3) : (5.3) : (5.3) : 8 Teaching 33.8 (2.3) : (6.0) : (7.1) : (1.0) : (1.0) : (4.8) : ( 2.5) : 21 Hepatitis 80.0 (5.7) : (0.3) : (2.2) : (1.0) : (1.0) : (0.3) : (2.2) : 8 Hayes-roth 80.0 (5.5) : (5.5) : (5.5) : (7.1) : (7.1) : (7.1) : (7.1) : 12 Wine 92.7 (3.5) : (4.4) : (2.5) : (3.5) : (2.5) : (2.6) : (4.3) : 8 Wpbc 72.2 (6.1) : (7.0) : (8.4) : (5.3) : (5.3) : (5.3) : (6.9) : 6 Automobile 69.8 (5.9) : (2.0) : (11.9) : (7.2) : (7.8) : (4.0) : (7.0) : 80 Sonar 67.3 (3.6) : (3.1) : (9.5) : (5.7) : (5.8) : (7.7) : (6.0) : 15 Glass 94.4 (6.5) : (3.6) : (3.6) : (5.9) : (6.5) : (6.4) : (6.4) : 11 Heart 77.4 (2.8) : (2.8) : (8.8) : (3.4) : (3.4) : (6.2) : (1.7) : 26 Ecoli 81.0 (2.9) : (3.4) : (3.1) : (5.8) : (2.1) : (1.0) : ( 2.1) : 28 Bupa 63.2 (4.4) : (4.3) : (6.0) : (4.8) : (4.8) : (4.8) : (3.3) : 26 Ionosphere 90.6 (1.7) : (1.0) : (1.7) : (0.9) : (1.7) : (2.2) : (3.9) : 12 Horse-colic 83.2 (2.0) : (4.0) : (2.3) : (0.4) : (0.4) : (3.5) : (0.9) : 8 Monks-1-test (0.0) : (0.0) : (0.0) : (13.5) : (13.5) : (13.5) : (0.0) : 41 Monks-2-test 67.1 (1.1) : (1.1) : (1.1) : (1.1) : (1.1) : (1.1) : (1.1) : 1 Monks-3-test (0.0) : (2.4) : (0.0) : (0.0) : (0.0) : (0.0) : (2.1) : 14 Voting 95.6 (2.2) : (1.4) : (1.7) : (1.6) : (1.6) : (1.6) : (0.8) : 7 Led (0.0) : (0.0) : (0.0) : (10.3) : (2.3) : (1.7) : (2.3) : 11 Cylinder-bands 73.7 (4.8) : (5.5) : (2.4) : (1.3) : (1.3) : (1.2) : (3.2) : 49 Crx 86.2 (0.9) : (2.3) : (1.3) : (2.7) : (2.7) : (1.5) : (2.9) : 19 Breast-cancer (4.4) : (3.7) : (4.1) : (4.6) : (4.6) : (4.5) : (3.4) : 34 wisconsin Pid 72.7 (2.2) : (2.4) : (2.4) : (3.0) : (2.2) : (3.9) : (3.4) : 10 Anneal 89.2 (1.0) : (1.4) : (0.2) : (4.1) : (3.6) : (1.1) : (2.6) : 45 Vehicle 70.7 (2.8) : (3.6) : (0.9) : (2.3) : (0.9) : (2.3) : (0.9) : 104 Ttt 84.6 (3.4) : (1.9) : (3.9) : (1.2) : (1.2) : (1.2) : (1.2) : 29 Vowel-context 73.5 (1.7) : (1.4) : (2.4) : (1.1) : ( 2.5) : (2.9) : (2.6) : 183 German 71.7 (2.7) : (5.4) : (2.7) : (3.5) : (3.5) : (2.9) : (4.7) : 57 Cmc 49.3 (3.0) : (1.0) : (2.9) : (4.1) : (4.1) : (3.8) : (4.1) : 97 Car 90.0 (1.0) : (1.1) : (1.1) : (1.2) : (1.2) : (1.2) : (1.2) : 22 Segmentation 95.4 (1.2) : (1.1) : (0.9) : (0.9) : (0.9) : (1.0) : (1.6) : 66 Splice 93.4 (0.1) : (0.6) : (0.4) : (0.2) : (0.3) : (0.3) : (0.2) : 252 Krvskp 99.2 (0.4) : (0.6) : (0.8) : (1.7) : (1.7) : (1.7) : (1.0) : 21 Sick 97.7 (0.6) : (0.4) : (0.2) : (0.6) : (0.6) : (0.5) : (1.0) : 1 Mushroom (0.0) : (0.5) : (0.5) : (0.0) : (0.1) : (0.0) : (0.0) : 42 Nursery 96.1 (0.4) : (0.4) : (0.4) : (0.1) : (0.1) : (0.1) : (0.1) : 44 Mean 82.7 (-) : (-) : (-) : (-) : (-) : (-) : (-) : IEEE INTELLIGENT SYSTEMS

can t be ideally effective without parameter tuning.

and 5 to 61 features. Although no parameter settings can be universally optimal, the results reported here might have some degree of generality and thus might be useful in practice.

Machine Learning (ICML 96), Morgan Kaufmann, 1996, pp. 284 292. 2. R. Kohavi and G.H. John, Wrappers for Feature Subset Selection, Artificial Intelligence, special issue on relevance, vol. 97, nos.

10 can t be ideally effective without parameter tuning. Our experimental data was particularly comprehensive: 36 data sets, from different real-world and artificial domains, containing categorical as well as numeric features, with 150 to 10,000+ instances and 5 to 61 features. Although no parameter settings can be universally optimal, the results reported here might have some degree of generality and thus might be useful in practice. Acknowledgments We are grateful to Hill Zhu for suggesting useful references for this article. References 1. D. Koller and M. Sahami, Toward Optimal Feature Selection, Proc. 13th Int l Conf. Machine Learning (ICML 96), Morgan Kaufmann, 1996, pp R. Kohavi and G.H. John, Wrappers for Feature Subset Selection, Artificial Intelligence, special issue on relevance, vol. 97, nos. 1 2, 1996, pp X. Wu and D. Urpani, Induction by Attribute Elimination, IEEE Trans. Knowledge and Data Eng., vol. 11, no. 5, 1999, pp A. Blum and P. Langley, Selection of Relevant Features and Examples in Machine Learning, Artificial Intelligence, vol. 97, nos. 1 2, 1997, pp M. Dash and H. Liu, Feature Selection for Classification, Intelligent Data Analysis, vol. 1, no. 3, 1997, pp G.H. John, R. Kohavi, and K. Pfleger, Irrelevant Features and the Subset Selection Problem, Proc. 11th Int l Conf. Machine Learning, Morgan Kaufmann, 1994, pp P. Langley and S. Sage, Induction of Selective Bayesian Classifiers, Proc. 10th Conf. Uncertainty in Artificial Intelligence, Morgan Kaufmann, 1994, pp P. Langley and S. Sage, Oblivious Decision Trees and Abstract Cases, Working Notes AAAI-94 Workshop on Case-Based Reasoning, AAAI Press, 1994, pp M.J. Pazzani, Searching for Dependencies in Bayesian Classifiers, Proc. 5th Int l Workshop Artificial Intelligence and Statistics, Springer-Verlag, 1996, pp G.M. Provan and M. Singh, Learning Bayesian Networks Using Feature Selection, Proc. 5th Int l Workshop Artificial Intelligence and Statistics, Springer-Verlag, 1995, pp M. Singh and G.M. Provan, A Comparison of Induction Algorithms for Selective and Non-Selective Bayesian Classifiers, Proc. 12th Int l Conf. Machine Learning, Morgan Kaufmann, 1995, pp W.N. Street, O.L. Mangasarian, and W.H. Wolberg, An Inductive Learning Approach to Prognostic Prediction, Proc. 12th Int l Conf. Machine Learning,Morgan Kaufmann, 1995, pp R. Kohavi, A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection, Proc. 14th Int l Joint Conf. Artificial Intelligence (IJCAI), Morgan Kaufmann, 1995, pp R. Kohavi and F. Provost, Glossary of Terms, special issue on applications of machine learning and the knowledge discovery process, Machine Learning, vol. 30, 1998, pp B. Efron, Estimating the Error Rate of a Prediction Rule: Improvement on Cross- Validation, J. Am. Statistical Assoc., vol. 78, no. 382, 1983, pp B. Efron and R. Tibshirani, An Introduction to the Bootstrap, Chapman and Hall, C.L. Blake and C.J. Merz, UCI Repository of Machine Learning Databases, Dept. of Information and Computer Science, Univ. of California, Irvine, 1998, ~mlearn/mlrepository.html. 18. G.W. Snedecor and W. Cochran, Statistical Methods, Iowa State Univ. Press, J.R. Quinlan, Induction of Decision Trees, Machine Learning, vol. 1, 1986, pp K. Kira and L. Rendell, The Feature Selection Problem: Traditional Methods and a New Algorithm, Proc. 10th Nat l Conf. Artificial Intelligence, 1992, AAAI Press, pp I. Kononenko, Estimating Attributes: Analysis and Extensions of RELIEF, Proc. European Conf. Machine Learning, Springer- Verlag, 1994, pp C. Cardie, Using Decision Trees to Improve Case-Based Learning, Proc. 10th Int l Conf. Machine Learning, Morgan Kaufmann, 1993, pp Z. Michalewicz and D.B. Fogel, How to Solve It: Modern Heuristics, Springer-Verlag, T h e A u t h o r s Ying Yang is a postdoctoral research associate in the Department of Computer Science at the University of Vermont. Her research interests are in machine learning and data mining. She received her PhD in computer science from Monash University, Australia. Contact her at the Dept. of Computer Science, 343 Votey Bldg., Univ. of Vermont, Burlington, VT 05405; yyang@emba.uvm.edu. Xindong Wu is a professor in and the chair of the Department of Computer Science at the University of Vermont. He received his PhD in artificial intelligence from the University of Edinburgh. He is the executive editor of Knowledge and Information Systems, chair of the Steering Committee of the IEEE International Conference on Data Mining, a series editor of the Springer book series on advanced information and knowledge processing, and chair of the IEEE Computer Society Technical Committee on Computational Intelligence. Contact him at the Dept. of Computer Science, 351 Votey Bldg., Univ. of Vermont, Burlington, VT 05405; xwu@emba.uvm.edu. For more information on this or any other computing topic, please visit our Digital Library at IEEE UPCOMING ISSUES: Persistent Software Attributes Return on Investment in the Software Industry New Trends in Process Improvement VISIT US AT MARCH/APRIL

SSV Criterion Based Discretization for Naive Bayes Classifiers

SSV Criterion Based Discretization for Naive Bayes Classifiers Krzysztof Grąbczewski kgrabcze@phys.uni.torun.pl Department of Informatics, Nicolaus Copernicus University, ul. Grudziądzka 5, 87-100 Toruń,