3 Virtual attribute subsetting

Size: px

Start display at page:

Download "3 Virtual attribute subsetting"

Giles Wiggins
5 years ago
Views:

1 3 Virtual attribute subsetting Portions of this chapter were previously presented at the 19 th Australian Joint Conference on Artificial Intelligence (Horton et al., 2006). Virtual attribute subsetting is a meta-classification technique which will be shown in chapter 6 to improve the performance of the confidence mapping technique described there. It was created in this work as a generic technique, and this chapter describes its implementation and application to generic classification problems. It is an extension to attribute subsetting, as described in section Unlike standard attribute subsetting, virtual attribute subsetting uses a single base classifier but classifies an instance by passing multiple copies to the base classifier; each copy has some of its attribute values set to unknown. This is shown here to have some of the accuracy benefits of standard attribute subsetting while needing less training time virtual attribute subsetting only trains a single base classifier. This also means that virtual attribute subsetting can be applied when the base classifier already exists, while the standard attribute subsetting base classifiers must be trained with attribute subsetting in mind. This possibility of improving the performance of an existing classifier without needing any training data or further training time is very unusual in classifier learning. 3.1 Unknown attributes in classification Virtual attribute subsetting requires base classifiers that can return a classification for an instance with unknown values. Many types of classifiers can handle such missing values, but their means of doing so differ. Three such classifiers are considered here. 1. Naïve Bayesian classifiers can easily handle missing values by omitting the term for that attribute when the probabilities are multiplied (Kohavi et al., 1997). 2. Decision tree classifiers can deal with missing values at classification time in several different ways (Quinlan, 1989). In trees learnt by the commonly-used C4.5 algorithm, if a node in the tree depends on an attribute whose value is unknown, all subtrees of that node are checked and their class distribution summed (Quinlan, 1993). 30

2 Chapter 3 Virtual attribute subsetting Many rule induction algorithms can handle missing values, but not all do so in a manner useful to virtual attribute subsetting. For example, in ripper (Repeated Incremental Pruning to Produce Error Reduction), rules return negative if they need a missing value (Cohen, 1995). This means that making attributes unknown can easily result in all rules evaluating to negative, so the default classification is returned. Early tests showed that this made virtual attribute subsetting perform poorly, but that the part rule set inductor, which trains decision nodes with the C4.5 decision tree learning algorithm (Frank & Witten, 1998), was more promising. 3.2 The virtual attribute subsetting algorithm Bay (Bay, 1998) applied attribute subsetting to a single nearest-neighbour base classifier. The only training step for nearest-neighbour classifiers is reading the training data. There is no benefit from reading the training data multiple times, so they were only read once. At classification time, multiple nearest-neighbour measurements were made on this single dataset, each measuring a subset of attributes. This saved training time and storage space, while the results were identical to those obtainable by reading the data multiple times to create multiple base classifiers. This approach can be generalized to other base classifiers, to create virtual attribute subsetting. At training time, a single base classifier is learnt on all of the training data and multiple attribute subsets are created, using one of the algorithms discussed in section To classify an instance, one copy of the instance is created for each subset. For each instance copy, those attributes missing from its subset are given the value unknown. The instance copies are then passed to the base classifier, and its predictions are combined to give an overall prediction. The process of copying instances and erasing their values at classification time is shown in fig. 3.1; this may be compared with the attribute subsetting at training time using the same subsets shown in fig For most base classifiers, virtual attribute subsetting may be less accurate than standard attribute subsetting, but will have used less training time and storage. Specifically, it only needs the time and space required by the single base classifier.

3 Chapter 3 Virtual attribute subsetting 32 Figure 3.1: Virtual attribute subsetting example: 3 copies of a 4-attribute instance created with attribute values erased Three parameters of a virtual attribute subsetting classifier must be chosen: the subsets, the type of base classifier, and the means of combining predictions Subset choice There are many ways to select multiple subsets of attributes. Some subsets have been chosen by hand, based on domain knowledge (Cherkauer, 1996; Sutton et al., 2005). Pseudorandom subsets have also been used, as in (Ho, 1998), where each subset contained 50% of the attributes. Pseudorandom subsets may be optimized by introspective subset choice: a portion of the training data is set aside for evaluation, and only those subsets which lead to accurate classifiers on the evaluation data are used. Introspective subset choice may be based on learning the optimal proportion of attributes for each dataset (Bay, 1998), or by generating many subsets and only using those which lead to accurate classifiers (Bryll et al., 2003). However, virtual attribute subsetting is intended to be a fast, generic technique. The subsets should not be based on domain knowledge (which is not generic) or introspective subset choice (which is time-consuming). For this experiment, four types of pseudorandom subset choice were tested: random, classifier balanced, attribute balanced and both balanced. Sample subsets generated by all four subset choice algorithms are shown in table 3.1; the shaded cells show where balancing has been enforced.

4 Chapter 3 Virtual attribute subsetting 33 Each subset generation algorithm receives three parameters: a, the number of attributes, s, the number of subsets to generate and p, the desired proportion of attributes per subset. Table 3.1: Examples of subsets created by different algorithms with a = 4, s = 5 and p =0.7 (a) No balancing (b) Classifier balanced (c) Attribute balanced (d) Both balanced Random subsets This is the simplest algorithm. It iterates through the a s attribute/subset pairs, randomly selecting a s p attributes to include in the subsets Classifier balanced subsets This algorithm chooses subsets such that each subset contains approximately a p attributes. Since a p may not be an integer, it rounds some subsets up and some down to bring the total number of attributes used as close to a s p as possible.

5 Chapter 3 Virtual attribute subsetting Attribute balanced subsets This algorithm chooses subsets such that each attribute appears in approximately s p subsets. Since s p may not be an integer, it rounds some counts up and some down to bring the total number of attributes used as close to a s p as possible Both balanced subsets Creating subsets that satisfy both the classifier and attribute balancing specified above is slightly more difficult. The algorithm is illustrated by associating a line segment with each attribute. The length of each line segment is the number of further times that attribute needs to be added to a subset. It is described in pseudocode below and the process is illustrated in fig Note that this algorithm may create duplicated subsets; a version that created only unique subsets was tested but had an insignificant effect on accuracy. Determine the number of times each attribute should appear; if achieving the correct proportion requires varied attribute counts (some must be rounded up and some rounded down), randomly distribute the counts For each subset: Randomly arrange the attributes in line selectorlength <- number of subsets remaining selectorpos <- random(0..selectorlength-1) While selectorpos lies adjacent to an attribute: Add that attribute to the subset selectorpos <- selectorpos + selectorlength Reduce the lengths of the chosen attributes by 1

6 Chapter 3 Virtual attribute subsetting 35 Figure 3.2: Visualisation of steps in both balanced attribute subsets algorithm

7 Chapter 3 Virtual attribute subsetting Base classifiers The effect of virtual attribute subsetting will vary, depending upon the base classifier used. Three types of base classifiers were tested in this experiment: Naïve Bayes, C4.5 decision trees, and part rule sets. Making an attribute value unknown makes Naïve Bayes ignore it as if that entire attribute were not in the training data, so there should be no difference in output between standard and virtual attribute subsetting if both use Naïve Bayes as the base classifier. For C4.5 and part base classifiers, virtual attribute subsetting is unlikely to match the accuracy of standard attribute subsetting, as the classifiers used for each subset will not be appropriately independent. However, virtual attribute subsetting may still be more accurate than a single base classifier Combining predictions There are many ways to combine the predictions of multiple classifiers. The method chosen for virtual attribute subsetting was to sum and normalize the class probability distributions of the base classifiers to give an overall class probability distribution. 3.3 Method As mentioned in section 2.3.5, all experiments here were carried out in weka, the Waikato Environment for Knowledge Analysis. Weka conversions of 31 datasets from the UCI Machine Learning Repository (Newman et al., 1998) were selected for testing 1. The dataset names are listed in tables 3.8, 3.9 and Each test of a classifier on a dataset involved 10 repetitions of 10-fold cross-validation. Both standard and virtual attribute subsetting have three adjustable settings: subset choice algorithm ( no balancing, classifier balanced, attribute balanced or both balanced ), attribute proportion (a floating point number from 0.0 to 1.0, although exactly 1.0 results in every subset containing every attribute) and number of subsets (any positive integer). Preliminary tests showed that reasonable default settings are balancing= both balanced, attribute proportion=0.8 and subsets=

8 Chapter 3 Virtual attribute subsetting 37 The decision tree classifier used was J4.8, the weka implementation of C4.5. The rule set classifier was the weka implementation of part. All base classifier settings were left at their defaults. For both J4.8 and part, this meant pruning with a confidence factor of 0.25 and requiring a minimum of two instances per leaf. 3.4 Results A virtual attribute subsetting classifier with a given base classifier may be considered to succeed if it is more accurate than a single classifier with the same type and settings: that is, if it improves accuracy for no significant increase in storage space or training time. This is the main comparison made here; standard attribute subsetting results are also shown, as they provide a probable upper bound to accuracy. Results are shaded if there are more wins than losses, and the probability that this is due to chance is less than 0.05, based on a one-tailed Fisher Sign Test (Weisstein, 1999) Naïve Bayes The ability of virtual attribute subsetting to yield exactly the same results as standard attribute subsetting when Naïve Bayes is the base classifier was verified, so the results shown apply to both standard and virtual attribute subsetting. Attribute subsetting usually only improved the accuracy of a Naïve Bayesian classifier when attributes were balanced, as in table 3.2. The best attribute proportion was 0.9, as shown in table 3.3. Table 3.2: Wins/draws/losses for standard/virtual attribute subsetting with varying subset choice algorithms compared with a single Naïve Bayesian classifier Table 3.3: Wins/draws/losses for standard/virtual attribute subsetting with varying proportion compared with a single Naïve Bayesian classifier

9 Chapter 3 Virtual attribute subsetting Decision trees With J4.8 as the base classifier, standard attribute subsetting performed well with all subset algorithms, but virtual attribute subsetting was only effective when both attributes and classifiers were balanced, as shown in table 3.4. Both attribute subsetting classifiers were commonly more accurate with attribute proportions of 0.8 or 0.9, as in table 3.5. The accuracies achieved with the best settings are listed in table 3.9. The accuracy gains made were usually small. Two additional experiments with J4.8 were undertaken but are not reported in detail here. The number of subsets was varied from 2 to 40, with no significant improvement in virtual attribute subsetting once the number of subsets was at least 6. Unpruned decision trees were also tested; virtual attribute subsetting using unpruned J4.8 as the base classifier performed well against a single unpruned J4.8 classifier but poorly against a single pruned J4.8 classifier. Table 3.4: Wins/draws/losses for standard/virtual attribute subsetting with varying subset choice algorithms compared with a single J4.8 classifier Table 3.5: Wins/draws/losses for standard/virtual attribute subsetting with varying proportion compared with a single J4.8 classifier

10 Chapter 3 Virtual attribute subsetting Rule learning Results for the part algorithm are similar to those for J4.8. Standard attribute subsetting gave good results, while virtual attribute subsetting was only effective when the attributes were balanced. Balancing both attributes and classifiers did not cause further improvements, as shown in table 3.6. Of the attribute proportions tested, attribute subsetting using part was commonly more accurate with proportions of 0.8 or 0.9, as in table 3.7. The accuracies achieved with the best settings are listed in table Once again, the actual accuracy gains made were small. Table 3.6: Wins/draws/losses for standard/virtual attribute subsetting with varying subset choice algorithms compared with a single part classifier Table 3.7: Wins/draws/losses for standard/virtual attribute subsetting with varying proportion compared with a single part classifier

11 Chapter 3 Virtual attribute subsetting Training time Since the training for virtual attribute subsetting is limited to building subsets and training one classifier on the original training data, it was expected to have similar training time to a single classifier. The standard attribute subsetting algorithm builds a classifier for each subset. Since the training data for each classifier have some attributes removed, the individual classifiers may take less time to learn than a single classifier (as there are fewer attributes to consider) or more time (as more steps may be needed to build an accurate classifier). To test this, the ratios of attribute subsetting training time to single classifier training time were measured. As expected, virtual attribute subsetting took comparable training time to a single classifier, while standard attribute subsetting needed at least six times longer than both a single classifier and virtual attribute subsetting Classifier size The size of a decision tree or rule set provides some measure of its complexity. Small classifiers may be too simple and underfit, while large classifiers may be needlessly complex and overfit. Under attribute subsetting, leaving out some attributes may lead the classifier learner to simpler classifiers, or may force it into complexity. The number of nodes in the J4.8 decision trees, and the number of rules in the part rule sets, were therefore compared in table 3.8. The numbers are the average across the 10-fold cross-validation; each fold s standard attribute subsetting result is also the mean of its 10 base classifiers. Virtual attribute subsetting is not listed, as it uses the single classifier. The table shows that for J4.8 neither method leads to consistently larger trees, while attribute subsetting with part generally does learn larger rule sets Accuracy tables The accuracies returned by the most accurate attribute subsetting and virtual attribute subsetting classifiers using J4.8 and part base classifiers are shown in table 3.9 and 3.10 respectively, along with their win/draw/loss totals.

12 Chapter 3 Virtual attribute subsetting 41 Table 3.8: Sizes of classifiers trained on the entire dataset and under standard attribute subsetting Greater or Less than results are based on a simple count comparison and were measured before these figures were truncated for presentation.

13 Chapter 3 Virtual attribute subsetting 42 Table 3.9: Percentage accuracy over the 31 datasets for J4.8 and standard/virtual attribute subsetting using J4.8 base classifiers Wins and losses are based on a simple accuracy comparison and were measured before these figures were truncated for presentation.

14 Chapter 3 Virtual attribute subsetting 43 Table 3.10: Percentage accuracy over the 31 datasets for part and standard/virtual attribute subsetting using part base classifiers Wins and losses are based on a simple accuracy comparison and were measured before these figures were truncated for presentation.

15 Chapter 3 Virtual attribute subsetting Conclusions This chapter introduced a new meta-classification technique, virtual attribute subsetting. Tests show that it is effective at improving the accuracy of decision tree and rule set classifiers on a variety of common classification datasets. If blind subset selection is made, the results suggest that the attribute subsets should be chosen with the both balanced subsets algorithm described in section with a subset proportion of This technique is discussed further in chapter 6. There, the confidence measures from Haar Classifier Cascades are modified in section by applying virtual attribute subsetting to the cascade stages. This obtains multiple classifications from a single cascade. The results in section show the benefits of doing so.

A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995)

A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995) Department of Information, Operations and Management Sciences Stern School of Business, NYU padamopo@stern.nyu.edu