size, runs an existing induction algorithm on the rst subset to obtain a rst set of rules, and then processes each of the remaining data subsets at a

Multi-Layer Incremental Induction Xindong Wu and William H.W. Lo School of Computer Science and Software Ebgineering Monash University 900 Dandenong Road Melbourne, VIC 3145, Australia Email: xindong@computer.org To appear in Proceedings of the 5th Pacic Rim International Conference on Articial Intelligence, Singapore, 25-27 November 1998. Abstract. This paper describes a multi-layer incremental induction algorithm, MLII, which is linked to an existing nonincremental induction algorithm to learn incrementally from noisy data. MLII makes use of three operations: data partitioning, generalization and reduction. Generalization can either learn a set of rules from a (sub)set of examples, or rene a previous set of rules. The latter is achieved through a redescription operation called reduction: from a set of examples and a set of rules, we derive a new set of examples describing the behaviour of the rule set. New rules are extracted from these behavioral examples, and these rules can be seen as meta-rules, as they control previous rules in order to improve their predictive accuracy. Experimental results show that MLII achieves signicant improvement on the existing nonincremental algorithm HCV used for experiments in this paper, in terms of rule accuracy. 1 Introduction Existing machine learning algorithms can be generally distinguished into two categories [Langley 1996], nonincremental algorithms which process all training examples at once, and incremental algorithms which handle training examples one by one. When an example set is not a static repository of data, for example, an example set may be added, deleted, or changed over a span of time, the learning on the example set cannot be an one-time process, so nonincremental learning has a problem dealing with changing example populations. However, processing examples one by one in existing incremental algorithms is a very tedious process when the example set is extraordinary large. In addition, when some of the examples are noisy, the results learned from them must be reverted at a later stage. As stated in [Schlimmer and Fisher 1986], incremental learning provides predictive results that depend on the particular order of the data presentation. This paper designs a new incremental learning algorithm, multi-layer induction, which divides an initial training set into subsets of approximately equal

size, runs an existing induction algorithm on the rst subset to obtain a rst set of rules, and then processes each of the remaining data subsets at a time by incorporating the induction results from the previous subset(s). This way, multi-layer induction accumulates discovered rules from each data subset at each layer and produces a nal integrated output which represents the original data more accurately. Any noisy data contained in the original data set can be partitioned and diminished in multi-layer induction into the small data subsets, thus the eects of noise would be diluted and induction eciency can be increased. The existing algorithm used in this paper for experiments is HCV (Version 2.0) [Wu 1995], a nonincremental rule induction system that in many cases performs better than other induction algorithms in terms of rule complexity and predictive accuracy. 2 MLII: Multi-Layer Incremental Induction Multi-layer incremental induction (MLII) applies three learning operations, data partitioning, rule reduction and rule generalization into a self-developed process. Generalization and reduction work together with sequential incrementality in order to learn and rene rules incrementally. After data partitioning, MLII handles example subsets sequentially through the generalization-reduction process. The sequential incrementality is particularly useful in cases of huge amount of data, in order to avoid exponential explosion. 2.1 Algorithm Outline In the rst step, the initial data set is partitioned into a number of data subsets of approximately equal size in a random shued way. In the second step, a set of rules is learned from a rst subset of examples by a generalization algorithm. The only assumption we make here is that the generalization algorithm is able to produce deliberately under-optimal solutions (rules are redundant). This way, the learning problem is given an approximate rule set, and this rule set will be rened with other data subsets. The third step performs the transition toward another learning problem, namely the renement of the previous set of rules. This transition is performed by a redescription operator called reduction, which derives a new set of behavioral examples by examining the behavior of the rule set from Step 2 over a second data subset. From these behavioral examples, generalization can extract new rules, which are expected to correct defects and inconsistencies of previous rules. A sequence of rule sets is so gradually built. Successive applications of the above generalizationreduction process allow more accurate and more complex (because of disjunctive) rules to be discovered, by sequentially handling the subsets of examples. 2.2 Data Partitioning Data partitioning aects the quality of information in each data subset and in turn aects the performance of multi-layer induction. Our main design aim here

is to dilute the noise in the original data set and evenly distribute examples of dierent classes. The partitioning process is designed as follows. 1. Shue all examples in the training set randomly. 2. Put examples of each class into one separate group. 3. Count the number of examples in each class group and get the ratio of the numbers of each class. 4. Randomly select examples from each class group according to the above ratio and put them into a subset. This process performs for N times (where N is the number parameter adjusted by the user). In some cases, the example ratio from dierent class groups cannot be integers and for the last subset some class groups may still have examples while other class groups do not have any examples. In these cases, we do not form the last subset, but insert the remaining examples randomly into the existing subsets. 2.3 Generalization Generalization compresses initial information. It involves observing a (sub)set of training examples of some particular concept, identifying the essential features common to the positive examples in these training examples, and then formulating a concept denition based on these common features. The generalization process can thus be viewed as a search through a space of possible concept denitions for a correct denition of the concept to be learned. Because the space of possible concept denitions is vast, the heart of the generalization problem lies in utilizing whatever training data, assumptions and knowledge are available to constrain the search. In MLII, discriminant generalization by elimination [Tim 1993] is adapted. A discriminant description species an expression (or a logical disjunction of such expressions) that distinguishes a given class from a xed number of other classes. The minimal discriminant descriptions are the shortest expressions (i.e., with the minimum number of descriptors) distinguishing all objects in the given class from objects of other classes. Such descriptions specify the minimum information sucient to identify the given class among a xed number of other classes. These discriminant descriptions will be converted into generalization rules. A generalization rule is a transformation of a description into a more general description that tautologically implies the initial description. Generalization rules are not truth-preserving but falsity preserving, which means that if an event falsies some description, then it also falsies a more general description. This is immediately seen by observing that H ) F is equivalent to :F ) :H (the law of contraposition). Generalization by Elimination Generalization by elimination lies on the concept of the star methodology [Michalski 1984]. Its main originality is a logical pruning of counter-examples, based on the near-miss notion [Kodrato 1984].

Let s be an example of an example (sub)set A. Any counter-example t of A gives a constraint over the generalization of s: the descriptors which discriminate t from s cannot be dropped simultaneously. The constraint C(s, t) is a subset of integers, given by C(s; t) = fijattribute i discriminates s and tg: A counter-example t 0 is a maximal near-miss to s in A if the constraint C(s, t0) is minimal for the set inclusion, among all C(s, t). We search all maximal near-miss counter-examples to nd such an integer set M that intersects every constraint C(s,t). From M, a rule R sm is dened as follows: its premises are the conjunction of all conditions in s. We prove that R sm is a maximally discriminant generalization of s. By construction, for any counter-example t discriminated from s, there exists an element in C(s,t) which belongs to M: the corresponding attribute allows to discriminate s and t; this condition is kept from s to R sm, hence R sm still discriminates t. The search for M can be achieved by a graph exploration, which is exponential with respect to the number of constraints. However, it is enough for a subset M to intersect all C(s,t) for t maximally near-miss to s. This generalization by elimination therefore reduces the size of exponential exploration by a preliminary (polynomial) pruning. Predicate Calculus for Reduction On the discriminant rules obtained above, we apply predicate calculus [Leung 1992] to generate more general rules. The following is a list of formulae (where X; Y and Z each represent a conditional statement and : represents complement (not)) we have used in our MLII system. 1. :(:(X)) X 2. X V Y Y V X (the commutative law of conjunction) 3. X V (Y V Z) (X V Y ) V Z (the associative law of conjunction) 4. X W X X 5. X W Y Y W X (the commutative law of disjunction) 6. X W (Y V Z) (X W Y ) V (X W Z) (the distributive law) 7. X V (Y W Z) (X V Y ) W (X V Z) (the distribute law) 8. :(X W Y ) (:X) V (:Y ) (De Morgan's law) 9. :(X V Y ) (:X) W (:Y ) (De Morgan's law) These laws are useful to combine dierent conditional rules together by symbolic resolution in order to get generalized conditional rules (meta-rules).

2.4 Reduction Let denote the description space of the learning domain, B be the set of rules expressed within, and L be the number of rules in B. For any rule in B, we say an example in res the rule if the description of the example satises the premises of the rule. Denition 1. Reduction, denoted by B, is a redescription operator dened as follows. B:?! [0; 1] L B: s 2?! B(s) = [ r j (s), j = 1,, L] where the reduced descriptor r j is given by: 1 if s fires the j-th rule in B r j (s) = 0 otherwise The redescription transforms each example in into an L-dimension description. The class of the example does not change. Denition 2. From a learning set A and a rule set B, the reduced learning set, denoted by A B, is generated as follows. A B = [( B (s i ); Class(s i ))j(s i ; Class(s i )) 2 A] where Class(s i ) indicates the class information of the s i example. The reduced learning set A B describes the behaviour of B on the examples in A. It is expressed in boolean logic, whatever the initial representation of A and B are. Generalization can be carried out on the reduced learning set to produce a rened set of rules. The rened set of rules is applied to a new subset of the original training examples to obtain a new learning set for further generalization, and so on. The number of examples in the reduced learning set A B is generally less than the number of examples in the initial learning set A. But the reduced learning set must still contain enough information in order to enable a further generalization. So the number of examples in each data subset should not decrease too much. 2.5 Renement of Previous Rules At each learning layer, generalization on a reduced learning set A B renes the rule set B from previous layer(s). First, if a rule in B has a good predictive accuracy, this information is implicitly available from the reduced learning set A B : it is often red by the examples in A and consequently, the corresponding descriptor in A B takes the value 1, and the class of these examples is often the same as the rule class. Hence, there is a correlation in A B between a value of this descriptor and a value of the class information, and rules with a good predictive accuracy will be discovered again by next generalization. This process is stable, as good rules in B are carried on. Second, the same argument above ensures that irrelevant rules are dropped: if a rule is irrelevant, the associated reduced descriptor is irrelevant with respect to A B too. As generalization is supposed to detect and drop irrelevant descriptors, the rules learned from A B do not keep previous irrelevant rules.

Expermnt 1 Expermnt 2 Expermnt 3 Expermnt 4 Database Person 1 Person 2 Labor-Neg 1 Labor-Neg 2 No of training examples 100 300 100 300 Number of attributes 4 4 5 5 Number of classes 2 2 2 2 Missing values 10 15 23 10 Misclassications 3 7 17 7 Level of noise low low high low No of test examples 30 50 50 80 No of HCV rules 8 12 16 10 Accuracy of HCV rules 88.92% 77.33% 81.71% 87.77% No of MLII rules 6 7 10 8 Accuracy of MLII rules 98.88% 90.73% 92.57% 94.61% Table 1. Summary of Experiments 1-4. Third, generalization discovers links among descriptors and classes. In the reduced learning set A B, examples are described according to the rules in B they trigger. Hence, the triggering of rules can be generalized from A B : the generalization solves conicts arising among previous rules. 3 Experiments In this section, we set up a few experiments to compare the predictive accuracy of MLII rules with the HCV induction program [Wu 1995]. 3.1 Experiments 1 to 4 Table 1 provides a summary of the data sets used in our rst four experiments and the results. The 4 databases were all copied from the University of California at Irvine machine learning database repository [Murphy & Aha 95], and each contains certain level of noise. These databases have been selected because each of them consists of two standard components when created or collected by the original providers: a training set and a test set. The databases have been used \as is". Example ordering has not been changed, neither have examples been moved between the sets. For each database, we ran each of HCV (Version 2.0) and MLII (with 4 layers) 10 times on the training set, and the accuracies listed in Tables 1 are average results from the test set. For all the 4 databases, MLII (with 4 layers) performs better than HCV (Version 2.0). The accuracy dierence on each database between MLII and HCV is statistically signicant. Therefore, we conclude that with a carefully selected number of layers, MLII achieves signicant improvement on HCV in terms of rule accuracy.

Layer No. Rule Set Test Set Accuracy 1 B 73.331% 2 C1 63.214% C2 69.643% 3 D1 54.123% D2 70.813% D3 80.771% 4 E1 45.634% E2 60.811% E3 75.123% E4 92.512% Table 2. Results of Experiment 5. 3.2 Experiment 5 The purpose of this experiment is to check the change of accuracy of MLII rules generated on each layer in a n-layer induction. The data set used is labor-neg1, the same as used in Experiment 3. Table 2 shows the results, and Figure 1 provides a visual illustration of the same results. From the graph in Figure 1, it is obvious that at the rst layer, the accuracy of the induced rules on the same test data decreases when the number of layers increases in MLII. The highest is HCV induction (just one layer) and lowest is the 4-layer MLII. The reason is that HCV uses the whole 300 training examples to generate rules, while MLII uses only a subset (one-nth of the training set) at the rst layer for HCV to generate an initial rule set. We have tried up to 4 layers with MLII, and the rule accuracy on the test set is always increasing at the last layer. A question arise here. What is the optimal number of layers for MLII on a training set? Based on various experiments we have carried out, it depends on the size and the noise level of the training set. If the size of the training set is very large (e.g. 5000 examples or more) and there exists high-level noise, more layers of learning allows deeper rule renement to dilute the noise. Otherwise, if we use a large number of layers on a small data set, MLII can not gain enough information to generate approximate (redundant) rules for later successive renement, and this in turn aects the completeness and consistency of the nal generated rules. For each curve in Figure 1, we can nd that the test set accuracy increases signicantly from the rst to second layer rules, still increases from the second to third layer and the improvement decreases as the number of layers increases. This indicates that the approximate rules generated at the rst layer are successively rened at the following learning layers (i.e., rules become more consistent and

Fig. 1. Rule Accuracy at Each Induction Layer of Experiment 5. accurate) and nally achieve an optimal level and are no longer redundant. Therefore, the successive learning should be stopped and the optimal rules be taken as the nal rules. In general, it is not the case that the more layers we use, the more accurate nal rules we will get. 4 Conclusions Multi-layer induction learns accurate rules in an incremental manner. It handles subsets of training examples sequentially. Compared to handling training examples one by one in existing incremental learning algorithms, this sequential incrementality is more exible, because the size of data subsets is controlled by data partitioning in multi-layer induction. Multi-layer induction suits noisy domains, because data partitioning dilutes the eects of noise into data subsets. Five experiments were carried out in this paper to quantify the gains of MLII, and signicant improvement of rule accuracy has been achieved. Multi-layer induction is designed for handling large and/or noisy data sets. With medium-sized, noisefree data sets, we have not found much improvement of MLII on HCV induction in rule accuracy. The information quality of the data sets is a critical factor to determine the number of layers in MLII. The noise level, number of training examples, numbers of attributes and classes, and value domains of attributes are all contributing factors when applying MLII to a particular data set.

Future work will involve applying MLII to other induction programs, such as C4.5 [Quinlan 1993], extending the experiments to larger data sets and comparing with other incremental learning methods such as case-based learning [Ram 1990] which learns incrementally case by case and treats each case as a chunk of partially matched rules. References [Kodrato 1984] Kodrato, Y. (1984). Learning complex structural descriptions from examples. Computer vision, graphics and image processing 27. [Langley 1996] Langley, P. (1996). Elements of Machine Learning. Morgan Kaufmann. [Leung 1992] Leung, K. T. (1992). Elementary Set Theory (3 Ed.). Hong Kong University Press. [Michalski 1984] Michalski, R. S. (1984). A theory and methodology for inductive learning. Articial Intelligence 20 (2). [Michalski 1985] Michalski, R. S. (1985). Knowledge repair mechanisms: Evolution versus revolution. In Proceedings of the Third International Machine Learning Workshop, 116{119. Rutgers University. [Murphy & Aha 95] Murphy, P.M. & Aha, D.W. (1995). UCI Repository of Machine Learning Databases, Machine-Readable Data Repository. University of California, Department of Information and Computer Science, Irvine, CA. [Quinlan 1993] Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann. [Ram 1990] Ram, A. (1990). Incremental learning of explanation patterns and their indices. In Proceedings of the Seventh International Conference on Machine Learning, 49{57. Morgan Kaufmann. [Schlimmer and Fisher 1986] Schlimmer, J. and Fisher, D. (1986). A case study of incremental concept induction. In Proceedings of the Fifth National Conference on Artical Intelligence, pp. 496{501. Morgan Kaufmann. [Tim 1993] Tim, N. (1993, Feb). Discriminant generalization in logic program. Knowledge Representation and Organization in Machine Learning 14 (3), 345{351. [Wu 1995] Wu, X. (1995). Knowledge Acquisition from Databases. Ablex.