Fig. 1): The rule creation algorithm creates an initial fuzzy partitioning for each variable. This is given by a xed number of equally distributed tri

Size: px

Start display at page:

Download "Fig. 1): The rule creation algorithm creates an initial fuzzy partitioning for each variable. This is given by a xed number of equally distributed tri"

Thomasine Dalton
5 years ago
Views:

1 Some Approaches to Improve the Interpretability of Neuro-Fuzzy Classiers Aljoscha Klose, Andreas Nurnberger, and Detlef Nauck Faculty of Computer Science (FIN-IWS), University of Magdeburg Universitatsplatz 2, D Magdeburg, Germany Tel , Fax: URL: Abstract Neuro-fuzzy classication systems make it possible to obtain a suitable fuzzy classier by learning from data. Nevertheless, in some cases the derived rule base is hard to interpret. In this paper we discuss some approaches to improve the interpretability of neuro-fuzzy classication systems. We present modied learning strategies to derive fuzzy classication rules from data, and some methods to simplify the found rule base to improve the interpretability of the resulting fuzzy system. 1 Introduction Fuzzy rules are a well-suited representation of classication knowledge. Their application is rather intuitive and understandable. The abstraction from gures to linguistic variables eases the readability and interpretability of the rule base. Automatic induction of fuzzy rules from data is therefore an interesting topic. One approach to automatic rule generation is based on neuro-fuzzy systems, that use learning algorithms derived from neural network theory to generate fuzzy rules [1, 5, 12]. Unfortunately, the result of automatic rule induction is not always as easy to interpret as arulebasebuiltby human experts. In this paper we discuss some of the problems and approaches to improve the interpretability of neuro-fuzzy classication systems. We present modied learning strategies to derive fuzzy classication rules from data, and some methods to simplify the found rule base to improve the interpretability for the user. For the analyses of the presented approaches we used an implementation of the NEFCLASS (NEuro Fuzzy CLASSication) model [6], a neuro-fuzzy model for data analysis, since it was designed as an interactive classication tool. One of our goals is to develop algorithms that can learn automatically, but also allow the user to inuence the learning and classication process, e.g. by initializing the system, and modifying or extracting knowledge. The next section describes the NEFCLASS system and its two-phase mechanism of rule creation and ne tuning. Sect. 2 shows problems of automatic rule generation and describes our approaches to tackle them. Experimental evaluation and conclusions are given in Sect. 3 and The NEFCLASS Model NEFCLASS can be viewed as a special 3-layer feed-forward neural network. The units in this network use t-norms or t-conorms instead of the activation functions common in neural networks. The rst layer represents input variables, the middle (hidden) layer represents fuzzy rules and the third layer represents output variables. Fuzzy sets are encoded as (fuzzy) connection weights. Thus NEFCLASS can always (i.e. before, during and after learning) be interpreted as a system of fuzzy classication rules like where the 1 ::: n are fuzzy sets. if x1 is 1 and x2 is 2 and ::: and xn is n then pattern (x1 x2 ::: xn) belongs to class i, It is possible to create a NEFCLASS system from scratch using training data, or to initialize it by prior knowledge in form of fuzzy rules. If a classier is learned from data, two phases are needed (see left side of Acknowledgment: The research presented in this paper is partly funded by DFG contract KR 521/3-1.

2 Fig. 1): The rule creation algorithm creates an initial fuzzy partitioning for each variable. This is given by a xed number of equally distributed triangular membership functions. The combination of the fuzzy sets forms a grid in the data space, i.e. equally distributed overlapping hyper boxes. Then the training data is processed, and those clusters, that cover areas where data is located are added as rules into the rule base of the classier. After the rule base has been created, the membership functions are tuned by a simple heuristic. For each rule a classication error is determined and used to modify that membership function that is responsible for the rule activation (i.e. which yields the minimal membership degree of all fuzzy sets in the rule's antecedent). The modication results in shifting the fuzzy set, and enlarging or reducing its support, such that a larger or smaller membership degree is obtained depending on the current error. The learning procedure takes into account the semantic properties of the underlying fuzzy system. This results in constraints on possible modications applicable to the system parameters. NEFCLASS allows to impose dierent restrictions on the learning algorithm, e.g. membership functions must not pass its neighbors, must stay symmetrical, or intersect at 0.5 [7]. 2 Improving Neuro-Fuzzy Classication One major problem of NEFCLASS and other fuzzy rule inducing systems is, that they tend to nd large numbers of rules and therefore interpretability (aswell as generalizing ability and learning stability) cannot be guaranteed. This is mainly because the rule creation process does not look for larger structures in the data, but for a local covering of the input space with hyperboxes. For datasets of higher dimensionality there are huge numbers of possible rules. The Wisconsin breast cancer dataset of the UCI machine learning repository [4] has 9 input variables. Using all of them and choosing a partitioning of the input ranges into 5 linguistic variables each, results in 5 9 = possible rules. Thus, most of the about 700 cases in the dataset generate rules which cover only the generating case. This will obviously deliver a poor classication on unseen data. If we dene interpretability as the possibility togaininsight in the structures of the data from looking at the rules, this interpretability willbevery low. Finding structure in the data means abstraction from single cases by describing it in more general rules. Thus it is a common target for generality andinterpretability to reduce the number of rules. In [8] some (semi-automatic) pruning methods were presented, which can be used interactively during the learning process. One basic principle of these methods is to prune the rules by deleting hyperboxes that { by some measure { do not essentially contribute to the classication result. As the rules are fuzzy and thus hyperboxes overlap, the neighboring rules can (hopefully) classify cases formerly classied by now deleted rules. Successive runs of the second learning phase try to enlarge the remaining fuzzy rules (hyperboxes), such that the input space is completely covered again [8]. Rule simplication is done by iteratively applying learning phase II and pruning (see Fig. 1). Problems arise when the data sparsely covers the input space or the initial hyperboxes are small. In the breast cancer example mentioned above many rules classify only one data point. Therefore, the optimizing learning phase II has problems to adapt the fuzzy sets correctly due to the lack of statistical information. The pruning methods make rather crude changes to the rule base by rule deleting rules and thus strongly change the classication. We therefore show an alternative approach that reduces the number of rules, but tries to keep the changes in classication performance small. The basic principles are the merging of neighboring hyperboxes (i.e. rules), deletion of redundant fuzzy sets and choosing an appropriate rule base format. Merging of Rules If hyperboxes are adjacent in input space and have identical consequents, they can be joined with no or little loss of information. Thus, initial hyperboxes are enlarged and joined rules are combined to a single one. Joining rules that are adjacent in one dimension can prohibit joins in other dimensions. Therefore the task is to nd the best merge, i.e. merging the most rules with a minimal loss of classication performance. Unfortunately, this search is NP-complete, so we have to make use of some heuristics. A usual strategy is greedy searching, i.e. promising joins combining many rules are made rst. If, for example, a rule can be joined with an adjacent rule in one dimension, but with two or more rules along another dimension, the latter is preferred. A special case are rules, that are neighboring along one input dimension, predict the same class, and cover the complete input range. This input feature can assume arbitrary values, and can therefore be ignored in the joined rule. In the fuzzy rule base this is done by setting this input to the linguistic term

3 Figure 1: Structure of the learning algorithm in a modied NEFCLASS system don't care, which is associated with a constant membership function equal to 1. Another even more promising situation can be seen as a special case of the above-mentioned scenario: If for all rules the same input can be ignored, it can completely be removed from the input dataset. This is desirable, as it reduces very eectively the number of rules and the input complexity of the dataset. According to our greedy approach, we merge rules in 3 phases: First, we search explicitly for inputs that can be ignored in all rules, then for inputs, that can be ignored in some rules, and then for all remaining merging possibilities. We call these 3 phases input pruning on dataset level, input pruning on rule level and (simple) rule merging. Usually, the rst two phases will be applied after initial rule base creation, but they may also be used in the pruning loop (see Fig. 1). We willnow describe the phases in more detail, beginning with pruning on dataset level. Input Pruning on Data Set Level Inspired by the top-down induction of decision trees [9, 10], we start with all rules merged and split them iteratively. That means we start with no inputs used for classication. Then iteratively the best input according to an appropriate selection measure is added to the set of used inputs. Previously merged rules that dier in the newly selected variable are split accordingly. This procedure is repeated until the expected classication error reaches a minimum or falls beneath a threshold given by the user. All inputs not belonging to the set of used inputs are set to don't care and all superuous rules are removed. This greedy approach strongly depends on the attribute selection measure. We decide to use a measure based on the minimum description length principle (MDL, [11]), which in the past turned out to be well suited for the induction of decision trees [2]. MDL is equivalent to maximum likelihood for the estimation of models with a xed number of free parameters, but additionally oers a possibility to compare objectively between models of dierent complexity. Intuitively, in a classication case MDL yields good values (short description lengths) for models with few parameters, that separate the classes well. On the other hand it punishes models with many parameters. Thus it helps to prevent over-tting. The basis for the application of MDL is a crisp statistic built from the dataset. It counts how many cases of each class each rule classies, i.e. each case is counted for the rule which has the highest activation. From this statistic it can be calculated, which rules and which cases fall together, if a set of inputs is ignored. From this the expected error can be calculated and the separation can be measured by MDL. Due to lack of space we refer the reader to [2] for a formal discussion. Input Pruning on Rule Level The next step is pruning on rule level, i.e. the search for a set of rules that are adjacent in one dimension, independently from other rules. If they cover a certain input range and do not make contradicting predictions, they are merged by ignoring the corresponding input. In contrast to the last section, this merging works bottom-up. Again a greedy search guided by MDL is applied. This time, no error limit is set. Instead of this, the message length for the unmerged and merged rules are calculated according to [2]. The merging is only done, if this decreases the MDL measure.

4 Simple Rule Merging The third step is simple rule merging. This can be done on several levels: Rules with neighboring fuzzy sets can be merged by using a trapezoidal fuzzy set in the antecedent: The interpretability is not aected, since usually a new linguistic term can be given which combines both fuzzy sets. The membership values of data points between the merged fuzzy sets (and thus classication) will be aected, as the trapezoidal fuzzy set has higher membership values for the inner region. However, if the rules are merged because they describe a common region for a class in the data set, this even is desirable. Rules can be merged on a logical level: The rule base is given in disjunctive normal form. More compact representations can be given using Boolean transformations. This is only applicable in the post-processing phase before the nal rule base is given to the user, since the rules' ring strengths could be modied in an undesired way. Nearly identical fuzzy sets can be pruned: During optimization fuzzy sets often become very similar. They can be replaced by a single fuzzy set if they are equal according to a similarity measure (see, for example, [3]). Sorting and Grouping Rule Output Our approaches can be divided into pruning in the learning loop and post-processing of the rule base. The pruning loop aims at increasing the interpretability as well as supporting learning and enable more powerful representations. Post-processing only aims at providing a more readable rule base. Boolean transformations on a logical level is an example of such a kind of post-processing. Additional help can be given to the user by choosing an appropriate output format. Rules that cannot be merged, but nevertheless describe adjacent regions, should be grouped in output. In this way the user is guided in interpreting the classication. 3 Experimental Evaluation As an example for our extensions to NEFCLASS we use the \Wisconsin Breast Cancer" data set [13]. This data set contains 699 cases distributed into two classes (benign and malign). We excluded 16 cases with missing values, and split the dataset into training and test data. Initial rule creation using 5 fuzzy sets per variable creates 166 rules. To demonstrate the pruning strategies the initial number of rules was not restricted. This means that \best per class" rule learning [7] to reduce IF Input 2 IS very small AND Input 6 IS (very small, small OR medium) THEN Class IS Benign IF Input 2 IS small AND Input 6 IS very small THEN Class IS Benign IF Input 6 IS (large OR very large) IF Input 6 IS (small or medium) AND Input 2 IS (small, medium, large OR very large) IF Input 6 IS very small AND Input 2 IS (medium, large OR very large) Table 1: Resulting rules for breast cancer data set

5 the number of rules right away was not used in this case. After training the learned rule base causes only one classication error. However, on the test set it performs badly with 132 errors (38%). 127 of these errors were ambiguous classications, i.e. cases, where no rule res. Obviously the hyperboxes are much too small and scattered. Therefore, we next applied the merging/pruning strategies presented in this paper to the rulebase. The rst pruning phase with 5% misclassications as stopping criterion uses only two inputs to obtain 9 (2.6%) errors on the training set and 18 errors on the test set. There are only 25 rules left. The next pruning phase reduces this number by 8 to a remainder of 17 rules. The classication performance is not aected by this merging. Tab. 1 shows the ordered and logically merged rule output. The last step has not yet been implemented and has therefore been done manually. It should be noted that the number of rules can be further decreased by the other pruning techniques implemented in NEFCLASS [8]. 4 Conclusions The presented approaches can be used as automatic as well as interactive pruning methods. Some of them are not limited to neuro-fuzzy classication. They may be applied to all kinds of classiers, and some (Sect. 2) even do not need a representation by rules. Nonetheless, in combination with adapted learning methods, improvementofinterpretability and learning stabilitycanbeachieved. Together with other pruning methods recently integrated in NEFCLASS [8] the automatic part (the machine's support) in the interactive process of data analysis can be increased. The original NEFCLASS software for UNIX (written in C++, with a user interface in TCL/TK) that was used to obtain the results described in this paper can be freely obtained from the World Wide Web ( References [1] Hamid R. Berenji and Pratap Khedkar. Learning and tuning fuzzy logic controllers through reinforcements. IEEE Trans. Neural Networks, 3:724{740, September [2] I. Kononenko. On biases in estimating multi-valued attributes. In Proc. 1st International Conference on Knowledge Discovery and Data Mining, pages 1034{1040, [3] Chin-Teng Lin. Neural Fuzzy Control Systems with Structure and Parameter Learning. World Scientic, Singapore, [4] C.J. Merz and P.M. Murphy. UCI repository of machine learning databases. University of California, Irvine, Dept. of Information and Computer Sciences, URL: [5] Detlef Nauck, Frank Klawonn, and Rudolf Kruse. Foundations of Neuro-Fuzzy Systems. Wiley, Chichester, [6] Detlef Nauck and Rudolf Kruse. NEFCLASS { a neuro-fuzzy approach for the classication of data. In K. M. George, Janice H. Carrol, Ed Deaton, Dave Oppenheim, and Jim Hightower, editors, Applied Computing Proc ACM Symposium on Applied Computing, Nashville, Feb. 26{28, pages 461{465. ACM Press, New York, February [7] Detlef Nauck and Rudolf Kruse. A neuro-fuzzy method to learn fuzzy classication rules from data. Fuzzy Sets and Systems, 89:277{288, [8] Detlef Nauck and Rudolf Kruse. New learning strategies for NEFCLASS. In Proc. Seventh International Fuzzy Systems Association World Congress IFSA'97, volume IV, pages 50{55, Prague, [9] J.R. Quinlan. Induction of decision trees. Machine Learning, 1:81{106, [10] J.R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufman, San Mateo, CA, [11] J. Rissanen. A universal prior for integers and estimation by minimum description length. Annals of Statistics, 11:416{431, [12] Nadine Tschichold-Gurman. Generation and improvement of fuzzy classiers with incremental learning using fuzzy rulenet. In K. M. George, Janice H. Carrol, Ed Deaton, Dave Oppenheim, and Jim Hightower, editors, Applied Computing Proc ACM Symposium on Applied Computing, Nashville, Feb. 26{28, pages 466{470. ACM Press, New York, February [13] W.H. Wolberg and O.L. Mangasarian. Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proc. National Academy of Sciences, 87:9193{9196, December 1990.

Some neuro-fuzzy models use more than 3 layers, and encode fuzzy sets as activation functions. In this case, it is usually possible to transform them

Some neuro-fuzzy models use more than 3 layers, and encode fuzzy sets as activation functions. In this case, it is usually possible to transform them NEURO-FUZZY SYSTEMS: REVIEW AND PROSPECTS Detlef Nauck Faculty of Computer Science (FIN-IIK), University of Magdeburg Universitaetsplatz, D-39106 Magdeburg, Germany Tel. +49.391.67-12700, Fax: +49.391.67-12018