Reducing redundancy in characteristic rule discovery by using integer programming techniques

Size: px

Start display at page:

Download "Reducing redundancy in characteristic rule discovery by using integer programming techniques"

Trevor Harry Marsh
6 years ago
Views:

1 Intelligent Data Analysis 4 (2000) IOS Press Reducing redundancy in characteristic rule discovery by using integer programming techniques Tom Brijs, Koen Vanhoof and Geert Wets Department of Applied Economics, Limburg University Centre, B-3590 Diepenbeek, Belgium {tom.brijs, koen.vanhoof, geert.wets}@luc.ac.be Received 20 October 1999 Revised 2 December 1999 Accepted 12 December 1999 Abstract. The discovery of characteristic rules is a well-known data mining task and has lead to several successful applications. However, because of the descriptive nature of characteristic rules, typically a (very) large number of them is discovered during the mining stage. This makes monitoring and control of these rules, in practice, extremely costly and difficult. Therefore, a selection of the most promising subset of rules is desirable. Some heuristic rule selection methods have been proposed in the literature that deal with this issue. In this paper, we propose an integer programming model to solve the problem of optimally selecting the most promising subset of characteristic rules. Moreover, the proposed technique enables to control a user-defined level of overall quality of the model in combination with a maximum reduction of the redundancy extant in the original ruleset. We use real-world data to empirically evaluate the benefits and performance of the proposed technique against the well-known RuleCover heuristic. Results demonstrate that the proposed integer programming techniques are able to significantly reduce the number of retained rules and the level of redundancy in the final ruleset. Moreover, the results demonstrate that the overall quality in terms of the discriminant power of the final ruleset slightly increases if integer programming methods are used. Keywords: Redundancy reduction, rule selection, characteristic rules, artificial intelligence 1. Introduction Data mining is the automated search for hidden, previously unknown and potentially useful information from large databases. Moreover, data mining is a crucial phase in the KDD (Knowledge Discovery in Databases) process [7]. In fact, two important goals of KDD can be identified, more specifically prediction, i.e. the use of training data to construct a model to predict unknown values of future instances, and description, i.e. the search for interesting patterns and their (re)presentation in an easy, human understandable format. In this paper, we are primarily interested in the latter objective, namely description, however, without neglecting the former objective, i.e. predictive power. One of the most well-known data mining tasks to extract descriptive information from data is the discovery of characteristic rules. Briefly, characteristic rules express characteristics or properties of a Corresponding author. Tel.: ; X/00/$ IOS Press. All rights reserved

2 230 T. Brijs et al. / Reducing redundancy in characteristic rule discovery by using integer programming techniques certain class of instances in attribute-value (propositional) rule format, such as if species = swan then type = bird and color = white. In general, for a characteristic rule Y X, X summerizes one or more properties common to all (or many) instances of Y [14]. 1 Among the advantages of characteristic rules are clearly their natural representation and the ease of integration of the discovered rules with background knowledge. Several successful applications [3,5,24] have demonstrated their usefulness. However, also some drawbacks of characteristic rules can be identified. Firstly, because of their descriptive nature, often a large number of rules is discovered during the mining stage. Especially for real-world applications, this makes monitoring and control of these rules extremely costly and difficult. Secondly, characteristic rules often suffer from being incomplete, i.e. not all instances are covered by the set of discovered rules, or they may contain redundancy, i.e. the same database instance may be covered by multiple rules. Previous researchers have already highlighted this problem of redundancy. In their study on the interestingness of rules Klemettinen, Mannila, Ronkainen, Toivonen and Verkamo [16] concluded: A problem that remains is redundancy. Large amounts of rules could potentially be pruned, if there were appropriate ways to remove redundant or nearly redundant rules. Indeed, with characteristic rule discovery, instances may be covered by multiple rules causing some rules to overlap, i.e. describing the same database rows. In this paper, we specifically focus on this problem of redundancy and we propose a post-processing method to reduce the redundancy extant in a set of induced characteristic rules. Moreover, we are able to influence the rule pruning process such that some overall measure of quality (such as the discriminant power of the reduced ruleset) can be controlled. The outline of the remainder of this paper is as follows. In Section 2, we introduce the discovery of characteristic rules and present a graphical illustration of the redundancy reduction problem. Section 3 provides an overview of the previous work on the problem of interestingness in order to put the key issue of this work in a global perspective. Section 4 introduces a formal representation of the problem of redundancy reduction and presents a novel solution to reduce the redundancy in a set of characteristic rules by using integer programming techniques. In Section 5, we discuss the results of the RuleCover heuristic [22] as a method for redundancy reduction and compare it with two optimal models (Model 1, 2) that we will propose in this study. Finally, Section 6 summarizes our work and presents the limitations of this study. 2. Problem situation 2.1. Characteristic rules Characteristic rules express characteristics or properties of one class of instances in a typical attributevalue, or propositional rule format. For instance, to express that a swan is a bird and has usually a white color, the following characteristic rule may apply: if species = swan then type = bird and color = white. Although this rule satisfies the completeness property (i.e. it is true for all or almost all swans) the rule is not necessarily a good differentiator between different classes of instances in the database (i.e. also parrots and ducks are birds and can have a white color). Therefore, the above rule does not have a high discriminant power with respect to the target class, i.e. swans. If Y represents the class value and X represents the (combination of) descriptive attribute value(s), then the following presents a formal representation of these notions [14]: 1 Note that we observe the same notation as proposed by [14]. For discriminant rules, they use the notation X Y whereas for characteristic rules they use the notation Y X, where X is evidence and Y is a hypothesis.

3 T. Brijs et al. / Reducing redundancy in characteristic rule discovery by using integer programming techniques 231 Definition 1. Completeness (or also confidence). The rule Y X is s% complete, if X satisfies/covers s% of the instances belonging to class value Y. Definition 2. Discriminant power. The rule Y X is c% discriminant, if X satisfies/covers (100 c)% of the negative instances. These two measures are important to distinguish between discriminant/classification rules that primarily serve the purpose of prediction and characteristic/descriptive rules that primarily serve the purpose of description. So far, most research on rule induction has concentrated on finding discriminant rules from examples, both in a noise-free domain, i.e. to find rules that cover all positive instances (high completeness) without covering any of the negative instances (high discriminant power) (e.g. Michalski s AQ family of concept learning algorithms). Some recent extensions are proposed to cope with noisy data [10,16]. In contrast, up till now, characteristic rule induction has obtained far less attention in the research community. However, in some situations, especially when dealing with highly skewed class frequency distributions, it may be worthwhile considering partial classification techniques, such as characteristic rules, as an alternative [21]. For instance, when the concept/class to be described is largely underrepresented in the data (e.g. in customer satisfaction research where dissatisfied customers compared to satisfied customers typically represent a small minority in the data), it is known that traditional discriminant rule induction and classification techniques have difficulties discriminating well between the classes. In such situations however, it may be worthwhile to describe the most prevalent characteristics of the instances of the target class (high completeness) whith discriminant power of secondary importance. In this setting, discovering characteristic rules is essentially a sequential process: firstly, finding a ruleset with high completeness and then, secondly, removing rules that have low discriminant power. A number of methods have been developed to discover characteristic rules of which the two most important approaches are the data cube approach [6,9] and the attribute-oriented induction approach [11, 12]. The former approach is based on the specialization and generalization operations carried out as drill-down and roll-up actions on multidimensional database cubes. On the other hand, the latter approach deploys attribute removal (in the case when concept hierarchies on attributes do not exist) and attribute generalization (in the case when concept hierarchies on attributes exist) operations. In the current paper, we use the notion of frequent itemsets from association rules to generate all combinations of properties/characteristics of instances of a given class that have a minimum presence/completeness within that class. 2 A similar approach can be found in [3]. In fact, by using a minimum presence/support threshold, we can be sure that all discovered rules are minimaly s% complete. The discovery of frequent itemsets has been studied extensively in the literature on association rules [1,2,18]. In this approach, concept hierarchies can also be used, however, the rule induction process is not based on the principle of attribute value generalization. Instead, the basic Apriori algorithm uses boolean attributes 3 (representing whether the instance representing that concept posessess the property or not) and finds all combinations of properties of the concept that are shared by at least a minimum proportion of the instances that represent that concept. The frequent itemset approach is especially attractive when there is no concept hierarchy available or one is interested in finding characteristics of a concept expressed at the lowest attribute-value level. 2 In essence, this involves finding association rules where the item in the consequent is fixed to a specific class value. 3 Numeric attributes are discretized and discrete-type attributes are mapped to boolean attributes for processing.

4 232 T. Brijs et al. / Reducing redundancy in characteristic rule discovery by using integer programming techniques R 4 _ R 2 R 3 R 1 Legend: Class 1 instance Class 2 instance Characteristic rule _ 2.2. Redundancy: graphical problem illustration Fig. 1. Redundancy in characteristic rules. Especially, with the frequent itemset approach, often a (very) large number of characteristic rules is discovered. Furthermore, because characteristic rules describe properties that are common to many or all instances of a class, different rules may describe different properties of the same instances. Consequently, mutual exclusivity in the discovered set of rules cannot be guaranteed, i.e. some instances in the database are covered by multiple rules. Although completeness and discriminant power (see Definition 1 and 2) can be used to filter less interesting rules, these measures do not guarantee mutual exclusivity in the induced set of rules. Therefore, other methods are needed to reduce the level of redundancy that is present in the set of characteristic rules. Graphically, redundancy could be presented as follows: In Fig. 1, it can be observed that rule number 2 (dashed line) does not cover any instances in addition to the instances already covered by the other rules (rule 1, 3 and 4) in the model. We suggest that rule 2 is redundant and therefore, it can be discarded. However, one must be careful in cutting away rules from the ruleset, because: Discarding rules can result in reducing the covered 4 instance space and this may not be recommended; When discriminant power is of particular importance, the selected set of characteristic rules should describe as many positive instances and as few negative instances as possible with respect to the original ruleset. 3. Previous work A number of methods have been proposed to deal with redundancy in a variety of ways. The following provides an overview of the previous work in this field. Gago and Bento [8] propose a distance metric between rules to select the most heterogeneous set of rules that together gives a good coverage of the instance space. The method, however, has several drawbacks. First of all, the method can only be applied if the underlying data follows a uniform distribution. Secondly, three weight parameters are specified in the distance function but there is no concrete guidance for reasonable values of these parameters. Finally, outliers in the data can significantly affect the percentage of overlap of two rules. 4 An instance is said to be covered by a rule when the attribute-value combinations in the rule are also present in the instance.

5 T. Brijs et al. / Reducing redundancy in characteristic rule discovery by using integer programming techniques 233 Kryszkiewicz [17] introduces the interesting notion of RR (representative association rules), i.e. a least set of rules that covers all association rules. Subsequently, a user may be provided with the set of RR s instead of the whole set of association rules. However, when needed, all usual association rules can be generated from the set of RR s by means of a cover operator. Hoschka and Klösgen [13] deal with the problem of redundancy in their Explora system. It uses partial orderings of attributes and attribute sets to avoid presenting several kinds of redundant knowledge. Bayardo [4] proposes a pruning strategy called redundancy exploitation. The idea is to prevent continued effort at classifying instances already classified by existing rules with high confidence. In yet another approach, the use of rule covers was proposed by Toivonen et al. [22] to reduce redundancy in a discovered set of association rules (see also Section 5). Whereas all of the above techniques are based on heuristic procedures, this paper studies the performance of an optimal rule selection technique based on integer programming methods. More specifically, the RuleCover algorithm will be used as a benchmark against the results of our integer programming models (see Section 4). One important advantage of our approach however, when compared to heuristic approaches, is that the selection of rules for the final ruleset is independent of any ordering of the rules. For instance, with the RuleCover heuristic, the stepwise selection of a subsequent rule is dependent on which rules have been chosen during the previous steps. Consequently, because of the adoption of heuristic selection criteria, there is a possibility that some of the previously selected rules are not optimal from an overall perspective. In contrast, our integer programming approach always results in the most optimal selection irrespective of the ordering of the rules, because it adopts a simultaneous selection of rules instead of a stepwise selection. Finally, redundancy has also been tackled from a totally different point of view, namely, in the research community involved in validation and verification of knowledge based systems, redundancy has mainly been studied from the syntactical point of view [19,20,23]. 4. Solution: integer programming In this section, we propose a solution that takes into account either one (Model 1) or both (Model 2) measures of completeness and discriminant power that were deemed important to assess rule quality. Before we introduce both models we will elaborate on the algebraic definition of the problem Algebraic problem definition Consider the following instance-rule matrix (see Table 1): The matrix shows K rules and N instances. Depending on the number of classes, the instance space is subdivided in two or more groups. For example, in the above matrix two groups of instances can be identified. One group belongs to a first (positive) class and carries the index values 1 to I, the other group of instances belongs to a second (negative) class and carries the index values J to N. The matrix shows whether a particular instance i is covered by a certain characteristic rule j or not. Note, however, that the above matrix only contains instances that are covered at least once by the original ruleset. As a consequence, the number of rows in the matrix may be lower than the total number of instances because with a characteristic ruleset, the property of completeness can be violated (see Section 2). Formally, we define: { 1 if instance i is covered by rule j s ij = 0 if instance i is not covered by rule j

6 234 T. Brijs et al. / Reducing redundancy in characteristic rule discovery by using integer programming techniques Table 1 Instance-rule matrix Index i Index j Rule 1 Rule 2 Rule... Rule K Instance Instance Instance I Instance J Instance Instance N Now, consider the formulation of the following integer programming (IP) model: 4.2. Model specification Model 1. Maximal redundancy reduction. Let: i = number of instances j = number of characteristic rules Given: s ij Boolean decision variables: x j Target function: Min Z = I i=1 K j=1 s ij x j Subject to: i(i =1 I) : K j=1 x j s ij 1 The decision variable x j is binary-valued and specifies whether characteristic rule j will be included in the final ruleset. The target function specifies that the model should look for patterns that have a minimum overlap of instances as possible in the group of positive instances. This means that the model searches for characteristic rules that are as far apart as possible in the positive instance space. The constraint in the model is used to ensure that the original positive instance space is not reduced so that the final ruleset will still cover all positive instances (in the instance-rule matrix) that were covered by the original ruleset. Otherwise, when neglecting this constraint, the model would select no rules at all since the objective function forces the model to select as few rules as possible. Although providing an optimal solution to the redudancy reduction problem, the model presented above still suffers from a few imperfections. Firstly, the user may want to sacrifice completeness in turn for obtaining fewer rules. Obviously, if the objective is to further reduce the size of the ruleset obtained from Model 1, this cannot be done without reducing the current level of completeness of that ruleset because the solution is guaranteed to be optimal given the constraint that each positive instance in the rule-instance matrix must be covered. However, in some circumstances, for instance, when dealing with noisy data, it may be sensible to sacrifice a certain level of completeness to enable further reduction of the ruleset [11]. Secondly, although completeness is probably the most important quality criterion for characteristic rules, the user may also want to take into account the discriminant power of the rules (see Definition 2). Indeed, when selecting rules for the final (reduncancy reduced) ruleset, it may be appropriate to select characteristic rules that, as a group, cover as few negative instances as possible. This is particularly important when the task is to discover characteristics of a class of instances which is largely underrepresented in the data but, at the same time, those characteristics should differentiate the target class from other classes in the data. To accomplish

7 T. Brijs et al. / Reducing redundancy in characteristic rule discovery by using integer programming techniques 235 this, we introduce explicit, user-defined bounds on the coverage of positive and negative instances of the final ruleset. More specifically, consider the statements below: α = proportion of positive instances that are covered by the final ruleset β = proportion of negative instances that are covered by the final ruleset When all positive instances in the instance-rule matrix (see Table 1) are covered by the final ruleset, then α = 100%. However, to account for a certain level of noise in the data, we specify that it is allowed to leave a certain proportion (100 α) of the positive instances uncovered, i.e. we sacrifice completeness. In addition, to control the discriminant power (i.e. the proportion of negative instances covered by the final ruleset) it is specified that no more than β percent of the negative instances in the instance-rule matrix can be covered. Integrating these improvements into Model 1, results in a slightly different model presented below. Model 2. Incorporating α and β into Model 1. Let: i = number of instances j = number of patterns Given: s ij, W 1, W 2, α, β Boolean decision variables: x j Target function: Min Z = W I K 1 i=1 j=1 s ij x j W N K 2 i=j j=1 s ij x j Subject to: i(i =1 I) : K j=1 x j s ij p i 1 i(i = J N) : K j=1 x j s ij M q i 0 I i=1 p i (100 α) I N i=j q i β (N I) In Model 2, in contrast to Model 1, the target function exists of two parts. The first part represents the coverage of positive instances whereas the second part represents the coverage of negative instances. Consequently, the first part will force the search algorithm to select as few rules as possible and the second part of the target function is conceived to minimize the coverage of negative class instances in order to increase the discriminant power of the final ruleset. Without any constraints, this model would result in an empty ruleset selection. Therefore, constraints are again added to the model to enforce certain quality criteria. More specifically, to control the completeness of the final ruleset, the first constraint specifies that at least a certain proportion of positive instances must be covered. This is enforced by the introduction of a boolean slack variable p i which determines the number of positive instances that may be left uncovered by the final ruleset. Indeed, p i itself is subject to a constraint specifying that no more than 100 α percent of the positive instances (in the instance-rule matrix) may be left uncovered by the final ruleset. In addition, the second constraint controls the discriminant power of the final ruleset. In turn, this is enforced by the introduction of a boolean slack variable q i which determines the number of negative instances that may be covered by the final ruleset. Indeed, q i itself is subject to a constraint specifying that no more than β percent of the negative instances may be covered by the final ruleset. An infinitely large number M is introduced in the second constraint to ensure that the boolean slack variable q i is powerful enough to compensate for the fact that an instance may be covered by multiple rules. The W 1 and W 2 parameters are continuous weight values to correct for possible bias in the target function as a result of a different number of instances in each class. When there are more negative than positive

8 236 T. Brijs et al. / Reducing redundancy in characteristic rule discovery by using integer programming techniques instances, then W 1 >W 2, i.e. W 1 =(N I)/I. Moreover, these weight parameters offer the user an additional degree of freedom to specify the importance of each part of the target function. The model, however, is not guaranteed to reach an optimal solution, depending on the choice of the values of the parameter values α and β. For example, if α is too high and β too low, reaching an optimal solution may be impossible. Indeed, it will then be difficult for the model to find a good set of rules having a low degree of redundancy but also covering at least α percent of the positive instances and covering less than β percent of negative instances. When discussing the empirical results (Section 5), we will elaborate on this and propose guidelines for appropriate settings for the α and β parameters. 5. Empirical evaluation To assess the performance of the proposed method, we will use the results of a previous research [5]. In short, in the latter study, data from a customer satisfaction survey, carried out by a leading Belgian bank, were used to identify characteristic rules for dissatisfaction. With these rules, 733 latently dissatisfied customers were identified, i.e. customers that report overall satisfaction but who possess characteristics that are indicators for dissatisfaction. The dataset contained 7264 instances of which 445 (6.1%) reported dissatisfaction and the rest (6819 or 93.9%) reported satisfaction. Characteristic rules were used since we were interested in characteristics of dissatisfied customers and it can be observed that this group is largely underrepresented in the data. It turned out that 29 characteristic rules for dissatisfaction were found to be interesting, 5 covering 328 (i.e. completeness = 74%) of the total group of dissatisfied instances. However, closer observation of the discovered set of rules revealed considerable redundancy (i.e. the same instance being covered by multiple rules). Therefore, as a post-processing step, the integer programming methods, presented in Section 4.2, will be used to reduce the redundancy and select a smaller set of rules. We will compare the results of our integer programming method against those obtained from the heuristic method of rule covering (see formal description below) proposed by Toivonen [22]. Algorithm RuleCover. Input: Set of rules Γ={X i (Y i =1,...,n}. Sets of matched rows m(x i Y ) for all i {1,...,n}. Output: Rule cover. Method: := ; // rule cover s := n i=1 m(x iy i ); // rows unmatched by cover for all i {1,...,n} do s i = m(x i Y ); // rowsof s matched by rule i end; while s do choose i {1,...,n} so that (X i Y ) Γ and s i is largest; := {X i Y }; Γ:=Γ\(X i Y ); // add the rule to the cover // remove the rule from the original set 5 Defined as the difference between the percentage coverage of positive instances and the percentage coverage within the total group of instances (i.e. positive and negative).

9 T. Brijs et al. / Reducing redundancy in characteristic rule discovery by using integer programming techniques 237 Table 2 Results RuleCover versus Model 1 Method Size original Size final Average Coverage positive Complete- Coverage negative ruleset ruleset Redundancy class (α) ness class (β) Rulecover % 74% Irrelevant Model % 74% Irrelevant for all (X j Y ) Γ do s j = s j \m(x i Y ); end; s := s \m(x i Y ); end; // remove matched rows // remove matched rows In short, the RuleCover algorithm, presented above, works as follows: a greedy algorithm uses an original set Γ (containing the entire set of characertistic rules) and then iteratively selects a rule X i Y to move it into. In each pass the rule is selected that covers the maximum number of instances that are left over after having deleted the instances that were covered by the rule which was selected during the previous pass. This process continues until no instances or rules are left over. At the end, contains the minimum rule cover of Γ. In paragraph 5.1 and 5.2, the results of the empirical research will be highlighted Maximal redundancy reduction (Model 1) In the first analysis, we compare the RuleCover heuristic with the first proposed integer programming model to select a ruleset with minimal redundancy. In fact, this means that for IP-model 1 in Section 4.2, the redundancy in the ruleset, covering positive instances, has to be minimized, regardless of the performance of the ruleset in the negative class, i.e. without taking the discriminant power of the final ruleset into account. Empirical results show that the IP-model succeeds in selecting fewer rules than the RuleCover algorithm and that the level of redundancy is lower if the IP-model is being applied. Table 2 illustrates these results. RuleCover returns 15 rules whereas the integer programming algorithm returns only 13 rules that are able to cover all positive instances that were covered by the original ruleset (α = 100%). This means that within the total dataset, the completeness of the ruleset remains 74%, i.e. equal to the completeness of the original ruleset of 29 rules. Furthermore, the level of redundancy is significantly different when comparing the two methods. When calculating the average number of times each positive instance is covered by the final ruleset, for the RuleCover algorithm this figure amounts to 5.02 whereas for the IPmodel this figure only amounts to This again illustrates that RuleCover is a heuristic and therefore, it cannot guarantee an optimal solution to the redundancy reduction problem. In fact, 11 out of the 15 rules that were selected by the RuleCover algorithm were also selected by our method. However, it must be clear that no attention is being paid to the discriminant power of the resulting ruleset. Actually, it is possible that RuleCover returns more rules, i.e. it produces more redundancy, but that the discriminant power in terms of the coverage of negative instances is better (i.e. it covers fewer negative instances) than the one obtained by the integer programming model Incorporating α and β (Model 2) Firstly, the number of negative instances covered by the final ruleset, overall, is an important indicator for the discriminant power (see Definition 2 in Section 2) of the final ruleset, and therefore it should

10 238 T. Brijs et al. / Reducing redundancy in characteristic rule discovery by using integer programming techniques Table 3 Results RuleCover versus Model 2 Method Size Size Average Coverage Complete- Coverage Discriminant original final Redundancy positive ness negative power ruleset ruleset class (α) class (β) Rulecover % 60% 62.4% 93% Model % 60% 59% 94% play an important role in the selection of rules for the final ruleset. Secondly, the user may be willing to sacrifice completeness in turn for obtaining fewer rules. Especially, when dealing with real-world data, certain levels of noise in the data may be expected and therefore specifying α = 100% may be too restrictive. To carry out the analysis, the following idea was adopted. Observations indicated that from the 15 patterns selected by RuleCover (see Table 2), the first four patterns cover 81% of the positive instances (in the instance-rule matrix), which corresponds to 60% completeness within the total group of positive instances in the dataset, which is very reasonable. This 4-member ruleset however, also covers 458 (62.4%) of the negative instances in the instance-rule matrix (see Table 1), producing a relatively high discriminant power of 93% within the total dataset. Consequently, Model 2 in Section 4.2 can be used to select the minimum set of rules that achieves at least the same coverage of positive instances while covering less than 62.4% of the negative instances of the instance-rule matrix. As such, we force the IP-model to perform at least as good as the RuleCover heuristic. More specifically, by setting α equal to 81% and β equal to 62.4%, the final ruleset selected by the integer programming model covers 437 (59%) of the negative instances in the instance-rule matrix, resulting in an overall discriminant power of 94%, which is slightly better than the RuleCover heuristic. Table 3 summarizes these results. Notice that although the percentage improvement in discriminant power is relatively low (i.e. 1%), the absolute difference in the coverage of negative instances shows a clearer picture (i.e. from 458 with RuleCover to 437 with Model 2, resulting in a difference of 21 instances)! Since the primary objective of the algorithm is to reduce redundancy and not to maximally discriminate between the two target groups, the amount of reduction in redundancy is also an important indicator. In analogy with Section 5.1, the degree of redundancy can be expressed as the average number of times each instance is covered by the final ruleset. For the ruleset obtained from the RuleCover heuristic, this figure amounts to 1.54 whereas for the optimal IP-model this figure only amounts to 1.32, again indicating a significant improvement. Furthermore, the above illustrations indicate that the parameter values for α and β obtained from examining the results of RuleCover are good lower bounds (for α), and upper bounds (for β) for the parameter values to be used in the optimization model. 6. Conclusion and future work In this paper, we introduced two integer programming models to tackle the problem of redundancy in a set of characteristic rules. The first model searches for an optimal selection of rules that is able to maximally reduce redundancy given the constraint of covering all (positive) instances that are covered by the original ruleset. In the second model, the first model was adapted (by incorporating two parameters α and β) to account for a flexible adjustment of the completeness and discriminant power of the final ruleset. Both models were empirically tested on real-world data and compared with the well-known RuleCover heuristic. It was found that the IP models are able to produce significantly better results than

11 T. Brijs et al. / Reducing redundancy in characteristic rule discovery by using integer programming techniques 239 the RuleCover heuristic. Firstly, in terms of the number of characteristic rules that are retained for the final ruleset. Secondly in terms of the discriminant power of the final ruleset and finally also in terms of the total redundancy that remains in the final ruleset. However, the reader should also consider the limitations of this work. Currently, comparison of our models against the RuleCover heuristic has been carried out on one dataset only. Therefore, to gain a better insight in the improvements of the proposed models, analysis on additional datasets should be carried out. Secondly, the proposed methods may become computationally complex when very large redundancy reduction problems are considered. In those circumstances, the use of heuristic techniques may then be considered. In contrast, for reasonably-sized problems, the proposed integer programming methods are worth to be considered as an alternative to the traditional heuristic techniques. Acknowledgment Tom Brijs is a research fellow of the Fund for Scientific Research, Flanders (FWO-Vlaanderen). References [1] R. Agrawal, T. Imielinski and A. Swami, Mining association rules between sets of items in large databases, in: Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD 93), Washington, D.C., USA: ACM, 1993, pp [2] R. Agrawal and R. Srikant, Fast algorithms for mining association rules in large databases, in: Proceedings of the 20th VLDB Conference, 1994, pp [3] K. Ali, S. Manganaris and R. Srikant, Partial classification using association rules, in: Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, The AAAI Press, 1997, pp [4] R.J. Bayardo, Jr., Brute-force mining of high-confidence classification rules, in: Proceedings of the Third International Conference of Knowledge Discovery and Data Mining, The AAAI Press, 1997, pp [5] J. Bloemer, T. Brijs, G. Swinnen and K. Vanhoof, Using association rules in customer satisfaction studies to identify latent dissatisfied customers, in: ESOMAR publication series, ISBN: , 1998, pp [6] S. Chaudhuri and U. Dayal, An overview of data warehousing and OLAP technology, ACM SIGMOD Record 26 (1997), [7] U.M. Fayyad, G. Piatetsky-Shapiro and P. Smyth, From data mining to knowledge discovery: An overview, in: Advances in Knowledge Discovery and Data Mining, AAAI Press / The MIT Press, 1996, pp [8] P. Gago and C. Bento, A metric for selection of the most promising rules, in: Proceedings of the Second European Symposium, PKDD98, Lecture Notes in Artificial Intelligence 1510, 1998, pp [9] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow and A. Pirahesh, Data cube: A relational aggregation operator generalizing group-by, cross-tab and sub-totals, Data Mining and Knowledge Discovery 1(1) (1997), [10] J.W. Grzymala-Busse, LERS- A system for learning from examples based on rough sets, in: Intelligent Decision Support, Handbook of Applications and Advances of the Rough Sets Theory, Slowinski R., ed., Kluwer academic publishers, 1992, pp [11] J. Han, Y. Cai and N. Cercone, Data-driven discovery of quantitative rules in relational databases, IEEE Transactions on Knowledge and Data Engineering 5(1) (1993), [12] J. Han and Y. Fu, Exploration of the power of attribute-oriented induction in data mining, in: Advances in Knowledge Discovery and Data Mining, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy, eds., AAAI/MIT Press, 1996, pp [13] R. Hoschka and W. Klösgen, A support system for interpreting statistical data, in: Knowledge Discovery in Databases, 1991, pp [14] M. Kamber and R. Shinghal, Evaluating the interestingness of characteristic rules, in: Proceedings of the Second International Conference on Knowledge Discovery & Data Mining, The AAAI Press, 1996, pp [15] K.A. Kaufman and R.S. Michalski, Learning in an inconsistent world, rule selection in Star/AQ18, Reports of the Machine Learning and Inference Laboratory, MLI 99-2, George Mason University, Fairfax, VA, May, 1999.

12 240 T. Brijs et al. / Reducing redundancy in characteristic rule discovery by using integer programming techniques [16] M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen and A.I. Verkamo, Finding interesting rules from large sets of discovered association rules, in: The Third International Conference on Information and Knowledge Management, ACM Press, 1994, pp [17] M. Kryszkiewicz, Representative association rules mnd Minimum condition maximum consequence association rules, in: Proceedings of the Principles of Data Mining and Knowledge Discovery Conference (PKDD 98), Nantes, 1998, pp [18] H. Mannila, Methods and problems in data mining, in: Proceedings of the International Conference on Database Theory, 1997, pp [19] A.D. Preece, Methods for verifying expert system knowledge bases, Report for Bell Canada, Centre for Pattern Recognition and Machine Intelligence, Concordia University Canada. [20] A.D. Preece and R. Shingal, Foundation and application of knowledge base verification, International Journal of Intelligent Systems 9 (1994), [21] R. Srikant, Q. Vu and R. Agrawal, Mining association rules with item constraints, in: Proceedings of the Third International Conference of Knowledge Discovery & Data Mining, The AAAI Press, 1997, pp [22] H. Toivonen, M. Klemettinen, P. Ronkainen, K. Hätönen and H. Mannila, Pruning and grouping of discovered association rules, in: MLnet Workshop on Statistics, Machine Learning, and Discovery in Databases, Heraklion, Crete, Greece, [23] F. Van Harmelen, Applying rule-base anomalies to KADS inference structures, Decision Support Systems 21(4) (1998), [24] M.S. Viveros, J.P. Nearhos and M.J. Rothman, Applying data mining techniques to a health insurance information system, in: Proceedings of the 22nd VLDB Conference, 1996, pp

Reducing Redundancy in Characteristic Rule Discovery by Using IP-Techniques

Reducing Redundancy in Characteristic Rule Discovery by Using IP-Techniques Tom Brijs, Koen Vanhoof and Geert Wets Limburg University Centre, Faculty of Applied Economic Sciences, B-3590 Diepenbeek, Belgium