Reducing redundancy in characteristic rule discovery by using integer programming techniques

Size: px
Start display at page:

Download "Reducing redundancy in characteristic rule discovery by using integer programming techniques"

Transcription

1 Intelligent Data Analysis 4 (2000) IOS Press Reducing redundancy in characteristic rule discovery by using integer programming techniques Tom Brijs, Koen Vanhoof and Geert Wets Department of Applied Economics, Limburg University Centre, B-3590 Diepenbeek, Belgium {tom.brijs, koen.vanhoof, geert.wets}@luc.ac.be Received 20 October 1999 Revised 2 December 1999 Accepted 12 December 1999 Abstract. The discovery of characteristic rules is a well-known data mining task and has lead to several successful applications. However, because of the descriptive nature of characteristic rules, typically a (very) large number of them is discovered during the mining stage. This makes monitoring and control of these rules, in practice, extremely costly and difficult. Therefore, a selection of the most promising subset of rules is desirable. Some heuristic rule selection methods have been proposed in the literature that deal with this issue. In this paper, we propose an integer programming model to solve the problem of optimally selecting the most promising subset of characteristic rules. Moreover, the proposed technique enables to control a user-defined level of overall quality of the model in combination with a maximum reduction of the redundancy extant in the original ruleset. We use real-world data to empirically evaluate the benefits and performance of the proposed technique against the well-known RuleCover heuristic. Results demonstrate that the proposed integer programming techniques are able to significantly reduce the number of retained rules and the level of redundancy in the final ruleset. Moreover, the results demonstrate that the overall quality in terms of the discriminant power of the final ruleset slightly increases if integer programming methods are used. Keywords: Redundancy reduction, rule selection, characteristic rules, artificial intelligence 1. Introduction Data mining is the automated search for hidden, previously unknown and potentially useful information from large databases. Moreover, data mining is a crucial phase in the KDD (Knowledge Discovery in Databases) process [7]. In fact, two important goals of KDD can be identified, more specifically prediction, i.e. the use of training data to construct a model to predict unknown values of future instances, and description, i.e. the search for interesting patterns and their (re)presentation in an easy, human understandable format. In this paper, we are primarily interested in the latter objective, namely description, however, without neglecting the former objective, i.e. predictive power. One of the most well-known data mining tasks to extract descriptive information from data is the discovery of characteristic rules. Briefly, characteristic rules express characteristics or properties of a Corresponding author. Tel.: ; X/00/$ IOS Press. All rights reserved

2 230 T. Brijs et al. / Reducing redundancy in characteristic rule discovery by using integer programming techniques certain class of instances in attribute-value (propositional) rule format, such as if species = swan then type = bird and color = white. In general, for a characteristic rule Y X, X summerizes one or more properties common to all (or many) instances of Y [14]. 1 Among the advantages of characteristic rules are clearly their natural representation and the ease of integration of the discovered rules with background knowledge. Several successful applications [3,5,24] have demonstrated their usefulness. However, also some drawbacks of characteristic rules can be identified. Firstly, because of their descriptive nature, often a large number of rules is discovered during the mining stage. Especially for real-world applications, this makes monitoring and control of these rules extremely costly and difficult. Secondly, characteristic rules often suffer from being incomplete, i.e. not all instances are covered by the set of discovered rules, or they may contain redundancy, i.e. the same database instance may be covered by multiple rules. Previous researchers have already highlighted this problem of redundancy. In their study on the interestingness of rules Klemettinen, Mannila, Ronkainen, Toivonen and Verkamo [16] concluded: A problem that remains is redundancy. Large amounts of rules could potentially be pruned, if there were appropriate ways to remove redundant or nearly redundant rules. Indeed, with characteristic rule discovery, instances may be covered by multiple rules causing some rules to overlap, i.e. describing the same database rows. In this paper, we specifically focus on this problem of redundancy and we propose a post-processing method to reduce the redundancy extant in a set of induced characteristic rules. Moreover, we are able to influence the rule pruning process such that some overall measure of quality (such as the discriminant power of the reduced ruleset) can be controlled. The outline of the remainder of this paper is as follows. In Section 2, we introduce the discovery of characteristic rules and present a graphical illustration of the redundancy reduction problem. Section 3 provides an overview of the previous work on the problem of interestingness in order to put the key issue of this work in a global perspective. Section 4 introduces a formal representation of the problem of redundancy reduction and presents a novel solution to reduce the redundancy in a set of characteristic rules by using integer programming techniques. In Section 5, we discuss the results of the RuleCover heuristic [22] as a method for redundancy reduction and compare it with two optimal models (Model 1, 2) that we will propose in this study. Finally, Section 6 summarizes our work and presents the limitations of this study. 2. Problem situation 2.1. Characteristic rules Characteristic rules express characteristics or properties of one class of instances in a typical attributevalue, or propositional rule format. For instance, to express that a swan is a bird and has usually a white color, the following characteristic rule may apply: if species = swan then type = bird and color = white. Although this rule satisfies the completeness property (i.e. it is true for all or almost all swans) the rule is not necessarily a good differentiator between different classes of instances in the database (i.e. also parrots and ducks are birds and can have a white color). Therefore, the above rule does not have a high discriminant power with respect to the target class, i.e. swans. If Y represents the class value and X represents the (combination of) descriptive attribute value(s), then the following presents a formal representation of these notions [14]: 1 Note that we observe the same notation as proposed by [14]. For discriminant rules, they use the notation X Y whereas for characteristic rules they use the notation Y X, where X is evidence and Y is a hypothesis.

3 T. Brijs et al. / Reducing redundancy in characteristic rule discovery by using integer programming techniques 231 Definition 1. Completeness (or also confidence). The rule Y X is s% complete, if X satisfies/covers s% of the instances belonging to class value Y. Definition 2. Discriminant power. The rule Y X is c% discriminant, if X satisfies/covers (100 c)% of the negative instances. These two measures are important to distinguish between discriminant/classification rules that primarily serve the purpose of prediction and characteristic/descriptive rules that primarily serve the purpose of description. So far, most research on rule induction has concentrated on finding discriminant rules from examples, both in a noise-free domain, i.e. to find rules that cover all positive instances (high completeness) without covering any of the negative instances (high discriminant power) (e.g. Michalski s AQ family of concept learning algorithms). Some recent extensions are proposed to cope with noisy data [10,16]. In contrast, up till now, characteristic rule induction has obtained far less attention in the research community. However, in some situations, especially when dealing with highly skewed class frequency distributions, it may be worthwhile considering partial classification techniques, such as characteristic rules, as an alternative [21]. For instance, when the concept/class to be described is largely underrepresented in the data (e.g. in customer satisfaction research where dissatisfied customers compared to satisfied customers typically represent a small minority in the data), it is known that traditional discriminant rule induction and classification techniques have difficulties discriminating well between the classes. In such situations however, it may be worthwhile to describe the most prevalent characteristics of the instances of the target class (high completeness) whith discriminant power of secondary importance. In this setting, discovering characteristic rules is essentially a sequential process: firstly, finding a ruleset with high completeness and then, secondly, removing rules that have low discriminant power. A number of methods have been developed to discover characteristic rules of which the two most important approaches are the data cube approach [6,9] and the attribute-oriented induction approach [11, 12]. The former approach is based on the specialization and generalization operations carried out as drill-down and roll-up actions on multidimensional database cubes. On the other hand, the latter approach deploys attribute removal (in the case when concept hierarchies on attributes do not exist) and attribute generalization (in the case when concept hierarchies on attributes exist) operations. In the current paper, we use the notion of frequent itemsets from association rules to generate all combinations of properties/characteristics of instances of a given class that have a minimum presence/completeness within that class. 2 A similar approach can be found in [3]. In fact, by using a minimum presence/support threshold, we can be sure that all discovered rules are minimaly s% complete. The discovery of frequent itemsets has been studied extensively in the literature on association rules [1,2,18]. In this approach, concept hierarchies can also be used, however, the rule induction process is not based on the principle of attribute value generalization. Instead, the basic Apriori algorithm uses boolean attributes 3 (representing whether the instance representing that concept posessess the property or not) and finds all combinations of properties of the concept that are shared by at least a minimum proportion of the instances that represent that concept. The frequent itemset approach is especially attractive when there is no concept hierarchy available or one is interested in finding characteristics of a concept expressed at the lowest attribute-value level. 2 In essence, this involves finding association rules where the item in the consequent is fixed to a specific class value. 3 Numeric attributes are discretized and discrete-type attributes are mapped to boolean attributes for processing.

4 232 T. Brijs et al. / Reducing redundancy in characteristic rule discovery by using integer programming techniques R 4 _ R 2 R 3 R 1 Legend: Class 1 instance Class 2 instance Characteristic rule _ 2.2. Redundancy: graphical problem illustration Fig. 1. Redundancy in characteristic rules. Especially, with the frequent itemset approach, often a (very) large number of characteristic rules is discovered. Furthermore, because characteristic rules describe properties that are common to many or all instances of a class, different rules may describe different properties of the same instances. Consequently, mutual exclusivity in the discovered set of rules cannot be guaranteed, i.e. some instances in the database are covered by multiple rules. Although completeness and discriminant power (see Definition 1 and 2) can be used to filter less interesting rules, these measures do not guarantee mutual exclusivity in the induced set of rules. Therefore, other methods are needed to reduce the level of redundancy that is present in the set of characteristic rules. Graphically, redundancy could be presented as follows: In Fig. 1, it can be observed that rule number 2 (dashed line) does not cover any instances in addition to the instances already covered by the other rules (rule 1, 3 and 4) in the model. We suggest that rule 2 is redundant and therefore, it can be discarded. However, one must be careful in cutting away rules from the ruleset, because: Discarding rules can result in reducing the covered 4 instance space and this may not be recommended; When discriminant power is of particular importance, the selected set of characteristic rules should describe as many positive instances and as few negative instances as possible with respect to the original ruleset. 3. Previous work A number of methods have been proposed to deal with redundancy in a variety of ways. The following provides an overview of the previous work in this field. Gago and Bento [8] propose a distance metric between rules to select the most heterogeneous set of rules that together gives a good coverage of the instance space. The method, however, has several drawbacks. First of all, the method can only be applied if the underlying data follows a uniform distribution. Secondly, three weight parameters are specified in the distance function but there is no concrete guidance for reasonable values of these parameters. Finally, outliers in the data can significantly affect the percentage of overlap of two rules. 4 An instance is said to be covered by a rule when the attribute-value combinations in the rule are also present in the instance.

5 T. Brijs et al. / Reducing redundancy in characteristic rule discovery by using integer programming techniques 233 Kryszkiewicz [17] introduces the interesting notion of RR (representative association rules), i.e. a least set of rules that covers all association rules. Subsequently, a user may be provided with the set of RR s instead of the whole set of association rules. However, when needed, all usual association rules can be generated from the set of RR s by means of a cover operator. Hoschka and Klösgen [13] deal with the problem of redundancy in their Explora system. It uses partial orderings of attributes and attribute sets to avoid presenting several kinds of redundant knowledge. Bayardo [4] proposes a pruning strategy called redundancy exploitation. The idea is to prevent continued effort at classifying instances already classified by existing rules with high confidence. In yet another approach, the use of rule covers was proposed by Toivonen et al. [22] to reduce redundancy in a discovered set of association rules (see also Section 5). Whereas all of the above techniques are based on heuristic procedures, this paper studies the performance of an optimal rule selection technique based on integer programming methods. More specifically, the RuleCover algorithm will be used as a benchmark against the results of our integer programming models (see Section 4). One important advantage of our approach however, when compared to heuristic approaches, is that the selection of rules for the final ruleset is independent of any ordering of the rules. For instance, with the RuleCover heuristic, the stepwise selection of a subsequent rule is dependent on which rules have been chosen during the previous steps. Consequently, because of the adoption of heuristic selection criteria, there is a possibility that some of the previously selected rules are not optimal from an overall perspective. In contrast, our integer programming approach always results in the most optimal selection irrespective of the ordering of the rules, because it adopts a simultaneous selection of rules instead of a stepwise selection. Finally, redundancy has also been tackled from a totally different point of view, namely, in the research community involved in validation and verification of knowledge based systems, redundancy has mainly been studied from the syntactical point of view [19,20,23]. 4. Solution: integer programming In this section, we propose a solution that takes into account either one (Model 1) or both (Model 2) measures of completeness and discriminant power that were deemed important to assess rule quality. Before we introduce both models we will elaborate on the algebraic definition of the problem Algebraic problem definition Consider the following instance-rule matrix (see Table 1): The matrix shows K rules and N instances. Depending on the number of classes, the instance space is subdivided in two or more groups. For example, in the above matrix two groups of instances can be identified. One group belongs to a first (positive) class and carries the index values 1 to I, the other group of instances belongs to a second (negative) class and carries the index values J to N. The matrix shows whether a particular instance i is covered by a certain characteristic rule j or not. Note, however, that the above matrix only contains instances that are covered at least once by the original ruleset. As a consequence, the number of rows in the matrix may be lower than the total number of instances because with a characteristic ruleset, the property of completeness can be violated (see Section 2). Formally, we define: { 1 if instance i is covered by rule j s ij = 0 if instance i is not covered by rule j

6 234 T. Brijs et al. / Reducing redundancy in characteristic rule discovery by using integer programming techniques Table 1 Instance-rule matrix Index i Index j Rule 1 Rule 2 Rule... Rule K Instance Instance Instance I Instance J Instance Instance N Now, consider the formulation of the following integer programming (IP) model: 4.2. Model specification Model 1. Maximal redundancy reduction. Let: i = number of instances j = number of characteristic rules Given: s ij Boolean decision variables: x j Target function: Min Z = I i=1 K j=1 s ij x j Subject to: i(i =1 I) : K j=1 x j s ij 1 The decision variable x j is binary-valued and specifies whether characteristic rule j will be included in the final ruleset. The target function specifies that the model should look for patterns that have a minimum overlap of instances as possible in the group of positive instances. This means that the model searches for characteristic rules that are as far apart as possible in the positive instance space. The constraint in the model is used to ensure that the original positive instance space is not reduced so that the final ruleset will still cover all positive instances (in the instance-rule matrix) that were covered by the original ruleset. Otherwise, when neglecting this constraint, the model would select no rules at all since the objective function forces the model to select as few rules as possible. Although providing an optimal solution to the redudancy reduction problem, the model presented above still suffers from a few imperfections. Firstly, the user may want to sacrifice completeness in turn for obtaining fewer rules. Obviously, if the objective is to further reduce the size of the ruleset obtained from Model 1, this cannot be done without reducing the current level of completeness of that ruleset because the solution is guaranteed to be optimal given the constraint that each positive instance in the rule-instance matrix must be covered. However, in some circumstances, for instance, when dealing with noisy data, it may be sensible to sacrifice a certain level of completeness to enable further reduction of the ruleset [11]. Secondly, although completeness is probably the most important quality criterion for characteristic rules, the user may also want to take into account the discriminant power of the rules (see Definition 2). Indeed, when selecting rules for the final (reduncancy reduced) ruleset, it may be appropriate to select characteristic rules that, as a group, cover as few negative instances as possible. This is particularly important when the task is to discover characteristics of a class of instances which is largely underrepresented in the data but, at the same time, those characteristics should differentiate the target class from other classes in the data. To accomplish

7 T. Brijs et al. / Reducing redundancy in characteristic rule discovery by using integer programming techniques 235 this, we introduce explicit, user-defined bounds on the coverage of positive and negative instances of the final ruleset. More specifically, consider the statements below: α = proportion of positive instances that are covered by the final ruleset β = proportion of negative instances that are covered by the final ruleset When all positive instances in the instance-rule matrix (see Table 1) are covered by the final ruleset, then α = 100%. However, to account for a certain level of noise in the data, we specify that it is allowed to leave a certain proportion (100 α) of the positive instances uncovered, i.e. we sacrifice completeness. In addition, to control the discriminant power (i.e. the proportion of negative instances covered by the final ruleset) it is specified that no more than β percent of the negative instances in the instance-rule matrix can be covered. Integrating these improvements into Model 1, results in a slightly different model presented below. Model 2. Incorporating α and β into Model 1. Let: i = number of instances j = number of patterns Given: s ij, W 1, W 2, α, β Boolean decision variables: x j Target function: Min Z = W I K 1 i=1 j=1 s ij x j W N K 2 i=j j=1 s ij x j Subject to: i(i =1 I) : K j=1 x j s ij p i 1 i(i = J N) : K j=1 x j s ij M q i 0 I i=1 p i (100 α) I N i=j q i β (N I) In Model 2, in contrast to Model 1, the target function exists of two parts. The first part represents the coverage of positive instances whereas the second part represents the coverage of negative instances. Consequently, the first part will force the search algorithm to select as few rules as possible and the second part of the target function is conceived to minimize the coverage of negative class instances in order to increase the discriminant power of the final ruleset. Without any constraints, this model would result in an empty ruleset selection. Therefore, constraints are again added to the model to enforce certain quality criteria. More specifically, to control the completeness of the final ruleset, the first constraint specifies that at least a certain proportion of positive instances must be covered. This is enforced by the introduction of a boolean slack variable p i which determines the number of positive instances that may be left uncovered by the final ruleset. Indeed, p i itself is subject to a constraint specifying that no more than 100 α percent of the positive instances (in the instance-rule matrix) may be left uncovered by the final ruleset. In addition, the second constraint controls the discriminant power of the final ruleset. In turn, this is enforced by the introduction of a boolean slack variable q i which determines the number of negative instances that may be covered by the final ruleset. Indeed, q i itself is subject to a constraint specifying that no more than β percent of the negative instances may be covered by the final ruleset. An infinitely large number M is introduced in the second constraint to ensure that the boolean slack variable q i is powerful enough to compensate for the fact that an instance may be covered by multiple rules. The W 1 and W 2 parameters are continuous weight values to correct for possible bias in the target function as a result of a different number of instances in each class. When there are more negative than positive

8 236 T. Brijs et al. / Reducing redundancy in characteristic rule discovery by using integer programming techniques instances, then W 1 >W 2, i.e. W 1 =(N I)/I. Moreover, these weight parameters offer the user an additional degree of freedom to specify the importance of each part of the target function. The model, however, is not guaranteed to reach an optimal solution, depending on the choice of the values of the parameter values α and β. For example, if α is too high and β too low, reaching an optimal solution may be impossible. Indeed, it will then be difficult for the model to find a good set of rules having a low degree of redundancy but also covering at least α percent of the positive instances and covering less than β percent of negative instances. When discussing the empirical results (Section 5), we will elaborate on this and propose guidelines for appropriate settings for the α and β parameters. 5. Empirical evaluation To assess the performance of the proposed method, we will use the results of a previous research [5]. In short, in the latter study, data from a customer satisfaction survey, carried out by a leading Belgian bank, were used to identify characteristic rules for dissatisfaction. With these rules, 733 latently dissatisfied customers were identified, i.e. customers that report overall satisfaction but who possess characteristics that are indicators for dissatisfaction. The dataset contained 7264 instances of which 445 (6.1%) reported dissatisfaction and the rest (6819 or 93.9%) reported satisfaction. Characteristic rules were used since we were interested in characteristics of dissatisfied customers and it can be observed that this group is largely underrepresented in the data. It turned out that 29 characteristic rules for dissatisfaction were found to be interesting, 5 covering 328 (i.e. completeness = 74%) of the total group of dissatisfied instances. However, closer observation of the discovered set of rules revealed considerable redundancy (i.e. the same instance being covered by multiple rules). Therefore, as a post-processing step, the integer programming methods, presented in Section 4.2, will be used to reduce the redundancy and select a smaller set of rules. We will compare the results of our integer programming method against those obtained from the heuristic method of rule covering (see formal description below) proposed by Toivonen [22]. Algorithm RuleCover. Input: Set of rules Γ={X i (Y i =1,...,n}. Sets of matched rows m(x i Y ) for all i {1,...,n}. Output: Rule cover. Method: := ; // rule cover s := n i=1 m(x iy i ); // rows unmatched by cover for all i {1,...,n} do s i = m(x i Y ); // rowsof s matched by rule i end; while s do choose i {1,...,n} so that (X i Y ) Γ and s i is largest; := {X i Y }; Γ:=Γ\(X i Y ); // add the rule to the cover // remove the rule from the original set 5 Defined as the difference between the percentage coverage of positive instances and the percentage coverage within the total group of instances (i.e. positive and negative).

9 T. Brijs et al. / Reducing redundancy in characteristic rule discovery by using integer programming techniques 237 Table 2 Results RuleCover versus Model 1 Method Size original Size final Average Coverage positive Complete- Coverage negative ruleset ruleset Redundancy class (α) ness class (β) Rulecover % 74% Irrelevant Model % 74% Irrelevant for all (X j Y ) Γ do s j = s j \m(x i Y ); end; s := s \m(x i Y ); end; // remove matched rows // remove matched rows In short, the RuleCover algorithm, presented above, works as follows: a greedy algorithm uses an original set Γ (containing the entire set of characertistic rules) and then iteratively selects a rule X i Y to move it into. In each pass the rule is selected that covers the maximum number of instances that are left over after having deleted the instances that were covered by the rule which was selected during the previous pass. This process continues until no instances or rules are left over. At the end, contains the minimum rule cover of Γ. In paragraph 5.1 and 5.2, the results of the empirical research will be highlighted Maximal redundancy reduction (Model 1) In the first analysis, we compare the RuleCover heuristic with the first proposed integer programming model to select a ruleset with minimal redundancy. In fact, this means that for IP-model 1 in Section 4.2, the redundancy in the ruleset, covering positive instances, has to be minimized, regardless of the performance of the ruleset in the negative class, i.e. without taking the discriminant power of the final ruleset into account. Empirical results show that the IP-model succeeds in selecting fewer rules than the RuleCover algorithm and that the level of redundancy is lower if the IP-model is being applied. Table 2 illustrates these results. RuleCover returns 15 rules whereas the integer programming algorithm returns only 13 rules that are able to cover all positive instances that were covered by the original ruleset (α = 100%). This means that within the total dataset, the completeness of the ruleset remains 74%, i.e. equal to the completeness of the original ruleset of 29 rules. Furthermore, the level of redundancy is significantly different when comparing the two methods. When calculating the average number of times each positive instance is covered by the final ruleset, for the RuleCover algorithm this figure amounts to 5.02 whereas for the IPmodel this figure only amounts to This again illustrates that RuleCover is a heuristic and therefore, it cannot guarantee an optimal solution to the redundancy reduction problem. In fact, 11 out of the 15 rules that were selected by the RuleCover algorithm were also selected by our method. However, it must be clear that no attention is being paid to the discriminant power of the resulting ruleset. Actually, it is possible that RuleCover returns more rules, i.e. it produces more redundancy, but that the discriminant power in terms of the coverage of negative instances is better (i.e. it covers fewer negative instances) than the one obtained by the integer programming model Incorporating α and β (Model 2) Firstly, the number of negative instances covered by the final ruleset, overall, is an important indicator for the discriminant power (see Definition 2 in Section 2) of the final ruleset, and therefore it should

10 238 T. Brijs et al. / Reducing redundancy in characteristic rule discovery by using integer programming techniques Table 3 Results RuleCover versus Model 2 Method Size Size Average Coverage Complete- Coverage Discriminant original final Redundancy positive ness negative power ruleset ruleset class (α) class (β) Rulecover % 60% 62.4% 93% Model % 60% 59% 94% play an important role in the selection of rules for the final ruleset. Secondly, the user may be willing to sacrifice completeness in turn for obtaining fewer rules. Especially, when dealing with real-world data, certain levels of noise in the data may be expected and therefore specifying α = 100% may be too restrictive. To carry out the analysis, the following idea was adopted. Observations indicated that from the 15 patterns selected by RuleCover (see Table 2), the first four patterns cover 81% of the positive instances (in the instance-rule matrix), which corresponds to 60% completeness within the total group of positive instances in the dataset, which is very reasonable. This 4-member ruleset however, also covers 458 (62.4%) of the negative instances in the instance-rule matrix (see Table 1), producing a relatively high discriminant power of 93% within the total dataset. Consequently, Model 2 in Section 4.2 can be used to select the minimum set of rules that achieves at least the same coverage of positive instances while covering less than 62.4% of the negative instances of the instance-rule matrix. As such, we force the IP-model to perform at least as good as the RuleCover heuristic. More specifically, by setting α equal to 81% and β equal to 62.4%, the final ruleset selected by the integer programming model covers 437 (59%) of the negative instances in the instance-rule matrix, resulting in an overall discriminant power of 94%, which is slightly better than the RuleCover heuristic. Table 3 summarizes these results. Notice that although the percentage improvement in discriminant power is relatively low (i.e. 1%), the absolute difference in the coverage of negative instances shows a clearer picture (i.e. from 458 with RuleCover to 437 with Model 2, resulting in a difference of 21 instances)! Since the primary objective of the algorithm is to reduce redundancy and not to maximally discriminate between the two target groups, the amount of reduction in redundancy is also an important indicator. In analogy with Section 5.1, the degree of redundancy can be expressed as the average number of times each instance is covered by the final ruleset. For the ruleset obtained from the RuleCover heuristic, this figure amounts to 1.54 whereas for the optimal IP-model this figure only amounts to 1.32, again indicating a significant improvement. Furthermore, the above illustrations indicate that the parameter values for α and β obtained from examining the results of RuleCover are good lower bounds (for α), and upper bounds (for β) for the parameter values to be used in the optimization model. 6. Conclusion and future work In this paper, we introduced two integer programming models to tackle the problem of redundancy in a set of characteristic rules. The first model searches for an optimal selection of rules that is able to maximally reduce redundancy given the constraint of covering all (positive) instances that are covered by the original ruleset. In the second model, the first model was adapted (by incorporating two parameters α and β) to account for a flexible adjustment of the completeness and discriminant power of the final ruleset. Both models were empirically tested on real-world data and compared with the well-known RuleCover heuristic. It was found that the IP models are able to produce significantly better results than

11 T. Brijs et al. / Reducing redundancy in characteristic rule discovery by using integer programming techniques 239 the RuleCover heuristic. Firstly, in terms of the number of characteristic rules that are retained for the final ruleset. Secondly in terms of the discriminant power of the final ruleset and finally also in terms of the total redundancy that remains in the final ruleset. However, the reader should also consider the limitations of this work. Currently, comparison of our models against the RuleCover heuristic has been carried out on one dataset only. Therefore, to gain a better insight in the improvements of the proposed models, analysis on additional datasets should be carried out. Secondly, the proposed methods may become computationally complex when very large redundancy reduction problems are considered. In those circumstances, the use of heuristic techniques may then be considered. In contrast, for reasonably-sized problems, the proposed integer programming methods are worth to be considered as an alternative to the traditional heuristic techniques. Acknowledgment Tom Brijs is a research fellow of the Fund for Scientific Research, Flanders (FWO-Vlaanderen). References [1] R. Agrawal, T. Imielinski and A. Swami, Mining association rules between sets of items in large databases, in: Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD 93), Washington, D.C., USA: ACM, 1993, pp [2] R. Agrawal and R. Srikant, Fast algorithms for mining association rules in large databases, in: Proceedings of the 20th VLDB Conference, 1994, pp [3] K. Ali, S. Manganaris and R. Srikant, Partial classification using association rules, in: Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, The AAAI Press, 1997, pp [4] R.J. Bayardo, Jr., Brute-force mining of high-confidence classification rules, in: Proceedings of the Third International Conference of Knowledge Discovery and Data Mining, The AAAI Press, 1997, pp [5] J. Bloemer, T. Brijs, G. Swinnen and K. Vanhoof, Using association rules in customer satisfaction studies to identify latent dissatisfied customers, in: ESOMAR publication series, ISBN: , 1998, pp [6] S. Chaudhuri and U. Dayal, An overview of data warehousing and OLAP technology, ACM SIGMOD Record 26 (1997), [7] U.M. Fayyad, G. Piatetsky-Shapiro and P. Smyth, From data mining to knowledge discovery: An overview, in: Advances in Knowledge Discovery and Data Mining, AAAI Press / The MIT Press, 1996, pp [8] P. Gago and C. Bento, A metric for selection of the most promising rules, in: Proceedings of the Second European Symposium, PKDD98, Lecture Notes in Artificial Intelligence 1510, 1998, pp [9] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow and A. Pirahesh, Data cube: A relational aggregation operator generalizing group-by, cross-tab and sub-totals, Data Mining and Knowledge Discovery 1(1) (1997), [10] J.W. Grzymala-Busse, LERS- A system for learning from examples based on rough sets, in: Intelligent Decision Support, Handbook of Applications and Advances of the Rough Sets Theory, Slowinski R., ed., Kluwer academic publishers, 1992, pp [11] J. Han, Y. Cai and N. Cercone, Data-driven discovery of quantitative rules in relational databases, IEEE Transactions on Knowledge and Data Engineering 5(1) (1993), [12] J. Han and Y. Fu, Exploration of the power of attribute-oriented induction in data mining, in: Advances in Knowledge Discovery and Data Mining, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy, eds., AAAI/MIT Press, 1996, pp [13] R. Hoschka and W. Klösgen, A support system for interpreting statistical data, in: Knowledge Discovery in Databases, 1991, pp [14] M. Kamber and R. Shinghal, Evaluating the interestingness of characteristic rules, in: Proceedings of the Second International Conference on Knowledge Discovery & Data Mining, The AAAI Press, 1996, pp [15] K.A. Kaufman and R.S. Michalski, Learning in an inconsistent world, rule selection in Star/AQ18, Reports of the Machine Learning and Inference Laboratory, MLI 99-2, George Mason University, Fairfax, VA, May, 1999.

12 240 T. Brijs et al. / Reducing redundancy in characteristic rule discovery by using integer programming techniques [16] M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen and A.I. Verkamo, Finding interesting rules from large sets of discovered association rules, in: The Third International Conference on Information and Knowledge Management, ACM Press, 1994, pp [17] M. Kryszkiewicz, Representative association rules mnd Minimum condition maximum consequence association rules, in: Proceedings of the Principles of Data Mining and Knowledge Discovery Conference (PKDD 98), Nantes, 1998, pp [18] H. Mannila, Methods and problems in data mining, in: Proceedings of the International Conference on Database Theory, 1997, pp [19] A.D. Preece, Methods for verifying expert system knowledge bases, Report for Bell Canada, Centre for Pattern Recognition and Machine Intelligence, Concordia University Canada. [20] A.D. Preece and R. Shingal, Foundation and application of knowledge base verification, International Journal of Intelligent Systems 9 (1994), [21] R. Srikant, Q. Vu and R. Agrawal, Mining association rules with item constraints, in: Proceedings of the Third International Conference of Knowledge Discovery & Data Mining, The AAAI Press, 1997, pp [22] H. Toivonen, M. Klemettinen, P. Ronkainen, K. Hätönen and H. Mannila, Pruning and grouping of discovered association rules, in: MLnet Workshop on Statistics, Machine Learning, and Discovery in Databases, Heraklion, Crete, Greece, [23] F. Van Harmelen, Applying rule-base anomalies to KADS inference structures, Decision Support Systems 21(4) (1998), [24] M.S. Viveros, J.P. Nearhos and M.J. Rothman, Applying data mining techniques to a health insurance information system, in: Proceedings of the 22nd VLDB Conference, 1996, pp

Reducing Redundancy in Characteristic Rule Discovery by Using IP-Techniques

Reducing Redundancy in Characteristic Rule Discovery by Using IP-Techniques Reducing Redundancy in Characteristic Rule Discovery by Using IP-Techniques Tom Brijs, Koen Vanhoof and Geert Wets Limburg University Centre, Faculty of Applied Economic Sciences, B-3590 Diepenbeek, Belgium

More information

Discovering interesting rules from financial data

Discovering interesting rules from financial data Discovering interesting rules from financial data Przemysław Sołdacki Institute of Computer Science Warsaw University of Technology Ul. Andersa 13, 00-159 Warszawa Tel: +48 609129896 email: psoldack@ii.pw.edu.pl

More information

Mining Association Rules with Item Constraints. Ramakrishnan Srikant and Quoc Vu and Rakesh Agrawal. IBM Almaden Research Center

Mining Association Rules with Item Constraints. Ramakrishnan Srikant and Quoc Vu and Rakesh Agrawal. IBM Almaden Research Center Mining Association Rules with Item Constraints Ramakrishnan Srikant and Quoc Vu and Rakesh Agrawal IBM Almaden Research Center 650 Harry Road, San Jose, CA 95120, U.S.A. fsrikant,qvu,ragrawalg@almaden.ibm.com

More information

Structure of Association Rule Classifiers: a Review

Structure of Association Rule Classifiers: a Review Structure of Association Rule Classifiers: a Review Koen Vanhoof Benoît Depaire Transportation Research Institute (IMOB), University Hasselt 3590 Diepenbeek, Belgium koen.vanhoof@uhasselt.be benoit.depaire@uhasselt.be

More information

Post-Processing of Discovered Association Rules Using Ontologies

Post-Processing of Discovered Association Rules Using Ontologies Post-Processing of Discovered Association Rules Using Ontologies Claudia Marinica, Fabrice Guillet and Henri Briand LINA Ecole polytechnique de l'université de Nantes, France {claudia.marinica, fabrice.guillet,

More information

A Data Mining Framework for Extracting Product Sales Patterns in Retail Store Transactions Using Association Rules: A Case Study

A Data Mining Framework for Extracting Product Sales Patterns in Retail Store Transactions Using Association Rules: A Case Study A Data Mining Framework for Extracting Product Sales Patterns in Retail Store Transactions Using Association Rules: A Case Study Mirzaei.Afshin 1, Sheikh.Reza 2 1 Department of Industrial Engineering and

More information

A Rough Set Approach for Generation and Validation of Rules for Missing Attribute Values of a Data Set

A Rough Set Approach for Generation and Validation of Rules for Missing Attribute Values of a Data Set A Rough Set Approach for Generation and Validation of Rules for Missing Attribute Values of a Data Set Renu Vashist School of Computer Science and Engineering Shri Mata Vaishno Devi University, Katra,

More information

A Comparative Study of Data Mining Process Models (KDD, CRISP-DM and SEMMA)

A Comparative Study of Data Mining Process Models (KDD, CRISP-DM and SEMMA) International Journal of Innovation and Scientific Research ISSN 2351-8014 Vol. 12 No. 1 Nov. 2014, pp. 217-222 2014 Innovative Space of Scientific Research Journals http://www.ijisr.issr-journals.org/

More information

A recommendation engine by using association rules

A recommendation engine by using association rules Available online at www.sciencedirect.com Procedia - Social and Behavioral Sciences 62 ( 2012 ) 452 456 WCBEM 2012 A recommendation engine by using association rules Ozgur Cakir a 1, Murat Efe Aras b a

More information

An Improved Apriori Algorithm for Association Rules

An Improved Apriori Algorithm for Association Rules Research article An Improved Apriori Algorithm for Association Rules Hassan M. Najadat 1, Mohammed Al-Maolegi 2, Bassam Arkok 3 Computer Science, Jordan University of Science and Technology, Irbid, Jordan

More information

Mining High Order Decision Rules

Mining High Order Decision Rules Mining High Order Decision Rules Y.Y. Yao Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2 e-mail: yyao@cs.uregina.ca Abstract. We introduce the notion of high

More information

Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules

Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules Manju Department of Computer Engg. CDL Govt. Polytechnic Education Society Nathusari Chopta, Sirsa Abstract The discovery

More information

A Comparison of Global and Local Probabilistic Approximations in Mining Data with Many Missing Attribute Values

A Comparison of Global and Local Probabilistic Approximations in Mining Data with Many Missing Attribute Values A Comparison of Global and Local Probabilistic Approximations in Mining Data with Many Missing Attribute Values Patrick G. Clark Department of Electrical Eng. and Computer Sci. University of Kansas Lawrence,

More information

Induction of Association Rules: Apriori Implementation

Induction of Association Rules: Apriori Implementation 1 Induction of Association Rules: Apriori Implementation Christian Borgelt and Rudolf Kruse Department of Knowledge Processing and Language Engineering School of Computer Science Otto-von-Guericke-University

More information

1 Introduction The step following the data mining step in the KDD process consists of the interpretation of the data mining results [6]. This post-pro

1 Introduction The step following the data mining step in the KDD process consists of the interpretation of the data mining results [6]. This post-pro Decision support queries for the interpretation of data mining results Bart Goethals, Jan Van den Bussche, Koen Vanhoof Limburgs Universitair Centrum Abstract The interpretation of data mining results

More information

Association Rule Selection in a Data Mining Environment

Association Rule Selection in a Data Mining Environment Association Rule Selection in a Data Mining Environment Mika Klemettinen 1, Heikki Mannila 2, and A. Inkeri Verkamo 1 1 University of Helsinki, Department of Computer Science P.O. Box 26, FIN 00014 University

More information

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery : Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Hong Cheng Philip S. Yu Jiawei Han University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center {hcheng3, hanj}@cs.uiuc.edu,

More information

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set To Enhance Scalability of Item Transactions by Parallel and Partition using Dynamic Data Set Priyanka Soni, Research Scholar (CSE), MTRI, Bhopal, priyanka.soni379@gmail.com Dhirendra Kumar Jha, MTRI, Bhopal,

More information

Fast Discovery of Sequential Patterns Using Materialized Data Mining Views

Fast Discovery of Sequential Patterns Using Materialized Data Mining Views Fast Discovery of Sequential Patterns Using Materialized Data Mining Views Tadeusz Morzy, Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo

More information

Mining Negative Rules using GRD

Mining Negative Rules using GRD Mining Negative Rules using GRD D. R. Thiruvady and G. I. Webb School of Computer Science and Software Engineering, Monash University, Wellington Road, Clayton, Victoria 3800 Australia, Dhananjay Thiruvady@hotmail.com,

More information

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani LINK MINING PROCESS Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani Higher Colleges of Technology, United Arab Emirates ABSTRACT Many data mining and knowledge discovery methodologies and process models

More information

Visualization Techniques to Explore Data Mining Results for Document Collections

Visualization Techniques to Explore Data Mining Results for Document Collections Visualization Techniques to Explore Data Mining Results for Document Collections Ronen Feldman Math and Computer Science Department Bar-Ilan University Ramat-Gan, ISRAEL 52900 feldman@macs.biu.ac.il Willi

More information

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SHRI ANGALAMMAN COLLEGE OF ENGINEERING & TECHNOLOGY (An ISO 9001:2008 Certified Institution) SIRUGANOOR,TRICHY-621105. DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year / Semester: IV/VII CS1011-DATA

More information

An Information-Theoretic Approach to the Prepruning of Classification Rules

An Information-Theoretic Approach to the Prepruning of Classification Rules An Information-Theoretic Approach to the Prepruning of Classification Rules Max Bramer University of Portsmouth, Portsmouth, UK Abstract: Keywords: The automatic induction of classification rules from

More information

Mixture models and frequent sets: combining global and local methods for 0 1 data

Mixture models and frequent sets: combining global and local methods for 0 1 data Mixture models and frequent sets: combining global and local methods for 1 data Jaakko Hollmén Jouni K. Seppänen Heikki Mannila Abstract We study the interaction between global and local techniques in

More information

KDD, SEMMA AND CRISP-DM: A PARALLEL OVERVIEW. Ana Azevedo and M.F. Santos

KDD, SEMMA AND CRISP-DM: A PARALLEL OVERVIEW. Ana Azevedo and M.F. Santos KDD, SEMMA AND CRISP-DM: A PARALLEL OVERVIEW Ana Azevedo and M.F. Santos ABSTRACT In the last years there has been a huge growth and consolidation of the Data Mining field. Some efforts are being done

More information

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis CHAPTER 3 BEST FIRST AND GREEDY SEARCH BASED CFS AND NAÏVE BAYES ALGORITHMS FOR HEPATITIS DIAGNOSIS 3.1 Introduction

More information

Appropriate Item Partition for Improving the Mining Performance

Appropriate Item Partition for Improving the Mining Performance Appropriate Item Partition for Improving the Mining Performance Tzung-Pei Hong 1,2, Jheng-Nan Huang 1, Kawuu W. Lin 3 and Wen-Yang Lin 1 1 Department of Computer Science and Information Engineering National

More information

Data Access Paths for Frequent Itemsets Discovery

Data Access Paths for Frequent Itemsets Discovery Data Access Paths for Frequent Itemsets Discovery Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science {marekw, mzakrz}@cs.put.poznan.pl Abstract. A number

More information

Discovery of Actionable Patterns in Databases: The Action Hierarchy Approach

Discovery of Actionable Patterns in Databases: The Action Hierarchy Approach Discovery of Actionable Patterns in Databases: The Action Hierarchy Approach Gediminas Adomavicius Computer Science Department Alexander Tuzhilin Leonard N. Stern School of Business Workinq Paper Series

More information

Tadeusz Morzy, Maciej Zakrzewicz

Tadeusz Morzy, Maciej Zakrzewicz From: KDD-98 Proceedings. Copyright 998, AAAI (www.aaai.org). All rights reserved. Group Bitmap Index: A Structure for Association Rules Retrieval Tadeusz Morzy, Maciej Zakrzewicz Institute of Computing

More information

Reduce convention for Large Data Base Using Mathematical Progression

Reduce convention for Large Data Base Using Mathematical Progression Global Journal of Pure and Applied Mathematics. ISSN 0973-1768 Volume 12, Number 4 (2016), pp. 3577-3584 Research India Publications http://www.ripublication.com/gjpam.htm Reduce convention for Large Data

More information

Keywords Fuzzy, Set Theory, KDD, Data Base, Transformed Database.

Keywords Fuzzy, Set Theory, KDD, Data Base, Transformed Database. Volume 6, Issue 5, May 016 ISSN: 77 18X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Fuzzy Logic in Online

More information

DIVERSITY-BASED INTERESTINGNESS MEASURES FOR ASSOCIATION RULE MINING

DIVERSITY-BASED INTERESTINGNESS MEASURES FOR ASSOCIATION RULE MINING DIVERSITY-BASED INTERESTINGNESS MEASURES FOR ASSOCIATION RULE MINING Huebner, Richard A. Norwich University rhuebner@norwich.edu ABSTRACT Association rule interestingness measures are used to help select

More information

Hierarchical Online Mining for Associative Rules

Hierarchical Online Mining for Associative Rules Hierarchical Online Mining for Associative Rules Naresh Jotwani Dhirubhai Ambani Institute of Information & Communication Technology Gandhinagar 382009 INDIA naresh_jotwani@da-iict.org Abstract Mining

More information

Hybrid Feature Selection for Modeling Intrusion Detection Systems

Hybrid Feature Selection for Modeling Intrusion Detection Systems Hybrid Feature Selection for Modeling Intrusion Detection Systems Srilatha Chebrolu, Ajith Abraham and Johnson P Thomas Department of Computer Science, Oklahoma State University, USA ajith.abraham@ieee.org,

More information

MINING ASSOCIATION RULES WITH UNCERTAIN ITEM RELATIONSHIPS

MINING ASSOCIATION RULES WITH UNCERTAIN ITEM RELATIONSHIPS MINING ASSOCIATION RULES WITH UNCERTAIN ITEM RELATIONSHIPS Mei-Ling Shyu 1, Choochart Haruechaiyasak 1, Shu-Ching Chen, and Kamal Premaratne 1 1 Department of Electrical and Computer Engineering University

More information

This proposed research is inspired by the work of Mr Jagdish Sadhave 2009, who used

This proposed research is inspired by the work of Mr Jagdish Sadhave 2009, who used Literature Review This proposed research is inspired by the work of Mr Jagdish Sadhave 2009, who used the technology of Data Mining and Knowledge Discovery in Databases to build Examination Data Warehouse

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 21 Table of contents 1 Introduction 2 Data mining

More information

On Reduct Construction Algorithms

On Reduct Construction Algorithms 1 On Reduct Construction Algorithms Yiyu Yao 1, Yan Zhao 1 and Jue Wang 2 1 Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2 {yyao, yanzhao}@cs.uregina.ca 2 Laboratory

More information

A Closest Fit Approach to Missing Attribute Values in Preterm Birth Data

A Closest Fit Approach to Missing Attribute Values in Preterm Birth Data A Closest Fit Approach to Missing Attribute Values in Preterm Birth Data Jerzy W. Grzymala-Busse 1, Witold J. Grzymala-Busse 2, and Linda K. Goodwin 3 1 Department of Electrical Engineering and Computer

More information

Discovery of Association Rules in Temporal Databases 1

Discovery of Association Rules in Temporal Databases 1 Discovery of Association Rules in Temporal Databases 1 Abdullah Uz Tansel 2 and Necip Fazil Ayan Department of Computer Engineering and Information Science Bilkent University 06533, Ankara, Turkey {atansel,

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 20 Table of contents 1 Introduction 2 Data mining

More information

Feature Selection Based on Relative Attribute Dependency: An Experimental Study

Feature Selection Based on Relative Attribute Dependency: An Experimental Study Feature Selection Based on Relative Attribute Dependency: An Experimental Study Jianchao Han, Ricardo Sanchez, Xiaohua Hu, T.Y. Lin Department of Computer Science, California State University Dominguez

More information

A Model of Machine Learning Based on User Preference of Attributes

A Model of Machine Learning Based on User Preference of Attributes 1 A Model of Machine Learning Based on User Preference of Attributes Yiyu Yao 1, Yan Zhao 1, Jue Wang 2 and Suqing Han 2 1 Department of Computer Science, University of Regina, Regina, Saskatchewan, Canada

More information

An Improved Algorithm for Mining Association Rules Using Multiple Support Values

An Improved Algorithm for Mining Association Rules Using Multiple Support Values An Improved Algorithm for Mining Association Rules Using Multiple Support Values Ioannis N. Kouris, Christos H. Makris, Athanasios K. Tsakalidis University of Patras, School of Engineering Department of

More information

On Multiple Query Optimization in Data Mining

On Multiple Query Optimization in Data Mining On Multiple Query Optimization in Data Mining Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo 3a, 60-965 Poznan, Poland {marek,mzakrz}@cs.put.poznan.pl

More information

Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports

Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports R. Uday Kiran P. Krishna Reddy Center for Data Engineering International Institute of Information Technology-Hyderabad Hyderabad,

More information

Bootstrap Confidence Intervals for Regression Error Characteristic Curves Evaluating the Prediction Error of Software Cost Estimation Models

Bootstrap Confidence Intervals for Regression Error Characteristic Curves Evaluating the Prediction Error of Software Cost Estimation Models Bootstrap Confidence Intervals for Regression Error Characteristic Curves Evaluating the Prediction Error of Software Cost Estimation Models Nikolaos Mittas, Lefteris Angelis Department of Informatics,

More information

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá INTRODUCTION TO DATA MINING Daniel Rodríguez, University of Alcalá Outline Knowledge Discovery in Datasets Model Representation Types of models Supervised Unsupervised Evaluation (Acknowledgement: Jesús

More information

Materialized Data Mining Views *

Materialized Data Mining Views * Materialized Data Mining Views * Tadeusz Morzy, Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo 3a, 60-965 Poznan, Poland tel. +48 61

More information

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data Nian Zhang and Lara Thompson Department of Electrical and Computer Engineering, University

More information

Visualisation of Temporal Interval Association Rules

Visualisation of Temporal Interval Association Rules Visualisation of Temporal Interval Association Rules Chris P. Rainsford 1 and John F. Roddick 2 1 Defence Science and Technology Organisation, DSTO C3 Research Centre Fernhill Park, Canberra, 2600, Australia.

More information

SD-Map A Fast Algorithm for Exhaustive Subgroup Discovery

SD-Map A Fast Algorithm for Exhaustive Subgroup Discovery SD-Map A Fast Algorithm for Exhaustive Subgroup Discovery Martin Atzmueller and Frank Puppe University of Würzburg, 97074 Würzburg, Germany Department of Computer Science Phone: +49 931 888-6739, Fax:

More information

Induction of Strong Feature Subsets

Induction of Strong Feature Subsets Induction of Strong Feature Subsets Mohamed Quafafou and Moussa Boussouf IRIN, University of Nantes, 2 rue de la Houssiniere, BP 92208-44322, Nantes Cedex 03, France. quafafou9 Abstract The problem of

More information

Mining Frequent Patterns with Counting Inference at Multiple Levels

Mining Frequent Patterns with Counting Inference at Multiple Levels International Journal of Computer Applications (097 7) Volume 3 No.10, July 010 Mining Frequent Patterns with Counting Inference at Multiple Levels Mittar Vishav Deptt. Of IT M.M.University, Mullana Ruchika

More information

Generating Cross level Rules: An automated approach

Generating Cross level Rules: An automated approach Generating Cross level Rules: An automated approach Ashok 1, Sonika Dhingra 1 1HOD, Dept of Software Engg.,Bhiwani Institute of Technology, Bhiwani, India 1M.Tech Student, Dept of Software Engg.,Bhiwani

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Closed Non-Derivable Itemsets

Closed Non-Derivable Itemsets Closed Non-Derivable Itemsets Juho Muhonen and Hannu Toivonen Helsinki Institute for Information Technology Basic Research Unit Department of Computer Science University of Helsinki Finland Abstract. Itemset

More information

Integrating Logistic Regression with Knowledge Discovery Systems

Integrating Logistic Regression with Knowledge Discovery Systems Association for Information Systems AIS Electronic Library (AISeL) AMCIS 1997 Proceedings Americas Conference on Information Systems (AMCIS) 8-15-1997 Integrating Logistic Regression with Knowledge Discovery

More information

MULTI-VIEW TARGET CLASSIFICATION IN SYNTHETIC APERTURE SONAR IMAGERY

MULTI-VIEW TARGET CLASSIFICATION IN SYNTHETIC APERTURE SONAR IMAGERY MULTI-VIEW TARGET CLASSIFICATION IN SYNTHETIC APERTURE SONAR IMAGERY David Williams a, Johannes Groen b ab NATO Undersea Research Centre, Viale San Bartolomeo 400, 19126 La Spezia, Italy Contact Author:

More information

Using Association Rules for Better Treatment of Missing Values

Using Association Rules for Better Treatment of Missing Values Using Association Rules for Better Treatment of Missing Values SHARIQ BASHIR, SAAD RAZZAQ, UMER MAQBOOL, SONYA TAHIR, A. RAUF BAIG Department of Computer Science (Machine Intelligence Group) National University

More information

Empirical Evaluation of Feature Subset Selection based on a Real-World Data Set

Empirical Evaluation of Feature Subset Selection based on a Real-World Data Set P. Perner and C. Apte, Empirical Evaluation of Feature Subset Selection Based on a Real World Data Set, In: D.A. Zighed, J. Komorowski, and J. Zytkow, Principles of Data Mining and Knowledge Discovery,

More information

Chapter 4 Data Mining A Short Introduction

Chapter 4 Data Mining A Short Introduction Chapter 4 Data Mining A Short Introduction Data Mining - 1 1 Today's Question 1. Data Mining Overview 2. Association Rule Mining 3. Clustering 4. Classification Data Mining - 2 2 1. Data Mining Overview

More information

Associating Terms with Text Categories

Associating Terms with Text Categories Associating Terms with Text Categories Osmar R. Zaïane Department of Computing Science University of Alberta Edmonton, AB, Canada zaiane@cs.ualberta.ca Maria-Luiza Antonie Department of Computing Science

More information

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42 Pattern Mining Knowledge Discovery and Data Mining 1 Roman Kern KTI, TU Graz 2016-01-14 Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 1 / 42 Outline 1 Introduction 2 Apriori Algorithm 3 FP-Growth

More information

Cost-sensitive Discretization of Numeric Attributes

Cost-sensitive Discretization of Numeric Attributes Cost-sensitive Discretization of Numeric Attributes Tom Brijs 1 and Koen Vanhoof 2 1 Limburg University Centre, Faculty of Applied Economic Sciences, B-3590 Diepenbeek, Belgium tom.brijs@luc.ac.be 2 Limburg

More information

Memory issues in frequent itemset mining

Memory issues in frequent itemset mining Memory issues in frequent itemset mining Bart Goethals HIIT Basic Research Unit Department of Computer Science P.O. Box 26, Teollisuuskatu 2 FIN-00014 University of Helsinki, Finland bart.goethals@cs.helsinki.fi

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE Vandit Agarwal 1, Mandhani Kushal 2 and Preetham Kumar 3

More information

Transforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm

Transforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm Transforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm Expert Systems: Final (Research Paper) Project Daniel Josiah-Akintonde December

More information

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization

More information

EFFICIENT ATTRIBUTE REDUCTION ALGORITHM

EFFICIENT ATTRIBUTE REDUCTION ALGORITHM EFFICIENT ATTRIBUTE REDUCTION ALGORITHM Zhongzhi Shi, Shaohui Liu, Zheng Zheng Institute Of Computing Technology,Chinese Academy of Sciences, Beijing, China Abstract: Key words: Efficiency of algorithms

More information

An Approach for Accessing Linked Open Data for Data Mining Purposes

An Approach for Accessing Linked Open Data for Data Mining Purposes An Approach for Accessing Linked Open Data for Data Mining Purposes Andreas Nolle, German Nemirovski Albstadt-Sigmaringen University nolle, nemirovskij@hs-albsig.de Abstract In the recent time the amount

More information

Association Rule Learning

Association Rule Learning Association Rule Learning 16s1: COMP9417 Machine Learning and Data Mining School of Computer Science and Engineering, University of New South Wales March 15, 2016 COMP9417 ML & DM (CSE, UNSW) Association

More information

Open Access Research on the Data Pre-Processing in the Network Abnormal Intrusion Detection

Open Access Research on the Data Pre-Processing in the Network Abnormal Intrusion Detection Send Orders for Reprints to reprints@benthamscience.ae 1228 The Open Automation and Control Systems Journal, 2014, 6, 1228-1232 Open Access Research on the Data Pre-Processing in the Network Abnormal Intrusion

More information

An Empirical Study of Lazy Multilabel Classification Algorithms

An Empirical Study of Lazy Multilabel Classification Algorithms An Empirical Study of Lazy Multilabel Classification Algorithms E. Spyromitros and G. Tsoumakas and I. Vlahavas Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

More information

An Efficient Reduced Pattern Count Tree Method for Discovering Most Accurate Set of Frequent itemsets

An Efficient Reduced Pattern Count Tree Method for Discovering Most Accurate Set of Frequent itemsets IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.8, August 2008 121 An Efficient Reduced Pattern Count Tree Method for Discovering Most Accurate Set of Frequent itemsets

More information

Mining of Web Server Logs using Extended Apriori Algorithm

Mining of Web Server Logs using Extended Apriori Algorithm International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) International Journal of Emerging Technologies in Computational

More information

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach ABSTRACT G.Ravi Kumar 1 Dr.G.A. Ramachandra 2 G.Sunitha 3 1. Research Scholar, Department of Computer Science &Technology,

More information

An Efficient Algorithm for Finding the Support Count of Frequent 1-Itemsets in Frequent Pattern Mining

An Efficient Algorithm for Finding the Support Count of Frequent 1-Itemsets in Frequent Pattern Mining An Efficient Algorithm for Finding the Support Count of Frequent 1-Itemsets in Frequent Pattern Mining P.Subhashini 1, Dr.G.Gunasekaran 2 Research Scholar, Dept. of Information Technology, St.Peter s University,

More information

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Database and Knowledge-Base Systems: Data Mining. Martin Ester Database and Knowledge-Base Systems: Data Mining Martin Ester Simon Fraser University School of Computing Science Graduate Course Spring 2006 CMPT 843, SFU, Martin Ester, 1-06 1 Introduction [Fayyad, Piatetsky-Shapiro

More information

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA METANAT HOOSHSADAT, SAMANEH BAYAT, PARISA NAEIMI, MAHDIEH S. MIRIAN, OSMAR R. ZAÏANE Computing Science Department, University

More information

Course on Data Mining ( )

Course on Data Mining ( ) Course on Data Mining (581550-4) Intro/Ass. Rules 24./26.10. Episodes 30.10. 7.11. Home Exam Clustering 14.11. KDD Process 21.11. Text Mining 28.11. Appl./Summary 21.11.2001 Data mining: KDD Process 1

More information

Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation

Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation Daniel Lowd January 14, 2004 1 Introduction Probabilistic models have shown increasing popularity

More information

Bitmap index-based decision trees

Bitmap index-based decision trees Bitmap index-based decision trees Cécile Favre and Fadila Bentayeb ERIC - Université Lumière Lyon 2, Bâtiment L, 5 avenue Pierre Mendès-France 69676 BRON Cedex FRANCE {cfavre, bentayeb}@eric.univ-lyon2.fr

More information

Optimization using Ant Colony Algorithm

Optimization using Ant Colony Algorithm Optimization using Ant Colony Algorithm Er. Priya Batta 1, Er. Geetika Sharmai 2, Er. Deepshikha 3 1Faculty, Department of Computer Science, Chandigarh University,Gharaun,Mohali,Punjab 2Faculty, Department

More information

9. Conclusions. 9.1 Definition KDD

9. Conclusions. 9.1 Definition KDD 9. Conclusions Contents of this Chapter 9.1 Course review 9.2 State-of-the-art in KDD 9.3 KDD challenges SFU, CMPT 740, 03-3, Martin Ester 419 9.1 Definition KDD [Fayyad, Piatetsky-Shapiro & Smyth 96]

More information

Decision Management in the Insurance Industry: Standards and Tools

Decision Management in the Insurance Industry: Standards and Tools Decision Management in the Insurance Industry: Standards and Tools Kimon Batoulis 1, Alexey Nesterenko 2, Günther Repitsch 2, and Mathias Weske 1 1 Hasso Plattner Institute, University of Potsdam, Potsdam,

More information

Dynamic Data in terms of Data Mining Streams

Dynamic Data in terms of Data Mining Streams International Journal of Computer Science and Software Engineering Volume 1, Number 1 (2015), pp. 25-31 International Research Publication House http://www.irphouse.com Dynamic Data in terms of Data Mining

More information

Mining for Mutually Exclusive Items in. Transaction Databases

Mining for Mutually Exclusive Items in. Transaction Databases Mining for Mutually Exclusive Items in Transaction Databases George Tzanis and Christos Berberidis Department of Informatics, Aristotle University of Thessaloniki Thessaloniki 54124, Greece {gtzanis, berber,

More information

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining Miss. Rituja M. Zagade Computer Engineering Department,JSPM,NTC RSSOER,Savitribai Phule Pune University Pune,India

More information

Rough Set Approaches to Rule Induction from Incomplete Data

Rough Set Approaches to Rule Induction from Incomplete Data Proceedings of the IPMU'2004, the 10th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, Perugia, Italy, July 4 9, 2004, vol. 2, 923 930 Rough

More information

Applying Data Mining Techniques to Wafer Manufacturing

Applying Data Mining Techniques to Wafer Manufacturing Applying Data Mining Techniques to Wafer Manufacturing Elisa Bertino 1, Barbara Catania 2, and Eleonora Caglio 3 1 Università degli Studi di Milano Via Comelico 39/41 20135 Milano, Italy bertino@dsi.unimi.it

More information

Ontology Based Data Analysing Approach for Actionable Knowledge Discovery

Ontology Based Data Analysing Approach for Actionable Knowledge Discovery IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 78-661,p-ISSN: 78-877, Volume 16, Issue 6, Ver. IV (Nov Dec. 14), PP 39-45 Ontology Based Data Analysing Approach for Actionable Knowledge Discovery

More information

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset M.Hamsathvani 1, D.Rajeswari 2 M.E, R.Kalaiselvi 3 1 PG Scholar(M.E), Angel College of Engineering and Technology, Tiruppur,

More information

Data Mining Technology Based on Bayesian Network Structure Applied in Learning

Data Mining Technology Based on Bayesian Network Structure Applied in Learning , pp.67-71 http://dx.doi.org/10.14257/astl.2016.137.12 Data Mining Technology Based on Bayesian Network Structure Applied in Learning Chunhua Wang, Dong Han College of Information Engineering, Huanghuai

More information

Building a Concept Hierarchy from a Distance Matrix

Building a Concept Hierarchy from a Distance Matrix Building a Concept Hierarchy from a Distance Matrix Huang-Cheng Kuo 1 and Jen-Peng Huang 2 1 Department of Computer Science and Information Engineering National Chiayi University, Taiwan 600 hckuo@mail.ncyu.edu.tw

More information

Discovering the Association Rules in OLAP Data Cube with Daily Downloads of Folklore Materials *

Discovering the Association Rules in OLAP Data Cube with Daily Downloads of Folklore Materials * Discovering the Association Rules in OLAP Data Cube with Daily Downloads of Folklore Materials * Galina Bogdanova, Tsvetanka Georgieva Abstract: Association rules mining is one kind of data mining techniques

More information