Pruning Techniques in Associative Classification: Survey and Comparison

Size: px

Start display at page:

Download "Pruning Techniques in Associative Classification: Survey and Comparison"

Howard Bryan
5 years ago
Views:

1 Survey Research Pruning Techniques in Associative Classification: Survey and Comparison Fadi Thabtah Management Information systems Department Philadelphia University, Amman, Jordan Journal of Digital Information Management ABSTRACT: Association rule discovery and classification are common data mining tasks. Integrating association rule and classification also known as associative classification is a promising approach that derives classifiers highly competitive with regards to accuracy to that of traditional classification approaches such as rule induction and decision trees. However, the size of the classifiers generated by associative classification is often large and therefore pruning becomes an essential task. In this paper, we survey different rule pruning methods used by current associative classification techniques. Further, we compare the effect of three pruning methods (database coverage, pessimistic error estimation, lazy pruning) on the accuracy rate and the number of rules derived from different classification data sets. Results obtained from experimenting on different data sets from UCI data collection indicate that lazy pruning algorithms may produce slightly higher predictive classifiers than those which utilise database coverage and pessimistic error pruning methods. However, the potential use of such classifiers is limited because they are difficult to understand and maintain by the end-user. Categories and Subject Descriptors H.2 [Database management] H.2.8 [Database Applications]; Data mining: General Terms Data mining, Pruning methods Keywords: Associative Classification, Association Rule, Classification, Data Mining, Rule Pruning Received 12 March 2006; Revised 12 July 2006; Accepted 23 July Introduction Rapid advances in computing especially data collection and storage technology have enabled organisations to collect massive amount of data. However, finding and deriving useful information has proven to be a hard task since often the size of the source data is large. Data mining can be defined as a technology that utilises different intelligent algorithms for processing large data sets in order to extract useful knowledge (Tan, et al., 2005). This knowledge can be outputted in different forms such as simple if-then rules, probabilities, etc, and is used for forecasting, data analysis and several other tasks. Association rule discovery is an important data mining task that finds correlations among items in a transactional database. The classic application for association rule is market basket analysis (Agrawal and Srikant, 1994), in which business experts aim to investigate the shopping behaviour of customers in an attempt to discover regularities. In finding association rules, one tries to find groups of items that are frequently sold together in order to infer certain items from the presence of other items in the customer s shopping cart. An example of an association rule is: 55% of customers who buy crisps are likely to buy a soft drink as well; 4% of all database transactions contain crisps and a soft drink. Customers who buy crisps is known as rule antecedent, and buy a soft drink as well is known as rule consequent. The antecedent and consequent of an association rule contain at least one item. The 55% of the association rule mentioned above represents the strength of the rule and is known as rule s confidence, whereas the 4% is a statistical significance measure, known as the rule s support. Classification, also known as categorisation is another well-known task in data mining. Unlike association rule which discovers relationships among items in a transactional database, the ultimate goal of classification is to construct a set of rules (classifier) from a labelled training data set, in order to classify new data objects, known as test data objects, as accurately as possible. In other words, classification is a supervised learning task in which the class labels are known in advance, whereas, association rule has no class to predict and therefore, it could be categorised as unsupervised learning task. (Freitas, 2001) gives more comprehensive discussion on the main differences between classification and association rule. There are many classic classification approaches for extracting knowledge from data such as divide-and-conquer (Quinlan, 1987, Quinlan, 1993), separate-and-conquer (Furnkranz, 1999) (also known as rule induction) and statistical approaches (Duda and Hart, 1973; Meretakis and Wüthrich, 1999). The divide-and-conquer selects the root based on the most informative attribute in the training data set. It makes the selection using statistical measures such as information gain (Quinlan, 1979), and then it makes a branch for each possible level of that attribute. This will split the training instances into subsets, one for each possible value of the selected attribute. The same process is repeated until all instances that fall in one branch have the same classification or the remaining instances cannot split any further. The separate-and-conquer approach on the other hand, starts by building up the rules in a greedy manner. After a rule is found, all training instances covered by the rule are removed and the same process is repeated until the best rule found has a large error rate. Finally, statistical approaches such as Naïve Bayes (Duda and Hart, 1973) computes probabilities of classes in the training data set using the frequency of attribute values associated with these classes in order to classify test instances. Numerous algorithms have been developed based on these approaches such as decision trees (Quinlan, 1993), PART (Frank and Witten, 1998) and RIPPER (Cohen, 1995). In recent years, a new classification approach called associative classification (Liu et al., 1998; Li et al., 2001; Thabtah et al., 2004) that utilises association rule discovery methods to find the rules has been developed. Several associative algorithms have been proposed including, CBA (Liu et al., 1998), CMAR (Li et al., 2001), L 3 (Baralis and Torino, 2002), CPAR (Yin and Han, 2003) and MCAR (Thabtah et al., 2005). Empirical results (Liu et al., 1998; Li et al., 2001; Yin and Han, 2003; Thabtah et al., 2004, Thabtah, et al., 2005) have showed that these algorithms usually build more classifiers than that of decision trees and rule induction approaches, respectively. Association rule discovery techniques generate massive number of rules especially when a low support threshold is used (Liu, et al., 1999; Zaiane, O and Antonie M., 2005; Adamo, J., 2006). Hence associative classification algorithms utilise association rule methods in the training phase, the size of their classifiers is large as well. Several pruning methods have been used effectively to reduce the size of the classifiers in associative classification such as pessimistic error estimation (Quinlan, 1987), chi-square testing ( 2 ) (Snedecor and Cochran, 1989), database coverage (Liu, et al., 1998) and lazy pruning (Baralis and Torino, 2002). The aim of this paper is to survey these different pruning methods and to measure their impact on the number of rules derived. Particularly, we compare three different pruning methods (database coverage, pessimistic error, lazy pruning) with reference to the size of the derived classifiers from several different classification benchmarks. Furthermore, we investigate the predictive accuracy generated from different data sets by three popular associative algorithms (L 3, CBA, MCAR) which employ different pruning heuristics in order to measure the impact of pruning on accuracy. Journal of Digital Information Management Volume 4 Number 3 September

2 The rest of the paper is organised as follows: Associative classification basic concepts is presented in Section 2. Different methods used to prune redundant and harmful rules are surveyed in Section 3. In Sections 4, we show the impact of pruning on classifiers derived by associative techniques. Section 5 is devoted to data and experimental results and finally conclusions are given in Section Associative Classification In associative classification, a training data set T has m distinct attributes A 1, A 2,, A m and C is a list of class labels. The number of rows in T is denoted T. Attributes could be categorical (meaning they take a value from a finite set of possible values) or continuous (where they are real or integer). In the case of categorical attributes, all possible values are mapped to a set of positive integers. For continuous attributes, a discretisation method is first used to transform these attributes into categorical ones. Definition 1: An item can be described as an attribute name A i and its value a i, denoted (A i, a i ). Definition 2: The j th row or a training object in T can be described as a list of items (A j1, a j1 ),, (A jk, a jk ), plus a class denoted by c j. Definition 3: An itemset can be described as a set of disjoint attribute values contained in a training object, denoted < (A i1, a i1 ),, (A ik, a ik )>. Definition 4: A ruleitem r is of the form <cond, c>, where condition cond is an itemset and c C is a class. Definition 5: The actual occurrence (actoccr) of a ruleitem r in T is the number of rows in T that match r s itemset. Definition 6: The support count (suppcount) of ruleitem r = <cond, c> is the number of rows in T that matches r s itemset, and belongs to a class c. Definition 7: The occurrence (occitm) of an itemset I in T is the number of rows in T that match I. Definition 8: An itemset i passes the minimum support (minsupp) threshold if (occitm(i)/ T ) > minsupp. Such an itemset is called frequent itemset. Definition 9: A ruleitem r passes the minsupp threshold if, suppcount(r)/ T > minsupp. Such a ruleitem is said to be a frequent ruleitem. Definition 10: A ruleitem r passes the minimum confidence (minconf) threshold if suppcount(r) / actoccr(r) > minconf. Definition 11: An associative rule is represented in the form: cond c, where the antecedent is an itemset and the consequent is a class. The problem of associative classification is to discover a subset of rules with significant supports and high confidences. This subset is then used to build an automated classifier that could be used to predict the classes of previously unseen data. Figure 1 shows the main phases implemented by an associative classification system where the end user selects the training data and inputs two values (minsupp, minconf). The system starts processing the training data and produces the complete set of frequent ruleitems where a subset of which is derived to the user to represent the classifier. A classifier is a mapping form, H : A Y where A is the set of items and Y is the set of class labels. The goal is to find a classifier h H that maximises the probability that h (a) = y for each test data object. Age Income has a car Buy/class senior middle n yes youth low y no junior high y yes youth middle y yes senior high n yes junior low n no senior middle n no Table 1. Car sales training data AC Ruleitem Itemset Class Support Confidence {low} no 2/7 2/2 {high} yes 2/7 2/2 {senior, no} yes 2/7 2/3 {middle} yes 2/7 2/3 {senior} yes 2/7 2/3 {y} yes 2/7 2/3 {n} yes 2/7 2/4 {n} no 2/7 2/4 Table 2. Possible Ruleitems from Table 1 To demonstrate the main steps of an associative classification system, consider for instance the training data set shown in Table 1, which represents whether or not a person is likely to buy a new car. Assume that minsupp = 2 and minconf = 50%. Frequent ruleitems discovered along with their relevant support and confidence values are shown in Table 2. Before constructing the classifier, most associative algorithms including (Liu, et al., 1998; Yin and Han, 2003; Thabtah, et al., 2005) sort the rules discovered according to their confidence and support values. After rules have been sorted, these techniques apply pruning heuristics to discard redundant and useless rules and select a subset of high confidence rules to form the classifier. 3. Pruning Techniques Used in Associative Classification Associative algorithms normally derive a large set of rules (Liu, et al., 1999; Li, et al., 2001) since (1) classification data sets are typically highly correlated and (2) association rule mining approaches that consider all attribute values combinations in the database are used for rules discovery. As a result, there have been many attempts to reduce the size of their classifiers, mainly focused on preventing rules that are either redundant or misleading from taking any role in the prediction process of test data objects. The removal of such rules can make the classification process more effective and accurate. Several pruning methods have been used effectively to reduce the size of the classifiers, some of which have been adopted from decision trees, like pessimistic estimation, others from statistics such as chi-square testing ( 2 ). These pruning techniques are utilised during either rule discovery or the construction of the classifier. For instance, a very early pruning step, which eliminates ruleitems that do not pass the support threshold may occur in the process of finding frequent ruleitems. Another pruning such as chi-square testing may take place when generating the rules, and a late pruning method like database coverage may be used after discovering all potential rules. Throughout this section, we discuss pruning techniques used by associative classification algorithms. 3.1 Chi-square Testing Chi-square testing ( 2 ) is a well-known discrete data hypothesis testing method from statistics, which evaluates the correlation between two variables and determines whether they are independent or correlated (Snedecor and Cochran, 1989). The test for independence, when applied to a population of subjects, determines whether they are positively correlated or not. Figure 1. Associative Classification main phases, where Journal of Digital Information Management Volume 4 Number 3 September

3 is the expected frequencies and is the observed frequencies. When the expected frequencies and the observed frequencies are notably different, the hypothesis that they are correlated is rejected. This method has been used in associative classification to prune negatively correlated rules. For example, a test can be done on every discovered rule, such as r : x c, to find out whether the condition x is positively correlated with the class c. If the result of the test is larger than a particular constant, there is a strong indication that x and c of r are positively correlated, and therefore r will be stored as a candidate rule in the classifier. If the test result indicates negative correlation, r will not take any part in the later prediction and is discarded. The CMAR algorithm adopts the chi-square testing in its rules discovery step. When a rule is found, CMAR tests whether its body is positively correlated with the class. If a positive correlation is found, CMAR keeps the rule, otherwise the rule is discarded. 3.2 Redundant Rule Pruning In associative classification, all attribute value combinations are considered in turn as a rule s condition, therefore rules in the resulting classifiers may share training items in their bodies and for this reason there could be several specific rules containing many general rules. Rules redundancy in the classifier is unnecessary and in fact could be a serious problem, especially if the number of discovered rules is extremely large. A pruning method that discards specific rules with less confidence values than general rules called redundant rule pruning, has been proposed in (Li, et al., 2001). Redundant rule pruning method works as follows: Once the rule generation process is finished and rules are sorted, an evaluation step is performed to prune all rules such as from the set of generated rules, where there is some general rule of a higher rank and. This pruning method significantly reduces the size of the resulting classifiers and minimises rules redundancy. Algorithms, including (Li, et al., 2001; Antonie, et al., 2003), have used redundant rule pruning. They perform such pruning immediately after a rule is inserted into a compact data structure called CR-tree. When a rule is added to the CR-tree, a query is issued to check if the inserted rule can be pruned or some other already inserted rules in the tree can be removed. Give a set of generated rules R, and the training data set T, the database coverage process works as follows: For each rule r i in R Do Find all applicable data instances in T that match r i s condition If r i correctly classifies a training instance in T end if end Mark r i as a candidate rule in the classifier Remove all instances in T covered by r i If r i cannot correctly cover any training instance in T end if Remove r i from R Figure 2. Database coverage method 3.3 Database Coverage The database coverage heuristic, which is illustrated in Figure 2, is a popular pruning technique that usually invoked after potential rules have been created. This method tests the generated rules against the training data set, and only high quality rules that cover at least one training instance not considered by other higher ranked rules are kept for later classification. The database coverage method was used first by CBA and then latterly by CBA (2) (Liu, et al., 1999) and CMAR. 3.4 Pessimistic Error Estimation Generally, there are two pruning strategies in decision trees, prepruning and post pruning (Witten and Frank, 2000). The latter one, also known as backward pruning, is more popular and has been used by many decision tree algorithms like C4.5 (Quinlan, 1993). In performing backward pruning, the tree is first completely constructed, then at each node a decision is made whether to replace a node and its descendents with a single leaf or to leave the node unchanged. The decision of whether to replace a node or not, is done by calculating the estimated error using the pessimistic error estimation measure of a particular node and comparing it with its potential replacement leaf. The method of replacing a sub-tree with a leaf node is called sub-tree replacement. The error is estimated using a pessimistic error estimation measure based on the training instances. The probability of error at a node v, where is N v the number of training data objects at node v and N v,c is the number of the training data objects associated with the majority class at node v. The error rate at a sub-tree T,. The sub-tree T is pruned if q (v) q (T). In addition to using pessimistic error rate in decision tree algorithms, it can be also used in associative classification by comparing the estimated error of a new rule, r i, resulting from the deletion of one item in the condition of the original rule, r j. If the expected error of r i is lower than that of r j, then r j will be replaced by r i. Algorithms, including (Liu, et al., 1998, Liu, et al., 1999), have used pessimistic error estimation to effectively cut down the number of extracted rules. 3.5 Lazy Pruning Some associative classification techniques (Baralis and Torino, 2002) argue that pruning classification rules should be limited to only negative rules (those that lead to incorrect classification). In addition, they claim that database coverage pruning often discards some useful knowledge, as the ideal support threshold is not known in advance. Due to this, these algorithms have used a late database coverage-like approach called lazy pruning, which discards rules that incorrectly classify training instances and keeps all other rules. Lazy pruning happens after potential rules have been created and stored, where each training instance is taken in turn and the first rule in the set of ranked rules applicable to the instance is assigned to it. The correctness of the class assigned to the instance is checked, if the predicted class assigned by the rule to the instance matches the true class of that instance, then the instance is removed and the rule is inserted into the classifier. Once all training instances have been considered, only rules that wrongly classified training instances are discarded and their covered instances are put into a new cycle and the process is repeated until all training instances are correctly classified. The results are two levels of rules, the first level contains rules that classified at least one single training instance correctly and the second level contains rules that were never used in the training phase. The main difference between lazy pruning and database coverage pruning is that the second level rules that are held in the main memory by lazy pruning are completely removed by database coverage method during rules discovery step. Furthermore, once a rule is applied to the training instances, all instances covered by the rule are removed (negative and positive) by the database coverage method. Journal of Digital Information Management Volume 4 Number 3 September

4 Experimental tests reported in (Baralis and Torino, 2000) using 26 different data sets from (Merz and Murphy, 1996) showed that methods that employ lazy pruning such as L 3 may improve classification accuracy on average by +1.63% over other techniques that use database coverage pruning. However, lazy pruning may lead to very large classifiers, which makes it difficult for a human to understand or be able to interpret. In addition, the experimental tests indicate that lazy pruning algorithms consume more memory than other associative classification techniques and more importantly they may fail if the support threshold is set to a very low value due to the very large number of potential rules. 3.6 Conflicting Rules For highly dense classification data sets and other data where there could be multiple class labels associated with each training instance, there is a possibility to produce rules with the same body that predict different classes. For example, given two rules such as and, (Antonie and Zaine, 2003) proposed a pruning method that considers these two rules conflicting. Their method removes conflicting rules and disallows them from taking any role in classifying test data objects. However, a recent proposed algorithm called MMAC (Thabtah, et al., 2004) showed by experiments that such rules represent useful knowledge since they pass support and confidence requirements. Thus, domain experts can profit from them. MMAC developed a recursive learning phase that combines what so called conflicting rules into one multi-label rule. For the above example, MMAC combines the two rules into the following multi-label rule : x c1 c Laplace Accuracy Laplace accuracy (Clark and Boswell, 1991) is mainly used in classification to estimate the expected error of a rule. This expected accuracy for a given rule, r, is given by the formula: Laplace (r) =, where m is the number of class labels in the domain, p tot (r) is the number of instances matching r antecedent and p c (r) is the number of instances covered by r that belong to class c. Laplace expected error has been successfully used by an associative classification algorithm called CPAR (Yin and Han; 2003), where the expected accuracy for each rule is calculated before classification of test instances. This ensures that the best expected accuracy rules for each class participate in the prediction, which results in slightly higher accuracy classifiers for CPAR over that of CBA and C4.5 algorithms. Particularly, experimental results against 26 data sets from (Merz and Murphy, 1996) showed that CPAR achieved on average +0.48% and +1.83% higher prediction rates than CBA and C4.5 algorithms, respectively. 4. Impact of Pruning on Classifiers In association rule discovery, a transaction can be used to generate many rules; therefore, there are tremendous numbers of potential rules. In associative classification, association rule approaches such as Apriori (Agrawal and Srikant, 1994) are used to discover the rules, and thus the expected number of potential rules is large. Without adding constraints on the rule discovery and generation phases or imposing appropriate pruning, the very large numbers of rules often in the order of thousands and sometimes tens of thousands, make humans unable to understand or maintain the classifier. Pruning noisy and redundant rules in classifiers becomes an important task. Associative classification algorithms that use pruning methods like database coverage and redundant rule prefer general, effective rules over specific ones, thus they produce smaller classifiers than other techniques that adopt lazy pruning. We conducted experiments on the german and wine data sets downloaded from (Weka, 2000) to compare a lazy pruning algorithm, L 3 and the (database coverage, pessimistic error) approach of CBA with reference to number of rules and accuracy. L 3 results on both data sets have been generated using minsupp of 1% and minconf 0.0% and due to this we ran the experiments of CBA using the same support and confidence thresholds for fair comparison. The numbers of rules produced by L 3 on the german and wine data sets are and 40775, respectively, with prediction accuracies of 72.50% and 95.00%, respectively. By comparison, CBA derives only 325 and 12 rules from the same data sets, with prediction accuracies of 73.58% and 98.33%, respectively. These results provide direct, if limited, evidence that techniques, which use database coverage and/or pessimistic error pruning tend to choose general rules and simpler classifiers, which sometimes are more accurate on test data sets when compared with lazy pruning methods like L 3. Overall, techniques that derive smaller classifiers are generally preferred by human experts due to their ease of manual maintenance and interpretability. For instance, if general practitioners used their patient data to build a rule-based diagnosis system, they would prefer the resulting number of rules to be small and simple. As such, they may even slightly decrease accuracy in exchange for a more concise set of rules, which human experts can understand. Smaller classifiers do however suffer from some drawbacks, including, their sensitivity to low-quality data (data sets that contain redundant information and missing values) and their inability to cover the whole training data. On the other hand, approaches that produce very large numbers of rules such as L 3, usually give slightly improved predictive power, but spend a long time in training and during the prediction of test objects, since they must pass over a very large number of rules when classifying test data. In the L 3 algorithm, rules which cover no training data instances, are known as spare or secondary rules. Holding a very large number of spare rules to cover a limited number of test instances missed by the primary rules is inefficient. There should be a trade-off between the size of the classifiers and the predictive accuracy, especially where slightly lower accuracy can be tolerated in exchange for a more concise set of rules. 5. Experimental Results Experiments on fourteen different data sets from the UCI data collection (Merz and Murphy, 1996) were conducted using stratified ten-fold cross validation. Cross validation is a known evaluation method in data mining, where the training data is divided randomly into n blocks, each block is held out once, and the classifier is trained on the remaining n-1 blocks; then its error rate is evaluated on the holdout block. Therefore, the learning procedure is executed n times on slightly different training data sets. We compared the effect of three pruning methods, these are lazy pruning (Baralis and Torino, 2002), database coverage (Liu, et al., 1998) and pessimistic error estimation (Quinlan, 1987) in terms of the number of rules produced by three known associative classification approaches. Particularly, we compared the number of rules derived by L 3 (lazy pruning algorithm), MCAR (database coverage pruning algorithm) and CBA (pessimistic error and database coverage pruning algorithm) against the fourteen classification benchmarks. Data (Database Coverage & Database Lazy Pessimistic Error) Coverage pruning Auto Led Breast Pima Tic-tac Glass Lymph Diabetes Iris Cleve Heart Labor Table 3. Number of rules produced when different pruning approaches are used in associative classification Journal of Digital Information Management Volume 4 Number 3 September

5 The experiments of CBA were conducted using an implementation version provided by the authors of (CBA, 1998). MCAR was implemented in Java under Windows XP and the results of L 3 were given by its prospective authors. All Experiments were conducted on Pentium IV 2.7 Ghz, 512 RAM machine. For CBA and MCAR, we used a minsupp of 2% and a minconf of 50% since these are one of the suggested values by their authors. For L 3, we used the standard values for all parameters suggested by their authors. Table 3 shows the number of rules derived when different pruning approaches are used. The results shown in column two has been derived using the CBA algorithm. For column three, we used MCAR algorithm and finally for column four, we used L 3 algorithm. It is obvious from the numbers shown in Table 3 that algorithms, which use lazy pruning approach, generate many more rules than those that employ other approaches. In particular, for all classification data sets we considered, L 3 algorithm produces more rules than MCAR and CBA techniques, respectively. One of the principle reasons for generating large number of rules by lazy pruning algorithms is due to storing rules that do even cover a single training data object in the classifier. For example, while constructing the classifier by the L 3 algorithm, every rule is evaluated to validate whether or not it covers a training data object. If a rule covers correctly a training object, it then will be added into the classifier, otherwise it will be removed. However, rules, which have been never tested, are also added as spare rules into the classifier. These spare rules are often in the orders of thousands and even tens of thousands. Producing very large number of rules raises many problems such as user understandability and maintenance. Furthermore, the classification time may increase since the classification system must iterate through the rule set, and consequently these problems limit the use of lazy associative algorithms. Unlike lazy pruning approach, the database coverage method eliminates the spare rules and that explains its moderate sized classifiers. Specifically, MCAR and CBA algorithms generate reasonable sized classifiers if compared with L 3, which enable domain users to benefit from. In fact, for most data sets we consider, the classifiers produced when the database coverage heuristic is used, can be easily maintained and interpreted by end users. Moreover, CBA algorithm, which utilises database coverage and pessimistic error pruning methods, cuts down further the size of the classifiers. This can be seen when we compare the numbers in column two with that of column three in Table 3. Table 4 gives the accuracy (%) figures derived by L 3, CBA and MCAR algorithms from twelve of the benchmarks we considered. We also report the accuracy derived without using any pruning method as shown in column 7. The accuracy numbers have been generated using a support of 2% and a confidence of 50% for CBA and MCAR algorithms. For L 3 algorithm, we used the default parameters suggested by its prospective authors. Table 4 figures indicate that L 3 algorithm outperforms CBA and MCAR on five classification data sets. Also L 3 produced on average higher classification accuracy if no pruning at all is applied in the experiments. In fact, L 3 achieved on average +0.69% and +0.21% higher prediction rate than that of CBA and MCAR algorithms, respectively. Moreover, on average L 3 achieved higher accuracy than an AC method which uses no pruning at all. In the classification step, and when the primary rules fail to classify a test object, lazy pruning methods such as L 3 use the spare rules to classify that object, which explains the slight increase of accuracy over that of CBA and MCAR. In other words, unlike CBA and MCAR algorithms which takes on the default class label when rules in their classifiers fail to classify a test object, L 3 algorithm utilise the spare rules. 6. Conclusions Associative classification is an imperative data mining task that has recently attracted many researchers since it derives highly predictive classification systems. We have surveyed different pruning heuristics in associative classification such as redundant rule, database coverage, lazy pruning, etc. Moreover, experiments using fourteen different classification data sets have been conducted to compare between three popular associative algorithms, which utilise different pruning methods, i.e. database coverage, lazy pruning and pessimistic error. Our bases of comparison are classifier size and accuracy rate. The results revealed that algorithms that utilise lazy pruning (L 3 ) produce very large classifiers, which make it difficult for domain experts to maintain or understand, and consequently limit their use. Although, L 3 generates on average and higher accuracy rate than that of CBA and MCAR, respectively. On the other hand, associative algorithms that utilise database coverage and/or pessimistic error estimation pruning such as MCAR and CBA produce moderate sized classifiers that are easy to maintain and interpret by end users. The results also pointed out the need for additional constraints during pruning in order to decrease further the size of the resulting classifiers. In near future, we will investigate the possibility of creating a new hybrid pruning method. Data Size # of CBA MCAR L3 No Pruning Classes Led Breast Pima Tic-tac Glass Diabetes Iris Cleve Heart Labor Wine Zoo Average Table 4. Accuracy of CBA,L 3 and MCAR algorithms using Ten-fold Cross Validation References Adamo, J. (2006). Association Rule based Classifier Built via Direct Enumeration, Online Pruning and Genetic Algorithm based Rule Decimation. Artificial Intelligence and Applications 2006: Agrawal, R., Srikant, R (1994). Fast algorithms for mining association rule. Proceedings of the 20th International Conferenceon Very Large Data Bases. p Baralis, E., Torino, P (2002). A lazy approach to pruning classification rules. Proceedings of the 2002 IEEE ICDM 02. p. 35. Journal of Digital Information Management Volume 4 Number 3 September

6 Clark, P., Boswell, R (1991). Rule induction with CN2: Some recent improvements. In Y. Kodratoff, editor, Machine Learning - EWSL- 91, p Berlin, Springer-Verlag. Cohen, W. (1995). Fast effective rule induction. Proceedings of the 12 th International Conference on Machine Learning, (pp ). Morgan Kaufmann, CA. Dong, G., Li., J (1999). Efficient mining of emerging patterns: Discovering trends and differences. Proceedings of the Int l Conf. Of Knowledge Discovery and Data Mining, (pp ). Duda, R., Hart, P (1973). Pattern classification and scene analysis. John Wiley & son. Frank, E., Witten, I (1998). Generating accurate rule sets without global optimisation. Proceedings of the Fifteenth International Conference on Machine Learning, p Morgan Kaufmann, Madison, Wisconsin. Freitas, A (2000). Understanding the crucial difference between classification and association rule discovery. ACM SIGKDD Explorations Newsletter, 2 (1) Li, W., Han, J., Pei, J (2001). CMAR: Accurate and efficient classification based on multiple-class association rule. Proceedings of the ICDM 01 p ). San Jose, CA. Liu, B., Hsu, W., Ma, Y (1998). Integrating classification and association rule mining. Proceedings of the KDD, (pp ). New York, NY. Liu, B., Hsu, W., Ma, Y (1999). Mining association rules with multiple minimum supports. Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p San Diego, California. Meretakis, D., Wüthrich, B (1999). Extending naïve Bayes classifiers using long itemsets. Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p ). San Diego, California. Merz, C., Murphy, P (1996). UCI repository of machine learning databases. Irvine, CA, University of California, Department of Information and Computer Science. Tan P-N, Steinbach M., Kumar V (2005). Introduction to data mining. Addison Wesley Thabtah, F., Cowling, P., Peng, Y (2005). MCAR: Multi-class classification based on association rule approach. Proceeding of the 3 rd IEEE International Conference on Computer Systems and Applications p Cairo, Egypt. Thabtah, F., Cowling, P., Peng, Y (2004). MMAC: A new multi-class, multi-label associative classification approach. Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM 04), (pp ). Brighton, UK. (Nominated for the Best paper award). Quinlan, J (1993). C4.5: Programs for machine learning. San Mateo, CA: Morgan Kaufmann. Quinlan, J (1987). Simplifying decision trees. International journal of man-machine studies, 27-(3) Quinlan, J (1979). Discovering rules from large collections of examples: a case study. In: D. Michie, editor, Expert Systems in the Micro-electronic Age, p ). Edinburgh University Press, Edinburgh. Snedecor, W., Cochran, W (1989). Statistical Methods, Eighth Edition, Iowa State University Press. Antonie, M., Zaïane, O., Coman, A. (2003) associative classifiers for medical images. Lecture Notes in Artificial Intelligence 2797, Mining Multimedia and Complex Data, (pp ). Springer-Verlag. Witten, I., Frank, E. (2000). Data mining: practical machine learning tools and techniques with Java implementations. San Francisco: Morgan Kaufmann. Zaiane, O., Antonie, M (2005). Pruning and Tuning Rules for Associative Classifiers. Ninth International Conference on Knowledge-Based Intelligence Information & Engineering Systems (KES 05), (pp ). Melbourne, Australia, September CBA. Yin, X., Han, J (2003). CPAR: Classification based on predictive association rule. Proceedings of the SDM p San Francisco, CA. WEKA (2000). Data Mining Software in Java: Journal of Digital Information Management Volume 4 Number 3 September

Rule Pruning in Associative Classification Mining

Rule Pruning in Associative Classification Mining Fadi Thabtah Department of Computing and Engineering University of Huddersfield Huddersfield, HD1 3DH, UK F.Thabtah@hud.ac.uk Abstract Classification and