Pruning Techniques in Associative Classification: Survey and Comparison

Size: px
Start display at page:

Download "Pruning Techniques in Associative Classification: Survey and Comparison"

Transcription

1 Survey Research Pruning Techniques in Associative Classification: Survey and Comparison Fadi Thabtah Management Information systems Department Philadelphia University, Amman, Jordan Journal of Digital Information Management ABSTRACT: Association rule discovery and classification are common data mining tasks. Integrating association rule and classification also known as associative classification is a promising approach that derives classifiers highly competitive with regards to accuracy to that of traditional classification approaches such as rule induction and decision trees. However, the size of the classifiers generated by associative classification is often large and therefore pruning becomes an essential task. In this paper, we survey different rule pruning methods used by current associative classification techniques. Further, we compare the effect of three pruning methods (database coverage, pessimistic error estimation, lazy pruning) on the accuracy rate and the number of rules derived from different classification data sets. Results obtained from experimenting on different data sets from UCI data collection indicate that lazy pruning algorithms may produce slightly higher predictive classifiers than those which utilise database coverage and pessimistic error pruning methods. However, the potential use of such classifiers is limited because they are difficult to understand and maintain by the end-user. Categories and Subject Descriptors H.2 [Database management] H.2.8 [Database Applications]; Data mining: General Terms Data mining, Pruning methods Keywords: Associative Classification, Association Rule, Classification, Data Mining, Rule Pruning Received 12 March 2006; Revised 12 July 2006; Accepted 23 July Introduction Rapid advances in computing especially data collection and storage technology have enabled organisations to collect massive amount of data. However, finding and deriving useful information has proven to be a hard task since often the size of the source data is large. Data mining can be defined as a technology that utilises different intelligent algorithms for processing large data sets in order to extract useful knowledge (Tan, et al., 2005). This knowledge can be outputted in different forms such as simple if-then rules, probabilities, etc, and is used for forecasting, data analysis and several other tasks. Association rule discovery is an important data mining task that finds correlations among items in a transactional database. The classic application for association rule is market basket analysis (Agrawal and Srikant, 1994), in which business experts aim to investigate the shopping behaviour of customers in an attempt to discover regularities. In finding association rules, one tries to find groups of items that are frequently sold together in order to infer certain items from the presence of other items in the customer s shopping cart. An example of an association rule is: 55% of customers who buy crisps are likely to buy a soft drink as well; 4% of all database transactions contain crisps and a soft drink. Customers who buy crisps is known as rule antecedent, and buy a soft drink as well is known as rule consequent. The antecedent and consequent of an association rule contain at least one item. The 55% of the association rule mentioned above represents the strength of the rule and is known as rule s confidence, whereas the 4% is a statistical significance measure, known as the rule s support. Classification, also known as categorisation is another well-known task in data mining. Unlike association rule which discovers relationships among items in a transactional database, the ultimate goal of classification is to construct a set of rules (classifier) from a labelled training data set, in order to classify new data objects, known as test data objects, as accurately as possible. In other words, classification is a supervised learning task in which the class labels are known in advance, whereas, association rule has no class to predict and therefore, it could be categorised as unsupervised learning task. (Freitas, 2001) gives more comprehensive discussion on the main differences between classification and association rule. There are many classic classification approaches for extracting knowledge from data such as divide-and-conquer (Quinlan, 1987, Quinlan, 1993), separate-and-conquer (Furnkranz, 1999) (also known as rule induction) and statistical approaches (Duda and Hart, 1973; Meretakis and Wüthrich, 1999). The divide-and-conquer selects the root based on the most informative attribute in the training data set. It makes the selection using statistical measures such as information gain (Quinlan, 1979), and then it makes a branch for each possible level of that attribute. This will split the training instances into subsets, one for each possible value of the selected attribute. The same process is repeated until all instances that fall in one branch have the same classification or the remaining instances cannot split any further. The separate-and-conquer approach on the other hand, starts by building up the rules in a greedy manner. After a rule is found, all training instances covered by the rule are removed and the same process is repeated until the best rule found has a large error rate. Finally, statistical approaches such as Naïve Bayes (Duda and Hart, 1973) computes probabilities of classes in the training data set using the frequency of attribute values associated with these classes in order to classify test instances. Numerous algorithms have been developed based on these approaches such as decision trees (Quinlan, 1993), PART (Frank and Witten, 1998) and RIPPER (Cohen, 1995). In recent years, a new classification approach called associative classification (Liu et al., 1998; Li et al., 2001; Thabtah et al., 2004) that utilises association rule discovery methods to find the rules has been developed. Several associative algorithms have been proposed including, CBA (Liu et al., 1998), CMAR (Li et al., 2001), L 3 (Baralis and Torino, 2002), CPAR (Yin and Han, 2003) and MCAR (Thabtah et al., 2005). Empirical results (Liu et al., 1998; Li et al., 2001; Yin and Han, 2003; Thabtah et al., 2004, Thabtah, et al., 2005) have showed that these algorithms usually build more classifiers than that of decision trees and rule induction approaches, respectively. Association rule discovery techniques generate massive number of rules especially when a low support threshold is used (Liu, et al., 1999; Zaiane, O and Antonie M., 2005; Adamo, J., 2006). Hence associative classification algorithms utilise association rule methods in the training phase, the size of their classifiers is large as well. Several pruning methods have been used effectively to reduce the size of the classifiers in associative classification such as pessimistic error estimation (Quinlan, 1987), chi-square testing ( 2 ) (Snedecor and Cochran, 1989), database coverage (Liu, et al., 1998) and lazy pruning (Baralis and Torino, 2002). The aim of this paper is to survey these different pruning methods and to measure their impact on the number of rules derived. Particularly, we compare three different pruning methods (database coverage, pessimistic error, lazy pruning) with reference to the size of the derived classifiers from several different classification benchmarks. Furthermore, we investigate the predictive accuracy generated from different data sets by three popular associative algorithms (L 3, CBA, MCAR) which employ different pruning heuristics in order to measure the impact of pruning on accuracy. Journal of Digital Information Management Volume 4 Number 3 September

2 The rest of the paper is organised as follows: Associative classification basic concepts is presented in Section 2. Different methods used to prune redundant and harmful rules are surveyed in Section 3. In Sections 4, we show the impact of pruning on classifiers derived by associative techniques. Section 5 is devoted to data and experimental results and finally conclusions are given in Section Associative Classification In associative classification, a training data set T has m distinct attributes A 1, A 2,, A m and C is a list of class labels. The number of rows in T is denoted T. Attributes could be categorical (meaning they take a value from a finite set of possible values) or continuous (where they are real or integer). In the case of categorical attributes, all possible values are mapped to a set of positive integers. For continuous attributes, a discretisation method is first used to transform these attributes into categorical ones. Definition 1: An item can be described as an attribute name A i and its value a i, denoted (A i, a i ). Definition 2: The j th row or a training object in T can be described as a list of items (A j1, a j1 ),, (A jk, a jk ), plus a class denoted by c j. Definition 3: An itemset can be described as a set of disjoint attribute values contained in a training object, denoted < (A i1, a i1 ),, (A ik, a ik )>. Definition 4: A ruleitem r is of the form <cond, c>, where condition cond is an itemset and c C is a class. Definition 5: The actual occurrence (actoccr) of a ruleitem r in T is the number of rows in T that match r s itemset. Definition 6: The support count (suppcount) of ruleitem r = <cond, c> is the number of rows in T that matches r s itemset, and belongs to a class c. Definition 7: The occurrence (occitm) of an itemset I in T is the number of rows in T that match I. Definition 8: An itemset i passes the minimum support (minsupp) threshold if (occitm(i)/ T ) > minsupp. Such an itemset is called frequent itemset. Definition 9: A ruleitem r passes the minsupp threshold if, suppcount(r)/ T > minsupp. Such a ruleitem is said to be a frequent ruleitem. Definition 10: A ruleitem r passes the minimum confidence (minconf) threshold if suppcount(r) / actoccr(r) > minconf. Definition 11: An associative rule is represented in the form: cond c, where the antecedent is an itemset and the consequent is a class. The problem of associative classification is to discover a subset of rules with significant supports and high confidences. This subset is then used to build an automated classifier that could be used to predict the classes of previously unseen data. Figure 1 shows the main phases implemented by an associative classification system where the end user selects the training data and inputs two values (minsupp, minconf). The system starts processing the training data and produces the complete set of frequent ruleitems where a subset of which is derived to the user to represent the classifier. A classifier is a mapping form, H : A Y where A is the set of items and Y is the set of class labels. The goal is to find a classifier h H that maximises the probability that h (a) = y for each test data object. Age Income has a car Buy/class senior middle n yes youth low y no junior high y yes youth middle y yes senior high n yes junior low n no senior middle n no Table 1. Car sales training data AC Ruleitem Itemset Class Support Confidence {low} no 2/7 2/2 {high} yes 2/7 2/2 {senior, no} yes 2/7 2/3 {middle} yes 2/7 2/3 {senior} yes 2/7 2/3 {y} yes 2/7 2/3 {n} yes 2/7 2/4 {n} no 2/7 2/4 Table 2. Possible Ruleitems from Table 1 To demonstrate the main steps of an associative classification system, consider for instance the training data set shown in Table 1, which represents whether or not a person is likely to buy a new car. Assume that minsupp = 2 and minconf = 50%. Frequent ruleitems discovered along with their relevant support and confidence values are shown in Table 2. Before constructing the classifier, most associative algorithms including (Liu, et al., 1998; Yin and Han, 2003; Thabtah, et al., 2005) sort the rules discovered according to their confidence and support values. After rules have been sorted, these techniques apply pruning heuristics to discard redundant and useless rules and select a subset of high confidence rules to form the classifier. 3. Pruning Techniques Used in Associative Classification Associative algorithms normally derive a large set of rules (Liu, et al., 1999; Li, et al., 2001) since (1) classification data sets are typically highly correlated and (2) association rule mining approaches that consider all attribute values combinations in the database are used for rules discovery. As a result, there have been many attempts to reduce the size of their classifiers, mainly focused on preventing rules that are either redundant or misleading from taking any role in the prediction process of test data objects. The removal of such rules can make the classification process more effective and accurate. Several pruning methods have been used effectively to reduce the size of the classifiers, some of which have been adopted from decision trees, like pessimistic estimation, others from statistics such as chi-square testing ( 2 ). These pruning techniques are utilised during either rule discovery or the construction of the classifier. For instance, a very early pruning step, which eliminates ruleitems that do not pass the support threshold may occur in the process of finding frequent ruleitems. Another pruning such as chi-square testing may take place when generating the rules, and a late pruning method like database coverage may be used after discovering all potential rules. Throughout this section, we discuss pruning techniques used by associative classification algorithms. 3.1 Chi-square Testing Chi-square testing ( 2 ) is a well-known discrete data hypothesis testing method from statistics, which evaluates the correlation between two variables and determines whether they are independent or correlated (Snedecor and Cochran, 1989). The test for independence, when applied to a population of subjects, determines whether they are positively correlated or not. Figure 1. Associative Classification main phases, where Journal of Digital Information Management Volume 4 Number 3 September

3 is the expected frequencies and is the observed frequencies. When the expected frequencies and the observed frequencies are notably different, the hypothesis that they are correlated is rejected. This method has been used in associative classification to prune negatively correlated rules. For example, a test can be done on every discovered rule, such as r : x c, to find out whether the condition x is positively correlated with the class c. If the result of the test is larger than a particular constant, there is a strong indication that x and c of r are positively correlated, and therefore r will be stored as a candidate rule in the classifier. If the test result indicates negative correlation, r will not take any part in the later prediction and is discarded. The CMAR algorithm adopts the chi-square testing in its rules discovery step. When a rule is found, CMAR tests whether its body is positively correlated with the class. If a positive correlation is found, CMAR keeps the rule, otherwise the rule is discarded. 3.2 Redundant Rule Pruning In associative classification, all attribute value combinations are considered in turn as a rule s condition, therefore rules in the resulting classifiers may share training items in their bodies and for this reason there could be several specific rules containing many general rules. Rules redundancy in the classifier is unnecessary and in fact could be a serious problem, especially if the number of discovered rules is extremely large. A pruning method that discards specific rules with less confidence values than general rules called redundant rule pruning, has been proposed in (Li, et al., 2001). Redundant rule pruning method works as follows: Once the rule generation process is finished and rules are sorted, an evaluation step is performed to prune all rules such as from the set of generated rules, where there is some general rule of a higher rank and. This pruning method significantly reduces the size of the resulting classifiers and minimises rules redundancy. Algorithms, including (Li, et al., 2001; Antonie, et al., 2003), have used redundant rule pruning. They perform such pruning immediately after a rule is inserted into a compact data structure called CR-tree. When a rule is added to the CR-tree, a query is issued to check if the inserted rule can be pruned or some other already inserted rules in the tree can be removed. Give a set of generated rules R, and the training data set T, the database coverage process works as follows: For each rule r i in R Do Find all applicable data instances in T that match r i s condition If r i correctly classifies a training instance in T end if end Mark r i as a candidate rule in the classifier Remove all instances in T covered by r i If r i cannot correctly cover any training instance in T end if Remove r i from R Figure 2. Database coverage method 3.3 Database Coverage The database coverage heuristic, which is illustrated in Figure 2, is a popular pruning technique that usually invoked after potential rules have been created. This method tests the generated rules against the training data set, and only high quality rules that cover at least one training instance not considered by other higher ranked rules are kept for later classification. The database coverage method was used first by CBA and then latterly by CBA (2) (Liu, et al., 1999) and CMAR. 3.4 Pessimistic Error Estimation Generally, there are two pruning strategies in decision trees, prepruning and post pruning (Witten and Frank, 2000). The latter one, also known as backward pruning, is more popular and has been used by many decision tree algorithms like C4.5 (Quinlan, 1993). In performing backward pruning, the tree is first completely constructed, then at each node a decision is made whether to replace a node and its descendents with a single leaf or to leave the node unchanged. The decision of whether to replace a node or not, is done by calculating the estimated error using the pessimistic error estimation measure of a particular node and comparing it with its potential replacement leaf. The method of replacing a sub-tree with a leaf node is called sub-tree replacement. The error is estimated using a pessimistic error estimation measure based on the training instances. The probability of error at a node v, where is N v the number of training data objects at node v and N v,c is the number of the training data objects associated with the majority class at node v. The error rate at a sub-tree T,. The sub-tree T is pruned if q (v) q (T). In addition to using pessimistic error rate in decision tree algorithms, it can be also used in associative classification by comparing the estimated error of a new rule, r i, resulting from the deletion of one item in the condition of the original rule, r j. If the expected error of r i is lower than that of r j, then r j will be replaced by r i. Algorithms, including (Liu, et al., 1998, Liu, et al., 1999), have used pessimistic error estimation to effectively cut down the number of extracted rules. 3.5 Lazy Pruning Some associative classification techniques (Baralis and Torino, 2002) argue that pruning classification rules should be limited to only negative rules (those that lead to incorrect classification). In addition, they claim that database coverage pruning often discards some useful knowledge, as the ideal support threshold is not known in advance. Due to this, these algorithms have used a late database coverage-like approach called lazy pruning, which discards rules that incorrectly classify training instances and keeps all other rules. Lazy pruning happens after potential rules have been created and stored, where each training instance is taken in turn and the first rule in the set of ranked rules applicable to the instance is assigned to it. The correctness of the class assigned to the instance is checked, if the predicted class assigned by the rule to the instance matches the true class of that instance, then the instance is removed and the rule is inserted into the classifier. Once all training instances have been considered, only rules that wrongly classified training instances are discarded and their covered instances are put into a new cycle and the process is repeated until all training instances are correctly classified. The results are two levels of rules, the first level contains rules that classified at least one single training instance correctly and the second level contains rules that were never used in the training phase. The main difference between lazy pruning and database coverage pruning is that the second level rules that are held in the main memory by lazy pruning are completely removed by database coverage method during rules discovery step. Furthermore, once a rule is applied to the training instances, all instances covered by the rule are removed (negative and positive) by the database coverage method. Journal of Digital Information Management Volume 4 Number 3 September

4 Experimental tests reported in (Baralis and Torino, 2000) using 26 different data sets from (Merz and Murphy, 1996) showed that methods that employ lazy pruning such as L 3 may improve classification accuracy on average by +1.63% over other techniques that use database coverage pruning. However, lazy pruning may lead to very large classifiers, which makes it difficult for a human to understand or be able to interpret. In addition, the experimental tests indicate that lazy pruning algorithms consume more memory than other associative classification techniques and more importantly they may fail if the support threshold is set to a very low value due to the very large number of potential rules. 3.6 Conflicting Rules For highly dense classification data sets and other data where there could be multiple class labels associated with each training instance, there is a possibility to produce rules with the same body that predict different classes. For example, given two rules such as and, (Antonie and Zaine, 2003) proposed a pruning method that considers these two rules conflicting. Their method removes conflicting rules and disallows them from taking any role in classifying test data objects. However, a recent proposed algorithm called MMAC (Thabtah, et al., 2004) showed by experiments that such rules represent useful knowledge since they pass support and confidence requirements. Thus, domain experts can profit from them. MMAC developed a recursive learning phase that combines what so called conflicting rules into one multi-label rule. For the above example, MMAC combines the two rules into the following multi-label rule : x c1 c Laplace Accuracy Laplace accuracy (Clark and Boswell, 1991) is mainly used in classification to estimate the expected error of a rule. This expected accuracy for a given rule, r, is given by the formula: Laplace (r) =, where m is the number of class labels in the domain, p tot (r) is the number of instances matching r antecedent and p c (r) is the number of instances covered by r that belong to class c. Laplace expected error has been successfully used by an associative classification algorithm called CPAR (Yin and Han; 2003), where the expected accuracy for each rule is calculated before classification of test instances. This ensures that the best expected accuracy rules for each class participate in the prediction, which results in slightly higher accuracy classifiers for CPAR over that of CBA and C4.5 algorithms. Particularly, experimental results against 26 data sets from (Merz and Murphy, 1996) showed that CPAR achieved on average +0.48% and +1.83% higher prediction rates than CBA and C4.5 algorithms, respectively. 4. Impact of Pruning on Classifiers In association rule discovery, a transaction can be used to generate many rules; therefore, there are tremendous numbers of potential rules. In associative classification, association rule approaches such as Apriori (Agrawal and Srikant, 1994) are used to discover the rules, and thus the expected number of potential rules is large. Without adding constraints on the rule discovery and generation phases or imposing appropriate pruning, the very large numbers of rules often in the order of thousands and sometimes tens of thousands, make humans unable to understand or maintain the classifier. Pruning noisy and redundant rules in classifiers becomes an important task. Associative classification algorithms that use pruning methods like database coverage and redundant rule prefer general, effective rules over specific ones, thus they produce smaller classifiers than other techniques that adopt lazy pruning. We conducted experiments on the german and wine data sets downloaded from (Weka, 2000) to compare a lazy pruning algorithm, L 3 and the (database coverage, pessimistic error) approach of CBA with reference to number of rules and accuracy. L 3 results on both data sets have been generated using minsupp of 1% and minconf 0.0% and due to this we ran the experiments of CBA using the same support and confidence thresholds for fair comparison. The numbers of rules produced by L 3 on the german and wine data sets are and 40775, respectively, with prediction accuracies of 72.50% and 95.00%, respectively. By comparison, CBA derives only 325 and 12 rules from the same data sets, with prediction accuracies of 73.58% and 98.33%, respectively. These results provide direct, if limited, evidence that techniques, which use database coverage and/or pessimistic error pruning tend to choose general rules and simpler classifiers, which sometimes are more accurate on test data sets when compared with lazy pruning methods like L 3. Overall, techniques that derive smaller classifiers are generally preferred by human experts due to their ease of manual maintenance and interpretability. For instance, if general practitioners used their patient data to build a rule-based diagnosis system, they would prefer the resulting number of rules to be small and simple. As such, they may even slightly decrease accuracy in exchange for a more concise set of rules, which human experts can understand. Smaller classifiers do however suffer from some drawbacks, including, their sensitivity to low-quality data (data sets that contain redundant information and missing values) and their inability to cover the whole training data. On the other hand, approaches that produce very large numbers of rules such as L 3, usually give slightly improved predictive power, but spend a long time in training and during the prediction of test objects, since they must pass over a very large number of rules when classifying test data. In the L 3 algorithm, rules which cover no training data instances, are known as spare or secondary rules. Holding a very large number of spare rules to cover a limited number of test instances missed by the primary rules is inefficient. There should be a trade-off between the size of the classifiers and the predictive accuracy, especially where slightly lower accuracy can be tolerated in exchange for a more concise set of rules. 5. Experimental Results Experiments on fourteen different data sets from the UCI data collection (Merz and Murphy, 1996) were conducted using stratified ten-fold cross validation. Cross validation is a known evaluation method in data mining, where the training data is divided randomly into n blocks, each block is held out once, and the classifier is trained on the remaining n-1 blocks; then its error rate is evaluated on the holdout block. Therefore, the learning procedure is executed n times on slightly different training data sets. We compared the effect of three pruning methods, these are lazy pruning (Baralis and Torino, 2002), database coverage (Liu, et al., 1998) and pessimistic error estimation (Quinlan, 1987) in terms of the number of rules produced by three known associative classification approaches. Particularly, we compared the number of rules derived by L 3 (lazy pruning algorithm), MCAR (database coverage pruning algorithm) and CBA (pessimistic error and database coverage pruning algorithm) against the fourteen classification benchmarks. Data (Database Coverage & Database Lazy Pessimistic Error) Coverage pruning Auto Led Breast Pima Tic-tac Glass Lymph Diabetes Iris Cleve Heart Labor Table 3. Number of rules produced when different pruning approaches are used in associative classification Journal of Digital Information Management Volume 4 Number 3 September

5 The experiments of CBA were conducted using an implementation version provided by the authors of (CBA, 1998). MCAR was implemented in Java under Windows XP and the results of L 3 were given by its prospective authors. All Experiments were conducted on Pentium IV 2.7 Ghz, 512 RAM machine. For CBA and MCAR, we used a minsupp of 2% and a minconf of 50% since these are one of the suggested values by their authors. For L 3, we used the standard values for all parameters suggested by their authors. Table 3 shows the number of rules derived when different pruning approaches are used. The results shown in column two has been derived using the CBA algorithm. For column three, we used MCAR algorithm and finally for column four, we used L 3 algorithm. It is obvious from the numbers shown in Table 3 that algorithms, which use lazy pruning approach, generate many more rules than those that employ other approaches. In particular, for all classification data sets we considered, L 3 algorithm produces more rules than MCAR and CBA techniques, respectively. One of the principle reasons for generating large number of rules by lazy pruning algorithms is due to storing rules that do even cover a single training data object in the classifier. For example, while constructing the classifier by the L 3 algorithm, every rule is evaluated to validate whether or not it covers a training data object. If a rule covers correctly a training object, it then will be added into the classifier, otherwise it will be removed. However, rules, which have been never tested, are also added as spare rules into the classifier. These spare rules are often in the orders of thousands and even tens of thousands. Producing very large number of rules raises many problems such as user understandability and maintenance. Furthermore, the classification time may increase since the classification system must iterate through the rule set, and consequently these problems limit the use of lazy associative algorithms. Unlike lazy pruning approach, the database coverage method eliminates the spare rules and that explains its moderate sized classifiers. Specifically, MCAR and CBA algorithms generate reasonable sized classifiers if compared with L 3, which enable domain users to benefit from. In fact, for most data sets we consider, the classifiers produced when the database coverage heuristic is used, can be easily maintained and interpreted by end users. Moreover, CBA algorithm, which utilises database coverage and pessimistic error pruning methods, cuts down further the size of the classifiers. This can be seen when we compare the numbers in column two with that of column three in Table 3. Table 4 gives the accuracy (%) figures derived by L 3, CBA and MCAR algorithms from twelve of the benchmarks we considered. We also report the accuracy derived without using any pruning method as shown in column 7. The accuracy numbers have been generated using a support of 2% and a confidence of 50% for CBA and MCAR algorithms. For L 3 algorithm, we used the default parameters suggested by its prospective authors. Table 4 figures indicate that L 3 algorithm outperforms CBA and MCAR on five classification data sets. Also L 3 produced on average higher classification accuracy if no pruning at all is applied in the experiments. In fact, L 3 achieved on average +0.69% and +0.21% higher prediction rate than that of CBA and MCAR algorithms, respectively. Moreover, on average L 3 achieved higher accuracy than an AC method which uses no pruning at all. In the classification step, and when the primary rules fail to classify a test object, lazy pruning methods such as L 3 use the spare rules to classify that object, which explains the slight increase of accuracy over that of CBA and MCAR. In other words, unlike CBA and MCAR algorithms which takes on the default class label when rules in their classifiers fail to classify a test object, L 3 algorithm utilise the spare rules. 6. Conclusions Associative classification is an imperative data mining task that has recently attracted many researchers since it derives highly predictive classification systems. We have surveyed different pruning heuristics in associative classification such as redundant rule, database coverage, lazy pruning, etc. Moreover, experiments using fourteen different classification data sets have been conducted to compare between three popular associative algorithms, which utilise different pruning methods, i.e. database coverage, lazy pruning and pessimistic error. Our bases of comparison are classifier size and accuracy rate. The results revealed that algorithms that utilise lazy pruning (L 3 ) produce very large classifiers, which make it difficult for domain experts to maintain or understand, and consequently limit their use. Although, L 3 generates on average and higher accuracy rate than that of CBA and MCAR, respectively. On the other hand, associative algorithms that utilise database coverage and/or pessimistic error estimation pruning such as MCAR and CBA produce moderate sized classifiers that are easy to maintain and interpret by end users. The results also pointed out the need for additional constraints during pruning in order to decrease further the size of the resulting classifiers. In near future, we will investigate the possibility of creating a new hybrid pruning method. Data Size # of CBA MCAR L3 No Pruning Classes Led Breast Pima Tic-tac Glass Diabetes Iris Cleve Heart Labor Wine Zoo Average Table 4. Accuracy of CBA,L 3 and MCAR algorithms using Ten-fold Cross Validation References Adamo, J. (2006). Association Rule based Classifier Built via Direct Enumeration, Online Pruning and Genetic Algorithm based Rule Decimation. Artificial Intelligence and Applications 2006: Agrawal, R., Srikant, R (1994). Fast algorithms for mining association rule. Proceedings of the 20th International Conferenceon Very Large Data Bases. p Baralis, E., Torino, P (2002). A lazy approach to pruning classification rules. Proceedings of the 2002 IEEE ICDM 02. p. 35. Journal of Digital Information Management Volume 4 Number 3 September

6 Clark, P., Boswell, R (1991). Rule induction with CN2: Some recent improvements. In Y. Kodratoff, editor, Machine Learning - EWSL- 91, p Berlin, Springer-Verlag. Cohen, W. (1995). Fast effective rule induction. Proceedings of the 12 th International Conference on Machine Learning, (pp ). Morgan Kaufmann, CA. Dong, G., Li., J (1999). Efficient mining of emerging patterns: Discovering trends and differences. Proceedings of the Int l Conf. Of Knowledge Discovery and Data Mining, (pp ). Duda, R., Hart, P (1973). Pattern classification and scene analysis. John Wiley & son. Frank, E., Witten, I (1998). Generating accurate rule sets without global optimisation. Proceedings of the Fifteenth International Conference on Machine Learning, p Morgan Kaufmann, Madison, Wisconsin. Freitas, A (2000). Understanding the crucial difference between classification and association rule discovery. ACM SIGKDD Explorations Newsletter, 2 (1) Li, W., Han, J., Pei, J (2001). CMAR: Accurate and efficient classification based on multiple-class association rule. Proceedings of the ICDM 01 p ). San Jose, CA. Liu, B., Hsu, W., Ma, Y (1998). Integrating classification and association rule mining. Proceedings of the KDD, (pp ). New York, NY. Liu, B., Hsu, W., Ma, Y (1999). Mining association rules with multiple minimum supports. Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p San Diego, California. Meretakis, D., Wüthrich, B (1999). Extending naïve Bayes classifiers using long itemsets. Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p ). San Diego, California. Merz, C., Murphy, P (1996). UCI repository of machine learning databases. Irvine, CA, University of California, Department of Information and Computer Science. Tan P-N, Steinbach M., Kumar V (2005). Introduction to data mining. Addison Wesley Thabtah, F., Cowling, P., Peng, Y (2005). MCAR: Multi-class classification based on association rule approach. Proceeding of the 3 rd IEEE International Conference on Computer Systems and Applications p Cairo, Egypt. Thabtah, F., Cowling, P., Peng, Y (2004). MMAC: A new multi-class, multi-label associative classification approach. Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM 04), (pp ). Brighton, UK. (Nominated for the Best paper award). Quinlan, J (1993). C4.5: Programs for machine learning. San Mateo, CA: Morgan Kaufmann. Quinlan, J (1987). Simplifying decision trees. International journal of man-machine studies, 27-(3) Quinlan, J (1979). Discovering rules from large collections of examples: a case study. In: D. Michie, editor, Expert Systems in the Micro-electronic Age, p ). Edinburgh University Press, Edinburgh. Snedecor, W., Cochran, W (1989). Statistical Methods, Eighth Edition, Iowa State University Press. Antonie, M., Zaïane, O., Coman, A. (2003) associative classifiers for medical images. Lecture Notes in Artificial Intelligence 2797, Mining Multimedia and Complex Data, (pp ). Springer-Verlag. Witten, I., Frank, E. (2000). Data mining: practical machine learning tools and techniques with Java implementations. San Francisco: Morgan Kaufmann. Zaiane, O., Antonie, M (2005). Pruning and Tuning Rules for Associative Classifiers. Ninth International Conference on Knowledge-Based Intelligence Information & Engineering Systems (KES 05), (pp ). Melbourne, Australia, September CBA. Yin, X., Han, J (2003). CPAR: Classification based on predictive association rule. Proceedings of the SDM p San Francisco, CA. WEKA (2000). Data Mining Software in Java: Journal of Digital Information Management Volume 4 Number 3 September

Rule Pruning in Associative Classification Mining

Rule Pruning in Associative Classification Mining Rule Pruning in Associative Classification Mining Fadi Thabtah Department of Computing and Engineering University of Huddersfield Huddersfield, HD1 3DH, UK F.Thabtah@hud.ac.uk Abstract Classification and

More information

Challenges and Interesting Research Directions in Associative Classification

Challenges and Interesting Research Directions in Associative Classification Challenges and Interesting Research Directions in Associative Classification Fadi Thabtah Department of Management Information Systems Philadelphia University Amman, Jordan Email: FFayez@philadelphia.edu.jo

More information

A review of associative classification mining

A review of associative classification mining The Knowledge Engineering Review, Vol. 22:1, 37 65. Ó 2007, Cambridge University Press doi:10.1017/s0269888907001026 Printed in the United Kingdom A review of associative classification mining FADI THABTAH

More information

Class Strength Prediction Method for Associative Classification

Class Strength Prediction Method for Associative Classification Class Strength Prediction Method for Associative Classification Suzan Ayyat Joan Lu Fadi Thabtah Department of Informatics Huddersfield University Department of Informatics Huddersfield University Ebusiness

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 4, Jul Aug 2017

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 4, Jul Aug 2017 International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 4, Jul Aug 17 RESEARCH ARTICLE OPEN ACCESS Classifying Brain Dataset Using Classification Based Association Rules

More information

A Novel Algorithm for Associative Classification

A Novel Algorithm for Associative Classification A Novel Algorithm for Associative Classification Gourab Kundu 1, Sirajum Munir 1, Md. Faizul Bari 1, Md. Monirul Islam 1, and K. Murase 2 1 Department of Computer Science and Engineering Bangladesh University

More information

A SURVEY OF DIFFERENT ASSOCIATIVE CLASSIFICATION ALGORITHMS

A SURVEY OF DIFFERENT ASSOCIATIVE CLASSIFICATION ALGORITHMS Asian Journal Of Computer Science And Information Technology 3 : 6 (2013)88-93. Contents lists available at www.innovativejournal.in Asian Journal of Computer Science And Information Technology Journal

More information

A Survey on Algorithms for Market Basket Analysis

A Survey on Algorithms for Market Basket Analysis ISSN: 2321-7782 (Online) Special Issue, December 2013 International Journal of Advance Research in Computer Science and Management Studies Research Paper Available online at: www.ijarcsms.com A Survey

More information

Structure of Association Rule Classifiers: a Review

Structure of Association Rule Classifiers: a Review Structure of Association Rule Classifiers: a Review Koen Vanhoof Benoît Depaire Transportation Research Institute (IMOB), University Hasselt 3590 Diepenbeek, Belgium koen.vanhoof@uhasselt.be benoit.depaire@uhasselt.be

More information

Combinatorial Approach of Associative Classification

Combinatorial Approach of Associative Classification Int. J. Advanced Networking and Applications 470 Combinatorial Approach of Associative Classification P. R. Pal Department of Computer Applications, Shri Vaishnav Institute of Management, Indore, M.P.

More information

A Classification Rules Mining Method based on Dynamic Rules' Frequency

A Classification Rules Mining Method based on Dynamic Rules' Frequency A Classification Rules Mining Method based on Dynamic Rules' Frequency Issa Qabajeh Centre for Computational Intelligence, De Montfort University, Leicester, UK P12047781@myemail.dmu.ac.uk Francisco Chiclana

More information

Review and Comparison of Associative Classification Data Mining Approaches

Review and Comparison of Associative Classification Data Mining Approaches Review and Comparison of Associative Classification Data Mining Approaches Suzan Wedyan Abstract Associative classification (AC) is a data mining approach that combines association rule and classification

More information

Temporal Weighted Association Rule Mining for Classification

Temporal Weighted Association Rule Mining for Classification Temporal Weighted Association Rule Mining for Classification Purushottam Sharma and Kanak Saxena Abstract There are so many important techniques towards finding the association rules. But, when we consider

More information

A neural-networks associative classification method for association rule mining

A neural-networks associative classification method for association rule mining Data Mining VII: Data, Text and Web Mining and their Business Applications 93 A neural-networks associative classification method for association rule mining P. Sermswatsri & C. Srisa-an Faculty of Information

More information

Enhanced Associative classification based on incremental mining Algorithm (E-ACIM)

Enhanced Associative classification based on incremental mining Algorithm (E-ACIM) www.ijcsi.org 124 Enhanced Associative classification based on incremental mining Algorithm (E-ACIM) Mustafa A. Al-Fayoumi College of Computer Engineering and Sciences, Salman bin Abdulaziz University

More information

A Conflict-Based Confidence Measure for Associative Classification

A Conflict-Based Confidence Measure for Associative Classification A Conflict-Based Confidence Measure for Associative Classification Peerapon Vateekul and Mei-Ling Shyu Department of Electrical and Computer Engineering University of Miami Coral Gables, FL 33124, USA

More information

ASSOCIATIVE CLASSIFICATION WITH KNN

ASSOCIATIVE CLASSIFICATION WITH KNN ASSOCIATIVE CLASSIFICATION WITH ZAIXIANG HUANG, ZHONGMEI ZHOU, TIANZHONG HE Department of Computer Science and Engineering, Zhangzhou Normal University, Zhangzhou 363000, China E-mail: huangzaixiang@126.com

More information

ACN: An Associative Classifier with Negative Rules

ACN: An Associative Classifier with Negative Rules ACN: An Associative Classifier with Negative Rules Gourab Kundu, Md. Monirul Islam, Sirajum Munir, Md. Faizul Bari Department of Computer Science and Engineering Bangladesh University of Engineering and

More information

A dynamic rule-induction method for classification in data mining

A dynamic rule-induction method for classification in data mining Journal of Management Analytics, 2015 Vol. 2, No. 3, 233 253, http://dx.doi.org/10.1080/23270012.2015.1090889 A dynamic rule-induction method for classification in data mining Issa Qabajeh a *, Fadi Thabtah

More information

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach ABSTRACT G.Ravi Kumar 1 Dr.G.A. Ramachandra 2 G.Sunitha 3 1. Research Scholar, Department of Computer Science &Technology,

More information

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Jing Wang Computer Science Department, The University of Iowa jing-wang-1@uiowa.edu W. Nick Street Management Sciences Department,

More information

CloNI: clustering of JN -interval discretization

CloNI: clustering of JN -interval discretization CloNI: clustering of JN -interval discretization C. Ratanamahatana Department of Computer Science, University of California, Riverside, USA Abstract It is known that the naive Bayesian classifier typically

More information

COMPARATIVE STUDY ON ASSOCIATIVECLASSIFICATION TECHNIQUES

COMPARATIVE STUDY ON ASSOCIATIVECLASSIFICATION TECHNIQUES COMPARATIVE STUDY ON ASSOCIATIVECLASSIFICATION TECHNIQUES Ravi Patel 1, Jay Vala 2, Kanu Patel 3 1 Information Technology, GCET, patelravi32@yahoo.co.in 2 Information Technology, GCET, jayvala1623@gmail.com

More information

Using Association Rules for Better Treatment of Missing Values

Using Association Rules for Better Treatment of Missing Values Using Association Rules for Better Treatment of Missing Values SHARIQ BASHIR, SAAD RAZZAQ, UMER MAQBOOL, SONYA TAHIR, A. RAUF BAIG Department of Computer Science (Machine Intelligence Group) National University

More information

Tendency Mining in Dynamic Association Rules Based on SVM Classifier

Tendency Mining in Dynamic Association Rules Based on SVM Classifier Send Orders for Reprints to reprints@benthamscienceae The Open Mechanical Engineering Journal, 2014, 8, 303-307 303 Open Access Tendency Mining in Dynamic Association Rules Based on SVM Classifier Zhonglin

More information

Genetic Programming for Data Classification: Partitioning the Search Space

Genetic Programming for Data Classification: Partitioning the Search Space Genetic Programming for Data Classification: Partitioning the Search Space Jeroen Eggermont jeggermo@liacs.nl Joost N. Kok joost@liacs.nl Walter A. Kosters kosters@liacs.nl ABSTRACT When Genetic Programming

More information

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE Vandit Agarwal 1, Mandhani Kushal 2 and Preetham Kumar 3

More information

Improved Frequent Pattern Mining Algorithm with Indexing

Improved Frequent Pattern Mining Algorithm with Indexing IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 16, Issue 6, Ver. VII (Nov Dec. 2014), PP 73-78 Improved Frequent Pattern Mining Algorithm with Indexing Prof.

More information

COMPACT WEIGHTED CLASS ASSOCIATION RULE MINING USING INFORMATION GAIN

COMPACT WEIGHTED CLASS ASSOCIATION RULE MINING USING INFORMATION GAIN COMPACT WEIGHTED CLASS ASSOCIATION RULE MINING USING INFORMATION GAIN S.P.Syed Ibrahim 1 and K.R.Chandran 2 1 Assistant Professor, Department of Computer Science and Engineering, PSG College of Technology,

More information

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA METANAT HOOSHSADAT, SAMANEH BAYAT, PARISA NAEIMI, MAHDIEH S. MIRIAN, OSMAR R. ZAÏANE Computing Science Department, University

More information

Associating Terms with Text Categories

Associating Terms with Text Categories Associating Terms with Text Categories Osmar R. Zaïane Department of Computing Science University of Alberta Edmonton, AB, Canada zaiane@cs.ualberta.ca Maria-Luiza Antonie Department of Computing Science

More information

An Information-Theoretic Approach to the Prepruning of Classification Rules

An Information-Theoretic Approach to the Prepruning of Classification Rules An Information-Theoretic Approach to the Prepruning of Classification Rules Max Bramer University of Portsmouth, Portsmouth, UK Abstract: Keywords: The automatic induction of classification rules from

More information

A Comparative Study of Selected Classification Algorithms of Data Mining

A Comparative Study of Selected Classification Algorithms of Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 6, June 2015, pg.220

More information

Mining Quantitative Association Rules on Overlapped Intervals

Mining Quantitative Association Rules on Overlapped Intervals Mining Quantitative Association Rules on Overlapped Intervals Qiang Tong 1,3, Baoping Yan 2, and Yuanchun Zhou 1,3 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China {tongqiang,

More information

A recommendation engine by using association rules

A recommendation engine by using association rules Available online at www.sciencedirect.com Procedia - Social and Behavioral Sciences 62 ( 2012 ) 452 456 WCBEM 2012 A recommendation engine by using association rules Ozgur Cakir a 1, Murat Efe Aras b a

More information

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand).

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand). http://waikato.researchgateway.ac.nz/ Research Commons at the University of Waikato Copyright Statement: The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand). The thesis

More information

A Novel Rule Ordering Approach in Classification Association Rule Mining

A Novel Rule Ordering Approach in Classification Association Rule Mining A Novel Rule Ordering Approach in Classification Association Rule Mining Yanbo J. Wang 1, Qin Xin 2, and Frans Coenen 1 1 Department of Computer Science, The University of Liverpool, Ashton Building, Ashton

More information

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA SEQUENTIAL PATTERN MINING FROM WEB LOG DATA Rajashree Shettar 1 1 Associate Professor, Department of Computer Science, R. V College of Engineering, Karnataka, India, rajashreeshettar@rvce.edu.in Abstract

More information

Association Technique in Data Mining and Its Applications

Association Technique in Data Mining and Its Applications Association Technique in Data Mining and Its Applications Harveen Buttar *, Rajneet Kaur ** * (Research Scholar, Department of Computer Science Engineering, SGGSWU, Fatehgarh Sahib, Punjab, India.) **(Assistant

More information

Discrete Particle Swarm Optimization With Local Search Strategy for Rule Classification

Discrete Particle Swarm Optimization With Local Search Strategy for Rule Classification Discrete Particle Swarm Optimization With Local Search Strategy for Rule Classification Min Chen and Simone A. Ludwig Department of Computer Science North Dakota State University Fargo, ND, USA min.chen@my.ndsu.edu,

More information

2. Department of Electronic Engineering and Computer Science, Case Western Reserve University

2. Department of Electronic Engineering and Computer Science, Case Western Reserve University Chapter MINING HIGH-DIMENSIONAL DATA Wei Wang 1 and Jiong Yang 2 1. Department of Computer Science, University of North Carolina at Chapel Hill 2. Department of Electronic Engineering and Computer Science,

More information

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique www.ijcsi.org 29 Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Anotai Siltepavet 1, Sukree Sinthupinyo 2 and Prabhas Chongstitvatana 3 1 Computer Engineering, Chulalongkorn

More information

A Two Stage Zone Regression Method for Global Characterization of a Project Database

A Two Stage Zone Regression Method for Global Characterization of a Project Database A Two Stage Zone Regression Method for Global Characterization 1 Chapter I A Two Stage Zone Regression Method for Global Characterization of a Project Database J. J. Dolado, University of the Basque Country,

More information

A Novel Rule Weighting Approach in Classification Association Rule Mining

A Novel Rule Weighting Approach in Classification Association Rule Mining A Novel Rule Weighting Approach in Classification Association Rule Mining (An Extended Version of 2007 IEEE ICDM Workshop Paper) Yanbo J. Wang 1, Qin Xin 2, and Frans Coenen 1 1 Department of Computer

More information

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset M.Hamsathvani 1, D.Rajeswari 2 M.E, R.Kalaiselvi 3 1 PG Scholar(M.E), Angel College of Engineering and Technology, Tiruppur,

More information

An Efficient Algorithm for Finding the Support Count of Frequent 1-Itemsets in Frequent Pattern Mining

An Efficient Algorithm for Finding the Support Count of Frequent 1-Itemsets in Frequent Pattern Mining An Efficient Algorithm for Finding the Support Count of Frequent 1-Itemsets in Frequent Pattern Mining P.Subhashini 1, Dr.G.Gunasekaran 2 Research Scholar, Dept. of Information Technology, St.Peter s University,

More information

Feature-weighted k-nearest Neighbor Classifier

Feature-weighted k-nearest Neighbor Classifier Proceedings of the 27 IEEE Symposium on Foundations of Computational Intelligence (FOCI 27) Feature-weighted k-nearest Neighbor Classifier Diego P. Vivencio vivencio@comp.uf scar.br Estevam R. Hruschka

More information

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42 Pattern Mining Knowledge Discovery and Data Mining 1 Roman Kern KTI, TU Graz 2016-01-14 Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 1 / 42 Outline 1 Introduction 2 Apriori Algorithm 3 FP-Growth

More information

WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1

WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1 WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1 H. Altay Güvenir and Aynur Akkuş Department of Computer Engineering and Information Science Bilkent University, 06533, Ankara, Turkey

More information

Using Decision Boundary to Analyze Classifiers

Using Decision Boundary to Analyze Classifiers Using Decision Boundary to Analyze Classifiers Zhiyong Yan Congfu Xu College of Computer Science, Zhejiang University, Hangzhou, China yanzhiyong@zju.edu.cn Abstract In this paper we propose to use decision

More information

Discretizing Continuous Attributes Using Information Theory

Discretizing Continuous Attributes Using Information Theory Discretizing Continuous Attributes Using Information Theory Chang-Hwan Lee Department of Information and Communications, DongGuk University, Seoul, Korea 100-715 chlee@dgu.ac.kr Abstract. Many classification

More information

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier Data Mining 3.2 Decision Tree Classifier Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction Basic Algorithm for Decision Tree Induction Attribute Selection Measures Information Gain Gain Ratio

More information

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification Data Mining 3.3 Fall 2008 Instructor: Dr. Masoud Yaghini Outline Using IF-THEN Rules for Classification Rules With Exceptions Rule Extraction from a Decision Tree 1R Algorithm Sequential Covering Algorithms

More information

Efficient Pruning Method for Ensemble Self-Generating Neural Networks

Efficient Pruning Method for Ensemble Self-Generating Neural Networks Efficient Pruning Method for Ensemble Self-Generating Neural Networks Hirotaka INOUE Department of Electrical Engineering & Information Science, Kure National College of Technology -- Agaminami, Kure-shi,

More information

COMP 465: Data Mining Classification Basics

COMP 465: Data Mining Classification Basics Supervised vs. Unsupervised Learning COMP 465: Data Mining Classification Basics Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Supervised

More information

SSV Criterion Based Discretization for Naive Bayes Classifiers

SSV Criterion Based Discretization for Naive Bayes Classifiers SSV Criterion Based Discretization for Naive Bayes Classifiers Krzysztof Grąbczewski kgrabcze@phys.uni.torun.pl Department of Informatics, Nicolaus Copernicus University, ul. Grudziądzka 5, 87-100 Toruń,

More information

An Empirical Study on feature selection for Data Classification

An Empirical Study on feature selection for Data Classification An Empirical Study on feature selection for Data Classification S.Rajarajeswari 1, K.Somasundaram 2 Department of Computer Science, M.S.Ramaiah Institute of Technology, Bangalore, India 1 Department of

More information

Induction of Association Rules: Apriori Implementation

Induction of Association Rules: Apriori Implementation 1 Induction of Association Rules: Apriori Implementation Christian Borgelt and Rudolf Kruse Department of Knowledge Processing and Language Engineering School of Computer Science Otto-von-Guericke-University

More information

International Journal of Software and Web Sciences (IJSWS)

International Journal of Software and Web Sciences (IJSWS) International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) ISSN (Print): 2279-0063 ISSN (Online): 2279-0071 International

More information

Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values

Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values SHARIQ BASHIR, SAAD RAZZAQ, UMER MAQBOOL, SONYA TAHIR, A. RAUF BAIG Department of Computer Science (Machine

More information

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

Study on Classifiers using Genetic Algorithm and Class based Rules Generation 2012 International Conference on Software and Computer Applications (ICSCA 2012) IPCSIT vol. 41 (2012) (2012) IACSIT Press, Singapore Study on Classifiers using Genetic Algorithm and Class based Rules

More information

Optimization using Ant Colony Algorithm

Optimization using Ant Colony Algorithm Optimization using Ant Colony Algorithm Er. Priya Batta 1, Er. Geetika Sharmai 2, Er. Deepshikha 3 1Faculty, Department of Computer Science, Chandigarh University,Gharaun,Mohali,Punjab 2Faculty, Department

More information

Understanding Rule Behavior through Apriori Algorithm over Social Network Data

Understanding Rule Behavior through Apriori Algorithm over Social Network Data Global Journal of Computer Science and Technology Volume 12 Issue 10 Version 1.0 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals Inc. (USA) Online ISSN: 0975-4172

More information

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Anotai Siltepavet 1, Sukree Sinthupinyo 2 and Prabhas Chongstitvatana 3 1 Computer Engineering, Chulalongkorn University,

More information

Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree

Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree Virendra Kumar Shrivastava 1, Parveen Kumar 2, K. R. Pardasani 3 1 Department of Computer Science & Engineering, Singhania

More information

Technical Report TUD KE Jan-Nikolas Sulzmann, Johannes Fürnkranz

Technical Report TUD KE Jan-Nikolas Sulzmann, Johannes Fürnkranz Technische Universität Darmstadt Knowledge Engineering Group Hochschulstrasse 10, D-64289 Darmstadt, Germany http://www.ke.informatik.tu-darmstadt.de Technical Report TUD KE 2008 03 Jan-Nikolas Sulzmann,

More information

Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms

Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms Remco R. Bouckaert 1,2 and Eibe Frank 2 1 Xtal Mountain Information Technology 215 Three Oaks Drive, Dairy Flat, Auckland,

More information

Mining High Average-Utility Itemsets

Mining High Average-Utility Itemsets Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics San Antonio, TX, USA - October 2009 Mining High Itemsets Tzung-Pei Hong Dept of Computer Science and Information Engineering

More information

Bing Liu, Minqing Hu and Wynne Hsu

Bing Liu, Minqing Hu and Wynne Hsu From: AAAI- Proceedings. Copyright, AAAI (www.aaai.org). All rights reserved. Intuitive Representation of Decision Trees Using General Rules and Exceptions Bing Liu, Minqing Hu and Wynne Hsu School of

More information

An Improved Apriori Algorithm for Association Rules

An Improved Apriori Algorithm for Association Rules Research article An Improved Apriori Algorithm for Association Rules Hassan M. Najadat 1, Mohammed Al-Maolegi 2, Bassam Arkok 3 Computer Science, Jordan University of Science and Technology, Irbid, Jordan

More information

Eager, Lazy and Hybrid Algorithms for Multi-Criteria Associative Classification

Eager, Lazy and Hybrid Algorithms for Multi-Criteria Associative Classification Eager, Lazy and Hybrid Algorithms for Multi-Criteria Associative Classification Adriano Veloso 1, Wagner Meira Jr 1 1 Computer Science Department Universidade Federal de Minas Gerais (UFMG) Belo Horizonte

More information

Applying Objective Interestingness Measures. in Data Mining Systems. Robert J. Hilderman and Howard J. Hamilton. Department of Computer Science

Applying Objective Interestingness Measures. in Data Mining Systems. Robert J. Hilderman and Howard J. Hamilton. Department of Computer Science Applying Objective Interestingness Measures in Data Mining Systems Robert J. Hilderman and Howard J. Hamilton Department of Computer Science University of Regina Regina, Saskatchewan, Canada SS 0A fhilder,hamiltong@cs.uregina.ca

More information

Mining Generalised Emerging Patterns

Mining Generalised Emerging Patterns Mining Generalised Emerging Patterns Xiaoyuan Qian, James Bailey, Christopher Leckie Department of Computer Science and Software Engineering University of Melbourne, Australia {jbailey, caleckie}@csse.unimelb.edu.au

More information

DIVERSITY-BASED INTERESTINGNESS MEASURES FOR ASSOCIATION RULE MINING

DIVERSITY-BASED INTERESTINGNESS MEASURES FOR ASSOCIATION RULE MINING DIVERSITY-BASED INTERESTINGNESS MEASURES FOR ASSOCIATION RULE MINING Huebner, Richard A. Norwich University rhuebner@norwich.edu ABSTRACT Association rule interestingness measures are used to help select

More information

Univariate and Multivariate Decision Trees

Univariate and Multivariate Decision Trees Univariate and Multivariate Decision Trees Olcay Taner Yıldız and Ethem Alpaydın Department of Computer Engineering Boğaziçi University İstanbul 80815 Turkey Abstract. Univariate decision trees at each

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.4. Spring 2010 Instructor: Dr. Masoud Yaghini Outline Using IF-THEN Rules for Classification Rule Extraction from a Decision Tree 1R Algorithm Sequential Covering Algorithms

More information

Handling Missing Values via Decomposition of the Conditioned Set

Handling Missing Values via Decomposition of the Conditioned Set Handling Missing Values via Decomposition of the Conditioned Set Mei-Ling Shyu, Indika Priyantha Kuruppu-Appuhamilage Department of Electrical and Computer Engineering, University of Miami Coral Gables,

More information

Evolving SQL Queries for Data Mining

Evolving SQL Queries for Data Mining Evolving SQL Queries for Data Mining Majid Salim and Xin Yao School of Computer Science, The University of Birmingham Edgbaston, Birmingham B15 2TT, UK {msc30mms,x.yao}@cs.bham.ac.uk Abstract. This paper

More information

Improving Classifier Performance by Imputing Missing Values using Discretization Method

Improving Classifier Performance by Imputing Missing Values using Discretization Method Improving Classifier Performance by Imputing Missing Values using Discretization Method E. CHANDRA BLESSIE Assistant Professor, Department of Computer Science, D.J.Academy for Managerial Excellence, Coimbatore,

More information

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au

More information

Association Rule Mining from XML Data

Association Rule Mining from XML Data 144 Conference on Data Mining DMIN'06 Association Rule Mining from XML Data Qin Ding and Gnanasekaran Sundarraj Computer Science Program The Pennsylvania State University at Harrisburg Middletown, PA 17057,

More information

I. INTRODUCTION. Keywords : Spatial Data Mining, Association Mining, FP-Growth Algorithm, Frequent Data Sets

I. INTRODUCTION. Keywords : Spatial Data Mining, Association Mining, FP-Growth Algorithm, Frequent Data Sets 2017 IJSRSET Volume 3 Issue 5 Print ISSN: 2395-1990 Online ISSN : 2394-4099 Themed Section: Engineering and Technology Emancipation of FP Growth Algorithm using Association Rules on Spatial Data Sets Sudheer

More information

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, statistics foundations 5 Introduction to D3, visual analytics

More information

The Comparison of CBA Algorithm and CBS Algorithm for Meteorological Data Classification Mohammad Iqbal, Imam Mukhlash, Hanim Maria Astuti

The Comparison of CBA Algorithm and CBS Algorithm for Meteorological Data Classification Mohammad Iqbal, Imam Mukhlash, Hanim Maria Astuti Information Systems International Conference (ISICO), 2 4 December 2013 The Comparison of CBA Algorithm and CBS Algorithm for Meteorological Data Classification Mohammad Iqbal, Imam Mukhlash, Hanim Maria

More information

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Pamba Pravallika 1, K. Narendra 2

Pamba Pravallika 1, K. Narendra 2 2018 IJSRSET Volume 4 Issue 1 Print ISSN: 2395-1990 Online ISSN : 2394-4099 Themed Section : Engineering and Technology Analysis on Medical Data sets using Apriori Algorithm Based on Association Rules

More information

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Abstract - The primary goal of the web site is to provide the

More information

Raunak Rathi 1, Prof. A.V.Deorankar 2 1,2 Department of Computer Science and Engineering, Government College of Engineering Amravati

Raunak Rathi 1, Prof. A.V.Deorankar 2 1,2 Department of Computer Science and Engineering, Government College of Engineering Amravati Analytical Representation on Secure Mining in Horizontally Distributed Database Raunak Rathi 1, Prof. A.V.Deorankar 2 1,2 Department of Computer Science and Engineering, Government College of Engineering

More information

A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995)

A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995) A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995) Department of Information, Operations and Management Sciences Stern School of Business, NYU padamopo@stern.nyu.edu

More information

Real World Performance of Association Rule Algorithms

Real World Performance of Association Rule Algorithms To appear in KDD 2001 Real World Performance of Association Rule Algorithms Zijian Zheng Blue Martini Software 2600 Campus Drive San Mateo, CA 94403, USA +1 650 356 4223 zijian@bluemartini.com Ron Kohavi

More information

Efficient Pairwise Classification

Efficient Pairwise Classification Efficient Pairwise Classification Sang-Hyeun Park and Johannes Fürnkranz TU Darmstadt, Knowledge Engineering Group, D-64289 Darmstadt, Germany Abstract. Pairwise classification is a class binarization

More information

An approach to calculate minimum support-confidence using MCAR with GA

An approach to calculate minimum support-confidence using MCAR with GA An approach to calculate minimum support-confidence using MCAR with GA Brijkishor Kumar Gupta Research Scholar Sri Satya Sai Institute Of Science & Engineering, Sehore Gajendra Singh Chandel Reader Sri

More information

Uncertain Data Classification Using Decision Tree Classification Tool With Probability Density Function Modeling Technique

Uncertain Data Classification Using Decision Tree Classification Tool With Probability Density Function Modeling Technique Research Paper Uncertain Data Classification Using Decision Tree Classification Tool With Probability Density Function Modeling Technique C. Sudarsana Reddy 1 S. Aquter Babu 2 Dr. V. Vasu 3 Department

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

Hybrid Feature Selection for Modeling Intrusion Detection Systems

Hybrid Feature Selection for Modeling Intrusion Detection Systems Hybrid Feature Selection for Modeling Intrusion Detection Systems Srilatha Chebrolu, Ajith Abraham and Johnson P Thomas Department of Computer Science, Oklahoma State University, USA ajith.abraham@ieee.org,

More information

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a International Conference on Education Technology, Management and Humanities Science (ETMHS 2015) Research on Applications of Data Mining in Electronic Commerce Xiuping YANG 1, a 1 Computer Science Department,

More information

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization

More information

Analysis of a Population of Diabetic Patients Databases in Weka Tool P.Yasodha, M. Kannan

Analysis of a Population of Diabetic Patients Databases in Weka Tool P.Yasodha, M. Kannan International Journal of Scientific & Engineering Research Volume 2, Issue 5, May-2011 1 Analysis of a Population of Diabetic Patients Databases in Weka Tool P.Yasodha, M. Kannan Abstract - Data mining

More information

CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules

CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules CMAR: Accurate and fficient Classification Based on Multiple Class-Association Rules Wenmin Li Jiawei an Jian Pei School of Computing Science, Simon Fraser University Burnaby, B.C., Canada V5A 1S6 -mail:

More information

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining Miss. Rituja M. Zagade Computer Engineering Department,JSPM,NTC RSSOER,Savitribai Phule Pune University Pune,India

More information