Challenges and Interesting Research Directions in Associative Classification

Challenges and Interesting Research Directions in Associative Classification Fadi Thabtah Department of Management Information Systems Philadelphia University Amman, Jordan Email: FFayez@philadelphia.edu.jo Abstract Utilising association rule discovery methods to construct classification systems in data mining is known as associative classification. In the last few years, associative classification algorithms such as CBA, CMAR and MMAC showed experimentally that they generate more accurate classifiers than traditional classification approaches such as decision trees and rule induction. However, there is room to improve further the performance and/or the outcome quality of these algorithms. This paper highlights new research directions within associative classification approach, which could improve solution quality and performance and also minimise drawbacks and limitations. We discuss potential research areas such as incremental learning, noise in test data sets, exponential growth of rules and many others. 1. Introduction Since the introduction of association rule discovery, it continues to be an active research area in data mining. Association rule discovery finds associations among items in a transactional database [1]. Classification is another important data miming task. The goal of classification is to build a set of rules (a classifier) from labelled examples known as the training data set, in order to classify previously unseen examples, known as test data set, as accurately as possible. The primary difference between classification and association rule is that the former goal is to predict the class attribute in the test data set, whereas the latter aims to discover correlations among items in a database. Associative classification (AC) employs association rule discovery methods to find the rules from classification benchmarks. In 1998, AC was successfully used to build classifiers by [7] and later attracted many researchers, e.g. [6, 13], from data mining and machine learning communities. Several studies [6, 7, 11, 13] provided evidence that AC algorithms are able to extract more accurate classifiers than traditional classification techniques, such as decision trees [9], rule induction [3] and probabilistic [4] approaches. However, there are some challenges and issues (described in Section 3) in AC which if considered will make this approach widely used especially for real world classification problems. Examples for such challenges are incremental learning, noise in test data sets, and the extraction of multi-label rules. The goal of this paper is to discuss drawbacks and limitations of AC approach and to highlight some of its important future research directions. This could be useful for researchers who are interested to explore this scientific field. The rest of the paper is organised as follows: AC and a simple example to demonstrate its main phases are given in Section. Important issues and future trends in AC are raised in Sections 3. Finally, Section 4 is devoted to conclusions.. Associative Classification Problem In associative classification, the training data set T has m distinct attributes A 1, A,, A m and C is a list of class labels. The number of rows in T is denoted T. Attributes could be categorical (meaning they take a value from a finite set of possible values) or continuous (where they are real or integer). In the case of categorical attributes, all possible values are mapped to a set of positive integers. For continuous attributes, a discretisation method is first used to transform these attributes into categorical ones. 1

Definition 1: An item can be described as an attribute name A i and its value a i, denoted (A i, a i ). Definition : The j th row or a training object in T can be described as a list of items (A j1, a j1 ),, (A jk, a jk ), plus a class denoted by c j. Definition 3: An itemset can be described as a set of disjoint attribute values contained in a training object, denoted < (A i1, a i1 ),, (A ik, a ik )>. Definition 4: A ruleitem r is of the form <cond, c>, where condition cond is an itemset and cεc is a class. Definition 5: The actual occurrence (actoccr) of a ruleitem r in T is the number of rows in T that match r s itemset. Definition 6: The support count (suppcount) of ruleitem r = <cond, c> is the number of rows in T that matches r s itemset, and belongs to a class c. Definition 7: The occurrence (occitm) of an itemset I in T is the number of rows in T that match I. Definition 8: An itemset i passes the minimum support (minsupp) threshold if (occitm(i)/ T ) minsupp. Such an itemset is called frequent itemset. Definition 9: A ruleitem r passes the minsupp threshold if, suppcount(r)/ T minsupp. Such a ruleitem is said to be a frequent ruleitem. Definition 10: A ruleitem r passes the minimum confidence (minconf) threshold if suppcount(r) / actoccr(r) minconf. Definition 11: A rule is represented in the form: cond c j, where the left-hand-side of the rule (antecedent) is an itemset and the right-hand-side of the rule (consequent) is a class labels. A classifier is a mapping form H : A Y, where A is the set of items and Y is the set of class labels. The main task of AC is to construct a classifier that is able to predict the classes of previously unseen data set as accurately as possible. In other words, the goal is to find a classifier h ε H that maximises the probability that h (a) = y for each test data object. Consider the training data set shown in Table 1, which represents whether or not a person is likely to buy a new car. Assume that minsupp = and minconf = 50%. Frequent ruleitems discovered in the learning step (phase 1) along with their relevant support and confidence values are shown in Table. Table 1: Car sales training data Age Income has a car Buy/class senior middle n yes youth low y no junior high y yes youth middle y yes senior high n yes junior low n no senior middle n no Table : Possible Ruleitems from Table 1 Frequent Ruleitems Itemset Class Support Confidence {low} no /7 / {high} yes /7 / {middle} yes /7 /3 {senior} yes /7 /3 {y} yes /7 /3 {n} yes /7 /4 {n} no /7 /4 {senior, no} yes /7 /3 3. Associative Classification Challenges and Interesting Research Directions 3.1 Multi-label Rules Classifiers Existing AC techniques create only the most obvious class correlated to a rule and simply ignore the other classes even though such classes when associated with these rules may be significant and useful. For example, assume that an itemset a is stored in a database and is associated with three potential classes, e.g. f 1, f and f 3, 35, 34 and 31 times, respectively. Assume that a holds enough support and confidence when associated with the three classes. Typically, existing AC techniques generate only one rule for itemset a, e.g. a f1, since class f 1 is the largest frequency class among the others associated with a. The other two potential rules, i.e. a f, a f3, are simply discarded. However, these two rules may play a useful role in the prediction step because they are highly representative and hold useful information. The difference between the chosen rule and the ignored two rules is quite small. For itemset a, a rule like a f1 f f3, that hold all potential classes that survive support and confidence thresholds is more appropriate for decision makers in many applications. A recently proposed multiple labels algorithm called MMAC [11] could be seen as a starting point for research on multi-label AC. The MMAC generates classifiers that contain rules with multiple labels from multi-class and multi-label data, extracting important knowledge that would have been discarded by existing techniques. A rule in the MMAC classifier takes the form: cond c1 c... c n, where cond is an itemset and the consequent is a list of ranked class labels each of which is assigned a weight during the training step. The multiple classes in the consequent provide useful knowledge that end-user and decision makers may benefit from. The MMAC approach employs a recursive learning phase that search for

1 st, nd, n th class associated with each itemset in the training data; rather than just looking for only the dominant class. Empirical studies [11] on various known multi-class benchmark problems as well as real world multiple label optimisation problem show that MMAC outperformed popular AC algorithms such as CBA and traditional techniques such as C4.5 and RIPPER with reference to error-rate. For applications such as medical diagnoses it is more appropriate to produce the list of all classes associated with symptoms based on their distribution frequencies in the training data. As a result, there is a need for developing algorithms for real world multiclass and multi-label classification data that consider all available classes that pass certain user thresholds for each itemset. 3. Rule Ranking Sorting of rules according to certain criteria plays an important role in the classification process since the majority of AC algorithms such as [6, 7, 13] utilise rule ranking procedures as the basis for selecting the classifier during pruning. In particular, CBA and CMAR algorithms for example use the database coverage pruning [7] to build their classifiers, where using this pruning, rules are tested according to their ranks. In addition, the ranking of rules plays an important role in the prediction step as the top ranked rules are used more frequently than others in classifying test objects. The precedence of the rules is usually determined according to several parameters such as the support, confidence and the length of a rule (cardinality). In AC, normally a very small support is used and since most classification data sets are dense, the expected number of rules with identical support, confidence and cardinality is high. For example, if someone mines the tic-tac data set, which has been downloaded from [14] with a minsupp of % and minconf of 50% using the CBA algorithm [7] and without using any pruning, there will be numerous numbers of rules, which have the same support and confidence values. Specifically, the confidence, support and rule length for more than 16 rules are identical, and thus CBA has to discriminate between them using random selection. There have been few attempts to consider other parameters in rule ranking beside support and confidence such as the distribution frequency of class labels [10] in the training data. Experimental results [10] against 1 classification data sets revealed the frequent use of the class distribution parameter within their proposed algorithm, which positively improves upon the accuracy of the generated classifiers. Particularly, when using the class distribution after considering confidence, support and rule length, the accuracy of the derived classifiers has improved on average +0.6% and +0.40% over (support, confidence) and (support, confidence, rule length) rule sorting approaches, respectively. This provides evidence that adding more appropriate constraints to break ties slightly improves the predictive power of the classifiers. 3.3 Noise in Test Data Roughly speaking, a classifier is constructed from labelled data records, and later is used to forecast classes of previously unseen data records. Training and test data sets may contain noise, including, missing or incorrect values inside records. One has to think carefully about the importance of missing or incorrect values in training or test data sets. As a result, only human experts in the application domains used to generate the data sets can make an implicit assumption about the significance of missing or invalid values. Several classification algorithms that have been proposed in data mining produce classifiers with an acceptable error rate. However, most of these algorithms assume that all records in the test data set are complete and no missing data are present. When test data sets suffer from missing attribute values or incomplete records, classification algorithms may produce poor classifiers with reference to prediction accuracy. This is due to that these algorithms tend to tailor the training data set too much [9]. In real world applications, it is common that a training or test data contains attribute with missing values. For instance, the labor and hepatid data sets published in the UCI data repository [8] contain missing records. Thus, its is imperative to build classifiers that are able to predict accurately the classes for test data sets with missing attribute values. These classifiers are normally called robust classifiers [5]. Unlike traditional classifiers, which assume that the test data is complete, robust classifiers deal with existing and non-existing values in test data sets. There have been some solutions to avoid noise in the training data sets. Naïve Bayes [4] for instance, ignores missing values during the computation of probabilities, and thus missing values have no effect on the prediction since they have been omitted. Although omitting missing values may not be the ideal solution since these unknown values may provide a good deal of information. Other classification techniques like CBA assume that the absence of missing values may be of some importance, and therefore they treat them as other existing known values in the training data set. However, if this is not the case, then missing values should be treated in a special way rather than just considering them as other possible values that the attribute might take. Decision tree algorithms [9] deal with missing values using probabilities, which are calculated from the frequencies of the different 3

values for an attribute at a particular node in the decision tree. The problem of dealing with unknown values inside test data sets has not yet been explored well in AC approach. One possible simple solution for this problem is to select the common value of the attribute that contains missing values from the training data set. The common value could be selected from the attribute objects that occur with the same class to which the missing value belongs. Finally, each missing value for that attribute and its corresponding class in the training data set is substituted with the common value. The common value represents the value that has the largest frequency with the attribute in the training data set. We could also use common values from the test data set the same way as described above to substitute attributes with missing values. Another possible solution for missing values in test data set is using weights or probabilities similar to C4.5 algorithm. 3.4 Incremental Learning Existing AC algorithms mine the training data set as whole in order to produce the outcome. When data operations (adding, deleting and editing) occur on the training data set, current algorithms have to scan the complete training data set one more time in order to reflect changes done. Further, since data are collected in most application domains on a daily, weekly or monthly basis, training data sets can rapidly grow. As a result of that, the cost of the repetitive scan each time a training data gets modified in order to update the set of rules is costly with regards to I/O and CPU times. Incremental AC algorithms, which can keep the last mining results and only consider data records that have been updated, are a more efficient approach, which can lead to a huge saving in computational time. To explain the incremental mining problem more precisely in AC, consider a training data set T. The following operations may occur on T: The original training data T can be incremented by T + records (adding). T - records can be removed from the original training data T (deleting). T + records can be added to T and T - records can be removed from T (updating). The result of any of the operations described above on T is an updated training data T. The question is how the outcome (rules) of the original data set T can be updated to reflect changes done on T without having to perform extensive computations. This problem can be divided further into sub-problems according to the possible ruleitems contained in T after performing a data manipulation operation. For example, ruleitems in T can be divided into the following groups after inserting new records (T + ): a. ruleitems that are frequent in T and T + b. ruleitems that are frequent in T and not frequent in T + c. ruleitems that are frequent in T + and not frequent in T d. ruleitems that are neither frequent in T + nor T The ruleitems in groups 1 and can be identified in a straightforward manner. For instance, if ruleitem Y is frequent in T, then it s support count in the updated training data (T ), Y count = Y count +Y + count, where Y count is known and Y + count can be obtained after scanning T +. The challenge is to find frequent ruleitems that are not frequent in T but frequent in T + since these ruleitems are not determined after scanning T or T +. There has been some research work on incremental association rule discovery, i.e. [1], which can be considered as a starting point for research on incremental AC. 3.5 Rules Overlapping Classic rule-based classification approaches such as rule induction and covering consider building the classifier in a heuristic way. Once a rule is evaluated during the learning step, all training objects covered by it are discarded, thus a training instance is covered only by a single rule. Association rule discovery on the other hand, considers the correlation between all possible items in a database, and therefore, rules overlap in their training objects. In other words, multiple rules could be generated from a database transaction. Since AC employs association rule methods to discover the rules, rules created share training objects as well. In most existing AC techniques [, 6, 7], when a rule is evaluated during construction of the classifier, all its related training data objects are removed from the training data set using pruning heuristics. However, these training objects may also be used by other potential rules during the training phase. Consider for instance two rules, e.g. r : a b c 1 1 and r : b c 1, and assume that r1 p r, i.e. r 1 precedes r. Assume that r 1 covers rows (1,, 3) and these rows are associated with class c 1 in the training data, whereas r covers rows (1,, 3, 4, 5) and rows (4,5) are associated with class c in the training data. Now, once r 1 is evaluated and inserted into the classifier using an AC technique such as CBA or CMAR, all training objects associated with r 1 are removed, i.e. rows (1,, 3), using the database coverage pruning. The removal of the evaluated rule, i.e. r1, training objects, may influence other potential rules that share with r 1 its training objects, i.e. r. Consequently, after inserting r 1 into the classifier, the statistically fittest class c 1 of rule r would not be the fittest class any more; 4

rather a new class at that point becomes the fittest class, c, because it has the largest representation among the remaining r rows, i.e. (4, 5), in the training data. The effect of the removal of training data objects for each evaluated rule should be considered for all other candidate rules that use these objects. If the removal is not considered, it could lead to a classifier that contains rules that predict class labels that have a low representation and sometimes no representation at all in the training data. If the effect of removal of training data objects for the evaluated rules is considered on other potential rules in the training phase, then a more realistic classifier that assigns the true class fitness to each rule will result. 4. Conclusions Associative classification is becoming a common approach in classification since it extracts very competitive classifiers with regards to prediction accuracy if compared with rule induction, probabilistic and decision tree approaches. However, challenges such as efficiency of rule discovery methods, the exponential growth of rules, rule ranking and noise in test data set need more consideration. Furthermore, there are new research directions in associative classification, which have not yet been explored such as incremental learning, multi-label classifiers and rules overlapping. This paper has highlighted and discussed these challenges and potential research directions. California, Department of Information and Computer Science. [9] Quinlan, J. (1993) C4.5: Programs for machine learning. San Mateo, CA: Morgan Kaufmann. [10] Thabtah F (006): Rule Preference Effect in Associative Classification Mining. Journal of Information and Knowledge Management, Vol 5(1):1-7, 1006. [11] Thabtah, F., Cowling, P., and Peng, Y. (004) MMAC: A new multi-class, multi-label associative classification approach. Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM 04), (pp. 17-4). Brighton, UK. (Nominated for the Best paper award). [1] Tsai, P., Lee, C., and Chen A. (1999) An efficient approach for incremental association rule mining. Proceedings of the Third Pacific-Asia Conference on Methodologies for Knowledge Discovery and Data Mining, (pp. 74-83). London, UK. [13] Yin, X., and Han, J. (003) CPAR: Classification based on predictive association rule. Proceedings of the SDM (pp. 369-376). San Francisco, CA. [14] WEKA (000): Data Mining Software in Java: http://www.cs.waikato.ac.nz/ml/weka. References [1] Agrawal, R., and Srikant, R. (1994) Fast algorithms for mining association rule. Proceedings of the 0th International Conference on Very Large Data Bases (pp. 487-499). [] Baralis, E., and Torino, P. (000) A lazy approach to pruning classification rules. Proceedings of the 00 IEEE ICDM'0, (pp. 35). [3] Cohen, W. (1995) Fast effective rule induction. Proceedings of the 1 th International Conference on Machine Learning, (pp. 115-13). Morgan Kaufmann, CA. [4] Duda, R., and Hart, P. (1973) Pattern classification and scene analysis. John Wiley & son, 1973. [5] Hu, H., and Li, J. (005) Using association rules to make rule-based classifiers robust. Proceedings of the Sixteenth Australasian Database Conference, (pp. 47-54). Newcastle, Australia. [6] Li, W., Han, J., and Pei, J. (001) CMAR: Accurate and efficient classification based on multiple-class association rule. Proceedings of the ICDM 01 (pp. 369-376). San Jose, CA. [7] Liu, B., Hsu, W., and Ma, Y. (1998) Integrating classification and association rule mining. Proceedings of the KDD, (pp. 80-86). New York, NY. [8] Merz, C., and Murphy, P. (1996) UCI repository of machine learning databases. Irvine, CA, University of 5