A dynamic rule-induction method for classification in data mining

Size: px

Start display at page:

Download "A dynamic rule-induction method for classification in data mining"

Preston Newton
5 years ago
Views:

1 Journal of Management Analytics, 2015 Vol. 2, No. 3, , A dynamic rule-induction method for classification in data mining Issa Qabajeh a *, Fadi Thabtah b and Francisco Chiclana c a E-Business Department, Canadian University of Dubai, Dubai, UAE; b Computing and Informatics Department, De Montfort University, Leicester, UK; c Centre for Computational Intelligence, De Montfort University, Leicester, UK (Received 22 March 2015; revised 25 August 2015; accepted 31 August 2015) Rule induction (RI) produces classifiers containing simple yet effective If Then rules for decision makers. RI algorithms normally based on PRISM suffer from a few drawbacks mainly related to rule pruning and rule-sharing items (attribute values) in the training data instances. In response to the above two issues, a new dynamic rule induction (DRI) method is proposed. Whenever a rule is produced and its related training data instances are discarded, DRI updates the frequency of attribute values that are used to make the next in-line rule to reflect the data deletion. Therefore, the attribute value frequencies are dynamically adjusted each time a rule is generated rather statically as in PRISM. This enables DRI to generate near perfect rules and realistic classifiers. Experimental results using different University of California Irvine data sets show competitive performance in regards to error rate and classifier size of DRI when compared to other RI algorithms. Keywords: data mining; classification rules; rule induction; expected accuracy 1. Introduction Since the rapid development of computer hardware and networks, companies were able to capture massive amounts of data offline and online. These data usually hold crucial information about clients and functioning units performance, and therefore management can make use of them to improve various different business processes. Often, extracting useful information from the massive amounts of data using traditional means or manually is a challenging task that necessitates time, domain experts and care. This has demanded the existence of intelligent tools to automatically discover the useful information from the scattered data, and represent it in practical ways to decision makers in order to increase confidence in making key decisions. Normally, these decisions work for developing and sustaining businesses competitive advantages (Coulter, 2012). Generally, the intelligent tools used by the decision makers are basically automated computer software that utilise a certain learning methodology based on data mining. Data mining is a multidisciplinary field combining artificial intelligence (AI; search methods), databases and mathematics (statistics and probability; Abdelhamid & Thabtah, 2014). Data mining can be defined as the process of discerning new patterns *Corresponding author. fadi@cud.ac.ae 2015 Antai College of Economics and Management, Shanghai Jiao Tong University

2 234 I. Qabajeh et al. from large data intelligently to guide key corporate managers (Thabtah, Hammoud, & Adbeljaber, 2015). We have defined data mining as a learning methodology concerned with revealing hidden knowledge in a specific format from data sets for a particular usage. One popular task in data mining which involves predicting unseen target attributes, i.e. the class, based on learning from labelled historical data (training data set) is classification. Classification involves learning a model, often named the classifier, from a training data set consisting of a set of features (attributes), one of which is labelled as the class. The main goal of a classification technique is to accurately guess the value of the class for an unseen set of data, normally called the test data. This type of learning that occurs on the training data set is restricted to the value of the class attribute, and therefore it falls under the category of supervised learning research topics. Common applications for classification are medical diagnoses (Rameshkumar, Sambath, & Ravi, 2013) and website phishing detection (Abdelhamid, Ayesh, & Thabtah, 2014). Several different classification approaches have been developed in data mining including decision trees (Quinlan, 1993), neural networks (NN; Mohammad, Thabtah, & McCluskey, 2013), support vector machines (SVM; Cortes & Vapnik, 1995), associative classification (AC; Thabtah, Cowling, & Peng, 2004), rule induction (RI; Cohen, 1995) and others. The latter two approaches, AC and RI, extract classifiers which contain human-interpretable rules in If Then form, and this explains their widespread applicability. However, there are differences between AC and RI, especially in the way rules are induced. In particular, AC utilises association rule discovery techniques to induce the rules based on two user thresholds, named the minimum confidence and the minimum support. The AC algorithm normally discovers the rules at once from the input data set based on the above thresholds, whereas in RI rules are discovered one by one in a greedy fashion and per class label. In other words, the classifier in RI is learnt from parts of the training data set since the data gets initially partitioned into parts based on the available class label, whereas the classifier in AC is learnt from the complete training data without splitting it. This article falls under the umbrella of RI research. PRISM is an RI technique which was developed by Cendrowska (1987) and slightly enhanced by others (i.e. Stahl & Bramer, 2008). This learning algorithm follows a separate-and-conquer learning strategy in building the classifier (Witten & Frank, 2005). Particularly and for the available class labels in the training data set, this algorithm builds a set of rules and then combines them to make the classifier. Normally for a certain class label, PRISM starts with an empty rule and keeps adding attribute values (Definition 1 in section 2.1) to the rule s body until the rule reaches a certain expected accuracy (Definition 8 in section 2.1). Often, PRISM generates only perfect rules (rules that have 100% accuracy). When this happens, the rule gets generated and all training data connected with it are discarded. PRISM continues building other rules in the same way until no more data associated with the current class can be found. At this point, PRISM moves to a new class label and repeats the same steps described earlier until the training data set becomes empty. During the building of a rule, often the largest accuracy attribute value is added to the candidate rule. More details on PRISM are given in section 3.

3 Journal of Management Analytics 235 This paper investigates shortcomings associated with PRISM. Specifically, we look into three main issues: (1) Reducing the possible number of attribute values used to create rules by using an attribute value frequency threshold that we call freq ; (2) Generating not only rules with 100% accuracy but also other high-accuracy rules. We utilise here a predefined user threshold that we call rule strength (Rule_Strength) to separate acceptable and not acceptable rules. Acceptable rules usually hold accuracy above the Rule_Strength threshold, and not acceptable rules are pruned since they hold accuracy below Rule_Strength. More details are given in section 2.1. Often, we store the acceptable rules that are not perfect (< 100% accuracy) in a secondary classifier that can be used in the prediction step only when rules in the primary classifier (= 100% accuracy) fail to classify a test case;. (3) Dynamically updating the frequency of candidate attribute values that was initially computed from the training data set whenever a rule is produced. This is because when a rule is generated its associated training data are removed, but some of these data may contain items that are similar to certain attribute values of other candidate rules. Therefore, the frequency of these impacted rules attribute values appearing in the removed training instances must be decremented. This is simply because the training data set has been reduced after deleting the generated rule s data instances from it. This problem is discussed further in section 3 (Issue 2). In the dynamic rule induction algorithm, the domain expert is the one who controls the number of attribute values which can be utilised to make rules by setting the freq threshold. This threshold is used to separate strong attribute values (attribute values having data representation above the freq threshold) and weak attribute values (attribute values having a data representation below the freq threshold). By discarding weak attribute values early, this minimises the search space and reduces computation costs such as training time. Moreover, we check the attribute value frequency against the freq threshold every time a rule is produced. Section 4 further elaborates this step. In response to the above-raised issues, we develop a new dynamic learning method based on RI that we name dynamic rule induction (DRI). DRI discovers the rule one by one per class, and primarily uses a minimum frequency threshold called freq to limit the search space for attribute values by discarding weak attribute values. Further, whenever a rule is induced, DRI decrements the frequencies of strong attribute values that appeared inside the deleted training instances of the induced rule. This indeed may result in discarding some attribute values since they become weak (their frequency drops below the freq threshold) and therefore a lower number of rules in the classifier, especially those with 100% accuracy. More details on the distinguishing features of the proposed algorithm are given in section 4.3. Lastly, DRI allows the generation of rules with high but not necessarily perfect accuracy, which may limit the use of the default class rule in classifying test data. The default class rule is formed from the uncovered training data after inducing all rules, and normally the default class rule is connected with the highest frequency class in the uncovered training data. Often these high-accuracy rules are ignored by PRISM algorithm since they do not hold

4 236 I. Qabajeh et al. 100% accuracy. These rules are only used instead of the default class rule and when no primary rule is able to classify a test datum. This paper is structured as follows: section 2 presents the related definitions to the classification problem, surveys common RI algorithms and highlights the investigated research issues. Section 3 discusses the proposed algorithm and its related phases alongside a comprehensive example that reveals DRI s insight and its main features. Section 4 is devoted to the data and the experimental results analysis, and finally conclusions are provided in section Literature review and research issues raised in RI In this section, we define terms related to the research problem, review common RI algorithms in the literature and shed light on the research issues this article investigated. We focus on the PRISM algorithm and its successors, since we tackle research problems associated with it. Another reason for including this section is that some of the algorithms described herein are used in the experimental section for comparison purposes with the proposed algorithm Related definitions Given an input training data set T, which has n distinct attributes A 1, A 2,,A n one of which is called the class, i.e. l, that contains a list of values. T size is denoted T. An attribute may be categorical, which means it takes a value from a known set of possible values, or continuous (numeric). For attributes that are categorical, their values are mapped to a set of positive integers, whereas the continuous attribute is discretised. The ultimate aim is to build a classification model (classifier) from T, e.g. C : A l, which guesses the value of the class of test data where A is a disjoint set of attribute values and lis a class. The proposed algorithm depends on a predefined user threshold called freq. This threshold is utilised to differentiate between strong and non-strong ruleitems <attribute value, class> (weak ruleitems) based on their frequency in the training data set. Any ruleitem that survives the freq threshold is known as a strong ruleitem, and when the strong ruleitem belongs to one attribute, we call it a strong 1-ruleitem. Hereunder are the main related terms and definitions. Definition 1: An attribute value is an attribute plus its values denoted (A i, a i ). Definition 2: A training instance in T is a row combining a list of attribute values (A j1, a j1 ),,(A jv, a jv ), plus a class denoted by c j. Definition 3: A ruleitem r has the format <body, c>, where body is a set of disjoint attribute values and c is a class value. Definition 4: The frequency threshold ( freq) is a predefined threshold given by the end user. Definition 5: The body frequency (body_freq) ofaruleitem r in T is the number of instances in T that match r s body. Definition 6: The frequency of a ruleitem r in T (ruleitem_freq) is the number of instances in T that match r.

5 Journal of Management Analytics 237 Definition 7: Aruleitem r passes the freq threshold if, r s body_freq / T freq. Such a ruleitem is said to be a strong ruleitem. Definition 8: A ruleitem r expected accuracy is defined as ruleitem_freq / body_freq. Definition 9: A rule in our classifier is represented as: body l, where the left-hand side (body) is a set of disjoint attribute values, and the right-hand side (l) is a class value. The format of the rules is: a 1 ^ a 2 ^...^ a n l Literature review PRISM is one of the known RI algorithms that derive rules in greedy manner, in which it splits the training data set into subsets with respects to class values. Then, for each subset, the algorithm forms an empty rule and searches for the attribute value that has the highest expected accuracy and appends it into the rule body, and continues finding attribute values until the current candidate rule achieves maximum expected accuracy (often 100%). Once this happens, the algorithm generates the rule and removes all of its positive instances (data in the subset that belong to the rule). The same process is repeated to produce the rest of the rules from the remaining uncovered data in the subset, until the subset becomes empty or no rule with acceptable expected accuracy can be derived. At that point, the algorithm moves on to the next class subset and repeats the same process until all rules in all class data subsets are generated and merged to form the classifier. One notable problem with this classification approach is that the effort required to find the best attribute value to append into a rule at any stage of the learning phase is exhaustive when we have high-dimensional training data sets. Moreover, there is no clear pruning mechanism in PRISM, which often results in a very large number of rules, each covering a low number of instances within the classifier. A parallel PRISM (P-PRISM) method has been developed (Stahl & Bramer, 2008) to overcome PRISM s computationally expensive process, which involves testing all attribute values when computing the expected accuracies while building a rule. The authors have pre-sorted the items based on their occurrences in the training data set and their class values, and therefore holding such information rather than the complete input data will minimise the memory use. Then, the data are distributed to different processors (central processing units CPUs) where rules are produced locally and then combined globally with no synchronisation mechanism defined. Limited experiments have been conducted to measure the scalability and efficiency of P-PRISM. To cut down the classifier size in RI, the Repeated Incremental Pruning to Produce Error Reduction (RIPPER) algorithm (Cohen, 1995) was developed. It divides the training data set with respect to class labels, then, starting with the least frequent class set, it builds a rule by adding items to its body until the rule is perfect (i.e. the number of negative examples covered by the rule is zero). For each candidate empty rule, the algorithm looks for the best attribute value in the data set using information gain (IG; defined in Equations1 and 2; Quinlan, 1993) and appends it to the rule s body. The IG basically evaluates how good the attribute is at splitting the data based on the class labels. The algorithm keeps adding attribute values until the rule becomes perfect, at which point the rule gets generated. This

6 238 I. Qabajeh et al. phase is called rule growing. At the same time as rules are built, RIPPER uses extensive pruning, using both the positive and negative examples associated with the candidate rules, to eliminate unnecessary attribute values. The algorithm stops building the rules when any rule found has 50% error, or in a new implementation of RIPPER when the minimum description length (MDL) of the rules set after adding a candidate rule is larger than the one obtained before adding the candidate rule. Another pruning in RIPPER occurs while building the final classifier. For each candidate rule generated, two substitute rules are made: its replacement and its revision. The first one is made by growing an empty rule r i and filtering it to minimise the error on the overall rule set. The revision rule is built in similar fashion except that the algorithm just inserts an additional item to the rule s body, and examines the original and the revised rule against the data to choose the rule with the lowest error rate. This extensive pruning in RIPPER explains the small-sized classifiers generated by this type of algorithm. Experiments on a number of University of California Irvine data sets (Merz & Murphy, 1996) showed that RIPPER scales well in accuracy rate when compared to decision trees (Cohen, 1995). Gain ( D, A)= Entropy( D) (( ) ) D a / D Entropy ( Da ) (1) where Entropy (D) = P c log 2 P c (2) where P v = the probability that D belongs to class c;. D a = the subset of D for which A has value a; D a = the number of examples in D a, and D = size of D. Ahybridclassification algorithm that uses decision tree and RI approaches together to produce classifiers in one phase rather than two phases, called PART, was proposed by Frank and Witten (1998). PART employs RI to generate the candidate rule set, and then filters this set out using pruning methods adopted from decision trees. PART builds a rule as RI algorithms, but rather than constructing the rule directly from the data, it derives a sub-tree from the training data and then it converts the path leading to the leaf with the largest coverage into a rule, and the sub-tree gets discarded along with its positive instances from the data set. The same process is repeated until all instances in the data set are removed. OneRule is a simple rule-based algorithm that was proposed by Holte (1993). This algorithm makes a one-level tree and produces rules that are connected with the most frequent class in the training data set (having the largest data coverage). For all attribute values in the training data set, OneRule iterates over the training data examples and computes the frequency of each attribute value with respect to available class labels. The algorithm selects the most frequent attribute and class and generates them as a rule if they pass an error rate check. Finally, the algorithm repeats the same step to generate the subsequent rules until it finds a rule with unacceptable error; at that stage, the rule-discovery process terminates.

7 Journal of Management Analytics Research issues raised Issue 1 One of the main problems associated with RI approaches such as PRISM is the large dimensionality of the search space of attribute values (i.e. the large number of candidate attribute values). When constructing a rule for a particular class, PRISM has to evaluate the expected accuracy of all available attribute values linked with that class in order to select the best one that can be added to the rule s body. This necessitates large computations when the training data have many attribute values, and can be a burden especially when several unnecessary computations are made for attribute values that have low data representation (weak attribute values). For this issue, we allow the end user to input, early on, a minimum frequency threshold that determines whether the attribute values can be qualified to be part of a rule body before computing its expected accuracy. This minimises the search space by reducing the number of attribute value frequency computations Issue 2 Another serious problem that has never been previously reported in RI research happens when instances of a generated rule are discarded by PRISM from the training data set. This usually impacts other attribute values that share these instances with that rule. For example, when a rule R 1 :IFx 1 and y 2 Then C 1 is generated, assume that six data instances are linked with the R 1 have been discarded. Now, all candidate attribute values inside the six deleted training data instances, other than items x 1 and y 2, are impacted because of this removal, and their frequencies should be updated to reflect the occurred changes. This means some of these candidate attribute values may no longer have a high enough frequency and therefore should be pruned before building the next rule. In fact, decrementing the frequencies of the affected candidate attribute values results in three distinct advantages: (1) A natural pruning method that discards infrequent attribute values, therefore further reducing the search space;. (2) The number of rules is minimised since fewer possible attribute values can be added into the next in-line rule, therefore resolving one of the major problems associated with PRISM (the typical PRISM has no pruning method);. (3) We can consider that the majority of the rules derived are now composed of items with realistic class, and frequencies which are computed incrementally based on the rule-generation process rather than statically at once from the training data set. We believe that these dynamic attribute value frequencies are fairer than those computed at once by the traditional RI algorithms. More details on the second research issue are given in a detailed example in section Issue 3 One of the problems associated with PRISM is its excessive learning to derive perfect rules regardless of whether the produced rule has sufficient data representation, which may lead to the generation of massive number of rules that have

8 240 I. Qabajeh et al. low frequency despite being perfect in regards to expected accuracy. So when a rule has an expected accuracy of 90% and has a large representation, unfortunately PRISM does not generate it and prefers to break down its instances to produce multiple low-coverage rules. We believe that these high-coverage rules, when having a good expected accuracy, can be advantageous, especially in predicting the class of test data when perfect rules are failing to do so. This leads us to propose a threshold that can separate between acceptable and non-acceptable rules, which we call Rule_Strength. We also use rule sorting, where top-ranked rules normally have 100% accuracy and low-ranked rules are acceptable rules with expected accuracy of less than 100%. 3. A new dynamic rule induction algorithm Our algorithm uses RI learning strategy to discover and extract the rules. It consists of two main phases: rule discovery and class prediction. In phase 1, the algorithm logically splits the training data per class, and for each class it builds rules with expected accuracy = 100% OR >= Rule_Strength until no more rules can be extracted or the class label s data are covered by the produced rules. Our algorithm induces rules usually ignored by PRISM by considering the Rule_Strength threshold. The same process is repeated for the rest of the classes in the training data set until the complete data set becomes empty, and at that point all rules are merged together to make the classifier. In order to minimise the number of candidate attribute values, DRI employs a frequency threshold that only allows items having sufficient number of occurrences above the freq threshold to be part of rules. All items that belong to a particular class and have frequencies below the freq threshold are discarded during the rulediscovery phase. During phase 1, and once a rule is derived and its associated instances are removed, the frequencies of any candidate attribute values appearing in the deleted instances are updated. Often, this update involves decrementing the frequencies of the impacted candidate attribute values. This indeed may result in removing candidate attribute values that have low frequencies (<freq threshold). The consequence is further minimising the search space by pruning these candidate attribute values, and therefore the number of rules ending in the classifier is also minimised. Phase 2 involves using the classifier to forecast the class of unseen data, and computing the error rate. The general steps of the proposed algorithm are depicted in Figure 1. In the subsequent sections, details about each phase are elaborated. The attributes inside the training data set are assumed to be categorical or continuous. For continuous attributes, the entropy-based discretisation method is applied before the rule-discovery phase. In discretising a continuous attribute, the attribute s values are sorted in ascending order, and the class linked with each value is presented. When the value of the class changes this is considered a breaking point, and the gain of splitting the data for that attribute at each breaking point is computed based on IG (Quinlan, 1993). The breaking point that maximises the IG over all possible points is selected. The same process is repeated for the remaining unselected breaking point partitions. Further information about discretisation is found in (Witten and Frank, 2005). We deal with missing values as other available values in the training data set.

9 Journal of Management Analytics 241 Figure 1. Dynamic rule induction (DRI) algorithm.

10 242 I. Qabajeh et al Rule production and classifier building Before DRI starts the mining process, the training data gets transformed into a data structure that will hold <Item, class, Line# s / row IDs>. The item and the class are represented by <ColumnID, RowID> data representation, where the first column and row numbers that the item/class occur in in the training data set denote the item/class. This data representation has been adopted from Thabtah and Hammoud (2013). The main advantage of using this data format is that there is no attribute value frequency counting after iteration 1. This is because our algorithm stores both the attribute values and the class locations in the training data, set in a data structure called the TID. Ruleitem r s TID is utilised to locate the frequency of r by just taking r s TID s size. This is very simple process that normally reduces the number of passes over the training data set to one (Abdelhamid, Ayesh, Thabtah, Ahmadi, & Hadi, 2012). Further details on the advantage of the data representation used can be found in Thabtah and Hammoud (2013). DRI starts learning by passing over the input data set and building a data structure that corresponds to all strong 1-ruleitems and their frequencies (TIDs). All candidate 1-ruleitems that are weak (their frequency is below the freq threshold) are discarded. Then, for each class, say L 1, we start with an empty rule r i, i.e. If Empty then L 1 adds the largest expected accuracy attribute values to r i s body until r i becomes perfect or with an acceptable error rate. In other words, our algorithm can generate a rule for a class despite being not perfect, as long as it passes the Rule_Strength threshold. These not-perfect rules are then ranked and stored in a secondary classifier. Now, once a rule is produced, all instances connected with it in the training data are deleted and we move on to build the next rule for the current class (L 1 ). The deletion of r i s training instances may impact other candidate attribute values that appear in those instances, and therefore the DRI algorithm updates the frequency of all other strong attribute values that have appeared in the removed r i s instances to reflect the changes made. This guarantees a live and dynamic frequency for all remaining strong attribute values, where some of these may become more statistically fit and others become weak. This is a natural pruning process in which weak attribute values are identified without having to look them up in the training data set, which efficiently improves the training process and reduces the number of candidate strong items used to generate the next rule. We believe that DRI is the only RI algorithm that takes care of this problem. Now, once the first rule is devised (r i ), then the algorithm continues building up rules for the current class until: (1) No more strong attribute values are linked with class L1; (2) The remaining attribute values expected accuracy is unacceptable. At this point, DRI picks up another class and repeats the same process until the training data set becomes empty or no stronger attribute values are found. Section 3 gives a detailed example of the rule discovery phase of the proposed algorithm. Often, there should be a way to distinguish among rules in classification in order to choose which rule should be fired in classifying the test data during the process of class allocation. In the classic PRISM algorithm, there is no rule ranking since all rules generated and stored in the classifier have 100% expected accuracy. Nevertheless, PRISM

11 Journal of Management Analytics 243 Figure 2. Rule sorting of the dynamic rule induction (DRI) algorithm. and its successors ignore near-perfect rules and rules that are not perfect. We solved this problem by considering not only perfect rules but other rules that pass a userdefined threshold called the Rule Strength (Rule_Strength). These rules are normally kept in a secondary classifier that can be utilised only when no rules in the primary classifier can cover a test datum. This means the DRI algorithm has two classifiers:. Primary: stores only perfect rules that have 100% accuracy;. Secondary: stores other rules that are not perfect but passed the Rule_Strength threshold (i.e. rules having an acceptable error rate). The sorting procedure (Figure 2) will be fully applied on the secondary rule set and partially applied (Line 2 onward) on the primary rule set, since rules in the primary classifier have similar expected accuracy Test data class allocation step Once the rules are derived and sorted in the classifier (primary and secondary), then they are ready to be utilised to allocate the right class to the unlabelled data (test data). It should be noted that there is one classifier where we name the top part the primary classifier part and the lower part the secondary classifier part. The basic idea behind our class allocation procedure is to limit the use of the default class rule which normally has an unacceptable error rate. This is the main reason for making a secondary classifier part which normally contains high-predictive rules that are ignored by RI algorithms which are based on PRISM. Rules in the secondary classifier are only used when rules in the primary classifier are unable to classify a test datum. For a test datum t i, the DRI algorithm goes over the rules in the classifier s primary part, and the first rule that has common items with t i classifies it. This means all items in the fired rule body must be contained in t i. In the case that no rules in the primary part are found, DRI moves on to the secondary part and applies the same procedure. Now, in the case that no rules are found in both sets, then our algorithm will take on the first partially matching rule. By partially we mean any item of the rule s body is matching any item in t i. The DRI class allocation procedure minimises the utilisation of the default class to almost no use at all. This should positively affect the overall classification accuracy of the classifier. Figure 3 displays the class allocation procedure which we propose.

12 244 I. Qabajeh et al. Figure 3. Class allocation procedure of the dynamic rule induction (DRI) algorithm DRI vs other RI algorithms There are limited numbers of methods that are based on PRISM in data mining that have been developed to improve PRISM s output quality and efficiency in finding the rules, such as P-PRISM. This section highlights the primary distinctions between our method and those that are PRISM-based:. The PRISM algorithm utilises the expected accuracy as a measure of a rule s goodness, and only generates the rule when its expected accuracy is 100%. This unfortunately results in many rules with low data coverage. By contrast, our algorithm generates perfect rules as well as high-coverage, near-perfect rules (low-error rules). This results in a lower number of perfect rules in the primary classifier and allows other good rules to play a role in the classification phase of test data, which eventually reduces the use of the default class rule;.. There is no rule sorting in PRISM and its successors, whereas DRI discriminates amongst rules based on three new criteria in RI. This allows, in some cases, lower ranked rules to classify test data;. The DRI algorithm uses two new thresholds, named freq and Rule_Strength, to minimise the search space of attribute values while constructing the rules. This makes the process of rule discovery more efficient. On the other hand, PRISM has to evaluate the expected accuracy of all attribute values each time it builds a rule, which could be problematic when the training data set s dimensionality is large;.. PRISM utilises a static expected accuracy and frequency for each attribute value associated with the class that were computed once from the training data set during the first scan. On the other hand, DRI enables each attribute value to have dynamic expected accuracy and frequency that often change whenever a rule is derived. This ensures that each attribute value has its true data representation while building the classifier Example of the proposed algorithm In this section, we go through a detailed example to simplify the way the DRI algorithm works in finding the rules and producing the classifier. Assume the minimum freq

13 Table 1. Sample data set (Witten and Frank, 2005). Journal of Management Analytics 245 Inst. ID Age Spectacle -prescrip Astigmatism Tear-prod-rate Class 1 Young Myope No Reduced None 2 Young Myope No Normal Soft 3 Young Myope Yes Reduced None 4 Young Myope Yes Normal Hard 5 Young Hypermetrope No Reduced None 6 Young Hypermetrope No Normal Soft 7 Young Hypermetrope Yes Reduced None 8 Young Hypermetrope Yes Normal Hard 9 Pre-presbyopic Myope No Reduced None 10 Pre-presbyopic Myope No Normal Soft 11 Pre-presbyopic Myope Yes Reduced None 12 Pre-presbyopic Myope Yes Normal Hard 13 Pre-presbyopic Hypermetrope No Reduced None 14 Pre-presbyopic Hypermetrope No Normal Soft 15 Pre-presbyopic Hypermetrope Yes Reduced None 16 Pre-presbyopic Hypermetrope Yes Normal None 17 Presbyopic Myope No Reduced None 18 Presbyopic Myope No Normal None 19 Presbyopic Myope Yes Reduced None 20 Presbyopic Myope Yes Normal Hard 21 Presbyopic Hypermetrope No Reduced None 22 Presbyopic Hypermetrope No Normal Soft 23 Presbyopic Hypermetrope Yes Reduced None 24 Presbyopic Hypermetrope Yes Normal None Table 2. The frequency and expected accuracy of attribute values connected with class None. Candidate ruleitem Frequency Expected accuracy Age = young, None 4 4/8 Age = presbyopic, None 6 6/8 Age = pre-presbyopic, None 5 5/8 Spectacle-prescrip = myope, None 7 7/12 Spectacle-prescrip = hypermetrope, None 8 8/12 Astigmatism = No, None 7 7/12 Astigmatism = Yes, None 8 8/12 Tear-prod-rate = reduced, None 12 12/12 Tear-prod-rate= normal, None 3 3/12 threshold and Rule_Strength are set to 3 and 50%, respectively. Suppose the data set below was given (Table 1). DRI starts with class None, and computes candidate ruleitems as shown in Table 2. The highest accuracy attribute value is Tear-prod-rate = reduced with all of its appearances are associated with class None and therefore we generate the first rule as follows: RULE (1) If Tear-prod-rate = reduced then None (12/12).

14 246 I. Qabajeh et al. Table 3. New frequency of attribute values linked with class None. after generating rule 1. Frequency computations from the original data set before R1 and after R1 Candidate ruleitem Original frequency New frequency Status Age = young, None 4 0 Deleted after R 1 is derived Age = presbyopic, None 6 2 Deleted after R 1 is derived Age = pre-presbyopic, None 4 0 Deleted after R 1 is derived Spectacle-prescrip = myope, 7 1 Deleted after R 1 is derived None Spectacle-prescrip = 8 2 Deleted after R 1 is derived hypermetrope, None Astigmatism = No, None 7 1 Deleted after R 1 is derived Astigmatism = Yes, None 8 2 Deleted after R 1 is derived Tear-prod-rate = normal, None 3 3 Keep for possible secondary classifier Table 4. The frequency and expected accuracy of attribute values connected with class Hard. Candidate ruleitem Frequency Expected accuracy Frequency status Age = young, Hard 2 2/8 Remove Age = presbyopic, Hard 1 1/8 Remove Age = pre-presbyopic, Hard 1 1/8 Remove Spectacle-prescrip = myope, Hard 3 3/12 Spectacle-prescrip = hypermetrope, Hard 1 1/12 Remove Astigmatism = Yes, Hard 4 4/12 Tear-prod-rate = normal, None 4 4/12 Then, we remove all data instances covered by rule 1 and update the frequencies of all attribute values that have appeared in the removed instances as shown in Table 3. We stop generating rules for class None since the remaining attribute value, i.e. Tearprod-rate = normal, has high error besides being the only attribute value left. So we keep it for the secondary classifier in case it later passes the Rule_Strength threshold inputted by the end user. We move on to class Hard, Table 4 displays the expected accuracy computed from the training data set for attribute values linked with this class. In Table 4, we notice only three strong attribute values, so all other weak items are removed as shown in the last column of the table. It should be noted that other RI algorithms keep them. Now, based on the computations shown in Table 4, there are two attribute values with similar expected accuracies (4/12) so we select one randomly, i.e. Astigmatism = Yes, and add it to the empty rule as follows: RULE (2) If Astigmatism = Yes Then Hard (4/12) We then separate the data instances associated with the current rule as shown in Table 5, and compute the expected accuracy again as depicted in Table 6. The best

15 Journal of Management Analytics 247 Table 5. Training instances linked with Astigmatism = Yes. Age Spectacle-prescrip Astigmatism Tear-prod-rate Class Young Myope Yes Normal Hard Young Hypermetrope Yes Normal Hard Pre-presbyopic Myope Yes Normal Hard Pre-presbyopic Hypermetrope Yes Normal None Presbyopic Myope Yes Normal Hard Presbyopic Hypermetrope Yes Normal None Table 6. Updated frequency and expected accuracy of attribute values computed from Table 5. Candidate ruleitem Frequency Expected Accuracy Tear-prod-rate = normal, None 4 4/6 Spectacle-prescrip = myope, Hard 3 3/3 Table 7. The frequency and expected accuracy of attribute values connected with class Soft. Candidate ruleitem Frequency Expected Accuracy Frequency status Age = young, Soft 2 2/3 Remove Age = presbyopic, Soft 1 1/3 Remove Age = pre-presbyopic, Soft 2 2/3 Remove Spectacle-prescrip = myope, Soft 3 2/3 Spectacle-prescrip = hypermetrope, Soft 1 3/6 Remove Astigmatism = Yes, Soft 0 0 Remove Astigmatism = no, Soft 5 5/6 Tear-prod-rate = normal, None 5 5/8 and only attribute value left is Spectacle-prescrip = myope with 3/3 accuracy, so we add it to the current rule as follows: RULE (2) If Astigmatism = Yes and Spectacle-prescrip = myope Then Hard 3/3. Only one instance is left uncovered for class Hard in the training data, so we stop generating rules for this class since this attribute value fails the frequency requirement Table 8. Training instances linked with Astigmatism = No. Inst. ID Age Spectacle-prescrip Astigmatism Tear-prod-rate Class 2 Young Myope No Normal Soft 6 Young Hypermetrope No Normal Soft 10 Pre-Presbyopic Myope No Normal Soft 14 Pre-Presbyopic Hypermetrope No Normal Soft 18 Presbyopic Myope No Normal None 22 Presbyopic Hypermetrope No Normal Soft

16 248 I. Qabajeh et al. Table 9. Updated frequency and expected accuracy of attribute values computed from Table 8. Candidate ruleitem Frequency/(Support) Expected Accuracy Spectacle-prescrip = myope, Soft 3 2/3 Tear-prod-rate = normal, None 5 5/6 Table 10. Remaining unclassified instances after generating rule 3. Inst. ID Age Spectacle-prescrip Astigmatism Tear-prod-rate Class 8 Young Hypermetrope Yes Normal Hard 16 Pre-presbyopic Hypermetrope Yes Normal None 18 Presbyopic Myope No Normal None 24 Presbyopic Hypermetrope Yes Normal None to make a new rule. We move on to class Soft ; Table 7 displays the expected accuracies from the training data set for the attribute values linked with this class. We notice that thereare five weak attribute values so we remove them, as shown in the last column of the table. We select attribute value Astigmatism = No since it has the largest expected accuracy, i.e. 5/6. We build the following rule: RULE (3) If Astigmatism = No Then Soft (5/6). Then we separate the data instances for this rule as shown in Table 8, and calculate again the expected accuracy from Table 8 as shown in Table 9. According to Table 9, the expected accuracy remained the same as the current rule, i.e. 5/6, so we generate the current rule and remove all instances associated with it. We have generated rule 3 despite it being not perfect, since it passed the Rule_Strength threshold. Nevertheless, this rule will be added below the perfect rules. Table 10 shows the remaining unclassified instances, and all candidate attribute values that Table 11. data set. The frequency and expected accuracy of uncovered attribute value in the training None Hard Candidate ruleitem Frequency/ (Support) Expected Accuracy Frequency status Status Age = young 0 1 Remove Age = presbyopic 2 0 Remove Age = pre-presbyopic 1 0 Remove Spectacle-prescrip = myope 1 Remove Spectacle-prescrip = 2 2/3 1 1/3 Remove hypermetrope, Astigmatism = Yes 2 2/3 1 1/3 Remove Astigmatism = no 1 Remove Tear-prod-rate = normal 3 3/4 1 1/4

17 Journal of Management Analytics 249 become weak are shown in Table 11 along with their updated frequencies which eventually are removed. RULE (3) If Astigmatism = No Then Soft (5/6) Based on Table 11, there is one attribute value with an acceptable frequency and which has an expected accuracy above Rule_Strength, which is Tear-prod-rate = normal. Therefore, we build a rule for it as follows: RULE (4) If Tear-prod-rate = Normal (3/4) and delete all its training instances. At this point we are only left with one instance in the training data set associated with class Hard. This is our default class rule. In our example, four rules have been devised from the original training data set, two of which are primary and two of which are secondary. 4. Data and experimental results In this section, we test the proposed algorithm on different data sets related to University of California Irvine data collection (Merz & Murphy, 1996). Our choice of the University of California Irvine data sets is based on different features like the number of attributes, the data set size, the number of classes and the type of the available attribute. For fair comparison, data sets of different sizes have been chosen. Table 12 displays the details of each data set used in the experiments. Different evaluation criteria are used to conduct the experiments and results analysis, mainly:. Classification accuracy (%);. Number of rules, particularly between DRI and PRISM algorithms. Different classification algorithms in data mining have been chosen to evaluate the general performance of the DRI algorithm with respect to classifiers predictive accuracy and rules. The majority of the chosen algorithms fall under the category of RI, and these are RIPPER and PRISM. In addition, we have selected a known decision tree algorithm called C4.5 to further evaluate DRI. The reason for picking these algorithmsis is due to the fact that most of them employ a similar learning methodology to DRI, with the exception of C4.5 which uses an information theory measure based on entropy to build a decision tree classifier. Table 12. Data sets characteristics. Dataset No. of classes No. of attributes No. of instances Contact lenses Vote Weather Labour Glass Iris Diabetes Segment Zoo Sonar Tic-Tac

18 250 I. Qabajeh et al. Figure 4. Average classification accuracy (%) for the considered algorithms on the University of California Irvine data sets. The experiments of the proposed algorithm have been conducted using a Java prototype, whereas all remaining algorithms have been tested on WEKA. is an open-source Java-based platform that was developed at the University of Waikato, New Zealand. It contains different implementations and evaluation measures of data mining and machine learning methods for tasks including classification, clustering, regression, association rules and feature selections. All experiments were conducted on a computing machine with a 1.7-GHz processor. The average accuracy produced by the considered algorithm on the 10 University of California Irvine data sets is displayed in Figure 4. In the figure, it is clear that the DRI algorithm performed on average extremely well when compared to the RIPPER and PRISM RI algorithms. In fact, on average, DRI gained higher classification by 1.51 and 4.58% than the RIPPER and PRISM algorithms, respectively. This gain resulted from the dynamic rules generated by this algorithm which keeps the highest, fittest rules besides the perfect rules, which leads to improvement of the predictive power of DRI. On the other hand, the decision tree algorithm C4.5 has slightly higher classification accuracy on average than our algorithm. To be more precise, C4.5 has on average 0.67% higher accuracy than DRI on the University of California Irvine data sets used in the experiments. This is good because of the high predictive power of C4.5, besides the excessive pruning which is used by this algorithm during the process of constructing the classifier. The fact that DRI is competitive in regards to accuracy compared to C4.5 and derives on average higher predictive classifiers than its own kind is an achievement. We further evaluated the proposed algorithms per University of California Irvine data set and compared its predictive with the three other classification algorithms. Figure 5 shows the classification accuracy of all algorithms used in the experiments. In the figure, DRI has outperformed most of the considered classification algorithms on the University of California Irvine data sets used in the experiments. In particular, the won lost tie records of DRI against Repeated Incremental Pruning to Produce Error Reduction, PRISM and C4.5 are 6 4 0, 7 1 2, and 3 1 6, respectively. The new rule evaluation method of DRI has impacted positively on the classification

19 Journal of Management Analytics 251 Figure 5. The classification accuracy (%) for the considered algorithms on the 10 University of California Irvine data sets. Figure 6. data sets. The classifier size of PRISM and dynamic rule induction (DRI) algorithms on the performance of this algorithm by only allowing rules that are statistically fit to participate in the classifier. These rules are the ones utilised during the class prediction step. The number of rules in the classifiers produced by PRISM and our algorithm is depicted in Figure 6. It is clear from the figure that PRISM generates on average larger classifiers than the DRI algorithm does, due to the fact that PRISM has no pruning strategies at all. A dynamic update of candidate items when rules are generated has a good impact on reducing the search space of the items, and therefore a lower number of candidate strong items is presented. In other words, the removal of the overlap among rules in the training instances when each rule is generated also has a positive impact on the classifier size. In particular, DRI ensures that all candidate strong items expected accuracies as well as frequencies are amended on the fly

20 252 I. Qabajeh et al. whenever a rule gets produced, which definitely minimises the available number of candidate strong items for the next rule. 5. Conclusions Rule induction (RI) is one of the known classification approaches in data mining that attracted researchers due to its simple output and applicability in several domains. However, RI, especially the PRISM algorithm, has a few substantial issues, including ignoring rules with high training data coverage and not having 100% accuracy. Furthermore, PRISM has no defined rule-pruning strategy, which may lead to the generation of high numbers of low-data-coverage rules. Another serious problem in RI, especially greedy algorithms like PRISM, is that whenever a rule is produced and all of its covered data are removed from the training data set, these algorithms do not take into account the impact of the removed data on other waiting candidate rules. This may result in the generation of many redundant rules and could also increase the search space for attribute values. In response to the above-raised issues, we proposed in this article a dynamic rule induction strategy (DRI) that utilises two thresholds to reduce the search space, and guarantees the production of not only perfect rules but high-quality rules as well. Moreover, DRI discards all data instances when a rule is generated, and amends the frequencies of all remaining candidate rules attribute values that appeared in the removed instances. This makes fairer rules since the actual rule body attribute value frequencies are incrementally updated rather than computed once from the original training data set. Experimental results using 10 University of California Irvine data sets, and using different RI algorithms, were conducted. The results revealed that the DRI algorithm is highly competitive with respect to classification accuracy compared to the PRISM, RIPPER and C4.5 algorithms. Moreover, DRI consistently produced a lower number of rules than PRISM on the data sets we considered. In the near future, we intend to extend DRI to deal with unstructured data sets in order to handle the challenging problem of multi-label classification. References Abdelhamid, N., Ayesh, A., & Thabtah, F. (2014). Phishing detection-based associative classification data mining. Expert Systems with Applications, 41(13), Abdelhamid, N., Ayesh, A., Thabtah, F., Ahmadi, S., & Hadi, W. (2012). MAC: A multiclass associative classification algorithm. Journal of Information and Knowledge Management (JIKM), 11(2), Abdelhamid, N., & Thabtah, F. (2014). Associative classification approaches: Review and comparison. Journal of Information and Knowledge Management (JIKM), 13, Cendrowska, J. (1987). PRISM: An algorithm for inducing modular rules. International Journal of Man-Machine Studies, 27(4), Cohen, W. (1995). Fast effective rule induction. In Morgan Kaufmann (Ed.), Proceedings of the 12 th international conference on machine learning (pp ). Tahoe City, CA, USA. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), Coulter, M. (2012). Strategic management in action. Pearson Education. June 2012, 6/E, ISBN- 10: Frank, E., & Witten, I. (1998). Generating accurate rule sets without global optimisation. In Morgan Kaufmann (Ed.), Proceedings of the fifteenth international conference on machine learning (pp ). San Francisco, CA, USA.

A Classification Rules Mining Method based on Dynamic Rules' Frequency

A Classification Rules Mining Method based on Dynamic Rules' Frequency Issa Qabajeh Centre for Computational Intelligence, De Montfort University, Leicester, UK P12047781@myemail.dmu.ac.uk Francisco Chiclana