PERFICT : Perturbed Frequent Itemset based Classification Technique

Size: px

Start display at page:

Download "PERFICT : Perturbed Frequent Itemset based Classification Technique"

Angel Evans
5 years ago
Views:

1 PERFICT : Perturbed Frequent Itemset based Classification Technique Raghvendra Mall IIIT Hyderabad students.iiit.ac.in Prakhar Jain IIIT Hyderabad students.iiit.ac.in Vikram Pudi IIIT Hyderabad iiit.ac.in ABSTRACT This paper presents Perturbed Frequent Itemset based Classification Technique (PERFICT), a novel associative classification approach based on perturbed frequent itemsets. Most of the existing associative classifiers work well on transactional data where each record contains a set of boolean items. They are not very effective in general for relational data that typically contains real valued attributes. In PER- FICT, we handle real attributes by treating items as (attribute,value) pairs, where the value is not the original one, but is perturbed by a small amount and is a range based value. We also propose our own similarity measure which captures the nature of real valued attributes and provide effective weights for the itemsets. The probabilistic contributions of different itemsets is taken into considerations during classification. Some of the applications where such a technique is useful are in signal classification, medical diagnosis and handwriting recognition. Experiments conducted on the UCI Repository datasets show that PERFICT is highly competitive in terms of accuracy in comparison with popular associative classification methods. 1. INTRODUCTION Classification of real world data is an important aspect of data mining that aims at predicting group membership for data instances. Starting with the seminal work over a decade ago [17], several classification approaches based on association rules have emerged. The foundation of the various rule based approaches is the Apriori [2, 3] and the FP-Tree [15] algorithms which have been studied extensively and applied to areas of machine learning, data mining and to many other problem domains [5, 18, 4, 19, 1]. The two thresholds used for selection of rules are min-support and min-confidence. These parameters are kept fixed while generating the rules, so an overhead of rule weighting and selection of n best rules is required for classification purposes. Most of the recent associative classifiers such as CBA[17], CMAR[16], MCAR[10], CPAR[22], GARC[6], etc. work well on transactional data where each record contains a set of boolean items. There remains scope for more efficient handling of continuous attributes. In this paper, a novel associative classification procedure namely PERturbed Frequent Itemsets based Classification Technique (PERFICT) has been proposed. Our algorithm explicitly and effectively handles real valued attributes by means of Perturbed Frequent Itemsets (PFI). A new MJ similarity measure is also proposed which regulates the selection of the PFIs. The introduction of the similarity measure also helps weighing of the PFIs during classification. Rule selection process is not required and the pruned PFIs are used for probabilistic estimate of classes. We propose three different methods: 1. A naive histogram based approach(hist PERFICT). 2. A histogram based approach with the similarity measure(histsimilar PERFICT). 3. A randomized clustering method including the similarity measure(k-means PERFICT) Experimental evaluation of our algorithms on standard UCI datasets show that they perform better against most of the recent state-of-art associative classifiers. The Randomized K-Means PERFICT outperforms HistSimilar PER- FICT and Hist PERFICT in most of the cases. Our contributions include: Handling noisy data and the problem of exact matches in an effective manner using the notion of perturbation. Introduction of a new MJ similarity measure for weighing and pruning itemsets. Use of self-adjusting mincount value for pruning perturbed frequent itemsets. Identifying drawbacks of standard discretization method and avoiding it through a preprocessing step. In section 2, we describe related work done in the field of associative classifiers. The next section presents concepts and definitions used in our approach. Section 4 depicts an outline of the three different PERFICT algorithms. This is followed by the discussion on issues with Hist PERFICT algorithm. Section 6 describes the HistSimilar PERFICT algorithm. Section 7 throws light on the Randomized K- Means PERFICT which is followed by Results and Analysis in section 8. Finally we conclude in section 9.

2 2. RELATED WORK Recent state of the art has exploited the paradigm of associative rule mining for solving the problem of classification. They work on the principles of mining association rules to build classifiers [8]. Advantages of these approaches include (1) Frequent itemsets capture all dominant relationships between items in a dataset; (2) Efficient itemset mining algorithms exist resulting in highly scalable classifiers; (3) These classifiers naturally handle missing values and outliers as they only deal with statistically significant associations. This property translates well into robustness; (4) Extensive experimental statistics show that these techniques are less error prone. However, these associative classifiers suffer certain drawbacks. Though, they provide more rules and information, redundancy involved in the rules increases the cost in terms of orders of time and computation complexity during the process of classification. MCAR [10] determines a redundant rule by checking whether it covers instances in training data set or not. GARC[6] brought in the notion of compact set to shrink the rule set by converting the rule set to a compact one. Since the reduction of redundant rules require a brute force technique, it fails to avoid some meaningless searching. Second, as we know, the rule generation is based on frequent pattern mining in associative classification, when the size of data set grows, the time cost for frequent pattern mining may increase sharply which may be an inherent limitation of associative classification. The FP Growth technique [14] used in CMAR [22] has proved to be very efficient, but extra time should be considered to compute the support and confidence of rules by scanning the data set again. However, the cost problem still remains unsolved. These algorithms have a major drawback: generation of rules by exact matches irrespective of categorical or numeric attributes. This situation causes a problem in most of the real world scenarios because records that contain nearly similar values for a real valued attribute should support the same rule. Due to discretized matches, the algorithms do not always generate the required rule. Approaches like [11] [12] [13] [20] perform numeric attribute based optimized rule mining. But its difficult to handle noisy data with them. They use computational geometry to determine the areas of confidence and use it as the primary criteria. But similarity of the generated association rules with the given test record is not emphasized and a thus new similarity measure is required. 3. BASIC CONCEPTS AND DEFINITIONS Without loss of generality, we assume that our input data is in the form of a relational table whose attributes are: {A 1, A 2, A 3...A n, C}, where C is the class attribute. We use the term item to refer to an attribute value pair (A i, a i), where a i is the value of an attribute A i which is not a class attribute. For brevity, we also simply use a i to refer to the item (A i, a i). Each record in the input relational table then contains a set of items I = {a 1, a 2, a 3...a n}. An itemset T is defined as T I. A frequent itemset is an itemset whose support (i.e. frequency) is greater than some user-specified minimum support threshold. We allow for different thresholds depending on the length of itemsets, to account for the fact that itemsets which have larger length naturally have low sup- Figure 1: Hist PERFICT ports. Let Min k denote the minimum support, where k is the length of the corresponding itemset. Use of frequent itemsets for numeric real-world data is not appropriate as exact matches for attribute values might not exist. Instead, we use the notion of perturbation, a term used to convey the disturbance of a value from its mean position. Perturbation represents the noise in the value of attributes of the items and effectively converts items to ranges. For instance given an itemset T with attribute values a v1, a v2 and a v3, the perturbed frequent itemset PFI T will look like PFI T = {a v1 ± σ 1, a v2 ± σ 2, a v3 ± σ 3} (1) 4. THE PERFICT ALGORITHMS The PERFICT algorithms are based on the principle of weighted probabilistic contribution of the Perturbed Frequent Itemsets. One advantage of this procedure over other associative classifiers is that there is no rule generating step in PERFICT algorithms. The basic concept employed here is larger the length of perturbed frequent itemset greater is the similarity between a given test record and the training records containing those PFIs Here we outline the general structure of the three PER- FICT algorithms. From Figure 1,2 and 3, we observe that there are several steps which are in common to the three procedures. We explain all the steps first taking the Hist PER- FICT algorithm in detail and subsequently explain other steps in HistSimilar PERFICT and Randomized K-Means PERFICT. 4.1 Preprocessing Techniques For associative classifiers, it has been observed that there is need of preprocessing step where real valued attributes are discretized. In our approach, the concept of perturbation appropriately assigns ranges to these attribute values eliminating the need for discretization Histogram Construction A histogram is a frequency chart with non-overlapping adjacent intervals calculated upon values of some variable. Mathematically, if n is the total number of observed values, k is the total number of partitions, the histogram m i must meet the following condition n = kx m i (2) i=1 There are several kinds of histograms, but two types of histograms were best suited for our approach. They are equi-

3 Figure 2: HistSimilar PERFICT bin using a range query from the hash table of histograms. The attribute value is transformed to original value ± the standard deviation of all the values in mapped bin. The perturbation is defined as the standard deviation of all the values of an attribute that are initially hashed into that partition. Let a ik represent the value of i th attribute corresponding to k th record. The histogram bins for the i th attribute are represented as h i1, h i2, h i3...h ip. As we are using equi-depth histograms so each partition has same number of values say n. Let a ik maps to h i3. h i3 = hash(a ik ) (3) nx a ij µ hi3 = (4) n j=1 v u nx σ i3 = t (a ij µ hi3 ) 2 (5) j=1 a ik = a ik ± σ i3 (6) where µ hi3 represents the mean value for the histogram bin h i3 and σ i3 represents the standard deviation of that histogram partition. Table 1: Dataset before transformation Figure 3: K-Means PERFICT width histogram and equi-depth histogram. Equi-width histograms have all partitions of same size. The size of each partition is an important variable here. Equi-depth histograms are based on the concept of equal frequency or equal number of values in each partition. The parameter involved is the number of values falling into different sized partitions. Equi-depth histograms are more suited for classification because they capture the intrinsic nature of the random variable or attribute being observed. Moreover, the histogram is not affected by the presence of outliers. An Equi-width histogram on the other hand is highly affected by outliers and is a weak choice for the purpose of classification. There exists an approach in [21] which partitions the values of quantitative attributes into equi-depth intervals. The underlying principle is that of partial completeness. This approach measures the information loss due to formation of rules obtained by considering ranges over partitions of quantitative attributes. However, the proposed approach works differently and prevents any information loss. 4.2 Transforming the Training Data Set From Figure 1, it can be seen that the PERFICT algorithms includes transforming the training dataset. Equidepth histograms are constructed for each attribute with variable depth value. The standard deviation of each such partition is computed as well. Let us assume there are k attributes apart from the class attribute in the training set. In order to convert the (attribute, value) pair a i to a i ± σ i we need a transformation. To obtain these ranges we use the histogram constructed above. Each attribute value of a training record is mapped to the corresponding histogram S. No A#1 A#2 A#3 Class 1 v 11 v 12 v 13 C 1 2 v 21 v 22 v 23 C 2 3 v 31 v 32 v 33 C 1 Table 2: Same Dataset after transformation S. No A#1 A#2 A#3 Class 1 v 11 ± σ 11 v 12 ± σ 12 v 13 ± σ 13 C 1 2 v 21 ± σ 21 v 22 ± σ 22 v 23 ± σ 23 C 2 3 v 31 ± σ 31 v 32 ± σ 32 v 33 ± σ 33 C 1 It can be observed from Table 2, that each attribute value is replaced by a ranged value. This process adds perturbation for each attribute value Issues with discretization Earlier approaches followed a simple discretization step for converting real valued attributes to ranges and mapped these ranges to consecutive integers. There are several issues with discretization. If the bin size is kept small, the number of partitions become very high and the ranges obtained would not capture the nature of the dataset effectively. Alternatively, if the bin size is large, two values of the same attribute positioned at the opposite extremes of the same partition are treated as same, though they might have different contributions. The introduction of perturbation allows two different values of the same attribute belonging to the same partition to be mapped to different ranges. For example, consider a histogram interval 0 3 say for attribute A 1 and standard deviation of Consider two values belonging to this partition say 0.5 and 2.5. Let the attribute value for the test

4 record be 0.7. A simple discretization process will map 0.5, 2.5 and 0.7 to the interval 0 3 and replace these values with an integer say 1. In other words, both 0.5 and 2.5 are considered to be equally similar with the test record s value(0.7). But by perturbation mechanism, we see that similarity of 0.5 ± 0.25 (here perturbation, σ = 0.25) is greater than 2.5 ± 0.25 as its range is closer and intersecting to the test record s range(0.7 ± 0.25). 4.3 Transforming the Test Record The same transformation (as for training dataset) is applied to individual test records using training data histograms. 4.4 Generating Perturbed Frequent Itemsets To obtain PFIs we apply the modified Apriori algorithm mentioned below: Algorithm 1. Generate 2-itemset 2. Repeat till n-itemset(where n is the number of predictor attributes) (a) Join Step (b) Prune Step (c) Record Track Step (d) forall candidates C i,j, Generating 2-itemset if count(c i,j) minsupport Freq itemset = Freq itemset C i,j 1. forall training records r, (a) foreach pair of attributes a i and a j. if test record s range(a i)) r s range(a i) φ and test record s range(a j)) r s range(a j) φ Candidate i,j = Candidate i,j r 2. forall candidates C i,j, (a) if count(c i,j) minsupport The Join Step Freq itemset = Freq itemset C i,j 1. forall pairs L1,L2 of Freq itemset k 1, (a) if L1 a1 = L2 a1 and L1 a2 =L2 a2... L1 ak 2 = L2 ak 2 and L1 ak 1 L2 ak 1 The Prune Step C k = a 1, a 2... a k 2, L1a k 1, L2 ak 1 1. forall itemsets c C k, (a) foreach (k - 1) subsets s of c, if s / L k 1 delete c from C k Record Track Step 1. forall itemsets c C k, (a) foreach (k-1)-subsets s of c, foreach record r contributing in count of s Increment count(r) by 1 (b) forall records r, if count(r) = k Keep track of record r While developing the algorithm, we assume that the minimum contributing PFIs are perturbed frequent 2-itemsets. To obtain these itemsets we identify all attributes in the training dataset whose value ranges intersect with the test record value ranges. Let there be k predictor attributes for training dataset along with the class attribute. The i th attribute is denoted by A i. So the candidate set is a combination of all possible two itemsets and has the cardinality k C 2. This set can be represented as C 2 = {(A 1, A 2), (A 2, A 3),... (A k 1, A k )} where C 2 refers to length 2 candidate itemset. A candidate itemset is formed if the range of each attribute of the training record intersects with the corresponding range of the test record. For instance let the two attributes be A 0 and A 1. Let the values of those attributes for the j th training record be a j0 ±σ j0 and a j1 ±σ j1 respectively. From Figure (4), we can conclude that for both attribute A 0 and A 1 the ranged based values of the test record and the training record are intersecting. Hence the count for the candidate itemset (A 0, A 1) is incremented by 1 and a track of the training record-id is kept. We make a note that a single training record may account for multiple candidate itemsets and contributes distinctly in the frequency of each such candidate itemset. Once the candidate itemsets have been constructed we introduce a small prune step based on the minsupport threshold criteria which is applied on the count for each candidate itemset. This mincount is defined as minsupport 100 size(currentdataset). The minsupport is available as user parameter and is directly proportional to degree of pruning. For a very high minsupport value very few candidate itemsets will survive. A very low minsupport value limiting to 0 results in no pruning or elimination of candidate itemsets. An important aspect of the self-adjusting mincount value is that it prevents over fitting. The size of the Currentdataset is also variable in our procedure. For 2-itemsets, the value of Currentdataset is initialized to the size of training dataset. But, for generating frequent itemsets of length > 2, only distinct records which contribute towards the count of at least one perturbed frequent itemset are included in Currentdataset. A bookkeeping strategy for all the record ids contributing in the frequency of each PFI is followed which is highlighted in the record track step. There are some records which do not contribute towards any PFI and are removed. Therefore, the value of Currentdataset does not remain same and generally varies. As the length of itemset increases, for example (A 0, A 1) (A 0, A 1, A 2), i.e. from 2 itemsets to 3 itemsets, the size of Currentdataset decreases. The value of mincount adjusts accordingly and reduces. The first iteration to calculate frequent 2-itemsets is the one involving major computation.

5 Figure 4: All possible range intersection cases The Join Step The join step is similar to the join step observed in the Apriori algorithm. Consider a candidate itemset of length r: C r,1 = {A 1, A 2..., A r 2, P, Q}. Then, C r 1,i = {A 1, A 2..., A r 2, P } and C r 1,j = {A 1, A 2... A r 2, Q} are frequent itemsets of length r 1. A Frequent itemset of length r 1 implies that r 1 predictor attributes obtained from the training records are intersecting with the respective attributes of the test record. While forming candidate itemsets of length r, we take any two frequent itemsets of length r 1 having exactly r 2 overlapping attributes in common. The possibilities of intersection for individual attribute values are shown in Figure 4. The two r 1 length frequent itemsets contain a number of records with the same ids mapped to them which percolate to the count of the candidate itemset C r,1. Let us illustrate this by an example: Consider the candidate itemset C = (A 0, A 1, A 2). It can be easily visualized to be formed from frequent itemsets (A 0, A 1) and (A 0, A 2). The former attribute, namely A 0 is common to both frequent 2-itemsets. The frequency of C is determined by records which are present in both the frequent itemsets. The Prune Step After obtaining the candidate itemsets from the above procedure, we apply a prune step similar to the Apriori approach. The Record Track Step There is a need to keep track of the records which contribute toward any frequent itemset because all such records form the Currentdataset for the next iteration (i.e. itemset length r-1 to r). From the pseudo code it can be observed that for the participation of a record in the count of frequent itemset, the record must be contributing in the frequency of each of the subsets of the frequent itemsets. For example let us consider a record r participating for frequent itemset (A 0, A 1, A 2). Then r id (A 0, A 1) map, r id (A 0, A 2) map and r id (A 1, A 2) map where (A i, A j) map represents the map between a frequent itemset and set of all the record ids contributing for that itemset. 4.5 Naive Probabilistic Estimation Once we have obtained all possible frequent itemsets the final task is the estimation of the class to which the test record belongs. We devise a formula which comprises two components. 1. For each frequent itemsets (PFIs) of length i 2 we keep track of all the records contributing to its count. These records may belong to different classes. For instance let (A 0, A 1) be the I th PFI of length 2. Let n I be the number of records participating in the count of this PFI. Then, n I = P j C nijcj where Cj represents j th class out of the possible C classes. So contribution of each PFI of length 2 is defined as: Contri(PFI Ii) = X X n Ij C j (7) N i I Freq(i) j C where N i is the size of dataset for itemsets of length i and PFI Ii represents the I th PFI of length i. 2. The second part is the heuristics based rank associated with each PFI. We assign different ranks to itemsets of different lengths. However, same weight is associated with the itemsets of similar length. The allocation resembles ones intuition - Greater the length of the PFI more is the similarity between training set and testrecord and hence a larger contribution towards classification. P i p=1 Rank i = P p max k=2 max k + 1 (8) where the numerator is sum of all natural numbers from 1 to the length i and the denominator is a normalization constant. Here max represents the maximum number of attributes in the dataset apart from class attribute as the largest PFI can be of max length only. The formula is similar to that used for assigning weights in k-nearest neighbor classification. Equation 7 is similar to the Laplacian operator mentioned in [7]. A deeper analysis of the equation 7 shows that it converges to 1 as the length of PFIs increases. This conforms with the true nature of the problem as records which are highly similar to a given test record are fewer in number and play a major role in deciding the class to which the record may belong. The overall formula for finding the class conditional probability of a record becomes max X P(C/R) = Contri(PFI Ii) Rank i (9) i=2 where P(C/R) contains the contribution from all classes and can also be written as P(C/R) = a 1C 1 + a 2C a tc t (10) where a i represents the contribution of all PFIs for class C i and i varies from 1 to t. This sum of co-efficients a i can be either greater than or less than 1 and is given by S. So to normalize, each co-efficient is divided by S. This process converts the ratio into a probability measure. We select that class label as the class for the test record whose co-efficient is maximum i.e argmax i{a 1, a 2,...a t}. This naive probabilistic classification technique is used for the Hist PERFICT algorithm. The importance of the result lies in the fact that a probabilistic estimation of contribution of all the classes pertaining to a single test record is available at the end. This can be viewed as an addendum for detailed analysis and confidence towards classification.we now provide some of the problems with Hist PERFICT and the way their rectification leads to HistSimilar PERFICT method.

5. ISSUES WITH HIST PERFICT 5.1 Pruning Quantitatively, the number of itemsets generated by the Apriori algorithm are huge and require a pruning step.

6 5. ISSUES WITH HIST PERFICT 5.1 Pruning Quantitatively, the number of itemsets generated by the Apriori algorithm are huge and require a pruning step. However in the case of Hist PERFICT we generate the set of all possible PFIs without including an extra pruning step. Some of the itemsets have high contribution in more than one class which sometimes leads to misclassification. So we need a proper pruning step to make the classifier more effective. 5.2 Weights As mentioned earlier, the rank or weight for different length itemsets are different. However, another major issue with the Hist PERFICT approach is that we assign same rank or weights to PFIs of similar length. This sometimes degrades the accuracy of classification as is evident by the Precision Table (Table 3). We need to construct a metric which can capture the range based nature of the PFIs effectively and provide weights in accordance. Both these problems are resolved by the means of our proposed MJ similarity metric. 6. HISTSIMILAR PERFICT From Figures 1 and 2, we observe that an additional step involving a new similarity calculation is required. The criteria is presented as follows. 6.1 MJ Similarity Metric We define a new similarity measure based on the simple though effective notion of area of overlap. Let us illustrate by an example. Let us assume for a given test record 1 st attribute value a 1 = 0.5 and 2 nd attribute value a 2 = 0.6. Then during the transformation of the test record to perturbed range based values we map a 1 to a histogram whose standard deviation is 0.3 and a 2 to a histogram whose standard deviation is 0.4. So the perturbed value for attribute 1 and 2 becomes a 1 ± σ 1 and a 2 ± σ 2. Now, let there be a training record r for which r 1 = 0.5 and r 2 = 0.6. These values also map to the histograms whose deviations are 0.3 and 0.4 respectively. Perturbed training record values become r 1 ± σ 1 and r 2 ± σ 2 where σ i represents the standard deviation. For the attributes 1 and 2 of the record r, (a1 ± σ1) (r1 ± σ1) (a2 ± σ2) (r2 ± σ2) AO = σ 1 2 σ 2 (11) where AO represents Area of Overlap and the intersection of the range based values for 1 st attribute leads to similarity of 0.5 and the intersection of range based values for 2 nd attribute leads to similarity of 0.7. The similarity values are normalized by taking into account the σ value of the respective attribute histograms. It can be observed from the formula of area of overlap that only taking into account the dot products of the similarity for each attribute would lead to a decrease in the overall (multiplying one fraction with another fraction). So we introduce an additional multiplicative factor to obtain the overall area of overlap which is dependent on length of the itemset and is defined as (itemsetlength) (itemsetlength). Figure 5 represents the example stated above as an illustration. The formula for Area Figure 5: Sample area of Overlap of Overlap can be generalized for the j th PFI of length k as: AO PFIjk = ky i=1 (a i ± σ i) (r i ± σ i) σ i k k (12) where a i is the i th attribute value for test record and r i is the i th attribute value for a training record r and σ i is deviation of the histogram partition to which the value maps. An important constraint imposed on the proposed similarity measure is that PFIs of larger length must be assigned greater weights than PFIs of smaller length. But sometimes if we directly use the AO based formula, this constrain can be violated. Without loss of generality, let us consider we have a PFI of length 2 (A 1, A 2) and from that PFI another frequent itemset of length 3 (A 1, A 2, A 3) is obtained. However, if the intersection for the 3 rd attribute is very small then Area of Overlap is reduced and so similarity of 3 length PFI can be smaller than that of 2-length PFI. In order to prevent such an opposing situation we add a term 2 itemset length to Area of Overlap after taking logarithm (base 10) of the Area of Overlap. We place the constraint that intersection for each attribute (here i th attribute) must be (0.02 σ i). By maintaining this criteria the similarity value for 3-itemset similarity value for 2-itemset. MJ PFIjk = 2 k + log 10(AO PFIjk ) (13) where MJ PFIjk represents the similarity value for the j th PFI of length k with respect to training record r. But there are several such records which are contributing in the count of PFI jk. So we take average of all such values as the similarity measure for the corresponding PFI. The MJ similarity criteria of intersection (0.02 σ) is included in Generating Perturbed Frequent Itemsets step. The Probabilistic estimation step is modified by including the MJ criteria. In equation (9) we replace the Rank term with the MJ similarity measure term to arrive at the following formula to be used for classification. max X P(C/R) = Contri(PFI Ii) MJ PFIIi (14) i=2 This changed probabilistic estimation step is used in Hist- Similar PERFICT and Randomized K-means algorithm. 6.2 Benefits of MJ similarity Measure

7 1. One of the benefits of using this similarity measure is that we are assigning different weights even to itemsets of same length based on their area of overlap. More the area of overlap more the similarity, larger the intersections and greater the weights associated. This is a straight result from direct proportionality. 2. Another unique advantage of this similarity measure is that it intrinsically provides the much needed pruning step. During the generation of the PFIs, at any stage, if the range of intersection for any attribute of the candidate itemset does not satisfy the MJ criteria then that candidate itemset is immediately pruned. So the requirement of an explicit pruning step is no longer necessary. 7. RANDOMIZED K-MEANS 7.1 Disadvantages of Histograms There are certain limitations to the Histogram based approach for preprocessing the data. The depth of each histogram for each attribute is kept variable. But histogram construction is a time consuming process. Further the process of estimation of the best depth for each variable is manual and cannot easily be automated. Sometimes it happens that the variation in the attribute values is not much and for such an attribute 3 or 4 partitions are sufficient but we end up with more number of bins. This problem can be solved by using a clustering technique for preprocessing. 7.2 K-Means Approach The K-Means algorithm[9] is a highly popular clustering approach. It can be used very effectively to identify clusters of data which act as partitions for us. For our algorithm, we compute the k-means independently for each attribute and so the data points are all one dimensional. The purpose of this method is to minimize the Squared Sum Error (SSE) i.e distance of each data point from the k-means. 7.3 Advantages of K-Means The major contribution of the K-Means method towards our approach is that we calculate K-Means separately for each attribute and the number of clusters are calculated on the fly. We vary the number of clusters and identify the best clusters for each attribute based on a threshold θ = SSE k 1 SSE k (15) k where the numerator represents the difference in the squared sum error for the k-1 clusters and k clusters and denominator is number of clusters. So when the contribution of each cluster in the squared error is very small i.e. less than a threshold θ we stop. The approach allows different number of clusters for different attributes and provides a better opportunity to capture the inherent similarity within the attributes of a dataset. We now consider the perturbation as the standard deviation of each cluster. Rest of the procedure is same as that of HistSimilar PERFICT algorithm. But the k-means approach is much faster as in comparison to histogram based approach. Taking advantage of the fact, the K-Means procedure is run multiple times for an accurate estimation of the clusters. The procedure is henceforth called Randomized K-Means PERFICT. 8. RESULTS AND ANALYSIS We conducted our experiments over 12 datasets from the UCI repository. These datasets consist of real valued attributes which are generally continuous by nature. We also conducted experiments over two different minsupport values (1 and 10) for various datasets and selected the minsupport value for which the accuracy was best for the corresponding dataset. We varied the histogram bins and number of clusters from 3 to 15 and selected the one for which accuracy was best. The 10-fold cross validation accuracy for associative classifiers like CBA, CPAR, CMAR, case based algorithm like PART, decision tress like C4.5 and Ripper along with Naive Bayes algorithm are presented in Table 3. For discretization of real valued attributes of these classifiers we use the entropy based technique same as that used in MLC++ library. Our primary concern is accuracy of the PERFICT classifiers. Apart from the accuracy measure, we also give a detailed analysis of the effect of MJ criteria for HistSimilar PERFICT and Randomized K-Means Algorithm. We have used the WEKA Toolkit and have implemented the PERFICT algorithms in C Performance of PERFICT We provide a brief analysis of the precision results present in Table 3. The datasets are composed of real valued attributes. The Randomized K-Means PERFICT algorithm outperforms other algorithms over 8 datasets and is among top 3 for 11 datasets. The performance of K-Means PER- FICT over other algorithms is exceptional for waveform, vehicle and ecoli datasets. The inclusion of MJ similarity measure is primarily responsible for high precision results. It helps to prune away the itemsets which are not essential for classification and gives the appropriate weight to each PFI necessary for classification. The reason for high success rate of Randomized K-means is that it captures the noisy nature of attributes. Superiority of Randomized K-means over HistSimilar and Hist PERFICT approach is due to the fact that variable number of points can belong to a cluster as opposed to equi-depth histogram. This is because the size of a cluster is not fixed while frequency of each bin has to be same for equi-depth histogram. The necessity for having a variable for the number of clusters or bins can be seen by the fact that in diabetes dataset where the number of clusters or bins for each attribute is best set to 15 while for the ecoli dataset the number is 3 as there is little variation in value of each feature. For dataset like image, vowel and wine the K-Means PERFICT is among the top 3 classifiers and hence is suitable for tasks like image recognition and handwriting recognition. While performing the complexity analysis for each of the proposed algorithm, we observe that the complexity of each PERFICT algorithm is same as that of Apriori algorithm. Sometimes its observed that datasets for image, signal and hand-writing classification have a very large feature space i.e they have large number of attributes. So in order to reduce this feature space to low dimensions, we perform a data preprocessing technique like PCA (Principal Component Analysis) or LDA (Linear Discriminant Analysis). This reduces the number of attributes making the various PER- FICT approaches highly scalable. 8.2 Effects of MJ Criteria For both HistSimilar PERFICT and Randomized K-Means

8 Table 3: Precision Results Dataset Hist HistSimilar K-Means CBA CMAR CPAR RIPPER J4.8 PART Naive PERFICT PERFICT PERFICT Bayes breast-w diabetes ecoli glass heart image iris pima vehicle vowel waveform wine Figure 6: Accuracy vs Overlap Figure 7: Accuracy vs MinSupport approach, the application of MJ criteria is one of the most important steps. We vary the MJ criteria as mentioned below MJ j = σj N (16) where MJ j represents the threshold for the j th dimension. σ j represents the deviation for the partition to which the training value maps for the j th attribute and N represents a natural number. The maximum overlap that can occur is equal to 2 σ j, when the range based value of the training record r completely overlaps with test record values. As N is a natural number the MJ criteria is always smaller than 1 2 of maximum Overlap. We vary the N values from one to hundred to see the effect of various thresholds on accuracy. From Figure 6, we can observe that the maximum variation in the values of accuracy occurs for smaller values of N. Lesser the value of N, higher is the MJ criteria and so stricter is the condition and more is the pruning of the candidate itemsets. As it can be observed that more pruning leads to under-fitting and characteristic of dataset is not captured leading to lower accuracy. However as N increases the MJ criteria is relaxed. We achieve optimal accuracy at some value and then accuracy remains nearly constant for higher values of N. 8.3 Image Segmentation Data Analysis The image segmentation problem is one of the oldest problem of pattern recognition. It comprises of instances that are drawn from a database of 7 outdoor images. The images are hand-segmented to create a classification for every pixel. There are 210 records in all and each record has 19 continuous attributes apart from one class attribute. There are seven classes in total and the data is uniformly distributed among all these classes with 30 instances of each class. From Fig 7, we observe that while the CBA and Hist PER- FICT algorithm completely fail to build a proper classifier and their accuracy remains low as the minsupport value increases. But the HistSimilar PERFICT and Randomized K-Means PERFICT performs well. From Table 3, it can be seen that PART and J4.8 algorithm are better than PER- FICT but as minsupport is not a parameter for these classifiers, so they are not considered in Figure 7. A detailed analysis reveals that Hist PERFICT algorithm is not suitable for this dataset. Here the number of classes is very high and the itemsets contributing are of smaller length. By providing same weights to different PFIs of same

9 length it causes misclassification. The accuracy reaches a minimum for this approach. For the HistSimilar PERFICT algorithm the accuracy remains constant throughout. The variation of the minsupport value has no effect on its accuracy. The maximum accuracy of HistSimilar algorithm for this dataset is at 10 percent minsupport and it can be observed that this accuracy was achieved before 8 percent minsupport. However, its the Randomized K-Means PER- FICT algorithm which shows better trends such as the accuracy increases with minsupport values and then reaches an optimum. 9. CONCLUSION This paper presents a new classification approach developed on the concept of perturbed frequent itemsets and their probabilistic contribution for each class. The suggested work is computationally more efficient than other discretized rule generating algorithms as CBA, CPAR, CMAR etc. and employs a new MJ similarity measure. Results obtained are accurate and effective particularly for noisy datasets like waveform, diabetes and vehicle. Advantages of the various PERFICT algorithms include (1) It captures the nature of noisy data easily. (2) The value of mincount is not fixed throughout, but dynamically adjusts with increasing length of candidate itemsets.(3) Probabilistic contribution of each class towards the record is identified. The algorithm helps in understanding the maximum contribution of a particular class in a record. Thus, the proposed approach is novel and efficient than most of the existing well established associative classification algorithms. In future, we can incorporate another pruning step based on Maximum Likelihood techniques. Finally, an improvement that can be sought is building a classifier model, rather than running the entire algorithm for individual test records, although its not very computationally intensive, as it is. 10. REFERENCES [1] R. Agarwal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of ACM SIGMOD Conference on Data Management of data, pages 80 86, [2] R. Agarwal, T. Imielinski, and R. Shrikant. Mining association rules between sets of items in large databases. In Proceedings of ACM SIGMOD Conference On Management of Data, pages , [3] R. Agarwal and R. Shrikant. Fast algorithms for mining association rules. In Proceedings of International Conference on Very Large Database, 20: , September [4] K. Ali, S. Manganaris, and R. Shrikant. Partial classification using association rules. In Proceedings of the International Conference on Knowledge Discovery and Data Mining KDD, 3, [5] R. Bayardo, R. Agarwal, and D. Gunopulos. Constraint-based rule mining in large databases. In Proceedings of International Conference on Knowledge Discovery and Data Engineering, 15: , [6] G. Chen and H. et al. A new approach to classification based on association rule mining. DECISION SUPPORT SYSTEMS, 42: , [7] P. Clark and R. Boswell. Rule induction with cn2: Some recent improvements. In Proceedings of the 5th European Working Session on Learning, pages , [8] G. Dong, X. Zhang, L.Wong, and J. Li. Classification by aggregating emerging patterns. Discovery Science, [9] R. Duda and P. Hart. Pattern Classification and scene analysis. John Wiley and Sons, New York, [10] T. Fadi, C. Peter, and Y. Peng. Mcar: Multi-class classification based on association rule. IEEE International Conference on Computer Systems and Applications, pages , [11] T. Fuduka, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining using two-dimensional optimized accociation rules: Scheme, algorithms, and visualization. SIGMOD, [12] T. Fuduka, Y. Morimoto, S. Morishita, and T. Tokuyama. Mining optimized association rules for numeric attributes. J. Comput. Syst. Sci, [13] T. Fuduka, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining with optimized two-dimensional association rules. ACM Transactions. Database Systems, [14] J. Han, J.Pei, and Y.Yin. Mining frequent patterns without candidate generation. In Proceedings of 2000 ACM SIGMOD International Conference on Management of Data, pages 1 12, [15] J. Han, J. Pei, Y. Yin, and R. Mao. Mining frequent patterns without candidate generation- a frequent-pattern tree approach. Data Mining and Knowledge Discovery, [16] W. Li, J. Han, and J.Pei. Cmar: Accurate and efficient classification based on multiple-class association rule. Proceedings of ICDM, 1: , [17] B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining. In Proceedings of KDD, [18] H. Mannila and H. Toivonen. Discovering generalized episodes using minimum occurences. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, KDD96, 2, [19] N. Megiddo and R. Shrikant. Discovering predictive association rules. In Proceedings of International Conference on Knowledge Discovery and Data Mining KDD, 4, [20] T. Raymond, L. Lakshmanan, and A. P. J. Han. Exploratory mining and pruning optimizations of constrained association rules. SIGMOD, [21] R. Srikant and R. Agarwal. Mining quantitative association rules in large relational tables. SIGMOD, [22] X.Lin and J.Han. Cpar: Classification based on predictive association rule. SDM2003, 2003.

Challenges and Interesting Research Directions in Associative Classification

Challenges and Interesting Research Directions in Associative Classification Fadi Thabtah Department of Management Information Systems Philadelphia University Amman, Jordan Email: FFayez@philadelphia.edu.jo