PERFICT : Perturbed Frequent Itemset based Classification Technique

Size: px
Start display at page:

Download "PERFICT : Perturbed Frequent Itemset based Classification Technique"

Transcription

1 PERFICT : Perturbed Frequent Itemset based Classification Technique Raghvendra Mall IIIT Hyderabad students.iiit.ac.in Prakhar Jain IIIT Hyderabad students.iiit.ac.in Vikram Pudi IIIT Hyderabad iiit.ac.in ABSTRACT This paper presents Perturbed Frequent Itemset based Classification Technique (PERFICT), a novel associative classification approach based on perturbed frequent itemsets. Most of the existing associative classifiers work well on transactional data where each record contains a set of boolean items. They are not very effective in general for relational data that typically contains real valued attributes. In PER- FICT, we handle real attributes by treating items as (attribute,value) pairs, where the value is not the original one, but is perturbed by a small amount and is a range based value. We also propose our own similarity measure which captures the nature of real valued attributes and provide effective weights for the itemsets. The probabilistic contributions of different itemsets is taken into considerations during classification. Some of the applications where such a technique is useful are in signal classification, medical diagnosis and handwriting recognition. Experiments conducted on the UCI Repository datasets show that PERFICT is highly competitive in terms of accuracy in comparison with popular associative classification methods. 1. INTRODUCTION Classification of real world data is an important aspect of data mining that aims at predicting group membership for data instances. Starting with the seminal work over a decade ago [17], several classification approaches based on association rules have emerged. The foundation of the various rule based approaches is the Apriori [2, 3] and the FP-Tree [15] algorithms which have been studied extensively and applied to areas of machine learning, data mining and to many other problem domains [5, 18, 4, 19, 1]. The two thresholds used for selection of rules are min-support and min-confidence. These parameters are kept fixed while generating the rules, so an overhead of rule weighting and selection of n best rules is required for classification purposes. Most of the recent associative classifiers such as CBA[17], CMAR[16], MCAR[10], CPAR[22], GARC[6], etc. work well on transactional data where each record contains a set of boolean items. There remains scope for more efficient handling of continuous attributes. In this paper, a novel associative classification procedure namely PERturbed Frequent Itemsets based Classification Technique (PERFICT) has been proposed. Our algorithm explicitly and effectively handles real valued attributes by means of Perturbed Frequent Itemsets (PFI). A new MJ similarity measure is also proposed which regulates the selection of the PFIs. The introduction of the similarity measure also helps weighing of the PFIs during classification. Rule selection process is not required and the pruned PFIs are used for probabilistic estimate of classes. We propose three different methods: 1. A naive histogram based approach(hist PERFICT). 2. A histogram based approach with the similarity measure(histsimilar PERFICT). 3. A randomized clustering method including the similarity measure(k-means PERFICT) Experimental evaluation of our algorithms on standard UCI datasets show that they perform better against most of the recent state-of-art associative classifiers. The Randomized K-Means PERFICT outperforms HistSimilar PER- FICT and Hist PERFICT in most of the cases. Our contributions include: Handling noisy data and the problem of exact matches in an effective manner using the notion of perturbation. Introduction of a new MJ similarity measure for weighing and pruning itemsets. Use of self-adjusting mincount value for pruning perturbed frequent itemsets. Identifying drawbacks of standard discretization method and avoiding it through a preprocessing step. In section 2, we describe related work done in the field of associative classifiers. The next section presents concepts and definitions used in our approach. Section 4 depicts an outline of the three different PERFICT algorithms. This is followed by the discussion on issues with Hist PERFICT algorithm. Section 6 describes the HistSimilar PERFICT algorithm. Section 7 throws light on the Randomized K- Means PERFICT which is followed by Results and Analysis in section 8. Finally we conclude in section 9.

2 2. RELATED WORK Recent state of the art has exploited the paradigm of associative rule mining for solving the problem of classification. They work on the principles of mining association rules to build classifiers [8]. Advantages of these approaches include (1) Frequent itemsets capture all dominant relationships between items in a dataset; (2) Efficient itemset mining algorithms exist resulting in highly scalable classifiers; (3) These classifiers naturally handle missing values and outliers as they only deal with statistically significant associations. This property translates well into robustness; (4) Extensive experimental statistics show that these techniques are less error prone. However, these associative classifiers suffer certain drawbacks. Though, they provide more rules and information, redundancy involved in the rules increases the cost in terms of orders of time and computation complexity during the process of classification. MCAR [10] determines a redundant rule by checking whether it covers instances in training data set or not. GARC[6] brought in the notion of compact set to shrink the rule set by converting the rule set to a compact one. Since the reduction of redundant rules require a brute force technique, it fails to avoid some meaningless searching. Second, as we know, the rule generation is based on frequent pattern mining in associative classification, when the size of data set grows, the time cost for frequent pattern mining may increase sharply which may be an inherent limitation of associative classification. The FP Growth technique [14] used in CMAR [22] has proved to be very efficient, but extra time should be considered to compute the support and confidence of rules by scanning the data set again. However, the cost problem still remains unsolved. These algorithms have a major drawback: generation of rules by exact matches irrespective of categorical or numeric attributes. This situation causes a problem in most of the real world scenarios because records that contain nearly similar values for a real valued attribute should support the same rule. Due to discretized matches, the algorithms do not always generate the required rule. Approaches like [11] [12] [13] [20] perform numeric attribute based optimized rule mining. But its difficult to handle noisy data with them. They use computational geometry to determine the areas of confidence and use it as the primary criteria. But similarity of the generated association rules with the given test record is not emphasized and a thus new similarity measure is required. 3. BASIC CONCEPTS AND DEFINITIONS Without loss of generality, we assume that our input data is in the form of a relational table whose attributes are: {A 1, A 2, A 3...A n, C}, where C is the class attribute. We use the term item to refer to an attribute value pair (A i, a i), where a i is the value of an attribute A i which is not a class attribute. For brevity, we also simply use a i to refer to the item (A i, a i). Each record in the input relational table then contains a set of items I = {a 1, a 2, a 3...a n}. An itemset T is defined as T I. A frequent itemset is an itemset whose support (i.e. frequency) is greater than some user-specified minimum support threshold. We allow for different thresholds depending on the length of itemsets, to account for the fact that itemsets which have larger length naturally have low sup- Figure 1: Hist PERFICT ports. Let Min k denote the minimum support, where k is the length of the corresponding itemset. Use of frequent itemsets for numeric real-world data is not appropriate as exact matches for attribute values might not exist. Instead, we use the notion of perturbation, a term used to convey the disturbance of a value from its mean position. Perturbation represents the noise in the value of attributes of the items and effectively converts items to ranges. For instance given an itemset T with attribute values a v1, a v2 and a v3, the perturbed frequent itemset PFI T will look like PFI T = {a v1 ± σ 1, a v2 ± σ 2, a v3 ± σ 3} (1) 4. THE PERFICT ALGORITHMS The PERFICT algorithms are based on the principle of weighted probabilistic contribution of the Perturbed Frequent Itemsets. One advantage of this procedure over other associative classifiers is that there is no rule generating step in PERFICT algorithms. The basic concept employed here is larger the length of perturbed frequent itemset greater is the similarity between a given test record and the training records containing those PFIs Here we outline the general structure of the three PER- FICT algorithms. From Figure 1,2 and 3, we observe that there are several steps which are in common to the three procedures. We explain all the steps first taking the Hist PER- FICT algorithm in detail and subsequently explain other steps in HistSimilar PERFICT and Randomized K-Means PERFICT. 4.1 Preprocessing Techniques For associative classifiers, it has been observed that there is need of preprocessing step where real valued attributes are discretized. In our approach, the concept of perturbation appropriately assigns ranges to these attribute values eliminating the need for discretization Histogram Construction A histogram is a frequency chart with non-overlapping adjacent intervals calculated upon values of some variable. Mathematically, if n is the total number of observed values, k is the total number of partitions, the histogram m i must meet the following condition n = kx m i (2) i=1 There are several kinds of histograms, but two types of histograms were best suited for our approach. They are equi-

3 Figure 2: HistSimilar PERFICT bin using a range query from the hash table of histograms. The attribute value is transformed to original value ± the standard deviation of all the values in mapped bin. The perturbation is defined as the standard deviation of all the values of an attribute that are initially hashed into that partition. Let a ik represent the value of i th attribute corresponding to k th record. The histogram bins for the i th attribute are represented as h i1, h i2, h i3...h ip. As we are using equi-depth histograms so each partition has same number of values say n. Let a ik maps to h i3. h i3 = hash(a ik ) (3) nx a ij µ hi3 = (4) n j=1 v u nx σ i3 = t (a ij µ hi3 ) 2 (5) j=1 a ik = a ik ± σ i3 (6) where µ hi3 represents the mean value for the histogram bin h i3 and σ i3 represents the standard deviation of that histogram partition. Table 1: Dataset before transformation Figure 3: K-Means PERFICT width histogram and equi-depth histogram. Equi-width histograms have all partitions of same size. The size of each partition is an important variable here. Equi-depth histograms are based on the concept of equal frequency or equal number of values in each partition. The parameter involved is the number of values falling into different sized partitions. Equi-depth histograms are more suited for classification because they capture the intrinsic nature of the random variable or attribute being observed. Moreover, the histogram is not affected by the presence of outliers. An Equi-width histogram on the other hand is highly affected by outliers and is a weak choice for the purpose of classification. There exists an approach in [21] which partitions the values of quantitative attributes into equi-depth intervals. The underlying principle is that of partial completeness. This approach measures the information loss due to formation of rules obtained by considering ranges over partitions of quantitative attributes. However, the proposed approach works differently and prevents any information loss. 4.2 Transforming the Training Data Set From Figure 1, it can be seen that the PERFICT algorithms includes transforming the training dataset. Equidepth histograms are constructed for each attribute with variable depth value. The standard deviation of each such partition is computed as well. Let us assume there are k attributes apart from the class attribute in the training set. In order to convert the (attribute, value) pair a i to a i ± σ i we need a transformation. To obtain these ranges we use the histogram constructed above. Each attribute value of a training record is mapped to the corresponding histogram S. No A#1 A#2 A#3 Class 1 v 11 v 12 v 13 C 1 2 v 21 v 22 v 23 C 2 3 v 31 v 32 v 33 C 1 Table 2: Same Dataset after transformation S. No A#1 A#2 A#3 Class 1 v 11 ± σ 11 v 12 ± σ 12 v 13 ± σ 13 C 1 2 v 21 ± σ 21 v 22 ± σ 22 v 23 ± σ 23 C 2 3 v 31 ± σ 31 v 32 ± σ 32 v 33 ± σ 33 C 1 It can be observed from Table 2, that each attribute value is replaced by a ranged value. This process adds perturbation for each attribute value Issues with discretization Earlier approaches followed a simple discretization step for converting real valued attributes to ranges and mapped these ranges to consecutive integers. There are several issues with discretization. If the bin size is kept small, the number of partitions become very high and the ranges obtained would not capture the nature of the dataset effectively. Alternatively, if the bin size is large, two values of the same attribute positioned at the opposite extremes of the same partition are treated as same, though they might have different contributions. The introduction of perturbation allows two different values of the same attribute belonging to the same partition to be mapped to different ranges. For example, consider a histogram interval 0 3 say for attribute A 1 and standard deviation of Consider two values belonging to this partition say 0.5 and 2.5. Let the attribute value for the test

4 record be 0.7. A simple discretization process will map 0.5, 2.5 and 0.7 to the interval 0 3 and replace these values with an integer say 1. In other words, both 0.5 and 2.5 are considered to be equally similar with the test record s value(0.7). But by perturbation mechanism, we see that similarity of 0.5 ± 0.25 (here perturbation, σ = 0.25) is greater than 2.5 ± 0.25 as its range is closer and intersecting to the test record s range(0.7 ± 0.25). 4.3 Transforming the Test Record The same transformation (as for training dataset) is applied to individual test records using training data histograms. 4.4 Generating Perturbed Frequent Itemsets To obtain PFIs we apply the modified Apriori algorithm mentioned below: Algorithm 1. Generate 2-itemset 2. Repeat till n-itemset(where n is the number of predictor attributes) (a) Join Step (b) Prune Step (c) Record Track Step (d) forall candidates C i,j, Generating 2-itemset if count(c i,j) minsupport Freq itemset = Freq itemset C i,j 1. forall training records r, (a) foreach pair of attributes a i and a j. if test record s range(a i)) r s range(a i) φ and test record s range(a j)) r s range(a j) φ Candidate i,j = Candidate i,j r 2. forall candidates C i,j, (a) if count(c i,j) minsupport The Join Step Freq itemset = Freq itemset C i,j 1. forall pairs L1,L2 of Freq itemset k 1, (a) if L1 a1 = L2 a1 and L1 a2 =L2 a2... L1 ak 2 = L2 ak 2 and L1 ak 1 L2 ak 1 The Prune Step C k = a 1, a 2... a k 2, L1a k 1, L2 ak 1 1. forall itemsets c C k, (a) foreach (k - 1) subsets s of c, if s / L k 1 delete c from C k Record Track Step 1. forall itemsets c C k, (a) foreach (k-1)-subsets s of c, foreach record r contributing in count of s Increment count(r) by 1 (b) forall records r, if count(r) = k Keep track of record r While developing the algorithm, we assume that the minimum contributing PFIs are perturbed frequent 2-itemsets. To obtain these itemsets we identify all attributes in the training dataset whose value ranges intersect with the test record value ranges. Let there be k predictor attributes for training dataset along with the class attribute. The i th attribute is denoted by A i. So the candidate set is a combination of all possible two itemsets and has the cardinality k C 2. This set can be represented as C 2 = {(A 1, A 2), (A 2, A 3),... (A k 1, A k )} where C 2 refers to length 2 candidate itemset. A candidate itemset is formed if the range of each attribute of the training record intersects with the corresponding range of the test record. For instance let the two attributes be A 0 and A 1. Let the values of those attributes for the j th training record be a j0 ±σ j0 and a j1 ±σ j1 respectively. From Figure (4), we can conclude that for both attribute A 0 and A 1 the ranged based values of the test record and the training record are intersecting. Hence the count for the candidate itemset (A 0, A 1) is incremented by 1 and a track of the training record-id is kept. We make a note that a single training record may account for multiple candidate itemsets and contributes distinctly in the frequency of each such candidate itemset. Once the candidate itemsets have been constructed we introduce a small prune step based on the minsupport threshold criteria which is applied on the count for each candidate itemset. This mincount is defined as minsupport 100 size(currentdataset). The minsupport is available as user parameter and is directly proportional to degree of pruning. For a very high minsupport value very few candidate itemsets will survive. A very low minsupport value limiting to 0 results in no pruning or elimination of candidate itemsets. An important aspect of the self-adjusting mincount value is that it prevents over fitting. The size of the Currentdataset is also variable in our procedure. For 2-itemsets, the value of Currentdataset is initialized to the size of training dataset. But, for generating frequent itemsets of length > 2, only distinct records which contribute towards the count of at least one perturbed frequent itemset are included in Currentdataset. A bookkeeping strategy for all the record ids contributing in the frequency of each PFI is followed which is highlighted in the record track step. There are some records which do not contribute towards any PFI and are removed. Therefore, the value of Currentdataset does not remain same and generally varies. As the length of itemset increases, for example (A 0, A 1) (A 0, A 1, A 2), i.e. from 2 itemsets to 3 itemsets, the size of Currentdataset decreases. The value of mincount adjusts accordingly and reduces. The first iteration to calculate frequent 2-itemsets is the one involving major computation.

5 Figure 4: All possible range intersection cases The Join Step The join step is similar to the join step observed in the Apriori algorithm. Consider a candidate itemset of length r: C r,1 = {A 1, A 2..., A r 2, P, Q}. Then, C r 1,i = {A 1, A 2..., A r 2, P } and C r 1,j = {A 1, A 2... A r 2, Q} are frequent itemsets of length r 1. A Frequent itemset of length r 1 implies that r 1 predictor attributes obtained from the training records are intersecting with the respective attributes of the test record. While forming candidate itemsets of length r, we take any two frequent itemsets of length r 1 having exactly r 2 overlapping attributes in common. The possibilities of intersection for individual attribute values are shown in Figure 4. The two r 1 length frequent itemsets contain a number of records with the same ids mapped to them which percolate to the count of the candidate itemset C r,1. Let us illustrate this by an example: Consider the candidate itemset C = (A 0, A 1, A 2). It can be easily visualized to be formed from frequent itemsets (A 0, A 1) and (A 0, A 2). The former attribute, namely A 0 is common to both frequent 2-itemsets. The frequency of C is determined by records which are present in both the frequent itemsets. The Prune Step After obtaining the candidate itemsets from the above procedure, we apply a prune step similar to the Apriori approach. The Record Track Step There is a need to keep track of the records which contribute toward any frequent itemset because all such records form the Currentdataset for the next iteration (i.e. itemset length r-1 to r). From the pseudo code it can be observed that for the participation of a record in the count of frequent itemset, the record must be contributing in the frequency of each of the subsets of the frequent itemsets. For example let us consider a record r participating for frequent itemset (A 0, A 1, A 2). Then r id (A 0, A 1) map, r id (A 0, A 2) map and r id (A 1, A 2) map where (A i, A j) map represents the map between a frequent itemset and set of all the record ids contributing for that itemset. 4.5 Naive Probabilistic Estimation Once we have obtained all possible frequent itemsets the final task is the estimation of the class to which the test record belongs. We devise a formula which comprises two components. 1. For each frequent itemsets (PFIs) of length i 2 we keep track of all the records contributing to its count. These records may belong to different classes. For instance let (A 0, A 1) be the I th PFI of length 2. Let n I be the number of records participating in the count of this PFI. Then, n I = P j C nijcj where Cj represents j th class out of the possible C classes. So contribution of each PFI of length 2 is defined as: Contri(PFI Ii) = X X n Ij C j (7) N i I Freq(i) j C where N i is the size of dataset for itemsets of length i and PFI Ii represents the I th PFI of length i. 2. The second part is the heuristics based rank associated with each PFI. We assign different ranks to itemsets of different lengths. However, same weight is associated with the itemsets of similar length. The allocation resembles ones intuition - Greater the length of the PFI more is the similarity between training set and testrecord and hence a larger contribution towards classification. P i p=1 Rank i = P p max k=2 max k + 1 (8) where the numerator is sum of all natural numbers from 1 to the length i and the denominator is a normalization constant. Here max represents the maximum number of attributes in the dataset apart from class attribute as the largest PFI can be of max length only. The formula is similar to that used for assigning weights in k-nearest neighbor classification. Equation 7 is similar to the Laplacian operator mentioned in [7]. A deeper analysis of the equation 7 shows that it converges to 1 as the length of PFIs increases. This conforms with the true nature of the problem as records which are highly similar to a given test record are fewer in number and play a major role in deciding the class to which the record may belong. The overall formula for finding the class conditional probability of a record becomes max X P(C/R) = Contri(PFI Ii) Rank i (9) i=2 where P(C/R) contains the contribution from all classes and can also be written as P(C/R) = a 1C 1 + a 2C a tc t (10) where a i represents the contribution of all PFIs for class C i and i varies from 1 to t. This sum of co-efficients a i can be either greater than or less than 1 and is given by S. So to normalize, each co-efficient is divided by S. This process converts the ratio into a probability measure. We select that class label as the class for the test record whose co-efficient is maximum i.e argmax i{a 1, a 2,...a t}. This naive probabilistic classification technique is used for the Hist PERFICT algorithm. The importance of the result lies in the fact that a probabilistic estimation of contribution of all the classes pertaining to a single test record is available at the end. This can be viewed as an addendum for detailed analysis and confidence towards classification.we now provide some of the problems with Hist PERFICT and the way their rectification leads to HistSimilar PERFICT method.

6 5. ISSUES WITH HIST PERFICT 5.1 Pruning Quantitatively, the number of itemsets generated by the Apriori algorithm are huge and require a pruning step. However in the case of Hist PERFICT we generate the set of all possible PFIs without including an extra pruning step. Some of the itemsets have high contribution in more than one class which sometimes leads to misclassification. So we need a proper pruning step to make the classifier more effective. 5.2 Weights As mentioned earlier, the rank or weight for different length itemsets are different. However, another major issue with the Hist PERFICT approach is that we assign same rank or weights to PFIs of similar length. This sometimes degrades the accuracy of classification as is evident by the Precision Table (Table 3). We need to construct a metric which can capture the range based nature of the PFIs effectively and provide weights in accordance. Both these problems are resolved by the means of our proposed MJ similarity metric. 6. HISTSIMILAR PERFICT From Figures 1 and 2, we observe that an additional step involving a new similarity calculation is required. The criteria is presented as follows. 6.1 MJ Similarity Metric We define a new similarity measure based on the simple though effective notion of area of overlap. Let us illustrate by an example. Let us assume for a given test record 1 st attribute value a 1 = 0.5 and 2 nd attribute value a 2 = 0.6. Then during the transformation of the test record to perturbed range based values we map a 1 to a histogram whose standard deviation is 0.3 and a 2 to a histogram whose standard deviation is 0.4. So the perturbed value for attribute 1 and 2 becomes a 1 ± σ 1 and a 2 ± σ 2. Now, let there be a training record r for which r 1 = 0.5 and r 2 = 0.6. These values also map to the histograms whose deviations are 0.3 and 0.4 respectively. Perturbed training record values become r 1 ± σ 1 and r 2 ± σ 2 where σ i represents the standard deviation. For the attributes 1 and 2 of the record r, (a1 ± σ1) (r1 ± σ1) (a2 ± σ2) (r2 ± σ2) AO = σ 1 2 σ 2 (11) where AO represents Area of Overlap and the intersection of the range based values for 1 st attribute leads to similarity of 0.5 and the intersection of range based values for 2 nd attribute leads to similarity of 0.7. The similarity values are normalized by taking into account the σ value of the respective attribute histograms. It can be observed from the formula of area of overlap that only taking into account the dot products of the similarity for each attribute would lead to a decrease in the overall (multiplying one fraction with another fraction). So we introduce an additional multiplicative factor to obtain the overall area of overlap which is dependent on length of the itemset and is defined as (itemsetlength) (itemsetlength). Figure 5 represents the example stated above as an illustration. The formula for Area Figure 5: Sample area of Overlap of Overlap can be generalized for the j th PFI of length k as: AO PFIjk = ky i=1 (a i ± σ i) (r i ± σ i) σ i k k (12) where a i is the i th attribute value for test record and r i is the i th attribute value for a training record r and σ i is deviation of the histogram partition to which the value maps. An important constraint imposed on the proposed similarity measure is that PFIs of larger length must be assigned greater weights than PFIs of smaller length. But sometimes if we directly use the AO based formula, this constrain can be violated. Without loss of generality, let us consider we have a PFI of length 2 (A 1, A 2) and from that PFI another frequent itemset of length 3 (A 1, A 2, A 3) is obtained. However, if the intersection for the 3 rd attribute is very small then Area of Overlap is reduced and so similarity of 3 length PFI can be smaller than that of 2-length PFI. In order to prevent such an opposing situation we add a term 2 itemset length to Area of Overlap after taking logarithm (base 10) of the Area of Overlap. We place the constraint that intersection for each attribute (here i th attribute) must be (0.02 σ i). By maintaining this criteria the similarity value for 3-itemset similarity value for 2-itemset. MJ PFIjk = 2 k + log 10(AO PFIjk ) (13) where MJ PFIjk represents the similarity value for the j th PFI of length k with respect to training record r. But there are several such records which are contributing in the count of PFI jk. So we take average of all such values as the similarity measure for the corresponding PFI. The MJ similarity criteria of intersection (0.02 σ) is included in Generating Perturbed Frequent Itemsets step. The Probabilistic estimation step is modified by including the MJ criteria. In equation (9) we replace the Rank term with the MJ similarity measure term to arrive at the following formula to be used for classification. max X P(C/R) = Contri(PFI Ii) MJ PFIIi (14) i=2 This changed probabilistic estimation step is used in Hist- Similar PERFICT and Randomized K-means algorithm. 6.2 Benefits of MJ similarity Measure

7 1. One of the benefits of using this similarity measure is that we are assigning different weights even to itemsets of same length based on their area of overlap. More the area of overlap more the similarity, larger the intersections and greater the weights associated. This is a straight result from direct proportionality. 2. Another unique advantage of this similarity measure is that it intrinsically provides the much needed pruning step. During the generation of the PFIs, at any stage, if the range of intersection for any attribute of the candidate itemset does not satisfy the MJ criteria then that candidate itemset is immediately pruned. So the requirement of an explicit pruning step is no longer necessary. 7. RANDOMIZED K-MEANS 7.1 Disadvantages of Histograms There are certain limitations to the Histogram based approach for preprocessing the data. The depth of each histogram for each attribute is kept variable. But histogram construction is a time consuming process. Further the process of estimation of the best depth for each variable is manual and cannot easily be automated. Sometimes it happens that the variation in the attribute values is not much and for such an attribute 3 or 4 partitions are sufficient but we end up with more number of bins. This problem can be solved by using a clustering technique for preprocessing. 7.2 K-Means Approach The K-Means algorithm[9] is a highly popular clustering approach. It can be used very effectively to identify clusters of data which act as partitions for us. For our algorithm, we compute the k-means independently for each attribute and so the data points are all one dimensional. The purpose of this method is to minimize the Squared Sum Error (SSE) i.e distance of each data point from the k-means. 7.3 Advantages of K-Means The major contribution of the K-Means method towards our approach is that we calculate K-Means separately for each attribute and the number of clusters are calculated on the fly. We vary the number of clusters and identify the best clusters for each attribute based on a threshold θ = SSE k 1 SSE k (15) k where the numerator represents the difference in the squared sum error for the k-1 clusters and k clusters and denominator is number of clusters. So when the contribution of each cluster in the squared error is very small i.e. less than a threshold θ we stop. The approach allows different number of clusters for different attributes and provides a better opportunity to capture the inherent similarity within the attributes of a dataset. We now consider the perturbation as the standard deviation of each cluster. Rest of the procedure is same as that of HistSimilar PERFICT algorithm. But the k-means approach is much faster as in comparison to histogram based approach. Taking advantage of the fact, the K-Means procedure is run multiple times for an accurate estimation of the clusters. The procedure is henceforth called Randomized K-Means PERFICT. 8. RESULTS AND ANALYSIS We conducted our experiments over 12 datasets from the UCI repository. These datasets consist of real valued attributes which are generally continuous by nature. We also conducted experiments over two different minsupport values (1 and 10) for various datasets and selected the minsupport value for which the accuracy was best for the corresponding dataset. We varied the histogram bins and number of clusters from 3 to 15 and selected the one for which accuracy was best. The 10-fold cross validation accuracy for associative classifiers like CBA, CPAR, CMAR, case based algorithm like PART, decision tress like C4.5 and Ripper along with Naive Bayes algorithm are presented in Table 3. For discretization of real valued attributes of these classifiers we use the entropy based technique same as that used in MLC++ library. Our primary concern is accuracy of the PERFICT classifiers. Apart from the accuracy measure, we also give a detailed analysis of the effect of MJ criteria for HistSimilar PERFICT and Randomized K-Means Algorithm. We have used the WEKA Toolkit and have implemented the PERFICT algorithms in C Performance of PERFICT We provide a brief analysis of the precision results present in Table 3. The datasets are composed of real valued attributes. The Randomized K-Means PERFICT algorithm outperforms other algorithms over 8 datasets and is among top 3 for 11 datasets. The performance of K-Means PER- FICT over other algorithms is exceptional for waveform, vehicle and ecoli datasets. The inclusion of MJ similarity measure is primarily responsible for high precision results. It helps to prune away the itemsets which are not essential for classification and gives the appropriate weight to each PFI necessary for classification. The reason for high success rate of Randomized K-means is that it captures the noisy nature of attributes. Superiority of Randomized K-means over HistSimilar and Hist PERFICT approach is due to the fact that variable number of points can belong to a cluster as opposed to equi-depth histogram. This is because the size of a cluster is not fixed while frequency of each bin has to be same for equi-depth histogram. The necessity for having a variable for the number of clusters or bins can be seen by the fact that in diabetes dataset where the number of clusters or bins for each attribute is best set to 15 while for the ecoli dataset the number is 3 as there is little variation in value of each feature. For dataset like image, vowel and wine the K-Means PERFICT is among the top 3 classifiers and hence is suitable for tasks like image recognition and handwriting recognition. While performing the complexity analysis for each of the proposed algorithm, we observe that the complexity of each PERFICT algorithm is same as that of Apriori algorithm. Sometimes its observed that datasets for image, signal and hand-writing classification have a very large feature space i.e they have large number of attributes. So in order to reduce this feature space to low dimensions, we perform a data preprocessing technique like PCA (Principal Component Analysis) or LDA (Linear Discriminant Analysis). This reduces the number of attributes making the various PER- FICT approaches highly scalable. 8.2 Effects of MJ Criteria For both HistSimilar PERFICT and Randomized K-Means

8 Table 3: Precision Results Dataset Hist HistSimilar K-Means CBA CMAR CPAR RIPPER J4.8 PART Naive PERFICT PERFICT PERFICT Bayes breast-w diabetes ecoli glass heart image iris pima vehicle vowel waveform wine Figure 6: Accuracy vs Overlap Figure 7: Accuracy vs MinSupport approach, the application of MJ criteria is one of the most important steps. We vary the MJ criteria as mentioned below MJ j = σj N (16) where MJ j represents the threshold for the j th dimension. σ j represents the deviation for the partition to which the training value maps for the j th attribute and N represents a natural number. The maximum overlap that can occur is equal to 2 σ j, when the range based value of the training record r completely overlaps with test record values. As N is a natural number the MJ criteria is always smaller than 1 2 of maximum Overlap. We vary the N values from one to hundred to see the effect of various thresholds on accuracy. From Figure 6, we can observe that the maximum variation in the values of accuracy occurs for smaller values of N. Lesser the value of N, higher is the MJ criteria and so stricter is the condition and more is the pruning of the candidate itemsets. As it can be observed that more pruning leads to under-fitting and characteristic of dataset is not captured leading to lower accuracy. However as N increases the MJ criteria is relaxed. We achieve optimal accuracy at some value and then accuracy remains nearly constant for higher values of N. 8.3 Image Segmentation Data Analysis The image segmentation problem is one of the oldest problem of pattern recognition. It comprises of instances that are drawn from a database of 7 outdoor images. The images are hand-segmented to create a classification for every pixel. There are 210 records in all and each record has 19 continuous attributes apart from one class attribute. There are seven classes in total and the data is uniformly distributed among all these classes with 30 instances of each class. From Fig 7, we observe that while the CBA and Hist PER- FICT algorithm completely fail to build a proper classifier and their accuracy remains low as the minsupport value increases. But the HistSimilar PERFICT and Randomized K-Means PERFICT performs well. From Table 3, it can be seen that PART and J4.8 algorithm are better than PER- FICT but as minsupport is not a parameter for these classifiers, so they are not considered in Figure 7. A detailed analysis reveals that Hist PERFICT algorithm is not suitable for this dataset. Here the number of classes is very high and the itemsets contributing are of smaller length. By providing same weights to different PFIs of same

9 length it causes misclassification. The accuracy reaches a minimum for this approach. For the HistSimilar PERFICT algorithm the accuracy remains constant throughout. The variation of the minsupport value has no effect on its accuracy. The maximum accuracy of HistSimilar algorithm for this dataset is at 10 percent minsupport and it can be observed that this accuracy was achieved before 8 percent minsupport. However, its the Randomized K-Means PER- FICT algorithm which shows better trends such as the accuracy increases with minsupport values and then reaches an optimum. 9. CONCLUSION This paper presents a new classification approach developed on the concept of perturbed frequent itemsets and their probabilistic contribution for each class. The suggested work is computationally more efficient than other discretized rule generating algorithms as CBA, CPAR, CMAR etc. and employs a new MJ similarity measure. Results obtained are accurate and effective particularly for noisy datasets like waveform, diabetes and vehicle. Advantages of the various PERFICT algorithms include (1) It captures the nature of noisy data easily. (2) The value of mincount is not fixed throughout, but dynamically adjusts with increasing length of candidate itemsets.(3) Probabilistic contribution of each class towards the record is identified. The algorithm helps in understanding the maximum contribution of a particular class in a record. Thus, the proposed approach is novel and efficient than most of the existing well established associative classification algorithms. In future, we can incorporate another pruning step based on Maximum Likelihood techniques. Finally, an improvement that can be sought is building a classifier model, rather than running the entire algorithm for individual test records, although its not very computationally intensive, as it is. 10. REFERENCES [1] R. Agarwal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of ACM SIGMOD Conference on Data Management of data, pages 80 86, [2] R. Agarwal, T. Imielinski, and R. Shrikant. Mining association rules between sets of items in large databases. In Proceedings of ACM SIGMOD Conference On Management of Data, pages , [3] R. Agarwal and R. Shrikant. Fast algorithms for mining association rules. In Proceedings of International Conference on Very Large Database, 20: , September [4] K. Ali, S. Manganaris, and R. Shrikant. Partial classification using association rules. In Proceedings of the International Conference on Knowledge Discovery and Data Mining KDD, 3, [5] R. Bayardo, R. Agarwal, and D. Gunopulos. Constraint-based rule mining in large databases. In Proceedings of International Conference on Knowledge Discovery and Data Engineering, 15: , [6] G. Chen and H. et al. A new approach to classification based on association rule mining. DECISION SUPPORT SYSTEMS, 42: , [7] P. Clark and R. Boswell. Rule induction with cn2: Some recent improvements. In Proceedings of the 5th European Working Session on Learning, pages , [8] G. Dong, X. Zhang, L.Wong, and J. Li. Classification by aggregating emerging patterns. Discovery Science, [9] R. Duda and P. Hart. Pattern Classification and scene analysis. John Wiley and Sons, New York, [10] T. Fadi, C. Peter, and Y. Peng. Mcar: Multi-class classification based on association rule. IEEE International Conference on Computer Systems and Applications, pages , [11] T. Fuduka, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining using two-dimensional optimized accociation rules: Scheme, algorithms, and visualization. SIGMOD, [12] T. Fuduka, Y. Morimoto, S. Morishita, and T. Tokuyama. Mining optimized association rules for numeric attributes. J. Comput. Syst. Sci, [13] T. Fuduka, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining with optimized two-dimensional association rules. ACM Transactions. Database Systems, [14] J. Han, J.Pei, and Y.Yin. Mining frequent patterns without candidate generation. In Proceedings of 2000 ACM SIGMOD International Conference on Management of Data, pages 1 12, [15] J. Han, J. Pei, Y. Yin, and R. Mao. Mining frequent patterns without candidate generation- a frequent-pattern tree approach. Data Mining and Knowledge Discovery, [16] W. Li, J. Han, and J.Pei. Cmar: Accurate and efficient classification based on multiple-class association rule. Proceedings of ICDM, 1: , [17] B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining. In Proceedings of KDD, [18] H. Mannila and H. Toivonen. Discovering generalized episodes using minimum occurences. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, KDD96, 2, [19] N. Megiddo and R. Shrikant. Discovering predictive association rules. In Proceedings of International Conference on Knowledge Discovery and Data Mining KDD, 4, [20] T. Raymond, L. Lakshmanan, and A. P. J. Han. Exploratory mining and pruning optimizations of constrained association rules. SIGMOD, [21] R. Srikant and R. Agarwal. Mining quantitative association rules in large relational tables. SIGMOD, [22] X.Lin and J.Han. Cpar: Classification based on predictive association rule. SDM2003, 2003.

Challenges and Interesting Research Directions in Associative Classification

Challenges and Interesting Research Directions in Associative Classification Challenges and Interesting Research Directions in Associative Classification Fadi Thabtah Department of Management Information Systems Philadelphia University Amman, Jordan Email: FFayez@philadelphia.edu.jo

More information

A Novel Algorithm for Associative Classification

A Novel Algorithm for Associative Classification A Novel Algorithm for Associative Classification Gourab Kundu 1, Sirajum Munir 1, Md. Faizul Bari 1, Md. Monirul Islam 1, and K. Murase 2 1 Department of Computer Science and Engineering Bangladesh University

More information

2. Department of Electronic Engineering and Computer Science, Case Western Reserve University

2. Department of Electronic Engineering and Computer Science, Case Western Reserve University Chapter MINING HIGH-DIMENSIONAL DATA Wei Wang 1 and Jiong Yang 2 1. Department of Computer Science, University of North Carolina at Chapel Hill 2. Department of Electronic Engineering and Computer Science,

More information

Mining Quantitative Association Rules on Overlapped Intervals

Mining Quantitative Association Rules on Overlapped Intervals Mining Quantitative Association Rules on Overlapped Intervals Qiang Tong 1,3, Baoping Yan 2, and Yuanchun Zhou 1,3 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China {tongqiang,

More information

ASSOCIATIVE CLASSIFICATION WITH KNN

ASSOCIATIVE CLASSIFICATION WITH KNN ASSOCIATIVE CLASSIFICATION WITH ZAIXIANG HUANG, ZHONGMEI ZHOU, TIANZHONG HE Department of Computer Science and Engineering, Zhangzhou Normal University, Zhangzhou 363000, China E-mail: huangzaixiang@126.com

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 4, Jul Aug 2017

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 4, Jul Aug 2017 International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 4, Jul Aug 17 RESEARCH ARTICLE OPEN ACCESS Classifying Brain Dataset Using Classification Based Association Rules

More information

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA METANAT HOOSHSADAT, SAMANEH BAYAT, PARISA NAEIMI, MAHDIEH S. MIRIAN, OSMAR R. ZAÏANE Computing Science Department, University

More information

A neural-networks associative classification method for association rule mining

A neural-networks associative classification method for association rule mining Data Mining VII: Data, Text and Web Mining and their Business Applications 93 A neural-networks associative classification method for association rule mining P. Sermswatsri & C. Srisa-an Faculty of Information

More information

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE Vandit Agarwal 1, Mandhani Kushal 2 and Preetham Kumar 3

More information

A Survey on Algorithms for Market Basket Analysis

A Survey on Algorithms for Market Basket Analysis ISSN: 2321-7782 (Online) Special Issue, December 2013 International Journal of Advance Research in Computer Science and Management Studies Research Paper Available online at: www.ijarcsms.com A Survey

More information

Mining of Web Server Logs using Extended Apriori Algorithm

Mining of Web Server Logs using Extended Apriori Algorithm International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) International Journal of Emerging Technologies in Computational

More information

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery : Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Hong Cheng Philip S. Yu Jiawei Han University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center {hcheng3, hanj}@cs.uiuc.edu,

More information

Comparative Study of Subspace Clustering Algorithms

Comparative Study of Subspace Clustering Algorithms Comparative Study of Subspace Clustering Algorithms S.Chitra Nayagam, Asst Prof., Dept of Computer Applications, Don Bosco College, Panjim, Goa. Abstract-A cluster is a collection of data objects that

More information

Using Association Rules for Better Treatment of Missing Values

Using Association Rules for Better Treatment of Missing Values Using Association Rules for Better Treatment of Missing Values SHARIQ BASHIR, SAAD RAZZAQ, UMER MAQBOOL, SONYA TAHIR, A. RAUF BAIG Department of Computer Science (Machine Intelligence Group) National University

More information

A Conflict-Based Confidence Measure for Associative Classification

A Conflict-Based Confidence Measure for Associative Classification A Conflict-Based Confidence Measure for Associative Classification Peerapon Vateekul and Mei-Ling Shyu Department of Electrical and Computer Engineering University of Miami Coral Gables, FL 33124, USA

More information

Pruning Techniques in Associative Classification: Survey and Comparison

Pruning Techniques in Associative Classification: Survey and Comparison Survey Research Pruning Techniques in Associative Classification: Survey and Comparison Fadi Thabtah Management Information systems Department Philadelphia University, Amman, Jordan ffayez@philadelphia.edu.jo

More information

Mining Generalised Emerging Patterns

Mining Generalised Emerging Patterns Mining Generalised Emerging Patterns Xiaoyuan Qian, James Bailey, Christopher Leckie Department of Computer Science and Software Engineering University of Melbourne, Australia {jbailey, caleckie}@csse.unimelb.edu.au

More information

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm Proceedings of the National Conference on Recent Trends in Mathematical Computing NCRTMC 13 427 An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm A.Veeraswamy

More information

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, statistics foundations 5 Introduction to D3, visual analytics

More information

Rule Pruning in Associative Classification Mining

Rule Pruning in Associative Classification Mining Rule Pruning in Associative Classification Mining Fadi Thabtah Department of Computing and Engineering University of Huddersfield Huddersfield, HD1 3DH, UK F.Thabtah@hud.ac.uk Abstract Classification and

More information

Structure of Association Rule Classifiers: a Review

Structure of Association Rule Classifiers: a Review Structure of Association Rule Classifiers: a Review Koen Vanhoof Benoît Depaire Transportation Research Institute (IMOB), University Hasselt 3590 Diepenbeek, Belgium koen.vanhoof@uhasselt.be benoit.depaire@uhasselt.be

More information

Combinatorial Approach of Associative Classification

Combinatorial Approach of Associative Classification Int. J. Advanced Networking and Applications 470 Combinatorial Approach of Associative Classification P. R. Pal Department of Computer Applications, Shri Vaishnav Institute of Management, Indore, M.P.

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

Web page recommendation using a stochastic process model

Web page recommendation using a stochastic process model Data Mining VII: Data, Text and Web Mining and their Business Applications 233 Web page recommendation using a stochastic process model B. J. Park 1, W. Choi 1 & S. H. Noh 2 1 Computer Science Department,

More information

Improved Frequent Pattern Mining Algorithm with Indexing

Improved Frequent Pattern Mining Algorithm with Indexing IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 16, Issue 6, Ver. VII (Nov Dec. 2014), PP 73-78 Improved Frequent Pattern Mining Algorithm with Indexing Prof.

More information

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining Miss. Rituja M. Zagade Computer Engineering Department,JSPM,NTC RSSOER,Savitribai Phule Pune University Pune,India

More information

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University Cse634 DATA MINING TEST REVIEW Professor Anita Wasilewska Computer Science Department Stony Brook University Preprocessing stage Preprocessing: includes all the operations that have to be performed before

More information

Data Mining Part 3. Associations Rules

Data Mining Part 3. Associations Rules Data Mining Part 3. Associations Rules 3.2 Efficient Frequent Itemset Mining Methods Fall 2009 Instructor: Dr. Masoud Yaghini Outline Apriori Algorithm Generating Association Rules from Frequent Itemsets

More information

Appropriate Item Partition for Improving the Mining Performance

Appropriate Item Partition for Improving the Mining Performance Appropriate Item Partition for Improving the Mining Performance Tzung-Pei Hong 1,2, Jheng-Nan Huang 1, Kawuu W. Lin 3 and Wen-Yang Lin 1 1 Department of Computer Science and Information Engineering National

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

An Improved Apriori Algorithm for Association Rules

An Improved Apriori Algorithm for Association Rules Research article An Improved Apriori Algorithm for Association Rules Hassan M. Najadat 1, Mohammed Al-Maolegi 2, Bassam Arkok 3 Computer Science, Jordan University of Science and Technology, Irbid, Jordan

More information

Chapter 8 The C 4.5*stat algorithm

Chapter 8 The C 4.5*stat algorithm 109 The C 4.5*stat algorithm This chapter explains a new algorithm namely C 4.5*stat for numeric data sets. It is a variant of the C 4.5 algorithm and it uses variance instead of information gain for the

More information

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16 CLUSTERING CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16 1. K-medoids: REFERENCES https://www.coursera.org/learn/cluster-analysis/lecture/nj0sb/3-4-the-k-medoids-clustering-method https://anuradhasrinivas.files.wordpress.com/2013/04/lesson8-clustering.pdf

More information

Temporal Weighted Association Rule Mining for Classification

Temporal Weighted Association Rule Mining for Classification Temporal Weighted Association Rule Mining for Classification Purushottam Sharma and Kanak Saxena Abstract There are so many important techniques towards finding the association rules. But, when we consider

More information

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Lutfi Fanani 1 and Nurizal Dwi Priandani 2 1 Department of Computer Science, Brawijaya University, Malang, Indonesia. 2 Department

More information

Class Strength Prediction Method for Associative Classification

Class Strength Prediction Method for Associative Classification Class Strength Prediction Method for Associative Classification Suzan Ayyat Joan Lu Fadi Thabtah Department of Informatics Huddersfield University Department of Informatics Huddersfield University Ebusiness

More information

CloNI: clustering of JN -interval discretization

CloNI: clustering of JN -interval discretization CloNI: clustering of JN -interval discretization C. Ratanamahatana Department of Computer Science, University of California, Riverside, USA Abstract It is known that the naive Bayesian classifier typically

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

Association Pattern Mining. Lijun Zhang

Association Pattern Mining. Lijun Zhang Association Pattern Mining Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction The Frequent Pattern Mining Model Association Rule Generation Framework Frequent Itemset Mining Algorithms

More information

On Multiple Query Optimization in Data Mining

On Multiple Query Optimization in Data Mining On Multiple Query Optimization in Data Mining Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo 3a, 60-965 Poznan, Poland {marek,mzakrz}@cs.put.poznan.pl

More information

Normalization based K means Clustering Algorithm

Normalization based K means Clustering Algorithm Normalization based K means Clustering Algorithm Deepali Virmani 1,Shweta Taneja 2,Geetika Malhotra 3 1 Department of Computer Science,Bhagwan Parshuram Institute of Technology,New Delhi Email:deepalivirmani@gmail.com

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

An Efficient Reduced Pattern Count Tree Method for Discovering Most Accurate Set of Frequent itemsets

An Efficient Reduced Pattern Count Tree Method for Discovering Most Accurate Set of Frequent itemsets IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.8, August 2008 121 An Efficient Reduced Pattern Count Tree Method for Discovering Most Accurate Set of Frequent itemsets

More information

A Novel Texture Classification Procedure by using Association Rules

A Novel Texture Classification Procedure by using Association Rules ITB J. ICT Vol. 2, No. 2, 2008, 03-4 03 A Novel Texture Classification Procedure by using Association Rules L. Jaba Sheela & V.Shanthi 2 Panimalar Engineering College, Chennai. 2 St.Joseph s Engineering

More information

Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values

Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values SHARIQ BASHIR, SAAD RAZZAQ, UMER MAQBOOL, SONYA TAHIR, A. RAUF BAIG Department of Computer Science (Machine

More information

Categorization of Sequential Data using Associative Classifiers

Categorization of Sequential Data using Associative Classifiers Categorization of Sequential Data using Associative Classifiers Mrs. R. Meenakshi, MCA., MPhil., Research Scholar, Mrs. J.S. Subhashini, MCA., M.Phil., Assistant Professor, Department of Computer Science,

More information

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest Bhakti V. Gavali 1, Prof. Vivekanand Reddy 2 1 Department of Computer Science and Engineering, Visvesvaraya Technological

More information

Parallel Mining of Maximal Frequent Itemsets in PC Clusters

Parallel Mining of Maximal Frequent Itemsets in PC Clusters Proceedings of the International MultiConference of Engineers and Computer Scientists 28 Vol I IMECS 28, 19-21 March, 28, Hong Kong Parallel Mining of Maximal Frequent Itemsets in PC Clusters Vong Chan

More information

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach ABSTRACT G.Ravi Kumar 1 Dr.G.A. Ramachandra 2 G.Sunitha 3 1. Research Scholar, Department of Computer Science &Technology,

More information

Mining Quantitative Maximal Hyperclique Patterns: A Summary of Results

Mining Quantitative Maximal Hyperclique Patterns: A Summary of Results Mining Quantitative Maximal Hyperclique Patterns: A Summary of Results Yaochun Huang, Hui Xiong, Weili Wu, and Sam Y. Sung 3 Computer Science Department, University of Texas - Dallas, USA, {yxh03800,wxw0000}@utdallas.edu

More information

CS570 Introduction to Data Mining

CS570 Introduction to Data Mining CS570 Introduction to Data Mining Frequent Pattern Mining and Association Analysis Cengiz Gunay Partial slide credits: Li Xiong, Jiawei Han and Micheline Kamber George Kollios 1 Mining Frequent Patterns,

More information

Mining Frequent Itemsets Along with Rare Itemsets Based on Categorical Multiple Minimum Support

Mining Frequent Itemsets Along with Rare Itemsets Based on Categorical Multiple Minimum Support IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 6, Ver. IV (Nov.-Dec. 2016), PP 109-114 www.iosrjournals.org Mining Frequent Itemsets Along with Rare

More information

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

Feature Selection Using Modified-MCA Based Scoring Metric for Classification 2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) (2011) IACSIT Press, Singapore Feature Selection Using Modified-MCA Based Scoring Metric for Classification

More information

Memory issues in frequent itemset mining

Memory issues in frequent itemset mining Memory issues in frequent itemset mining Bart Goethals HIIT Basic Research Unit Department of Computer Science P.O. Box 26, Teollisuuskatu 2 FIN-00014 University of Helsinki, Finland bart.goethals@cs.helsinki.fi

More information

6. Dicretization methods 6.1 The purpose of discretization

6. Dicretization methods 6.1 The purpose of discretization 6. Dicretization methods 6.1 The purpose of discretization Often data are given in the form of continuous values. If their number is huge, model building for such data can be difficult. Moreover, many

More information

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the Chapter 6: What Is Frequent ent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc) that occurs frequently in a data set frequent itemsets and association rule

More information

Mining N-most Interesting Itemsets. Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang. fadafu,

Mining N-most Interesting Itemsets. Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang. fadafu, Mining N-most Interesting Itemsets Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang Department of Computer Science and Engineering The Chinese University of Hong Kong, Hong Kong fadafu, wwkwongg@cse.cuhk.edu.hk

More information

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set To Enhance Scalability of Item Transactions by Parallel and Partition using Dynamic Data Set Priyanka Soni, Research Scholar (CSE), MTRI, Bhopal, priyanka.soni379@gmail.com Dhirendra Kumar Jha, MTRI, Bhopal,

More information

COMPACT WEIGHTED CLASS ASSOCIATION RULE MINING USING INFORMATION GAIN

COMPACT WEIGHTED CLASS ASSOCIATION RULE MINING USING INFORMATION GAIN COMPACT WEIGHTED CLASS ASSOCIATION RULE MINING USING INFORMATION GAIN S.P.Syed Ibrahim 1 and K.R.Chandran 2 1 Assistant Professor, Department of Computer Science and Engineering, PSG College of Technology,

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

A mining method for tracking changes in temporal association rules from an encoded database

A mining method for tracking changes in temporal association rules from an encoded database A mining method for tracking changes in temporal association rules from an encoded database Chelliah Balasubramanian *, Karuppaswamy Duraiswamy ** K.S.Rangasamy College of Technology, Tiruchengode, Tamil

More information

CHAPTER 7 A GRID CLUSTERING ALGORITHM

CHAPTER 7 A GRID CLUSTERING ALGORITHM CHAPTER 7 A GRID CLUSTERING ALGORITHM 7.1 Introduction The grid-based methods have widely been used over all the algorithms discussed in previous chapters due to their rapid clustering results. In this

More information

Business Club. Decision Trees

Business Club. Decision Trees Business Club Decision Trees Business Club Analytics Team December 2017 Index 1. Motivation- A Case Study 2. The Trees a. What is a decision tree b. Representation 3. Regression v/s Classification 4. Building

More information

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42 Pattern Mining Knowledge Discovery and Data Mining 1 Roman Kern KTI, TU Graz 2016-01-14 Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 1 / 42 Outline 1 Introduction 2 Apriori Algorithm 3 FP-Growth

More information

Fast Fuzzy Clustering of Infrared Images. 2. brfcm

Fast Fuzzy Clustering of Infrared Images. 2. brfcm Fast Fuzzy Clustering of Infrared Images Steven Eschrich, Jingwei Ke, Lawrence O. Hall and Dmitry B. Goldgof Department of Computer Science and Engineering, ENB 118 University of South Florida 4202 E.

More information

Building Intelligent Learning Database Systems

Building Intelligent Learning Database Systems Building Intelligent Learning Database Systems 1. Intelligent Learning Database Systems: A Definition (Wu 1995, Wu 2000) 2. Induction: Mining Knowledge from Data Decision tree construction (ID3 and C4.5)

More information

CS Introduction to Data Mining Instructor: Abdullah Mueen

CS Introduction to Data Mining Instructor: Abdullah Mueen CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 8: ADVANCED CLUSTERING (FUZZY AND CO -CLUSTERING) Review: Basic Cluster Analysis Methods (Chap. 10) Cluster Analysis: Basic Concepts

More information

Association Rule Mining. Entscheidungsunterstützungssysteme

Association Rule Mining. Entscheidungsunterstützungssysteme Association Rule Mining Entscheidungsunterstützungssysteme Frequent Pattern Analysis Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set

More information

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 70 CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 3.1 INTRODUCTION In medical science, effective tools are essential to categorize and systematically

More information

Machine Learning with MATLAB --classification

Machine Learning with MATLAB --classification Machine Learning with MATLAB --classification Stanley Liang, PhD York University Classification the definition In machine learning and statistics, classification is the problem of identifying to which

More information

Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1

Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques Chapter 5 SS Chung April 5, 2013 Data Mining: Concepts and Techniques 1 Chapter 5: Mining Frequent Patterns, Association and Correlations Basic concepts and a road

More information

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING Chapter 3 EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING 3.1 INTRODUCTION Generally web pages are retrieved with the help of search engines which deploy crawlers for downloading purpose. Given a query,

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

Unsupervised Discretization using Tree-based Density Estimation

Unsupervised Discretization using Tree-based Density Estimation Unsupervised Discretization using Tree-based Density Estimation Gabi Schmidberger and Eibe Frank Department of Computer Science University of Waikato Hamilton, New Zealand {gabi, eibe}@cs.waikato.ac.nz

More information

Downloaded from

Downloaded from UNIT 2 WHAT IS STATISTICS? Researchers deal with a large amount of data and have to draw dependable conclusions on the basis of data collected for the purpose. Statistics help the researchers in making

More information

Eager, Lazy and Hybrid Algorithms for Multi-Criteria Associative Classification

Eager, Lazy and Hybrid Algorithms for Multi-Criteria Associative Classification Eager, Lazy and Hybrid Algorithms for Multi-Criteria Associative Classification Adriano Veloso 1, Wagner Meira Jr 1 1 Computer Science Department Universidade Federal de Minas Gerais (UFMG) Belo Horizonte

More information

Data Clustering With Leaders and Subleaders Algorithm

Data Clustering With Leaders and Subleaders Algorithm IOSR Journal of Engineering (IOSRJEN) e-issn: 2250-3021, p-issn: 2278-8719, Volume 2, Issue 11 (November2012), PP 01-07 Data Clustering With Leaders and Subleaders Algorithm Srinivasulu M 1,Kotilingswara

More information

CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION

CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster)

More information

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality

More information

Chapter 4 Data Mining A Short Introduction

Chapter 4 Data Mining A Short Introduction Chapter 4 Data Mining A Short Introduction Data Mining - 1 1 Today's Question 1. Data Mining Overview 2. Association Rule Mining 3. Clustering 4. Classification Data Mining - 2 2 1. Data Mining Overview

More information

Partition Based Perturbation for Privacy Preserving Distributed Data Mining

Partition Based Perturbation for Privacy Preserving Distributed Data Mining BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 17, No 2 Sofia 2017 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.1515/cait-2017-0015 Partition Based Perturbation

More information

FEATURE SELECTION TECHNIQUES

FEATURE SELECTION TECHNIQUES CHAPTER-2 FEATURE SELECTION TECHNIQUES 2.1. INTRODUCTION Dimensionality reduction through the choice of an appropriate feature subset selection, results in multiple uses including performance upgrading,

More information

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

Study on Classifiers using Genetic Algorithm and Class based Rules Generation 2012 International Conference on Software and Computer Applications (ICSCA 2012) IPCSIT vol. 41 (2012) (2012) IACSIT Press, Singapore Study on Classifiers using Genetic Algorithm and Class based Rules

More information

COMP 465: Data Mining Still More on Clustering

COMP 465: Data Mining Still More on Clustering 3/4/015 Exercise COMP 465: Data Mining Still More on Clustering Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Describe each of the following

More information

Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules

Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules Manju Department of Computer Engg. CDL Govt. Polytechnic Education Society Nathusari Chopta, Sirsa Abstract The discovery

More information

CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING

CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING 41 CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING 3.1 INTRODUCTION This chapter describes the clustering process based on association rule mining. As discussed in the introduction, clustering algorithms have

More information

Frequent Itemsets Melange

Frequent Itemsets Melange Frequent Itemsets Melange Sebastien Siva Data Mining Motivation and objectives Finding all frequent itemsets in a dataset using the traditional Apriori approach is too computationally expensive for datasets

More information

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization

More information

Mining Temporal Association Rules in Network Traffic Data

Mining Temporal Association Rules in Network Traffic Data Mining Temporal Association Rules in Network Traffic Data Guojun Mao Abstract Mining association rules is one of the most important and popular task in data mining. Current researches focus on discovering

More information

Nearest neighbor classification DSE 220

Nearest neighbor classification DSE 220 Nearest neighbor classification DSE 220 Decision Trees Target variable Label Dependent variable Output space Person ID Age Gender Income Balance Mortgag e payment 123213 32 F 25000 32000 Y 17824 49 M 12000-3000

More information

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai Decision Trees Decision Tree Decision Trees (DTs) are a nonparametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target

More information

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES EXPERIMENTAL WORK PART I CHAPTER 6 DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES The evaluation of models built using statistical in conjunction with various feature subset

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Chapter 4: Mining Frequent Patterns, Associations and Correlations Chapter 4: Mining Frequent Patterns, Associations and Correlations 4.1 Basic Concepts 4.2 Frequent Itemset Mining Methods 4.3 Which Patterns Are Interesting? Pattern Evaluation Methods 4.4 Summary Frequent

More information

Clustering Billions of Images with Large Scale Nearest Neighbor Search

Clustering Billions of Images with Large Scale Nearest Neighbor Search Clustering Billions of Images with Large Scale Nearest Neighbor Search Ting Liu, Charles Rosenberg, Henry A. Rowley IEEE Workshop on Applications of Computer Vision February 2007 Presented by Dafna Bitton

More information

Tendency Mining in Dynamic Association Rules Based on SVM Classifier

Tendency Mining in Dynamic Association Rules Based on SVM Classifier Send Orders for Reprints to reprints@benthamscienceae The Open Mechanical Engineering Journal, 2014, 8, 303-307 303 Open Access Tendency Mining in Dynamic Association Rules Based on SVM Classifier Zhonglin

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

MSA220 - Statistical Learning for Big Data

MSA220 - Statistical Learning for Big Data MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups

More information