CHAPTER 3 ASSOCIATION RULE MINING WITH LEVELWISE AUTOMATIC SUPPORT THRESHOLDS

Size: px

Start display at page:

Download "CHAPTER 3 ASSOCIATION RULE MINING WITH LEVELWISE AUTOMATIC SUPPORT THRESHOLDS"

Philip Washington
5 years ago
Views:

1 23 CHAPTER 3 ASSOCIATION RULE MINING WITH LEVELWISE AUTOMATIC SUPPORT THRESHOLDS This chapter introduces the concepts of association rule mining. It also proposes two algorithms based on, to calculate and use support thresholds automatically. 3.1 BASIC CONCEPTS Association rules were first introduced by Agrawal et al, (1993) as a means of determining relationships among a set of items in a dataset. It is a type of unsupervised learning, which has been applied to many fields such as retail industry, web mining and text mining. The most familiar and renowned example for association rule mining is Market Basket Analysis (MBA). The problem of association rule mining is generally divided into two sub-problems. One is to find frequent itemsets. The second problem is to discover association rules from those frequent itemsets. The extraction of frequent itemsets is a process of extracting set of items occurring with a frequency (i.e. the number of times the items occurring together) from a dataset D whose frequency is greater than a given threshold. It is then possible to generate association rules of the form, relating a subset of items from A with a subset of items from B. It can further be interpreted as follows: an itemset A relates to itemset B with a certain support and a certain confidence. The number of itemsets and rules that can be extracted from a

2 24 dataset may be very large. To ensure the successive interpretation of the extracted set of itemsets and the sets of extracted rules, there is a need for pruning of the extracted units. In the following, the principles of frequent itemset search and of the extraction of association rules have been introduced Frequent Itemset Search Definition 3.1: Consider a transaction dataset D which comprises a set of records R. A record in R consists of a set of items. An itemset, or a pattern corresponds to a set of items. The number of items in an itemset determines the length of the itemset. The support of an itemset corresponds to the number of records which includes the itemset. An itemset is said to be frequent if its support is greater than or equal to a given support threshold called minimum support (minsup). To evaluate the performance of the existing and newly proposed algorithms, the datasets in Table 3.4 is used in the thesis. However, a simple dataset D (Table 3.1) is used throughout the thesis, to illustrate a running example of those algorithms. Table 3.1 Dataset D Record Number Items 1 Bread, milk 2 Bread, jam 3 Milk, egg 4 Milk, sugar 5 Bread, milk, egg 6 Bread, jam, milk 7 Bread, jam 8 Bread, egg 9 Milk, sugar, egg 1 Milk, egg, sugar, bread

3 25 For example, considering the dataset D in Table 3.1, with minsup = 2, In the first level:{bread} is a frequent itemset of length 1 and of support 7; {bread, milk} is of length 2, of support 4, and frequent; {bread,milk,egg} is of length 3, of support 2, and frequent;{bread,milk,egg,sugar} is of length 4, of support 1, and not frequent. It can be noticed that the support is a monotonously decreasing function, with respect to the length of an itemset. When the number of items in D is equal to n, the number of potential itemsets is equal to 2 n. Thus, a direct search for the frequent itemsets is not conceivable. Heuristic methods have to be used for pruning the set of all itemsets to be tested. This is the purpose of the levelwise search of frequent itemsets, and the well-known algorithm proposed by Agrawal et al 1993, Agrawal et al 1994, Mannila et al 1994 and Agrawal et al relies on two fundamental and dual principles: (i) every subset of a frequent itemset is a frequent itemset, (ii) every superset of an infrequent itemset is infrequent. can further be summarized as follows: 1. The search for frequent itemsets starts with the search for frequent itemsets of length The initial frequent itemsets are found and combined together to form candidate itemsets of greater length. The infrequent itemsets are removed, and by consequence, all their super itemsets are also removed. The candidate itemsets are then tested, and the process continues in the same way, until no more candidates can be produced. For example, considering the Dataset on Table 3.1, with minsup = 2, the frequent itemsets of length 1, with their support are: {bread} (7), {milk}

4 26 (6), {egg} (5), {jam} (3), {sugar} (3). All items in the first level are found to be frequent. Then the candidates of length 2 are formed, combining the frequent itemsets of length 1, e.g. {bread,milk}, {bread,egg}, {bread,jam}, {bread,sugar}... and then tested. The frequent itemsets of length 2 are: {bread,milk} (4), {bread,egg} (3), {bread,jam} (3),{milk,egg} (4), {milk,sugar} (3). The candidates of length 3 are formed and tested. The frequent itemsets of length 3 are: {bread,milk,egg} (2), {milk,sugar,egg} (2). Finally, the candidate of length 4 is formed, i.e. {bread,milk,sugar,egg}, tested and found to be not a frequent itemset. No other candidates can be formed, and the algorithm terminates Association Rule Extraction Definition 3.2: An association rule has the form B, where A and B are two itemsets. The support of the rule B is defined as the support of the itemset B. The confidence of a rule B is defined as ( ) ( ). The confidence can be represented as a conditional probability P(B A) i.e. probability of B knowing A. A rule is said to be valid if its confidence is greater than or equal to a confidence threshold or minimum confidence (minconf), and its support is greater than or equal to the support threshold or minimum support (minsup). A valid rule can only be extracted from a frequent itemset. A rule is said to be exact if its confidence is equal to 1(1%), otherwise the rule is approximate. For example, with minsup = 3 and minconf = 5%, {bread, milk} is frequent, and the rule bread milk is valid (with support 4 and confidence 4/7); the rule bread jam is not valid (with support 3 and confidence 3/7). The generation of valid association rules from a frequent itemset of length necessarily greater than or equal to two which proceeds in a similar way as the search for frequent itemsets.

5 LEVELWISE MINING ALGORITHMS The most well-known algorithm of this kind is. This algorithm addresses the problem of finding all frequent itemsets in a dataset. has been followed by lots of variations, and several of these levelwise algorithms concentrate on a special subset of frequent itemsets, like closed itemsets or generators. The levelwise algorithm for finding all frequent itemsets is a breadth-first and bottom-up algorithm. It means the following: First it finds all 1-frequent itemsets, then at each i th iteration it identifies all i-long frequent itemsets. The algorithm stops when it has identified the largest frequent itemset. Frequent itemsets are computed iteratively, in ascending order by their length. This approach is very simple and efficient for sparse, weakly correlated data. The levelwise algorithm is based on two basic properties. Property 3.1 Property 3.2 (downward closure) All subsets of a frequent itemset are frequent. (anti-monotonocity) All supersets of a nonfrequent itemset are non-frequent Levelwise Exploration of the Frequent Itemsets The levelwise algorithm discovers frequent itemsets in a level by level manner. At each level i, it uses i-long frequent itemsets to generate their (i + 1)-long supersets. These supersets are called candidates, and only the frequent itemsets are kept. For storing itemsets, two kinds of tables are used: Fi for i-long frequent itemsets, and C i for frequent i-long candidates. An itemset of length (i + 1) is called frequent if all its i-long subsets are frequent. Otherwise, if it has a i-long subset not present in F i, then it means that it has an infrequent subset. And by Property 3.2 the candidate is also infrequent and can be pruned. With each database pass, the support of frequent candidates is

6 28 counted, and itemsets that turn out to be infrequent are pruned. The frequent (i + 1)-long itemsets are used then to generate (i+2)-long candidates, etc. The process continues until no new candidates can be generated. The generation of (i+1)-long frequent candidates from i-long frequent itemsets consists of two steps. First, in the join step, table F i is joined with itself. Next, in the prune step, candidates in C i+1 are deleted if they have an i-long subset not present in F i. This way infrequent itemsets are pruned and only frequent candidates are kept in C i+1. As an example of the join step, consider the F 2 table of dataset D: {(bread, milk), (bread, egg), (bread, jam), (milk, egg), (milk, sugar)}. After the join step, C 3 is: {(bread, milk, egg), (bread, milk, jam), (milk, egg, sugar)}. Since all their 2-long subsets are present in F 2, they are also frequent and kept in C 3. That is, the prune step will not delete any element of C 3. The candidate generation and the support counting process require a subset test. In candidate generation, for an (i+1)- long candidate, its subsets need to be identified in F i. In the support counting process, each itemset in the dataset is read. For each itemset in C k, the subsets are found. Then, the support value of each subset in C k is incremented by 1. Since these operations must be performed many times, it must be implemented in a very efficient way for a good performance. In this thesis, all the algorithms used trie data structure (Aho et al 1985) for subset test. 3.3 DYNAMIC ADAPTIVE SUPPORT APRIORI Motivation The support distributions of data items have a strong influence on the performance of association rule mining algorithms. In reality, support distribution of itemsets is highly skewed. The majority of the items have relatively low support values while a small fraction of them have very high support values. The datasets that exhibit such a highly skewed support

7 29 distribution is shown in Table 3.2. This table illustrates the Minimum support value among the items (minsup), Maximum support value among the items (maxsup) and the support distribution of items in various datasets. Table 3.2 Effect of Skewed Support Distribution Datasets minsup maxsup Support distribution in % (%) (%) <1 1 to 1 11to 3 31 to 61 to >9 6 9 Mushroom Chess C2D1K T2I6D1K T25I1D1K The T2I6D1K and T25I1D1K synthetic datasets imitate market basket data that are typically sparse and weakly correlated data. Mushroom, Chess and C2D1K are highly correlated datasets. Mushroom describes its characteristics. Chess describes a game dataset and C2D1K is a census dataset. These three datasets reflect the characteristics of real life datasets. These are the datasets used to carry out all experiments in this thesis. In these datasets, most of the items have less support and only small numbers of items have high support. Chess dataset is exceptional. It is a highly dense dataset which has all ranges of support values. Choosing the right support threshold for mining these datasets will be quite tricky. If the threshold is set too high, then many frequent itemsets involving the low support items will be missed. In market basket analysis, such low support items may correspond to expensive products that are seldom

8 3 bought by customers, but whose patterns are still interesting to retailers. Conversely, when the threshold is set too low, it becomes difficult to find the association rules due to the following reasons. First, the computational and memory requirements of association rule mining algorithms increase considerably with low support thresholds. Second, the number of extracted patterns also increases substantially, many of which relate a high-frequency item to a low-frequency item. For example, bread relates to a gold ring. Such patterns are likely to be spurious. However, the actual number of frequent items depends greatly on the support threshold that is chosen. Similarly, the possible number of association rules is large and is sensitive to the support threshold that is chosen. For example, considering the Dataset on Table 3.1, with minsup = 2, the frequent itemsets of length 1 are {bread} (7), {milk} (6), {egg} (5), {jam} (3), {sugar} (3). The frequent itemsets of length 2 are {bread,milk} (4), {bread,egg} (3), {bread,jam} (3),{milk,egg} (4), {milk,sugar} (3). The frequent itemsets of length 3 are {bread,milk,egg} (2), {milk,sugar,egg} (2). If the minsup is set to 3, then the algorithm terminates at the second level itself. At initial levels, support of items will be high, whereas in subsequent levels, for the combination of items the support will be low. For example, in level 2, the itemset {bread, milk} appears 4 times in the dataset whereas in level 3, the itemset {bread,milk,egg} which is a superset of {bread,milk} appears only twice in the dataset. Hence it is necessary to reduce the minsup threshold in subsequent levels. At each level the support distribution of items has to be analyzed and a suitable minsup threshold is to be chosen. This will be helpful to extract more number of lengthy frequent itemsets.

9 Computation of minsup Threshold In view of this, an algorithm based on called Dynamic Adaptive Support (DAS) for mining association rules is proposed. It employs a new method for calculating the minsup threshold and mining the large frequent itemsets and frequent association rules. An automatic support threshold should hold the following properties: Property 3.3: Property 3.4: The support threshold should be feasible. The support threshold should be appropriate. A support threshold is said to be feasible if its value is not exceeded the MAXS value and not lowered than the MINS value. That is MINS minsup MAXS. To obtain an appropriate minsup threshold, the support distributions of the items in the dataset need to be analyzed. Hence, it is necessary to conduct a statistical analysis on the support distribution of items. Here, the Mean (µ) and the Standard Deviation (SD) are the two statistical values that are used to compute the appropriate threshold. The mean (µ) is obtained by dividing the sum of support values by the number of candidates in that level. Let there be n candidates and the support of each candidate be denoted by sup. The mean (µ) can be computed as below in (3.1) = (3.1) The standard deviation (SD) denotes how close the entire set of support values is to the mean value. If the support values lie close to the mean, then the standard SD will be small. While the support values spread out over a large range of values, SD will be large. The formula for standard deviation is given below in equation (3.2)

10 32 SD= (sup ) (3.2) Using the µ and the SD value, the minimum and maximum bounds of the set of supports can be determined. The minimum bound of support (MINS) and the maximum bound of support (MAXS) can be calculated using equations (3.3) and (3.4) as given below: MINS = SD (3.3) MAXS = + SD (3.4) To increase or decrease the value of these bounds, another bound called the Candidate Threshold (CT) is introduced. It is based on the concept proposed by a mathematician Chebyshev. Theorem 3.1: Chebyshev's inequality Theorem (Grimmett and Stirzaker, 21): if X is a random variable with standard deviation, the probability that the outcome of X is not less than a away from its mean is no more than 1 / a 2. The fraction of observations falling between two distinct values, whose differences from the mean have the same absolute value, is related to the variance of the population. Chebyshev's Theorem gives a conservative estimate to the above percentage. For any population or sample, at least (1 (1 / k) 2 ) of the observations in the data set fall within k standard deviations of the mean, where k 1. Using the concept of z scores, Chebyshev's Theorem can be restated that for any population or sample, the proportion of all observations,

11 33 whose z score has an absolute value less than or equal to k, is not less than (1 - (1 / k 2 )). For k = 1, this theorem states that the fraction of all observations having a z score between -1 and 1 is (1 - (1 / 1)) 2 =. But for k 1, Chebyshev's Theorem provides a lower bound to the proportion of measurements that are within a certain number of standard deviations from the mean. This lower bound estimate can be very helpful when the distribution of a particular population is unknown or mathematically intractable. This bound can be used to determine how much of the items must lie close to the mean. In particular for any positive value T, the proportion of the items that lies within k standard deviations of the mean is given in equation (3.5) as CT. CT= 1. (3.5) For example, if T = 2, CT= 1 =.75. To ensure at least 75% of the items that lies within 75% of the mean, the k value should be calculated from (3.5). This is given below in equation (3.6). T= (3.6) To ensure the minimum percentage of items that should be mined, the equations (3.3) and (3.4) can be modified as in equations (3.7) and (3.8). MINS = (T ) SD (3.7) MAXS= + (T) SD (3.8)

12 34 As mentioned before, a feasible minsup threshold should lie between MINS and MAXS values. As a simple measure of central dispersion, the minsup threshold will be obtained from the equation (3.9) as follows: minsup= (3.9) At each level, the MINS, MAXS and thus the minsup values are calculated dynamically. Let there be n number of k-itemsets in a level. Initially, MINS and MAXS values are calculated. If the MAXS value is greater than the maximum among the support values, then the MAXS is set to the maximum support values. It is not necessary to search itemsets beyond the maximum support. In each subsequent level, a new minsup threshold is adapted based on the support distribution in that level. At each level, the newly computed minsup threshold is used for itemset pruning DAS_ Algorithm The proposed algorithm, DAS_, extends the algorithm. It employs a levelwise search procedure for finding large frequent itemsets. The additional computational overhead incurred by this algorithm is compensated using an efficient support counting procedure used by Szathmary et al (25). It works as follows: If the dataset has n items, then an (n 1) (n 1) upper triangular matrix is built, such as the one shown in Figure 3.1. The support count method for the dataset D in Table 3.1 is illustrated. Bread Milk Egg Jam Sugar Jam Egg Milk Figure 3.1 Upper Triangular Matrix

13 35 This matrix will contain the support values of 2-itemsets. First its entries are initialized by zero. A record of the dataset is decomposed into a list of 2-itemsets. For each element in this list, the value of its corresponding entry in the matrix is incremented by 1. This process is repeated for each record in the dataset. To facilitate quick retrieval and lookup of the subsets, the subset function is implemented using Trie data structure. The trie data structure is a tree for storing strings in which each node corresponds to a prefix. The root is associated with the empty string. The descendants of each node represent strings that begin with the prefix stored at that node. The name of the data structure comes from the word retrieval. Here, itemsets with their support value form a record. All itemsets are sorted in lexicographic order. If a subset operation is to be performed, then a trie is built over the itemsets. Each node in the trie has a value, which is a 1-long item (an attribute). Because of lexicographic order, the value of each child is greater than the value of its parent. An itemset is represented as a path in the trie, starting from the root node. Each node has a pointer back to its parent. Each terminal node has a pointer to its corresponding record. Algorithm DAS_ Variables D F C k CT T Dataset Frequent Itemsets Candidate Itemsets Level Candidate Threshold Threshold calculated from CT

14 36 µ Mean SD Standard Deviation MINS Minimum Support bound MAXS Maximum Support bound Input Dataset D Candidate Threshold CT Output Large Frequent Itemsets F DAS_ F 1 = find frequent 1 itemset(d) for (k = 2; F k-1 Ø, k++) do C k =candidate-gen (F k-1 ) end for each Record r in D do C t = subset (C k, r); for each candidate c C t do c.count++; end end minsup k = minsup_calc(c k, CT) F k = {c C k c.count ( minsup p )} return k F k ; // To generate Candidate itemsets Procedure Candidate-gen(F k-1 ) for each itemset l 1 F k-1 for each itemset l 2 F k-1 perform join operation l 1 and l 2

15 37 if has_infrequent_subset(c, F k-1) prune c; else add c to C k; end if end end return C k; // To perform pruning of infrequent itemsets Procedure has_infrequent_subset(c, F k-1 ) for each (k-1) subset s of c if s is in F k-1 return false; else return true; end if end // To Calculate the minsup value Procedure minsup_calc(c k, CT) T =calculate_k(ct) µ k = calculate_mean(c k ) SD k =calculate_sd(c k, µ k ) MINS (T) SD MAXS + (T) SD minsup k = Average (MAXS k, MINS k ) return minsup k

16 38 DAS_ algorithm generates all large frequent itemsets by making multiple passes over data. In each pass, it counts the supports of itemsets and finds the MINS and MAXS values and thus the minsup threshold. Initially, the itemsets that satisfies the minsup are retrieved. To extract sufficient number of frequent itemsets, user can specify the value of another threshold CT. To calculate the T value used in (3.7) and (3.8), CT value is chosen between.1 and.9. Hence, the performance of the algorithm depends on the CT value. Example The execution of DAS_ algorithm on Dataset D (Table 3.1) with CT=.1 is illustrated in Table 3.3. Table 3.3 Execution of DAS_ with CT=.1 on Dataset D C 1 Sup F 1 C 2 Sup F 2 Bread 7 Bread Bread, milk 4 Bread, milk Milk 6 Milk Bread, egg 3 Milk, egg Egg 5 Egg Milk, egg 4 Jam 3 Sugar 3 During the first level the calculated values are: µ=4.8, SD=1.79, T=1.5, MINS=3, MAXS=7 and thus the minsup=5. Hence there are three items satisfying the minsup value and are listed in F 1. Now, three candidate itemsets are formed in the second level and listed in C 2. During level 2, the values are calculated as µ=3.67, SD=.58, MINS=3, MAXS=4 and thus the minsup=4. Two itemsets are found to be frequent and listed in F 2. Only one candidate is generated for the third level. Since, the SD value is, the algorithm terminates.

17 Rule Generation The concept of association rule was introduced by Agrawal et al (1993). The proposed algorithm adopts the confidence based rule generation method of for rule generation. The confidence threshold can be used to find out the frequent association rules. The confidence of a rule is its support divided by the support of its antecedent. In this process, the first step is to find all frequent itemsets F in dataset D, where sup(f) minsup. For each frequent itemset (F), all non empty subsets (f) of the frequent itemset are generated. For every non empty subset s of f, the rule (f s)is generated, if sup (f) sup(s) minconf. The algorithm for rule generation is given below: Algorithm Rule generation Input Output F k Frequent Itemsets minconf minimum confidence Association Rules Algorithm: Rule_gen (F k, minconf) for each frequent itemset f F do for each nonempty subset s of f do if sup (f) sup(s) minconf then output the rule f end end The rule generation algorithm works based on this principle. The support of any subset F 3 of F 2 is greater than or equal to the support of F 2.

18 4 Thus, the confidence of the rule F F F is necessarily less than or equal to the confidence of the rule F F F. Hence, if the rule F F F is not confident, then neither is the rule F F F. Conversely, if the rule F F F is confident, then all rules of the form F F F are also confident. For example, if the rule milk bread,egg is confident, then the rules milk,bread egg and milk,egg bread are confident as well. For each frequent itemset F 1, all association rules with one item in the consequent are generated. Then, using the sets of 1-long consequents, 2-long consequents are generated. The rules with 2 items in the consequent are kept only when the confidence of them is greater than or equal to minconf. The 2-long consequents of the association rules are used for generating consequents with 3 items and so on. The confidence of the rules will vary % to 1%. The rules with low support and 1% confidence will be considered as exceptional but highly useful in analyzing critical cases. The rules with 1% confidence are known as exact rules Experimental Results All the experiments in this thesis are carried out on an Intel Pentium IV 1.99 GHz machine running under Fedora 1 Operating system with 2.99 GB RAM. Algorithms are implemented in Java. Testing of the algorithms is carried out on five different benchmark datasets. MUSHROOMS, CHESS, C2D1K, T2I6D1K and T25I1D1K datasets are taken from CORON platform (szathmary et al 25a). The characteristics of these datasets are illustrated in the Table 3.4.

19 41 Table 3.4 Characteristics of Datasets used for Evaluation Dataset Records Items #Non empty #Average Density items attributes Mushroom % Chess % C2D1K % T2I6D1K % T25I1D1K % In Table 3.4, there are two kinds of datasets: Synthetic and Real life datasets. According to the frequency of items in the datasets, they are further categorized into two types: Sparse and Dense datasets. This study uses two synthetic datasets namely T2I6D1K and T25I1D1K. These datasets are typically sparse datasets. In these datasets, most of the items have less support. Mushroom and C2D1K are real life, dense datasets. Chess is also a real life dataset with highly dense data with all ranges of support values. Since these datasets reflect various kinds of support behaviour, these datasets are chosen to carry out all experiments in this study. Since, the algorithms are implemented as an improvement over algorithm, the performance of each algorithm is tested against. The dataset and CT are given as input to the proposed DAS_ algorithm. The value of CT should be chosen between and 1. The proposed algorithm calculates the support automatically. It uses CT values from.1 to.9. For each.1 interval, the algorithm is executed. The existing algorithm requires a minsup value to be specified. To run algorithm, each dataset is analyzed and an optimum minsup value is picked up. First, the frequent itemsets (FI) are generated. The FIs generated by the algorithm are normalized to one. Also, the FIs obtained from DAS_ algorithm is normalized with FI of algorithm for better comparison. The length of the longest frequent itemsets is also compared.

20 42 Then the extracted frequent itemsets are used to generate association rules. To generate association rules, a minconf value should be specified. By assigning % minconf value, all association rules with specified minsup value are generated. The exact rules (1% confidence) of specified support thresholds are produced by choosing a minconf value of 1%. The response time of the DAS_ algorithm is compared with the response times of the existing algorithms like (Agrawal et al 1993), Pascal (Bastide et al 2) and Zart (Szathmary et al 25). The implementation of Pascal and Zart are taken from CORON platform for comparison.

21 Table 3.5 Response Time (in Seconds), Length of FIs, #FI and # Exact rules of Existing Algorithms Dataset Pascal Zart Length Response Response Response Response Response Response of FIs Time (FI) Time Time (FI) Time Time (FI) Time (Rule) (Rule) (Rule) #FIs #Exact Rules Mushroom (5%) Chess (5%) C2d1k (5%) T2i6d1k (1%) T25i1d1k (1%)

22 44 Table 3.6 Length of FIs, #FI and # Exact rules and %Exact rules of DAS_ Dataset Threshold DAS_ Mushroom Chess C2D1K Length of FIs #FI #Rules #Exact Rules % Exact Rules

23 45 Table 3.5 and Table 3.6 shows the performance of the existing algorithms and the proposed algorithm in terms of Response Time, the number of FIs, the number of rules and the number of exact rules respectively. Also, the performance comparison for each of the above factor is graphically shown. Figures 3.2 to 3.1 show the Comparison results of the Length of FI. DAS_(.1) Length of FIs M ushroom Chess C2d1k T2i6d1k T25i1d1k Datasets Figure 3.2 Comparison of Length of FIs with CT set to.1 DAS_(.2) Length of FIs M ushroom Chess C2d1k T2i6d1k T25i1d1k Datasets Figure 3.3 Comparison on Length of FIs with CT set to.2

24 46 DAS_(.3) Length of FIs M ushroom Chess C2d1k T2i6d1k T25i1d1k Datasets Figure 3.4 Comparison on Length of FIs with CT set to.3 DAS_(.4) Length of FIs M ushroom Chess C2d1k T2i6d1k T25i1d1k Datasets Figure 3.5 Comparison on Length of FIs with CT set to.4

25 47 DAS_(.5) Length of FIs M ushroom Chess C2d1k T2i6d1k T25i1d1k Datasets Figure 3.6 Comparison on Length of FIs with CT set to.5 DAS_(.6) Length of FIs M ushroom Chess C2d1k T2i6d1k T25i1d1k Datasets Figure 3.7 Comparison on Length of FIs with CT set to.6

26 48 DAS_(.7) Length of FIs M ushroom Chess C2d1k T2i6d1k T25i1d1k Datasets Figure 3.8 Comparison on Length of FIs with CT set to.7 DAS_(.8) Length of FIs M ushroom Chess C2d1k T2i6d1k T25i1d1k Datasets Figure 3.9 Comparison on Length of FIs with CT set to.8

27 49 DAS_(.9) Length of FIs M ushroom Chess C2d1k T2i6d1k T25i1d1k Datasets Figure 3.1 Comparison on Length of FIs with CT set to.9 In case of DAS, if the CT value is less, then lengthy FIs are obtained. It becomes evident that for all strongly correlated datasets like Mushroom, Chess and C2D1K, the lower CT value leads to the longest frequent itemsets. Small length FIs are only generated by the proposed algorithm for weakly correlated, Sparse Datasets like T2I6D1K and T25I1D1K. The proposed algorithm produces lengthy FIs than for CT values.1 to.5. DAS_ algorithm guarantees to produce FIs for any CT value. But, this is not true with. For arbitrary minsup values, the algorithm may break down and will not yield any results. The number of FIs generated by and DAS_ is compared and shown in Figures 3.11 to The #FI value of is normalized to one. The proposed DAS_ algorithm is also normalized with. The normalized results are compared because the scales of #FI values vary differently for different types of datasets. From Figures 3.11 to 3.19, for strongly correlated datasets, the proposed algorithm with Low CT

28 5 value produces better results than the existing algorithm. Also, it shows that, for mushroom dataset, the number of FIs increases from.15% and goes up to 33.44% for the chosen CT values. The C2D1K dataset shows an improvement from.27% to 8.76% in #FI generation. The T2I6D1K shows a slight improvement of.2% to.2%. Also, T25I1D1K exhibits the same pattern. DAS_(.1) #FIs M ushroom Chess C2d1k T2i6d1k T25i1d1k Datasets Figure 3.11 Comparison on #FIs with CT set to.1 ( normalized to one)

29 51 DAS_(.2) #FIs M ushroom Chess C2d1k T2i6d1k T25i1d1k Datasets Figure 3.12 Comparison on #FIs with CT set to.2 ( normalized to one) DAS_(.3) #FIs M ushroom Chess C2d1k T2i6d1k T25i1d1k Datasets Figure 3.13 Comparison on #FIs with CT set to.3 ( normalized to one)

30 52 DAS_(.4) #FIs M ushroom Chess C2d1k T2i6d1k T25i1d1k Datasets Figure 3.14 Comparison on #FIs with CT set to.4 ( normalized to one) DAS_(.5) 25 2 #FIs M ushroom Chess C2d1k T2i6d1k T25i1d1k Datasets Figure 3.15 Comparison on #FIs with CT set to.5 ( normalized to one)

31 53 DAS_(.6) #FIs M ushroom Chess C2d1k T2i6d1k T25i1d1k Datasets Figure 3.16 Comparison on #FIs with CT set to.6 ( normalized to one) DAS_(.7) #FIs M ushroom Chess C2d1k T2i6d1k T25i1d1k Datasets Figure 3.17 Comparison on #FIs with CT set to.7 ( normalized to one)

32 54 DAS_(.8) #FIs M ushroom Chess C2d1k T2i6d1k T25i1d1k Datasets Figure 3.18 Comparison on #FIs with CT set to.8 ( normalized to one) DAS_(.9) #FIs M ushroom Chess C2d1k T2i6d1k T25i1d1k Datasets Figure 3.19 Comparison on #FIs with CT set to.9 ( normalized to one)

33 55 The response times of algorithms during the generation of FI and rules for various datasets are shown in Table 3.5 and Table 3.7. It is also graphically illustrated in Figures 3.2 to Table 3.7 Response Times of DAS_ for FI and Rule Generation Dataset Threshold FI generation (in seconds) Mushroom Chess C2D1K Rule generation (in seconds)

34 56 Response Time Candidate Threshold DAS_ Pascal Zart Figure 3.2 Response Times of FI generation on Mushroom Response Time Candidate Threshold DAS_ Pascal Zart Figure 3.21 Response Times of FI generation on Chess

35 57 Response Time Candidate Threshold DAS_ Pascal Zart Figure 3.22 Response Times of FI generation on C2D1K Response Time DAS_ Pascal Zart Candidate Threshold Figure 3.23 Response Times of FI generation on T2I6D1K

36 Response Time DAS_ Pascal Zart Candidate Threshold Figure 3.24 Response Times of FI generation on T25I1D1K Response Time Candidate Threshold DAS_ Pascal Zart Figure 3.25 Response Times of Rule generation on Mushroom

37 59 Response Time Candidate Threshold DAS_ Pascal Zart Figure 3.26 Response Times of Rule generation on Chess 3 25 Response Time DAS_ Pascal Zart Candidate Threshold Figure 3.27 Response Times of Rule generation on C2D1K

38 Response Time DAS_ Pascal Zart Candidate Threshold Figure 3.28 Response Times of Rule generation on T2I6D1K Response Time Candidate Threshold DAS_ Pascal Zart Figure 3.29 Response Times of Rule generation on T25I1D1K According to the rule generation, the DAS_ algorithm generates more number of rules. The proposed algorithm produces more exact rules than. For mushroom dataset, DAS_ improves the rule

39 61 generation at least by 8.4% and up to18.75% for the chosen CT values. For Chess dataset, the algorithm shows an increase from.73% to 46.32% whereas for C2D1K dataset, it exhibits an improvement from 1.55% to 5.49% only. DAS_ algorithm produces poor results on weakly correlated datasets. In case of Mushroom dataset, the response time of the proposed algorithm is quick when a high CT value is picked. The proposed algorithm yields better response time than for medium CT values. In most of the datasets, almost for all values of CT, the proposed algorithm yields quick response time except for few cases in Chess during FI generation. In many cases, the performance of the algorithm is better than Pascal and Zart algorithm. DAS_ algorithm performs better for three out of five datasets. However, the performance of the sparse datasets can still be improved. 3.4 DYNAMIC AND COLLECTIVE SUPPORT THRESHOLDS This algorithm is proposed to improve the performance of DAS_ algorithm. DAS_ algorithm requires a user specified threshold called CT for the generation of frequent itemsets and association rules. An improved algorithm is proposed which avoids the use of CT. Also, the DAS_ algorithm does not perform well for sparse datasets. In particular for weakly correlated sparse datasets, the minsup value has to be significantly reduced in subsequent levels. If the minsup value is lowered for strongly correlated datasets, then it may lead to memory scarcity problem. Keeping this in view, the proposed algorithm considers the collective support of itemsets obtained from the previous level and uses it for the subsequent level. In this model, two minimum support counts namely Dynamic Minimum Support (DMS) and Collective Minimum Support Count

40 62 (CMS) have been introduced for the item set generation at each level. Initially, DMS is calculated while scanning the items in the dataset. CMS is calculated during the itemset generation. DMS reflects the frequency of items in the dataset. CMS reflects the intrinsic nature of items in the dataset by carrying over the existing support to the next level. In each level, a different minimum support value is used, that is, the DMS and CMS values are calculated in each level. Initially, the DMS is used for itemset generation and in the subsequent levels CMS values are used to find the frequent itemsets. Let there be n items in the dataset and sup be the support of each item in the dataset and k represents the current level. MAXS k and MINS k values are calculated as in DAS_. Although, the T value is ignored since it is derived from CT. Then the equations (3.7) and (3.8) can be redefined and are shown in equation (3.1) and (3.11): MINS = SD (3.1) MAXS = + SD (3.11) The total support of items considered in each level is TOTOCC k which is shown in Equation (3.12). TOTOCC k n sup (3.12) i i 1 1 TOTOCC k MINS k MAXS k DMS (3.13) k 2 n 2 CMS k 1 4 DMS DMS (3.14) k k 1 DMS k and CMS k are calculated using Equations (3.13) and (3.14). The calculation for the DMS k values has been the same at all level. Here

41 63 DMS k represents the value at the current level whereas the DMS k-1 represents the value at the previous level. The proposed method is known as Dynamic Collective Support (DCS_). It works as follows: In each level k, it counts the supports of itemsets and finds the MINS k and MAXS k values, TOTOCC k and thus the DMS k value. Initially, the itemsets that satisfy the DMS k value are retrieved. The DMS k value is calculated based on the candidates generated in the previous level DCS_ Algorithm Algorithm DCS_ Variables D Dataset F Frequent Itemsets C Candidate Itemsets k Level µ Mean SD Standard Deviation MINS Minimum Support bound MAXS Maximum Support bound Input Dataset D Candidate Threshold CT Output Large Frequent Itemsets F k DCS_ F 1 = find frequent 1 itemset(d) for (k = 2; F k-1 Ø, k++) do

42 64 C k =candidate-gen (F k-1 ) end for each Record r in D do C t = subset (C k, r); for each candidate c C t do c.count++; end end minsup k = minsup_calc(c k ) F k = {c C k c.count ( minsup k )} return k F k ; // To generate Candidate itemsets Procedure Candidate-gen(F k-1 ) for each itemset l 1 F k-1 for each itemset l 2 F k-1 perform join operation l 1 and l 2 if has_infrequent_subset(c, F k-1 ) prune c; else add c to C k; end if end end return C k; // To perform pruning of infrequent itemsets Procedure has_infrequent_subset(c, F k-1 ) for each (k-1) subset s of c

43 65 if s is in F k-1 return false; else return true; end if end // To Calculate the minsup value Procedure minsup_calc(c k ) µ k = calculate_mean(c k ) SD k =calculate_sd(c k, µ k ) MINS SD MAXS + SD TOTOCC k = sum(c.count) for all C k DMS k If (k= 1) then CMS k = DMS k Else 1 TOTOCC 2 n k MINS k MAXS 2 k CMS k = (DMS k-1 + DMS k ) / 4 End if return CMS k DCS_ An Example The execution of DAS_ algorithm on Dataset D (Table 3.1) with CT=.1 is illustrated in Table 3.8.

44 66 Table 3.8 Execution of DCS_ on Dataset D C 1 Sup F 1 C 2 Sup F 2 F 3 Bread 7 Bread Bread, milk 4 Bread, milk Bread,milk,egg Milk 6 Milk Bread, egg 3 Milk, egg Egg 5 Egg Milk, egg 4 Jam 3 Sugar 3 During the first level DMS 1 = + =5, so there are three items satisfying the condition. Three candidate itemsets are formed in the second level. The support threshold will be, DMS 2 = + =4 and CMS 2 = (5+ 4) = 2. Thus, the support threshold is 2 and all three candidate itemsets are selected in the second level. Only one candidate itemset with support 2 is generated in the third level. For a single candidate it is not necessary to calculate the DMS 3 and CMS 3 during the third level. The support is directly compared with the support of DMS 2 and found to be frequent Experimental Results The proposed DCS_ algorithm is tested with the benchmark datasets described in Table 3.4. The algorithm is compared with and DAS_. The minsup value of 5% is considered to be optimum thresholds for those datasets whereas 1% minsup is considered to be optimum for T2I6D1K and T25I1D1K datasets. Hence, these minsup values are taken to run. DAS_ algorithm uses its CT value as.5 which is considered to be optimum for all datasets used in this comparison.

45 Table 3.9 Response Time of Algorithms during FI and Rule Generation Dataset Pascal Zart DAS- DCS- FI Rule FI Rule FI Rule FI Rule FI Rule Mushroom (5%) Chess (5%) C2d1k (5%) T2i6d1k (1%) T25i1d1k (1%) Table 3.1 Length of FIs, #FI and # Rules and #Exact rules of DAS_, DCS_ and Dataset Length of FIs DAS_ DCS_ #FIs # Exact Rules Length of FIs #FIs #Rules # Exact Rules Length of FIs #FI #Rules # Exact Rules Mushroom (5%) Chess (5%) C2d1k (5%) T2i6d1k (1%) T25i1d1k (1%)

46 68 Table 3.9 lists the response times of existing algorithms along with the proposed algorithms DAS_ and DCS_. The length of the FIs, number of FIs generated number of rules generated by the algorithms for the benchmark datasets are shown in Table 3.1. Also, the graphical representations are given in Figures 3.3 to DAS_ DCS_ Length of FIs M ushroom Chess C2d1k T2i6d1k T25i1d1k Datasets Figure 3.3 Comparison of Length of FI DAS_(.5) DCS_ #FIs M ushroom Chess C2d1k T2i6d1k T25i1d1k Datasets Figure 3.31 Comparison of #FIs

47 69 From the analysis, it is known that both DAS_ and DCS_ algorithms produce longest FIs for strongly correlated datasets. DAS_ algorithm performs poorly for weakly correlated datasets in terms of length of FIs, #FIs and #Rules. In case of DCS_, the weakly correlated datasets like T1I6D1K and T25I1D1K also yields good results comparatively. The proposed DCS_ algorithm produces 33.51% more FIs on mushroom dataset. The algorithm also shows an increase up to 3% for other benchmark datasets. In case of exact rule generation, the DCS_ algorithm yields an additional 3.46% improvement over DAS_ algorithm on mushroom dataset. The total increase in the number of exact rules by DCS_ on musroom dataset is 13.46%. Due to the highly densed nature of chess, the rule generation algorithm terminates. However, for C2D1K dataset, the algorithm exhibits an improvement by.1% only. The performance improvement for T25I1D1K dataset in the number of exact rule generation is.14%. The same pattern of improvement is noticed in the rule generation process also. The results show that DCS_ algorithm explores more hidden frequent itemsets and thus more rules in all kinds of datasets. Also, DCS_ shows significant improvement in the generation of exact association rules. Although the response time of DCS_ is considerably increased for all datasets, it is negligible as it explores more hidden rules. Also, it leaves the user free from specifying minimum support and guarantees the generation of interesting association rules.

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Apriori Algorithm For a given set of transactions, the main aim of Association Rule Mining is to find rules that will predict the occurrence of an item based on the occurrences of the other items in the