Decision Support Systems 2012/2013 MEIC - TagusPark Homework #5 Due: 15.Apr.2013 1 Frequent Pattern Mining 1. Consider the database D depicted in Table 1, containing five transactions, each containing several items. Consider minsup = 60% and minconf = 80%. Table 1: Database D of transactions to be analyzed. TID Items T100 {B, O, N, E, C, O} T200 {B, O, N, E, C, A} T300 {C, A, N, E, C, A} T400 {F, A, N, E, C, A} T500 {F, A, C, A} (a) (1 val.) Using FP-growth algorithm, find all frequent 4- and 3-itemsets in the database D. The FP-Growth algorithm starts by building the set C 1 of frequent 1-itemsets, from which the FP-tree is then computed. From the provided data, we get C 1 = Item Count B 2 O 2 N 4 E 4 C 5 A 4 F 2 where the itemsets marked in bold are those above minsup. 1 Sorting the frequent 1-itemsets in
Homework 5 Decision Support Systems Page 2 of 9 decreasing order of support, we get C N E A and use this order to build the following FP-tree: Root Item C N E A N : 4 E : 4 C : 5 A : 1 To determine the frequent 4- and 3-itemsets, we build our conditional pattern base, including only those itemsetd with 3 and 4 items. This leads to: A : 3 Item Cond. Pattern Base Cond. Tree Frequent Pattern A {{CNE} : 3} C : 3, N : 3, E : 3 E {{CN} : 4} C : 4, N : 4 {CNE} : 4 We can then conclude that the only frequent 3-itemset is {CNE} and there are no frequent 4-itemsets. (b) (1 val.) Consider the frequent itemsets computed in (a). Without computing the corresponding support, show that any subitemset of such frequent itemsets must also be frequent. Use this fact to compute frequent 2- and 1-itemsets. If minsup denotes the minimum (relative) support, an itemset S is a frequent itemset if sup % (S, D) minsup or, equivalently, if sup c (S, D) minsup D, where D is the number of transactions in D. Let S 0 be any nonempty subset of S. Since S 0 appears in all transactions where S appears, sup c (S 0, D) sup c (S, D) minsup D. Thus, S 0 is also a frequent itemset. In our case, we have the frequent 3-itemset {CNE}, from where we can derive the frequent 2-itemsets {CN}, {CE} and {N E}. Similarly, we can compute the frequent 1-itemset {C}, {N} and {E}. (c) (1 val.) From the frequent itemsets you discovered, list all of the strong association rules matching the following metarule, where X is a variable representing customers, and Item i denotes variables representing items (e.g., A, C, etc.) t D, buys(x, item 1 ) buys(x, item 2 ) buys(x, item 3 ) [S, C]. Do not forget to include the values for the support S and confidence C for any rules you may discover. 1 In these solutions, we considered a strict minimum support, i.e., we considered as frequent only those items I such that supp(i) > minsup. However, for grading purposes, we admitted equally solutions that considered as frequent those itemsets I such that supp(i) minsup.
Homework 5 Decision Support Systems Page 3 of 9 In our case, since the provided metarule involves 3 items, we need only to consider the association rules derived from the frequent itemset {CN E}. In particular, we get three possible association rules verifying the provided metarule: {CN} {E} [0.8, 1] {CE} {N} [0.8, 1] {EN} {C} [0.8, 1]. Since all rules are above the minconf threshold, all three are strong rules. (d) (1 val.) Design an example to illustrate that, in general, computing 2- and 1-frequent itemsets from discovered 3-frequent itemsets is not sufficient to guarantee that all frequent itemsets have been discovered. Is this the case of database D? As an example, we consider the dataset provided. As can easily seen in Question a, the itemset {A} is a frequent 1-itemset that, however, is not a subset of the only frequent 3-itemset {CNE} determined in Question a. Similarly, by running FP-tree completely, we can conclude that the 2-itemset {CA} is frequent but, again, is not a subset of the frequent itemset {CNE}. This shows that computing 2- and 1-frequent itemsets from discovered 3-frequent itemsets is not sufficient to guarantee that all frequent itemsets are discovered. 2. (1 val.) Discuss advantages and disadvantages of FP-growth versus Apriori. Apriori has to do multiple scans of the database while FP-growth builds the FP-Tree with a single scan and requires no additional scans of the database. Moreover, Apriori requires that candidate itemsets are generated, an operation that is computationally expensive (owing to the self-join involved), while FP-growth does not generate any candidates. On the other hand, FP-growth implies handling an FP-tree, a more complex data-structure than those involved in Apriori. In scenarios involving itemsets with a large number of possible items and large cardinality may lead to complex FP-trees, the storage and handling of which becomes computationally expensive. Though debate exists, it is not established that either method is computationally more efficient. 1.1 Practical Questions (Using SQL Server 2012) 3. Using SQL Server Management Studio connect to the database AdventureWorksDW2012. (a) (1 val.) Write an SQL query to determine the number of transactions in the view vassocseqorders. In your answer document, include both the SQL query and the obtained value. One possible query would be:
Homework 5 Decision Support Systems Page 4 of 9 select COUNT(*) from dbo.vassocseqorders leading to the value 21, 255. (b) (1 val.) Write an SQL query to identify, in the view vassocseqlineitems, which models appear in more than 1, 500 orders. In your answer document, include both the SQL query and the obtained result. One possible query would be: SELECT I.Model, COUNT(*) AS Total FROM dbo.vassocseqlineitems I GROUP BY I.Model HAVING COUNT(*) > 1500 ORDER BY COUNT(*) DESC resulting in the following table: Model Total Sport-100 6,171 Water Bottle 4,076 Patch kit 3,010 Mountain Tire Tube 2,908 Mountain-200 2,477 Road Tire Tube 2,216 Cycling Cap 2,095 Fender Set - Mountain 2,014 Mountain Bottle Cage 1,941 Road Bottle Cage 1,702 Long-Sleeve Logo Jersey 1,642 Short-Sleeve Classic Jersey 1,537 (c) (1 val.) Write an SQL query to identify, in the view vassocseqlineitems, which pairs of models appear in more than 1, 500 orders (do not include pairs in which both elements are the same). In your answer document, include both the SQL query and the obtained result. Model Model Total
Homework 5 Decision Support Systems Page 5 of 9 One possible query would be: SELECT I.Model, J.Model, COUNT(*) AS Total FROM dbo.vassocseqlineitems I INNER JOIN dbo.vassocseqlineitems J ON I.OrderNumber = J.OrderNumber AND I.Model < J.Model GROUP BY I.Model, J.Model HAVING COUNT(*) > 1500 ORDER BY COUNT(*) DESC resulting in the following table: Model Model Total Mountain Bottle Cage Water Bottle 1,623 Road Bottle Cage Water Bottle 1,513 (d) (1 val.) Write an SQL query to identify, in the view vassocseqlineitems, which triplets of models appear in more than 1, 500 orders (do not include triplets with repeated elements). In your answer document, include both the SQL query and the obtained result. One possible query would be: SELECT I.Model, J.Model, K.Model, COUNT(*) AS Total FROM dbo.vassocseqlineitems I, dbo.vassocseqlineitems J, dbo.vassocseqlineitems K where I.OrderNumber = J.OrderNumber AND J.OrderNumber = K.OrderNumber AND I.Model < J.Model and J.Model < K.Model GROUP BY I.Model, J.Model, K.Model HAVING COUNT(*) > 1500 ORDER BY COUNT(*) DESC resulting in an empty table.
Homework 5 Decision Support Systems Page 6 of 9 4. The different queries in Question 3 roughly correspond to the main steps of the Apriori algorithm. (a) (1 val.) From the results in Question 3, determine the minimum (relative) support implicitly used in the aforementioned SQL queries. Since we selected only itemsets appearing more than 1, 500, we have a minimum relative support of 1, 500 minsup = 21, 255 = 7.05%. (b) (2 val.) Determine all possible associations obtained from the frequent itemsets identified in Question 3. Indicate the confidence associated with each such association rule and all relevant calculations. Which of the calculated association rules correspond to strong rules for a minimum confidence of 60%? Possible associations arise from frequent k-itemsets, with k > 1. possible associations, In our case, we have, as Water Bottle Mountain Bottle Cage Mountain Bottle Cage Water Bottle Water Bottle Road Bottle Cage Road Bottle Cage Water Bottle In order to determine which of the associations above are strong associations, the corresponding confidence is: Water Bottle Mountain Bottle Cage 1, 623 conf = 4, 076 = 39.8% Mountain Bottle Cage Water Bottle 1, 623 conf = 1, 941 = 83.6% Water Bottle Road Bottle Cage 1, 513 conf = 4, 076 = 37.1% Road Bottle Cage Water Bottle conf = 1, 513 1, 702 = 88.9% and we can conclude that, for minconf = 60%, only Mountain Bottle Cage Water Bottle and Road Bottle Cage Water Bottle are strong association rules. 5. In SQL Server Data Tools, run the Microsoft Association algorithm you experimented in the lab, but setting the minimum support to the value computed in Question 4 and the minimum confidence to 60%. (a) (2 val.) Provide a screenshot of the Itemset pane containing the frequent itemsets discovered by the algorithm. Compare these with your results from Question 4. As seen in Question 3, the frequent itemsets are:
Homework 5 Decision Support Systems Page 7 of 9 Model Total Sport-100 6,171 Water Bottle 4,076 Patch kit 3,010 Mountain Tire Tube 2,908 Mountain-200 2,477 Road Tire Tube 2,216 Cycling Cap 2,095 Fender Set - Mountain 2,014 Mountain Bottle Cage 1,941 Road Bottle Cage 1,702 Long-Sleeve Logo Jersey 1,642 Short-Sleeve Classic Jersey 1,537 Mountain Bottle Cage, Water Bottle 1,623 Road Bottle Cage, Water Bottle 1,513 This corresponds to the result obtained by Microsoft Association algorithm: The only two 2-itemsets observed are precisely those appearing in the associations determined in Question 4, as expected. (b) (2 val.) Provide a screenshot of the Rules pane containing the strong association rules discovered by the algorithm. Compare these with your results from Question 4.
Homework 5 Decision Support Systems Page 8 of 9 As seen in Question 4, the only strong associations are: Mountain bottle cage Water bottle [sup = 32.4%, C = 83.6%] Road bottle cage Water bottle [sup = 30.2%, C = 88.9%] This corresponds to the result obtained by Microsoft Association algorithm: (c) (2 val.) Indicate the dependence network computed by the algorithm and explain its meaning. The dependence network portrayed by the Microsoft Association algorithm is: and indicates that the existence of either items Road bottle cage or Mountain bottle cage is a strong indicator of the presence of item Water bottle. 6. (2 val.) Note that, besides the confidence associated with each association rule, MS SQL Server also indicates the importance of the rule. Importance determines how useful a given rule is, and is computed as ( ) sup(x Y ) sup( X) importance(x Y ) = log, sup(x) sup( X Y ) where sup( A) corresponds to the number of itemsets that do not include item A. In the data-mining literature, a quantity providing similar information is the lift and is computed as lift(x Y ) = sup % (X Y ) sup % (X) sup % (Y ). Compute the importance and lift for the association rules mined. For this purpose, take into consideration the total number of transactions you computed in Question 4. Confirm the value of importance provided by Microsoft Association. Indicate your calculations, and verify that the rules with larger lift are also ranked by Microsoft Association algorithm as more important.
Homework 5 Decision Support Systems Page 9 of 9 Computing the importance for the mined rules, we get: Computing now the lift, we get: 1, 513 19, 553 importance(rbc WB) = log 1, 702 2, 563 = 0.831 importance(mbc WB) = log 1, 623 19, 314 1, 941 2, 453 = 0.818. 1, 513 21, 255 lift(rbc WB) = 1, 702 4, 076 = 4.64 lift(mbc WB) = 1, 623 21, 255 1, 941 4, 076 = 4.36, which agrees with the importance results from Microsoft Association algorithm.