Decision Support Systems 2012/2013. MEIC - TagusPark. Homework #5. Due: 15.Apr.2013

Similar documents
Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Data Mining Part 3. Associations Rules

Association Rule Mining: FP-Growth

Tutorial on Association Rule Mining

Mining Association Rules in Large Databases

Association Rule Mining

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

Chapter 6: Association Rules

Decision Support Systems

Association mining rules

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

CSE 634/590 Data mining Extra Credit: Classification by Association rules: Example Problem. Muhammad Asiful Islam, SBID:

Lecture notes for April 6, 2005

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

CHAPTER 8. ITEMSET MINING 226

A case study to introduce Microsoft Data Mining in the database course

Mining Association Rules in Large Databases

2 CONTENTS

Association Rule Mining. Introduction 46. Study core 46

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Chapter 4: Association analysis:

Chapter 4 Data Mining A Short Introduction

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Data Mining: Mining Association Rules. Definitions. .. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

Mining Frequent Patterns without Candidate Generation

Association Rules. Berlin Chen References:

Chapter 7: Frequent Itemsets and Association Rules

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

IJESRT. Scientific Journal Impact Factor: (ISRA), Impact Factor: [35] [Rana, 3(12): December, 2014] ISSN:

Frequent Pattern Mining S L I D E S B Y : S H R E E J A S W A L

Nesnelerin İnternetinde Veri Analizi

A Technical Analysis of Market Basket by using Association Rule Mining and Apriori Algorithm

2. Discovery of Association Rules

Association Rules. A. Bellaachia Page: 1

Unsupervised learning: Data Mining. Associa6on rules and frequent itemsets mining

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Comparison of FP tree and Apriori Algorithm

Association Rules Apriori Algorithm

Chapter 7: Frequent Itemsets and Association Rules

Association Rules Apriori Algorithm

BCB 713 Module Spring 2011

Data Mining for Knowledge Management. Association Rules

WIP: mining Weighted Interesting Patterns with a strong weight and/or support affinity

CompSci 516 Data Intensive Computing Systems

DESIGN AND CONSTRUCTION OF A FREQUENT-PATTERN TREE

Association Rule with Frequent Pattern Growth. Algorithm for Frequent Item Sets Mining

Association Rule Mining. Entscheidungsunterstützungssysteme

Association Rules Extraction with MINE RULE Operator

H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases. Paper s goals. H-mine characteristics. Why a new algorithm?

A New Technique to Optimize User s Browsing Session using Data Mining

Efficient Remining of Generalized Multi-supported Association Rules under Support Update

Supervised and Unsupervised Learning (II)

Association Rule Mining

Association Rules and

Frequent Itemsets Melange

Road Map. Objectives. Objectives. Frequent itemsets and rules. Items and transactions. Association Rules and Sequential Patterns

CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM. Please purchase PDF Split-Merge on to remove this watermark.

Improved FP-growth Algorithm with Multiple Minimum Supports Using Maximum Constraints

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

An Apriori-like algorithm for Extracting Fuzzy Association Rules between Keyphrases in Text Documents

Optimization using Ant Colony Algorithm

AN ENHANCED SEMI-APRIORI ALGORITHM FOR MINING ASSOCIATION RULES

Performance Based Study of Association Rule Algorithms On Voter DB

An Automated Support Threshold Based on Apriori Algorithm for Frequent Itemsets

Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports

DATA MINING II - 1DL460

Improved Frequent Pattern Mining Algorithm with Indexing

A mining method for tracking changes in temporal association rules from an encoded database

Product presentations can be more intelligently planned

Lesson 3: Building a Market Basket Scenario (Intermediate Data Mining Tutorial)

Association rules. Marco Saerens (UCL), with Christine Decaestecker (ULB)

Discovering interesting rules from financial data

Data Mining Techniques

Research and Application of E-Commerce Recommendation System Based on Association Rules Algorithm

ETP-Mine: An Efficient Method for Mining Transitional Patterns

620 HUANG Liusheng, CHEN Huaping et al. Vol.15 this itemset. Itemsets that have minimum support (minsup) are called large itemsets, and all the others

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

Salah Alghyaline, Jun-Wei Hsieh, and Jim Z. C. Lai

A NEW ASSOCIATION RULE MINING BASED ON FREQUENT ITEM SET

MEIT: Memory Efficient Itemset Tree for Targeted Association Rule Mining

PTclose: A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets

Frequent Pattern Mining

Fundamental Data Mining Algorithms

Enhanced Outlier Detection Method Using Association Rule Mining Technique

Sensitive Rule Hiding and InFrequent Filtration through Binary Search Method

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

CSE 5243 INTRO. TO DATA MINING

CLOSET+:Searching for the Best Strategies for Mining Frequent Closed Itemsets

An Improved Apriori Algorithm for Association Rules

A Comparative Study of Association Rules Mining Algorithms

gspan: Graph-Based Substructure Pattern Mining

Course Content. Outline of Lecture 10. Objectives of Lecture 10 DBMS & WWW. CMPUT 499: DBMS and WWW. Dr. Osmar R. Zaïane. University of Alberta 4

CHAPTER 3 ASSOCIATION RULE MINING WITH LEVELWISE AUTOMATIC SUPPORT THRESHOLDS

Decision Support Systems

Interestingness Measurements

5. MULTIPLE LEVELS AND CROSS LEVELS ASSOCIATION RULES UNDER CONSTRAINTS

Mining Top-K Association Rules. Philippe Fournier-Viger 1 Cheng-Wei Wu 2 Vincent Shin-Mu Tseng 2. University of Moncton, Canada

Maintenance of the Prelarge Trees for Record Deletion

Data Warehousing and Data Mining. Announcements (December 1) Data integration. CPS 116 Introduction to Database Systems

COMP Associa0on Rules

Transcription:

Decision Support Systems 2012/2013 MEIC - TagusPark Homework #5 Due: 15.Apr.2013 1 Frequent Pattern Mining 1. Consider the database D depicted in Table 1, containing five transactions, each containing several items. Consider minsup = 60% and minconf = 80%. Table 1: Database D of transactions to be analyzed. TID Items T100 {B, O, N, E, C, O} T200 {B, O, N, E, C, A} T300 {C, A, N, E, C, A} T400 {F, A, N, E, C, A} T500 {F, A, C, A} (a) (1 val.) Using FP-growth algorithm, find all frequent 4- and 3-itemsets in the database D. The FP-Growth algorithm starts by building the set C 1 of frequent 1-itemsets, from which the FP-tree is then computed. From the provided data, we get C 1 = Item Count B 2 O 2 N 4 E 4 C 5 A 4 F 2 where the itemsets marked in bold are those above minsup. 1 Sorting the frequent 1-itemsets in

Homework 5 Decision Support Systems Page 2 of 9 decreasing order of support, we get C N E A and use this order to build the following FP-tree: Root Item C N E A N : 4 E : 4 C : 5 A : 1 To determine the frequent 4- and 3-itemsets, we build our conditional pattern base, including only those itemsetd with 3 and 4 items. This leads to: A : 3 Item Cond. Pattern Base Cond. Tree Frequent Pattern A {{CNE} : 3} C : 3, N : 3, E : 3 E {{CN} : 4} C : 4, N : 4 {CNE} : 4 We can then conclude that the only frequent 3-itemset is {CNE} and there are no frequent 4-itemsets. (b) (1 val.) Consider the frequent itemsets computed in (a). Without computing the corresponding support, show that any subitemset of such frequent itemsets must also be frequent. Use this fact to compute frequent 2- and 1-itemsets. If minsup denotes the minimum (relative) support, an itemset S is a frequent itemset if sup % (S, D) minsup or, equivalently, if sup c (S, D) minsup D, where D is the number of transactions in D. Let S 0 be any nonempty subset of S. Since S 0 appears in all transactions where S appears, sup c (S 0, D) sup c (S, D) minsup D. Thus, S 0 is also a frequent itemset. In our case, we have the frequent 3-itemset {CNE}, from where we can derive the frequent 2-itemsets {CN}, {CE} and {N E}. Similarly, we can compute the frequent 1-itemset {C}, {N} and {E}. (c) (1 val.) From the frequent itemsets you discovered, list all of the strong association rules matching the following metarule, where X is a variable representing customers, and Item i denotes variables representing items (e.g., A, C, etc.) t D, buys(x, item 1 ) buys(x, item 2 ) buys(x, item 3 ) [S, C]. Do not forget to include the values for the support S and confidence C for any rules you may discover. 1 In these solutions, we considered a strict minimum support, i.e., we considered as frequent only those items I such that supp(i) > minsup. However, for grading purposes, we admitted equally solutions that considered as frequent those itemsets I such that supp(i) minsup.

Homework 5 Decision Support Systems Page 3 of 9 In our case, since the provided metarule involves 3 items, we need only to consider the association rules derived from the frequent itemset {CN E}. In particular, we get three possible association rules verifying the provided metarule: {CN} {E} [0.8, 1] {CE} {N} [0.8, 1] {EN} {C} [0.8, 1]. Since all rules are above the minconf threshold, all three are strong rules. (d) (1 val.) Design an example to illustrate that, in general, computing 2- and 1-frequent itemsets from discovered 3-frequent itemsets is not sufficient to guarantee that all frequent itemsets have been discovered. Is this the case of database D? As an example, we consider the dataset provided. As can easily seen in Question a, the itemset {A} is a frequent 1-itemset that, however, is not a subset of the only frequent 3-itemset {CNE} determined in Question a. Similarly, by running FP-tree completely, we can conclude that the 2-itemset {CA} is frequent but, again, is not a subset of the frequent itemset {CNE}. This shows that computing 2- and 1-frequent itemsets from discovered 3-frequent itemsets is not sufficient to guarantee that all frequent itemsets are discovered. 2. (1 val.) Discuss advantages and disadvantages of FP-growth versus Apriori. Apriori has to do multiple scans of the database while FP-growth builds the FP-Tree with a single scan and requires no additional scans of the database. Moreover, Apriori requires that candidate itemsets are generated, an operation that is computationally expensive (owing to the self-join involved), while FP-growth does not generate any candidates. On the other hand, FP-growth implies handling an FP-tree, a more complex data-structure than those involved in Apriori. In scenarios involving itemsets with a large number of possible items and large cardinality may lead to complex FP-trees, the storage and handling of which becomes computationally expensive. Though debate exists, it is not established that either method is computationally more efficient. 1.1 Practical Questions (Using SQL Server 2012) 3. Using SQL Server Management Studio connect to the database AdventureWorksDW2012. (a) (1 val.) Write an SQL query to determine the number of transactions in the view vassocseqorders. In your answer document, include both the SQL query and the obtained value. One possible query would be:

Homework 5 Decision Support Systems Page 4 of 9 select COUNT(*) from dbo.vassocseqorders leading to the value 21, 255. (b) (1 val.) Write an SQL query to identify, in the view vassocseqlineitems, which models appear in more than 1, 500 orders. In your answer document, include both the SQL query and the obtained result. One possible query would be: SELECT I.Model, COUNT(*) AS Total FROM dbo.vassocseqlineitems I GROUP BY I.Model HAVING COUNT(*) > 1500 ORDER BY COUNT(*) DESC resulting in the following table: Model Total Sport-100 6,171 Water Bottle 4,076 Patch kit 3,010 Mountain Tire Tube 2,908 Mountain-200 2,477 Road Tire Tube 2,216 Cycling Cap 2,095 Fender Set - Mountain 2,014 Mountain Bottle Cage 1,941 Road Bottle Cage 1,702 Long-Sleeve Logo Jersey 1,642 Short-Sleeve Classic Jersey 1,537 (c) (1 val.) Write an SQL query to identify, in the view vassocseqlineitems, which pairs of models appear in more than 1, 500 orders (do not include pairs in which both elements are the same). In your answer document, include both the SQL query and the obtained result. Model Model Total

Homework 5 Decision Support Systems Page 5 of 9 One possible query would be: SELECT I.Model, J.Model, COUNT(*) AS Total FROM dbo.vassocseqlineitems I INNER JOIN dbo.vassocseqlineitems J ON I.OrderNumber = J.OrderNumber AND I.Model < J.Model GROUP BY I.Model, J.Model HAVING COUNT(*) > 1500 ORDER BY COUNT(*) DESC resulting in the following table: Model Model Total Mountain Bottle Cage Water Bottle 1,623 Road Bottle Cage Water Bottle 1,513 (d) (1 val.) Write an SQL query to identify, in the view vassocseqlineitems, which triplets of models appear in more than 1, 500 orders (do not include triplets with repeated elements). In your answer document, include both the SQL query and the obtained result. One possible query would be: SELECT I.Model, J.Model, K.Model, COUNT(*) AS Total FROM dbo.vassocseqlineitems I, dbo.vassocseqlineitems J, dbo.vassocseqlineitems K where I.OrderNumber = J.OrderNumber AND J.OrderNumber = K.OrderNumber AND I.Model < J.Model and J.Model < K.Model GROUP BY I.Model, J.Model, K.Model HAVING COUNT(*) > 1500 ORDER BY COUNT(*) DESC resulting in an empty table.

Homework 5 Decision Support Systems Page 6 of 9 4. The different queries in Question 3 roughly correspond to the main steps of the Apriori algorithm. (a) (1 val.) From the results in Question 3, determine the minimum (relative) support implicitly used in the aforementioned SQL queries. Since we selected only itemsets appearing more than 1, 500, we have a minimum relative support of 1, 500 minsup = 21, 255 = 7.05%. (b) (2 val.) Determine all possible associations obtained from the frequent itemsets identified in Question 3. Indicate the confidence associated with each such association rule and all relevant calculations. Which of the calculated association rules correspond to strong rules for a minimum confidence of 60%? Possible associations arise from frequent k-itemsets, with k > 1. possible associations, In our case, we have, as Water Bottle Mountain Bottle Cage Mountain Bottle Cage Water Bottle Water Bottle Road Bottle Cage Road Bottle Cage Water Bottle In order to determine which of the associations above are strong associations, the corresponding confidence is: Water Bottle Mountain Bottle Cage 1, 623 conf = 4, 076 = 39.8% Mountain Bottle Cage Water Bottle 1, 623 conf = 1, 941 = 83.6% Water Bottle Road Bottle Cage 1, 513 conf = 4, 076 = 37.1% Road Bottle Cage Water Bottle conf = 1, 513 1, 702 = 88.9% and we can conclude that, for minconf = 60%, only Mountain Bottle Cage Water Bottle and Road Bottle Cage Water Bottle are strong association rules. 5. In SQL Server Data Tools, run the Microsoft Association algorithm you experimented in the lab, but setting the minimum support to the value computed in Question 4 and the minimum confidence to 60%. (a) (2 val.) Provide a screenshot of the Itemset pane containing the frequent itemsets discovered by the algorithm. Compare these with your results from Question 4. As seen in Question 3, the frequent itemsets are:

Homework 5 Decision Support Systems Page 7 of 9 Model Total Sport-100 6,171 Water Bottle 4,076 Patch kit 3,010 Mountain Tire Tube 2,908 Mountain-200 2,477 Road Tire Tube 2,216 Cycling Cap 2,095 Fender Set - Mountain 2,014 Mountain Bottle Cage 1,941 Road Bottle Cage 1,702 Long-Sleeve Logo Jersey 1,642 Short-Sleeve Classic Jersey 1,537 Mountain Bottle Cage, Water Bottle 1,623 Road Bottle Cage, Water Bottle 1,513 This corresponds to the result obtained by Microsoft Association algorithm: The only two 2-itemsets observed are precisely those appearing in the associations determined in Question 4, as expected. (b) (2 val.) Provide a screenshot of the Rules pane containing the strong association rules discovered by the algorithm. Compare these with your results from Question 4.

Homework 5 Decision Support Systems Page 8 of 9 As seen in Question 4, the only strong associations are: Mountain bottle cage Water bottle [sup = 32.4%, C = 83.6%] Road bottle cage Water bottle [sup = 30.2%, C = 88.9%] This corresponds to the result obtained by Microsoft Association algorithm: (c) (2 val.) Indicate the dependence network computed by the algorithm and explain its meaning. The dependence network portrayed by the Microsoft Association algorithm is: and indicates that the existence of either items Road bottle cage or Mountain bottle cage is a strong indicator of the presence of item Water bottle. 6. (2 val.) Note that, besides the confidence associated with each association rule, MS SQL Server also indicates the importance of the rule. Importance determines how useful a given rule is, and is computed as ( ) sup(x Y ) sup( X) importance(x Y ) = log, sup(x) sup( X Y ) where sup( A) corresponds to the number of itemsets that do not include item A. In the data-mining literature, a quantity providing similar information is the lift and is computed as lift(x Y ) = sup % (X Y ) sup % (X) sup % (Y ). Compute the importance and lift for the association rules mined. For this purpose, take into consideration the total number of transactions you computed in Question 4. Confirm the value of importance provided by Microsoft Association. Indicate your calculations, and verify that the rules with larger lift are also ranked by Microsoft Association algorithm as more important.

Homework 5 Decision Support Systems Page 9 of 9 Computing the importance for the mined rules, we get: Computing now the lift, we get: 1, 513 19, 553 importance(rbc WB) = log 1, 702 2, 563 = 0.831 importance(mbc WB) = log 1, 623 19, 314 1, 941 2, 453 = 0.818. 1, 513 21, 255 lift(rbc WB) = 1, 702 4, 076 = 4.64 lift(mbc WB) = 1, 623 21, 255 1, 941 4, 076 = 4.36, which agrees with the importance results from Microsoft Association algorithm.