Business Intelligence. Tutorial for Performing Market Basket Analysis (with ItemCount)

Similar documents
Association rule mining

Improved Frequent Pattern Mining Algorithm with Indexing

Chapter 7: Frequent Itemsets and Association Rules

Association Rule Mining. Introduction 46. Study core 46

Data mining techniques for actuaries: an overview

Lesson 3: Building a Market Basket Scenario (Intermediate Data Mining Tutorial)

Rare Association Rule Mining for Network Intrusion Detection

Association Rule with Frequent Pattern Growth. Algorithm for Frequent Item Sets Mining

Value Added Association Rules

Detection of Interesting Traffic Accident Patterns by Association Rule Mining

Association mining rules

Pattern Discovery Using Apriori and Ch-Search Algorithm

Association Rule Discovery

Chapter 4: Association analysis:

Enhanced Outlier Detection Method Using Association Rule Mining Technique

Association Pattern Mining. Lijun Zhang

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Association Rule Discovery

ASSOCIATION RULE MINING: MARKET BASKET ANALYSIS OF A GROCERY STORE

Association Rule Mining. Entscheidungsunterstützungssysteme

Hierarchical Online Mining for Associative Rules

Association Rules. Berlin Chen References:

Data Mining and Knowledge Discovery: Practice Notes

Tutorial on Association Rule Mining

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Nesnelerin İnternetinde Veri Analizi

SAP InfiniteInsight 7.0 Modeler - Association Rules Getting Started Guide

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Data Mining and Knowledge Discovery: Practice Notes

CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM. Please purchase PDF Split-Merge on to remove this watermark.

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

A Literature Review of Modern Association Rule Mining Techniques

Data Mining and Knowledge Discovery: Practice Notes

Predicting Missing Items in Shopping Carts

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

Machine Learning: Symbolische Ansätze

Performance Based Study of Association Rule Algorithms On Voter DB

INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM

Data Mining and Knowledge Discovery Practice notes Numeric prediction and descriptive DM

PRODUCT DOCUMENTATION. Association Discovery

Market baskets Frequent itemsets FP growth. Data mining. Frequent itemset Association&decision rule mining. University of Szeged.

Association Rules Apriori Algorithm

Outline. Project Update Data Mining: Answers without Queries. Principles of Information and Database Management 198:336 Week 12 Apr 25 Matthew Stone

Creating letters using mail merge in Microsoft Word (Windows PC)

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Association Rule Mining Techniques between Set of Items

Association Rules Apriori Algorithm

Association Rule Learning

A Technical Analysis of Market Basket by using Association Rule Mining and Apriori Algorithm

2. Discovery of Association Rules

DATA MINING APRIORI ALGORITHM IMPLEMENTATION USING R

Decision Support Systems

signicantly higher than it would be if items were placed at random into baskets. For example, we

A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases *

Open Microsoft Word: click the Start button, click Programs> Microsoft Office> Microsoft Office Word 2007.

Precision Routing. Capabilities. Precision Queues. Capabilities, page 1 Initial Setup, page 6

Rule induction. Dr Beatriz de la Iglesia

Chapter 7: Frequent Itemsets and Association Rules

Approaches for Mining Frequent Itemsets and Minimal Association Rules

DIVERSITY-BASED INTERESTINGNESS MEASURES FOR ASSOCIATION RULE MINING

Data Mining Algorithms

Discovering interesting rules from financial data

COMM 391 Winter 2014 Term 1. Tutorial 1: Microsoft Excel - Creating Pivot Table

Improving of e-business activities by building web applications with integrated data mining services

INTELLIGENT SUPERMARKET USING APRIORI

Data Structure for Association Rule Mining: T-Trees and P-Trees

Grouping Association Rules Using Lift

Association Rules Outline

This paper proposes: Mining Frequent Patterns without Candidate Generation

Sort, Filter, Pivot Table

Case Study: SAP BW Data Mining (Association Analysis)

OPTIMISING ASSOCIATION RULE ALGORITHMS USING ITEMSET ORDERING

Lecture notes for April 6, 2005

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

Precision Routing. Capabilities. Precision Queues. Capabilities, page 1 Initial setup, page 5

Generating Cross level Rules: An automated approach

Mining Association Rules in Large Databases

Excel Expert Microsoft Excel 2010

Tutorial on Assignment 3 in Data Mining 2009 Frequent Itemset and Association Rule Mining. Gyozo Gidofalvi Uppsala Database Laboratory

Association Rule Mining from XML Data

A Pandect on Association Rule Hiding Techniques

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET)

Service Line Export and Pivot Table Report (Windows Excel 2010)

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Microsoft Excel 2016 LEVEL 3

AGB 260: Agribusiness Data Literacy. Tables

Simplifi 797 Upgrade - October 2010

STUDENT LEARNING OUTCOMES

Quality Gates User guide

Association Rule Mining Using Revolution R for Market Basket Analysis

Open Excel by following the directions listed below: Click on Start, select Programs, and the click on Microsoft Excel.

Obvious reason for using a mail merge then is to save time in drafting a form document which needs to be sent to multiple folks.

Optimization of Query Processing in XML Document Using Association and Path Based Indexing

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

ALGORITHM FOR MINING TIME VARYING FREQUENT ITEMSETS

Discover the golden paths, unique sequences and marvelous associations out of your big data using Link Analysis in SAS Enterprise Miner TM

Here is an example of a credit card export; none of the columns or data have been modified.

An Approach for Finding Frequent Item Set Done By Comparison Based Technique

Expense: Process Reports

Structure of Association Rule Classifiers: a Review

Transcription:

Business Intelligence Professor Chen NAME: Due Date: Tutorial for Performing Market Basket Analysis (with ItemCount) 1. To perform a Market Basket Analysis, we will begin by selecting Open Template from the main menu (Or by clicking File->Open template) as is shown in Fig 1. or Fig 1 2. From the Open Template window, we will scroll down to the bottom and choose Market Basket analysis as is shown in Fig 2-a. Proceed by clicking next (Fig 2-b), and then Finish. We will be modifying this skeleton format to suit our needs. Fig 2-a Fig 2-b RapidMiner Market Basket Analysis, Page-1

3. The Main Process window will have loaded the skeleton format for the Market Basket Analysis as is shown in Fig 3a. Our first modification will be to remove the Retrieve operator (Fig. 3-a), and replace it with a Read Excel Operator by right-clicking the Retrieve operator and choose Select Operator option then import data Read Excel operator (Fig 3-b) or deleting the Retrieve Operator the adding the Read Excel Operator (Fig 3-d). You may have to re-connect from out in Read Excel Operator to thr (for through ) in Define Item Count Operator. Use the import wizard on in the Read Excel operator to locate and Read the Market Basket file (file name: Basket (RM format).xls), and after clicking Next, choose the correct sheet (Numerical Data) as shown in Fig 3-c, and make note of the column header labels in the file (customerid, itemid, itemcount) then select NEXT, NEXT (Fig. 3-d, 3-e) otherwise, only one attribute displayed. Do not need to identify label for a selected attribute as we did in the Life Insurance example) then Finish (Fig 3-f). Fig 3-a Fig 3-b RapidMiner Market Basket Analysis, Page-2

Fig 3-c Fig 3-d Please note that the original data set contains 76 types of items and 89 customers with different combinations of items purchased (details can be found in the EXCEL file). However, the itemcount is generated using a random generator function of (=RANDBETWEEN(1,10)) from the Excel. RapidMiner Market Basket Analysis, Page-3

Fig 3-f Fig 3-e 4. The first operator to modify is the Aggregate operator seen in Fig 3-a. If we click on the operator, we can the click the Edit Lists (1) button under parameters that is shown in Fig 4-a. This will bring up a window as is shown in Fig 4-a. a. In this window we will modify the marked box to match our column name using the drop-down menu. Click the drop-down menu and select itemcount (Fig 4-b). Then select Apply as shown in Fig. 4-c. Fig 4-a RapidMiner Market Basket Analysis, Page-4

Fig 4-b Fig 4-c b. Then we will click the Select Attributes button in Fig 4-a, which will bring up a window as shown in Fig 4-d. By using the green arrows in this window, we will include the customerid and itemid Attributes, and we will exclude the customeridattributename and itemidattributename. The result is shown in Fig 4-e. Then select Apply. The operators in process area is shown in Fig4-f. Note that the light in Aggreator turns to yellow). RapidMiner Market Basket Analysis, Page-5

Fig 4-d Fig 4-e Fig 4-f 5. The Pivot and Set Role operators seen in 3-a will be modified similarly to the Aggregate Operator. Fig 5-a (i) and Fig 5-a (ii) (illustrated on both sides) shows the Parameters of the Pivot Operator. The group attribute and index attribute parameters should be modified to the values we have been using (customerid and itemid respectively). Figure 5-b shows the parameters of the Set Role operator. The name parameter should be modified to the value we have been using (customerid in this case). Click on Process area and all lights on the operators now turn into yellow (Fig. 5-c). If not, you should try to re-connect them. RapidMiner Market Basket Analysis, Page-6

Fig 5-a (i) Fig 5-a (ii) Fig 5-b (i) ) Fig 5-b (ii) Fig 5-c RapidMiner Market Basket Analysis, Page-7

6. We can create a new template by going to File Save as Template (Fig. 6-a, 6-b). After giving it a name, click Save. Whenever we want to perform a Market Basket Analysis we can load the template we created without having to perform these changes. Fig. 6-a Fig. 6-b RapidMiner Market Basket Analysis, Page-8

7. Before you RUN the market basket analysis, it is important to know that the parameters in FP Growth operator (Frequent Pattern-Growth) as RapidMiner will find only those item sets which exceed this minimum support value. The FP Growth operator is a RapidMiner core (operator) and it efficiently calculates all frequent itemsets from the given ExampleSet using the FP-tree data structure. It is compulsory that all attributes of the input ExampleSet should be binominal a. Click the FP Growth operator (Fig 7-a) and change the value in minimum support value from 0.95 to 0.0001 (Fig 7-b) ; otherwise, there will be no association rules produced. Other values remain unchanged. Fig. 7-a Fig. 7-b Fig. 7-c b. Next, we need to check the value of minimum confidence and see if we need change it. Click Create Association Rule operator (Fig 7-c) the value of minimum confidence is 0.8 and we decide to keep this number. c. When these required parameters are set properly, we are ready to run the market basket analysis. Click on RUN button and you may be asked to provide a file name to save the example (model and process) in local repository (Fig 7-d). Select OK after you enter the name of your example (e.g., Market Basket examplembus673). It may take few seconds to complete the process and then move from Design Perspective to Results Perspective. Click AssociationRules and the default Table View output is shown in Fig. 7-e. You may also explore other options such as Text View or Graph View (Fig. 7-f, 7-g). RapidMiner Market Basket Analysis, Page-9

Fig. 7-d Summary: When the process is run, the results show the association rules created in the form: Premises => Conclusion, as shown in figure 7-e. Your results will depend on the minimum confidence and support chosen in the FP-Growth, and the Create Association rules operators respectively. For this example, a minimum support of 0.0001 has been used, and a minimum confidence of 0.8. So row no. 1 can be read as: The purchase of item 37 implies the purchase of item 2. These items are bought together in 4.5% of the transactions, and 80% (probability) of the times that Item 37 is bought, Item 2 is bought as well. Row no.33 can be interpreted as: The purchase of item 2 implies the purchase of items 70 and 39. These three items are bought together in 10% of the transactions, and 20% of the times that item 2 is bought, Items 70 and 39 are bought together as well. Note: Generally, with large amounts of data, the lower the minimum confidence chosen is, the longer the program will take to process. Fig. 7-e RapidMiner Market Basket Analysis, Page-10

Fig. 7-e Fig. 7-g RapidMiner Market Basket Analysis, Page-11

Information on RapidMiner Core Operators: 1. FP-Growth (Frequent Pattern-Growth) Synopsis The FP Growth operator is a RapidMiner core and it efficiently calculates all frequent itemsets from the given ExampleSet using the FP-tree data structure. It is compulsory that all attributes of the input ExampleSet should be binominal. Parameters in FP Growth operator as RapidMiner will find only those item sets which exceed this minimum support value. Description In simple words, frequent itemsets are groups of items that often appear together in the data. It is important to know the basics of market-basket analysis for understanding frequent itemsets. The market-basket model of data is used to describe a common form of a many-to-many relationship between two kinds of objects. On the one hand, we have items, and on the other we have baskets, also called 'transactions'. The set of items is usually represented as set of attributes. Mostly these attributes are binominal. The transactions are usually each represented as examples of the ExampleSet. When an attribute value is 'true' in an example; it implies that the corresponding item is present in that transaction. Each transaction consists of a set of items (an itemset). Usually it is assumed that the number of items in a transaction is small, much smaller than the total number of items i.e. in most of the examples most of the attribute values are 'false'. The number of transactions is usually assumed to be very large i.e. the number of examples in the ExampleSet is assumed to be large. The frequent-itemsets problem is that of finding sets of items that appear together in at least a threshold ratio of transactions. This threshold is defined by the 'minimum support' criteria. The support of an itemset is the number of times that itemset appears in the ExampleSet divided by the total number of examples. The 'Transactions' data set at "Samples/data/Transactions" in the repository of RapidMiner is an example of how transactions data usually look like. Parameters find min number of itemsets If this parameter is set to true, this operator finds at least the specified number of itemsets with highest support without taking the min support parameter into account. This operator finds (at least) the number of itemsets specified in the min number of itemsets parameter. The min support parameter is ignored to some extent in this case. The minimal support is decreased automatically until the specified minimum number of frequent itemsets is found. The defined minimal support is lowered by 20 percent each time. Range: boolean min number of itemsets This parameter is only available when the find min number of itemsets parameter is set true. This parameter specifies the minimum number of itemsets which should be mined. Range: integer max number of retries This parameter is only available when the find min number of itemsets parameter is set true. This parameter determines how many times the operator should lower the minimal support to find the minimal number of item sets. Each time the minimal support is lowered by 20 percent. Range: integer positive value This parameter determines which value of the binominal attributes should be treated as positive. The attributes with that value are considered as part of a transaction. If left blank, the ExampleSet determines which value is used. Range: string min support The minimum support criteria is specified by this parameter. Please study the description of this operator for more information about minimum support. Range: real max items RapidMiner Market Basket Analysis, Page-12

This parameter specifies the upper bound for the length of the itemsets i.e. the maximum number of items in an itemset. If set to -1, this parameter imposes no upper bound. Range: integer must contain This parameter specifies the items that should be part of frequent itemsets. It is specified through a regular expression. If there is no specific item that you want to have in the frequent itemset, you can leave this blank. Range: 2. Create Association Rules Synopsis This operator generates a set of association rules from the given set of frequent itemsets. Description Association rules are if/then statements that help uncover relationships between seemingly unrelated data. An example of an association rule would be "If a customer buys eggs, he is 80% likely to also purchase milk." An association rule has two parts, an antecedent (if) and a consequent (then). An antecedent is an item (or itemset) found in the data. A consequent is an item (or itemset) that is found in combination with the antecedent. Association rules are created by analyzing data for frequent if/then patterns and using the criteria support and confidence to identify the most important relationships. Support is an indication of how frequently the items appear in the database. Confidence indicates the number of times the if/then statements have been found to be true. The frequent if/then patterns are mined using the operators like the FP-Growth operator. The Create Association Rules operator takes these frequent itemsets and generates association rules. Such information can be used as the basis for decisions about marketing activities such as, e.g., promotional pricing or product placements. In addition to the above example from market basket analysis association rules are employed today in many application areas including Web usage mining, intrusion detection and bioinformatics. Parameters criterion This parameter specifies the criterion which is used for the selection of rules. confidence: The confidence of a rule is defined conf(x implies Y) = supp(x Y)/supp(X). Be careful when reading the expression: here supp(x Y) means "support for occurrences of transactions where X and Y both appear", not "support for occurrences of transactions where either X or Y appears". Confidence ranges from 0 to 1. Confidence is an estimate of Pr(Y X), the probability of observing Y given X. The support supp(x) of an itemset X is defined as the proportion of transactions in the data set which contain the itemset. lift: The lift of a rule is defined as lift(x implies Y) = supp(x Y)/((supp(Y) x supp(x)) or the ratio of the observed support to that expected if X and Y were independent. Lift can also be defined as lift(x implies Y) =conf(x implies Y)/supp(Y). Lift measures how far from independence are X and Y. It ranges within 0 to positive infinity. Values close to 1 imply that X and Y are independent and the rule is not interesting. conviction: conviction is sensitive to rule direction i.e. conv(x implies Y) is not same as conv(y implies X). Conviction is somewhat inspired in the logical definition of implication and attempts to measure the degree of implication of a rule. Conviction is defined as conv(x implies Y) =(1 - supp(y))/(1 - conf(x implies Y)) gain: When this option is selected, the gain is calculated using the gain theta parameter. laplace: When this option is selected, the Laplace is calculated using the laplace k parameter. ps: When this option is selected, the ps criteria is used for rule selection. RapidMiner Market Basket Analysis, Page-13