DATA MINING II - 1DL460

Similar documents
Tutorial on Assignment 3 in Data Mining 2009 Frequent Itemset and Association Rule Mining. Gyozo Gidofalvi Uppsala Database Laboratory

Mining of Web Server Logs using Extended Apriori Algorithm

Memory issues in frequent itemset mining

Improved Frequent Pattern Mining Algorithm with Indexing

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

SQL Based Frequent Pattern Mining with FP-growth

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

CS570 Introduction to Data Mining

An improved approach of FP-Growth tree for Frequent Itemset Mining using Partition Projection and Parallel Projection Techniques

CSE 5243 INTRO. TO DATA MINING

DATA MINING II - 1DL460

ETP-Mine: An Efficient Method for Mining Transitional Patterns

DATA MINING II - 1DL460

DATA MINING II - 1DL460

PLT- Positional Lexicographic Tree: A New Structure for Mining Frequent Itemsets

An Efficient Algorithm for finding high utility itemsets from online sell

Basic Concepts: Association Rules. What Is Frequent Pattern Analysis? COMP 465: Data Mining Mining Frequent Patterns, Associations and Correlations

Product presentations can be more intelligently planned

Mining High Average-Utility Itemsets

An Algorithm for Frequent Pattern Mining Based On Apriori

Performance Evaluation for Frequent Pattern mining Algorithm

Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods

CSE 5243 INTRO. TO DATA MINING

FIT: A Fast Algorithm for Discovering Frequent Itemsets in Large Databases Sanguthevar Rajasekaran

An Improved Frequent Pattern-growth Algorithm Based on Decomposition of the Transaction Database

Iliya Mitov 1, Krassimira Ivanova 1, Benoit Depaire 2, Koen Vanhoof 2

Parallel Algorithms for Discovery of Association Rules

Appropriate Item Partition for Improving the Mining Performance

Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules

A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases *

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set

ASSOCIATION rules mining is a very popular data mining

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Improved Algorithm for Frequent Item sets Mining Based on Apriori and FP-Tree

ML-DS: A Novel Deterministic Sampling Algorithm for Association Rules Mining

PTclose: A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets

Available online at ScienceDirect. Procedia Computer Science 37 (2014 )

CSE 5243 INTRO. TO DATA MINING

Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree

An Improved Algorithm for Mining Association Rules Using Multiple Support Values

Medical Data Mining Based on Association Rules

Data Mining Part 3. Associations Rules

Data Structure for Association Rule Mining: T-Trees and P-Trees

DESIGN AND CONSTRUCTION OF A FREQUENT-PATTERN TREE

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Association Rule Mining from XML Data

WIP: mining Weighted Interesting Patterns with a strong weight and/or support affinity

Association Rule Mining. Introduction 46. Study core 46

Generation of Potential High Utility Itemsets from Transactional Databases

Parallel Closed Frequent Pattern Mining on PC Cluster

A New Fast Vertical Method for Mining Frequent Patterns

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

An Improved Apriori Algorithm for Association Rules

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

Fast Algorithm for Mining Association Rules

A Data Mining Framework for Extracting Product Sales Patterns in Retail Store Transactions Using Association Rules: A Case Study

Performance Analysis of Frequent Closed Itemset Mining: PEPP Scalability over CHARM, CLOSET+ and BIDE

Keywords: Mining frequent itemsets, prime-block encoding, sparse data

IJESRT. Scientific Journal Impact Factor: (ISRA), Impact Factor: [35] [Rana, 3(12): December, 2014] ISSN:

Maintenance of fast updated frequent pattern trees for record deletion

An Efficient Algorithm for Finding the Support Count of Frequent 1-Itemsets in Frequent Pattern Mining

Item Set Extraction of Mining Association Rule

Study on Apriori Algorithm and its Application in Grocery Store

Fast Discovery of Sequential Patterns Using Materialized Data Mining Views

A Literature Review of Modern Association Rule Mining Techniques

AN ENHANCED SEMI-APRIORI ALGORITHM FOR MINING ASSOCIATION RULES

Data Mining Techniques

ISSN: ISO 9001:2008 Certified International Journal of Engineering Science and Innovative Technology (IJESIT) Volume 2, Issue 2, March 2013

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

EFFICIENT TRANSACTION REDUCTION IN ACTIONABLE PATTERN MINING FOR HIGH VOLUMINOUS DATASETS BASED ON BITMAP AND CLASS LABELS

Model for Load Balancing on Processors in Parallel Mining of Frequent Itemsets

Association Rule Discovery

rule mining can be used to analyze the share price R 1 : When the prices of IBM and SUN go up, at 80% same day.

A Taxonomy of Classical Frequent Item set Mining Algorithms

Discovery of Frequent Itemset and Promising Frequent Itemset Using Incremental Association Rule Mining Over Stream Data Mining

A SURVEY OF DIFFERENT ASSOCIATIVE CLASSIFICATION ALGORITHMS

Mining Frequent Patterns Based on Data Characteristics

Privacy Preserving Frequent Itemset Mining Using SRD Technique in Retail Analysis

Association Rule Discovery

Incremental Mining of Frequent Patterns Without Candidate Generation or Support Constraint

620 HUANG Liusheng, CHEN Huaping et al. Vol.15 this itemset. Itemsets that have minimum support (minsup) are called large itemsets, and all the others

Application of Web Mining with XML Data using XQuery

Parallel Mining of Maximal Frequent Itemsets in PC Clusters

Performance and Scalability: Apriori Implementa6on

Upper bound tighter Item caps for fast frequent itemsets mining for uncertain data Implemented using splay trees. Shashikiran V 1, Murali S 2

Temporal Weighted Association Rule Mining for Classification

International Journal of Computer Sciences and Engineering. Research Paper Volume-5, Issue-8 E-ISSN:

Combining Distributed Memory and Shared Memory Parallelization for Data Mining Algorithms

Scalable Frequent Itemset Mining Methods

DMSA TECHNIQUE FOR FINDING SIGNIFICANT PATTERNS IN LARGE DATABASE

On Frequent Itemset Mining With Closure

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

Web page recommendation using a stochastic process model

Fundamental Data Mining Algorithms

Performance study of Association Rule Mining Algorithms for Dyeing Processing System

Association Rule Mining: FP-Growth

Parallel Association Rule Mining by Data De-Clustering to Support Grid Computing

Performance Based Study of Association Rule Algorithms On Voter DB

Parallel Mining Association Rules in Calculation Grids

Marwan AL-Abed Abu-Zanona Department of Computer Information System Jerash University Amman, Jordan

Transcription:

Uppsala University Department of Information Technology Kjell Orsborn DATA MINING II - 1DL460 Assignment 2 - Implementation of algorithm for frequent itemset and association rule mining 1 Algorithms for frequent itemset and association rule mining The purpose of this assignment is to study and implement one algorithm of your choice for frequent itemset and association rule mining [AIS93, AS94] in Amos II. You will choose among a small set of optional algorithms suitable for implementation in a database management system environment. The classical Apriori method [AIS93, AS94] uses the original horizontal representation of transactions. It uses a breadth-first search strategy to count the support of itemsets and uses a candidate generation function which exploits the downward closure property of support. The Apriori algorithm is by far the most important data mining algorithms for mining frequent itemsets and associations. It opened new doors and created new modalities to mine the data. Since its inception, many researchers have improved and optimized the Apriori algorithm and have presented new Apriori-like algorithms. One variation of Apriori, is the Direct Hashing and Pruning algorithm, Park et al [PCY95] that uses a hash technique making it very efficient to generate candidate itemsets, in particular large two-itemsets. Thus, DHP greatly improves the performance bottleneck of the whole process. In addition, DHP employs effective pruning techniques to progressively reduce the transaction database size. In many cases, the Apriori algorithm significantly reduces the size of candidate sets using the Apriori principle. However, it can suffer from 1

two-nontrivial costs: (1) generating a huge number of candidate sets, and (2) repeatedly scanning the database and checking the candidates by pattern matching. Han et al. [HPY00] developed an FP-growth method that mines the complete set of frequent itemsets without candidate generation. FP-growth works in a divide-and-conquer way, where the first scan of the database derives a list of frequent items in which items are ordered by frequency-descending order. The database is compressed by generating a frequent-pattern tree, or FP-tree, that retains the itemset association information. An alternative to the FP-growth algorithm is the Tree Projection algorithm [AAP01] for generation of frequent itemsets. This algorithm can apply different strategies for generating and traversing a lexicographic tree including breadth-first search, depth-first search and a combination of both. The innovation brought by this algorithm is the use of a lexicograph tree which requires substantially less memory than a hash tree. The support of the frequent itemsets is counted by projecting the transactions onto the nodes of this tree. This improves the performance of counting the number of transactions that have frequent itemsets. The original formulation traverse the lexicograph tree in a top-down fashion. Both Apriori and FP-growth methods mine frequent patterns from a set of transactions in horizontal data format (i.e., {TID: itemset}), where TID is a transaction-id and itemset is the set of items bought in transaction TID. Alternatively, mining can also be performed with data presented in vertical data format (i.e., {item: TID set}). Zaki [ZPO97, Z00, ZG03] proposed Equivalence CLASS Transformation (Eclat) algorithm by exploring the vertical data format. It is the first algorithm that uses a vertical data (inverted) layout. ECLAT is very efficient for large itemsets but less efficient for small ones. Besides taking advantage of the Apriori property in the generation of candidate (k + 1)-itemset from frequent k-itemsets, another merit of this method is that there is no need to scan the database to find the support of (k + 1)-itemsets (for k1). This is because the TID set of each k-itemset carries the complete information required for counting such support. In [ZPO97, Z00, ZG03], an improved version is presented introducing Another work that mines the frequent itemsets using a vertical data format is presented in Holsheimer et al. [HKMT95]. This work demonstrated that, though impressive results have been achieved for some data mining problems using highly specialized and clever data structures, one 2

could also explore the potential of solving data mining problems using the general purpose database management systems with quite good results. In the Data Mining I class, you used an implemented version of the PROjection Pattern Discovery (PROPAD) algorithm, a projection-based method relying on a vertical representation of the transactions database [SSG04]. It was implemented within the AmosMiner application using the Amos II database management system. The PROPAD algorithm applies a frequent pattern growth approach that only need one scan of the database to generate a transformed transaction table. It avoids complex joins between candidate itemsets tables and transaction tables. Instead, these are replaced by simple joins between smaller projected transaction tables and frequent itemsets tables. The PROPAD authors have also developed an SQL-based Frequent Pattern Mining algorithm [SSG05] with FP-growth. This algorithm implements an SQL-database version of an FP-growth-like approach where the FP-tree is represented using a relational table. It shows better performance than Apriori on large data sets or large patterns. 2 Preparation We suggest that you read about association analysis in Chapter 6 in Tan et al. [Tan06]. You should also read the background material for your specific algorithm. You can find and download the AmosMiner and Amos II system from the assignments home page. There will also be an introductory scheduled assignment, which is not mandatory, but advisable to attend. 3 Assignment You should develop an association analysis algorithm of your choice within the AmosMiner application. We suggest that you choose one of the Apriori, ECLAT, Tree-Projection, or possibly SQL FP-growth. In the AmosMiner catalog, you will find the a3.osql script file that 3

executes the PROPAD projection-based method [SSG04]. This was the algorithm applied in the Data Mining I class. You can study the implementation of this algorithm by studying this script file together with the associationrulemining.osql script file that you find in the MiningFunctions catalog. You can experiment with the different parameters used in the algorithm to see how they influence the results of the analysis. Section 4 describes how you should report the assignment and how the examination will be carried out. Thus, in your assignment you should do the following: 1. START-UP Once you have installed the AmosMiner application, you should have the script file and data files for assignment 2 available (if some file is missing in your AmosMiner directory you should download and install AmosMiner again). Create a script file for this assignment, your script file name.osql. This file can be loaded and executed into AmosMiner by the following command: < your_script_file_name.osql ; In this assignment you will use a transactions data set that is synthetically generated using the QUEST data generation tool [IBM96] to provide data with controlled properties. The transactions data to be used is found in the data file transactions1000.nt. The data consists of text that typically looks as follows: 1 3 4 1 2 3 5 2 3 5 2 5 1 2 3 6 In the file, blanks separate items (identified by integers) and new lines separate transactions. For example, the above file contains information about a total of 5 transactions and its second transaction consists of 4 items. To import this data, you can use function read ntuples() documented in section 2.8.1 of Amos user manual. 4

The result from executing an association analysis would typically look as follows: ({131,207,443,489},{104},0.975,0.156) ({207,443,489},{104},0.9765,0.166) ({131,207,443},{104},0.9765,0.166) ({131,443,489},{104},0.976,0.163) ({131,207,489},{104},0.9702,0.163) ({207,443},{104},0.9778,0.176) ({131,443},{104},0.9777,0.175) ({131,207},{104},0.9722,0.175) ({207,489},{104},0.9721,0.174) ({443,489},{104},0.9721,0.174) ({131,489},{104},0.9714,0.17) ({131},{104},0.9735,0.184) It is now your turn to implement your algorithm for frequent itemset and association rule mining by developing your own script file. As mentioned above, you will probably have good help of studying the PROPAD algorithm used in the application of association analysis in assignment 3 in the Data Mining I course http://www.it.uu.se/edu/ course/homepage/infoutv/ht10/dm1-ht2010-assignments.html. You are referred to the Amos II manual for further reading, which is available on the lab course home page and that you look at the tutorial slides. You might also find the following sections in the manual useful: Collections in 2.6, count in 2.6.1, and groupby in 2.7.2. 4 Examination At the examination, your script file will be executed in AmosMiner using: < your_script_file_name.osql ; 5

We expect you to make a brief presentation of your implementation. When the analysis has been executed correctly, the outcome of your analysis including the choice of your parameters minsupp, minconf, etc. Furthermore, you should be prepared to answer questions of your implementation. References [Tan06] [AIS93] [AS94] [PCY95] [HPY00] [AAP01] [ZPO97] Tan, P-N, Steinbach, M. and Kumar, V.: Introduction to Data Mining, Addison-Wesley, 2006. R. Agrawal, T. Imielinski, and A. Swami, Mining Associations between Sets of Items in Large Databases. In Proceedings of the ACM SIGMOD International Conference on the Management of Data, pp. 207-216, May 1993 (the paper is available on the lab course homepage). R. Agrawal and R. Srikant, Fast Algorithms for Mining Association Rules. In Proceedings of the 20th International Conference on Very Large Databases, pp. 487-499, September 1994 (the paper is available on the lab course homepage). Park JS, Chen MS, Yu PS (1995): An effective hash-based algorithm for mining association rules. In: Proceeding of the 1995 ACM-SIGMOD international conference on management of data (SIGMOD95), San Jose, CA, pp 175-186. Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proceeding of the 2000 ACM- SIGMOD international conference on management of data (SIGMOD00), Dallas, TX, pp 1-12. Agarwal R, Aggarwal CC, Prasad VVV (2001): A tree projection algorithm for generation of frequent itemsets. Journal of Parallel and Distributed Computing 61:350-371. Mohammed J. Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara and Wei Li, New algorithms for fast discovery of association rules. In 3rd International Conference on Knowledge Discovery and Data Mining (KDD). Aug 1997. 6

[Z00] [ZG03] Mohammed J. Zaki, Scalable Algorithms for Association Mining. IEEE Transactions on Knowledge and Data Engineering, 12(3):372-390. May/Jun 2000. Mohammed J. Zaki and Karam Gouda, Fast Vertical Mining Using Diffsets. In 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Aug 2003. [HKMT95] Holsheimer M, Kersten M, Mannila H, Toivonen H (1995): A perspective on databases and data mining. In Proceeding of the 1995 international conference on knowledge discovery and data mining (KDD95), Montreal, Canada, pp 150155. [SSG04] [SSG05] [IBM96] X. Shang, K.-U. Sattler, and I. Geist, Efficient Frequent Pattern Mining in Relational Databases. 5. Workshop des GI- Arbeitskreis Knowledge Discovery (AK KD) im Rahmen der LWA 2004 (the paper is available on the lab course homepage). Xuequn Shang, Kai-Uwe Sattler, Ingolf Geist: SQL Based Frequent Pattern Mining with FP-Growth, Lecture Notes in Computer Science, Vol. 3392 (January 2005), pp. 32-46. http://www.almaden.ibm.com/cs/projects/iis/hdb/ Projects/data_mining/mining.shtml 7