Uppsala University Department of Information Technology Kjell Orsborn DATA MINING II - 1DL460 Assignment 2 - Implementation of algorithm for frequent itemset and association rule mining 1 Algorithms for frequent itemset and association rule mining The purpose of this assignment is to study and implement one algorithm of your choice for frequent itemset and association rule mining [AIS93, AS94] in Amos II. You will choose among a small set of optional algorithms suitable for implementation in a database management system environment. The classical Apriori method [AIS93, AS94] uses the original horizontal representation of transactions. It uses a breadth-first search strategy to count the support of itemsets and uses a candidate generation function which exploits the downward closure property of support. The Apriori algorithm is by far the most important data mining algorithms for mining frequent itemsets and associations. It opened new doors and created new modalities to mine the data. Since its inception, many researchers have improved and optimized the Apriori algorithm and have presented new Apriori-like algorithms. One variation of Apriori, is the Direct Hashing and Pruning algorithm, Park et al [PCY95] that uses a hash technique making it very efficient to generate candidate itemsets, in particular large two-itemsets. Thus, DHP greatly improves the performance bottleneck of the whole process. In addition, DHP employs effective pruning techniques to progressively reduce the transaction database size. In many cases, the Apriori algorithm significantly reduces the size of candidate sets using the Apriori principle. However, it can suffer from 1
two-nontrivial costs: (1) generating a huge number of candidate sets, and (2) repeatedly scanning the database and checking the candidates by pattern matching. Han et al. [HPY00] developed an FP-growth method that mines the complete set of frequent itemsets without candidate generation. FP-growth works in a divide-and-conquer way, where the first scan of the database derives a list of frequent items in which items are ordered by frequency-descending order. The database is compressed by generating a frequent-pattern tree, or FP-tree, that retains the itemset association information. An alternative to the FP-growth algorithm is the Tree Projection algorithm [AAP01] for generation of frequent itemsets. This algorithm can apply different strategies for generating and traversing a lexicographic tree including breadth-first search, depth-first search and a combination of both. The innovation brought by this algorithm is the use of a lexicograph tree which requires substantially less memory than a hash tree. The support of the frequent itemsets is counted by projecting the transactions onto the nodes of this tree. This improves the performance of counting the number of transactions that have frequent itemsets. The original formulation traverse the lexicograph tree in a top-down fashion. Both Apriori and FP-growth methods mine frequent patterns from a set of transactions in horizontal data format (i.e., {TID: itemset}), where TID is a transaction-id and itemset is the set of items bought in transaction TID. Alternatively, mining can also be performed with data presented in vertical data format (i.e., {item: TID set}). Zaki [ZPO97, Z00, ZG03] proposed Equivalence CLASS Transformation (Eclat) algorithm by exploring the vertical data format. It is the first algorithm that uses a vertical data (inverted) layout. ECLAT is very efficient for large itemsets but less efficient for small ones. Besides taking advantage of the Apriori property in the generation of candidate (k + 1)-itemset from frequent k-itemsets, another merit of this method is that there is no need to scan the database to find the support of (k + 1)-itemsets (for k1). This is because the TID set of each k-itemset carries the complete information required for counting such support. In [ZPO97, Z00, ZG03], an improved version is presented introducing Another work that mines the frequent itemsets using a vertical data format is presented in Holsheimer et al. [HKMT95]. This work demonstrated that, though impressive results have been achieved for some data mining problems using highly specialized and clever data structures, one 2
could also explore the potential of solving data mining problems using the general purpose database management systems with quite good results. In the Data Mining I class, you used an implemented version of the PROjection Pattern Discovery (PROPAD) algorithm, a projection-based method relying on a vertical representation of the transactions database [SSG04]. It was implemented within the AmosMiner application using the Amos II database management system. The PROPAD algorithm applies a frequent pattern growth approach that only need one scan of the database to generate a transformed transaction table. It avoids complex joins between candidate itemsets tables and transaction tables. Instead, these are replaced by simple joins between smaller projected transaction tables and frequent itemsets tables. The PROPAD authors have also developed an SQL-based Frequent Pattern Mining algorithm [SSG05] with FP-growth. This algorithm implements an SQL-database version of an FP-growth-like approach where the FP-tree is represented using a relational table. It shows better performance than Apriori on large data sets or large patterns. 2 Preparation We suggest that you read about association analysis in Chapter 6 in Tan et al. [Tan06]. You should also read the background material for your specific algorithm. You can find and download the AmosMiner and Amos II system from the assignments home page. There will also be an introductory scheduled assignment, which is not mandatory, but advisable to attend. 3 Assignment You should develop an association analysis algorithm of your choice within the AmosMiner application. We suggest that you choose one of the Apriori, ECLAT, Tree-Projection, or possibly SQL FP-growth. In the AmosMiner catalog, you will find the a3.osql script file that 3
executes the PROPAD projection-based method [SSG04]. This was the algorithm applied in the Data Mining I class. You can study the implementation of this algorithm by studying this script file together with the associationrulemining.osql script file that you find in the MiningFunctions catalog. You can experiment with the different parameters used in the algorithm to see how they influence the results of the analysis. Section 4 describes how you should report the assignment and how the examination will be carried out. Thus, in your assignment you should do the following: 1. START-UP Once you have installed the AmosMiner application, you should have the script file and data files for assignment 2 available (if some file is missing in your AmosMiner directory you should download and install AmosMiner again). Create a script file for this assignment, your script file name.osql. This file can be loaded and executed into AmosMiner by the following command: < your_script_file_name.osql ; In this assignment you will use a transactions data set that is synthetically generated using the QUEST data generation tool [IBM96] to provide data with controlled properties. The transactions data to be used is found in the data file transactions1000.nt. The data consists of text that typically looks as follows: 1 3 4 1 2 3 5 2 3 5 2 5 1 2 3 6 In the file, blanks separate items (identified by integers) and new lines separate transactions. For example, the above file contains information about a total of 5 transactions and its second transaction consists of 4 items. To import this data, you can use function read ntuples() documented in section 2.8.1 of Amos user manual. 4
The result from executing an association analysis would typically look as follows: ({131,207,443,489},{104},0.975,0.156) ({207,443,489},{104},0.9765,0.166) ({131,207,443},{104},0.9765,0.166) ({131,443,489},{104},0.976,0.163) ({131,207,489},{104},0.9702,0.163) ({207,443},{104},0.9778,0.176) ({131,443},{104},0.9777,0.175) ({131,207},{104},0.9722,0.175) ({207,489},{104},0.9721,0.174) ({443,489},{104},0.9721,0.174) ({131,489},{104},0.9714,0.17) ({131},{104},0.9735,0.184) It is now your turn to implement your algorithm for frequent itemset and association rule mining by developing your own script file. As mentioned above, you will probably have good help of studying the PROPAD algorithm used in the application of association analysis in assignment 3 in the Data Mining I course http://www.it.uu.se/edu/ course/homepage/infoutv/ht10/dm1-ht2010-assignments.html. You are referred to the Amos II manual for further reading, which is available on the lab course home page and that you look at the tutorial slides. You might also find the following sections in the manual useful: Collections in 2.6, count in 2.6.1, and groupby in 2.7.2. 4 Examination At the examination, your script file will be executed in AmosMiner using: < your_script_file_name.osql ; 5
We expect you to make a brief presentation of your implementation. When the analysis has been executed correctly, the outcome of your analysis including the choice of your parameters minsupp, minconf, etc. Furthermore, you should be prepared to answer questions of your implementation. References [Tan06] [AIS93] [AS94] [PCY95] [HPY00] [AAP01] [ZPO97] Tan, P-N, Steinbach, M. and Kumar, V.: Introduction to Data Mining, Addison-Wesley, 2006. R. Agrawal, T. Imielinski, and A. Swami, Mining Associations between Sets of Items in Large Databases. In Proceedings of the ACM SIGMOD International Conference on the Management of Data, pp. 207-216, May 1993 (the paper is available on the lab course homepage). R. Agrawal and R. Srikant, Fast Algorithms for Mining Association Rules. In Proceedings of the 20th International Conference on Very Large Databases, pp. 487-499, September 1994 (the paper is available on the lab course homepage). Park JS, Chen MS, Yu PS (1995): An effective hash-based algorithm for mining association rules. In: Proceeding of the 1995 ACM-SIGMOD international conference on management of data (SIGMOD95), San Jose, CA, pp 175-186. Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proceeding of the 2000 ACM- SIGMOD international conference on management of data (SIGMOD00), Dallas, TX, pp 1-12. Agarwal R, Aggarwal CC, Prasad VVV (2001): A tree projection algorithm for generation of frequent itemsets. Journal of Parallel and Distributed Computing 61:350-371. Mohammed J. Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara and Wei Li, New algorithms for fast discovery of association rules. In 3rd International Conference on Knowledge Discovery and Data Mining (KDD). Aug 1997. 6
[Z00] [ZG03] Mohammed J. Zaki, Scalable Algorithms for Association Mining. IEEE Transactions on Knowledge and Data Engineering, 12(3):372-390. May/Jun 2000. Mohammed J. Zaki and Karam Gouda, Fast Vertical Mining Using Diffsets. In 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Aug 2003. [HKMT95] Holsheimer M, Kersten M, Mannila H, Toivonen H (1995): A perspective on databases and data mining. In Proceeding of the 1995 international conference on knowledge discovery and data mining (KDD95), Montreal, Canada, pp 150155. [SSG04] [SSG05] [IBM96] X. Shang, K.-U. Sattler, and I. Geist, Efficient Frequent Pattern Mining in Relational Databases. 5. Workshop des GI- Arbeitskreis Knowledge Discovery (AK KD) im Rahmen der LWA 2004 (the paper is available on the lab course homepage). Xuequn Shang, Kai-Uwe Sattler, Ingolf Geist: SQL Based Frequent Pattern Mining with FP-Growth, Lecture Notes in Computer Science, Vol. 3392 (January 2005), pp. 32-46. http://www.almaden.ibm.com/cs/projects/iis/hdb/ Projects/data_mining/mining.shtml 7