Mining Top-K Strongly Correlated Item Pairs Without Minimum Correlation Threshold

Similar documents
WIP: mining Weighted Interesting Patterns with a strong weight and/or support affinity

Mining Frequent Patterns without Candidate Generation

Association Rule Mining

Data Mining for Knowledge Management. Association Rules

Association Rules Mining Including Weak-Support Modes Using Novel Measures

DESIGN AND CONSTRUCTION OF A FREQUENT-PATTERN TREE

Association Rule Mining

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set

Improved Frequent Pattern Mining Algorithm with Indexing

Nesnelerin İnternetinde Veri Analizi

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

Appropriate Item Partition for Improving the Mining Performance

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Association Rule Mining. Introduction 46. Study core 46

FP-Growth algorithm in Data Compression frequent patterns

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

Efficient Remining of Generalized Multi-supported Association Rules under Support Update

Salah Alghyaline, Jun-Wei Hsieh, and Jim Z. C. Lai

CS570 Introduction to Data Mining

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

Ascending Frequency Ordered Prefix-tree: Efficient Mining of Frequent Patterns

CSE 5243 INTRO. TO DATA MINING

A New Fast Vertical Method for Mining Frequent Patterns

Memory issues in frequent itemset mining

Association rules. Marco Saerens (UCL), with Christine Decaestecker (ULB)

Mining Quantitative Maximal Hyperclique Patterns: A Summary of Results

Data Mining Part 3. Associations Rules

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

620 HUANG Liusheng, CHEN Huaping et al. Vol.15 this itemset. Itemsets that have minimum support (minsup) are called large itemsets, and all the others

A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases *

Improved Algorithm for Frequent Item sets Mining Based on Apriori and FP-Tree

Scalable Frequent Itemset Mining Methods

Frequent Itemsets Melange

Mining N-most Interesting Itemsets. Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang. fadafu,

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach

An Improved Frequent Pattern-growth Algorithm Based on Decomposition of the Transaction Database

ETP-Mine: An Efficient Method for Mining Transitional Patterns

Data Structure for Association Rule Mining: T-Trees and P-Trees

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Comparing the Performance of Frequent Itemsets Mining Algorithms

Efficiently Mining Positive Correlation Rules

Mining Recent Frequent Itemsets in Data Streams with Optimistic Pruning

An Improved Apriori Algorithm for Association Rules

Generation of Potential High Utility Itemsets from Transactional Databases

Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree

CSE 5243 INTRO. TO DATA MINING

Association Rule Mining from XML Data

Available online at ScienceDirect. Procedia Computer Science 45 (2015 )

Structure of Association Rule Classifiers: a Review

Parallelizing Frequent Itemset Mining with FP-Trees

PTclose: A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets

A Technical Analysis of Market Basket by using Association Rule Mining and Apriori Algorithm

Incremental Mining of Frequent Patterns Without Candidate Generation or Support Constraint

SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases

APPLYING BIT-VECTOR PROJECTION APPROACH FOR EFFICIENT MINING OF N-MOST INTERESTING FREQUENT ITEMSETS

A Graph-Based Approach for Mining Closed Large Itemsets

Tutorial on Association Rule Mining

INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM

Item Set Extraction of Mining Association Rule

Closed Non-Derivable Itemsets

Basic Concepts: Association Rules. What Is Frequent Pattern Analysis? COMP 465: Data Mining Mining Frequent Patterns, Associations and Correlations

Purna Prasad Mutyala et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 2 (5), 2011,

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey

IJESRT. Scientific Journal Impact Factor: (ISRA), Impact Factor: [35] [Rana, 3(12): December, 2014] ISSN:

Discovery of Frequent Itemsets: Frequent Item Tree-Based Approach

Mining Frequent Itemsets Along with Rare Itemsets Based on Categorical Multiple Minimum Support

Fast Discovery of Sequential Patterns Using Materialized Data Mining Views

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

An Efficient Algorithm for Finding the Support Count of Frequent 1-Itemsets in Frequent Pattern Mining

EFFICIENT TRANSACTION REDUCTION IN ACTIONABLE PATTERN MINING FOR HIGH VOLUMINOUS DATASETS BASED ON BITMAP AND CLASS LABELS

An Approximate Approach for Mining Recently Frequent Itemsets from Data Streams *

Novel Techniques to Reduce Search Space in Multiple Minimum Supports-Based Frequent Pattern Mining Algorithms

CSCI6405 Project - Association rules mining

An Algorithm for Mining Frequent Itemsets from Library Big Data

CHAPTER 5 WEIGHTED SUPPORT ASSOCIATION RULE MINING USING CLOSED ITEMSET LATTICES IN PARALLEL

Performance Analysis of Apriori Algorithm with Progressive Approach for Mining Data

Association Rule Mining. Entscheidungsunterstützungssysteme

A Literature Review of Modern Association Rule Mining Techniques

Performance Based Study of Association Rule Algorithms On Voter DB

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Parallel Popular Crime Pattern Mining in Multidimensional Databases

A Comparative study of CARM and BBT Algorithm for Generation of Association Rules

Concurrent Processing of Frequent Itemset Queries Using FP-Growth Algorithm

Leveraging Set Relations in Exact Set Similarity Join

Parallel Mining of Maximal Frequent Itemsets in PC Clusters

Incremental Mining of Frequently Correlated, Associated- Correlated and Independent Patterns Synchronously by Removing Null Transactions

Survey: Efficent tree based structure for mining frequent pattern from transactional databases

An Approach for Privacy Preserving in Association Rule Mining Using Data Restriction

This paper proposes: Mining Frequent Patterns without Candidate Generation

Mining High Average-Utility Itemsets

Fast Algorithm for Mining Association Rules

Frequent Pattern Mining

DATA MINING II - 1DL460

An Algorithm for Mining Large Sequences in Databases

TAPER: A Two-Step Approach for All-strong-pairs. Correlation Query in Large Databases

Comparison of FP tree and Apriori Algorithm

A Novel Algorithm for Associative Classification

Transcription:

Mining Top-K Strongly Correlated Item Pairs Without Minimum Correlation Threshold Zengyou He, Xiaofei Xu, Shengchun Deng Department of Computer Science and Engineering, Harbin Institute of Technology, 92 West Dazhi Street, P.O Box 315, P. R. China, 150001 Email: zengyouhe@yahoo.com, {xiaofei, dsc}@hit.edu.cn Abstract Given a user-specified minimum correlation threshold and a transaction database, the problem of mining strongly correlated item pairs is to find all item pairs with Pearson's correlation coefficients above the threshold. However, setting such a threshold is by no means an easy task. In this paper, we consider a more practical problem: mining top-k strongly correlated item pairs, where k is the desired number of item pairs that have largest correlation values. Based on the FP-tree data structure, we propose an efficient algorithm, called Tkcp, for mining such patterns without minimum correlation threshold. Our experimental results show that Tkcp algorithm outperforms the Taper algorithm, one efficient algorithm for mining correlated item pairs, even with the assumption of an optimally chosen correlation threshold. Thus, we conclude that mining top-k strongly correlated pairs without minimum correlation threshold is more preferable than the original correlation threshold based mining. Keywords Correlation, Association Rule, Transactions, FP-tree, Data Mining 1. Introduction Association rule mining is a widely studied topic in data mining research [e.g., 2-4]. Its popularity can be attributed to the simplicity of the problem statement and the efficiency of support-based pruning strategy to prune a combinatorial space. However, association mining often generates a huge number of rules, but a majority of them either are redundant or do not reflect the true correlation relationship among data objects. To overcome this difficulty, correlation has been adopted as an alternative or an augmentation of association rule since most people are interested in not only association-like co-occurrences but also the possible strong correlations More recently, the discovery of statistical correlations is applied to transaction databases [1], which retrieves all pairs of items with high positive correlation in a transaction database. The problem can be formalized as follows: Given a user-specified minimum correlation threshold θ and a transaction database with N items and T transactions, all-strong-pairs correlation query finds all item pairs with Pearson's correlation coefficients above the minimum correlation threshold θ. However, without specific knowledge about the target data, users will have difficulties in setting the correlation threshold to obtain their required results. If the correlation threshold is set too large, there may be only a small number of results or even no result. In which case, the user may have to guess a smaller threshold and do the mining again, which may or may not give a better result. If the threshold is too small, there may be too many results for the users; too many 1

results can imply an exceedingly long time in the computation, and also extra efforts to screen the answers. As can be seen in the experimental section, there can be a sharp contrast between the threshold values for different datasets. The similar phenomenon has also been observed in association rule mining [13,14,15] and sequential pattern mining [16]. To solve this problem, as proposed in [13-16], a good solution is to change the task of mining frequent/sequential patterns to mining top-k frequent/sequential (closed) patterns This setting is also desirable in the context of correlation pattern mining. That is, it is often preferable to change the task of mining correlated item pairs under pre-specified threshold to mining top-k strongly correlated item pairs, where k is the desired number of item pairs that have largest correlation values. Unfortunately, previous developed Taper algorithm [1] cannot be directly applied for mining top-k correlated item pairs since Taper algorithm is based on the generation-and-test framework. Nevertheless, those ideas in [13-16] are helpful in designing our new algorithm. Motivated by the above observations, we examine other possible solutions for the problem of mining top-k correlated item pairs. We note that, the FP-tree [4] approach does not rely on the candidate generation step. We, therefore, consider how to make use of the FP-tree for mining top-k correlated pairs of items with highest positive correlation. An FP-tree (frequent pattern tree) is a variation of the trie data structure, which is a prefix-tree structure for storing crucial and compressed information about support information. Thus, it is possible to utilize such information to alleviate the problem encountered in Taper to discover top-k correlated item pairs by working directly on the FP-Tree. Based on the FP-tree data structure, an efficient algorithm, called Tkcp (FP-Tree based top-k Correlated Pairs Mining), is proposed to find top-k strongly correlated item pairs. The Tkcp algorithm consists of two steps: FP-tree construction and top-k correlation mining. In the FP-tree construction step, the Tkcp algorithm constructs a FP-tree from the transaction database. In the original FP-tree method [4], the FP-tree is built only with the items with sufficient support. However, in our problem setting, there is no support threshold used initially. We, therefore, propose to build a complete FP-tree with all items in the database. Note that this is equivalent to setting the initial support threshold to zero. In the top-k correlation mining step, we utilize a pattern-growth method to generate all item pairs one by one and compute their exact correlation values. An array is used to maintain k item pairs with largest correlation values found so far. Our experimental results show that Tkcp algorithm outperforms the Taper algorithm, even when Taper algorithm is running with best-tuned correlation threshold. 2. Related Work Association rule mining is a widely studied topic in data mining research. However, it is well recognized that true correlation relationships among data objects may be missed in the support-based association-mining framework. To overcome this difficulty, correlation has been adopted as an interesting measure since most people are interested in not only association-like co-occurrences but also the possible strong correlations implied by such co-occurrences. Related literature can be categorized by the correlation measures. Brin et al. [5] introduced lift and a 2 χ correlation measure and developed methods for 2

mining such correlations. Ma and Hellerstein [6] proposed an Apriori-based algorithm for mining so-called mutually dependent patterns. Recently, Omiecinski [7] introduced two interesting measures, called all confidence and bond, which have the downward closure property. Lee et al. [8] proposed two algorithms by extending the pattern-growth methodology [4] for mining all confidence and bond based correlated patterns. By extending the concepts proposed in [7], a new notion of the confidence-closed correlated patterns is presented in [9]. And an algorithm called CCMine for mining those patterns is also proposed in [9]. Xiong et al. [10] independently proposed the h-confidence measure, which is equivalent to the all confidence measure of [7]. In [1], Taper algorithm was proposed to find positive correlated pairs of items with Pearson's correlation coefficient as the correlation measure. As have been discussed in Section 1, more and more researchers have realized that changing the task of mining frequent/sequential patterns to mining top-k frequent/sequential (closed) patterns is more preferable in real data mining applications [13-16]. Shen et al. [13] considers the problem of mining N largest itemsets. Cheung and Fu [14] proposed a different formulation by mining N k-itemsets with the highest supports for k up to a certain k max value. Han et al. [15] proposed another new mining task: mining top-k frequent closed patterns of length no less than min _l, where k is the desired number of frequent closed patterns to be mined, and min _l is the minimal length of each pattern. As an extension to [15], the problem of mining top-k sequential closed patterns of length no less than min _ l is discussed in [16]. To our knowledge, the problem of mining top-k correlated patterns is only studied in [17]. 2 Sese et al. [17] utilize the χ correlation measure to prune un-correlated association rules, which is very different from the Pearson's correlation coefficient. In this paper, we focus on developing a FP-Tree base method for efficiently identifying the top-k strong correlated pairs between items, with Pearson's correlation coefficient as correlation measure. 3. Background-Pearson's Correlation Coefficient In statistics, a measure of association is a numerical index that describes the strength or magnitude of a relationship among variables. Relationships among nominal variables can be analyzed with nominal measures of association such as Pearson's Correlation Coefficient. The φ correlation coefficient [11] is the computation form of Pearson's Correlation Coefficient for binary variables. As shown in [1], when adopting the support measure of association rule mining [2], for two items A and B in a transaction database, we can derive the support form of the φ correlation coefficient as shown below in Equation (1): sup( A, B) sup( A) *sup( B) φ = (1) sup( A) *sup( B) * (1 sup( A)) * (1 sup( B)) where sup(a), sup(b) and sup(a,b) are the support of item(s) A, B and AB separately. Furthermore, given an item pair {A, B}, the support value sup(a) for item A, and the support 3

value sup(b) for item B, without loss of generality, let sup(a)>=sup(b). The upper bound upper φ ) of the φ correlation coefficient for an item pair {A, B} is [1]: ( ( A, B) sup( B) 1 sup( A) upper( φ ( A, B) ) = * (2) sup( A) 1 sup( B) As can be seen in Equation (2), the upper bound of φ correlation coefficient for an item pair {A, B} only relies on the support value of item A and the support value of item B. In other words, there is no requirement to get the support value sup(a, B) of an item pair{a, B} for the calculation of this upper bound. As a result, in the Taper algorithm [1], this upper bound is used to serve as a coarse filter to filter out item pairs that are of no interest, thus saving I/O cost by reducing the computation of the support value of those pruned pairs. The Taper algorithm [1] is a two-step filter-and-refine mining algorithm, which consists of two steps: filtering and refinement. In the filtering step, the Taper algorithm applies the upper bound of the φ correlation coefficient as a coarse filter. If the upper bound of the φ correlation coefficient for an item pair is less than the user-specified correlation threshold, we can prune this item pair right way. In the refinement step, the Taper algorithm computes the exact correlation coefficient for each surviving pair from the filtering step and retrieves the pairs with coefficients above the user-specified correlation threshold as the mining results. 4. Tkcp Algorithm In this section, we introduce Tkcp algorithm for mining top-k correlated pairs between items. We adopt ideas of the FP-tree structure [4] in our algorithm. 4.1 FP-Tree Basics An FP-tree (frequent pattern tree) is a variation of the trie data structure, which is a prefix-tree structure for storing crucial and compressed information about frequent patterns. It consists of one root labeled as NULL, a set of item prefix sub-trees as the children of the root, and a frequent item header table. Each node in the item prefix sub-tree consists of three fields: item-name, count, and node-link, where item-name indicates which item this node represents, count indicates the number of transactions containing items in the portion of the path reaching this node, and node-link links to the next node in the FP-tree carrying the same item-name, or null if there is none. Each entry in the frequent item header table consists of two fields, item name and head of node-link. The latter points to the first node in the FP-tree carrying the item name. Let us illustrate by an example in [4] the algorithm to build an FP-tree using a user specified minimum support threshold. Suppose we have a transaction database shown in Fig. 1 with minimum support threshold 3. By scanning the database, we get the sorted (item: support) pairs, (f: 4), (c:4), (a:3), (b:3), (m:3), (p:3) (of course, the frequent 1-itemsets are: f, c, a, b, m, p). We use the tree construction 4

algorithm in [4] to build the corresponding FP-tree. We scan each transaction and insert the frequent items (according to the above sorted order) to the tree. First, we insert <f, c, a, m, p> to the empty tree. This results in a single path: root (NULL) (f: 1) (c:1) (a:1) (m:1) (p:1) Then, we insert <f, c, a, b, m>. This leads to two paths with f, c and a being the common prefixes: root (NULL) (f: 2 ) (c:2) (a:2) (m:1) (p:1) and root (NULL) (f: 2) (c:2) (a:2) (b:1) (m:1) Third, we insert <f, b>. This time, we get a new path: root (NULL) (f: 3) (b:1) In a similar way, we can insert <c, b, p> and <f, c, a, m, p> to the tree and get the resulting tree as shown in Fig.1, which also shows the horizontal links for each frequent 1-itemset in dotted-line arrows. Header table root item f c a b m p f:4 c:3 a:3 m:2 p:2 b:1 b:1 m:1 c:1 b:1 p:1 TID Items bought (ordered) freq items 100 200 300 400 500 f, a, c, d, g, I, m, p a, b, c, f, l,m, o b, f, h, j, o b, c, k, s, p a, f, c, e, l, p, m, n f, c, a, m, p f, c, a, b, m f, b c, b, p f, c, a, m, p Fig. 1. Database and Corresponding FP-tree. With the initial FP-tree, we can mine frequent itemsets of size k, where k>=2. An FP-growth algorithm [4] is used for the mining phase. We may start from the bottom of the header table and consider item p first. There are two paths: (f: 4) (c:3) (a:3) (m:3) (p:2) and (c:1) (b:1) (p:1). Hence, we get two prefix paths for p: <f:2, c:2, a:2, m:2> (the count for each item is 2 because the prefix path only appears twice together with p) and <c:1,b:1>, which are called p s conditional pattern base (i.e., the sub-pattern base under the condition of p s existence). Construction of an FP-tree on this conditional pattern base (conditional FP-tree), which acts as a transaction database with respect to item p, results in only one branch <c:3>. Hence, If we represent all frequent itemsets in the form of (itemset: count), only one frequent itemset (cp:3) is derived. The search for frequent itemsets associated p terminates. Next, we consider item m. We get the conditional pattern base: <f:2, c:2, a:2>, <f:1, c:1, a:1, b:1>. Constructing an FP-tree on it, we get m s conditional FP-tree with a single path: root (NULL) (f: 3) (c:3) (a:3). Mining 5

this resulting FP-tree by forming all the possible combinations of items f, c and a with the appending of item m, we get the frequent itemsets (am:3), (cm:3), (fm:3), (cam:3), (fam:3), (fcam:3) and (fcm:3). Similarly, we consider items b, a, c and f to get all frequent itemsets. 4.2 Tkcp Algorithm The Tkcp algorithm consists of two steps: FP-tree construction and top-k correlation mining. In the FP-tree construction step, the Tkcp algorithm constructs a FP-tree from the transaction database. In the top-k correlation mining step, we utilize a pattern-growth method to generate all item pairs and compute their exact correlation values to get k item pairs with largest correlation values. Fig.2 shows the main algorithm of Tkcp. In the following, we will illustrate each step thoroughly. Tkcp Algorithm Input: (1): A transaction database D (2): k: The desired number of item pairs that have largest correlation values Output: TIP: k item pairs with largest correlation values Steps: (A) FP-Tree Construction Step (1) Scan the transaction database, D. Find the support of each item (2) Sort the items by their supports in descending order, denoted as sorted-list (3) Create an FP-Tree T, with only a root node with label being NULL. (4) Scan the transaction database again to build the whole FP-Tree with all ordered items (B) Top-K Correlation Mining Step (5) For each item Ai in the header table of T do (6) Get the conditional pattern base of Ai and compute its conditional FP-Tree, Ti (7) For each item Bj in the header table of Ti do (8) Compute the correlation value of (Ai, Bj ), φ (Ai, Bj ) (9) If φ (Ai, Bj )>min {φ (Cp, Dq) (Cp, Dq) TIP } (10) Then replace (Cp, Dq) with minimal correlation value in TIP by (Ai, Bj ) (11) Return TIP Fig. 2. The Tkcp Algorithm (A) FP-tree Construction Step In the original FP-tree method [4], the FP-tree is built only with the items with sufficient support. However, in our problem setting, there is no support threshold used initially. We, therefore, propose to build a complete FP-tree with all items in the database (Step 3-4). Note that this is equivalent to setting the initial support threshold to zero. The size of an FP-tree is bounded by the size of its corresponding database because each transaction will contribute at most one path to the FP-tree, with the path length equal to the number of items in that transaction. Since there is often a lot of sharing of frequent items among transactions, the size of the tree is usually much smaller than its original database [4]. 6

As can be seen from Fig.2, before constructing the FP-Tree, we use an additional scan over the database to count the support of each item (Step 1-2). At the first glance, one may argue that this additional I/O cost is not necessary since we have to build the complete FP-Tree with zero support. However, the rationale behind is as follows. (1) Knowing the support of each item will help us in building a more compact FP-Tree, e.g., ordered FP-Tree. (2) In the top-k correlation mining step, single item support is required in the computation of the correlation value for each item pair. (B) Top-K Correlation Mining Step In the top-k correlation mining step, since we can get all support information from the FP-tree, we utilize a pattern-growth method to generate all item pairs one by one and compute their exact correlation values (Step 5-8). Furthermore, an array (TIP) is used to maintain k item pairs with largest correlation values found so far (Step 9-10). As described in Fig.2, different from the Taper algorithm, the Tkcp algorithm doesn t need to use the upper bound based pruning technique since the exact correlation values can be derived in Step 8. Usually, the input parameter, k, is always relatively small (e.g., 200,500). Hence, the computation cost of Step 9-10 will not differ too much with respect to different values of k. That is, the Tkcp algorithm is insensitive to the input desired number of item pairs, which is highly desirable in real data mining applications. In contrast, the performance of Taper algorithm is dependent on both data distribution and minimum correlation threshold, because these factors have a great effect on the results of upper bound based pruning. Therefore, to discover top-k correlated item pairs from those datasets having only item pairs with lower correlation values, the computation cost of Taper algorithm can be still very high even when it is running with best-chosen correlation threshold. 5. Experimental Results A performance study has been conducted to evaluate our method. In this section, we describe those experiments and their results. Both real-life datasets from UCI [12] and synthetic datasets were used. In particular, we compare the efficiency of Tkcp with Taper. To give the best possible credit to the Taper algorithm, our comparison is always based on assigning the best-tuned minimal correlation threshold (which is hard to obtain in practice) to the Taper algorithm so that they can generate the same results for a user-specified k value. These optimal threshold values are obtained by running Tkcp once under each experimental condition. This means that even if Tkcp has only comparable performance with Taper, it will still be far more useful than the latter due to its usability and the difficulty to speculate minimal correlation threshold without mining. The experiments show that the running time of Tkcp is shorter than Taper in large dataset with larger number of items, and is comparable in other cases. All algorithms were implemented in Java. All experiments were conducted on a Pentium4-2.4G machine with 512 M of RAM and running Windows 2000. 7

5.1 Experimental Datasets We experimented with both real datasets and synthetic datasets. The mushroom dataset from UCI [12] has 23 attributes and 8124 records. Each record represents physical characteristics of a single mushroom. From the viewpoint of transaction database, the mushroom dataset has 119 items totally. The synthesized datasets are created with the data generator in ARMiner software 1, which also follows the basic spirit of well-known IBM synthetic data generator for association rule mining. The data size (i.e., number of transactions), the number of items and the average size of transactions are the major parameters in the synthesized data generation. Table 1 shows the four datasets generated with different parameters and used in the experiments. The main difference between these datasets is that they have different number of items, range from 400 to 1000. Table 1. Test Synthetic Data Sets Data Set Name Number of Transactions Number of Items Average Size of Transactions T10I400D100K 100,000 400 10 T10I600D100K 100,000 600 10 T10I800D100K 100,000 800 10 T10I1000D100K 100,000 1000 10 5.2 Empirical Results We compared the performance of Tkcp with Taper on the 5 datasets by varying k (desired number of item pairs with highest positive correlations) from 100 to 500. Fig. 3 shows the running time of the two algorithms on the mushroom dataset. We observe that, Tkcp's running time remains roughly stable over the whole range of k. The Tkcp s stability on execution time is also observed in the consequent experiments on synthetic datasets. The reason is that Tkcp doesn t depend on upper bound based pruning technique, and hence is very robust to parameter k. In contrast, the execution time for Taper algorithm increases as k is increased. When k reaches 300, Tkcp starts to outperform Taper. Although the performance of Tkcp is not good as that of Taper when k is less than 300, Tkcp at least achieves same level performance as that of Taper on the mushroom dataset. The reason for the Tkcp s unsatisfied performance is that, for dataset with fewer items, the advantage of Tkcp is not very significant (keep in mind that best-tuned minimal correlation threshold is assigned to Taper). As can be seen in consequent experiments, Tkcp's performance is significantly better than that of the Taper algorithm when the dataset has more items and transactions. 1 http://www.cs.umb.edu/~laur/arminer/ 8

Execution Times(s) 5 4.8 4.6 4.4 4.2 4 3.8 Taper Tkcp 100 200 300 400 500 The number of disired item pairs (k ) Fig.3 Execution Time Comparison between Tkcp and Taper on Mushroom Dataset Fig.4, Fig.5, Fig.6 and Fig.7 show the execution times of the two algorithms on four different synthetic datasets as k is increased. From these four figures, some important observations are summarized as follows. (1) Tkcp keeps its stable running time for the whole range of k on all the synthetic datasets. It further confirmed the fact that Tkcp algorithm is robust with respect to input parameter. (2) Taper keeps its increase in running time when k is increased (although the increase in running time is not very significant). And on all datasets, Tkcp always outperforms Taper significantly. (3) With the increase in the number of items (from Fig.4 to Fig. 7), we can see that the gap between Tkcp and Taper on execution time increases. Furthermore, on the T10I1000D100K dataset, when correlation threshold reaches 300, the Taper algorithm failed to continue its work because of the limited memory size (since the optimal correlation threshold is relatively low). While the Tkcp algorithm is still very effective on the same dataset. Hence, we can conclude that Tkcp is more suitable for mining transaction database with very large number of items and transactions, which is highly desirable in real data mining applications. (4) Table 2 shows the ideal correlation thresholds for Taper algorithm on different datasets. We can see that reasonable correlation thresholds may vary from 0.349 to 0.692 for different datasets of different nature. Hence, in real applications, it is generally very difficult to pick an optimal correlation threshold. If the guess is not proper, then the conventional approach (Taper) would not get the proper results and, therefore, our approach is definitely much better. Execution Times(s) 80 60 40 20 0 Taper 100 200 300 400 500 The number of disired item pairs (k ) Tkcp Fig.4 Execution Time Comparison between Tkcp and Taper on T10I400D100K Dataset 9

Execution Times(s) 200 150 100 50 0 Taper 100 200 300 400 500 The number of disired item pairs (k ) Tkcp Fig.5 Execution Time Comparison between Tkcp and Taper on T10I600D100K Dataset Execution Times(s) 200 150 100 50 0 Taper 100 200 300 400 500 The number of disired item pairs (k ) Tkcp Fig.6 Execution Time Comparison between Tkcp and Taper on T10I800D100K Dataset Execution Times(s) 800 600 400 200 0 Taper 100 200 300 400 500 The number of disired item pairs (k) Tkcp Fig.7 Execution Time Comparison between Tkcp and Taper on T10I1000D100K Dataset Table 2. Ideal Thresholds for Taper on Different Data Sets Data Set Name k 100 200 300 400 500 T10I400D100K 0.456 0.38 0.325 0.286 0.261 T10I600D100K 0.567 0.479 0.42 0.375 0.343 T10I800D100K 0.692 0.569 0.5 0.447 0.416 T10I1000D100K 0.349 0.297 0.26 0.238 0.223 Mushroom 0.496 0.365 0.308 0.256 0.229 10

6 Conclusions In this paper, we propose a new algorithm called Tkcp for mining top-k statistical correlated pairs between items from transaction databases. The salient feature of Tkcp is that it generates correlated item pairs directly without candidate generation through the usage of FP-tree structure. We have compared this algorithm to the previously known algorithm, the Taper algorithm, using both synthetic and real world data. As shown in our performance study, the proposed algorithm significantly outperforms the Taper algorithm, even when the later is running with an ideal correlation threshold. Acknowledgements This work was supported by the High Technology Research and Development Program of China (Grant No. 2003AA4Z2170, Grant No. 2003AA413021), the National Nature Science Foundation of China (Grant No. 40301038) and the IBM SUR Research Fund. References [1] H. Xiong, S. Shekhar, P-N. Tan, V. Kumar. Exploiting a Support-based Upper Bound of Pearson's Correlation Coefficient for Efficiently Identifying Strongly Correlated Pairs. In: Proc. of ACM SIGKDD 04, pp.334-343, 2004 [2] R. Agrawal, R. Srikant. Fast Algorithms for Mining Association Rules. In: Proc of VLDB 94, pp. 478-499, 1994 [3] M. J. Zaki. Scalable Algorithms for Association Mining. IEEE Transactions on Knowledge and Data Engineering, 2000, 12(3): 372-390 [4] J. Han, J. Pei, J. Yin. Mining Frequent Patterns without Candidate Generation. In: Proc. of SIGMOD 00, pp. 1-12, 2000 [5] S. Brin, R. Motwani, C. Silverstein. Beyond Market Basket: Generalizing Association Rules to Correlations. In: Proc. of SIGMOD 97, pp.265-276, 1997 [6] S. Ma, J. L. Hellerstein. Mining Mutually Dependent Patterns. In: Proc. of ICDM 01, pp.409-416, 2001 [7] E. Omiecinski. Alternative Interest Measures for Mining Associations. IEEE Transactions on Knowledge and Data Engineering, 2003, 15(1): 57-69 [8] Y.-K. Lee, W.-Y. Kim, Y. D. Cai, J. Han. CoMine: Efficient Mining of Correlated Patterns. In: Proc. of ICDM'03, pp. 581-584, 2003 [9] W.-Y. Kim, Y.-K. Lee, J. Han. CCMine: Efficient Mining of Confidence-Closed Correlated Patterns. In: Proc. of PAKDD'04, pp.569-579, 2004 [10] H. Xiong, P.-N. Tan, V. Kumar. Mining Hyper-Clique Patterns with Confidence Pruning. Tech. Report, Univ. of Minnesota, Minneapolis, March 2003 [11] H. T. Reynolds. The Analysis of Cross-classifications. The Free Press, New York, 1977 [12] C. J. Merz, P. Merphy. UCI Repository of Machine Learning Databases, 1996. ( Http://www.ics.uci.edu/~mlearn/MLRRepository.html) 11

[13] L. Shen, H. Shen, P. Prithard, R. Topor. Finding the N Largest Itemsets. In: Data Mining (Editor: N.F.F. EBECKEN, serve as Proc. of ICDM'98), Great Britain: WIT Pr, 1998 [14] Y-L.Cheung, A. Fu. Mining Frequent Itemsets without Support Threshold: With and Without Item Constraints. IEEE Transactions on Knowledge and Data Engineering, 16(9): 1052-1069, 2004 [15] J. Han, J. Wang, Y. Lu, P. Tzvetkov. Mining Top-K Frequent Closed Patterns without Minimum Support. In: Proc. of ICDM 02, pp.211-218, 2002. [16] P. Tzvetkov, X. Yan, J. Han. TSP: Mining Top-K Closed Sequential Patterns. In: Proc. of ICDM'03, pp.347-354, 2003. [17] J. Sese, S. Morishita. Answering the Most Correlated N Association Rules Efficiently. In: Proc. of PKDD 02, pp. 410-422, 2002. 12